What is PySpark and why is it Important?
In today’s data-driven world, companies generate massive volumes of data every second. To process such huge datasets efficiently, we need a tool that is fast, scalable, and reliable. That’s where Apache Spark comes in — a powerful open-source engine for big data processing.
PySpark is the Python API for Apache Spark. It lets you harness the power of Spark using Python, one of the most popular and beginner-friendly programming languages in data science. PySpark makes it easy to perform distributed computing, run complex transformations, and analyze massive datasets — all with Python code.
Whether you’re building data pipelines, training machine learning models, or analyzing logs, PySpark helps you scale your work without worrying about the complexity of cluster management.
In this blog, we’ll explore commonly asked PySpark interview questions — along with clear answers and code examples — to help you crack your next data engineering or data science interview.
Top 32 PySpark Interview Questions and Answers for 2025
Q1. What are the core features of PySpark?
Ans – The core features of PySpark are given below:-
- Lazy evaluation
- Distributed computing
- In-memory computation
- Fault tolerance
- Integration with Hadoop and other Big Data tools
- Real-time stream processing with Spark Streaming
Q2. What is RDD in PySpark?
Ans – RDD stands for Resilient Distributed Dataset. It is the fundamental unit of data structure of Spark.RDD is an immutable distributed collection of objects. RDDs can be processed in parallel and are fault-tolerant.
Q3. How do I create a DataFrame in PySpark?
Ans –
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“example”).getOrCreate()
data = [(1, “Alice”), (2, “Bob”)]
df = spark.createDataFrame(data, [“id”, “name”])
df.show()
Q4. Explain lazy evaluation in PySpark.
Ans – In PySpar,k transformations like map, filter, etc, do not execute immediately. Spark builds the logical plan for execution and execution begins only when an action like count() or collect() is called.
Q5. What are the different types of joins in PySpark?
Ans – PySpark supports the following types of joins:
- Inner Join
- Left Outer Join
- Right Outer Join
- Full Outer Join
- Semi Join
- Anti Join
- Cross Join
Q6. What is the difference between collect() and show()?
Ans –
- collect() returns all the data to the driver node as a list of rows.
- show() prints the first 20 rows in a tabular format
Q7. How to create a DataFrame in PySpark?
Ans – from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“example”).getOrCreate()
data = [(1, “Raman”), (2, “Manish”)]
df = spark.createDataFrame(data, [“id”, “name”])
df.show()
-
Q8. Read a CSV and Display the Schema
df = spark.read.csv(“path/to/file.csv”, header=True, inferSchema=True)
df.printSchema()
-
Q9. How to handle missing data in PySpark?
Ans – You can use na function to handle null
df.na.drop() # Drop rows with null values
df.na.fill(“Unknown”) # Fill nulls with a default value
-
Q10. What is broadcast join in PySpark?
Ans – Broadcast join is an optimization technique in th PySpark that you use to join two Dataframes. BY broadcast join you can avoid shuffle.
from pyspark.sql.functions import broadcast
df1.join(broadcast(df2), “id”)
Q11. What are window functions in PySpark?
Ans – Window functions perform operations across an input range of rows, such as ranking, running totals, etc.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy(“department”).orderBy(“salary”)
df.withColumn(“row_num”, row_number().over(windowSpec)).show()
-
Q12. What is caching in PySpark?
Ans – Caching is an optimization technique through which we can satire the intermediate result in the memory for faster computation. You can use cache() and persist() methods .
df.cache()
df.persist()
Q13. How to write data to Hive or Parquet in PySpark?
Ans – df.write.mode(“overwrite”).parquet(“/path/to/save”)
df.write.saveAsTable(“my_table”)
Q14. Find the top 3 highest salaries in each department.
Ans – from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
windowSpec = Window.partitionBy(“department”).orderBy(df[“salary”].desc())
df.withColumn(“rank”, dense_rank().over(windowSpec)) \
.filter(“rank <= 3”) \
.show()
Q15. Find top 3 salaried emp by department.
Ans – from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
windowSpec = Window.partitionBy(“department”).orderBy(df[“salary”].desc())
df.withColumn(“rank”, dense_rank().over(windowSpec)) \
.filter(“rank <= 3”) \
.show()
Q16. Remove Duplicates Based on One or More Columns
Ans – df.dropDuplicates([“employee_id”]).show()
Q17. Count Null Values in Each Column
Ans – from pyspark.sql.functions import col, sum
null_counts = df.select([sum(col(c).isNull().cast(“int”)).alias(c) for c in df.columns])
null_counts.show()
Q18. Join Two DataFrames and Filter Rows on specific values.
Ans – joined_df = df1.join(df2, df1[“id”] == df2[“emp_id”], “inner”)
filtered_df = joined_df.filter(df1[“salary”] > 50000)
filtered_df.show()
Q19. Find the average salary, department wise.
Ans – df.groupBy(“department”).agg({“salary”: “avg”}).show()
Q20. Calculate Rolling Average Using Window Function
Ans –
from pyspark.sql.functions import avg
windowSpec = Window.partitionBy(“id”).orderBy(“date”).rowsBetween(-2, 0)
df.withColumn(“rolling_avg”, avg(“value”).over(windowSpec)).show()
-
Q21. Find Difference Between Two Date Columns
Ans – from pyspark.sql.functions import datediff
df.withColumn(“days_diff”, datediff(df[“end_date”], df[“start_date”])).show()
Q22. Find Difference Between Two Date Columns
Ans – from pyspark.sql.functions import datediff
df.withColumn(“days_diff”, datediff(df[“end_date”], df[“start_date”])).show()
Q23. Remove Duplicates Based on One or More Columns
Ans – df.dropDuplicates([“employee_id”]).show()
Q24. Count Null Values in Each Column
Ans – from pyspark.sql.functions import col, sum
null_counts = df.select([sum(col(c).isNull().cast(“int”)).alias(c) for c in df.columns])
null_counts.show()
-
Q25. Calculate Rolling Average Using Window Function
Ans- from pyspark.sql.functions import avg
windowSpec = Window.partitionBy(“id”).orderBy(“date”).rowsBetween(-2, 0)
df.withColumn(“rolling_avg”, avg(“value”).over(windowSpec)).show()
Q26. Filter Records Where Value Did Not Change for 30 Days
Ans –
from pyspark.sql.functions import lag, col, datediff
from pyspark.sql.window import Window
windowSpec = Window.partitionBy(“id”).orderBy(“date”)
df = df.withColumn(“prev_value”, lag(“value”).over(windowSpec))
df = df.withColumn(“change_day”, datediff(“date”, lag(“date”).over(windowSpec)))
df.filter((col(“value”) == col(“prev_value”)) & (col(“change_day”) >= 30)).show()
-
Q27. Find Duplicate Records
Ans –
df.groupBy(df.columns).count().filter(“count > 1”).show()
-
Q28. Explode Array Column into Rows
Ans –
from pyspark.sql.functions import explode
df.withColumn(“item”, explode(“items”)).show()
-
Q29. Pivot DataFrame (e.g., Sales by Month)
Ans. df.groupBy(“product”).pivot(“month”).sum(“sales”).show()
-
Q30. Optimize Spark Job (Use Broadcast Join)
Ans –
from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), “id”).show()
-
Q31. Get First and Last Transaction Date for Each Customer
Ans – from pyspark.sql.functions import min, max
df.groupBy(“customer_id”).agg(
min(“txn_date”).alias(“first_txn”),
max(“txn_date”).alias(“last_txn”)
).show()
Q.32. Create a Derived Column Using UDF (e.g., Card Type Segmentation)
Ans – from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def card_segment(card_type):
if card_type in [“Visa Platinum”, “Master Platinum”]:
return “Premium”
elif card_type in [“Visa”, “Master”]:
return “Consumer”
return “Others”
segment_udf = udf(card_segment, StringType())
df.withColumn(“card_segment”, segment_udf(col(“card_type”))).show()
Conclusion
Pyspark is a highly in-demand tool nowadays as every organization is completely dependent on data for their business growth. PySpark developers grab a high package job with the number of opportunities. You can learn by enrolling yourself with the Console Flare institute. There you will get training from industry experts with strong placement support.
For more such content and regular updates, follow us on Facebook, Instagram, LinkedIn