Top 32 PySpark Interview Questions and Answers for 2025

Top 32 PySpark Interview Questions and Answers for 2025

What is PySpark and why is it Important?

In today’s data-driven world, companies generate massive volumes of data every second. To process such huge datasets efficiently, we need a tool that is fast, scalable, and reliable. That’s where Apache Spark comes in — a powerful open-source engine for big data processing.

PySpark is the Python API for Apache Spark. It lets you harness the power of Spark using Python, one of the most popular and beginner-friendly programming languages in data science. PySpark makes it easy to perform distributed computing, run complex transformations, and analyze massive datasets — all with Python code.

Whether you’re building data pipelines, training machine learning models, or analyzing logs, PySpark helps you scale your work without worrying about the complexity of cluster management.

In this blog, we’ll explore commonly asked PySpark interview questions — along with clear answers and code examples — to help you crack your next data engineering or data science interview.

Top 32 PySpark Interview Questions and Answers for 2025

Top 32 PySpark Interview Questions and Answers for 2025

Q1. What are the core features of PySpark?

Ans – The core features of PySpark are given below:-

  • Lazy evaluation
  • Distributed computing
  • In-memory computation
  • Fault tolerance
  • Integration with Hadoop and other Big Data tools
  • Real-time stream processing with Spark Streaming

Q2. What is RDD in PySpark?

Ans – RDD stands for Resilient Distributed Dataset. It is the fundamental  unit of data structure of Spark.RDD is an immutable distributed collection of objects. RDDs can be processed in parallel and are fault-tolerant.

Q3. How do I create a DataFrame in PySpark?

Ans –

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“example”).getOrCreate()

data = [(1, “Alice”), (2, “Bob”)]

df = spark.createDataFrame(data, [“id”, “name”])

df.show()

Q4. Explain lazy evaluation in PySpark.

Ans – In PySpar,k transformations like map, filter, etc, do not execute immediately. Spark builds the logical plan for execution and execution begins only when an action like count() or collect() is called.

Q5. What are the different types of joins in PySpark?

Ans – PySpark supports the following types of joins:

  • Inner Join
  • Left Outer Join
  • Right Outer Join
  • Full Outer Join
  • Semi Join
  • Anti Join
  • Cross Join

Q6. What is the difference between collect() and show()?

Ans –

  • collect() returns all the data to the driver node as a list of rows.
  • show() prints the first 20 rows in a tabular format

Q7. How to create a DataFrame in PySpark?

Ans –  from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“example”).getOrCreate()

data = [(1, “Raman”), (2, “Manish”)]

df = spark.createDataFrame(data, [“id”, “name”])

df.show()

  1. Q8. Read a CSV and Display the Schema

df = spark.read.csv(“path/to/file.csv”, header=True, inferSchema=True)

df.printSchema()

  1. Q9. How to handle missing data in PySpark?

Ans –  You can use na function to handle null

df.na.drop()           # Drop rows with null values

df.na.fill(“Unknown”)  # Fill nulls with a default value

  1. Q10. What is broadcast join in PySpark?

Ans – Broadcast join is an optimization technique in th PySpark that you use to join two Dataframes. BY broadcast join you can avoid shuffle.

from pyspark.sql.functions import broadcast

df1.join(broadcast(df2), “id”)

Q11. What are window functions in PySpark?

Ans – Window functions perform operations across an input range of rows, such as ranking, running totals, etc.

from pyspark.sql.window import Window

from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy(“department”).orderBy(“salary”)

df.withColumn(“row_num”, row_number().over(windowSpec)).show()

  1. Q12. What is caching in PySpark?

Ans – Caching is an optimization technique through which we can satire the intermediate result in the memory for faster computation. You can use cache() and persist() methods .

df.cache()

df.persist()

Q13. How to write data to Hive or Parquet in PySpark?

Ans – df.write.mode(“overwrite”).parquet(“/path/to/save”)

df.write.saveAsTable(“my_table”)

Q14. Find the top 3 highest salaries in each department.

Ans –  from pyspark.sql.window import Window

from pyspark.sql.functions import dense_rank

windowSpec = Window.partitionBy(“department”).orderBy(df[“salary”].desc())

df.withColumn(“rank”, dense_rank().over(windowSpec)) \

  .filter(“rank <= 3”) \

  .show()

Q15. Find top 3 salaried emp by department.

Ans – from pyspark.sql.window import Window

from pyspark.sql.functions import dense_rank

windowSpec = Window.partitionBy(“department”).orderBy(df[“salary”].desc())

df.withColumn(“rank”, dense_rank().over(windowSpec)) \

  .filter(“rank <= 3”) \

  .show()

Q16. Remove Duplicates Based on One or More Columns

Ans – df.dropDuplicates([“employee_id”]).show()

Q17. Count Null Values in Each Column

Ans – from pyspark.sql.functions import col, sum

null_counts = df.select([sum(col(c).isNull().cast(“int”)).alias(c) for c in df.columns])

null_counts.show()

Q18. Join Two DataFrames and Filter Rows on specific values.

Ans –  joined_df = df1.join(df2, df1[“id”] == df2[“emp_id”], “inner”)

filtered_df = joined_df.filter(df1[“salary”] > 50000)

filtered_df.show()

Q19. Find the average salary, department wise.

Ans – df.groupBy(“department”).agg({“salary”: “avg”}).show()

Q20. Calculate Rolling Average Using Window Function

Ans –

from pyspark.sql.functions import avg

windowSpec = Window.partitionBy(“id”).orderBy(“date”).rowsBetween(-2, 0)

df.withColumn(“rolling_avg”, avg(“value”).over(windowSpec)).show()

  1. Q21. Find Difference Between Two Date Columns

Ans – from pyspark.sql.functions import datediff

df.withColumn(“days_diff”, datediff(df[“end_date”], df[“start_date”])).show()

Q22. Find Difference Between Two Date Columns

Ans – from pyspark.sql.functions import datediff

df.withColumn(“days_diff”, datediff(df[“end_date”], df[“start_date”])).show()

Q23. Remove Duplicates Based on One or More Columns

Ans –  df.dropDuplicates([“employee_id”]).show()

Q24. Count Null Values in Each Column

Ans – from pyspark.sql.functions import col, sum

null_counts = df.select([sum(col(c).isNull().cast(“int”)).alias(c) for c in df.columns])

null_counts.show()

  1. Q25. Calculate Rolling Average Using Window Function

Ans- from pyspark.sql.functions import avg

windowSpec = Window.partitionBy(“id”).orderBy(“date”).rowsBetween(-2, 0)

df.withColumn(“rolling_avg”, avg(“value”).over(windowSpec)).show()

Q26. Filter Records Where Value Did Not Change for 30 Days

Ans –

from pyspark.sql.functions import lag, col, datediff

from pyspark.sql.window import Window

windowSpec = Window.partitionBy(“id”).orderBy(“date”)

df = df.withColumn(“prev_value”, lag(“value”).over(windowSpec))

df = df.withColumn(“change_day”, datediff(“date”, lag(“date”).over(windowSpec)))

df.filter((col(“value”) == col(“prev_value”)) & (col(“change_day”) >= 30)).show()

  1. Q27. Find Duplicate Records

Ans –

  df.groupBy(df.columns).count().filter(“count > 1”).show()

  1. Q28. Explode Array Column into Rows

Ans –

from pyspark.sql.functions import explode

df.withColumn(“item”, explode(“items”)).show()

  1. Q29. Pivot DataFrame (e.g., Sales by Month)

Ans. df.groupBy(“product”).pivot(“month”).sum(“sales”).show()

  1. Q30. Optimize Spark Job (Use Broadcast Join)

Ans –

from pyspark.sql.functions import broadcast

df_large.join(broadcast(df_small), “id”).show()

  1. Q31. Get First and Last Transaction Date for Each Customer

Ans – from pyspark.sql.functions import min, max

df.groupBy(“customer_id”).agg(

    min(“txn_date”).alias(“first_txn”),

    max(“txn_date”).alias(“last_txn”)

).show()

Q.32. Create a Derived Column Using UDF (e.g., Card Type Segmentation)

Ans –  from pyspark.sql.functions import udf

from pyspark.sql.types import StringType

def card_segment(card_type):

    if card_type in [“Visa Platinum”, “Master Platinum”]:

        return “Premium”

    elif card_type in [“Visa”, “Master”]:

        return “Consumer”

    return “Others”

segment_udf = udf(card_segment, StringType())

df.withColumn(“card_segment”, segment_udf(col(“card_type”))).show()

Conclusion

Pyspark is a highly in-demand tool nowadays as every organization is completely dependent on data for their business growth. PySpark developers grab a high package job with the number of opportunities. You can learn by enrolling yourself with the Console Flare institute. There you will get training from industry experts with strong placement support.

For more such content and regular updates, follow us on FacebookInstagramLinkedIn

seoadmin

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top