Quantcast
Channel: Welcome To TechBrothersIT
Viewing all articles
Browse latest Browse all 1874

PySpark Tutorial: fillna() Function to Replace Null or Missing Values | #PySparkTutorial #PySpark

$
0
0
How to Use fillna() Function in PySpark | Step-by-Step Guide

How to Use fillna() Function in PySpark

Author: Aamir Shahzad

Date: March 2025

Introduction

In this tutorial, we will learn how to handle missing or null values in PySpark DataFrames using the fillna() function. Handling missing data is a critical part of data cleaning in data engineering workflows.

Why Use fillna() in PySpark?

  • Replace NULL values in DataFrame columns with specific values.
  • Apply different replacement values to different columns.
  • Clean your dataset before analysis or feeding it into machine learning models.

Step 1: Import SparkSession and Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkFillnaFunction") \
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    ("Amir Shahzad", "Engineering", 5000),
    ("Ali", None, 4000),
    ("Raza", "Marketing", None),
    (None, "Sales", 4500),
    ("Ali", None, None)
]

columns = ["Name", "Department", "Salary"]

df = spark.createDataFrame(data, schema=columns)

df.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|       null|  4000|
|         Raza|  Marketing|  null|
|         null|      Sales|  4500|
|          Ali|       null|  null|
+-------------+-----------+------+

Step 3: Fill All NULL Values

Fill all NULL values with 'Unknown' for string columns and 0 for numeric columns.

df_fill_all = df.fillna("Unknown").fillna(0)

df_fill_all.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|    Unknown|  4000|
|         Raza|  Marketing|     0|
|      Unknown|      Sales|  4500|
|          Ali|    Unknown|     0|
+-------------+-----------+------+

Step 4: Fill NULLs with Column-Specific Values

df_fill_columns = df.fillna({
    "Department": "NA",
    "Salary": 10000
})

df_fill_columns.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|         NA|  4000|
|         Raza|  Marketing| 10000|
|         null|      Sales|  4500|
|          Ali|         NA| 10000|
+-------------+-----------+------+

Step 5: Fill NULLs in a Specific Column Only

df_fill_name = df.fillna("No Name", subset=["Name"])

df_fill_name.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|       null|  4000|
|         Raza|  Marketing|  null|
|      No Name|      Sales|  4500|
|          Ali|       null|  null|
+-------------+-----------+------+

Conclusion

Handling null and missing values is an essential part of data processing in PySpark. The fillna() function provides a simple and flexible way to replace these values, ensuring your data is clean and ready for further analysis or modeling.

Watch the Video Tutorial


Viewing all articles
Browse latest Browse all 1874

Trending Articles