Quantcast
Channel: Welcome To TechBrothersIT
Viewing all articles
Browse latest Browse all 2048

Use distinct() to Remove Duplicates from DataFrames | Get Unique Rows with distinct() in PySpark

$
0
0
How to Use distinct() Function in PySpark | Step-by-Step Guide

How to Use distinct() Function in PySpark

The distinct() function in PySpark is used to remove duplicate rows from a DataFrame. It returns a new DataFrame containing only unique rows, making it a valuable tool for data cleaning and analysis in big data workflows.

Example Dataset

id  name    department
1   Alice   IT
2   Bob     HR
9   Aamir   Finance
4   Alice   IT
5   Eve     HR
6   Frank   Finance
7   Bob     HR
8   Grace   IT
9   Aamir   Finance

Create DataFrame

data = [
    (1, "Alice", "IT"),
    (2, "Bob", "HR"),
    (9, "Aamir", "Finance"),
    (4, "Alice", "IT"),
    (5, "Eve", "HR"),
    (6, "Frank", "Finance"),
    (7, "Bob", "HR"),
    (8, "Grace", "IT"),
    (9, "Aamir", "Finance")
]

df = spark.createDataFrame(data, ["id", "name", "department"])

df.show()

Removing Duplicate Rows using distinct()

df_distinct = df.distinct()

df_distinct.show()

Getting Unique Values from a Single Column

df.select("department").distinct().show()

Summary

The distinct() function in PySpark is very useful when you need to remove duplicate rows from your DataFrame or get unique values from a specific column. It is commonly used during data preprocessing and cleaning tasks in data engineering projects.

Watch the Video Tutorial


Viewing all articles
Browse latest Browse all 2048

Latest Images

Trending Articles



Latest Images