Quantcast
Channel: Welcome To TechBrothersIT
Viewing all articles
Browse latest Browse all 1931

How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark | PySpark Tutorial

$
0
0
How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark | Step-by-Step Guide

How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark

In this guide, you will learn how to work with various functions in PySpark to select, manipulate, and transform data efficiently in your data engineering projects.

Topics Covered

  • select() - Retrieve specific columns.
  • selectExpr() - Use SQL expressions.
  • col() - Reference columns.
  • expr() - Perform expressions.
  • when() - Conditional logic.
  • lit() - Add constant columns.

1. Sample DataFrame Creation

from pyspark.sql.functions import col, expr, when, lit

# Sample Data
data = [
    (1, "Alice", 5000, "IT", 25),
    (2, "Bob", 6000, "HR", 30),
    (3, "Charlie", 7000, "Finance", 35),
    (4, "David", 8000, "IT", 40),
    (5, "Eve", 9000, "HR", 45)
]

# Creating DataFrame
df = spark.createDataFrame(data, ["id", "name", "salary", "department", "age"])

# Show DataFrame
df.show()

2. Selecting Specific Columns

df.select("name", "salary").show()

3. Using col() Function

df.select(col("name"), col("department")).show()

4. Renaming Columns Using alias()

df.select(col("name").alias("Employee_Name"), col("salary").alias("Employee_Salary")).show()

5. Using Expressions in select()

df.select("name", "salary", expr("salary * 1.10 AS increased_salary")).show()

6. Using Conditional Expressions with when()

df.select(
    "name",
    "salary",
    when(col("salary") > 7000, "High").otherwise("Low").alias("Salary_Category")
).show()

7. Using selectExpr() for SQL-like Expressions

df.selectExpr("name", "salary * 2 as double_salary").show()

8. Adding Constant Columns Using lit()

df.select("name", "department", lit("Active").alias("status")).show()

9. Selecting Columns Dynamically

columns_to_select = ["name", "salary", "department"]
df.select(*columns_to_select).show()

10. Selecting All Columns Except One

df.select([column for column in df.columns if column != "age"]).show()

Watch the Video Tutorial

Watch on YouTube

Author: Aamir Shahzad


Viewing all articles
Browse latest Browse all 1931

Trending Articles