Spark - The Elite List


Filter rows by column value in PySpark.


# Showing non-empty rows;
df.where(F.col("col_name").like("% %") == False).show()

Create new column in Spark Dataframe from existing column using withColumn method.


df = df.withColumn('new_col', df.old_col[1:2])

Groupby and aggregate example in PySpark.


df.groupby("col").agg(F.min('agg_col').alias('agg')).orderBy('col', ascending=True)

Save Spark Dataframe as Apache Parquet storage in HDFS.


df.write.parquet("hdfs:///user/abc123/outputs/df.parquet")

Load Fixed Width Text file as Dataframe in PySpark.


df = spark.read.text("hdfs:///path/to/dir")
df = df.select(
df.value.substr(1,2).alias('ID'),
df.value.substr(4,57).alias('NAME') )