├── README.md └── sparksession.md /README.md: -------------------------------------------------------------------------------- 1 | Using PySpark we can process data from Hadoop HDFS, AWS S3, and many file systems. PySpark also is used to process real-time data using Streaming and Kafka. 2 | 3 | All Spark examples provided in this PySpark (Spark with Python) tutorial is basic, simple, and easy to practice for beginners who are enthusiastic to learn PySpark and advance their career in BigData and Machine Learning. 4 | 5 | ## Features of PySpark 6 |

In-memory computation 7 |

Distributed processing using parallelize 8 |

Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c) 9 |

Fault-tolerant 10 |

Immutable 11 |

Lazy evaluation 12 |

Cache & persistence 13 |

Inbuild-optimization when using DataFrames 14 |

Supports ANSI SQL 15 | 16 | # PySpark Quick Reference 17 | A quick reference guide to the most commonly used patterns and functions in PySpark SQL 18 | 19 | ### Read CSV file into DataFrame with schema and delimited as comma 20 | ``` 21 | 22 | df = spark.read.option(header='True', inferSchema='True',delimiter=',').csv("/tmp/resources/sales.csv") 23 | 24 | ``` 25 | 26 | 27 | ### Easily reference these as F.func() and T.type() 28 | ``` 29 | from pyspark.sql import functions as F, types as T 30 | ``` 31 | ## Common Operation 32 | ``` 33 | ### Filter on equals condition 34 | 35 | df = df.filter(df.is_adult == 'Y') 36 | 37 | ### Filter on >, <, >=, <= condition 38 | 39 | df = df.filter(df.age > 25) 40 | 41 | ### Sort results 42 | 43 | df = df.orderBy(df.age.asc())) 44 | df = df.orderBy(df.age.desc())) 45 | 46 | 47 | ### Multiple conditions require parentheses around each condition 48 | 49 | df = df.filter((df.age > 25) & (df.is_adult == 'Y')) 50 | 51 | 52 | ### Compare against a list of allowed values 53 | 54 | df = df.filter(col('age').isin([3, 4, 7])) 55 | ``` 56 | ## Joins 57 | ``` 58 | ### Left join in another dataset 59 | 60 | df = df.join(person_lookup_table, 'person_id', 'left') 61 | 62 | ### Match on different columns in left & right datasets 63 | df = df.join(other_table, df.id == other_table.person_id, 'left') 64 | 65 | ### Match on multiple columns 66 | df = df.join(other_table, ['first_name', 'last_name'], 'left') 67 | 68 | ### Useful for one-liner lookup code joins if you have a bunch 69 | def lookup_and_replace(df1, df2, df1_key, df2_key, df2_value): 70 | return ( 71 | df1 72 | .join(df2[[df2_key, df2_value]], df1[df1_key] == df2[df2_key], 'left') 73 | .withColumn(df1_key, F.coalesce(F.col(df2_value), F.col(df1_key))) 74 | .drop(df2_key) 75 | .drop(df2_value) 76 | ) 77 | 78 | df = lookup_and_replace(people, pay_codes, id, pay_code_id, pay_code_desc) 79 | ``` 80 | 81 | ## Column Operations 82 | ``` 83 | # Add a new static column 84 | df = df.withColumn('status', F.lit('PASS')) 85 | 86 | # Construct a new dynamic column 87 | df = df.withColumn('full_name', F.when( 88 | (df.fname.isNotNull() & df.lname.isNotNull()), F.concat(df.fname, df.lname) 89 | ).otherwise(F.lit('N/A')) 90 | 91 | # Pick which columns to keep, optionally rename some 92 | df = df.select( 93 | 'name', 94 | 'age', 95 | F.col('dob').alias('date_of_birth'), 96 | ) 97 | ``` 98 | ## Casting & Coalescing Null Values & Duplicates 99 | ``` 100 | # Cast a column to a different type 101 | df = df.withColumn('price', df.price.cast(T.DoubleType())) 102 | 103 | # Replace all nulls with a specific value 104 | df = df.fillna({ 105 | 'first_name': 'Sundar', 106 | 'age': 18, 107 | }) 108 | 109 | # Take the first value that is not null 110 | df = df.withColumn('last_name', F.coalesce(df.last_name, df.surname, F.lit('N/A'))) 111 | 112 | # Drop duplicate rows in a dataset (distinct) 113 | df = df.dropDuplicates() 114 | 115 | # Drop duplicate rows, but consider only specific columns 116 | df = df.dropDuplicates(['name', 'height']) 117 | 118 | # Replace empty strings with null (leave out subset keyword arg to replace in all columns) 119 | df = df.replace({"": None}, subset=["name"]) 120 | 121 | # Convert Python/PySpark/NumPy NaN operator to null 122 | df = df.replace(float("nan"), None) 123 | 124 | 125 | # Remove columns 126 | df = df.drop('mod_dt', 'mod_username') 127 | 128 | # Rename a column 129 | df = df.withColumnRenamed('dob', 'date_of_birth') 130 | 131 | # Keep all the columns which also occur in another dataset 132 | df = df.select(*(F.col(c) for c in df2.columns)) 133 | 134 | # Batch Rename/Clean Columns 135 | for col in df.columns: 136 | df = df.withColumnRenamed(col, col.lower().replace(' ', '_').replace('-', '_')) 137 | ``` 138 | ## String Operations 139 | ### String Filters 140 | ``` 141 | # Contains - col.contains(string) 142 | df = df.filter(df.name.contains('o')) 143 | 144 | # Starts With - col.startswith(string) 145 | df = df.filter(df.name.startswith('Al')) 146 | 147 | # Ends With - col.endswith(string) 148 | df = df.filter(df.name.endswith('ice')) 149 | 150 | # Is Null - col.isNull() 151 | df = df.filter(df.is_adult.isNull()) 152 | 153 | # Is Not Null - col.isNotNull() 154 | df = df.filter(df.first_name.isNotNull()) 155 | 156 | # Like - col.like(string_with_sql_wildcards) 157 | df = df.filter(df.name.like('Al%')) 158 | 159 | # Regex Like - col.rlike(regex) 160 | df = df.filter(df.name.rlike('[A-Z]*ice$')) 161 | 162 | # Is In List - col.isin(*cols) 163 | df = df.filter(df.name.isin('Bob', 'Mike')) 164 | ``` 165 | ### String Functions 166 | ``` 167 | # Substring - col.substr(startPos, length) 168 | df = df.withColumn('short_id', df.id.substr(0, 10)) 169 | 170 | # Trim - F.trim(col) 171 | df = df.withColumn('name', F.trim(df.name)) 172 | 173 | # Left Pad - F.lpad(col, len, pad) 174 | # Right Pad - F.rpad(col, len, pad) 175 | df = df.withColumn('id', F.lpad('id', 4, '0')) 176 | 177 | # Left Trim - F.ltrim(col) 178 | # Right Trim - F.rtrim(col) 179 | df = df.withColumn('id', F.ltrim('id')) 180 | 181 | # Concatenate - F.concat(*cols) 182 | df = df.withColumn('full_name', F.concat('fname', F.lit(' '), 'lname')) 183 | 184 | # Concatenate with Separator/Delimiter - F.concat_ws(delimiter, *cols) 185 | df = df.withColumn('full_name', F.concat_ws('-', 'fname', 'lname')) 186 | 187 | # Regex Replace - F.regexp_replace(str, pattern, replacement)[source] 188 | df = df.withColumn('id', F.regexp_replace(id, '0F1(.*)', '1F1-$1')) 189 | 190 | # Regex Extract - F.regexp_extract(str, pattern, idx) 191 | df = df.withColumn('id', F.regexp_extract(id, '[0-9]*', 0)) 192 | ``` 193 | 194 | ## Number Operations 195 | ``` 196 | # Round - F.round(col, scale=0) 197 | df = df.withColumn('price', F.round('price', 0)) 198 | 199 | # Floor - F.floor(col) 200 | df = df.withColumn('price', F.floor('price')) 201 | 202 | # Ceiling - F.ceil(col) 203 | df = df.withColumn('price', F.ceil('price')) 204 | 205 | # Absolute Value - F.abs(col) 206 | df = df.withColumn('price', F.abs('price')) 207 | 208 | # X raised to power Y – F.pow(x, y) 209 | df = df.withColumn('exponential_growth', F.pow('x', 'y')) 210 | 211 | # Select smallest value out of multiple columns – F.least(*cols) 212 | df = df.withColumn('least', F.least('subtotal', 'total')) 213 | 214 | # Select largest value out of multiple columns – F.greatest(*cols) 215 | df = df.withColumn('greatest', F.greatest('subtotal', 'total')) 216 | ``` 217 | 218 | ## Date & Timestamp Operations 219 | ``` 220 | # Convert a string of known format to a date (excludes time information) 221 | df = df.withColumn('date_of_birth', F.to_date('date_of_birth', 'yyyy-MM-dd')) 222 | 223 | # Convert a string of known format to a timestamp (includes time information) 224 | df = df.withColumn('time_of_birth', F.to_timestamp('time_of_birth', 'yyyy-MM-dd HH:mm:ss')) 225 | 226 | # Get year from date: F.year(col) 227 | # Get month from date: F.month(col) 228 | # Get day from date: F.dayofmonth(col) 229 | # Get hour from date: F.hour(col) 230 | # Get minute from date: F.minute(col) 231 | # Get second from date: F.second(col) 232 | df = df.filter(F.year('date_of_birth') == F.lit('2017')) 233 | 234 | # Add & subtract days 235 | df = df.withColumn('three_days_after', F.date_add('date_of_birth', 3)) 236 | df = df.withColumn('three_days_before', F.date_sub('date_of_birth', 3)) 237 | 238 | # Add & Subtract months 239 | df = df.withColumn('next_month', F.add_month('date_of_birth', 1)) 240 | 241 | # Get number of days between two dates 242 | df = df.withColumn('days_between', F.datediff('start', 'end')) 243 | 244 | # Get number of months between two dates 245 | df = df.withColumn('months_between', F.months_between('start', 'end')) 246 | 247 | # Keep only rows where date_of_birth is between 2017-05-10 and 2018-07-21 248 | df = df.filter( 249 | (F.col('date_of_birth') >= F.lit('2017-05-10')) & 250 | (F.col('date_of_birth') <= F.lit('2018-07-21')) 251 | ) 252 | ``` 253 | 254 | ## Array Operations 255 | ``` 256 | # Column Array - F.array(*cols) 257 | df = df.withColumn('full_name', F.array('fname', 'lname')) 258 | 259 | # Empty Array - F.array(*cols) 260 | df = df.withColumn('empty_array_column', F.array([])) 261 | 262 | # Array Size/Length – F.size(col) 263 | df = df.withColumn('array_length', F.size(F.col('my_array'))) 264 | ``` 265 | ## Aggregation Operations 266 | ``` 267 | # Row Count: F.count() 268 | # Sum of Rows in Group: F.sum(*cols) 269 | # Mean of Rows in Group: F.mean(*cols) 270 | # Max of Rows in Group: F.max(*cols) 271 | # Min of Rows in Group: F.min(*cols) 272 | # First Row in Group: F.alias(*cols) 273 | df = df.groupBy('gender').agg(F.max('age').alias('max_age_by_gender')) 274 | 275 | # Collect a Set of all Rows in Group: F.collect_set(col) 276 | # Collect a List of all Rows in Group: F.collect_list(col) 277 | df = df.groupBy('age').agg(F.collect_set('name').alias('person_names')) 278 | 279 | # Just take the lastest row for each combination (Window Functions) 280 | from pyspark.sql import Window as W 281 | 282 | window = W.partitionBy("first_name", "last_name").orderBy(F.desc("date")) 283 | df = df.withColumn("row_number", F.row_number().over(window)) 284 | df = df.filter(F.col("row_number") == 1) 285 | df = df.drop("row_number") 286 | ``` 287 | 288 | ## Advanced Operations 289 | ### Repartitioning 290 | ``` 291 | # Repartition – df.repartition(num_output_partitions) 292 | df = df.repartition(1) 293 | ``` 294 | 295 | ### UDFs (User Defined Functions) 296 | ``` 297 | # Multiply each row's age column by two 298 | times_two_udf = F.udf(lambda x: x * 2) 299 | df = df.withColumn('age', times_two_udf(df.age)) 300 | 301 | # Randomly choose a value to use as a row's name 302 | import random 303 | 304 | random_name_udf = F.udf(lambda: random.choice(['Bob', 'Tom', 'Amy', 'Jenna'])) 305 | df = df.withColumn('name', random_name_udf()) 306 | ``` 307 | 308 | ## Window Functions 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 338 | 339 | 340 | 343 | 344 | 345 |

Window Functions Usage & Syntax	PySpark Window Functions description
row_number(): Column	Returns a sequential number starting from 1 within a window partition
rank(): Column	Returns the rank of rows within a window partition, with gaps.
percent_rank(): Column	Returns the percentile rank of rows within a window partition.
dense_rank(): Column	Returns the rank of rows within a window partition without any gaps. Where as Rank() returns rank with gaps.
ntile(n: Int): Column	Returns the ntile id in a window partition
cume_dist(): Column	Returns the cumulative distribution of values within a window partition
lag(e: Column, offset: Int): Column 336 \| lag(columnName: String, offset: Int): Column 337 \| lag(columnName: String, offset: Int, defaultValue: Any): Column	returns the value that is `offset` rows before the current row, and `null` if there is less than `offset` rows before the current row.
lead(columnName: String, offset: Int): Column 341 \| lead(columnName: String, offset: Int): Column 342 \| lead(columnName: String, offset: Int, defaultValue: Any): Column	returns the value that is `offset` rows after the current row, and `null` if there is less than `offset` rows after the current row.

346 | 347 | ### row_number Window Function 348 | ``` 349 | 350 | from pyspark.sql.window import Window 351 | from pyspark.sql.functions import row_number 352 | windowSpec = Window.partitionBy("department").orderBy("salary") 353 | 354 | df.withColumn("row_number",row_number().over(windowSpec)) \ 355 | .show(truncate=False) 356 | ``` 357 | 358 | 359 | 360 | 361 | 362 | -------------------------------------------------------------------------------- /sparksession.md: -------------------------------------------------------------------------------- 1 | # SparkSession 2 | ### SparkSession introduced in version 2.0, is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame. 3 | ``` 4 | #Code to create spark session 5 | import pyspark 6 | from pyspark.sql import SparkSession 7 | spark = SparkSession.builder.master("local[1]") \ 8 | .appName('myapp.com') \ 9 | .getOrCreate() 10 | ``` 11 | It’s object spark is default available in pyspark-shell and it can be created programmatically using SparkSession 12 | 13 | **Spark Session available APIs in different contexts –** 14 | 15 |

Spark Context 16 |

SQL Context 17 |

Streaming Context 18 |

Hive Context 19 | 20 | You can create as many SparkSession objects you want using either **SparkSession.builder** or **SparkSession.newSession**. 21 | 22 | **SparkSession Commonly Used Methods** 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 |

version()	Returns Spark version where your application is running, probably the Spark version you cluster is configured with.
createDataFrame()	This creates a DataFrame from a collection and an RDD.
getActiveSession()	returns an active Spark session.
read()	Returns an instance of DataFrameReader class, this is used to read records from csv, parquet, avro and more file formats into DataFrame.
readStream()	Returns an instance of DataStreamReader class, this is used to read streaming data. that can be used to read streaming data into DataFrame.
sparkContext()	Returns a SparkContext.
sql()	Returns a DataFrame after executing the SQL mentioned.
sqlContext()	Returns SQLContext.
stop()	Stop the current SparkContext.
table()	Returns a DataFrame of a table or view.
udf()	Creates a PySpark UDF to use it on DataFrame, Dataset, and SQL.

69 | 70 | 71 | 72 | --------------------------------------------------------------------------------