├── LICENSE ├── README.md ├── Chapter06 └── Chapter 6 Code.ipynb ├── Chapter05 └── Chapter 5 Code.ipynb └── Chapter04 └── Chapter 4 Code.ipynb /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Databricks Certified Associate Developer for Apache Spark Using Python 2 | no-image 3 | 4 | This is the code repository for [Databricks Certified Associate Developer for Apache Spark Using Python](https://www.packtpub.com/product/databricks-certified-associate-developer-for-apache-spark-using-python/9781804619780), published by Packt. 5 | 6 | **The ultimate guide to getting certified in Apache Spark using practical examples with Python** 7 | 8 | ## What is this book about? 9 | This guide gets you ready for certification with expert-backed content, key exam concepts, and topic reviews. Additionally, you’ll be able to make the most of Apache Spark 3.0 to modernize workloads and more using specific tools and techniques. 10 | 11 | This book covers the following exciting features: 12 | * Create and manipulate SQL queries in Spark 13 | * Build complex Spark functions using Spark UDFs 14 | * Architect big data apps with Spark fundamentals for optimal design 15 | * Apply techniques to manipulate and optimize big data applications 16 | * Build real-time or near-real-time applications using Spark Streaming 17 | * Work with Apache Spark for machine learning applications 18 | 19 | If you feel this book is for you, get your [copy](https://www.amazon.com/Databricks-Certified-Associate-Developer-Apache/dp/1804619787) today! 20 | 21 | https://www.packtpub.com/ 23 | 24 | ## Instructions and Navigations 25 | All of the code is organized into folders. For example, Chapter04. 26 | 27 | The code will look like the following: 28 | ``` 29 | # Perform an aggregation to calculate the average salary 30 | average_salary = spark.sql("SELECT AVG(Salary) AS average_salary FROM 31 | employees") 32 | ``` 33 | 34 | **Following is what you need for this book:** 35 | This book is for you if you’re a professional looking to venture into the world of big data and data engineering, a data professional who wants to endorse your knowledge of Spark, or a student. Although working knowledge of Python is required, no prior Spark knowledge is needed. Additionally, experience with Pyspark will be beneficial. 36 | 37 | With the following software and hardware list you can run all code files present in the book (Chapter 4-8). 38 | ### Software and Hardware List 39 | | Chapter | Software required | OS required | 40 | | -------- | ------------------------------------ | ----------------------------------- | 41 | | 4-8 | Python | Windows, Mac OS X, and Linux | 42 | | 4-8 | Spark | Windows, Mac OS X, and Linux | 43 | 44 | ### Related products 45 | * Business Intelligence with Databricks SQL [[Packt]](https://www.packtpub.com/product/business-intelligence-with-databricks-sql/9781803235332) [[Amazon]](https://www.amazon.com/Business-Intelligence-Databricks-SQL-intelligence/dp/1803235330/ref=sr_1_1?crid=1QYCAOZP9E3NH&dib=eyJ2IjoiMSJ9.nKZ7dRFPdDZyRvWwKM_NiTSZyweCLZ8g9JdktemcYzaWNiGWg9PuoxY2yb2jogGyK8hgRliKebDQfdHu2rRnTZTWZbsWOJAN33k65RFkAgdFX-csS8HgTFfjZj-SFKLpp4FC6LHwQvWr9Nq6f5x6eg.jh99qre-Hl4OHA9rypXLmSGsQp4exBvaZ2xUOPDQ0mM&dib_tag=se&keywords=Business+Intelligence+with+Databricks+SQL&qid=1718173191&s=books&sprefix=business+intelligence+with+databricks+sql%2Cstripbooks-intl-ship%2C553&sr=1-1) 46 | 47 | * Azure Databricks Cookbook [[Packt]](https://www.packtpub.com/product/azure-databricks-cookbook/9781789809718) [[Amazon]](https://www.amazon.com/Azure-Databricks-Cookbook-Jonathan-Wood/dp/1789809711) 48 | 49 | ## Get to Know the Author 50 | **Saba Shah** is a Data and AI Architect and Evangelist with a wide technical breadth and deep understanding of big data and machine learning technologies. She has experience leading data science and data engineering teams in Fortune 500s as well as startups. She started her career as a software engineer but soon transitioned to big data. She is currently a solutions architect at Databricks and works with enterprises building their data strategy and helping them create a vision for the future with machine learning and predictive analytics. Saba graduated with a degree in Computer Science and later earned an MS degree in Advanced Web Technologies. She is passionate about all things data and cricket. She currently resides in RTP, NC. 51 | -------------------------------------------------------------------------------- /Chapter06/Chapter 6 Code.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "application/vnd.databricks.v1+cell": { 7 | "cellMetadata": {}, 8 | "inputWidgets": {}, 9 | "nuid": "5f9b4e65-714c-46a7-b0ca-2c22d87e349e", 10 | "showTitle": false, 11 | "title": "" 12 | } 13 | }, 14 | "source": [ 15 | "# Chapter 6: SQL Queries in Spark" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 0, 21 | "metadata": { 22 | "application/vnd.databricks.v1+cell": { 23 | "cellMetadata": { 24 | "byteLimit": 2048000, 25 | "rowLimit": 10000 26 | }, 27 | "inputWidgets": {}, 28 | "nuid": "e4b6f89b-2d68-4937-a30a-ac64ef7caa49", 29 | "showTitle": true, 30 | "title": "Create Salary dataframe" 31 | } 32 | }, 33 | "outputs": [ 34 | { 35 | "output_type": "stream", 36 | "name": "stdout", 37 | "output_type": "stream", 38 | "text": [ 39 | "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n| 1| John| Field-eng| 3500| 40|\n| 2| Robert| Sales| 4000| 38|\n| 3| Maria| Finance| 3500| 28|\n| 4| Michael| Sales| 3000| 20|\n| 5| Kelly| Finance| 3500| 35|\n| 6| Kate| Finance| 3000| 45|\n| 7| Martin| Finance| 3500| 26|\n| 8| Kiran| Sales| 2200| 35|\n+---+--------+----------+------+---+\n\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "salary_data_with_id = [(1, \"John\", \"Field-eng\", 3500, 40), \\\n", 45 | " (2, \"Robert\", \"Sales\", 4000, 38), \\\n", 46 | " (3, \"Maria\", \"Finance\", 3500, 28), \\\n", 47 | " (4, \"Michael\", \"Sales\", 3000, 20), \\\n", 48 | " (5, \"Kelly\", \"Finance\", 3500, 35), \\\n", 49 | " (6, \"Kate\", \"Finance\", 3000, 45), \\\n", 50 | " (7, \"Martin\", \"Finance\", 3500, 26), \\\n", 51 | " (8, \"Kiran\", \"Sales\", 2200, 35), \\\n", 52 | " ]\n", 53 | "columns= [\"ID\", \"Employee\", \"Department\", \"Salary\", \"Age\"]\n", 54 | "salary_data_with_id = spark.createDataFrame(data = salary_data_with_id, schema = columns)\n", 55 | "salary_data_with_id.show()\n" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 0, 61 | "metadata": { 62 | "application/vnd.databricks.v1+cell": { 63 | "cellMetadata": { 64 | "byteLimit": 2048000, 65 | "rowLimit": 10000 66 | }, 67 | "inputWidgets": {}, 68 | "nuid": "e075be3c-bb49-4c81-a7b9-7dfbb4370056", 69 | "showTitle": true, 70 | "title": "Writing csv file" 71 | } 72 | }, 73 | "outputs": [], 74 | "source": [ 75 | "salary_data_with_id.write.format(\"csv\").mode(\"overwrite\").option(\"header\", \"true\").save(\"salary_data.csv\")\n" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 0, 81 | "metadata": { 82 | "application/vnd.databricks.v1+cell": { 83 | "cellMetadata": { 84 | "byteLimit": 2048000, 85 | "rowLimit": 10000 86 | }, 87 | "inputWidgets": {}, 88 | "nuid": "a11fe3af-e723-4e97-abbf-d503acf033e3", 89 | "showTitle": true, 90 | "title": "Reading csv file" 91 | } 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "csv_data = spark.read.csv('/salary_data.csv', header=True)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 0, 101 | "metadata": { 102 | "application/vnd.databricks.v1+cell": { 103 | "cellMetadata": { 104 | "byteLimit": 2048000, 105 | "rowLimit": 10000 106 | }, 107 | "inputWidgets": {}, 108 | "nuid": "e1aa8d82-a393-448a-8fd0-5110b1eb1af2", 109 | "showTitle": true, 110 | "title": "Showing data" 111 | } 112 | }, 113 | "outputs": [ 114 | { 115 | "output_type": "stream", 116 | "name": "stdout", 117 | "output_type": "stream", 118 | "text": [ 119 | "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n| 1| John| Field-eng| 3500| 40|\n| 2| Robert| Sales| 4000| 38|\n| 3| Maria| Finance| 3500| 28|\n| 4| Michael| Sales| 3000| 20|\n| 5| Kelly| Finance| 3500| 35|\n| 6| Kate| Finance| 3000| 45|\n| 7| Martin| Finance| 3500| 26|\n| 8| Kiran| Sales| 2200| 35|\n+---+--------+----------+------+---+\n\n" 120 | ] 121 | } 122 | ], 123 | "source": [ 124 | "csv_data.show()" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 0, 130 | "metadata": { 131 | "application/vnd.databricks.v1+cell": { 132 | "cellMetadata": { 133 | "byteLimit": 2048000, 134 | "rowLimit": 10000 135 | }, 136 | "inputWidgets": {}, 137 | "nuid": "7f186780-97f0-4b7f-af17-c33073c872ce", 138 | "showTitle": true, 139 | "title": "# Perform transformations on the loaded data" 140 | } 141 | }, 142 | "outputs": [ 143 | { 144 | "output_type": "stream", 145 | "name": "stdout", 146 | "output_type": "stream", 147 | "text": [ 148 | "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n| 1| John| Field-eng| 3500| 40|\n| 2| Robert| Sales| 4000| 38|\n| 3| Maria| Finance| 3500| 28|\n| 5| Kelly| Finance| 3500| 35|\n| 7| Martin| Finance| 3500| 26|\n+---+--------+----------+------+---+\n\n" 149 | ] 150 | } 151 | ], 152 | "source": [ 153 | "# Perform transformations on the loaded data \n", 154 | "processed_data = csv_data.filter(csv_data[\"Salary\"] > 3000) \n", 155 | "# Save the processed data as a table \n", 156 | "processed_data.createOrReplaceTempView(\"high_salary_employees\") \n", 157 | "# Perform SQL queries on the saved table \n", 158 | "results = spark.sql(\"SELECT * FROM high_salary_employees \") \n", 159 | "results.show()\n" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 0, 165 | "metadata": { 166 | "application/vnd.databricks.v1+cell": { 167 | "cellMetadata": { 168 | "byteLimit": 2048000, 169 | "rowLimit": 10000 170 | }, 171 | "inputWidgets": {}, 172 | "nuid": "022c96b3-91f2-41ea-91c4-58ecd511a2f6", 173 | "showTitle": true, 174 | "title": "Saving Transformed Data as a View" 175 | } 176 | }, 177 | "outputs": [ 178 | { 179 | "output_type": "stream", 180 | "name": "stdout", 181 | "output_type": "stream", 182 | "text": [ 183 | "+--------+----------+------+---+\n|Employee|Department|Salary|Age|\n+--------+----------+------+---+\n| John| Field-eng| 3500| 40|\n| Robert| Sales| 4000| 38|\n| Kelly| Finance| 3500| 35|\n| Kate| Finance| 3000| 45|\n| Kiran| Sales| 2200| 35|\n+--------+----------+------+---+\n\n" 184 | ] 185 | } 186 | ], 187 | "source": [ 188 | "# Save the processed data as a view \n", 189 | "salary_data_with_id.createOrReplaceTempView(\"employees\") \n", 190 | "#Apply filtering on data\n", 191 | "filtered_data = spark.sql(\"SELECT Employee, Department, Salary, Age FROM employees WHERE age > 30\") \n", 192 | "# Display the results \n", 193 | "filtered_data.show()\n" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 0, 199 | "metadata": { 200 | "application/vnd.databricks.v1+cell": { 201 | "cellMetadata": { 202 | "byteLimit": 2048000, 203 | "rowLimit": 10000 204 | }, 205 | "inputWidgets": {}, 206 | "nuid": "174c84ac-fa9e-42a7-a3b5-a166193e63b0", 207 | "showTitle": true, 208 | "title": "Aggregating data" 209 | } 210 | }, 211 | "outputs": [ 212 | { 213 | "output_type": "stream", 214 | "name": "stdout", 215 | "output_type": "stream", 216 | "text": [ 217 | "+--------------+\n|average_salary|\n+--------------+\n| 3275.0|\n+--------------+\n\n" 218 | ] 219 | } 220 | ], 221 | "source": [ 222 | "# Perform an aggregation to calculate the average salary \n", 223 | "average_salary = spark.sql(\"SELECT AVG(Salary) AS average_salary FROM employees\") \n", 224 | "# Display the average salary \n", 225 | "average_salary.show() \n" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 0, 231 | "metadata": { 232 | "application/vnd.databricks.v1+cell": { 233 | "cellMetadata": { 234 | "byteLimit": 2048000, 235 | "rowLimit": 10000 236 | }, 237 | "inputWidgets": {}, 238 | "nuid": "97bfd3f0-1a65-4a58-bdc7-58b55dd58840", 239 | "showTitle": true, 240 | "title": "Sorting data" 241 | } 242 | }, 243 | "outputs": [ 244 | { 245 | "output_type": "stream", 246 | "name": "stdout", 247 | "output_type": "stream", 248 | "text": [ 249 | "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n| 2| Robert| Sales| 4000| 38|\n| 1| John| Field-eng| 3500| 40|\n| 7| Martin| Finance| 3500| 26|\n| 3| Maria| Finance| 3500| 28|\n| 5| Kelly| Finance| 3500| 35|\n| 4| Michael| Sales| 3000| 20|\n| 6| Kate| Finance| 3000| 45|\n| 8| Kiran| Sales| 2200| 35|\n+---+--------+----------+------+---+\n\n" 250 | ] 251 | } 252 | ], 253 | "source": [ 254 | "# Sort the data based on the salary column in descending order \n", 255 | "sorted_data = spark.sql(\"SELECT * FROM employees ORDER BY Salary DESC\") \n", 256 | "# Display the sorted data \n", 257 | "sorted_data.show() \n" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 0, 263 | "metadata": { 264 | "application/vnd.databricks.v1+cell": { 265 | "cellMetadata": { 266 | "byteLimit": 2048000, 267 | "rowLimit": 10000 268 | }, 269 | "inputWidgets": {}, 270 | "nuid": "38606a23-aa13-49ca-b7dc-d669dd472f55", 271 | "showTitle": true, 272 | "title": "Combining Aggregations" 273 | } 274 | }, 275 | "outputs": [ 276 | { 277 | "output_type": "stream", 278 | "name": "stdout", 279 | "output_type": "stream", 280 | "text": [ 281 | "+--------+----------+------+---+\n|Employee|Department|Salary|Age|\n+--------+----------+------+---+\n| Robert| Sales| 4000| 38|\n| John| Field-eng| 3500| 40|\n| Kelly| Finance| 3500| 35|\n+--------+----------+------+---+\n\n" 282 | ] 283 | } 284 | ], 285 | "source": [ 286 | "# Sort the data based on the salary column in descending order \n", 287 | "filtered_data = spark.sql(\"SELECT Employee, Department, Salary, Age FROM employees WHERE age > 30 AND Salary > 3000 ORDER BY Salary DESC\") \n", 288 | "# Display the results \n", 289 | "filtered_data.show()\n" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 0, 295 | "metadata": { 296 | "application/vnd.databricks.v1+cell": { 297 | "cellMetadata": { 298 | "byteLimit": 2048000, 299 | "rowLimit": 10000 300 | }, 301 | "inputWidgets": {}, 302 | "nuid": "7701c892-0883-4bc2-9b5e-f51cf55fcc78", 303 | "showTitle": true, 304 | "title": "Grouping data" 305 | } 306 | }, 307 | "outputs": [ 308 | { 309 | "output_type": "stream", 310 | "name": "stdout", 311 | "output_type": "stream", 312 | "text": [ 313 | "+----------+------------------+\n|Department| avg(Salary)|\n+----------+------------------+\n| Sales|3066.6666666666665|\n| Finance| 3375.0|\n| Field-eng| 3500.0|\n+----------+------------------+\n\n" 314 | ] 315 | } 316 | ], 317 | "source": [ 318 | "# Group the data based on the Department column and take average salary for each department \n", 319 | "grouped_data = spark.sql(\"SELECT Department, avg(Salary) FROM employees GROUP BY Department\") \n", 320 | "# Display the results \n", 321 | "grouped_data.show()\n" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 0, 327 | "metadata": { 328 | "application/vnd.databricks.v1+cell": { 329 | "cellMetadata": { 330 | "byteLimit": 2048000, 331 | "rowLimit": 10000 332 | }, 333 | "inputWidgets": {}, 334 | "nuid": "abafb986-88fa-4c43-8f65-6f7bbfa70ec6", 335 | "showTitle": true, 336 | "title": "Grouping with multiple aggregations" 337 | } 338 | }, 339 | "outputs": [ 340 | { 341 | "output_type": "stream", 342 | "name": "stdout", 343 | "output_type": "stream", 344 | "text": [ 345 | "+----------+------------+----------+\n|Department|total_salary|max_salary|\n+----------+------------+----------+\n| Sales| 9200| 4000|\n| Finance| 13500| 3500|\n| Field-eng| 3500| 3500|\n+----------+------------+----------+\n\n" 346 | ] 347 | } 348 | ], 349 | "source": [ 350 | "# Perform grouping and multiple aggregations \n", 351 | "aggregated_data = spark.sql(\"SELECT Department, sum(Salary) AS total_salary, max(Salary) AS max_salary FROM employees GROUP BY Department\") \n", 352 | "\n", 353 | "# Display the results \n", 354 | "aggregated_data.show()\n" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 0, 360 | "metadata": { 361 | "application/vnd.databricks.v1+cell": { 362 | "cellMetadata": { 363 | "byteLimit": 2048000, 364 | "rowLimit": 10000 365 | }, 366 | "inputWidgets": {}, 367 | "nuid": "3b85baa3-2e70-4038-af6c-440346d96d78", 368 | "showTitle": true, 369 | "title": "Window functions" 370 | } 371 | }, 372 | "outputs": [ 373 | { 374 | "output_type": "stream", 375 | "name": "stdout", 376 | "output_type": "stream", 377 | "text": [ 378 | "+---+--------+----------+------+---+--------------+\n| ID|Employee|Department|Salary|Age|cumulative_sum|\n+---+--------+----------+------+---+--------------+\n| 1| John| Field-eng| 3500| 40| 3500|\n| 7| Martin| Finance| 3500| 26| 3500|\n| 3| Maria| Finance| 3500| 28| 7000|\n| 5| Kelly| Finance| 3500| 35| 10500|\n| 6| Kate| Finance| 3000| 45| 13500|\n| 4| Michael| Sales| 3000| 20| 3000|\n| 8| Kiran| Sales| 2200| 35| 5200|\n| 2| Robert| Sales| 4000| 38| 9200|\n+---+--------+----------+------+---+--------------+\n\n" 379 | ] 380 | } 381 | ], 382 | "source": [ 383 | "from pyspark.sql.window import Window\n", 384 | "from pyspark.sql.functions import col, sum\n", 385 | "\n", 386 | "# Define the window specification\n", 387 | "window_spec = Window.partitionBy(\"Department\").orderBy(\"Age\")\n", 388 | "\n", 389 | "# Calculate the cumulative sum using window function\n", 390 | "df_with_cumulative_sum = salary_data_with_id.withColumn(\"cumulative_sum\", sum(col(\"Salary\")).over(window_spec))\n", 391 | "\n", 392 | "# Display the result\n", 393 | "df_with_cumulative_sum.show()\n" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 0, 399 | "metadata": { 400 | "application/vnd.databricks.v1+cell": { 401 | "cellMetadata": { 402 | "byteLimit": 2048000, 403 | "rowLimit": 10000 404 | }, 405 | "inputWidgets": {}, 406 | "nuid": "5d8c6978-6e33-47db-8ad6-8af7cd77d522", 407 | "showTitle": true, 408 | "title": "Using udfs" 409 | } 410 | }, 411 | "outputs": [ 412 | { 413 | "output_type": "stream", 414 | "name": "stdout", 415 | "output_type": "stream", 416 | "text": [ 417 | "+---+--------+----------+------+---+----------------+\n| ID|Employee|Department|Salary|Age|capitalized_name|\n+---+--------+----------+------+---+----------------+\n| 1| John| Field-eng| 3500| 40| JOHN|\n| 2| Robert| Sales| 4000| 38| ROBERT|\n| 3| Maria| Finance| 3500| 28| MARIA|\n| 4| Michael| Sales| 3000| 20| MICHAEL|\n| 5| Kelly| Finance| 3500| 35| KELLY|\n| 6| Kate| Finance| 3000| 45| KATE|\n| 7| Martin| Finance| 3500| 26| MARTIN|\n| 8| Kiran| Sales| 2200| 35| KIRAN|\n+---+--------+----------+------+---+----------------+\n\n" 418 | ] 419 | } 420 | ], 421 | "source": [ 422 | "from pyspark.sql import SparkSession\n", 423 | "from pyspark.sql.functions import udf\n", 424 | "from pyspark.sql.types import StringType\n", 425 | "\n", 426 | "# Define a UDF to capitalize a string\n", 427 | "capitalize_udf = udf(lambda x: x.upper(), StringType())\n", 428 | "\n", 429 | "# Apply the UDF to a column\n", 430 | "df_with_capitalized_names = salary_data_with_id.withColumn(\"capitalized_name\", capitalize_udf(\"Employee\"))\n", 431 | "\n", 432 | "# Display the result\n", 433 | "df_with_capitalized_names.show()\n" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 0, 439 | "metadata": { 440 | "application/vnd.databricks.v1+cell": { 441 | "cellMetadata": { 442 | "byteLimit": 2048000, 443 | "rowLimit": 10000 444 | }, 445 | "inputWidgets": {}, 446 | "nuid": "19166b9b-a9c2-4e89-959f-abbac8ef8eff", 447 | "showTitle": false, 448 | "title": "" 449 | } 450 | }, 451 | "outputs": [ 452 | { 453 | "output_type": "stream", 454 | "name": "stdout", 455 | "output_type": "stream", 456 | "text": [ 457 | "+---+--------+----------+------+---+----------------+\n| ID|Employee|Department|Salary|Age|capitalized_name|\n+---+--------+----------+------+---+----------------+\n| 1| John| Field-eng| 3500| 40| JOHN|\n| 2| Robert| Sales| 4000| 38| ROBERT|\n| 3| Maria| Finance| 3500| 28| MARIA|\n| 4| Michael| Sales| 3000| 20| MICHAEL|\n| 5| Kelly| Finance| 3500| 35| KELLY|\n| 6| Kate| Finance| 3000| 45| KATE|\n| 7| Martin| Finance| 3500| 26| MARTIN|\n| 8| Kiran| Sales| 2200| 35| KIRAN|\n+---+--------+----------+------+---+----------------+\n\n" 458 | ] 459 | } 460 | ], 461 | "source": [ 462 | "from pyspark.sql.functions import udf\n", 463 | "from pyspark.sql.types import StringType\n", 464 | "\n", 465 | "# Define a UDF to capitalize a string\n", 466 | "capitalize_udf = udf(lambda x: x.upper(), StringType())\n", 467 | "\n", 468 | "# Apply the UDF to a column\n", 469 | "df_with_capitalized_names = salary_data_with_id.withColumn(\"capitalized_name\", capitalize_udf(\"Employee\"))\n", 470 | "\n", 471 | "# Display the result\n", 472 | "df_with_capitalized_names.show()" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": 0, 478 | "metadata": { 479 | "application/vnd.databricks.v1+cell": { 480 | "cellMetadata": { 481 | "byteLimit": 2048000, 482 | "rowLimit": 10000 483 | }, 484 | "inputWidgets": {}, 485 | "nuid": "df1d9134-f8bf-4597-9b46-1e0142a0acac", 486 | "showTitle": true, 487 | "title": "Applying functions" 488 | } 489 | }, 490 | "outputs": [ 491 | { 492 | "output_type": "stream", 493 | "name": "stdout", 494 | "output_type": "stream", 495 | "text": [ 496 | "+-----------------------+\n|pandas_plus_one(Salary)|\n+-----------------------+\n| 3501|\n| 4001|\n| 3501|\n| 3001|\n| 3501|\n| 3001|\n| 3501|\n| 2201|\n+-----------------------+\n\n" 497 | ] 498 | } 499 | ], 500 | "source": [ 501 | "import pandas as pd\n", 502 | "from pyspark.sql.functions import pandas_udf\n", 503 | "\n", 504 | "@pandas_udf('long')\n", 505 | "def pandas_plus_one(series: pd.Series) -> pd.Series:\n", 506 | " # Simply plus one by using pandas Series.\n", 507 | " return series + 1\n", 508 | "\n", 509 | "salary_data_with_id.select(pandas_plus_one(salary_data_with_id.Salary)).show()\n" 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": 0, 515 | "metadata": { 516 | "application/vnd.databricks.v1+cell": { 517 | "cellMetadata": { 518 | "byteLimit": 2048000, 519 | "rowLimit": 10000 520 | }, 521 | "inputWidgets": {}, 522 | "nuid": "d0141f5b-f209-4f66-928f-1d631a99ca58", 523 | "showTitle": true, 524 | "title": "Pandas udfs" 525 | } 526 | }, 527 | "outputs": [ 528 | { 529 | "output_type": "stream", 530 | "name": "stdout", 531 | "output_type": "stream", 532 | "text": [ 533 | "+---------------+\n|add_one(Salary)|\n+---------------+\n| 3501|\n| 4001|\n| 3501|\n| 3001|\n| 3501|\n| 3001|\n| 3501|\n| 2201|\n+---------------+\n\n" 534 | ] 535 | } 536 | ], 537 | "source": [ 538 | "@pandas_udf(\"integer\")\n", 539 | "def add_one(s: pd.Series) -> pd.Series:\n", 540 | " return s + 1\n", 541 | "\n", 542 | "spark.udf.register(\"add_one\", add_one)\n", 543 | "spark.sql(\"SELECT add_one(Salary) FROM employees\").show()\n" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": 0, 549 | "metadata": { 550 | "application/vnd.databricks.v1+cell": { 551 | "cellMetadata": {}, 552 | "inputWidgets": {}, 553 | "nuid": "4f6c3ea0-f650-4806-93cf-0368b01c2dd2", 554 | "showTitle": false, 555 | "title": "" 556 | } 557 | }, 558 | "outputs": [], 559 | "source": [] 560 | } 561 | ], 562 | "metadata": { 563 | "application/vnd.databricks.v1+notebook": { 564 | "dashboards": [], 565 | "language": "python", 566 | "notebookMetadata": { 567 | "mostRecentlyExecutedCommandWithImplicitDF": { 568 | "commandId": 969987236417588, 569 | "dataframes": [ 570 | "_sqldf" 571 | ] 572 | }, 573 | "pythonIndentUnit": 2 574 | }, 575 | "notebookName": "Chapter 6 Code", 576 | "widgets": {} 577 | } 578 | }, 579 | "nbformat": 4, 580 | "nbformat_minor": 0 581 | } 582 | -------------------------------------------------------------------------------- /Chapter05/Chapter 5 Code.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "application/vnd.databricks.v1+cell": { 7 | "cellMetadata": {}, 8 | "inputWidgets": {}, 9 | "nuid": "7f1436a0-3357-4850-b507-a12c76e60c22", 10 | "showTitle": false, 11 | "title": "" 12 | } 13 | }, 14 | "source": [ 15 | "# Chapter 5: Advanced Operations in Spark Code" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 0, 21 | "metadata": { 22 | "application/vnd.databricks.v1+cell": { 23 | "cellMetadata": { 24 | "byteLimit": 2048000, 25 | "rowLimit": 10000 26 | }, 27 | "inputWidgets": {}, 28 | "nuid": "0c029f8c-dfbc-4e10-b09d-ccbea7b62eec", 29 | "showTitle": true, 30 | "title": "Create Salary dataframe" 31 | } 32 | }, 33 | "outputs": [ 34 | { 35 | "output_type": "stream", 36 | "name": "stdout", 37 | "output_type": "stream", 38 | "text": [ 39 | "root\n |-- Employee: string (nullable = true)\n |-- Department: string (nullable = true)\n |-- Salary: long (nullable = true)\n\n+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| John| Field-eng| 3500|\n| Michael| Field-eng| 4500|\n| Robert| NULL| 4000|\n| Maria| Finance| 3500|\n| John| Sales| 3000|\n| Kelly| Finance| 3500|\n| Kate| Finance| 3000|\n| Martin| NULL| 3500|\n| Kiran| Sales| 2200|\n| Michael| Field-eng| 4500|\n+--------+----------+------+\n\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "salary_data = [(\"John\", \"Field-eng\", 3500), \n", 45 | " (\"Michael\", \"Field-eng\", 4500), \n", 46 | " (\"Robert\", None, 4000), \n", 47 | " (\"Maria\", \"Finance\", 3500), \n", 48 | " (\"John\", \"Sales\", 3000), \n", 49 | " (\"Kelly\", \"Finance\", 3500), \n", 50 | " (\"Kate\", \"Finance\", 3000), \n", 51 | " (\"Martin\", None, 3500), \n", 52 | " (\"Kiran\", \"Sales\", 2200), \n", 53 | " (\"Michael\", \"Field-eng\", 4500) \n", 54 | " ]\n", 55 | "columns= [\"Employee\", \"Department\", \"Salary\"]\n", 56 | "salary_data = spark.createDataFrame(data = salary_data, schema = columns)\n", 57 | "salary_data.printSchema()\n", 58 | "salary_data.show()\n" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 0, 64 | "metadata": { 65 | "application/vnd.databricks.v1+cell": { 66 | "cellMetadata": { 67 | "byteLimit": 2048000, 68 | "rowLimit": 10000 69 | }, 70 | "inputWidgets": {}, 71 | "nuid": "5c64523b-97f8-4cdf-8c73-13723a7f7453", 72 | "showTitle": true, 73 | "title": "Using Groupby in a Dataframe" 74 | } 75 | }, 76 | "outputs": [ 77 | { 78 | "output_type": "execute_result", 79 | "data": { 80 | "text/plain": [ 81 | "GroupedData[grouping expressions: [Department], value: [Employee: string, Department: string, Salary: bigint], type: GroupBy]" 82 | ] 83 | }, 84 | "execution_count": 3, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "salary_data.groupby('Department')" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 0, 96 | "metadata": { 97 | "application/vnd.databricks.v1+cell": { 98 | "cellMetadata": { 99 | "byteLimit": 2048000, 100 | "rowLimit": 10000 101 | }, 102 | "inputWidgets": {}, 103 | "nuid": "73e2c600-8160-4138-968f-835e6757f06c", 104 | "showTitle": false, 105 | "title": "" 106 | } 107 | }, 108 | "outputs": [ 109 | { 110 | "output_type": "stream", 111 | "name": "stdout", 112 | "output_type": "stream", 113 | "text": [ 114 | "+----------+------------------+\n|Department| avg(Salary)|\n+----------+------------------+\n| Field-eng| 4166.666666666667|\n| Sales| 2600.0|\n| NULL| 3750.0|\n| Finance|3333.3333333333335|\n+----------+------------------+\n\n" 115 | ] 116 | } 117 | ], 118 | "source": [ 119 | "salary_data.groupby('Department').avg().show()" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 0, 125 | "metadata": { 126 | "application/vnd.databricks.v1+cell": { 127 | "cellMetadata": { 128 | "byteLimit": 2048000, 129 | "rowLimit": 10000 130 | }, 131 | "inputWidgets": {}, 132 | "nuid": "d437c9f0-2336-4687-83b4-7c8142b4085f", 133 | "showTitle": true, 134 | "title": "Complex Groupby Statement" 135 | } 136 | }, 137 | "outputs": [ 138 | { 139 | "output_type": "stream", 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | "+----------+------+\n|Department|Salary|\n+----------+------+\n| NULL| 7500|\n| Field-eng| 12500|\n| Finance| 10000|\n| Sales| 5200|\n+----------+------+\n\n" 144 | ] 145 | } 146 | ], 147 | "source": [ 148 | "from pyspark.sql.functions import col, round\n", 149 | "\n", 150 | "salary_data.groupBy('Department')\\\n", 151 | " .sum('Salary')\\\n", 152 | " .withColumn('sum(Salary)',round(col('sum(Salary)'), 2))\\\n", 153 | " .withColumnRenamed('sum(Salary)', 'Salary')\\\n", 154 | " .orderBy('Department')\\\n", 155 | " .show()\n" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 0, 161 | "metadata": { 162 | "application/vnd.databricks.v1+cell": { 163 | "cellMetadata": { 164 | "byteLimit": 2048000, 165 | "rowLimit": 10000 166 | }, 167 | "inputWidgets": {}, 168 | "nuid": "dfc73dea-aa0c-4a54-aded-a4c3814f01a9", 169 | "showTitle": true, 170 | "title": "Joining Dataframes in Spark" 171 | } 172 | }, 173 | "outputs": [ 174 | { 175 | "output_type": "stream", 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n+---+--------+----------+------+\n\n" 180 | ] 181 | } 182 | ], 183 | "source": [ 184 | "salary_data_with_id = [(1, \"John\", \"Field-eng\", 3500), \\\n", 185 | " (2, \"Robert\", \"Sales\", 4000), \\\n", 186 | " (3, \"Maria\", \"Finance\", 3500), \\\n", 187 | " (4, \"Michael\", \"Sales\", 3000), \\\n", 188 | " (5, \"Kelly\", \"Finance\", 3500), \\\n", 189 | " (6, \"Kate\", \"Finance\", 3000), \\\n", 190 | " (7, \"Martin\", \"Finance\", 3500), \\\n", 191 | " (8, \"Kiran\", \"Sales\", 2200), \\\n", 192 | " ]\n", 193 | "columns= [\"ID\", \"Employee\", \"Department\", \"Salary\"]\n", 194 | "salary_data_with_id = spark.createDataFrame(data = salary_data_with_id, schema = columns)\n", 195 | "salary_data_with_id.show()\n" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 0, 201 | "metadata": { 202 | "application/vnd.databricks.v1+cell": { 203 | "cellMetadata": { 204 | "byteLimit": 2048000, 205 | "rowLimit": 10000 206 | }, 207 | "inputWidgets": {}, 208 | "nuid": "125e73d8-c716-4e1c-8900-859c1ec666e9", 209 | "showTitle": true, 210 | "title": "Employee data" 211 | } 212 | }, 213 | "outputs": [ 214 | { 215 | "output_type": "stream", 216 | "name": "stdout", 217 | "output_type": "stream", 218 | "text": [ 219 | "+---+-----+------+\n| ID|State|Gender|\n+---+-----+------+\n| 1| NY| M|\n| 2| NC| M|\n| 3| NY| F|\n| 4| TX| M|\n| 5| NY| F|\n| 6| AZ| F|\n+---+-----+------+\n\n" 220 | ] 221 | } 222 | ], 223 | "source": [ 224 | "employee_data = [(1, \"NY\", \"M\"), \\\n", 225 | " (2, \"NC\", \"M\"), \\\n", 226 | " (3, \"NY\", \"F\"), \\\n", 227 | " (4, \"TX\", \"M\"), \\\n", 228 | " (5, \"NY\", \"F\"), \\\n", 229 | " (6, \"AZ\", \"F\") \\\n", 230 | " ]\n", 231 | "columns= [\"ID\", \"State\", \"Gender\"]\n", 232 | "employee_data = spark.createDataFrame(data = employee_data, schema = columns)\n", 233 | "employee_data.show()\n" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 0, 239 | "metadata": { 240 | "application/vnd.databricks.v1+cell": { 241 | "cellMetadata": { 242 | "byteLimit": 2048000, 243 | "rowLimit": 10000 244 | }, 245 | "inputWidgets": {}, 246 | "nuid": "c0137bf4-d318-4417-86ca-df79f2fb80be", 247 | "showTitle": true, 248 | "title": "Inner join" 249 | } 250 | }, 251 | "outputs": [ 252 | { 253 | "output_type": "stream", 254 | "name": "stdout", 255 | "output_type": "stream", 256 | "text": [ 257 | "+---+--------+----------+------+---+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+---+-----+------+\n| 1| John| Field-eng| 3500| 1| NY| M|\n| 2| Robert| Sales| 4000| 2| NC| M|\n| 3| Maria| Finance| 3500| 3| NY| F|\n| 4| Michael| Sales| 3000| 4| TX| M|\n| 5| Kelly| Finance| 3500| 5| NY| F|\n| 6| Kate| Finance| 3000| 6| AZ| F|\n+---+--------+----------+------+---+-----+------+\n\n" 258 | ] 259 | } 260 | ], 261 | "source": [ 262 | "salary_data_with_id.join(employee_data,salary_data_with_id.ID == employee_data.ID,\"inner\").show()" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 0, 268 | "metadata": { 269 | "application/vnd.databricks.v1+cell": { 270 | "cellMetadata": { 271 | "byteLimit": 2048000, 272 | "rowLimit": 10000 273 | }, 274 | "inputWidgets": {}, 275 | "nuid": "f34ff657-b0dd-4485-96f0-6d7c6126a1bd", 276 | "showTitle": true, 277 | "title": "Outer join" 278 | } 279 | }, 280 | "outputs": [ 281 | { 282 | "output_type": "stream", 283 | "name": "stdout", 284 | "output_type": "stream", 285 | "text": [ 286 | "+---+--------+----------+------+----+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+----+-----+------+\n| 1| John| Field-eng| 3500| 1| NY| M|\n| 2| Robert| Sales| 4000| 2| NC| M|\n| 3| Maria| Finance| 3500| 3| NY| F|\n| 4| Michael| Sales| 3000| 4| TX| M|\n| 5| Kelly| Finance| 3500| 5| NY| F|\n| 6| Kate| Finance| 3000| 6| AZ| F|\n| 7| Martin| Finance| 3500|NULL| NULL| NULL|\n| 8| Kiran| Sales| 2200|NULL| NULL| NULL|\n+---+--------+----------+------+----+-----+------+\n\n" 287 | ] 288 | } 289 | ], 290 | "source": [ 291 | "salary_data_with_id.join(employee_data,salary_data_with_id.ID == employee_data.ID,\"outer\").show()" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 0, 297 | "metadata": { 298 | "application/vnd.databricks.v1+cell": { 299 | "cellMetadata": { 300 | "byteLimit": 2048000, 301 | "rowLimit": 10000 302 | }, 303 | "inputWidgets": {}, 304 | "nuid": "868ca315-ab44-4eb6-b8f1-92481d770911", 305 | "showTitle": true, 306 | "title": "Left join" 307 | } 308 | }, 309 | "outputs": [ 310 | { 311 | "output_type": "stream", 312 | "name": "stdout", 313 | "output_type": "stream", 314 | "text": [ 315 | "+---+--------+----------+------+----+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+----+-----+------+\n| 1| John| Field-eng| 3500| 1| NY| M|\n| 2| Robert| Sales| 4000| 2| NC| M|\n| 3| Maria| Finance| 3500| 3| NY| F|\n| 4| Michael| Sales| 3000| 4| TX| M|\n| 5| Kelly| Finance| 3500| 5| NY| F|\n| 6| Kate| Finance| 3000| 6| AZ| F|\n| 7| Martin| Finance| 3500|NULL| NULL| NULL|\n| 8| Kiran| Sales| 2200|NULL| NULL| NULL|\n+---+--------+----------+------+----+-----+------+\n\n" 316 | ] 317 | } 318 | ], 319 | "source": [ 320 | "salary_data_with_id.join(employee_data,salary_data_with_id.ID == employee_data.ID,\"left\").show()" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 0, 326 | "metadata": { 327 | "application/vnd.databricks.v1+cell": { 328 | "cellMetadata": { 329 | "byteLimit": 2048000, 330 | "rowLimit": 10000 331 | }, 332 | "inputWidgets": {}, 333 | "nuid": "4cba2965-54b3-4d04-a456-77e9d9af6e1f", 334 | "showTitle": true, 335 | "title": "Right join" 336 | } 337 | }, 338 | "outputs": [ 339 | { 340 | "output_type": "stream", 341 | "name": "stdout", 342 | "output_type": "stream", 343 | "text": [ 344 | "+---+--------+----------+------+---+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+---+-----+------+\n| 1| John| Field-eng| 3500| 1| NY| M|\n| 2| Robert| Sales| 4000| 2| NC| M|\n| 3| Maria| Finance| 3500| 3| NY| F|\n| 4| Michael| Sales| 3000| 4| TX| M|\n| 5| Kelly| Finance| 3500| 5| NY| F|\n| 6| Kate| Finance| 3000| 6| AZ| F|\n+---+--------+----------+------+---+-----+------+\n\n" 345 | ] 346 | } 347 | ], 348 | "source": [ 349 | "salary_data_with_id.join(employee_data,salary_data_with_id.ID == employee_data.ID,\"right\").show()" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 0, 355 | "metadata": { 356 | "application/vnd.databricks.v1+cell": { 357 | "cellMetadata": { 358 | "byteLimit": 2048000, 359 | "rowLimit": 10000 360 | }, 361 | "inputWidgets": {}, 362 | "nuid": "dd9f95c1-4109-4ceb-925d-7b10cf838fdd", 363 | "showTitle": true, 364 | "title": "Union" 365 | } 366 | }, 367 | "outputs": [ 368 | { 369 | "output_type": "stream", 370 | "name": "stdout", 371 | "output_type": "stream", 372 | "text": [ 373 | "root\n |-- ID: long (nullable = true)\n |-- Employee: string (nullable = true)\n |-- Department: string (nullable = true)\n |-- Salary: long (nullable = true)\n\n+---+--------+----------+------+\n|ID |Employee|Department|Salary|\n+---+--------+----------+------+\n|1 |John |Field-eng |3500 |\n|2 |Robert |Sales |4000 |\n|3 |Aliya |Finance |3500 |\n|4 |Nate |Sales |3000 |\n+---+--------+----------+------+\n\n" 374 | ] 375 | } 376 | ], 377 | "source": [ 378 | "salary_data_with_id_2 = [(1, \"John\", \"Field-eng\", 3500), \\\n", 379 | " (2, \"Robert\", \"Sales\", 4000), \\\n", 380 | " (3, \"Aliya\", \"Finance\", 3500), \\\n", 381 | " (4, \"Nate\", \"Sales\", 3000), \\\n", 382 | " ]\n", 383 | "columns2= [\"ID\", \"Employee\", \"Department\", \"Salary\"]\n", 384 | "\n", 385 | "salary_data_with_id_2 = spark.createDataFrame(data = salary_data_with_id_2, schema = columns2)\n", 386 | "\n", 387 | "salary_data_with_id_2.printSchema()\n", 388 | "salary_data_with_id_2.show(truncate=False)\n", 389 | "\n" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 0, 395 | "metadata": { 396 | "application/vnd.databricks.v1+cell": { 397 | "cellMetadata": { 398 | "byteLimit": 2048000, 399 | "rowLimit": 10000 400 | }, 401 | "inputWidgets": {}, 402 | "nuid": "2eb3d433-2a89-47b4-9d21-2d79194809c1", 403 | "showTitle": false, 404 | "title": "" 405 | } 406 | }, 407 | "outputs": [ 408 | { 409 | "output_type": "stream", 410 | "name": "stdout", 411 | "output_type": "stream", 412 | "text": [ 413 | "+---+--------+----------+------+\n|ID |Employee|Department|Salary|\n+---+--------+----------+------+\n|1 |John |Field-eng |3500 |\n|2 |Robert |Sales |4000 |\n|3 |Maria |Finance |3500 |\n|4 |Michael |Sales |3000 |\n|5 |Kelly |Finance |3500 |\n|6 |Kate |Finance |3000 |\n|7 |Martin |Finance |3500 |\n|8 |Kiran |Sales |2200 |\n|1 |John |Field-eng |3500 |\n|2 |Robert |Sales |4000 |\n|3 |Aliya |Finance |3500 |\n|4 |Nate |Sales |3000 |\n+---+--------+----------+------+\n\n" 414 | ] 415 | } 416 | ], 417 | "source": [ 418 | "unionDF = salary_data_with_id.union(salary_data_with_id_2)\n", 419 | "unionDF.show(truncate=False)\n" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": { 425 | "application/vnd.databricks.v1+cell": { 426 | "cellMetadata": {}, 427 | "inputWidgets": {}, 428 | "nuid": "d84b0031-6f62-41a8-9533-b510a487ab0f", 429 | "showTitle": false, 430 | "title": "" 431 | } 432 | }, 433 | "source": [ 434 | "Reading and Writing Data" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 0, 440 | "metadata": { 441 | "application/vnd.databricks.v1+cell": { 442 | "cellMetadata": { 443 | "byteLimit": 2048000, 444 | "rowLimit": 10000 445 | }, 446 | "inputWidgets": {}, 447 | "nuid": "d3c8eb85-7d75-4010-977d-370b7940b57e", 448 | "showTitle": true, 449 | "title": "Reading and writing CSV files" 450 | } 451 | }, 452 | "outputs": [ 453 | { 454 | "output_type": "stream", 455 | "name": "stdout", 456 | "output_type": "stream", 457 | "text": [ 458 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n+---+--------+----------+------+\n\n" 459 | ] 460 | } 461 | ], 462 | "source": [ 463 | "\n", 464 | "salary_data_with_id.write.csv('salary_data.csv', mode='overwrite', header=True)\n", 465 | "spark.read.csv('/salary_data.csv', header=True).show()\n" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": 0, 471 | "metadata": { 472 | "application/vnd.databricks.v1+cell": { 473 | "cellMetadata": { 474 | "byteLimit": 2048000, 475 | "rowLimit": 10000 476 | }, 477 | "inputWidgets": {}, 478 | "nuid": "b033bc47-7a90-4ae1-b37b-692860e06482", 479 | "showTitle": false, 480 | "title": "" 481 | } 482 | }, 483 | "outputs": [ 484 | { 485 | "output_type": "stream", 486 | "name": "stdout", 487 | "output_type": "stream", 488 | "text": [ 489 | "+---+-------+---------+\n| ID| State| Gender|\n+---+-------+---------+\n| 1| John|Field-eng|\n| 2| Robert| Sales|\n| 3| Maria| Finance|\n| 4|Michael| Sales|\n| 5| Kelly| Finance|\n| 6| Kate| Finance|\n| 7| Martin| Finance|\n| 8| Kiran| Sales|\n+---+-------+---------+\n\n" 490 | ] 491 | } 492 | ], 493 | "source": [ 494 | "from pyspark.sql.types import *\n", 495 | "\n", 496 | "filePath = '/salary_data.csv'\n", 497 | "columns= [\"ID\", \"State\", \"Gender\"] \n", 498 | "schema = StructType([\n", 499 | " StructField(\"ID\", IntegerType(),True),\n", 500 | " StructField(\"State\", StringType(),True),\n", 501 | " StructField(\"Gender\", StringType(),True)\n", 502 | "])\n", 503 | " \n", 504 | "read_data = spark.read.format(\"csv\").option(\"header\",\"true\").schema(schema).load(filePath)\n", 505 | "read_data.show()\n" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": 0, 511 | "metadata": { 512 | "application/vnd.databricks.v1+cell": { 513 | "cellMetadata": { 514 | "byteLimit": 2048000, 515 | "rowLimit": 10000 516 | }, 517 | "inputWidgets": {}, 518 | "nuid": "bfd8f639-d141-48c9-be8e-dffd764aa0ee", 519 | "showTitle": true, 520 | "title": "Reading and writing Parquet files" 521 | } 522 | }, 523 | "outputs": [ 524 | { 525 | "output_type": "stream", 526 | "name": "stdout", 527 | "output_type": "stream", 528 | "text": [ 529 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n+---+--------+----------+------+\n\n" 530 | ] 531 | } 532 | ], 533 | "source": [ 534 | "salary_data_with_id.write.parquet('salary_data.parquet', mode='overwrite')\n", 535 | "spark.read.parquet('/salary_data.parquet').show()\n" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": 0, 541 | "metadata": { 542 | "application/vnd.databricks.v1+cell": { 543 | "cellMetadata": { 544 | "byteLimit": 2048000, 545 | "rowLimit": 10000 546 | }, 547 | "inputWidgets": {}, 548 | "nuid": "492b344b-3719-44cd-a8dc-034d20f3a409", 549 | "showTitle": true, 550 | "title": "Reading and writing ORC files" 551 | } 552 | }, 553 | "outputs": [ 554 | { 555 | "output_type": "stream", 556 | "name": "stdout", 557 | "output_type": "stream", 558 | "text": [ 559 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n+---+--------+----------+------+\n\n" 560 | ] 561 | } 562 | ], 563 | "source": [ 564 | "salary_data_with_id.write.orc('salary_data.orc', mode='overwrite')\n", 565 | "spark.read.orc('/salary_data.orc').show()" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": 0, 571 | "metadata": { 572 | "application/vnd.databricks.v1+cell": { 573 | "cellMetadata": { 574 | "byteLimit": 2048000, 575 | "rowLimit": 10000 576 | }, 577 | "inputWidgets": {}, 578 | "nuid": "9b3c1309-4a00-4a92-ac3e-7f2a9d491445", 579 | "showTitle": true, 580 | "title": "Reading and writing Delta files" 581 | } 582 | }, 583 | "outputs": [ 584 | { 585 | "output_type": "stream", 586 | "name": "stdout", 587 | "output_type": "stream", 588 | "text": [ 589 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n+---+--------+----------+------+\n\n" 590 | ] 591 | } 592 | ], 593 | "source": [ 594 | "salary_data_with_id.write.format(\"delta\").save(\"/FileStore/tables/salary_data_with_id\", mode='overwrite')\n", 595 | "df = spark.read.load(\"/FileStore/tables/salary_data_with_id\")\n", 596 | "df.show()\n" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": 0, 602 | "metadata": { 603 | "application/vnd.databricks.v1+cell": { 604 | "cellMetadata": { 605 | "byteLimit": 2048000, 606 | "rowLimit": 10000 607 | }, 608 | "inputWidgets": {}, 609 | "nuid": "d616d17f-7848-4527-aae3-78eec9d3214d", 610 | "showTitle": true, 611 | "title": "Using SQL in Spark" 612 | } 613 | }, 614 | "outputs": [ 615 | { 616 | "output_type": "stream", 617 | "name": "stdout", 618 | "output_type": "stream", 619 | "text": [ 620 | "+--------+\n|count(1)|\n+--------+\n| 8|\n+--------+\n\n" 621 | ] 622 | } 623 | ], 624 | "source": [ 625 | "salary_data_with_id.createOrReplaceTempView(\"SalaryTable\")\n", 626 | "spark.sql(\"SELECT count(*) from SalaryTable\").show()\n" 627 | ] 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "metadata": { 632 | "application/vnd.databricks.v1+cell": { 633 | "cellMetadata": {}, 634 | "inputWidgets": {}, 635 | "nuid": "f549a552-a92a-477c-bcbd-0eaf6104c207", 636 | "showTitle": false, 637 | "title": "" 638 | } 639 | }, 640 | "source": [ 641 | "Catalyst Optimizer" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": 0, 647 | "metadata": { 648 | "application/vnd.databricks.v1+cell": { 649 | "cellMetadata": { 650 | "byteLimit": 2048000, 651 | "rowLimit": 10000 652 | }, 653 | "inputWidgets": {}, 654 | "nuid": "b66004b0-07ac-4c06-966e-1370a2e1b3d6", 655 | "showTitle": true, 656 | "title": "Catalyst Optimizer in Action" 657 | } 658 | }, 659 | "outputs": [ 660 | { 661 | "output_type": "stream", 662 | "name": "stdout", 663 | "output_type": "stream", 664 | "text": [ 665 | "== Physical Plan ==\n*(1) Project [employee#129490, department#129491]\n+- *(1) Filter (isnotnull(salary#129492) AND (salary#129492 > 3500))\n +- FileScan csv [Employee#129490,Department#129491,Salary#129492] Batched: false, DataFilters: [isnotnull(Salary#129492), (Salary#129492 > 3500)], Format: CSV, Location: InMemoryFileIndex(1 paths)[dbfs:/salary_data.csv], PartitionFilters: [], PushedFilters: [IsNotNull(Salary), GreaterThan(Salary,3500)], ReadSchema: struct\n\n\n" 666 | ] 667 | } 668 | ], 669 | "source": [ 670 | "# SparkSession setup \n", 671 | "from pyspark.sql import SparkSession \n", 672 | "spark = SparkSession.builder.appName(\"CatalystOptimizerExample\").getOrCreate() \n", 673 | "# Load data \n", 674 | "df = spark.read.csv(\"/salary_data.csv\", header=True, inferSchema=True) \n", 675 | "# Query with Catalyst Optimizer \n", 676 | "result_df = df.select(\"employee\", \"department\").filter(df[\"salary\"] > 3500) \n", 677 | "# Explain the optimized query plan \n", 678 | "result_df.explain() \n" 679 | ] 680 | }, 681 | { 682 | "cell_type": "code", 683 | "execution_count": 0, 684 | "metadata": { 685 | "application/vnd.databricks.v1+cell": { 686 | "cellMetadata": { 687 | "byteLimit": 2048000, 688 | "rowLimit": 10000 689 | }, 690 | "inputWidgets": {}, 691 | "nuid": "08ba28ee-80d0-4210-acb7-4a45bee2815b", 692 | "showTitle": true, 693 | "title": "Unpersisting Data" 694 | } 695 | }, 696 | "outputs": [ 697 | { 698 | "output_type": "execute_result", 699 | "data": { 700 | "text/plain": [ 701 | "DataFrame[ID: int, Employee: string, Department: string, Salary: int]" 702 | ] 703 | }, 704 | "execution_count": 24, 705 | "metadata": {}, 706 | "output_type": "execute_result" 707 | } 708 | ], 709 | "source": [ 710 | "# Cache a DataFrame \n", 711 | "df.cache() \n", 712 | "# Unpersist the cached DataFrame \n", 713 | "df.unpersist() \n" 714 | ] 715 | }, 716 | { 717 | "cell_type": "code", 718 | "execution_count": 0, 719 | "metadata": { 720 | "application/vnd.databricks.v1+cell": { 721 | "cellMetadata": { 722 | "byteLimit": 2048000, 723 | "rowLimit": 10000 724 | }, 725 | "inputWidgets": {}, 726 | "nuid": "25bff695-c206-4800-80c0-e6dde6962438", 727 | "showTitle": true, 728 | "title": "Repartitioning Data" 729 | } 730 | }, 731 | "outputs": [ 732 | { 733 | "output_type": "execute_result", 734 | "data": { 735 | "text/plain": [ 736 | "DataFrame[ID: int, Employee: string, Department: string, Salary: int]" 737 | ] 738 | }, 739 | "execution_count": 25, 740 | "metadata": {}, 741 | "output_type": "execute_result" 742 | } 743 | ], 744 | "source": [ 745 | "# Repartition a DataFrame into 8 partitions \n", 746 | "df.repartition(8) \n" 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": 0, 752 | "metadata": { 753 | "application/vnd.databricks.v1+cell": { 754 | "cellMetadata": { 755 | "byteLimit": 2048000, 756 | "rowLimit": 10000 757 | }, 758 | "inputWidgets": {}, 759 | "nuid": "3eb87f1e-ba2d-47fb-a7c1-b74e1f293598", 760 | "showTitle": true, 761 | "title": "Coalescing Data" 762 | } 763 | }, 764 | "outputs": [ 765 | { 766 | "output_type": "execute_result", 767 | "data": { 768 | "text/plain": [ 769 | "DataFrame[ID: int, Employee: string, Department: string, Salary: int]" 770 | ] 771 | }, 772 | "execution_count": 26, 773 | "metadata": {}, 774 | "output_type": "execute_result" 775 | } 776 | ], 777 | "source": [ 778 | "# Coalesce a DataFrame to 4 partitions \n", 779 | "df.coalesce(4) \n" 780 | ] 781 | } 782 | ], 783 | "metadata": { 784 | "application/vnd.databricks.v1+notebook": { 785 | "dashboards": [], 786 | "language": "python", 787 | "notebookMetadata": { 788 | "mostRecentlyExecutedCommandWithImplicitDF": { 789 | "commandId": 969987236417588, 790 | "dataframes": [ 791 | "_sqldf" 792 | ] 793 | }, 794 | "pythonIndentUnit": 2 795 | }, 796 | "notebookName": "Chapter 5 Code", 797 | "widgets": {} 798 | } 799 | }, 800 | "nbformat": 4, 801 | "nbformat_minor": 0 802 | } 803 | -------------------------------------------------------------------------------- /Chapter04/Chapter 4 Code.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "application/vnd.databricks.v1+cell": { 7 | "cellMetadata": {}, 8 | "inputWidgets": {}, 9 | "nuid": "7f1436a0-3357-4850-b507-a12c76e60c22", 10 | "showTitle": false, 11 | "title": "" 12 | } 13 | }, 14 | "source": [ 15 | "# Chapter 4 : Spark Dataframes and Operations Code" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "application/vnd.databricks.v1+cell": { 22 | "cellMetadata": {}, 23 | "inputWidgets": {}, 24 | "nuid": "c8b85703-a9de-4ac7-892a-b1fb92ac4442", 25 | "showTitle": false, 26 | "title": "" 27 | } 28 | }, 29 | "source": [ 30 | "Create Dataframe Operations" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 0, 36 | "metadata": { 37 | "application/vnd.databricks.v1+cell": { 38 | "cellMetadata": { 39 | "byteLimit": 2048000, 40 | "rowLimit": 10000 41 | }, 42 | "inputWidgets": {}, 43 | "nuid": "0269a412-f57f-4c05-b079-4f7236b5cbc8", 44 | "showTitle": true, 45 | "title": "Create Dataframe from list of rows" 46 | } 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "import pandas as pd\n", 51 | "from datetime import datetime, date\n", 52 | "from pyspark.sql import Row\n", 53 | "\n", 54 | "data_df = spark.createDataFrame([\n", 55 | " Row(col_1=100, col_2=200., col_3='string_test_1', col_4=date(2023, 1, 1), col_5=datetime(2023, 1, 1, 12, 0)),\n", 56 | " Row(col_1=200, col_2=300., col_3='string_test_2', col_4=date(2023, 2, 1), col_5=datetime(2023, 1, 2, 12, 0)),\n", 57 | " Row(col_1=400, col_2=500., col_3='string_test_3', col_4=date(2023, 3, 1), col_5=datetime(2023, 1, 3, 12, 0))\n", 58 | "])\n" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 0, 64 | "metadata": { 65 | "application/vnd.databricks.v1+cell": { 66 | "cellMetadata": { 67 | "byteLimit": 2048000, 68 | "rowLimit": 10000 69 | }, 70 | "inputWidgets": {}, 71 | "nuid": "70b00e29-29b9-47c6-9353-aecc105f5aba", 72 | "showTitle": true, 73 | "title": "Create Dataframe from list of rows using schema" 74 | } 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "import pandas as pd\n", 79 | "from datetime import datetime, date\n", 80 | "from pyspark.sql import Row\n", 81 | "\n", 82 | "data_df = spark.createDataFrame([\n", 83 | " Row(col_1=100, col_2=200., col_3='string_test_1', col_4=date(2023, 1, 1), col_5=datetime(2023, 1, 1, 12, 0)),\n", 84 | " Row(col_1=200, col_2=300., col_3='string_test_2', col_4=date(2023, 2, 1), col_5=datetime(2023, 1, 2, 12, 0)),\n", 85 | " Row(col_1=400, col_2=500., col_3='string_test_3', col_4=date(2023, 3, 1), col_5=datetime(2023, 1, 3, 12, 0))\n", 86 | "], schema=' col_1 long, col_2 double, col_3 string, col_4 date, col_5 timestamp')\n" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 0, 92 | "metadata": { 93 | "application/vnd.databricks.v1+cell": { 94 | "cellMetadata": { 95 | "byteLimit": 2048000, 96 | "rowLimit": 10000 97 | }, 98 | "inputWidgets": {}, 99 | "nuid": "d0238c11-4d56-4175-89b2-bb4ddcc4976a", 100 | "showTitle": true, 101 | "title": "Create Dataframe from pandas dataframe" 102 | } 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "import pandas as pd\n", 107 | "from datetime import datetime, date\n", 108 | "from pyspark.sql import Row\n", 109 | "\n", 110 | "pandas_df = pd.DataFrame({\n", 111 | " 'col_1': [100, 200, 400],\n", 112 | " 'col_2': [200., 300., 500.],\n", 113 | " 'col_3': ['string_test_1', 'string_test_2', 'string_test_3'],\n", 114 | " 'col_4': [date(2023, 1, 1), date(2023, 2, 1), date(2023, 3, 1)],\n", 115 | " 'col_5': [datetime(2023, 1, 1, 12, 0), datetime(2023, 1, 2, 12, 0), datetime(2023, 1, 3, 12, 0)]\n", 116 | "})\n", 117 | "df = spark.createDataFrame(pandas_df)\n" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 0, 123 | "metadata": { 124 | "application/vnd.databricks.v1+cell": { 125 | "cellMetadata": { 126 | "byteLimit": 2048000, 127 | "rowLimit": 10000 128 | }, 129 | "inputWidgets": {}, 130 | "nuid": "a1791d2e-ff43-4557-ad31-a2d92c3a21a8", 131 | "showTitle": false, 132 | "title": "" 133 | } 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "from datetime import datetime, date\n", 138 | "from pyspark.sql import SparkSession\n", 139 | "\n", 140 | "spark = SparkSession.builder.getOrCreate()\n", 141 | "\n", 142 | "rdd = spark.sparkContext.parallelize([\n", 143 | " (100, 200., 'string_test_1', date(2023, 1, 1), datetime(2023, 1, 1, 12, 0)),\n", 144 | " (200, 300., 'string_test_2', date(2023, 2, 1), datetime(2023, 1, 2, 12, 0)),\n", 145 | " (300, 400., 'string_test_3', date(2023, 3, 1), datetime(2023, 1, 3, 12, 0))\n", 146 | "])\n", 147 | "data_df = spark.createDataFrame(rdd, schema=['col_1', 'col_2', 'col_3', 'col_4', 'col_5'])" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": { 153 | "application/vnd.databricks.v1+cell": { 154 | "cellMetadata": {}, 155 | "inputWidgets": {}, 156 | "nuid": "d87a4498-fa76-444e-984b-25cec32fb37c", 157 | "showTitle": false, 158 | "title": "" 159 | } 160 | }, 161 | "source": [ 162 | "How to View the Dataframes" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 0, 168 | "metadata": { 169 | "application/vnd.databricks.v1+cell": { 170 | "cellMetadata": { 171 | "byteLimit": 2048000, 172 | "rowLimit": 10000 173 | }, 174 | "inputWidgets": {}, 175 | "nuid": "d4e5a140-b2bd-4cf3-8798-6fecb6164064", 176 | "showTitle": true, 177 | "title": "Viewing DataFrames " 178 | } 179 | }, 180 | "outputs": [ 181 | { 182 | "output_type": "stream", 183 | "name": "stdout", 184 | "output_type": "stream", 185 | "text": [ 186 | "+-----+-----+-------------+----------+-------------------+\n|col_1|col_2| col_3| col_4| col_5|\n+-----+-----+-------------+----------+-------------------+\n| 100|200.0|string_test_1|2023-01-01|2023-01-01 12:00:00|\n| 200|300.0|string_test_2|2023-02-01|2023-01-02 12:00:00|\n| 300|400.0|string_test_3|2023-03-01|2023-01-03 12:00:00|\n+-----+-----+-------------+----------+-------------------+\n\n" 187 | ] 188 | } 189 | ], 190 | "source": [ 191 | "data_df.show()" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 0, 197 | "metadata": { 198 | "application/vnd.databricks.v1+cell": { 199 | "cellMetadata": { 200 | "byteLimit": 2048000, 201 | "rowLimit": 10000 202 | }, 203 | "inputWidgets": {}, 204 | "nuid": "465e9144-4bf7-472d-bb62-35dad761c240", 205 | "showTitle": true, 206 | "title": "Viewing top n rows" 207 | } 208 | }, 209 | "outputs": [ 210 | { 211 | "output_type": "stream", 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "+-----+-----+-------------+----------+-------------------+\n|col_1|col_2| col_3| col_4| col_5|\n+-----+-----+-------------+----------+-------------------+\n| 100|200.0|string_test_1|2023-01-01|2023-01-01 12:00:00|\n| 200|300.0|string_test_2|2023-02-01|2023-01-02 12:00:00|\n+-----+-----+-------------+----------+-------------------+\nonly showing top 2 rows\n\n" 216 | ] 217 | } 218 | ], 219 | "source": [ 220 | "data_df.show(2)" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 0, 226 | "metadata": { 227 | "application/vnd.databricks.v1+cell": { 228 | "cellMetadata": { 229 | "byteLimit": 2048000, 230 | "rowLimit": 10000 231 | }, 232 | "inputWidgets": {}, 233 | "nuid": "3882bfa7-6fa4-4a7c-b039-d919edc5fb07", 234 | "showTitle": true, 235 | "title": "Viewing DataFrame schema" 236 | } 237 | }, 238 | "outputs": [ 239 | { 240 | "output_type": "stream", 241 | "name": "stdout", 242 | "output_type": "stream", 243 | "text": [ 244 | "root\n |-- col_1: long (nullable = true)\n |-- col_2: double (nullable = true)\n |-- col_3: string (nullable = true)\n |-- col_4: date (nullable = true)\n |-- col_5: timestamp (nullable = true)\n\n" 245 | ] 246 | } 247 | ], 248 | "source": [ 249 | "data_df.printSchema()" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 0, 255 | "metadata": { 256 | "application/vnd.databricks.v1+cell": { 257 | "cellMetadata": { 258 | "byteLimit": 2048000, 259 | "rowLimit": 10000 260 | }, 261 | "inputWidgets": {}, 262 | "nuid": "49b81183-36c8-4862-b2fb-4778ceb6d16c", 263 | "showTitle": true, 264 | "title": "Viewing data vertically" 265 | } 266 | }, 267 | "outputs": [ 268 | { 269 | "output_type": "stream", 270 | "name": "stdout", 271 | "output_type": "stream", 272 | "text": [ 273 | "-RECORD 0--------------------\n col_1 | 100 \n col_2 | 200.0 \n col_3 | string_test_1 \n col_4 | 2023-01-01 \n col_5 | 2023-01-01 12:00:00 \nonly showing top 1 row\n\n" 274 | ] 275 | } 276 | ], 277 | "source": [ 278 | "data_df.show(1, vertical=True)" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 0, 284 | "metadata": { 285 | "application/vnd.databricks.v1+cell": { 286 | "cellMetadata": { 287 | "byteLimit": 2048000, 288 | "rowLimit": 10000 289 | }, 290 | "inputWidgets": {}, 291 | "nuid": "18974b22-4fae-4787-aacb-67eb458c20a2", 292 | "showTitle": true, 293 | "title": "Viewing columns of data " 294 | } 295 | }, 296 | "outputs": [ 297 | { 298 | "output_type": "execute_result", 299 | "data": { 300 | "text/plain": [ 301 | "['col_1', 'col_2', 'col_3', 'col_4', 'col_5']" 302 | ] 303 | }, 304 | "execution_count": 7, 305 | "metadata": {}, 306 | "output_type": "execute_result" 307 | } 308 | ], 309 | "source": [ 310 | "data_df.columns" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 0, 316 | "metadata": { 317 | "application/vnd.databricks.v1+cell": { 318 | "cellMetadata": { 319 | "byteLimit": 2048000, 320 | "rowLimit": 10000 321 | }, 322 | "inputWidgets": {}, 323 | "nuid": "f607973a-991a-4892-ada8-a6f8e2daf5d1", 324 | "showTitle": true, 325 | "title": "Counting number of rows of data" 326 | } 327 | }, 328 | "outputs": [ 329 | { 330 | "output_type": "execute_result", 331 | "data": { 332 | "text/plain": [ 333 | "3" 334 | ] 335 | }, 336 | "execution_count": 8, 337 | "metadata": {}, 338 | "output_type": "execute_result" 339 | } 340 | ], 341 | "source": [ 342 | "data_df.count()" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 0, 348 | "metadata": { 349 | "application/vnd.databricks.v1+cell": { 350 | "cellMetadata": { 351 | "byteLimit": 2048000, 352 | "rowLimit": 10000 353 | }, 354 | "inputWidgets": {}, 355 | "nuid": "3513085f-5679-4735-9085-7e7b3de398b4", 356 | "showTitle": true, 357 | "title": "Viewing summary statistics " 358 | } 359 | }, 360 | "outputs": [ 361 | { 362 | "output_type": "stream", 363 | "name": "stdout", 364 | "output_type": "stream", 365 | "text": [ 366 | "+-------+-----+-----+-------------+\n|summary|col_1|col_2| col_3|\n+-------+-----+-----+-------------+\n| count| 3| 3| 3|\n| mean|200.0|300.0| NULL|\n| stddev|100.0|100.0| NULL|\n| min| 100|200.0|string_test_1|\n| max| 300|400.0|string_test_3|\n+-------+-----+-----+-------------+\n\n" 367 | ] 368 | } 369 | ], 370 | "source": [ 371 | "data_df.select('col_1', 'col_2', 'col_3').describe().show()" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 0, 377 | "metadata": { 378 | "application/vnd.databricks.v1+cell": { 379 | "cellMetadata": { 380 | "byteLimit": 2048000, 381 | "rowLimit": 10000 382 | }, 383 | "inputWidgets": {}, 384 | "nuid": "550f6a6b-61cc-4913-9220-9cb02962045c", 385 | "showTitle": true, 386 | "title": "Collecting the data" 387 | } 388 | }, 389 | "outputs": [ 390 | { 391 | "output_type": "execute_result", 392 | "data": { 393 | "text/plain": [ 394 | "[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0)),\n", 395 | " Row(col_1=200, col_2=300.0, col_3='string_test_2', col_4=datetime.date(2023, 2, 1), col_5=datetime.datetime(2023, 1, 2, 12, 0)),\n", 396 | " Row(col_1=300, col_2=400.0, col_3='string_test_3', col_4=datetime.date(2023, 3, 1), col_5=datetime.datetime(2023, 1, 3, 12, 0))]" 397 | ] 398 | }, 399 | "execution_count": 10, 400 | "metadata": {}, 401 | "output_type": "execute_result" 402 | } 403 | ], 404 | "source": [ 405 | "data_df.collect()" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": 0, 411 | "metadata": { 412 | "application/vnd.databricks.v1+cell": { 413 | "cellMetadata": { 414 | "byteLimit": 2048000, 415 | "rowLimit": 10000 416 | }, 417 | "inputWidgets": {}, 418 | "nuid": "d1a9456e-7fc7-4b73-bdfe-b3bd45f01f68", 419 | "showTitle": true, 420 | "title": "Using take" 421 | } 422 | }, 423 | "outputs": [ 424 | { 425 | "output_type": "execute_result", 426 | "data": { 427 | "text/plain": [ 428 | "[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0))]" 429 | ] 430 | }, 431 | "execution_count": 11, 432 | "metadata": {}, 433 | "output_type": "execute_result" 434 | } 435 | ], 436 | "source": [ 437 | "data_df.take(1)" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 0, 443 | "metadata": { 444 | "application/vnd.databricks.v1+cell": { 445 | "cellMetadata": { 446 | "byteLimit": 2048000, 447 | "rowLimit": 10000 448 | }, 449 | "inputWidgets": {}, 450 | "nuid": "5f5c4426-d270-40be-9088-c74d727af5b1", 451 | "showTitle": true, 452 | "title": "Using tail" 453 | } 454 | }, 455 | "outputs": [ 456 | { 457 | "output_type": "execute_result", 458 | "data": { 459 | "text/plain": [ 460 | "[Row(col_1=300, col_2=400.0, col_3='string_test_3', col_4=datetime.date(2023, 3, 1), col_5=datetime.datetime(2023, 1, 3, 12, 0))]" 461 | ] 462 | }, 463 | "execution_count": 12, 464 | "metadata": {}, 465 | "output_type": "execute_result" 466 | } 467 | ], 468 | "source": [ 469 | "data_df.tail(1)" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 0, 475 | "metadata": { 476 | "application/vnd.databricks.v1+cell": { 477 | "cellMetadata": { 478 | "byteLimit": 2048000, 479 | "rowLimit": 10000 480 | }, 481 | "inputWidgets": {}, 482 | "nuid": "fbdd9347-da63-4575-ac6d-b55b593044ff", 483 | "showTitle": true, 484 | "title": "Using head" 485 | } 486 | }, 487 | "outputs": [ 488 | { 489 | "output_type": "execute_result", 490 | "data": { 491 | "text/plain": [ 492 | "[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0))]" 493 | ] 494 | }, 495 | "execution_count": 13, 496 | "metadata": {}, 497 | "output_type": "execute_result" 498 | } 499 | ], 500 | "source": [ 501 | "data_df.head(1)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 0, 507 | "metadata": { 508 | "application/vnd.databricks.v1+cell": { 509 | "cellMetadata": { 510 | "byteLimit": 2048000, 511 | "rowLimit": 10000 512 | }, 513 | "inputWidgets": {}, 514 | "nuid": "2e48afa4-79e2-4ac5-86f0-cb95a8145ef0", 515 | "showTitle": true, 516 | "title": "Converting Pyspark dataframe to Pandas" 517 | } 518 | }, 519 | "outputs": [ 520 | { 521 | "output_type": "execute_result", 522 | "data": { 523 | "text/html": [ 524 | "
\n", 525 | "\n", 538 | "\n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | "
col_1col_2col_3col_4col_5
0100200.0string_test_12023-01-012023-01-01 12:00:00
1200300.0string_test_22023-02-012023-01-02 12:00:00
2300400.0string_test_32023-03-012023-01-03 12:00:00
\n", 576 | "
" 577 | ], 578 | "text/plain": [ 579 | " col_1 col_2 col_3 col_4 col_5\n", 580 | "0 100 200.0 string_test_1 2023-01-01 2023-01-01 12:00:00\n", 581 | "1 200 300.0 string_test_2 2023-02-01 2023-01-02 12:00:00\n", 582 | "2 300 400.0 string_test_3 2023-03-01 2023-01-03 12:00:00" 583 | ] 584 | }, 585 | "execution_count": 15, 586 | "metadata": {}, 587 | "output_type": "execute_result" 588 | } 589 | ], 590 | "source": [ 591 | "data_df.toPandas()" 592 | ] 593 | }, 594 | { 595 | "cell_type": "markdown", 596 | "metadata": { 597 | "application/vnd.databricks.v1+cell": { 598 | "cellMetadata": {}, 599 | "inputWidgets": {}, 600 | "nuid": "f0502651-b54a-45e6-84ae-62ea7e1600ad", 601 | "showTitle": false, 602 | "title": "" 603 | } 604 | }, 605 | "source": [ 606 | "How to do Data Manipulation - Rows and Columns" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": 0, 612 | "metadata": { 613 | "application/vnd.databricks.v1+cell": { 614 | "cellMetadata": { 615 | "byteLimit": 2048000, 616 | "rowLimit": 10000 617 | }, 618 | "inputWidgets": {}, 619 | "nuid": "e4f34241-a0de-47d4-89d0-5596681206c5", 620 | "showTitle": true, 621 | "title": "Selecting Columns" 622 | } 623 | }, 624 | "outputs": [ 625 | { 626 | "output_type": "stream", 627 | "name": "stdout", 628 | "output_type": "stream", 629 | "text": [ 630 | "+-------------+\n| col_3|\n+-------------+\n|string_test_1|\n|string_test_2|\n|string_test_3|\n+-------------+\n\n" 631 | ] 632 | } 633 | ], 634 | "source": [ 635 | "from pyspark.sql import Column\n", 636 | "\n", 637 | "data_df.select(data_df.col_3).show()\n" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 0, 643 | "metadata": { 644 | "application/vnd.databricks.v1+cell": { 645 | "cellMetadata": { 646 | "byteLimit": 2048000, 647 | "rowLimit": 10000 648 | }, 649 | "inputWidgets": {}, 650 | "nuid": "47fd50e6-1433-4add-ae2f-1ddcfb1a0e7c", 651 | "showTitle": true, 652 | "title": "Creating Columns" 653 | } 654 | }, 655 | "outputs": [ 656 | { 657 | "output_type": "stream", 658 | "name": "stdout", 659 | "output_type": "stream", 660 | "text": [ 661 | "+-----+-----+-------------+----------+-------------------+-----+\n|col_1|col_2| col_3| col_4| col_5|col_6|\n+-----+-----+-------------+----------+-------------------+-----+\n| 100|200.0|string_test_1|2023-01-01|2023-01-01 12:00:00| A|\n| 200|300.0|string_test_2|2023-02-01|2023-01-02 12:00:00| A|\n| 300|400.0|string_test_3|2023-03-01|2023-01-03 12:00:00| A|\n+-----+-----+-------------+----------+-------------------+-----+\n\n" 662 | ] 663 | } 664 | ], 665 | "source": [ 666 | "from pyspark.sql import functions as F\n", 667 | "data_df = data_df.withColumn(\"col_6\", F.lit(\"A\"))\n", 668 | "data_df.show()\n" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 0, 674 | "metadata": { 675 | "application/vnd.databricks.v1+cell": { 676 | "cellMetadata": { 677 | "byteLimit": 2048000, 678 | "rowLimit": 10000 679 | }, 680 | "inputWidgets": {}, 681 | "nuid": "62323798-4638-483c-9d30-e9206ef826de", 682 | "showTitle": true, 683 | "title": "Dropping Columns" 684 | } 685 | }, 686 | "outputs": [ 687 | { 688 | "output_type": "stream", 689 | "name": "stdout", 690 | "output_type": "stream", 691 | "text": [ 692 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| col_3| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n| 300|400.0|string_test_3|2023-03-01| A|\n+-----+-----+-------------+----------+-----+\n\n" 693 | ] 694 | } 695 | ], 696 | "source": [ 697 | "data_df = data_df.drop(\"col_5\")\n", 698 | "data_df.show()\n" 699 | ] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "execution_count": 0, 704 | "metadata": { 705 | "application/vnd.databricks.v1+cell": { 706 | "cellMetadata": { 707 | "byteLimit": 2048000, 708 | "rowLimit": 10000 709 | }, 710 | "inputWidgets": {}, 711 | "nuid": "fb759316-d4e6-43b0-8ced-d073f1a20f97", 712 | "showTitle": true, 713 | "title": "Updating Columns" 714 | } 715 | }, 716 | "outputs": [ 717 | { 718 | "output_type": "stream", 719 | "name": "stdout", 720 | "output_type": "stream", 721 | "text": [ 722 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| col_3| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100| 2.0|string_test_1|2023-01-01| A|\n| 200| 3.0|string_test_2|2023-02-01| A|\n| 300| 4.0|string_test_3|2023-03-01| A|\n+-----+-----+-------------+----------+-----+\n\n" 723 | ] 724 | } 725 | ], 726 | "source": [ 727 | "data_df.withColumn(\"col_2\", F.col(\"col_2\") / 100).show()" 728 | ] 729 | }, 730 | { 731 | "cell_type": "code", 732 | "execution_count": 0, 733 | "metadata": { 734 | "application/vnd.databricks.v1+cell": { 735 | "cellMetadata": { 736 | "byteLimit": 2048000, 737 | "rowLimit": 10000 738 | }, 739 | "inputWidgets": {}, 740 | "nuid": "2cc409f3-6407-48a1-a5c9-21f08062a7f3", 741 | "showTitle": true, 742 | "title": "Renaming Columns" 743 | } 744 | }, 745 | "outputs": [ 746 | { 747 | "output_type": "stream", 748 | "name": "stdout", 749 | "output_type": "stream", 750 | "text": [ 751 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n| 300|400.0|string_test_3|2023-03-01| A|\n+-----+-----+-------------+----------+-----+\n\n" 752 | ] 753 | } 754 | ], 755 | "source": [ 756 | "data_df = data_df.withColumnRenamed(\"col_3\", \"string_col\")\n", 757 | "data_df.show()\n" 758 | ] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": 0, 763 | "metadata": { 764 | "application/vnd.databricks.v1+cell": { 765 | "cellMetadata": { 766 | "byteLimit": 2048000, 767 | "rowLimit": 10000 768 | }, 769 | "inputWidgets": {}, 770 | "nuid": "fd1755b4-eec5-44f5-b8d3-946cb0359432", 771 | "showTitle": true, 772 | "title": "Finding Unique Values in a Column" 773 | } 774 | }, 775 | "outputs": [ 776 | { 777 | "output_type": "stream", 778 | "name": "stdout", 779 | "output_type": "stream", 780 | "text": [ 781 | "+-----+\n|col_6|\n+-----+\n| A|\n+-----+\n\n" 782 | ] 783 | } 784 | ], 785 | "source": [ 786 | "data_df.select(\"col_6\").distinct().show()" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": 0, 792 | "metadata": { 793 | "application/vnd.databricks.v1+cell": { 794 | "cellMetadata": { 795 | "byteLimit": 2048000, 796 | "rowLimit": 10000 797 | }, 798 | "inputWidgets": {}, 799 | "nuid": "beddedbc-ced1-477d-b8bd-30a102ef10dd", 800 | "showTitle": false, 801 | "title": "" 802 | } 803 | }, 804 | "outputs": [ 805 | { 806 | "output_type": "stream", 807 | "name": "stdout", 808 | "output_type": "stream", 809 | "text": [ 810 | "+------------+\n|Total_Unique|\n+------------+\n| 1|\n+------------+\n\n" 811 | ] 812 | } 813 | ], 814 | "source": [ 815 | "data_df.select(F.countDistinct(\"col_6\").alias(\"Total_Unique\")).show()" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": 0, 821 | "metadata": { 822 | "application/vnd.databricks.v1+cell": { 823 | "cellMetadata": { 824 | "byteLimit": 2048000, 825 | "rowLimit": 10000 826 | }, 827 | "inputWidgets": {}, 828 | "nuid": "469c586b-c14e-4652-be89-97c9b62e5818", 829 | "showTitle": true, 830 | "title": "Change case of a Column" 831 | } 832 | }, 833 | "outputs": [ 834 | { 835 | "output_type": "stream", 836 | "name": "stdout", 837 | "output_type": "stream", 838 | "text": [ 839 | "+-----+-----+-------------+----------+-----+----------------+\n|col_1|col_2| string_col| col_4|col_6|upper_string_col|\n+-----+-----+-------------+----------+-----+----------------+\n| 100|200.0|string_test_1|2023-01-01| A| STRING_TEST_1|\n| 200|300.0|string_test_2|2023-02-01| A| STRING_TEST_2|\n| 300|400.0|string_test_3|2023-03-01| A| STRING_TEST_3|\n+-----+-----+-------------+----------+-----+----------------+\n\n" 840 | ] 841 | } 842 | ], 843 | "source": [ 844 | "from pyspark.sql.functions import upper\n", 845 | "\n", 846 | "data_df.withColumn('upper_string_col', upper(data_df.string_col)).show()\n" 847 | ] 848 | }, 849 | { 850 | "cell_type": "code", 851 | "execution_count": 0, 852 | "metadata": { 853 | "application/vnd.databricks.v1+cell": { 854 | "cellMetadata": { 855 | "byteLimit": 2048000, 856 | "rowLimit": 10000 857 | }, 858 | "inputWidgets": {}, 859 | "nuid": "c7b3e23f-84f5-41de-a09d-a7a20e9404b5", 860 | "showTitle": true, 861 | "title": "Filtering a Dataframe" 862 | } 863 | }, 864 | "outputs": [ 865 | { 866 | "output_type": "stream", 867 | "name": "stdout", 868 | "output_type": "stream", 869 | "text": [ 870 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n+-----+-----+-------------+----------+-----+\n\n" 871 | ] 872 | } 873 | ], 874 | "source": [ 875 | "data_df.filter(data_df.col_1 == 100).show()" 876 | ] 877 | }, 878 | { 879 | "cell_type": "code", 880 | "execution_count": 0, 881 | "metadata": { 882 | "application/vnd.databricks.v1+cell": { 883 | "cellMetadata": { 884 | "byteLimit": 2048000, 885 | "rowLimit": 10000 886 | }, 887 | "inputWidgets": {}, 888 | "nuid": "889da414-a8e1-4014-ab7f-5e1f10d362fa", 889 | "showTitle": true, 890 | "title": "Logical Operators in a Dataframe" 891 | } 892 | }, 893 | "outputs": [ 894 | { 895 | "output_type": "stream", 896 | "name": "stdout", 897 | "output_type": "stream", 898 | "text": [ 899 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n+-----+-----+-------------+----------+-----+\n\n" 900 | ] 901 | } 902 | ], 903 | "source": [ 904 | "data_df.filter((data_df.col_1 == 100)\n", 905 | "\t\t& (data_df.col_6 == 'A')).show()\n" 906 | ] 907 | }, 908 | { 909 | "cell_type": "code", 910 | "execution_count": 0, 911 | "metadata": { 912 | "application/vnd.databricks.v1+cell": { 913 | "cellMetadata": { 914 | "byteLimit": 2048000, 915 | "rowLimit": 10000 916 | }, 917 | "inputWidgets": {}, 918 | "nuid": "582e12c8-1081-4a3f-a539-8fbf1080e316", 919 | "showTitle": false, 920 | "title": "" 921 | } 922 | }, 923 | "outputs": [ 924 | { 925 | "output_type": "stream", 926 | "name": "stdout", 927 | "output_type": "stream", 928 | "text": [ 929 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n+-----+-----+-------------+----------+-----+\n\n" 930 | ] 931 | } 932 | ], 933 | "source": [ 934 | "data_df.filter((data_df.col_1 == 100)\n", 935 | "\t\t| (data_df.col_2 == 300.00)).show()\n" 936 | ] 937 | }, 938 | { 939 | "cell_type": "code", 940 | "execution_count": 0, 941 | "metadata": { 942 | "application/vnd.databricks.v1+cell": { 943 | "cellMetadata": { 944 | "byteLimit": 2048000, 945 | "rowLimit": 10000 946 | }, 947 | "inputWidgets": {}, 948 | "nuid": "ef3afdd9-e883-40cc-bd79-cc9fcff57a7e", 949 | "showTitle": true, 950 | "title": "Using Isin()" 951 | } 952 | }, 953 | "outputs": [ 954 | { 955 | "output_type": "stream", 956 | "name": "stdout", 957 | "output_type": "stream", 958 | "text": [ 959 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n+-----+-----+-------------+----------+-----+\n\n" 960 | ] 961 | } 962 | ], 963 | "source": [ 964 | "list = [100, 200]\n", 965 | "data_df.filter(data_df.col_1.isin(list)).show()\n" 966 | ] 967 | }, 968 | { 969 | "cell_type": "code", 970 | "execution_count": 0, 971 | "metadata": { 972 | "application/vnd.databricks.v1+cell": { 973 | "cellMetadata": { 974 | "byteLimit": 2048000, 975 | "rowLimit": 10000 976 | }, 977 | "inputWidgets": {}, 978 | "nuid": "fb4e727a-c8b3-4010-bbde-2224b82aab79", 979 | "showTitle": true, 980 | "title": "Datatype conversions" 981 | } 982 | }, 983 | "outputs": [ 984 | { 985 | "output_type": "stream", 986 | "name": "stdout", 987 | "output_type": "stream", 988 | "text": [ 989 | "root\n |-- col_1: integer (nullable = true)\n |-- col_2: double (nullable = true)\n |-- string_col: string (nullable = true)\n |-- col_4: string (nullable = true)\n |-- col_6: string (nullable = false)\n\n+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n| 300|400.0|string_test_3|2023-03-01| A|\n+-----+-----+-------------+----------+-----+\n\n" 990 | ] 991 | } 992 | ], 993 | "source": [ 994 | "from pyspark.sql.functions import col\n", 995 | "from pyspark.sql.types import StringType,BooleanType,DateType,IntegerType\n", 996 | "\n", 997 | "data_df_2 = data_df.withColumn(\"col_4\",col(\"col_4\").cast(StringType())) \\\n", 998 | " .withColumn(\"col_1\",col(\"col_1\").cast(IntegerType()))\n", 999 | "data_df_2.printSchema()\n", 1000 | "data_df.show()\n", 1001 | "\n" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "code", 1006 | "execution_count": 0, 1007 | "metadata": { 1008 | "application/vnd.databricks.v1+cell": { 1009 | "cellMetadata": { 1010 | "byteLimit": 2048000, 1011 | "rowLimit": 10000 1012 | }, 1013 | "inputWidgets": {}, 1014 | "nuid": "908430a1-d7f1-4372-aebc-ef371f1efc96", 1015 | "showTitle": false, 1016 | "title": "" 1017 | } 1018 | }, 1019 | "outputs": [ 1020 | { 1021 | "output_type": "stream", 1022 | "name": "stdout", 1023 | "output_type": "stream", 1024 | "text": [ 1025 | "root\n |-- col_4: date (nullable = true)\n |-- col_1: long (nullable = true)\n\n" 1026 | ] 1027 | } 1028 | ], 1029 | "source": [ 1030 | "data_df_3 = data_df_2.selectExpr(\"cast(col_4 as date) col_4\",\n", 1031 | " \"cast(col_1 as long) col_1\")\n", 1032 | "data_df_3.printSchema()\n" 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "code", 1037 | "execution_count": 0, 1038 | "metadata": { 1039 | "application/vnd.databricks.v1+cell": { 1040 | "cellMetadata": { 1041 | "byteLimit": 2048000, 1042 | "rowLimit": 10000 1043 | }, 1044 | "inputWidgets": {}, 1045 | "nuid": "08646a8d-923c-440a-bae9-305e390b529e", 1046 | "showTitle": false, 1047 | "title": "" 1048 | } 1049 | }, 1050 | "outputs": [ 1051 | { 1052 | "output_type": "stream", 1053 | "name": "stdout", 1054 | "output_type": "stream", 1055 | "text": [ 1056 | "root\n |-- col_1: double (nullable = true)\n |-- col_4: date (nullable = true)\n\n+-----+----------+\n|col_1|col_4 |\n+-----+----------+\n|100.0|2023-01-01|\n|200.0|2023-02-01|\n|300.0|2023-03-01|\n+-----+----------+\n\n" 1057 | ] 1058 | } 1059 | ], 1060 | "source": [ 1061 | "data_df_3.createOrReplaceTempView(\"CastExample\")\n", 1062 | "data_df_4 = spark.sql(\"SELECT DOUBLE(col_1), DATE(col_4) from CastExample\")\n", 1063 | "data_df_4.printSchema()\n", 1064 | "data_df_4.show(truncate=False)\n" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "code", 1069 | "execution_count": 0, 1070 | "metadata": { 1071 | "application/vnd.databricks.v1+cell": { 1072 | "cellMetadata": { 1073 | "byteLimit": 2048000, 1074 | "rowLimit": 10000 1075 | }, 1076 | "inputWidgets": {}, 1077 | "nuid": "92dcefea-630b-4ecf-8649-705fbf78b93c", 1078 | "showTitle": true, 1079 | "title": "Dropping null values from a Dataframe" 1080 | } 1081 | }, 1082 | "outputs": [ 1083 | { 1084 | "output_type": "stream", 1085 | "name": "stdout", 1086 | "output_type": "stream", 1087 | "text": [ 1088 | "root\n |-- Employee: string (nullable = true)\n |-- Department: string (nullable = true)\n |-- Salary: long (nullable = true)\n\n+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| John| Field-eng| 3500|\n| Michael| Field-eng| 4500|\n| Robert| NULL| 4000|\n| Maria| Finance| 3500|\n| John| Sales| 3000|\n| Kelly| Finance| 3500|\n| Kate| Finance| 3000|\n| Martin| NULL| 3500|\n| Kiran| Sales| 2200|\n| Michael| Field-eng| 4500|\n+--------+----------+------+\n\n" 1089 | ] 1090 | } 1091 | ], 1092 | "source": [ 1093 | "salary_data = [(\"John\", \"Field-eng\", 3500), \n", 1094 | " (\"Michael\", \"Field-eng\", 4500), \n", 1095 | " (\"Robert\", None, 4000), \n", 1096 | " (\"Maria\", \"Finance\", 3500), \n", 1097 | " (\"John\", \"Sales\", 3000), \n", 1098 | " (\"Kelly\", \"Finance\", 3500), \n", 1099 | " (\"Kate\", \"Finance\", 3000), \n", 1100 | " (\"Martin\", None, 3500), \n", 1101 | " (\"Kiran\", \"Sales\", 2200), \n", 1102 | " (\"Michael\", \"Field-eng\", 4500) \n", 1103 | " ]\n", 1104 | "columns= [\"Employee\", \"Department\", \"Salary\"]\n", 1105 | "salary_data = spark.createDataFrame(data = salary_data, schema = columns)\n", 1106 | "salary_data.printSchema()\n", 1107 | "salary_data.show()\n" 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": 0, 1113 | "metadata": { 1114 | "application/vnd.databricks.v1+cell": { 1115 | "cellMetadata": { 1116 | "byteLimit": 2048000, 1117 | "rowLimit": 10000 1118 | }, 1119 | "inputWidgets": {}, 1120 | "nuid": "5f7ded6f-cc9d-4bfe-8f63-afe560278772", 1121 | "showTitle": false, 1122 | "title": "" 1123 | } 1124 | }, 1125 | "outputs": [ 1126 | { 1127 | "output_type": "stream", 1128 | "name": "stdout", 1129 | "output_type": "stream", 1130 | "text": [ 1131 | "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| John| Field-eng| 3500|\n| Michael| Field-eng| 4500|\n| Maria| Finance| 3500|\n| John| Sales| 3000|\n| Kelly| Finance| 3500|\n| Kate| Finance| 3000|\n| Kiran| Sales| 2200|\n| Michael| Field-eng| 4500|\n+--------+----------+------+\n\n" 1132 | ] 1133 | } 1134 | ], 1135 | "source": [ 1136 | "salary_data.dropna().show()" 1137 | ] 1138 | }, 1139 | { 1140 | "cell_type": "code", 1141 | "execution_count": 0, 1142 | "metadata": { 1143 | "application/vnd.databricks.v1+cell": { 1144 | "cellMetadata": { 1145 | "byteLimit": 2048000, 1146 | "rowLimit": 10000 1147 | }, 1148 | "inputWidgets": {}, 1149 | "nuid": "50cb0f47-2786-4d8a-9514-913108e338af", 1150 | "showTitle": true, 1151 | "title": "Dropping Duplicates from a Dataframe" 1152 | } 1153 | }, 1154 | "outputs": [ 1155 | { 1156 | "output_type": "stream", 1157 | "name": "stdout", 1158 | "output_type": "stream", 1159 | "text": [ 1160 | "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| John| Field-eng| 3500|\n| Michael| Field-eng| 4500|\n| Robert| NULL| 4000|\n| John| Sales| 3000|\n| Maria| Finance| 3500|\n| Kelly| Finance| 3500|\n| Kate| Finance| 3000|\n| Martin| NULL| 3500|\n| Kiran| Sales| 2200|\n+--------+----------+------+\n\n" 1161 | ] 1162 | } 1163 | ], 1164 | "source": [ 1165 | "new_salary_data = salary_data.dropDuplicates().show()" 1166 | ] 1167 | }, 1168 | { 1169 | "cell_type": "markdown", 1170 | "metadata": { 1171 | "application/vnd.databricks.v1+cell": { 1172 | "cellMetadata": {}, 1173 | "inputWidgets": {}, 1174 | "nuid": "26b09b00-3377-47f5-b54b-350d3126e92c", 1175 | "showTitle": false, 1176 | "title": "" 1177 | } 1178 | }, 1179 | "source": [ 1180 | "Using Aggregrates in a Dataframe" 1181 | ] 1182 | }, 1183 | { 1184 | "cell_type": "code", 1185 | "execution_count": 0, 1186 | "metadata": { 1187 | "application/vnd.databricks.v1+cell": { 1188 | "cellMetadata": { 1189 | "byteLimit": 2048000, 1190 | "rowLimit": 10000 1191 | }, 1192 | "inputWidgets": {}, 1193 | "nuid": "151c07a1-1bf3-4cc0-98f4-46aad67e67d3", 1194 | "showTitle": true, 1195 | "title": "Average (avg)" 1196 | } 1197 | }, 1198 | "outputs": [ 1199 | { 1200 | "output_type": "stream", 1201 | "name": "stdout", 1202 | "output_type": "stream", 1203 | "text": [ 1204 | "+-----------+\n|avg(Salary)|\n+-----------+\n| 3520.0|\n+-----------+\n\n" 1205 | ] 1206 | } 1207 | ], 1208 | "source": [ 1209 | "from pyspark.sql.functions import countDistinct, avg\n", 1210 | "salary_data.select(avg('Salary')).show()\n" 1211 | ] 1212 | }, 1213 | { 1214 | "cell_type": "code", 1215 | "execution_count": 0, 1216 | "metadata": { 1217 | "application/vnd.databricks.v1+cell": { 1218 | "cellMetadata": { 1219 | "byteLimit": 2048000, 1220 | "rowLimit": 10000 1221 | }, 1222 | "inputWidgets": {}, 1223 | "nuid": "b3104546-2fba-4638-987b-100fb9ac53cf", 1224 | "showTitle": true, 1225 | "title": "Count" 1226 | } 1227 | }, 1228 | "outputs": [ 1229 | { 1230 | "output_type": "stream", 1231 | "name": "stdout", 1232 | "output_type": "stream", 1233 | "text": [ 1234 | "+-------------+\n|count(Salary)|\n+-------------+\n| 10|\n+-------------+\n\n" 1235 | ] 1236 | } 1237 | ], 1238 | "source": [ 1239 | "salary_data.agg({'Salary':'count'}).show()" 1240 | ] 1241 | }, 1242 | { 1243 | "cell_type": "code", 1244 | "execution_count": 0, 1245 | "metadata": { 1246 | "application/vnd.databricks.v1+cell": { 1247 | "cellMetadata": { 1248 | "byteLimit": 2048000, 1249 | "rowLimit": 10000 1250 | }, 1251 | "inputWidgets": {}, 1252 | "nuid": "8a396838-9c08-46f9-ae42-6b5db6c094c8", 1253 | "showTitle": true, 1254 | "title": "Count distinct values" 1255 | } 1256 | }, 1257 | "outputs": [ 1258 | { 1259 | "output_type": "stream", 1260 | "name": "stdout", 1261 | "output_type": "stream", 1262 | "text": [ 1263 | "+---------------+\n|Distinct Salary|\n+---------------+\n| 5|\n+---------------+\n\n" 1264 | ] 1265 | } 1266 | ], 1267 | "source": [ 1268 | "salary_data.select(countDistinct(\"Salary\").alias(\"Distinct Salary\")).show()" 1269 | ] 1270 | }, 1271 | { 1272 | "cell_type": "code", 1273 | "execution_count": 0, 1274 | "metadata": { 1275 | "application/vnd.databricks.v1+cell": { 1276 | "cellMetadata": { 1277 | "byteLimit": 2048000, 1278 | "rowLimit": 10000 1279 | }, 1280 | "inputWidgets": {}, 1281 | "nuid": "0d17be07-1395-470b-afec-93e6d3a8b893", 1282 | "showTitle": true, 1283 | "title": "Finding maximums (max)" 1284 | } 1285 | }, 1286 | "outputs": [ 1287 | { 1288 | "output_type": "stream", 1289 | "name": "stdout", 1290 | "output_type": "stream", 1291 | "text": [ 1292 | "+-----------+\n|max(Salary)|\n+-----------+\n| 4500|\n+-----------+\n\n" 1293 | ] 1294 | } 1295 | ], 1296 | "source": [ 1297 | "salary_data.agg({'Salary':'max'}).show() " 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "code", 1302 | "execution_count": 0, 1303 | "metadata": { 1304 | "application/vnd.databricks.v1+cell": { 1305 | "cellMetadata": { 1306 | "byteLimit": 2048000, 1307 | "rowLimit": 10000 1308 | }, 1309 | "inputWidgets": {}, 1310 | "nuid": "dfb27aef-8f62-4c89-b005-e046bc924699", 1311 | "showTitle": true, 1312 | "title": "Sum" 1313 | } 1314 | }, 1315 | "outputs": [ 1316 | { 1317 | "output_type": "stream", 1318 | "name": "stdout", 1319 | "output_type": "stream", 1320 | "text": [ 1321 | "+-----------+\n|sum(Salary)|\n+-----------+\n| 35200|\n+-----------+\n\n" 1322 | ] 1323 | } 1324 | ], 1325 | "source": [ 1326 | "salary_data.agg({'Salary':'sum'}).show()" 1327 | ] 1328 | }, 1329 | { 1330 | "cell_type": "code", 1331 | "execution_count": 0, 1332 | "metadata": { 1333 | "application/vnd.databricks.v1+cell": { 1334 | "cellMetadata": { 1335 | "byteLimit": 2048000, 1336 | "rowLimit": 10000 1337 | }, 1338 | "inputWidgets": {}, 1339 | "nuid": "7d0b8f92-9d16-4cc4-9bc8-8b67db6befd6", 1340 | "showTitle": true, 1341 | "title": "Sort data with OrderBy" 1342 | } 1343 | }, 1344 | "outputs": [ 1345 | { 1346 | "output_type": "stream", 1347 | "name": "stdout", 1348 | "output_type": "stream", 1349 | "text": [ 1350 | "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| Kiran| Sales| 2200|\n| John| Sales| 3000|\n| Kate| Finance| 3000|\n| Martin| NULL| 3500|\n| Maria| Finance| 3500|\n| Kelly| Finance| 3500|\n| John| Field-eng| 3500|\n| Robert| NULL| 4000|\n| Michael| Field-eng| 4500|\n| Michael| Field-eng| 4500|\n+--------+----------+------+\n\n" 1351 | ] 1352 | } 1353 | ], 1354 | "source": [ 1355 | "salary_data.orderBy(\"Salary\").show()" 1356 | ] 1357 | }, 1358 | { 1359 | "cell_type": "code", 1360 | "execution_count": 0, 1361 | "metadata": { 1362 | "application/vnd.databricks.v1+cell": { 1363 | "cellMetadata": { 1364 | "byteLimit": 2048000, 1365 | "rowLimit": 10000 1366 | }, 1367 | "inputWidgets": {}, 1368 | "nuid": "021a9f0c-59f2-497e-adbe-f39a2f7b7b21", 1369 | "showTitle": false, 1370 | "title": "" 1371 | } 1372 | }, 1373 | "outputs": [ 1374 | { 1375 | "output_type": "stream", 1376 | "name": "stdout", 1377 | "output_type": "stream", 1378 | "text": [ 1379 | "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| Michael| Field-eng| 4500|\n| Michael| Field-eng| 4500|\n| Robert| NULL| 4000|\n| John| Field-eng| 3500|\n| Martin| NULL| 3500|\n| Kelly| Finance| 3500|\n| Maria| Finance| 3500|\n| Kate| Finance| 3000|\n| John| Sales| 3000|\n| Kiran| Sales| 2200|\n+--------+----------+------+\n\n" 1380 | ] 1381 | } 1382 | ], 1383 | "source": [ 1384 | "salary_data.orderBy(salary_data[\"Salary\"].desc()).show()" 1385 | ] 1386 | } 1387 | ], 1388 | "metadata": { 1389 | "application/vnd.databricks.v1+notebook": { 1390 | "dashboards": [], 1391 | "language": "python", 1392 | "notebookMetadata": { 1393 | "mostRecentlyExecutedCommandWithImplicitDF": { 1394 | "commandId": 969987236417588, 1395 | "dataframes": [ 1396 | "_sqldf" 1397 | ] 1398 | }, 1399 | "pythonIndentUnit": 2 1400 | }, 1401 | "notebookName": "Chapter 4 Code", 1402 | "widgets": {} 1403 | } 1404 | }, 1405 | "nbformat": 4, 1406 | "nbformat_minor": 0 1407 | } 1408 | --------------------------------------------------------------------------------