├── LICENSE
├── README.md
├── Chapter06
    └── Chapter 6 Code.ipynb
├── Chapter05
    └── Chapter 5 Code.ipynb
└── Chapter04
    └── Chapter 4 Code.ipynb


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Packt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Databricks Certified Associate Developer for Apache Spark Using Python
 2 | <a href="https://www.packtpub.com/product/databricks-certified-associate-developer-for-apache-spark-using-python/9781804619780"><img src="https://content.packt.com/_/image/original/B19176/cover_image_large.jpg" alt="no-image" height="256px" align="right"></a>
 3 | 
 4 | This is the code repository for [Databricks Certified Associate Developer for Apache Spark Using Python](https://www.packtpub.com/product/databricks-certified-associate-developer-for-apache-spark-using-python/9781804619780), published by Packt.
 5 | 
 6 | **The ultimate guide to getting certified in Apache Spark using practical examples with Python**
 7 | 
 8 | ## What is this book about?
 9 | This guide gets you ready for certification with expert-backed content, key exam concepts, and topic reviews. Additionally, you’ll be able to make the most of Apache Spark 3.0 to modernize workloads and more using specific tools and techniques.
10 | 
11 | This book covers the following exciting features:
12 | * Create and manipulate SQL queries in Spark
13 | * Build complex Spark functions using Spark UDFs
14 | * Architect big data apps with Spark fundamentals for optimal design
15 | * Apply techniques to manipulate and optimize big data applications
16 | * Build real-time or near-real-time applications using Spark Streaming
17 | * Work with Apache Spark for machine learning applications
18 | 
19 | If you feel this book is for you, get your [copy](https://www.amazon.com/Databricks-Certified-Associate-Developer-Apache/dp/1804619787) today!
20 | 
21 | <a href="https://www.packtpub.com/?utm_source=github&utm_medium=banner&utm_campaign=GitHubBanner"><img src="https://raw.githubusercontent.com/PacktPublishing/GitHub/master/GitHub.png" 
22 | alt="https://www.packtpub.com/" border="5" /></a>
23 | 
24 | ## Instructions and Navigations
25 | All of the code is organized into folders. For example, Chapter04.
26 | 
27 | The code will look like the following:
28 | ```
29 | # Perform an aggregation to calculate the average salary
30 | average_salary = spark.sql("SELECT AVG(Salary) AS average_salary FROM 
31 | employees")
32 | ```
33 | 
34 | **Following is what you need for this book:**
35 | This book is for you if you’re a professional looking to venture into the world of big data and data engineering, a data professional who wants to endorse your knowledge of Spark, or a student. Although working knowledge of Python is required, no prior Spark knowledge is needed. Additionally, experience with Pyspark will be beneficial.
36 | 
37 | With the following software and hardware list you can run all code files present in the book (Chapter 4-8).
38 | ### Software and Hardware List
39 | | Chapter | Software required | OS required |
40 | | -------- | ------------------------------------ | ----------------------------------- |
41 | | 4-8 | Python | Windows, Mac OS X, and Linux |
42 | | 4-8 | Spark  | Windows, Mac OS X, and Linux |
43 |                        
44 | ### Related products
45 | * Business Intelligence with Databricks SQL [[Packt]](https://www.packtpub.com/product/business-intelligence-with-databricks-sql/9781803235332) [[Amazon]](https://www.amazon.com/Business-Intelligence-Databricks-SQL-intelligence/dp/1803235330/ref=sr_1_1?crid=1QYCAOZP9E3NH&dib=eyJ2IjoiMSJ9.nKZ7dRFPdDZyRvWwKM_NiTSZyweCLZ8g9JdktemcYzaWNiGWg9PuoxY2yb2jogGyK8hgRliKebDQfdHu2rRnTZTWZbsWOJAN33k65RFkAgdFX-csS8HgTFfjZj-SFKLpp4FC6LHwQvWr9Nq6f5x6eg.jh99qre-Hl4OHA9rypXLmSGsQp4exBvaZ2xUOPDQ0mM&dib_tag=se&keywords=Business+Intelligence+with+Databricks+SQL&qid=1718173191&s=books&sprefix=business+intelligence+with+databricks+sql%2Cstripbooks-intl-ship%2C553&sr=1-1)
46 | 
47 | * Azure Databricks Cookbook [[Packt]](https://www.packtpub.com/product/azure-databricks-cookbook/9781789809718) [[Amazon]](https://www.amazon.com/Azure-Databricks-Cookbook-Jonathan-Wood/dp/1789809711)
48 | 
49 | ## Get to Know the Author
50 | **Saba Shah** is a Data and AI Architect and Evangelist with a wide technical breadth and deep understanding of big data and machine learning technologies. She has experience leading data science and data engineering teams in Fortune 500s as well as startups. She started her career as a software engineer but soon transitioned to big data. She is currently a solutions architect at Databricks and works with enterprises building their data strategy and helping them create a vision for the future with machine learning and predictive analytics. Saba graduated with a degree in Computer Science and later earned an MS degree in Advanced Web Technologies. She is passionate about all things data and cricket. She currently resides in RTP, NC.
51 | 


--------------------------------------------------------------------------------
/Chapter06/Chapter 6 Code.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "application/vnd.databricks.v1+cell": {
  7 |      "cellMetadata": {},
  8 |      "inputWidgets": {},
  9 |      "nuid": "5f9b4e65-714c-46a7-b0ca-2c22d87e349e",
 10 |      "showTitle": false,
 11 |      "title": ""
 12 |     }
 13 |    },
 14 |    "source": [
 15 |     "# Chapter 6: SQL Queries in Spark"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 0,
 21 |    "metadata": {
 22 |     "application/vnd.databricks.v1+cell": {
 23 |      "cellMetadata": {
 24 |       "byteLimit": 2048000,
 25 |       "rowLimit": 10000
 26 |      },
 27 |      "inputWidgets": {},
 28 |      "nuid": "e4b6f89b-2d68-4937-a30a-ac64ef7caa49",
 29 |      "showTitle": true,
 30 |      "title": "Create Salary dataframe"
 31 |     }
 32 |    },
 33 |    "outputs": [
 34 |     {
 35 |      "output_type": "stream",
 36 |      "name": "stdout",
 37 |      "output_type": "stream",
 38 |      "text": [
 39 |       "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n|  1|    John| Field-eng|  3500| 40|\n|  2|  Robert|     Sales|  4000| 38|\n|  3|   Maria|   Finance|  3500| 28|\n|  4| Michael|     Sales|  3000| 20|\n|  5|   Kelly|   Finance|  3500| 35|\n|  6|    Kate|   Finance|  3000| 45|\n|  7|  Martin|   Finance|  3500| 26|\n|  8|   Kiran|     Sales|  2200| 35|\n+---+--------+----------+------+---+\n\n"
 40 |      ]
 41 |     }
 42 |    ],
 43 |    "source": [
 44 |     "salary_data_with_id = [(1, \"John\", \"Field-eng\", 3500, 40), \\\n",
 45 |     "    (2, \"Robert\", \"Sales\", 4000, 38), \\\n",
 46 |     "    (3, \"Maria\", \"Finance\", 3500, 28), \\\n",
 47 |     "    (4, \"Michael\", \"Sales\", 3000, 20), \\\n",
 48 |     "    (5, \"Kelly\", \"Finance\", 3500, 35), \\\n",
 49 |     "    (6, \"Kate\", \"Finance\", 3000, 45), \\\n",
 50 |     "    (7, \"Martin\", \"Finance\", 3500, 26), \\\n",
 51 |     "    (8, \"Kiran\", \"Sales\", 2200, 35), \\\n",
 52 |     "  ]\n",
 53 |     "columns= [\"ID\", \"Employee\", \"Department\", \"Salary\", \"Age\"]\n",
 54 |     "salary_data_with_id = spark.createDataFrame(data = salary_data_with_id, schema = columns)\n",
 55 |     "salary_data_with_id.show()\n"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": 0,
 61 |    "metadata": {
 62 |     "application/vnd.databricks.v1+cell": {
 63 |      "cellMetadata": {
 64 |       "byteLimit": 2048000,
 65 |       "rowLimit": 10000
 66 |      },
 67 |      "inputWidgets": {},
 68 |      "nuid": "e075be3c-bb49-4c81-a7b9-7dfbb4370056",
 69 |      "showTitle": true,
 70 |      "title": "Writing csv file"
 71 |     }
 72 |    },
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "salary_data_with_id.write.format(\"csv\").mode(\"overwrite\").option(\"header\", \"true\").save(\"salary_data.csv\")\n"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": 0,
 81 |    "metadata": {
 82 |     "application/vnd.databricks.v1+cell": {
 83 |      "cellMetadata": {
 84 |       "byteLimit": 2048000,
 85 |       "rowLimit": 10000
 86 |      },
 87 |      "inputWidgets": {},
 88 |      "nuid": "a11fe3af-e723-4e97-abbf-d503acf033e3",
 89 |      "showTitle": true,
 90 |      "title": "Reading csv file"
 91 |     }
 92 |    },
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "csv_data = spark.read.csv('/salary_data.csv', header=True)"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 0,
101 |    "metadata": {
102 |     "application/vnd.databricks.v1+cell": {
103 |      "cellMetadata": {
104 |       "byteLimit": 2048000,
105 |       "rowLimit": 10000
106 |      },
107 |      "inputWidgets": {},
108 |      "nuid": "e1aa8d82-a393-448a-8fd0-5110b1eb1af2",
109 |      "showTitle": true,
110 |      "title": "Showing data"
111 |     }
112 |    },
113 |    "outputs": [
114 |     {
115 |      "output_type": "stream",
116 |      "name": "stdout",
117 |      "output_type": "stream",
118 |      "text": [
119 |       "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n|  1|    John| Field-eng|  3500| 40|\n|  2|  Robert|     Sales|  4000| 38|\n|  3|   Maria|   Finance|  3500| 28|\n|  4| Michael|     Sales|  3000| 20|\n|  5|   Kelly|   Finance|  3500| 35|\n|  6|    Kate|   Finance|  3000| 45|\n|  7|  Martin|   Finance|  3500| 26|\n|  8|   Kiran|     Sales|  2200| 35|\n+---+--------+----------+------+---+\n\n"
120 |      ]
121 |     }
122 |    ],
123 |    "source": [
124 |     "csv_data.show()"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": 0,
130 |    "metadata": {
131 |     "application/vnd.databricks.v1+cell": {
132 |      "cellMetadata": {
133 |       "byteLimit": 2048000,
134 |       "rowLimit": 10000
135 |      },
136 |      "inputWidgets": {},
137 |      "nuid": "7f186780-97f0-4b7f-af17-c33073c872ce",
138 |      "showTitle": true,
139 |      "title": "# Perform transformations on the loaded data"
140 |     }
141 |    },
142 |    "outputs": [
143 |     {
144 |      "output_type": "stream",
145 |      "name": "stdout",
146 |      "output_type": "stream",
147 |      "text": [
148 |       "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n|  1|    John| Field-eng|  3500| 40|\n|  2|  Robert|     Sales|  4000| 38|\n|  3|   Maria|   Finance|  3500| 28|\n|  5|   Kelly|   Finance|  3500| 35|\n|  7|  Martin|   Finance|  3500| 26|\n+---+--------+----------+------+---+\n\n"
149 |      ]
150 |     }
151 |    ],
152 |    "source": [
153 |     "# Perform transformations on the loaded data \n",
154 |     "processed_data = csv_data.filter(csv_data[\"Salary\"] > 3000) \n",
155 |     "# Save the processed data as a table \n",
156 |     "processed_data.createOrReplaceTempView(\"high_salary_employees\") \n",
157 |     "# Perform SQL queries on the saved table \n",
158 |     "results = spark.sql(\"SELECT * FROM high_salary_employees \") \n",
159 |     "results.show()\n"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": 0,
165 |    "metadata": {
166 |     "application/vnd.databricks.v1+cell": {
167 |      "cellMetadata": {
168 |       "byteLimit": 2048000,
169 |       "rowLimit": 10000
170 |      },
171 |      "inputWidgets": {},
172 |      "nuid": "022c96b3-91f2-41ea-91c4-58ecd511a2f6",
173 |      "showTitle": true,
174 |      "title": "Saving Transformed Data as a View"
175 |     }
176 |    },
177 |    "outputs": [
178 |     {
179 |      "output_type": "stream",
180 |      "name": "stdout",
181 |      "output_type": "stream",
182 |      "text": [
183 |       "+--------+----------+------+---+\n|Employee|Department|Salary|Age|\n+--------+----------+------+---+\n|    John| Field-eng|  3500| 40|\n|  Robert|     Sales|  4000| 38|\n|   Kelly|   Finance|  3500| 35|\n|    Kate|   Finance|  3000| 45|\n|   Kiran|     Sales|  2200| 35|\n+--------+----------+------+---+\n\n"
184 |      ]
185 |     }
186 |    ],
187 |    "source": [
188 |     "# Save the processed data as a view \n",
189 |     "salary_data_with_id.createOrReplaceTempView(\"employees\") \n",
190 |     "#Apply filtering on data\n",
191 |     "filtered_data = spark.sql(\"SELECT Employee, Department, Salary, Age FROM employees WHERE age > 30\") \n",
192 |     "# Display the results \n",
193 |     "filtered_data.show()\n"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": 0,
199 |    "metadata": {
200 |     "application/vnd.databricks.v1+cell": {
201 |      "cellMetadata": {
202 |       "byteLimit": 2048000,
203 |       "rowLimit": 10000
204 |      },
205 |      "inputWidgets": {},
206 |      "nuid": "174c84ac-fa9e-42a7-a3b5-a166193e63b0",
207 |      "showTitle": true,
208 |      "title": "Aggregating data"
209 |     }
210 |    },
211 |    "outputs": [
212 |     {
213 |      "output_type": "stream",
214 |      "name": "stdout",
215 |      "output_type": "stream",
216 |      "text": [
217 |       "+--------------+\n|average_salary|\n+--------------+\n|        3275.0|\n+--------------+\n\n"
218 |      ]
219 |     }
220 |    ],
221 |    "source": [
222 |     "# Perform an aggregation to calculate the average salary \n",
223 |     "average_salary = spark.sql(\"SELECT AVG(Salary) AS average_salary FROM employees\") \n",
224 |     "# Display the average salary \n",
225 |     "average_salary.show() \n"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": 0,
231 |    "metadata": {
232 |     "application/vnd.databricks.v1+cell": {
233 |      "cellMetadata": {
234 |       "byteLimit": 2048000,
235 |       "rowLimit": 10000
236 |      },
237 |      "inputWidgets": {},
238 |      "nuid": "97bfd3f0-1a65-4a58-bdc7-58b55dd58840",
239 |      "showTitle": true,
240 |      "title": "Sorting data"
241 |     }
242 |    },
243 |    "outputs": [
244 |     {
245 |      "output_type": "stream",
246 |      "name": "stdout",
247 |      "output_type": "stream",
248 |      "text": [
249 |       "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n|  2|  Robert|     Sales|  4000| 38|\n|  1|    John| Field-eng|  3500| 40|\n|  7|  Martin|   Finance|  3500| 26|\n|  3|   Maria|   Finance|  3500| 28|\n|  5|   Kelly|   Finance|  3500| 35|\n|  4| Michael|     Sales|  3000| 20|\n|  6|    Kate|   Finance|  3000| 45|\n|  8|   Kiran|     Sales|  2200| 35|\n+---+--------+----------+------+---+\n\n"
250 |      ]
251 |     }
252 |    ],
253 |    "source": [
254 |     "# Sort the data based on the salary column in descending order \n",
255 |     "sorted_data = spark.sql(\"SELECT * FROM employees ORDER BY Salary DESC\") \n",
256 |     "# Display the sorted data \n",
257 |     "sorted_data.show() \n"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "code",
262 |    "execution_count": 0,
263 |    "metadata": {
264 |     "application/vnd.databricks.v1+cell": {
265 |      "cellMetadata": {
266 |       "byteLimit": 2048000,
267 |       "rowLimit": 10000
268 |      },
269 |      "inputWidgets": {},
270 |      "nuid": "38606a23-aa13-49ca-b7dc-d669dd472f55",
271 |      "showTitle": true,
272 |      "title": "Combining Aggregations"
273 |     }
274 |    },
275 |    "outputs": [
276 |     {
277 |      "output_type": "stream",
278 |      "name": "stdout",
279 |      "output_type": "stream",
280 |      "text": [
281 |       "+--------+----------+------+---+\n|Employee|Department|Salary|Age|\n+--------+----------+------+---+\n|  Robert|     Sales|  4000| 38|\n|    John| Field-eng|  3500| 40|\n|   Kelly|   Finance|  3500| 35|\n+--------+----------+------+---+\n\n"
282 |      ]
283 |     }
284 |    ],
285 |    "source": [
286 |     "# Sort the data based on the salary column in descending order \n",
287 |     "filtered_data = spark.sql(\"SELECT Employee, Department, Salary, Age FROM employees WHERE age > 30 AND Salary > 3000 ORDER BY Salary DESC\") \n",
288 |     "# Display the results \n",
289 |     "filtered_data.show()\n"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "code",
294 |    "execution_count": 0,
295 |    "metadata": {
296 |     "application/vnd.databricks.v1+cell": {
297 |      "cellMetadata": {
298 |       "byteLimit": 2048000,
299 |       "rowLimit": 10000
300 |      },
301 |      "inputWidgets": {},
302 |      "nuid": "7701c892-0883-4bc2-9b5e-f51cf55fcc78",
303 |      "showTitle": true,
304 |      "title": "Grouping data"
305 |     }
306 |    },
307 |    "outputs": [
308 |     {
309 |      "output_type": "stream",
310 |      "name": "stdout",
311 |      "output_type": "stream",
312 |      "text": [
313 |       "+----------+------------------+\n|Department|       avg(Salary)|\n+----------+------------------+\n|     Sales|3066.6666666666665|\n|   Finance|            3375.0|\n| Field-eng|            3500.0|\n+----------+------------------+\n\n"
314 |      ]
315 |     }
316 |    ],
317 |    "source": [
318 |     "# Group the data based on the Department column and take average salary for each department  \n",
319 |     "grouped_data = spark.sql(\"SELECT Department, avg(Salary) FROM employees GROUP BY Department\") \n",
320 |     "# Display the results \n",
321 |     "grouped_data.show()\n"
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "code",
326 |    "execution_count": 0,
327 |    "metadata": {
328 |     "application/vnd.databricks.v1+cell": {
329 |      "cellMetadata": {
330 |       "byteLimit": 2048000,
331 |       "rowLimit": 10000
332 |      },
333 |      "inputWidgets": {},
334 |      "nuid": "abafb986-88fa-4c43-8f65-6f7bbfa70ec6",
335 |      "showTitle": true,
336 |      "title": "Grouping with multiple aggregations"
337 |     }
338 |    },
339 |    "outputs": [
340 |     {
341 |      "output_type": "stream",
342 |      "name": "stdout",
343 |      "output_type": "stream",
344 |      "text": [
345 |       "+----------+------------+----------+\n|Department|total_salary|max_salary|\n+----------+------------+----------+\n|     Sales|        9200|      4000|\n|   Finance|       13500|      3500|\n| Field-eng|        3500|      3500|\n+----------+------------+----------+\n\n"
346 |      ]
347 |     }
348 |    ],
349 |    "source": [
350 |     "# Perform grouping and multiple aggregations  \n",
351 |     "aggregated_data = spark.sql(\"SELECT Department, sum(Salary) AS total_salary, max(Salary) AS max_salary FROM employees GROUP BY Department\") \n",
352 |     "\n",
353 |     "# Display the results \n",
354 |     "aggregated_data.show()\n"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "code",
359 |    "execution_count": 0,
360 |    "metadata": {
361 |     "application/vnd.databricks.v1+cell": {
362 |      "cellMetadata": {
363 |       "byteLimit": 2048000,
364 |       "rowLimit": 10000
365 |      },
366 |      "inputWidgets": {},
367 |      "nuid": "3b85baa3-2e70-4038-af6c-440346d96d78",
368 |      "showTitle": true,
369 |      "title": "Window functions"
370 |     }
371 |    },
372 |    "outputs": [
373 |     {
374 |      "output_type": "stream",
375 |      "name": "stdout",
376 |      "output_type": "stream",
377 |      "text": [
378 |       "+---+--------+----------+------+---+--------------+\n| ID|Employee|Department|Salary|Age|cumulative_sum|\n+---+--------+----------+------+---+--------------+\n|  1|    John| Field-eng|  3500| 40|          3500|\n|  7|  Martin|   Finance|  3500| 26|          3500|\n|  3|   Maria|   Finance|  3500| 28|          7000|\n|  5|   Kelly|   Finance|  3500| 35|         10500|\n|  6|    Kate|   Finance|  3000| 45|         13500|\n|  4| Michael|     Sales|  3000| 20|          3000|\n|  8|   Kiran|     Sales|  2200| 35|          5200|\n|  2|  Robert|     Sales|  4000| 38|          9200|\n+---+--------+----------+------+---+--------------+\n\n"
379 |      ]
380 |     }
381 |    ],
382 |    "source": [
383 |     "from pyspark.sql.window import Window\n",
384 |     "from pyspark.sql.functions import col, sum\n",
385 |     "\n",
386 |     "# Define the window specification\n",
387 |     "window_spec = Window.partitionBy(\"Department\").orderBy(\"Age\")\n",
388 |     "\n",
389 |     "# Calculate the cumulative sum using window function\n",
390 |     "df_with_cumulative_sum = salary_data_with_id.withColumn(\"cumulative_sum\", sum(col(\"Salary\")).over(window_spec))\n",
391 |     "\n",
392 |     "# Display the result\n",
393 |     "df_with_cumulative_sum.show()\n"
394 |    ]
395 |   },
396 |   {
397 |    "cell_type": "code",
398 |    "execution_count": 0,
399 |    "metadata": {
400 |     "application/vnd.databricks.v1+cell": {
401 |      "cellMetadata": {
402 |       "byteLimit": 2048000,
403 |       "rowLimit": 10000
404 |      },
405 |      "inputWidgets": {},
406 |      "nuid": "5d8c6978-6e33-47db-8ad6-8af7cd77d522",
407 |      "showTitle": true,
408 |      "title": "Using udfs"
409 |     }
410 |    },
411 |    "outputs": [
412 |     {
413 |      "output_type": "stream",
414 |      "name": "stdout",
415 |      "output_type": "stream",
416 |      "text": [
417 |       "+---+--------+----------+------+---+----------------+\n| ID|Employee|Department|Salary|Age|capitalized_name|\n+---+--------+----------+------+---+----------------+\n|  1|    John| Field-eng|  3500| 40|            JOHN|\n|  2|  Robert|     Sales|  4000| 38|          ROBERT|\n|  3|   Maria|   Finance|  3500| 28|           MARIA|\n|  4| Michael|     Sales|  3000| 20|         MICHAEL|\n|  5|   Kelly|   Finance|  3500| 35|           KELLY|\n|  6|    Kate|   Finance|  3000| 45|            KATE|\n|  7|  Martin|   Finance|  3500| 26|          MARTIN|\n|  8|   Kiran|     Sales|  2200| 35|           KIRAN|\n+---+--------+----------+------+---+----------------+\n\n"
418 |      ]
419 |     }
420 |    ],
421 |    "source": [
422 |     "from pyspark.sql import SparkSession\n",
423 |     "from pyspark.sql.functions import udf\n",
424 |     "from pyspark.sql.types import StringType\n",
425 |     "\n",
426 |     "# Define a UDF to capitalize a string\n",
427 |     "capitalize_udf = udf(lambda x: x.upper(), StringType())\n",
428 |     "\n",
429 |     "# Apply the UDF to a column\n",
430 |     "df_with_capitalized_names = salary_data_with_id.withColumn(\"capitalized_name\", capitalize_udf(\"Employee\"))\n",
431 |     "\n",
432 |     "# Display the result\n",
433 |     "df_with_capitalized_names.show()\n"
434 |    ]
435 |   },
436 |   {
437 |    "cell_type": "code",
438 |    "execution_count": 0,
439 |    "metadata": {
440 |     "application/vnd.databricks.v1+cell": {
441 |      "cellMetadata": {
442 |       "byteLimit": 2048000,
443 |       "rowLimit": 10000
444 |      },
445 |      "inputWidgets": {},
446 |      "nuid": "19166b9b-a9c2-4e89-959f-abbac8ef8eff",
447 |      "showTitle": false,
448 |      "title": ""
449 |     }
450 |    },
451 |    "outputs": [
452 |     {
453 |      "output_type": "stream",
454 |      "name": "stdout",
455 |      "output_type": "stream",
456 |      "text": [
457 |       "+---+--------+----------+------+---+----------------+\n| ID|Employee|Department|Salary|Age|capitalized_name|\n+---+--------+----------+------+---+----------------+\n|  1|    John| Field-eng|  3500| 40|            JOHN|\n|  2|  Robert|     Sales|  4000| 38|          ROBERT|\n|  3|   Maria|   Finance|  3500| 28|           MARIA|\n|  4| Michael|     Sales|  3000| 20|         MICHAEL|\n|  5|   Kelly|   Finance|  3500| 35|           KELLY|\n|  6|    Kate|   Finance|  3000| 45|            KATE|\n|  7|  Martin|   Finance|  3500| 26|          MARTIN|\n|  8|   Kiran|     Sales|  2200| 35|           KIRAN|\n+---+--------+----------+------+---+----------------+\n\n"
458 |      ]
459 |     }
460 |    ],
461 |    "source": [
462 |     "from pyspark.sql.functions import udf\n",
463 |     "from pyspark.sql.types import StringType\n",
464 |     "\n",
465 |     "# Define a UDF to capitalize a string\n",
466 |     "capitalize_udf = udf(lambda x: x.upper(), StringType())\n",
467 |     "\n",
468 |     "# Apply the UDF to a column\n",
469 |     "df_with_capitalized_names = salary_data_with_id.withColumn(\"capitalized_name\", capitalize_udf(\"Employee\"))\n",
470 |     "\n",
471 |     "# Display the result\n",
472 |     "df_with_capitalized_names.show()"
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "code",
477 |    "execution_count": 0,
478 |    "metadata": {
479 |     "application/vnd.databricks.v1+cell": {
480 |      "cellMetadata": {
481 |       "byteLimit": 2048000,
482 |       "rowLimit": 10000
483 |      },
484 |      "inputWidgets": {},
485 |      "nuid": "df1d9134-f8bf-4597-9b46-1e0142a0acac",
486 |      "showTitle": true,
487 |      "title": "Applying functions"
488 |     }
489 |    },
490 |    "outputs": [
491 |     {
492 |      "output_type": "stream",
493 |      "name": "stdout",
494 |      "output_type": "stream",
495 |      "text": [
496 |       "+-----------------------+\n|pandas_plus_one(Salary)|\n+-----------------------+\n|                   3501|\n|                   4001|\n|                   3501|\n|                   3001|\n|                   3501|\n|                   3001|\n|                   3501|\n|                   2201|\n+-----------------------+\n\n"
497 |      ]
498 |     }
499 |    ],
500 |    "source": [
501 |     "import pandas as pd\n",
502 |     "from pyspark.sql.functions import pandas_udf\n",
503 |     "\n",
504 |     "@pandas_udf('long')\n",
505 |     "def pandas_plus_one(series: pd.Series) -> pd.Series:\n",
506 |     "    # Simply plus one by using pandas Series.\n",
507 |     "    return series + 1\n",
508 |     "\n",
509 |     "salary_data_with_id.select(pandas_plus_one(salary_data_with_id.Salary)).show()\n"
510 |    ]
511 |   },
512 |   {
513 |    "cell_type": "code",
514 |    "execution_count": 0,
515 |    "metadata": {
516 |     "application/vnd.databricks.v1+cell": {
517 |      "cellMetadata": {
518 |       "byteLimit": 2048000,
519 |       "rowLimit": 10000
520 |      },
521 |      "inputWidgets": {},
522 |      "nuid": "d0141f5b-f209-4f66-928f-1d631a99ca58",
523 |      "showTitle": true,
524 |      "title": "Pandas udfs"
525 |     }
526 |    },
527 |    "outputs": [
528 |     {
529 |      "output_type": "stream",
530 |      "name": "stdout",
531 |      "output_type": "stream",
532 |      "text": [
533 |       "+---------------+\n|add_one(Salary)|\n+---------------+\n|           3501|\n|           4001|\n|           3501|\n|           3001|\n|           3501|\n|           3001|\n|           3501|\n|           2201|\n+---------------+\n\n"
534 |      ]
535 |     }
536 |    ],
537 |    "source": [
538 |     "@pandas_udf(\"integer\")\n",
539 |     "def add_one(s: pd.Series) -> pd.Series:\n",
540 |     "    return s + 1\n",
541 |     "\n",
542 |     "spark.udf.register(\"add_one\", add_one)\n",
543 |     "spark.sql(\"SELECT add_one(Salary) FROM employees\").show()\n"
544 |    ]
545 |   },
546 |   {
547 |    "cell_type": "code",
548 |    "execution_count": 0,
549 |    "metadata": {
550 |     "application/vnd.databricks.v1+cell": {
551 |      "cellMetadata": {},
552 |      "inputWidgets": {},
553 |      "nuid": "4f6c3ea0-f650-4806-93cf-0368b01c2dd2",
554 |      "showTitle": false,
555 |      "title": ""
556 |     }
557 |    },
558 |    "outputs": [],
559 |    "source": []
560 |   }
561 |  ],
562 |  "metadata": {
563 |   "application/vnd.databricks.v1+notebook": {
564 |    "dashboards": [],
565 |    "language": "python",
566 |    "notebookMetadata": {
567 |     "mostRecentlyExecutedCommandWithImplicitDF": {
568 |      "commandId": 969987236417588,
569 |      "dataframes": [
570 |       "_sqldf"
571 |      ]
572 |     },
573 |     "pythonIndentUnit": 2
574 |    },
575 |    "notebookName": "Chapter 6 Code",
576 |    "widgets": {}
577 |   }
578 |  },
579 |  "nbformat": 4,
580 |  "nbformat_minor": 0
581 | }
582 | 


--------------------------------------------------------------------------------
/Chapter05/Chapter 5 Code.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "application/vnd.databricks.v1+cell": {
  7 |      "cellMetadata": {},
  8 |      "inputWidgets": {},
  9 |      "nuid": "7f1436a0-3357-4850-b507-a12c76e60c22",
 10 |      "showTitle": false,
 11 |      "title": ""
 12 |     }
 13 |    },
 14 |    "source": [
 15 |     "# Chapter 5: Advanced Operations in Spark Code"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 0,
 21 |    "metadata": {
 22 |     "application/vnd.databricks.v1+cell": {
 23 |      "cellMetadata": {
 24 |       "byteLimit": 2048000,
 25 |       "rowLimit": 10000
 26 |      },
 27 |      "inputWidgets": {},
 28 |      "nuid": "0c029f8c-dfbc-4e10-b09d-ccbea7b62eec",
 29 |      "showTitle": true,
 30 |      "title": "Create Salary dataframe"
 31 |     }
 32 |    },
 33 |    "outputs": [
 34 |     {
 35 |      "output_type": "stream",
 36 |      "name": "stdout",
 37 |      "output_type": "stream",
 38 |      "text": [
 39 |       "root\n |-- Employee: string (nullable = true)\n |-- Department: string (nullable = true)\n |-- Salary: long (nullable = true)\n\n+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n|    John| Field-eng|  3500|\n| Michael| Field-eng|  4500|\n|  Robert|      NULL|  4000|\n|   Maria|   Finance|  3500|\n|    John|     Sales|  3000|\n|   Kelly|   Finance|  3500|\n|    Kate|   Finance|  3000|\n|  Martin|      NULL|  3500|\n|   Kiran|     Sales|  2200|\n| Michael| Field-eng|  4500|\n+--------+----------+------+\n\n"
 40 |      ]
 41 |     }
 42 |    ],
 43 |    "source": [
 44 |     "salary_data = [(\"John\", \"Field-eng\", 3500), \n",
 45 |     "    (\"Michael\", \"Field-eng\", 4500), \n",
 46 |     "    (\"Robert\", None, 4000), \n",
 47 |     "    (\"Maria\", \"Finance\", 3500), \n",
 48 |     "    (\"John\", \"Sales\", 3000), \n",
 49 |     "    (\"Kelly\", \"Finance\", 3500), \n",
 50 |     "    (\"Kate\", \"Finance\", 3000), \n",
 51 |     "    (\"Martin\", None, 3500), \n",
 52 |     "    (\"Kiran\", \"Sales\", 2200), \n",
 53 |     "    (\"Michael\", \"Field-eng\", 4500) \n",
 54 |     "  ]\n",
 55 |     "columns= [\"Employee\", \"Department\", \"Salary\"]\n",
 56 |     "salary_data = spark.createDataFrame(data = salary_data, schema = columns)\n",
 57 |     "salary_data.printSchema()\n",
 58 |     "salary_data.show()\n"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 0,
 64 |    "metadata": {
 65 |     "application/vnd.databricks.v1+cell": {
 66 |      "cellMetadata": {
 67 |       "byteLimit": 2048000,
 68 |       "rowLimit": 10000
 69 |      },
 70 |      "inputWidgets": {},
 71 |      "nuid": "5c64523b-97f8-4cdf-8c73-13723a7f7453",
 72 |      "showTitle": true,
 73 |      "title": "Using Groupby in a Dataframe"
 74 |     }
 75 |    },
 76 |    "outputs": [
 77 |     {
 78 |      "output_type": "execute_result",
 79 |      "data": {
 80 |       "text/plain": [
 81 |        "GroupedData[grouping expressions: [Department], value: [Employee: string, Department: string, Salary: bigint], type: GroupBy]"
 82 |       ]
 83 |      },
 84 |      "execution_count": 3,
 85 |      "metadata": {},
 86 |      "output_type": "execute_result"
 87 |     }
 88 |    ],
 89 |    "source": [
 90 |     "salary_data.groupby('Department')"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": 0,
 96 |    "metadata": {
 97 |     "application/vnd.databricks.v1+cell": {
 98 |      "cellMetadata": {
 99 |       "byteLimit": 2048000,
100 |       "rowLimit": 10000
101 |      },
102 |      "inputWidgets": {},
103 |      "nuid": "73e2c600-8160-4138-968f-835e6757f06c",
104 |      "showTitle": false,
105 |      "title": ""
106 |     }
107 |    },
108 |    "outputs": [
109 |     {
110 |      "output_type": "stream",
111 |      "name": "stdout",
112 |      "output_type": "stream",
113 |      "text": [
114 |       "+----------+------------------+\n|Department|       avg(Salary)|\n+----------+------------------+\n| Field-eng| 4166.666666666667|\n|     Sales|            2600.0|\n|      NULL|            3750.0|\n|   Finance|3333.3333333333335|\n+----------+------------------+\n\n"
115 |      ]
116 |     }
117 |    ],
118 |    "source": [
119 |     "salary_data.groupby('Department').avg().show()"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 0,
125 |    "metadata": {
126 |     "application/vnd.databricks.v1+cell": {
127 |      "cellMetadata": {
128 |       "byteLimit": 2048000,
129 |       "rowLimit": 10000
130 |      },
131 |      "inputWidgets": {},
132 |      "nuid": "d437c9f0-2336-4687-83b4-7c8142b4085f",
133 |      "showTitle": true,
134 |      "title": "Complex Groupby Statement"
135 |     }
136 |    },
137 |    "outputs": [
138 |     {
139 |      "output_type": "stream",
140 |      "name": "stdout",
141 |      "output_type": "stream",
142 |      "text": [
143 |       "+----------+------+\n|Department|Salary|\n+----------+------+\n|      NULL|  7500|\n| Field-eng| 12500|\n|   Finance| 10000|\n|     Sales|  5200|\n+----------+------+\n\n"
144 |      ]
145 |     }
146 |    ],
147 |    "source": [
148 |     "from pyspark.sql.functions import col, round\n",
149 |     "\n",
150 |     "salary_data.groupBy('Department')\\\n",
151 |     "  .sum('Salary')\\\n",
152 |     "  .withColumn('sum(Salary)',round(col('sum(Salary)'), 2))\\\n",
153 |     "  .withColumnRenamed('sum(Salary)', 'Salary')\\\n",
154 |     "  .orderBy('Department')\\\n",
155 |     "  .show()\n"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 0,
161 |    "metadata": {
162 |     "application/vnd.databricks.v1+cell": {
163 |      "cellMetadata": {
164 |       "byteLimit": 2048000,
165 |       "rowLimit": 10000
166 |      },
167 |      "inputWidgets": {},
168 |      "nuid": "dfc73dea-aa0c-4a54-aded-a4c3814f01a9",
169 |      "showTitle": true,
170 |      "title": "Joining Dataframes in Spark"
171 |     }
172 |    },
173 |    "outputs": [
174 |     {
175 |      "output_type": "stream",
176 |      "name": "stdout",
177 |      "output_type": "stream",
178 |      "text": [
179 |       "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n|  1|    John| Field-eng|  3500|\n|  2|  Robert|     Sales|  4000|\n|  3|   Maria|   Finance|  3500|\n|  4| Michael|     Sales|  3000|\n|  5|   Kelly|   Finance|  3500|\n|  6|    Kate|   Finance|  3000|\n|  7|  Martin|   Finance|  3500|\n|  8|   Kiran|     Sales|  2200|\n+---+--------+----------+------+\n\n"
180 |      ]
181 |     }
182 |    ],
183 |    "source": [
184 |     "salary_data_with_id = [(1, \"John\", \"Field-eng\", 3500), \\\n",
185 |     "    (2, \"Robert\", \"Sales\", 4000), \\\n",
186 |     "    (3, \"Maria\", \"Finance\", 3500), \\\n",
187 |     "    (4, \"Michael\", \"Sales\", 3000), \\\n",
188 |     "    (5, \"Kelly\", \"Finance\", 3500), \\\n",
189 |     "    (6, \"Kate\", \"Finance\", 3000), \\\n",
190 |     "    (7, \"Martin\", \"Finance\", 3500), \\\n",
191 |     "    (8, \"Kiran\", \"Sales\", 2200), \\\n",
192 |     "  ]\n",
193 |     "columns= [\"ID\", \"Employee\", \"Department\", \"Salary\"]\n",
194 |     "salary_data_with_id = spark.createDataFrame(data = salary_data_with_id, schema = columns)\n",
195 |     "salary_data_with_id.show()\n"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 0,
201 |    "metadata": {
202 |     "application/vnd.databricks.v1+cell": {
203 |      "cellMetadata": {
204 |       "byteLimit": 2048000,
205 |       "rowLimit": 10000
206 |      },
207 |      "inputWidgets": {},
208 |      "nuid": "125e73d8-c716-4e1c-8900-859c1ec666e9",
209 |      "showTitle": true,
210 |      "title": "Employee data"
211 |     }
212 |    },
213 |    "outputs": [
214 |     {
215 |      "output_type": "stream",
216 |      "name": "stdout",
217 |      "output_type": "stream",
218 |      "text": [
219 |       "+---+-----+------+\n| ID|State|Gender|\n+---+-----+------+\n|  1|   NY|     M|\n|  2|   NC|     M|\n|  3|   NY|     F|\n|  4|   TX|     M|\n|  5|   NY|     F|\n|  6|   AZ|     F|\n+---+-----+------+\n\n"
220 |      ]
221 |     }
222 |    ],
223 |    "source": [
224 |     "employee_data = [(1, \"NY\", \"M\"), \\\n",
225 |     "    (2, \"NC\", \"M\"), \\\n",
226 |     "    (3, \"NY\", \"F\"), \\\n",
227 |     "    (4, \"TX\", \"M\"), \\\n",
228 |     "    (5, \"NY\", \"F\"), \\\n",
229 |     "    (6, \"AZ\", \"F\") \\\n",
230 |     "  ]\n",
231 |     "columns= [\"ID\", \"State\", \"Gender\"]\n",
232 |     "employee_data = spark.createDataFrame(data = employee_data, schema = columns)\n",
233 |     "employee_data.show()\n"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": 0,
239 |    "metadata": {
240 |     "application/vnd.databricks.v1+cell": {
241 |      "cellMetadata": {
242 |       "byteLimit": 2048000,
243 |       "rowLimit": 10000
244 |      },
245 |      "inputWidgets": {},
246 |      "nuid": "c0137bf4-d318-4417-86ca-df79f2fb80be",
247 |      "showTitle": true,
248 |      "title": "Inner join"
249 |     }
250 |    },
251 |    "outputs": [
252 |     {
253 |      "output_type": "stream",
254 |      "name": "stdout",
255 |      "output_type": "stream",
256 |      "text": [
257 |       "+---+--------+----------+------+---+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+---+-----+------+\n|  1|    John| Field-eng|  3500|  1|   NY|     M|\n|  2|  Robert|     Sales|  4000|  2|   NC|     M|\n|  3|   Maria|   Finance|  3500|  3|   NY|     F|\n|  4| Michael|     Sales|  3000|  4|   TX|     M|\n|  5|   Kelly|   Finance|  3500|  5|   NY|     F|\n|  6|    Kate|   Finance|  3000|  6|   AZ|     F|\n+---+--------+----------+------+---+-----+------+\n\n"
258 |      ]
259 |     }
260 |    ],
261 |    "source": [
262 |     "salary_data_with_id.join(employee_data,salary_data_with_id.ID ==  employee_data.ID,\"inner\").show()"
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": 0,
268 |    "metadata": {
269 |     "application/vnd.databricks.v1+cell": {
270 |      "cellMetadata": {
271 |       "byteLimit": 2048000,
272 |       "rowLimit": 10000
273 |      },
274 |      "inputWidgets": {},
275 |      "nuid": "f34ff657-b0dd-4485-96f0-6d7c6126a1bd",
276 |      "showTitle": true,
277 |      "title": "Outer join"
278 |     }
279 |    },
280 |    "outputs": [
281 |     {
282 |      "output_type": "stream",
283 |      "name": "stdout",
284 |      "output_type": "stream",
285 |      "text": [
286 |       "+---+--------+----------+------+----+-----+------+\n| ID|Employee|Department|Salary|  ID|State|Gender|\n+---+--------+----------+------+----+-----+------+\n|  1|    John| Field-eng|  3500|   1|   NY|     M|\n|  2|  Robert|     Sales|  4000|   2|   NC|     M|\n|  3|   Maria|   Finance|  3500|   3|   NY|     F|\n|  4| Michael|     Sales|  3000|   4|   TX|     M|\n|  5|   Kelly|   Finance|  3500|   5|   NY|     F|\n|  6|    Kate|   Finance|  3000|   6|   AZ|     F|\n|  7|  Martin|   Finance|  3500|NULL| NULL|  NULL|\n|  8|   Kiran|     Sales|  2200|NULL| NULL|  NULL|\n+---+--------+----------+------+----+-----+------+\n\n"
287 |      ]
288 |     }
289 |    ],
290 |    "source": [
291 |     "salary_data_with_id.join(employee_data,salary_data_with_id.ID ==  employee_data.ID,\"outer\").show()"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "execution_count": 0,
297 |    "metadata": {
298 |     "application/vnd.databricks.v1+cell": {
299 |      "cellMetadata": {
300 |       "byteLimit": 2048000,
301 |       "rowLimit": 10000
302 |      },
303 |      "inputWidgets": {},
304 |      "nuid": "868ca315-ab44-4eb6-b8f1-92481d770911",
305 |      "showTitle": true,
306 |      "title": "Left join"
307 |     }
308 |    },
309 |    "outputs": [
310 |     {
311 |      "output_type": "stream",
312 |      "name": "stdout",
313 |      "output_type": "stream",
314 |      "text": [
315 |       "+---+--------+----------+------+----+-----+------+\n| ID|Employee|Department|Salary|  ID|State|Gender|\n+---+--------+----------+------+----+-----+------+\n|  1|    John| Field-eng|  3500|   1|   NY|     M|\n|  2|  Robert|     Sales|  4000|   2|   NC|     M|\n|  3|   Maria|   Finance|  3500|   3|   NY|     F|\n|  4| Michael|     Sales|  3000|   4|   TX|     M|\n|  5|   Kelly|   Finance|  3500|   5|   NY|     F|\n|  6|    Kate|   Finance|  3000|   6|   AZ|     F|\n|  7|  Martin|   Finance|  3500|NULL| NULL|  NULL|\n|  8|   Kiran|     Sales|  2200|NULL| NULL|  NULL|\n+---+--------+----------+------+----+-----+------+\n\n"
316 |      ]
317 |     }
318 |    ],
319 |    "source": [
320 |     "salary_data_with_id.join(employee_data,salary_data_with_id.ID ==  employee_data.ID,\"left\").show()"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "execution_count": 0,
326 |    "metadata": {
327 |     "application/vnd.databricks.v1+cell": {
328 |      "cellMetadata": {
329 |       "byteLimit": 2048000,
330 |       "rowLimit": 10000
331 |      },
332 |      "inputWidgets": {},
333 |      "nuid": "4cba2965-54b3-4d04-a456-77e9d9af6e1f",
334 |      "showTitle": true,
335 |      "title": "Right join"
336 |     }
337 |    },
338 |    "outputs": [
339 |     {
340 |      "output_type": "stream",
341 |      "name": "stdout",
342 |      "output_type": "stream",
343 |      "text": [
344 |       "+---+--------+----------+------+---+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+---+-----+------+\n|  1|    John| Field-eng|  3500|  1|   NY|     M|\n|  2|  Robert|     Sales|  4000|  2|   NC|     M|\n|  3|   Maria|   Finance|  3500|  3|   NY|     F|\n|  4| Michael|     Sales|  3000|  4|   TX|     M|\n|  5|   Kelly|   Finance|  3500|  5|   NY|     F|\n|  6|    Kate|   Finance|  3000|  6|   AZ|     F|\n+---+--------+----------+------+---+-----+------+\n\n"
345 |      ]
346 |     }
347 |    ],
348 |    "source": [
349 |     "salary_data_with_id.join(employee_data,salary_data_with_id.ID ==  employee_data.ID,\"right\").show()"
350 |    ]
351 |   },
352 |   {
353 |    "cell_type": "code",
354 |    "execution_count": 0,
355 |    "metadata": {
356 |     "application/vnd.databricks.v1+cell": {
357 |      "cellMetadata": {
358 |       "byteLimit": 2048000,
359 |       "rowLimit": 10000
360 |      },
361 |      "inputWidgets": {},
362 |      "nuid": "dd9f95c1-4109-4ceb-925d-7b10cf838fdd",
363 |      "showTitle": true,
364 |      "title": "Union"
365 |     }
366 |    },
367 |    "outputs": [
368 |     {
369 |      "output_type": "stream",
370 |      "name": "stdout",
371 |      "output_type": "stream",
372 |      "text": [
373 |       "root\n |-- ID: long (nullable = true)\n |-- Employee: string (nullable = true)\n |-- Department: string (nullable = true)\n |-- Salary: long (nullable = true)\n\n+---+--------+----------+------+\n|ID |Employee|Department|Salary|\n+---+--------+----------+------+\n|1  |John    |Field-eng |3500  |\n|2  |Robert  |Sales     |4000  |\n|3  |Aliya   |Finance   |3500  |\n|4  |Nate    |Sales     |3000  |\n+---+--------+----------+------+\n\n"
374 |      ]
375 |     }
376 |    ],
377 |    "source": [
378 |     "salary_data_with_id_2 = [(1, \"John\", \"Field-eng\", 3500), \\\n",
379 |     "    (2, \"Robert\", \"Sales\", 4000), \\\n",
380 |     "    (3, \"Aliya\", \"Finance\", 3500), \\\n",
381 |     "    (4, \"Nate\", \"Sales\", 3000), \\\n",
382 |     "  ]\n",
383 |     "columns2= [\"ID\", \"Employee\", \"Department\", \"Salary\"]\n",
384 |     "\n",
385 |     "salary_data_with_id_2 = spark.createDataFrame(data = salary_data_with_id_2, schema = columns2)\n",
386 |     "\n",
387 |     "salary_data_with_id_2.printSchema()\n",
388 |     "salary_data_with_id_2.show(truncate=False)\n",
389 |     "\n"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "code",
394 |    "execution_count": 0,
395 |    "metadata": {
396 |     "application/vnd.databricks.v1+cell": {
397 |      "cellMetadata": {
398 |       "byteLimit": 2048000,
399 |       "rowLimit": 10000
400 |      },
401 |      "inputWidgets": {},
402 |      "nuid": "2eb3d433-2a89-47b4-9d21-2d79194809c1",
403 |      "showTitle": false,
404 |      "title": ""
405 |     }
406 |    },
407 |    "outputs": [
408 |     {
409 |      "output_type": "stream",
410 |      "name": "stdout",
411 |      "output_type": "stream",
412 |      "text": [
413 |       "+---+--------+----------+------+\n|ID |Employee|Department|Salary|\n+---+--------+----------+------+\n|1  |John    |Field-eng |3500  |\n|2  |Robert  |Sales     |4000  |\n|3  |Maria   |Finance   |3500  |\n|4  |Michael |Sales     |3000  |\n|5  |Kelly   |Finance   |3500  |\n|6  |Kate    |Finance   |3000  |\n|7  |Martin  |Finance   |3500  |\n|8  |Kiran   |Sales     |2200  |\n|1  |John    |Field-eng |3500  |\n|2  |Robert  |Sales     |4000  |\n|3  |Aliya   |Finance   |3500  |\n|4  |Nate    |Sales     |3000  |\n+---+--------+----------+------+\n\n"
414 |      ]
415 |     }
416 |    ],
417 |    "source": [
418 |     "unionDF = salary_data_with_id.union(salary_data_with_id_2)\n",
419 |     "unionDF.show(truncate=False)\n"
420 |    ]
421 |   },
422 |   {
423 |    "cell_type": "markdown",
424 |    "metadata": {
425 |     "application/vnd.databricks.v1+cell": {
426 |      "cellMetadata": {},
427 |      "inputWidgets": {},
428 |      "nuid": "d84b0031-6f62-41a8-9533-b510a487ab0f",
429 |      "showTitle": false,
430 |      "title": ""
431 |     }
432 |    },
433 |    "source": [
434 |     "Reading and Writing Data"
435 |    ]
436 |   },
437 |   {
438 |    "cell_type": "code",
439 |    "execution_count": 0,
440 |    "metadata": {
441 |     "application/vnd.databricks.v1+cell": {
442 |      "cellMetadata": {
443 |       "byteLimit": 2048000,
444 |       "rowLimit": 10000
445 |      },
446 |      "inputWidgets": {},
447 |      "nuid": "d3c8eb85-7d75-4010-977d-370b7940b57e",
448 |      "showTitle": true,
449 |      "title": "Reading and writing CSV files"
450 |     }
451 |    },
452 |    "outputs": [
453 |     {
454 |      "output_type": "stream",
455 |      "name": "stdout",
456 |      "output_type": "stream",
457 |      "text": [
458 |       "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n|  1|    John| Field-eng|  3500|\n|  2|  Robert|     Sales|  4000|\n|  3|   Maria|   Finance|  3500|\n|  4| Michael|     Sales|  3000|\n|  5|   Kelly|   Finance|  3500|\n|  6|    Kate|   Finance|  3000|\n|  7|  Martin|   Finance|  3500|\n|  8|   Kiran|     Sales|  2200|\n+---+--------+----------+------+\n\n"
459 |      ]
460 |     }
461 |    ],
462 |    "source": [
463 |     "\n",
464 |     "salary_data_with_id.write.csv('salary_data.csv', mode='overwrite', header=True)\n",
465 |     "spark.read.csv('/salary_data.csv', header=True).show()\n"
466 |    ]
467 |   },
468 |   {
469 |    "cell_type": "code",
470 |    "execution_count": 0,
471 |    "metadata": {
472 |     "application/vnd.databricks.v1+cell": {
473 |      "cellMetadata": {
474 |       "byteLimit": 2048000,
475 |       "rowLimit": 10000
476 |      },
477 |      "inputWidgets": {},
478 |      "nuid": "b033bc47-7a90-4ae1-b37b-692860e06482",
479 |      "showTitle": false,
480 |      "title": ""
481 |     }
482 |    },
483 |    "outputs": [
484 |     {
485 |      "output_type": "stream",
486 |      "name": "stdout",
487 |      "output_type": "stream",
488 |      "text": [
489 |       "+---+-------+---------+\n| ID|  State|   Gender|\n+---+-------+---------+\n|  1|   John|Field-eng|\n|  2| Robert|    Sales|\n|  3|  Maria|  Finance|\n|  4|Michael|    Sales|\n|  5|  Kelly|  Finance|\n|  6|   Kate|  Finance|\n|  7| Martin|  Finance|\n|  8|  Kiran|    Sales|\n+---+-------+---------+\n\n"
490 |      ]
491 |     }
492 |    ],
493 |    "source": [
494 |     "from pyspark.sql.types import *\n",
495 |     "\n",
496 |     "filePath = '/salary_data.csv'\n",
497 |     "columns= [\"ID\", \"State\", \"Gender\"] \n",
498 |     "schema = StructType([\n",
499 |     "      StructField(\"ID\", IntegerType(),True),\n",
500 |     "  StructField(\"State\",  StringType(),True),\n",
501 |     "  StructField(\"Gender\",  StringType(),True)\n",
502 |     "])\n",
503 |     " \n",
504 |     "read_data = spark.read.format(\"csv\").option(\"header\",\"true\").schema(schema).load(filePath)\n",
505 |     "read_data.show()\n"
506 |    ]
507 |   },
508 |   {
509 |    "cell_type": "code",
510 |    "execution_count": 0,
511 |    "metadata": {
512 |     "application/vnd.databricks.v1+cell": {
513 |      "cellMetadata": {
514 |       "byteLimit": 2048000,
515 |       "rowLimit": 10000
516 |      },
517 |      "inputWidgets": {},
518 |      "nuid": "bfd8f639-d141-48c9-be8e-dffd764aa0ee",
519 |      "showTitle": true,
520 |      "title": "Reading and writing Parquet files"
521 |     }
522 |    },
523 |    "outputs": [
524 |     {
525 |      "output_type": "stream",
526 |      "name": "stdout",
527 |      "output_type": "stream",
528 |      "text": [
529 |       "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n|  5|   Kelly|   Finance|  3500|\n|  6|    Kate|   Finance|  3000|\n|  1|    John| Field-eng|  3500|\n|  2|  Robert|     Sales|  4000|\n|  3|   Maria|   Finance|  3500|\n|  4| Michael|     Sales|  3000|\n|  7|  Martin|   Finance|  3500|\n|  8|   Kiran|     Sales|  2200|\n+---+--------+----------+------+\n\n"
530 |      ]
531 |     }
532 |    ],
533 |    "source": [
534 |     "salary_data_with_id.write.parquet('salary_data.parquet', mode='overwrite')\n",
535 |     "spark.read.parquet('/salary_data.parquet').show()\n"
536 |    ]
537 |   },
538 |   {
539 |    "cell_type": "code",
540 |    "execution_count": 0,
541 |    "metadata": {
542 |     "application/vnd.databricks.v1+cell": {
543 |      "cellMetadata": {
544 |       "byteLimit": 2048000,
545 |       "rowLimit": 10000
546 |      },
547 |      "inputWidgets": {},
548 |      "nuid": "492b344b-3719-44cd-a8dc-034d20f3a409",
549 |      "showTitle": true,
550 |      "title": "Reading and writing ORC files"
551 |     }
552 |    },
553 |    "outputs": [
554 |     {
555 |      "output_type": "stream",
556 |      "name": "stdout",
557 |      "output_type": "stream",
558 |      "text": [
559 |       "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n|  5|   Kelly|   Finance|  3500|\n|  6|    Kate|   Finance|  3000|\n|  1|    John| Field-eng|  3500|\n|  2|  Robert|     Sales|  4000|\n|  7|  Martin|   Finance|  3500|\n|  8|   Kiran|     Sales|  2200|\n|  3|   Maria|   Finance|  3500|\n|  4| Michael|     Sales|  3000|\n+---+--------+----------+------+\n\n"
560 |      ]
561 |     }
562 |    ],
563 |    "source": [
564 |     "salary_data_with_id.write.orc('salary_data.orc', mode='overwrite')\n",
565 |     "spark.read.orc('/salary_data.orc').show()"
566 |    ]
567 |   },
568 |   {
569 |    "cell_type": "code",
570 |    "execution_count": 0,
571 |    "metadata": {
572 |     "application/vnd.databricks.v1+cell": {
573 |      "cellMetadata": {
574 |       "byteLimit": 2048000,
575 |       "rowLimit": 10000
576 |      },
577 |      "inputWidgets": {},
578 |      "nuid": "9b3c1309-4a00-4a92-ac3e-7f2a9d491445",
579 |      "showTitle": true,
580 |      "title": "Reading and writing Delta files"
581 |     }
582 |    },
583 |    "outputs": [
584 |     {
585 |      "output_type": "stream",
586 |      "name": "stdout",
587 |      "output_type": "stream",
588 |      "text": [
589 |       "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n|  1|    John| Field-eng|  3500|\n|  2|  Robert|     Sales|  4000|\n|  3|   Maria|   Finance|  3500|\n|  4| Michael|     Sales|  3000|\n|  5|   Kelly|   Finance|  3500|\n|  6|    Kate|   Finance|  3000|\n|  7|  Martin|   Finance|  3500|\n|  8|   Kiran|     Sales|  2200|\n+---+--------+----------+------+\n\n"
590 |      ]
591 |     }
592 |    ],
593 |    "source": [
594 |     "salary_data_with_id.write.format(\"delta\").save(\"/FileStore/tables/salary_data_with_id\", mode='overwrite')\n",
595 |     "df = spark.read.load(\"/FileStore/tables/salary_data_with_id\")\n",
596 |     "df.show()\n"
597 |    ]
598 |   },
599 |   {
600 |    "cell_type": "code",
601 |    "execution_count": 0,
602 |    "metadata": {
603 |     "application/vnd.databricks.v1+cell": {
604 |      "cellMetadata": {
605 |       "byteLimit": 2048000,
606 |       "rowLimit": 10000
607 |      },
608 |      "inputWidgets": {},
609 |      "nuid": "d616d17f-7848-4527-aae3-78eec9d3214d",
610 |      "showTitle": true,
611 |      "title": "Using SQL in Spark"
612 |     }
613 |    },
614 |    "outputs": [
615 |     {
616 |      "output_type": "stream",
617 |      "name": "stdout",
618 |      "output_type": "stream",
619 |      "text": [
620 |       "+--------+\n|count(1)|\n+--------+\n|       8|\n+--------+\n\n"
621 |      ]
622 |     }
623 |    ],
624 |    "source": [
625 |     "salary_data_with_id.createOrReplaceTempView(\"SalaryTable\")\n",
626 |     "spark.sql(\"SELECT count(*) from SalaryTable\").show()\n"
627 |    ]
628 |   },
629 |   {
630 |    "cell_type": "markdown",
631 |    "metadata": {
632 |     "application/vnd.databricks.v1+cell": {
633 |      "cellMetadata": {},
634 |      "inputWidgets": {},
635 |      "nuid": "f549a552-a92a-477c-bcbd-0eaf6104c207",
636 |      "showTitle": false,
637 |      "title": ""
638 |     }
639 |    },
640 |    "source": [
641 |     "Catalyst Optimizer"
642 |    ]
643 |   },
644 |   {
645 |    "cell_type": "code",
646 |    "execution_count": 0,
647 |    "metadata": {
648 |     "application/vnd.databricks.v1+cell": {
649 |      "cellMetadata": {
650 |       "byteLimit": 2048000,
651 |       "rowLimit": 10000
652 |      },
653 |      "inputWidgets": {},
654 |      "nuid": "b66004b0-07ac-4c06-966e-1370a2e1b3d6",
655 |      "showTitle": true,
656 |      "title": "Catalyst Optimizer in Action"
657 |     }
658 |    },
659 |    "outputs": [
660 |     {
661 |      "output_type": "stream",
662 |      "name": "stdout",
663 |      "output_type": "stream",
664 |      "text": [
665 |       "== Physical Plan ==\n*(1) Project [employee#129490, department#129491]\n+- *(1) Filter (isnotnull(salary#129492) AND (salary#129492 > 3500))\n   +- FileScan csv [Employee#129490,Department#129491,Salary#129492] Batched: false, DataFilters: [isnotnull(Salary#129492), (Salary#129492 > 3500)], Format: CSV, Location: InMemoryFileIndex(1 paths)[dbfs:/salary_data.csv], PartitionFilters: [], PushedFilters: [IsNotNull(Salary), GreaterThan(Salary,3500)], ReadSchema: struct<Employee:string,Department:string,Salary:int>\n\n\n"
666 |      ]
667 |     }
668 |    ],
669 |    "source": [
670 |     "# SparkSession setup \n",
671 |     "from pyspark.sql import SparkSession \n",
672 |     "spark = SparkSession.builder.appName(\"CatalystOptimizerExample\").getOrCreate() \n",
673 |     "# Load data \n",
674 |     "df = spark.read.csv(\"/salary_data.csv\", header=True, inferSchema=True) \n",
675 |     "# Query with Catalyst Optimizer \n",
676 |     "result_df = df.select(\"employee\", \"department\").filter(df[\"salary\"] > 3500) \n",
677 |     "# Explain the optimized query plan \n",
678 |     "result_df.explain() \n"
679 |    ]
680 |   },
681 |   {
682 |    "cell_type": "code",
683 |    "execution_count": 0,
684 |    "metadata": {
685 |     "application/vnd.databricks.v1+cell": {
686 |      "cellMetadata": {
687 |       "byteLimit": 2048000,
688 |       "rowLimit": 10000
689 |      },
690 |      "inputWidgets": {},
691 |      "nuid": "08ba28ee-80d0-4210-acb7-4a45bee2815b",
692 |      "showTitle": true,
693 |      "title": "Unpersisting Data"
694 |     }
695 |    },
696 |    "outputs": [
697 |     {
698 |      "output_type": "execute_result",
699 |      "data": {
700 |       "text/plain": [
701 |        "DataFrame[ID: int, Employee: string, Department: string, Salary: int]"
702 |       ]
703 |      },
704 |      "execution_count": 24,
705 |      "metadata": {},
706 |      "output_type": "execute_result"
707 |     }
708 |    ],
709 |    "source": [
710 |     "# Cache a DataFrame \n",
711 |     "df.cache() \n",
712 |     "# Unpersist the cached DataFrame \n",
713 |     "df.unpersist() \n"
714 |    ]
715 |   },
716 |   {
717 |    "cell_type": "code",
718 |    "execution_count": 0,
719 |    "metadata": {
720 |     "application/vnd.databricks.v1+cell": {
721 |      "cellMetadata": {
722 |       "byteLimit": 2048000,
723 |       "rowLimit": 10000
724 |      },
725 |      "inputWidgets": {},
726 |      "nuid": "25bff695-c206-4800-80c0-e6dde6962438",
727 |      "showTitle": true,
728 |      "title": "Repartitioning Data"
729 |     }
730 |    },
731 |    "outputs": [
732 |     {
733 |      "output_type": "execute_result",
734 |      "data": {
735 |       "text/plain": [
736 |        "DataFrame[ID: int, Employee: string, Department: string, Salary: int]"
737 |       ]
738 |      },
739 |      "execution_count": 25,
740 |      "metadata": {},
741 |      "output_type": "execute_result"
742 |     }
743 |    ],
744 |    "source": [
745 |     "# Repartition a DataFrame into 8 partitions \n",
746 |     "df.repartition(8) \n"
747 |    ]
748 |   },
749 |   {
750 |    "cell_type": "code",
751 |    "execution_count": 0,
752 |    "metadata": {
753 |     "application/vnd.databricks.v1+cell": {
754 |      "cellMetadata": {
755 |       "byteLimit": 2048000,
756 |       "rowLimit": 10000
757 |      },
758 |      "inputWidgets": {},
759 |      "nuid": "3eb87f1e-ba2d-47fb-a7c1-b74e1f293598",
760 |      "showTitle": true,
761 |      "title": "Coalescing Data"
762 |     }
763 |    },
764 |    "outputs": [
765 |     {
766 |      "output_type": "execute_result",
767 |      "data": {
768 |       "text/plain": [
769 |        "DataFrame[ID: int, Employee: string, Department: string, Salary: int]"
770 |       ]
771 |      },
772 |      "execution_count": 26,
773 |      "metadata": {},
774 |      "output_type": "execute_result"
775 |     }
776 |    ],
777 |    "source": [
778 |     "# Coalesce a DataFrame to 4 partitions \n",
779 |     "df.coalesce(4) \n"
780 |    ]
781 |   }
782 |  ],
783 |  "metadata": {
784 |   "application/vnd.databricks.v1+notebook": {
785 |    "dashboards": [],
786 |    "language": "python",
787 |    "notebookMetadata": {
788 |     "mostRecentlyExecutedCommandWithImplicitDF": {
789 |      "commandId": 969987236417588,
790 |      "dataframes": [
791 |       "_sqldf"
792 |      ]
793 |     },
794 |     "pythonIndentUnit": 2
795 |    },
796 |    "notebookName": "Chapter 5 Code",
797 |    "widgets": {}
798 |   }
799 |  },
800 |  "nbformat": 4,
801 |  "nbformat_minor": 0
802 | }
803 | 


--------------------------------------------------------------------------------
/Chapter04/Chapter 4 Code.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "application/vnd.databricks.v1+cell": {
   7 |      "cellMetadata": {},
   8 |      "inputWidgets": {},
   9 |      "nuid": "7f1436a0-3357-4850-b507-a12c76e60c22",
  10 |      "showTitle": false,
  11 |      "title": ""
  12 |     }
  13 |    },
  14 |    "source": [
  15 |     "# Chapter 4 : Spark Dataframes and Operations Code"
  16 |    ]
  17 |   },
  18 |   {
  19 |    "cell_type": "markdown",
  20 |    "metadata": {
  21 |     "application/vnd.databricks.v1+cell": {
  22 |      "cellMetadata": {},
  23 |      "inputWidgets": {},
  24 |      "nuid": "c8b85703-a9de-4ac7-892a-b1fb92ac4442",
  25 |      "showTitle": false,
  26 |      "title": ""
  27 |     }
  28 |    },
  29 |    "source": [
  30 |     "Create Dataframe Operations"
  31 |    ]
  32 |   },
  33 |   {
  34 |    "cell_type": "code",
  35 |    "execution_count": 0,
  36 |    "metadata": {
  37 |     "application/vnd.databricks.v1+cell": {
  38 |      "cellMetadata": {
  39 |       "byteLimit": 2048000,
  40 |       "rowLimit": 10000
  41 |      },
  42 |      "inputWidgets": {},
  43 |      "nuid": "0269a412-f57f-4c05-b079-4f7236b5cbc8",
  44 |      "showTitle": true,
  45 |      "title": "Create Dataframe from list of rows"
  46 |     }
  47 |    },
  48 |    "outputs": [],
  49 |    "source": [
  50 |     "import pandas as pd\n",
  51 |     "from datetime import datetime, date\n",
  52 |     "from pyspark.sql import Row\n",
  53 |     "\n",
  54 |     "data_df = spark.createDataFrame([\n",
  55 |     "    Row(col_1=100, col_2=200., col_3='string_test_1', col_4=date(2023, 1, 1), col_5=datetime(2023, 1, 1, 12, 0)),\n",
  56 |     "    Row(col_1=200, col_2=300., col_3='string_test_2', col_4=date(2023, 2, 1), col_5=datetime(2023, 1, 2, 12, 0)),\n",
  57 |     "    Row(col_1=400, col_2=500., col_3='string_test_3', col_4=date(2023, 3, 1), col_5=datetime(2023, 1, 3, 12, 0))\n",
  58 |     "])\n"
  59 |    ]
  60 |   },
  61 |   {
  62 |    "cell_type": "code",
  63 |    "execution_count": 0,
  64 |    "metadata": {
  65 |     "application/vnd.databricks.v1+cell": {
  66 |      "cellMetadata": {
  67 |       "byteLimit": 2048000,
  68 |       "rowLimit": 10000
  69 |      },
  70 |      "inputWidgets": {},
  71 |      "nuid": "70b00e29-29b9-47c6-9353-aecc105f5aba",
  72 |      "showTitle": true,
  73 |      "title": "Create Dataframe from list of rows using schema"
  74 |     }
  75 |    },
  76 |    "outputs": [],
  77 |    "source": [
  78 |     "import pandas as pd\n",
  79 |     "from datetime import datetime, date\n",
  80 |     "from pyspark.sql import Row\n",
  81 |     "\n",
  82 |     "data_df = spark.createDataFrame([\n",
  83 |     "    Row(col_1=100, col_2=200., col_3='string_test_1', col_4=date(2023, 1, 1), col_5=datetime(2023, 1, 1, 12, 0)),\n",
  84 |     "    Row(col_1=200, col_2=300., col_3='string_test_2', col_4=date(2023, 2, 1), col_5=datetime(2023, 1, 2, 12, 0)),\n",
  85 |     "    Row(col_1=400, col_2=500., col_3='string_test_3', col_4=date(2023, 3, 1), col_5=datetime(2023, 1, 3, 12, 0))\n",
  86 |     "], schema=' col_1 long, col_2 double, col_3 string, col_4 date, col_5 timestamp')\n"
  87 |    ]
  88 |   },
  89 |   {
  90 |    "cell_type": "code",
  91 |    "execution_count": 0,
  92 |    "metadata": {
  93 |     "application/vnd.databricks.v1+cell": {
  94 |      "cellMetadata": {
  95 |       "byteLimit": 2048000,
  96 |       "rowLimit": 10000
  97 |      },
  98 |      "inputWidgets": {},
  99 |      "nuid": "d0238c11-4d56-4175-89b2-bb4ddcc4976a",
 100 |      "showTitle": true,
 101 |      "title": "Create Dataframe from pandas dataframe"
 102 |     }
 103 |    },
 104 |    "outputs": [],
 105 |    "source": [
 106 |     "import pandas as pd\n",
 107 |     "from datetime import datetime, date\n",
 108 |     "from pyspark.sql import Row\n",
 109 |     "\n",
 110 |     "pandas_df = pd.DataFrame({\n",
 111 |     "    'col_1': [100, 200, 400],\n",
 112 |     "    'col_2': [200., 300., 500.],\n",
 113 |     "    'col_3': ['string_test_1', 'string_test_2', 'string_test_3'],\n",
 114 |     "    'col_4': [date(2023, 1, 1), date(2023, 2, 1), date(2023, 3, 1)],\n",
 115 |     "    'col_5': [datetime(2023, 1, 1, 12, 0), datetime(2023, 1, 2, 12, 0), datetime(2023, 1, 3, 12, 0)]\n",
 116 |     "})\n",
 117 |     "df = spark.createDataFrame(pandas_df)\n"
 118 |    ]
 119 |   },
 120 |   {
 121 |    "cell_type": "code",
 122 |    "execution_count": 0,
 123 |    "metadata": {
 124 |     "application/vnd.databricks.v1+cell": {
 125 |      "cellMetadata": {
 126 |       "byteLimit": 2048000,
 127 |       "rowLimit": 10000
 128 |      },
 129 |      "inputWidgets": {},
 130 |      "nuid": "a1791d2e-ff43-4557-ad31-a2d92c3a21a8",
 131 |      "showTitle": false,
 132 |      "title": ""
 133 |     }
 134 |    },
 135 |    "outputs": [],
 136 |    "source": [
 137 |     "from datetime import datetime, date\n",
 138 |     "from pyspark.sql import SparkSession\n",
 139 |     "\n",
 140 |     "spark = SparkSession.builder.getOrCreate()\n",
 141 |     "\n",
 142 |     "rdd = spark.sparkContext.parallelize([\n",
 143 |     "    (100, 200., 'string_test_1', date(2023, 1, 1), datetime(2023, 1, 1, 12, 0)),\n",
 144 |     "    (200, 300., 'string_test_2', date(2023, 2, 1), datetime(2023, 1, 2, 12, 0)),\n",
 145 |     "    (300, 400., 'string_test_3', date(2023, 3, 1), datetime(2023, 1, 3, 12, 0))\n",
 146 |     "])\n",
 147 |     "data_df = spark.createDataFrame(rdd, schema=['col_1', 'col_2', 'col_3', 'col_4', 'col_5'])"
 148 |    ]
 149 |   },
 150 |   {
 151 |    "cell_type": "markdown",
 152 |    "metadata": {
 153 |     "application/vnd.databricks.v1+cell": {
 154 |      "cellMetadata": {},
 155 |      "inputWidgets": {},
 156 |      "nuid": "d87a4498-fa76-444e-984b-25cec32fb37c",
 157 |      "showTitle": false,
 158 |      "title": ""
 159 |     }
 160 |    },
 161 |    "source": [
 162 |     "How to View the Dataframes"
 163 |    ]
 164 |   },
 165 |   {
 166 |    "cell_type": "code",
 167 |    "execution_count": 0,
 168 |    "metadata": {
 169 |     "application/vnd.databricks.v1+cell": {
 170 |      "cellMetadata": {
 171 |       "byteLimit": 2048000,
 172 |       "rowLimit": 10000
 173 |      },
 174 |      "inputWidgets": {},
 175 |      "nuid": "d4e5a140-b2bd-4cf3-8798-6fecb6164064",
 176 |      "showTitle": true,
 177 |      "title": "Viewing DataFrames "
 178 |     }
 179 |    },
 180 |    "outputs": [
 181 |     {
 182 |      "output_type": "stream",
 183 |      "name": "stdout",
 184 |      "output_type": "stream",
 185 |      "text": [
 186 |       "+-----+-----+-------------+----------+-------------------+\n|col_1|col_2|        col_3|     col_4|              col_5|\n+-----+-----+-------------+----------+-------------------+\n|  100|200.0|string_test_1|2023-01-01|2023-01-01 12:00:00|\n|  200|300.0|string_test_2|2023-02-01|2023-01-02 12:00:00|\n|  300|400.0|string_test_3|2023-03-01|2023-01-03 12:00:00|\n+-----+-----+-------------+----------+-------------------+\n\n"
 187 |      ]
 188 |     }
 189 |    ],
 190 |    "source": [
 191 |     "data_df.show()"
 192 |    ]
 193 |   },
 194 |   {
 195 |    "cell_type": "code",
 196 |    "execution_count": 0,
 197 |    "metadata": {
 198 |     "application/vnd.databricks.v1+cell": {
 199 |      "cellMetadata": {
 200 |       "byteLimit": 2048000,
 201 |       "rowLimit": 10000
 202 |      },
 203 |      "inputWidgets": {},
 204 |      "nuid": "465e9144-4bf7-472d-bb62-35dad761c240",
 205 |      "showTitle": true,
 206 |      "title": "Viewing top n rows"
 207 |     }
 208 |    },
 209 |    "outputs": [
 210 |     {
 211 |      "output_type": "stream",
 212 |      "name": "stdout",
 213 |      "output_type": "stream",
 214 |      "text": [
 215 |       "+-----+-----+-------------+----------+-------------------+\n|col_1|col_2|        col_3|     col_4|              col_5|\n+-----+-----+-------------+----------+-------------------+\n|  100|200.0|string_test_1|2023-01-01|2023-01-01 12:00:00|\n|  200|300.0|string_test_2|2023-02-01|2023-01-02 12:00:00|\n+-----+-----+-------------+----------+-------------------+\nonly showing top 2 rows\n\n"
 216 |      ]
 217 |     }
 218 |    ],
 219 |    "source": [
 220 |     "data_df.show(2)"
 221 |    ]
 222 |   },
 223 |   {
 224 |    "cell_type": "code",
 225 |    "execution_count": 0,
 226 |    "metadata": {
 227 |     "application/vnd.databricks.v1+cell": {
 228 |      "cellMetadata": {
 229 |       "byteLimit": 2048000,
 230 |       "rowLimit": 10000
 231 |      },
 232 |      "inputWidgets": {},
 233 |      "nuid": "3882bfa7-6fa4-4a7c-b039-d919edc5fb07",
 234 |      "showTitle": true,
 235 |      "title": "Viewing DataFrame schema"
 236 |     }
 237 |    },
 238 |    "outputs": [
 239 |     {
 240 |      "output_type": "stream",
 241 |      "name": "stdout",
 242 |      "output_type": "stream",
 243 |      "text": [
 244 |       "root\n |-- col_1: long (nullable = true)\n |-- col_2: double (nullable = true)\n |-- col_3: string (nullable = true)\n |-- col_4: date (nullable = true)\n |-- col_5: timestamp (nullable = true)\n\n"
 245 |      ]
 246 |     }
 247 |    ],
 248 |    "source": [
 249 |     "data_df.printSchema()"
 250 |    ]
 251 |   },
 252 |   {
 253 |    "cell_type": "code",
 254 |    "execution_count": 0,
 255 |    "metadata": {
 256 |     "application/vnd.databricks.v1+cell": {
 257 |      "cellMetadata": {
 258 |       "byteLimit": 2048000,
 259 |       "rowLimit": 10000
 260 |      },
 261 |      "inputWidgets": {},
 262 |      "nuid": "49b81183-36c8-4862-b2fb-4778ceb6d16c",
 263 |      "showTitle": true,
 264 |      "title": "Viewing data vertically"
 265 |     }
 266 |    },
 267 |    "outputs": [
 268 |     {
 269 |      "output_type": "stream",
 270 |      "name": "stdout",
 271 |      "output_type": "stream",
 272 |      "text": [
 273 |       "-RECORD 0--------------------\n col_1 | 100                 \n col_2 | 200.0               \n col_3 | string_test_1       \n col_4 | 2023-01-01          \n col_5 | 2023-01-01 12:00:00 \nonly showing top 1 row\n\n"
 274 |      ]
 275 |     }
 276 |    ],
 277 |    "source": [
 278 |     "data_df.show(1, vertical=True)"
 279 |    ]
 280 |   },
 281 |   {
 282 |    "cell_type": "code",
 283 |    "execution_count": 0,
 284 |    "metadata": {
 285 |     "application/vnd.databricks.v1+cell": {
 286 |      "cellMetadata": {
 287 |       "byteLimit": 2048000,
 288 |       "rowLimit": 10000
 289 |      },
 290 |      "inputWidgets": {},
 291 |      "nuid": "18974b22-4fae-4787-aacb-67eb458c20a2",
 292 |      "showTitle": true,
 293 |      "title": "Viewing columns of data "
 294 |     }
 295 |    },
 296 |    "outputs": [
 297 |     {
 298 |      "output_type": "execute_result",
 299 |      "data": {
 300 |       "text/plain": [
 301 |        "['col_1', 'col_2', 'col_3', 'col_4', 'col_5']"
 302 |       ]
 303 |      },
 304 |      "execution_count": 7,
 305 |      "metadata": {},
 306 |      "output_type": "execute_result"
 307 |     }
 308 |    ],
 309 |    "source": [
 310 |     "data_df.columns"
 311 |    ]
 312 |   },
 313 |   {
 314 |    "cell_type": "code",
 315 |    "execution_count": 0,
 316 |    "metadata": {
 317 |     "application/vnd.databricks.v1+cell": {
 318 |      "cellMetadata": {
 319 |       "byteLimit": 2048000,
 320 |       "rowLimit": 10000
 321 |      },
 322 |      "inputWidgets": {},
 323 |      "nuid": "f607973a-991a-4892-ada8-a6f8e2daf5d1",
 324 |      "showTitle": true,
 325 |      "title": "Counting number of rows of data"
 326 |     }
 327 |    },
 328 |    "outputs": [
 329 |     {
 330 |      "output_type": "execute_result",
 331 |      "data": {
 332 |       "text/plain": [
 333 |        "3"
 334 |       ]
 335 |      },
 336 |      "execution_count": 8,
 337 |      "metadata": {},
 338 |      "output_type": "execute_result"
 339 |     }
 340 |    ],
 341 |    "source": [
 342 |     "data_df.count()"
 343 |    ]
 344 |   },
 345 |   {
 346 |    "cell_type": "code",
 347 |    "execution_count": 0,
 348 |    "metadata": {
 349 |     "application/vnd.databricks.v1+cell": {
 350 |      "cellMetadata": {
 351 |       "byteLimit": 2048000,
 352 |       "rowLimit": 10000
 353 |      },
 354 |      "inputWidgets": {},
 355 |      "nuid": "3513085f-5679-4735-9085-7e7b3de398b4",
 356 |      "showTitle": true,
 357 |      "title": "Viewing summary statistics "
 358 |     }
 359 |    },
 360 |    "outputs": [
 361 |     {
 362 |      "output_type": "stream",
 363 |      "name": "stdout",
 364 |      "output_type": "stream",
 365 |      "text": [
 366 |       "+-------+-----+-----+-------------+\n|summary|col_1|col_2|        col_3|\n+-------+-----+-----+-------------+\n|  count|    3|    3|            3|\n|   mean|200.0|300.0|         NULL|\n| stddev|100.0|100.0|         NULL|\n|    min|  100|200.0|string_test_1|\n|    max|  300|400.0|string_test_3|\n+-------+-----+-----+-------------+\n\n"
 367 |      ]
 368 |     }
 369 |    ],
 370 |    "source": [
 371 |     "data_df.select('col_1', 'col_2', 'col_3').describe().show()"
 372 |    ]
 373 |   },
 374 |   {
 375 |    "cell_type": "code",
 376 |    "execution_count": 0,
 377 |    "metadata": {
 378 |     "application/vnd.databricks.v1+cell": {
 379 |      "cellMetadata": {
 380 |       "byteLimit": 2048000,
 381 |       "rowLimit": 10000
 382 |      },
 383 |      "inputWidgets": {},
 384 |      "nuid": "550f6a6b-61cc-4913-9220-9cb02962045c",
 385 |      "showTitle": true,
 386 |      "title": "Collecting the data"
 387 |     }
 388 |    },
 389 |    "outputs": [
 390 |     {
 391 |      "output_type": "execute_result",
 392 |      "data": {
 393 |       "text/plain": [
 394 |        "[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0)),\n",
 395 |        " Row(col_1=200, col_2=300.0, col_3='string_test_2', col_4=datetime.date(2023, 2, 1), col_5=datetime.datetime(2023, 1, 2, 12, 0)),\n",
 396 |        " Row(col_1=300, col_2=400.0, col_3='string_test_3', col_4=datetime.date(2023, 3, 1), col_5=datetime.datetime(2023, 1, 3, 12, 0))]"
 397 |       ]
 398 |      },
 399 |      "execution_count": 10,
 400 |      "metadata": {},
 401 |      "output_type": "execute_result"
 402 |     }
 403 |    ],
 404 |    "source": [
 405 |     "data_df.collect()"
 406 |    ]
 407 |   },
 408 |   {
 409 |    "cell_type": "code",
 410 |    "execution_count": 0,
 411 |    "metadata": {
 412 |     "application/vnd.databricks.v1+cell": {
 413 |      "cellMetadata": {
 414 |       "byteLimit": 2048000,
 415 |       "rowLimit": 10000
 416 |      },
 417 |      "inputWidgets": {},
 418 |      "nuid": "d1a9456e-7fc7-4b73-bdfe-b3bd45f01f68",
 419 |      "showTitle": true,
 420 |      "title": "Using take"
 421 |     }
 422 |    },
 423 |    "outputs": [
 424 |     {
 425 |      "output_type": "execute_result",
 426 |      "data": {
 427 |       "text/plain": [
 428 |        "[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0))]"
 429 |       ]
 430 |      },
 431 |      "execution_count": 11,
 432 |      "metadata": {},
 433 |      "output_type": "execute_result"
 434 |     }
 435 |    ],
 436 |    "source": [
 437 |     "data_df.take(1)"
 438 |    ]
 439 |   },
 440 |   {
 441 |    "cell_type": "code",
 442 |    "execution_count": 0,
 443 |    "metadata": {
 444 |     "application/vnd.databricks.v1+cell": {
 445 |      "cellMetadata": {
 446 |       "byteLimit": 2048000,
 447 |       "rowLimit": 10000
 448 |      },
 449 |      "inputWidgets": {},
 450 |      "nuid": "5f5c4426-d270-40be-9088-c74d727af5b1",
 451 |      "showTitle": true,
 452 |      "title": "Using tail"
 453 |     }
 454 |    },
 455 |    "outputs": [
 456 |     {
 457 |      "output_type": "execute_result",
 458 |      "data": {
 459 |       "text/plain": [
 460 |        "[Row(col_1=300, col_2=400.0, col_3='string_test_3', col_4=datetime.date(2023, 3, 1), col_5=datetime.datetime(2023, 1, 3, 12, 0))]"
 461 |       ]
 462 |      },
 463 |      "execution_count": 12,
 464 |      "metadata": {},
 465 |      "output_type": "execute_result"
 466 |     }
 467 |    ],
 468 |    "source": [
 469 |     "data_df.tail(1)"
 470 |    ]
 471 |   },
 472 |   {
 473 |    "cell_type": "code",
 474 |    "execution_count": 0,
 475 |    "metadata": {
 476 |     "application/vnd.databricks.v1+cell": {
 477 |      "cellMetadata": {
 478 |       "byteLimit": 2048000,
 479 |       "rowLimit": 10000
 480 |      },
 481 |      "inputWidgets": {},
 482 |      "nuid": "fbdd9347-da63-4575-ac6d-b55b593044ff",
 483 |      "showTitle": true,
 484 |      "title": "Using head"
 485 |     }
 486 |    },
 487 |    "outputs": [
 488 |     {
 489 |      "output_type": "execute_result",
 490 |      "data": {
 491 |       "text/plain": [
 492 |        "[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0))]"
 493 |       ]
 494 |      },
 495 |      "execution_count": 13,
 496 |      "metadata": {},
 497 |      "output_type": "execute_result"
 498 |     }
 499 |    ],
 500 |    "source": [
 501 |     "data_df.head(1)"
 502 |    ]
 503 |   },
 504 |   {
 505 |    "cell_type": "code",
 506 |    "execution_count": 0,
 507 |    "metadata": {
 508 |     "application/vnd.databricks.v1+cell": {
 509 |      "cellMetadata": {
 510 |       "byteLimit": 2048000,
 511 |       "rowLimit": 10000
 512 |      },
 513 |      "inputWidgets": {},
 514 |      "nuid": "2e48afa4-79e2-4ac5-86f0-cb95a8145ef0",
 515 |      "showTitle": true,
 516 |      "title": "Converting Pyspark dataframe to Pandas"
 517 |     }
 518 |    },
 519 |    "outputs": [
 520 |     {
 521 |      "output_type": "execute_result",
 522 |      "data": {
 523 |       "text/html": [
 524 |        "<div>\n",
 525 |        "<style scoped>\n",
 526 |        "    .dataframe tbody tr th:only-of-type {\n",
 527 |        "        vertical-align: middle;\n",
 528 |        "    }\n",
 529 |        "\n",
 530 |        "    .dataframe tbody tr th {\n",
 531 |        "        vertical-align: top;\n",
 532 |        "    }\n",
 533 |        "\n",
 534 |        "    .dataframe thead th {\n",
 535 |        "        text-align: right;\n",
 536 |        "    }\n",
 537 |        "</style>\n",
 538 |        "<table border=\"1\" class=\"dataframe\">\n",
 539 |        "  <thead>\n",
 540 |        "    <tr style=\"text-align: right;\">\n",
 541 |        "      <th></th>\n",
 542 |        "      <th>col_1</th>\n",
 543 |        "      <th>col_2</th>\n",
 544 |        "      <th>col_3</th>\n",
 545 |        "      <th>col_4</th>\n",
 546 |        "      <th>col_5</th>\n",
 547 |        "    </tr>\n",
 548 |        "  </thead>\n",
 549 |        "  <tbody>\n",
 550 |        "    <tr>\n",
 551 |        "      <th>0</th>\n",
 552 |        "      <td>100</td>\n",
 553 |        "      <td>200.0</td>\n",
 554 |        "      <td>string_test_1</td>\n",
 555 |        "      <td>2023-01-01</td>\n",
 556 |        "      <td>2023-01-01 12:00:00</td>\n",
 557 |        "    </tr>\n",
 558 |        "    <tr>\n",
 559 |        "      <th>1</th>\n",
 560 |        "      <td>200</td>\n",
 561 |        "      <td>300.0</td>\n",
 562 |        "      <td>string_test_2</td>\n",
 563 |        "      <td>2023-02-01</td>\n",
 564 |        "      <td>2023-01-02 12:00:00</td>\n",
 565 |        "    </tr>\n",
 566 |        "    <tr>\n",
 567 |        "      <th>2</th>\n",
 568 |        "      <td>300</td>\n",
 569 |        "      <td>400.0</td>\n",
 570 |        "      <td>string_test_3</td>\n",
 571 |        "      <td>2023-03-01</td>\n",
 572 |        "      <td>2023-01-03 12:00:00</td>\n",
 573 |        "    </tr>\n",
 574 |        "  </tbody>\n",
 575 |        "</table>\n",
 576 |        "</div>"
 577 |       ],
 578 |       "text/plain": [
 579 |        "   col_1  col_2          col_3       col_4               col_5\n",
 580 |        "0    100  200.0  string_test_1  2023-01-01 2023-01-01 12:00:00\n",
 581 |        "1    200  300.0  string_test_2  2023-02-01 2023-01-02 12:00:00\n",
 582 |        "2    300  400.0  string_test_3  2023-03-01 2023-01-03 12:00:00"
 583 |       ]
 584 |      },
 585 |      "execution_count": 15,
 586 |      "metadata": {},
 587 |      "output_type": "execute_result"
 588 |     }
 589 |    ],
 590 |    "source": [
 591 |     "data_df.toPandas()"
 592 |    ]
 593 |   },
 594 |   {
 595 |    "cell_type": "markdown",
 596 |    "metadata": {
 597 |     "application/vnd.databricks.v1+cell": {
 598 |      "cellMetadata": {},
 599 |      "inputWidgets": {},
 600 |      "nuid": "f0502651-b54a-45e6-84ae-62ea7e1600ad",
 601 |      "showTitle": false,
 602 |      "title": ""
 603 |     }
 604 |    },
 605 |    "source": [
 606 |     "How to do Data Manipulation - Rows and Columns"
 607 |    ]
 608 |   },
 609 |   {
 610 |    "cell_type": "code",
 611 |    "execution_count": 0,
 612 |    "metadata": {
 613 |     "application/vnd.databricks.v1+cell": {
 614 |      "cellMetadata": {
 615 |       "byteLimit": 2048000,
 616 |       "rowLimit": 10000
 617 |      },
 618 |      "inputWidgets": {},
 619 |      "nuid": "e4f34241-a0de-47d4-89d0-5596681206c5",
 620 |      "showTitle": true,
 621 |      "title": "Selecting Columns"
 622 |     }
 623 |    },
 624 |    "outputs": [
 625 |     {
 626 |      "output_type": "stream",
 627 |      "name": "stdout",
 628 |      "output_type": "stream",
 629 |      "text": [
 630 |       "+-------------+\n|        col_3|\n+-------------+\n|string_test_1|\n|string_test_2|\n|string_test_3|\n+-------------+\n\n"
 631 |      ]
 632 |     }
 633 |    ],
 634 |    "source": [
 635 |     "from pyspark.sql import Column\n",
 636 |     "\n",
 637 |     "data_df.select(data_df.col_3).show()\n"
 638 |    ]
 639 |   },
 640 |   {
 641 |    "cell_type": "code",
 642 |    "execution_count": 0,
 643 |    "metadata": {
 644 |     "application/vnd.databricks.v1+cell": {
 645 |      "cellMetadata": {
 646 |       "byteLimit": 2048000,
 647 |       "rowLimit": 10000
 648 |      },
 649 |      "inputWidgets": {},
 650 |      "nuid": "47fd50e6-1433-4add-ae2f-1ddcfb1a0e7c",
 651 |      "showTitle": true,
 652 |      "title": "Creating Columns"
 653 |     }
 654 |    },
 655 |    "outputs": [
 656 |     {
 657 |      "output_type": "stream",
 658 |      "name": "stdout",
 659 |      "output_type": "stream",
 660 |      "text": [
 661 |       "+-----+-----+-------------+----------+-------------------+-----+\n|col_1|col_2|        col_3|     col_4|              col_5|col_6|\n+-----+-----+-------------+----------+-------------------+-----+\n|  100|200.0|string_test_1|2023-01-01|2023-01-01 12:00:00|    A|\n|  200|300.0|string_test_2|2023-02-01|2023-01-02 12:00:00|    A|\n|  300|400.0|string_test_3|2023-03-01|2023-01-03 12:00:00|    A|\n+-----+-----+-------------+----------+-------------------+-----+\n\n"
 662 |      ]
 663 |     }
 664 |    ],
 665 |    "source": [
 666 |     "from pyspark.sql import functions as F\n",
 667 |     "data_df = data_df.withColumn(\"col_6\", F.lit(\"A\"))\n",
 668 |     "data_df.show()\n"
 669 |    ]
 670 |   },
 671 |   {
 672 |    "cell_type": "code",
 673 |    "execution_count": 0,
 674 |    "metadata": {
 675 |     "application/vnd.databricks.v1+cell": {
 676 |      "cellMetadata": {
 677 |       "byteLimit": 2048000,
 678 |       "rowLimit": 10000
 679 |      },
 680 |      "inputWidgets": {},
 681 |      "nuid": "62323798-4638-483c-9d30-e9206ef826de",
 682 |      "showTitle": true,
 683 |      "title": "Dropping Columns"
 684 |     }
 685 |    },
 686 |    "outputs": [
 687 |     {
 688 |      "output_type": "stream",
 689 |      "name": "stdout",
 690 |      "output_type": "stream",
 691 |      "text": [
 692 |       "+-----+-----+-------------+----------+-----+\n|col_1|col_2|        col_3|     col_4|col_6|\n+-----+-----+-------------+----------+-----+\n|  100|200.0|string_test_1|2023-01-01|    A|\n|  200|300.0|string_test_2|2023-02-01|    A|\n|  300|400.0|string_test_3|2023-03-01|    A|\n+-----+-----+-------------+----------+-----+\n\n"
 693 |      ]
 694 |     }
 695 |    ],
 696 |    "source": [
 697 |     "data_df = data_df.drop(\"col_5\")\n",
 698 |     "data_df.show()\n"
 699 |    ]
 700 |   },
 701 |   {
 702 |    "cell_type": "code",
 703 |    "execution_count": 0,
 704 |    "metadata": {
 705 |     "application/vnd.databricks.v1+cell": {
 706 |      "cellMetadata": {
 707 |       "byteLimit": 2048000,
 708 |       "rowLimit": 10000
 709 |      },
 710 |      "inputWidgets": {},
 711 |      "nuid": "fb759316-d4e6-43b0-8ced-d073f1a20f97",
 712 |      "showTitle": true,
 713 |      "title": "Updating Columns"
 714 |     }
 715 |    },
 716 |    "outputs": [
 717 |     {
 718 |      "output_type": "stream",
 719 |      "name": "stdout",
 720 |      "output_type": "stream",
 721 |      "text": [
 722 |       "+-----+-----+-------------+----------+-----+\n|col_1|col_2|        col_3|     col_4|col_6|\n+-----+-----+-------------+----------+-----+\n|  100|  2.0|string_test_1|2023-01-01|    A|\n|  200|  3.0|string_test_2|2023-02-01|    A|\n|  300|  4.0|string_test_3|2023-03-01|    A|\n+-----+-----+-------------+----------+-----+\n\n"
 723 |      ]
 724 |     }
 725 |    ],
 726 |    "source": [
 727 |     "data_df.withColumn(\"col_2\", F.col(\"col_2\") / 100).show()"
 728 |    ]
 729 |   },
 730 |   {
 731 |    "cell_type": "code",
 732 |    "execution_count": 0,
 733 |    "metadata": {
 734 |     "application/vnd.databricks.v1+cell": {
 735 |      "cellMetadata": {
 736 |       "byteLimit": 2048000,
 737 |       "rowLimit": 10000
 738 |      },
 739 |      "inputWidgets": {},
 740 |      "nuid": "2cc409f3-6407-48a1-a5c9-21f08062a7f3",
 741 |      "showTitle": true,
 742 |      "title": "Renaming Columns"
 743 |     }
 744 |    },
 745 |    "outputs": [
 746 |     {
 747 |      "output_type": "stream",
 748 |      "name": "stdout",
 749 |      "output_type": "stream",
 750 |      "text": [
 751 |       "+-----+-----+-------------+----------+-----+\n|col_1|col_2|   string_col|     col_4|col_6|\n+-----+-----+-------------+----------+-----+\n|  100|200.0|string_test_1|2023-01-01|    A|\n|  200|300.0|string_test_2|2023-02-01|    A|\n|  300|400.0|string_test_3|2023-03-01|    A|\n+-----+-----+-------------+----------+-----+\n\n"
 752 |      ]
 753 |     }
 754 |    ],
 755 |    "source": [
 756 |     "data_df = data_df.withColumnRenamed(\"col_3\", \"string_col\")\n",
 757 |     "data_df.show()\n"
 758 |    ]
 759 |   },
 760 |   {
 761 |    "cell_type": "code",
 762 |    "execution_count": 0,
 763 |    "metadata": {
 764 |     "application/vnd.databricks.v1+cell": {
 765 |      "cellMetadata": {
 766 |       "byteLimit": 2048000,
 767 |       "rowLimit": 10000
 768 |      },
 769 |      "inputWidgets": {},
 770 |      "nuid": "fd1755b4-eec5-44f5-b8d3-946cb0359432",
 771 |      "showTitle": true,
 772 |      "title": "Finding Unique Values in a Column"
 773 |     }
 774 |    },
 775 |    "outputs": [
 776 |     {
 777 |      "output_type": "stream",
 778 |      "name": "stdout",
 779 |      "output_type": "stream",
 780 |      "text": [
 781 |       "+-----+\n|col_6|\n+-----+\n|    A|\n+-----+\n\n"
 782 |      ]
 783 |     }
 784 |    ],
 785 |    "source": [
 786 |     "data_df.select(\"col_6\").distinct().show()"
 787 |    ]
 788 |   },
 789 |   {
 790 |    "cell_type": "code",
 791 |    "execution_count": 0,
 792 |    "metadata": {
 793 |     "application/vnd.databricks.v1+cell": {
 794 |      "cellMetadata": {
 795 |       "byteLimit": 2048000,
 796 |       "rowLimit": 10000
 797 |      },
 798 |      "inputWidgets": {},
 799 |      "nuid": "beddedbc-ced1-477d-b8bd-30a102ef10dd",
 800 |      "showTitle": false,
 801 |      "title": ""
 802 |     }
 803 |    },
 804 |    "outputs": [
 805 |     {
 806 |      "output_type": "stream",
 807 |      "name": "stdout",
 808 |      "output_type": "stream",
 809 |      "text": [
 810 |       "+------------+\n|Total_Unique|\n+------------+\n|           1|\n+------------+\n\n"
 811 |      ]
 812 |     }
 813 |    ],
 814 |    "source": [
 815 |     "data_df.select(F.countDistinct(\"col_6\").alias(\"Total_Unique\")).show()"
 816 |    ]
 817 |   },
 818 |   {
 819 |    "cell_type": "code",
 820 |    "execution_count": 0,
 821 |    "metadata": {
 822 |     "application/vnd.databricks.v1+cell": {
 823 |      "cellMetadata": {
 824 |       "byteLimit": 2048000,
 825 |       "rowLimit": 10000
 826 |      },
 827 |      "inputWidgets": {},
 828 |      "nuid": "469c586b-c14e-4652-be89-97c9b62e5818",
 829 |      "showTitle": true,
 830 |      "title": "Change case of a Column"
 831 |     }
 832 |    },
 833 |    "outputs": [
 834 |     {
 835 |      "output_type": "stream",
 836 |      "name": "stdout",
 837 |      "output_type": "stream",
 838 |      "text": [
 839 |       "+-----+-----+-------------+----------+-----+----------------+\n|col_1|col_2|   string_col|     col_4|col_6|upper_string_col|\n+-----+-----+-------------+----------+-----+----------------+\n|  100|200.0|string_test_1|2023-01-01|    A|   STRING_TEST_1|\n|  200|300.0|string_test_2|2023-02-01|    A|   STRING_TEST_2|\n|  300|400.0|string_test_3|2023-03-01|    A|   STRING_TEST_3|\n+-----+-----+-------------+----------+-----+----------------+\n\n"
 840 |      ]
 841 |     }
 842 |    ],
 843 |    "source": [
 844 |     "from pyspark.sql.functions import upper\n",
 845 |     "\n",
 846 |     "data_df.withColumn('upper_string_col', upper(data_df.string_col)).show()\n"
 847 |    ]
 848 |   },
 849 |   {
 850 |    "cell_type": "code",
 851 |    "execution_count": 0,
 852 |    "metadata": {
 853 |     "application/vnd.databricks.v1+cell": {
 854 |      "cellMetadata": {
 855 |       "byteLimit": 2048000,
 856 |       "rowLimit": 10000
 857 |      },
 858 |      "inputWidgets": {},
 859 |      "nuid": "c7b3e23f-84f5-41de-a09d-a7a20e9404b5",
 860 |      "showTitle": true,
 861 |      "title": "Filtering a Dataframe"
 862 |     }
 863 |    },
 864 |    "outputs": [
 865 |     {
 866 |      "output_type": "stream",
 867 |      "name": "stdout",
 868 |      "output_type": "stream",
 869 |      "text": [
 870 |       "+-----+-----+-------------+----------+-----+\n|col_1|col_2|   string_col|     col_4|col_6|\n+-----+-----+-------------+----------+-----+\n|  100|200.0|string_test_1|2023-01-01|    A|\n+-----+-----+-------------+----------+-----+\n\n"
 871 |      ]
 872 |     }
 873 |    ],
 874 |    "source": [
 875 |     "data_df.filter(data_df.col_1 == 100).show()"
 876 |    ]
 877 |   },
 878 |   {
 879 |    "cell_type": "code",
 880 |    "execution_count": 0,
 881 |    "metadata": {
 882 |     "application/vnd.databricks.v1+cell": {
 883 |      "cellMetadata": {
 884 |       "byteLimit": 2048000,
 885 |       "rowLimit": 10000
 886 |      },
 887 |      "inputWidgets": {},
 888 |      "nuid": "889da414-a8e1-4014-ab7f-5e1f10d362fa",
 889 |      "showTitle": true,
 890 |      "title": "Logical Operators in a Dataframe"
 891 |     }
 892 |    },
 893 |    "outputs": [
 894 |     {
 895 |      "output_type": "stream",
 896 |      "name": "stdout",
 897 |      "output_type": "stream",
 898 |      "text": [
 899 |       "+-----+-----+-------------+----------+-----+\n|col_1|col_2|   string_col|     col_4|col_6|\n+-----+-----+-------------+----------+-----+\n|  100|200.0|string_test_1|2023-01-01|    A|\n+-----+-----+-------------+----------+-----+\n\n"
 900 |      ]
 901 |     }
 902 |    ],
 903 |    "source": [
 904 |     "data_df.filter((data_df.col_1 == 100)\n",
 905 |     "\t\t& (data_df.col_6 == 'A')).show()\n"
 906 |    ]
 907 |   },
 908 |   {
 909 |    "cell_type": "code",
 910 |    "execution_count": 0,
 911 |    "metadata": {
 912 |     "application/vnd.databricks.v1+cell": {
 913 |      "cellMetadata": {
 914 |       "byteLimit": 2048000,
 915 |       "rowLimit": 10000
 916 |      },
 917 |      "inputWidgets": {},
 918 |      "nuid": "582e12c8-1081-4a3f-a539-8fbf1080e316",
 919 |      "showTitle": false,
 920 |      "title": ""
 921 |     }
 922 |    },
 923 |    "outputs": [
 924 |     {
 925 |      "output_type": "stream",
 926 |      "name": "stdout",
 927 |      "output_type": "stream",
 928 |      "text": [
 929 |       "+-----+-----+-------------+----------+-----+\n|col_1|col_2|   string_col|     col_4|col_6|\n+-----+-----+-------------+----------+-----+\n|  100|200.0|string_test_1|2023-01-01|    A|\n|  200|300.0|string_test_2|2023-02-01|    A|\n+-----+-----+-------------+----------+-----+\n\n"
 930 |      ]
 931 |     }
 932 |    ],
 933 |    "source": [
 934 |     "data_df.filter((data_df.col_1 == 100)\n",
 935 |     "\t\t| (data_df.col_2 == 300.00)).show()\n"
 936 |    ]
 937 |   },
 938 |   {
 939 |    "cell_type": "code",
 940 |    "execution_count": 0,
 941 |    "metadata": {
 942 |     "application/vnd.databricks.v1+cell": {
 943 |      "cellMetadata": {
 944 |       "byteLimit": 2048000,
 945 |       "rowLimit": 10000
 946 |      },
 947 |      "inputWidgets": {},
 948 |      "nuid": "ef3afdd9-e883-40cc-bd79-cc9fcff57a7e",
 949 |      "showTitle": true,
 950 |      "title": "Using Isin()"
 951 |     }
 952 |    },
 953 |    "outputs": [
 954 |     {
 955 |      "output_type": "stream",
 956 |      "name": "stdout",
 957 |      "output_type": "stream",
 958 |      "text": [
 959 |       "+-----+-----+-------------+----------+-----+\n|col_1|col_2|   string_col|     col_4|col_6|\n+-----+-----+-------------+----------+-----+\n|  100|200.0|string_test_1|2023-01-01|    A|\n|  200|300.0|string_test_2|2023-02-01|    A|\n+-----+-----+-------------+----------+-----+\n\n"
 960 |      ]
 961 |     }
 962 |    ],
 963 |    "source": [
 964 |     "list = [100, 200]\n",
 965 |     "data_df.filter(data_df.col_1.isin(list)).show()\n"
 966 |    ]
 967 |   },
 968 |   {
 969 |    "cell_type": "code",
 970 |    "execution_count": 0,
 971 |    "metadata": {
 972 |     "application/vnd.databricks.v1+cell": {
 973 |      "cellMetadata": {
 974 |       "byteLimit": 2048000,
 975 |       "rowLimit": 10000
 976 |      },
 977 |      "inputWidgets": {},
 978 |      "nuid": "fb4e727a-c8b3-4010-bbde-2224b82aab79",
 979 |      "showTitle": true,
 980 |      "title": "Datatype conversions"
 981 |     }
 982 |    },
 983 |    "outputs": [
 984 |     {
 985 |      "output_type": "stream",
 986 |      "name": "stdout",
 987 |      "output_type": "stream",
 988 |      "text": [
 989 |       "root\n |-- col_1: integer (nullable = true)\n |-- col_2: double (nullable = true)\n |-- string_col: string (nullable = true)\n |-- col_4: string (nullable = true)\n |-- col_6: string (nullable = false)\n\n+-----+-----+-------------+----------+-----+\n|col_1|col_2|   string_col|     col_4|col_6|\n+-----+-----+-------------+----------+-----+\n|  100|200.0|string_test_1|2023-01-01|    A|\n|  200|300.0|string_test_2|2023-02-01|    A|\n|  300|400.0|string_test_3|2023-03-01|    A|\n+-----+-----+-------------+----------+-----+\n\n"
 990 |      ]
 991 |     }
 992 |    ],
 993 |    "source": [
 994 |     "from pyspark.sql.functions import col\n",
 995 |     "from pyspark.sql.types import StringType,BooleanType,DateType,IntegerType\n",
 996 |     "\n",
 997 |     "data_df_2 = data_df.withColumn(\"col_4\",col(\"col_4\").cast(StringType())) \\\n",
 998 |     "    .withColumn(\"col_1\",col(\"col_1\").cast(IntegerType()))\n",
 999 |     "data_df_2.printSchema()\n",
1000 |     "data_df.show()\n",
1001 |     "\n"
1002 |    ]
1003 |   },
1004 |   {
1005 |    "cell_type": "code",
1006 |    "execution_count": 0,
1007 |    "metadata": {
1008 |     "application/vnd.databricks.v1+cell": {
1009 |      "cellMetadata": {
1010 |       "byteLimit": 2048000,
1011 |       "rowLimit": 10000
1012 |      },
1013 |      "inputWidgets": {},
1014 |      "nuid": "908430a1-d7f1-4372-aebc-ef371f1efc96",
1015 |      "showTitle": false,
1016 |      "title": ""
1017 |     }
1018 |    },
1019 |    "outputs": [
1020 |     {
1021 |      "output_type": "stream",
1022 |      "name": "stdout",
1023 |      "output_type": "stream",
1024 |      "text": [
1025 |       "root\n |-- col_4: date (nullable = true)\n |-- col_1: long (nullable = true)\n\n"
1026 |      ]
1027 |     }
1028 |    ],
1029 |    "source": [
1030 |     "data_df_3 = data_df_2.selectExpr(\"cast(col_4 as date) col_4\",\n",
1031 |     "    \"cast(col_1 as long) col_1\")\n",
1032 |     "data_df_3.printSchema()\n"
1033 |    ]
1034 |   },
1035 |   {
1036 |    "cell_type": "code",
1037 |    "execution_count": 0,
1038 |    "metadata": {
1039 |     "application/vnd.databricks.v1+cell": {
1040 |      "cellMetadata": {
1041 |       "byteLimit": 2048000,
1042 |       "rowLimit": 10000
1043 |      },
1044 |      "inputWidgets": {},
1045 |      "nuid": "08646a8d-923c-440a-bae9-305e390b529e",
1046 |      "showTitle": false,
1047 |      "title": ""
1048 |     }
1049 |    },
1050 |    "outputs": [
1051 |     {
1052 |      "output_type": "stream",
1053 |      "name": "stdout",
1054 |      "output_type": "stream",
1055 |      "text": [
1056 |       "root\n |-- col_1: double (nullable = true)\n |-- col_4: date (nullable = true)\n\n+-----+----------+\n|col_1|col_4     |\n+-----+----------+\n|100.0|2023-01-01|\n|200.0|2023-02-01|\n|300.0|2023-03-01|\n+-----+----------+\n\n"
1057 |      ]
1058 |     }
1059 |    ],
1060 |    "source": [
1061 |     "data_df_3.createOrReplaceTempView(\"CastExample\")\n",
1062 |     "data_df_4 = spark.sql(\"SELECT DOUBLE(col_1), DATE(col_4) from CastExample\")\n",
1063 |     "data_df_4.printSchema()\n",
1064 |     "data_df_4.show(truncate=False)\n"
1065 |    ]
1066 |   },
1067 |   {
1068 |    "cell_type": "code",
1069 |    "execution_count": 0,
1070 |    "metadata": {
1071 |     "application/vnd.databricks.v1+cell": {
1072 |      "cellMetadata": {
1073 |       "byteLimit": 2048000,
1074 |       "rowLimit": 10000
1075 |      },
1076 |      "inputWidgets": {},
1077 |      "nuid": "92dcefea-630b-4ecf-8649-705fbf78b93c",
1078 |      "showTitle": true,
1079 |      "title": "Dropping null values from a Dataframe"
1080 |     }
1081 |    },
1082 |    "outputs": [
1083 |     {
1084 |      "output_type": "stream",
1085 |      "name": "stdout",
1086 |      "output_type": "stream",
1087 |      "text": [
1088 |       "root\n |-- Employee: string (nullable = true)\n |-- Department: string (nullable = true)\n |-- Salary: long (nullable = true)\n\n+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n|    John| Field-eng|  3500|\n| Michael| Field-eng|  4500|\n|  Robert|      NULL|  4000|\n|   Maria|   Finance|  3500|\n|    John|     Sales|  3000|\n|   Kelly|   Finance|  3500|\n|    Kate|   Finance|  3000|\n|  Martin|      NULL|  3500|\n|   Kiran|     Sales|  2200|\n| Michael| Field-eng|  4500|\n+--------+----------+------+\n\n"
1089 |      ]
1090 |     }
1091 |    ],
1092 |    "source": [
1093 |     "salary_data = [(\"John\", \"Field-eng\", 3500), \n",
1094 |     "    (\"Michael\", \"Field-eng\", 4500), \n",
1095 |     "    (\"Robert\", None, 4000), \n",
1096 |     "    (\"Maria\", \"Finance\", 3500), \n",
1097 |     "    (\"John\", \"Sales\", 3000), \n",
1098 |     "    (\"Kelly\", \"Finance\", 3500), \n",
1099 |     "    (\"Kate\", \"Finance\", 3000), \n",
1100 |     "    (\"Martin\", None, 3500), \n",
1101 |     "    (\"Kiran\", \"Sales\", 2200), \n",
1102 |     "    (\"Michael\", \"Field-eng\", 4500) \n",
1103 |     "  ]\n",
1104 |     "columns= [\"Employee\", \"Department\", \"Salary\"]\n",
1105 |     "salary_data = spark.createDataFrame(data = salary_data, schema = columns)\n",
1106 |     "salary_data.printSchema()\n",
1107 |     "salary_data.show()\n"
1108 |    ]
1109 |   },
1110 |   {
1111 |    "cell_type": "code",
1112 |    "execution_count": 0,
1113 |    "metadata": {
1114 |     "application/vnd.databricks.v1+cell": {
1115 |      "cellMetadata": {
1116 |       "byteLimit": 2048000,
1117 |       "rowLimit": 10000
1118 |      },
1119 |      "inputWidgets": {},
1120 |      "nuid": "5f7ded6f-cc9d-4bfe-8f63-afe560278772",
1121 |      "showTitle": false,
1122 |      "title": ""
1123 |     }
1124 |    },
1125 |    "outputs": [
1126 |     {
1127 |      "output_type": "stream",
1128 |      "name": "stdout",
1129 |      "output_type": "stream",
1130 |      "text": [
1131 |       "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n|    John| Field-eng|  3500|\n| Michael| Field-eng|  4500|\n|   Maria|   Finance|  3500|\n|    John|     Sales|  3000|\n|   Kelly|   Finance|  3500|\n|    Kate|   Finance|  3000|\n|   Kiran|     Sales|  2200|\n| Michael| Field-eng|  4500|\n+--------+----------+------+\n\n"
1132 |      ]
1133 |     }
1134 |    ],
1135 |    "source": [
1136 |     "salary_data.dropna().show()"
1137 |    ]
1138 |   },
1139 |   {
1140 |    "cell_type": "code",
1141 |    "execution_count": 0,
1142 |    "metadata": {
1143 |     "application/vnd.databricks.v1+cell": {
1144 |      "cellMetadata": {
1145 |       "byteLimit": 2048000,
1146 |       "rowLimit": 10000
1147 |      },
1148 |      "inputWidgets": {},
1149 |      "nuid": "50cb0f47-2786-4d8a-9514-913108e338af",
1150 |      "showTitle": true,
1151 |      "title": "Dropping Duplicates from a Dataframe"
1152 |     }
1153 |    },
1154 |    "outputs": [
1155 |     {
1156 |      "output_type": "stream",
1157 |      "name": "stdout",
1158 |      "output_type": "stream",
1159 |      "text": [
1160 |       "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n|    John| Field-eng|  3500|\n| Michael| Field-eng|  4500|\n|  Robert|      NULL|  4000|\n|    John|     Sales|  3000|\n|   Maria|   Finance|  3500|\n|   Kelly|   Finance|  3500|\n|    Kate|   Finance|  3000|\n|  Martin|      NULL|  3500|\n|   Kiran|     Sales|  2200|\n+--------+----------+------+\n\n"
1161 |      ]
1162 |     }
1163 |    ],
1164 |    "source": [
1165 |     "new_salary_data = salary_data.dropDuplicates().show()"
1166 |    ]
1167 |   },
1168 |   {
1169 |    "cell_type": "markdown",
1170 |    "metadata": {
1171 |     "application/vnd.databricks.v1+cell": {
1172 |      "cellMetadata": {},
1173 |      "inputWidgets": {},
1174 |      "nuid": "26b09b00-3377-47f5-b54b-350d3126e92c",
1175 |      "showTitle": false,
1176 |      "title": ""
1177 |     }
1178 |    },
1179 |    "source": [
1180 |     "Using Aggregrates in a Dataframe"
1181 |    ]
1182 |   },
1183 |   {
1184 |    "cell_type": "code",
1185 |    "execution_count": 0,
1186 |    "metadata": {
1187 |     "application/vnd.databricks.v1+cell": {
1188 |      "cellMetadata": {
1189 |       "byteLimit": 2048000,
1190 |       "rowLimit": 10000
1191 |      },
1192 |      "inputWidgets": {},
1193 |      "nuid": "151c07a1-1bf3-4cc0-98f4-46aad67e67d3",
1194 |      "showTitle": true,
1195 |      "title": "Average (avg)"
1196 |     }
1197 |    },
1198 |    "outputs": [
1199 |     {
1200 |      "output_type": "stream",
1201 |      "name": "stdout",
1202 |      "output_type": "stream",
1203 |      "text": [
1204 |       "+-----------+\n|avg(Salary)|\n+-----------+\n|     3520.0|\n+-----------+\n\n"
1205 |      ]
1206 |     }
1207 |    ],
1208 |    "source": [
1209 |     "from pyspark.sql.functions import countDistinct, avg\n",
1210 |     "salary_data.select(avg('Salary')).show()\n"
1211 |    ]
1212 |   },
1213 |   {
1214 |    "cell_type": "code",
1215 |    "execution_count": 0,
1216 |    "metadata": {
1217 |     "application/vnd.databricks.v1+cell": {
1218 |      "cellMetadata": {
1219 |       "byteLimit": 2048000,
1220 |       "rowLimit": 10000
1221 |      },
1222 |      "inputWidgets": {},
1223 |      "nuid": "b3104546-2fba-4638-987b-100fb9ac53cf",
1224 |      "showTitle": true,
1225 |      "title": "Count"
1226 |     }
1227 |    },
1228 |    "outputs": [
1229 |     {
1230 |      "output_type": "stream",
1231 |      "name": "stdout",
1232 |      "output_type": "stream",
1233 |      "text": [
1234 |       "+-------------+\n|count(Salary)|\n+-------------+\n|           10|\n+-------------+\n\n"
1235 |      ]
1236 |     }
1237 |    ],
1238 |    "source": [
1239 |     "salary_data.agg({'Salary':'count'}).show()"
1240 |    ]
1241 |   },
1242 |   {
1243 |    "cell_type": "code",
1244 |    "execution_count": 0,
1245 |    "metadata": {
1246 |     "application/vnd.databricks.v1+cell": {
1247 |      "cellMetadata": {
1248 |       "byteLimit": 2048000,
1249 |       "rowLimit": 10000
1250 |      },
1251 |      "inputWidgets": {},
1252 |      "nuid": "8a396838-9c08-46f9-ae42-6b5db6c094c8",
1253 |      "showTitle": true,
1254 |      "title": "Count distinct values"
1255 |     }
1256 |    },
1257 |    "outputs": [
1258 |     {
1259 |      "output_type": "stream",
1260 |      "name": "stdout",
1261 |      "output_type": "stream",
1262 |      "text": [
1263 |       "+---------------+\n|Distinct Salary|\n+---------------+\n|              5|\n+---------------+\n\n"
1264 |      ]
1265 |     }
1266 |    ],
1267 |    "source": [
1268 |     "salary_data.select(countDistinct(\"Salary\").alias(\"Distinct Salary\")).show()"
1269 |    ]
1270 |   },
1271 |   {
1272 |    "cell_type": "code",
1273 |    "execution_count": 0,
1274 |    "metadata": {
1275 |     "application/vnd.databricks.v1+cell": {
1276 |      "cellMetadata": {
1277 |       "byteLimit": 2048000,
1278 |       "rowLimit": 10000
1279 |      },
1280 |      "inputWidgets": {},
1281 |      "nuid": "0d17be07-1395-470b-afec-93e6d3a8b893",
1282 |      "showTitle": true,
1283 |      "title": "Finding maximums (max)"
1284 |     }
1285 |    },
1286 |    "outputs": [
1287 |     {
1288 |      "output_type": "stream",
1289 |      "name": "stdout",
1290 |      "output_type": "stream",
1291 |      "text": [
1292 |       "+-----------+\n|max(Salary)|\n+-----------+\n|       4500|\n+-----------+\n\n"
1293 |      ]
1294 |     }
1295 |    ],
1296 |    "source": [
1297 |     "salary_data.agg({'Salary':'max'}).show() "
1298 |    ]
1299 |   },
1300 |   {
1301 |    "cell_type": "code",
1302 |    "execution_count": 0,
1303 |    "metadata": {
1304 |     "application/vnd.databricks.v1+cell": {
1305 |      "cellMetadata": {
1306 |       "byteLimit": 2048000,
1307 |       "rowLimit": 10000
1308 |      },
1309 |      "inputWidgets": {},
1310 |      "nuid": "dfb27aef-8f62-4c89-b005-e046bc924699",
1311 |      "showTitle": true,
1312 |      "title": "Sum"
1313 |     }
1314 |    },
1315 |    "outputs": [
1316 |     {
1317 |      "output_type": "stream",
1318 |      "name": "stdout",
1319 |      "output_type": "stream",
1320 |      "text": [
1321 |       "+-----------+\n|sum(Salary)|\n+-----------+\n|      35200|\n+-----------+\n\n"
1322 |      ]
1323 |     }
1324 |    ],
1325 |    "source": [
1326 |     "salary_data.agg({'Salary':'sum'}).show()"
1327 |    ]
1328 |   },
1329 |   {
1330 |    "cell_type": "code",
1331 |    "execution_count": 0,
1332 |    "metadata": {
1333 |     "application/vnd.databricks.v1+cell": {
1334 |      "cellMetadata": {
1335 |       "byteLimit": 2048000,
1336 |       "rowLimit": 10000
1337 |      },
1338 |      "inputWidgets": {},
1339 |      "nuid": "7d0b8f92-9d16-4cc4-9bc8-8b67db6befd6",
1340 |      "showTitle": true,
1341 |      "title": "Sort data with OrderBy"
1342 |     }
1343 |    },
1344 |    "outputs": [
1345 |     {
1346 |      "output_type": "stream",
1347 |      "name": "stdout",
1348 |      "output_type": "stream",
1349 |      "text": [
1350 |       "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n|   Kiran|     Sales|  2200|\n|    John|     Sales|  3000|\n|    Kate|   Finance|  3000|\n|  Martin|      NULL|  3500|\n|   Maria|   Finance|  3500|\n|   Kelly|   Finance|  3500|\n|    John| Field-eng|  3500|\n|  Robert|      NULL|  4000|\n| Michael| Field-eng|  4500|\n| Michael| Field-eng|  4500|\n+--------+----------+------+\n\n"
1351 |      ]
1352 |     }
1353 |    ],
1354 |    "source": [
1355 |     "salary_data.orderBy(\"Salary\").show()"
1356 |    ]
1357 |   },
1358 |   {
1359 |    "cell_type": "code",
1360 |    "execution_count": 0,
1361 |    "metadata": {
1362 |     "application/vnd.databricks.v1+cell": {
1363 |      "cellMetadata": {
1364 |       "byteLimit": 2048000,
1365 |       "rowLimit": 10000
1366 |      },
1367 |      "inputWidgets": {},
1368 |      "nuid": "021a9f0c-59f2-497e-adbe-f39a2f7b7b21",
1369 |      "showTitle": false,
1370 |      "title": ""
1371 |     }
1372 |    },
1373 |    "outputs": [
1374 |     {
1375 |      "output_type": "stream",
1376 |      "name": "stdout",
1377 |      "output_type": "stream",
1378 |      "text": [
1379 |       "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| Michael| Field-eng|  4500|\n| Michael| Field-eng|  4500|\n|  Robert|      NULL|  4000|\n|    John| Field-eng|  3500|\n|  Martin|      NULL|  3500|\n|   Kelly|   Finance|  3500|\n|   Maria|   Finance|  3500|\n|    Kate|   Finance|  3000|\n|    John|     Sales|  3000|\n|   Kiran|     Sales|  2200|\n+--------+----------+------+\n\n"
1380 |      ]
1381 |     }
1382 |    ],
1383 |    "source": [
1384 |     "salary_data.orderBy(salary_data[\"Salary\"].desc()).show()"
1385 |    ]
1386 |   }
1387 |  ],
1388 |  "metadata": {
1389 |   "application/vnd.databricks.v1+notebook": {
1390 |    "dashboards": [],
1391 |    "language": "python",
1392 |    "notebookMetadata": {
1393 |     "mostRecentlyExecutedCommandWithImplicitDF": {
1394 |      "commandId": 969987236417588,
1395 |      "dataframes": [
1396 |       "_sqldf"
1397 |      ]
1398 |     },
1399 |     "pythonIndentUnit": 2
1400 |    },
1401 |    "notebookName": "Chapter 4 Code",
1402 |    "widgets": {}
1403 |   }
1404 |  },
1405 |  "nbformat": 4,
1406 |  "nbformat_minor": 0
1407 | }
1408 | 


--------------------------------------------------------------------------------
	col_1	col_2	col_3	col_4	col_5
0	100	200.0	string_test_1	2023-01-01	2023-01-01 12:00:00
1	200	300.0	string_test_2	2023-02-01	2023-01-02 12:00:00
2	300	400.0	string_test_3	2023-03-01	2023-01-03 12:00:00