├── LICENSE
├── README.md
├── Chapter06
└── Chapter 6 Code.ipynb
├── Chapter05
└── Chapter 5 Code.ipynb
└── Chapter04
└── Chapter 4 Code.ipynb
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Packt
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Databricks Certified Associate Developer for Apache Spark Using Python
2 |
3 |
4 | This is the code repository for [Databricks Certified Associate Developer for Apache Spark Using Python](https://www.packtpub.com/product/databricks-certified-associate-developer-for-apache-spark-using-python/9781804619780), published by Packt.
5 |
6 | **The ultimate guide to getting certified in Apache Spark using practical examples with Python**
7 |
8 | ## What is this book about?
9 | This guide gets you ready for certification with expert-backed content, key exam concepts, and topic reviews. Additionally, you’ll be able to make the most of Apache Spark 3.0 to modernize workloads and more using specific tools and techniques.
10 |
11 | This book covers the following exciting features:
12 | * Create and manipulate SQL queries in Spark
13 | * Build complex Spark functions using Spark UDFs
14 | * Architect big data apps with Spark fundamentals for optimal design
15 | * Apply techniques to manipulate and optimize big data applications
16 | * Build real-time or near-real-time applications using Spark Streaming
17 | * Work with Apache Spark for machine learning applications
18 |
19 | If you feel this book is for you, get your [copy](https://www.amazon.com/Databricks-Certified-Associate-Developer-Apache/dp/1804619787) today!
20 |
21 |
23 |
24 | ## Instructions and Navigations
25 | All of the code is organized into folders. For example, Chapter04.
26 |
27 | The code will look like the following:
28 | ```
29 | # Perform an aggregation to calculate the average salary
30 | average_salary = spark.sql("SELECT AVG(Salary) AS average_salary FROM
31 | employees")
32 | ```
33 |
34 | **Following is what you need for this book:**
35 | This book is for you if you’re a professional looking to venture into the world of big data and data engineering, a data professional who wants to endorse your knowledge of Spark, or a student. Although working knowledge of Python is required, no prior Spark knowledge is needed. Additionally, experience with Pyspark will be beneficial.
36 |
37 | With the following software and hardware list you can run all code files present in the book (Chapter 4-8).
38 | ### Software and Hardware List
39 | | Chapter | Software required | OS required |
40 | | -------- | ------------------------------------ | ----------------------------------- |
41 | | 4-8 | Python | Windows, Mac OS X, and Linux |
42 | | 4-8 | Spark | Windows, Mac OS X, and Linux |
43 |
44 | ### Related products
45 | * Business Intelligence with Databricks SQL [[Packt]](https://www.packtpub.com/product/business-intelligence-with-databricks-sql/9781803235332) [[Amazon]](https://www.amazon.com/Business-Intelligence-Databricks-SQL-intelligence/dp/1803235330/ref=sr_1_1?crid=1QYCAOZP9E3NH&dib=eyJ2IjoiMSJ9.nKZ7dRFPdDZyRvWwKM_NiTSZyweCLZ8g9JdktemcYzaWNiGWg9PuoxY2yb2jogGyK8hgRliKebDQfdHu2rRnTZTWZbsWOJAN33k65RFkAgdFX-csS8HgTFfjZj-SFKLpp4FC6LHwQvWr9Nq6f5x6eg.jh99qre-Hl4OHA9rypXLmSGsQp4exBvaZ2xUOPDQ0mM&dib_tag=se&keywords=Business+Intelligence+with+Databricks+SQL&qid=1718173191&s=books&sprefix=business+intelligence+with+databricks+sql%2Cstripbooks-intl-ship%2C553&sr=1-1)
46 |
47 | * Azure Databricks Cookbook [[Packt]](https://www.packtpub.com/product/azure-databricks-cookbook/9781789809718) [[Amazon]](https://www.amazon.com/Azure-Databricks-Cookbook-Jonathan-Wood/dp/1789809711)
48 |
49 | ## Get to Know the Author
50 | **Saba Shah** is a Data and AI Architect and Evangelist with a wide technical breadth and deep understanding of big data and machine learning technologies. She has experience leading data science and data engineering teams in Fortune 500s as well as startups. She started her career as a software engineer but soon transitioned to big data. She is currently a solutions architect at Databricks and works with enterprises building their data strategy and helping them create a vision for the future with machine learning and predictive analytics. Saba graduated with a degree in Computer Science and later earned an MS degree in Advanced Web Technologies. She is passionate about all things data and cricket. She currently resides in RTP, NC.
51 |
--------------------------------------------------------------------------------
/Chapter06/Chapter 6 Code.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "application/vnd.databricks.v1+cell": {
7 | "cellMetadata": {},
8 | "inputWidgets": {},
9 | "nuid": "5f9b4e65-714c-46a7-b0ca-2c22d87e349e",
10 | "showTitle": false,
11 | "title": ""
12 | }
13 | },
14 | "source": [
15 | "# Chapter 6: SQL Queries in Spark"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": 0,
21 | "metadata": {
22 | "application/vnd.databricks.v1+cell": {
23 | "cellMetadata": {
24 | "byteLimit": 2048000,
25 | "rowLimit": 10000
26 | },
27 | "inputWidgets": {},
28 | "nuid": "e4b6f89b-2d68-4937-a30a-ac64ef7caa49",
29 | "showTitle": true,
30 | "title": "Create Salary dataframe"
31 | }
32 | },
33 | "outputs": [
34 | {
35 | "output_type": "stream",
36 | "name": "stdout",
37 | "output_type": "stream",
38 | "text": [
39 | "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n| 1| John| Field-eng| 3500| 40|\n| 2| Robert| Sales| 4000| 38|\n| 3| Maria| Finance| 3500| 28|\n| 4| Michael| Sales| 3000| 20|\n| 5| Kelly| Finance| 3500| 35|\n| 6| Kate| Finance| 3000| 45|\n| 7| Martin| Finance| 3500| 26|\n| 8| Kiran| Sales| 2200| 35|\n+---+--------+----------+------+---+\n\n"
40 | ]
41 | }
42 | ],
43 | "source": [
44 | "salary_data_with_id = [(1, \"John\", \"Field-eng\", 3500, 40), \\\n",
45 | " (2, \"Robert\", \"Sales\", 4000, 38), \\\n",
46 | " (3, \"Maria\", \"Finance\", 3500, 28), \\\n",
47 | " (4, \"Michael\", \"Sales\", 3000, 20), \\\n",
48 | " (5, \"Kelly\", \"Finance\", 3500, 35), \\\n",
49 | " (6, \"Kate\", \"Finance\", 3000, 45), \\\n",
50 | " (7, \"Martin\", \"Finance\", 3500, 26), \\\n",
51 | " (8, \"Kiran\", \"Sales\", 2200, 35), \\\n",
52 | " ]\n",
53 | "columns= [\"ID\", \"Employee\", \"Department\", \"Salary\", \"Age\"]\n",
54 | "salary_data_with_id = spark.createDataFrame(data = salary_data_with_id, schema = columns)\n",
55 | "salary_data_with_id.show()\n"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 0,
61 | "metadata": {
62 | "application/vnd.databricks.v1+cell": {
63 | "cellMetadata": {
64 | "byteLimit": 2048000,
65 | "rowLimit": 10000
66 | },
67 | "inputWidgets": {},
68 | "nuid": "e075be3c-bb49-4c81-a7b9-7dfbb4370056",
69 | "showTitle": true,
70 | "title": "Writing csv file"
71 | }
72 | },
73 | "outputs": [],
74 | "source": [
75 | "salary_data_with_id.write.format(\"csv\").mode(\"overwrite\").option(\"header\", \"true\").save(\"salary_data.csv\")\n"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 0,
81 | "metadata": {
82 | "application/vnd.databricks.v1+cell": {
83 | "cellMetadata": {
84 | "byteLimit": 2048000,
85 | "rowLimit": 10000
86 | },
87 | "inputWidgets": {},
88 | "nuid": "a11fe3af-e723-4e97-abbf-d503acf033e3",
89 | "showTitle": true,
90 | "title": "Reading csv file"
91 | }
92 | },
93 | "outputs": [],
94 | "source": [
95 | "csv_data = spark.read.csv('/salary_data.csv', header=True)"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 0,
101 | "metadata": {
102 | "application/vnd.databricks.v1+cell": {
103 | "cellMetadata": {
104 | "byteLimit": 2048000,
105 | "rowLimit": 10000
106 | },
107 | "inputWidgets": {},
108 | "nuid": "e1aa8d82-a393-448a-8fd0-5110b1eb1af2",
109 | "showTitle": true,
110 | "title": "Showing data"
111 | }
112 | },
113 | "outputs": [
114 | {
115 | "output_type": "stream",
116 | "name": "stdout",
117 | "output_type": "stream",
118 | "text": [
119 | "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n| 1| John| Field-eng| 3500| 40|\n| 2| Robert| Sales| 4000| 38|\n| 3| Maria| Finance| 3500| 28|\n| 4| Michael| Sales| 3000| 20|\n| 5| Kelly| Finance| 3500| 35|\n| 6| Kate| Finance| 3000| 45|\n| 7| Martin| Finance| 3500| 26|\n| 8| Kiran| Sales| 2200| 35|\n+---+--------+----------+------+---+\n\n"
120 | ]
121 | }
122 | ],
123 | "source": [
124 | "csv_data.show()"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 0,
130 | "metadata": {
131 | "application/vnd.databricks.v1+cell": {
132 | "cellMetadata": {
133 | "byteLimit": 2048000,
134 | "rowLimit": 10000
135 | },
136 | "inputWidgets": {},
137 | "nuid": "7f186780-97f0-4b7f-af17-c33073c872ce",
138 | "showTitle": true,
139 | "title": "# Perform transformations on the loaded data"
140 | }
141 | },
142 | "outputs": [
143 | {
144 | "output_type": "stream",
145 | "name": "stdout",
146 | "output_type": "stream",
147 | "text": [
148 | "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n| 1| John| Field-eng| 3500| 40|\n| 2| Robert| Sales| 4000| 38|\n| 3| Maria| Finance| 3500| 28|\n| 5| Kelly| Finance| 3500| 35|\n| 7| Martin| Finance| 3500| 26|\n+---+--------+----------+------+---+\n\n"
149 | ]
150 | }
151 | ],
152 | "source": [
153 | "# Perform transformations on the loaded data \n",
154 | "processed_data = csv_data.filter(csv_data[\"Salary\"] > 3000) \n",
155 | "# Save the processed data as a table \n",
156 | "processed_data.createOrReplaceTempView(\"high_salary_employees\") \n",
157 | "# Perform SQL queries on the saved table \n",
158 | "results = spark.sql(\"SELECT * FROM high_salary_employees \") \n",
159 | "results.show()\n"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": 0,
165 | "metadata": {
166 | "application/vnd.databricks.v1+cell": {
167 | "cellMetadata": {
168 | "byteLimit": 2048000,
169 | "rowLimit": 10000
170 | },
171 | "inputWidgets": {},
172 | "nuid": "022c96b3-91f2-41ea-91c4-58ecd511a2f6",
173 | "showTitle": true,
174 | "title": "Saving Transformed Data as a View"
175 | }
176 | },
177 | "outputs": [
178 | {
179 | "output_type": "stream",
180 | "name": "stdout",
181 | "output_type": "stream",
182 | "text": [
183 | "+--------+----------+------+---+\n|Employee|Department|Salary|Age|\n+--------+----------+------+---+\n| John| Field-eng| 3500| 40|\n| Robert| Sales| 4000| 38|\n| Kelly| Finance| 3500| 35|\n| Kate| Finance| 3000| 45|\n| Kiran| Sales| 2200| 35|\n+--------+----------+------+---+\n\n"
184 | ]
185 | }
186 | ],
187 | "source": [
188 | "# Save the processed data as a view \n",
189 | "salary_data_with_id.createOrReplaceTempView(\"employees\") \n",
190 | "#Apply filtering on data\n",
191 | "filtered_data = spark.sql(\"SELECT Employee, Department, Salary, Age FROM employees WHERE age > 30\") \n",
192 | "# Display the results \n",
193 | "filtered_data.show()\n"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": 0,
199 | "metadata": {
200 | "application/vnd.databricks.v1+cell": {
201 | "cellMetadata": {
202 | "byteLimit": 2048000,
203 | "rowLimit": 10000
204 | },
205 | "inputWidgets": {},
206 | "nuid": "174c84ac-fa9e-42a7-a3b5-a166193e63b0",
207 | "showTitle": true,
208 | "title": "Aggregating data"
209 | }
210 | },
211 | "outputs": [
212 | {
213 | "output_type": "stream",
214 | "name": "stdout",
215 | "output_type": "stream",
216 | "text": [
217 | "+--------------+\n|average_salary|\n+--------------+\n| 3275.0|\n+--------------+\n\n"
218 | ]
219 | }
220 | ],
221 | "source": [
222 | "# Perform an aggregation to calculate the average salary \n",
223 | "average_salary = spark.sql(\"SELECT AVG(Salary) AS average_salary FROM employees\") \n",
224 | "# Display the average salary \n",
225 | "average_salary.show() \n"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": 0,
231 | "metadata": {
232 | "application/vnd.databricks.v1+cell": {
233 | "cellMetadata": {
234 | "byteLimit": 2048000,
235 | "rowLimit": 10000
236 | },
237 | "inputWidgets": {},
238 | "nuid": "97bfd3f0-1a65-4a58-bdc7-58b55dd58840",
239 | "showTitle": true,
240 | "title": "Sorting data"
241 | }
242 | },
243 | "outputs": [
244 | {
245 | "output_type": "stream",
246 | "name": "stdout",
247 | "output_type": "stream",
248 | "text": [
249 | "+---+--------+----------+------+---+\n| ID|Employee|Department|Salary|Age|\n+---+--------+----------+------+---+\n| 2| Robert| Sales| 4000| 38|\n| 1| John| Field-eng| 3500| 40|\n| 7| Martin| Finance| 3500| 26|\n| 3| Maria| Finance| 3500| 28|\n| 5| Kelly| Finance| 3500| 35|\n| 4| Michael| Sales| 3000| 20|\n| 6| Kate| Finance| 3000| 45|\n| 8| Kiran| Sales| 2200| 35|\n+---+--------+----------+------+---+\n\n"
250 | ]
251 | }
252 | ],
253 | "source": [
254 | "# Sort the data based on the salary column in descending order \n",
255 | "sorted_data = spark.sql(\"SELECT * FROM employees ORDER BY Salary DESC\") \n",
256 | "# Display the sorted data \n",
257 | "sorted_data.show() \n"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": 0,
263 | "metadata": {
264 | "application/vnd.databricks.v1+cell": {
265 | "cellMetadata": {
266 | "byteLimit": 2048000,
267 | "rowLimit": 10000
268 | },
269 | "inputWidgets": {},
270 | "nuid": "38606a23-aa13-49ca-b7dc-d669dd472f55",
271 | "showTitle": true,
272 | "title": "Combining Aggregations"
273 | }
274 | },
275 | "outputs": [
276 | {
277 | "output_type": "stream",
278 | "name": "stdout",
279 | "output_type": "stream",
280 | "text": [
281 | "+--------+----------+------+---+\n|Employee|Department|Salary|Age|\n+--------+----------+------+---+\n| Robert| Sales| 4000| 38|\n| John| Field-eng| 3500| 40|\n| Kelly| Finance| 3500| 35|\n+--------+----------+------+---+\n\n"
282 | ]
283 | }
284 | ],
285 | "source": [
286 | "# Sort the data based on the salary column in descending order \n",
287 | "filtered_data = spark.sql(\"SELECT Employee, Department, Salary, Age FROM employees WHERE age > 30 AND Salary > 3000 ORDER BY Salary DESC\") \n",
288 | "# Display the results \n",
289 | "filtered_data.show()\n"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": 0,
295 | "metadata": {
296 | "application/vnd.databricks.v1+cell": {
297 | "cellMetadata": {
298 | "byteLimit": 2048000,
299 | "rowLimit": 10000
300 | },
301 | "inputWidgets": {},
302 | "nuid": "7701c892-0883-4bc2-9b5e-f51cf55fcc78",
303 | "showTitle": true,
304 | "title": "Grouping data"
305 | }
306 | },
307 | "outputs": [
308 | {
309 | "output_type": "stream",
310 | "name": "stdout",
311 | "output_type": "stream",
312 | "text": [
313 | "+----------+------------------+\n|Department| avg(Salary)|\n+----------+------------------+\n| Sales|3066.6666666666665|\n| Finance| 3375.0|\n| Field-eng| 3500.0|\n+----------+------------------+\n\n"
314 | ]
315 | }
316 | ],
317 | "source": [
318 | "# Group the data based on the Department column and take average salary for each department \n",
319 | "grouped_data = spark.sql(\"SELECT Department, avg(Salary) FROM employees GROUP BY Department\") \n",
320 | "# Display the results \n",
321 | "grouped_data.show()\n"
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "execution_count": 0,
327 | "metadata": {
328 | "application/vnd.databricks.v1+cell": {
329 | "cellMetadata": {
330 | "byteLimit": 2048000,
331 | "rowLimit": 10000
332 | },
333 | "inputWidgets": {},
334 | "nuid": "abafb986-88fa-4c43-8f65-6f7bbfa70ec6",
335 | "showTitle": true,
336 | "title": "Grouping with multiple aggregations"
337 | }
338 | },
339 | "outputs": [
340 | {
341 | "output_type": "stream",
342 | "name": "stdout",
343 | "output_type": "stream",
344 | "text": [
345 | "+----------+------------+----------+\n|Department|total_salary|max_salary|\n+----------+------------+----------+\n| Sales| 9200| 4000|\n| Finance| 13500| 3500|\n| Field-eng| 3500| 3500|\n+----------+------------+----------+\n\n"
346 | ]
347 | }
348 | ],
349 | "source": [
350 | "# Perform grouping and multiple aggregations \n",
351 | "aggregated_data = spark.sql(\"SELECT Department, sum(Salary) AS total_salary, max(Salary) AS max_salary FROM employees GROUP BY Department\") \n",
352 | "\n",
353 | "# Display the results \n",
354 | "aggregated_data.show()\n"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": 0,
360 | "metadata": {
361 | "application/vnd.databricks.v1+cell": {
362 | "cellMetadata": {
363 | "byteLimit": 2048000,
364 | "rowLimit": 10000
365 | },
366 | "inputWidgets": {},
367 | "nuid": "3b85baa3-2e70-4038-af6c-440346d96d78",
368 | "showTitle": true,
369 | "title": "Window functions"
370 | }
371 | },
372 | "outputs": [
373 | {
374 | "output_type": "stream",
375 | "name": "stdout",
376 | "output_type": "stream",
377 | "text": [
378 | "+---+--------+----------+------+---+--------------+\n| ID|Employee|Department|Salary|Age|cumulative_sum|\n+---+--------+----------+------+---+--------------+\n| 1| John| Field-eng| 3500| 40| 3500|\n| 7| Martin| Finance| 3500| 26| 3500|\n| 3| Maria| Finance| 3500| 28| 7000|\n| 5| Kelly| Finance| 3500| 35| 10500|\n| 6| Kate| Finance| 3000| 45| 13500|\n| 4| Michael| Sales| 3000| 20| 3000|\n| 8| Kiran| Sales| 2200| 35| 5200|\n| 2| Robert| Sales| 4000| 38| 9200|\n+---+--------+----------+------+---+--------------+\n\n"
379 | ]
380 | }
381 | ],
382 | "source": [
383 | "from pyspark.sql.window import Window\n",
384 | "from pyspark.sql.functions import col, sum\n",
385 | "\n",
386 | "# Define the window specification\n",
387 | "window_spec = Window.partitionBy(\"Department\").orderBy(\"Age\")\n",
388 | "\n",
389 | "# Calculate the cumulative sum using window function\n",
390 | "df_with_cumulative_sum = salary_data_with_id.withColumn(\"cumulative_sum\", sum(col(\"Salary\")).over(window_spec))\n",
391 | "\n",
392 | "# Display the result\n",
393 | "df_with_cumulative_sum.show()\n"
394 | ]
395 | },
396 | {
397 | "cell_type": "code",
398 | "execution_count": 0,
399 | "metadata": {
400 | "application/vnd.databricks.v1+cell": {
401 | "cellMetadata": {
402 | "byteLimit": 2048000,
403 | "rowLimit": 10000
404 | },
405 | "inputWidgets": {},
406 | "nuid": "5d8c6978-6e33-47db-8ad6-8af7cd77d522",
407 | "showTitle": true,
408 | "title": "Using udfs"
409 | }
410 | },
411 | "outputs": [
412 | {
413 | "output_type": "stream",
414 | "name": "stdout",
415 | "output_type": "stream",
416 | "text": [
417 | "+---+--------+----------+------+---+----------------+\n| ID|Employee|Department|Salary|Age|capitalized_name|\n+---+--------+----------+------+---+----------------+\n| 1| John| Field-eng| 3500| 40| JOHN|\n| 2| Robert| Sales| 4000| 38| ROBERT|\n| 3| Maria| Finance| 3500| 28| MARIA|\n| 4| Michael| Sales| 3000| 20| MICHAEL|\n| 5| Kelly| Finance| 3500| 35| KELLY|\n| 6| Kate| Finance| 3000| 45| KATE|\n| 7| Martin| Finance| 3500| 26| MARTIN|\n| 8| Kiran| Sales| 2200| 35| KIRAN|\n+---+--------+----------+------+---+----------------+\n\n"
418 | ]
419 | }
420 | ],
421 | "source": [
422 | "from pyspark.sql import SparkSession\n",
423 | "from pyspark.sql.functions import udf\n",
424 | "from pyspark.sql.types import StringType\n",
425 | "\n",
426 | "# Define a UDF to capitalize a string\n",
427 | "capitalize_udf = udf(lambda x: x.upper(), StringType())\n",
428 | "\n",
429 | "# Apply the UDF to a column\n",
430 | "df_with_capitalized_names = salary_data_with_id.withColumn(\"capitalized_name\", capitalize_udf(\"Employee\"))\n",
431 | "\n",
432 | "# Display the result\n",
433 | "df_with_capitalized_names.show()\n"
434 | ]
435 | },
436 | {
437 | "cell_type": "code",
438 | "execution_count": 0,
439 | "metadata": {
440 | "application/vnd.databricks.v1+cell": {
441 | "cellMetadata": {
442 | "byteLimit": 2048000,
443 | "rowLimit": 10000
444 | },
445 | "inputWidgets": {},
446 | "nuid": "19166b9b-a9c2-4e89-959f-abbac8ef8eff",
447 | "showTitle": false,
448 | "title": ""
449 | }
450 | },
451 | "outputs": [
452 | {
453 | "output_type": "stream",
454 | "name": "stdout",
455 | "output_type": "stream",
456 | "text": [
457 | "+---+--------+----------+------+---+----------------+\n| ID|Employee|Department|Salary|Age|capitalized_name|\n+---+--------+----------+------+---+----------------+\n| 1| John| Field-eng| 3500| 40| JOHN|\n| 2| Robert| Sales| 4000| 38| ROBERT|\n| 3| Maria| Finance| 3500| 28| MARIA|\n| 4| Michael| Sales| 3000| 20| MICHAEL|\n| 5| Kelly| Finance| 3500| 35| KELLY|\n| 6| Kate| Finance| 3000| 45| KATE|\n| 7| Martin| Finance| 3500| 26| MARTIN|\n| 8| Kiran| Sales| 2200| 35| KIRAN|\n+---+--------+----------+------+---+----------------+\n\n"
458 | ]
459 | }
460 | ],
461 | "source": [
462 | "from pyspark.sql.functions import udf\n",
463 | "from pyspark.sql.types import StringType\n",
464 | "\n",
465 | "# Define a UDF to capitalize a string\n",
466 | "capitalize_udf = udf(lambda x: x.upper(), StringType())\n",
467 | "\n",
468 | "# Apply the UDF to a column\n",
469 | "df_with_capitalized_names = salary_data_with_id.withColumn(\"capitalized_name\", capitalize_udf(\"Employee\"))\n",
470 | "\n",
471 | "# Display the result\n",
472 | "df_with_capitalized_names.show()"
473 | ]
474 | },
475 | {
476 | "cell_type": "code",
477 | "execution_count": 0,
478 | "metadata": {
479 | "application/vnd.databricks.v1+cell": {
480 | "cellMetadata": {
481 | "byteLimit": 2048000,
482 | "rowLimit": 10000
483 | },
484 | "inputWidgets": {},
485 | "nuid": "df1d9134-f8bf-4597-9b46-1e0142a0acac",
486 | "showTitle": true,
487 | "title": "Applying functions"
488 | }
489 | },
490 | "outputs": [
491 | {
492 | "output_type": "stream",
493 | "name": "stdout",
494 | "output_type": "stream",
495 | "text": [
496 | "+-----------------------+\n|pandas_plus_one(Salary)|\n+-----------------------+\n| 3501|\n| 4001|\n| 3501|\n| 3001|\n| 3501|\n| 3001|\n| 3501|\n| 2201|\n+-----------------------+\n\n"
497 | ]
498 | }
499 | ],
500 | "source": [
501 | "import pandas as pd\n",
502 | "from pyspark.sql.functions import pandas_udf\n",
503 | "\n",
504 | "@pandas_udf('long')\n",
505 | "def pandas_plus_one(series: pd.Series) -> pd.Series:\n",
506 | " # Simply plus one by using pandas Series.\n",
507 | " return series + 1\n",
508 | "\n",
509 | "salary_data_with_id.select(pandas_plus_one(salary_data_with_id.Salary)).show()\n"
510 | ]
511 | },
512 | {
513 | "cell_type": "code",
514 | "execution_count": 0,
515 | "metadata": {
516 | "application/vnd.databricks.v1+cell": {
517 | "cellMetadata": {
518 | "byteLimit": 2048000,
519 | "rowLimit": 10000
520 | },
521 | "inputWidgets": {},
522 | "nuid": "d0141f5b-f209-4f66-928f-1d631a99ca58",
523 | "showTitle": true,
524 | "title": "Pandas udfs"
525 | }
526 | },
527 | "outputs": [
528 | {
529 | "output_type": "stream",
530 | "name": "stdout",
531 | "output_type": "stream",
532 | "text": [
533 | "+---------------+\n|add_one(Salary)|\n+---------------+\n| 3501|\n| 4001|\n| 3501|\n| 3001|\n| 3501|\n| 3001|\n| 3501|\n| 2201|\n+---------------+\n\n"
534 | ]
535 | }
536 | ],
537 | "source": [
538 | "@pandas_udf(\"integer\")\n",
539 | "def add_one(s: pd.Series) -> pd.Series:\n",
540 | " return s + 1\n",
541 | "\n",
542 | "spark.udf.register(\"add_one\", add_one)\n",
543 | "spark.sql(\"SELECT add_one(Salary) FROM employees\").show()\n"
544 | ]
545 | },
546 | {
547 | "cell_type": "code",
548 | "execution_count": 0,
549 | "metadata": {
550 | "application/vnd.databricks.v1+cell": {
551 | "cellMetadata": {},
552 | "inputWidgets": {},
553 | "nuid": "4f6c3ea0-f650-4806-93cf-0368b01c2dd2",
554 | "showTitle": false,
555 | "title": ""
556 | }
557 | },
558 | "outputs": [],
559 | "source": []
560 | }
561 | ],
562 | "metadata": {
563 | "application/vnd.databricks.v1+notebook": {
564 | "dashboards": [],
565 | "language": "python",
566 | "notebookMetadata": {
567 | "mostRecentlyExecutedCommandWithImplicitDF": {
568 | "commandId": 969987236417588,
569 | "dataframes": [
570 | "_sqldf"
571 | ]
572 | },
573 | "pythonIndentUnit": 2
574 | },
575 | "notebookName": "Chapter 6 Code",
576 | "widgets": {}
577 | }
578 | },
579 | "nbformat": 4,
580 | "nbformat_minor": 0
581 | }
582 |
--------------------------------------------------------------------------------
/Chapter05/Chapter 5 Code.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "application/vnd.databricks.v1+cell": {
7 | "cellMetadata": {},
8 | "inputWidgets": {},
9 | "nuid": "7f1436a0-3357-4850-b507-a12c76e60c22",
10 | "showTitle": false,
11 | "title": ""
12 | }
13 | },
14 | "source": [
15 | "# Chapter 5: Advanced Operations in Spark Code"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": 0,
21 | "metadata": {
22 | "application/vnd.databricks.v1+cell": {
23 | "cellMetadata": {
24 | "byteLimit": 2048000,
25 | "rowLimit": 10000
26 | },
27 | "inputWidgets": {},
28 | "nuid": "0c029f8c-dfbc-4e10-b09d-ccbea7b62eec",
29 | "showTitle": true,
30 | "title": "Create Salary dataframe"
31 | }
32 | },
33 | "outputs": [
34 | {
35 | "output_type": "stream",
36 | "name": "stdout",
37 | "output_type": "stream",
38 | "text": [
39 | "root\n |-- Employee: string (nullable = true)\n |-- Department: string (nullable = true)\n |-- Salary: long (nullable = true)\n\n+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| John| Field-eng| 3500|\n| Michael| Field-eng| 4500|\n| Robert| NULL| 4000|\n| Maria| Finance| 3500|\n| John| Sales| 3000|\n| Kelly| Finance| 3500|\n| Kate| Finance| 3000|\n| Martin| NULL| 3500|\n| Kiran| Sales| 2200|\n| Michael| Field-eng| 4500|\n+--------+----------+------+\n\n"
40 | ]
41 | }
42 | ],
43 | "source": [
44 | "salary_data = [(\"John\", \"Field-eng\", 3500), \n",
45 | " (\"Michael\", \"Field-eng\", 4500), \n",
46 | " (\"Robert\", None, 4000), \n",
47 | " (\"Maria\", \"Finance\", 3500), \n",
48 | " (\"John\", \"Sales\", 3000), \n",
49 | " (\"Kelly\", \"Finance\", 3500), \n",
50 | " (\"Kate\", \"Finance\", 3000), \n",
51 | " (\"Martin\", None, 3500), \n",
52 | " (\"Kiran\", \"Sales\", 2200), \n",
53 | " (\"Michael\", \"Field-eng\", 4500) \n",
54 | " ]\n",
55 | "columns= [\"Employee\", \"Department\", \"Salary\"]\n",
56 | "salary_data = spark.createDataFrame(data = salary_data, schema = columns)\n",
57 | "salary_data.printSchema()\n",
58 | "salary_data.show()\n"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 0,
64 | "metadata": {
65 | "application/vnd.databricks.v1+cell": {
66 | "cellMetadata": {
67 | "byteLimit": 2048000,
68 | "rowLimit": 10000
69 | },
70 | "inputWidgets": {},
71 | "nuid": "5c64523b-97f8-4cdf-8c73-13723a7f7453",
72 | "showTitle": true,
73 | "title": "Using Groupby in a Dataframe"
74 | }
75 | },
76 | "outputs": [
77 | {
78 | "output_type": "execute_result",
79 | "data": {
80 | "text/plain": [
81 | "GroupedData[grouping expressions: [Department], value: [Employee: string, Department: string, Salary: bigint], type: GroupBy]"
82 | ]
83 | },
84 | "execution_count": 3,
85 | "metadata": {},
86 | "output_type": "execute_result"
87 | }
88 | ],
89 | "source": [
90 | "salary_data.groupby('Department')"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 0,
96 | "metadata": {
97 | "application/vnd.databricks.v1+cell": {
98 | "cellMetadata": {
99 | "byteLimit": 2048000,
100 | "rowLimit": 10000
101 | },
102 | "inputWidgets": {},
103 | "nuid": "73e2c600-8160-4138-968f-835e6757f06c",
104 | "showTitle": false,
105 | "title": ""
106 | }
107 | },
108 | "outputs": [
109 | {
110 | "output_type": "stream",
111 | "name": "stdout",
112 | "output_type": "stream",
113 | "text": [
114 | "+----------+------------------+\n|Department| avg(Salary)|\n+----------+------------------+\n| Field-eng| 4166.666666666667|\n| Sales| 2600.0|\n| NULL| 3750.0|\n| Finance|3333.3333333333335|\n+----------+------------------+\n\n"
115 | ]
116 | }
117 | ],
118 | "source": [
119 | "salary_data.groupby('Department').avg().show()"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 0,
125 | "metadata": {
126 | "application/vnd.databricks.v1+cell": {
127 | "cellMetadata": {
128 | "byteLimit": 2048000,
129 | "rowLimit": 10000
130 | },
131 | "inputWidgets": {},
132 | "nuid": "d437c9f0-2336-4687-83b4-7c8142b4085f",
133 | "showTitle": true,
134 | "title": "Complex Groupby Statement"
135 | }
136 | },
137 | "outputs": [
138 | {
139 | "output_type": "stream",
140 | "name": "stdout",
141 | "output_type": "stream",
142 | "text": [
143 | "+----------+------+\n|Department|Salary|\n+----------+------+\n| NULL| 7500|\n| Field-eng| 12500|\n| Finance| 10000|\n| Sales| 5200|\n+----------+------+\n\n"
144 | ]
145 | }
146 | ],
147 | "source": [
148 | "from pyspark.sql.functions import col, round\n",
149 | "\n",
150 | "salary_data.groupBy('Department')\\\n",
151 | " .sum('Salary')\\\n",
152 | " .withColumn('sum(Salary)',round(col('sum(Salary)'), 2))\\\n",
153 | " .withColumnRenamed('sum(Salary)', 'Salary')\\\n",
154 | " .orderBy('Department')\\\n",
155 | " .show()\n"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 0,
161 | "metadata": {
162 | "application/vnd.databricks.v1+cell": {
163 | "cellMetadata": {
164 | "byteLimit": 2048000,
165 | "rowLimit": 10000
166 | },
167 | "inputWidgets": {},
168 | "nuid": "dfc73dea-aa0c-4a54-aded-a4c3814f01a9",
169 | "showTitle": true,
170 | "title": "Joining Dataframes in Spark"
171 | }
172 | },
173 | "outputs": [
174 | {
175 | "output_type": "stream",
176 | "name": "stdout",
177 | "output_type": "stream",
178 | "text": [
179 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n+---+--------+----------+------+\n\n"
180 | ]
181 | }
182 | ],
183 | "source": [
184 | "salary_data_with_id = [(1, \"John\", \"Field-eng\", 3500), \\\n",
185 | " (2, \"Robert\", \"Sales\", 4000), \\\n",
186 | " (3, \"Maria\", \"Finance\", 3500), \\\n",
187 | " (4, \"Michael\", \"Sales\", 3000), \\\n",
188 | " (5, \"Kelly\", \"Finance\", 3500), \\\n",
189 | " (6, \"Kate\", \"Finance\", 3000), \\\n",
190 | " (7, \"Martin\", \"Finance\", 3500), \\\n",
191 | " (8, \"Kiran\", \"Sales\", 2200), \\\n",
192 | " ]\n",
193 | "columns= [\"ID\", \"Employee\", \"Department\", \"Salary\"]\n",
194 | "salary_data_with_id = spark.createDataFrame(data = salary_data_with_id, schema = columns)\n",
195 | "salary_data_with_id.show()\n"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 0,
201 | "metadata": {
202 | "application/vnd.databricks.v1+cell": {
203 | "cellMetadata": {
204 | "byteLimit": 2048000,
205 | "rowLimit": 10000
206 | },
207 | "inputWidgets": {},
208 | "nuid": "125e73d8-c716-4e1c-8900-859c1ec666e9",
209 | "showTitle": true,
210 | "title": "Employee data"
211 | }
212 | },
213 | "outputs": [
214 | {
215 | "output_type": "stream",
216 | "name": "stdout",
217 | "output_type": "stream",
218 | "text": [
219 | "+---+-----+------+\n| ID|State|Gender|\n+---+-----+------+\n| 1| NY| M|\n| 2| NC| M|\n| 3| NY| F|\n| 4| TX| M|\n| 5| NY| F|\n| 6| AZ| F|\n+---+-----+------+\n\n"
220 | ]
221 | }
222 | ],
223 | "source": [
224 | "employee_data = [(1, \"NY\", \"M\"), \\\n",
225 | " (2, \"NC\", \"M\"), \\\n",
226 | " (3, \"NY\", \"F\"), \\\n",
227 | " (4, \"TX\", \"M\"), \\\n",
228 | " (5, \"NY\", \"F\"), \\\n",
229 | " (6, \"AZ\", \"F\") \\\n",
230 | " ]\n",
231 | "columns= [\"ID\", \"State\", \"Gender\"]\n",
232 | "employee_data = spark.createDataFrame(data = employee_data, schema = columns)\n",
233 | "employee_data.show()\n"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": 0,
239 | "metadata": {
240 | "application/vnd.databricks.v1+cell": {
241 | "cellMetadata": {
242 | "byteLimit": 2048000,
243 | "rowLimit": 10000
244 | },
245 | "inputWidgets": {},
246 | "nuid": "c0137bf4-d318-4417-86ca-df79f2fb80be",
247 | "showTitle": true,
248 | "title": "Inner join"
249 | }
250 | },
251 | "outputs": [
252 | {
253 | "output_type": "stream",
254 | "name": "stdout",
255 | "output_type": "stream",
256 | "text": [
257 | "+---+--------+----------+------+---+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+---+-----+------+\n| 1| John| Field-eng| 3500| 1| NY| M|\n| 2| Robert| Sales| 4000| 2| NC| M|\n| 3| Maria| Finance| 3500| 3| NY| F|\n| 4| Michael| Sales| 3000| 4| TX| M|\n| 5| Kelly| Finance| 3500| 5| NY| F|\n| 6| Kate| Finance| 3000| 6| AZ| F|\n+---+--------+----------+------+---+-----+------+\n\n"
258 | ]
259 | }
260 | ],
261 | "source": [
262 | "salary_data_with_id.join(employee_data,salary_data_with_id.ID == employee_data.ID,\"inner\").show()"
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": 0,
268 | "metadata": {
269 | "application/vnd.databricks.v1+cell": {
270 | "cellMetadata": {
271 | "byteLimit": 2048000,
272 | "rowLimit": 10000
273 | },
274 | "inputWidgets": {},
275 | "nuid": "f34ff657-b0dd-4485-96f0-6d7c6126a1bd",
276 | "showTitle": true,
277 | "title": "Outer join"
278 | }
279 | },
280 | "outputs": [
281 | {
282 | "output_type": "stream",
283 | "name": "stdout",
284 | "output_type": "stream",
285 | "text": [
286 | "+---+--------+----------+------+----+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+----+-----+------+\n| 1| John| Field-eng| 3500| 1| NY| M|\n| 2| Robert| Sales| 4000| 2| NC| M|\n| 3| Maria| Finance| 3500| 3| NY| F|\n| 4| Michael| Sales| 3000| 4| TX| M|\n| 5| Kelly| Finance| 3500| 5| NY| F|\n| 6| Kate| Finance| 3000| 6| AZ| F|\n| 7| Martin| Finance| 3500|NULL| NULL| NULL|\n| 8| Kiran| Sales| 2200|NULL| NULL| NULL|\n+---+--------+----------+------+----+-----+------+\n\n"
287 | ]
288 | }
289 | ],
290 | "source": [
291 | "salary_data_with_id.join(employee_data,salary_data_with_id.ID == employee_data.ID,\"outer\").show()"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": 0,
297 | "metadata": {
298 | "application/vnd.databricks.v1+cell": {
299 | "cellMetadata": {
300 | "byteLimit": 2048000,
301 | "rowLimit": 10000
302 | },
303 | "inputWidgets": {},
304 | "nuid": "868ca315-ab44-4eb6-b8f1-92481d770911",
305 | "showTitle": true,
306 | "title": "Left join"
307 | }
308 | },
309 | "outputs": [
310 | {
311 | "output_type": "stream",
312 | "name": "stdout",
313 | "output_type": "stream",
314 | "text": [
315 | "+---+--------+----------+------+----+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+----+-----+------+\n| 1| John| Field-eng| 3500| 1| NY| M|\n| 2| Robert| Sales| 4000| 2| NC| M|\n| 3| Maria| Finance| 3500| 3| NY| F|\n| 4| Michael| Sales| 3000| 4| TX| M|\n| 5| Kelly| Finance| 3500| 5| NY| F|\n| 6| Kate| Finance| 3000| 6| AZ| F|\n| 7| Martin| Finance| 3500|NULL| NULL| NULL|\n| 8| Kiran| Sales| 2200|NULL| NULL| NULL|\n+---+--------+----------+------+----+-----+------+\n\n"
316 | ]
317 | }
318 | ],
319 | "source": [
320 | "salary_data_with_id.join(employee_data,salary_data_with_id.ID == employee_data.ID,\"left\").show()"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": 0,
326 | "metadata": {
327 | "application/vnd.databricks.v1+cell": {
328 | "cellMetadata": {
329 | "byteLimit": 2048000,
330 | "rowLimit": 10000
331 | },
332 | "inputWidgets": {},
333 | "nuid": "4cba2965-54b3-4d04-a456-77e9d9af6e1f",
334 | "showTitle": true,
335 | "title": "Right join"
336 | }
337 | },
338 | "outputs": [
339 | {
340 | "output_type": "stream",
341 | "name": "stdout",
342 | "output_type": "stream",
343 | "text": [
344 | "+---+--------+----------+------+---+-----+------+\n| ID|Employee|Department|Salary| ID|State|Gender|\n+---+--------+----------+------+---+-----+------+\n| 1| John| Field-eng| 3500| 1| NY| M|\n| 2| Robert| Sales| 4000| 2| NC| M|\n| 3| Maria| Finance| 3500| 3| NY| F|\n| 4| Michael| Sales| 3000| 4| TX| M|\n| 5| Kelly| Finance| 3500| 5| NY| F|\n| 6| Kate| Finance| 3000| 6| AZ| F|\n+---+--------+----------+------+---+-----+------+\n\n"
345 | ]
346 | }
347 | ],
348 | "source": [
349 | "salary_data_with_id.join(employee_data,salary_data_with_id.ID == employee_data.ID,\"right\").show()"
350 | ]
351 | },
352 | {
353 | "cell_type": "code",
354 | "execution_count": 0,
355 | "metadata": {
356 | "application/vnd.databricks.v1+cell": {
357 | "cellMetadata": {
358 | "byteLimit": 2048000,
359 | "rowLimit": 10000
360 | },
361 | "inputWidgets": {},
362 | "nuid": "dd9f95c1-4109-4ceb-925d-7b10cf838fdd",
363 | "showTitle": true,
364 | "title": "Union"
365 | }
366 | },
367 | "outputs": [
368 | {
369 | "output_type": "stream",
370 | "name": "stdout",
371 | "output_type": "stream",
372 | "text": [
373 | "root\n |-- ID: long (nullable = true)\n |-- Employee: string (nullable = true)\n |-- Department: string (nullable = true)\n |-- Salary: long (nullable = true)\n\n+---+--------+----------+------+\n|ID |Employee|Department|Salary|\n+---+--------+----------+------+\n|1 |John |Field-eng |3500 |\n|2 |Robert |Sales |4000 |\n|3 |Aliya |Finance |3500 |\n|4 |Nate |Sales |3000 |\n+---+--------+----------+------+\n\n"
374 | ]
375 | }
376 | ],
377 | "source": [
378 | "salary_data_with_id_2 = [(1, \"John\", \"Field-eng\", 3500), \\\n",
379 | " (2, \"Robert\", \"Sales\", 4000), \\\n",
380 | " (3, \"Aliya\", \"Finance\", 3500), \\\n",
381 | " (4, \"Nate\", \"Sales\", 3000), \\\n",
382 | " ]\n",
383 | "columns2= [\"ID\", \"Employee\", \"Department\", \"Salary\"]\n",
384 | "\n",
385 | "salary_data_with_id_2 = spark.createDataFrame(data = salary_data_with_id_2, schema = columns2)\n",
386 | "\n",
387 | "salary_data_with_id_2.printSchema()\n",
388 | "salary_data_with_id_2.show(truncate=False)\n",
389 | "\n"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": 0,
395 | "metadata": {
396 | "application/vnd.databricks.v1+cell": {
397 | "cellMetadata": {
398 | "byteLimit": 2048000,
399 | "rowLimit": 10000
400 | },
401 | "inputWidgets": {},
402 | "nuid": "2eb3d433-2a89-47b4-9d21-2d79194809c1",
403 | "showTitle": false,
404 | "title": ""
405 | }
406 | },
407 | "outputs": [
408 | {
409 | "output_type": "stream",
410 | "name": "stdout",
411 | "output_type": "stream",
412 | "text": [
413 | "+---+--------+----------+------+\n|ID |Employee|Department|Salary|\n+---+--------+----------+------+\n|1 |John |Field-eng |3500 |\n|2 |Robert |Sales |4000 |\n|3 |Maria |Finance |3500 |\n|4 |Michael |Sales |3000 |\n|5 |Kelly |Finance |3500 |\n|6 |Kate |Finance |3000 |\n|7 |Martin |Finance |3500 |\n|8 |Kiran |Sales |2200 |\n|1 |John |Field-eng |3500 |\n|2 |Robert |Sales |4000 |\n|3 |Aliya |Finance |3500 |\n|4 |Nate |Sales |3000 |\n+---+--------+----------+------+\n\n"
414 | ]
415 | }
416 | ],
417 | "source": [
418 | "unionDF = salary_data_with_id.union(salary_data_with_id_2)\n",
419 | "unionDF.show(truncate=False)\n"
420 | ]
421 | },
422 | {
423 | "cell_type": "markdown",
424 | "metadata": {
425 | "application/vnd.databricks.v1+cell": {
426 | "cellMetadata": {},
427 | "inputWidgets": {},
428 | "nuid": "d84b0031-6f62-41a8-9533-b510a487ab0f",
429 | "showTitle": false,
430 | "title": ""
431 | }
432 | },
433 | "source": [
434 | "Reading and Writing Data"
435 | ]
436 | },
437 | {
438 | "cell_type": "code",
439 | "execution_count": 0,
440 | "metadata": {
441 | "application/vnd.databricks.v1+cell": {
442 | "cellMetadata": {
443 | "byteLimit": 2048000,
444 | "rowLimit": 10000
445 | },
446 | "inputWidgets": {},
447 | "nuid": "d3c8eb85-7d75-4010-977d-370b7940b57e",
448 | "showTitle": true,
449 | "title": "Reading and writing CSV files"
450 | }
451 | },
452 | "outputs": [
453 | {
454 | "output_type": "stream",
455 | "name": "stdout",
456 | "output_type": "stream",
457 | "text": [
458 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n+---+--------+----------+------+\n\n"
459 | ]
460 | }
461 | ],
462 | "source": [
463 | "\n",
464 | "salary_data_with_id.write.csv('salary_data.csv', mode='overwrite', header=True)\n",
465 | "spark.read.csv('/salary_data.csv', header=True).show()\n"
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": 0,
471 | "metadata": {
472 | "application/vnd.databricks.v1+cell": {
473 | "cellMetadata": {
474 | "byteLimit": 2048000,
475 | "rowLimit": 10000
476 | },
477 | "inputWidgets": {},
478 | "nuid": "b033bc47-7a90-4ae1-b37b-692860e06482",
479 | "showTitle": false,
480 | "title": ""
481 | }
482 | },
483 | "outputs": [
484 | {
485 | "output_type": "stream",
486 | "name": "stdout",
487 | "output_type": "stream",
488 | "text": [
489 | "+---+-------+---------+\n| ID| State| Gender|\n+---+-------+---------+\n| 1| John|Field-eng|\n| 2| Robert| Sales|\n| 3| Maria| Finance|\n| 4|Michael| Sales|\n| 5| Kelly| Finance|\n| 6| Kate| Finance|\n| 7| Martin| Finance|\n| 8| Kiran| Sales|\n+---+-------+---------+\n\n"
490 | ]
491 | }
492 | ],
493 | "source": [
494 | "from pyspark.sql.types import *\n",
495 | "\n",
496 | "filePath = '/salary_data.csv'\n",
497 | "columns= [\"ID\", \"State\", \"Gender\"] \n",
498 | "schema = StructType([\n",
499 | " StructField(\"ID\", IntegerType(),True),\n",
500 | " StructField(\"State\", StringType(),True),\n",
501 | " StructField(\"Gender\", StringType(),True)\n",
502 | "])\n",
503 | " \n",
504 | "read_data = spark.read.format(\"csv\").option(\"header\",\"true\").schema(schema).load(filePath)\n",
505 | "read_data.show()\n"
506 | ]
507 | },
508 | {
509 | "cell_type": "code",
510 | "execution_count": 0,
511 | "metadata": {
512 | "application/vnd.databricks.v1+cell": {
513 | "cellMetadata": {
514 | "byteLimit": 2048000,
515 | "rowLimit": 10000
516 | },
517 | "inputWidgets": {},
518 | "nuid": "bfd8f639-d141-48c9-be8e-dffd764aa0ee",
519 | "showTitle": true,
520 | "title": "Reading and writing Parquet files"
521 | }
522 | },
523 | "outputs": [
524 | {
525 | "output_type": "stream",
526 | "name": "stdout",
527 | "output_type": "stream",
528 | "text": [
529 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n+---+--------+----------+------+\n\n"
530 | ]
531 | }
532 | ],
533 | "source": [
534 | "salary_data_with_id.write.parquet('salary_data.parquet', mode='overwrite')\n",
535 | "spark.read.parquet('/salary_data.parquet').show()\n"
536 | ]
537 | },
538 | {
539 | "cell_type": "code",
540 | "execution_count": 0,
541 | "metadata": {
542 | "application/vnd.databricks.v1+cell": {
543 | "cellMetadata": {
544 | "byteLimit": 2048000,
545 | "rowLimit": 10000
546 | },
547 | "inputWidgets": {},
548 | "nuid": "492b344b-3719-44cd-a8dc-034d20f3a409",
549 | "showTitle": true,
550 | "title": "Reading and writing ORC files"
551 | }
552 | },
553 | "outputs": [
554 | {
555 | "output_type": "stream",
556 | "name": "stdout",
557 | "output_type": "stream",
558 | "text": [
559 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n+---+--------+----------+------+\n\n"
560 | ]
561 | }
562 | ],
563 | "source": [
564 | "salary_data_with_id.write.orc('salary_data.orc', mode='overwrite')\n",
565 | "spark.read.orc('/salary_data.orc').show()"
566 | ]
567 | },
568 | {
569 | "cell_type": "code",
570 | "execution_count": 0,
571 | "metadata": {
572 | "application/vnd.databricks.v1+cell": {
573 | "cellMetadata": {
574 | "byteLimit": 2048000,
575 | "rowLimit": 10000
576 | },
577 | "inputWidgets": {},
578 | "nuid": "9b3c1309-4a00-4a92-ac3e-7f2a9d491445",
579 | "showTitle": true,
580 | "title": "Reading and writing Delta files"
581 | }
582 | },
583 | "outputs": [
584 | {
585 | "output_type": "stream",
586 | "name": "stdout",
587 | "output_type": "stream",
588 | "text": [
589 | "+---+--------+----------+------+\n| ID|Employee|Department|Salary|\n+---+--------+----------+------+\n| 1| John| Field-eng| 3500|\n| 2| Robert| Sales| 4000|\n| 3| Maria| Finance| 3500|\n| 4| Michael| Sales| 3000|\n| 5| Kelly| Finance| 3500|\n| 6| Kate| Finance| 3000|\n| 7| Martin| Finance| 3500|\n| 8| Kiran| Sales| 2200|\n+---+--------+----------+------+\n\n"
590 | ]
591 | }
592 | ],
593 | "source": [
594 | "salary_data_with_id.write.format(\"delta\").save(\"/FileStore/tables/salary_data_with_id\", mode='overwrite')\n",
595 | "df = spark.read.load(\"/FileStore/tables/salary_data_with_id\")\n",
596 | "df.show()\n"
597 | ]
598 | },
599 | {
600 | "cell_type": "code",
601 | "execution_count": 0,
602 | "metadata": {
603 | "application/vnd.databricks.v1+cell": {
604 | "cellMetadata": {
605 | "byteLimit": 2048000,
606 | "rowLimit": 10000
607 | },
608 | "inputWidgets": {},
609 | "nuid": "d616d17f-7848-4527-aae3-78eec9d3214d",
610 | "showTitle": true,
611 | "title": "Using SQL in Spark"
612 | }
613 | },
614 | "outputs": [
615 | {
616 | "output_type": "stream",
617 | "name": "stdout",
618 | "output_type": "stream",
619 | "text": [
620 | "+--------+\n|count(1)|\n+--------+\n| 8|\n+--------+\n\n"
621 | ]
622 | }
623 | ],
624 | "source": [
625 | "salary_data_with_id.createOrReplaceTempView(\"SalaryTable\")\n",
626 | "spark.sql(\"SELECT count(*) from SalaryTable\").show()\n"
627 | ]
628 | },
629 | {
630 | "cell_type": "markdown",
631 | "metadata": {
632 | "application/vnd.databricks.v1+cell": {
633 | "cellMetadata": {},
634 | "inputWidgets": {},
635 | "nuid": "f549a552-a92a-477c-bcbd-0eaf6104c207",
636 | "showTitle": false,
637 | "title": ""
638 | }
639 | },
640 | "source": [
641 | "Catalyst Optimizer"
642 | ]
643 | },
644 | {
645 | "cell_type": "code",
646 | "execution_count": 0,
647 | "metadata": {
648 | "application/vnd.databricks.v1+cell": {
649 | "cellMetadata": {
650 | "byteLimit": 2048000,
651 | "rowLimit": 10000
652 | },
653 | "inputWidgets": {},
654 | "nuid": "b66004b0-07ac-4c06-966e-1370a2e1b3d6",
655 | "showTitle": true,
656 | "title": "Catalyst Optimizer in Action"
657 | }
658 | },
659 | "outputs": [
660 | {
661 | "output_type": "stream",
662 | "name": "stdout",
663 | "output_type": "stream",
664 | "text": [
665 | "== Physical Plan ==\n*(1) Project [employee#129490, department#129491]\n+- *(1) Filter (isnotnull(salary#129492) AND (salary#129492 > 3500))\n +- FileScan csv [Employee#129490,Department#129491,Salary#129492] Batched: false, DataFilters: [isnotnull(Salary#129492), (Salary#129492 > 3500)], Format: CSV, Location: InMemoryFileIndex(1 paths)[dbfs:/salary_data.csv], PartitionFilters: [], PushedFilters: [IsNotNull(Salary), GreaterThan(Salary,3500)], ReadSchema: struct\n\n\n"
666 | ]
667 | }
668 | ],
669 | "source": [
670 | "# SparkSession setup \n",
671 | "from pyspark.sql import SparkSession \n",
672 | "spark = SparkSession.builder.appName(\"CatalystOptimizerExample\").getOrCreate() \n",
673 | "# Load data \n",
674 | "df = spark.read.csv(\"/salary_data.csv\", header=True, inferSchema=True) \n",
675 | "# Query with Catalyst Optimizer \n",
676 | "result_df = df.select(\"employee\", \"department\").filter(df[\"salary\"] > 3500) \n",
677 | "# Explain the optimized query plan \n",
678 | "result_df.explain() \n"
679 | ]
680 | },
681 | {
682 | "cell_type": "code",
683 | "execution_count": 0,
684 | "metadata": {
685 | "application/vnd.databricks.v1+cell": {
686 | "cellMetadata": {
687 | "byteLimit": 2048000,
688 | "rowLimit": 10000
689 | },
690 | "inputWidgets": {},
691 | "nuid": "08ba28ee-80d0-4210-acb7-4a45bee2815b",
692 | "showTitle": true,
693 | "title": "Unpersisting Data"
694 | }
695 | },
696 | "outputs": [
697 | {
698 | "output_type": "execute_result",
699 | "data": {
700 | "text/plain": [
701 | "DataFrame[ID: int, Employee: string, Department: string, Salary: int]"
702 | ]
703 | },
704 | "execution_count": 24,
705 | "metadata": {},
706 | "output_type": "execute_result"
707 | }
708 | ],
709 | "source": [
710 | "# Cache a DataFrame \n",
711 | "df.cache() \n",
712 | "# Unpersist the cached DataFrame \n",
713 | "df.unpersist() \n"
714 | ]
715 | },
716 | {
717 | "cell_type": "code",
718 | "execution_count": 0,
719 | "metadata": {
720 | "application/vnd.databricks.v1+cell": {
721 | "cellMetadata": {
722 | "byteLimit": 2048000,
723 | "rowLimit": 10000
724 | },
725 | "inputWidgets": {},
726 | "nuid": "25bff695-c206-4800-80c0-e6dde6962438",
727 | "showTitle": true,
728 | "title": "Repartitioning Data"
729 | }
730 | },
731 | "outputs": [
732 | {
733 | "output_type": "execute_result",
734 | "data": {
735 | "text/plain": [
736 | "DataFrame[ID: int, Employee: string, Department: string, Salary: int]"
737 | ]
738 | },
739 | "execution_count": 25,
740 | "metadata": {},
741 | "output_type": "execute_result"
742 | }
743 | ],
744 | "source": [
745 | "# Repartition a DataFrame into 8 partitions \n",
746 | "df.repartition(8) \n"
747 | ]
748 | },
749 | {
750 | "cell_type": "code",
751 | "execution_count": 0,
752 | "metadata": {
753 | "application/vnd.databricks.v1+cell": {
754 | "cellMetadata": {
755 | "byteLimit": 2048000,
756 | "rowLimit": 10000
757 | },
758 | "inputWidgets": {},
759 | "nuid": "3eb87f1e-ba2d-47fb-a7c1-b74e1f293598",
760 | "showTitle": true,
761 | "title": "Coalescing Data"
762 | }
763 | },
764 | "outputs": [
765 | {
766 | "output_type": "execute_result",
767 | "data": {
768 | "text/plain": [
769 | "DataFrame[ID: int, Employee: string, Department: string, Salary: int]"
770 | ]
771 | },
772 | "execution_count": 26,
773 | "metadata": {},
774 | "output_type": "execute_result"
775 | }
776 | ],
777 | "source": [
778 | "# Coalesce a DataFrame to 4 partitions \n",
779 | "df.coalesce(4) \n"
780 | ]
781 | }
782 | ],
783 | "metadata": {
784 | "application/vnd.databricks.v1+notebook": {
785 | "dashboards": [],
786 | "language": "python",
787 | "notebookMetadata": {
788 | "mostRecentlyExecutedCommandWithImplicitDF": {
789 | "commandId": 969987236417588,
790 | "dataframes": [
791 | "_sqldf"
792 | ]
793 | },
794 | "pythonIndentUnit": 2
795 | },
796 | "notebookName": "Chapter 5 Code",
797 | "widgets": {}
798 | }
799 | },
800 | "nbformat": 4,
801 | "nbformat_minor": 0
802 | }
803 |
--------------------------------------------------------------------------------
/Chapter04/Chapter 4 Code.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "application/vnd.databricks.v1+cell": {
7 | "cellMetadata": {},
8 | "inputWidgets": {},
9 | "nuid": "7f1436a0-3357-4850-b507-a12c76e60c22",
10 | "showTitle": false,
11 | "title": ""
12 | }
13 | },
14 | "source": [
15 | "# Chapter 4 : Spark Dataframes and Operations Code"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "application/vnd.databricks.v1+cell": {
22 | "cellMetadata": {},
23 | "inputWidgets": {},
24 | "nuid": "c8b85703-a9de-4ac7-892a-b1fb92ac4442",
25 | "showTitle": false,
26 | "title": ""
27 | }
28 | },
29 | "source": [
30 | "Create Dataframe Operations"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 0,
36 | "metadata": {
37 | "application/vnd.databricks.v1+cell": {
38 | "cellMetadata": {
39 | "byteLimit": 2048000,
40 | "rowLimit": 10000
41 | },
42 | "inputWidgets": {},
43 | "nuid": "0269a412-f57f-4c05-b079-4f7236b5cbc8",
44 | "showTitle": true,
45 | "title": "Create Dataframe from list of rows"
46 | }
47 | },
48 | "outputs": [],
49 | "source": [
50 | "import pandas as pd\n",
51 | "from datetime import datetime, date\n",
52 | "from pyspark.sql import Row\n",
53 | "\n",
54 | "data_df = spark.createDataFrame([\n",
55 | " Row(col_1=100, col_2=200., col_3='string_test_1', col_4=date(2023, 1, 1), col_5=datetime(2023, 1, 1, 12, 0)),\n",
56 | " Row(col_1=200, col_2=300., col_3='string_test_2', col_4=date(2023, 2, 1), col_5=datetime(2023, 1, 2, 12, 0)),\n",
57 | " Row(col_1=400, col_2=500., col_3='string_test_3', col_4=date(2023, 3, 1), col_5=datetime(2023, 1, 3, 12, 0))\n",
58 | "])\n"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 0,
64 | "metadata": {
65 | "application/vnd.databricks.v1+cell": {
66 | "cellMetadata": {
67 | "byteLimit": 2048000,
68 | "rowLimit": 10000
69 | },
70 | "inputWidgets": {},
71 | "nuid": "70b00e29-29b9-47c6-9353-aecc105f5aba",
72 | "showTitle": true,
73 | "title": "Create Dataframe from list of rows using schema"
74 | }
75 | },
76 | "outputs": [],
77 | "source": [
78 | "import pandas as pd\n",
79 | "from datetime import datetime, date\n",
80 | "from pyspark.sql import Row\n",
81 | "\n",
82 | "data_df = spark.createDataFrame([\n",
83 | " Row(col_1=100, col_2=200., col_3='string_test_1', col_4=date(2023, 1, 1), col_5=datetime(2023, 1, 1, 12, 0)),\n",
84 | " Row(col_1=200, col_2=300., col_3='string_test_2', col_4=date(2023, 2, 1), col_5=datetime(2023, 1, 2, 12, 0)),\n",
85 | " Row(col_1=400, col_2=500., col_3='string_test_3', col_4=date(2023, 3, 1), col_5=datetime(2023, 1, 3, 12, 0))\n",
86 | "], schema=' col_1 long, col_2 double, col_3 string, col_4 date, col_5 timestamp')\n"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 0,
92 | "metadata": {
93 | "application/vnd.databricks.v1+cell": {
94 | "cellMetadata": {
95 | "byteLimit": 2048000,
96 | "rowLimit": 10000
97 | },
98 | "inputWidgets": {},
99 | "nuid": "d0238c11-4d56-4175-89b2-bb4ddcc4976a",
100 | "showTitle": true,
101 | "title": "Create Dataframe from pandas dataframe"
102 | }
103 | },
104 | "outputs": [],
105 | "source": [
106 | "import pandas as pd\n",
107 | "from datetime import datetime, date\n",
108 | "from pyspark.sql import Row\n",
109 | "\n",
110 | "pandas_df = pd.DataFrame({\n",
111 | " 'col_1': [100, 200, 400],\n",
112 | " 'col_2': [200., 300., 500.],\n",
113 | " 'col_3': ['string_test_1', 'string_test_2', 'string_test_3'],\n",
114 | " 'col_4': [date(2023, 1, 1), date(2023, 2, 1), date(2023, 3, 1)],\n",
115 | " 'col_5': [datetime(2023, 1, 1, 12, 0), datetime(2023, 1, 2, 12, 0), datetime(2023, 1, 3, 12, 0)]\n",
116 | "})\n",
117 | "df = spark.createDataFrame(pandas_df)\n"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": 0,
123 | "metadata": {
124 | "application/vnd.databricks.v1+cell": {
125 | "cellMetadata": {
126 | "byteLimit": 2048000,
127 | "rowLimit": 10000
128 | },
129 | "inputWidgets": {},
130 | "nuid": "a1791d2e-ff43-4557-ad31-a2d92c3a21a8",
131 | "showTitle": false,
132 | "title": ""
133 | }
134 | },
135 | "outputs": [],
136 | "source": [
137 | "from datetime import datetime, date\n",
138 | "from pyspark.sql import SparkSession\n",
139 | "\n",
140 | "spark = SparkSession.builder.getOrCreate()\n",
141 | "\n",
142 | "rdd = spark.sparkContext.parallelize([\n",
143 | " (100, 200., 'string_test_1', date(2023, 1, 1), datetime(2023, 1, 1, 12, 0)),\n",
144 | " (200, 300., 'string_test_2', date(2023, 2, 1), datetime(2023, 1, 2, 12, 0)),\n",
145 | " (300, 400., 'string_test_3', date(2023, 3, 1), datetime(2023, 1, 3, 12, 0))\n",
146 | "])\n",
147 | "data_df = spark.createDataFrame(rdd, schema=['col_1', 'col_2', 'col_3', 'col_4', 'col_5'])"
148 | ]
149 | },
150 | {
151 | "cell_type": "markdown",
152 | "metadata": {
153 | "application/vnd.databricks.v1+cell": {
154 | "cellMetadata": {},
155 | "inputWidgets": {},
156 | "nuid": "d87a4498-fa76-444e-984b-25cec32fb37c",
157 | "showTitle": false,
158 | "title": ""
159 | }
160 | },
161 | "source": [
162 | "How to View the Dataframes"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 0,
168 | "metadata": {
169 | "application/vnd.databricks.v1+cell": {
170 | "cellMetadata": {
171 | "byteLimit": 2048000,
172 | "rowLimit": 10000
173 | },
174 | "inputWidgets": {},
175 | "nuid": "d4e5a140-b2bd-4cf3-8798-6fecb6164064",
176 | "showTitle": true,
177 | "title": "Viewing DataFrames "
178 | }
179 | },
180 | "outputs": [
181 | {
182 | "output_type": "stream",
183 | "name": "stdout",
184 | "output_type": "stream",
185 | "text": [
186 | "+-----+-----+-------------+----------+-------------------+\n|col_1|col_2| col_3| col_4| col_5|\n+-----+-----+-------------+----------+-------------------+\n| 100|200.0|string_test_1|2023-01-01|2023-01-01 12:00:00|\n| 200|300.0|string_test_2|2023-02-01|2023-01-02 12:00:00|\n| 300|400.0|string_test_3|2023-03-01|2023-01-03 12:00:00|\n+-----+-----+-------------+----------+-------------------+\n\n"
187 | ]
188 | }
189 | ],
190 | "source": [
191 | "data_df.show()"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 0,
197 | "metadata": {
198 | "application/vnd.databricks.v1+cell": {
199 | "cellMetadata": {
200 | "byteLimit": 2048000,
201 | "rowLimit": 10000
202 | },
203 | "inputWidgets": {},
204 | "nuid": "465e9144-4bf7-472d-bb62-35dad761c240",
205 | "showTitle": true,
206 | "title": "Viewing top n rows"
207 | }
208 | },
209 | "outputs": [
210 | {
211 | "output_type": "stream",
212 | "name": "stdout",
213 | "output_type": "stream",
214 | "text": [
215 | "+-----+-----+-------------+----------+-------------------+\n|col_1|col_2| col_3| col_4| col_5|\n+-----+-----+-------------+----------+-------------------+\n| 100|200.0|string_test_1|2023-01-01|2023-01-01 12:00:00|\n| 200|300.0|string_test_2|2023-02-01|2023-01-02 12:00:00|\n+-----+-----+-------------+----------+-------------------+\nonly showing top 2 rows\n\n"
216 | ]
217 | }
218 | ],
219 | "source": [
220 | "data_df.show(2)"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 0,
226 | "metadata": {
227 | "application/vnd.databricks.v1+cell": {
228 | "cellMetadata": {
229 | "byteLimit": 2048000,
230 | "rowLimit": 10000
231 | },
232 | "inputWidgets": {},
233 | "nuid": "3882bfa7-6fa4-4a7c-b039-d919edc5fb07",
234 | "showTitle": true,
235 | "title": "Viewing DataFrame schema"
236 | }
237 | },
238 | "outputs": [
239 | {
240 | "output_type": "stream",
241 | "name": "stdout",
242 | "output_type": "stream",
243 | "text": [
244 | "root\n |-- col_1: long (nullable = true)\n |-- col_2: double (nullable = true)\n |-- col_3: string (nullable = true)\n |-- col_4: date (nullable = true)\n |-- col_5: timestamp (nullable = true)\n\n"
245 | ]
246 | }
247 | ],
248 | "source": [
249 | "data_df.printSchema()"
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": 0,
255 | "metadata": {
256 | "application/vnd.databricks.v1+cell": {
257 | "cellMetadata": {
258 | "byteLimit": 2048000,
259 | "rowLimit": 10000
260 | },
261 | "inputWidgets": {},
262 | "nuid": "49b81183-36c8-4862-b2fb-4778ceb6d16c",
263 | "showTitle": true,
264 | "title": "Viewing data vertically"
265 | }
266 | },
267 | "outputs": [
268 | {
269 | "output_type": "stream",
270 | "name": "stdout",
271 | "output_type": "stream",
272 | "text": [
273 | "-RECORD 0--------------------\n col_1 | 100 \n col_2 | 200.0 \n col_3 | string_test_1 \n col_4 | 2023-01-01 \n col_5 | 2023-01-01 12:00:00 \nonly showing top 1 row\n\n"
274 | ]
275 | }
276 | ],
277 | "source": [
278 | "data_df.show(1, vertical=True)"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": 0,
284 | "metadata": {
285 | "application/vnd.databricks.v1+cell": {
286 | "cellMetadata": {
287 | "byteLimit": 2048000,
288 | "rowLimit": 10000
289 | },
290 | "inputWidgets": {},
291 | "nuid": "18974b22-4fae-4787-aacb-67eb458c20a2",
292 | "showTitle": true,
293 | "title": "Viewing columns of data "
294 | }
295 | },
296 | "outputs": [
297 | {
298 | "output_type": "execute_result",
299 | "data": {
300 | "text/plain": [
301 | "['col_1', 'col_2', 'col_3', 'col_4', 'col_5']"
302 | ]
303 | },
304 | "execution_count": 7,
305 | "metadata": {},
306 | "output_type": "execute_result"
307 | }
308 | ],
309 | "source": [
310 | "data_df.columns"
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": 0,
316 | "metadata": {
317 | "application/vnd.databricks.v1+cell": {
318 | "cellMetadata": {
319 | "byteLimit": 2048000,
320 | "rowLimit": 10000
321 | },
322 | "inputWidgets": {},
323 | "nuid": "f607973a-991a-4892-ada8-a6f8e2daf5d1",
324 | "showTitle": true,
325 | "title": "Counting number of rows of data"
326 | }
327 | },
328 | "outputs": [
329 | {
330 | "output_type": "execute_result",
331 | "data": {
332 | "text/plain": [
333 | "3"
334 | ]
335 | },
336 | "execution_count": 8,
337 | "metadata": {},
338 | "output_type": "execute_result"
339 | }
340 | ],
341 | "source": [
342 | "data_df.count()"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": 0,
348 | "metadata": {
349 | "application/vnd.databricks.v1+cell": {
350 | "cellMetadata": {
351 | "byteLimit": 2048000,
352 | "rowLimit": 10000
353 | },
354 | "inputWidgets": {},
355 | "nuid": "3513085f-5679-4735-9085-7e7b3de398b4",
356 | "showTitle": true,
357 | "title": "Viewing summary statistics "
358 | }
359 | },
360 | "outputs": [
361 | {
362 | "output_type": "stream",
363 | "name": "stdout",
364 | "output_type": "stream",
365 | "text": [
366 | "+-------+-----+-----+-------------+\n|summary|col_1|col_2| col_3|\n+-------+-----+-----+-------------+\n| count| 3| 3| 3|\n| mean|200.0|300.0| NULL|\n| stddev|100.0|100.0| NULL|\n| min| 100|200.0|string_test_1|\n| max| 300|400.0|string_test_3|\n+-------+-----+-----+-------------+\n\n"
367 | ]
368 | }
369 | ],
370 | "source": [
371 | "data_df.select('col_1', 'col_2', 'col_3').describe().show()"
372 | ]
373 | },
374 | {
375 | "cell_type": "code",
376 | "execution_count": 0,
377 | "metadata": {
378 | "application/vnd.databricks.v1+cell": {
379 | "cellMetadata": {
380 | "byteLimit": 2048000,
381 | "rowLimit": 10000
382 | },
383 | "inputWidgets": {},
384 | "nuid": "550f6a6b-61cc-4913-9220-9cb02962045c",
385 | "showTitle": true,
386 | "title": "Collecting the data"
387 | }
388 | },
389 | "outputs": [
390 | {
391 | "output_type": "execute_result",
392 | "data": {
393 | "text/plain": [
394 | "[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0)),\n",
395 | " Row(col_1=200, col_2=300.0, col_3='string_test_2', col_4=datetime.date(2023, 2, 1), col_5=datetime.datetime(2023, 1, 2, 12, 0)),\n",
396 | " Row(col_1=300, col_2=400.0, col_3='string_test_3', col_4=datetime.date(2023, 3, 1), col_5=datetime.datetime(2023, 1, 3, 12, 0))]"
397 | ]
398 | },
399 | "execution_count": 10,
400 | "metadata": {},
401 | "output_type": "execute_result"
402 | }
403 | ],
404 | "source": [
405 | "data_df.collect()"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": 0,
411 | "metadata": {
412 | "application/vnd.databricks.v1+cell": {
413 | "cellMetadata": {
414 | "byteLimit": 2048000,
415 | "rowLimit": 10000
416 | },
417 | "inputWidgets": {},
418 | "nuid": "d1a9456e-7fc7-4b73-bdfe-b3bd45f01f68",
419 | "showTitle": true,
420 | "title": "Using take"
421 | }
422 | },
423 | "outputs": [
424 | {
425 | "output_type": "execute_result",
426 | "data": {
427 | "text/plain": [
428 | "[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0))]"
429 | ]
430 | },
431 | "execution_count": 11,
432 | "metadata": {},
433 | "output_type": "execute_result"
434 | }
435 | ],
436 | "source": [
437 | "data_df.take(1)"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": 0,
443 | "metadata": {
444 | "application/vnd.databricks.v1+cell": {
445 | "cellMetadata": {
446 | "byteLimit": 2048000,
447 | "rowLimit": 10000
448 | },
449 | "inputWidgets": {},
450 | "nuid": "5f5c4426-d270-40be-9088-c74d727af5b1",
451 | "showTitle": true,
452 | "title": "Using tail"
453 | }
454 | },
455 | "outputs": [
456 | {
457 | "output_type": "execute_result",
458 | "data": {
459 | "text/plain": [
460 | "[Row(col_1=300, col_2=400.0, col_3='string_test_3', col_4=datetime.date(2023, 3, 1), col_5=datetime.datetime(2023, 1, 3, 12, 0))]"
461 | ]
462 | },
463 | "execution_count": 12,
464 | "metadata": {},
465 | "output_type": "execute_result"
466 | }
467 | ],
468 | "source": [
469 | "data_df.tail(1)"
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": 0,
475 | "metadata": {
476 | "application/vnd.databricks.v1+cell": {
477 | "cellMetadata": {
478 | "byteLimit": 2048000,
479 | "rowLimit": 10000
480 | },
481 | "inputWidgets": {},
482 | "nuid": "fbdd9347-da63-4575-ac6d-b55b593044ff",
483 | "showTitle": true,
484 | "title": "Using head"
485 | }
486 | },
487 | "outputs": [
488 | {
489 | "output_type": "execute_result",
490 | "data": {
491 | "text/plain": [
492 | "[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0))]"
493 | ]
494 | },
495 | "execution_count": 13,
496 | "metadata": {},
497 | "output_type": "execute_result"
498 | }
499 | ],
500 | "source": [
501 | "data_df.head(1)"
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": 0,
507 | "metadata": {
508 | "application/vnd.databricks.v1+cell": {
509 | "cellMetadata": {
510 | "byteLimit": 2048000,
511 | "rowLimit": 10000
512 | },
513 | "inputWidgets": {},
514 | "nuid": "2e48afa4-79e2-4ac5-86f0-cb95a8145ef0",
515 | "showTitle": true,
516 | "title": "Converting Pyspark dataframe to Pandas"
517 | }
518 | },
519 | "outputs": [
520 | {
521 | "output_type": "execute_result",
522 | "data": {
523 | "text/html": [
524 | "\n",
525 | "\n",
538 | "
\n",
539 | " \n",
540 | " \n",
541 | " | \n",
542 | " col_1 | \n",
543 | " col_2 | \n",
544 | " col_3 | \n",
545 | " col_4 | \n",
546 | " col_5 | \n",
547 | "
\n",
548 | " \n",
549 | " \n",
550 | " \n",
551 | " | 0 | \n",
552 | " 100 | \n",
553 | " 200.0 | \n",
554 | " string_test_1 | \n",
555 | " 2023-01-01 | \n",
556 | " 2023-01-01 12:00:00 | \n",
557 | "
\n",
558 | " \n",
559 | " | 1 | \n",
560 | " 200 | \n",
561 | " 300.0 | \n",
562 | " string_test_2 | \n",
563 | " 2023-02-01 | \n",
564 | " 2023-01-02 12:00:00 | \n",
565 | "
\n",
566 | " \n",
567 | " | 2 | \n",
568 | " 300 | \n",
569 | " 400.0 | \n",
570 | " string_test_3 | \n",
571 | " 2023-03-01 | \n",
572 | " 2023-01-03 12:00:00 | \n",
573 | "
\n",
574 | " \n",
575 | "
\n",
576 | "
"
577 | ],
578 | "text/plain": [
579 | " col_1 col_2 col_3 col_4 col_5\n",
580 | "0 100 200.0 string_test_1 2023-01-01 2023-01-01 12:00:00\n",
581 | "1 200 300.0 string_test_2 2023-02-01 2023-01-02 12:00:00\n",
582 | "2 300 400.0 string_test_3 2023-03-01 2023-01-03 12:00:00"
583 | ]
584 | },
585 | "execution_count": 15,
586 | "metadata": {},
587 | "output_type": "execute_result"
588 | }
589 | ],
590 | "source": [
591 | "data_df.toPandas()"
592 | ]
593 | },
594 | {
595 | "cell_type": "markdown",
596 | "metadata": {
597 | "application/vnd.databricks.v1+cell": {
598 | "cellMetadata": {},
599 | "inputWidgets": {},
600 | "nuid": "f0502651-b54a-45e6-84ae-62ea7e1600ad",
601 | "showTitle": false,
602 | "title": ""
603 | }
604 | },
605 | "source": [
606 | "How to do Data Manipulation - Rows and Columns"
607 | ]
608 | },
609 | {
610 | "cell_type": "code",
611 | "execution_count": 0,
612 | "metadata": {
613 | "application/vnd.databricks.v1+cell": {
614 | "cellMetadata": {
615 | "byteLimit": 2048000,
616 | "rowLimit": 10000
617 | },
618 | "inputWidgets": {},
619 | "nuid": "e4f34241-a0de-47d4-89d0-5596681206c5",
620 | "showTitle": true,
621 | "title": "Selecting Columns"
622 | }
623 | },
624 | "outputs": [
625 | {
626 | "output_type": "stream",
627 | "name": "stdout",
628 | "output_type": "stream",
629 | "text": [
630 | "+-------------+\n| col_3|\n+-------------+\n|string_test_1|\n|string_test_2|\n|string_test_3|\n+-------------+\n\n"
631 | ]
632 | }
633 | ],
634 | "source": [
635 | "from pyspark.sql import Column\n",
636 | "\n",
637 | "data_df.select(data_df.col_3).show()\n"
638 | ]
639 | },
640 | {
641 | "cell_type": "code",
642 | "execution_count": 0,
643 | "metadata": {
644 | "application/vnd.databricks.v1+cell": {
645 | "cellMetadata": {
646 | "byteLimit": 2048000,
647 | "rowLimit": 10000
648 | },
649 | "inputWidgets": {},
650 | "nuid": "47fd50e6-1433-4add-ae2f-1ddcfb1a0e7c",
651 | "showTitle": true,
652 | "title": "Creating Columns"
653 | }
654 | },
655 | "outputs": [
656 | {
657 | "output_type": "stream",
658 | "name": "stdout",
659 | "output_type": "stream",
660 | "text": [
661 | "+-----+-----+-------------+----------+-------------------+-----+\n|col_1|col_2| col_3| col_4| col_5|col_6|\n+-----+-----+-------------+----------+-------------------+-----+\n| 100|200.0|string_test_1|2023-01-01|2023-01-01 12:00:00| A|\n| 200|300.0|string_test_2|2023-02-01|2023-01-02 12:00:00| A|\n| 300|400.0|string_test_3|2023-03-01|2023-01-03 12:00:00| A|\n+-----+-----+-------------+----------+-------------------+-----+\n\n"
662 | ]
663 | }
664 | ],
665 | "source": [
666 | "from pyspark.sql import functions as F\n",
667 | "data_df = data_df.withColumn(\"col_6\", F.lit(\"A\"))\n",
668 | "data_df.show()\n"
669 | ]
670 | },
671 | {
672 | "cell_type": "code",
673 | "execution_count": 0,
674 | "metadata": {
675 | "application/vnd.databricks.v1+cell": {
676 | "cellMetadata": {
677 | "byteLimit": 2048000,
678 | "rowLimit": 10000
679 | },
680 | "inputWidgets": {},
681 | "nuid": "62323798-4638-483c-9d30-e9206ef826de",
682 | "showTitle": true,
683 | "title": "Dropping Columns"
684 | }
685 | },
686 | "outputs": [
687 | {
688 | "output_type": "stream",
689 | "name": "stdout",
690 | "output_type": "stream",
691 | "text": [
692 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| col_3| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n| 300|400.0|string_test_3|2023-03-01| A|\n+-----+-----+-------------+----------+-----+\n\n"
693 | ]
694 | }
695 | ],
696 | "source": [
697 | "data_df = data_df.drop(\"col_5\")\n",
698 | "data_df.show()\n"
699 | ]
700 | },
701 | {
702 | "cell_type": "code",
703 | "execution_count": 0,
704 | "metadata": {
705 | "application/vnd.databricks.v1+cell": {
706 | "cellMetadata": {
707 | "byteLimit": 2048000,
708 | "rowLimit": 10000
709 | },
710 | "inputWidgets": {},
711 | "nuid": "fb759316-d4e6-43b0-8ced-d073f1a20f97",
712 | "showTitle": true,
713 | "title": "Updating Columns"
714 | }
715 | },
716 | "outputs": [
717 | {
718 | "output_type": "stream",
719 | "name": "stdout",
720 | "output_type": "stream",
721 | "text": [
722 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| col_3| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100| 2.0|string_test_1|2023-01-01| A|\n| 200| 3.0|string_test_2|2023-02-01| A|\n| 300| 4.0|string_test_3|2023-03-01| A|\n+-----+-----+-------------+----------+-----+\n\n"
723 | ]
724 | }
725 | ],
726 | "source": [
727 | "data_df.withColumn(\"col_2\", F.col(\"col_2\") / 100).show()"
728 | ]
729 | },
730 | {
731 | "cell_type": "code",
732 | "execution_count": 0,
733 | "metadata": {
734 | "application/vnd.databricks.v1+cell": {
735 | "cellMetadata": {
736 | "byteLimit": 2048000,
737 | "rowLimit": 10000
738 | },
739 | "inputWidgets": {},
740 | "nuid": "2cc409f3-6407-48a1-a5c9-21f08062a7f3",
741 | "showTitle": true,
742 | "title": "Renaming Columns"
743 | }
744 | },
745 | "outputs": [
746 | {
747 | "output_type": "stream",
748 | "name": "stdout",
749 | "output_type": "stream",
750 | "text": [
751 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n| 300|400.0|string_test_3|2023-03-01| A|\n+-----+-----+-------------+----------+-----+\n\n"
752 | ]
753 | }
754 | ],
755 | "source": [
756 | "data_df = data_df.withColumnRenamed(\"col_3\", \"string_col\")\n",
757 | "data_df.show()\n"
758 | ]
759 | },
760 | {
761 | "cell_type": "code",
762 | "execution_count": 0,
763 | "metadata": {
764 | "application/vnd.databricks.v1+cell": {
765 | "cellMetadata": {
766 | "byteLimit": 2048000,
767 | "rowLimit": 10000
768 | },
769 | "inputWidgets": {},
770 | "nuid": "fd1755b4-eec5-44f5-b8d3-946cb0359432",
771 | "showTitle": true,
772 | "title": "Finding Unique Values in a Column"
773 | }
774 | },
775 | "outputs": [
776 | {
777 | "output_type": "stream",
778 | "name": "stdout",
779 | "output_type": "stream",
780 | "text": [
781 | "+-----+\n|col_6|\n+-----+\n| A|\n+-----+\n\n"
782 | ]
783 | }
784 | ],
785 | "source": [
786 | "data_df.select(\"col_6\").distinct().show()"
787 | ]
788 | },
789 | {
790 | "cell_type": "code",
791 | "execution_count": 0,
792 | "metadata": {
793 | "application/vnd.databricks.v1+cell": {
794 | "cellMetadata": {
795 | "byteLimit": 2048000,
796 | "rowLimit": 10000
797 | },
798 | "inputWidgets": {},
799 | "nuid": "beddedbc-ced1-477d-b8bd-30a102ef10dd",
800 | "showTitle": false,
801 | "title": ""
802 | }
803 | },
804 | "outputs": [
805 | {
806 | "output_type": "stream",
807 | "name": "stdout",
808 | "output_type": "stream",
809 | "text": [
810 | "+------------+\n|Total_Unique|\n+------------+\n| 1|\n+------------+\n\n"
811 | ]
812 | }
813 | ],
814 | "source": [
815 | "data_df.select(F.countDistinct(\"col_6\").alias(\"Total_Unique\")).show()"
816 | ]
817 | },
818 | {
819 | "cell_type": "code",
820 | "execution_count": 0,
821 | "metadata": {
822 | "application/vnd.databricks.v1+cell": {
823 | "cellMetadata": {
824 | "byteLimit": 2048000,
825 | "rowLimit": 10000
826 | },
827 | "inputWidgets": {},
828 | "nuid": "469c586b-c14e-4652-be89-97c9b62e5818",
829 | "showTitle": true,
830 | "title": "Change case of a Column"
831 | }
832 | },
833 | "outputs": [
834 | {
835 | "output_type": "stream",
836 | "name": "stdout",
837 | "output_type": "stream",
838 | "text": [
839 | "+-----+-----+-------------+----------+-----+----------------+\n|col_1|col_2| string_col| col_4|col_6|upper_string_col|\n+-----+-----+-------------+----------+-----+----------------+\n| 100|200.0|string_test_1|2023-01-01| A| STRING_TEST_1|\n| 200|300.0|string_test_2|2023-02-01| A| STRING_TEST_2|\n| 300|400.0|string_test_3|2023-03-01| A| STRING_TEST_3|\n+-----+-----+-------------+----------+-----+----------------+\n\n"
840 | ]
841 | }
842 | ],
843 | "source": [
844 | "from pyspark.sql.functions import upper\n",
845 | "\n",
846 | "data_df.withColumn('upper_string_col', upper(data_df.string_col)).show()\n"
847 | ]
848 | },
849 | {
850 | "cell_type": "code",
851 | "execution_count": 0,
852 | "metadata": {
853 | "application/vnd.databricks.v1+cell": {
854 | "cellMetadata": {
855 | "byteLimit": 2048000,
856 | "rowLimit": 10000
857 | },
858 | "inputWidgets": {},
859 | "nuid": "c7b3e23f-84f5-41de-a09d-a7a20e9404b5",
860 | "showTitle": true,
861 | "title": "Filtering a Dataframe"
862 | }
863 | },
864 | "outputs": [
865 | {
866 | "output_type": "stream",
867 | "name": "stdout",
868 | "output_type": "stream",
869 | "text": [
870 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n+-----+-----+-------------+----------+-----+\n\n"
871 | ]
872 | }
873 | ],
874 | "source": [
875 | "data_df.filter(data_df.col_1 == 100).show()"
876 | ]
877 | },
878 | {
879 | "cell_type": "code",
880 | "execution_count": 0,
881 | "metadata": {
882 | "application/vnd.databricks.v1+cell": {
883 | "cellMetadata": {
884 | "byteLimit": 2048000,
885 | "rowLimit": 10000
886 | },
887 | "inputWidgets": {},
888 | "nuid": "889da414-a8e1-4014-ab7f-5e1f10d362fa",
889 | "showTitle": true,
890 | "title": "Logical Operators in a Dataframe"
891 | }
892 | },
893 | "outputs": [
894 | {
895 | "output_type": "stream",
896 | "name": "stdout",
897 | "output_type": "stream",
898 | "text": [
899 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n+-----+-----+-------------+----------+-----+\n\n"
900 | ]
901 | }
902 | ],
903 | "source": [
904 | "data_df.filter((data_df.col_1 == 100)\n",
905 | "\t\t& (data_df.col_6 == 'A')).show()\n"
906 | ]
907 | },
908 | {
909 | "cell_type": "code",
910 | "execution_count": 0,
911 | "metadata": {
912 | "application/vnd.databricks.v1+cell": {
913 | "cellMetadata": {
914 | "byteLimit": 2048000,
915 | "rowLimit": 10000
916 | },
917 | "inputWidgets": {},
918 | "nuid": "582e12c8-1081-4a3f-a539-8fbf1080e316",
919 | "showTitle": false,
920 | "title": ""
921 | }
922 | },
923 | "outputs": [
924 | {
925 | "output_type": "stream",
926 | "name": "stdout",
927 | "output_type": "stream",
928 | "text": [
929 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n+-----+-----+-------------+----------+-----+\n\n"
930 | ]
931 | }
932 | ],
933 | "source": [
934 | "data_df.filter((data_df.col_1 == 100)\n",
935 | "\t\t| (data_df.col_2 == 300.00)).show()\n"
936 | ]
937 | },
938 | {
939 | "cell_type": "code",
940 | "execution_count": 0,
941 | "metadata": {
942 | "application/vnd.databricks.v1+cell": {
943 | "cellMetadata": {
944 | "byteLimit": 2048000,
945 | "rowLimit": 10000
946 | },
947 | "inputWidgets": {},
948 | "nuid": "ef3afdd9-e883-40cc-bd79-cc9fcff57a7e",
949 | "showTitle": true,
950 | "title": "Using Isin()"
951 | }
952 | },
953 | "outputs": [
954 | {
955 | "output_type": "stream",
956 | "name": "stdout",
957 | "output_type": "stream",
958 | "text": [
959 | "+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n+-----+-----+-------------+----------+-----+\n\n"
960 | ]
961 | }
962 | ],
963 | "source": [
964 | "list = [100, 200]\n",
965 | "data_df.filter(data_df.col_1.isin(list)).show()\n"
966 | ]
967 | },
968 | {
969 | "cell_type": "code",
970 | "execution_count": 0,
971 | "metadata": {
972 | "application/vnd.databricks.v1+cell": {
973 | "cellMetadata": {
974 | "byteLimit": 2048000,
975 | "rowLimit": 10000
976 | },
977 | "inputWidgets": {},
978 | "nuid": "fb4e727a-c8b3-4010-bbde-2224b82aab79",
979 | "showTitle": true,
980 | "title": "Datatype conversions"
981 | }
982 | },
983 | "outputs": [
984 | {
985 | "output_type": "stream",
986 | "name": "stdout",
987 | "output_type": "stream",
988 | "text": [
989 | "root\n |-- col_1: integer (nullable = true)\n |-- col_2: double (nullable = true)\n |-- string_col: string (nullable = true)\n |-- col_4: string (nullable = true)\n |-- col_6: string (nullable = false)\n\n+-----+-----+-------------+----------+-----+\n|col_1|col_2| string_col| col_4|col_6|\n+-----+-----+-------------+----------+-----+\n| 100|200.0|string_test_1|2023-01-01| A|\n| 200|300.0|string_test_2|2023-02-01| A|\n| 300|400.0|string_test_3|2023-03-01| A|\n+-----+-----+-------------+----------+-----+\n\n"
990 | ]
991 | }
992 | ],
993 | "source": [
994 | "from pyspark.sql.functions import col\n",
995 | "from pyspark.sql.types import StringType,BooleanType,DateType,IntegerType\n",
996 | "\n",
997 | "data_df_2 = data_df.withColumn(\"col_4\",col(\"col_4\").cast(StringType())) \\\n",
998 | " .withColumn(\"col_1\",col(\"col_1\").cast(IntegerType()))\n",
999 | "data_df_2.printSchema()\n",
1000 | "data_df.show()\n",
1001 | "\n"
1002 | ]
1003 | },
1004 | {
1005 | "cell_type": "code",
1006 | "execution_count": 0,
1007 | "metadata": {
1008 | "application/vnd.databricks.v1+cell": {
1009 | "cellMetadata": {
1010 | "byteLimit": 2048000,
1011 | "rowLimit": 10000
1012 | },
1013 | "inputWidgets": {},
1014 | "nuid": "908430a1-d7f1-4372-aebc-ef371f1efc96",
1015 | "showTitle": false,
1016 | "title": ""
1017 | }
1018 | },
1019 | "outputs": [
1020 | {
1021 | "output_type": "stream",
1022 | "name": "stdout",
1023 | "output_type": "stream",
1024 | "text": [
1025 | "root\n |-- col_4: date (nullable = true)\n |-- col_1: long (nullable = true)\n\n"
1026 | ]
1027 | }
1028 | ],
1029 | "source": [
1030 | "data_df_3 = data_df_2.selectExpr(\"cast(col_4 as date) col_4\",\n",
1031 | " \"cast(col_1 as long) col_1\")\n",
1032 | "data_df_3.printSchema()\n"
1033 | ]
1034 | },
1035 | {
1036 | "cell_type": "code",
1037 | "execution_count": 0,
1038 | "metadata": {
1039 | "application/vnd.databricks.v1+cell": {
1040 | "cellMetadata": {
1041 | "byteLimit": 2048000,
1042 | "rowLimit": 10000
1043 | },
1044 | "inputWidgets": {},
1045 | "nuid": "08646a8d-923c-440a-bae9-305e390b529e",
1046 | "showTitle": false,
1047 | "title": ""
1048 | }
1049 | },
1050 | "outputs": [
1051 | {
1052 | "output_type": "stream",
1053 | "name": "stdout",
1054 | "output_type": "stream",
1055 | "text": [
1056 | "root\n |-- col_1: double (nullable = true)\n |-- col_4: date (nullable = true)\n\n+-----+----------+\n|col_1|col_4 |\n+-----+----------+\n|100.0|2023-01-01|\n|200.0|2023-02-01|\n|300.0|2023-03-01|\n+-----+----------+\n\n"
1057 | ]
1058 | }
1059 | ],
1060 | "source": [
1061 | "data_df_3.createOrReplaceTempView(\"CastExample\")\n",
1062 | "data_df_4 = spark.sql(\"SELECT DOUBLE(col_1), DATE(col_4) from CastExample\")\n",
1063 | "data_df_4.printSchema()\n",
1064 | "data_df_4.show(truncate=False)\n"
1065 | ]
1066 | },
1067 | {
1068 | "cell_type": "code",
1069 | "execution_count": 0,
1070 | "metadata": {
1071 | "application/vnd.databricks.v1+cell": {
1072 | "cellMetadata": {
1073 | "byteLimit": 2048000,
1074 | "rowLimit": 10000
1075 | },
1076 | "inputWidgets": {},
1077 | "nuid": "92dcefea-630b-4ecf-8649-705fbf78b93c",
1078 | "showTitle": true,
1079 | "title": "Dropping null values from a Dataframe"
1080 | }
1081 | },
1082 | "outputs": [
1083 | {
1084 | "output_type": "stream",
1085 | "name": "stdout",
1086 | "output_type": "stream",
1087 | "text": [
1088 | "root\n |-- Employee: string (nullable = true)\n |-- Department: string (nullable = true)\n |-- Salary: long (nullable = true)\n\n+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| John| Field-eng| 3500|\n| Michael| Field-eng| 4500|\n| Robert| NULL| 4000|\n| Maria| Finance| 3500|\n| John| Sales| 3000|\n| Kelly| Finance| 3500|\n| Kate| Finance| 3000|\n| Martin| NULL| 3500|\n| Kiran| Sales| 2200|\n| Michael| Field-eng| 4500|\n+--------+----------+------+\n\n"
1089 | ]
1090 | }
1091 | ],
1092 | "source": [
1093 | "salary_data = [(\"John\", \"Field-eng\", 3500), \n",
1094 | " (\"Michael\", \"Field-eng\", 4500), \n",
1095 | " (\"Robert\", None, 4000), \n",
1096 | " (\"Maria\", \"Finance\", 3500), \n",
1097 | " (\"John\", \"Sales\", 3000), \n",
1098 | " (\"Kelly\", \"Finance\", 3500), \n",
1099 | " (\"Kate\", \"Finance\", 3000), \n",
1100 | " (\"Martin\", None, 3500), \n",
1101 | " (\"Kiran\", \"Sales\", 2200), \n",
1102 | " (\"Michael\", \"Field-eng\", 4500) \n",
1103 | " ]\n",
1104 | "columns= [\"Employee\", \"Department\", \"Salary\"]\n",
1105 | "salary_data = spark.createDataFrame(data = salary_data, schema = columns)\n",
1106 | "salary_data.printSchema()\n",
1107 | "salary_data.show()\n"
1108 | ]
1109 | },
1110 | {
1111 | "cell_type": "code",
1112 | "execution_count": 0,
1113 | "metadata": {
1114 | "application/vnd.databricks.v1+cell": {
1115 | "cellMetadata": {
1116 | "byteLimit": 2048000,
1117 | "rowLimit": 10000
1118 | },
1119 | "inputWidgets": {},
1120 | "nuid": "5f7ded6f-cc9d-4bfe-8f63-afe560278772",
1121 | "showTitle": false,
1122 | "title": ""
1123 | }
1124 | },
1125 | "outputs": [
1126 | {
1127 | "output_type": "stream",
1128 | "name": "stdout",
1129 | "output_type": "stream",
1130 | "text": [
1131 | "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| John| Field-eng| 3500|\n| Michael| Field-eng| 4500|\n| Maria| Finance| 3500|\n| John| Sales| 3000|\n| Kelly| Finance| 3500|\n| Kate| Finance| 3000|\n| Kiran| Sales| 2200|\n| Michael| Field-eng| 4500|\n+--------+----------+------+\n\n"
1132 | ]
1133 | }
1134 | ],
1135 | "source": [
1136 | "salary_data.dropna().show()"
1137 | ]
1138 | },
1139 | {
1140 | "cell_type": "code",
1141 | "execution_count": 0,
1142 | "metadata": {
1143 | "application/vnd.databricks.v1+cell": {
1144 | "cellMetadata": {
1145 | "byteLimit": 2048000,
1146 | "rowLimit": 10000
1147 | },
1148 | "inputWidgets": {},
1149 | "nuid": "50cb0f47-2786-4d8a-9514-913108e338af",
1150 | "showTitle": true,
1151 | "title": "Dropping Duplicates from a Dataframe"
1152 | }
1153 | },
1154 | "outputs": [
1155 | {
1156 | "output_type": "stream",
1157 | "name": "stdout",
1158 | "output_type": "stream",
1159 | "text": [
1160 | "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| John| Field-eng| 3500|\n| Michael| Field-eng| 4500|\n| Robert| NULL| 4000|\n| John| Sales| 3000|\n| Maria| Finance| 3500|\n| Kelly| Finance| 3500|\n| Kate| Finance| 3000|\n| Martin| NULL| 3500|\n| Kiran| Sales| 2200|\n+--------+----------+------+\n\n"
1161 | ]
1162 | }
1163 | ],
1164 | "source": [
1165 | "new_salary_data = salary_data.dropDuplicates().show()"
1166 | ]
1167 | },
1168 | {
1169 | "cell_type": "markdown",
1170 | "metadata": {
1171 | "application/vnd.databricks.v1+cell": {
1172 | "cellMetadata": {},
1173 | "inputWidgets": {},
1174 | "nuid": "26b09b00-3377-47f5-b54b-350d3126e92c",
1175 | "showTitle": false,
1176 | "title": ""
1177 | }
1178 | },
1179 | "source": [
1180 | "Using Aggregrates in a Dataframe"
1181 | ]
1182 | },
1183 | {
1184 | "cell_type": "code",
1185 | "execution_count": 0,
1186 | "metadata": {
1187 | "application/vnd.databricks.v1+cell": {
1188 | "cellMetadata": {
1189 | "byteLimit": 2048000,
1190 | "rowLimit": 10000
1191 | },
1192 | "inputWidgets": {},
1193 | "nuid": "151c07a1-1bf3-4cc0-98f4-46aad67e67d3",
1194 | "showTitle": true,
1195 | "title": "Average (avg)"
1196 | }
1197 | },
1198 | "outputs": [
1199 | {
1200 | "output_type": "stream",
1201 | "name": "stdout",
1202 | "output_type": "stream",
1203 | "text": [
1204 | "+-----------+\n|avg(Salary)|\n+-----------+\n| 3520.0|\n+-----------+\n\n"
1205 | ]
1206 | }
1207 | ],
1208 | "source": [
1209 | "from pyspark.sql.functions import countDistinct, avg\n",
1210 | "salary_data.select(avg('Salary')).show()\n"
1211 | ]
1212 | },
1213 | {
1214 | "cell_type": "code",
1215 | "execution_count": 0,
1216 | "metadata": {
1217 | "application/vnd.databricks.v1+cell": {
1218 | "cellMetadata": {
1219 | "byteLimit": 2048000,
1220 | "rowLimit": 10000
1221 | },
1222 | "inputWidgets": {},
1223 | "nuid": "b3104546-2fba-4638-987b-100fb9ac53cf",
1224 | "showTitle": true,
1225 | "title": "Count"
1226 | }
1227 | },
1228 | "outputs": [
1229 | {
1230 | "output_type": "stream",
1231 | "name": "stdout",
1232 | "output_type": "stream",
1233 | "text": [
1234 | "+-------------+\n|count(Salary)|\n+-------------+\n| 10|\n+-------------+\n\n"
1235 | ]
1236 | }
1237 | ],
1238 | "source": [
1239 | "salary_data.agg({'Salary':'count'}).show()"
1240 | ]
1241 | },
1242 | {
1243 | "cell_type": "code",
1244 | "execution_count": 0,
1245 | "metadata": {
1246 | "application/vnd.databricks.v1+cell": {
1247 | "cellMetadata": {
1248 | "byteLimit": 2048000,
1249 | "rowLimit": 10000
1250 | },
1251 | "inputWidgets": {},
1252 | "nuid": "8a396838-9c08-46f9-ae42-6b5db6c094c8",
1253 | "showTitle": true,
1254 | "title": "Count distinct values"
1255 | }
1256 | },
1257 | "outputs": [
1258 | {
1259 | "output_type": "stream",
1260 | "name": "stdout",
1261 | "output_type": "stream",
1262 | "text": [
1263 | "+---------------+\n|Distinct Salary|\n+---------------+\n| 5|\n+---------------+\n\n"
1264 | ]
1265 | }
1266 | ],
1267 | "source": [
1268 | "salary_data.select(countDistinct(\"Salary\").alias(\"Distinct Salary\")).show()"
1269 | ]
1270 | },
1271 | {
1272 | "cell_type": "code",
1273 | "execution_count": 0,
1274 | "metadata": {
1275 | "application/vnd.databricks.v1+cell": {
1276 | "cellMetadata": {
1277 | "byteLimit": 2048000,
1278 | "rowLimit": 10000
1279 | },
1280 | "inputWidgets": {},
1281 | "nuid": "0d17be07-1395-470b-afec-93e6d3a8b893",
1282 | "showTitle": true,
1283 | "title": "Finding maximums (max)"
1284 | }
1285 | },
1286 | "outputs": [
1287 | {
1288 | "output_type": "stream",
1289 | "name": "stdout",
1290 | "output_type": "stream",
1291 | "text": [
1292 | "+-----------+\n|max(Salary)|\n+-----------+\n| 4500|\n+-----------+\n\n"
1293 | ]
1294 | }
1295 | ],
1296 | "source": [
1297 | "salary_data.agg({'Salary':'max'}).show() "
1298 | ]
1299 | },
1300 | {
1301 | "cell_type": "code",
1302 | "execution_count": 0,
1303 | "metadata": {
1304 | "application/vnd.databricks.v1+cell": {
1305 | "cellMetadata": {
1306 | "byteLimit": 2048000,
1307 | "rowLimit": 10000
1308 | },
1309 | "inputWidgets": {},
1310 | "nuid": "dfb27aef-8f62-4c89-b005-e046bc924699",
1311 | "showTitle": true,
1312 | "title": "Sum"
1313 | }
1314 | },
1315 | "outputs": [
1316 | {
1317 | "output_type": "stream",
1318 | "name": "stdout",
1319 | "output_type": "stream",
1320 | "text": [
1321 | "+-----------+\n|sum(Salary)|\n+-----------+\n| 35200|\n+-----------+\n\n"
1322 | ]
1323 | }
1324 | ],
1325 | "source": [
1326 | "salary_data.agg({'Salary':'sum'}).show()"
1327 | ]
1328 | },
1329 | {
1330 | "cell_type": "code",
1331 | "execution_count": 0,
1332 | "metadata": {
1333 | "application/vnd.databricks.v1+cell": {
1334 | "cellMetadata": {
1335 | "byteLimit": 2048000,
1336 | "rowLimit": 10000
1337 | },
1338 | "inputWidgets": {},
1339 | "nuid": "7d0b8f92-9d16-4cc4-9bc8-8b67db6befd6",
1340 | "showTitle": true,
1341 | "title": "Sort data with OrderBy"
1342 | }
1343 | },
1344 | "outputs": [
1345 | {
1346 | "output_type": "stream",
1347 | "name": "stdout",
1348 | "output_type": "stream",
1349 | "text": [
1350 | "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| Kiran| Sales| 2200|\n| John| Sales| 3000|\n| Kate| Finance| 3000|\n| Martin| NULL| 3500|\n| Maria| Finance| 3500|\n| Kelly| Finance| 3500|\n| John| Field-eng| 3500|\n| Robert| NULL| 4000|\n| Michael| Field-eng| 4500|\n| Michael| Field-eng| 4500|\n+--------+----------+------+\n\n"
1351 | ]
1352 | }
1353 | ],
1354 | "source": [
1355 | "salary_data.orderBy(\"Salary\").show()"
1356 | ]
1357 | },
1358 | {
1359 | "cell_type": "code",
1360 | "execution_count": 0,
1361 | "metadata": {
1362 | "application/vnd.databricks.v1+cell": {
1363 | "cellMetadata": {
1364 | "byteLimit": 2048000,
1365 | "rowLimit": 10000
1366 | },
1367 | "inputWidgets": {},
1368 | "nuid": "021a9f0c-59f2-497e-adbe-f39a2f7b7b21",
1369 | "showTitle": false,
1370 | "title": ""
1371 | }
1372 | },
1373 | "outputs": [
1374 | {
1375 | "output_type": "stream",
1376 | "name": "stdout",
1377 | "output_type": "stream",
1378 | "text": [
1379 | "+--------+----------+------+\n|Employee|Department|Salary|\n+--------+----------+------+\n| Michael| Field-eng| 4500|\n| Michael| Field-eng| 4500|\n| Robert| NULL| 4000|\n| John| Field-eng| 3500|\n| Martin| NULL| 3500|\n| Kelly| Finance| 3500|\n| Maria| Finance| 3500|\n| Kate| Finance| 3000|\n| John| Sales| 3000|\n| Kiran| Sales| 2200|\n+--------+----------+------+\n\n"
1380 | ]
1381 | }
1382 | ],
1383 | "source": [
1384 | "salary_data.orderBy(salary_data[\"Salary\"].desc()).show()"
1385 | ]
1386 | }
1387 | ],
1388 | "metadata": {
1389 | "application/vnd.databricks.v1+notebook": {
1390 | "dashboards": [],
1391 | "language": "python",
1392 | "notebookMetadata": {
1393 | "mostRecentlyExecutedCommandWithImplicitDF": {
1394 | "commandId": 969987236417588,
1395 | "dataframes": [
1396 | "_sqldf"
1397 | ]
1398 | },
1399 | "pythonIndentUnit": 2
1400 | },
1401 | "notebookName": "Chapter 4 Code",
1402 | "widgets": {}
1403 | }
1404 | },
1405 | "nbformat": 4,
1406 | "nbformat_minor": 0
1407 | }
1408 |
--------------------------------------------------------------------------------