├── .DS_Store
├── Chapter0
├── sample.txt
├── .DS_Store
├── README.md
├── nyc_taxi_zone.csv
└── nyc_taxi_zone.json
├── Chapter11
├── .DS_Store
└── README.md
├── Chapter5
├── .DS_Store
├── README.md
└── chapter5.ipynb
├── Chapter9
├── .DS_Store
└── README.md
├── Chapter1
├── department.csv
├── README.md
├── user.csv
└── employee.csv
├── Chapter6
└── README.md
├── Chapter2
├── README.md
└── employee.csv
├── Chapter7
├── README.md
└── 00000000000000000000.json
├── Chapter8
└── README.md
├── Chapter3
├── README.md
├── chapter3-CovidData.ipynb
├── chapter3_YellowCab.ipynb
└── chapter3_PublicHoliday.ipynb
├── Chapter10
└── README.md
├── Chapter12
├── commands.sh
├── README.md
├── Chapter12_1.ipynb
├── Chapter12_2.ipynb
└── Chapter12.ipynb
├── Chapter4
└── README.md
└── README.md
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/.DS_Store
--------------------------------------------------------------------------------
/Chapter0/sample.txt:
--------------------------------------------------------------------------------
1 | 1,chris,USA
2 | 2,Mark,AUS
3 | 3,Jags,IND
4 | 4,Adam,UK
--------------------------------------------------------------------------------
/Chapter0/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/Chapter0/.DS_Store
--------------------------------------------------------------------------------
/Chapter11/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/Chapter11/.DS_Store
--------------------------------------------------------------------------------
/Chapter5/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/Chapter5/.DS_Store
--------------------------------------------------------------------------------
/Chapter9/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/Chapter9/.DS_Store
--------------------------------------------------------------------------------
/Chapter1/department.csv:
--------------------------------------------------------------------------------
1 | department_id,department_name
2 | 1005,Sales
3 | 1002,Finanace
4 | 1004,Purchase
5 | 1001,Operations
6 | 1006,Marketing
7 | 1003,Technoogy
--------------------------------------------------------------------------------
/Chapter6/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 6 -> Spark ETL with APIs
3 |
4 | Task to do
5 | 1. Call API and load data into Dataframe
6 | 2. Create temp table or view and analyse data
7 | 3. Filter data and store into CSV format on file server
8 | 4. Filter data and store into JSON format
9 |
10 | Reference:
11 | https://api.publicapis.org/entries
12 |
13 | Solution Notebook:
14 | [Spark Notebook](chapter6.ipynb)
15 |
16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/18/spark-etl-chapter-6-with-apis/
18 |
19 | YouTube video with Explanation:
20 | https://www.youtube.com/watch?v=eL1xIjranhg
21 |
--------------------------------------------------------------------------------
/Chapter5/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 5 -> Spark ETL with Hive tables
3 |
4 | Task to do
5 | 1. Read data from one of the source (We take source as our MongoDB collection)
6 | 2. Create dataframe from source
7 | 3. Create Hive table from dataframe
8 | 4. Create temp Hive view from dataframe
9 | 5. Create global Hive view from dataframe
10 | 6. List database and tables in database
11 | 7. Drop all the created tables and views in default database
12 | 8. Create Dataeng database and create global and temp view using SQL
13 | 9. Access global table from other session
14 |
15 |
16 | Solution Notebook:
17 | [Spark Notebook](chapter5.ipynb)
18 |
19 | Blog with Explaination:
20 |
21 | YouTube video with Explanation:
22 |
--------------------------------------------------------------------------------
/Chapter2/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 2 -> Spark ETL with NonSQL Database (MongoDB)
3 |
4 | Task to do
5 | 1. Install required spark libraries
6 | 2. Create connection with NonSQL Database
7 | 3. Read data from NonSQL Database
8 | 4. Transform data
9 | 5. write data into NonSQL Server
10 |
11 | Spark Libraries
12 | https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
13 | 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1'
14 |
15 | Solution Notebook:
16 | [Spark Notebook](chapter2.ipynb)
17 |
18 | Blog with Explaination:
19 | https://developershome.blog/2023/03/07/spark-etl-chapter-2-with-nosql-database-mongodb-cassandra/
20 |
21 | YouTube video with Explanation:
22 | https://www.youtube.com/watch?v=vPZV_GF0klE&list=PLYqhYQOVe-qNwwWJdhiLM_In2l9kDwkAa&index=5
23 |
--------------------------------------------------------------------------------
/Chapter7/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 7 -> Spark ETL with Lakehouse | Delta Lake
3 |
4 | Task to do
5 | 1. Read data from MySQL server into Spark
6 | 2. Create HIVE temp view from data frame
7 | 3. Load filtered data into Delta format (create initial table)
8 | 4. Load filtered data again into Delta format into same table
9 | 5. Read Delta tables using Spark data frame
10 | 6. Create Temp HIVE of delta tables
11 | 7. Write query to read data and also explore versions
12 |
13 | Solution Notebook:
14 | [Spark Notebook](chapter7.ipynb)
15 |
16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/18/spark-etl-chapter-6-with-apis/
18 |
19 | YouTube video with Explanation:
20 | https://www.youtube.com/watch?v=eL1xIjranhg
21 |
22 | Meduim Blog Channel:
23 | https://medium.com/@developershome
24 |
--------------------------------------------------------------------------------
/Chapter1/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 1 -> Spark ETL with SQL Database (MySQL | PostgreSQL)
3 |
4 | Task to do
5 | 1. Install required spark libraries
6 | 2. Create connection with SQL Database
7 | 3. Read data from SQL Database
8 | 4. Transform data
9 | 5. write data into SQL Server
10 |
11 | Spark Libraries
12 | https://mvnrepository.com/artifact/mysql/mysql-connector-java
13 |
14 | Solution Notebook:
15 | [Spark Notebook](chapter1.ipynb)
16 |
17 | Blog with Explaination:
18 | https://developershome.blog/2023/03/06/spark-etl-with-sql-databases-mysql-postgresql/
19 | https://medium.com/@fylfotbeta/spark-etl-chapter-1-with-sql-databases-mysql-postgresql-a0a589f7f9ff
20 |
21 | YouTube video with Explanation:
22 | https://www.youtube.com/watch?v=PHahcWd1AqM&list=PLYqhYQOVe-qNwwWJdhiLM_In2l9kDwkAa&index=3
23 |
--------------------------------------------------------------------------------
/Chapter8/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 8 -> Spark ETL with Lakehouse | Apache HUDI
3 |
4 | Task to do
5 | 1. Read data from MySQL server into Spark
6 | 2. Create HIVE temp view from data frame
7 | 3. Load filtered data into HUDI format (create initial table)
8 | 4. Load filtered data again into HUDI format into same table
9 | 5. Read HUDI tables using Spark data frame
10 | 6. Create Temp HIVE of HUDI tables
11 | 7. Write query to read data and also explore versions
12 |
13 | Solution Notebook:
14 | [Spark Notebook](chapter8.ipynb)
15 |
16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/21/spark-etl-chapter-8-with-lakehouse-apache-hudi/
18 |
19 | YouTube video with Explanation:
20 | https://www.youtube.com/watch?v=eL1xIjranhg
21 |
22 | Meduim Blog Channel:
23 | https://medium.com/@developershome
24 |
--------------------------------------------------------------------------------
/Chapter11/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 11 -> Spark ETL with Lakehouse | Delta table Optimization (Partition, ZORDER & Optimize)
3 |
4 | Task to do
5 | 1. Read data from CSV file to Spark
6 | 2. Create HIVE temp view from data frame
7 | 3. Load data into Delta format (create initial table)
8 | 4. Load data into Delta format with partition
9 | 5. Apply Optimize executeCompaction on delta table
10 | 6. Apply Optimize ZOrder on delta table
11 | 7. Check performance
12 |
13 | Solution Notebook:
14 | [Spark Notebook](chapter9.ipynb)
15 |
16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/21/spark-etl-chapter-9-with-lakehouse-apache-iceberg/
18 |
19 | YouTube video with Explanation:
20 | https://www.youtube.com/watch?v=eL1xIjranhg
21 |
22 | Meduim Blog Channel:
23 | https://medium.com/@developershome
24 |
--------------------------------------------------------------------------------
/Chapter9/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 9 -> Spark ETL with Lakehouse | Apache Iceberg
3 |
4 | Task to do
5 | 1. Read data from MySQL server into Spark
6 | 2. Create HIVE temp view from data frame
7 | 3. Load filtered data into Iceberg format (create initial table)
8 | 4. Load filtered data again into Iceberg format into same table
9 | 5. Read Iceberg tables using Spark data frame
10 | 6. Create Temp HIVE of Iceberg tables
11 | 7. Write query to read data and also explore versions
12 |
13 | Solution Notebook:
14 | [Spark Notebook](chapter9.ipynb)
15 |
16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/21/spark-etl-chapter-9-with-lakehouse-apache-iceberg/
18 |
19 | YouTube video with Explanation:
20 | https://www.youtube.com/watch?v=eL1xIjranhg
21 |
22 | Meduim Blog Channel:
23 | https://medium.com/@developershome
24 |
--------------------------------------------------------------------------------
/Chapter3/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 3 -> Spark ETL with Azure (Blob | ADLS)
3 |
4 | Task to do
5 | 1. Install required spark libraries
6 | 2. Create connection with Azure Blob storage
7 | 3. Read data from blob and store into dataframe
8 | 4. Transform data
9 | 5. write data into parquet file
10 | 6. write data into CSV file
11 |
12 | Reference:
13 | https://learn.microsoft.com/en-us/azure/open-datasets/dataset-catalog
14 |
15 | Solution Notebook:
16 | [Spark Notebook With NYC Yellow Taxi blob](chapter3_YellowCab.ipynb)
17 | [Spark Notebook With Covid Public Data blob](chapter3_CovidData.ipynb)
18 | [Spark Notebook With Public Holiday blob](chapter3_PublicHoliday.ipynb)
19 |
20 | Blog with Explaination:
21 | https://developershome.blog/2023/03/08/spark-etl-chapter-3-with-cloud-data-lakes-azure-blob-azure-adls/
22 |
23 | YouTube video with Explanation:
24 |
--------------------------------------------------------------------------------
/Chapter10/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 10 -> Spark ETL with Lakehouse | Delta Lake vs Apache Iceberg vs Apache HUDI
3 |
4 | Task to do
5 | 1. Read data from MySQL server into Spark
6 | 2. Create HIVE temp view from data frame
7 | 3. Load filtered data into Delta format (create initial table)
8 | 4. Load filtered data into HUDI format (create initial table)
9 | 5. Load filtered data into Iceberg format (create initial table)
10 | 6. Read data from Delta | HUDI | Iceberg format
11 |
12 | Solution Notebook:
13 | [Spark Notebook Delta](chapter10_delta.ipynb)
14 | [Spark Notebook HUDI](chapter10_hudi.ipynb)
15 | [Spark Notebook Iceberg](chapter10_iceberg.ipynb)
16 |
17 | Blog with Explaination:
18 | https://developershome.blog/?s=etl&category=spark
19 |
20 | YouTube video with Explanation:
21 | https://www.youtube.com/watch?v=eL1xIjranhg
22 |
23 | Meduim Blog Channel:
24 | https://medium.com/@developershome
25 |
--------------------------------------------------------------------------------
/Chapter12/commands.sh:
--------------------------------------------------------------------------------
1 | # Command for listing topic
2 | kafka-topics.sh --bootstrap-server=localhost:9092 --list
3 |
4 | # Command for creating topic
5 | kafka-topics.sh --create --topic dataeng --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
6 | kafka-topics.sh --create --bootstrap-server localhost:9092 --topic test_topic
7 |
8 | # Describe topics in details
9 | kafka-topics.sh --bootstrap-server=localhost:9092 --describe --topic dataeng
10 |
11 | # Command for producer for publishing messages
12 | kafka-console-producer.sh --topic dataeng --bootstrap-server localhost:9092
13 | kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test_topic --property "parse.key=true" --property "key.separator=:"
14 |
15 | # Command for subscriber to recieve messages
16 | kafka-console-consumer.sh --topic dataeng --from-beginning --bootstrap-server localhost:9092
17 | kafka-console-consumer.sh --topic test_topic --from-beginning --bootstrap-server localhost:9092
--------------------------------------------------------------------------------
/Chapter0/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 0 -> Spark ETL with files (CSV | JSON | Parquet | Text | Spark Dataframe)
3 |
4 | Task to do
5 | 1. Read CSV file and write into dataframe
6 | 2. Read JSON file and write into dataframe
7 | 3. Read Parquet file and write into dataframe
8 | 4. Read text file and write into dataframe
9 | 5. Create temp table for all
10 | 6. Create JSON file from CSV dataframe
11 | 7. Create CSV file from Parquet dataframe
12 | 8. Create parquet file from JSON dataframe
13 | 9. Create orc file from JSON dataframe
14 |
15 | Reference Data:
16 | https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
17 |
18 | Solution Notebook:
19 | [Spark Notebook](chapter0.ipynb)
20 |
21 | Blog with Explaination:
22 | https://medium.com/@fylfotbeta/spark-etl-chapter-0-with-files-csv-json-parquet-orc-87359909c568
23 |
24 | https://developershome.blog/2023/03/02/spark-etl-chapter-0-with-files-csv-json-parquet-orc/
25 |
26 | YouTube video with Explanation:
27 | https://youtu.be/fL_DpgyU040
--------------------------------------------------------------------------------
/Chapter4/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 4 -> Spark ETL with AWS (S3 bucket)
3 |
4 | Task to do
5 | 1. Install required spark libraries
6 | 2. Create connection with AWS S3 bucket
7 | 3. Read data from S3 bucket and store into dataframe
8 | 4. Transform data
9 | 5. write data into parquet file
10 | 6. write data into JSON file
11 |
12 | Reference:
13 | https://registry.opendata.aws/speedtest-global-performance/
14 |
15 | command:
16 | aws s3 ls --no-sign-request s3://ookla-open-data/parquet/performance/type=fixed/year=2019/quarter=1/2019-01-01_performance_fixed_tiles.parquet
17 | aws s3 cp --no-sign-request s3://ookla-open-data/parquet/performance/type=fixed/year=2019/quarter=1/2019-01-01_performance_fixed_tiles.parquet sample.parquet
18 |
19 | Solution Notebook:
20 | [Spark Notebook](chapter4.ipynb)
21 | [Spark Notebook Dataset1](chapter4-dataset1.ipynb)
22 |
23 | Blog with Explaination:
24 | https://developershome.blog/2023/03/12/spark-etl-chapter-4-with-cloud-data-lakes-aws-s3-bucket/
25 |
26 | YouTube video with Explanation:
27 |
--------------------------------------------------------------------------------
/Chapter7/00000000000000000000.json:
--------------------------------------------------------------------------------
1 | {"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
2 | {"metaData":{"id":"29ae77fe-b695-4e96-b941-16d6ffd0ade5","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"FOODNAME\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"scale\":0}},{\"name\":\"SCIENTIFICNAME\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"scale\":0}},{\"name\":\"GROUP\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"scale\":0}},{\"name\":\"SUBGROUP\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"scale\":0}}]}","partitionColumns":[],"configuration":{},"createdTime":1679214189993}}
3 | {"add":{"path":"part-00000-4865d56d-ee2d-4f57-9e81-645e1547bfe3-c000.snappy.parquet","partitionValues":{},"size":2938,"modificationTime":1679214191340,"dataChange":true}}
4 | {"commitInfo":{"timestamp":1679214191401,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"52","numOutputBytes":"2938"},"engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.1.0"}}
5 |
--------------------------------------------------------------------------------
/Chapter1/user.csv:
--------------------------------------------------------------------------------
1 | id,name
2 | 1,Dustin Smith
3 | 2,Jay Ramirez
4 | 3,Joseph Cooke
5 | 4,Melinda Young
6 | 5,Sean Parker
7 | 6,Ian Foster
8 | 7,Christopher Schmitt
9 | 8,Patrick Gutierrez
10 | 9,Dennis Douglas
11 | 10,Brenda Morris
12 | 11,Jeffery Hernandez
13 | 12,David Rice
14 | 13,Charles Foster
15 | 14,Keith Perez DVM
16 | 15,Dean Cuevas
17 | 16,Melissa Bishop
18 | 17,Alexander Howell
19 | 18,Austin Robertson
20 | 19,Sherri Mcdaniel
21 | 20,Nancy Nguyen
22 | 21,Melody Ball
23 | 22,Christopher Stokes
24 | 23,Joseph Hamilton
25 | 24,Kevin Fischer
26 | 25,Crystal Berg
27 | 26,Barbara Larson
28 | 27,Jacqueline Heath
29 | 28,Eric Gardner
30 | 29,Daniel Kennedy
31 | 30,Kaylee Sims
32 | 31,Shannon Green
33 | 32,Stacy Collins
34 | 33,Donna Ortiz
35 | 34,Jennifer Simmons
36 | 35,Michael Gill
37 | 36,Alyssa Shaw
38 | 37,Destiny Clark
39 | 38,Thomas Lara
40 | 39,Mark Diaz
41 | 40,Stacy Bryant
42 | 41,Howard Rose
43 | 42,Brian Schwartz
44 | 43,Kimberly Potter
45 | 44,Cassidy Ryan
46 | 45,Benjamin Mcbride
47 | 46,Elizabeth Ward
48 | 47,Christina Price
49 | 48,Pamela Cox
50 | 49,Jessica Peterson
51 | 50,Michael Nelson
--------------------------------------------------------------------------------
/Chapter12/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Chapter 11 -> Spark ETL with Apache Kafka
3 |
4 | Task to do
5 | 1. Create Apache Kafka Publisher, create the topic, and publish messages
6 | 2. Create Apache Kafka Consumer and subscribe topic and receive messages
7 | 3. Create a Spark session & install the required libraries for Apache Kafka
8 | 4. From the Spark, session subscribe earlier created topic
9 | 5. Stream messages into the console
10 | 6. Write streaming messages into the files (CSV or JSON or Delta format)
11 | 7. Write streaming messages to the database (MySQL or PostgreSQL or MongoDB)
12 |
13 | Solution Notebook:
14 | [Spark Notebook For Streaming messages on Console](chapter12.ipynb)
15 | [Spark Notebook For Streaming messages in CSV files](chapter12_1.ipynb)
16 | [Spark Notebook For Streaming messages in SQL Server](chapter12_2.ipynb)
17 |
18 | Blog with Explaination:
19 | https://developershome.blog/2023/03/21/spark-etl-chapter-9-with-lakehouse-apache-iceberg/
20 |
21 | YouTube video with Explanation:
22 | https://www.youtube.com/watch?v=eL1xIjranhg
23 |
24 | Meduim Blog Channel:
25 | https://medium.com/@developershome
26 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Spark ETL
2 | Extract Transform and Load using Spark or
3 | Extract , Load and Tranform using Spark
4 |
5 |
6 |
7 | Here, we will create Spark notebooks for doing all of below ETL processes. Once we learn about all the ETL processes, we will start working on projects using Spark.
8 | Please find list ETL Pipelines
9 |
10 | 0. Chapter0 -> [Spark ETL with Files (CSV | JSON | Parquet)](Chapter0/README.md)
11 | 1. Chapter1 -> [Spark ETL with SQL Database (MySQL | PostgreSQL)](Chapter1/README.md)
12 | 2. Chapter2 -> [Spark ETL with NonSQL Database (MongoDB)](Chapter2/README.md)
13 | 3. Chapter3 -> [Spark ETL with Azure (Blob | ADLS)](Chapter3/README.md)
14 | 4. Chapter4 -> [Spark ETL with AWS (S3 bucket)](Chapter4/README.md)
15 | 5. Chapter5 -> [Spark ETL with Hive tables](Chapter5/README.md)
16 | 6. Chapter6 -> [Spark ETL with APIs](Chapter6/README.md)
17 | 7. Chapter7 -> [Spark ETL with Lakehouse (Delta Lake)](Chapter7/README.md)
18 | 8. Chapter8 -> [Spark ETL with Lakehouse (Apache HUDI)](Chapter8/README.md)
19 | 9. Chapter9 -> [Spark ETL with Lakehouse (Apache Iceberg)](Chapter9/README.md)
20 | 10. Chapter10 -> [Spark ETL with Lakehouse (Delta Lake vs Apache Iceberg vs Apache HUDI)](Chapter10/README.md)
21 | 11. Chapter11 -> [Spark ETL with Lakehouse (Delta table Optimization)](Chapter11/README.md)
22 | 12. Chapter12 -> [Spark ETL with Apache Kafka](Chapter12/README.md)
23 | 13. Chapter13 -> [Spark ETL with GCP (Big Query)](Chapter13/README.md)
24 |
25 |
26 | Also find below blog for understanding all the data engineering ETL Chapters
27 |
28 | https://developershome.blog/category/data-engineering/spark-etl
29 |
30 | Also find below youtube channel for understanding all the data engineering Chapters and learning new concepts of data engineering.
31 |
32 | https://www.youtube.com/@developershomeIn
33 |
--------------------------------------------------------------------------------
/Chapter1/employee.csv:
--------------------------------------------------------------------------------
1 | "id","first_name","last_name","salary","department_id"
2 | 1,Todd,Wilson,110000,1006
3 | 1,Todd,Wilson,106119,1006
4 | 2,Justin,Simon,128922,1005
5 | 2,Justin,Simon,130000,1005
6 | 3,Kelly,Rosario,42689,1002
7 | 4,Patricia,Powell,162825,1004
8 | 4,Patricia,Powell,170000,1004
9 | 5,Sherry,Golden,44101,1002
10 | 6,Natasha,Swanson,79632,1005
11 | 6,Natasha,Swanson,90000,1005
12 | 7,Diane,Gordon,74591,1002
13 | 8,Mercedes,Rodriguez,61048,1005
14 | 9,Christy,Mitchell,137236,1001
15 | 9,Christy,Mitchell,140000,1001
16 | 9,Christy,Mitchell,150000,1001
17 | 10,Sean,Crawford,182065,1006
18 | 10,Sean,Crawford,190000,1006
19 | 11,Kevin,Townsend,166861,1002
20 | 12,Joshua,Johnson,123082,1004
21 | 13,Julie,Sanchez,185663,1001
22 | 13,Julie,Sanchez,200000,1001
23 | 13,Julie,Sanchez,210000,1001
24 | 14,John,Coleman,152434,1001
25 | 15,Anthony,Valdez,96898,1001
26 | 16,Briana,Rivas,151668,1005
27 | 17,Jason,Burnett,42525,1006
28 | 18,Jeffrey,Harris,14491,1002
29 | 18,Jeffrey,Harris,20000,1002
30 | 19,Michael,Ramsey,63159,1003
31 | 20,Cody,Gonzalez,112809,1004
32 | 21,Stephen,Berry,123617,1002
33 | 22,Brittany,Scott,162537,1002
34 | 23,Angela,Williams,100875,1004
35 | 24,William,Flores,142674,1003
36 | 25,Pamela,Matthews,57944,1005
37 | 26,Allison,Johnson,128782,1001
38 | 27,Anthony,Ball,34386,1003
39 | 28,Alexis,Beck,12260,1005
40 | 29,Jason,Olsen,51937,1006
41 | 30,Stephen,Smith,194791,1001
42 | 31,Kimberly,Brooks,95327,1003
43 | 32,Eric,Zimmerman,83093,1006
44 | 33,Peter,Holt,69945,1002
45 | 34,Justin,Dunn,67992,1003
46 | 35,John,Ball,47795,1004
47 | 36,Jesus,Ward,36078,1005
48 | 37,Philip,Gillespie,36424,1006
49 | 38,Nicole,Lewis,114079,1001
50 | 39,Linda,Clark,186781,1002
51 | 40,Colleen,Carrillo,147723,1004
52 | 41,John,George,21642,1001
53 | 42,Traci,Williams,138892,1003
54 | 42,Traci,Williams,150000,1003
55 | 42,Traci,Williams,160000,1003
56 | 42,Traci,Williams,180000,1003
57 | 43,Joseph,Rogers,22800,1005
58 | 44,Trevor,Carter,38670,1001
59 | 45,Kevin,Duncan,45210,1003
60 | 46,Joshua,Ewing,73088,1003
61 | 47,Kimberly,Dean,71416,1003
62 | 48,Robert,Lynch,117960,1004
63 | 49,Amber,Harding,77764,1002
64 | 50,Victoria,Wilson,176620,1002
65 | 51,Theresa,Everett,31404,1002
66 | 52,Kara,Smith,192838,1004
67 | 53,Teresa,Cohen,98860,1001
68 | 54,Wesley,Tucker,90221,1005
69 | 55,Michael,Morris,106799,1005
70 | 56,Rachael,Williams,103585,1002
71 | 57,Patricia,Harmon,147417,1005
72 | 58,Edward,Sharp,41077,1005
73 | 59,Kevin,Robinson,100924,1005
74 | 60,Charles,Pearson,173317,1004
75 | 61,Ryan,Brown,110225,1003
76 | 61,Ryan,Brown,120000,1003
77 | 62,Dale,Hayes,97662,1005
78 | 63,Richard,Sanford,136083,1001
79 | 64,Danielle,Williams,98655,1006
80 | 64,Danielle,Williams,110000,1006
81 | 64,Danielle,Williams,120000,1006
82 | 65,Deborah,Martin,67389,1004
83 | 66,Dustin,Bush,47567,1004
84 | 67,Tyler,Green,111085,1002
85 | 68,Antonio,Carpenter,83684,1002
86 | 69,Ernest,Peterson,115993,1005
87 | 70,Karen,Fernandez,101238,1003
88 | 71,Kristine,Casey,67651,1003
89 | 72,Christine,Frye,137244,1004
90 | 73,William,Preston,155225,1003
91 | 74,Richard,Cole,180361,1003
92 | 75,Julia,Ramos,61398,1006
93 | 75,Julia,Ramos,70000,1006
94 | 75,Julia,Ramos,83000,1006
95 | 75,Julia,Ramos,90000,1006
96 | 75,Julia,Ramos,105000,1006
97 |
--------------------------------------------------------------------------------
/Chapter2/employee.csv:
--------------------------------------------------------------------------------
1 | "id","first_name","last_name","salary","department_id"
2 | 1,Todd,Wilson,110000,1006
3 | 1,Todd,Wilson,106119,1006
4 | 2,Justin,Simon,128922,1005
5 | 2,Justin,Simon,130000,1005
6 | 3,Kelly,Rosario,42689,1002
7 | 4,Patricia,Powell,162825,1004
8 | 4,Patricia,Powell,170000,1004
9 | 5,Sherry,Golden,44101,1002
10 | 6,Natasha,Swanson,79632,1005
11 | 6,Natasha,Swanson,90000,1005
12 | 7,Diane,Gordon,74591,1002
13 | 8,Mercedes,Rodriguez,61048,1005
14 | 9,Christy,Mitchell,137236,1001
15 | 9,Christy,Mitchell,140000,1001
16 | 9,Christy,Mitchell,150000,1001
17 | 10,Sean,Crawford,182065,1006
18 | 10,Sean,Crawford,190000,1006
19 | 11,Kevin,Townsend,166861,1002
20 | 12,Joshua,Johnson,123082,1004
21 | 13,Julie,Sanchez,185663,1001
22 | 13,Julie,Sanchez,200000,1001
23 | 13,Julie,Sanchez,210000,1001
24 | 14,John,Coleman,152434,1001
25 | 15,Anthony,Valdez,96898,1001
26 | 16,Briana,Rivas,151668,1005
27 | 17,Jason,Burnett,42525,1006
28 | 18,Jeffrey,Harris,14491,1002
29 | 18,Jeffrey,Harris,20000,1002
30 | 19,Michael,Ramsey,63159,1003
31 | 20,Cody,Gonzalez,112809,1004
32 | 21,Stephen,Berry,123617,1002
33 | 22,Brittany,Scott,162537,1002
34 | 23,Angela,Williams,100875,1004
35 | 24,William,Flores,142674,1003
36 | 25,Pamela,Matthews,57944,1005
37 | 26,Allison,Johnson,128782,1001
38 | 27,Anthony,Ball,34386,1003
39 | 28,Alexis,Beck,12260,1005
40 | 29,Jason,Olsen,51937,1006
41 | 30,Stephen,Smith,194791,1001
42 | 31,Kimberly,Brooks,95327,1003
43 | 32,Eric,Zimmerman,83093,1006
44 | 33,Peter,Holt,69945,1002
45 | 34,Justin,Dunn,67992,1003
46 | 35,John,Ball,47795,1004
47 | 36,Jesus,Ward,36078,1005
48 | 37,Philip,Gillespie,36424,1006
49 | 38,Nicole,Lewis,114079,1001
50 | 39,Linda,Clark,186781,1002
51 | 40,Colleen,Carrillo,147723,1004
52 | 41,John,George,21642,1001
53 | 42,Traci,Williams,138892,1003
54 | 42,Traci,Williams,150000,1003
55 | 42,Traci,Williams,160000,1003
56 | 42,Traci,Williams,180000,1003
57 | 43,Joseph,Rogers,22800,1005
58 | 44,Trevor,Carter,38670,1001
59 | 45,Kevin,Duncan,45210,1003
60 | 46,Joshua,Ewing,73088,1003
61 | 47,Kimberly,Dean,71416,1003
62 | 48,Robert,Lynch,117960,1004
63 | 49,Amber,Harding,77764,1002
64 | 50,Victoria,Wilson,176620,1002
65 | 51,Theresa,Everett,31404,1002
66 | 52,Kara,Smith,192838,1004
67 | 53,Teresa,Cohen,98860,1001
68 | 54,Wesley,Tucker,90221,1005
69 | 55,Michael,Morris,106799,1005
70 | 56,Rachael,Williams,103585,1002
71 | 57,Patricia,Harmon,147417,1005
72 | 58,Edward,Sharp,41077,1005
73 | 59,Kevin,Robinson,100924,1005
74 | 60,Charles,Pearson,173317,1004
75 | 61,Ryan,Brown,110225,1003
76 | 61,Ryan,Brown,120000,1003
77 | 62,Dale,Hayes,97662,1005
78 | 63,Richard,Sanford,136083,1001
79 | 64,Danielle,Williams,98655,1006
80 | 64,Danielle,Williams,110000,1006
81 | 64,Danielle,Williams,120000,1006
82 | 65,Deborah,Martin,67389,1004
83 | 66,Dustin,Bush,47567,1004
84 | 67,Tyler,Green,111085,1002
85 | 68,Antonio,Carpenter,83684,1002
86 | 69,Ernest,Peterson,115993,1005
87 | 70,Karen,Fernandez,101238,1003
88 | 71,Kristine,Casey,67651,1003
89 | 72,Christine,Frye,137244,1004
90 | 73,William,Preston,155225,1003
91 | 74,Richard,Cole,180361,1003
92 | 75,Julia,Ramos,61398,1006
93 | 75,Julia,Ramos,70000,1006
94 | 75,Julia,Ramos,83000,1006
95 | 75,Julia,Ramos,90000,1006
96 | 75,Julia,Ramos,105000,1006
97 |
--------------------------------------------------------------------------------
/Chapter12/Chapter12_1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "0032fcbf-b834-4e38-a99a-269f80ae104f",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "# First Load all the required library and also Start Spark Session\n",
11 | "# Load all the required library\n",
12 | "from pyspark.sql import SparkSession"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 2,
18 | "id": "d0223468-96e6-42d3-a40d-3b0a68af8226",
19 | "metadata": {},
20 | "outputs": [
21 | {
22 | "name": "stderr",
23 | "output_type": "stream",
24 | "text": [
25 | "WARNING: An illegal reflective access operation has occurred\n",
26 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
27 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
28 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
29 | "WARNING: All illegal access operations will be denied in a future release\n"
30 | ]
31 | },
32 | {
33 | "name": "stdout",
34 | "output_type": "stream",
35 | "text": [
36 | ":: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n"
37 | ]
38 | },
39 | {
40 | "name": "stderr",
41 | "output_type": "stream",
42 | "text": [
43 | "Ivy Default Cache set to: /root/.ivy2/cache\n",
44 | "The jars for the packages stored in: /root/.ivy2/jars\n",
45 | "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n",
46 | "mysql#mysql-connector-java added as a dependency\n",
47 | ":: resolving dependencies :: org.apache.spark#spark-submit-parent-07a625e6-7632-4ff2-81bf-49eb66cddb8d;1.0\n",
48 | "\tconfs: [default]\n",
49 | "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 in central\n",
50 | "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 in central\n",
51 | "\tfound org.apache.kafka#kafka-clients;2.8.0 in central\n",
52 | "\tfound org.lz4#lz4-java;1.7.1 in central\n",
53 | "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n",
54 | "\tfound org.slf4j#slf4j-api;1.7.30 in central\n",
55 | "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.1 in central\n",
56 | "\tfound org.spark-project.spark#unused;1.0.0 in central\n",
57 | "\tfound org.apache.hadoop#hadoop-client-api;3.3.1 in central\n",
58 | "\tfound org.apache.htrace#htrace-core4;4.1.0-incubating in central\n",
59 | "\tfound commons-logging#commons-logging;1.1.3 in central\n",
60 | "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n",
61 | "\tfound org.apache.commons#commons-pool2;2.6.2 in central\n",
62 | "\tfound mysql#mysql-connector-java;8.0.32 in central\n",
63 | "\tfound com.mysql#mysql-connector-j;8.0.32 in central\n",
64 | "\tfound com.google.protobuf#protobuf-java;3.21.9 in central\n",
65 | ":: resolution report :: resolve 3102ms :: artifacts dl 113ms\n",
66 | "\t:: modules in use:\n",
67 | "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n",
68 | "\tcom.google.protobuf#protobuf-java;3.21.9 from central in [default]\n",
69 | "\tcom.mysql#mysql-connector-j;8.0.32 from central in [default]\n",
70 | "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n",
71 | "\tmysql#mysql-connector-java;8.0.32 from central in [default]\n",
72 | "\torg.apache.commons#commons-pool2;2.6.2 from central in [default]\n",
73 | "\torg.apache.hadoop#hadoop-client-api;3.3.1 from central in [default]\n",
74 | "\torg.apache.hadoop#hadoop-client-runtime;3.3.1 from central in [default]\n",
75 | "\torg.apache.htrace#htrace-core4;4.1.0-incubating from central in [default]\n",
76 | "\torg.apache.kafka#kafka-clients;2.8.0 from central in [default]\n",
77 | "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 from central in [default]\n",
78 | "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 from central in [default]\n",
79 | "\torg.lz4#lz4-java;1.7.1 from central in [default]\n",
80 | "\torg.slf4j#slf4j-api;1.7.30 from central in [default]\n",
81 | "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n",
82 | "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n",
83 | "\t---------------------------------------------------------------------\n",
84 | "\t| | modules || artifacts |\n",
85 | "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n",
86 | "\t---------------------------------------------------------------------\n",
87 | "\t| default | 16 | 0 | 0 | 0 || 15 | 0 |\n",
88 | "\t---------------------------------------------------------------------\n",
89 | ":: retrieving :: org.apache.spark#spark-submit-parent-07a625e6-7632-4ff2-81bf-49eb66cddb8d\n",
90 | "\tconfs: [default]\n",
91 | "\t0 artifacts copied, 15 already retrieved (0kB/45ms)\n",
92 | "23/08/02 11:06:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
93 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
94 | "Setting default log level to \"WARN\".\n",
95 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
96 | "23/08/02 11:06:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n"
97 | ]
98 | }
99 | ],
100 | "source": [
101 | "#Start Spark Session\n",
102 | "spark = SparkSession.builder.appName(\"chapter12\") \\\n",
103 | " .config(\"spark.jars.packages\", \"org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,mysql:mysql-connector-java:8.0.32\") \\\n",
104 | " .getOrCreate()\n",
105 | "sqlContext = SparkSession(spark)\n",
106 | "#Dont Show warning only error\n",
107 | "spark.sparkContext.setLogLevel(\"WARN\")"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 3,
113 | "id": "49bdd94c-a4a5-41ea-89c7-0e36c173cae1",
114 | "metadata": {},
115 | "outputs": [
116 | {
117 | "data": {
118 | "text/html": [
119 | "\n",
120 | "
\n",
121 | "
SparkSession - in-memory
\n",
122 | " \n",
123 | "
\n",
124 | "
SparkContext
\n",
125 | "\n",
126 | "
Spark UI
\n",
127 | "\n",
128 | "
\n",
129 | " - Version
\n",
130 | " v3.2.1 \n",
131 | " - Master
\n",
132 | " local[*] \n",
133 | " - AppName
\n",
134 | " chapter12 \n",
135 | "
\n",
136 | "
\n",
137 | " \n",
138 | "
\n",
139 | " "
140 | ],
141 | "text/plain": [
142 | ""
143 | ]
144 | },
145 | "execution_count": 3,
146 | "metadata": {},
147 | "output_type": "execute_result"
148 | }
149 | ],
150 | "source": [
151 | "spark"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 6,
157 | "id": "3fd21dc9-d9df-4819-98ce-5efd44bf31c7",
158 | "metadata": {},
159 | "outputs": [],
160 | "source": [
161 | "KAFKA_BOOTSTRAP_SERVERS = \"192.168.1.102:9092\"\n",
162 | "KAFKA_TOPIC = \"News_XYZ_Technology\""
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 8,
168 | "id": "540b5af6-09e5-4269-a9de-dc03f5bd2b28",
169 | "metadata": {},
170 | "outputs": [],
171 | "source": [
172 | "df = spark.readStream.format(\"kafka\") \\\n",
173 | " .option(\"kafka.bootstrap.servers\", KAFKA_BOOTSTRAP_SERVERS) \\\n",
174 | " .option(\"subscribe\", KAFKA_TOPIC) \\\n",
175 | " .option(\"startingOffsets\", \"earliest\") \\\n",
176 | " .load()"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": null,
182 | "id": "01227128-d113-4641-8e0e-60630be4481d",
183 | "metadata": {},
184 | "outputs": [
185 | {
186 | "name": "stderr",
187 | "output_type": "stream",
188 | "text": [
189 | "23/08/02 11:14:09 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n",
190 | " \r"
191 | ]
192 | }
193 | ],
194 | "source": [
195 | "df.selectExpr(\"cast(value as string)\") \\\n",
196 | " .writeStream \\\n",
197 | " .format(\"csv\") \\\n",
198 | " .option(\"checkpointLocation\", \"/opt/spark/SparkETL/Chapter12/csv_checkpoint\") \\\n",
199 | " .option(\"path\", \"/opt/spark/SparkETL/Chapter12/csv_data\") \\\n",
200 | " .outputMode(\"append\") \\\n",
201 | " .start() \\\n",
202 | " .awaitTermination()"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "id": "787af49a-d12e-4b15-8277-e2b5aa894c81",
209 | "metadata": {},
210 | "outputs": [],
211 | "source": []
212 | }
213 | ],
214 | "metadata": {
215 | "kernelspec": {
216 | "display_name": "Python 3 (ipykernel)",
217 | "language": "python",
218 | "name": "python3"
219 | },
220 | "language_info": {
221 | "codemirror_mode": {
222 | "name": "ipython",
223 | "version": 3
224 | },
225 | "file_extension": ".py",
226 | "mimetype": "text/x-python",
227 | "name": "python",
228 | "nbconvert_exporter": "python",
229 | "pygments_lexer": "ipython3",
230 | "version": "3.8.13"
231 | }
232 | },
233 | "nbformat": 4,
234 | "nbformat_minor": 5
235 | }
236 |
--------------------------------------------------------------------------------
/Chapter12/Chapter12_2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "0032fcbf-b834-4e38-a99a-269f80ae104f",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "# First Load all the required library and also Start Spark Session\n",
11 | "# Load all the required library\n",
12 | "from pyspark.sql import SparkSession"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 2,
18 | "id": "d0223468-96e6-42d3-a40d-3b0a68af8226",
19 | "metadata": {},
20 | "outputs": [
21 | {
22 | "name": "stderr",
23 | "output_type": "stream",
24 | "text": [
25 | "WARNING: An illegal reflective access operation has occurred\n",
26 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
27 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
28 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
29 | "WARNING: All illegal access operations will be denied in a future release\n"
30 | ]
31 | },
32 | {
33 | "name": "stdout",
34 | "output_type": "stream",
35 | "text": [
36 | ":: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n"
37 | ]
38 | },
39 | {
40 | "name": "stderr",
41 | "output_type": "stream",
42 | "text": [
43 | "Ivy Default Cache set to: /root/.ivy2/cache\n",
44 | "The jars for the packages stored in: /root/.ivy2/jars\n",
45 | "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n",
46 | "mysql#mysql-connector-java added as a dependency\n",
47 | ":: resolving dependencies :: org.apache.spark#spark-submit-parent-5bd2f33a-cc11-4f99-9a05-9717c077d03e;1.0\n",
48 | "\tconfs: [default]\n",
49 | "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 in central\n",
50 | "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 in central\n",
51 | "\tfound org.apache.kafka#kafka-clients;2.8.0 in central\n",
52 | "\tfound org.lz4#lz4-java;1.7.1 in central\n",
53 | "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n",
54 | "\tfound org.slf4j#slf4j-api;1.7.30 in central\n",
55 | "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.1 in central\n",
56 | "\tfound org.spark-project.spark#unused;1.0.0 in central\n",
57 | "\tfound org.apache.hadoop#hadoop-client-api;3.3.1 in central\n",
58 | "\tfound org.apache.htrace#htrace-core4;4.1.0-incubating in central\n",
59 | "\tfound commons-logging#commons-logging;1.1.3 in central\n",
60 | "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n",
61 | "\tfound org.apache.commons#commons-pool2;2.6.2 in central\n",
62 | "\tfound mysql#mysql-connector-java;8.0.32 in central\n",
63 | "\tfound com.mysql#mysql-connector-j;8.0.32 in central\n",
64 | "\tfound com.google.protobuf#protobuf-java;3.21.9 in central\n",
65 | ":: resolution report :: resolve 4813ms :: artifacts dl 58ms\n",
66 | "\t:: modules in use:\n",
67 | "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n",
68 | "\tcom.google.protobuf#protobuf-java;3.21.9 from central in [default]\n",
69 | "\tcom.mysql#mysql-connector-j;8.0.32 from central in [default]\n",
70 | "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n",
71 | "\tmysql#mysql-connector-java;8.0.32 from central in [default]\n",
72 | "\torg.apache.commons#commons-pool2;2.6.2 from central in [default]\n",
73 | "\torg.apache.hadoop#hadoop-client-api;3.3.1 from central in [default]\n",
74 | "\torg.apache.hadoop#hadoop-client-runtime;3.3.1 from central in [default]\n",
75 | "\torg.apache.htrace#htrace-core4;4.1.0-incubating from central in [default]\n",
76 | "\torg.apache.kafka#kafka-clients;2.8.0 from central in [default]\n",
77 | "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 from central in [default]\n",
78 | "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 from central in [default]\n",
79 | "\torg.lz4#lz4-java;1.7.1 from central in [default]\n",
80 | "\torg.slf4j#slf4j-api;1.7.30 from central in [default]\n",
81 | "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n",
82 | "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n",
83 | "\t---------------------------------------------------------------------\n",
84 | "\t| | modules || artifacts |\n",
85 | "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n",
86 | "\t---------------------------------------------------------------------\n",
87 | "\t| default | 16 | 0 | 0 | 0 || 15 | 0 |\n",
88 | "\t---------------------------------------------------------------------\n",
89 | ":: retrieving :: org.apache.spark#spark-submit-parent-5bd2f33a-cc11-4f99-9a05-9717c077d03e\n",
90 | "\tconfs: [default]\n",
91 | "\t0 artifacts copied, 15 already retrieved (0kB/28ms)\n",
92 | "23/08/02 11:33:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
93 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
94 | "Setting default log level to \"WARN\".\n",
95 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
96 | "23/08/02 11:33:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n",
97 | "23/08/02 11:33:37 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.\n"
98 | ]
99 | }
100 | ],
101 | "source": [
102 | "#Start Spark Session\n",
103 | "spark = SparkSession.builder.appName(\"chapter12\") \\\n",
104 | " .config(\"spark.jars.packages\", \"org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,mysql:mysql-connector-java:8.0.32\") \\\n",
105 | " .getOrCreate()\n",
106 | "sqlContext = SparkSession(spark)\n",
107 | "#Dont Show warning only error\n",
108 | "spark.sparkContext.setLogLevel(\"WARN\")"
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": 3,
114 | "id": "49bdd94c-a4a5-41ea-89c7-0e36c173cae1",
115 | "metadata": {},
116 | "outputs": [
117 | {
118 | "data": {
119 | "text/html": [
120 | "\n",
121 | " \n",
122 | "
SparkSession - in-memory
\n",
123 | " \n",
124 | "
\n",
125 | "
SparkContext
\n",
126 | "\n",
127 | "
Spark UI
\n",
128 | "\n",
129 | "
\n",
130 | " - Version
\n",
131 | " v3.2.1 \n",
132 | " - Master
\n",
133 | " local[*] \n",
134 | " - AppName
\n",
135 | " chapter12 \n",
136 | "
\n",
137 | "
\n",
138 | " \n",
139 | "
\n",
140 | " "
141 | ],
142 | "text/plain": [
143 | ""
144 | ]
145 | },
146 | "execution_count": 3,
147 | "metadata": {},
148 | "output_type": "execute_result"
149 | }
150 | ],
151 | "source": [
152 | "spark"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": 4,
158 | "id": "3fd21dc9-d9df-4819-98ce-5efd44bf31c7",
159 | "metadata": {},
160 | "outputs": [],
161 | "source": [
162 | "KAFKA_BOOTSTRAP_SERVERS = \"192.168.1.102:9092\"\n",
163 | "KAFKA_TOPIC = \"News_XYZ_Technology\""
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 7,
169 | "id": "540b5af6-09e5-4269-a9de-dc03f5bd2b28",
170 | "metadata": {},
171 | "outputs": [],
172 | "source": [
173 | "df = spark.readStream.format(\"kafka\") \\\n",
174 | " .option(\"kafka.bootstrap.servers\", KAFKA_BOOTSTRAP_SERVERS) \\\n",
175 | " .option(\"subscribe\", KAFKA_TOPIC) \\\n",
176 | " .option(\"startingOffsets\", \"earliest\") \\\n",
177 | " .load()"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": null,
183 | "id": "787af49a-d12e-4b15-8277-e2b5aa894c81",
184 | "metadata": {},
185 | "outputs": [
186 | {
187 | "name": "stderr",
188 | "output_type": "stream",
189 | "text": [
190 | "23/08/02 11:41:05 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-a868c49c-a037-4b16-b53a-0f9948351d7b. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n",
191 | "23/08/02 11:41:05 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n",
192 | " \r"
193 | ]
194 | }
195 | ],
196 | "source": [
197 | "def foreach_batch_function(df, epoch_id):\n",
198 | " df.write \\\n",
199 | " .format(\"jdbc\") \\\n",
200 | " .option(\"driver\",\"com.mysql.cj.jdbc.Driver\") \\\n",
201 | " .option(\"url\", \"jdbc:mysql://192.168.1.102:3306/DATAENG\") \\\n",
202 | " .option(\"dbtable\", \"StreamMessagesKafka\") \\\n",
203 | " .option(\"user\", \"root\") \\\n",
204 | " .option(\"password\", \"mysql\") \\\n",
205 | " .save()\n",
206 | " pass\n",
207 | "\n",
208 | "df.selectExpr(\"cast(value as string)\") \\\n",
209 | " .writeStream \\\n",
210 | " .outputMode(\"append\") \\\n",
211 | " .foreachBatch(foreach_batch_function)\\\n",
212 | " .start() \\\n",
213 | " .awaitTermination()\n"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": null,
219 | "id": "7c687c45-4036-4eaa-a8bc-c4f996817047",
220 | "metadata": {},
221 | "outputs": [],
222 | "source": []
223 | }
224 | ],
225 | "metadata": {
226 | "kernelspec": {
227 | "display_name": "Python 3 (ipykernel)",
228 | "language": "python",
229 | "name": "python3"
230 | },
231 | "language_info": {
232 | "codemirror_mode": {
233 | "name": "ipython",
234 | "version": 3
235 | },
236 | "file_extension": ".py",
237 | "mimetype": "text/x-python",
238 | "name": "python",
239 | "nbconvert_exporter": "python",
240 | "pygments_lexer": "ipython3",
241 | "version": "3.8.13"
242 | }
243 | },
244 | "nbformat": 4,
245 | "nbformat_minor": 5
246 | }
247 |
--------------------------------------------------------------------------------
/Chapter3/chapter3-CovidData.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "3a51c7bc-a588-4d9e-bb74-a26148c92900",
6 | "metadata": {
7 | "tags": []
8 | },
9 | "source": [
10 | "\n",
11 | "# Chapter 3 -> Spark ETL with Azure (Blob | ADLS)\n",
12 | "\n",
13 | "Task to do \n",
14 | "1. Install required spark libraries\n",
15 | "2. Create connection with Azure Blob storage\n",
16 | "3. Read data from blob and store into dataframe\n",
17 | "4. Transform data\n",
18 | "5. write data into parquet file \n",
19 | "6. write data into JSON file\n",
20 | "\n",
21 | "Reference:\n",
22 | "https://learn.microsoft.com/en-us/azure/open-datasets/dataset-catalog"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 4,
28 | "id": "ea22d710-40f5-4b64-803e-83c583aa3472",
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# First Load all the required library and also Start Spark Session\n",
33 | "# Load all the required library\n",
34 | "from pyspark.sql import SparkSession"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 5,
40 | "id": "368cc9c2-6f26-4d6b-93ff-d9ee7541869c",
41 | "metadata": {},
42 | "outputs": [
43 | {
44 | "name": "stderr",
45 | "output_type": "stream",
46 | "text": [
47 | "WARNING: An illegal reflective access operation has occurred\n",
48 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
49 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
50 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
51 | "WARNING: All illegal access operations will be denied in a future release\n",
52 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
53 | "Setting default log level to \"WARN\".\n",
54 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
55 | "23/03/09 21:50:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
56 | "23/03/09 21:50:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n"
57 | ]
58 | }
59 | ],
60 | "source": [
61 | "#Start Spark Session\n",
62 | "spark = SparkSession.builder.appName(\"chapter3_1\").getOrCreate()\n",
63 | "sqlContext = SparkSession(spark)\n",
64 | "#Dont Show warning only error\n",
65 | "spark.sparkContext.setLogLevel(\"ERROR\")"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "id": "f79815e9-7597-4a08-a8ea-098ad0e556ec",
71 | "metadata": {},
72 | "source": [
73 | "1. Create connection with Azure Blob storage"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 6,
79 | "id": "d6f6fd8e-6367-494f-9682-a901d2822473",
80 | "metadata": {},
81 | "outputs": [],
82 | "source": [
83 | "# Azure storage access info\n",
84 | "blob_account_name = \"pandemicdatalake\"\n",
85 | "blob_container_name = \"public\"\n",
86 | "blob_relative_path = \"curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet\"\n",
87 | "blob_sas_token = r\"\""
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 7,
93 | "id": "d384937f-4d7b-4a7d-804b-2b891e08ca85",
94 | "metadata": {},
95 | "outputs": [
96 | {
97 | "name": "stdout",
98 | "output_type": "stream",
99 | "text": [
100 | "Remote blob path: wasbs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet\n"
101 | ]
102 | }
103 | ],
104 | "source": [
105 | "\n",
106 | "# Allow SPARK to read from Blob remotely\n",
107 | "wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)\n",
108 | "spark.conf.set(\n",
109 | " 'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),\n",
110 | " blob_sas_token)\n",
111 | "print('Remote blob path: ' + wasbs_path)"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "id": "1a9ffb71-9748-4dba-b0ec-42ca3c491443",
117 | "metadata": {},
118 | "source": [
119 | "3. Read data from blob and store into dataframe"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 8,
125 | "id": "666621b6-0800-44d0-ad87-aa400f790ffc",
126 | "metadata": {},
127 | "outputs": [
128 | {
129 | "name": "stderr",
130 | "output_type": "stream",
131 | "text": [
132 | " \r"
133 | ]
134 | }
135 | ],
136 | "source": [
137 | "df = spark.read.parquet(wasbs_path)"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 6,
143 | "id": "e26c9728-6a0d-4d6e-9870-ec0bcbd49728",
144 | "metadata": {},
145 | "outputs": [
146 | {
147 | "name": "stdout",
148 | "output_type": "stream",
149 | "text": [
150 | "root\n",
151 | " |-- vendorID: string (nullable = true)\n",
152 | " |-- tpepPickupDateTime: timestamp (nullable = true)\n",
153 | " |-- tpepDropoffDateTime: timestamp (nullable = true)\n",
154 | " |-- passengerCount: integer (nullable = true)\n",
155 | " |-- tripDistance: double (nullable = true)\n",
156 | " |-- puLocationId: string (nullable = true)\n",
157 | " |-- doLocationId: string (nullable = true)\n",
158 | " |-- startLon: double (nullable = true)\n",
159 | " |-- startLat: double (nullable = true)\n",
160 | " |-- endLon: double (nullable = true)\n",
161 | " |-- endLat: double (nullable = true)\n",
162 | " |-- rateCodeId: integer (nullable = true)\n",
163 | " |-- storeAndFwdFlag: string (nullable = true)\n",
164 | " |-- paymentType: string (nullable = true)\n",
165 | " |-- fareAmount: double (nullable = true)\n",
166 | " |-- extra: double (nullable = true)\n",
167 | " |-- mtaTax: double (nullable = true)\n",
168 | " |-- improvementSurcharge: string (nullable = true)\n",
169 | " |-- tipAmount: double (nullable = true)\n",
170 | " |-- tollsAmount: double (nullable = true)\n",
171 | " |-- totalAmount: double (nullable = true)\n",
172 | " |-- puYear: integer (nullable = true)\n",
173 | " |-- puMonth: integer (nullable = true)\n",
174 | "\n"
175 | ]
176 | }
177 | ],
178 | "source": [
179 | "df.printSchema()"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": 9,
185 | "id": "c3a9c513-d6e1-4e59-9e56-a12da2533375",
186 | "metadata": {},
187 | "outputs": [
188 | {
189 | "name": "stderr",
190 | "output_type": "stream",
191 | "text": [
192 | " \r"
193 | ]
194 | },
195 | {
196 | "name": "stdout",
197 | "output_type": "stream",
198 | "text": [
199 | "+------+----------+---------+----------------+------+-------------+---------+----------------+--------+---------+----+----+--------------+--------------+---------------+--------------+--------------------+\n",
200 | "| id| updated|confirmed|confirmed_change|deaths|deaths_change|recovered|recovered_change|latitude|longitude|iso2|iso3|country_region|admin_region_1|iso_subdivision|admin_region_2| load_time|\n",
201 | "+------+----------+---------+----------------+------+-------------+---------+----------------+--------+---------+----+----+--------------+--------------+---------------+--------------+--------------------+\n",
202 | "|338995|2020-01-21| 262| null| 0| null| null| null| null| null|null|null| Worldwide| null| null| null|2023-03-09 00:04:...|\n",
203 | "|338996|2020-01-22| 313| 51| 0| 0| null| null| null| null|null|null| Worldwide| null| null| null|2023-03-09 00:04:...|\n",
204 | "+------+----------+---------+----------------+------+-------------+---------+----------------+--------+---------+----+----+--------------+--------------+---------------+--------------+--------------------+\n",
205 | "only showing top 2 rows\n",
206 | "\n"
207 | ]
208 | }
209 | ],
210 | "source": [
211 | "df.show(n=2)"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "id": "34762ecc-82b1-4b81-99f4-e480a43cbf92",
217 | "metadata": {},
218 | "source": [
219 | "4. Transform data"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": 12,
225 | "id": "47d997c9-a0da-4936-9c62-ac3907efd874",
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "name": "stdout",
230 | "output_type": "stream",
231 | "text": [
232 | "Register the DataFrame as a SQL temporary view: source\n"
233 | ]
234 | }
235 | ],
236 | "source": [
237 | "print('Register the DataFrame as a SQL temporary view: source')\n",
238 | "df.createOrReplaceTempView('tempSource')"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": 13,
244 | "id": "1ba63625-c1a8-4e45-997f-84b2150c1eae",
245 | "metadata": {},
246 | "outputs": [
247 | {
248 | "name": "stdout",
249 | "output_type": "stream",
250 | "text": [
251 | "Displaying top 10 rows: \n"
252 | ]
253 | },
254 | {
255 | "data": {
256 | "text/plain": [
257 | "DataFrame[id: int, updated: date, confirmed: int, confirmed_change: int, deaths: int, deaths_change: smallint, recovered: int, recovered_change: int, latitude: double, longitude: double, iso2: string, iso3: string, country_region: string, admin_region_1: string, iso_subdivision: string, admin_region_2: string, load_time: timestamp]"
258 | ]
259 | },
260 | "metadata": {},
261 | "output_type": "display_data"
262 | }
263 | ],
264 | "source": [
265 | "print('Displaying top 10 rows: ')\n",
266 | "display(spark.sql('SELECT * FROM tempSource LIMIT 10'))"
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": 14,
272 | "id": "4e6bf6f7-bf6d-455a-86c1-d1759c03eda0",
273 | "metadata": {},
274 | "outputs": [],
275 | "source": [
276 | "newdf = spark.sql('SELECT * FROM tempSource LIMIT 10')"
277 | ]
278 | },
279 | {
280 | "cell_type": "markdown",
281 | "id": "ca5dc217-35ac-43c2-895c-ae0a9c8bddac",
282 | "metadata": {},
283 | "source": [
284 | "5. write data into parquet file \n",
285 | "6. write data into JSON file"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 15,
291 | "id": "84f36d51-366e-4e7c-80fa-d2a72475d262",
292 | "metadata": {},
293 | "outputs": [
294 | {
295 | "name": "stderr",
296 | "output_type": "stream",
297 | "text": [
298 | " \r"
299 | ]
300 | }
301 | ],
302 | "source": [
303 | "newdf.write.format(\"parquet\").option(\"compression\",\"snappy\").save(\"parquetdata\",mode='append')"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": null,
309 | "id": "c417172f-ae9a-4331-b871-cb1599a27446",
310 | "metadata": {},
311 | "outputs": [],
312 | "source": [
313 | "newdf.write.format(\"csv\").option(\"header\",\"true\").save(\"csvdata\",mode='append')"
314 | ]
315 | }
316 | ],
317 | "metadata": {
318 | "kernelspec": {
319 | "display_name": "Python 3 (ipykernel)",
320 | "language": "python",
321 | "name": "python3"
322 | },
323 | "language_info": {
324 | "codemirror_mode": {
325 | "name": "ipython",
326 | "version": 3
327 | },
328 | "file_extension": ".py",
329 | "mimetype": "text/x-python",
330 | "name": "python",
331 | "nbconvert_exporter": "python",
332 | "pygments_lexer": "ipython3",
333 | "version": "3.8.13"
334 | }
335 | },
336 | "nbformat": 4,
337 | "nbformat_minor": 5
338 | }
339 |
--------------------------------------------------------------------------------
/Chapter0/nyc_taxi_zone.csv:
--------------------------------------------------------------------------------
1 | "LocationID","Borough","Zone","service_zone"
2 | 1,"EWR","Newark Airport","EWR"
3 | 2,"Queens","Jamaica Bay","Boro Zone"
4 | 3,"Bronx","Allerton/Pelham Gardens","Boro Zone"
5 | 4,"Manhattan","Alphabet City","Yellow Zone"
6 | 5,"Staten Island","Arden Heights","Boro Zone"
7 | 6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone"
8 | 7,"Queens","Astoria","Boro Zone"
9 | 8,"Queens","Astoria Park","Boro Zone"
10 | 9,"Queens","Auburndale","Boro Zone"
11 | 10,"Queens","Baisley Park","Boro Zone"
12 | 11,"Brooklyn","Bath Beach","Boro Zone"
13 | 12,"Manhattan","Battery Park","Yellow Zone"
14 | 13,"Manhattan","Battery Park City","Yellow Zone"
15 | 14,"Brooklyn","Bay Ridge","Boro Zone"
16 | 15,"Queens","Bay Terrace/Fort Totten","Boro Zone"
17 | 16,"Queens","Bayside","Boro Zone"
18 | 17,"Brooklyn","Bedford","Boro Zone"
19 | 18,"Bronx","Bedford Park","Boro Zone"
20 | 19,"Queens","Bellerose","Boro Zone"
21 | 20,"Bronx","Belmont","Boro Zone"
22 | 21,"Brooklyn","Bensonhurst East","Boro Zone"
23 | 22,"Brooklyn","Bensonhurst West","Boro Zone"
24 | 23,"Staten Island","Bloomfield/Emerson Hill","Boro Zone"
25 | 24,"Manhattan","Bloomingdale","Yellow Zone"
26 | 25,"Brooklyn","Boerum Hill","Boro Zone"
27 | 26,"Brooklyn","Borough Park","Boro Zone"
28 | 27,"Queens","Breezy Point/Fort Tilden/Riis Beach","Boro Zone"
29 | 28,"Queens","Briarwood/Jamaica Hills","Boro Zone"
30 | 29,"Brooklyn","Brighton Beach","Boro Zone"
31 | 30,"Queens","Broad Channel","Boro Zone"
32 | 31,"Bronx","Bronx Park","Boro Zone"
33 | 32,"Bronx","Bronxdale","Boro Zone"
34 | 33,"Brooklyn","Brooklyn Heights","Boro Zone"
35 | 34,"Brooklyn","Brooklyn Navy Yard","Boro Zone"
36 | 35,"Brooklyn","Brownsville","Boro Zone"
37 | 36,"Brooklyn","Bushwick North","Boro Zone"
38 | 37,"Brooklyn","Bushwick South","Boro Zone"
39 | 38,"Queens","Cambria Heights","Boro Zone"
40 | 39,"Brooklyn","Canarsie","Boro Zone"
41 | 40,"Brooklyn","Carroll Gardens","Boro Zone"
42 | 41,"Manhattan","Central Harlem","Boro Zone"
43 | 42,"Manhattan","Central Harlem North","Boro Zone"
44 | 43,"Manhattan","Central Park","Yellow Zone"
45 | 44,"Staten Island","Charleston/Tottenville","Boro Zone"
46 | 45,"Manhattan","Chinatown","Yellow Zone"
47 | 46,"Bronx","City Island","Boro Zone"
48 | 47,"Bronx","Claremont/Bathgate","Boro Zone"
49 | 48,"Manhattan","Clinton East","Yellow Zone"
50 | 49,"Brooklyn","Clinton Hill","Boro Zone"
51 | 50,"Manhattan","Clinton West","Yellow Zone"
52 | 51,"Bronx","Co-Op City","Boro Zone"
53 | 52,"Brooklyn","Cobble Hill","Boro Zone"
54 | 53,"Queens","College Point","Boro Zone"
55 | 54,"Brooklyn","Columbia Street","Boro Zone"
56 | 55,"Brooklyn","Coney Island","Boro Zone"
57 | 56,"Queens","Corona","Boro Zone"
58 | 57,"Queens","Corona","Boro Zone"
59 | 58,"Bronx","Country Club","Boro Zone"
60 | 59,"Bronx","Crotona Park","Boro Zone"
61 | 60,"Bronx","Crotona Park East","Boro Zone"
62 | 61,"Brooklyn","Crown Heights North","Boro Zone"
63 | 62,"Brooklyn","Crown Heights South","Boro Zone"
64 | 63,"Brooklyn","Cypress Hills","Boro Zone"
65 | 64,"Queens","Douglaston","Boro Zone"
66 | 65,"Brooklyn","Downtown Brooklyn/MetroTech","Boro Zone"
67 | 66,"Brooklyn","DUMBO/Vinegar Hill","Boro Zone"
68 | 67,"Brooklyn","Dyker Heights","Boro Zone"
69 | 68,"Manhattan","East Chelsea","Yellow Zone"
70 | 69,"Bronx","East Concourse/Concourse Village","Boro Zone"
71 | 70,"Queens","East Elmhurst","Boro Zone"
72 | 71,"Brooklyn","East Flatbush/Farragut","Boro Zone"
73 | 72,"Brooklyn","East Flatbush/Remsen Village","Boro Zone"
74 | 73,"Queens","East Flushing","Boro Zone"
75 | 74,"Manhattan","East Harlem North","Boro Zone"
76 | 75,"Manhattan","East Harlem South","Boro Zone"
77 | 76,"Brooklyn","East New York","Boro Zone"
78 | 77,"Brooklyn","East New York/Pennsylvania Avenue","Boro Zone"
79 | 78,"Bronx","East Tremont","Boro Zone"
80 | 79,"Manhattan","East Village","Yellow Zone"
81 | 80,"Brooklyn","East Williamsburg","Boro Zone"
82 | 81,"Bronx","Eastchester","Boro Zone"
83 | 82,"Queens","Elmhurst","Boro Zone"
84 | 83,"Queens","Elmhurst/Maspeth","Boro Zone"
85 | 84,"Staten Island","Eltingville/Annadale/Prince's Bay","Boro Zone"
86 | 85,"Brooklyn","Erasmus","Boro Zone"
87 | 86,"Queens","Far Rockaway","Boro Zone"
88 | 87,"Manhattan","Financial District North","Yellow Zone"
89 | 88,"Manhattan","Financial District South","Yellow Zone"
90 | 89,"Brooklyn","Flatbush/Ditmas Park","Boro Zone"
91 | 90,"Manhattan","Flatiron","Yellow Zone"
92 | 91,"Brooklyn","Flatlands","Boro Zone"
93 | 92,"Queens","Flushing","Boro Zone"
94 | 93,"Queens","Flushing Meadows-Corona Park","Boro Zone"
95 | 94,"Bronx","Fordham South","Boro Zone"
96 | 95,"Queens","Forest Hills","Boro Zone"
97 | 96,"Queens","Forest Park/Highland Park","Boro Zone"
98 | 97,"Brooklyn","Fort Greene","Boro Zone"
99 | 98,"Queens","Fresh Meadows","Boro Zone"
100 | 99,"Staten Island","Freshkills Park","Boro Zone"
101 | 100,"Manhattan","Garment District","Yellow Zone"
102 | 101,"Queens","Glen Oaks","Boro Zone"
103 | 102,"Queens","Glendale","Boro Zone"
104 | 103,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone"
105 | 104,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone"
106 | 105,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone"
107 | 106,"Brooklyn","Gowanus","Boro Zone"
108 | 107,"Manhattan","Gramercy","Yellow Zone"
109 | 108,"Brooklyn","Gravesend","Boro Zone"
110 | 109,"Staten Island","Great Kills","Boro Zone"
111 | 110,"Staten Island","Great Kills Park","Boro Zone"
112 | 111,"Brooklyn","Green-Wood Cemetery","Boro Zone"
113 | 112,"Brooklyn","Greenpoint","Boro Zone"
114 | 113,"Manhattan","Greenwich Village North","Yellow Zone"
115 | 114,"Manhattan","Greenwich Village South","Yellow Zone"
116 | 115,"Staten Island","Grymes Hill/Clifton","Boro Zone"
117 | 116,"Manhattan","Hamilton Heights","Boro Zone"
118 | 117,"Queens","Hammels/Arverne","Boro Zone"
119 | 118,"Staten Island","Heartland Village/Todt Hill","Boro Zone"
120 | 119,"Bronx","Highbridge","Boro Zone"
121 | 120,"Manhattan","Highbridge Park","Boro Zone"
122 | 121,"Queens","Hillcrest/Pomonok","Boro Zone"
123 | 122,"Queens","Hollis","Boro Zone"
124 | 123,"Brooklyn","Homecrest","Boro Zone"
125 | 124,"Queens","Howard Beach","Boro Zone"
126 | 125,"Manhattan","Hudson Sq","Yellow Zone"
127 | 126,"Bronx","Hunts Point","Boro Zone"
128 | 127,"Manhattan","Inwood","Boro Zone"
129 | 128,"Manhattan","Inwood Hill Park","Boro Zone"
130 | 129,"Queens","Jackson Heights","Boro Zone"
131 | 130,"Queens","Jamaica","Boro Zone"
132 | 131,"Queens","Jamaica Estates","Boro Zone"
133 | 132,"Queens","JFK Airport","Airports"
134 | 133,"Brooklyn","Kensington","Boro Zone"
135 | 134,"Queens","Kew Gardens","Boro Zone"
136 | 135,"Queens","Kew Gardens Hills","Boro Zone"
137 | 136,"Bronx","Kingsbridge Heights","Boro Zone"
138 | 137,"Manhattan","Kips Bay","Yellow Zone"
139 | 138,"Queens","LaGuardia Airport","Airports"
140 | 139,"Queens","Laurelton","Boro Zone"
141 | 140,"Manhattan","Lenox Hill East","Yellow Zone"
142 | 141,"Manhattan","Lenox Hill West","Yellow Zone"
143 | 142,"Manhattan","Lincoln Square East","Yellow Zone"
144 | 143,"Manhattan","Lincoln Square West","Yellow Zone"
145 | 144,"Manhattan","Little Italy/NoLiTa","Yellow Zone"
146 | 145,"Queens","Long Island City/Hunters Point","Boro Zone"
147 | 146,"Queens","Long Island City/Queens Plaza","Boro Zone"
148 | 147,"Bronx","Longwood","Boro Zone"
149 | 148,"Manhattan","Lower East Side","Yellow Zone"
150 | 149,"Brooklyn","Madison","Boro Zone"
151 | 150,"Brooklyn","Manhattan Beach","Boro Zone"
152 | 151,"Manhattan","Manhattan Valley","Yellow Zone"
153 | 152,"Manhattan","Manhattanville","Boro Zone"
154 | 153,"Manhattan","Marble Hill","Boro Zone"
155 | 154,"Brooklyn","Marine Park/Floyd Bennett Field","Boro Zone"
156 | 155,"Brooklyn","Marine Park/Mill Basin","Boro Zone"
157 | 156,"Staten Island","Mariners Harbor","Boro Zone"
158 | 157,"Queens","Maspeth","Boro Zone"
159 | 158,"Manhattan","Meatpacking/West Village West","Yellow Zone"
160 | 159,"Bronx","Melrose South","Boro Zone"
161 | 160,"Queens","Middle Village","Boro Zone"
162 | 161,"Manhattan","Midtown Center","Yellow Zone"
163 | 162,"Manhattan","Midtown East","Yellow Zone"
164 | 163,"Manhattan","Midtown North","Yellow Zone"
165 | 164,"Manhattan","Midtown South","Yellow Zone"
166 | 165,"Brooklyn","Midwood","Boro Zone"
167 | 166,"Manhattan","Morningside Heights","Boro Zone"
168 | 167,"Bronx","Morrisania/Melrose","Boro Zone"
169 | 168,"Bronx","Mott Haven/Port Morris","Boro Zone"
170 | 169,"Bronx","Mount Hope","Boro Zone"
171 | 170,"Manhattan","Murray Hill","Yellow Zone"
172 | 171,"Queens","Murray Hill-Queens","Boro Zone"
173 | 172,"Staten Island","New Dorp/Midland Beach","Boro Zone"
174 | 173,"Queens","North Corona","Boro Zone"
175 | 174,"Bronx","Norwood","Boro Zone"
176 | 175,"Queens","Oakland Gardens","Boro Zone"
177 | 176,"Staten Island","Oakwood","Boro Zone"
178 | 177,"Brooklyn","Ocean Hill","Boro Zone"
179 | 178,"Brooklyn","Ocean Parkway South","Boro Zone"
180 | 179,"Queens","Old Astoria","Boro Zone"
181 | 180,"Queens","Ozone Park","Boro Zone"
182 | 181,"Brooklyn","Park Slope","Boro Zone"
183 | 182,"Bronx","Parkchester","Boro Zone"
184 | 183,"Bronx","Pelham Bay","Boro Zone"
185 | 184,"Bronx","Pelham Bay Park","Boro Zone"
186 | 185,"Bronx","Pelham Parkway","Boro Zone"
187 | 186,"Manhattan","Penn Station/Madison Sq West","Yellow Zone"
188 | 187,"Staten Island","Port Richmond","Boro Zone"
189 | 188,"Brooklyn","Prospect-Lefferts Gardens","Boro Zone"
190 | 189,"Brooklyn","Prospect Heights","Boro Zone"
191 | 190,"Brooklyn","Prospect Park","Boro Zone"
192 | 191,"Queens","Queens Village","Boro Zone"
193 | 192,"Queens","Queensboro Hill","Boro Zone"
194 | 193,"Queens","Queensbridge/Ravenswood","Boro Zone"
195 | 194,"Manhattan","Randalls Island","Yellow Zone"
196 | 195,"Brooklyn","Red Hook","Boro Zone"
197 | 196,"Queens","Rego Park","Boro Zone"
198 | 197,"Queens","Richmond Hill","Boro Zone"
199 | 198,"Queens","Ridgewood","Boro Zone"
200 | 199,"Bronx","Rikers Island","Boro Zone"
201 | 200,"Bronx","Riverdale/North Riverdale/Fieldston","Boro Zone"
202 | 201,"Queens","Rockaway Park","Boro Zone"
203 | 202,"Manhattan","Roosevelt Island","Boro Zone"
204 | 203,"Queens","Rosedale","Boro Zone"
205 | 204,"Staten Island","Rossville/Woodrow","Boro Zone"
206 | 205,"Queens","Saint Albans","Boro Zone"
207 | 206,"Staten Island","Saint George/New Brighton","Boro Zone"
208 | 207,"Queens","Saint Michaels Cemetery/Woodside","Boro Zone"
209 | 208,"Bronx","Schuylerville/Edgewater Park","Boro Zone"
210 | 209,"Manhattan","Seaport","Yellow Zone"
211 | 210,"Brooklyn","Sheepshead Bay","Boro Zone"
212 | 211,"Manhattan","SoHo","Yellow Zone"
213 | 212,"Bronx","Soundview/Bruckner","Boro Zone"
214 | 213,"Bronx","Soundview/Castle Hill","Boro Zone"
215 | 214,"Staten Island","South Beach/Dongan Hills","Boro Zone"
216 | 215,"Queens","South Jamaica","Boro Zone"
217 | 216,"Queens","South Ozone Park","Boro Zone"
218 | 217,"Brooklyn","South Williamsburg","Boro Zone"
219 | 218,"Queens","Springfield Gardens North","Boro Zone"
220 | 219,"Queens","Springfield Gardens South","Boro Zone"
221 | 220,"Bronx","Spuyten Duyvil/Kingsbridge","Boro Zone"
222 | 221,"Staten Island","Stapleton","Boro Zone"
223 | 222,"Brooklyn","Starrett City","Boro Zone"
224 | 223,"Queens","Steinway","Boro Zone"
225 | 224,"Manhattan","Stuy Town/Peter Cooper Village","Yellow Zone"
226 | 225,"Brooklyn","Stuyvesant Heights","Boro Zone"
227 | 226,"Queens","Sunnyside","Boro Zone"
228 | 227,"Brooklyn","Sunset Park East","Boro Zone"
229 | 228,"Brooklyn","Sunset Park West","Boro Zone"
230 | 229,"Manhattan","Sutton Place/Turtle Bay North","Yellow Zone"
231 | 230,"Manhattan","Times Sq/Theatre District","Yellow Zone"
232 | 231,"Manhattan","TriBeCa/Civic Center","Yellow Zone"
233 | 232,"Manhattan","Two Bridges/Seward Park","Yellow Zone"
234 | 233,"Manhattan","UN/Turtle Bay South","Yellow Zone"
235 | 234,"Manhattan","Union Sq","Yellow Zone"
236 | 235,"Bronx","University Heights/Morris Heights","Boro Zone"
237 | 236,"Manhattan","Upper East Side North","Yellow Zone"
238 | 237,"Manhattan","Upper East Side South","Yellow Zone"
239 | 238,"Manhattan","Upper West Side North","Yellow Zone"
240 | 239,"Manhattan","Upper West Side South","Yellow Zone"
241 | 240,"Bronx","Van Cortlandt Park","Boro Zone"
242 | 241,"Bronx","Van Cortlandt Village","Boro Zone"
243 | 242,"Bronx","Van Nest/Morris Park","Boro Zone"
244 | 243,"Manhattan","Washington Heights North","Boro Zone"
245 | 244,"Manhattan","Washington Heights South","Boro Zone"
246 | 245,"Staten Island","West Brighton","Boro Zone"
247 | 246,"Manhattan","West Chelsea/Hudson Yards","Yellow Zone"
248 | 247,"Bronx","West Concourse","Boro Zone"
249 | 248,"Bronx","West Farms/Bronx River","Boro Zone"
250 | 249,"Manhattan","West Village","Yellow Zone"
251 | 250,"Bronx","Westchester Village/Unionport","Boro Zone"
252 | 251,"Staten Island","Westerleigh","Boro Zone"
253 | 252,"Queens","Whitestone","Boro Zone"
254 | 253,"Queens","Willets Point","Boro Zone"
255 | 254,"Bronx","Williamsbridge/Olinville","Boro Zone"
256 | 255,"Brooklyn","Williamsburg (North Side)","Boro Zone"
257 | 256,"Brooklyn","Williamsburg (South Side)","Boro Zone"
258 | 257,"Brooklyn","Windsor Terrace","Boro Zone"
259 | 258,"Queens","Woodhaven","Boro Zone"
260 | 259,"Bronx","Woodlawn/Wakefield","Boro Zone"
261 | 260,"Queens","Woodside","Boro Zone"
262 | 261,"Manhattan","World Trade Center","Yellow Zone"
263 | 262,"Manhattan","Yorkville East","Yellow Zone"
264 | 263,"Manhattan","Yorkville West","Yellow Zone"
265 | 264,"Unknown","NV","N/A"
266 | 265,"Unknown","NA","N/A"
267 |
--------------------------------------------------------------------------------
/Chapter3/chapter3_YellowCab.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "3a51c7bc-a588-4d9e-bb74-a26148c92900",
6 | "metadata": {
7 | "tags": []
8 | },
9 | "source": [
10 | "\n",
11 | "# Chapter 3 -> Spark ETL with Azure (Blob | ADLS)\n",
12 | "\n",
13 | "Task to do \n",
14 | "1. Install required spark libraries\n",
15 | "2. Create connection with Azure Blob storage\n",
16 | "3. Read data from blob and store into dataframe\n",
17 | "4. Transform data\n",
18 | "5. write data into parquet file \n",
19 | "6. write data into JSON file\n",
20 | "\n",
21 | "Reference:\n",
22 | "https://learn.microsoft.com/en-us/azure/open-datasets/dataset-catalog"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 1,
28 | "id": "ea22d710-40f5-4b64-803e-83c583aa3472",
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# First Load all the required library and also Start Spark Session\n",
33 | "# Load all the required library\n",
34 | "from pyspark.sql import SparkSession"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 2,
40 | "id": "368cc9c2-6f26-4d6b-93ff-d9ee7541869c",
41 | "metadata": {},
42 | "outputs": [
43 | {
44 | "name": "stderr",
45 | "output_type": "stream",
46 | "text": [
47 | "WARNING: An illegal reflective access operation has occurred\n",
48 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
49 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
50 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
51 | "WARNING: All illegal access operations will be denied in a future release\n",
52 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
53 | "Setting default log level to \"WARN\".\n",
54 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
55 | "23/03/09 21:17:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
56 | ]
57 | }
58 | ],
59 | "source": [
60 | "#Start Spark Session\n",
61 | "spark = SparkSession.builder.appName(\"chapter3\").getOrCreate()\n",
62 | "sqlContext = SparkSession(spark)\n",
63 | "#Dont Show warning only error\n",
64 | "spark.sparkContext.setLogLevel(\"ERROR\")"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "id": "f79815e9-7597-4a08-a8ea-098ad0e556ec",
70 | "metadata": {},
71 | "source": [
72 | "1. Create connection with Azure Blob storage"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 4,
78 | "id": "d6f6fd8e-6367-494f-9682-a901d2822473",
79 | "metadata": {},
80 | "outputs": [],
81 | "source": [
82 | "# Azure storage access info\n",
83 | "blob_account_name = \"azureopendatastorage\"\n",
84 | "blob_container_name = \"nyctlc\"\n",
85 | "blob_relative_path = \"yellow\"\n",
86 | "blob_sas_token = \"r\""
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 5,
92 | "id": "d384937f-4d7b-4a7d-804b-2b891e08ca85",
93 | "metadata": {},
94 | "outputs": [
95 | {
96 | "name": "stdout",
97 | "output_type": "stream",
98 | "text": [
99 | "Remote blob path: wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/yellow\n"
100 | ]
101 | }
102 | ],
103 | "source": [
104 | "\n",
105 | "# Allow SPARK to read from Blob remotely\n",
106 | "wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)\n",
107 | "spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),blob_sas_token)\n",
108 | "print('Remote blob path: ' + wasbs_path)"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "id": "1a9ffb71-9748-4dba-b0ec-42ca3c491443",
114 | "metadata": {},
115 | "source": [
116 | "3. Read data from blob and store into dataframe"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": 6,
122 | "id": "666621b6-0800-44d0-ad87-aa400f790ffc",
123 | "metadata": {},
124 | "outputs": [
125 | {
126 | "name": "stderr",
127 | "output_type": "stream",
128 | "text": [
129 | " \r"
130 | ]
131 | }
132 | ],
133 | "source": [
134 | "df = spark.read.parquet(wasbs_path)"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 7,
140 | "id": "e26c9728-6a0d-4d6e-9870-ec0bcbd49728",
141 | "metadata": {},
142 | "outputs": [
143 | {
144 | "name": "stdout",
145 | "output_type": "stream",
146 | "text": [
147 | "root\n",
148 | " |-- vendorID: string (nullable = true)\n",
149 | " |-- tpepPickupDateTime: timestamp (nullable = true)\n",
150 | " |-- tpepDropoffDateTime: timestamp (nullable = true)\n",
151 | " |-- passengerCount: integer (nullable = true)\n",
152 | " |-- tripDistance: double (nullable = true)\n",
153 | " |-- puLocationId: string (nullable = true)\n",
154 | " |-- doLocationId: string (nullable = true)\n",
155 | " |-- startLon: double (nullable = true)\n",
156 | " |-- startLat: double (nullable = true)\n",
157 | " |-- endLon: double (nullable = true)\n",
158 | " |-- endLat: double (nullable = true)\n",
159 | " |-- rateCodeId: integer (nullable = true)\n",
160 | " |-- storeAndFwdFlag: string (nullable = true)\n",
161 | " |-- paymentType: string (nullable = true)\n",
162 | " |-- fareAmount: double (nullable = true)\n",
163 | " |-- extra: double (nullable = true)\n",
164 | " |-- mtaTax: double (nullable = true)\n",
165 | " |-- improvementSurcharge: string (nullable = true)\n",
166 | " |-- tipAmount: double (nullable = true)\n",
167 | " |-- tollsAmount: double (nullable = true)\n",
168 | " |-- totalAmount: double (nullable = true)\n",
169 | " |-- puYear: integer (nullable = true)\n",
170 | " |-- puMonth: integer (nullable = true)\n",
171 | "\n"
172 | ]
173 | }
174 | ],
175 | "source": [
176 | "df.printSchema()"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": 8,
182 | "id": "c3a9c513-d6e1-4e59-9e56-a12da2533375",
183 | "metadata": {},
184 | "outputs": [
185 | {
186 | "name": "stderr",
187 | "output_type": "stream",
188 | "text": [
189 | "[Stage 2:> (0 + 1) / 1]\r"
190 | ]
191 | },
192 | {
193 | "name": "stdout",
194 | "output_type": "stream",
195 | "text": [
196 | "+--------+-------------------+-------------------+--------------+------------+------------+------------+----------+---------+----------+---------+----------+---------------+-----------+----------+-----+------+--------------------+---------+-----------+-----------+------+-------+\n",
197 | "|vendorID| tpepPickupDateTime|tpepDropoffDateTime|passengerCount|tripDistance|puLocationId|doLocationId| startLon| startLat| endLon| endLat|rateCodeId|storeAndFwdFlag|paymentType|fareAmount|extra|mtaTax|improvementSurcharge|tipAmount|tollsAmount|totalAmount|puYear|puMonth|\n",
198 | "+--------+-------------------+-------------------+--------------+------------+------------+------------+----------+---------+----------+---------+----------+---------------+-----------+----------+-----+------+--------------------+---------+-----------+-----------+------+-------+\n",
199 | "| CMT|2012-02-29 23:53:14|2012-03-01 00:00:43| 1| 2.1| null| null|-73.980494|40.730601|-73.983532|40.752311| 1| N| CSH| 7.3| 0.5| 0.5| null| 0.0| 0.0| 8.3| 2012| 3|\n",
200 | "| VTS|2012-03-17 08:01:00|2012-03-17 08:15:00| 1| 11.06| null| null|-73.986067|40.699862|-73.814838|40.737052| 1| null| CRD| 24.5| 0.0| 0.5| null| 4.9| 0.0| 29.9| 2012| 3|\n",
201 | "+--------+-------------------+-------------------+--------------+------------+------------+------------+----------+---------+----------+---------+----------+---------------+-----------+----------+-----+------+--------------------+---------+-----------+-----------+------+-------+\n",
202 | "only showing top 2 rows\n",
203 | "\n"
204 | ]
205 | },
206 | {
207 | "name": "stderr",
208 | "output_type": "stream",
209 | "text": [
210 | " \r"
211 | ]
212 | }
213 | ],
214 | "source": [
215 | "df.show(n=2)"
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "id": "34762ecc-82b1-4b81-99f4-e480a43cbf92",
221 | "metadata": {},
222 | "source": [
223 | "4. Transform data"
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": 9,
229 | "id": "47d997c9-a0da-4936-9c62-ac3907efd874",
230 | "metadata": {},
231 | "outputs": [
232 | {
233 | "name": "stdout",
234 | "output_type": "stream",
235 | "text": [
236 | "Register the DataFrame as a SQL temporary view: source\n"
237 | ]
238 | }
239 | ],
240 | "source": [
241 | "print('Register the DataFrame as a SQL temporary view: source')\n",
242 | "df.createOrReplaceTempView('tempSource')"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": 10,
248 | "id": "1ba63625-c1a8-4e45-997f-84b2150c1eae",
249 | "metadata": {},
250 | "outputs": [
251 | {
252 | "name": "stdout",
253 | "output_type": "stream",
254 | "text": [
255 | "Displaying top 10 rows: \n"
256 | ]
257 | },
258 | {
259 | "data": {
260 | "text/plain": [
261 | "DataFrame[vendorID: string, tpepPickupDateTime: timestamp, tpepDropoffDateTime: timestamp, passengerCount: int, tripDistance: double, puLocationId: string, doLocationId: string, startLon: double, startLat: double, endLon: double, endLat: double, rateCodeId: int, storeAndFwdFlag: string, paymentType: string, fareAmount: double, extra: double, mtaTax: double, improvementSurcharge: string, tipAmount: double, tollsAmount: double, totalAmount: double, puYear: int, puMonth: int]"
262 | ]
263 | },
264 | "metadata": {},
265 | "output_type": "display_data"
266 | }
267 | ],
268 | "source": [
269 | "print('Displaying top 10 rows: ')\n",
270 | "display(spark.sql('SELECT * FROM tempSource LIMIT 10'))"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": null,
276 | "id": "4e6bf6f7-bf6d-455a-86c1-d1759c03eda0",
277 | "metadata": {},
278 | "outputs": [],
279 | "source": [
280 | "newdf = spark.sql('SELECT * FROM tempSource LIMIT 10')"
281 | ]
282 | },
283 | {
284 | "cell_type": "markdown",
285 | "id": "ca5dc217-35ac-43c2-895c-ae0a9c8bddac",
286 | "metadata": {},
287 | "source": [
288 | "5. write data into parquet file \n",
289 | "6. write data into JSON file"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "id": "84f36d51-366e-4e7c-80fa-d2a72475d262",
296 | "metadata": {},
297 | "outputs": [],
298 | "source": [
299 | "newdf.write.format(\"parquet\").option(\"compression\",\"snappy\").save(\"parquetdata\",mode='append')"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": null,
305 | "id": "c417172f-ae9a-4331-b871-cb1599a27446",
306 | "metadata": {},
307 | "outputs": [],
308 | "source": [
309 | "newdf.write.format(\"csv\").option(\"header\",\"true\").save(\"csvdata\",mode='append')"
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": null,
315 | "id": "defd421f-3e4e-427e-88cd-97ecf299ac5b",
316 | "metadata": {},
317 | "outputs": [],
318 | "source": [
319 | "newdf.show()"
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": null,
325 | "id": "58739de3-0f36-4d94-bf7e-926a0de4d1b0",
326 | "metadata": {},
327 | "outputs": [],
328 | "source": []
329 | }
330 | ],
331 | "metadata": {
332 | "kernelspec": {
333 | "display_name": "Python 3 (ipykernel)",
334 | "language": "python",
335 | "name": "python3"
336 | },
337 | "language_info": {
338 | "codemirror_mode": {
339 | "name": "ipython",
340 | "version": 3
341 | },
342 | "file_extension": ".py",
343 | "mimetype": "text/x-python",
344 | "name": "python",
345 | "nbconvert_exporter": "python",
346 | "pygments_lexer": "ipython3",
347 | "version": "3.8.13"
348 | }
349 | },
350 | "nbformat": 4,
351 | "nbformat_minor": 5
352 | }
353 |
--------------------------------------------------------------------------------
/Chapter3/chapter3_PublicHoliday.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "3a51c7bc-a588-4d9e-bb74-a26148c92900",
6 | "metadata": {
7 | "tags": []
8 | },
9 | "source": [
10 | "\n",
11 | "# Chapter 3 -> Spark ETL with Azure (Blob | ADLS)\n",
12 | "\n",
13 | "Task to do \n",
14 | "1. Install required spark libraries\n",
15 | "2. Create connection with Azure Blob storage\n",
16 | "3. Read data from blob and store into dataframe\n",
17 | "4. Transform data\n",
18 | "5. write data into parquet file \n",
19 | "6. write data into JSON file\n",
20 | "\n",
21 | "Reference:\n",
22 | "https://learn.microsoft.com/en-us/azure/open-datasets/dataset-catalog"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 1,
28 | "id": "ea22d710-40f5-4b64-803e-83c583aa3472",
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# First Load all the required library and also Start Spark Session\n",
33 | "# Load all the required library\n",
34 | "from pyspark.sql import SparkSession"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 2,
40 | "id": "368cc9c2-6f26-4d6b-93ff-d9ee7541869c",
41 | "metadata": {},
42 | "outputs": [
43 | {
44 | "name": "stderr",
45 | "output_type": "stream",
46 | "text": [
47 | "WARNING: An illegal reflective access operation has occurred\n",
48 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
49 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
50 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
51 | "WARNING: All illegal access operations will be denied in a future release\n",
52 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
53 | "Setting default log level to \"WARN\".\n",
54 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
55 | "23/03/09 21:58:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
56 | "23/03/09 21:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n",
57 | "23/03/09 21:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.\n",
58 | "23/03/09 21:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.\n"
59 | ]
60 | }
61 | ],
62 | "source": [
63 | "#Start Spark Session\n",
64 | "spark = SparkSession.builder.appName(\"chapter3\").getOrCreate()\n",
65 | "sqlContext = SparkSession(spark)\n",
66 | "#Dont Show warning only error\n",
67 | "spark.sparkContext.setLogLevel(\"ERROR\")"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "id": "f79815e9-7597-4a08-a8ea-098ad0e556ec",
73 | "metadata": {},
74 | "source": [
75 | "1. Create connection with Azure Blob storage"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 3,
81 | "id": "d6f6fd8e-6367-494f-9682-a901d2822473",
82 | "metadata": {},
83 | "outputs": [],
84 | "source": [
85 | "# Azure storage for Holiday \n",
86 | "blob_account_name = \"azureopendatastorage\"\n",
87 | "blob_container_name = \"holidaydatacontainer\"\n",
88 | "blob_relative_path = \"Processed\"\n",
89 | "blob_sas_token = r\"\""
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 4,
95 | "id": "d384937f-4d7b-4a7d-804b-2b891e08ca85",
96 | "metadata": {},
97 | "outputs": [
98 | {
99 | "name": "stdout",
100 | "output_type": "stream",
101 | "text": [
102 | "Remote blob path: wasbs://holidaydatacontainer@azureopendatastorage.blob.core.windows.net/Processed\n"
103 | ]
104 | }
105 | ],
106 | "source": [
107 | "\n",
108 | "# Allow SPARK to read from Blob remotely\n",
109 | "wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)\n",
110 | "spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),blob_sas_token)\n",
111 | "print('Remote blob path: ' + wasbs_path)"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "id": "1a9ffb71-9748-4dba-b0ec-42ca3c491443",
117 | "metadata": {},
118 | "source": [
119 | "3. Read data from blob and store into dataframe"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 5,
125 | "id": "666621b6-0800-44d0-ad87-aa400f790ffc",
126 | "metadata": {},
127 | "outputs": [
128 | {
129 | "name": "stderr",
130 | "output_type": "stream",
131 | "text": [
132 | " \r"
133 | ]
134 | }
135 | ],
136 | "source": [
137 | "df = spark.read.parquet(wasbs_path)"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 6,
143 | "id": "e26c9728-6a0d-4d6e-9870-ec0bcbd49728",
144 | "metadata": {},
145 | "outputs": [
146 | {
147 | "name": "stdout",
148 | "output_type": "stream",
149 | "text": [
150 | "root\n",
151 | " |-- countryOrRegion: string (nullable = true)\n",
152 | " |-- holidayName: string (nullable = true)\n",
153 | " |-- normalizeHolidayName: string (nullable = true)\n",
154 | " |-- isPaidTimeOff: boolean (nullable = true)\n",
155 | " |-- countryRegionCode: string (nullable = true)\n",
156 | " |-- date: timestamp (nullable = true)\n",
157 | "\n"
158 | ]
159 | }
160 | ],
161 | "source": [
162 | "df.printSchema()"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 7,
168 | "id": "c3a9c513-d6e1-4e59-9e56-a12da2533375",
169 | "metadata": {},
170 | "outputs": [
171 | {
172 | "name": "stderr",
173 | "output_type": "stream",
174 | "text": [
175 | "[Stage 1:> (0 + 1) / 1]\r"
176 | ]
177 | },
178 | {
179 | "name": "stdout",
180 | "output_type": "stream",
181 | "text": [
182 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n",
183 | "|countryOrRegion| holidayName|normalizeHolidayName|isPaidTimeOff|countryRegionCode| date|\n",
184 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n",
185 | "| Argentina|Año Nuevo [New Ye...|Año Nuevo [New Ye...| null| AR|1970-01-01 00:00:00|\n",
186 | "| Australia| New Year's Day| New Year's Day| null| AU|1970-01-01 00:00:00|\n",
187 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n",
188 | "only showing top 2 rows\n",
189 | "\n"
190 | ]
191 | },
192 | {
193 | "name": "stderr",
194 | "output_type": "stream",
195 | "text": [
196 | " \r"
197 | ]
198 | }
199 | ],
200 | "source": [
201 | "df.show(n=2)"
202 | ]
203 | },
204 | {
205 | "cell_type": "markdown",
206 | "id": "34762ecc-82b1-4b81-99f4-e480a43cbf92",
207 | "metadata": {},
208 | "source": [
209 | "4. Transform data"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 8,
215 | "id": "47d997c9-a0da-4936-9c62-ac3907efd874",
216 | "metadata": {},
217 | "outputs": [
218 | {
219 | "name": "stdout",
220 | "output_type": "stream",
221 | "text": [
222 | "Register the DataFrame as a SQL temporary view: source\n"
223 | ]
224 | }
225 | ],
226 | "source": [
227 | "print('Register the DataFrame as a SQL temporary view: source')\n",
228 | "df.createOrReplaceTempView('tempSource')"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 9,
234 | "id": "1ba63625-c1a8-4e45-997f-84b2150c1eae",
235 | "metadata": {},
236 | "outputs": [
237 | {
238 | "name": "stdout",
239 | "output_type": "stream",
240 | "text": [
241 | "Displaying top 10 rows: \n"
242 | ]
243 | },
244 | {
245 | "data": {
246 | "text/plain": [
247 | "DataFrame[countryOrRegion: string, holidayName: string, normalizeHolidayName: string, isPaidTimeOff: boolean, countryRegionCode: string, date: timestamp]"
248 | ]
249 | },
250 | "metadata": {},
251 | "output_type": "display_data"
252 | }
253 | ],
254 | "source": [
255 | "print('Displaying top 10 rows: ')\n",
256 | "display(spark.sql('SELECT * FROM tempSource LIMIT 10'))"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 10,
262 | "id": "4e6bf6f7-bf6d-455a-86c1-d1759c03eda0",
263 | "metadata": {},
264 | "outputs": [],
265 | "source": [
266 | "newdf = spark.sql('SELECT * FROM tempSource LIMIT 10')"
267 | ]
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "id": "ca5dc217-35ac-43c2-895c-ae0a9c8bddac",
272 | "metadata": {},
273 | "source": [
274 | "5. write data into parquet file \n",
275 | "6. write data into JSON file"
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": 11,
281 | "id": "84f36d51-366e-4e7c-80fa-d2a72475d262",
282 | "metadata": {},
283 | "outputs": [
284 | {
285 | "name": "stderr",
286 | "output_type": "stream",
287 | "text": [
288 | " \r"
289 | ]
290 | }
291 | ],
292 | "source": [
293 | "newdf.write.format(\"parquet\").option(\"compression\",\"snappy\").save(\"parquetholidaydata\",mode='append')"
294 | ]
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": 12,
299 | "id": "c417172f-ae9a-4331-b871-cb1599a27446",
300 | "metadata": {},
301 | "outputs": [
302 | {
303 | "name": "stderr",
304 | "output_type": "stream",
305 | "text": [
306 | " \r"
307 | ]
308 | }
309 | ],
310 | "source": [
311 | "newdf.write.format(\"csv\").option(\"header\",\"true\").save(\"csvdata\",mode='append')"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": 13,
317 | "id": "defd421f-3e4e-427e-88cd-97ecf299ac5b",
318 | "metadata": {},
319 | "outputs": [
320 | {
321 | "name": "stderr",
322 | "output_type": "stream",
323 | "text": [
324 | "[Stage 8:> (0 + 1) / 1]\r"
325 | ]
326 | },
327 | {
328 | "name": "stdout",
329 | "output_type": "stream",
330 | "text": [
331 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n",
332 | "|countryOrRegion| holidayName|normalizeHolidayName|isPaidTimeOff|countryRegionCode| date|\n",
333 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n",
334 | "| Argentina|Año Nuevo [New Ye...|Año Nuevo [New Ye...| null| AR|1970-01-01 00:00:00|\n",
335 | "| Australia| New Year's Day| New Year's Day| null| AU|1970-01-01 00:00:00|\n",
336 | "| Austria| Neujahr| Neujahr| null| AT|1970-01-01 00:00:00|\n",
337 | "| Belgium| Nieuwjaarsdag| Nieuwjaarsdag| null| BE|1970-01-01 00:00:00|\n",
338 | "| Brazil| Ano novo| Ano novo| null| BR|1970-01-01 00:00:00|\n",
339 | "| Canada| New Year's Day| New Year's Day| null| CA|1970-01-01 00:00:00|\n",
340 | "| Colombia|Año Nuevo [New Ye...|Año Nuevo [New Ye...| null| CO|1970-01-01 00:00:00|\n",
341 | "| Croatia| Nova Godina| Nova Godina| null| HR|1970-01-01 00:00:00|\n",
342 | "| Czech| Nový rok| Nový rok| null| CZ|1970-01-01 00:00:00|\n",
343 | "| Denmark| Nytårsdag| Nytårsdag| null| DK|1970-01-01 00:00:00|\n",
344 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n",
345 | "\n"
346 | ]
347 | },
348 | {
349 | "name": "stderr",
350 | "output_type": "stream",
351 | "text": [
352 | " \r"
353 | ]
354 | }
355 | ],
356 | "source": [
357 | "newdf.show()"
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": 15,
363 | "id": "58739de3-0f36-4d94-bf7e-926a0de4d1b0",
364 | "metadata": {},
365 | "outputs": [
366 | {
367 | "name": "stderr",
368 | "output_type": "stream",
369 | "text": [
370 | " \r"
371 | ]
372 | },
373 | {
374 | "data": {
375 | "text/plain": [
376 | "69557"
377 | ]
378 | },
379 | "execution_count": 15,
380 | "metadata": {},
381 | "output_type": "execute_result"
382 | }
383 | ],
384 | "source": [
385 | "df.count()"
386 | ]
387 | }
388 | ],
389 | "metadata": {
390 | "kernelspec": {
391 | "display_name": "Python 3 (ipykernel)",
392 | "language": "python",
393 | "name": "python3"
394 | },
395 | "language_info": {
396 | "codemirror_mode": {
397 | "name": "ipython",
398 | "version": 3
399 | },
400 | "file_extension": ".py",
401 | "mimetype": "text/x-python",
402 | "name": "python",
403 | "nbconvert_exporter": "python",
404 | "pygments_lexer": "ipython3",
405 | "version": "3.8.13"
406 | }
407 | },
408 | "nbformat": 4,
409 | "nbformat_minor": 5
410 | }
411 |
--------------------------------------------------------------------------------
/Chapter12/Chapter12.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "b139ba1c-52b5-4bdd-96b7-309a78d30421",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "# First Load all the required library and also Start Spark Session\n",
11 | "# Load all the required library\n",
12 | "from pyspark.sql import SparkSession"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 2,
18 | "id": "2e81bfed-7faa-4615-91a0-0761bd2e827a",
19 | "metadata": {},
20 | "outputs": [
21 | {
22 | "name": "stderr",
23 | "output_type": "stream",
24 | "text": [
25 | "WARNING: An illegal reflective access operation has occurred\n",
26 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
27 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
28 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
29 | "WARNING: All illegal access operations will be denied in a future release\n"
30 | ]
31 | },
32 | {
33 | "name": "stdout",
34 | "output_type": "stream",
35 | "text": [
36 | ":: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n"
37 | ]
38 | },
39 | {
40 | "name": "stderr",
41 | "output_type": "stream",
42 | "text": [
43 | "Ivy Default Cache set to: /root/.ivy2/cache\n",
44 | "The jars for the packages stored in: /root/.ivy2/jars\n",
45 | "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n",
46 | "mysql#mysql-connector-java added as a dependency\n",
47 | ":: resolving dependencies :: org.apache.spark#spark-submit-parent-493c4a9c-0316-4f83-965c-fea5e02a6e4d;1.0\n",
48 | "\tconfs: [default]\n",
49 | "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 in central\n",
50 | "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 in central\n",
51 | "\tfound org.apache.kafka#kafka-clients;2.8.0 in central\n",
52 | "\tfound org.lz4#lz4-java;1.7.1 in central\n",
53 | "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n",
54 | "\tfound org.slf4j#slf4j-api;1.7.30 in central\n",
55 | "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.1 in central\n",
56 | "\tfound org.spark-project.spark#unused;1.0.0 in central\n",
57 | "\tfound org.apache.hadoop#hadoop-client-api;3.3.1 in central\n",
58 | "\tfound org.apache.htrace#htrace-core4;4.1.0-incubating in central\n",
59 | "\tfound commons-logging#commons-logging;1.1.3 in central\n",
60 | "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n",
61 | "\tfound org.apache.commons#commons-pool2;2.6.2 in central\n",
62 | "\tfound mysql#mysql-connector-java;8.0.32 in central\n",
63 | "\tfound com.mysql#mysql-connector-j;8.0.32 in central\n",
64 | "\tfound com.google.protobuf#protobuf-java;3.21.9 in central\n",
65 | ":: resolution report :: resolve 1765ms :: artifacts dl 19ms\n",
66 | "\t:: modules in use:\n",
67 | "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n",
68 | "\tcom.google.protobuf#protobuf-java;3.21.9 from central in [default]\n",
69 | "\tcom.mysql#mysql-connector-j;8.0.32 from central in [default]\n",
70 | "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n",
71 | "\tmysql#mysql-connector-java;8.0.32 from central in [default]\n",
72 | "\torg.apache.commons#commons-pool2;2.6.2 from central in [default]\n",
73 | "\torg.apache.hadoop#hadoop-client-api;3.3.1 from central in [default]\n",
74 | "\torg.apache.hadoop#hadoop-client-runtime;3.3.1 from central in [default]\n",
75 | "\torg.apache.htrace#htrace-core4;4.1.0-incubating from central in [default]\n",
76 | "\torg.apache.kafka#kafka-clients;2.8.0 from central in [default]\n",
77 | "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 from central in [default]\n",
78 | "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 from central in [default]\n",
79 | "\torg.lz4#lz4-java;1.7.1 from central in [default]\n",
80 | "\torg.slf4j#slf4j-api;1.7.30 from central in [default]\n",
81 | "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n",
82 | "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n",
83 | "\t---------------------------------------------------------------------\n",
84 | "\t| | modules || artifacts |\n",
85 | "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n",
86 | "\t---------------------------------------------------------------------\n",
87 | "\t| default | 16 | 0 | 0 | 0 || 15 | 0 |\n",
88 | "\t---------------------------------------------------------------------\n",
89 | ":: retrieving :: org.apache.spark#spark-submit-parent-493c4a9c-0316-4f83-965c-fea5e02a6e4d\n",
90 | "\tconfs: [default]\n",
91 | "\t0 artifacts copied, 15 already retrieved (0kB/16ms)\n",
92 | "23/08/01 11:21:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
93 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
94 | "Setting default log level to \"WARN\".\n",
95 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
96 | ]
97 | }
98 | ],
99 | "source": [
100 | "#Start Spark Session\n",
101 | "spark = SparkSession.builder.appName(\"chapter12\") \\\n",
102 | " .config(\"spark.jars.packages\", \"org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,mysql:mysql-connector-java:8.0.32\") \\\n",
103 | " .getOrCreate()\n",
104 | "sqlContext = SparkSession(spark)\n",
105 | "#Dont Show warning only error\n",
106 | "spark.sparkContext.setLogLevel(\"WARN\")"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 3,
112 | "id": "8f9e8602-6b59-4ef7-ad2a-e4139dddd454",
113 | "metadata": {},
114 | "outputs": [
115 | {
116 | "data": {
117 | "text/html": [
118 | "\n",
119 | " \n",
120 | "
SparkSession - in-memory
\n",
121 | " \n",
122 | "
\n",
123 | "
SparkContext
\n",
124 | "\n",
125 | "
Spark UI
\n",
126 | "\n",
127 | "
\n",
128 | " - Version
\n",
129 | " v3.2.1 \n",
130 | " - Master
\n",
131 | " local[*] \n",
132 | " - AppName
\n",
133 | " chapter12 \n",
134 | "
\n",
135 | "
\n",
136 | " \n",
137 | "
\n",
138 | " "
139 | ],
140 | "text/plain": [
141 | ""
142 | ]
143 | },
144 | "execution_count": 3,
145 | "metadata": {},
146 | "output_type": "execute_result"
147 | }
148 | ],
149 | "source": [
150 | "spark"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 4,
156 | "id": "38a809e9-d284-4ead-a441-a95fbe0fa9f0",
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "KAFKA_BOOTSTRAP_SERVERS = \"192.168.1.102:9092\"\n",
161 | "KAFKA_TOPIC = \"News_XYZ_Technology\""
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 5,
167 | "id": "adac4438-f229-4831-9ba0-fb987d232b07",
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "df = spark.readStream.format(\"kafka\") \\\n",
172 | " .option(\"kafka.bootstrap.servers\", KAFKA_BOOTSTRAP_SERVERS) \\\n",
173 | " .option(\"subscribe\", KAFKA_TOPIC) \\\n",
174 | " .option(\"startingOffsets\", \"earliest\") \\\n",
175 | " .load()"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 7,
181 | "id": "e6adc8ee-7126-4247-8f77-8798a6f59edd",
182 | "metadata": {},
183 | "outputs": [
184 | {
185 | "name": "stderr",
186 | "output_type": "stream",
187 | "text": [
188 | "23/08/01 11:22:36 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-aa2ef7ab-e536-4b1b-aa4f-3f260d4e8f09. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n",
189 | "23/08/01 11:22:36 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n"
190 | ]
191 | },
192 | {
193 | "data": {
194 | "text/plain": [
195 | ""
196 | ]
197 | },
198 | "execution_count": 7,
199 | "metadata": {},
200 | "output_type": "execute_result"
201 | }
202 | ],
203 | "source": [
204 | "df \\\n",
205 | " .writeStream \\\n",
206 | " .format(\"console\") \\\n",
207 | " .outputMode(\"append\") \\\n",
208 | " .start() "
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": 8,
214 | "id": "70853b26-9f33-44b4-90df-af3152ea9a3c",
215 | "metadata": {},
216 | "outputs": [
217 | {
218 | "name": "stderr",
219 | "output_type": "stream",
220 | "text": [
221 | "23/08/01 11:22:43 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-18deab41-0527-4d6e-8238-9cd5ccf2bfdf. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n",
222 | "23/08/01 11:22:43 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n"
223 | ]
224 | },
225 | {
226 | "data": {
227 | "text/plain": [
228 | ""
229 | ]
230 | },
231 | "execution_count": 8,
232 | "metadata": {},
233 | "output_type": "execute_result"
234 | },
235 | {
236 | "name": "stderr",
237 | "output_type": "stream",
238 | "text": [
239 | " \r"
240 | ]
241 | },
242 | {
243 | "name": "stdout",
244 | "output_type": "stream",
245 | "text": [
246 | "-------------------------------------------\n",
247 | "Batch: 0\n",
248 | "-------------------------------------------\n",
249 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
250 | "| key| value| topic|partition|offset| timestamp|timestampType|\n",
251 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
252 | "|null| [53 70 61 72 6B 20]|News_XYZ_Technology| 0| 0|2023-07-30 01:20:...| 0|\n",
253 | "|null|[4C 61 6B 65 68 6...|News_XYZ_Technology| 0| 1|2023-07-30 01:20:...| 0|\n",
254 | "|null|[44 65 6C 74 61 2...|News_XYZ_Technology| 0| 2|2023-07-30 01:20:...| 0|\n",
255 | "|null| [4F 70 65 6E 41 49]|News_XYZ_Technology| 0| 3|2023-07-30 01:20:...| 0|\n",
256 | "|null|[44 6F 6C 6C 79 2...|News_XYZ_Technology| 0| 4|2023-07-30 01:21:...| 0|\n",
257 | "|null|[44 61 74 61 62 7...|News_XYZ_Technology| 0| 5|2023-07-30 01:21:...| 0|\n",
258 | "|null|[4D 69 63 72 6F 7...|News_XYZ_Technology| 0| 6|2023-07-30 01:21:...| 0|\n",
259 | "|null|[41 7A 75 72 65 2...|News_XYZ_Technology| 0| 7|2023-07-30 01:21:...| 0|\n",
260 | "|null|[41 57 53 20 4B 6...|News_XYZ_Technology| 0| 8|2023-07-30 01:21:...| 0|\n",
261 | "|null|[44 65 6C 74 61 2...|News_XYZ_Technology| 0| 9|2023-07-30 03:23:...| 0|\n",
262 | "|null|[41 70 61 63 68 6...|News_XYZ_Technology| 0| 10|2023-07-30 04:13:...| 0|\n",
263 | "|null| [41 70 61 63 68 65]|News_XYZ_Technology| 0| 11|2023-07-30 04:14:...| 0|\n",
264 | "|null|[69 63 65 62 65 7...|News_XYZ_Technology| 0| 12|2023-07-30 04:15:...| 0|\n",
265 | "|null| [68 75 64 69]|News_XYZ_Technology| 0| 13|2023-07-30 04:15:...| 0|\n",
266 | "|null| [64 65 6C 74 61]|News_XYZ_Technology| 0| 14|2023-07-30 04:15:...| 0|\n",
267 | "|null| [66 61 62 72 69 63]|News_XYZ_Technology| 0| 15|2023-07-30 04:15:...| 0|\n",
268 | "|null| [64 65 6C 74 61]|News_XYZ_Technology| 0| 16|2023-07-30 04:15:...| 0|\n",
269 | "|null|[41 70 61 63 68 6...|News_XYZ_Technology| 0| 17|2023-07-30 05:16:...| 0|\n",
270 | "|null|[41 70 61 63 68 6...|News_XYZ_Technology| 0| 18|2023-07-30 05:16:...| 0|\n",
271 | "|null|[41 7A 75 72 65 2...|News_XYZ_Technology| 0| 19|2023-07-30 05:40:...| 0|\n",
272 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
273 | "only showing top 20 rows\n",
274 | "\n"
275 | ]
276 | },
277 | {
278 | "name": "stderr",
279 | "output_type": "stream",
280 | "text": [
281 | " \r"
282 | ]
283 | },
284 | {
285 | "name": "stdout",
286 | "output_type": "stream",
287 | "text": [
288 | "-------------------------------------------\n",
289 | "Batch: 0\n",
290 | "-------------------------------------------\n",
291 | "+-----------------+\n",
292 | "| value|\n",
293 | "+-----------------+\n",
294 | "| Spark |\n",
295 | "| Lakehouse|\n",
296 | "| Delta Lake|\n",
297 | "| OpenAI|\n",
298 | "| Dolly Model|\n",
299 | "|Databrics Events |\n",
300 | "| Microsoft Fabric|\n",
301 | "| Azure DW|\n",
302 | "| AWS Kinesis|\n",
303 | "| Delta Lake|\n",
304 | "| Apache HUdi|\n",
305 | "| Apache|\n",
306 | "| iceberg|\n",
307 | "| hudi|\n",
308 | "| delta|\n",
309 | "| fabric|\n",
310 | "| delta|\n",
311 | "| Apache Nifi|\n",
312 | "| Apache Beam|\n",
313 | "| Azure EventHub|\n",
314 | "+-----------------+\n",
315 | "only showing top 20 rows\n",
316 | "\n"
317 | ]
318 | },
319 | {
320 | "name": "stderr",
321 | "output_type": "stream",
322 | "text": [
323 | " \r"
324 | ]
325 | },
326 | {
327 | "name": "stdout",
328 | "output_type": "stream",
329 | "text": [
330 | "-------------------------------------------\n",
331 | "Batch: 1\n",
332 | "-------------------------------------------\n",
333 | "-------------------------------------------\n",
334 | "Batch: 1\n",
335 | "-------------------------------------------\n",
336 | "+----------+\n",
337 | "| value|\n",
338 | "+----------+\n",
339 | "|SQL Server|\n",
340 | "+----------+\n",
341 | "\n",
342 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
343 | "| key| value| topic|partition|offset| timestamp|timestampType|\n",
344 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
345 | "|null|[53 51 4C 20 53 6...|News_XYZ_Technology| 0| 21|2023-08-01 11:25:...| 0|\n",
346 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
347 | "\n"
348 | ]
349 | },
350 | {
351 | "name": "stderr",
352 | "output_type": "stream",
353 | "text": [
354 | " \r"
355 | ]
356 | },
357 | {
358 | "name": "stdout",
359 | "output_type": "stream",
360 | "text": [
361 | "-------------------------------------------\n",
362 | "Batch: 2\n",
363 | "-------------------------------------------\n",
364 | "-------------------------------------------\n",
365 | "Batch: 2\n",
366 | "-------------------------------------------\n",
367 | "+-----+\n",
368 | "|value|\n",
369 | "+-----+\n",
370 | "| HIVE|\n",
371 | "+-----+\n",
372 | "\n",
373 | "+----+-------------+-------------------+---------+------+--------------------+-------------+\n",
374 | "| key| value| topic|partition|offset| timestamp|timestampType|\n",
375 | "+----+-------------+-------------------+---------+------+--------------------+-------------+\n",
376 | "|null|[48 49 56 45]|News_XYZ_Technology| 0| 22|2023-08-01 11:25:...| 0|\n",
377 | "+----+-------------+-------------------+---------+------+--------------------+-------------+\n",
378 | "\n",
379 | "-------------------------------------------\n",
380 | "Batch: 3\n",
381 | "-------------------------------------------\n",
382 | "-------------------------------------------\n",
383 | "Batch: 3\n",
384 | "-------------------------------------------\n",
385 | "+-------+\n",
386 | "| value|\n",
387 | "+-------+\n",
388 | "|MongoDB|\n",
389 | "+-------+\n",
390 | "\n",
391 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
392 | "| key| value| topic|partition|offset| timestamp|timestampType|\n",
393 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
394 | "|null|[4D 6F 6E 67 6F 4...|News_XYZ_Technology| 0| 23|2023-08-01 11:26:...| 0|\n",
395 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
396 | "\n",
397 | "-------------------------------------------\n",
398 | "Batch: 4\n",
399 | "-------------------------------------------\n",
400 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n",
401 | "| key| value| topic|partition|offset| timestamp|timestampType|\n",
402 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n",
403 | "|null|[4D 79 53 51 4C]|News_XYZ_Technology| 0| 24|2023-08-01 11:26:...| 0|\n",
404 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n",
405 | "\n",
406 | "-------------------------------------------\n",
407 | "Batch: 4\n",
408 | "-------------------------------------------\n",
409 | "+-----+\n",
410 | "|value|\n",
411 | "+-----+\n",
412 | "|MySQL|\n",
413 | "+-----+\n",
414 | "\n"
415 | ]
416 | }
417 | ],
418 | "source": [
419 | "df.selectExpr(\"cast(value as string)\") \\\n",
420 | " .writeStream \\\n",
421 | " .format(\"console\") \\\n",
422 | " .outputMode(\"append\") \\\n",
423 | " .start() "
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": null,
429 | "id": "a1deba01-d2f9-4519-a9de-1b895b9fbbf8",
430 | "metadata": {},
431 | "outputs": [
432 | {
433 | "name": "stderr",
434 | "output_type": "stream",
435 | "text": [
436 | "23/08/01 11:32:38 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-8a7db8ca-72f7-416c-89f3-8a5f3a93fce0. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n",
437 | "23/08/01 11:32:39 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n",
438 | " \r"
439 | ]
440 | },
441 | {
442 | "name": "stdout",
443 | "output_type": "stream",
444 | "text": [
445 | "-------------------------------------------\n",
446 | "Batch: 0\n",
447 | "-------------------------------------------\n",
448 | "+-----------------+\n",
449 | "| value|\n",
450 | "+-----------------+\n",
451 | "| Spark |\n",
452 | "| Lakehouse|\n",
453 | "| Delta Lake|\n",
454 | "| OpenAI|\n",
455 | "| Dolly Model|\n",
456 | "|Databrics Events |\n",
457 | "| Microsoft Fabric|\n",
458 | "| Azure DW|\n",
459 | "| AWS Kinesis|\n",
460 | "| Delta Lake|\n",
461 | "| Apache HUdi|\n",
462 | "| Apache|\n",
463 | "| iceberg|\n",
464 | "| hudi|\n",
465 | "| delta|\n",
466 | "| fabric|\n",
467 | "| delta|\n",
468 | "| Apache Nifi|\n",
469 | "| Apache Beam|\n",
470 | "| Azure EventHub|\n",
471 | "+-----------------+\n",
472 | "only showing top 20 rows\n",
473 | "\n"
474 | ]
475 | },
476 | {
477 | "name": "stderr",
478 | "output_type": "stream",
479 | "text": [
480 | " \r"
481 | ]
482 | },
483 | {
484 | "name": "stdout",
485 | "output_type": "stream",
486 | "text": [
487 | "-------------------------------------------\n",
488 | "Batch: 1\n",
489 | "-------------------------------------------\n",
490 | "+----------+\n",
491 | "| value|\n",
492 | "+----------+\n",
493 | "|PostgreSQL|\n",
494 | "+----------+\n",
495 | "\n",
496 | "-------------------------------------------\n",
497 | "Batch: 5\n",
498 | "-------------------------------------------\n",
499 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
500 | "| key| value| topic|partition|offset| timestamp|timestampType|\n",
501 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
502 | "|null|[50 6F 73 74 67 7...|News_XYZ_Technology| 0| 25|2023-08-01 11:33:...| 0|\n",
503 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
504 | "\n",
505 | "-------------------------------------------\n",
506 | "Batch: 5\n",
507 | "-------------------------------------------\n",
508 | "+----------+\n",
509 | "| value|\n",
510 | "+----------+\n",
511 | "|PostgreSQL|\n",
512 | "+----------+\n",
513 | "\n"
514 | ]
515 | },
516 | {
517 | "name": "stderr",
518 | "output_type": "stream",
519 | "text": [
520 | " \r"
521 | ]
522 | },
523 | {
524 | "name": "stdout",
525 | "output_type": "stream",
526 | "text": [
527 | "-------------------------------------------\n",
528 | "Batch: 2\n",
529 | "-------------------------------------------\n",
530 | "-------------------------------------------\n",
531 | "Batch: 6\n",
532 | "-------------------------------------------\n",
533 | "+--------+\n",
534 | "| value|\n",
535 | "+--------+\n",
536 | "|CosmosDB|\n",
537 | "+--------+\n",
538 | "\n",
539 | "+--------+\n",
540 | "| value|\n",
541 | "+--------+\n",
542 | "|CosmosDB|\n",
543 | "+--------+\n",
544 | "\n"
545 | ]
546 | },
547 | {
548 | "name": "stderr",
549 | "output_type": "stream",
550 | "text": [
551 | " \r"
552 | ]
553 | },
554 | {
555 | "name": "stdout",
556 | "output_type": "stream",
557 | "text": [
558 | "-------------------------------------------\n",
559 | "Batch: 6\n",
560 | "-------------------------------------------\n",
561 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
562 | "| key| value| topic|partition|offset| timestamp|timestampType|\n",
563 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
564 | "|null|[43 6F 73 6D 6F 7...|News_XYZ_Technology| 0| 26|2023-08-01 11:33:...| 0|\n",
565 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n",
566 | "\n"
567 | ]
568 | },
569 | {
570 | "name": "stderr",
571 | "output_type": "stream",
572 | "text": [
573 | " \r"
574 | ]
575 | },
576 | {
577 | "name": "stdout",
578 | "output_type": "stream",
579 | "text": [
580 | "-------------------------------------------\n",
581 | "Batch: 3\n",
582 | "-------------------------------------------\n",
583 | "+-----+\n",
584 | "|value|\n",
585 | "+-----+\n",
586 | "|Redis|\n",
587 | "+-----+\n",
588 | "\n",
589 | "-------------------------------------------\n",
590 | "Batch: 7\n",
591 | "-------------------------------------------\n",
592 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n",
593 | "| key| value| topic|partition|offset| timestamp|timestampType|\n",
594 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n",
595 | "|null|[52 65 64 69 73]|News_XYZ_Technology| 0| 27|2023-08-01 11:33:...| 0|\n",
596 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n",
597 | "\n",
598 | "-------------------------------------------\n",
599 | "Batch: 7\n",
600 | "-------------------------------------------\n",
601 | "+-----+\n",
602 | "|value|\n",
603 | "+-----+\n",
604 | "|Redis|\n",
605 | "+-----+\n",
606 | "\n"
607 | ]
608 | },
609 | {
610 | "name": "stderr",
611 | "output_type": "stream",
612 | "text": [
613 | "23/08/01 12:08:18 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 401471 ms exceeds timeout 120000 ms\n",
614 | "23/08/01 12:08:19 WARN SparkContext: Killing executors is not supported by current scheduler.\n",
615 | "23/08/01 16:55:17 ERROR AbstractCoordinator: [Consumer clientId=consumer-spark-kafka-source-c822c496-ebc1-4120-89e0-d54ce1aba996--1766797160-driver-0-2, groupId=spark-kafka-source-c822c496-ebc1-4120-89e0-d54ce1aba996--1766797160-driver-0] LeaveGroup request with Generation{generationId=1, memberId='consumer-spark-kafka-source-c822c496-ebc1-4120-89e0-d54ce1aba996--1766797160-driver-0-2-f645cdfa-e701-4c50-9946-8d36b34dc4eb', protocol='range'} failed with error: This is not the correct coordinator.\n",
616 | "23/08/01 16:55:17 ERROR AbstractCoordinator: [Consumer clientId=consumer-spark-kafka-source-87b4591e-ee0d-479e-a974-37af7e90439a--1154230701-driver-0-5, groupId=spark-kafka-source-87b4591e-ee0d-479e-a974-37af7e90439a--1154230701-driver-0] LeaveGroup request with Generation{generationId=1, memberId='consumer-spark-kafka-source-87b4591e-ee0d-479e-a974-37af7e90439a--1154230701-driver-0-5-30076fd4-0723-4731-aa21-5c71f1971e18', protocol='range'} failed with error: This is not the correct coordinator.\n"
617 | ]
618 | }
619 | ],
620 | "source": [
621 | "df.selectExpr(\"cast(value as string)\") \\\n",
622 | " .writeStream \\\n",
623 | " .format(\"console\") \\\n",
624 | " .outputMode(\"append\") \\\n",
625 | " .start().awaitTermination()"
626 | ]
627 | }
628 | ],
629 | "metadata": {
630 | "kernelspec": {
631 | "display_name": "Python 3 (ipykernel)",
632 | "language": "python",
633 | "name": "python3"
634 | },
635 | "language_info": {
636 | "codemirror_mode": {
637 | "name": "ipython",
638 | "version": 3
639 | },
640 | "file_extension": ".py",
641 | "mimetype": "text/x-python",
642 | "name": "python",
643 | "nbconvert_exporter": "python",
644 | "pygments_lexer": "ipython3",
645 | "version": "3.8.13"
646 | }
647 | },
648 | "nbformat": 4,
649 | "nbformat_minor": 5
650 | }
651 |
--------------------------------------------------------------------------------
/Chapter5/chapter5.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "3a51c7bc-a588-4d9e-bb74-a26148c92900",
6 | "metadata": {
7 | "tags": []
8 | },
9 | "source": [
10 | "\n",
11 | "# Chapter 5 -> Spark ETL with Hive tables\n",
12 | "\n",
13 | "Task to do \n",
14 | "1. Read data from one of the source (We take source as our MongoDB collection)\n",
15 | "2. Create dataframe from source \n",
16 | "3. Create Hive table from dataframe\n",
17 | "4. Create temp Hive view from dataframe\n",
18 | "5. Create global Hive view from dataframe\n",
19 | "6. List database and tables in database\n",
20 | "7. Drop all the created tables and views in default database\n",
21 | "8. Create Dataeng database and create global and temp view using SQL \n",
22 | "9. Access global table from other session\n"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 1,
28 | "id": "ea22d710-40f5-4b64-803e-83c583aa3472",
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# First Load all the required library and also Start Spark Session\n",
33 | "# Load all the required library\n",
34 | "from pyspark.sql import SparkSession"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 2,
40 | "id": "368cc9c2-6f26-4d6b-93ff-d9ee7541869c",
41 | "metadata": {},
42 | "outputs": [
43 | {
44 | "name": "stderr",
45 | "output_type": "stream",
46 | "text": [
47 | "WARNING: An illegal reflective access operation has occurred\n",
48 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
49 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
50 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
51 | "WARNING: All illegal access operations will be denied in a future release\n"
52 | ]
53 | },
54 | {
55 | "name": "stdout",
56 | "output_type": "stream",
57 | "text": [
58 | ":: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n"
59 | ]
60 | },
61 | {
62 | "name": "stderr",
63 | "output_type": "stream",
64 | "text": [
65 | "Ivy Default Cache set to: /root/.ivy2/cache\n",
66 | "The jars for the packages stored in: /root/.ivy2/jars\n",
67 | "org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency\n",
68 | ":: resolving dependencies :: org.apache.spark#spark-submit-parent-f2ed49bd-7bd8-4327-8579-ed71257d3bbb;1.0\n",
69 | "\tconfs: [default]\n",
70 | "\tfound org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 in central\n",
71 | "\tfound org.mongodb#mongodb-driver-sync;4.0.5 in central\n",
72 | "\tfound org.mongodb#bson;4.0.5 in central\n",
73 | "\tfound org.mongodb#mongodb-driver-core;4.0.5 in central\n",
74 | ":: resolution report :: resolve 830ms :: artifacts dl 128ms\n",
75 | "\t:: modules in use:\n",
76 | "\torg.mongodb#bson;4.0.5 from central in [default]\n",
77 | "\torg.mongodb#mongodb-driver-core;4.0.5 from central in [default]\n",
78 | "\torg.mongodb#mongodb-driver-sync;4.0.5 from central in [default]\n",
79 | "\torg.mongodb.spark#mongo-spark-connector_2.12;3.0.1 from central in [default]\n",
80 | "\t---------------------------------------------------------------------\n",
81 | "\t| | modules || artifacts |\n",
82 | "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n",
83 | "\t---------------------------------------------------------------------\n",
84 | "\t| default | 4 | 0 | 0 | 0 || 4 | 0 |\n",
85 | "\t---------------------------------------------------------------------\n",
86 | ":: retrieving :: org.apache.spark#spark-submit-parent-f2ed49bd-7bd8-4327-8579-ed71257d3bbb\n",
87 | "\tconfs: [default]\n",
88 | "\t0 artifacts copied, 4 already retrieved (0kB/30ms)\n",
89 | "23/03/15 07:51:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
90 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
91 | "Setting default log level to \"WARN\".\n",
92 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
93 | ]
94 | }
95 | ],
96 | "source": [
97 | "#Start Spark Session\n",
98 | "spark = SparkSession.builder.appName(\"chapter5\")\\\n",
99 | " .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1')\\\n",
100 | " .getOrCreate()\n",
101 | "sqlContext = SparkSession(spark)\n",
102 | "#Dont Show warning only error\n",
103 | "spark.sparkContext.setLogLevel(\"ERROR\")"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "id": "6d2a3b43-6035-4104-94fb-d895cf2a524a",
109 | "metadata": {},
110 | "source": [
111 | "1. Read data from one of the source (We take source as our MongoDB collection)\n",
112 | "2. Create dataframe from source "
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": 61,
118 | "id": "42da13f2-c156-40e2-8ade-a55ff9d09f88",
119 | "metadata": {},
120 | "outputs": [
121 | {
122 | "data": {
123 | "text/html": [
124 | "\n",
125 | " \n",
126 | "
SparkSession - in-memory
\n",
127 | " \n",
128 | "
\n",
129 | "
SparkContext
\n",
130 | "\n",
131 | "
Spark UI
\n",
132 | "\n",
133 | "
\n",
134 | " - Version
\n",
135 | " v3.2.1 \n",
136 | " - Master
\n",
137 | " local[*] \n",
138 | " - AppName
\n",
139 | " chapter5 \n",
140 | "
\n",
141 | "
\n",
142 | " \n",
143 | "
\n",
144 | " "
145 | ],
146 | "text/plain": [
147 | ""
148 | ]
149 | },
150 | "execution_count": 61,
151 | "metadata": {},
152 | "output_type": "execute_result"
153 | }
154 | ],
155 | "source": [
156 | "spark"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 3,
162 | "id": "1682daf8-0610-44bc-bd5e-9ce5f3814e20",
163 | "metadata": {},
164 | "outputs": [
165 | {
166 | "name": "stderr",
167 | "output_type": "stream",
168 | "text": [
169 | " \r"
170 | ]
171 | }
172 | ],
173 | "source": [
174 | "mongodf = spark.read.format(\"mongo\") \\\n",
175 | " .option(\"uri\", \"mongodb://root:mongodb@192.168.1.104:27017/\") \\\n",
176 | " .option(\"database\", \"dataengineering\") \\\n",
177 | " .option(\"collection\", \"employee\") \\\n",
178 | " .load()"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 4,
184 | "id": "b3944d5d-0420-4b96-844d-b395886a13f4",
185 | "metadata": {},
186 | "outputs": [
187 | {
188 | "name": "stdout",
189 | "output_type": "stream",
190 | "text": [
191 | "root\n",
192 | " |-- _id: struct (nullable = true)\n",
193 | " | |-- oid: string (nullable = true)\n",
194 | " |-- department_id: string (nullable = true)\n",
195 | " |-- first_name: string (nullable = true)\n",
196 | " |-- id: string (nullable = true)\n",
197 | " |-- last_name: string (nullable = true)\n",
198 | " |-- salary: string (nullable = true)\n",
199 | "\n"
200 | ]
201 | }
202 | ],
203 | "source": [
204 | "mongodf.printSchema()"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": 5,
210 | "id": "c3a9c513-d6e1-4e59-9e56-a12da2533375",
211 | "metadata": {},
212 | "outputs": [
213 | {
214 | "name": "stdout",
215 | "output_type": "stream",
216 | "text": [
217 | "+--------------------+-------------+----------+---+---------+------+\n",
218 | "| _id|department_id|first_name| id|last_name|salary|\n",
219 | "+--------------------+-------------+----------+---+---------+------+\n",
220 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|110000|\n",
221 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|106119|\n",
222 | "+--------------------+-------------+----------+---+---------+------+\n",
223 | "only showing top 2 rows\n",
224 | "\n"
225 | ]
226 | }
227 | ],
228 | "source": [
229 | "mongodf.show(n=2)"
230 | ]
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "id": "d0fd01eb-361e-4a2a-a1b1-7b880344085d",
235 | "metadata": {},
236 | "source": [
237 | "3. Create Hive table from dataframe"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": 7,
243 | "id": "2447464e-c348-4628-b9fe-cb1bc0c7562c",
244 | "metadata": {},
245 | "outputs": [
246 | {
247 | "name": "stderr",
248 | "output_type": "stream",
249 | "text": [
250 | " \r"
251 | ]
252 | }
253 | ],
254 | "source": [
255 | "mongodf.write.saveAsTable(\"hivesampletable\")"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "id": "c68f8a0c-c720-49c2-afaa-0fc22c4cc633",
261 | "metadata": {},
262 | "source": [
263 | "4. Create temp Hive view from dataframe\n",
264 | "5. Create global Hive view from dataframe\n",
265 | "\n",
266 | "The difference between temporary and global temporary views being subtle, it can be a source of mild confusion among developers new to Spark. A temporary view is tied to a single SparkSession within a Spark application. In contrast, a global temporary view is visible across multiple SparkSessions within a Spark application. Yes, you can create multiple SparkSessions within a single Spark application—this can be handy, for example, in cases where you want to access (and combine) data from two different SparkSessions that don’t share the same Hive metastore configurations."
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": 8,
272 | "id": "a1410b71-4c59-46e7-bf79-e2d645921aff",
273 | "metadata": {},
274 | "outputs": [],
275 | "source": [
276 | "mongodf.createOrReplaceGlobalTempView(\"sampleglobalview\")\n",
277 | "mongodf.createOrReplaceTempView(\"sampletempview\")"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "id": "b567d95f-dcf1-4083-875e-670b941b8cbb",
283 | "metadata": {},
284 | "source": [
285 | "6. List database and tables in database"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 9,
291 | "id": "9f74008c-d30e-4c77-9b20-e0895ed013e9",
292 | "metadata": {},
293 | "outputs": [
294 | {
295 | "name": "stdout",
296 | "output_type": "stream",
297 | "text": [
298 | "+---------+\n",
299 | "|namespace|\n",
300 | "+---------+\n",
301 | "| default|\n",
302 | "+---------+\n",
303 | "\n"
304 | ]
305 | }
306 | ],
307 | "source": [
308 | "spark.sql(\"show databases\").show()"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": 10,
314 | "id": "42bf3cbd-fcbc-406b-bc24-7df63c038bb4",
315 | "metadata": {},
316 | "outputs": [
317 | {
318 | "name": "stdout",
319 | "output_type": "stream",
320 | "text": [
321 | "+---------+---------------+-----------+\n",
322 | "|namespace| tableName|isTemporary|\n",
323 | "+---------+---------------+-----------+\n",
324 | "| default|hivesampletable| false|\n",
325 | "| | sampletempview| true|\n",
326 | "+---------+---------------+-----------+\n",
327 | "\n"
328 | ]
329 | }
330 | ],
331 | "source": [
332 | "spark.sql(\"show tables\").show()"
333 | ]
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": 11,
338 | "id": "9f9788ab-4fe1-4964-a047-6a88ee7c0658",
339 | "metadata": {},
340 | "outputs": [
341 | {
342 | "data": {
343 | "text/plain": [
344 | "[Database(name='default', description='default database', locationUri='file:/opt/spark/SparkETL/Chapter5/spark-warehouse')]"
345 | ]
346 | },
347 | "execution_count": 11,
348 | "metadata": {},
349 | "output_type": "execute_result"
350 | }
351 | ],
352 | "source": [
353 | "spark.catalog.listDatabases()"
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "execution_count": 12,
359 | "id": "ba7856c9-7ca1-4b54-84f2-c5f0f39b1db5",
360 | "metadata": {},
361 | "outputs": [
362 | {
363 | "data": {
364 | "text/plain": [
365 | "[Table(name='hivesampletable', database='default', description=None, tableType='MANAGED', isTemporary=False),\n",
366 | " Table(name='sampletempview', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]"
367 | ]
368 | },
369 | "execution_count": 12,
370 | "metadata": {},
371 | "output_type": "execute_result"
372 | }
373 | ],
374 | "source": [
375 | "spark.catalog.listTables()"
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": 13,
381 | "id": "eb99c32d-bf12-4ae0-9254-7515824474d5",
382 | "metadata": {},
383 | "outputs": [
384 | {
385 | "data": {
386 | "text/plain": [
387 | "[Column(name='_id', description=None, dataType='struct', nullable=True, isPartition=False, isBucket=False),\n",
388 | " Column(name='department_id', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),\n",
389 | " Column(name='first_name', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),\n",
390 | " Column(name='id', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),\n",
391 | " Column(name='last_name', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),\n",
392 | " Column(name='salary', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]"
393 | ]
394 | },
395 | "execution_count": 13,
396 | "metadata": {},
397 | "output_type": "execute_result"
398 | }
399 | ],
400 | "source": [
401 | "spark.catalog.listColumns(\"hivesampletable\")"
402 | ]
403 | },
404 | {
405 | "cell_type": "code",
406 | "execution_count": 14,
407 | "id": "c757cc03-36f8-4c9f-8b88-e8a1c7485fc2",
408 | "metadata": {},
409 | "outputs": [
410 | {
411 | "name": "stdout",
412 | "output_type": "stream",
413 | "text": [
414 | "+--------------------+-------------+----------+---+---------+------+\n",
415 | "| _id|department_id|first_name| id|last_name|salary|\n",
416 | "+--------------------+-------------+----------+---+---------+------+\n",
417 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|110000|\n",
418 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|106119|\n",
419 | "|{6407c350840f10d7...| 1005| Justin| 2| Simon|128922|\n",
420 | "|{6407c350840f10d7...| 1005| Justin| 2| Simon|130000|\n",
421 | "|{6407c350840f10d7...| 1002| Kelly| 3| Rosario| 42689|\n",
422 | "|{6407c350840f10d7...| 1004| Patricia| 4| Powell|162825|\n",
423 | "|{6407c350840f10d7...| 1004| Patricia| 4| Powell|170000|\n",
424 | "|{6407c350840f10d7...| 1002| Sherry| 5| Golden| 44101|\n",
425 | "|{6407c350840f10d7...| 1005| Natasha| 6| Swanson| 79632|\n",
426 | "|{6407c350840f10d7...| 1005| Natasha| 6| Swanson| 90000|\n",
427 | "|{6407c350840f10d7...| 1002| Diane| 7| Gordon| 74591|\n",
428 | "|{6407c350840f10d7...| 1005| Mercedes| 8|Rodriguez| 61048|\n",
429 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|137236|\n",
430 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|140000|\n",
431 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|150000|\n",
432 | "|{6407c350840f10d7...| 1006| Sean| 10| Crawford|182065|\n",
433 | "|{6407c350840f10d7...| 1006| Sean| 10| Crawford|190000|\n",
434 | "|{6407c350840f10d7...| 1002| Kevin| 11| Townsend|166861|\n",
435 | "|{6407c350840f10d7...| 1004| Joshua| 12| Johnson|123082|\n",
436 | "|{6407c350840f10d7...| 1001| Julie| 13| Sanchez|185663|\n",
437 | "+--------------------+-------------+----------+---+---------+------+\n",
438 | "only showing top 20 rows\n",
439 | "\n"
440 | ]
441 | }
442 | ],
443 | "source": [
444 | "sqlContext.sql(\"SELECT * FROM hivesampletable\").show()"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": 38,
450 | "id": "3fb5fafe-af29-49ed-b88a-99913818d940",
451 | "metadata": {},
452 | "outputs": [
453 | {
454 | "name": "stdout",
455 | "output_type": "stream",
456 | "text": [
457 | "+--------------------+-------------+----------+---+---------+------+\n",
458 | "| _id|department_id|first_name| id|last_name|salary|\n",
459 | "+--------------------+-------------+----------+---+---------+------+\n",
460 | "|{6402d551bd67c9b0...| 1006| Todd| 1| Wilson|110000|\n",
461 | "|{6402d551bd67c9b0...| 1006| Todd| 1| Wilson|106119|\n",
462 | "|{6402d551bd67c9b0...| 1005| Justin| 2| Simon|128922|\n",
463 | "|{6402d551bd67c9b0...| 1005| Justin| 2| Simon|130000|\n",
464 | "|{6402d551bd67c9b0...| 1002| Kelly| 3| Rosario| 42689|\n",
465 | "|{6402d551bd67c9b0...| 1004| Patricia| 4| Powell|162825|\n",
466 | "|{6402d551bd67c9b0...| 1004| Patricia| 4| Powell|170000|\n",
467 | "|{6402d551bd67c9b0...| 1002| Sherry| 5| Golden| 44101|\n",
468 | "|{6402d551bd67c9b0...| 1005| Natasha| 6| Swanson| 79632|\n",
469 | "|{6402d551bd67c9b0...| 1005| Natasha| 6| Swanson| 90000|\n",
470 | "|{6402d551bd67c9b0...| 1002| Diane| 7| Gordon| 74591|\n",
471 | "|{6402d551bd67c9b0...| 1005| Mercedes| 8|Rodriguez| 61048|\n",
472 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|137236|\n",
473 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|140000|\n",
474 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|150000|\n",
475 | "|{6402d551bd67c9b0...| 1006| Sean| 10| Crawford|182065|\n",
476 | "|{6402d551bd67c9b0...| 1006| Sean| 10| Crawford|190000|\n",
477 | "|{6402d551bd67c9b0...| 1002| Kevin| 11| Townsend|166861|\n",
478 | "|{6402d551bd67c9b0...| 1004| Joshua| 12| Johnson|123082|\n",
479 | "|{6402d551bd67c9b0...| 1001| Julie| 13| Sanchez|185663|\n",
480 | "+--------------------+-------------+----------+---+---------+------+\n",
481 | "only showing top 20 rows\n",
482 | "\n"
483 | ]
484 | }
485 | ],
486 | "source": [
487 | "spark.table(\"sampletempview\").show()"
488 | ]
489 | },
490 | {
491 | "cell_type": "code",
492 | "execution_count": 15,
493 | "id": "7902e3d4-9c4e-4553-94bf-9dbc0dbff4a7",
494 | "metadata": {},
495 | "outputs": [
496 | {
497 | "name": "stdout",
498 | "output_type": "stream",
499 | "text": [
500 | "+--------------------+-------------+----------+---+---------+------+\n",
501 | "| _id|department_id|first_name| id|last_name|salary|\n",
502 | "+--------------------+-------------+----------+---+---------+------+\n",
503 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|110000|\n",
504 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|106119|\n",
505 | "|{6407c350840f10d7...| 1005| Justin| 2| Simon|128922|\n",
506 | "|{6407c350840f10d7...| 1005| Justin| 2| Simon|130000|\n",
507 | "|{6407c350840f10d7...| 1002| Kelly| 3| Rosario| 42689|\n",
508 | "|{6407c350840f10d7...| 1004| Patricia| 4| Powell|162825|\n",
509 | "|{6407c350840f10d7...| 1004| Patricia| 4| Powell|170000|\n",
510 | "|{6407c350840f10d7...| 1002| Sherry| 5| Golden| 44101|\n",
511 | "|{6407c350840f10d7...| 1005| Natasha| 6| Swanson| 79632|\n",
512 | "|{6407c350840f10d7...| 1005| Natasha| 6| Swanson| 90000|\n",
513 | "|{6407c350840f10d7...| 1002| Diane| 7| Gordon| 74591|\n",
514 | "|{6407c350840f10d7...| 1005| Mercedes| 8|Rodriguez| 61048|\n",
515 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|137236|\n",
516 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|140000|\n",
517 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|150000|\n",
518 | "|{6407c350840f10d7...| 1006| Sean| 10| Crawford|182065|\n",
519 | "|{6407c350840f10d7...| 1006| Sean| 10| Crawford|190000|\n",
520 | "|{6407c350840f10d7...| 1002| Kevin| 11| Townsend|166861|\n",
521 | "|{6407c350840f10d7...| 1004| Joshua| 12| Johnson|123082|\n",
522 | "|{6407c350840f10d7...| 1001| Julie| 13| Sanchez|185663|\n",
523 | "+--------------------+-------------+----------+---+---------+------+\n",
524 | "only showing top 20 rows\n",
525 | "\n"
526 | ]
527 | }
528 | ],
529 | "source": [
530 | "sqlContext.sql(\"SELECT * FROM global_temp.sampleglobalview\").show()"
531 | ]
532 | },
533 | {
534 | "cell_type": "markdown",
535 | "id": "a23c0bc5-e3e8-4ac3-94f3-d0e119e31f23",
536 | "metadata": {},
537 | "source": [
538 | "7. Drop all the created tables and views in default database"
539 | ]
540 | },
541 | {
542 | "cell_type": "code",
543 | "execution_count": 16,
544 | "id": "7e30ad8e-7ad8-4c79-b197-41719d21c654",
545 | "metadata": {},
546 | "outputs": [],
547 | "source": [
548 | "spark.catalog.dropGlobalTempView(\"sampleglobalview\")\n",
549 | "spark.catalog.dropTempView(\"sampletempview\")"
550 | ]
551 | },
552 | {
553 | "cell_type": "markdown",
554 | "id": "d4eba5f1-9fd8-4d92-97b6-638250d475e8",
555 | "metadata": {},
556 | "source": [
557 | "8. Create Dataeng database and create global and temp view using SQL "
558 | ]
559 | },
560 | {
561 | "cell_type": "code",
562 | "execution_count": 17,
563 | "id": "5644d360-38b3-4a6e-867d-5a86772d268c",
564 | "metadata": {},
565 | "outputs": [
566 | {
567 | "data": {
568 | "text/plain": [
569 | "DataFrame[]"
570 | ]
571 | },
572 | "execution_count": 17,
573 | "metadata": {},
574 | "output_type": "execute_result"
575 | }
576 | ],
577 | "source": [
578 | "spark.sql(\"CREATE DATABASE dataeng\")\n",
579 | "spark.sql(\"USE dataeng\")"
580 | ]
581 | },
582 | {
583 | "cell_type": "code",
584 | "execution_count": 18,
585 | "id": "36c017c5-efb4-4a20-9628-e4bcb1a2e068",
586 | "metadata": {},
587 | "outputs": [
588 | {
589 | "name": "stdout",
590 | "output_type": "stream",
591 | "text": [
592 | "+---------+\n",
593 | "|namespace|\n",
594 | "+---------+\n",
595 | "| dataeng|\n",
596 | "| default|\n",
597 | "+---------+\n",
598 | "\n"
599 | ]
600 | }
601 | ],
602 | "source": [
603 | "spark.sql(\"show databases\").show()"
604 | ]
605 | },
606 | {
607 | "cell_type": "code",
608 | "execution_count": 20,
609 | "id": "44082455-4774-41ff-b943-16a315b661f8",
610 | "metadata": {},
611 | "outputs": [],
612 | "source": [
613 | "mongodf.write.saveAsTable(\"hivesampletable\")\n",
614 | "mongodf.createOrReplaceGlobalTempView(\"sampleglobalview\")\n",
615 | "mongodf.createOrReplaceTempView(\"sampletempview\")"
616 | ]
617 | },
618 | {
619 | "cell_type": "code",
620 | "execution_count": 22,
621 | "id": "2dcdebaf-ad5d-4fa9-9ec2-49903c734386",
622 | "metadata": {},
623 | "outputs": [
624 | {
625 | "name": "stdout",
626 | "output_type": "stream",
627 | "text": [
628 | "+---------+---------------+-----------+\n",
629 | "|namespace| tableName|isTemporary|\n",
630 | "+---------+---------------+-----------+\n",
631 | "| dataeng|hivesampletable| false|\n",
632 | "| | sampletempview| true|\n",
633 | "+---------+---------------+-----------+\n",
634 | "\n"
635 | ]
636 | }
637 | ],
638 | "source": [
639 | "spark.sql(\"show tables\").show()"
640 | ]
641 | },
642 | {
643 | "cell_type": "markdown",
644 | "id": "16c741fd-d170-4efd-963d-20d917872e95",
645 | "metadata": {},
646 | "source": [
647 | "9. Access global table from other session"
648 | ]
649 | },
650 | {
651 | "cell_type": "code",
652 | "execution_count": 23,
653 | "id": "2166405a-0b3d-427d-9ebd-35d03172397d",
654 | "metadata": {},
655 | "outputs": [],
656 | "source": [
657 | "newSpark = spark.newSession()"
658 | ]
659 | },
660 | {
661 | "cell_type": "code",
662 | "execution_count": 24,
663 | "id": "bbb3b2bc-0a4d-467b-ad53-04df44944db4",
664 | "metadata": {},
665 | "outputs": [
666 | {
667 | "data": {
668 | "text/html": [
669 | "\n",
670 | " \n",
671 | "
SparkSession - in-memory
\n",
672 | " \n",
673 | "
\n",
674 | "
SparkContext
\n",
675 | "\n",
676 | "
Spark UI
\n",
677 | "\n",
678 | "
\n",
679 | " - Version
\n",
680 | " v3.2.1 \n",
681 | " - Master
\n",
682 | " local[*] \n",
683 | " - AppName
\n",
684 | " chapter5 \n",
685 | "
\n",
686 | "
\n",
687 | " \n",
688 | "
\n",
689 | " "
690 | ],
691 | "text/plain": [
692 | ""
693 | ]
694 | },
695 | "execution_count": 24,
696 | "metadata": {},
697 | "output_type": "execute_result"
698 | }
699 | ],
700 | "source": [
701 | "newSpark"
702 | ]
703 | },
704 | {
705 | "cell_type": "code",
706 | "execution_count": 59,
707 | "id": "8d170beb-4996-4c40-9374-bd0ba6747b7b",
708 | "metadata": {},
709 | "outputs": [
710 | {
711 | "name": "stdout",
712 | "output_type": "stream",
713 | "text": [
714 | "+--------------------+-------------+----------+---+---------+------+\n",
715 | "| _id|department_id|first_name| id|last_name|salary|\n",
716 | "+--------------------+-------------+----------+---+---------+------+\n",
717 | "|{6402d551bd67c9b0...| 1006| Todd| 1| Wilson|110000|\n",
718 | "|{6402d551bd67c9b0...| 1006| Todd| 1| Wilson|106119|\n",
719 | "|{6402d551bd67c9b0...| 1005| Justin| 2| Simon|128922|\n",
720 | "|{6402d551bd67c9b0...| 1005| Justin| 2| Simon|130000|\n",
721 | "|{6402d551bd67c9b0...| 1002| Kelly| 3| Rosario| 42689|\n",
722 | "|{6402d551bd67c9b0...| 1004| Patricia| 4| Powell|162825|\n",
723 | "|{6402d551bd67c9b0...| 1004| Patricia| 4| Powell|170000|\n",
724 | "|{6402d551bd67c9b0...| 1002| Sherry| 5| Golden| 44101|\n",
725 | "|{6402d551bd67c9b0...| 1005| Natasha| 6| Swanson| 79632|\n",
726 | "|{6402d551bd67c9b0...| 1005| Natasha| 6| Swanson| 90000|\n",
727 | "|{6402d551bd67c9b0...| 1002| Diane| 7| Gordon| 74591|\n",
728 | "|{6402d551bd67c9b0...| 1005| Mercedes| 8|Rodriguez| 61048|\n",
729 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|137236|\n",
730 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|140000|\n",
731 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|150000|\n",
732 | "|{6402d551bd67c9b0...| 1006| Sean| 10| Crawford|182065|\n",
733 | "|{6402d551bd67c9b0...| 1006| Sean| 10| Crawford|190000|\n",
734 | "|{6402d551bd67c9b0...| 1002| Kevin| 11| Townsend|166861|\n",
735 | "|{6402d551bd67c9b0...| 1004| Joshua| 12| Johnson|123082|\n",
736 | "|{6402d551bd67c9b0...| 1001| Julie| 13| Sanchez|185663|\n",
737 | "+--------------------+-------------+----------+---+---------+------+\n",
738 | "only showing top 20 rows\n",
739 | "\n"
740 | ]
741 | }
742 | ],
743 | "source": [
744 | "newSpark.sql(\"SELECT * FROM global_temp.sampleglobalview\").show()"
745 | ]
746 | },
747 | {
748 | "cell_type": "code",
749 | "execution_count": 25,
750 | "id": "b2fd67ca-8ced-4ddd-a47d-da71d5b2e626",
751 | "metadata": {},
752 | "outputs": [
753 | {
754 | "ename": "AnalysisException",
755 | "evalue": "Table or view not found: sampletempview; line 1 pos 14;\n'Project [*]\n+- 'UnresolvedRelation [sampletempview], [], false\n",
756 | "output_type": "error",
757 | "traceback": [
758 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
759 | "\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)",
760 | "Cell \u001b[0;32mIn[25], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mnewSpark\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msql\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mSELECT * FROM sampletempview\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39mshow()\n",
761 | "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/pyspark/sql/session.py:723\u001b[0m, in \u001b[0;36mSparkSession.sql\u001b[0;34m(self, sqlQuery)\u001b[0m\n\u001b[1;32m 707\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21msql\u001b[39m(\u001b[38;5;28mself\u001b[39m, sqlQuery):\n\u001b[1;32m 708\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Returns a :class:`DataFrame` representing the result of the given query.\u001b[39;00m\n\u001b[1;32m 709\u001b[0m \n\u001b[1;32m 710\u001b[0m \u001b[38;5;124;03m .. versionadded:: 2.0.0\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 721\u001b[0m \u001b[38;5;124;03m [Row(f1=1, f2='row1'), Row(f1=2, f2='row2'), Row(f1=3, f2='row3')]\u001b[39;00m\n\u001b[1;32m 722\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 723\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m DataFrame(\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_jsparkSession\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msql\u001b[49m\u001b[43m(\u001b[49m\u001b[43msqlQuery\u001b[49m\u001b[43m)\u001b[49m, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_wrapped)\n",
762 | "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/py4j/java_gateway.py:1309\u001b[0m, in \u001b[0;36mJavaMember.__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1303\u001b[0m command \u001b[38;5;241m=\u001b[39m proto\u001b[38;5;241m.\u001b[39mCALL_COMMAND_NAME \u001b[38;5;241m+\u001b[39m\\\n\u001b[1;32m 1304\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcommand_header \u001b[38;5;241m+\u001b[39m\\\n\u001b[1;32m 1305\u001b[0m args_command \u001b[38;5;241m+\u001b[39m\\\n\u001b[1;32m 1306\u001b[0m proto\u001b[38;5;241m.\u001b[39mEND_COMMAND_PART\n\u001b[1;32m 1308\u001b[0m answer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mgateway_client\u001b[38;5;241m.\u001b[39msend_command(command)\n\u001b[0;32m-> 1309\u001b[0m return_value \u001b[38;5;241m=\u001b[39m \u001b[43mget_return_value\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1310\u001b[0m \u001b[43m \u001b[49m\u001b[43manswer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgateway_client\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtarget_id\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mname\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1312\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m temp_arg \u001b[38;5;129;01min\u001b[39;00m temp_args:\n\u001b[1;32m 1313\u001b[0m temp_arg\u001b[38;5;241m.\u001b[39m_detach()\n",
763 | "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/pyspark/sql/utils.py:117\u001b[0m, in \u001b[0;36mcapture_sql_exception..deco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 113\u001b[0m converted \u001b[38;5;241m=\u001b[39m convert_exception(e\u001b[38;5;241m.\u001b[39mjava_exception)\n\u001b[1;32m 114\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(converted, UnknownException):\n\u001b[1;32m 115\u001b[0m \u001b[38;5;66;03m# Hide where the exception came from that shows a non-Pythonic\u001b[39;00m\n\u001b[1;32m 116\u001b[0m \u001b[38;5;66;03m# JVM exception message.\u001b[39;00m\n\u001b[0;32m--> 117\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m converted \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;28mNone\u001b[39m\n\u001b[1;32m 118\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 119\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m\n",
764 | "\u001b[0;31mAnalysisException\u001b[0m: Table or view not found: sampletempview; line 1 pos 14;\n'Project [*]\n+- 'UnresolvedRelation [sampletempview], [], false\n"
765 | ]
766 | }
767 | ],
768 | "source": [
769 | "newSpark.sql(\"SELECT * FROM sampletempview\").show()"
770 | ]
771 | },
772 | {
773 | "cell_type": "code",
774 | "execution_count": null,
775 | "id": "067b9a71-0dea-457f-812d-557242af506e",
776 | "metadata": {},
777 | "outputs": [],
778 | "source": []
779 | }
780 | ],
781 | "metadata": {
782 | "kernelspec": {
783 | "display_name": "Python 3 (ipykernel)",
784 | "language": "python",
785 | "name": "python3"
786 | },
787 | "language_info": {
788 | "codemirror_mode": {
789 | "name": "ipython",
790 | "version": 3
791 | },
792 | "file_extension": ".py",
793 | "mimetype": "text/x-python",
794 | "name": "python",
795 | "nbconvert_exporter": "python",
796 | "pygments_lexer": "ipython3",
797 | "version": "3.8.13"
798 | }
799 | },
800 | "nbformat": 4,
801 | "nbformat_minor": 5
802 | }
803 |
--------------------------------------------------------------------------------
/Chapter0/nyc_taxi_zone.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "LocationID": 1,
4 | "Borough": "EWR",
5 | "Zone": "Newark Airport",
6 | "service_zone": "EWR"
7 | },
8 | {
9 | "LocationID": 2,
10 | "Borough": "Queens",
11 | "Zone": "Jamaica Bay",
12 | "service_zone": "Boro Zone"
13 | },
14 | {
15 | "LocationID": 3,
16 | "Borough": "Bronx",
17 | "Zone": "Allerton/Pelham Gardens",
18 | "service_zone": "Boro Zone"
19 | },
20 | {
21 | "LocationID": 4,
22 | "Borough": "Manhattan",
23 | "Zone": "Alphabet City",
24 | "service_zone": "Yellow Zone"
25 | },
26 | {
27 | "LocationID": 5,
28 | "Borough": "Staten Island",
29 | "Zone": "Arden Heights",
30 | "service_zone": "Boro Zone"
31 | },
32 | {
33 | "LocationID": 6,
34 | "Borough": "Staten Island",
35 | "Zone": "Arrochar/Fort Wadsworth",
36 | "service_zone": "Boro Zone"
37 | },
38 | {
39 | "LocationID": 7,
40 | "Borough": "Queens",
41 | "Zone": "Astoria",
42 | "service_zone": "Boro Zone"
43 | },
44 | {
45 | "LocationID": 8,
46 | "Borough": "Queens",
47 | "Zone": "Astoria Park",
48 | "service_zone": "Boro Zone"
49 | },
50 | {
51 | "LocationID": 9,
52 | "Borough": "Queens",
53 | "Zone": "Auburndale",
54 | "service_zone": "Boro Zone"
55 | },
56 | {
57 | "LocationID": 10,
58 | "Borough": "Queens",
59 | "Zone": "Baisley Park",
60 | "service_zone": "Boro Zone"
61 | },
62 | {
63 | "LocationID": 11,
64 | "Borough": "Brooklyn",
65 | "Zone": "Bath Beach",
66 | "service_zone": "Boro Zone"
67 | },
68 | {
69 | "LocationID": 12,
70 | "Borough": "Manhattan",
71 | "Zone": "Battery Park",
72 | "service_zone": "Yellow Zone"
73 | },
74 | {
75 | "LocationID": 13,
76 | "Borough": "Manhattan",
77 | "Zone": "Battery Park City",
78 | "service_zone": "Yellow Zone"
79 | },
80 | {
81 | "LocationID": 14,
82 | "Borough": "Brooklyn",
83 | "Zone": "Bay Ridge",
84 | "service_zone": "Boro Zone"
85 | },
86 | {
87 | "LocationID": 15,
88 | "Borough": "Queens",
89 | "Zone": "Bay Terrace/Fort Totten",
90 | "service_zone": "Boro Zone"
91 | },
92 | {
93 | "LocationID": 16,
94 | "Borough": "Queens",
95 | "Zone": "Bayside",
96 | "service_zone": "Boro Zone"
97 | },
98 | {
99 | "LocationID": 17,
100 | "Borough": "Brooklyn",
101 | "Zone": "Bedford",
102 | "service_zone": "Boro Zone"
103 | },
104 | {
105 | "LocationID": 18,
106 | "Borough": "Bronx",
107 | "Zone": "Bedford Park",
108 | "service_zone": "Boro Zone"
109 | },
110 | {
111 | "LocationID": 19,
112 | "Borough": "Queens",
113 | "Zone": "Bellerose",
114 | "service_zone": "Boro Zone"
115 | },
116 | {
117 | "LocationID": 20,
118 | "Borough": "Bronx",
119 | "Zone": "Belmont",
120 | "service_zone": "Boro Zone"
121 | },
122 | {
123 | "LocationID": 21,
124 | "Borough": "Brooklyn",
125 | "Zone": "Bensonhurst East",
126 | "service_zone": "Boro Zone"
127 | },
128 | {
129 | "LocationID": 22,
130 | "Borough": "Brooklyn",
131 | "Zone": "Bensonhurst West",
132 | "service_zone": "Boro Zone"
133 | },
134 | {
135 | "LocationID": 23,
136 | "Borough": "Staten Island",
137 | "Zone": "Bloomfield/Emerson Hill",
138 | "service_zone": "Boro Zone"
139 | },
140 | {
141 | "LocationID": 24,
142 | "Borough": "Manhattan",
143 | "Zone": "Bloomingdale",
144 | "service_zone": "Yellow Zone"
145 | },
146 | {
147 | "LocationID": 25,
148 | "Borough": "Brooklyn",
149 | "Zone": "Boerum Hill",
150 | "service_zone": "Boro Zone"
151 | },
152 | {
153 | "LocationID": 26,
154 | "Borough": "Brooklyn",
155 | "Zone": "Borough Park",
156 | "service_zone": "Boro Zone"
157 | },
158 | {
159 | "LocationID": 27,
160 | "Borough": "Queens",
161 | "Zone": "Breezy Point/Fort Tilden/Riis Beach",
162 | "service_zone": "Boro Zone"
163 | },
164 | {
165 | "LocationID": 28,
166 | "Borough": "Queens",
167 | "Zone": "Briarwood/Jamaica Hills",
168 | "service_zone": "Boro Zone"
169 | },
170 | {
171 | "LocationID": 29,
172 | "Borough": "Brooklyn",
173 | "Zone": "Brighton Beach",
174 | "service_zone": "Boro Zone"
175 | },
176 | {
177 | "LocationID": 30,
178 | "Borough": "Queens",
179 | "Zone": "Broad Channel",
180 | "service_zone": "Boro Zone"
181 | },
182 | {
183 | "LocationID": 31,
184 | "Borough": "Bronx",
185 | "Zone": "Bronx Park",
186 | "service_zone": "Boro Zone"
187 | },
188 | {
189 | "LocationID": 32,
190 | "Borough": "Bronx",
191 | "Zone": "Bronxdale",
192 | "service_zone": "Boro Zone"
193 | },
194 | {
195 | "LocationID": 33,
196 | "Borough": "Brooklyn",
197 | "Zone": "Brooklyn Heights",
198 | "service_zone": "Boro Zone"
199 | },
200 | {
201 | "LocationID": 34,
202 | "Borough": "Brooklyn",
203 | "Zone": "Brooklyn Navy Yard",
204 | "service_zone": "Boro Zone"
205 | },
206 | {
207 | "LocationID": 35,
208 | "Borough": "Brooklyn",
209 | "Zone": "Brownsville",
210 | "service_zone": "Boro Zone"
211 | },
212 | {
213 | "LocationID": 36,
214 | "Borough": "Brooklyn",
215 | "Zone": "Bushwick North",
216 | "service_zone": "Boro Zone"
217 | },
218 | {
219 | "LocationID": 37,
220 | "Borough": "Brooklyn",
221 | "Zone": "Bushwick South",
222 | "service_zone": "Boro Zone"
223 | },
224 | {
225 | "LocationID": 38,
226 | "Borough": "Queens",
227 | "Zone": "Cambria Heights",
228 | "service_zone": "Boro Zone"
229 | },
230 | {
231 | "LocationID": 39,
232 | "Borough": "Brooklyn",
233 | "Zone": "Canarsie",
234 | "service_zone": "Boro Zone"
235 | },
236 | {
237 | "LocationID": 40,
238 | "Borough": "Brooklyn",
239 | "Zone": "Carroll Gardens",
240 | "service_zone": "Boro Zone"
241 | },
242 | {
243 | "LocationID": 41,
244 | "Borough": "Manhattan",
245 | "Zone": "Central Harlem",
246 | "service_zone": "Boro Zone"
247 | },
248 | {
249 | "LocationID": 42,
250 | "Borough": "Manhattan",
251 | "Zone": "Central Harlem North",
252 | "service_zone": "Boro Zone"
253 | },
254 | {
255 | "LocationID": 43,
256 | "Borough": "Manhattan",
257 | "Zone": "Central Park",
258 | "service_zone": "Yellow Zone"
259 | },
260 | {
261 | "LocationID": 44,
262 | "Borough": "Staten Island",
263 | "Zone": "Charleston/Tottenville",
264 | "service_zone": "Boro Zone"
265 | },
266 | {
267 | "LocationID": 45,
268 | "Borough": "Manhattan",
269 | "Zone": "Chinatown",
270 | "service_zone": "Yellow Zone"
271 | },
272 | {
273 | "LocationID": 46,
274 | "Borough": "Bronx",
275 | "Zone": "City Island",
276 | "service_zone": "Boro Zone"
277 | },
278 | {
279 | "LocationID": 47,
280 | "Borough": "Bronx",
281 | "Zone": "Claremont/Bathgate",
282 | "service_zone": "Boro Zone"
283 | },
284 | {
285 | "LocationID": 48,
286 | "Borough": "Manhattan",
287 | "Zone": "Clinton East",
288 | "service_zone": "Yellow Zone"
289 | },
290 | {
291 | "LocationID": 49,
292 | "Borough": "Brooklyn",
293 | "Zone": "Clinton Hill",
294 | "service_zone": "Boro Zone"
295 | },
296 | {
297 | "LocationID": 50,
298 | "Borough": "Manhattan",
299 | "Zone": "Clinton West",
300 | "service_zone": "Yellow Zone"
301 | },
302 | {
303 | "LocationID": 51,
304 | "Borough": "Bronx",
305 | "Zone": "Co-Op City",
306 | "service_zone": "Boro Zone"
307 | },
308 | {
309 | "LocationID": 52,
310 | "Borough": "Brooklyn",
311 | "Zone": "Cobble Hill",
312 | "service_zone": "Boro Zone"
313 | },
314 | {
315 | "LocationID": 53,
316 | "Borough": "Queens",
317 | "Zone": "College Point",
318 | "service_zone": "Boro Zone"
319 | },
320 | {
321 | "LocationID": 54,
322 | "Borough": "Brooklyn",
323 | "Zone": "Columbia Street",
324 | "service_zone": "Boro Zone"
325 | },
326 | {
327 | "LocationID": 55,
328 | "Borough": "Brooklyn",
329 | "Zone": "Coney Island",
330 | "service_zone": "Boro Zone"
331 | },
332 | {
333 | "LocationID": 56,
334 | "Borough": "Queens",
335 | "Zone": "Corona",
336 | "service_zone": "Boro Zone"
337 | },
338 | {
339 | "LocationID": 57,
340 | "Borough": "Queens",
341 | "Zone": "Corona",
342 | "service_zone": "Boro Zone"
343 | },
344 | {
345 | "LocationID": 58,
346 | "Borough": "Bronx",
347 | "Zone": "Country Club",
348 | "service_zone": "Boro Zone"
349 | },
350 | {
351 | "LocationID": 59,
352 | "Borough": "Bronx",
353 | "Zone": "Crotona Park",
354 | "service_zone": "Boro Zone"
355 | },
356 | {
357 | "LocationID": 60,
358 | "Borough": "Bronx",
359 | "Zone": "Crotona Park East",
360 | "service_zone": "Boro Zone"
361 | },
362 | {
363 | "LocationID": 61,
364 | "Borough": "Brooklyn",
365 | "Zone": "Crown Heights North",
366 | "service_zone": "Boro Zone"
367 | },
368 | {
369 | "LocationID": 62,
370 | "Borough": "Brooklyn",
371 | "Zone": "Crown Heights South",
372 | "service_zone": "Boro Zone"
373 | },
374 | {
375 | "LocationID": 63,
376 | "Borough": "Brooklyn",
377 | "Zone": "Cypress Hills",
378 | "service_zone": "Boro Zone"
379 | },
380 | {
381 | "LocationID": 64,
382 | "Borough": "Queens",
383 | "Zone": "Douglaston",
384 | "service_zone": "Boro Zone"
385 | },
386 | {
387 | "LocationID": 65,
388 | "Borough": "Brooklyn",
389 | "Zone": "Downtown Brooklyn/MetroTech",
390 | "service_zone": "Boro Zone"
391 | },
392 | {
393 | "LocationID": 66,
394 | "Borough": "Brooklyn",
395 | "Zone": "DUMBO/Vinegar Hill",
396 | "service_zone": "Boro Zone"
397 | },
398 | {
399 | "LocationID": 67,
400 | "Borough": "Brooklyn",
401 | "Zone": "Dyker Heights",
402 | "service_zone": "Boro Zone"
403 | },
404 | {
405 | "LocationID": 68,
406 | "Borough": "Manhattan",
407 | "Zone": "East Chelsea",
408 | "service_zone": "Yellow Zone"
409 | },
410 | {
411 | "LocationID": 69,
412 | "Borough": "Bronx",
413 | "Zone": "East Concourse/Concourse Village",
414 | "service_zone": "Boro Zone"
415 | },
416 | {
417 | "LocationID": 70,
418 | "Borough": "Queens",
419 | "Zone": "East Elmhurst",
420 | "service_zone": "Boro Zone"
421 | },
422 | {
423 | "LocationID": 71,
424 | "Borough": "Brooklyn",
425 | "Zone": "East Flatbush/Farragut",
426 | "service_zone": "Boro Zone"
427 | },
428 | {
429 | "LocationID": 72,
430 | "Borough": "Brooklyn",
431 | "Zone": "East Flatbush/Remsen Village",
432 | "service_zone": "Boro Zone"
433 | },
434 | {
435 | "LocationID": 73,
436 | "Borough": "Queens",
437 | "Zone": "East Flushing",
438 | "service_zone": "Boro Zone"
439 | },
440 | {
441 | "LocationID": 74,
442 | "Borough": "Manhattan",
443 | "Zone": "East Harlem North",
444 | "service_zone": "Boro Zone"
445 | },
446 | {
447 | "LocationID": 75,
448 | "Borough": "Manhattan",
449 | "Zone": "East Harlem South",
450 | "service_zone": "Boro Zone"
451 | },
452 | {
453 | "LocationID": 76,
454 | "Borough": "Brooklyn",
455 | "Zone": "East New York",
456 | "service_zone": "Boro Zone"
457 | },
458 | {
459 | "LocationID": 77,
460 | "Borough": "Brooklyn",
461 | "Zone": "East New York/Pennsylvania Avenue",
462 | "service_zone": "Boro Zone"
463 | },
464 | {
465 | "LocationID": 78,
466 | "Borough": "Bronx",
467 | "Zone": "East Tremont",
468 | "service_zone": "Boro Zone"
469 | },
470 | {
471 | "LocationID": 79,
472 | "Borough": "Manhattan",
473 | "Zone": "East Village",
474 | "service_zone": "Yellow Zone"
475 | },
476 | {
477 | "LocationID": 80,
478 | "Borough": "Brooklyn",
479 | "Zone": "East Williamsburg",
480 | "service_zone": "Boro Zone"
481 | },
482 | {
483 | "LocationID": 81,
484 | "Borough": "Bronx",
485 | "Zone": "Eastchester",
486 | "service_zone": "Boro Zone"
487 | },
488 | {
489 | "LocationID": 82,
490 | "Borough": "Queens",
491 | "Zone": "Elmhurst",
492 | "service_zone": "Boro Zone"
493 | },
494 | {
495 | "LocationID": 83,
496 | "Borough": "Queens",
497 | "Zone": "Elmhurst/Maspeth",
498 | "service_zone": "Boro Zone"
499 | },
500 | {
501 | "LocationID": 84,
502 | "Borough": "Staten Island",
503 | "Zone": "Eltingville/Annadale/Prince's Bay",
504 | "service_zone": "Boro Zone"
505 | },
506 | {
507 | "LocationID": 85,
508 | "Borough": "Brooklyn",
509 | "Zone": "Erasmus",
510 | "service_zone": "Boro Zone"
511 | },
512 | {
513 | "LocationID": 86,
514 | "Borough": "Queens",
515 | "Zone": "Far Rockaway",
516 | "service_zone": "Boro Zone"
517 | },
518 | {
519 | "LocationID": 87,
520 | "Borough": "Manhattan",
521 | "Zone": "Financial District North",
522 | "service_zone": "Yellow Zone"
523 | },
524 | {
525 | "LocationID": 88,
526 | "Borough": "Manhattan",
527 | "Zone": "Financial District South",
528 | "service_zone": "Yellow Zone"
529 | },
530 | {
531 | "LocationID": 89,
532 | "Borough": "Brooklyn",
533 | "Zone": "Flatbush/Ditmas Park",
534 | "service_zone": "Boro Zone"
535 | },
536 | {
537 | "LocationID": 90,
538 | "Borough": "Manhattan",
539 | "Zone": "Flatiron",
540 | "service_zone": "Yellow Zone"
541 | },
542 | {
543 | "LocationID": 91,
544 | "Borough": "Brooklyn",
545 | "Zone": "Flatlands",
546 | "service_zone": "Boro Zone"
547 | },
548 | {
549 | "LocationID": 92,
550 | "Borough": "Queens",
551 | "Zone": "Flushing",
552 | "service_zone": "Boro Zone"
553 | },
554 | {
555 | "LocationID": 93,
556 | "Borough": "Queens",
557 | "Zone": "Flushing Meadows-Corona Park",
558 | "service_zone": "Boro Zone"
559 | },
560 | {
561 | "LocationID": 94,
562 | "Borough": "Bronx",
563 | "Zone": "Fordham South",
564 | "service_zone": "Boro Zone"
565 | },
566 | {
567 | "LocationID": 95,
568 | "Borough": "Queens",
569 | "Zone": "Forest Hills",
570 | "service_zone": "Boro Zone"
571 | },
572 | {
573 | "LocationID": 96,
574 | "Borough": "Queens",
575 | "Zone": "Forest Park/Highland Park",
576 | "service_zone": "Boro Zone"
577 | },
578 | {
579 | "LocationID": 97,
580 | "Borough": "Brooklyn",
581 | "Zone": "Fort Greene",
582 | "service_zone": "Boro Zone"
583 | },
584 | {
585 | "LocationID": 98,
586 | "Borough": "Queens",
587 | "Zone": "Fresh Meadows",
588 | "service_zone": "Boro Zone"
589 | },
590 | {
591 | "LocationID": 99,
592 | "Borough": "Staten Island",
593 | "Zone": "Freshkills Park",
594 | "service_zone": "Boro Zone"
595 | },
596 | {
597 | "LocationID": 100,
598 | "Borough": "Manhattan",
599 | "Zone": "Garment District",
600 | "service_zone": "Yellow Zone"
601 | },
602 | {
603 | "LocationID": 101,
604 | "Borough": "Queens",
605 | "Zone": "Glen Oaks",
606 | "service_zone": "Boro Zone"
607 | },
608 | {
609 | "LocationID": 102,
610 | "Borough": "Queens",
611 | "Zone": "Glendale",
612 | "service_zone": "Boro Zone"
613 | },
614 | {
615 | "LocationID": 103,
616 | "Borough": "Manhattan",
617 | "Zone": "Governor's Island/Ellis Island/Liberty Island",
618 | "service_zone": "Yellow Zone"
619 | },
620 | {
621 | "LocationID": 104,
622 | "Borough": "Manhattan",
623 | "Zone": "Governor's Island/Ellis Island/Liberty Island",
624 | "service_zone": "Yellow Zone"
625 | },
626 | {
627 | "LocationID": 105,
628 | "Borough": "Manhattan",
629 | "Zone": "Governor's Island/Ellis Island/Liberty Island",
630 | "service_zone": "Yellow Zone"
631 | },
632 | {
633 | "LocationID": 106,
634 | "Borough": "Brooklyn",
635 | "Zone": "Gowanus",
636 | "service_zone": "Boro Zone"
637 | },
638 | {
639 | "LocationID": 107,
640 | "Borough": "Manhattan",
641 | "Zone": "Gramercy",
642 | "service_zone": "Yellow Zone"
643 | },
644 | {
645 | "LocationID": 108,
646 | "Borough": "Brooklyn",
647 | "Zone": "Gravesend",
648 | "service_zone": "Boro Zone"
649 | },
650 | {
651 | "LocationID": 109,
652 | "Borough": "Staten Island",
653 | "Zone": "Great Kills",
654 | "service_zone": "Boro Zone"
655 | },
656 | {
657 | "LocationID": 110,
658 | "Borough": "Staten Island",
659 | "Zone": "Great Kills Park",
660 | "service_zone": "Boro Zone"
661 | },
662 | {
663 | "LocationID": 111,
664 | "Borough": "Brooklyn",
665 | "Zone": "Green-Wood Cemetery",
666 | "service_zone": "Boro Zone"
667 | },
668 | {
669 | "LocationID": 112,
670 | "Borough": "Brooklyn",
671 | "Zone": "Greenpoint",
672 | "service_zone": "Boro Zone"
673 | },
674 | {
675 | "LocationID": 113,
676 | "Borough": "Manhattan",
677 | "Zone": "Greenwich Village North",
678 | "service_zone": "Yellow Zone"
679 | },
680 | {
681 | "LocationID": 114,
682 | "Borough": "Manhattan",
683 | "Zone": "Greenwich Village South",
684 | "service_zone": "Yellow Zone"
685 | },
686 | {
687 | "LocationID": 115,
688 | "Borough": "Staten Island",
689 | "Zone": "Grymes Hill/Clifton",
690 | "service_zone": "Boro Zone"
691 | },
692 | {
693 | "LocationID": 116,
694 | "Borough": "Manhattan",
695 | "Zone": "Hamilton Heights",
696 | "service_zone": "Boro Zone"
697 | },
698 | {
699 | "LocationID": 117,
700 | "Borough": "Queens",
701 | "Zone": "Hammels/Arverne",
702 | "service_zone": "Boro Zone"
703 | },
704 | {
705 | "LocationID": 118,
706 | "Borough": "Staten Island",
707 | "Zone": "Heartland Village/Todt Hill",
708 | "service_zone": "Boro Zone"
709 | },
710 | {
711 | "LocationID": 119,
712 | "Borough": "Bronx",
713 | "Zone": "Highbridge",
714 | "service_zone": "Boro Zone"
715 | },
716 | {
717 | "LocationID": 120,
718 | "Borough": "Manhattan",
719 | "Zone": "Highbridge Park",
720 | "service_zone": "Boro Zone"
721 | },
722 | {
723 | "LocationID": 121,
724 | "Borough": "Queens",
725 | "Zone": "Hillcrest/Pomonok",
726 | "service_zone": "Boro Zone"
727 | },
728 | {
729 | "LocationID": 122,
730 | "Borough": "Queens",
731 | "Zone": "Hollis",
732 | "service_zone": "Boro Zone"
733 | },
734 | {
735 | "LocationID": 123,
736 | "Borough": "Brooklyn",
737 | "Zone": "Homecrest",
738 | "service_zone": "Boro Zone"
739 | },
740 | {
741 | "LocationID": 124,
742 | "Borough": "Queens",
743 | "Zone": "Howard Beach",
744 | "service_zone": "Boro Zone"
745 | },
746 | {
747 | "LocationID": 125,
748 | "Borough": "Manhattan",
749 | "Zone": "Hudson Sq",
750 | "service_zone": "Yellow Zone"
751 | },
752 | {
753 | "LocationID": 126,
754 | "Borough": "Bronx",
755 | "Zone": "Hunts Point",
756 | "service_zone": "Boro Zone"
757 | },
758 | {
759 | "LocationID": 127,
760 | "Borough": "Manhattan",
761 | "Zone": "Inwood",
762 | "service_zone": "Boro Zone"
763 | },
764 | {
765 | "LocationID": 128,
766 | "Borough": "Manhattan",
767 | "Zone": "Inwood Hill Park",
768 | "service_zone": "Boro Zone"
769 | },
770 | {
771 | "LocationID": 129,
772 | "Borough": "Queens",
773 | "Zone": "Jackson Heights",
774 | "service_zone": "Boro Zone"
775 | },
776 | {
777 | "LocationID": 130,
778 | "Borough": "Queens",
779 | "Zone": "Jamaica",
780 | "service_zone": "Boro Zone"
781 | },
782 | {
783 | "LocationID": 131,
784 | "Borough": "Queens",
785 | "Zone": "Jamaica Estates",
786 | "service_zone": "Boro Zone"
787 | },
788 | {
789 | "LocationID": 132,
790 | "Borough": "Queens",
791 | "Zone": "JFK Airport",
792 | "service_zone": "Airports"
793 | },
794 | {
795 | "LocationID": 133,
796 | "Borough": "Brooklyn",
797 | "Zone": "Kensington",
798 | "service_zone": "Boro Zone"
799 | },
800 | {
801 | "LocationID": 134,
802 | "Borough": "Queens",
803 | "Zone": "Kew Gardens",
804 | "service_zone": "Boro Zone"
805 | },
806 | {
807 | "LocationID": 135,
808 | "Borough": "Queens",
809 | "Zone": "Kew Gardens Hills",
810 | "service_zone": "Boro Zone"
811 | },
812 | {
813 | "LocationID": 136,
814 | "Borough": "Bronx",
815 | "Zone": "Kingsbridge Heights",
816 | "service_zone": "Boro Zone"
817 | },
818 | {
819 | "LocationID": 137,
820 | "Borough": "Manhattan",
821 | "Zone": "Kips Bay",
822 | "service_zone": "Yellow Zone"
823 | },
824 | {
825 | "LocationID": 138,
826 | "Borough": "Queens",
827 | "Zone": "LaGuardia Airport",
828 | "service_zone": "Airports"
829 | },
830 | {
831 | "LocationID": 139,
832 | "Borough": "Queens",
833 | "Zone": "Laurelton",
834 | "service_zone": "Boro Zone"
835 | },
836 | {
837 | "LocationID": 140,
838 | "Borough": "Manhattan",
839 | "Zone": "Lenox Hill East",
840 | "service_zone": "Yellow Zone"
841 | },
842 | {
843 | "LocationID": 141,
844 | "Borough": "Manhattan",
845 | "Zone": "Lenox Hill West",
846 | "service_zone": "Yellow Zone"
847 | },
848 | {
849 | "LocationID": 142,
850 | "Borough": "Manhattan",
851 | "Zone": "Lincoln Square East",
852 | "service_zone": "Yellow Zone"
853 | },
854 | {
855 | "LocationID": 143,
856 | "Borough": "Manhattan",
857 | "Zone": "Lincoln Square West",
858 | "service_zone": "Yellow Zone"
859 | },
860 | {
861 | "LocationID": 144,
862 | "Borough": "Manhattan",
863 | "Zone": "Little Italy/NoLiTa",
864 | "service_zone": "Yellow Zone"
865 | },
866 | {
867 | "LocationID": 145,
868 | "Borough": "Queens",
869 | "Zone": "Long Island City/Hunters Point",
870 | "service_zone": "Boro Zone"
871 | },
872 | {
873 | "LocationID": 146,
874 | "Borough": "Queens",
875 | "Zone": "Long Island City/Queens Plaza",
876 | "service_zone": "Boro Zone"
877 | },
878 | {
879 | "LocationID": 147,
880 | "Borough": "Bronx",
881 | "Zone": "Longwood",
882 | "service_zone": "Boro Zone"
883 | },
884 | {
885 | "LocationID": 148,
886 | "Borough": "Manhattan",
887 | "Zone": "Lower East Side",
888 | "service_zone": "Yellow Zone"
889 | },
890 | {
891 | "LocationID": 149,
892 | "Borough": "Brooklyn",
893 | "Zone": "Madison",
894 | "service_zone": "Boro Zone"
895 | },
896 | {
897 | "LocationID": 150,
898 | "Borough": "Brooklyn",
899 | "Zone": "Manhattan Beach",
900 | "service_zone": "Boro Zone"
901 | },
902 | {
903 | "LocationID": 151,
904 | "Borough": "Manhattan",
905 | "Zone": "Manhattan Valley",
906 | "service_zone": "Yellow Zone"
907 | },
908 | {
909 | "LocationID": 152,
910 | "Borough": "Manhattan",
911 | "Zone": "Manhattanville",
912 | "service_zone": "Boro Zone"
913 | },
914 | {
915 | "LocationID": 153,
916 | "Borough": "Manhattan",
917 | "Zone": "Marble Hill",
918 | "service_zone": "Boro Zone"
919 | },
920 | {
921 | "LocationID": 154,
922 | "Borough": "Brooklyn",
923 | "Zone": "Marine Park/Floyd Bennett Field",
924 | "service_zone": "Boro Zone"
925 | },
926 | {
927 | "LocationID": 155,
928 | "Borough": "Brooklyn",
929 | "Zone": "Marine Park/Mill Basin",
930 | "service_zone": "Boro Zone"
931 | },
932 | {
933 | "LocationID": 156,
934 | "Borough": "Staten Island",
935 | "Zone": "Mariners Harbor",
936 | "service_zone": "Boro Zone"
937 | },
938 | {
939 | "LocationID": 157,
940 | "Borough": "Queens",
941 | "Zone": "Maspeth",
942 | "service_zone": "Boro Zone"
943 | },
944 | {
945 | "LocationID": 158,
946 | "Borough": "Manhattan",
947 | "Zone": "Meatpacking/West Village West",
948 | "service_zone": "Yellow Zone"
949 | },
950 | {
951 | "LocationID": 159,
952 | "Borough": "Bronx",
953 | "Zone": "Melrose South",
954 | "service_zone": "Boro Zone"
955 | },
956 | {
957 | "LocationID": 160,
958 | "Borough": "Queens",
959 | "Zone": "Middle Village",
960 | "service_zone": "Boro Zone"
961 | },
962 | {
963 | "LocationID": 161,
964 | "Borough": "Manhattan",
965 | "Zone": "Midtown Center",
966 | "service_zone": "Yellow Zone"
967 | },
968 | {
969 | "LocationID": 162,
970 | "Borough": "Manhattan",
971 | "Zone": "Midtown East",
972 | "service_zone": "Yellow Zone"
973 | },
974 | {
975 | "LocationID": 163,
976 | "Borough": "Manhattan",
977 | "Zone": "Midtown North",
978 | "service_zone": "Yellow Zone"
979 | },
980 | {
981 | "LocationID": 164,
982 | "Borough": "Manhattan",
983 | "Zone": "Midtown South",
984 | "service_zone": "Yellow Zone"
985 | },
986 | {
987 | "LocationID": 165,
988 | "Borough": "Brooklyn",
989 | "Zone": "Midwood",
990 | "service_zone": "Boro Zone"
991 | },
992 | {
993 | "LocationID": 166,
994 | "Borough": "Manhattan",
995 | "Zone": "Morningside Heights",
996 | "service_zone": "Boro Zone"
997 | },
998 | {
999 | "LocationID": 167,
1000 | "Borough": "Bronx",
1001 | "Zone": "Morrisania/Melrose",
1002 | "service_zone": "Boro Zone"
1003 | },
1004 | {
1005 | "LocationID": 168,
1006 | "Borough": "Bronx",
1007 | "Zone": "Mott Haven/Port Morris",
1008 | "service_zone": "Boro Zone"
1009 | },
1010 | {
1011 | "LocationID": 169,
1012 | "Borough": "Bronx",
1013 | "Zone": "Mount Hope",
1014 | "service_zone": "Boro Zone"
1015 | },
1016 | {
1017 | "LocationID": 170,
1018 | "Borough": "Manhattan",
1019 | "Zone": "Murray Hill",
1020 | "service_zone": "Yellow Zone"
1021 | },
1022 | {
1023 | "LocationID": 171,
1024 | "Borough": "Queens",
1025 | "Zone": "Murray Hill-Queens",
1026 | "service_zone": "Boro Zone"
1027 | },
1028 | {
1029 | "LocationID": 172,
1030 | "Borough": "Staten Island",
1031 | "Zone": "New Dorp/Midland Beach",
1032 | "service_zone": "Boro Zone"
1033 | },
1034 | {
1035 | "LocationID": 173,
1036 | "Borough": "Queens",
1037 | "Zone": "North Corona",
1038 | "service_zone": "Boro Zone"
1039 | },
1040 | {
1041 | "LocationID": 174,
1042 | "Borough": "Bronx",
1043 | "Zone": "Norwood",
1044 | "service_zone": "Boro Zone"
1045 | },
1046 | {
1047 | "LocationID": 175,
1048 | "Borough": "Queens",
1049 | "Zone": "Oakland Gardens",
1050 | "service_zone": "Boro Zone"
1051 | },
1052 | {
1053 | "LocationID": 176,
1054 | "Borough": "Staten Island",
1055 | "Zone": "Oakwood",
1056 | "service_zone": "Boro Zone"
1057 | },
1058 | {
1059 | "LocationID": 177,
1060 | "Borough": "Brooklyn",
1061 | "Zone": "Ocean Hill",
1062 | "service_zone": "Boro Zone"
1063 | },
1064 | {
1065 | "LocationID": 178,
1066 | "Borough": "Brooklyn",
1067 | "Zone": "Ocean Parkway South",
1068 | "service_zone": "Boro Zone"
1069 | },
1070 | {
1071 | "LocationID": 179,
1072 | "Borough": "Queens",
1073 | "Zone": "Old Astoria",
1074 | "service_zone": "Boro Zone"
1075 | },
1076 | {
1077 | "LocationID": 180,
1078 | "Borough": "Queens",
1079 | "Zone": "Ozone Park",
1080 | "service_zone": "Boro Zone"
1081 | },
1082 | {
1083 | "LocationID": 181,
1084 | "Borough": "Brooklyn",
1085 | "Zone": "Park Slope",
1086 | "service_zone": "Boro Zone"
1087 | },
1088 | {
1089 | "LocationID": 182,
1090 | "Borough": "Bronx",
1091 | "Zone": "Parkchester",
1092 | "service_zone": "Boro Zone"
1093 | },
1094 | {
1095 | "LocationID": 183,
1096 | "Borough": "Bronx",
1097 | "Zone": "Pelham Bay",
1098 | "service_zone": "Boro Zone"
1099 | },
1100 | {
1101 | "LocationID": 184,
1102 | "Borough": "Bronx",
1103 | "Zone": "Pelham Bay Park",
1104 | "service_zone": "Boro Zone"
1105 | },
1106 | {
1107 | "LocationID": 185,
1108 | "Borough": "Bronx",
1109 | "Zone": "Pelham Parkway",
1110 | "service_zone": "Boro Zone"
1111 | },
1112 | {
1113 | "LocationID": 186,
1114 | "Borough": "Manhattan",
1115 | "Zone": "Penn Station/Madison Sq West",
1116 | "service_zone": "Yellow Zone"
1117 | },
1118 | {
1119 | "LocationID": 187,
1120 | "Borough": "Staten Island",
1121 | "Zone": "Port Richmond",
1122 | "service_zone": "Boro Zone"
1123 | },
1124 | {
1125 | "LocationID": 188,
1126 | "Borough": "Brooklyn",
1127 | "Zone": "Prospect-Lefferts Gardens",
1128 | "service_zone": "Boro Zone"
1129 | },
1130 | {
1131 | "LocationID": 189,
1132 | "Borough": "Brooklyn",
1133 | "Zone": "Prospect Heights",
1134 | "service_zone": "Boro Zone"
1135 | },
1136 | {
1137 | "LocationID": 190,
1138 | "Borough": "Brooklyn",
1139 | "Zone": "Prospect Park",
1140 | "service_zone": "Boro Zone"
1141 | },
1142 | {
1143 | "LocationID": 191,
1144 | "Borough": "Queens",
1145 | "Zone": "Queens Village",
1146 | "service_zone": "Boro Zone"
1147 | },
1148 | {
1149 | "LocationID": 192,
1150 | "Borough": "Queens",
1151 | "Zone": "Queensboro Hill",
1152 | "service_zone": "Boro Zone"
1153 | },
1154 | {
1155 | "LocationID": 193,
1156 | "Borough": "Queens",
1157 | "Zone": "Queensbridge/Ravenswood",
1158 | "service_zone": "Boro Zone"
1159 | },
1160 | {
1161 | "LocationID": 194,
1162 | "Borough": "Manhattan",
1163 | "Zone": "Randalls Island",
1164 | "service_zone": "Yellow Zone"
1165 | },
1166 | {
1167 | "LocationID": 195,
1168 | "Borough": "Brooklyn",
1169 | "Zone": "Red Hook",
1170 | "service_zone": "Boro Zone"
1171 | },
1172 | {
1173 | "LocationID": 196,
1174 | "Borough": "Queens",
1175 | "Zone": "Rego Park",
1176 | "service_zone": "Boro Zone"
1177 | },
1178 | {
1179 | "LocationID": 197,
1180 | "Borough": "Queens",
1181 | "Zone": "Richmond Hill",
1182 | "service_zone": "Boro Zone"
1183 | },
1184 | {
1185 | "LocationID": 198,
1186 | "Borough": "Queens",
1187 | "Zone": "Ridgewood",
1188 | "service_zone": "Boro Zone"
1189 | },
1190 | {
1191 | "LocationID": 199,
1192 | "Borough": "Bronx",
1193 | "Zone": "Rikers Island",
1194 | "service_zone": "Boro Zone"
1195 | },
1196 | {
1197 | "LocationID": 200,
1198 | "Borough": "Bronx",
1199 | "Zone": "Riverdale/North Riverdale/Fieldston",
1200 | "service_zone": "Boro Zone"
1201 | },
1202 | {
1203 | "LocationID": 201,
1204 | "Borough": "Queens",
1205 | "Zone": "Rockaway Park",
1206 | "service_zone": "Boro Zone"
1207 | },
1208 | {
1209 | "LocationID": 202,
1210 | "Borough": "Manhattan",
1211 | "Zone": "Roosevelt Island",
1212 | "service_zone": "Boro Zone"
1213 | },
1214 | {
1215 | "LocationID": 203,
1216 | "Borough": "Queens",
1217 | "Zone": "Rosedale",
1218 | "service_zone": "Boro Zone"
1219 | },
1220 | {
1221 | "LocationID": 204,
1222 | "Borough": "Staten Island",
1223 | "Zone": "Rossville/Woodrow",
1224 | "service_zone": "Boro Zone"
1225 | },
1226 | {
1227 | "LocationID": 205,
1228 | "Borough": "Queens",
1229 | "Zone": "Saint Albans",
1230 | "service_zone": "Boro Zone"
1231 | },
1232 | {
1233 | "LocationID": 206,
1234 | "Borough": "Staten Island",
1235 | "Zone": "Saint George/New Brighton",
1236 | "service_zone": "Boro Zone"
1237 | },
1238 | {
1239 | "LocationID": 207,
1240 | "Borough": "Queens",
1241 | "Zone": "Saint Michaels Cemetery/Woodside",
1242 | "service_zone": "Boro Zone"
1243 | },
1244 | {
1245 | "LocationID": 208,
1246 | "Borough": "Bronx",
1247 | "Zone": "Schuylerville/Edgewater Park",
1248 | "service_zone": "Boro Zone"
1249 | },
1250 | {
1251 | "LocationID": 209,
1252 | "Borough": "Manhattan",
1253 | "Zone": "Seaport",
1254 | "service_zone": "Yellow Zone"
1255 | },
1256 | {
1257 | "LocationID": 210,
1258 | "Borough": "Brooklyn",
1259 | "Zone": "Sheepshead Bay",
1260 | "service_zone": "Boro Zone"
1261 | },
1262 | {
1263 | "LocationID": 211,
1264 | "Borough": "Manhattan",
1265 | "Zone": "SoHo",
1266 | "service_zone": "Yellow Zone"
1267 | },
1268 | {
1269 | "LocationID": 212,
1270 | "Borough": "Bronx",
1271 | "Zone": "Soundview/Bruckner",
1272 | "service_zone": "Boro Zone"
1273 | },
1274 | {
1275 | "LocationID": 213,
1276 | "Borough": "Bronx",
1277 | "Zone": "Soundview/Castle Hill",
1278 | "service_zone": "Boro Zone"
1279 | },
1280 | {
1281 | "LocationID": 214,
1282 | "Borough": "Staten Island",
1283 | "Zone": "South Beach/Dongan Hills",
1284 | "service_zone": "Boro Zone"
1285 | },
1286 | {
1287 | "LocationID": 215,
1288 | "Borough": "Queens",
1289 | "Zone": "South Jamaica",
1290 | "service_zone": "Boro Zone"
1291 | },
1292 | {
1293 | "LocationID": 216,
1294 | "Borough": "Queens",
1295 | "Zone": "South Ozone Park",
1296 | "service_zone": "Boro Zone"
1297 | },
1298 | {
1299 | "LocationID": 217,
1300 | "Borough": "Brooklyn",
1301 | "Zone": "South Williamsburg",
1302 | "service_zone": "Boro Zone"
1303 | },
1304 | {
1305 | "LocationID": 218,
1306 | "Borough": "Queens",
1307 | "Zone": "Springfield Gardens North",
1308 | "service_zone": "Boro Zone"
1309 | },
1310 | {
1311 | "LocationID": 219,
1312 | "Borough": "Queens",
1313 | "Zone": "Springfield Gardens South",
1314 | "service_zone": "Boro Zone"
1315 | },
1316 | {
1317 | "LocationID": 220,
1318 | "Borough": "Bronx",
1319 | "Zone": "Spuyten Duyvil/Kingsbridge",
1320 | "service_zone": "Boro Zone"
1321 | },
1322 | {
1323 | "LocationID": 221,
1324 | "Borough": "Staten Island",
1325 | "Zone": "Stapleton",
1326 | "service_zone": "Boro Zone"
1327 | },
1328 | {
1329 | "LocationID": 222,
1330 | "Borough": "Brooklyn",
1331 | "Zone": "Starrett City",
1332 | "service_zone": "Boro Zone"
1333 | },
1334 | {
1335 | "LocationID": 223,
1336 | "Borough": "Queens",
1337 | "Zone": "Steinway",
1338 | "service_zone": "Boro Zone"
1339 | },
1340 | {
1341 | "LocationID": 224,
1342 | "Borough": "Manhattan",
1343 | "Zone": "Stuy Town/Peter Cooper Village",
1344 | "service_zone": "Yellow Zone"
1345 | },
1346 | {
1347 | "LocationID": 225,
1348 | "Borough": "Brooklyn",
1349 | "Zone": "Stuyvesant Heights",
1350 | "service_zone": "Boro Zone"
1351 | },
1352 | {
1353 | "LocationID": 226,
1354 | "Borough": "Queens",
1355 | "Zone": "Sunnyside",
1356 | "service_zone": "Boro Zone"
1357 | },
1358 | {
1359 | "LocationID": 227,
1360 | "Borough": "Brooklyn",
1361 | "Zone": "Sunset Park East",
1362 | "service_zone": "Boro Zone"
1363 | },
1364 | {
1365 | "LocationID": 228,
1366 | "Borough": "Brooklyn",
1367 | "Zone": "Sunset Park West",
1368 | "service_zone": "Boro Zone"
1369 | },
1370 | {
1371 | "LocationID": 229,
1372 | "Borough": "Manhattan",
1373 | "Zone": "Sutton Place/Turtle Bay North",
1374 | "service_zone": "Yellow Zone"
1375 | },
1376 | {
1377 | "LocationID": 230,
1378 | "Borough": "Manhattan",
1379 | "Zone": "Times Sq/Theatre District",
1380 | "service_zone": "Yellow Zone"
1381 | },
1382 | {
1383 | "LocationID": 231,
1384 | "Borough": "Manhattan",
1385 | "Zone": "TriBeCa/Civic Center",
1386 | "service_zone": "Yellow Zone"
1387 | },
1388 | {
1389 | "LocationID": 232,
1390 | "Borough": "Manhattan",
1391 | "Zone": "Two Bridges/Seward Park",
1392 | "service_zone": "Yellow Zone"
1393 | },
1394 | {
1395 | "LocationID": 233,
1396 | "Borough": "Manhattan",
1397 | "Zone": "UN/Turtle Bay South",
1398 | "service_zone": "Yellow Zone"
1399 | },
1400 | {
1401 | "LocationID": 234,
1402 | "Borough": "Manhattan",
1403 | "Zone": "Union Sq",
1404 | "service_zone": "Yellow Zone"
1405 | },
1406 | {
1407 | "LocationID": 235,
1408 | "Borough": "Bronx",
1409 | "Zone": "University Heights/Morris Heights",
1410 | "service_zone": "Boro Zone"
1411 | },
1412 | {
1413 | "LocationID": 236,
1414 | "Borough": "Manhattan",
1415 | "Zone": "Upper East Side North",
1416 | "service_zone": "Yellow Zone"
1417 | },
1418 | {
1419 | "LocationID": 237,
1420 | "Borough": "Manhattan",
1421 | "Zone": "Upper East Side South",
1422 | "service_zone": "Yellow Zone"
1423 | },
1424 | {
1425 | "LocationID": 238,
1426 | "Borough": "Manhattan",
1427 | "Zone": "Upper West Side North",
1428 | "service_zone": "Yellow Zone"
1429 | },
1430 | {
1431 | "LocationID": 239,
1432 | "Borough": "Manhattan",
1433 | "Zone": "Upper West Side South",
1434 | "service_zone": "Yellow Zone"
1435 | },
1436 | {
1437 | "LocationID": 240,
1438 | "Borough": "Bronx",
1439 | "Zone": "Van Cortlandt Park",
1440 | "service_zone": "Boro Zone"
1441 | },
1442 | {
1443 | "LocationID": 241,
1444 | "Borough": "Bronx",
1445 | "Zone": "Van Cortlandt Village",
1446 | "service_zone": "Boro Zone"
1447 | },
1448 | {
1449 | "LocationID": 242,
1450 | "Borough": "Bronx",
1451 | "Zone": "Van Nest/Morris Park",
1452 | "service_zone": "Boro Zone"
1453 | },
1454 | {
1455 | "LocationID": 243,
1456 | "Borough": "Manhattan",
1457 | "Zone": "Washington Heights North",
1458 | "service_zone": "Boro Zone"
1459 | },
1460 | {
1461 | "LocationID": 244,
1462 | "Borough": "Manhattan",
1463 | "Zone": "Washington Heights South",
1464 | "service_zone": "Boro Zone"
1465 | },
1466 | {
1467 | "LocationID": 245,
1468 | "Borough": "Staten Island",
1469 | "Zone": "West Brighton",
1470 | "service_zone": "Boro Zone"
1471 | },
1472 | {
1473 | "LocationID": 246,
1474 | "Borough": "Manhattan",
1475 | "Zone": "West Chelsea/Hudson Yards",
1476 | "service_zone": "Yellow Zone"
1477 | },
1478 | {
1479 | "LocationID": 247,
1480 | "Borough": "Bronx",
1481 | "Zone": "West Concourse",
1482 | "service_zone": "Boro Zone"
1483 | },
1484 | {
1485 | "LocationID": 248,
1486 | "Borough": "Bronx",
1487 | "Zone": "West Farms/Bronx River",
1488 | "service_zone": "Boro Zone"
1489 | },
1490 | {
1491 | "LocationID": 249,
1492 | "Borough": "Manhattan",
1493 | "Zone": "West Village",
1494 | "service_zone": "Yellow Zone"
1495 | },
1496 | {
1497 | "LocationID": 250,
1498 | "Borough": "Bronx",
1499 | "Zone": "Westchester Village/Unionport",
1500 | "service_zone": "Boro Zone"
1501 | },
1502 | {
1503 | "LocationID": 251,
1504 | "Borough": "Staten Island",
1505 | "Zone": "Westerleigh",
1506 | "service_zone": "Boro Zone"
1507 | },
1508 | {
1509 | "LocationID": 252,
1510 | "Borough": "Queens",
1511 | "Zone": "Whitestone",
1512 | "service_zone": "Boro Zone"
1513 | },
1514 | {
1515 | "LocationID": 253,
1516 | "Borough": "Queens",
1517 | "Zone": "Willets Point",
1518 | "service_zone": "Boro Zone"
1519 | },
1520 | {
1521 | "LocationID": 254,
1522 | "Borough": "Bronx",
1523 | "Zone": "Williamsbridge/Olinville",
1524 | "service_zone": "Boro Zone"
1525 | },
1526 | {
1527 | "LocationID": 255,
1528 | "Borough": "Brooklyn",
1529 | "Zone": "Williamsburg (North Side)",
1530 | "service_zone": "Boro Zone"
1531 | },
1532 | {
1533 | "LocationID": 256,
1534 | "Borough": "Brooklyn",
1535 | "Zone": "Williamsburg (South Side)",
1536 | "service_zone": "Boro Zone"
1537 | },
1538 | {
1539 | "LocationID": 257,
1540 | "Borough": "Brooklyn",
1541 | "Zone": "Windsor Terrace",
1542 | "service_zone": "Boro Zone"
1543 | },
1544 | {
1545 | "LocationID": 258,
1546 | "Borough": "Queens",
1547 | "Zone": "Woodhaven",
1548 | "service_zone": "Boro Zone"
1549 | },
1550 | {
1551 | "LocationID": 259,
1552 | "Borough": "Bronx",
1553 | "Zone": "Woodlawn/Wakefield",
1554 | "service_zone": "Boro Zone"
1555 | },
1556 | {
1557 | "LocationID": 260,
1558 | "Borough": "Queens",
1559 | "Zone": "Woodside",
1560 | "service_zone": "Boro Zone"
1561 | },
1562 | {
1563 | "LocationID": 261,
1564 | "Borough": "Manhattan",
1565 | "Zone": "World Trade Center",
1566 | "service_zone": "Yellow Zone"
1567 | },
1568 | {
1569 | "LocationID": 262,
1570 | "Borough": "Manhattan",
1571 | "Zone": "Yorkville East",
1572 | "service_zone": "Yellow Zone"
1573 | },
1574 | {
1575 | "LocationID": 263,
1576 | "Borough": "Manhattan",
1577 | "Zone": "Yorkville West",
1578 | "service_zone": "Yellow Zone"
1579 | },
1580 | {
1581 | "LocationID": 264,
1582 | "Borough": "Unknown",
1583 | "Zone": "NV",
1584 | "service_zone": "N/A"
1585 | },
1586 | {
1587 | "LocationID": 265,
1588 | "Borough": "Unknown",
1589 | "Zone": "NA",
1590 | "service_zone": "N/A"
1591 | }
1592 | ]
--------------------------------------------------------------------------------