├── .DS_Store ├── Chapter0 ├── sample.txt ├── .DS_Store ├── README.md ├── nyc_taxi_zone.csv └── nyc_taxi_zone.json ├── Chapter11 ├── .DS_Store └── README.md ├── Chapter5 ├── .DS_Store ├── README.md └── chapter5.ipynb ├── Chapter9 ├── .DS_Store └── README.md ├── Chapter1 ├── department.csv ├── README.md ├── user.csv └── employee.csv ├── Chapter6 └── README.md ├── Chapter2 ├── README.md └── employee.csv ├── Chapter7 ├── README.md └── 00000000000000000000.json ├── Chapter8 └── README.md ├── Chapter3 ├── README.md ├── chapter3-CovidData.ipynb ├── chapter3_YellowCab.ipynb └── chapter3_PublicHoliday.ipynb ├── Chapter10 └── README.md ├── Chapter12 ├── commands.sh ├── README.md ├── Chapter12_1.ipynb ├── Chapter12_2.ipynb └── Chapter12.ipynb ├── Chapter4 └── README.md └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/.DS_Store -------------------------------------------------------------------------------- /Chapter0/sample.txt: -------------------------------------------------------------------------------- 1 | 1,chris,USA 2 | 2,Mark,AUS 3 | 3,Jags,IND 4 | 4,Adam,UK -------------------------------------------------------------------------------- /Chapter0/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/Chapter0/.DS_Store -------------------------------------------------------------------------------- /Chapter11/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/Chapter11/.DS_Store -------------------------------------------------------------------------------- /Chapter5/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/Chapter5/.DS_Store -------------------------------------------------------------------------------- /Chapter9/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/developershomes/SparkETL/HEAD/Chapter9/.DS_Store -------------------------------------------------------------------------------- /Chapter1/department.csv: -------------------------------------------------------------------------------- 1 | department_id,department_name 2 | 1005,Sales 3 | 1002,Finanace 4 | 1004,Purchase 5 | 1001,Operations 6 | 1006,Marketing 7 | 1003,Technoogy -------------------------------------------------------------------------------- /Chapter6/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 6 -> Spark ETL with APIs 3 | 4 | Task to do 5 | 1. Call API and load data into Dataframe 6 | 2. Create temp table or view and analyse data 7 | 3. Filter data and store into CSV format on file server 8 | 4. Filter data and store into JSON format 9 | 10 | Reference:
11 | https://api.publicapis.org/entries 12 | 13 | Solution Notebook:
14 | [Spark Notebook](chapter6.ipynb) 15 | 16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/18/spark-etl-chapter-6-with-apis/ 18 | 19 | YouTube video with Explanation: 20 | https://www.youtube.com/watch?v=eL1xIjranhg 21 | -------------------------------------------------------------------------------- /Chapter5/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 5 -> Spark ETL with Hive tables 3 | 4 | Task to do 5 | 1. Read data from one of the source (We take source as our MongoDB collection) 6 | 2. Create dataframe from source 7 | 3. Create Hive table from dataframe 8 | 4. Create temp Hive view from dataframe 9 | 5. Create global Hive view from dataframe 10 | 6. List database and tables in database 11 | 7. Drop all the created tables and views in default database 12 | 8. Create Dataeng database and create global and temp view using SQL 13 | 9. Access global table from other session 14 | 15 | 16 | Solution Notebook:
17 | [Spark Notebook](chapter5.ipynb) 18 | 19 | Blog with Explaination: 20 | 21 | YouTube video with Explanation: 22 | -------------------------------------------------------------------------------- /Chapter2/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 2 -> Spark ETL with NonSQL Database (MongoDB) 3 | 4 | Task to do 5 | 1. Install required spark libraries 6 | 2. Create connection with NonSQL Database 7 | 3. Read data from NonSQL Database 8 | 4. Transform data 9 | 5. write data into NonSQL Server 10 | 11 | Spark Libraries
12 | https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
13 | 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1'
14 | 15 | Solution Notebook:
16 | [Spark Notebook](chapter2.ipynb) 17 | 18 | Blog with Explaination:
19 | https://developershome.blog/2023/03/07/spark-etl-chapter-2-with-nosql-database-mongodb-cassandra/
20 | 21 | YouTube video with Explanation:
22 | https://www.youtube.com/watch?v=vPZV_GF0klE&list=PLYqhYQOVe-qNwwWJdhiLM_In2l9kDwkAa&index=5 23 | -------------------------------------------------------------------------------- /Chapter7/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 7 -> Spark ETL with Lakehouse | Delta Lake 3 | 4 | Task to do 5 | 1. Read data from MySQL server into Spark 6 | 2. Create HIVE temp view from data frame 7 | 3. Load filtered data into Delta format (create initial table) 8 | 4. Load filtered data again into Delta format into same table 9 | 5. Read Delta tables using Spark data frame 10 | 6. Create Temp HIVE of delta tables 11 | 7. Write query to read data and also explore versions 12 | 13 | Solution Notebook:
14 | [Spark Notebook](chapter7.ipynb) 15 | 16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/18/spark-etl-chapter-6-with-apis/ 18 | 19 | YouTube video with Explanation:
20 | https://www.youtube.com/watch?v=eL1xIjranhg 21 | 22 | Meduim Blog Channel:
23 | https://medium.com/@developershome 24 | -------------------------------------------------------------------------------- /Chapter1/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 1 -> Spark ETL with SQL Database (MySQL | PostgreSQL) 3 | 4 | Task to do 5 | 1. Install required spark libraries 6 | 2. Create connection with SQL Database 7 | 3. Read data from SQL Database 8 | 4. Transform data 9 | 5. write data into SQL Server 10 | 11 | Spark Libraries
12 | https://mvnrepository.com/artifact/mysql/mysql-connector-java 13 | 14 | Solution Notebook:
15 | [Spark Notebook](chapter1.ipynb) 16 | 17 | Blog with Explaination:
18 | https://developershome.blog/2023/03/06/spark-etl-with-sql-databases-mysql-postgresql/
19 | https://medium.com/@fylfotbeta/spark-etl-chapter-1-with-sql-databases-mysql-postgresql-a0a589f7f9ff 20 | 21 | YouTube video with Explanation:
22 | https://www.youtube.com/watch?v=PHahcWd1AqM&list=PLYqhYQOVe-qNwwWJdhiLM_In2l9kDwkAa&index=3 23 | -------------------------------------------------------------------------------- /Chapter8/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 8 -> Spark ETL with Lakehouse | Apache HUDI 3 | 4 | Task to do 5 | 1. Read data from MySQL server into Spark 6 | 2. Create HIVE temp view from data frame 7 | 3. Load filtered data into HUDI format (create initial table) 8 | 4. Load filtered data again into HUDI format into same table 9 | 5. Read HUDI tables using Spark data frame 10 | 6. Create Temp HIVE of HUDI tables 11 | 7. Write query to read data and also explore versions 12 | 13 | Solution Notebook:
14 | [Spark Notebook](chapter8.ipynb) 15 | 16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/21/spark-etl-chapter-8-with-lakehouse-apache-hudi/ 18 | 19 | YouTube video with Explanation:
20 | https://www.youtube.com/watch?v=eL1xIjranhg 21 | 22 | Meduim Blog Channel:
23 | https://medium.com/@developershome 24 | -------------------------------------------------------------------------------- /Chapter11/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 11 -> Spark ETL with Lakehouse | Delta table Optimization (Partition, ZORDER & Optimize) 3 | 4 | Task to do 5 | 1. Read data from CSV file to Spark 6 | 2. Create HIVE temp view from data frame 7 | 3. Load data into Delta format (create initial table) 8 | 4. Load data into Delta format with partition 9 | 5. Apply Optimize executeCompaction on delta table 10 | 6. Apply Optimize ZOrder on delta table 11 | 7. Check performance 12 | 13 | Solution Notebook:
14 | [Spark Notebook](chapter9.ipynb) 15 | 16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/21/spark-etl-chapter-9-with-lakehouse-apache-iceberg/ 18 | 19 | YouTube video with Explanation:
20 | https://www.youtube.com/watch?v=eL1xIjranhg 21 | 22 | Meduim Blog Channel:
23 | https://medium.com/@developershome 24 | -------------------------------------------------------------------------------- /Chapter9/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 9 -> Spark ETL with Lakehouse | Apache Iceberg 3 | 4 | Task to do 5 | 1. Read data from MySQL server into Spark 6 | 2. Create HIVE temp view from data frame 7 | 3. Load filtered data into Iceberg format (create initial table) 8 | 4. Load filtered data again into Iceberg format into same table 9 | 5. Read Iceberg tables using Spark data frame 10 | 6. Create Temp HIVE of Iceberg tables 11 | 7. Write query to read data and also explore versions 12 | 13 | Solution Notebook:
14 | [Spark Notebook](chapter9.ipynb) 15 | 16 | Blog with Explaination:
17 | https://developershome.blog/2023/03/21/spark-etl-chapter-9-with-lakehouse-apache-iceberg/ 18 | 19 | YouTube video with Explanation:
20 | https://www.youtube.com/watch?v=eL1xIjranhg 21 | 22 | Meduim Blog Channel:
23 | https://medium.com/@developershome 24 | -------------------------------------------------------------------------------- /Chapter3/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 3 -> Spark ETL with Azure (Blob | ADLS) 3 | 4 | Task to do 5 | 1. Install required spark libraries 6 | 2. Create connection with Azure Blob storage 7 | 3. Read data from blob and store into dataframe 8 | 4. Transform data 9 | 5. write data into parquet file 10 | 6. write data into CSV file 11 | 12 | Reference:
13 | https://learn.microsoft.com/en-us/azure/open-datasets/dataset-catalog 14 | 15 | Solution Notebook:
16 | [Spark Notebook With NYC Yellow Taxi blob](chapter3_YellowCab.ipynb)
17 | [Spark Notebook With Covid Public Data blob](chapter3_CovidData.ipynb)
18 | [Spark Notebook With Public Holiday blob](chapter3_PublicHoliday.ipynb) 19 | 20 | Blog with Explaination:
21 | https://developershome.blog/2023/03/08/spark-etl-chapter-3-with-cloud-data-lakes-azure-blob-azure-adls/ 22 | 23 | YouTube video with Explanation:
24 | -------------------------------------------------------------------------------- /Chapter10/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 10 -> Spark ETL with Lakehouse | Delta Lake vs Apache Iceberg vs Apache HUDI 3 | 4 | Task to do 5 | 1. Read data from MySQL server into Spark 6 | 2. Create HIVE temp view from data frame 7 | 3. Load filtered data into Delta format (create initial table) 8 | 4. Load filtered data into HUDI format (create initial table) 9 | 5. Load filtered data into Iceberg format (create initial table) 10 | 6. Read data from Delta | HUDI | Iceberg format 11 | 12 | Solution Notebook:
13 | [Spark Notebook Delta](chapter10_delta.ipynb)
14 | [Spark Notebook HUDI](chapter10_hudi.ipynb)
15 | [Spark Notebook Iceberg](chapter10_iceberg.ipynb) 16 | 17 | Blog with Explaination:
18 | https://developershome.blog/?s=etl&category=spark 19 | 20 | YouTube video with Explanation:
21 | https://www.youtube.com/watch?v=eL1xIjranhg 22 | 23 | Meduim Blog Channel:
24 | https://medium.com/@developershome 25 | -------------------------------------------------------------------------------- /Chapter12/commands.sh: -------------------------------------------------------------------------------- 1 | # Command for listing topic 2 | kafka-topics.sh --bootstrap-server=localhost:9092 --list 3 | 4 | # Command for creating topic 5 | kafka-topics.sh --create --topic dataeng --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1 6 | kafka-topics.sh --create --bootstrap-server localhost:9092 --topic test_topic 7 | 8 | # Describe topics in details 9 | kafka-topics.sh --bootstrap-server=localhost:9092 --describe --topic dataeng 10 | 11 | # Command for producer for publishing messages 12 | kafka-console-producer.sh --topic dataeng --bootstrap-server localhost:9092 13 | kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test_topic --property "parse.key=true" --property "key.separator=:" 14 | 15 | # Command for subscriber to recieve messages 16 | kafka-console-consumer.sh --topic dataeng --from-beginning --bootstrap-server localhost:9092 17 | kafka-console-consumer.sh --topic test_topic --from-beginning --bootstrap-server localhost:9092 -------------------------------------------------------------------------------- /Chapter0/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 0 -> Spark ETL with files (CSV | JSON | Parquet | Text | Spark Dataframe) 3 | 4 | Task to do 5 | 1. Read CSV file and write into dataframe 6 | 2. Read JSON file and write into dataframe 7 | 3. Read Parquet file and write into dataframe 8 | 4. Read text file and write into dataframe 9 | 5. Create temp table for all 10 | 6. Create JSON file from CSV dataframe 11 | 7. Create CSV file from Parquet dataframe 12 | 8. Create parquet file from JSON dataframe 13 | 9. Create orc file from JSON dataframe 14 | 15 | Reference Data:
16 | https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page 17 | 18 | Solution Notebook:
19 | [Spark Notebook](chapter0.ipynb) 20 | 21 | Blog with Explaination: 22 | https://medium.com/@fylfotbeta/spark-etl-chapter-0-with-files-csv-json-parquet-orc-87359909c568 23 | 24 | https://developershome.blog/2023/03/02/spark-etl-chapter-0-with-files-csv-json-parquet-orc/ 25 | 26 | YouTube video with Explanation: 27 | https://youtu.be/fL_DpgyU040 -------------------------------------------------------------------------------- /Chapter4/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 4 -> Spark ETL with AWS (S3 bucket) 3 | 4 | Task to do 5 | 1. Install required spark libraries 6 | 2. Create connection with AWS S3 bucket 7 | 3. Read data from S3 bucket and store into dataframe 8 | 4. Transform data 9 | 5. write data into parquet file 10 | 6. write data into JSON file 11 | 12 | Reference:
13 | https://registry.opendata.aws/speedtest-global-performance/ 14 | 15 | command:
16 | aws s3 ls --no-sign-request s3://ookla-open-data/parquet/performance/type=fixed/year=2019/quarter=1/2019-01-01_performance_fixed_tiles.parquet
17 | aws s3 cp --no-sign-request s3://ookla-open-data/parquet/performance/type=fixed/year=2019/quarter=1/2019-01-01_performance_fixed_tiles.parquet sample.parquet 18 | 19 | Solution Notebook:
20 | [Spark Notebook](chapter4.ipynb) 21 | [Spark Notebook Dataset1](chapter4-dataset1.ipynb) 22 | 23 | Blog with Explaination: 24 | https://developershome.blog/2023/03/12/spark-etl-chapter-4-with-cloud-data-lakes-aws-s3-bucket/
25 | 26 | YouTube video with Explanation: 27 | -------------------------------------------------------------------------------- /Chapter7/00000000000000000000.json: -------------------------------------------------------------------------------- 1 | {"protocol":{"minReaderVersion":1,"minWriterVersion":2}} 2 | {"metaData":{"id":"29ae77fe-b695-4e96-b941-16d6ffd0ade5","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"FOODNAME\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"scale\":0}},{\"name\":\"SCIENTIFICNAME\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"scale\":0}},{\"name\":\"GROUP\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"scale\":0}},{\"name\":\"SUBGROUP\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"scale\":0}}]}","partitionColumns":[],"configuration":{},"createdTime":1679214189993}} 3 | {"add":{"path":"part-00000-4865d56d-ee2d-4f57-9e81-645e1547bfe3-c000.snappy.parquet","partitionValues":{},"size":2938,"modificationTime":1679214191340,"dataChange":true}} 4 | {"commitInfo":{"timestamp":1679214191401,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"52","numOutputBytes":"2938"},"engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.1.0"}} 5 | -------------------------------------------------------------------------------- /Chapter1/user.csv: -------------------------------------------------------------------------------- 1 | id,name 2 | 1,Dustin Smith 3 | 2,Jay Ramirez 4 | 3,Joseph Cooke 5 | 4,Melinda Young 6 | 5,Sean Parker 7 | 6,Ian Foster 8 | 7,Christopher Schmitt 9 | 8,Patrick Gutierrez 10 | 9,Dennis Douglas 11 | 10,Brenda Morris 12 | 11,Jeffery Hernandez 13 | 12,David Rice 14 | 13,Charles Foster 15 | 14,Keith Perez DVM 16 | 15,Dean Cuevas 17 | 16,Melissa Bishop 18 | 17,Alexander Howell 19 | 18,Austin Robertson 20 | 19,Sherri Mcdaniel 21 | 20,Nancy Nguyen 22 | 21,Melody Ball 23 | 22,Christopher Stokes 24 | 23,Joseph Hamilton 25 | 24,Kevin Fischer 26 | 25,Crystal Berg 27 | 26,Barbara Larson 28 | 27,Jacqueline Heath 29 | 28,Eric Gardner 30 | 29,Daniel Kennedy 31 | 30,Kaylee Sims 32 | 31,Shannon Green 33 | 32,Stacy Collins 34 | 33,Donna Ortiz 35 | 34,Jennifer Simmons 36 | 35,Michael Gill 37 | 36,Alyssa Shaw 38 | 37,Destiny Clark 39 | 38,Thomas Lara 40 | 39,Mark Diaz 41 | 40,Stacy Bryant 42 | 41,Howard Rose 43 | 42,Brian Schwartz 44 | 43,Kimberly Potter 45 | 44,Cassidy Ryan 46 | 45,Benjamin Mcbride 47 | 46,Elizabeth Ward 48 | 47,Christina Price 49 | 48,Pamela Cox 50 | 49,Jessica Peterson 51 | 50,Michael Nelson -------------------------------------------------------------------------------- /Chapter12/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Chapter 11 -> Spark ETL with Apache Kafka 3 | 4 | Task to do 5 | 1. Create Apache Kafka Publisher, create the topic, and publish messages 6 | 2. Create Apache Kafka Consumer and subscribe topic and receive messages 7 | 3. Create a Spark session & install the required libraries for Apache Kafka 8 | 4. From the Spark, session subscribe earlier created topic 9 | 5. Stream messages into the console 10 | 6. Write streaming messages into the files (CSV or JSON or Delta format) 11 | 7. Write streaming messages to the database (MySQL or PostgreSQL or MongoDB) 12 | 13 | Solution Notebook:
14 | [Spark Notebook For Streaming messages on Console](chapter12.ipynb) 15 | [Spark Notebook For Streaming messages in CSV files](chapter12_1.ipynb) 16 | [Spark Notebook For Streaming messages in SQL Server](chapter12_2.ipynb) 17 | 18 | Blog with Explaination:
19 | https://developershome.blog/2023/03/21/spark-etl-chapter-9-with-lakehouse-apache-iceberg/ 20 | 21 | YouTube video with Explanation:
22 | https://www.youtube.com/watch?v=eL1xIjranhg 23 | 24 | Meduim Blog Channel:
25 | https://medium.com/@developershome 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Spark ETL 2 | Extract Transform and Load using Spark or 3 | Extract , Load and Tranform using Spark 4 | 5 | image 6 | 7 | Here, we will create Spark notebooks for doing all of below ETL processes. Once we learn about all the ETL processes, we will start working on projects using Spark.

8 | Please find list ETL Pipelines 9 | 10 | 0. Chapter0 -> [Spark ETL with Files (CSV | JSON | Parquet)](Chapter0/README.md) 11 | 1. Chapter1 -> [Spark ETL with SQL Database (MySQL | PostgreSQL)](Chapter1/README.md) 12 | 2. Chapter2 -> [Spark ETL with NonSQL Database (MongoDB)](Chapter2/README.md) 13 | 3. Chapter3 -> [Spark ETL with Azure (Blob | ADLS)](Chapter3/README.md) 14 | 4. Chapter4 -> [Spark ETL with AWS (S3 bucket)](Chapter4/README.md) 15 | 5. Chapter5 -> [Spark ETL with Hive tables](Chapter5/README.md) 16 | 6. Chapter6 -> [Spark ETL with APIs](Chapter6/README.md) 17 | 7. Chapter7 -> [Spark ETL with Lakehouse (Delta Lake)](Chapter7/README.md) 18 | 8. Chapter8 -> [Spark ETL with Lakehouse (Apache HUDI)](Chapter8/README.md) 19 | 9. Chapter9 -> [Spark ETL with Lakehouse (Apache Iceberg)](Chapter9/README.md) 20 | 10. Chapter10 -> [Spark ETL with Lakehouse (Delta Lake vs Apache Iceberg vs Apache HUDI)](Chapter10/README.md) 21 | 11. Chapter11 -> [Spark ETL with Lakehouse (Delta table Optimization)](Chapter11/README.md) 22 | 12. Chapter12 -> [Spark ETL with Apache Kafka](Chapter12/README.md) 23 | 13. Chapter13 -> [Spark ETL with GCP (Big Query)](Chapter13/README.md) 24 | 25 | 26 | Also find below blog for understanding all the data engineering ETL Chapters 27 | 28 | https://developershome.blog/category/data-engineering/spark-etl 29 | 30 | Also find below youtube channel for understanding all the data engineering Chapters and learning new concepts of data engineering. 31 | 32 | https://www.youtube.com/@developershomeIn 33 | -------------------------------------------------------------------------------- /Chapter1/employee.csv: -------------------------------------------------------------------------------- 1 | "id","first_name","last_name","salary","department_id" 2 | 1,Todd,Wilson,110000,1006 3 | 1,Todd,Wilson,106119,1006 4 | 2,Justin,Simon,128922,1005 5 | 2,Justin,Simon,130000,1005 6 | 3,Kelly,Rosario,42689,1002 7 | 4,Patricia,Powell,162825,1004 8 | 4,Patricia,Powell,170000,1004 9 | 5,Sherry,Golden,44101,1002 10 | 6,Natasha,Swanson,79632,1005 11 | 6,Natasha,Swanson,90000,1005 12 | 7,Diane,Gordon,74591,1002 13 | 8,Mercedes,Rodriguez,61048,1005 14 | 9,Christy,Mitchell,137236,1001 15 | 9,Christy,Mitchell,140000,1001 16 | 9,Christy,Mitchell,150000,1001 17 | 10,Sean,Crawford,182065,1006 18 | 10,Sean,Crawford,190000,1006 19 | 11,Kevin,Townsend,166861,1002 20 | 12,Joshua,Johnson,123082,1004 21 | 13,Julie,Sanchez,185663,1001 22 | 13,Julie,Sanchez,200000,1001 23 | 13,Julie,Sanchez,210000,1001 24 | 14,John,Coleman,152434,1001 25 | 15,Anthony,Valdez,96898,1001 26 | 16,Briana,Rivas,151668,1005 27 | 17,Jason,Burnett,42525,1006 28 | 18,Jeffrey,Harris,14491,1002 29 | 18,Jeffrey,Harris,20000,1002 30 | 19,Michael,Ramsey,63159,1003 31 | 20,Cody,Gonzalez,112809,1004 32 | 21,Stephen,Berry,123617,1002 33 | 22,Brittany,Scott,162537,1002 34 | 23,Angela,Williams,100875,1004 35 | 24,William,Flores,142674,1003 36 | 25,Pamela,Matthews,57944,1005 37 | 26,Allison,Johnson,128782,1001 38 | 27,Anthony,Ball,34386,1003 39 | 28,Alexis,Beck,12260,1005 40 | 29,Jason,Olsen,51937,1006 41 | 30,Stephen,Smith,194791,1001 42 | 31,Kimberly,Brooks,95327,1003 43 | 32,Eric,Zimmerman,83093,1006 44 | 33,Peter,Holt,69945,1002 45 | 34,Justin,Dunn,67992,1003 46 | 35,John,Ball,47795,1004 47 | 36,Jesus,Ward,36078,1005 48 | 37,Philip,Gillespie,36424,1006 49 | 38,Nicole,Lewis,114079,1001 50 | 39,Linda,Clark,186781,1002 51 | 40,Colleen,Carrillo,147723,1004 52 | 41,John,George,21642,1001 53 | 42,Traci,Williams,138892,1003 54 | 42,Traci,Williams,150000,1003 55 | 42,Traci,Williams,160000,1003 56 | 42,Traci,Williams,180000,1003 57 | 43,Joseph,Rogers,22800,1005 58 | 44,Trevor,Carter,38670,1001 59 | 45,Kevin,Duncan,45210,1003 60 | 46,Joshua,Ewing,73088,1003 61 | 47,Kimberly,Dean,71416,1003 62 | 48,Robert,Lynch,117960,1004 63 | 49,Amber,Harding,77764,1002 64 | 50,Victoria,Wilson,176620,1002 65 | 51,Theresa,Everett,31404,1002 66 | 52,Kara,Smith,192838,1004 67 | 53,Teresa,Cohen,98860,1001 68 | 54,Wesley,Tucker,90221,1005 69 | 55,Michael,Morris,106799,1005 70 | 56,Rachael,Williams,103585,1002 71 | 57,Patricia,Harmon,147417,1005 72 | 58,Edward,Sharp,41077,1005 73 | 59,Kevin,Robinson,100924,1005 74 | 60,Charles,Pearson,173317,1004 75 | 61,Ryan,Brown,110225,1003 76 | 61,Ryan,Brown,120000,1003 77 | 62,Dale,Hayes,97662,1005 78 | 63,Richard,Sanford,136083,1001 79 | 64,Danielle,Williams,98655,1006 80 | 64,Danielle,Williams,110000,1006 81 | 64,Danielle,Williams,120000,1006 82 | 65,Deborah,Martin,67389,1004 83 | 66,Dustin,Bush,47567,1004 84 | 67,Tyler,Green,111085,1002 85 | 68,Antonio,Carpenter,83684,1002 86 | 69,Ernest,Peterson,115993,1005 87 | 70,Karen,Fernandez,101238,1003 88 | 71,Kristine,Casey,67651,1003 89 | 72,Christine,Frye,137244,1004 90 | 73,William,Preston,155225,1003 91 | 74,Richard,Cole,180361,1003 92 | 75,Julia,Ramos,61398,1006 93 | 75,Julia,Ramos,70000,1006 94 | 75,Julia,Ramos,83000,1006 95 | 75,Julia,Ramos,90000,1006 96 | 75,Julia,Ramos,105000,1006 97 | -------------------------------------------------------------------------------- /Chapter2/employee.csv: -------------------------------------------------------------------------------- 1 | "id","first_name","last_name","salary","department_id" 2 | 1,Todd,Wilson,110000,1006 3 | 1,Todd,Wilson,106119,1006 4 | 2,Justin,Simon,128922,1005 5 | 2,Justin,Simon,130000,1005 6 | 3,Kelly,Rosario,42689,1002 7 | 4,Patricia,Powell,162825,1004 8 | 4,Patricia,Powell,170000,1004 9 | 5,Sherry,Golden,44101,1002 10 | 6,Natasha,Swanson,79632,1005 11 | 6,Natasha,Swanson,90000,1005 12 | 7,Diane,Gordon,74591,1002 13 | 8,Mercedes,Rodriguez,61048,1005 14 | 9,Christy,Mitchell,137236,1001 15 | 9,Christy,Mitchell,140000,1001 16 | 9,Christy,Mitchell,150000,1001 17 | 10,Sean,Crawford,182065,1006 18 | 10,Sean,Crawford,190000,1006 19 | 11,Kevin,Townsend,166861,1002 20 | 12,Joshua,Johnson,123082,1004 21 | 13,Julie,Sanchez,185663,1001 22 | 13,Julie,Sanchez,200000,1001 23 | 13,Julie,Sanchez,210000,1001 24 | 14,John,Coleman,152434,1001 25 | 15,Anthony,Valdez,96898,1001 26 | 16,Briana,Rivas,151668,1005 27 | 17,Jason,Burnett,42525,1006 28 | 18,Jeffrey,Harris,14491,1002 29 | 18,Jeffrey,Harris,20000,1002 30 | 19,Michael,Ramsey,63159,1003 31 | 20,Cody,Gonzalez,112809,1004 32 | 21,Stephen,Berry,123617,1002 33 | 22,Brittany,Scott,162537,1002 34 | 23,Angela,Williams,100875,1004 35 | 24,William,Flores,142674,1003 36 | 25,Pamela,Matthews,57944,1005 37 | 26,Allison,Johnson,128782,1001 38 | 27,Anthony,Ball,34386,1003 39 | 28,Alexis,Beck,12260,1005 40 | 29,Jason,Olsen,51937,1006 41 | 30,Stephen,Smith,194791,1001 42 | 31,Kimberly,Brooks,95327,1003 43 | 32,Eric,Zimmerman,83093,1006 44 | 33,Peter,Holt,69945,1002 45 | 34,Justin,Dunn,67992,1003 46 | 35,John,Ball,47795,1004 47 | 36,Jesus,Ward,36078,1005 48 | 37,Philip,Gillespie,36424,1006 49 | 38,Nicole,Lewis,114079,1001 50 | 39,Linda,Clark,186781,1002 51 | 40,Colleen,Carrillo,147723,1004 52 | 41,John,George,21642,1001 53 | 42,Traci,Williams,138892,1003 54 | 42,Traci,Williams,150000,1003 55 | 42,Traci,Williams,160000,1003 56 | 42,Traci,Williams,180000,1003 57 | 43,Joseph,Rogers,22800,1005 58 | 44,Trevor,Carter,38670,1001 59 | 45,Kevin,Duncan,45210,1003 60 | 46,Joshua,Ewing,73088,1003 61 | 47,Kimberly,Dean,71416,1003 62 | 48,Robert,Lynch,117960,1004 63 | 49,Amber,Harding,77764,1002 64 | 50,Victoria,Wilson,176620,1002 65 | 51,Theresa,Everett,31404,1002 66 | 52,Kara,Smith,192838,1004 67 | 53,Teresa,Cohen,98860,1001 68 | 54,Wesley,Tucker,90221,1005 69 | 55,Michael,Morris,106799,1005 70 | 56,Rachael,Williams,103585,1002 71 | 57,Patricia,Harmon,147417,1005 72 | 58,Edward,Sharp,41077,1005 73 | 59,Kevin,Robinson,100924,1005 74 | 60,Charles,Pearson,173317,1004 75 | 61,Ryan,Brown,110225,1003 76 | 61,Ryan,Brown,120000,1003 77 | 62,Dale,Hayes,97662,1005 78 | 63,Richard,Sanford,136083,1001 79 | 64,Danielle,Williams,98655,1006 80 | 64,Danielle,Williams,110000,1006 81 | 64,Danielle,Williams,120000,1006 82 | 65,Deborah,Martin,67389,1004 83 | 66,Dustin,Bush,47567,1004 84 | 67,Tyler,Green,111085,1002 85 | 68,Antonio,Carpenter,83684,1002 86 | 69,Ernest,Peterson,115993,1005 87 | 70,Karen,Fernandez,101238,1003 88 | 71,Kristine,Casey,67651,1003 89 | 72,Christine,Frye,137244,1004 90 | 73,William,Preston,155225,1003 91 | 74,Richard,Cole,180361,1003 92 | 75,Julia,Ramos,61398,1006 93 | 75,Julia,Ramos,70000,1006 94 | 75,Julia,Ramos,83000,1006 95 | 75,Julia,Ramos,90000,1006 96 | 75,Julia,Ramos,105000,1006 97 | -------------------------------------------------------------------------------- /Chapter12/Chapter12_1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "0032fcbf-b834-4e38-a99a-269f80ae104f", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "# First Load all the required library and also Start Spark Session\n", 11 | "# Load all the required library\n", 12 | "from pyspark.sql import SparkSession" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "id": "d0223468-96e6-42d3-a40d-3b0a68af8226", 19 | "metadata": {}, 20 | "outputs": [ 21 | { 22 | "name": "stderr", 23 | "output_type": "stream", 24 | "text": [ 25 | "WARNING: An illegal reflective access operation has occurred\n", 26 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", 27 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", 28 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", 29 | "WARNING: All illegal access operations will be denied in a future release\n" 30 | ] 31 | }, 32 | { 33 | "name": "stdout", 34 | "output_type": "stream", 35 | "text": [ 36 | ":: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" 37 | ] 38 | }, 39 | { 40 | "name": "stderr", 41 | "output_type": "stream", 42 | "text": [ 43 | "Ivy Default Cache set to: /root/.ivy2/cache\n", 44 | "The jars for the packages stored in: /root/.ivy2/jars\n", 45 | "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n", 46 | "mysql#mysql-connector-java added as a dependency\n", 47 | ":: resolving dependencies :: org.apache.spark#spark-submit-parent-07a625e6-7632-4ff2-81bf-49eb66cddb8d;1.0\n", 48 | "\tconfs: [default]\n", 49 | "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 in central\n", 50 | "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 in central\n", 51 | "\tfound org.apache.kafka#kafka-clients;2.8.0 in central\n", 52 | "\tfound org.lz4#lz4-java;1.7.1 in central\n", 53 | "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n", 54 | "\tfound org.slf4j#slf4j-api;1.7.30 in central\n", 55 | "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.1 in central\n", 56 | "\tfound org.spark-project.spark#unused;1.0.0 in central\n", 57 | "\tfound org.apache.hadoop#hadoop-client-api;3.3.1 in central\n", 58 | "\tfound org.apache.htrace#htrace-core4;4.1.0-incubating in central\n", 59 | "\tfound commons-logging#commons-logging;1.1.3 in central\n", 60 | "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n", 61 | "\tfound org.apache.commons#commons-pool2;2.6.2 in central\n", 62 | "\tfound mysql#mysql-connector-java;8.0.32 in central\n", 63 | "\tfound com.mysql#mysql-connector-j;8.0.32 in central\n", 64 | "\tfound com.google.protobuf#protobuf-java;3.21.9 in central\n", 65 | ":: resolution report :: resolve 3102ms :: artifacts dl 113ms\n", 66 | "\t:: modules in use:\n", 67 | "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n", 68 | "\tcom.google.protobuf#protobuf-java;3.21.9 from central in [default]\n", 69 | "\tcom.mysql#mysql-connector-j;8.0.32 from central in [default]\n", 70 | "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n", 71 | "\tmysql#mysql-connector-java;8.0.32 from central in [default]\n", 72 | "\torg.apache.commons#commons-pool2;2.6.2 from central in [default]\n", 73 | "\torg.apache.hadoop#hadoop-client-api;3.3.1 from central in [default]\n", 74 | "\torg.apache.hadoop#hadoop-client-runtime;3.3.1 from central in [default]\n", 75 | "\torg.apache.htrace#htrace-core4;4.1.0-incubating from central in [default]\n", 76 | "\torg.apache.kafka#kafka-clients;2.8.0 from central in [default]\n", 77 | "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 from central in [default]\n", 78 | "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 from central in [default]\n", 79 | "\torg.lz4#lz4-java;1.7.1 from central in [default]\n", 80 | "\torg.slf4j#slf4j-api;1.7.30 from central in [default]\n", 81 | "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n", 82 | "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n", 83 | "\t---------------------------------------------------------------------\n", 84 | "\t| | modules || artifacts |\n", 85 | "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", 86 | "\t---------------------------------------------------------------------\n", 87 | "\t| default | 16 | 0 | 0 | 0 || 15 | 0 |\n", 88 | "\t---------------------------------------------------------------------\n", 89 | ":: retrieving :: org.apache.spark#spark-submit-parent-07a625e6-7632-4ff2-81bf-49eb66cddb8d\n", 90 | "\tconfs: [default]\n", 91 | "\t0 artifacts copied, 15 already retrieved (0kB/45ms)\n", 92 | "23/08/02 11:06:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", 93 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", 94 | "Setting default log level to \"WARN\".\n", 95 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", 96 | "23/08/02 11:06:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "#Start Spark Session\n", 102 | "spark = SparkSession.builder.appName(\"chapter12\") \\\n", 103 | " .config(\"spark.jars.packages\", \"org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,mysql:mysql-connector-java:8.0.32\") \\\n", 104 | " .getOrCreate()\n", 105 | "sqlContext = SparkSession(spark)\n", 106 | "#Dont Show warning only error\n", 107 | "spark.sparkContext.setLogLevel(\"WARN\")" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 3, 113 | "id": "49bdd94c-a4a5-41ea-89c7-0e36c173cae1", 114 | "metadata": {}, 115 | "outputs": [ 116 | { 117 | "data": { 118 | "text/html": [ 119 | "\n", 120 | "
\n", 121 | "

SparkSession - in-memory

\n", 122 | " \n", 123 | "
\n", 124 | "

SparkContext

\n", 125 | "\n", 126 | "

Spark UI

\n", 127 | "\n", 128 | "
\n", 129 | "
Version
\n", 130 | "
v3.2.1
\n", 131 | "
Master
\n", 132 | "
local[*]
\n", 133 | "
AppName
\n", 134 | "
chapter12
\n", 135 | "
\n", 136 | "
\n", 137 | " \n", 138 | "
\n", 139 | " " 140 | ], 141 | "text/plain": [ 142 | "" 143 | ] 144 | }, 145 | "execution_count": 3, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "spark" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 6, 157 | "id": "3fd21dc9-d9df-4819-98ce-5efd44bf31c7", 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "KAFKA_BOOTSTRAP_SERVERS = \"192.168.1.102:9092\"\n", 162 | "KAFKA_TOPIC = \"News_XYZ_Technology\"" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 8, 168 | "id": "540b5af6-09e5-4269-a9de-dc03f5bd2b28", 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "df = spark.readStream.format(\"kafka\") \\\n", 173 | " .option(\"kafka.bootstrap.servers\", KAFKA_BOOTSTRAP_SERVERS) \\\n", 174 | " .option(\"subscribe\", KAFKA_TOPIC) \\\n", 175 | " .option(\"startingOffsets\", \"earliest\") \\\n", 176 | " .load()" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "id": "01227128-d113-4641-8e0e-60630be4481d", 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "name": "stderr", 187 | "output_type": "stream", 188 | "text": [ 189 | "23/08/02 11:14:09 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n", 190 | " \r" 191 | ] 192 | } 193 | ], 194 | "source": [ 195 | "df.selectExpr(\"cast(value as string)\") \\\n", 196 | " .writeStream \\\n", 197 | " .format(\"csv\") \\\n", 198 | " .option(\"checkpointLocation\", \"/opt/spark/SparkETL/Chapter12/csv_checkpoint\") \\\n", 199 | " .option(\"path\", \"/opt/spark/SparkETL/Chapter12/csv_data\") \\\n", 200 | " .outputMode(\"append\") \\\n", 201 | " .start() \\\n", 202 | " .awaitTermination()" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "id": "787af49a-d12e-4b15-8277-e2b5aa894c81", 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [] 212 | } 213 | ], 214 | "metadata": { 215 | "kernelspec": { 216 | "display_name": "Python 3 (ipykernel)", 217 | "language": "python", 218 | "name": "python3" 219 | }, 220 | "language_info": { 221 | "codemirror_mode": { 222 | "name": "ipython", 223 | "version": 3 224 | }, 225 | "file_extension": ".py", 226 | "mimetype": "text/x-python", 227 | "name": "python", 228 | "nbconvert_exporter": "python", 229 | "pygments_lexer": "ipython3", 230 | "version": "3.8.13" 231 | } 232 | }, 233 | "nbformat": 4, 234 | "nbformat_minor": 5 235 | } 236 | -------------------------------------------------------------------------------- /Chapter12/Chapter12_2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "0032fcbf-b834-4e38-a99a-269f80ae104f", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "# First Load all the required library and also Start Spark Session\n", 11 | "# Load all the required library\n", 12 | "from pyspark.sql import SparkSession" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "id": "d0223468-96e6-42d3-a40d-3b0a68af8226", 19 | "metadata": {}, 20 | "outputs": [ 21 | { 22 | "name": "stderr", 23 | "output_type": "stream", 24 | "text": [ 25 | "WARNING: An illegal reflective access operation has occurred\n", 26 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", 27 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", 28 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", 29 | "WARNING: All illegal access operations will be denied in a future release\n" 30 | ] 31 | }, 32 | { 33 | "name": "stdout", 34 | "output_type": "stream", 35 | "text": [ 36 | ":: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" 37 | ] 38 | }, 39 | { 40 | "name": "stderr", 41 | "output_type": "stream", 42 | "text": [ 43 | "Ivy Default Cache set to: /root/.ivy2/cache\n", 44 | "The jars for the packages stored in: /root/.ivy2/jars\n", 45 | "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n", 46 | "mysql#mysql-connector-java added as a dependency\n", 47 | ":: resolving dependencies :: org.apache.spark#spark-submit-parent-5bd2f33a-cc11-4f99-9a05-9717c077d03e;1.0\n", 48 | "\tconfs: [default]\n", 49 | "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 in central\n", 50 | "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 in central\n", 51 | "\tfound org.apache.kafka#kafka-clients;2.8.0 in central\n", 52 | "\tfound org.lz4#lz4-java;1.7.1 in central\n", 53 | "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n", 54 | "\tfound org.slf4j#slf4j-api;1.7.30 in central\n", 55 | "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.1 in central\n", 56 | "\tfound org.spark-project.spark#unused;1.0.0 in central\n", 57 | "\tfound org.apache.hadoop#hadoop-client-api;3.3.1 in central\n", 58 | "\tfound org.apache.htrace#htrace-core4;4.1.0-incubating in central\n", 59 | "\tfound commons-logging#commons-logging;1.1.3 in central\n", 60 | "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n", 61 | "\tfound org.apache.commons#commons-pool2;2.6.2 in central\n", 62 | "\tfound mysql#mysql-connector-java;8.0.32 in central\n", 63 | "\tfound com.mysql#mysql-connector-j;8.0.32 in central\n", 64 | "\tfound com.google.protobuf#protobuf-java;3.21.9 in central\n", 65 | ":: resolution report :: resolve 4813ms :: artifacts dl 58ms\n", 66 | "\t:: modules in use:\n", 67 | "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n", 68 | "\tcom.google.protobuf#protobuf-java;3.21.9 from central in [default]\n", 69 | "\tcom.mysql#mysql-connector-j;8.0.32 from central in [default]\n", 70 | "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n", 71 | "\tmysql#mysql-connector-java;8.0.32 from central in [default]\n", 72 | "\torg.apache.commons#commons-pool2;2.6.2 from central in [default]\n", 73 | "\torg.apache.hadoop#hadoop-client-api;3.3.1 from central in [default]\n", 74 | "\torg.apache.hadoop#hadoop-client-runtime;3.3.1 from central in [default]\n", 75 | "\torg.apache.htrace#htrace-core4;4.1.0-incubating from central in [default]\n", 76 | "\torg.apache.kafka#kafka-clients;2.8.0 from central in [default]\n", 77 | "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 from central in [default]\n", 78 | "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 from central in [default]\n", 79 | "\torg.lz4#lz4-java;1.7.1 from central in [default]\n", 80 | "\torg.slf4j#slf4j-api;1.7.30 from central in [default]\n", 81 | "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n", 82 | "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n", 83 | "\t---------------------------------------------------------------------\n", 84 | "\t| | modules || artifacts |\n", 85 | "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", 86 | "\t---------------------------------------------------------------------\n", 87 | "\t| default | 16 | 0 | 0 | 0 || 15 | 0 |\n", 88 | "\t---------------------------------------------------------------------\n", 89 | ":: retrieving :: org.apache.spark#spark-submit-parent-5bd2f33a-cc11-4f99-9a05-9717c077d03e\n", 90 | "\tconfs: [default]\n", 91 | "\t0 artifacts copied, 15 already retrieved (0kB/28ms)\n", 92 | "23/08/02 11:33:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", 93 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", 94 | "Setting default log level to \"WARN\".\n", 95 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", 96 | "23/08/02 11:33:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n", 97 | "23/08/02 11:33:37 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.\n" 98 | ] 99 | } 100 | ], 101 | "source": [ 102 | "#Start Spark Session\n", 103 | "spark = SparkSession.builder.appName(\"chapter12\") \\\n", 104 | " .config(\"spark.jars.packages\", \"org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,mysql:mysql-connector-java:8.0.32\") \\\n", 105 | " .getOrCreate()\n", 106 | "sqlContext = SparkSession(spark)\n", 107 | "#Dont Show warning only error\n", 108 | "spark.sparkContext.setLogLevel(\"WARN\")" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 3, 114 | "id": "49bdd94c-a4a5-41ea-89c7-0e36c173cae1", 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "data": { 119 | "text/html": [ 120 | "\n", 121 | "
\n", 122 | "

SparkSession - in-memory

\n", 123 | " \n", 124 | "
\n", 125 | "

SparkContext

\n", 126 | "\n", 127 | "

Spark UI

\n", 128 | "\n", 129 | "
\n", 130 | "
Version
\n", 131 | "
v3.2.1
\n", 132 | "
Master
\n", 133 | "
local[*]
\n", 134 | "
AppName
\n", 135 | "
chapter12
\n", 136 | "
\n", 137 | "
\n", 138 | " \n", 139 | "
\n", 140 | " " 141 | ], 142 | "text/plain": [ 143 | "" 144 | ] 145 | }, 146 | "execution_count": 3, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "spark" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 4, 158 | "id": "3fd21dc9-d9df-4819-98ce-5efd44bf31c7", 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "KAFKA_BOOTSTRAP_SERVERS = \"192.168.1.102:9092\"\n", 163 | "KAFKA_TOPIC = \"News_XYZ_Technology\"" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 7, 169 | "id": "540b5af6-09e5-4269-a9de-dc03f5bd2b28", 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "df = spark.readStream.format(\"kafka\") \\\n", 174 | " .option(\"kafka.bootstrap.servers\", KAFKA_BOOTSTRAP_SERVERS) \\\n", 175 | " .option(\"subscribe\", KAFKA_TOPIC) \\\n", 176 | " .option(\"startingOffsets\", \"earliest\") \\\n", 177 | " .load()" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "id": "787af49a-d12e-4b15-8277-e2b5aa894c81", 184 | "metadata": {}, 185 | "outputs": [ 186 | { 187 | "name": "stderr", 188 | "output_type": "stream", 189 | "text": [ 190 | "23/08/02 11:41:05 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-a868c49c-a037-4b16-b53a-0f9948351d7b. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n", 191 | "23/08/02 11:41:05 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n", 192 | " \r" 193 | ] 194 | } 195 | ], 196 | "source": [ 197 | "def foreach_batch_function(df, epoch_id):\n", 198 | " df.write \\\n", 199 | " .format(\"jdbc\") \\\n", 200 | " .option(\"driver\",\"com.mysql.cj.jdbc.Driver\") \\\n", 201 | " .option(\"url\", \"jdbc:mysql://192.168.1.102:3306/DATAENG\") \\\n", 202 | " .option(\"dbtable\", \"StreamMessagesKafka\") \\\n", 203 | " .option(\"user\", \"root\") \\\n", 204 | " .option(\"password\", \"mysql\") \\\n", 205 | " .save()\n", 206 | " pass\n", 207 | "\n", 208 | "df.selectExpr(\"cast(value as string)\") \\\n", 209 | " .writeStream \\\n", 210 | " .outputMode(\"append\") \\\n", 211 | " .foreachBatch(foreach_batch_function)\\\n", 212 | " .start() \\\n", 213 | " .awaitTermination()\n" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "id": "7c687c45-4036-4eaa-a8bc-c4f996817047", 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [] 223 | } 224 | ], 225 | "metadata": { 226 | "kernelspec": { 227 | "display_name": "Python 3 (ipykernel)", 228 | "language": "python", 229 | "name": "python3" 230 | }, 231 | "language_info": { 232 | "codemirror_mode": { 233 | "name": "ipython", 234 | "version": 3 235 | }, 236 | "file_extension": ".py", 237 | "mimetype": "text/x-python", 238 | "name": "python", 239 | "nbconvert_exporter": "python", 240 | "pygments_lexer": "ipython3", 241 | "version": "3.8.13" 242 | } 243 | }, 244 | "nbformat": 4, 245 | "nbformat_minor": 5 246 | } 247 | -------------------------------------------------------------------------------- /Chapter3/chapter3-CovidData.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3a51c7bc-a588-4d9e-bb74-a26148c92900", 6 | "metadata": { 7 | "tags": [] 8 | }, 9 | "source": [ 10 | "\n", 11 | "# Chapter 3 -> Spark ETL with Azure (Blob | ADLS)\n", 12 | "\n", 13 | "Task to do \n", 14 | "1. Install required spark libraries\n", 15 | "2. Create connection with Azure Blob storage\n", 16 | "3. Read data from blob and store into dataframe\n", 17 | "4. Transform data\n", 18 | "5. write data into parquet file \n", 19 | "6. write data into JSON file\n", 20 | "\n", 21 | "Reference:\n", 22 | "https://learn.microsoft.com/en-us/azure/open-datasets/dataset-catalog" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 4, 28 | "id": "ea22d710-40f5-4b64-803e-83c583aa3472", 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "# First Load all the required library and also Start Spark Session\n", 33 | "# Load all the required library\n", 34 | "from pyspark.sql import SparkSession" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 5, 40 | "id": "368cc9c2-6f26-4d6b-93ff-d9ee7541869c", 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "name": "stderr", 45 | "output_type": "stream", 46 | "text": [ 47 | "WARNING: An illegal reflective access operation has occurred\n", 48 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", 49 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", 50 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", 51 | "WARNING: All illegal access operations will be denied in a future release\n", 52 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", 53 | "Setting default log level to \"WARN\".\n", 54 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", 55 | "23/03/09 21:50:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", 56 | "23/03/09 21:50:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n" 57 | ] 58 | } 59 | ], 60 | "source": [ 61 | "#Start Spark Session\n", 62 | "spark = SparkSession.builder.appName(\"chapter3_1\").getOrCreate()\n", 63 | "sqlContext = SparkSession(spark)\n", 64 | "#Dont Show warning only error\n", 65 | "spark.sparkContext.setLogLevel(\"ERROR\")" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "id": "f79815e9-7597-4a08-a8ea-098ad0e556ec", 71 | "metadata": {}, 72 | "source": [ 73 | "1. Create connection with Azure Blob storage" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 6, 79 | "id": "d6f6fd8e-6367-494f-9682-a901d2822473", 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "# Azure storage access info\n", 84 | "blob_account_name = \"pandemicdatalake\"\n", 85 | "blob_container_name = \"public\"\n", 86 | "blob_relative_path = \"curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet\"\n", 87 | "blob_sas_token = r\"\"" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 7, 93 | "id": "d384937f-4d7b-4a7d-804b-2b891e08ca85", 94 | "metadata": {}, 95 | "outputs": [ 96 | { 97 | "name": "stdout", 98 | "output_type": "stream", 99 | "text": [ 100 | "Remote blob path: wasbs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet\n" 101 | ] 102 | } 103 | ], 104 | "source": [ 105 | "\n", 106 | "# Allow SPARK to read from Blob remotely\n", 107 | "wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)\n", 108 | "spark.conf.set(\n", 109 | " 'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),\n", 110 | " blob_sas_token)\n", 111 | "print('Remote blob path: ' + wasbs_path)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "id": "1a9ffb71-9748-4dba-b0ec-42ca3c491443", 117 | "metadata": {}, 118 | "source": [ 119 | "3. Read data from blob and store into dataframe" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 8, 125 | "id": "666621b6-0800-44d0-ad87-aa400f790ffc", 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "name": "stderr", 130 | "output_type": "stream", 131 | "text": [ 132 | " \r" 133 | ] 134 | } 135 | ], 136 | "source": [ 137 | "df = spark.read.parquet(wasbs_path)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 6, 143 | "id": "e26c9728-6a0d-4d6e-9870-ec0bcbd49728", 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "root\n", 151 | " |-- vendorID: string (nullable = true)\n", 152 | " |-- tpepPickupDateTime: timestamp (nullable = true)\n", 153 | " |-- tpepDropoffDateTime: timestamp (nullable = true)\n", 154 | " |-- passengerCount: integer (nullable = true)\n", 155 | " |-- tripDistance: double (nullable = true)\n", 156 | " |-- puLocationId: string (nullable = true)\n", 157 | " |-- doLocationId: string (nullable = true)\n", 158 | " |-- startLon: double (nullable = true)\n", 159 | " |-- startLat: double (nullable = true)\n", 160 | " |-- endLon: double (nullable = true)\n", 161 | " |-- endLat: double (nullable = true)\n", 162 | " |-- rateCodeId: integer (nullable = true)\n", 163 | " |-- storeAndFwdFlag: string (nullable = true)\n", 164 | " |-- paymentType: string (nullable = true)\n", 165 | " |-- fareAmount: double (nullable = true)\n", 166 | " |-- extra: double (nullable = true)\n", 167 | " |-- mtaTax: double (nullable = true)\n", 168 | " |-- improvementSurcharge: string (nullable = true)\n", 169 | " |-- tipAmount: double (nullable = true)\n", 170 | " |-- tollsAmount: double (nullable = true)\n", 171 | " |-- totalAmount: double (nullable = true)\n", 172 | " |-- puYear: integer (nullable = true)\n", 173 | " |-- puMonth: integer (nullable = true)\n", 174 | "\n" 175 | ] 176 | } 177 | ], 178 | "source": [ 179 | "df.printSchema()" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 9, 185 | "id": "c3a9c513-d6e1-4e59-9e56-a12da2533375", 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "name": "stderr", 190 | "output_type": "stream", 191 | "text": [ 192 | " \r" 193 | ] 194 | }, 195 | { 196 | "name": "stdout", 197 | "output_type": "stream", 198 | "text": [ 199 | "+------+----------+---------+----------------+------+-------------+---------+----------------+--------+---------+----+----+--------------+--------------+---------------+--------------+--------------------+\n", 200 | "| id| updated|confirmed|confirmed_change|deaths|deaths_change|recovered|recovered_change|latitude|longitude|iso2|iso3|country_region|admin_region_1|iso_subdivision|admin_region_2| load_time|\n", 201 | "+------+----------+---------+----------------+------+-------------+---------+----------------+--------+---------+----+----+--------------+--------------+---------------+--------------+--------------------+\n", 202 | "|338995|2020-01-21| 262| null| 0| null| null| null| null| null|null|null| Worldwide| null| null| null|2023-03-09 00:04:...|\n", 203 | "|338996|2020-01-22| 313| 51| 0| 0| null| null| null| null|null|null| Worldwide| null| null| null|2023-03-09 00:04:...|\n", 204 | "+------+----------+---------+----------------+------+-------------+---------+----------------+--------+---------+----+----+--------------+--------------+---------------+--------------+--------------------+\n", 205 | "only showing top 2 rows\n", 206 | "\n" 207 | ] 208 | } 209 | ], 210 | "source": [ 211 | "df.show(n=2)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "id": "34762ecc-82b1-4b81-99f4-e480a43cbf92", 217 | "metadata": {}, 218 | "source": [ 219 | "4. Transform data" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": 12, 225 | "id": "47d997c9-a0da-4936-9c62-ac3907efd874", 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "name": "stdout", 230 | "output_type": "stream", 231 | "text": [ 232 | "Register the DataFrame as a SQL temporary view: source\n" 233 | ] 234 | } 235 | ], 236 | "source": [ 237 | "print('Register the DataFrame as a SQL temporary view: source')\n", 238 | "df.createOrReplaceTempView('tempSource')" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 13, 244 | "id": "1ba63625-c1a8-4e45-997f-84b2150c1eae", 245 | "metadata": {}, 246 | "outputs": [ 247 | { 248 | "name": "stdout", 249 | "output_type": "stream", 250 | "text": [ 251 | "Displaying top 10 rows: \n" 252 | ] 253 | }, 254 | { 255 | "data": { 256 | "text/plain": [ 257 | "DataFrame[id: int, updated: date, confirmed: int, confirmed_change: int, deaths: int, deaths_change: smallint, recovered: int, recovered_change: int, latitude: double, longitude: double, iso2: string, iso3: string, country_region: string, admin_region_1: string, iso_subdivision: string, admin_region_2: string, load_time: timestamp]" 258 | ] 259 | }, 260 | "metadata": {}, 261 | "output_type": "display_data" 262 | } 263 | ], 264 | "source": [ 265 | "print('Displaying top 10 rows: ')\n", 266 | "display(spark.sql('SELECT * FROM tempSource LIMIT 10'))" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 14, 272 | "id": "4e6bf6f7-bf6d-455a-86c1-d1759c03eda0", 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "newdf = spark.sql('SELECT * FROM tempSource LIMIT 10')" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "id": "ca5dc217-35ac-43c2-895c-ae0a9c8bddac", 282 | "metadata": {}, 283 | "source": [ 284 | "5. write data into parquet file \n", 285 | "6. write data into JSON file" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 15, 291 | "id": "84f36d51-366e-4e7c-80fa-d2a72475d262", 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "name": "stderr", 296 | "output_type": "stream", 297 | "text": [ 298 | " \r" 299 | ] 300 | } 301 | ], 302 | "source": [ 303 | "newdf.write.format(\"parquet\").option(\"compression\",\"snappy\").save(\"parquetdata\",mode='append')" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "id": "c417172f-ae9a-4331-b871-cb1599a27446", 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "newdf.write.format(\"csv\").option(\"header\",\"true\").save(\"csvdata\",mode='append')" 314 | ] 315 | } 316 | ], 317 | "metadata": { 318 | "kernelspec": { 319 | "display_name": "Python 3 (ipykernel)", 320 | "language": "python", 321 | "name": "python3" 322 | }, 323 | "language_info": { 324 | "codemirror_mode": { 325 | "name": "ipython", 326 | "version": 3 327 | }, 328 | "file_extension": ".py", 329 | "mimetype": "text/x-python", 330 | "name": "python", 331 | "nbconvert_exporter": "python", 332 | "pygments_lexer": "ipython3", 333 | "version": "3.8.13" 334 | } 335 | }, 336 | "nbformat": 4, 337 | "nbformat_minor": 5 338 | } 339 | -------------------------------------------------------------------------------- /Chapter0/nyc_taxi_zone.csv: -------------------------------------------------------------------------------- 1 | "LocationID","Borough","Zone","service_zone" 2 | 1,"EWR","Newark Airport","EWR" 3 | 2,"Queens","Jamaica Bay","Boro Zone" 4 | 3,"Bronx","Allerton/Pelham Gardens","Boro Zone" 5 | 4,"Manhattan","Alphabet City","Yellow Zone" 6 | 5,"Staten Island","Arden Heights","Boro Zone" 7 | 6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone" 8 | 7,"Queens","Astoria","Boro Zone" 9 | 8,"Queens","Astoria Park","Boro Zone" 10 | 9,"Queens","Auburndale","Boro Zone" 11 | 10,"Queens","Baisley Park","Boro Zone" 12 | 11,"Brooklyn","Bath Beach","Boro Zone" 13 | 12,"Manhattan","Battery Park","Yellow Zone" 14 | 13,"Manhattan","Battery Park City","Yellow Zone" 15 | 14,"Brooklyn","Bay Ridge","Boro Zone" 16 | 15,"Queens","Bay Terrace/Fort Totten","Boro Zone" 17 | 16,"Queens","Bayside","Boro Zone" 18 | 17,"Brooklyn","Bedford","Boro Zone" 19 | 18,"Bronx","Bedford Park","Boro Zone" 20 | 19,"Queens","Bellerose","Boro Zone" 21 | 20,"Bronx","Belmont","Boro Zone" 22 | 21,"Brooklyn","Bensonhurst East","Boro Zone" 23 | 22,"Brooklyn","Bensonhurst West","Boro Zone" 24 | 23,"Staten Island","Bloomfield/Emerson Hill","Boro Zone" 25 | 24,"Manhattan","Bloomingdale","Yellow Zone" 26 | 25,"Brooklyn","Boerum Hill","Boro Zone" 27 | 26,"Brooklyn","Borough Park","Boro Zone" 28 | 27,"Queens","Breezy Point/Fort Tilden/Riis Beach","Boro Zone" 29 | 28,"Queens","Briarwood/Jamaica Hills","Boro Zone" 30 | 29,"Brooklyn","Brighton Beach","Boro Zone" 31 | 30,"Queens","Broad Channel","Boro Zone" 32 | 31,"Bronx","Bronx Park","Boro Zone" 33 | 32,"Bronx","Bronxdale","Boro Zone" 34 | 33,"Brooklyn","Brooklyn Heights","Boro Zone" 35 | 34,"Brooklyn","Brooklyn Navy Yard","Boro Zone" 36 | 35,"Brooklyn","Brownsville","Boro Zone" 37 | 36,"Brooklyn","Bushwick North","Boro Zone" 38 | 37,"Brooklyn","Bushwick South","Boro Zone" 39 | 38,"Queens","Cambria Heights","Boro Zone" 40 | 39,"Brooklyn","Canarsie","Boro Zone" 41 | 40,"Brooklyn","Carroll Gardens","Boro Zone" 42 | 41,"Manhattan","Central Harlem","Boro Zone" 43 | 42,"Manhattan","Central Harlem North","Boro Zone" 44 | 43,"Manhattan","Central Park","Yellow Zone" 45 | 44,"Staten Island","Charleston/Tottenville","Boro Zone" 46 | 45,"Manhattan","Chinatown","Yellow Zone" 47 | 46,"Bronx","City Island","Boro Zone" 48 | 47,"Bronx","Claremont/Bathgate","Boro Zone" 49 | 48,"Manhattan","Clinton East","Yellow Zone" 50 | 49,"Brooklyn","Clinton Hill","Boro Zone" 51 | 50,"Manhattan","Clinton West","Yellow Zone" 52 | 51,"Bronx","Co-Op City","Boro Zone" 53 | 52,"Brooklyn","Cobble Hill","Boro Zone" 54 | 53,"Queens","College Point","Boro Zone" 55 | 54,"Brooklyn","Columbia Street","Boro Zone" 56 | 55,"Brooklyn","Coney Island","Boro Zone" 57 | 56,"Queens","Corona","Boro Zone" 58 | 57,"Queens","Corona","Boro Zone" 59 | 58,"Bronx","Country Club","Boro Zone" 60 | 59,"Bronx","Crotona Park","Boro Zone" 61 | 60,"Bronx","Crotona Park East","Boro Zone" 62 | 61,"Brooklyn","Crown Heights North","Boro Zone" 63 | 62,"Brooklyn","Crown Heights South","Boro Zone" 64 | 63,"Brooklyn","Cypress Hills","Boro Zone" 65 | 64,"Queens","Douglaston","Boro Zone" 66 | 65,"Brooklyn","Downtown Brooklyn/MetroTech","Boro Zone" 67 | 66,"Brooklyn","DUMBO/Vinegar Hill","Boro Zone" 68 | 67,"Brooklyn","Dyker Heights","Boro Zone" 69 | 68,"Manhattan","East Chelsea","Yellow Zone" 70 | 69,"Bronx","East Concourse/Concourse Village","Boro Zone" 71 | 70,"Queens","East Elmhurst","Boro Zone" 72 | 71,"Brooklyn","East Flatbush/Farragut","Boro Zone" 73 | 72,"Brooklyn","East Flatbush/Remsen Village","Boro Zone" 74 | 73,"Queens","East Flushing","Boro Zone" 75 | 74,"Manhattan","East Harlem North","Boro Zone" 76 | 75,"Manhattan","East Harlem South","Boro Zone" 77 | 76,"Brooklyn","East New York","Boro Zone" 78 | 77,"Brooklyn","East New York/Pennsylvania Avenue","Boro Zone" 79 | 78,"Bronx","East Tremont","Boro Zone" 80 | 79,"Manhattan","East Village","Yellow Zone" 81 | 80,"Brooklyn","East Williamsburg","Boro Zone" 82 | 81,"Bronx","Eastchester","Boro Zone" 83 | 82,"Queens","Elmhurst","Boro Zone" 84 | 83,"Queens","Elmhurst/Maspeth","Boro Zone" 85 | 84,"Staten Island","Eltingville/Annadale/Prince's Bay","Boro Zone" 86 | 85,"Brooklyn","Erasmus","Boro Zone" 87 | 86,"Queens","Far Rockaway","Boro Zone" 88 | 87,"Manhattan","Financial District North","Yellow Zone" 89 | 88,"Manhattan","Financial District South","Yellow Zone" 90 | 89,"Brooklyn","Flatbush/Ditmas Park","Boro Zone" 91 | 90,"Manhattan","Flatiron","Yellow Zone" 92 | 91,"Brooklyn","Flatlands","Boro Zone" 93 | 92,"Queens","Flushing","Boro Zone" 94 | 93,"Queens","Flushing Meadows-Corona Park","Boro Zone" 95 | 94,"Bronx","Fordham South","Boro Zone" 96 | 95,"Queens","Forest Hills","Boro Zone" 97 | 96,"Queens","Forest Park/Highland Park","Boro Zone" 98 | 97,"Brooklyn","Fort Greene","Boro Zone" 99 | 98,"Queens","Fresh Meadows","Boro Zone" 100 | 99,"Staten Island","Freshkills Park","Boro Zone" 101 | 100,"Manhattan","Garment District","Yellow Zone" 102 | 101,"Queens","Glen Oaks","Boro Zone" 103 | 102,"Queens","Glendale","Boro Zone" 104 | 103,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone" 105 | 104,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone" 106 | 105,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone" 107 | 106,"Brooklyn","Gowanus","Boro Zone" 108 | 107,"Manhattan","Gramercy","Yellow Zone" 109 | 108,"Brooklyn","Gravesend","Boro Zone" 110 | 109,"Staten Island","Great Kills","Boro Zone" 111 | 110,"Staten Island","Great Kills Park","Boro Zone" 112 | 111,"Brooklyn","Green-Wood Cemetery","Boro Zone" 113 | 112,"Brooklyn","Greenpoint","Boro Zone" 114 | 113,"Manhattan","Greenwich Village North","Yellow Zone" 115 | 114,"Manhattan","Greenwich Village South","Yellow Zone" 116 | 115,"Staten Island","Grymes Hill/Clifton","Boro Zone" 117 | 116,"Manhattan","Hamilton Heights","Boro Zone" 118 | 117,"Queens","Hammels/Arverne","Boro Zone" 119 | 118,"Staten Island","Heartland Village/Todt Hill","Boro Zone" 120 | 119,"Bronx","Highbridge","Boro Zone" 121 | 120,"Manhattan","Highbridge Park","Boro Zone" 122 | 121,"Queens","Hillcrest/Pomonok","Boro Zone" 123 | 122,"Queens","Hollis","Boro Zone" 124 | 123,"Brooklyn","Homecrest","Boro Zone" 125 | 124,"Queens","Howard Beach","Boro Zone" 126 | 125,"Manhattan","Hudson Sq","Yellow Zone" 127 | 126,"Bronx","Hunts Point","Boro Zone" 128 | 127,"Manhattan","Inwood","Boro Zone" 129 | 128,"Manhattan","Inwood Hill Park","Boro Zone" 130 | 129,"Queens","Jackson Heights","Boro Zone" 131 | 130,"Queens","Jamaica","Boro Zone" 132 | 131,"Queens","Jamaica Estates","Boro Zone" 133 | 132,"Queens","JFK Airport","Airports" 134 | 133,"Brooklyn","Kensington","Boro Zone" 135 | 134,"Queens","Kew Gardens","Boro Zone" 136 | 135,"Queens","Kew Gardens Hills","Boro Zone" 137 | 136,"Bronx","Kingsbridge Heights","Boro Zone" 138 | 137,"Manhattan","Kips Bay","Yellow Zone" 139 | 138,"Queens","LaGuardia Airport","Airports" 140 | 139,"Queens","Laurelton","Boro Zone" 141 | 140,"Manhattan","Lenox Hill East","Yellow Zone" 142 | 141,"Manhattan","Lenox Hill West","Yellow Zone" 143 | 142,"Manhattan","Lincoln Square East","Yellow Zone" 144 | 143,"Manhattan","Lincoln Square West","Yellow Zone" 145 | 144,"Manhattan","Little Italy/NoLiTa","Yellow Zone" 146 | 145,"Queens","Long Island City/Hunters Point","Boro Zone" 147 | 146,"Queens","Long Island City/Queens Plaza","Boro Zone" 148 | 147,"Bronx","Longwood","Boro Zone" 149 | 148,"Manhattan","Lower East Side","Yellow Zone" 150 | 149,"Brooklyn","Madison","Boro Zone" 151 | 150,"Brooklyn","Manhattan Beach","Boro Zone" 152 | 151,"Manhattan","Manhattan Valley","Yellow Zone" 153 | 152,"Manhattan","Manhattanville","Boro Zone" 154 | 153,"Manhattan","Marble Hill","Boro Zone" 155 | 154,"Brooklyn","Marine Park/Floyd Bennett Field","Boro Zone" 156 | 155,"Brooklyn","Marine Park/Mill Basin","Boro Zone" 157 | 156,"Staten Island","Mariners Harbor","Boro Zone" 158 | 157,"Queens","Maspeth","Boro Zone" 159 | 158,"Manhattan","Meatpacking/West Village West","Yellow Zone" 160 | 159,"Bronx","Melrose South","Boro Zone" 161 | 160,"Queens","Middle Village","Boro Zone" 162 | 161,"Manhattan","Midtown Center","Yellow Zone" 163 | 162,"Manhattan","Midtown East","Yellow Zone" 164 | 163,"Manhattan","Midtown North","Yellow Zone" 165 | 164,"Manhattan","Midtown South","Yellow Zone" 166 | 165,"Brooklyn","Midwood","Boro Zone" 167 | 166,"Manhattan","Morningside Heights","Boro Zone" 168 | 167,"Bronx","Morrisania/Melrose","Boro Zone" 169 | 168,"Bronx","Mott Haven/Port Morris","Boro Zone" 170 | 169,"Bronx","Mount Hope","Boro Zone" 171 | 170,"Manhattan","Murray Hill","Yellow Zone" 172 | 171,"Queens","Murray Hill-Queens","Boro Zone" 173 | 172,"Staten Island","New Dorp/Midland Beach","Boro Zone" 174 | 173,"Queens","North Corona","Boro Zone" 175 | 174,"Bronx","Norwood","Boro Zone" 176 | 175,"Queens","Oakland Gardens","Boro Zone" 177 | 176,"Staten Island","Oakwood","Boro Zone" 178 | 177,"Brooklyn","Ocean Hill","Boro Zone" 179 | 178,"Brooklyn","Ocean Parkway South","Boro Zone" 180 | 179,"Queens","Old Astoria","Boro Zone" 181 | 180,"Queens","Ozone Park","Boro Zone" 182 | 181,"Brooklyn","Park Slope","Boro Zone" 183 | 182,"Bronx","Parkchester","Boro Zone" 184 | 183,"Bronx","Pelham Bay","Boro Zone" 185 | 184,"Bronx","Pelham Bay Park","Boro Zone" 186 | 185,"Bronx","Pelham Parkway","Boro Zone" 187 | 186,"Manhattan","Penn Station/Madison Sq West","Yellow Zone" 188 | 187,"Staten Island","Port Richmond","Boro Zone" 189 | 188,"Brooklyn","Prospect-Lefferts Gardens","Boro Zone" 190 | 189,"Brooklyn","Prospect Heights","Boro Zone" 191 | 190,"Brooklyn","Prospect Park","Boro Zone" 192 | 191,"Queens","Queens Village","Boro Zone" 193 | 192,"Queens","Queensboro Hill","Boro Zone" 194 | 193,"Queens","Queensbridge/Ravenswood","Boro Zone" 195 | 194,"Manhattan","Randalls Island","Yellow Zone" 196 | 195,"Brooklyn","Red Hook","Boro Zone" 197 | 196,"Queens","Rego Park","Boro Zone" 198 | 197,"Queens","Richmond Hill","Boro Zone" 199 | 198,"Queens","Ridgewood","Boro Zone" 200 | 199,"Bronx","Rikers Island","Boro Zone" 201 | 200,"Bronx","Riverdale/North Riverdale/Fieldston","Boro Zone" 202 | 201,"Queens","Rockaway Park","Boro Zone" 203 | 202,"Manhattan","Roosevelt Island","Boro Zone" 204 | 203,"Queens","Rosedale","Boro Zone" 205 | 204,"Staten Island","Rossville/Woodrow","Boro Zone" 206 | 205,"Queens","Saint Albans","Boro Zone" 207 | 206,"Staten Island","Saint George/New Brighton","Boro Zone" 208 | 207,"Queens","Saint Michaels Cemetery/Woodside","Boro Zone" 209 | 208,"Bronx","Schuylerville/Edgewater Park","Boro Zone" 210 | 209,"Manhattan","Seaport","Yellow Zone" 211 | 210,"Brooklyn","Sheepshead Bay","Boro Zone" 212 | 211,"Manhattan","SoHo","Yellow Zone" 213 | 212,"Bronx","Soundview/Bruckner","Boro Zone" 214 | 213,"Bronx","Soundview/Castle Hill","Boro Zone" 215 | 214,"Staten Island","South Beach/Dongan Hills","Boro Zone" 216 | 215,"Queens","South Jamaica","Boro Zone" 217 | 216,"Queens","South Ozone Park","Boro Zone" 218 | 217,"Brooklyn","South Williamsburg","Boro Zone" 219 | 218,"Queens","Springfield Gardens North","Boro Zone" 220 | 219,"Queens","Springfield Gardens South","Boro Zone" 221 | 220,"Bronx","Spuyten Duyvil/Kingsbridge","Boro Zone" 222 | 221,"Staten Island","Stapleton","Boro Zone" 223 | 222,"Brooklyn","Starrett City","Boro Zone" 224 | 223,"Queens","Steinway","Boro Zone" 225 | 224,"Manhattan","Stuy Town/Peter Cooper Village","Yellow Zone" 226 | 225,"Brooklyn","Stuyvesant Heights","Boro Zone" 227 | 226,"Queens","Sunnyside","Boro Zone" 228 | 227,"Brooklyn","Sunset Park East","Boro Zone" 229 | 228,"Brooklyn","Sunset Park West","Boro Zone" 230 | 229,"Manhattan","Sutton Place/Turtle Bay North","Yellow Zone" 231 | 230,"Manhattan","Times Sq/Theatre District","Yellow Zone" 232 | 231,"Manhattan","TriBeCa/Civic Center","Yellow Zone" 233 | 232,"Manhattan","Two Bridges/Seward Park","Yellow Zone" 234 | 233,"Manhattan","UN/Turtle Bay South","Yellow Zone" 235 | 234,"Manhattan","Union Sq","Yellow Zone" 236 | 235,"Bronx","University Heights/Morris Heights","Boro Zone" 237 | 236,"Manhattan","Upper East Side North","Yellow Zone" 238 | 237,"Manhattan","Upper East Side South","Yellow Zone" 239 | 238,"Manhattan","Upper West Side North","Yellow Zone" 240 | 239,"Manhattan","Upper West Side South","Yellow Zone" 241 | 240,"Bronx","Van Cortlandt Park","Boro Zone" 242 | 241,"Bronx","Van Cortlandt Village","Boro Zone" 243 | 242,"Bronx","Van Nest/Morris Park","Boro Zone" 244 | 243,"Manhattan","Washington Heights North","Boro Zone" 245 | 244,"Manhattan","Washington Heights South","Boro Zone" 246 | 245,"Staten Island","West Brighton","Boro Zone" 247 | 246,"Manhattan","West Chelsea/Hudson Yards","Yellow Zone" 248 | 247,"Bronx","West Concourse","Boro Zone" 249 | 248,"Bronx","West Farms/Bronx River","Boro Zone" 250 | 249,"Manhattan","West Village","Yellow Zone" 251 | 250,"Bronx","Westchester Village/Unionport","Boro Zone" 252 | 251,"Staten Island","Westerleigh","Boro Zone" 253 | 252,"Queens","Whitestone","Boro Zone" 254 | 253,"Queens","Willets Point","Boro Zone" 255 | 254,"Bronx","Williamsbridge/Olinville","Boro Zone" 256 | 255,"Brooklyn","Williamsburg (North Side)","Boro Zone" 257 | 256,"Brooklyn","Williamsburg (South Side)","Boro Zone" 258 | 257,"Brooklyn","Windsor Terrace","Boro Zone" 259 | 258,"Queens","Woodhaven","Boro Zone" 260 | 259,"Bronx","Woodlawn/Wakefield","Boro Zone" 261 | 260,"Queens","Woodside","Boro Zone" 262 | 261,"Manhattan","World Trade Center","Yellow Zone" 263 | 262,"Manhattan","Yorkville East","Yellow Zone" 264 | 263,"Manhattan","Yorkville West","Yellow Zone" 265 | 264,"Unknown","NV","N/A" 266 | 265,"Unknown","NA","N/A" 267 | -------------------------------------------------------------------------------- /Chapter3/chapter3_YellowCab.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3a51c7bc-a588-4d9e-bb74-a26148c92900", 6 | "metadata": { 7 | "tags": [] 8 | }, 9 | "source": [ 10 | "\n", 11 | "# Chapter 3 -> Spark ETL with Azure (Blob | ADLS)\n", 12 | "\n", 13 | "Task to do \n", 14 | "1. Install required spark libraries\n", 15 | "2. Create connection with Azure Blob storage\n", 16 | "3. Read data from blob and store into dataframe\n", 17 | "4. Transform data\n", 18 | "5. write data into parquet file \n", 19 | "6. write data into JSON file\n", 20 | "\n", 21 | "Reference:\n", 22 | "https://learn.microsoft.com/en-us/azure/open-datasets/dataset-catalog" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "id": "ea22d710-40f5-4b64-803e-83c583aa3472", 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "# First Load all the required library and also Start Spark Session\n", 33 | "# Load all the required library\n", 34 | "from pyspark.sql import SparkSession" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 2, 40 | "id": "368cc9c2-6f26-4d6b-93ff-d9ee7541869c", 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "name": "stderr", 45 | "output_type": "stream", 46 | "text": [ 47 | "WARNING: An illegal reflective access operation has occurred\n", 48 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", 49 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", 50 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", 51 | "WARNING: All illegal access operations will be denied in a future release\n", 52 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", 53 | "Setting default log level to \"WARN\".\n", 54 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", 55 | "23/03/09 21:17:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" 56 | ] 57 | } 58 | ], 59 | "source": [ 60 | "#Start Spark Session\n", 61 | "spark = SparkSession.builder.appName(\"chapter3\").getOrCreate()\n", 62 | "sqlContext = SparkSession(spark)\n", 63 | "#Dont Show warning only error\n", 64 | "spark.sparkContext.setLogLevel(\"ERROR\")" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "id": "f79815e9-7597-4a08-a8ea-098ad0e556ec", 70 | "metadata": {}, 71 | "source": [ 72 | "1. Create connection with Azure Blob storage" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 4, 78 | "id": "d6f6fd8e-6367-494f-9682-a901d2822473", 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "# Azure storage access info\n", 83 | "blob_account_name = \"azureopendatastorage\"\n", 84 | "blob_container_name = \"nyctlc\"\n", 85 | "blob_relative_path = \"yellow\"\n", 86 | "blob_sas_token = \"r\"" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 5, 92 | "id": "d384937f-4d7b-4a7d-804b-2b891e08ca85", 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "Remote blob path: wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/yellow\n" 100 | ] 101 | } 102 | ], 103 | "source": [ 104 | "\n", 105 | "# Allow SPARK to read from Blob remotely\n", 106 | "wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)\n", 107 | "spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),blob_sas_token)\n", 108 | "print('Remote blob path: ' + wasbs_path)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "id": "1a9ffb71-9748-4dba-b0ec-42ca3c491443", 114 | "metadata": {}, 115 | "source": [ 116 | "3. Read data from blob and store into dataframe" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 6, 122 | "id": "666621b6-0800-44d0-ad87-aa400f790ffc", 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "name": "stderr", 127 | "output_type": "stream", 128 | "text": [ 129 | " \r" 130 | ] 131 | } 132 | ], 133 | "source": [ 134 | "df = spark.read.parquet(wasbs_path)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 7, 140 | "id": "e26c9728-6a0d-4d6e-9870-ec0bcbd49728", 141 | "metadata": {}, 142 | "outputs": [ 143 | { 144 | "name": "stdout", 145 | "output_type": "stream", 146 | "text": [ 147 | "root\n", 148 | " |-- vendorID: string (nullable = true)\n", 149 | " |-- tpepPickupDateTime: timestamp (nullable = true)\n", 150 | " |-- tpepDropoffDateTime: timestamp (nullable = true)\n", 151 | " |-- passengerCount: integer (nullable = true)\n", 152 | " |-- tripDistance: double (nullable = true)\n", 153 | " |-- puLocationId: string (nullable = true)\n", 154 | " |-- doLocationId: string (nullable = true)\n", 155 | " |-- startLon: double (nullable = true)\n", 156 | " |-- startLat: double (nullable = true)\n", 157 | " |-- endLon: double (nullable = true)\n", 158 | " |-- endLat: double (nullable = true)\n", 159 | " |-- rateCodeId: integer (nullable = true)\n", 160 | " |-- storeAndFwdFlag: string (nullable = true)\n", 161 | " |-- paymentType: string (nullable = true)\n", 162 | " |-- fareAmount: double (nullable = true)\n", 163 | " |-- extra: double (nullable = true)\n", 164 | " |-- mtaTax: double (nullable = true)\n", 165 | " |-- improvementSurcharge: string (nullable = true)\n", 166 | " |-- tipAmount: double (nullable = true)\n", 167 | " |-- tollsAmount: double (nullable = true)\n", 168 | " |-- totalAmount: double (nullable = true)\n", 169 | " |-- puYear: integer (nullable = true)\n", 170 | " |-- puMonth: integer (nullable = true)\n", 171 | "\n" 172 | ] 173 | } 174 | ], 175 | "source": [ 176 | "df.printSchema()" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 8, 182 | "id": "c3a9c513-d6e1-4e59-9e56-a12da2533375", 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "name": "stderr", 187 | "output_type": "stream", 188 | "text": [ 189 | "[Stage 2:> (0 + 1) / 1]\r" 190 | ] 191 | }, 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "+--------+-------------------+-------------------+--------------+------------+------------+------------+----------+---------+----------+---------+----------+---------------+-----------+----------+-----+------+--------------------+---------+-----------+-----------+------+-------+\n", 197 | "|vendorID| tpepPickupDateTime|tpepDropoffDateTime|passengerCount|tripDistance|puLocationId|doLocationId| startLon| startLat| endLon| endLat|rateCodeId|storeAndFwdFlag|paymentType|fareAmount|extra|mtaTax|improvementSurcharge|tipAmount|tollsAmount|totalAmount|puYear|puMonth|\n", 198 | "+--------+-------------------+-------------------+--------------+------------+------------+------------+----------+---------+----------+---------+----------+---------------+-----------+----------+-----+------+--------------------+---------+-----------+-----------+------+-------+\n", 199 | "| CMT|2012-02-29 23:53:14|2012-03-01 00:00:43| 1| 2.1| null| null|-73.980494|40.730601|-73.983532|40.752311| 1| N| CSH| 7.3| 0.5| 0.5| null| 0.0| 0.0| 8.3| 2012| 3|\n", 200 | "| VTS|2012-03-17 08:01:00|2012-03-17 08:15:00| 1| 11.06| null| null|-73.986067|40.699862|-73.814838|40.737052| 1| null| CRD| 24.5| 0.0| 0.5| null| 4.9| 0.0| 29.9| 2012| 3|\n", 201 | "+--------+-------------------+-------------------+--------------+------------+------------+------------+----------+---------+----------+---------+----------+---------------+-----------+----------+-----+------+--------------------+---------+-----------+-----------+------+-------+\n", 202 | "only showing top 2 rows\n", 203 | "\n" 204 | ] 205 | }, 206 | { 207 | "name": "stderr", 208 | "output_type": "stream", 209 | "text": [ 210 | " \r" 211 | ] 212 | } 213 | ], 214 | "source": [ 215 | "df.show(n=2)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "id": "34762ecc-82b1-4b81-99f4-e480a43cbf92", 221 | "metadata": {}, 222 | "source": [ 223 | "4. Transform data" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 9, 229 | "id": "47d997c9-a0da-4936-9c62-ac3907efd874", 230 | "metadata": {}, 231 | "outputs": [ 232 | { 233 | "name": "stdout", 234 | "output_type": "stream", 235 | "text": [ 236 | "Register the DataFrame as a SQL temporary view: source\n" 237 | ] 238 | } 239 | ], 240 | "source": [ 241 | "print('Register the DataFrame as a SQL temporary view: source')\n", 242 | "df.createOrReplaceTempView('tempSource')" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 10, 248 | "id": "1ba63625-c1a8-4e45-997f-84b2150c1eae", 249 | "metadata": {}, 250 | "outputs": [ 251 | { 252 | "name": "stdout", 253 | "output_type": "stream", 254 | "text": [ 255 | "Displaying top 10 rows: \n" 256 | ] 257 | }, 258 | { 259 | "data": { 260 | "text/plain": [ 261 | "DataFrame[vendorID: string, tpepPickupDateTime: timestamp, tpepDropoffDateTime: timestamp, passengerCount: int, tripDistance: double, puLocationId: string, doLocationId: string, startLon: double, startLat: double, endLon: double, endLat: double, rateCodeId: int, storeAndFwdFlag: string, paymentType: string, fareAmount: double, extra: double, mtaTax: double, improvementSurcharge: string, tipAmount: double, tollsAmount: double, totalAmount: double, puYear: int, puMonth: int]" 262 | ] 263 | }, 264 | "metadata": {}, 265 | "output_type": "display_data" 266 | } 267 | ], 268 | "source": [ 269 | "print('Displaying top 10 rows: ')\n", 270 | "display(spark.sql('SELECT * FROM tempSource LIMIT 10'))" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": null, 276 | "id": "4e6bf6f7-bf6d-455a-86c1-d1759c03eda0", 277 | "metadata": {}, 278 | "outputs": [], 279 | "source": [ 280 | "newdf = spark.sql('SELECT * FROM tempSource LIMIT 10')" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "id": "ca5dc217-35ac-43c2-895c-ae0a9c8bddac", 286 | "metadata": {}, 287 | "source": [ 288 | "5. write data into parquet file \n", 289 | "6. write data into JSON file" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "id": "84f36d51-366e-4e7c-80fa-d2a72475d262", 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "newdf.write.format(\"parquet\").option(\"compression\",\"snappy\").save(\"parquetdata\",mode='append')" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "id": "c417172f-ae9a-4331-b871-cb1599a27446", 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "newdf.write.format(\"csv\").option(\"header\",\"true\").save(\"csvdata\",mode='append')" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "id": "defd421f-3e4e-427e-88cd-97ecf299ac5b", 316 | "metadata": {}, 317 | "outputs": [], 318 | "source": [ 319 | "newdf.show()" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "id": "58739de3-0f36-4d94-bf7e-926a0de4d1b0", 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [] 329 | } 330 | ], 331 | "metadata": { 332 | "kernelspec": { 333 | "display_name": "Python 3 (ipykernel)", 334 | "language": "python", 335 | "name": "python3" 336 | }, 337 | "language_info": { 338 | "codemirror_mode": { 339 | "name": "ipython", 340 | "version": 3 341 | }, 342 | "file_extension": ".py", 343 | "mimetype": "text/x-python", 344 | "name": "python", 345 | "nbconvert_exporter": "python", 346 | "pygments_lexer": "ipython3", 347 | "version": "3.8.13" 348 | } 349 | }, 350 | "nbformat": 4, 351 | "nbformat_minor": 5 352 | } 353 | -------------------------------------------------------------------------------- /Chapter3/chapter3_PublicHoliday.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3a51c7bc-a588-4d9e-bb74-a26148c92900", 6 | "metadata": { 7 | "tags": [] 8 | }, 9 | "source": [ 10 | "\n", 11 | "# Chapter 3 -> Spark ETL with Azure (Blob | ADLS)\n", 12 | "\n", 13 | "Task to do \n", 14 | "1. Install required spark libraries\n", 15 | "2. Create connection with Azure Blob storage\n", 16 | "3. Read data from blob and store into dataframe\n", 17 | "4. Transform data\n", 18 | "5. write data into parquet file \n", 19 | "6. write data into JSON file\n", 20 | "\n", 21 | "Reference:\n", 22 | "https://learn.microsoft.com/en-us/azure/open-datasets/dataset-catalog" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "id": "ea22d710-40f5-4b64-803e-83c583aa3472", 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "# First Load all the required library and also Start Spark Session\n", 33 | "# Load all the required library\n", 34 | "from pyspark.sql import SparkSession" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 2, 40 | "id": "368cc9c2-6f26-4d6b-93ff-d9ee7541869c", 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "name": "stderr", 45 | "output_type": "stream", 46 | "text": [ 47 | "WARNING: An illegal reflective access operation has occurred\n", 48 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", 49 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", 50 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", 51 | "WARNING: All illegal access operations will be denied in a future release\n", 52 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", 53 | "Setting default log level to \"WARN\".\n", 54 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", 55 | "23/03/09 21:58:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", 56 | "23/03/09 21:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n", 57 | "23/03/09 21:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.\n", 58 | "23/03/09 21:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.\n" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "#Start Spark Session\n", 64 | "spark = SparkSession.builder.appName(\"chapter3\").getOrCreate()\n", 65 | "sqlContext = SparkSession(spark)\n", 66 | "#Dont Show warning only error\n", 67 | "spark.sparkContext.setLogLevel(\"ERROR\")" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "id": "f79815e9-7597-4a08-a8ea-098ad0e556ec", 73 | "metadata": {}, 74 | "source": [ 75 | "1. Create connection with Azure Blob storage" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 3, 81 | "id": "d6f6fd8e-6367-494f-9682-a901d2822473", 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "# Azure storage for Holiday \n", 86 | "blob_account_name = \"azureopendatastorage\"\n", 87 | "blob_container_name = \"holidaydatacontainer\"\n", 88 | "blob_relative_path = \"Processed\"\n", 89 | "blob_sas_token = r\"\"" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 4, 95 | "id": "d384937f-4d7b-4a7d-804b-2b891e08ca85", 96 | "metadata": {}, 97 | "outputs": [ 98 | { 99 | "name": "stdout", 100 | "output_type": "stream", 101 | "text": [ 102 | "Remote blob path: wasbs://holidaydatacontainer@azureopendatastorage.blob.core.windows.net/Processed\n" 103 | ] 104 | } 105 | ], 106 | "source": [ 107 | "\n", 108 | "# Allow SPARK to read from Blob remotely\n", 109 | "wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)\n", 110 | "spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),blob_sas_token)\n", 111 | "print('Remote blob path: ' + wasbs_path)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "id": "1a9ffb71-9748-4dba-b0ec-42ca3c491443", 117 | "metadata": {}, 118 | "source": [ 119 | "3. Read data from blob and store into dataframe" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 5, 125 | "id": "666621b6-0800-44d0-ad87-aa400f790ffc", 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "name": "stderr", 130 | "output_type": "stream", 131 | "text": [ 132 | " \r" 133 | ] 134 | } 135 | ], 136 | "source": [ 137 | "df = spark.read.parquet(wasbs_path)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 6, 143 | "id": "e26c9728-6a0d-4d6e-9870-ec0bcbd49728", 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "root\n", 151 | " |-- countryOrRegion: string (nullable = true)\n", 152 | " |-- holidayName: string (nullable = true)\n", 153 | " |-- normalizeHolidayName: string (nullable = true)\n", 154 | " |-- isPaidTimeOff: boolean (nullable = true)\n", 155 | " |-- countryRegionCode: string (nullable = true)\n", 156 | " |-- date: timestamp (nullable = true)\n", 157 | "\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "df.printSchema()" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 7, 168 | "id": "c3a9c513-d6e1-4e59-9e56-a12da2533375", 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "name": "stderr", 173 | "output_type": "stream", 174 | "text": [ 175 | "[Stage 1:> (0 + 1) / 1]\r" 176 | ] 177 | }, 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n", 183 | "|countryOrRegion| holidayName|normalizeHolidayName|isPaidTimeOff|countryRegionCode| date|\n", 184 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n", 185 | "| Argentina|Año Nuevo [New Ye...|Año Nuevo [New Ye...| null| AR|1970-01-01 00:00:00|\n", 186 | "| Australia| New Year's Day| New Year's Day| null| AU|1970-01-01 00:00:00|\n", 187 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n", 188 | "only showing top 2 rows\n", 189 | "\n" 190 | ] 191 | }, 192 | { 193 | "name": "stderr", 194 | "output_type": "stream", 195 | "text": [ 196 | " \r" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "df.show(n=2)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "id": "34762ecc-82b1-4b81-99f4-e480a43cbf92", 207 | "metadata": {}, 208 | "source": [ 209 | "4. Transform data" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 8, 215 | "id": "47d997c9-a0da-4936-9c62-ac3907efd874", 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "Register the DataFrame as a SQL temporary view: source\n" 223 | ] 224 | } 225 | ], 226 | "source": [ 227 | "print('Register the DataFrame as a SQL temporary view: source')\n", 228 | "df.createOrReplaceTempView('tempSource')" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 9, 234 | "id": "1ba63625-c1a8-4e45-997f-84b2150c1eae", 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "name": "stdout", 239 | "output_type": "stream", 240 | "text": [ 241 | "Displaying top 10 rows: \n" 242 | ] 243 | }, 244 | { 245 | "data": { 246 | "text/plain": [ 247 | "DataFrame[countryOrRegion: string, holidayName: string, normalizeHolidayName: string, isPaidTimeOff: boolean, countryRegionCode: string, date: timestamp]" 248 | ] 249 | }, 250 | "metadata": {}, 251 | "output_type": "display_data" 252 | } 253 | ], 254 | "source": [ 255 | "print('Displaying top 10 rows: ')\n", 256 | "display(spark.sql('SELECT * FROM tempSource LIMIT 10'))" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 10, 262 | "id": "4e6bf6f7-bf6d-455a-86c1-d1759c03eda0", 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "newdf = spark.sql('SELECT * FROM tempSource LIMIT 10')" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "id": "ca5dc217-35ac-43c2-895c-ae0a9c8bddac", 272 | "metadata": {}, 273 | "source": [ 274 | "5. write data into parquet file \n", 275 | "6. write data into JSON file" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 11, 281 | "id": "84f36d51-366e-4e7c-80fa-d2a72475d262", 282 | "metadata": {}, 283 | "outputs": [ 284 | { 285 | "name": "stderr", 286 | "output_type": "stream", 287 | "text": [ 288 | " \r" 289 | ] 290 | } 291 | ], 292 | "source": [ 293 | "newdf.write.format(\"parquet\").option(\"compression\",\"snappy\").save(\"parquetholidaydata\",mode='append')" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 12, 299 | "id": "c417172f-ae9a-4331-b871-cb1599a27446", 300 | "metadata": {}, 301 | "outputs": [ 302 | { 303 | "name": "stderr", 304 | "output_type": "stream", 305 | "text": [ 306 | " \r" 307 | ] 308 | } 309 | ], 310 | "source": [ 311 | "newdf.write.format(\"csv\").option(\"header\",\"true\").save(\"csvdata\",mode='append')" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 13, 317 | "id": "defd421f-3e4e-427e-88cd-97ecf299ac5b", 318 | "metadata": {}, 319 | "outputs": [ 320 | { 321 | "name": "stderr", 322 | "output_type": "stream", 323 | "text": [ 324 | "[Stage 8:> (0 + 1) / 1]\r" 325 | ] 326 | }, 327 | { 328 | "name": "stdout", 329 | "output_type": "stream", 330 | "text": [ 331 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n", 332 | "|countryOrRegion| holidayName|normalizeHolidayName|isPaidTimeOff|countryRegionCode| date|\n", 333 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n", 334 | "| Argentina|Año Nuevo [New Ye...|Año Nuevo [New Ye...| null| AR|1970-01-01 00:00:00|\n", 335 | "| Australia| New Year's Day| New Year's Day| null| AU|1970-01-01 00:00:00|\n", 336 | "| Austria| Neujahr| Neujahr| null| AT|1970-01-01 00:00:00|\n", 337 | "| Belgium| Nieuwjaarsdag| Nieuwjaarsdag| null| BE|1970-01-01 00:00:00|\n", 338 | "| Brazil| Ano novo| Ano novo| null| BR|1970-01-01 00:00:00|\n", 339 | "| Canada| New Year's Day| New Year's Day| null| CA|1970-01-01 00:00:00|\n", 340 | "| Colombia|Año Nuevo [New Ye...|Año Nuevo [New Ye...| null| CO|1970-01-01 00:00:00|\n", 341 | "| Croatia| Nova Godina| Nova Godina| null| HR|1970-01-01 00:00:00|\n", 342 | "| Czech| Nový rok| Nový rok| null| CZ|1970-01-01 00:00:00|\n", 343 | "| Denmark| Nytårsdag| Nytårsdag| null| DK|1970-01-01 00:00:00|\n", 344 | "+---------------+--------------------+--------------------+-------------+-----------------+-------------------+\n", 345 | "\n" 346 | ] 347 | }, 348 | { 349 | "name": "stderr", 350 | "output_type": "stream", 351 | "text": [ 352 | " \r" 353 | ] 354 | } 355 | ], 356 | "source": [ 357 | "newdf.show()" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": 15, 363 | "id": "58739de3-0f36-4d94-bf7e-926a0de4d1b0", 364 | "metadata": {}, 365 | "outputs": [ 366 | { 367 | "name": "stderr", 368 | "output_type": "stream", 369 | "text": [ 370 | " \r" 371 | ] 372 | }, 373 | { 374 | "data": { 375 | "text/plain": [ 376 | "69557" 377 | ] 378 | }, 379 | "execution_count": 15, 380 | "metadata": {}, 381 | "output_type": "execute_result" 382 | } 383 | ], 384 | "source": [ 385 | "df.count()" 386 | ] 387 | } 388 | ], 389 | "metadata": { 390 | "kernelspec": { 391 | "display_name": "Python 3 (ipykernel)", 392 | "language": "python", 393 | "name": "python3" 394 | }, 395 | "language_info": { 396 | "codemirror_mode": { 397 | "name": "ipython", 398 | "version": 3 399 | }, 400 | "file_extension": ".py", 401 | "mimetype": "text/x-python", 402 | "name": "python", 403 | "nbconvert_exporter": "python", 404 | "pygments_lexer": "ipython3", 405 | "version": "3.8.13" 406 | } 407 | }, 408 | "nbformat": 4, 409 | "nbformat_minor": 5 410 | } 411 | -------------------------------------------------------------------------------- /Chapter12/Chapter12.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "b139ba1c-52b5-4bdd-96b7-309a78d30421", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "# First Load all the required library and also Start Spark Session\n", 11 | "# Load all the required library\n", 12 | "from pyspark.sql import SparkSession" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "id": "2e81bfed-7faa-4615-91a0-0761bd2e827a", 19 | "metadata": {}, 20 | "outputs": [ 21 | { 22 | "name": "stderr", 23 | "output_type": "stream", 24 | "text": [ 25 | "WARNING: An illegal reflective access operation has occurred\n", 26 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", 27 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", 28 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", 29 | "WARNING: All illegal access operations will be denied in a future release\n" 30 | ] 31 | }, 32 | { 33 | "name": "stdout", 34 | "output_type": "stream", 35 | "text": [ 36 | ":: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" 37 | ] 38 | }, 39 | { 40 | "name": "stderr", 41 | "output_type": "stream", 42 | "text": [ 43 | "Ivy Default Cache set to: /root/.ivy2/cache\n", 44 | "The jars for the packages stored in: /root/.ivy2/jars\n", 45 | "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n", 46 | "mysql#mysql-connector-java added as a dependency\n", 47 | ":: resolving dependencies :: org.apache.spark#spark-submit-parent-493c4a9c-0316-4f83-965c-fea5e02a6e4d;1.0\n", 48 | "\tconfs: [default]\n", 49 | "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 in central\n", 50 | "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 in central\n", 51 | "\tfound org.apache.kafka#kafka-clients;2.8.0 in central\n", 52 | "\tfound org.lz4#lz4-java;1.7.1 in central\n", 53 | "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n", 54 | "\tfound org.slf4j#slf4j-api;1.7.30 in central\n", 55 | "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.1 in central\n", 56 | "\tfound org.spark-project.spark#unused;1.0.0 in central\n", 57 | "\tfound org.apache.hadoop#hadoop-client-api;3.3.1 in central\n", 58 | "\tfound org.apache.htrace#htrace-core4;4.1.0-incubating in central\n", 59 | "\tfound commons-logging#commons-logging;1.1.3 in central\n", 60 | "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n", 61 | "\tfound org.apache.commons#commons-pool2;2.6.2 in central\n", 62 | "\tfound mysql#mysql-connector-java;8.0.32 in central\n", 63 | "\tfound com.mysql#mysql-connector-j;8.0.32 in central\n", 64 | "\tfound com.google.protobuf#protobuf-java;3.21.9 in central\n", 65 | ":: resolution report :: resolve 1765ms :: artifacts dl 19ms\n", 66 | "\t:: modules in use:\n", 67 | "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n", 68 | "\tcom.google.protobuf#protobuf-java;3.21.9 from central in [default]\n", 69 | "\tcom.mysql#mysql-connector-j;8.0.32 from central in [default]\n", 70 | "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n", 71 | "\tmysql#mysql-connector-java;8.0.32 from central in [default]\n", 72 | "\torg.apache.commons#commons-pool2;2.6.2 from central in [default]\n", 73 | "\torg.apache.hadoop#hadoop-client-api;3.3.1 from central in [default]\n", 74 | "\torg.apache.hadoop#hadoop-client-runtime;3.3.1 from central in [default]\n", 75 | "\torg.apache.htrace#htrace-core4;4.1.0-incubating from central in [default]\n", 76 | "\torg.apache.kafka#kafka-clients;2.8.0 from central in [default]\n", 77 | "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 from central in [default]\n", 78 | "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 from central in [default]\n", 79 | "\torg.lz4#lz4-java;1.7.1 from central in [default]\n", 80 | "\torg.slf4j#slf4j-api;1.7.30 from central in [default]\n", 81 | "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n", 82 | "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n", 83 | "\t---------------------------------------------------------------------\n", 84 | "\t| | modules || artifacts |\n", 85 | "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", 86 | "\t---------------------------------------------------------------------\n", 87 | "\t| default | 16 | 0 | 0 | 0 || 15 | 0 |\n", 88 | "\t---------------------------------------------------------------------\n", 89 | ":: retrieving :: org.apache.spark#spark-submit-parent-493c4a9c-0316-4f83-965c-fea5e02a6e4d\n", 90 | "\tconfs: [default]\n", 91 | "\t0 artifacts copied, 15 already retrieved (0kB/16ms)\n", 92 | "23/08/01 11:21:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", 93 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", 94 | "Setting default log level to \"WARN\".\n", 95 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" 96 | ] 97 | } 98 | ], 99 | "source": [ 100 | "#Start Spark Session\n", 101 | "spark = SparkSession.builder.appName(\"chapter12\") \\\n", 102 | " .config(\"spark.jars.packages\", \"org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,mysql:mysql-connector-java:8.0.32\") \\\n", 103 | " .getOrCreate()\n", 104 | "sqlContext = SparkSession(spark)\n", 105 | "#Dont Show warning only error\n", 106 | "spark.sparkContext.setLogLevel(\"WARN\")" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 3, 112 | "id": "8f9e8602-6b59-4ef7-ad2a-e4139dddd454", 113 | "metadata": {}, 114 | "outputs": [ 115 | { 116 | "data": { 117 | "text/html": [ 118 | "\n", 119 | "
\n", 120 | "

SparkSession - in-memory

\n", 121 | " \n", 122 | "
\n", 123 | "

SparkContext

\n", 124 | "\n", 125 | "

Spark UI

\n", 126 | "\n", 127 | "
\n", 128 | "
Version
\n", 129 | "
v3.2.1
\n", 130 | "
Master
\n", 131 | "
local[*]
\n", 132 | "
AppName
\n", 133 | "
chapter12
\n", 134 | "
\n", 135 | "
\n", 136 | " \n", 137 | "
\n", 138 | " " 139 | ], 140 | "text/plain": [ 141 | "" 142 | ] 143 | }, 144 | "execution_count": 3, 145 | "metadata": {}, 146 | "output_type": "execute_result" 147 | } 148 | ], 149 | "source": [ 150 | "spark" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 4, 156 | "id": "38a809e9-d284-4ead-a441-a95fbe0fa9f0", 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "KAFKA_BOOTSTRAP_SERVERS = \"192.168.1.102:9092\"\n", 161 | "KAFKA_TOPIC = \"News_XYZ_Technology\"" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 5, 167 | "id": "adac4438-f229-4831-9ba0-fb987d232b07", 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "df = spark.readStream.format(\"kafka\") \\\n", 172 | " .option(\"kafka.bootstrap.servers\", KAFKA_BOOTSTRAP_SERVERS) \\\n", 173 | " .option(\"subscribe\", KAFKA_TOPIC) \\\n", 174 | " .option(\"startingOffsets\", \"earliest\") \\\n", 175 | " .load()" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 7, 181 | "id": "e6adc8ee-7126-4247-8f77-8798a6f59edd", 182 | "metadata": {}, 183 | "outputs": [ 184 | { 185 | "name": "stderr", 186 | "output_type": "stream", 187 | "text": [ 188 | "23/08/01 11:22:36 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-aa2ef7ab-e536-4b1b-aa4f-3f260d4e8f09. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n", 189 | "23/08/01 11:22:36 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n" 190 | ] 191 | }, 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "" 196 | ] 197 | }, 198 | "execution_count": 7, 199 | "metadata": {}, 200 | "output_type": "execute_result" 201 | } 202 | ], 203 | "source": [ 204 | "df \\\n", 205 | " .writeStream \\\n", 206 | " .format(\"console\") \\\n", 207 | " .outputMode(\"append\") \\\n", 208 | " .start() " 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 8, 214 | "id": "70853b26-9f33-44b4-90df-af3152ea9a3c", 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "name": "stderr", 219 | "output_type": "stream", 220 | "text": [ 221 | "23/08/01 11:22:43 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-18deab41-0527-4d6e-8238-9cd5ccf2bfdf. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n", 222 | "23/08/01 11:22:43 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n" 223 | ] 224 | }, 225 | { 226 | "data": { 227 | "text/plain": [ 228 | "" 229 | ] 230 | }, 231 | "execution_count": 8, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | }, 235 | { 236 | "name": "stderr", 237 | "output_type": "stream", 238 | "text": [ 239 | " \r" 240 | ] 241 | }, 242 | { 243 | "name": "stdout", 244 | "output_type": "stream", 245 | "text": [ 246 | "-------------------------------------------\n", 247 | "Batch: 0\n", 248 | "-------------------------------------------\n", 249 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 250 | "| key| value| topic|partition|offset| timestamp|timestampType|\n", 251 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 252 | "|null| [53 70 61 72 6B 20]|News_XYZ_Technology| 0| 0|2023-07-30 01:20:...| 0|\n", 253 | "|null|[4C 61 6B 65 68 6...|News_XYZ_Technology| 0| 1|2023-07-30 01:20:...| 0|\n", 254 | "|null|[44 65 6C 74 61 2...|News_XYZ_Technology| 0| 2|2023-07-30 01:20:...| 0|\n", 255 | "|null| [4F 70 65 6E 41 49]|News_XYZ_Technology| 0| 3|2023-07-30 01:20:...| 0|\n", 256 | "|null|[44 6F 6C 6C 79 2...|News_XYZ_Technology| 0| 4|2023-07-30 01:21:...| 0|\n", 257 | "|null|[44 61 74 61 62 7...|News_XYZ_Technology| 0| 5|2023-07-30 01:21:...| 0|\n", 258 | "|null|[4D 69 63 72 6F 7...|News_XYZ_Technology| 0| 6|2023-07-30 01:21:...| 0|\n", 259 | "|null|[41 7A 75 72 65 2...|News_XYZ_Technology| 0| 7|2023-07-30 01:21:...| 0|\n", 260 | "|null|[41 57 53 20 4B 6...|News_XYZ_Technology| 0| 8|2023-07-30 01:21:...| 0|\n", 261 | "|null|[44 65 6C 74 61 2...|News_XYZ_Technology| 0| 9|2023-07-30 03:23:...| 0|\n", 262 | "|null|[41 70 61 63 68 6...|News_XYZ_Technology| 0| 10|2023-07-30 04:13:...| 0|\n", 263 | "|null| [41 70 61 63 68 65]|News_XYZ_Technology| 0| 11|2023-07-30 04:14:...| 0|\n", 264 | "|null|[69 63 65 62 65 7...|News_XYZ_Technology| 0| 12|2023-07-30 04:15:...| 0|\n", 265 | "|null| [68 75 64 69]|News_XYZ_Technology| 0| 13|2023-07-30 04:15:...| 0|\n", 266 | "|null| [64 65 6C 74 61]|News_XYZ_Technology| 0| 14|2023-07-30 04:15:...| 0|\n", 267 | "|null| [66 61 62 72 69 63]|News_XYZ_Technology| 0| 15|2023-07-30 04:15:...| 0|\n", 268 | "|null| [64 65 6C 74 61]|News_XYZ_Technology| 0| 16|2023-07-30 04:15:...| 0|\n", 269 | "|null|[41 70 61 63 68 6...|News_XYZ_Technology| 0| 17|2023-07-30 05:16:...| 0|\n", 270 | "|null|[41 70 61 63 68 6...|News_XYZ_Technology| 0| 18|2023-07-30 05:16:...| 0|\n", 271 | "|null|[41 7A 75 72 65 2...|News_XYZ_Technology| 0| 19|2023-07-30 05:40:...| 0|\n", 272 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 273 | "only showing top 20 rows\n", 274 | "\n" 275 | ] 276 | }, 277 | { 278 | "name": "stderr", 279 | "output_type": "stream", 280 | "text": [ 281 | " \r" 282 | ] 283 | }, 284 | { 285 | "name": "stdout", 286 | "output_type": "stream", 287 | "text": [ 288 | "-------------------------------------------\n", 289 | "Batch: 0\n", 290 | "-------------------------------------------\n", 291 | "+-----------------+\n", 292 | "| value|\n", 293 | "+-----------------+\n", 294 | "| Spark |\n", 295 | "| Lakehouse|\n", 296 | "| Delta Lake|\n", 297 | "| OpenAI|\n", 298 | "| Dolly Model|\n", 299 | "|Databrics Events |\n", 300 | "| Microsoft Fabric|\n", 301 | "| Azure DW|\n", 302 | "| AWS Kinesis|\n", 303 | "| Delta Lake|\n", 304 | "| Apache HUdi|\n", 305 | "| Apache|\n", 306 | "| iceberg|\n", 307 | "| hudi|\n", 308 | "| delta|\n", 309 | "| fabric|\n", 310 | "| delta|\n", 311 | "| Apache Nifi|\n", 312 | "| Apache Beam|\n", 313 | "| Azure EventHub|\n", 314 | "+-----------------+\n", 315 | "only showing top 20 rows\n", 316 | "\n" 317 | ] 318 | }, 319 | { 320 | "name": "stderr", 321 | "output_type": "stream", 322 | "text": [ 323 | " \r" 324 | ] 325 | }, 326 | { 327 | "name": "stdout", 328 | "output_type": "stream", 329 | "text": [ 330 | "-------------------------------------------\n", 331 | "Batch: 1\n", 332 | "-------------------------------------------\n", 333 | "-------------------------------------------\n", 334 | "Batch: 1\n", 335 | "-------------------------------------------\n", 336 | "+----------+\n", 337 | "| value|\n", 338 | "+----------+\n", 339 | "|SQL Server|\n", 340 | "+----------+\n", 341 | "\n", 342 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 343 | "| key| value| topic|partition|offset| timestamp|timestampType|\n", 344 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 345 | "|null|[53 51 4C 20 53 6...|News_XYZ_Technology| 0| 21|2023-08-01 11:25:...| 0|\n", 346 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 347 | "\n" 348 | ] 349 | }, 350 | { 351 | "name": "stderr", 352 | "output_type": "stream", 353 | "text": [ 354 | " \r" 355 | ] 356 | }, 357 | { 358 | "name": "stdout", 359 | "output_type": "stream", 360 | "text": [ 361 | "-------------------------------------------\n", 362 | "Batch: 2\n", 363 | "-------------------------------------------\n", 364 | "-------------------------------------------\n", 365 | "Batch: 2\n", 366 | "-------------------------------------------\n", 367 | "+-----+\n", 368 | "|value|\n", 369 | "+-----+\n", 370 | "| HIVE|\n", 371 | "+-----+\n", 372 | "\n", 373 | "+----+-------------+-------------------+---------+------+--------------------+-------------+\n", 374 | "| key| value| topic|partition|offset| timestamp|timestampType|\n", 375 | "+----+-------------+-------------------+---------+------+--------------------+-------------+\n", 376 | "|null|[48 49 56 45]|News_XYZ_Technology| 0| 22|2023-08-01 11:25:...| 0|\n", 377 | "+----+-------------+-------------------+---------+------+--------------------+-------------+\n", 378 | "\n", 379 | "-------------------------------------------\n", 380 | "Batch: 3\n", 381 | "-------------------------------------------\n", 382 | "-------------------------------------------\n", 383 | "Batch: 3\n", 384 | "-------------------------------------------\n", 385 | "+-------+\n", 386 | "| value|\n", 387 | "+-------+\n", 388 | "|MongoDB|\n", 389 | "+-------+\n", 390 | "\n", 391 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 392 | "| key| value| topic|partition|offset| timestamp|timestampType|\n", 393 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 394 | "|null|[4D 6F 6E 67 6F 4...|News_XYZ_Technology| 0| 23|2023-08-01 11:26:...| 0|\n", 395 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 396 | "\n", 397 | "-------------------------------------------\n", 398 | "Batch: 4\n", 399 | "-------------------------------------------\n", 400 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n", 401 | "| key| value| topic|partition|offset| timestamp|timestampType|\n", 402 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n", 403 | "|null|[4D 79 53 51 4C]|News_XYZ_Technology| 0| 24|2023-08-01 11:26:...| 0|\n", 404 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n", 405 | "\n", 406 | "-------------------------------------------\n", 407 | "Batch: 4\n", 408 | "-------------------------------------------\n", 409 | "+-----+\n", 410 | "|value|\n", 411 | "+-----+\n", 412 | "|MySQL|\n", 413 | "+-----+\n", 414 | "\n" 415 | ] 416 | } 417 | ], 418 | "source": [ 419 | "df.selectExpr(\"cast(value as string)\") \\\n", 420 | " .writeStream \\\n", 421 | " .format(\"console\") \\\n", 422 | " .outputMode(\"append\") \\\n", 423 | " .start() " 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "id": "a1deba01-d2f9-4519-a9de-1b895b9fbbf8", 430 | "metadata": {}, 431 | "outputs": [ 432 | { 433 | "name": "stderr", 434 | "output_type": "stream", 435 | "text": [ 436 | "23/08/01 11:32:38 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-8a7db8ca-72f7-416c-89f3-8a5f3a93fce0. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n", 437 | "23/08/01 11:32:39 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n", 438 | " \r" 439 | ] 440 | }, 441 | { 442 | "name": "stdout", 443 | "output_type": "stream", 444 | "text": [ 445 | "-------------------------------------------\n", 446 | "Batch: 0\n", 447 | "-------------------------------------------\n", 448 | "+-----------------+\n", 449 | "| value|\n", 450 | "+-----------------+\n", 451 | "| Spark |\n", 452 | "| Lakehouse|\n", 453 | "| Delta Lake|\n", 454 | "| OpenAI|\n", 455 | "| Dolly Model|\n", 456 | "|Databrics Events |\n", 457 | "| Microsoft Fabric|\n", 458 | "| Azure DW|\n", 459 | "| AWS Kinesis|\n", 460 | "| Delta Lake|\n", 461 | "| Apache HUdi|\n", 462 | "| Apache|\n", 463 | "| iceberg|\n", 464 | "| hudi|\n", 465 | "| delta|\n", 466 | "| fabric|\n", 467 | "| delta|\n", 468 | "| Apache Nifi|\n", 469 | "| Apache Beam|\n", 470 | "| Azure EventHub|\n", 471 | "+-----------------+\n", 472 | "only showing top 20 rows\n", 473 | "\n" 474 | ] 475 | }, 476 | { 477 | "name": "stderr", 478 | "output_type": "stream", 479 | "text": [ 480 | " \r" 481 | ] 482 | }, 483 | { 484 | "name": "stdout", 485 | "output_type": "stream", 486 | "text": [ 487 | "-------------------------------------------\n", 488 | "Batch: 1\n", 489 | "-------------------------------------------\n", 490 | "+----------+\n", 491 | "| value|\n", 492 | "+----------+\n", 493 | "|PostgreSQL|\n", 494 | "+----------+\n", 495 | "\n", 496 | "-------------------------------------------\n", 497 | "Batch: 5\n", 498 | "-------------------------------------------\n", 499 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 500 | "| key| value| topic|partition|offset| timestamp|timestampType|\n", 501 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 502 | "|null|[50 6F 73 74 67 7...|News_XYZ_Technology| 0| 25|2023-08-01 11:33:...| 0|\n", 503 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 504 | "\n", 505 | "-------------------------------------------\n", 506 | "Batch: 5\n", 507 | "-------------------------------------------\n", 508 | "+----------+\n", 509 | "| value|\n", 510 | "+----------+\n", 511 | "|PostgreSQL|\n", 512 | "+----------+\n", 513 | "\n" 514 | ] 515 | }, 516 | { 517 | "name": "stderr", 518 | "output_type": "stream", 519 | "text": [ 520 | " \r" 521 | ] 522 | }, 523 | { 524 | "name": "stdout", 525 | "output_type": "stream", 526 | "text": [ 527 | "-------------------------------------------\n", 528 | "Batch: 2\n", 529 | "-------------------------------------------\n", 530 | "-------------------------------------------\n", 531 | "Batch: 6\n", 532 | "-------------------------------------------\n", 533 | "+--------+\n", 534 | "| value|\n", 535 | "+--------+\n", 536 | "|CosmosDB|\n", 537 | "+--------+\n", 538 | "\n", 539 | "+--------+\n", 540 | "| value|\n", 541 | "+--------+\n", 542 | "|CosmosDB|\n", 543 | "+--------+\n", 544 | "\n" 545 | ] 546 | }, 547 | { 548 | "name": "stderr", 549 | "output_type": "stream", 550 | "text": [ 551 | " \r" 552 | ] 553 | }, 554 | { 555 | "name": "stdout", 556 | "output_type": "stream", 557 | "text": [ 558 | "-------------------------------------------\n", 559 | "Batch: 6\n", 560 | "-------------------------------------------\n", 561 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 562 | "| key| value| topic|partition|offset| timestamp|timestampType|\n", 563 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 564 | "|null|[43 6F 73 6D 6F 7...|News_XYZ_Technology| 0| 26|2023-08-01 11:33:...| 0|\n", 565 | "+----+--------------------+-------------------+---------+------+--------------------+-------------+\n", 566 | "\n" 567 | ] 568 | }, 569 | { 570 | "name": "stderr", 571 | "output_type": "stream", 572 | "text": [ 573 | " \r" 574 | ] 575 | }, 576 | { 577 | "name": "stdout", 578 | "output_type": "stream", 579 | "text": [ 580 | "-------------------------------------------\n", 581 | "Batch: 3\n", 582 | "-------------------------------------------\n", 583 | "+-----+\n", 584 | "|value|\n", 585 | "+-----+\n", 586 | "|Redis|\n", 587 | "+-----+\n", 588 | "\n", 589 | "-------------------------------------------\n", 590 | "Batch: 7\n", 591 | "-------------------------------------------\n", 592 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n", 593 | "| key| value| topic|partition|offset| timestamp|timestampType|\n", 594 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n", 595 | "|null|[52 65 64 69 73]|News_XYZ_Technology| 0| 27|2023-08-01 11:33:...| 0|\n", 596 | "+----+----------------+-------------------+---------+------+--------------------+-------------+\n", 597 | "\n", 598 | "-------------------------------------------\n", 599 | "Batch: 7\n", 600 | "-------------------------------------------\n", 601 | "+-----+\n", 602 | "|value|\n", 603 | "+-----+\n", 604 | "|Redis|\n", 605 | "+-----+\n", 606 | "\n" 607 | ] 608 | }, 609 | { 610 | "name": "stderr", 611 | "output_type": "stream", 612 | "text": [ 613 | "23/08/01 12:08:18 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 401471 ms exceeds timeout 120000 ms\n", 614 | "23/08/01 12:08:19 WARN SparkContext: Killing executors is not supported by current scheduler.\n", 615 | "23/08/01 16:55:17 ERROR AbstractCoordinator: [Consumer clientId=consumer-spark-kafka-source-c822c496-ebc1-4120-89e0-d54ce1aba996--1766797160-driver-0-2, groupId=spark-kafka-source-c822c496-ebc1-4120-89e0-d54ce1aba996--1766797160-driver-0] LeaveGroup request with Generation{generationId=1, memberId='consumer-spark-kafka-source-c822c496-ebc1-4120-89e0-d54ce1aba996--1766797160-driver-0-2-f645cdfa-e701-4c50-9946-8d36b34dc4eb', protocol='range'} failed with error: This is not the correct coordinator.\n", 616 | "23/08/01 16:55:17 ERROR AbstractCoordinator: [Consumer clientId=consumer-spark-kafka-source-87b4591e-ee0d-479e-a974-37af7e90439a--1154230701-driver-0-5, groupId=spark-kafka-source-87b4591e-ee0d-479e-a974-37af7e90439a--1154230701-driver-0] LeaveGroup request with Generation{generationId=1, memberId='consumer-spark-kafka-source-87b4591e-ee0d-479e-a974-37af7e90439a--1154230701-driver-0-5-30076fd4-0723-4731-aa21-5c71f1971e18', protocol='range'} failed with error: This is not the correct coordinator.\n" 617 | ] 618 | } 619 | ], 620 | "source": [ 621 | "df.selectExpr(\"cast(value as string)\") \\\n", 622 | " .writeStream \\\n", 623 | " .format(\"console\") \\\n", 624 | " .outputMode(\"append\") \\\n", 625 | " .start().awaitTermination()" 626 | ] 627 | } 628 | ], 629 | "metadata": { 630 | "kernelspec": { 631 | "display_name": "Python 3 (ipykernel)", 632 | "language": "python", 633 | "name": "python3" 634 | }, 635 | "language_info": { 636 | "codemirror_mode": { 637 | "name": "ipython", 638 | "version": 3 639 | }, 640 | "file_extension": ".py", 641 | "mimetype": "text/x-python", 642 | "name": "python", 643 | "nbconvert_exporter": "python", 644 | "pygments_lexer": "ipython3", 645 | "version": "3.8.13" 646 | } 647 | }, 648 | "nbformat": 4, 649 | "nbformat_minor": 5 650 | } 651 | -------------------------------------------------------------------------------- /Chapter5/chapter5.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3a51c7bc-a588-4d9e-bb74-a26148c92900", 6 | "metadata": { 7 | "tags": [] 8 | }, 9 | "source": [ 10 | "\n", 11 | "# Chapter 5 -> Spark ETL with Hive tables\n", 12 | "\n", 13 | "Task to do \n", 14 | "1. Read data from one of the source (We take source as our MongoDB collection)\n", 15 | "2. Create dataframe from source \n", 16 | "3. Create Hive table from dataframe\n", 17 | "4. Create temp Hive view from dataframe\n", 18 | "5. Create global Hive view from dataframe\n", 19 | "6. List database and tables in database\n", 20 | "7. Drop all the created tables and views in default database\n", 21 | "8. Create Dataeng database and create global and temp view using SQL \n", 22 | "9. Access global table from other session\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "id": "ea22d710-40f5-4b64-803e-83c583aa3472", 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "# First Load all the required library and also Start Spark Session\n", 33 | "# Load all the required library\n", 34 | "from pyspark.sql import SparkSession" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 2, 40 | "id": "368cc9c2-6f26-4d6b-93ff-d9ee7541869c", 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "name": "stderr", 45 | "output_type": "stream", 46 | "text": [ 47 | "WARNING: An illegal reflective access operation has occurred\n", 48 | "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", 49 | "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", 50 | "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", 51 | "WARNING: All illegal access operations will be denied in a future release\n" 52 | ] 53 | }, 54 | { 55 | "name": "stdout", 56 | "output_type": "stream", 57 | "text": [ 58 | ":: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" 59 | ] 60 | }, 61 | { 62 | "name": "stderr", 63 | "output_type": "stream", 64 | "text": [ 65 | "Ivy Default Cache set to: /root/.ivy2/cache\n", 66 | "The jars for the packages stored in: /root/.ivy2/jars\n", 67 | "org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency\n", 68 | ":: resolving dependencies :: org.apache.spark#spark-submit-parent-f2ed49bd-7bd8-4327-8579-ed71257d3bbb;1.0\n", 69 | "\tconfs: [default]\n", 70 | "\tfound org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 in central\n", 71 | "\tfound org.mongodb#mongodb-driver-sync;4.0.5 in central\n", 72 | "\tfound org.mongodb#bson;4.0.5 in central\n", 73 | "\tfound org.mongodb#mongodb-driver-core;4.0.5 in central\n", 74 | ":: resolution report :: resolve 830ms :: artifacts dl 128ms\n", 75 | "\t:: modules in use:\n", 76 | "\torg.mongodb#bson;4.0.5 from central in [default]\n", 77 | "\torg.mongodb#mongodb-driver-core;4.0.5 from central in [default]\n", 78 | "\torg.mongodb#mongodb-driver-sync;4.0.5 from central in [default]\n", 79 | "\torg.mongodb.spark#mongo-spark-connector_2.12;3.0.1 from central in [default]\n", 80 | "\t---------------------------------------------------------------------\n", 81 | "\t| | modules || artifacts |\n", 82 | "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", 83 | "\t---------------------------------------------------------------------\n", 84 | "\t| default | 4 | 0 | 0 | 0 || 4 | 0 |\n", 85 | "\t---------------------------------------------------------------------\n", 86 | ":: retrieving :: org.apache.spark#spark-submit-parent-f2ed49bd-7bd8-4327-8579-ed71257d3bbb\n", 87 | "\tconfs: [default]\n", 88 | "\t0 artifacts copied, 4 already retrieved (0kB/30ms)\n", 89 | "23/03/15 07:51:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", 90 | "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", 91 | "Setting default log level to \"WARN\".\n", 92 | "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" 93 | ] 94 | } 95 | ], 96 | "source": [ 97 | "#Start Spark Session\n", 98 | "spark = SparkSession.builder.appName(\"chapter5\")\\\n", 99 | " .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1')\\\n", 100 | " .getOrCreate()\n", 101 | "sqlContext = SparkSession(spark)\n", 102 | "#Dont Show warning only error\n", 103 | "spark.sparkContext.setLogLevel(\"ERROR\")" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "id": "6d2a3b43-6035-4104-94fb-d895cf2a524a", 109 | "metadata": {}, 110 | "source": [ 111 | "1. Read data from one of the source (We take source as our MongoDB collection)\n", 112 | "2. Create dataframe from source " 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 61, 118 | "id": "42da13f2-c156-40e2-8ade-a55ff9d09f88", 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "data": { 123 | "text/html": [ 124 | "\n", 125 | "
\n", 126 | "

SparkSession - in-memory

\n", 127 | " \n", 128 | "
\n", 129 | "

SparkContext

\n", 130 | "\n", 131 | "

Spark UI

\n", 132 | "\n", 133 | "
\n", 134 | "
Version
\n", 135 | "
v3.2.1
\n", 136 | "
Master
\n", 137 | "
local[*]
\n", 138 | "
AppName
\n", 139 | "
chapter5
\n", 140 | "
\n", 141 | "
\n", 142 | " \n", 143 | "
\n", 144 | " " 145 | ], 146 | "text/plain": [ 147 | "" 148 | ] 149 | }, 150 | "execution_count": 61, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "spark" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 3, 162 | "id": "1682daf8-0610-44bc-bd5e-9ce5f3814e20", 163 | "metadata": {}, 164 | "outputs": [ 165 | { 166 | "name": "stderr", 167 | "output_type": "stream", 168 | "text": [ 169 | " \r" 170 | ] 171 | } 172 | ], 173 | "source": [ 174 | "mongodf = spark.read.format(\"mongo\") \\\n", 175 | " .option(\"uri\", \"mongodb://root:mongodb@192.168.1.104:27017/\") \\\n", 176 | " .option(\"database\", \"dataengineering\") \\\n", 177 | " .option(\"collection\", \"employee\") \\\n", 178 | " .load()" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 4, 184 | "id": "b3944d5d-0420-4b96-844d-b395886a13f4", 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "name": "stdout", 189 | "output_type": "stream", 190 | "text": [ 191 | "root\n", 192 | " |-- _id: struct (nullable = true)\n", 193 | " | |-- oid: string (nullable = true)\n", 194 | " |-- department_id: string (nullable = true)\n", 195 | " |-- first_name: string (nullable = true)\n", 196 | " |-- id: string (nullable = true)\n", 197 | " |-- last_name: string (nullable = true)\n", 198 | " |-- salary: string (nullable = true)\n", 199 | "\n" 200 | ] 201 | } 202 | ], 203 | "source": [ 204 | "mongodf.printSchema()" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 5, 210 | "id": "c3a9c513-d6e1-4e59-9e56-a12da2533375", 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "name": "stdout", 215 | "output_type": "stream", 216 | "text": [ 217 | "+--------------------+-------------+----------+---+---------+------+\n", 218 | "| _id|department_id|first_name| id|last_name|salary|\n", 219 | "+--------------------+-------------+----------+---+---------+------+\n", 220 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|110000|\n", 221 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|106119|\n", 222 | "+--------------------+-------------+----------+---+---------+------+\n", 223 | "only showing top 2 rows\n", 224 | "\n" 225 | ] 226 | } 227 | ], 228 | "source": [ 229 | "mongodf.show(n=2)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "id": "d0fd01eb-361e-4a2a-a1b1-7b880344085d", 235 | "metadata": {}, 236 | "source": [ 237 | "3. Create Hive table from dataframe" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 7, 243 | "id": "2447464e-c348-4628-b9fe-cb1bc0c7562c", 244 | "metadata": {}, 245 | "outputs": [ 246 | { 247 | "name": "stderr", 248 | "output_type": "stream", 249 | "text": [ 250 | " \r" 251 | ] 252 | } 253 | ], 254 | "source": [ 255 | "mongodf.write.saveAsTable(\"hivesampletable\")" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "id": "c68f8a0c-c720-49c2-afaa-0fc22c4cc633", 261 | "metadata": {}, 262 | "source": [ 263 | "4. Create temp Hive view from dataframe\n", 264 | "5. Create global Hive view from dataframe\n", 265 | "\n", 266 | "The difference between temporary and global temporary views being subtle, it can be a source of mild confusion among developers new to Spark. A temporary view is tied to a single SparkSession within a Spark application. In contrast, a global temporary view is visible across multiple SparkSessions within a Spark application. Yes, you can create multiple SparkSessions within a single Spark application—this can be handy, for example, in cases where you want to access (and combine) data from two different SparkSessions that don’t share the same Hive metastore configurations." 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 8, 272 | "id": "a1410b71-4c59-46e7-bf79-e2d645921aff", 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "mongodf.createOrReplaceGlobalTempView(\"sampleglobalview\")\n", 277 | "mongodf.createOrReplaceTempView(\"sampletempview\")" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "id": "b567d95f-dcf1-4083-875e-670b941b8cbb", 283 | "metadata": {}, 284 | "source": [ 285 | "6. List database and tables in database" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 9, 291 | "id": "9f74008c-d30e-4c77-9b20-e0895ed013e9", 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "name": "stdout", 296 | "output_type": "stream", 297 | "text": [ 298 | "+---------+\n", 299 | "|namespace|\n", 300 | "+---------+\n", 301 | "| default|\n", 302 | "+---------+\n", 303 | "\n" 304 | ] 305 | } 306 | ], 307 | "source": [ 308 | "spark.sql(\"show databases\").show()" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 10, 314 | "id": "42bf3cbd-fcbc-406b-bc24-7df63c038bb4", 315 | "metadata": {}, 316 | "outputs": [ 317 | { 318 | "name": "stdout", 319 | "output_type": "stream", 320 | "text": [ 321 | "+---------+---------------+-----------+\n", 322 | "|namespace| tableName|isTemporary|\n", 323 | "+---------+---------------+-----------+\n", 324 | "| default|hivesampletable| false|\n", 325 | "| | sampletempview| true|\n", 326 | "+---------+---------------+-----------+\n", 327 | "\n" 328 | ] 329 | } 330 | ], 331 | "source": [ 332 | "spark.sql(\"show tables\").show()" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": 11, 338 | "id": "9f9788ab-4fe1-4964-a047-6a88ee7c0658", 339 | "metadata": {}, 340 | "outputs": [ 341 | { 342 | "data": { 343 | "text/plain": [ 344 | "[Database(name='default', description='default database', locationUri='file:/opt/spark/SparkETL/Chapter5/spark-warehouse')]" 345 | ] 346 | }, 347 | "execution_count": 11, 348 | "metadata": {}, 349 | "output_type": "execute_result" 350 | } 351 | ], 352 | "source": [ 353 | "spark.catalog.listDatabases()" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 12, 359 | "id": "ba7856c9-7ca1-4b54-84f2-c5f0f39b1db5", 360 | "metadata": {}, 361 | "outputs": [ 362 | { 363 | "data": { 364 | "text/plain": [ 365 | "[Table(name='hivesampletable', database='default', description=None, tableType='MANAGED', isTemporary=False),\n", 366 | " Table(name='sampletempview', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]" 367 | ] 368 | }, 369 | "execution_count": 12, 370 | "metadata": {}, 371 | "output_type": "execute_result" 372 | } 373 | ], 374 | "source": [ 375 | "spark.catalog.listTables()" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 13, 381 | "id": "eb99c32d-bf12-4ae0-9254-7515824474d5", 382 | "metadata": {}, 383 | "outputs": [ 384 | { 385 | "data": { 386 | "text/plain": [ 387 | "[Column(name='_id', description=None, dataType='struct', nullable=True, isPartition=False, isBucket=False),\n", 388 | " Column(name='department_id', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),\n", 389 | " Column(name='first_name', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),\n", 390 | " Column(name='id', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),\n", 391 | " Column(name='last_name', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),\n", 392 | " Column(name='salary', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]" 393 | ] 394 | }, 395 | "execution_count": 13, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "spark.catalog.listColumns(\"hivesampletable\")" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 14, 407 | "id": "c757cc03-36f8-4c9f-8b88-e8a1c7485fc2", 408 | "metadata": {}, 409 | "outputs": [ 410 | { 411 | "name": "stdout", 412 | "output_type": "stream", 413 | "text": [ 414 | "+--------------------+-------------+----------+---+---------+------+\n", 415 | "| _id|department_id|first_name| id|last_name|salary|\n", 416 | "+--------------------+-------------+----------+---+---------+------+\n", 417 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|110000|\n", 418 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|106119|\n", 419 | "|{6407c350840f10d7...| 1005| Justin| 2| Simon|128922|\n", 420 | "|{6407c350840f10d7...| 1005| Justin| 2| Simon|130000|\n", 421 | "|{6407c350840f10d7...| 1002| Kelly| 3| Rosario| 42689|\n", 422 | "|{6407c350840f10d7...| 1004| Patricia| 4| Powell|162825|\n", 423 | "|{6407c350840f10d7...| 1004| Patricia| 4| Powell|170000|\n", 424 | "|{6407c350840f10d7...| 1002| Sherry| 5| Golden| 44101|\n", 425 | "|{6407c350840f10d7...| 1005| Natasha| 6| Swanson| 79632|\n", 426 | "|{6407c350840f10d7...| 1005| Natasha| 6| Swanson| 90000|\n", 427 | "|{6407c350840f10d7...| 1002| Diane| 7| Gordon| 74591|\n", 428 | "|{6407c350840f10d7...| 1005| Mercedes| 8|Rodriguez| 61048|\n", 429 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|137236|\n", 430 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|140000|\n", 431 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|150000|\n", 432 | "|{6407c350840f10d7...| 1006| Sean| 10| Crawford|182065|\n", 433 | "|{6407c350840f10d7...| 1006| Sean| 10| Crawford|190000|\n", 434 | "|{6407c350840f10d7...| 1002| Kevin| 11| Townsend|166861|\n", 435 | "|{6407c350840f10d7...| 1004| Joshua| 12| Johnson|123082|\n", 436 | "|{6407c350840f10d7...| 1001| Julie| 13| Sanchez|185663|\n", 437 | "+--------------------+-------------+----------+---+---------+------+\n", 438 | "only showing top 20 rows\n", 439 | "\n" 440 | ] 441 | } 442 | ], 443 | "source": [ 444 | "sqlContext.sql(\"SELECT * FROM hivesampletable\").show()" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 38, 450 | "id": "3fb5fafe-af29-49ed-b88a-99913818d940", 451 | "metadata": {}, 452 | "outputs": [ 453 | { 454 | "name": "stdout", 455 | "output_type": "stream", 456 | "text": [ 457 | "+--------------------+-------------+----------+---+---------+------+\n", 458 | "| _id|department_id|first_name| id|last_name|salary|\n", 459 | "+--------------------+-------------+----------+---+---------+------+\n", 460 | "|{6402d551bd67c9b0...| 1006| Todd| 1| Wilson|110000|\n", 461 | "|{6402d551bd67c9b0...| 1006| Todd| 1| Wilson|106119|\n", 462 | "|{6402d551bd67c9b0...| 1005| Justin| 2| Simon|128922|\n", 463 | "|{6402d551bd67c9b0...| 1005| Justin| 2| Simon|130000|\n", 464 | "|{6402d551bd67c9b0...| 1002| Kelly| 3| Rosario| 42689|\n", 465 | "|{6402d551bd67c9b0...| 1004| Patricia| 4| Powell|162825|\n", 466 | "|{6402d551bd67c9b0...| 1004| Patricia| 4| Powell|170000|\n", 467 | "|{6402d551bd67c9b0...| 1002| Sherry| 5| Golden| 44101|\n", 468 | "|{6402d551bd67c9b0...| 1005| Natasha| 6| Swanson| 79632|\n", 469 | "|{6402d551bd67c9b0...| 1005| Natasha| 6| Swanson| 90000|\n", 470 | "|{6402d551bd67c9b0...| 1002| Diane| 7| Gordon| 74591|\n", 471 | "|{6402d551bd67c9b0...| 1005| Mercedes| 8|Rodriguez| 61048|\n", 472 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|137236|\n", 473 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|140000|\n", 474 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|150000|\n", 475 | "|{6402d551bd67c9b0...| 1006| Sean| 10| Crawford|182065|\n", 476 | "|{6402d551bd67c9b0...| 1006| Sean| 10| Crawford|190000|\n", 477 | "|{6402d551bd67c9b0...| 1002| Kevin| 11| Townsend|166861|\n", 478 | "|{6402d551bd67c9b0...| 1004| Joshua| 12| Johnson|123082|\n", 479 | "|{6402d551bd67c9b0...| 1001| Julie| 13| Sanchez|185663|\n", 480 | "+--------------------+-------------+----------+---+---------+------+\n", 481 | "only showing top 20 rows\n", 482 | "\n" 483 | ] 484 | } 485 | ], 486 | "source": [ 487 | "spark.table(\"sampletempview\").show()" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": 15, 493 | "id": "7902e3d4-9c4e-4553-94bf-9dbc0dbff4a7", 494 | "metadata": {}, 495 | "outputs": [ 496 | { 497 | "name": "stdout", 498 | "output_type": "stream", 499 | "text": [ 500 | "+--------------------+-------------+----------+---+---------+------+\n", 501 | "| _id|department_id|first_name| id|last_name|salary|\n", 502 | "+--------------------+-------------+----------+---+---------+------+\n", 503 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|110000|\n", 504 | "|{6407c350840f10d7...| 1006| Todd| 1| Wilson|106119|\n", 505 | "|{6407c350840f10d7...| 1005| Justin| 2| Simon|128922|\n", 506 | "|{6407c350840f10d7...| 1005| Justin| 2| Simon|130000|\n", 507 | "|{6407c350840f10d7...| 1002| Kelly| 3| Rosario| 42689|\n", 508 | "|{6407c350840f10d7...| 1004| Patricia| 4| Powell|162825|\n", 509 | "|{6407c350840f10d7...| 1004| Patricia| 4| Powell|170000|\n", 510 | "|{6407c350840f10d7...| 1002| Sherry| 5| Golden| 44101|\n", 511 | "|{6407c350840f10d7...| 1005| Natasha| 6| Swanson| 79632|\n", 512 | "|{6407c350840f10d7...| 1005| Natasha| 6| Swanson| 90000|\n", 513 | "|{6407c350840f10d7...| 1002| Diane| 7| Gordon| 74591|\n", 514 | "|{6407c350840f10d7...| 1005| Mercedes| 8|Rodriguez| 61048|\n", 515 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|137236|\n", 516 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|140000|\n", 517 | "|{6407c350840f10d7...| 1001| Christy| 9| Mitchell|150000|\n", 518 | "|{6407c350840f10d7...| 1006| Sean| 10| Crawford|182065|\n", 519 | "|{6407c350840f10d7...| 1006| Sean| 10| Crawford|190000|\n", 520 | "|{6407c350840f10d7...| 1002| Kevin| 11| Townsend|166861|\n", 521 | "|{6407c350840f10d7...| 1004| Joshua| 12| Johnson|123082|\n", 522 | "|{6407c350840f10d7...| 1001| Julie| 13| Sanchez|185663|\n", 523 | "+--------------------+-------------+----------+---+---------+------+\n", 524 | "only showing top 20 rows\n", 525 | "\n" 526 | ] 527 | } 528 | ], 529 | "source": [ 530 | "sqlContext.sql(\"SELECT * FROM global_temp.sampleglobalview\").show()" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "id": "a23c0bc5-e3e8-4ac3-94f3-d0e119e31f23", 536 | "metadata": {}, 537 | "source": [ 538 | "7. Drop all the created tables and views in default database" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 16, 544 | "id": "7e30ad8e-7ad8-4c79-b197-41719d21c654", 545 | "metadata": {}, 546 | "outputs": [], 547 | "source": [ 548 | "spark.catalog.dropGlobalTempView(\"sampleglobalview\")\n", 549 | "spark.catalog.dropTempView(\"sampletempview\")" 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "id": "d4eba5f1-9fd8-4d92-97b6-638250d475e8", 555 | "metadata": {}, 556 | "source": [ 557 | "8. Create Dataeng database and create global and temp view using SQL " 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": 17, 563 | "id": "5644d360-38b3-4a6e-867d-5a86772d268c", 564 | "metadata": {}, 565 | "outputs": [ 566 | { 567 | "data": { 568 | "text/plain": [ 569 | "DataFrame[]" 570 | ] 571 | }, 572 | "execution_count": 17, 573 | "metadata": {}, 574 | "output_type": "execute_result" 575 | } 576 | ], 577 | "source": [ 578 | "spark.sql(\"CREATE DATABASE dataeng\")\n", 579 | "spark.sql(\"USE dataeng\")" 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": 18, 585 | "id": "36c017c5-efb4-4a20-9628-e4bcb1a2e068", 586 | "metadata": {}, 587 | "outputs": [ 588 | { 589 | "name": "stdout", 590 | "output_type": "stream", 591 | "text": [ 592 | "+---------+\n", 593 | "|namespace|\n", 594 | "+---------+\n", 595 | "| dataeng|\n", 596 | "| default|\n", 597 | "+---------+\n", 598 | "\n" 599 | ] 600 | } 601 | ], 602 | "source": [ 603 | "spark.sql(\"show databases\").show()" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": 20, 609 | "id": "44082455-4774-41ff-b943-16a315b661f8", 610 | "metadata": {}, 611 | "outputs": [], 612 | "source": [ 613 | "mongodf.write.saveAsTable(\"hivesampletable\")\n", 614 | "mongodf.createOrReplaceGlobalTempView(\"sampleglobalview\")\n", 615 | "mongodf.createOrReplaceTempView(\"sampletempview\")" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": 22, 621 | "id": "2dcdebaf-ad5d-4fa9-9ec2-49903c734386", 622 | "metadata": {}, 623 | "outputs": [ 624 | { 625 | "name": "stdout", 626 | "output_type": "stream", 627 | "text": [ 628 | "+---------+---------------+-----------+\n", 629 | "|namespace| tableName|isTemporary|\n", 630 | "+---------+---------------+-----------+\n", 631 | "| dataeng|hivesampletable| false|\n", 632 | "| | sampletempview| true|\n", 633 | "+---------+---------------+-----------+\n", 634 | "\n" 635 | ] 636 | } 637 | ], 638 | "source": [ 639 | "spark.sql(\"show tables\").show()" 640 | ] 641 | }, 642 | { 643 | "cell_type": "markdown", 644 | "id": "16c741fd-d170-4efd-963d-20d917872e95", 645 | "metadata": {}, 646 | "source": [ 647 | "9. Access global table from other session" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": 23, 653 | "id": "2166405a-0b3d-427d-9ebd-35d03172397d", 654 | "metadata": {}, 655 | "outputs": [], 656 | "source": [ 657 | "newSpark = spark.newSession()" 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": 24, 663 | "id": "bbb3b2bc-0a4d-467b-ad53-04df44944db4", 664 | "metadata": {}, 665 | "outputs": [ 666 | { 667 | "data": { 668 | "text/html": [ 669 | "\n", 670 | "
\n", 671 | "

SparkSession - in-memory

\n", 672 | " \n", 673 | "
\n", 674 | "

SparkContext

\n", 675 | "\n", 676 | "

Spark UI

\n", 677 | "\n", 678 | "
\n", 679 | "
Version
\n", 680 | "
v3.2.1
\n", 681 | "
Master
\n", 682 | "
local[*]
\n", 683 | "
AppName
\n", 684 | "
chapter5
\n", 685 | "
\n", 686 | "
\n", 687 | " \n", 688 | "
\n", 689 | " " 690 | ], 691 | "text/plain": [ 692 | "" 693 | ] 694 | }, 695 | "execution_count": 24, 696 | "metadata": {}, 697 | "output_type": "execute_result" 698 | } 699 | ], 700 | "source": [ 701 | "newSpark" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": 59, 707 | "id": "8d170beb-4996-4c40-9374-bd0ba6747b7b", 708 | "metadata": {}, 709 | "outputs": [ 710 | { 711 | "name": "stdout", 712 | "output_type": "stream", 713 | "text": [ 714 | "+--------------------+-------------+----------+---+---------+------+\n", 715 | "| _id|department_id|first_name| id|last_name|salary|\n", 716 | "+--------------------+-------------+----------+---+---------+------+\n", 717 | "|{6402d551bd67c9b0...| 1006| Todd| 1| Wilson|110000|\n", 718 | "|{6402d551bd67c9b0...| 1006| Todd| 1| Wilson|106119|\n", 719 | "|{6402d551bd67c9b0...| 1005| Justin| 2| Simon|128922|\n", 720 | "|{6402d551bd67c9b0...| 1005| Justin| 2| Simon|130000|\n", 721 | "|{6402d551bd67c9b0...| 1002| Kelly| 3| Rosario| 42689|\n", 722 | "|{6402d551bd67c9b0...| 1004| Patricia| 4| Powell|162825|\n", 723 | "|{6402d551bd67c9b0...| 1004| Patricia| 4| Powell|170000|\n", 724 | "|{6402d551bd67c9b0...| 1002| Sherry| 5| Golden| 44101|\n", 725 | "|{6402d551bd67c9b0...| 1005| Natasha| 6| Swanson| 79632|\n", 726 | "|{6402d551bd67c9b0...| 1005| Natasha| 6| Swanson| 90000|\n", 727 | "|{6402d551bd67c9b0...| 1002| Diane| 7| Gordon| 74591|\n", 728 | "|{6402d551bd67c9b0...| 1005| Mercedes| 8|Rodriguez| 61048|\n", 729 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|137236|\n", 730 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|140000|\n", 731 | "|{6402d551bd67c9b0...| 1001| Christy| 9| Mitchell|150000|\n", 732 | "|{6402d551bd67c9b0...| 1006| Sean| 10| Crawford|182065|\n", 733 | "|{6402d551bd67c9b0...| 1006| Sean| 10| Crawford|190000|\n", 734 | "|{6402d551bd67c9b0...| 1002| Kevin| 11| Townsend|166861|\n", 735 | "|{6402d551bd67c9b0...| 1004| Joshua| 12| Johnson|123082|\n", 736 | "|{6402d551bd67c9b0...| 1001| Julie| 13| Sanchez|185663|\n", 737 | "+--------------------+-------------+----------+---+---------+------+\n", 738 | "only showing top 20 rows\n", 739 | "\n" 740 | ] 741 | } 742 | ], 743 | "source": [ 744 | "newSpark.sql(\"SELECT * FROM global_temp.sampleglobalview\").show()" 745 | ] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "execution_count": 25, 750 | "id": "b2fd67ca-8ced-4ddd-a47d-da71d5b2e626", 751 | "metadata": {}, 752 | "outputs": [ 753 | { 754 | "ename": "AnalysisException", 755 | "evalue": "Table or view not found: sampletempview; line 1 pos 14;\n'Project [*]\n+- 'UnresolvedRelation [sampletempview], [], false\n", 756 | "output_type": "error", 757 | "traceback": [ 758 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 759 | "\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)", 760 | "Cell \u001b[0;32mIn[25], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mnewSpark\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msql\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mSELECT * FROM sampletempview\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39mshow()\n", 761 | "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/pyspark/sql/session.py:723\u001b[0m, in \u001b[0;36mSparkSession.sql\u001b[0;34m(self, sqlQuery)\u001b[0m\n\u001b[1;32m 707\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21msql\u001b[39m(\u001b[38;5;28mself\u001b[39m, sqlQuery):\n\u001b[1;32m 708\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Returns a :class:`DataFrame` representing the result of the given query.\u001b[39;00m\n\u001b[1;32m 709\u001b[0m \n\u001b[1;32m 710\u001b[0m \u001b[38;5;124;03m .. versionadded:: 2.0.0\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 721\u001b[0m \u001b[38;5;124;03m [Row(f1=1, f2='row1'), Row(f1=2, f2='row2'), Row(f1=3, f2='row3')]\u001b[39;00m\n\u001b[1;32m 722\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 723\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m DataFrame(\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_jsparkSession\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msql\u001b[49m\u001b[43m(\u001b[49m\u001b[43msqlQuery\u001b[49m\u001b[43m)\u001b[49m, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_wrapped)\n", 762 | "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/py4j/java_gateway.py:1309\u001b[0m, in \u001b[0;36mJavaMember.__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1303\u001b[0m command \u001b[38;5;241m=\u001b[39m proto\u001b[38;5;241m.\u001b[39mCALL_COMMAND_NAME \u001b[38;5;241m+\u001b[39m\\\n\u001b[1;32m 1304\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcommand_header \u001b[38;5;241m+\u001b[39m\\\n\u001b[1;32m 1305\u001b[0m args_command \u001b[38;5;241m+\u001b[39m\\\n\u001b[1;32m 1306\u001b[0m proto\u001b[38;5;241m.\u001b[39mEND_COMMAND_PART\n\u001b[1;32m 1308\u001b[0m answer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mgateway_client\u001b[38;5;241m.\u001b[39msend_command(command)\n\u001b[0;32m-> 1309\u001b[0m return_value \u001b[38;5;241m=\u001b[39m \u001b[43mget_return_value\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1310\u001b[0m \u001b[43m \u001b[49m\u001b[43manswer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgateway_client\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtarget_id\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mname\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1312\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m temp_arg \u001b[38;5;129;01min\u001b[39;00m temp_args:\n\u001b[1;32m 1313\u001b[0m temp_arg\u001b[38;5;241m.\u001b[39m_detach()\n", 763 | "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/pyspark/sql/utils.py:117\u001b[0m, in \u001b[0;36mcapture_sql_exception..deco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 113\u001b[0m converted \u001b[38;5;241m=\u001b[39m convert_exception(e\u001b[38;5;241m.\u001b[39mjava_exception)\n\u001b[1;32m 114\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(converted, UnknownException):\n\u001b[1;32m 115\u001b[0m \u001b[38;5;66;03m# Hide where the exception came from that shows a non-Pythonic\u001b[39;00m\n\u001b[1;32m 116\u001b[0m \u001b[38;5;66;03m# JVM exception message.\u001b[39;00m\n\u001b[0;32m--> 117\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m converted \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;28mNone\u001b[39m\n\u001b[1;32m 118\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 119\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m\n", 764 | "\u001b[0;31mAnalysisException\u001b[0m: Table or view not found: sampletempview; line 1 pos 14;\n'Project [*]\n+- 'UnresolvedRelation [sampletempview], [], false\n" 765 | ] 766 | } 767 | ], 768 | "source": [ 769 | "newSpark.sql(\"SELECT * FROM sampletempview\").show()" 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": null, 775 | "id": "067b9a71-0dea-457f-812d-557242af506e", 776 | "metadata": {}, 777 | "outputs": [], 778 | "source": [] 779 | } 780 | ], 781 | "metadata": { 782 | "kernelspec": { 783 | "display_name": "Python 3 (ipykernel)", 784 | "language": "python", 785 | "name": "python3" 786 | }, 787 | "language_info": { 788 | "codemirror_mode": { 789 | "name": "ipython", 790 | "version": 3 791 | }, 792 | "file_extension": ".py", 793 | "mimetype": "text/x-python", 794 | "name": "python", 795 | "nbconvert_exporter": "python", 796 | "pygments_lexer": "ipython3", 797 | "version": "3.8.13" 798 | } 799 | }, 800 | "nbformat": 4, 801 | "nbformat_minor": 5 802 | } 803 | -------------------------------------------------------------------------------- /Chapter0/nyc_taxi_zone.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "LocationID": 1, 4 | "Borough": "EWR", 5 | "Zone": "Newark Airport", 6 | "service_zone": "EWR" 7 | }, 8 | { 9 | "LocationID": 2, 10 | "Borough": "Queens", 11 | "Zone": "Jamaica Bay", 12 | "service_zone": "Boro Zone" 13 | }, 14 | { 15 | "LocationID": 3, 16 | "Borough": "Bronx", 17 | "Zone": "Allerton/Pelham Gardens", 18 | "service_zone": "Boro Zone" 19 | }, 20 | { 21 | "LocationID": 4, 22 | "Borough": "Manhattan", 23 | "Zone": "Alphabet City", 24 | "service_zone": "Yellow Zone" 25 | }, 26 | { 27 | "LocationID": 5, 28 | "Borough": "Staten Island", 29 | "Zone": "Arden Heights", 30 | "service_zone": "Boro Zone" 31 | }, 32 | { 33 | "LocationID": 6, 34 | "Borough": "Staten Island", 35 | "Zone": "Arrochar/Fort Wadsworth", 36 | "service_zone": "Boro Zone" 37 | }, 38 | { 39 | "LocationID": 7, 40 | "Borough": "Queens", 41 | "Zone": "Astoria", 42 | "service_zone": "Boro Zone" 43 | }, 44 | { 45 | "LocationID": 8, 46 | "Borough": "Queens", 47 | "Zone": "Astoria Park", 48 | "service_zone": "Boro Zone" 49 | }, 50 | { 51 | "LocationID": 9, 52 | "Borough": "Queens", 53 | "Zone": "Auburndale", 54 | "service_zone": "Boro Zone" 55 | }, 56 | { 57 | "LocationID": 10, 58 | "Borough": "Queens", 59 | "Zone": "Baisley Park", 60 | "service_zone": "Boro Zone" 61 | }, 62 | { 63 | "LocationID": 11, 64 | "Borough": "Brooklyn", 65 | "Zone": "Bath Beach", 66 | "service_zone": "Boro Zone" 67 | }, 68 | { 69 | "LocationID": 12, 70 | "Borough": "Manhattan", 71 | "Zone": "Battery Park", 72 | "service_zone": "Yellow Zone" 73 | }, 74 | { 75 | "LocationID": 13, 76 | "Borough": "Manhattan", 77 | "Zone": "Battery Park City", 78 | "service_zone": "Yellow Zone" 79 | }, 80 | { 81 | "LocationID": 14, 82 | "Borough": "Brooklyn", 83 | "Zone": "Bay Ridge", 84 | "service_zone": "Boro Zone" 85 | }, 86 | { 87 | "LocationID": 15, 88 | "Borough": "Queens", 89 | "Zone": "Bay Terrace/Fort Totten", 90 | "service_zone": "Boro Zone" 91 | }, 92 | { 93 | "LocationID": 16, 94 | "Borough": "Queens", 95 | "Zone": "Bayside", 96 | "service_zone": "Boro Zone" 97 | }, 98 | { 99 | "LocationID": 17, 100 | "Borough": "Brooklyn", 101 | "Zone": "Bedford", 102 | "service_zone": "Boro Zone" 103 | }, 104 | { 105 | "LocationID": 18, 106 | "Borough": "Bronx", 107 | "Zone": "Bedford Park", 108 | "service_zone": "Boro Zone" 109 | }, 110 | { 111 | "LocationID": 19, 112 | "Borough": "Queens", 113 | "Zone": "Bellerose", 114 | "service_zone": "Boro Zone" 115 | }, 116 | { 117 | "LocationID": 20, 118 | "Borough": "Bronx", 119 | "Zone": "Belmont", 120 | "service_zone": "Boro Zone" 121 | }, 122 | { 123 | "LocationID": 21, 124 | "Borough": "Brooklyn", 125 | "Zone": "Bensonhurst East", 126 | "service_zone": "Boro Zone" 127 | }, 128 | { 129 | "LocationID": 22, 130 | "Borough": "Brooklyn", 131 | "Zone": "Bensonhurst West", 132 | "service_zone": "Boro Zone" 133 | }, 134 | { 135 | "LocationID": 23, 136 | "Borough": "Staten Island", 137 | "Zone": "Bloomfield/Emerson Hill", 138 | "service_zone": "Boro Zone" 139 | }, 140 | { 141 | "LocationID": 24, 142 | "Borough": "Manhattan", 143 | "Zone": "Bloomingdale", 144 | "service_zone": "Yellow Zone" 145 | }, 146 | { 147 | "LocationID": 25, 148 | "Borough": "Brooklyn", 149 | "Zone": "Boerum Hill", 150 | "service_zone": "Boro Zone" 151 | }, 152 | { 153 | "LocationID": 26, 154 | "Borough": "Brooklyn", 155 | "Zone": "Borough Park", 156 | "service_zone": "Boro Zone" 157 | }, 158 | { 159 | "LocationID": 27, 160 | "Borough": "Queens", 161 | "Zone": "Breezy Point/Fort Tilden/Riis Beach", 162 | "service_zone": "Boro Zone" 163 | }, 164 | { 165 | "LocationID": 28, 166 | "Borough": "Queens", 167 | "Zone": "Briarwood/Jamaica Hills", 168 | "service_zone": "Boro Zone" 169 | }, 170 | { 171 | "LocationID": 29, 172 | "Borough": "Brooklyn", 173 | "Zone": "Brighton Beach", 174 | "service_zone": "Boro Zone" 175 | }, 176 | { 177 | "LocationID": 30, 178 | "Borough": "Queens", 179 | "Zone": "Broad Channel", 180 | "service_zone": "Boro Zone" 181 | }, 182 | { 183 | "LocationID": 31, 184 | "Borough": "Bronx", 185 | "Zone": "Bronx Park", 186 | "service_zone": "Boro Zone" 187 | }, 188 | { 189 | "LocationID": 32, 190 | "Borough": "Bronx", 191 | "Zone": "Bronxdale", 192 | "service_zone": "Boro Zone" 193 | }, 194 | { 195 | "LocationID": 33, 196 | "Borough": "Brooklyn", 197 | "Zone": "Brooklyn Heights", 198 | "service_zone": "Boro Zone" 199 | }, 200 | { 201 | "LocationID": 34, 202 | "Borough": "Brooklyn", 203 | "Zone": "Brooklyn Navy Yard", 204 | "service_zone": "Boro Zone" 205 | }, 206 | { 207 | "LocationID": 35, 208 | "Borough": "Brooklyn", 209 | "Zone": "Brownsville", 210 | "service_zone": "Boro Zone" 211 | }, 212 | { 213 | "LocationID": 36, 214 | "Borough": "Brooklyn", 215 | "Zone": "Bushwick North", 216 | "service_zone": "Boro Zone" 217 | }, 218 | { 219 | "LocationID": 37, 220 | "Borough": "Brooklyn", 221 | "Zone": "Bushwick South", 222 | "service_zone": "Boro Zone" 223 | }, 224 | { 225 | "LocationID": 38, 226 | "Borough": "Queens", 227 | "Zone": "Cambria Heights", 228 | "service_zone": "Boro Zone" 229 | }, 230 | { 231 | "LocationID": 39, 232 | "Borough": "Brooklyn", 233 | "Zone": "Canarsie", 234 | "service_zone": "Boro Zone" 235 | }, 236 | { 237 | "LocationID": 40, 238 | "Borough": "Brooklyn", 239 | "Zone": "Carroll Gardens", 240 | "service_zone": "Boro Zone" 241 | }, 242 | { 243 | "LocationID": 41, 244 | "Borough": "Manhattan", 245 | "Zone": "Central Harlem", 246 | "service_zone": "Boro Zone" 247 | }, 248 | { 249 | "LocationID": 42, 250 | "Borough": "Manhattan", 251 | "Zone": "Central Harlem North", 252 | "service_zone": "Boro Zone" 253 | }, 254 | { 255 | "LocationID": 43, 256 | "Borough": "Manhattan", 257 | "Zone": "Central Park", 258 | "service_zone": "Yellow Zone" 259 | }, 260 | { 261 | "LocationID": 44, 262 | "Borough": "Staten Island", 263 | "Zone": "Charleston/Tottenville", 264 | "service_zone": "Boro Zone" 265 | }, 266 | { 267 | "LocationID": 45, 268 | "Borough": "Manhattan", 269 | "Zone": "Chinatown", 270 | "service_zone": "Yellow Zone" 271 | }, 272 | { 273 | "LocationID": 46, 274 | "Borough": "Bronx", 275 | "Zone": "City Island", 276 | "service_zone": "Boro Zone" 277 | }, 278 | { 279 | "LocationID": 47, 280 | "Borough": "Bronx", 281 | "Zone": "Claremont/Bathgate", 282 | "service_zone": "Boro Zone" 283 | }, 284 | { 285 | "LocationID": 48, 286 | "Borough": "Manhattan", 287 | "Zone": "Clinton East", 288 | "service_zone": "Yellow Zone" 289 | }, 290 | { 291 | "LocationID": 49, 292 | "Borough": "Brooklyn", 293 | "Zone": "Clinton Hill", 294 | "service_zone": "Boro Zone" 295 | }, 296 | { 297 | "LocationID": 50, 298 | "Borough": "Manhattan", 299 | "Zone": "Clinton West", 300 | "service_zone": "Yellow Zone" 301 | }, 302 | { 303 | "LocationID": 51, 304 | "Borough": "Bronx", 305 | "Zone": "Co-Op City", 306 | "service_zone": "Boro Zone" 307 | }, 308 | { 309 | "LocationID": 52, 310 | "Borough": "Brooklyn", 311 | "Zone": "Cobble Hill", 312 | "service_zone": "Boro Zone" 313 | }, 314 | { 315 | "LocationID": 53, 316 | "Borough": "Queens", 317 | "Zone": "College Point", 318 | "service_zone": "Boro Zone" 319 | }, 320 | { 321 | "LocationID": 54, 322 | "Borough": "Brooklyn", 323 | "Zone": "Columbia Street", 324 | "service_zone": "Boro Zone" 325 | }, 326 | { 327 | "LocationID": 55, 328 | "Borough": "Brooklyn", 329 | "Zone": "Coney Island", 330 | "service_zone": "Boro Zone" 331 | }, 332 | { 333 | "LocationID": 56, 334 | "Borough": "Queens", 335 | "Zone": "Corona", 336 | "service_zone": "Boro Zone" 337 | }, 338 | { 339 | "LocationID": 57, 340 | "Borough": "Queens", 341 | "Zone": "Corona", 342 | "service_zone": "Boro Zone" 343 | }, 344 | { 345 | "LocationID": 58, 346 | "Borough": "Bronx", 347 | "Zone": "Country Club", 348 | "service_zone": "Boro Zone" 349 | }, 350 | { 351 | "LocationID": 59, 352 | "Borough": "Bronx", 353 | "Zone": "Crotona Park", 354 | "service_zone": "Boro Zone" 355 | }, 356 | { 357 | "LocationID": 60, 358 | "Borough": "Bronx", 359 | "Zone": "Crotona Park East", 360 | "service_zone": "Boro Zone" 361 | }, 362 | { 363 | "LocationID": 61, 364 | "Borough": "Brooklyn", 365 | "Zone": "Crown Heights North", 366 | "service_zone": "Boro Zone" 367 | }, 368 | { 369 | "LocationID": 62, 370 | "Borough": "Brooklyn", 371 | "Zone": "Crown Heights South", 372 | "service_zone": "Boro Zone" 373 | }, 374 | { 375 | "LocationID": 63, 376 | "Borough": "Brooklyn", 377 | "Zone": "Cypress Hills", 378 | "service_zone": "Boro Zone" 379 | }, 380 | { 381 | "LocationID": 64, 382 | "Borough": "Queens", 383 | "Zone": "Douglaston", 384 | "service_zone": "Boro Zone" 385 | }, 386 | { 387 | "LocationID": 65, 388 | "Borough": "Brooklyn", 389 | "Zone": "Downtown Brooklyn/MetroTech", 390 | "service_zone": "Boro Zone" 391 | }, 392 | { 393 | "LocationID": 66, 394 | "Borough": "Brooklyn", 395 | "Zone": "DUMBO/Vinegar Hill", 396 | "service_zone": "Boro Zone" 397 | }, 398 | { 399 | "LocationID": 67, 400 | "Borough": "Brooklyn", 401 | "Zone": "Dyker Heights", 402 | "service_zone": "Boro Zone" 403 | }, 404 | { 405 | "LocationID": 68, 406 | "Borough": "Manhattan", 407 | "Zone": "East Chelsea", 408 | "service_zone": "Yellow Zone" 409 | }, 410 | { 411 | "LocationID": 69, 412 | "Borough": "Bronx", 413 | "Zone": "East Concourse/Concourse Village", 414 | "service_zone": "Boro Zone" 415 | }, 416 | { 417 | "LocationID": 70, 418 | "Borough": "Queens", 419 | "Zone": "East Elmhurst", 420 | "service_zone": "Boro Zone" 421 | }, 422 | { 423 | "LocationID": 71, 424 | "Borough": "Brooklyn", 425 | "Zone": "East Flatbush/Farragut", 426 | "service_zone": "Boro Zone" 427 | }, 428 | { 429 | "LocationID": 72, 430 | "Borough": "Brooklyn", 431 | "Zone": "East Flatbush/Remsen Village", 432 | "service_zone": "Boro Zone" 433 | }, 434 | { 435 | "LocationID": 73, 436 | "Borough": "Queens", 437 | "Zone": "East Flushing", 438 | "service_zone": "Boro Zone" 439 | }, 440 | { 441 | "LocationID": 74, 442 | "Borough": "Manhattan", 443 | "Zone": "East Harlem North", 444 | "service_zone": "Boro Zone" 445 | }, 446 | { 447 | "LocationID": 75, 448 | "Borough": "Manhattan", 449 | "Zone": "East Harlem South", 450 | "service_zone": "Boro Zone" 451 | }, 452 | { 453 | "LocationID": 76, 454 | "Borough": "Brooklyn", 455 | "Zone": "East New York", 456 | "service_zone": "Boro Zone" 457 | }, 458 | { 459 | "LocationID": 77, 460 | "Borough": "Brooklyn", 461 | "Zone": "East New York/Pennsylvania Avenue", 462 | "service_zone": "Boro Zone" 463 | }, 464 | { 465 | "LocationID": 78, 466 | "Borough": "Bronx", 467 | "Zone": "East Tremont", 468 | "service_zone": "Boro Zone" 469 | }, 470 | { 471 | "LocationID": 79, 472 | "Borough": "Manhattan", 473 | "Zone": "East Village", 474 | "service_zone": "Yellow Zone" 475 | }, 476 | { 477 | "LocationID": 80, 478 | "Borough": "Brooklyn", 479 | "Zone": "East Williamsburg", 480 | "service_zone": "Boro Zone" 481 | }, 482 | { 483 | "LocationID": 81, 484 | "Borough": "Bronx", 485 | "Zone": "Eastchester", 486 | "service_zone": "Boro Zone" 487 | }, 488 | { 489 | "LocationID": 82, 490 | "Borough": "Queens", 491 | "Zone": "Elmhurst", 492 | "service_zone": "Boro Zone" 493 | }, 494 | { 495 | "LocationID": 83, 496 | "Borough": "Queens", 497 | "Zone": "Elmhurst/Maspeth", 498 | "service_zone": "Boro Zone" 499 | }, 500 | { 501 | "LocationID": 84, 502 | "Borough": "Staten Island", 503 | "Zone": "Eltingville/Annadale/Prince's Bay", 504 | "service_zone": "Boro Zone" 505 | }, 506 | { 507 | "LocationID": 85, 508 | "Borough": "Brooklyn", 509 | "Zone": "Erasmus", 510 | "service_zone": "Boro Zone" 511 | }, 512 | { 513 | "LocationID": 86, 514 | "Borough": "Queens", 515 | "Zone": "Far Rockaway", 516 | "service_zone": "Boro Zone" 517 | }, 518 | { 519 | "LocationID": 87, 520 | "Borough": "Manhattan", 521 | "Zone": "Financial District North", 522 | "service_zone": "Yellow Zone" 523 | }, 524 | { 525 | "LocationID": 88, 526 | "Borough": "Manhattan", 527 | "Zone": "Financial District South", 528 | "service_zone": "Yellow Zone" 529 | }, 530 | { 531 | "LocationID": 89, 532 | "Borough": "Brooklyn", 533 | "Zone": "Flatbush/Ditmas Park", 534 | "service_zone": "Boro Zone" 535 | }, 536 | { 537 | "LocationID": 90, 538 | "Borough": "Manhattan", 539 | "Zone": "Flatiron", 540 | "service_zone": "Yellow Zone" 541 | }, 542 | { 543 | "LocationID": 91, 544 | "Borough": "Brooklyn", 545 | "Zone": "Flatlands", 546 | "service_zone": "Boro Zone" 547 | }, 548 | { 549 | "LocationID": 92, 550 | "Borough": "Queens", 551 | "Zone": "Flushing", 552 | "service_zone": "Boro Zone" 553 | }, 554 | { 555 | "LocationID": 93, 556 | "Borough": "Queens", 557 | "Zone": "Flushing Meadows-Corona Park", 558 | "service_zone": "Boro Zone" 559 | }, 560 | { 561 | "LocationID": 94, 562 | "Borough": "Bronx", 563 | "Zone": "Fordham South", 564 | "service_zone": "Boro Zone" 565 | }, 566 | { 567 | "LocationID": 95, 568 | "Borough": "Queens", 569 | "Zone": "Forest Hills", 570 | "service_zone": "Boro Zone" 571 | }, 572 | { 573 | "LocationID": 96, 574 | "Borough": "Queens", 575 | "Zone": "Forest Park/Highland Park", 576 | "service_zone": "Boro Zone" 577 | }, 578 | { 579 | "LocationID": 97, 580 | "Borough": "Brooklyn", 581 | "Zone": "Fort Greene", 582 | "service_zone": "Boro Zone" 583 | }, 584 | { 585 | "LocationID": 98, 586 | "Borough": "Queens", 587 | "Zone": "Fresh Meadows", 588 | "service_zone": "Boro Zone" 589 | }, 590 | { 591 | "LocationID": 99, 592 | "Borough": "Staten Island", 593 | "Zone": "Freshkills Park", 594 | "service_zone": "Boro Zone" 595 | }, 596 | { 597 | "LocationID": 100, 598 | "Borough": "Manhattan", 599 | "Zone": "Garment District", 600 | "service_zone": "Yellow Zone" 601 | }, 602 | { 603 | "LocationID": 101, 604 | "Borough": "Queens", 605 | "Zone": "Glen Oaks", 606 | "service_zone": "Boro Zone" 607 | }, 608 | { 609 | "LocationID": 102, 610 | "Borough": "Queens", 611 | "Zone": "Glendale", 612 | "service_zone": "Boro Zone" 613 | }, 614 | { 615 | "LocationID": 103, 616 | "Borough": "Manhattan", 617 | "Zone": "Governor's Island/Ellis Island/Liberty Island", 618 | "service_zone": "Yellow Zone" 619 | }, 620 | { 621 | "LocationID": 104, 622 | "Borough": "Manhattan", 623 | "Zone": "Governor's Island/Ellis Island/Liberty Island", 624 | "service_zone": "Yellow Zone" 625 | }, 626 | { 627 | "LocationID": 105, 628 | "Borough": "Manhattan", 629 | "Zone": "Governor's Island/Ellis Island/Liberty Island", 630 | "service_zone": "Yellow Zone" 631 | }, 632 | { 633 | "LocationID": 106, 634 | "Borough": "Brooklyn", 635 | "Zone": "Gowanus", 636 | "service_zone": "Boro Zone" 637 | }, 638 | { 639 | "LocationID": 107, 640 | "Borough": "Manhattan", 641 | "Zone": "Gramercy", 642 | "service_zone": "Yellow Zone" 643 | }, 644 | { 645 | "LocationID": 108, 646 | "Borough": "Brooklyn", 647 | "Zone": "Gravesend", 648 | "service_zone": "Boro Zone" 649 | }, 650 | { 651 | "LocationID": 109, 652 | "Borough": "Staten Island", 653 | "Zone": "Great Kills", 654 | "service_zone": "Boro Zone" 655 | }, 656 | { 657 | "LocationID": 110, 658 | "Borough": "Staten Island", 659 | "Zone": "Great Kills Park", 660 | "service_zone": "Boro Zone" 661 | }, 662 | { 663 | "LocationID": 111, 664 | "Borough": "Brooklyn", 665 | "Zone": "Green-Wood Cemetery", 666 | "service_zone": "Boro Zone" 667 | }, 668 | { 669 | "LocationID": 112, 670 | "Borough": "Brooklyn", 671 | "Zone": "Greenpoint", 672 | "service_zone": "Boro Zone" 673 | }, 674 | { 675 | "LocationID": 113, 676 | "Borough": "Manhattan", 677 | "Zone": "Greenwich Village North", 678 | "service_zone": "Yellow Zone" 679 | }, 680 | { 681 | "LocationID": 114, 682 | "Borough": "Manhattan", 683 | "Zone": "Greenwich Village South", 684 | "service_zone": "Yellow Zone" 685 | }, 686 | { 687 | "LocationID": 115, 688 | "Borough": "Staten Island", 689 | "Zone": "Grymes Hill/Clifton", 690 | "service_zone": "Boro Zone" 691 | }, 692 | { 693 | "LocationID": 116, 694 | "Borough": "Manhattan", 695 | "Zone": "Hamilton Heights", 696 | "service_zone": "Boro Zone" 697 | }, 698 | { 699 | "LocationID": 117, 700 | "Borough": "Queens", 701 | "Zone": "Hammels/Arverne", 702 | "service_zone": "Boro Zone" 703 | }, 704 | { 705 | "LocationID": 118, 706 | "Borough": "Staten Island", 707 | "Zone": "Heartland Village/Todt Hill", 708 | "service_zone": "Boro Zone" 709 | }, 710 | { 711 | "LocationID": 119, 712 | "Borough": "Bronx", 713 | "Zone": "Highbridge", 714 | "service_zone": "Boro Zone" 715 | }, 716 | { 717 | "LocationID": 120, 718 | "Borough": "Manhattan", 719 | "Zone": "Highbridge Park", 720 | "service_zone": "Boro Zone" 721 | }, 722 | { 723 | "LocationID": 121, 724 | "Borough": "Queens", 725 | "Zone": "Hillcrest/Pomonok", 726 | "service_zone": "Boro Zone" 727 | }, 728 | { 729 | "LocationID": 122, 730 | "Borough": "Queens", 731 | "Zone": "Hollis", 732 | "service_zone": "Boro Zone" 733 | }, 734 | { 735 | "LocationID": 123, 736 | "Borough": "Brooklyn", 737 | "Zone": "Homecrest", 738 | "service_zone": "Boro Zone" 739 | }, 740 | { 741 | "LocationID": 124, 742 | "Borough": "Queens", 743 | "Zone": "Howard Beach", 744 | "service_zone": "Boro Zone" 745 | }, 746 | { 747 | "LocationID": 125, 748 | "Borough": "Manhattan", 749 | "Zone": "Hudson Sq", 750 | "service_zone": "Yellow Zone" 751 | }, 752 | { 753 | "LocationID": 126, 754 | "Borough": "Bronx", 755 | "Zone": "Hunts Point", 756 | "service_zone": "Boro Zone" 757 | }, 758 | { 759 | "LocationID": 127, 760 | "Borough": "Manhattan", 761 | "Zone": "Inwood", 762 | "service_zone": "Boro Zone" 763 | }, 764 | { 765 | "LocationID": 128, 766 | "Borough": "Manhattan", 767 | "Zone": "Inwood Hill Park", 768 | "service_zone": "Boro Zone" 769 | }, 770 | { 771 | "LocationID": 129, 772 | "Borough": "Queens", 773 | "Zone": "Jackson Heights", 774 | "service_zone": "Boro Zone" 775 | }, 776 | { 777 | "LocationID": 130, 778 | "Borough": "Queens", 779 | "Zone": "Jamaica", 780 | "service_zone": "Boro Zone" 781 | }, 782 | { 783 | "LocationID": 131, 784 | "Borough": "Queens", 785 | "Zone": "Jamaica Estates", 786 | "service_zone": "Boro Zone" 787 | }, 788 | { 789 | "LocationID": 132, 790 | "Borough": "Queens", 791 | "Zone": "JFK Airport", 792 | "service_zone": "Airports" 793 | }, 794 | { 795 | "LocationID": 133, 796 | "Borough": "Brooklyn", 797 | "Zone": "Kensington", 798 | "service_zone": "Boro Zone" 799 | }, 800 | { 801 | "LocationID": 134, 802 | "Borough": "Queens", 803 | "Zone": "Kew Gardens", 804 | "service_zone": "Boro Zone" 805 | }, 806 | { 807 | "LocationID": 135, 808 | "Borough": "Queens", 809 | "Zone": "Kew Gardens Hills", 810 | "service_zone": "Boro Zone" 811 | }, 812 | { 813 | "LocationID": 136, 814 | "Borough": "Bronx", 815 | "Zone": "Kingsbridge Heights", 816 | "service_zone": "Boro Zone" 817 | }, 818 | { 819 | "LocationID": 137, 820 | "Borough": "Manhattan", 821 | "Zone": "Kips Bay", 822 | "service_zone": "Yellow Zone" 823 | }, 824 | { 825 | "LocationID": 138, 826 | "Borough": "Queens", 827 | "Zone": "LaGuardia Airport", 828 | "service_zone": "Airports" 829 | }, 830 | { 831 | "LocationID": 139, 832 | "Borough": "Queens", 833 | "Zone": "Laurelton", 834 | "service_zone": "Boro Zone" 835 | }, 836 | { 837 | "LocationID": 140, 838 | "Borough": "Manhattan", 839 | "Zone": "Lenox Hill East", 840 | "service_zone": "Yellow Zone" 841 | }, 842 | { 843 | "LocationID": 141, 844 | "Borough": "Manhattan", 845 | "Zone": "Lenox Hill West", 846 | "service_zone": "Yellow Zone" 847 | }, 848 | { 849 | "LocationID": 142, 850 | "Borough": "Manhattan", 851 | "Zone": "Lincoln Square East", 852 | "service_zone": "Yellow Zone" 853 | }, 854 | { 855 | "LocationID": 143, 856 | "Borough": "Manhattan", 857 | "Zone": "Lincoln Square West", 858 | "service_zone": "Yellow Zone" 859 | }, 860 | { 861 | "LocationID": 144, 862 | "Borough": "Manhattan", 863 | "Zone": "Little Italy/NoLiTa", 864 | "service_zone": "Yellow Zone" 865 | }, 866 | { 867 | "LocationID": 145, 868 | "Borough": "Queens", 869 | "Zone": "Long Island City/Hunters Point", 870 | "service_zone": "Boro Zone" 871 | }, 872 | { 873 | "LocationID": 146, 874 | "Borough": "Queens", 875 | "Zone": "Long Island City/Queens Plaza", 876 | "service_zone": "Boro Zone" 877 | }, 878 | { 879 | "LocationID": 147, 880 | "Borough": "Bronx", 881 | "Zone": "Longwood", 882 | "service_zone": "Boro Zone" 883 | }, 884 | { 885 | "LocationID": 148, 886 | "Borough": "Manhattan", 887 | "Zone": "Lower East Side", 888 | "service_zone": "Yellow Zone" 889 | }, 890 | { 891 | "LocationID": 149, 892 | "Borough": "Brooklyn", 893 | "Zone": "Madison", 894 | "service_zone": "Boro Zone" 895 | }, 896 | { 897 | "LocationID": 150, 898 | "Borough": "Brooklyn", 899 | "Zone": "Manhattan Beach", 900 | "service_zone": "Boro Zone" 901 | }, 902 | { 903 | "LocationID": 151, 904 | "Borough": "Manhattan", 905 | "Zone": "Manhattan Valley", 906 | "service_zone": "Yellow Zone" 907 | }, 908 | { 909 | "LocationID": 152, 910 | "Borough": "Manhattan", 911 | "Zone": "Manhattanville", 912 | "service_zone": "Boro Zone" 913 | }, 914 | { 915 | "LocationID": 153, 916 | "Borough": "Manhattan", 917 | "Zone": "Marble Hill", 918 | "service_zone": "Boro Zone" 919 | }, 920 | { 921 | "LocationID": 154, 922 | "Borough": "Brooklyn", 923 | "Zone": "Marine Park/Floyd Bennett Field", 924 | "service_zone": "Boro Zone" 925 | }, 926 | { 927 | "LocationID": 155, 928 | "Borough": "Brooklyn", 929 | "Zone": "Marine Park/Mill Basin", 930 | "service_zone": "Boro Zone" 931 | }, 932 | { 933 | "LocationID": 156, 934 | "Borough": "Staten Island", 935 | "Zone": "Mariners Harbor", 936 | "service_zone": "Boro Zone" 937 | }, 938 | { 939 | "LocationID": 157, 940 | "Borough": "Queens", 941 | "Zone": "Maspeth", 942 | "service_zone": "Boro Zone" 943 | }, 944 | { 945 | "LocationID": 158, 946 | "Borough": "Manhattan", 947 | "Zone": "Meatpacking/West Village West", 948 | "service_zone": "Yellow Zone" 949 | }, 950 | { 951 | "LocationID": 159, 952 | "Borough": "Bronx", 953 | "Zone": "Melrose South", 954 | "service_zone": "Boro Zone" 955 | }, 956 | { 957 | "LocationID": 160, 958 | "Borough": "Queens", 959 | "Zone": "Middle Village", 960 | "service_zone": "Boro Zone" 961 | }, 962 | { 963 | "LocationID": 161, 964 | "Borough": "Manhattan", 965 | "Zone": "Midtown Center", 966 | "service_zone": "Yellow Zone" 967 | }, 968 | { 969 | "LocationID": 162, 970 | "Borough": "Manhattan", 971 | "Zone": "Midtown East", 972 | "service_zone": "Yellow Zone" 973 | }, 974 | { 975 | "LocationID": 163, 976 | "Borough": "Manhattan", 977 | "Zone": "Midtown North", 978 | "service_zone": "Yellow Zone" 979 | }, 980 | { 981 | "LocationID": 164, 982 | "Borough": "Manhattan", 983 | "Zone": "Midtown South", 984 | "service_zone": "Yellow Zone" 985 | }, 986 | { 987 | "LocationID": 165, 988 | "Borough": "Brooklyn", 989 | "Zone": "Midwood", 990 | "service_zone": "Boro Zone" 991 | }, 992 | { 993 | "LocationID": 166, 994 | "Borough": "Manhattan", 995 | "Zone": "Morningside Heights", 996 | "service_zone": "Boro Zone" 997 | }, 998 | { 999 | "LocationID": 167, 1000 | "Borough": "Bronx", 1001 | "Zone": "Morrisania/Melrose", 1002 | "service_zone": "Boro Zone" 1003 | }, 1004 | { 1005 | "LocationID": 168, 1006 | "Borough": "Bronx", 1007 | "Zone": "Mott Haven/Port Morris", 1008 | "service_zone": "Boro Zone" 1009 | }, 1010 | { 1011 | "LocationID": 169, 1012 | "Borough": "Bronx", 1013 | "Zone": "Mount Hope", 1014 | "service_zone": "Boro Zone" 1015 | }, 1016 | { 1017 | "LocationID": 170, 1018 | "Borough": "Manhattan", 1019 | "Zone": "Murray Hill", 1020 | "service_zone": "Yellow Zone" 1021 | }, 1022 | { 1023 | "LocationID": 171, 1024 | "Borough": "Queens", 1025 | "Zone": "Murray Hill-Queens", 1026 | "service_zone": "Boro Zone" 1027 | }, 1028 | { 1029 | "LocationID": 172, 1030 | "Borough": "Staten Island", 1031 | "Zone": "New Dorp/Midland Beach", 1032 | "service_zone": "Boro Zone" 1033 | }, 1034 | { 1035 | "LocationID": 173, 1036 | "Borough": "Queens", 1037 | "Zone": "North Corona", 1038 | "service_zone": "Boro Zone" 1039 | }, 1040 | { 1041 | "LocationID": 174, 1042 | "Borough": "Bronx", 1043 | "Zone": "Norwood", 1044 | "service_zone": "Boro Zone" 1045 | }, 1046 | { 1047 | "LocationID": 175, 1048 | "Borough": "Queens", 1049 | "Zone": "Oakland Gardens", 1050 | "service_zone": "Boro Zone" 1051 | }, 1052 | { 1053 | "LocationID": 176, 1054 | "Borough": "Staten Island", 1055 | "Zone": "Oakwood", 1056 | "service_zone": "Boro Zone" 1057 | }, 1058 | { 1059 | "LocationID": 177, 1060 | "Borough": "Brooklyn", 1061 | "Zone": "Ocean Hill", 1062 | "service_zone": "Boro Zone" 1063 | }, 1064 | { 1065 | "LocationID": 178, 1066 | "Borough": "Brooklyn", 1067 | "Zone": "Ocean Parkway South", 1068 | "service_zone": "Boro Zone" 1069 | }, 1070 | { 1071 | "LocationID": 179, 1072 | "Borough": "Queens", 1073 | "Zone": "Old Astoria", 1074 | "service_zone": "Boro Zone" 1075 | }, 1076 | { 1077 | "LocationID": 180, 1078 | "Borough": "Queens", 1079 | "Zone": "Ozone Park", 1080 | "service_zone": "Boro Zone" 1081 | }, 1082 | { 1083 | "LocationID": 181, 1084 | "Borough": "Brooklyn", 1085 | "Zone": "Park Slope", 1086 | "service_zone": "Boro Zone" 1087 | }, 1088 | { 1089 | "LocationID": 182, 1090 | "Borough": "Bronx", 1091 | "Zone": "Parkchester", 1092 | "service_zone": "Boro Zone" 1093 | }, 1094 | { 1095 | "LocationID": 183, 1096 | "Borough": "Bronx", 1097 | "Zone": "Pelham Bay", 1098 | "service_zone": "Boro Zone" 1099 | }, 1100 | { 1101 | "LocationID": 184, 1102 | "Borough": "Bronx", 1103 | "Zone": "Pelham Bay Park", 1104 | "service_zone": "Boro Zone" 1105 | }, 1106 | { 1107 | "LocationID": 185, 1108 | "Borough": "Bronx", 1109 | "Zone": "Pelham Parkway", 1110 | "service_zone": "Boro Zone" 1111 | }, 1112 | { 1113 | "LocationID": 186, 1114 | "Borough": "Manhattan", 1115 | "Zone": "Penn Station/Madison Sq West", 1116 | "service_zone": "Yellow Zone" 1117 | }, 1118 | { 1119 | "LocationID": 187, 1120 | "Borough": "Staten Island", 1121 | "Zone": "Port Richmond", 1122 | "service_zone": "Boro Zone" 1123 | }, 1124 | { 1125 | "LocationID": 188, 1126 | "Borough": "Brooklyn", 1127 | "Zone": "Prospect-Lefferts Gardens", 1128 | "service_zone": "Boro Zone" 1129 | }, 1130 | { 1131 | "LocationID": 189, 1132 | "Borough": "Brooklyn", 1133 | "Zone": "Prospect Heights", 1134 | "service_zone": "Boro Zone" 1135 | }, 1136 | { 1137 | "LocationID": 190, 1138 | "Borough": "Brooklyn", 1139 | "Zone": "Prospect Park", 1140 | "service_zone": "Boro Zone" 1141 | }, 1142 | { 1143 | "LocationID": 191, 1144 | "Borough": "Queens", 1145 | "Zone": "Queens Village", 1146 | "service_zone": "Boro Zone" 1147 | }, 1148 | { 1149 | "LocationID": 192, 1150 | "Borough": "Queens", 1151 | "Zone": "Queensboro Hill", 1152 | "service_zone": "Boro Zone" 1153 | }, 1154 | { 1155 | "LocationID": 193, 1156 | "Borough": "Queens", 1157 | "Zone": "Queensbridge/Ravenswood", 1158 | "service_zone": "Boro Zone" 1159 | }, 1160 | { 1161 | "LocationID": 194, 1162 | "Borough": "Manhattan", 1163 | "Zone": "Randalls Island", 1164 | "service_zone": "Yellow Zone" 1165 | }, 1166 | { 1167 | "LocationID": 195, 1168 | "Borough": "Brooklyn", 1169 | "Zone": "Red Hook", 1170 | "service_zone": "Boro Zone" 1171 | }, 1172 | { 1173 | "LocationID": 196, 1174 | "Borough": "Queens", 1175 | "Zone": "Rego Park", 1176 | "service_zone": "Boro Zone" 1177 | }, 1178 | { 1179 | "LocationID": 197, 1180 | "Borough": "Queens", 1181 | "Zone": "Richmond Hill", 1182 | "service_zone": "Boro Zone" 1183 | }, 1184 | { 1185 | "LocationID": 198, 1186 | "Borough": "Queens", 1187 | "Zone": "Ridgewood", 1188 | "service_zone": "Boro Zone" 1189 | }, 1190 | { 1191 | "LocationID": 199, 1192 | "Borough": "Bronx", 1193 | "Zone": "Rikers Island", 1194 | "service_zone": "Boro Zone" 1195 | }, 1196 | { 1197 | "LocationID": 200, 1198 | "Borough": "Bronx", 1199 | "Zone": "Riverdale/North Riverdale/Fieldston", 1200 | "service_zone": "Boro Zone" 1201 | }, 1202 | { 1203 | "LocationID": 201, 1204 | "Borough": "Queens", 1205 | "Zone": "Rockaway Park", 1206 | "service_zone": "Boro Zone" 1207 | }, 1208 | { 1209 | "LocationID": 202, 1210 | "Borough": "Manhattan", 1211 | "Zone": "Roosevelt Island", 1212 | "service_zone": "Boro Zone" 1213 | }, 1214 | { 1215 | "LocationID": 203, 1216 | "Borough": "Queens", 1217 | "Zone": "Rosedale", 1218 | "service_zone": "Boro Zone" 1219 | }, 1220 | { 1221 | "LocationID": 204, 1222 | "Borough": "Staten Island", 1223 | "Zone": "Rossville/Woodrow", 1224 | "service_zone": "Boro Zone" 1225 | }, 1226 | { 1227 | "LocationID": 205, 1228 | "Borough": "Queens", 1229 | "Zone": "Saint Albans", 1230 | "service_zone": "Boro Zone" 1231 | }, 1232 | { 1233 | "LocationID": 206, 1234 | "Borough": "Staten Island", 1235 | "Zone": "Saint George/New Brighton", 1236 | "service_zone": "Boro Zone" 1237 | }, 1238 | { 1239 | "LocationID": 207, 1240 | "Borough": "Queens", 1241 | "Zone": "Saint Michaels Cemetery/Woodside", 1242 | "service_zone": "Boro Zone" 1243 | }, 1244 | { 1245 | "LocationID": 208, 1246 | "Borough": "Bronx", 1247 | "Zone": "Schuylerville/Edgewater Park", 1248 | "service_zone": "Boro Zone" 1249 | }, 1250 | { 1251 | "LocationID": 209, 1252 | "Borough": "Manhattan", 1253 | "Zone": "Seaport", 1254 | "service_zone": "Yellow Zone" 1255 | }, 1256 | { 1257 | "LocationID": 210, 1258 | "Borough": "Brooklyn", 1259 | "Zone": "Sheepshead Bay", 1260 | "service_zone": "Boro Zone" 1261 | }, 1262 | { 1263 | "LocationID": 211, 1264 | "Borough": "Manhattan", 1265 | "Zone": "SoHo", 1266 | "service_zone": "Yellow Zone" 1267 | }, 1268 | { 1269 | "LocationID": 212, 1270 | "Borough": "Bronx", 1271 | "Zone": "Soundview/Bruckner", 1272 | "service_zone": "Boro Zone" 1273 | }, 1274 | { 1275 | "LocationID": 213, 1276 | "Borough": "Bronx", 1277 | "Zone": "Soundview/Castle Hill", 1278 | "service_zone": "Boro Zone" 1279 | }, 1280 | { 1281 | "LocationID": 214, 1282 | "Borough": "Staten Island", 1283 | "Zone": "South Beach/Dongan Hills", 1284 | "service_zone": "Boro Zone" 1285 | }, 1286 | { 1287 | "LocationID": 215, 1288 | "Borough": "Queens", 1289 | "Zone": "South Jamaica", 1290 | "service_zone": "Boro Zone" 1291 | }, 1292 | { 1293 | "LocationID": 216, 1294 | "Borough": "Queens", 1295 | "Zone": "South Ozone Park", 1296 | "service_zone": "Boro Zone" 1297 | }, 1298 | { 1299 | "LocationID": 217, 1300 | "Borough": "Brooklyn", 1301 | "Zone": "South Williamsburg", 1302 | "service_zone": "Boro Zone" 1303 | }, 1304 | { 1305 | "LocationID": 218, 1306 | "Borough": "Queens", 1307 | "Zone": "Springfield Gardens North", 1308 | "service_zone": "Boro Zone" 1309 | }, 1310 | { 1311 | "LocationID": 219, 1312 | "Borough": "Queens", 1313 | "Zone": "Springfield Gardens South", 1314 | "service_zone": "Boro Zone" 1315 | }, 1316 | { 1317 | "LocationID": 220, 1318 | "Borough": "Bronx", 1319 | "Zone": "Spuyten Duyvil/Kingsbridge", 1320 | "service_zone": "Boro Zone" 1321 | }, 1322 | { 1323 | "LocationID": 221, 1324 | "Borough": "Staten Island", 1325 | "Zone": "Stapleton", 1326 | "service_zone": "Boro Zone" 1327 | }, 1328 | { 1329 | "LocationID": 222, 1330 | "Borough": "Brooklyn", 1331 | "Zone": "Starrett City", 1332 | "service_zone": "Boro Zone" 1333 | }, 1334 | { 1335 | "LocationID": 223, 1336 | "Borough": "Queens", 1337 | "Zone": "Steinway", 1338 | "service_zone": "Boro Zone" 1339 | }, 1340 | { 1341 | "LocationID": 224, 1342 | "Borough": "Manhattan", 1343 | "Zone": "Stuy Town/Peter Cooper Village", 1344 | "service_zone": "Yellow Zone" 1345 | }, 1346 | { 1347 | "LocationID": 225, 1348 | "Borough": "Brooklyn", 1349 | "Zone": "Stuyvesant Heights", 1350 | "service_zone": "Boro Zone" 1351 | }, 1352 | { 1353 | "LocationID": 226, 1354 | "Borough": "Queens", 1355 | "Zone": "Sunnyside", 1356 | "service_zone": "Boro Zone" 1357 | }, 1358 | { 1359 | "LocationID": 227, 1360 | "Borough": "Brooklyn", 1361 | "Zone": "Sunset Park East", 1362 | "service_zone": "Boro Zone" 1363 | }, 1364 | { 1365 | "LocationID": 228, 1366 | "Borough": "Brooklyn", 1367 | "Zone": "Sunset Park West", 1368 | "service_zone": "Boro Zone" 1369 | }, 1370 | { 1371 | "LocationID": 229, 1372 | "Borough": "Manhattan", 1373 | "Zone": "Sutton Place/Turtle Bay North", 1374 | "service_zone": "Yellow Zone" 1375 | }, 1376 | { 1377 | "LocationID": 230, 1378 | "Borough": "Manhattan", 1379 | "Zone": "Times Sq/Theatre District", 1380 | "service_zone": "Yellow Zone" 1381 | }, 1382 | { 1383 | "LocationID": 231, 1384 | "Borough": "Manhattan", 1385 | "Zone": "TriBeCa/Civic Center", 1386 | "service_zone": "Yellow Zone" 1387 | }, 1388 | { 1389 | "LocationID": 232, 1390 | "Borough": "Manhattan", 1391 | "Zone": "Two Bridges/Seward Park", 1392 | "service_zone": "Yellow Zone" 1393 | }, 1394 | { 1395 | "LocationID": 233, 1396 | "Borough": "Manhattan", 1397 | "Zone": "UN/Turtle Bay South", 1398 | "service_zone": "Yellow Zone" 1399 | }, 1400 | { 1401 | "LocationID": 234, 1402 | "Borough": "Manhattan", 1403 | "Zone": "Union Sq", 1404 | "service_zone": "Yellow Zone" 1405 | }, 1406 | { 1407 | "LocationID": 235, 1408 | "Borough": "Bronx", 1409 | "Zone": "University Heights/Morris Heights", 1410 | "service_zone": "Boro Zone" 1411 | }, 1412 | { 1413 | "LocationID": 236, 1414 | "Borough": "Manhattan", 1415 | "Zone": "Upper East Side North", 1416 | "service_zone": "Yellow Zone" 1417 | }, 1418 | { 1419 | "LocationID": 237, 1420 | "Borough": "Manhattan", 1421 | "Zone": "Upper East Side South", 1422 | "service_zone": "Yellow Zone" 1423 | }, 1424 | { 1425 | "LocationID": 238, 1426 | "Borough": "Manhattan", 1427 | "Zone": "Upper West Side North", 1428 | "service_zone": "Yellow Zone" 1429 | }, 1430 | { 1431 | "LocationID": 239, 1432 | "Borough": "Manhattan", 1433 | "Zone": "Upper West Side South", 1434 | "service_zone": "Yellow Zone" 1435 | }, 1436 | { 1437 | "LocationID": 240, 1438 | "Borough": "Bronx", 1439 | "Zone": "Van Cortlandt Park", 1440 | "service_zone": "Boro Zone" 1441 | }, 1442 | { 1443 | "LocationID": 241, 1444 | "Borough": "Bronx", 1445 | "Zone": "Van Cortlandt Village", 1446 | "service_zone": "Boro Zone" 1447 | }, 1448 | { 1449 | "LocationID": 242, 1450 | "Borough": "Bronx", 1451 | "Zone": "Van Nest/Morris Park", 1452 | "service_zone": "Boro Zone" 1453 | }, 1454 | { 1455 | "LocationID": 243, 1456 | "Borough": "Manhattan", 1457 | "Zone": "Washington Heights North", 1458 | "service_zone": "Boro Zone" 1459 | }, 1460 | { 1461 | "LocationID": 244, 1462 | "Borough": "Manhattan", 1463 | "Zone": "Washington Heights South", 1464 | "service_zone": "Boro Zone" 1465 | }, 1466 | { 1467 | "LocationID": 245, 1468 | "Borough": "Staten Island", 1469 | "Zone": "West Brighton", 1470 | "service_zone": "Boro Zone" 1471 | }, 1472 | { 1473 | "LocationID": 246, 1474 | "Borough": "Manhattan", 1475 | "Zone": "West Chelsea/Hudson Yards", 1476 | "service_zone": "Yellow Zone" 1477 | }, 1478 | { 1479 | "LocationID": 247, 1480 | "Borough": "Bronx", 1481 | "Zone": "West Concourse", 1482 | "service_zone": "Boro Zone" 1483 | }, 1484 | { 1485 | "LocationID": 248, 1486 | "Borough": "Bronx", 1487 | "Zone": "West Farms/Bronx River", 1488 | "service_zone": "Boro Zone" 1489 | }, 1490 | { 1491 | "LocationID": 249, 1492 | "Borough": "Manhattan", 1493 | "Zone": "West Village", 1494 | "service_zone": "Yellow Zone" 1495 | }, 1496 | { 1497 | "LocationID": 250, 1498 | "Borough": "Bronx", 1499 | "Zone": "Westchester Village/Unionport", 1500 | "service_zone": "Boro Zone" 1501 | }, 1502 | { 1503 | "LocationID": 251, 1504 | "Borough": "Staten Island", 1505 | "Zone": "Westerleigh", 1506 | "service_zone": "Boro Zone" 1507 | }, 1508 | { 1509 | "LocationID": 252, 1510 | "Borough": "Queens", 1511 | "Zone": "Whitestone", 1512 | "service_zone": "Boro Zone" 1513 | }, 1514 | { 1515 | "LocationID": 253, 1516 | "Borough": "Queens", 1517 | "Zone": "Willets Point", 1518 | "service_zone": "Boro Zone" 1519 | }, 1520 | { 1521 | "LocationID": 254, 1522 | "Borough": "Bronx", 1523 | "Zone": "Williamsbridge/Olinville", 1524 | "service_zone": "Boro Zone" 1525 | }, 1526 | { 1527 | "LocationID": 255, 1528 | "Borough": "Brooklyn", 1529 | "Zone": "Williamsburg (North Side)", 1530 | "service_zone": "Boro Zone" 1531 | }, 1532 | { 1533 | "LocationID": 256, 1534 | "Borough": "Brooklyn", 1535 | "Zone": "Williamsburg (South Side)", 1536 | "service_zone": "Boro Zone" 1537 | }, 1538 | { 1539 | "LocationID": 257, 1540 | "Borough": "Brooklyn", 1541 | "Zone": "Windsor Terrace", 1542 | "service_zone": "Boro Zone" 1543 | }, 1544 | { 1545 | "LocationID": 258, 1546 | "Borough": "Queens", 1547 | "Zone": "Woodhaven", 1548 | "service_zone": "Boro Zone" 1549 | }, 1550 | { 1551 | "LocationID": 259, 1552 | "Borough": "Bronx", 1553 | "Zone": "Woodlawn/Wakefield", 1554 | "service_zone": "Boro Zone" 1555 | }, 1556 | { 1557 | "LocationID": 260, 1558 | "Borough": "Queens", 1559 | "Zone": "Woodside", 1560 | "service_zone": "Boro Zone" 1561 | }, 1562 | { 1563 | "LocationID": 261, 1564 | "Borough": "Manhattan", 1565 | "Zone": "World Trade Center", 1566 | "service_zone": "Yellow Zone" 1567 | }, 1568 | { 1569 | "LocationID": 262, 1570 | "Borough": "Manhattan", 1571 | "Zone": "Yorkville East", 1572 | "service_zone": "Yellow Zone" 1573 | }, 1574 | { 1575 | "LocationID": 263, 1576 | "Borough": "Manhattan", 1577 | "Zone": "Yorkville West", 1578 | "service_zone": "Yellow Zone" 1579 | }, 1580 | { 1581 | "LocationID": 264, 1582 | "Borough": "Unknown", 1583 | "Zone": "NV", 1584 | "service_zone": "N/A" 1585 | }, 1586 | { 1587 | "LocationID": 265, 1588 | "Borough": "Unknown", 1589 | "Zone": "NA", 1590 | "service_zone": "N/A" 1591 | } 1592 | ] --------------------------------------------------------------------------------