├── Cloud Data Warehouse
├── L1 E1 - Step 1_2.ipynb
├── L1 E1 - Step 3.ipynb
├── L1 E1 - Step 4.ipynb
├── L1 E1 - Step 5.ipynb
├── L1 E1 - Step 6.ipynb
├── L1 E2 - CUBE.ipynb
├── L1 E2 - Grouping Sets.ipynb
├── L1 E2 - Roll up and Drill Down.ipynb
├── L1 E2 - Slicing and Dicing.ipynb
├── L1 E3 - Columnar Vs Row Storage.ipynb
├── Project Data Warehouse with AWS
│ ├── README.md
│ ├── RedShift_Test_Cluster.ipynb
│ ├── create_tables.py
│ ├── dwh.cfg
│ ├── etl.py
│ └── sql_queries.py
└── Readme.md
├── Data Lakes with Spark
├── Data_Inputs_Outputs.ipynb
├── Data_Wrangling.ipynb
├── Data_Wrangling_Sql.ipynb
├── Dataframe_Quiz.ipynb
├── Exercise 1 - Schema On Read.ipynb
├── Exercise 2 - Advanced Analytics NLP.ipynb
├── Exercise 3 - Data Lake on S3.ipynb
├── Mapreduce_Practice.ipynb
├── Procedural_vs_Functional_Python.ipynb
├── Project Data Lake with Spark
│ ├── README.md
│ ├── Readme.MD
│ ├── dl.cfg
│ └── etl.py
├── README.md
├── Spark_Maps_Lazy_Evaluation.ipynb
└── Spark_Sql_Quiz.ipynb
├── Data Pipeline with Airflow
├── Data Pipeline - Exercise 1.py
├── Data Pipeline - Exercise 2.py
├── Data Pipeline - Exercise 3.py
├── Data Pipeline - Exercise 4.py
├── Data Pipeline - Exercise 5.py
├── Data Pipeline - Exercise 6.py
├── Data Quality - Exercise 1.py
├── Data Quality - Exercise 2.py
├── Data Quality - Exercise 3.py
├── Data Quality - Exercise 4.py
├── Production Data Pipelines - Exercise 1.py
├── Production Data Pipelines - Exercise 2.py
├── Production Data Pipelines - Exercise 3.py
├── Production Data Pipelines - Exercise 4.py
├── Project Data Pipeline with Airflow
│ ├── DAG Graphview.png
│ ├── DAG Treeview.PNG
│ ├── Readme.MD
│ ├── create_tables.sql
│ ├── dags
│ │ ├── __pycache__
│ │ │ └── udac_example_dag.cpython-36.pyc
│ │ └── udac_example_dag.py
│ └── plugins
│ │ ├── __init__.py
│ │ ├── __pycache__
│ │ └── __init__.cpython-36.pyc
│ │ ├── helpers
│ │ ├── __init__.py
│ │ ├── __pycache__
│ │ │ ├── __init__.cpython-36.pyc
│ │ │ └── sql_queries.cpython-36.pyc
│ │ └── sql_queries.py
│ │ └── operators
│ │ ├── __init__.py
│ │ ├── __pycache__
│ │ ├── __init__.cpython-36.pyc
│ │ ├── data_quality.cpython-36.pyc
│ │ ├── load_dimension.cpython-36.pyc
│ │ ├── load_fact.cpython-36.pyc
│ │ └── stage_redshift.cpython-36.pyc
│ │ ├── data_quality.py
│ │ ├── load_dimension.py
│ │ ├── load_fact.py
│ │ └── stage_redshift.py
├── Readme.MD
├── __init__.py
├── dag.py
├── facts_calculator.py
├── has_rows.py
├── s3_to_redshift.py
├── sql_statements.py
└── subdag.py
├── Data-Modeling
├── L1 Exercise 1 Creating a Table with Postgres.ipynb
├── L1 Exercise 2 Creating a Table with Apache Cassandra.ipynb
├── L2 Exercise 1 Creating Normalized Tables.ipynb
├── L2 Exercise 2 Creating Denormalized Tables.ipynb
├── L2 Exercise 3 Creating Fact and Dimension Tables with Star Schema.ipynb
├── L3 Exercise 1 Three Queries Three Tables.ipynb
├── L3 Exercise 2 Primary Key.ipynb
├── L3 Exercise 3 Clustering Column.ipynb
├── L3 Exercise 4 Using the WHERE Clause.ipynb
├── Project 1
│ ├── Instructions 1.PNG
│ ├── Instructions 2.PNG
│ ├── Instructions 3.PNG
│ ├── Instructions 4.PNG
│ ├── Project 1 Introduction.PNG
│ ├── README.md
│ ├── create_tables.py
│ ├── data.zip
│ ├── etl.ipynb
│ ├── etl.py
│ ├── sql_queries.py
│ └── test.ipynb
├── Project 2
│ ├── Project_1B.ipynb
│ ├── Project_1B_ Project_Template.ipynb
│ ├── README.md
│ ├── event_data.rar
│ ├── event_datafile_new.csv
│ └── images.rar
└── Readme.md
└── README.md
/Cloud Data Warehouse/Project Data Warehouse with AWS/README.md:
--------------------------------------------------------------------------------
1 | Introduction
2 |
3 | A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
4 |
5 | Task is to build an ETL Pipeline that extracts their data from S3, staging it in Redshift and then transforming data into a set of Dimensional and Fact Tables for their Analytics Team to continue finding Insights to what songs their users are listening to.
6 |
7 | Project Description
8 |
9 | Application of Data warehouse and AWS to build an ETL Pipeline for a database hosted on Redshift Will need to load data from S3 to staging tables on Redshift and execute SQL Statements that create fact and dimension tables from these staging tables to create analytics
10 |
11 | Project Datasets
12 |
13 | Song Data Path --> s3://udacity-dend/song_data
14 | Log Data Path --> s3://udacity-dend/log_data
15 | Log Data JSON Path --> s3://udacity-dend/log_json_path.json
16 |
17 | Song Dataset
18 |
19 | The first dataset is a subset of real data from the Million Song Dataset(https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.
20 | For example:
21 |
22 | song_data/A/B/C/TRABCEI128F424C983.json
23 | song_data/A/A/B/TRAABJL12903CDCF1A.json
24 |
25 | And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
26 |
27 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
28 |
29 | Log Dataset
30 |
31 | The second dataset consists of log files in JSON format. The log files in the dataset with are partitioned by year and month.
32 | For example:
33 |
34 | log_data/2018/11/2018-11-12-events.json
35 | log_data/2018/11/2018-11-13-events.json
36 |
37 | And below is an example of what a single log file, 2018-11-13-events.json, looks like.
38 |
39 | {"artist":"Pavement", "auth":"Logged In", "firstName":"Sylvie", "gender", "F", "itemInSession":0, "lastName":"Cruz", "length":99.16036, "level":"free", "location":"Klamath Falls, OR", "method":"PUT", "page":"NextSong", "registration":"1.541078e+12", "sessionId":345, "song":"Mercy:The Laundromat", "status":200, "ts":1541990258796, "userAgent":"Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4...)", "userId":10}
40 |
41 | Schema for Song Play Analysis
42 |
43 | A Star Schema would be required for optimized queries on song play queries
44 |
45 | Fact Table
46 |
47 | songplays - records in event data associated with song plays i.e. records with page NextSong
48 | songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
49 |
50 | Dimension Tables
51 |
52 | users - users in the app
53 | user_id, first_name, last_name, gender, level
54 |
55 | songs - songs in music database
56 | song_id, title, artist_id, year, duration
57 |
58 | artists - artists in music database
59 | artist_id, name, location, lattitude, longitude
60 |
61 | time - timestamps of records in songplays broken down into specific units
62 | start_time, hour, day, week, month, year, weekday
63 |
64 | Project Template
65 |
66 | Project Template include four files:
67 |
68 | 1. create_table.py is where you'll create your fact and dimension tables for the star schema in Redshift.
69 |
70 | 2. etl.py is where you'll load data from S3 into staging tables on Redshift and then process that data into your analytics tables on Redshift.
71 |
72 | 3. sql_queries.py is where you'll define you SQL statements, which will be imported into the two other files above.
73 |
74 | 4. README.md is where you'll provide discussion on your process and decisions for this ETL pipeline.
75 |
76 | Create Table Schema
77 |
78 | 1. Write a SQL CREATE statement for each of these tables in sql_queries.py
79 | 2. Complete the logic in create_tables.py to connect to the database and create these tables
80 | 3. Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
81 | 4. Launch a redshift cluster and create an IAM role that has read access to S3.
82 | 5. Add redshift database and IAM role info to dwh.cfg.
83 | 6. Test by running create_tables.py and checking the table schemas in your redshift database.
84 |
85 | Build ETL Pipeline
86 |
87 | 1. Implement the logic in etl.py to load data from S3 to staging tables on Redshift.
88 | 2. Implement the logic in etl.py to load data from staging tables to analytics tables on Redshift.
89 | 3. Test by running etl.py after running create_tables.py and running the analytic queries on your Redshift database to compare your results with the expected results.
90 | 4. Delete your redshift cluster when finished.
91 |
92 | Final Instructions
93 |
94 | 1. Import all the necessary libraries
95 | 2. Write the configuration of AWS Cluster, store the important parameter in some other file
96 | 3. Configuration of boto3 which is an AWS SDK for Python
97 | 4. Using the bucket, can check whether files log files and song data files are present
98 | 5. Create an IAM User Role, Assign appropriate permissions and create the Redshift Cluster
99 | 6. Get the Value of Endpoint and Role for put into main configuration file
100 | 7. Authorize Security Access Group to Default TCP/IP Address
101 | 8. Launch database connectivity configuration
102 | 9. Go to Terminal write the command "python create_tables.py" and then "etl.py"
103 | 10. Should take around 4-10 minutes in total
104 | 11. Then you go back to jupyter notebook to test everything is working fine
105 | 12. I counted all the records in my tables
106 | 13. Now can delete the cluster, roles and assigned permission
--------------------------------------------------------------------------------
/Cloud Data Warehouse/Project Data Warehouse with AWS/create_tables.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import psycopg2
3 | from sql_queries import create_table_queries, drop_table_queries
4 |
5 |
6 | def drop_tables(cur, conn):
7 | for query in drop_table_queries:
8 | cur.execute(query)
9 | conn.commit()
10 |
11 |
12 | def create_tables(cur, conn):
13 | for query in create_table_queries:
14 | cur.execute(query)
15 | conn.commit()
16 |
17 |
18 | def main():
19 | config = configparser.ConfigParser()
20 | config.read('dwh.cfg')
21 |
22 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
23 | cur = conn.cursor()
24 |
25 | drop_tables(cur, conn)
26 | create_tables(cur, conn)
27 |
28 | conn.close()
29 |
30 |
31 | if __name__ == "__main__":
32 | main()
--------------------------------------------------------------------------------
/Cloud Data Warehouse/Project Data Warehouse with AWS/dwh.cfg:
--------------------------------------------------------------------------------
1 | [AWS]
2 | KEY=
3 | SECRET=
4 |
5 | [DWH]
6 | DWH_CLUSTER_TYPE=multi-node
7 | DWH_NUM_NODES=4
8 | DWH_NODE_TYPE=dc2.large
9 |
10 | DWH_IAM_ROLE_NAME=dwhRole
11 | DWH_CLUSTER_IDENTIFIER=dwhCluster
12 | DWH_DB=dwh
13 | DWH_DB_USER=dwhuser
14 | DWH_DB_PASSWORD=Passw0rd
15 | DWH_PORT=5439
16 |
17 | [CLUSTER]
18 | HOST=
19 | DB_NAME=dwh
20 | DB_USER=dwhuser
21 | DB_PASSWORD=Passw0rd
22 | DB_PORT=5439
23 |
24 | [IAM_ROLE]
25 | ARN=
26 |
27 | [S3]
28 | LOG_DATA='s3://udacity-dend/log_data'
29 | LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
30 | SONG_DATA='s3://udacity-dend/song_data'
--------------------------------------------------------------------------------
/Cloud Data Warehouse/Project Data Warehouse with AWS/etl.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import psycopg2
3 | from sql_queries import copy_table_queries, insert_table_queries
4 |
5 |
6 | def load_staging_tables(cur, conn):
7 | for query in copy_table_queries:
8 | cur.execute(query)
9 | conn.commit()
10 |
11 |
12 | def insert_tables(cur, conn):
13 | for query in insert_table_queries:
14 | cur.execute(query)
15 | conn.commit()
16 |
17 |
18 | def main():
19 | config = configparser.ConfigParser()
20 | config.read('dwh.cfg')
21 |
22 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
23 | cur = conn.cursor()
24 |
25 | load_staging_tables(cur, conn)
26 | insert_tables(cur, conn)
27 |
28 | conn.close()
29 |
30 |
31 | if __name__ == "__main__":
32 | main()
--------------------------------------------------------------------------------
/Cloud Data Warehouse/Project Data Warehouse with AWS/sql_queries.py:
--------------------------------------------------------------------------------
1 | import configparser
2 |
3 |
4 | # CONFIG
5 | config = configparser.ConfigParser()
6 | config.read('dwh.cfg')
7 |
8 | # GLOBAL VARIABLES
9 | LOG_DATA = config.get("S3","LOG_DATA")
10 | LOG_PATH = config.get("S3", "LOG_JSONPATH")
11 | SONG_DATA = config.get("S3", "SONG_DATA")
12 | IAM_ROLE = config.get("IAM_ROLE","ARN")
13 |
14 | # DROP TABLES
15 |
16 | staging_events_table_drop = "DROP TABLE IF EXISTS staging_events"
17 | staging_songs_table_drop = "DROP TABLE IF EXISTS staging_songs"
18 | songplay_table_drop = "DROP TABLE IF EXISTS fact_songplay"
19 | user_table_drop = "DROP TABLE IF EXISTS dim_user"
20 | song_table_drop = "DROP TABLE IF EXISTS dim_song"
21 | artist_table_drop = "DROP TABLE IF EXISTS dim_artist"
22 | time_table_drop = "DROP TABLE IF EXISTS dim_time"
23 |
24 | # CREATE TABLES
25 |
26 | staging_events_table_create= ("""
27 | CREATE TABLE IF NOT EXISTS staging_events
28 | (
29 | artist VARCHAR,
30 | auth VARCHAR,
31 | firstName VARCHAR,
32 | gender VARCHAR,
33 | itemInSession INTEGER,
34 | lastName VARCHAR,
35 | length FLOAT,
36 | level VARCHAR,
37 | location VARCHAR,
38 | method VARCHAR,
39 | page VARCHAR,
40 | registration BIGINT,
41 | sessionId INTEGER,
42 | song VARCHAR,
43 | status INTEGER,
44 | ts TIMESTAMP,
45 | userAgent VARCHAR,
46 | userId INTEGER
47 | );
48 | """)
49 |
50 | staging_songs_table_create = ("""
51 | CREATE TABLE IF NOT EXISTS staging_songs
52 | (
53 | song_id VARCHAR,
54 | num_songs INTEGER,
55 | title VARCHAR,
56 | artist_name VARCHAR,
57 | artist_latitude FLOAT,
58 | year INTEGER,
59 | duration FLOAT,
60 | artist_id VARCHAR,
61 | artist_longitude FLOAT,
62 | artist_location VARCHAR
63 | );
64 | """)
65 |
66 | songplay_table_create = ("""
67 | CREATE TABLE IF NOT EXISTS fact_songplay
68 | (
69 | songplay_id INTEGER IDENTITY(0,1) PRIMARY KEY sortkey,
70 | start_time TIMESTAMP,
71 | user_id INTEGER,
72 | level VARCHAR,
73 | song_id VARCHAR,
74 | artist_id VARCHAR,
75 | session_id INTEGER,
76 | location VARCHAR,
77 | user_agent VARCHAR
78 | );
79 | """)
80 |
81 | user_table_create = ("""
82 | CREATE TABLE IF NOT EXISTS dim_user
83 | (
84 | user_id INTEGER PRIMARY KEY distkey,
85 | first_name VARCHAR,
86 | last_name VARCHAR,
87 | gender VARCHAR,
88 | level VARCHAR
89 | );
90 | """)
91 |
92 | song_table_create = ("""
93 | CREATE TABLE IF NOT EXISTS dim_song
94 | (
95 | song_id VARCHAR PRIMARY KEY,
96 | title VARCHAR,
97 | artist_id VARCHAR distkey,
98 | year INTEGER,
99 | duration FLOAT
100 | );
101 | """)
102 |
103 | artist_table_create = ("""
104 | CREATE TABLE IF NOT EXISTS dim_artist
105 | (
106 | artist_id VARCHAR PRIMARY KEY distkey,
107 | name VARCHAR,
108 | location VARCHAR,
109 | latitude FLOAT,
110 | longitude FLOAT
111 | );
112 | """)
113 |
114 | time_table_create = ("""
115 | CREATE TABLE IF NOT EXISTS dim_time
116 | (
117 | start_time TIMESTAMP PRIMARY KEY sortkey distkey,
118 | hour INTEGER,
119 | day INTEGER,
120 | week INTEGER,
121 | month INTEGER,
122 | year INTEGER,
123 | weekday INTEGER
124 | );
125 | """)
126 |
127 | # STAGING TABLES
128 |
129 | staging_events_copy = ("""
130 | COPY staging_events FROM {}
131 | CREDENTIALS 'aws_iam_role={}'
132 | COMPUPDATE OFF region 'us-west-2'
133 | TIMEFORMAT as 'epochmillisecs'
134 | TRUNCATECOLUMNS BLANKSASNULL EMPTYASNULL
135 | FORMAT AS JSON {};
136 | """).format(LOG_DATA, IAM_ROLE, LOG_PATH)
137 |
138 | staging_songs_copy = ("""
139 | COPY staging_songs FROM {}
140 | CREDENTIALS 'aws_iam_role={}'
141 | COMPUPDATE OFF region 'us-west-2'
142 | FORMAT AS JSON 'auto'
143 | TRUNCATECOLUMNS BLANKSASNULL EMPTYASNULL;
144 | """).format(SONG_DATA, IAM_ROLE)
145 |
146 | # FINAL TABLES
147 |
148 | songplay_table_insert = ("""
149 | INSERT INTO fact_songplay(start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)
150 | SELECT DISTINCT to_timestamp(to_char(se.ts, '9999-99-99 99:99:99'),'YYYY-MM-DD HH24:MI:SS'),
151 | se.userId as user_id,
152 | se.level as level,
153 | ss.song_id as song_id,
154 | ss.artist_id as artist_id,
155 | se.sessionId as session_id,
156 | se.location as location,
157 | se.userAgent as user_agent
158 | FROM staging_events se
159 | JOIN staging_songs ss ON se.song = ss.title AND se.artist = ss.artist_name;
160 | """)
161 |
162 | user_table_insert = ("""
163 | INSERT INTO dim_user(user_id, first_name, last_name, gender, level)
164 | SELECT DISTINCT userId as user_id,
165 | firstName as first_name,
166 | lastName as last_name,
167 | gender as gender,
168 | level as level
169 | FROM staging_events
170 | where userId IS NOT NULL;
171 | """)
172 |
173 | song_table_insert = ("""
174 | INSERT INTO dim_song(song_id, title, artist_id, year, duration)
175 | SELECT DISTINCT song_id as song_id,
176 | title as title,
177 | artist_id as artist_id,
178 | year as year,
179 | duration as duration
180 | FROM staging_songs
181 | WHERE song_id IS NOT NULL;
182 | """)
183 |
184 | artist_table_insert = ("""
185 | INSERT INTO dim_artist(artist_id, name, location, latitude, longitude)
186 | SELECT DISTINCT artist_id as artist_id,
187 | artist_name as name,
188 | artist_location as location,
189 | artist_latitude as latitude,
190 | artist_longitude as longitude
191 | FROM staging_songs
192 | where artist_id IS NOT NULL;
193 | """)
194 |
195 | time_table_insert = ("""
196 | INSERT INTO dim_time(start_time, hour, day, week, month, year, weekday)
197 | SELECT distinct ts,
198 | EXTRACT(hour from ts),
199 | EXTRACT(day from ts),
200 | EXTRACT(week from ts),
201 | EXTRACT(month from ts),
202 | EXTRACT(year from ts),
203 | EXTRACT(weekday from ts)
204 | FROM staging_events
205 | WHERE ts IS NOT NULL;
206 | """)
207 |
208 | # QUERY LISTS
209 |
210 | create_table_queries = [staging_events_table_create, staging_songs_table_create, songplay_table_create, user_table_create, song_table_create, artist_table_create, time_table_create]
211 | drop_table_queries = [staging_events_table_drop, staging_songs_table_drop, songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]
212 | copy_table_queries = [staging_events_copy, staging_songs_copy]
213 | insert_table_queries = [songplay_table_insert, user_table_insert, song_table_insert, artist_table_insert, time_table_insert]
214 |
--------------------------------------------------------------------------------
/Cloud Data Warehouse/Readme.md:
--------------------------------------------------------------------------------
1 | This folder will contain the exercise files and details for Udacity Module - Cloud Data Warehouse
2 |
--------------------------------------------------------------------------------
/Data Lakes with Spark/Dataframe_Quiz.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Answer Key to the Data Frame Programming Quiz\n",
8 | "\n",
9 | "Helpful resources:\n",
10 | "http://spark.apache.org/docs/latest/api/python/pyspark.sql.html"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 1,
16 | "metadata": {},
17 | "outputs": [],
18 | "source": [
19 | "from pyspark.sql import SparkSession\n",
20 | "from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg\n",
21 | "from pyspark.sql.functions import sum as Fsum\n",
22 | "from pyspark.sql.window import Window\n",
23 | "from pyspark.sql.types import IntegerType"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 2,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# 1) import any other libraries you might need\n",
33 | "# 2) instantiate a Spark session \n",
34 | "# 3) read in the data set located at the path \"data/sparkify_log_small.json\"\n",
35 | "# 4) write code to answer the quiz questions \n",
36 | "\n",
37 | "spark = SparkSession \\\n",
38 | " .builder \\\n",
39 | " .appName(\"Data Frames practice\") \\\n",
40 | " .getOrCreate()\n",
41 | "\n",
42 | "df = spark.read.json(\"data/sparkify_log_small.json\")"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "# Question 1\n",
50 | "\n",
51 | "Which page did user id \"\" (empty string) NOT visit?"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 3,
57 | "metadata": {},
58 | "outputs": [
59 | {
60 | "name": "stdout",
61 | "output_type": "stream",
62 | "text": [
63 | "root\n",
64 | " |-- artist: string (nullable = true)\n",
65 | " |-- auth: string (nullable = true)\n",
66 | " |-- firstName: string (nullable = true)\n",
67 | " |-- gender: string (nullable = true)\n",
68 | " |-- itemInSession: long (nullable = true)\n",
69 | " |-- lastName: string (nullable = true)\n",
70 | " |-- length: double (nullable = true)\n",
71 | " |-- level: string (nullable = true)\n",
72 | " |-- location: string (nullable = true)\n",
73 | " |-- method: string (nullable = true)\n",
74 | " |-- page: string (nullable = true)\n",
75 | " |-- registration: long (nullable = true)\n",
76 | " |-- sessionId: long (nullable = true)\n",
77 | " |-- song: string (nullable = true)\n",
78 | " |-- status: long (nullable = true)\n",
79 | " |-- ts: long (nullable = true)\n",
80 | " |-- userAgent: string (nullable = true)\n",
81 | " |-- userId: string (nullable = true)\n",
82 | "\n"
83 | ]
84 | }
85 | ],
86 | "source": [
87 | "df.printSchema()"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 4,
93 | "metadata": {},
94 | "outputs": [
95 | {
96 | "name": "stdout",
97 | "output_type": "stream",
98 | "text": [
99 | "Settings\n",
100 | "Logout\n",
101 | "Submit Upgrade\n",
102 | "Error\n",
103 | "NextSong\n",
104 | "Submit Downgrade\n",
105 | "Downgrade\n",
106 | "Upgrade\n",
107 | "Save Settings\n"
108 | ]
109 | }
110 | ],
111 | "source": [
112 | "# filter for users with blank user id\n",
113 | "blank_pages = df.filter(df.userId == '') \\\n",
114 | " .select(col('page') \\\n",
115 | " .alias('blank_pages')) \\\n",
116 | " .dropDuplicates()\n",
117 | "\n",
118 | "# get a list of possible pages that could be visited\n",
119 | "all_pages = df.select('page').dropDuplicates()\n",
120 | "\n",
121 | "# find values in all_pages that are not in blank_pages\n",
122 | "# these are the pages that the blank user did not go to\n",
123 | "for row in set(all_pages.collect()) - set(blank_pages.collect()):\n",
124 | " print(row.page)"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "# Question 2 - Reflect\n",
132 | "\n",
133 | "What type of user does the empty string user id most likely refer to?\n"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "Perhaps it represents users who have not signed up yet or who are signed out and are about to log in."
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "# Question 3\n",
148 | "\n",
149 | "How many female users do we have in the data set?"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": 5,
155 | "metadata": {},
156 | "outputs": [
157 | {
158 | "data": {
159 | "text/plain": [
160 | "462"
161 | ]
162 | },
163 | "execution_count": 5,
164 | "metadata": {},
165 | "output_type": "execute_result"
166 | }
167 | ],
168 | "source": [
169 | "df.filter(df.gender == 'F') \\\n",
170 | " .select('userId', 'gender') \\\n",
171 | " .dropDuplicates() \\\n",
172 | " .count()"
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "# Question 4\n",
180 | "\n",
181 | "How many songs were played from the most played artist?"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 6,
187 | "metadata": {},
188 | "outputs": [
189 | {
190 | "name": "stdout",
191 | "output_type": "stream",
192 | "text": [
193 | "+--------+-----------+\n",
194 | "| Artist|Artistcount|\n",
195 | "+--------+-----------+\n",
196 | "|Coldplay| 83|\n",
197 | "+--------+-----------+\n",
198 | "only showing top 1 row\n",
199 | "\n"
200 | ]
201 | }
202 | ],
203 | "source": [
204 | "df.filter(df.page == 'NextSong') \\\n",
205 | " .select('Artist') \\\n",
206 | " .groupBy('Artist') \\\n",
207 | " .agg({'Artist':'count'}) \\\n",
208 | " .withColumnRenamed('count(Artist)', 'Artistcount') \\\n",
209 | " .sort(desc('Artistcount')) \\\n",
210 | " .show(1)"
211 | ]
212 | },
213 | {
214 | "cell_type": "markdown",
215 | "metadata": {},
216 | "source": [
217 | "# Question 5 (challenge)\n",
218 | "\n",
219 | "How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.\n",
220 | "\n"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 7,
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "name": "stdout",
230 | "output_type": "stream",
231 | "text": [
232 | "+------------------+\n",
233 | "|avg(count(period))|\n",
234 | "+------------------+\n",
235 | "| 6.898347107438017|\n",
236 | "+------------------+\n",
237 | "\n"
238 | ]
239 | }
240 | ],
241 | "source": [
242 | "function = udf(lambda ishome : int(ishome == 'Home'), IntegerType())\n",
243 | "\n",
244 | "user_window = Window \\\n",
245 | " .partitionBy('userID') \\\n",
246 | " .orderBy(desc('ts')) \\\n",
247 | " .rangeBetween(Window.unboundedPreceding, 0)\n",
248 | "\n",
249 | "cusum = df.filter((df.page == 'NextSong') | (df.page == 'Home')) \\\n",
250 | " .select('userID', 'page', 'ts') \\\n",
251 | " .withColumn('homevisit', function(col('page'))) \\\n",
252 | " .withColumn('period', Fsum('homevisit').over(user_window))\n",
253 | "\n",
254 | "cusum.filter((cusum.page == 'NextSong')) \\\n",
255 | " .groupBy('userID', 'period') \\\n",
256 | " .agg({'period':'count'}) \\\n",
257 | " .agg({'count(period)':'avg'}).show()"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": null,
263 | "metadata": {},
264 | "outputs": [],
265 | "source": []
266 | }
267 | ],
268 | "metadata": {
269 | "kernelspec": {
270 | "display_name": "Python 3",
271 | "language": "python",
272 | "name": "python3"
273 | },
274 | "language_info": {
275 | "codemirror_mode": {
276 | "name": "ipython",
277 | "version": 3
278 | },
279 | "file_extension": ".py",
280 | "mimetype": "text/x-python",
281 | "name": "python",
282 | "nbconvert_exporter": "python",
283 | "pygments_lexer": "ipython3",
284 | "version": "3.6.3"
285 | }
286 | },
287 | "nbformat": 4,
288 | "nbformat_minor": 2
289 | }
290 |
--------------------------------------------------------------------------------
/Data Lakes with Spark/Procedural_vs_Functional_Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Procedural Programming\n",
8 | "\n",
9 | "This notebook contains the code from the previous screencast. The code counts the number of times a song appears in the log_of_songs variable. \n",
10 | "\n",
11 | "You'll notice that the first time you run `count_plays(\"Despacito\")`, you get the correct count. However, when you run the same code again `count_plays(\"Despacito\")`, the results are no longer correct.This is because the global variable `play_count` stores the results outside of the count_plays function. \n",
12 | "\n",
13 | "\n",
14 | "# Instructions\n",
15 | "\n",
16 | "Run the code cells in this notebook to see the problem with "
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 1,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "log_of_songs = [\n",
26 | " \"Despacito\",\n",
27 | " \"Nice for what\",\n",
28 | " \"No tears left to cry\",\n",
29 | " \"Despacito\",\n",
30 | " \"Havana\",\n",
31 | " \"In my feelings\",\n",
32 | " \"Nice for what\",\n",
33 | " \"Despacito\",\n",
34 | " \"All the stars\"\n",
35 | "]"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 2,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "play_count = 0"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 3,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "def count_plays(song_title):\n",
54 | " global play_count\n",
55 | " for song in log_of_songs:\n",
56 | " if song == song_title:\n",
57 | " play_count = play_count + 1\n",
58 | " return play_count"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 4,
64 | "metadata": {},
65 | "outputs": [
66 | {
67 | "data": {
68 | "text/plain": [
69 | "3"
70 | ]
71 | },
72 | "execution_count": 4,
73 | "metadata": {},
74 | "output_type": "execute_result"
75 | }
76 | ],
77 | "source": [
78 | "count_plays(\"Despacito\")"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 5,
84 | "metadata": {},
85 | "outputs": [
86 | {
87 | "data": {
88 | "text/plain": [
89 | "6"
90 | ]
91 | },
92 | "execution_count": 5,
93 | "metadata": {},
94 | "output_type": "execute_result"
95 | }
96 | ],
97 | "source": [
98 | "count_plays(\"Despacito\")"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "# How to Solve the Issue\n",
106 | "\n",
107 | "How might you solve this issue? You could get rid of the global variable and instead use play_count as an input to the function:\n",
108 | "\n",
109 | "```python\n",
110 | "def count_plays(song_title, play_count):\n",
111 | " for song in log_of_songs:\n",
112 | " if song == song_title:\n",
113 | " play_count = play_count + 1\n",
114 | " return play_count\n",
115 | "\n",
116 | "```\n",
117 | "\n",
118 | "How would this work with parallel programming? Spark splits up data onto multiple machines. If your songs list were split onto two machines, Machine A would first need to finish counting, and then return its own result to Machine B. And then Machine B could use the output from Machine A and add to the count.\n",
119 | "\n",
120 | "However, that isn't parallel computing. Machine B would have to wait until Machine A finishes. You'll see in the next parts of the lesson how Spark solves this issue with a functional programming paradigm.\n",
121 | "\n",
122 | "In Spark, if your data is split onto two different machines, machine A will run a function to count how many times 'Despacito' appears on machine A. Machine B will simultaneously run a function to count how many times 'Despacito' appears on machine B. After they finish counting individually, they'll combine their results together. You'll see how this works in the next parts of the lesson."
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": []
131 | }
132 | ],
133 | "metadata": {
134 | "kernelspec": {
135 | "display_name": "Python 3",
136 | "language": "python",
137 | "name": "python3"
138 | },
139 | "language_info": {
140 | "codemirror_mode": {
141 | "name": "ipython",
142 | "version": 3
143 | },
144 | "file_extension": ".py",
145 | "mimetype": "text/x-python",
146 | "name": "python",
147 | "nbconvert_exporter": "python",
148 | "pygments_lexer": "ipython3",
149 | "version": "3.6.3"
150 | }
151 | },
152 | "nbformat": 4,
153 | "nbformat_minor": 2
154 | }
155 |
--------------------------------------------------------------------------------
/Data Lakes with Spark/Project Data Lake with Spark/README.md:
--------------------------------------------------------------------------------
1 | Project: Data Lake
2 |
3 | Introduction
4 |
5 | A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app
6 |
7 |
8 | Project Description
9 |
10 | Apply the knowledge of Spark and Data Lakes to build and ETL pipeline for a Data Lake hosted on Amazon S3
11 |
12 | In this task, we have to build an ETL Pipeline that extracts their data from S3 and process them using Spark and then load back into S3 in a set of Fact and Dimension Tables. This will allow their analytics team to continue finding insights in what songs their users are listening. Will have to deploy this Spark process on a Cluster using AWS
13 |
14 | Project Datasets
15 |
16 | Song Data Path --> s3://udacity-dend/song_data Log Data Path --> s3://udacity-dend/log_data Log Data JSON Path --> s3://udacity-dend/log_json_path.json
17 |
18 | Song Dataset
19 |
20 | The first dataset is a subset of real data from the Million Song Dataset(https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example:
21 |
22 | song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json
23 |
24 | And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
25 |
26 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
27 |
28 | Log Dataset
29 |
30 | The second dataset consists of log files in JSON format. The log files in the dataset with are partitioned by year and month. For example:
31 |
32 | log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json
33 |
34 | And below is an example of what a single log file, 2018-11-13-events.json, looks like.
35 |
36 | {"artist":"Pavement", "auth":"Logged In", "firstName":"Sylvie", "gender", "F", "itemInSession":0, "lastName":"Cruz", "length":99.16036, "level":"free", "location":"Klamath Falls, OR", "method":"PUT", "page":"NextSong", "registration":"1.541078e+12", "sessionId":345, "song":"Mercy:The Laundromat", "status":200, "ts":1541990258796, "userAgent":"Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4...)", "userId":10}
37 |
38 | Schema for Song Play Analysis
39 |
40 | A Star Schema would be required for optimized queries on song play queries
41 |
42 | Fact Table
43 |
44 | songplays - records in event data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
45 |
46 | Dimension Tables
47 |
48 | users - users in the app user_id, first_name, last_name, gender, level
49 |
50 | songs - songs in music database song_id, title, artist_id, year, duration
51 |
52 | artists - artists in music database artist_id, name, location, lattitude, longitude
53 |
54 | time - timestamps of records in songplays broken down into specific units start_time, hour, day, week, month, year, weekday
55 |
56 | Project Template
57 |
58 | Project Template include three files:
59 |
60 | 1. etl.py reads data from S3, processes that data using Spark and writes them back to S3
61 |
62 | 2. dl.cfg contains AWS Credentials
63 |
64 | 3. README.md provides discussion on your process and decisions
65 |
66 | ETL Pipeline
67 |
68 | 1. Load the credentials from dl.cfg
69 | 2. Load the Data which are in JSON Files(Song Data and Log Data)
70 | 3. After loading the JSON Files from S3
71 | 4.Use Spark process this JSON files and then generate a set of Fact and Dimension Tables
72 | 5. Load back these dimensional process to S3
73 |
74 | Final Instructions
75 |
76 | 1. Write correct keys in dl.cfg
77 | 2. Open Terminal write the command "python etl.py"
78 | 3. Should take about 2-4 mins in total
79 |
--------------------------------------------------------------------------------
/Data Lakes with Spark/Project Data Lake with Spark/Readme.MD:
--------------------------------------------------------------------------------
1 | Hello Testing
2 |
--------------------------------------------------------------------------------
/Data Lakes with Spark/Project Data Lake with Spark/dl.cfg:
--------------------------------------------------------------------------------
1 | [AWS]
2 | AWS_ACCESS_KEY_ID=
3 | AWS_SECRET_ACCESS_KEY=
--------------------------------------------------------------------------------
/Data Lakes with Spark/Project Data Lake with Spark/etl.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | from datetime import datetime
3 | import os
4 | from pyspark.sql import SparkSession
5 | from pyspark.sql.functions import udf, col
6 | from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
7 |
8 |
9 | config = configparser.ConfigParser()
10 | config.read_file(open('dl.cfg'))
11 |
12 | os.environ['AWS_ACCESS_KEY_ID']=config.get('AWS','AWS_ACCESS_KEY_ID')
13 | os.environ['AWS_SECRET_ACCESS_KEY']=config.get('AWS','AWS_SECRET_ACCESS_KEY')
14 |
15 |
16 | def create_spark_session():
17 | spark = SparkSession \
18 | .builder \
19 | .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
20 | .getOrCreate()
21 | return spark
22 |
23 |
24 | def process_song_data(spark, input_data, output_data):
25 | """
26 | Description: This function loads song_data from S3 and processes it by extracting the songs and artist tables
27 | and then again loaded back to S3
28 |
29 | Parameters:
30 | spark : this is the Spark Session
31 | input_data : the location of song_data from where the file is load to process
32 | output_data : the location where after processing the results will be stored
33 |
34 | """
35 | # get filepath to song data file
36 | song_data = input_data + 'song_data/*/*/*/*.json'
37 |
38 | # read song data file
39 | df = spark.read.json(song_data)
40 |
41 | # created song view to write SQL Queries
42 | df.createOrReplaceTempView("song_data_table")
43 |
44 | # extract columns to create songs table
45 | songs_table = spark.sql("""
46 | SELECT sdtn.song_id,
47 | sdtn.title,
48 | sdtn.artist_id,
49 | sdtn.year,
50 | sdtn.duration
51 | FROM song_data_table sdtn
52 | WHERE song_id IS NOT NULL
53 | """)
54 |
55 | # write songs table to parquet files partitioned by year and artist
56 | songs_table.write.mode('overwrite').partitionBy("year", "artist_id").parquet(output_data+'songs_table/')
57 |
58 | # extract columns to create artists table
59 | artists_table = spark.sql("""
60 | SELECT DISTINCT arti.artist_id,
61 | arti.artist_name,
62 | arti.artist_location,
63 | arti.artist_latitude,
64 | arti.artist_longitude
65 | FROM song_data_table arti
66 | WHERE arti.artist_id IS NOT NULL
67 | """)
68 |
69 | # write artists table to parquet files
70 | artists_table.write.mode('overwrite').parquet(output_data+'artists_table/')
71 |
72 |
73 | def process_log_data(spark, input_data, output_data):
74 | """
75 | Description: This function loads log_data from S3 and processes it by extracting the songs and artist tables
76 | and then again loaded back to S3. Also output from previous function is used in by spark.read.json command
77 |
78 | Parameters:
79 | spark : this is the Spark Session
80 | input_data : the location of song_data from where the file is load to process
81 | output_data : the location where after processing the results will be stored
82 |
83 | """
84 | # get filepath to log data file
85 | log_path = input_data + 'log_data/*.json'
86 |
87 | # read log data file
88 | df = spark.read.json(log_path)
89 |
90 | # filter by actions for song plays
91 | df = df.filter(df.page == 'NextSong')
92 |
93 | # created log view to write SQL Queries
94 | df.createOrReplaceTempView("log_data_table")
95 |
96 | # extract columns for users table
97 | users_table = spark.sql("""
98 | SELECT DISTINCT userT.userId as user_id,
99 | userT.firstName as first_name,
100 | userT.lastName as last_name,
101 | userT.gender as gender,
102 | userT.level as level
103 | FROM log_data_table userT
104 | WHERE userT.userId IS NOT NULL
105 | """)
106 |
107 | # write users table to parquet files
108 | users_table.write.mode('overwrite').parquet(output_data+'users_table/')
109 |
110 | # create timestamp column from original timestamp column
111 | # get_timestamp = udf()
112 | # df =
113 |
114 | # create datetime column from original timestamp column
115 | # get_datetime = udf()
116 | # df =
117 |
118 | # extract columns to create time table
119 | time_table = spark.sql("""
120 | SELECT
121 | A.start_time_sub as start_time,
122 | hour(A.start_time_sub) as hour,
123 | dayofmonth(A.start_time_sub) as day,
124 | weekofyear(A.start_time_sub) as week,
125 | month(A.start_time_sub) as month,
126 | year(A.start_time_sub) as year,
127 | dayofweek(A.start_time_sub) as weekday
128 | FROM
129 | (SELECT to_timestamp(timeSt.ts/1000) as start_time_sub
130 | FROM log_data_table timeSt
131 | WHERE timeSt.ts IS NOT NULL
132 | ) A
133 | """)
134 |
135 | # write time table to parquet files partitioned by year and month
136 | time_table.write.mode('overwrite').partitionBy("year", "month").parquet(output_data+'time_table/')
137 |
138 | # read in song data to use for songplays table
139 | song_df = spark.read.parquet(output_data+'songs_table/')
140 |
141 | # read song data file
142 | # song_df_upd = spark.read.json(input_data + 'song_data/*/*/*/*.json')
143 | # created song view to write SQL Queries
144 | # song_df_upd.createOrReplaceTempView("song_data_table")
145 |
146 |
147 |
148 | # extract columns from joined song and log datasets to create songplays table
149 | songplays_table = spark.sql("""
150 | SELECT monotonically_increasing_id() as songplay_id,
151 | to_timestamp(logT.ts/1000) as start_time,
152 | month(to_timestamp(logT.ts/1000)) as month,
153 | year(to_timestamp(logT.ts/1000)) as year,
154 | logT.userId as user_id,
155 | logT.level as level,
156 | songT.song_id as song_id,
157 | songT.artist_id as artist_id,
158 | logT.sessionId as session_id,
159 | logT.location as location,
160 | logT.userAgent as user_agent
161 |
162 | FROM log_data_table logT
163 | JOIN song_data_table songT on logT.artist = songT.artist_name and logT.song = songT.title
164 | """)
165 |
166 | # write songplays table to parquet files partitioned by year and month
167 | songplays_table.write.mode('overwrite').partitionBy("year", "month").parquet(output_data+'songplays_table/')
168 |
169 |
170 | def main():
171 | spark = create_spark_session()
172 |
173 | input_data = "s3a://udacity-dend/"
174 | output_data = "s3a://udacity-dend/dloutput/"
175 |
176 | #input_data = "./"
177 | #output_data = "./dloutput/"
178 |
179 | process_song_data(spark, input_data, output_data)
180 | process_log_data(spark, input_data, output_data)
181 |
182 |
183 | if __name__ == "__main__":
184 | main()
185 |
--------------------------------------------------------------------------------
/Data Lakes with Spark/README.md:
--------------------------------------------------------------------------------
1 | Data Lakes with Spark Exercise Files
2 |
--------------------------------------------------------------------------------
/Data Lakes with Spark/Spark_Maps_Lazy_Evaluation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Maps\n",
8 | "\n",
9 | "In Spark, maps take data as input and then transform that data with whatever function you put in the map. They are like directions for the data telling how each input should get to the output.\n",
10 | "\n",
11 | "The first code cell creates a SparkContext object. With the SparkContext, you can input a dataset and parallelize the data across a cluster (since you are currently using Spark in local mode on a single machine, technically the dataset isn't distributed yet).\n",
12 | "\n",
13 | "Run the code cell below to instantiate a SparkContext object and then read in the log_of_songs list into Spark. "
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 1,
19 | "metadata": {},
20 | "outputs": [],
21 | "source": [
22 | "### \n",
23 | "# You might have noticed this code in the screencast.\n",
24 | "#\n",
25 | "# import findspark\n",
26 | "# findspark.init('spark-2.3.2-bin-hadoop2.7')\n",
27 | "#\n",
28 | "# The findspark Python module makes it easier to install\n",
29 | "# Spark in local mode on your computer. This is convenient\n",
30 | "# for practicing Spark syntax locally. \n",
31 | "# However, the workspaces already have Spark installed and you do not\n",
32 | "# need to use the findspark module\n",
33 | "#\n",
34 | "###\n",
35 | "\n",
36 | "import pyspark\n",
37 | "sc = pyspark.SparkContext(appName=\"maps_and_lazy_evaluation_example\")\n",
38 | "\n",
39 | "log_of_songs = [\n",
40 | " \"Despacito\",\n",
41 | " \"Nice for what\",\n",
42 | " \"No tears left to cry\",\n",
43 | " \"Despacito\",\n",
44 | " \"Havana\",\n",
45 | " \"In my feelings\",\n",
46 | " \"Nice for what\",\n",
47 | " \"despacito\",\n",
48 | " \"All the stars\"\n",
49 | "]\n",
50 | "\n",
51 | "# parallelize the log_of_songs to use with Spark\n",
52 | "distributed_song_log = sc.parallelize(log_of_songs)"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "This next code cell defines a function that converts a song title to lowercase. Then there is an example converting the word \"Havana\" to \"havana\"."
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 2,
65 | "metadata": {},
66 | "outputs": [
67 | {
68 | "data": {
69 | "text/plain": [
70 | "'havana'"
71 | ]
72 | },
73 | "execution_count": 2,
74 | "metadata": {},
75 | "output_type": "execute_result"
76 | }
77 | ],
78 | "source": [
79 | "def convert_song_to_lowercase(song):\n",
80 | " return song.lower()\n",
81 | "\n",
82 | "convert_song_to_lowercase(\"Havana\")"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "The following code cells demonstrate how to apply this function using a map step. The map step will go through each song in the list and apply the convert_song_to_lowercase() function. "
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 3,
95 | "metadata": {},
96 | "outputs": [
97 | {
98 | "data": {
99 | "text/plain": [
100 | "PythonRDD[1] at RDD at PythonRDD.scala:53"
101 | ]
102 | },
103 | "execution_count": 3,
104 | "metadata": {},
105 | "output_type": "execute_result"
106 | }
107 | ],
108 | "source": [
109 | "distributed_song_log.map(convert_song_to_lowercase)"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "You'll notice that this code cell ran quite quickly. This is because of lazy evaluation. Spark does not actually execute the map step unless it needs to.\n",
117 | "\n",
118 | "\"RDD\" in the output refers to resilient distributed dataset. RDDs are exactly what they say they are: fault-tolerant datasets distributed across a cluster. This is how Spark stores data. \n",
119 | "\n",
120 | "To get Spark to actually run the map step, you need to use an \"action\". One available action is the collect method. The collect() method takes the results from all of the clusters and \"collects\" them into a single list on the master node."
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 4,
126 | "metadata": {},
127 | "outputs": [
128 | {
129 | "data": {
130 | "text/plain": [
131 | "['despacito',\n",
132 | " 'nice for what',\n",
133 | " 'no tears left to cry',\n",
134 | " 'despacito',\n",
135 | " 'havana',\n",
136 | " 'in my feelings',\n",
137 | " 'nice for what',\n",
138 | " 'despacito',\n",
139 | " 'all the stars']"
140 | ]
141 | },
142 | "execution_count": 4,
143 | "metadata": {},
144 | "output_type": "execute_result"
145 | }
146 | ],
147 | "source": [
148 | "distributed_song_log.map(convert_song_to_lowercase).collect()"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "Note as well that Spark is not changing the original data set: Spark is merely making a copy. You can see this by running collect() on the original dataset."
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 5,
161 | "metadata": {},
162 | "outputs": [
163 | {
164 | "data": {
165 | "text/plain": [
166 | "['Despacito',\n",
167 | " 'Nice for what',\n",
168 | " 'No tears left to cry',\n",
169 | " 'Despacito',\n",
170 | " 'Havana',\n",
171 | " 'In my feelings',\n",
172 | " 'Nice for what',\n",
173 | " 'despacito',\n",
174 | " 'All the stars']"
175 | ]
176 | },
177 | "execution_count": 5,
178 | "metadata": {},
179 | "output_type": "execute_result"
180 | }
181 | ],
182 | "source": [
183 | "distributed_song_log.collect()"
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "You do not always have to write a custom function for the map step. You can also use anonymous (lambda) functions as well as built-in Python functions like string.lower(). \n",
191 | "\n",
192 | "Anonymous functions are actually a Python feature for writing functional style programs."
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": 6,
198 | "metadata": {},
199 | "outputs": [
200 | {
201 | "data": {
202 | "text/plain": [
203 | "['despacito',\n",
204 | " 'nice for what',\n",
205 | " 'no tears left to cry',\n",
206 | " 'despacito',\n",
207 | " 'havana',\n",
208 | " 'in my feelings',\n",
209 | " 'nice for what',\n",
210 | " 'despacito',\n",
211 | " 'all the stars']"
212 | ]
213 | },
214 | "execution_count": 6,
215 | "metadata": {},
216 | "output_type": "execute_result"
217 | }
218 | ],
219 | "source": [
220 | "distributed_song_log.map(lambda song: song.lower()).collect()"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 7,
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "data": {
230 | "text/plain": [
231 | "['despacito',\n",
232 | " 'nice for what',\n",
233 | " 'no tears left to cry',\n",
234 | " 'despacito',\n",
235 | " 'havana',\n",
236 | " 'in my feelings',\n",
237 | " 'nice for what',\n",
238 | " 'despacito',\n",
239 | " 'all the stars']"
240 | ]
241 | },
242 | "execution_count": 7,
243 | "metadata": {},
244 | "output_type": "execute_result"
245 | }
246 | ],
247 | "source": [
248 | "distributed_song_log.map(lambda x: x.lower()).collect()"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": 9,
254 | "metadata": {},
255 | "outputs": [
256 | {
257 | "data": {
258 | "text/plain": [
259 | "9"
260 | ]
261 | },
262 | "execution_count": 9,
263 | "metadata": {},
264 | "output_type": "execute_result"
265 | }
266 | ],
267 | "source": [
268 | "distributed_song_log.map(lambda x: x.lower()).count()"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": null,
274 | "metadata": {},
275 | "outputs": [],
276 | "source": []
277 | }
278 | ],
279 | "metadata": {
280 | "kernelspec": {
281 | "display_name": "Python 3",
282 | "language": "python",
283 | "name": "python3"
284 | },
285 | "language_info": {
286 | "codemirror_mode": {
287 | "name": "ipython",
288 | "version": 3
289 | },
290 | "file_extension": ".py",
291 | "mimetype": "text/x-python",
292 | "name": "python",
293 | "nbconvert_exporter": "python",
294 | "pygments_lexer": "ipython3",
295 | "version": "3.6.3"
296 | }
297 | },
298 | "nbformat": 4,
299 | "nbformat_minor": 2
300 | }
301 |
--------------------------------------------------------------------------------
/Data Lakes with Spark/Spark_Sql_Quiz.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Answer Key to the Data Frame Programming Quiz\n",
8 | "\n",
9 | "Helpful resources:\n",
10 | "http://spark.apache.org/docs/latest/api/python/pyspark.sql.html"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 4,
16 | "metadata": {},
17 | "outputs": [],
18 | "source": [
19 | "from pyspark.sql import SparkSession\n",
20 | "# from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg\n",
21 | "# from pyspark.sql.functions import sum as Fsum\n",
22 | "# from pyspark.sql.window import Window\n",
23 | "# from pyspark.sql.types import IntegerType"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 5,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# 1) import any other libraries you might need\n",
33 | "# 2) instantiate a Spark session \n",
34 | "# 3) read in the data set located at the path \"data/sparkify_log_small.json\"\n",
35 | "# 4) create a view to use with your SQL queries\n",
36 | "# 5) write code to answer the quiz questions \n",
37 | "\n",
38 | "spark = SparkSession \\\n",
39 | " .builder \\\n",
40 | " .appName(\"Spark SQL Quiz\") \\\n",
41 | " .getOrCreate()\n",
42 | "\n",
43 | "user_log = spark.read.json(\"data/sparkify_log_small.json\")\n",
44 | "\n",
45 | "user_log.createOrReplaceTempView(\"log_table\")\n"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "# Question 1\n",
53 | "\n",
54 | "Which page did user id \"\" (empty string) NOT visit?"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 6,
60 | "metadata": {},
61 | "outputs": [
62 | {
63 | "name": "stdout",
64 | "output_type": "stream",
65 | "text": [
66 | "root\n",
67 | " |-- artist: string (nullable = true)\n",
68 | " |-- auth: string (nullable = true)\n",
69 | " |-- firstName: string (nullable = true)\n",
70 | " |-- gender: string (nullable = true)\n",
71 | " |-- itemInSession: long (nullable = true)\n",
72 | " |-- lastName: string (nullable = true)\n",
73 | " |-- length: double (nullable = true)\n",
74 | " |-- level: string (nullable = true)\n",
75 | " |-- location: string (nullable = true)\n",
76 | " |-- method: string (nullable = true)\n",
77 | " |-- page: string (nullable = true)\n",
78 | " |-- registration: long (nullable = true)\n",
79 | " |-- sessionId: long (nullable = true)\n",
80 | " |-- song: string (nullable = true)\n",
81 | " |-- status: long (nullable = true)\n",
82 | " |-- ts: long (nullable = true)\n",
83 | " |-- userAgent: string (nullable = true)\n",
84 | " |-- userId: string (nullable = true)\n",
85 | "\n"
86 | ]
87 | }
88 | ],
89 | "source": [
90 | "user_log.printSchema()"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 7,
96 | "metadata": {},
97 | "outputs": [
98 | {
99 | "name": "stdout",
100 | "output_type": "stream",
101 | "text": [
102 | "+----+----------------+\n",
103 | "|page| page|\n",
104 | "+----+----------------+\n",
105 | "|null|Submit Downgrade|\n",
106 | "|null| Downgrade|\n",
107 | "|null| Logout|\n",
108 | "|null| Save Settings|\n",
109 | "|null| Settings|\n",
110 | "|null| NextSong|\n",
111 | "|null| Upgrade|\n",
112 | "|null| Error|\n",
113 | "|null| Submit Upgrade|\n",
114 | "+----+----------------+\n",
115 | "\n"
116 | ]
117 | }
118 | ],
119 | "source": [
120 | "# SELECT distinct pages for the blank user and distinc pages for all users\n",
121 | "# Right join the results to find pages that blank visitor did not visit\n",
122 | "spark.sql(\"SELECT * \\\n",
123 | " FROM ( \\\n",
124 | " SELECT DISTINCT page \\\n",
125 | " FROM log_table \\\n",
126 | " WHERE userID='') AS user_pages \\\n",
127 | " RIGHT JOIN ( \\\n",
128 | " SELECT DISTINCT page \\\n",
129 | " FROM log_table) AS all_pages \\\n",
130 | " ON user_pages.page = all_pages.page \\\n",
131 | " WHERE user_pages.page IS NULL\").show()"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "# Question 2 - Reflect\n",
139 | "\n",
140 | "Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?\n",
141 | "\n",
142 | "Both Spark SQL and Spark Data Frames are part of the Spark SQL library. Hence, they both use the Spark SQL Catalyst Optimizer to optimize queries. \n",
143 | "\n",
144 | "You might prefer SQL over data frames because the syntax is clearer especially for teams already experienced in SQL.\n",
145 | "\n",
146 | "Spark data frames give you more control. You can break down your queries into smaller steps, which can make debugging easier. You can also [cache](https://unraveldata.com/to-cache-or-not-to-cache/) intermediate results or [repartition](https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4) intermediate results."
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "# Question 3\n",
154 | "\n",
155 | "How many female users do we have in the data set?"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 8,
161 | "metadata": {},
162 | "outputs": [
163 | {
164 | "name": "stdout",
165 | "output_type": "stream",
166 | "text": [
167 | "+----------------------+\n",
168 | "|count(DISTINCT userID)|\n",
169 | "+----------------------+\n",
170 | "| 462|\n",
171 | "+----------------------+\n",
172 | "\n"
173 | ]
174 | }
175 | ],
176 | "source": [
177 | "spark.sql(\"SELECT COUNT(DISTINCT userID) \\\n",
178 | " FROM log_table \\\n",
179 | " WHERE gender = 'F'\").show()"
180 | ]
181 | },
182 | {
183 | "cell_type": "markdown",
184 | "metadata": {},
185 | "source": [
186 | "# Question 4\n",
187 | "\n",
188 | "How many songs were played from the most played artist?"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 9,
194 | "metadata": {},
195 | "outputs": [
196 | {
197 | "name": "stdout",
198 | "output_type": "stream",
199 | "text": [
200 | "+--------+-----+\n",
201 | "| Artist|plays|\n",
202 | "+--------+-----+\n",
203 | "|Coldplay| 83|\n",
204 | "+--------+-----+\n",
205 | "\n",
206 | "+--------+-----+\n",
207 | "| Artist|plays|\n",
208 | "+--------+-----+\n",
209 | "|Coldplay| 83|\n",
210 | "+--------+-----+\n",
211 | "\n"
212 | ]
213 | }
214 | ],
215 | "source": [
216 | "# Here is one solution\n",
217 | "spark.sql(\"SELECT Artist, COUNT(Artist) AS plays \\\n",
218 | " FROM log_table \\\n",
219 | " GROUP BY Artist \\\n",
220 | " ORDER BY plays DESC \\\n",
221 | " LIMIT 1\").show()\n",
222 | "\n",
223 | "# Here is an alternative solution\n",
224 | "# Get the artist play counts\n",
225 | "play_counts = spark.sql(\"SELECT Artist, COUNT(Artist) AS plays \\\n",
226 | " FROM log_table \\\n",
227 | " GROUP BY Artist\")\n",
228 | "\n",
229 | "# save the results in a new view\n",
230 | "play_counts.createOrReplaceTempView(\"artist_counts\")\n",
231 | "\n",
232 | "# use a self join to find where the max play equals the count value\n",
233 | "spark.sql(\"SELECT a2.Artist, a2.plays FROM \\\n",
234 | " (SELECT max(plays) AS max_plays FROM artist_counts) AS a1 \\\n",
235 | " JOIN artist_counts AS a2 \\\n",
236 | " ON a1.max_plays = a2.plays \\\n",
237 | " \").show()"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "# Question 5 (challenge)\n",
245 | "\n",
246 | "How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.\n",
247 | "\n"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 31,
253 | "metadata": {},
254 | "outputs": [
255 | {
256 | "name": "stdout",
257 | "output_type": "stream",
258 | "text": [
259 | "+------------------+\n",
260 | "|avg(count(period))|\n",
261 | "+------------------+\n",
262 | "| 6.898347107438017|\n",
263 | "+------------------+\n",
264 | "\n"
265 | ]
266 | }
267 | ],
268 | "source": [
269 | "# SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;\n",
270 | "is_home = spark.sql(\"SELECT userID, page, ts, CASE WHEN page = 'Home' THEN 1 ELSE 0 END AS is_home FROM log_table \\\n",
271 | " WHERE (page = 'NextSong') or (page = 'Home') \\\n",
272 | " \")\n",
273 | "\n",
274 | "# keep the results in a new view\n",
275 | "is_home.createOrReplaceTempView(\"is_home_table\")\n",
276 | "\n",
277 | "# find the cumulative sum over the is_home column\n",
278 | "cumulative_sum = spark.sql(\"SELECT *, SUM(is_home) OVER \\\n",
279 | " (PARTITION BY userID ORDER BY ts DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period \\\n",
280 | " FROM is_home_table\")\n",
281 | "\n",
282 | "# keep the results in a view\n",
283 | "cumulative_sum.createOrReplaceTempView(\"period_table\")\n",
284 | "\n",
285 | "# find the average count for NextSong\n",
286 | "spark.sql(\"SELECT AVG(count_results) FROM \\\n",
287 | " (SELECT COUNT(*) AS count_results FROM period_table \\\n",
288 | "GROUP BY userID, period, page HAVING page = 'NextSong') AS counts\").show()"
289 | ]
290 | }
291 | ],
292 | "metadata": {
293 | "kernelspec": {
294 | "display_name": "Python 3",
295 | "language": "python",
296 | "name": "python3"
297 | },
298 | "language_info": {
299 | "codemirror_mode": {
300 | "name": "ipython",
301 | "version": 3
302 | },
303 | "file_extension": ".py",
304 | "mimetype": "text/x-python",
305 | "name": "python",
306 | "nbconvert_exporter": "python",
307 | "pygments_lexer": "ipython3",
308 | "version": "3.6.3"
309 | }
310 | },
311 | "nbformat": 4,
312 | "nbformat_minor": 2
313 | }
314 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 1.py:
--------------------------------------------------------------------------------
1 | # Instructions
2 | # Define a function that uses the python logger to log a function. Then finish filling in the details of the DAG down below. Once you’ve done that, run "/opt/airflow/start.sh" command to start the web server. Once the Airflow web server is ready, open the Airflow UI using the "Access Airflow" button. Turn your DAG “On”, and then Run your DAG. If you get stuck, you can take a look at the solution file or the video walkthrough on the next page.
3 |
4 | import datetime
5 | import logging
6 |
7 | from airflow import DAG
8 | from airflow.operators.python_operator import PythonOperator
9 |
10 | def first_prog():
11 | logging.info("This is my very first program for airflow")
12 |
13 | dag = DAG(
14 | 'lesson1.exercise1',
15 | start_date=datetime.datetime.now())
16 |
17 | greet_task = PythonOperator(
18 | task_id="first_airflow_program",
19 | python_callable=first_prog,
20 | dag=dag
21 | )
22 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 2.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import logging
3 |
4 | from airflow import DAG
5 | from airflow.operators.python_operator import PythonOperator
6 |
7 |
8 | def second_prog():
9 | logging.info("This is my second program for airflow")
10 |
11 | dag = DAG(
12 | "lesson1.exercise2",
13 | start_date=datetime.datetime.now() - datetime.timedelta(days=2),
14 | schedule_interval="@daily")
15 |
16 | task = PythonOperator(
17 | task_id="exercise_2",
18 | python_callable=second_prog,
19 | dag=dag)
20 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 3.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import logging
3 |
4 | from airflow import DAG
5 | from airflow.operators.python_operator import PythonOperator
6 |
7 |
8 | def hello_world():
9 | logging.info("Hello World")
10 |
11 |
12 | def addition():
13 | logging.info(f"2 + 2 = {2+2}")
14 |
15 |
16 | def subtraction():
17 | logging.info(f"6 -2 = {6-2}")
18 |
19 |
20 | def division():
21 | logging.info(f"10 / 2 = {int(10/2)}")
22 |
23 | def completed_task():
24 | logging.info("All Tasks Completed")
25 |
26 |
27 | dag = DAG(
28 | "lesson1.exercise3",
29 | schedule_interval='@hourly',
30 | start_date=datetime.datetime.now() - datetime.timedelta(days=1))
31 |
32 | hello_world_task = PythonOperator(
33 | task_id="hello_world",
34 | python_callable=hello_world,
35 | dag=dag)
36 |
37 | addition_task = PythonOperator(
38 | task_id="addition",
39 | python_callable=addition,
40 | dag=dag)
41 |
42 | subraction_task = PythonOperator(
43 | task_id="subtraction",
44 | python_callable=subtraction,
45 | dag=dag)
46 |
47 | division_task = PythonOperator(
48 | task_id="division",
49 | python_callable=division,
50 | dag=dag)
51 |
52 | completed_task = PythonOperator(
53 | task_id="completed_task",
54 | python_callable=completed_task,
55 | dag=dag)
56 | #
57 | # -> addition_task
58 | # / \
59 | # hello_world_task -> division_task-> completed_task
60 | # \ /
61 | # -> subtraction_task
62 |
63 | hello_world_task >> addition_task
64 | hello_world_task >> division_task
65 | hello_world_task >> subtraction_task
66 | addition_task >> completed_task
67 | division_task >> completed_task
68 | subtraction_task >> completed_task
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 4.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import logging
3 |
4 | from airflow import DAG
5 | from airflow.models import Variable
6 | from airflow.operators.python_operator import PythonOperator
7 | from airflow.hooks.S3_hook import S3Hook
8 |
9 | #
10 | # TODO: There is no code to modify in this exercise. We're going to create a connection and a
11 | # variable.
12 | # 1. Open your browser to localhost:8080 and open Admin->Variables
13 | # 2. Click "Create"
14 | # 3. Set "Key" equal to "s3_bucket" and set "Val" equal to "udacity-dend"
15 | # 4. Click save
16 | # 5. Open Admin->Connections
17 | # 6. Click "Create"
18 | # 7. Set "Conn Id" to "aws_credentials", "Conn Type" to "Amazon Web Services"
19 | # Set "Login" to your aws_access_key_id and "Password" to your aws_secret_key
20 | # 8. Click save
21 | # 9. Run the DAG
22 |
23 | def list_keys():
24 | hook = S3Hook(aws_conn_id='aws_credentials')
25 | bucket = Variable.get('s3_bucket')
26 | logging.info(f"Listing Keys from {bucket}")
27 | keys = hook.list_keys(bucket)
28 | for key in keys:
29 | logging.info(f"- s3://{bucket}/{key}")
30 |
31 |
32 | dag = DAG(
33 | 'lesson1.exercise4',
34 | start_date=datetime.datetime.now())
35 |
36 | list_task = PythonOperator(
37 | task_id="list_keys",
38 | python_callable=list_keys,
39 | dag=dag
40 | )
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 5.py:
--------------------------------------------------------------------------------
1 | # Instructions
2 | # Use the Airflow context in the pythonoperator to complete the TODOs below. Once you are done, run your DAG and check the logs to see the context in use.
3 |
4 | import datetime
5 | import logging
6 |
7 | from airflow import DAG
8 | from airflow.models import Variable
9 | from airflow.operators.python_operator import PythonOperator
10 | from airflow.hooks.S3_hook import S3Hook
11 |
12 |
13 | def log_details(*args, **kwargs):
14 | #
15 | # TODO: Extract ds, run_id, prev_ds, and next_ds from the kwargs, and log them
16 | # NOTE: Look here for context variables passed in on kwargs:
17 | # https://airflow.apache.org/macros.html
18 | #
19 | ds = kwargs['ds'] # kwargs[]
20 | run_id = kwargs['run_id'] # kwargs[]
21 | previous_ds = kwargs.get('prev_ds') # kwargs.get('')
22 | next_ds = kwargs.get('next_ds') # kwargs.get('')
23 |
24 | logging.info(f"Execution date is {ds}")
25 | logging.info(f"My run id is {run_id}")
26 | if previous_ds:
27 | logging.info(f"My previous run was on {previous_ds}")
28 | if next_ds:
29 | logging.info(f"My next run will be {next_ds}")
30 |
31 | dag = DAG(
32 | 'lesson1.exercise5',
33 | schedule_interval="@daily",
34 | start_date=datetime.datetime.now() - datetime.timedelta(days=2)
35 | )
36 |
37 | list_task = PythonOperator(
38 | task_id="log_details",
39 | python_callable=log_details,
40 | provide_context=True,
41 | dag=dag
42 | )
43 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 6.py:
--------------------------------------------------------------------------------
1 | # Instructions
2 | # Similar to what you saw in the demo, copy and populate the trips table. Then, add another operator which creates a traffic analysis table from the trips table you created. Note, in this class, we won’t be writing SQL -- all of the SQL statements we run against Redshift are predefined and included in your lesson.
3 |
4 | import datetime
5 | import logging
6 |
7 | from airflow import DAG
8 | from airflow.contrib.hooks.aws_hook import AwsHook
9 | from airflow.hooks.postgres_hook import PostgresHook
10 | from airflow.operators.postgres_operator import PostgresOperator
11 | from airflow.operators.python_operator import PythonOperator
12 |
13 | import sql_statements
14 |
15 |
16 | def load_data_to_redshift(*args, **kwargs):
17 | aws_hook = AwsHook("aws_credentials")
18 | credentials = aws_hook.get_credentials()
19 | redshift_hook = PostgresHook("redshift")
20 | redshift_hook.run(sql_statements.COPY_ALL_TRIPS_SQL.format(credentials.access_key, credentials.secret_key))
21 |
22 |
23 | dag = DAG(
24 | 'lesson1.exercise6',
25 | start_date=datetime.datetime.now()
26 | )
27 |
28 | create_table = PostgresOperator(
29 | task_id="create_table",
30 | dag=dag,
31 | postgres_conn_id="redshift",
32 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
33 | )
34 |
35 | copy_task = PythonOperator(
36 | task_id='load_from_s3_to_redshift',
37 | dag=dag,
38 | python_callable=load_data_to_redshift
39 | )
40 |
41 | location_traffic_task = PostgresOperator(
42 | task_id="calculate_location_traffic",
43 | dag=dag,
44 | postgres_conn_id="redshift",
45 | sql=sql_statements.LOCATION_TRAFFIC_SQL
46 | )
47 |
48 | create_table >> copy_task
49 | copy_task >> location_traffic_task
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Quality - Exercise 1.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #1 - Run the DAG as it is first, and observe the Airflow UI
3 | #2 - Next, open up the DAG and add the copy and load tasks as directed in the TODOs
4 | #3 - Reload the Airflow UI and run the DAG once more, observing the Airflow UI
5 |
6 | import datetime
7 | import logging
8 |
9 | from airflow import DAG
10 | from airflow.contrib.hooks.aws_hook import AwsHook
11 | from airflow.hooks.postgres_hook import PostgresHook
12 | from airflow.operators.postgres_operator import PostgresOperator
13 | from airflow.operators.python_operator import PythonOperator
14 |
15 | import sql_statements
16 |
17 |
18 | def load_trip_data_to_redshift(*args, **kwargs):
19 | aws_hook = AwsHook("aws_credentials")
20 | credentials = aws_hook.get_credentials()
21 | redshift_hook = PostgresHook("redshift")
22 | sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format(
23 | credentials.access_key,
24 | credentials.secret_key,
25 | )
26 | redshift_hook.run(sql_stmt)
27 |
28 |
29 | def load_station_data_to_redshift(*args, **kwargs):
30 | aws_hook = AwsHook("aws_credentials")
31 | credentials = aws_hook.get_credentials()
32 | redshift_hook = PostgresHook("redshift")
33 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
34 | credentials.access_key,
35 | credentials.secret_key,
36 | )
37 | redshift_hook.run(sql_stmt)
38 |
39 |
40 | dag = DAG(
41 | 'lesson2.exercise1',
42 | start_date=datetime.datetime.now()
43 | )
44 |
45 | create_trips_table = PostgresOperator(
46 | task_id="create_trips_table",
47 | dag=dag,
48 | postgres_conn_id="redshift",
49 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
50 | )
51 |
52 | copy_trips_task = PythonOperator(
53 | task_id='load_trips_from_s3_to_redshift',
54 | dag=dag,
55 | python_callable=load_trip_data_to_redshift,
56 | )
57 |
58 | create_stations_table = PostgresOperator(
59 | task_id="create_stations_table",
60 | dag=dag,
61 | postgres_conn_id="redshift",
62 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
63 | )
64 |
65 | copy_stations_task = PythonOperator(
66 | task_id='load_stations_from_s3_to_redshift',
67 | dag=dag,
68 | python_callable=load_station_data_to_redshift,
69 | )
70 |
71 | create_trips_table >> copy_trips_task
72 | create_stations_table >> copy_stations_task
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Quality - Exercise 2.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #1 - Revisit our bikeshare traffic
3 | #2 - Update our DAG with
4 | # a - @monthly schedule_interval
5 | # b - max_active_runs of 1
6 | # c - start_date of 2018/01/01
7 | # d - end_date of 2018/02/01
8 | # Use Airflow’s backfill capabilities to analyze our trip data on a monthly basis over 2 historical runs
9 |
10 | import datetime
11 | import logging
12 |
13 | from airflow import DAG
14 | from airflow.contrib.hooks.aws_hook import AwsHook
15 | from airflow.hooks.postgres_hook import PostgresHook
16 | from airflow.operators.postgres_operator import PostgresOperator
17 | from airflow.operators.python_operator import PythonOperator
18 |
19 | import sql_statements
20 |
21 |
22 | def load_trip_data_to_redshift(*args, **kwargs):
23 | aws_hook = AwsHook("aws_credentials")
24 | credentials = aws_hook.get_credentials()
25 | redshift_hook = PostgresHook("redshift")
26 | sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format(
27 | credentials.access_key,
28 | credentials.secret_key,
29 | )
30 | redshift_hook.run(sql_stmt)
31 |
32 |
33 | def load_station_data_to_redshift(*args, **kwargs):
34 | aws_hook = AwsHook("aws_credentials")
35 | credentials = aws_hook.get_credentials()
36 | redshift_hook = PostgresHook("redshift")
37 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
38 | credentials.access_key,
39 | credentials.secret_key,
40 | )
41 | redshift_hook.run(sql_stmt)
42 |
43 |
44 | dag = DAG(
45 | 'lesson2.exercise2',
46 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
47 | # TODO: Set the end date to February first
48 | end_date=datetime.datetime(2018, 2, 1, 0, 0, 0 , 0),
49 | # TODO: Set the schedule to be monthly
50 | schedule_interval='@monthly',
51 | # TODO: set the number of max active runs to 1
52 | max_active_runs=1
53 | )
54 |
55 | create_trips_table = PostgresOperator(
56 | task_id="create_trips_table",
57 | dag=dag,
58 | postgres_conn_id="redshift",
59 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
60 | )
61 |
62 | copy_trips_task = PythonOperator(
63 | task_id='load_trips_from_s3_to_redshift',
64 | dag=dag,
65 | python_callable=load_trip_data_to_redshift,
66 | provide_context=True,
67 | )
68 |
69 | create_stations_table = PostgresOperator(
70 | task_id="create_stations_table",
71 | dag=dag,
72 | postgres_conn_id="redshift",
73 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
74 | )
75 |
76 | copy_stations_task = PythonOperator(
77 | task_id='load_stations_from_s3_to_redshift',
78 | dag=dag,
79 | python_callable=load_station_data_to_redshift,
80 | )
81 |
82 | create_trips_table >> copy_trips_task
83 | create_stations_table >> copy_stations_task
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Quality - Exercise 3.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #1 - Modify the bikeshare DAG to load data month by month, instead of loading it all at once, every time.
3 | #2 - Use time partitioning to parallelize the execution of the DAG.
4 |
5 | import datetime
6 | import logging
7 |
8 | from airflow import DAG
9 | from airflow.contrib.hooks.aws_hook import AwsHook
10 | from airflow.hooks.postgres_hook import PostgresHook
11 | from airflow.operators.postgres_operator import PostgresOperator
12 | from airflow.operators.python_operator import PythonOperator
13 |
14 | import sql_statements
15 |
16 |
17 | def load_trip_data_to_redshift(*args, **kwargs):
18 | aws_hook = AwsHook("aws_credentials")
19 | credentials = aws_hook.get_credentials()
20 | redshift_hook = PostgresHook("redshift")
21 | execution_date = kwargs["execution_date"]
22 | sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format(
23 | credentials.access_key,
24 | credentials.secret_key,
25 | year=execution_date.year,
26 | month=execution_date.month
27 | )
28 | redshift_hook.run(sql_stmt)
29 |
30 |
31 | def load_station_data_to_redshift(*args, **kwargs):
32 | aws_hook = AwsHook("aws_credentials")
33 | credentials = aws_hook.get_credentials()
34 | redshift_hook = PostgresHook("redshift")
35 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
36 | credentials.access_key,
37 | credentials.secret_key,
38 | )
39 | redshift_hook.run(sql_stmt)
40 |
41 |
42 | dag = DAG(
43 | 'lesson2.exercise3',
44 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
45 | end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0),
46 | schedule_interval='@monthly',
47 | max_active_runs=1
48 | )
49 |
50 | create_trips_table = PostgresOperator(
51 | task_id="create_trips_table",
52 | dag=dag,
53 | postgres_conn_id="redshift",
54 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
55 | )
56 |
57 | copy_trips_task = PythonOperator(
58 | task_id='load_trips_from_s3_to_redshift',
59 | dag=dag,
60 | python_callable=load_trip_data_to_redshift,
61 | provide_context=True,
62 | )
63 |
64 | create_stations_table = PostgresOperator(
65 | task_id="create_stations_table",
66 | dag=dag,
67 | postgres_conn_id="redshift",
68 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
69 | )
70 |
71 | copy_stations_task = PythonOperator(
72 | task_id='load_stations_from_s3_to_redshift',
73 | dag=dag,
74 | python_callable=load_station_data_to_redshift,
75 | )
76 |
77 | create_trips_table >> copy_trips_task
78 | create_stations_table >> copy_stations_task
79 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Quality - Exercise 4.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #1 - Set an SLA on our bikeshare traffic calculation operator
3 | #2 - Add data verification step after the load step from s3 to redshift
4 | #3 - Add data verification step after we calculate our output table
5 |
6 | import datetime
7 | import logging
8 |
9 | from airflow import DAG
10 | from airflow.contrib.hooks.aws_hook import AwsHook
11 | from airflow.hooks.postgres_hook import PostgresHook
12 | from airflow.operators.postgres_operator import PostgresOperator
13 | from airflow.operators.python_operator import PythonOperator
14 |
15 | import sql_statements
16 |
17 |
18 | def load_trip_data_to_redshift(*args, **kwargs):
19 | aws_hook = AwsHook("aws_credentials")
20 | credentials = aws_hook.get_credentials()
21 | redshift_hook = PostgresHook("redshift")
22 | execution_date = kwargs["execution_date"]
23 | sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format(
24 | credentials.access_key,
25 | credentials.secret_key,
26 | year=execution_date.year,
27 | month=execution_date.month
28 | )
29 | redshift_hook.run(sql_stmt)
30 |
31 |
32 | def load_station_data_to_redshift(*args, **kwargs):
33 | aws_hook = AwsHook("aws_credentials")
34 | credentials = aws_hook.get_credentials()
35 | redshift_hook = PostgresHook("redshift")
36 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
37 | credentials.access_key,
38 | credentials.secret_key,
39 | )
40 | redshift_hook.run(sql_stmt)
41 |
42 |
43 | def check_greater_than_zero(*args, **kwargs):
44 | table = kwargs["params"]["table"]
45 | redshift_hook = PostgresHook("redshift")
46 | records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {table}")
47 | if len(records) < 1 or len(records[0]) < 1:
48 | raise ValueError(f"Data quality check failed. {table} returned no results")
49 | num_records = records[0][0]
50 | if num_records < 1:
51 | raise ValueError(f"Data quality check failed. {table} contained 0 rows")
52 | logging.info(f"Data quality on table {table} check passed with {records[0][0]} records")
53 |
54 |
55 | dag = DAG(
56 | 'lesson2.exercise4',
57 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
58 | end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0),
59 | schedule_interval='@monthly',
60 | max_active_runs=1
61 | )
62 |
63 | create_trips_table = PostgresOperator(
64 | task_id="create_trips_table",
65 | dag=dag,
66 | postgres_conn_id="redshift",
67 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
68 | )
69 |
70 | copy_trips_task = PythonOperator(
71 | task_id='load_trips_from_s3_to_redshift',
72 | dag=dag,
73 | python_callable=load_trip_data_to_redshift,
74 | provide_context=True,
75 | )
76 |
77 | check_trips = PythonOperator(
78 | task_id='check_trips_data',
79 | dag=dag,
80 | python_callable=check_greater_than_zero,
81 | provide_context=True,
82 | params={
83 | 'table': 'trips',
84 | }
85 | )
86 |
87 | create_stations_table = PostgresOperator(
88 | task_id="create_stations_table",
89 | dag=dag,
90 | postgres_conn_id="redshift",
91 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
92 | )
93 |
94 | copy_stations_task = PythonOperator(
95 | task_id='load_stations_from_s3_to_redshift',
96 | dag=dag,
97 | python_callable=load_station_data_to_redshift,
98 | )
99 |
100 | check_stations = PythonOperator(
101 | task_id='check_stations_data',
102 | dag=dag,
103 | python_callable=check_greater_than_zero,
104 | provide_context=True,
105 | params={
106 | 'table': 'stations',
107 | }
108 | )
109 |
110 | create_trips_table >> copy_trips_task
111 | create_stations_table >> copy_stations_task
112 | copy_stations_task >> check_stations
113 | copy_trips_task >> check_trips
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Production Data Pipelines - Exercise 1.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll consolidate repeated code into Operator Plugins
3 | #1 - Move the data quality check logic into a custom operator
4 | #2 - Replace the data quality check PythonOperators with our new custom operator
5 | #3 - Consolidate both the S3 to RedShift functions into a custom operator
6 | #4 - Replace the S3 to RedShift PythonOperators with our new custom operator
7 | #5 - Execute the DAG
8 |
9 | import datetime
10 | import logging
11 |
12 | from airflow import DAG
13 | from airflow.contrib.hooks.aws_hook import AwsHook
14 | from airflow.hooks.postgres_hook import PostgresHook
15 |
16 | from airflow.operators import (
17 | HasRowsOperator,
18 | PostgresOperator,
19 | PythonOperator,
20 | S3ToRedshiftOperator
21 | )
22 |
23 | import sql_statements
24 |
25 |
26 | #
27 | # TODO: Replace the data quality checks with the HasRowsOperator
28 | #
29 |
30 | dag = DAG(
31 | "lesson3.exercise1",
32 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
33 | end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0),
34 | schedule_interval="@monthly",
35 | max_active_runs=1
36 | )
37 |
38 | create_trips_table = PostgresOperator(
39 | task_id="create_trips_table",
40 | dag=dag,
41 | postgres_conn_id="redshift",
42 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
43 | )
44 |
45 | copy_trips_task = S3ToRedshiftOperator(
46 | task_id="load_trips_from_s3_to_redshift",
47 | dag=dag,
48 | table="trips",
49 | redshift_conn_id="redshift",
50 | aws_credentials_id="aws_credentials",
51 | s3_bucket="udac-data-pipelines",
52 | s3_key="divvy/partitioned/{execution_date.year}/{execution_date.month}/divvy_trips.csv"
53 | )
54 |
55 | #
56 | # TODO: Replace this data quality check with the HasRowsOperator
57 | #
58 | check_trips = HasRowsOperator(
59 | task_id='check_trips_data',
60 | dag=dag,
61 | redshift_conn_id="redshift",
62 | table="trips"
63 | )
64 |
65 | create_stations_table = PostgresOperator(
66 | task_id="create_stations_table",
67 | dag=dag,
68 | postgres_conn_id="redshift",
69 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
70 | )
71 |
72 | copy_stations_task = S3ToRedshiftOperator(
73 | task_id="load_stations_from_s3_to_redshift",
74 | dag=dag,
75 | redshift_conn_id="redshift",
76 | aws_credentials_id="aws_credentials",
77 | s3_bucket="udac-data-pipelines",
78 | s3_key="divvy/unpartitioned/divvy_stations_2017.csv",
79 | table="stations"
80 | )
81 |
82 | #
83 | # TODO: Replace this data quality check with the HasRowsOperator
84 | #
85 | check_stations = HasRowsOperator(
86 | task_id='check_stations_data',
87 | dag=dag,
88 | redshift_conn_id="redshift",
89 | table="stations"
90 | )
91 |
92 | create_trips_table >> copy_trips_task
93 | create_stations_table >> copy_stations_task
94 | copy_stations_task >> check_stations
95 | copy_trips_task >> check_trips
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Production Data Pipelines - Exercise 2.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll refactor a DAG with a single overloaded task into a DAG with several tasks with well-defined boundaries
3 | #1 - Read through the DAG and identify points in the DAG that could be split apart
4 | #2 - Split the DAG into multiple PythonOperators
5 | #3 - Run the DAG
6 |
7 | import datetime
8 | import logging
9 |
10 | from airflow import DAG
11 | from airflow.hooks.postgres_hook import PostgresHook
12 |
13 | from airflow.operators.postgres_operator import PostgresOperator
14 | from airflow.operators.python_operator import PythonOperator
15 |
16 |
17 | #
18 | # TODO: Finish refactoring this function into the appropriate set of tasks,
19 | # instead of keeping this one large task.
20 | #
21 | def load_and_analyze(*args, **kwargs):
22 | redshift_hook = PostgresHook("redshift")
23 |
24 | def log_oldest():
25 | redshift_hook = PostgresHook("redshift")
26 | records = redshift_hook.get_records("""
27 | SELECT birthyear FROM older_riders ORDER BY birthyear ASC LIMIT 1
28 | """)
29 | if len(records) > 0 and len(records[0]) > 0:
30 | logging.info(f"Oldest rider was born in {records[0][0]}")
31 |
32 | def log_younger():
33 | redshift_hook = PostgresHook("redshift")
34 | records = redshift_hook.get_records("""
35 | SELECT birthyear FROM younger_riders ORDER BY birthyear DESC LIMIT 1
36 | """)
37 | if len(records) > 0 and len(records[0]) > 0:
38 | logging.info(f"Youngest rider was born in {records[0][0]}")
39 |
40 |
41 | dag = DAG(
42 | "lesson3.exercise2",
43 | start_date=datetime.datetime.utcnow()
44 | )
45 |
46 | load_and_analyze = PythonOperator(
47 | task_id='load_and_analyze',
48 | dag=dag,
49 | python_callable=load_and_analyze,
50 | provide_context=True,
51 | )
52 |
53 | create_oldest_task = PostgresOperator(
54 | task_id="create_oldest",
55 | dag=dag,
56 | sql="""
57 | BEGIN;
58 | DROP TABLE IF EXISTS older_riders;
59 | CREATE TABLE older_riders AS (
60 | SELECT * FROM trips WHERE birthyear > 0 AND birthyear <= 1945
61 | );
62 | COMMIT;
63 | """,
64 | postgres_conn_id="redshift"
65 | )
66 |
67 | create_younger_task = PostgresOperator(
68 | task_id="create_younger",
69 | dag=dag,
70 | sql="""
71 | BEGIN;
72 | DROP TABLE IF EXISTS younger_riders;
73 | CREATE TABLE younger_riders AS (
74 | SELECT * FROM trips WHERE birthyear > 2000
75 | );
76 | COMMIT;
77 | """,
78 | postgres_conn_id="redshift"
79 | )
80 |
81 | create_lifetime_task = PostgresOperator(
82 | task_id="create_lifetime",
83 | dag=dag,
84 | sql="""
85 | BEGIN;
86 | DROP TABLE IF EXISTS lifetime_rides;
87 | CREATE TABLE lifetime_rides AS (
88 | SELECT bikeid, COUNT(bikeid)
89 | FROM trips
90 | GROUP BY bikeid
91 | );
92 | COMMIT;
93 | """,
94 | postgres_conn_id="redshift"
95 | )
96 |
97 | create_city_station_task = PostgresOperator(
98 | task_id="create_city_station",
99 | dag=dag,
100 | sql="""
101 | BEGIN;
102 | DROP TABLE IF EXISTS city_station_counts;
103 | CREATE TABLE city_station_counts AS(
104 | SELECT city, COUNT(city)
105 | FROM stations
106 | GROUP BY city
107 | );
108 | COMMIT;
109 | """,
110 | postgres_conn_id="redshift"
111 | )
112 |
113 | log_oldest_task = PythonOperator(
114 | task_id="log_oldest",
115 | dag=dag,
116 | python_callable=log_oldest
117 | )
118 |
119 | log_younger_task = PythonOperator(
120 | task_id="log_younger",
121 | dag=dag,
122 | python_callable=log_younger
123 | )
124 |
125 | load_and_analyze >> create_oldest_task
126 | create_oldest_task >> log_oldest_task
127 | create_younger_task >> log_younger_task
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Production Data Pipelines - Exercise 3.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll refactor a DAG with a single overloaded task into a DAG with several tasks with well-defined boundaries
3 | #1 - Read through the DAG and identify points in the DAG that could be split apart
4 | #2 - Split the DAG into multiple PythonOperators
5 | #3 - Run the DAG
6 |
7 | import datetime
8 | import logging
9 |
10 | from airflow import DAG
11 | from airflow.hooks.postgres_hook import PostgresHook
12 |
13 | from airflow.operators.postgres_operator import PostgresOperator
14 | from airflow.operators.python_operator import PythonOperator
15 |
16 |
17 | #
18 | # TODO: Finish refactoring this function into the appropriate set of tasks,
19 | # instead of keeping this one large task.
20 | #
21 | def load_and_analyze(*args, **kwargs):
22 | redshift_hook = PostgresHook("redshift")
23 |
24 | def log_oldest():
25 | redshift_hook = PostgresHook("redshift")
26 | records = redshift_hook.get_records("""
27 | SELECT birthyear FROM older_riders ORDER BY birthyear ASC LIMIT 1
28 | """)
29 | if len(records) > 0 and len(records[0]) > 0:
30 | logging.info(f"Oldest rider was born in {records[0][0]}")
31 |
32 | def log_younger():
33 | redshift_hook = PostgresHook("redshift")
34 | records = redshift_hook.get_records("""
35 | SELECT birthyear FROM younger_riders ORDER BY birthyear DESC LIMIT 1
36 | """)
37 | if len(records) > 0 and len(records[0]) > 0:
38 | logging.info(f"Youngest rider was born in {records[0][0]}")
39 |
40 |
41 | dag = DAG(
42 | "lesson3.exercise2",
43 | start_date=datetime.datetime.utcnow()
44 | )
45 |
46 | load_and_analyze = PythonOperator(
47 | task_id='load_and_analyze',
48 | dag=dag,
49 | python_callable=load_and_analyze,
50 | provide_context=True,
51 | )
52 |
53 | create_oldest_task = PostgresOperator(
54 | task_id="create_oldest",
55 | dag=dag,
56 | sql="""
57 | BEGIN;
58 | DROP TABLE IF EXISTS older_riders;
59 | CREATE TABLE older_riders AS (
60 | SELECT * FROM trips WHERE birthyear > 0 AND birthyear <= 1945
61 | );
62 | COMMIT;
63 | """,
64 | postgres_conn_id="redshift"
65 | )
66 |
67 | create_younger_task = PostgresOperator(
68 | task_id="create_younger",
69 | dag=dag,
70 | sql="""
71 | BEGIN;
72 | DROP TABLE IF EXISTS younger_riders;
73 | CREATE TABLE younger_riders AS (
74 | SELECT * FROM trips WHERE birthyear > 2000
75 | );
76 | COMMIT;
77 | """,
78 | postgres_conn_id="redshift"
79 | )
80 |
81 | create_lifetime_task = PostgresOperator(
82 | task_id="create_lifetime",
83 | dag=dag,
84 | sql="""
85 | BEGIN;
86 | DROP TABLE IF EXISTS lifetime_rides;
87 | CREATE TABLE lifetime_rides AS (
88 | SELECT bikeid, COUNT(bikeid)
89 | FROM trips
90 | GROUP BY bikeid
91 | );
92 | COMMIT;
93 | """,
94 | postgres_conn_id="redshift"
95 | )
96 |
97 | create_city_station_task = PostgresOperator(
98 | task_id="create_city_station",
99 | dag=dag,
100 | sql="""
101 | BEGIN;
102 | DROP TABLE IF EXISTS city_station_counts;
103 | CREATE TABLE city_station_counts AS(
104 | SELECT city, COUNT(city)
105 | FROM stations
106 | GROUP BY city
107 | );
108 | COMMIT;
109 | """,
110 | postgres_conn_id="redshift"
111 | )
112 |
113 | log_oldest_task = PythonOperator(
114 | task_id="log_oldest",
115 | dag=dag,
116 | python_callable=log_oldest
117 | )
118 |
119 | log_younger_task = PythonOperator(
120 | task_id="log_younger",
121 | dag=dag,
122 | python_callable=log_younger
123 | )
124 |
125 | load_and_analyze >> create_oldest_task
126 | create_oldest_task >> log_oldest_task
127 | create_younger_task >> log_younger_task
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Production Data Pipelines - Exercise 4.py:
--------------------------------------------------------------------------------
1 | import datetime
2 |
3 | from airflow import DAG
4 |
5 | from airflow.operators import (
6 | FactsCalculatorOperator,
7 | HasRowsOperator,
8 | S3ToRedshiftOperator
9 | )
10 |
11 | #
12 | # The following DAG performs the following functions:
13 | #
14 | # 1. Loads Trip data from S3 to RedShift
15 | # 2. Performs a data quality check on the Trips table in RedShift
16 | # 3. Uses the FactsCalculatorOperator to create a Facts table in Redshift
17 | # a. **NOTE**: to complete this step you must complete the FactsCalcuatorOperator
18 | # skeleton defined in plugins/operators/facts_calculator.py
19 | #
20 | dag = DAG("lesson3.exercise4", start_date=datetime.datetime.utcnow())
21 |
22 | #
23 | # The following code will load trips data from S3 to RedShift. Use the s3_key
24 | # "data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
25 | # and the s3_bucket "udacity-dend"
26 | #
27 | copy_trips_task = S3ToRedshiftOperator(
28 | task_id="load_trips_from_s3_to_redshift",
29 | dag=dag,
30 | table="trips",
31 | redshift_conn_id="redshift",
32 | aws_credentials_id="aws_credentials",
33 | s3_bucket="udacity-dend",
34 | s3_key="data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
35 | )
36 |
37 | #
38 | # Data quality check on the Trips table
39 | #
40 | check_trips = HasRowsOperator(
41 | task_id="check_trips_data",
42 | dag=dag,
43 | redshift_conn_id="redshift",
44 | table="trips"
45 | )
46 |
47 | #
48 | # We use the FactsCalculatorOperator to create a Facts table in RedShift. The fact column is
49 | # `tripduration` and the groupby_column is `bikeid`
50 | #
51 | calculate_facts = FactsCalculatorOperator(
52 | task_id="calculate_facts_trips",
53 | dag=dag,
54 | redshift_conn_id="redshift",
55 | origin_table="trips",
56 | destination_table="trips_facts",
57 | fact_column="tripduration",
58 | groupby_column="bikeid"
59 | )
60 |
61 | #
62 | # Task ordering for the DAG tasks
63 | #
64 | copy_trips_task >> check_trips
65 | check_trips >> calculate_facts
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Graphview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Graphview.png
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Treeview.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Treeview.PNG
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/Readme.MD:
--------------------------------------------------------------------------------
1 | Project: Data Pipeline with Airflow
2 |
3 | Introduction
4 |
5 | A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app
6 |
7 | Project Description
8 |
9 | Apply the knowledge of Apache Airflow to build and ETL pipeline for a Data Lake hosted on Amazon S3.
10 |
11 | In this project, we would have to create our own custom operators to perform tasks such as staging the data, filling the data warehouse and running checks on the data as the final step. We have been provided with four empty operators that need to be implemented into functional pieces of a data pipeline.
12 |
13 | Project Datasets
14 |
15 | Song Data Path --> s3://udacity-dend/song_data
16 |
17 | Log Data Path --> s3://udacity-dend/log_data Log Data
18 |
19 | Project Template
20 |
21 | The project template package contains three major components for the project:
22 |
23 | The dag template has all the imports and task templates in place, but the task dependencies have not been set
24 | The operators folder with operator templates
25 | A helper class for the SQL transformations
26 |
27 | Configuring the DAG
28 |
29 | In the DAG, add default parameters according to these guidelines
30 |
31 | 1. The DAG does not have dependencies on past runs
32 | 2. On failure, the task are retried 3 times
33 | 3. Retries happen every 5 minutes
34 | 4. Catchup is turned off
35 | 5. Do not email on retry
36 |
37 |
38 | Building the Operators
39 |
40 | Need to build four different operators that will stage the data, transform the data, and run checks on data quality. All of the operators and task instances will run SQL statements against the Redshift database. However, using parameters wisely will allow you to build flexible, reusable, and configurable operators you can later apply to many kinds of data pipelines with Redshift and with other databases.
41 |
42 | Stage Operator
43 |
44 | The stage operator is expected to be able to load any JSON and CSV formatted files from S3 to Amazon Redshift. The operator creates and runs a SQL COPY statement based on the parameters provided. The operator's parameters should specify where in S3 the file is loaded and what is the target table.
45 |
46 | The parameters should be used to distinguish between JSON and CSV file. Another important requirement of the stage operator is containing a templated field that allows it to load timestamped files from S3 based on the execution time and run backfills.
47 |
48 | Fact and Dimension Operators
49 |
50 | Provided SQL Helper class will help to run data transformations. Most of the logic is within the SQL transformations and the operator is expected to take as input a SQL statement and target database on which to run the query against. Dimension loads are often done with the truncate-insert pattern where the target table is emptied before the load. Fact tables are usually so massive that they should only allow append type functionality.
51 |
52 | Data Quality Operator
53 |
54 | The final operator to create is the data quality operator, which is used to run checks on the data itself. The operator's main functionality is to receive one or more SQL based test cases along with the expected results and execute the tests. For each the test, the test result and expected result needs to be checked and if there is no match, the operator should raise an exception and the task should retry and fail eventually.
55 |
56 | For example one test could be a SQL statement that checks if certain column contains NULL values by counting all the rows that have NULL in the column. We do not want to have any NULLs so expected result would be 0 and the test would compare the SQL statement's outcome to the expected result.
57 |
58 | Final Instructions
59 |
60 | When you are in the workspace, after completing the code, you can start by using the command : /opt/airflow/start.sh
61 |
62 | Once you done, it would automatically start all the dags required and outputting the result to its respective tables
63 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/create_tables.sql:
--------------------------------------------------------------------------------
1 | CREATE TABLE IF NOT EXISTS public.artists (
2 | artistid varchar(256) NOT NULL,
3 | name varchar(256),
4 | location varchar(256),
5 | lattitude numeric(18,0),
6 | longitude numeric(18,0)
7 | );
8 |
9 | CREATE TABLE IF NOT EXISTS public.songplays (
10 | playid varchar(32) NOT NULL,
11 | start_time timestamp NOT NULL,
12 | userid int4 NOT NULL,
13 | "level" varchar(256),
14 | songid varchar(256),
15 | artistid varchar(256),
16 | sessionid int4,
17 | location varchar(256),
18 | user_agent varchar(256),
19 | CONSTRAINT songplays_pkey PRIMARY KEY (playid)
20 | );
21 |
22 | CREATE TABLE IF NOT EXISTS public.songs (
23 | songid varchar(256) NOT NULL,
24 | title varchar(256),
25 | artistid varchar(256),
26 | "year" int4,
27 | duration numeric(18,0),
28 | CONSTRAINT songs_pkey PRIMARY KEY (songid)
29 | );
30 |
31 | CREATE TABLE IF NOT EXISTS public.staging_events (
32 | artist varchar(256),
33 | auth varchar(256),
34 | firstname varchar(256),
35 | gender varchar(256),
36 | iteminsession int4,
37 | lastname varchar(256),
38 | length numeric(18,0),
39 | "level" varchar(256),
40 | location varchar(256),
41 | "method" varchar(256),
42 | page varchar(256),
43 | registration numeric(18,0),
44 | sessionid int4,
45 | song varchar(256),
46 | status int4,
47 | ts int8,
48 | useragent varchar(256),
49 | userid int4
50 | );
51 |
52 | CREATE TABLE IF NOT EXISTS public.staging_songs (
53 | num_songs int4,
54 | artist_id varchar(256),
55 | artist_name varchar(256),
56 | artist_latitude numeric(18,0),
57 | artist_longitude numeric(18,0),
58 | artist_location varchar(256),
59 | song_id varchar(256),
60 | title varchar(256),
61 | duration numeric(18,0),
62 | "year" int4
63 | );
64 |
65 | CREATE TABLE IF NOT EXISTS public."time" (
66 | start_time timestamp NOT NULL,
67 | "hour" int4,
68 | "day" int4,
69 | week int4,
70 | "month" varchar(256),
71 | "year" int4,
72 | weekday varchar(256),
73 | CONSTRAINT time_pkey PRIMARY KEY (start_time)
74 | );
75 |
76 | CREATE TABLE IF NOT EXISTS public.users (
77 | userid int4 NOT NULL,
78 | first_name varchar(256),
79 | last_name varchar(256),
80 | gender varchar(256),
81 | "level" varchar(256),
82 | CONSTRAINT users_pkey PRIMARY KEY (userid)
83 | );
84 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/dags/udac_example_dag.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime, timedelta
2 | import os
3 | from airflow import DAG
4 | from airflow.operators.dummy_operator import DummyOperator
5 | from airflow.operators import (StageToRedshiftOperator, LoadFactOperator,
6 | LoadDimensionOperator, DataQualityOperator)
7 | from helpers import SqlQueries
8 |
9 | # AWS_KEY = os.environ.get('AWS_KEY')
10 | # AWS_SECRET = os.environ.get('AWS_SECRET')
11 |
12 | default_args = {
13 | 'owner': 'nareshkumar',
14 | 'start_date': datetime(2018, 11, 1),
15 | 'end_date': datetime(2018, 11, 30),
16 | 'depends_on_past': False,
17 | 'retries': 3,
18 | 'retry_delay': timedelta(minutes=5),
19 | 'catchup': False,
20 | 'email_on_retry': False
21 | }
22 |
23 | dag = DAG('udacity_airflow_project5',
24 | default_args=default_args,
25 | description='Load and transform data in Redshift with Airflow',
26 | schedule_interval='0 * * * *',
27 | max_active_runs=3
28 | )
29 |
30 | start_operator = DummyOperator(task_id='Begin_execution', dag=dag)
31 |
32 | stage_events_to_redshift = StageToRedshiftOperator(
33 | task_id='Stage_events',
34 | dag=dag,
35 | provide_context=True,
36 | aws_credentials_id="aws_credentials",
37 | redshift_conn_id='redshift',
38 | s3_bucket="udacity-dend-airflow-test",
39 | s3_key="log_data",
40 | table="staging_events",
41 | create_stmt=SqlQueries.create_table_staging_events
42 | )
43 |
44 | stage_songs_to_redshift = StageToRedshiftOperator(
45 | task_id='Stage_songs',
46 | dag=dag,
47 | provide_context=True,
48 | aws_credentials_id="aws_credentials",
49 | redshift_conn_id='redshift',
50 | s3_bucket="udacity-dend-airflow-test",
51 | s3_key="song_data",
52 | table="staging_songs",
53 | create_stmt=SqlQueries.create_table_staging_songs
54 | )
55 |
56 | load_songplays_table = LoadFactOperator(
57 | task_id='Load_songplays_fact_table',
58 | dag=dag,
59 | provide_context=True,
60 | aws_credentials_id="aws_credentials",
61 | redshift_conn_id='redshift',
62 | create_stmt=SqlQueries.create_table_songplays,
63 | sql_query=SqlQueries.songplay_table_insert
64 | )
65 |
66 | load_user_dimension_table = LoadDimensionOperator(
67 | task_id='Load_user_dim_table',
68 | dag=dag,
69 | provide_context=True,
70 | aws_credentials_id="aws_credentials",
71 | redshift_conn_id='redshift',
72 | create_stmt=SqlQueries.create_table_users,
73 | sql_query=SqlQueries.user_table_insert
74 | )
75 |
76 | load_song_dimension_table = LoadDimensionOperator(
77 | task_id='Load_song_dim_table',
78 | dag=dag,
79 | provide_context=True,
80 | aws_credentials_id="aws_credentials",
81 | redshift_conn_id='redshift',
82 | create_stmt=SqlQueries.create_table_songs,
83 | sql_query=SqlQueries.song_table_insert
84 | )
85 |
86 | load_artist_dimension_table = LoadDimensionOperator(
87 | task_id='Load_artist_dim_table',
88 | dag=dag,
89 | provide_context=True,
90 | aws_credentials_id="aws_credentials",
91 | redshift_conn_id='redshift',
92 | create_stmt=SqlQueries.create_table_artist,
93 | sql_query=SqlQueries.artist_table_insert
94 | )
95 |
96 | load_time_dimension_table = LoadDimensionOperator(
97 | task_id='Load_time_dim_table',
98 | dag=dag,
99 | provide_context=True,
100 | aws_credentials_id="aws_credentials",
101 | redshift_conn_id='redshift',
102 | create_stmt=SqlQueries.create_table_time,
103 | sql_query=SqlQueries.time_table_insert
104 | )
105 |
106 | run_quality_checks = DataQualityOperator(
107 | task_id='Run_data_quality_checks',
108 | dag=dag,
109 | provide_context=True,
110 | aws_credentials_id="aws_credentials",
111 | redshift_conn_id='redshift',
112 | )
113 |
114 | end_operator = DummyOperator(task_id='Stop_execution', dag=dag)
115 |
116 | start_operator >> [stage_events_to_redshift, stage_songs_to_redshift]
117 | [stage_events_to_redshift, stage_songs_to_redshift] >> load_songplays_table
118 | load_songplays_table >> [load_song_dimension_table, load_user_dimension_table, load_artist_dimension_table, load_time_dimension_table]
119 | [load_song_dimension_table, load_user_dimension_table, load_artist_dimension_table, load_time_dimension_table] >> run_quality_checks
120 | run_quality_checks >> end_operator
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/__init__.py:
--------------------------------------------------------------------------------
1 | from __future__ import division, absolute_import, print_function
2 |
3 | from airflow.plugins_manager import AirflowPlugin
4 |
5 | import operators
6 | import helpers
7 |
8 | # Defining the plugin class
9 | class UdacityPlugin(AirflowPlugin):
10 | name = "udacity_plugin"
11 | operators = [
12 | operators.StageToRedshiftOperator,
13 | operators.LoadFactOperator,
14 | operators.LoadDimensionOperator,
15 | operators.DataQualityOperator
16 | ]
17 | helpers = [
18 | helpers.SqlQueries
19 | ]
20 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/__pycache__/__init__.cpython-36.pyc
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__init__.py:
--------------------------------------------------------------------------------
1 | from helpers.sql_queries import SqlQueries
2 |
3 | __all__ = [
4 | 'SqlQueries',
5 | ]
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/sql_queries.py:
--------------------------------------------------------------------------------
1 | class SqlQueries:
2 | create_table_artist = ("""
3 | CREATE TABLE IF NOT EXISTS public.artists (
4 | artistid varchar(256) NOT NULL,
5 | name varchar(256),
6 | location varchar(256),
7 | lattitude numeric(18,0),
8 | longitude numeric(18,0)
9 | );
10 | """)
11 |
12 | create_table_songplays = ("""
13 | CREATE TABLE IF NOT EXISTS public.songplays
14 | playid varchar(32) NOT NULL,
15 | start_time timestamp NOT NULL,
16 | userid int4 NOT NULL,
17 | "level" varchar(256),
18 | songid varchar(256),
19 | artistid varchar(256),
20 | sessionid int4,
21 | location varchar(256),
22 | user_agent varchar(256),
23 | CONSTRAINT songplays_pkey PRIMARY KEY (playid)
24 | );
25 | """)
26 |
27 | create_table_songs = ("""
28 | CREATE TABLE IF NOT EXISTS public.songs (
29 | songid varchar(256) NOT NULL,
30 | title varchar(256),
31 | artistid varchar(256),
32 | "year" int4,
33 | duration numeric(18,0),
34 | CONSTRAINT songs_pkey PRIMARY KEY (songid)
35 | );
36 | """)
37 |
38 | create_table_staging_events = ("""
39 | CREATE TABLE IF NOT EXISTS public.staging_events (
40 | artist varchar(256),
41 | auth varchar(256),
42 | firstname varchar(256),
43 | gender varchar(256),
44 | iteminsession int4,
45 | lastname varchar(256),
46 | length numeric(18,0),
47 | "level" varchar(256),
48 | location varchar(256),
49 | "method" varchar(256),
50 | page varchar(256),
51 | registration numeric(18,0),
52 | sessionid int4,
53 | song varchar(256),
54 | status int4,
55 | ts int8,
56 | useragent varchar(256),
57 | userid int4
58 | );
59 | """)
60 |
61 | create_table_staging_songs = ("""
62 | CREATE TABLE IF NOT EXISTS public.staging_songs (
63 | num_songs int4,
64 | artist_id varchar(256),
65 | artist_name varchar(256),
66 | artist_latitude numeric(18,0),
67 | artist_longitude numeric(18,0),
68 | artist_location varchar(256),
69 | song_id varchar(256),
70 | title varchar(256),
71 | duration numeric(18,0),
72 | "year" int4
73 | );
74 | """)
75 |
76 | create_table_time = ("""
77 | CREATE TABLE IF NOT EXISTS public."time" (
78 | start_time timestamp NOT NULL,
79 | "hour" int4,
80 | "day" int4,
81 | week int4,
82 | "month" varchar(256),
83 | "year" int4,
84 | weekday varchar(256),
85 | CONSTRAINT time_pkey PRIMARY KEY (start_time)
86 | );
87 | """)
88 |
89 | create_table_users = ("""
90 | CREATE TABLE IF NOT EXISTS public.users (
91 | userid int4 NOT NULL,
92 | first_name varchar(256),
93 | last_name varchar(256),
94 | gender varchar(256),
95 | "level" varchar(256),
96 | CONSTRAINT users_pkey PRIMARY KEY (userid)
97 | );
98 | """)
99 |
100 | songplay_table_insert = ("""
101 | SELECT
102 | md5(events.sessionid || events.start_time) songplay_id,
103 | events.start_time,
104 | events.userid,
105 | events.level,
106 | songs.song_id,
107 | songs.artist_id,
108 | events.sessionid,
109 | events.location,
110 | events.useragent
111 | FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, *
112 | FROM staging_events
113 | WHERE page='NextSong') events
114 | LEFT JOIN staging_songs songs
115 | ON events.song = songs.title
116 | AND events.artist = songs.artist_name
117 | AND events.length = songs.duration
118 | """)
119 |
120 | user_table_insert = ("""
121 | SELECT distinct userid, firstname, lastname, gender, level
122 | FROM staging_events
123 | WHERE page='NextSong'
124 | """)
125 |
126 | song_table_insert = ("""
127 | SELECT distinct song_id, title, artist_id, year, duration
128 | FROM staging_songs
129 | """)
130 |
131 | artist_table_insert = ("""
132 | SELECT distinct artist_id, artist_name, artist_location, artist_latitude, artist_longitude
133 | FROM staging_songs
134 | """)
135 |
136 | time_table_insert = ("""
137 | SELECT start_time, extract(hour from start_time), extract(day from start_time), extract(week from start_time),
138 | extract(month from start_time), extract(year from start_time), extract(dayofweek from start_time)
139 | FROM songplays
140 | """)
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__init__.py:
--------------------------------------------------------------------------------
1 | from operators.stage_redshift import StageToRedshiftOperator
2 | from operators.load_fact import LoadFactOperator
3 | from operators.load_dimension import LoadDimensionOperator
4 | from operators.data_quality import DataQualityOperator
5 |
6 | __all__ = [
7 | 'StageToRedshiftOperator',
8 | 'LoadFactOperator',
9 | 'LoadDimensionOperator',
10 | 'DataQualityOperator'
11 | ]
12 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/data_quality.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class DataQualityOperator(BaseOperator):
6 |
7 | ui_color = '#89DA59'
8 |
9 | @apply_defaults
10 | def __init__(self,
11 | # Define your operators params (with defaults) here
12 | # Example:
13 | # conn_id = your-connection-name
14 | *args, **kwargs):
15 |
16 | super(DataQualityOperator, self).__init__(*args, **kwargs)
17 | # Map params here
18 | # Example:
19 | # self.conn_id = conn_id
20 |
21 | def execute(self, context):
22 | self.log.info('DataQualityOperator not implemented yet')
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/load_dimension.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class LoadDimensionOperator(BaseOperator):
6 |
7 | ui_color = '#80BD9E'
8 |
9 | @apply_defaults
10 | def __init__(self,
11 | # Define your operators params (with defaults) here
12 | # Example:
13 | # conn_id = your-connection-name
14 | *args, **kwargs):
15 |
16 | super(LoadDimensionOperator, self).__init__(*args, **kwargs)
17 | # Map params here
18 | # Example:
19 | # self.conn_id = conn_id
20 |
21 | def execute(self, context):
22 | self.log.info('LoadDimensionOperator not implemented yet')
23 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/load_fact.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class LoadFactOperator(BaseOperator):
6 |
7 | ui_color = '#F98866'
8 |
9 | @apply_defaults
10 | def __init__(self,
11 | # Define your operators params (with defaults) here
12 | # Example:
13 | # conn_id = your-connection-name
14 | *args, **kwargs):
15 |
16 | super(LoadFactOperator, self).__init__(*args, **kwargs)
17 | # Map params here
18 | # Example:
19 | # self.conn_id = conn_id
20 |
21 | def execute(self, context):
22 | self.log.info('LoadFactOperator not implemented yet')
23 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/stage_redshift.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class StageToRedshiftOperator(BaseOperator):
6 | ui_color = '#358140'
7 |
8 | @apply_defaults
9 | def __init__(self,
10 | # Define your operators params (with defaults) here
11 | # Example:
12 | # redshift_conn_id=your-connection-name
13 | *args, **kwargs):
14 |
15 | super(StageToRedshiftOperator, self).__init__(*args, **kwargs)
16 | # Map params here
17 | # Example:
18 | # self.conn_id = conn_id
19 |
20 | def execute(self, context):
21 | self.log.info('StageToRedshiftOperator not implemented yet')
22 |
23 |
24 |
25 |
26 |
27 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Readme.MD:
--------------------------------------------------------------------------------
1 | Exercise Files
2 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/__init__.py:
--------------------------------------------------------------------------------
1 | from operators.facts_calculator import FactsCalculatorOperator
2 | from operators.has_rows import HasRowsOperator
3 | from operators.s3_to_redshift import S3ToRedshiftOperator
4 |
5 | __all__ = [
6 | 'FactsCalculatorOperator',
7 | 'HasRowsOperator',
8 | 'S3ToRedshiftOperator'
9 | ]
10 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/dag.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag.
3 | #1 - Consolidate HasRowsOperator into the SubDag
4 | #2 - Reorder the tasks to take advantage of the SubDag Operators
5 |
6 | import datetime
7 |
8 | from airflow import DAG
9 | from airflow.operators.postgres_operator import PostgresOperator
10 | from airflow.operators.subdag_operator import SubDagOperator
11 | from airflow.operators.udacity_plugin import HasRowsOperator
12 |
13 | from lesson3.exercise3.subdag import get_s3_to_redshift_dag
14 | import sql_statements
15 |
16 |
17 | start_date = datetime.datetime.utcnow()
18 |
19 | dag = DAG(
20 | "lesson3.exercise3",
21 | start_date=start_date,
22 | )
23 |
24 | trips_task_id = "trips_subdag"
25 | trips_subdag_task = SubDagOperator(
26 | subdag=get_s3_to_redshift_dag(
27 | "lesson3.exercise3",
28 | trips_task_id,
29 | "redshift",
30 | "aws_credentials",
31 | "trips",
32 | sql_statements.CREATE_TRIPS_TABLE_SQL,
33 | s3_bucket="udac-data-pipelines",
34 | s3_key="divvy/unpartitioned/divvy_trips_2018.csv",
35 | start_date=start_date,
36 | ),
37 | task_id=trips_task_id,
38 | dag=dag,
39 | )
40 |
41 | stations_task_id = "stations_subdag"
42 | stations_subdag_task = SubDagOperator(
43 | subdag=get_s3_to_redshift_dag(
44 | "lesson3.exercise3",
45 | stations_task_id,
46 | "redshift",
47 | "aws_credentials",
48 | "stations",
49 | sql_statements.CREATE_STATIONS_TABLE_SQL,
50 | s3_bucket="udac-data-pipelines",
51 | s3_key="divvy/unpartitioned/divvy_stations_2017.csv",
52 | start_date=start_date,
53 | ),
54 | task_id=stations_task_id,
55 | dag=dag,
56 | )
57 |
58 | #
59 | # TODO: Consolidate check_trips and check_stations into a single check in the subdag
60 | # as we did with the create and copy in the demo
61 | #
62 | check_trips = HasRowsOperator(
63 | task_id="check_trips_data",
64 | dag=dag,
65 | redshift_conn_id="redshift",
66 | table="trips"
67 | )
68 |
69 | check_stations = HasRowsOperator(
70 | task_id="check_stations_data",
71 | dag=dag,
72 | redshift_conn_id="redshift",
73 | table="stations"
74 | )
75 |
76 | location_traffic_task = PostgresOperator(
77 | task_id="calculate_location_traffic",
78 | dag=dag,
79 | postgres_conn_id="redshift",
80 | sql=sql_statements.LOCATION_TRAFFIC_SQL
81 | )
82 |
83 | #
84 | # TODO: Reorder the Graph once you have moved the checks
85 | #
86 | trips_subdag_task >> location_traffic_task
87 | stations_subdag_task >> location_traffic_task
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/facts_calculator.py:
--------------------------------------------------------------------------------
1 | import logging
2 |
3 | from airflow.hooks.postgres_hook import PostgresHook
4 | from airflow.models import BaseOperator
5 | from airflow.utils.decorators import apply_defaults
6 |
7 |
8 | class FactsCalculatorOperator(BaseOperator):
9 | facts_sql_template = """
10 | DROP TABLE IF EXISTS {destination_table};
11 | CREATE TABLE {destination_table} AS
12 | SELECT
13 | {groupby_column},
14 | MAX({fact_column}) AS max_{fact_column},
15 | MIN({fact_column}) AS min_{fact_column},
16 | AVG({fact_column}) AS average_{fact_column}
17 | FROM {origin_table}
18 | GROUP BY {groupby_column};
19 | """
20 |
21 | @apply_defaults
22 | def __init__(self,
23 | redshift_conn_id="",
24 | origin_table="",
25 | destination_table="",
26 | fact_column="",
27 | groupby_column="",
28 | *args, **kwargs):
29 |
30 | super(FactsCalculatorOperator, self).__init__(*args, **kwargs)
31 | #
32 | # TODO: Set attributes from __init__ instantiation arguments
33 | #
34 |
35 | def execute(self, context):
36 | #
37 | # TODO: Fetch the redshift hook
38 | #
39 |
40 | #
41 | # TODO: Format the `facts_sql_template` and run the query against redshift
42 | #
43 |
44 | pass
45 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/has_rows.py:
--------------------------------------------------------------------------------
1 | import logging
2 |
3 | from airflow.hooks.postgres_hook import PostgresHook
4 | from airflow.models import BaseOperator
5 | from airflow.utils.decorators import apply_defaults
6 |
7 |
8 | class HasRowsOperator(BaseOperator):
9 |
10 | @apply_defaults
11 | def __init__(self,
12 | redshift_conn_id="",
13 | table="",
14 | *args, **kwargs):
15 |
16 | super(HasRowsOperator, self).__init__(*args, **kwargs)
17 | self.table = table
18 | self.redshift_conn_id = redshift_conn_id
19 |
20 | def execute(self, context):
21 | redshift_hook = PostgresHook(self.redshift_conn_id)
22 | records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {self.table}")
23 | if len(records) < 1 or len(records[0]) < 1:
24 | raise ValueError(f"Data quality check failed. {self.table} returned no results")
25 | num_records = records[0][0]
26 | if num_records < 1:
27 | raise ValueError(f"Data quality check failed. {self.table} contained 0 rows")
28 | logging.info(f"Data quality on table {self.table} check passed with {records[0][0]} records")
29 |
30 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/s3_to_redshift.py:
--------------------------------------------------------------------------------
1 | from airflow.contrib.hooks.aws_hook import AwsHook
2 | from airflow.hooks.postgres_hook import PostgresHook
3 | from airflow.models import BaseOperator
4 | from airflow.utils.decorators import apply_defaults
5 |
6 |
7 | class S3ToRedshiftOperator(BaseOperator):
8 | template_fields = ("s3_key",)
9 | copy_sql = """
10 | COPY {}
11 | FROM '{}'
12 | ACCESS_KEY_ID '{}'
13 | SECRET_ACCESS_KEY '{}'
14 | IGNOREHEADER {}
15 | DELIMITER '{}'
16 | """
17 |
18 |
19 | @apply_defaults
20 | def __init__(self,
21 | redshift_conn_id="",
22 | aws_credentials_id="",
23 | table="",
24 | s3_bucket="",
25 | s3_key="",
26 | delimiter=",",
27 | ignore_headers=1,
28 | *args, **kwargs):
29 |
30 | super(S3ToRedshiftOperator, self).__init__(*args, **kwargs)
31 | self.table = table
32 | self.redshift_conn_id = redshift_conn_id
33 | self.s3_bucket = s3_bucket
34 | self.s3_key = s3_key
35 | self.delimiter = delimiter
36 | self.ignore_headers = ignore_headers
37 | self.aws_credentials_id = aws_credentials_id
38 |
39 | def execute(self, context):
40 | aws_hook = AwsHook(self.aws_credentials_id)
41 | credentials = aws_hook.get_credentials()
42 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
43 |
44 | self.log.info("Clearing data from destination Redshift table")
45 | redshift.run("DELETE FROM {}".format(self.table))
46 |
47 | self.log.info("Copying data from S3 to Redshift")
48 | rendered_key = self.s3_key.format(**context)
49 | s3_path = "s3://{}/{}".format(self.s3_bucket, rendered_key)
50 | formatted_sql = S3ToRedshiftOperator.copy_sql.format(
51 | self.table,
52 | s3_path,
53 | credentials.access_key,
54 | credentials.secret_key,
55 | self.ignore_headers,
56 | self.delimiter
57 | )
58 | redshift.run(formatted_sql)
59 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/sql_statements.py:
--------------------------------------------------------------------------------
1 | CREATE_TRIPS_TABLE_SQL = """
2 | CREATE TABLE IF NOT EXISTS trips (
3 | trip_id INTEGER NOT NULL,
4 | start_time TIMESTAMP NOT NULL,
5 | end_time TIMESTAMP NOT NULL,
6 | bikeid INTEGER NOT NULL,
7 | tripduration DECIMAL(16,2) NOT NULL,
8 | from_station_id INTEGER NOT NULL,
9 | from_station_name VARCHAR(100) NOT NULL,
10 | to_station_id INTEGER NOT NULL,
11 | to_station_name VARCHAR(100) NOT NULL,
12 | usertype VARCHAR(20),
13 | gender VARCHAR(6),
14 | birthyear INTEGER,
15 | PRIMARY KEY(trip_id))
16 | DISTSTYLE ALL;
17 | """
18 |
19 | CREATE_STATIONS_TABLE_SQL = """
20 | CREATE TABLE IF NOT EXISTS stations (
21 | id INTEGER NOT NULL,
22 | name VARCHAR(250) NOT NULL,
23 | city VARCHAR(100) NOT NULL,
24 | latitude DECIMAL(9, 6) NOT NULL,
25 | longitude DECIMAL(9, 6) NOT NULL,
26 | dpcapacity INTEGER NOT NULL,
27 | online_date TIMESTAMP NOT NULL,
28 | PRIMARY KEY(id))
29 | DISTSTYLE ALL;
30 | """
31 |
32 | COPY_SQL = """
33 | COPY {}
34 | FROM '{}'
35 | ACCESS_KEY_ID '{{}}'
36 | SECRET_ACCESS_KEY '{{}}'
37 | IGNOREHEADER 1
38 | DELIMITER ','
39 | """
40 |
41 | COPY_MONTHLY_TRIPS_SQL = COPY_SQL.format(
42 | "trips",
43 | "s3://udac-data-pipelines/divvy/partitioned/{year}/{month}/divvy_trips.csv"
44 | )
45 |
46 | COPY_ALL_TRIPS_SQL = COPY_SQL.format(
47 | "trips",
48 | "s3://udac-data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
49 | )
50 |
51 | COPY_STATIONS_SQL = COPY_SQL.format(
52 | "stations",
53 | "s3://udac-data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv"
54 | )
55 |
56 | LOCATION_TRAFFIC_SQL = """
57 | BEGIN;
58 | DROP TABLE IF EXISTS station_traffic;
59 | CREATE TABLE station_traffic AS
60 | SELECT
61 | DISTINCT(t.from_station_id) AS station_id,
62 | t.from_station_name AS station_name,
63 | num_departures,
64 | num_arrivals
65 | FROM trips t
66 | JOIN (
67 | SELECT
68 | from_station_id,
69 | COUNT(from_station_id) AS num_departures
70 | FROM trips
71 | GROUP BY from_station_id
72 | ) AS fs ON t.from_station_id = fs.from_station_id
73 | JOIN (
74 | SELECT
75 | to_station_id,
76 | COUNT(to_station_id) AS num_arrivals
77 | FROM trips
78 | GROUP BY to_station_id
79 | ) AS ts ON t.from_station_id = ts.to_station_id
80 | """
81 |
--------------------------------------------------------------------------------
/Data Pipeline with Airflow/subdag.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag.
3 | #1 - Consolidate HasRowsOperator into the SubDag
4 | #2 - Reorder the tasks to take advantage of the SubDag Operators
5 |
6 | import datetime
7 |
8 | from airflow import DAG
9 | from airflow.operators.postgres_operator import PostgresOperator
10 | from airflow.operators.udacity_plugin import HasRowsOperator
11 | from airflow.operators.udacity_plugin import S3ToRedshiftOperator
12 |
13 | import sql
14 |
15 |
16 | # Returns a DAG which creates a table if it does not exist, and then proceeds
17 | # to load data into that table from S3. When the load is complete, a data
18 | # quality check is performed to assert that at least one row of data is
19 | # present.
20 | def get_s3_to_redshift_dag(
21 | parent_dag_name,
22 | task_id,
23 | redshift_conn_id,
24 | aws_credentials_id,
25 | table,
26 | create_sql_stmt,
27 | s3_bucket,
28 | s3_key,
29 | *args, **kwargs):
30 | dag = DAG(
31 | f"{parent_dag_name}.{task_id}",
32 | **kwargs
33 | )
34 |
35 | create_task = PostgresOperator(
36 | task_id=f"create_{table}_table",
37 | dag=dag,
38 | postgres_conn_id=redshift_conn_id,
39 | sql=create_sql_stmt
40 | )
41 |
42 | copy_task = S3ToRedshiftOperator(
43 | task_id=f"load_{table}_from_s3_to_redshift",
44 | dag=dag,
45 | table=table,
46 | redshift_conn_id=redshift_conn_id,
47 | aws_credentials_id=aws_credentials_id,
48 | s3_bucket=s3_bucket,
49 | s3_key=s3_key
50 | )
51 |
52 | #
53 | # TODO: Move the HasRowsOperator task here from the DAG
54 | #
55 |
56 | check_task = HasRowsOperator(
57 | task_id=f"check_{table}_data",
58 | dag=dag,
59 | redshift_conn_id=redshift_conn_id,
60 | table=table
61 | )
62 |
63 | create_task >> copy_task
64 | #
65 | # TODO: Use DAG ordering to place the check task
66 | #
67 | copy_task >> check_task
68 | return dag
69 |
--------------------------------------------------------------------------------
/Data-Modeling/L1 Exercise 1 Creating a Table with Postgres.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# L1 Exercise 1: Creating a Table with PostgreSQL\n",
8 | "\n",
9 | ""
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "### Walk through the basics of PostgreSQL. You will need to complete the following tasks: