├── Cloud Data Warehouse ├── L1 E1 - Step 1_2.ipynb ├── L1 E1 - Step 3.ipynb ├── L1 E1 - Step 4.ipynb ├── L1 E1 - Step 5.ipynb ├── L1 E1 - Step 6.ipynb ├── L1 E2 - CUBE.ipynb ├── L1 E2 - Grouping Sets.ipynb ├── L1 E2 - Roll up and Drill Down.ipynb ├── L1 E2 - Slicing and Dicing.ipynb ├── L1 E3 - Columnar Vs Row Storage.ipynb ├── Project Data Warehouse with AWS │ ├── README.md │ ├── RedShift_Test_Cluster.ipynb │ ├── create_tables.py │ ├── dwh.cfg │ ├── etl.py │ └── sql_queries.py └── Readme.md ├── Data Lakes with Spark ├── Data_Inputs_Outputs.ipynb ├── Data_Wrangling.ipynb ├── Data_Wrangling_Sql.ipynb ├── Dataframe_Quiz.ipynb ├── Exercise 1 - Schema On Read.ipynb ├── Exercise 2 - Advanced Analytics NLP.ipynb ├── Exercise 3 - Data Lake on S3.ipynb ├── Mapreduce_Practice.ipynb ├── Procedural_vs_Functional_Python.ipynb ├── Project Data Lake with Spark │ ├── README.md │ ├── Readme.MD │ ├── dl.cfg │ └── etl.py ├── README.md ├── Spark_Maps_Lazy_Evaluation.ipynb └── Spark_Sql_Quiz.ipynb ├── Data Pipeline with Airflow ├── Data Pipeline - Exercise 1.py ├── Data Pipeline - Exercise 2.py ├── Data Pipeline - Exercise 3.py ├── Data Pipeline - Exercise 4.py ├── Data Pipeline - Exercise 5.py ├── Data Pipeline - Exercise 6.py ├── Data Quality - Exercise 1.py ├── Data Quality - Exercise 2.py ├── Data Quality - Exercise 3.py ├── Data Quality - Exercise 4.py ├── Production Data Pipelines - Exercise 1.py ├── Production Data Pipelines - Exercise 2.py ├── Production Data Pipelines - Exercise 3.py ├── Production Data Pipelines - Exercise 4.py ├── Project Data Pipeline with Airflow │ ├── DAG Graphview.png │ ├── DAG Treeview.PNG │ ├── Readme.MD │ ├── create_tables.sql │ ├── dags │ │ ├── __pycache__ │ │ │ └── udac_example_dag.cpython-36.pyc │ │ └── udac_example_dag.py │ └── plugins │ │ ├── __init__.py │ │ ├── __pycache__ │ │ └── __init__.cpython-36.pyc │ │ ├── helpers │ │ ├── __init__.py │ │ ├── __pycache__ │ │ │ ├── __init__.cpython-36.pyc │ │ │ └── sql_queries.cpython-36.pyc │ │ └── sql_queries.py │ │ └── operators │ │ ├── __init__.py │ │ ├── __pycache__ │ │ ├── __init__.cpython-36.pyc │ │ ├── data_quality.cpython-36.pyc │ │ ├── load_dimension.cpython-36.pyc │ │ ├── load_fact.cpython-36.pyc │ │ └── stage_redshift.cpython-36.pyc │ │ ├── data_quality.py │ │ ├── load_dimension.py │ │ ├── load_fact.py │ │ └── stage_redshift.py ├── Readme.MD ├── __init__.py ├── dag.py ├── facts_calculator.py ├── has_rows.py ├── s3_to_redshift.py ├── sql_statements.py └── subdag.py ├── Data-Modeling ├── L1 Exercise 1 Creating a Table with Postgres.ipynb ├── L1 Exercise 2 Creating a Table with Apache Cassandra.ipynb ├── L2 Exercise 1 Creating Normalized Tables.ipynb ├── L2 Exercise 2 Creating Denormalized Tables.ipynb ├── L2 Exercise 3 Creating Fact and Dimension Tables with Star Schema.ipynb ├── L3 Exercise 1 Three Queries Three Tables.ipynb ├── L3 Exercise 2 Primary Key.ipynb ├── L3 Exercise 3 Clustering Column.ipynb ├── L3 Exercise 4 Using the WHERE Clause.ipynb ├── Project 1 │ ├── Instructions 1.PNG │ ├── Instructions 2.PNG │ ├── Instructions 3.PNG │ ├── Instructions 4.PNG │ ├── Project 1 Introduction.PNG │ ├── README.md │ ├── create_tables.py │ ├── data.zip │ ├── etl.ipynb │ ├── etl.py │ ├── sql_queries.py │ └── test.ipynb ├── Project 2 │ ├── Project_1B.ipynb │ ├── Project_1B_ Project_Template.ipynb │ ├── README.md │ ├── event_data.rar │ ├── event_datafile_new.csv │ └── images.rar └── Readme.md └── README.md /Cloud Data Warehouse/Project Data Warehouse with AWS/README.md: -------------------------------------------------------------------------------- 1 | Introduction 2 | 3 | A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. 4 | 5 | Task is to build an ETL Pipeline that extracts their data from S3, staging it in Redshift and then transforming data into a set of Dimensional and Fact Tables for their Analytics Team to continue finding Insights to what songs their users are listening to. 6 | 7 | Project Description 8 | 9 | Application of Data warehouse and AWS to build an ETL Pipeline for a database hosted on Redshift Will need to load data from S3 to staging tables on Redshift and execute SQL Statements that create fact and dimension tables from these staging tables to create analytics 10 | 11 | Project Datasets 12 | 13 | Song Data Path --> s3://udacity-dend/song_data 14 | Log Data Path --> s3://udacity-dend/log_data 15 | Log Data JSON Path --> s3://udacity-dend/log_json_path.json 16 | 17 | Song Dataset 18 | 19 | The first dataset is a subset of real data from the Million Song Dataset(https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 20 | For example: 21 | 22 | song_data/A/B/C/TRABCEI128F424C983.json 23 | song_data/A/A/B/TRAABJL12903CDCF1A.json 24 | 25 | And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like. 26 | 27 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0} 28 | 29 | Log Dataset 30 | 31 | The second dataset consists of log files in JSON format. The log files in the dataset with are partitioned by year and month. 32 | For example: 33 | 34 | log_data/2018/11/2018-11-12-events.json 35 | log_data/2018/11/2018-11-13-events.json 36 | 37 | And below is an example of what a single log file, 2018-11-13-events.json, looks like. 38 | 39 | {"artist":"Pavement", "auth":"Logged In", "firstName":"Sylvie", "gender", "F", "itemInSession":0, "lastName":"Cruz", "length":99.16036, "level":"free", "location":"Klamath Falls, OR", "method":"PUT", "page":"NextSong", "registration":"1.541078e+12", "sessionId":345, "song":"Mercy:The Laundromat", "status":200, "ts":1541990258796, "userAgent":"Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4...)", "userId":10} 40 | 41 | Schema for Song Play Analysis 42 | 43 | A Star Schema would be required for optimized queries on song play queries 44 | 45 | Fact Table 46 | 47 | songplays - records in event data associated with song plays i.e. records with page NextSong 48 | songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent 49 | 50 | Dimension Tables 51 | 52 | users - users in the app 53 | user_id, first_name, last_name, gender, level 54 | 55 | songs - songs in music database 56 | song_id, title, artist_id, year, duration 57 | 58 | artists - artists in music database 59 | artist_id, name, location, lattitude, longitude 60 | 61 | time - timestamps of records in songplays broken down into specific units 62 | start_time, hour, day, week, month, year, weekday 63 | 64 | Project Template 65 | 66 | Project Template include four files: 67 | 68 | 1. create_table.py is where you'll create your fact and dimension tables for the star schema in Redshift. 69 | 70 | 2. etl.py is where you'll load data from S3 into staging tables on Redshift and then process that data into your analytics tables on Redshift. 71 | 72 | 3. sql_queries.py is where you'll define you SQL statements, which will be imported into the two other files above. 73 | 74 | 4. README.md is where you'll provide discussion on your process and decisions for this ETL pipeline. 75 | 76 | Create Table Schema 77 | 78 | 1. Write a SQL CREATE statement for each of these tables in sql_queries.py 79 | 2. Complete the logic in create_tables.py to connect to the database and create these tables 80 | 3. Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline. 81 | 4. Launch a redshift cluster and create an IAM role that has read access to S3. 82 | 5. Add redshift database and IAM role info to dwh.cfg. 83 | 6. Test by running create_tables.py and checking the table schemas in your redshift database. 84 | 85 | Build ETL Pipeline 86 | 87 | 1. Implement the logic in etl.py to load data from S3 to staging tables on Redshift. 88 | 2. Implement the logic in etl.py to load data from staging tables to analytics tables on Redshift. 89 | 3. Test by running etl.py after running create_tables.py and running the analytic queries on your Redshift database to compare your results with the expected results. 90 | 4. Delete your redshift cluster when finished. 91 | 92 | Final Instructions 93 | 94 | 1. Import all the necessary libraries 95 | 2. Write the configuration of AWS Cluster, store the important parameter in some other file 96 | 3. Configuration of boto3 which is an AWS SDK for Python 97 | 4. Using the bucket, can check whether files log files and song data files are present 98 | 5. Create an IAM User Role, Assign appropriate permissions and create the Redshift Cluster 99 | 6. Get the Value of Endpoint and Role for put into main configuration file 100 | 7. Authorize Security Access Group to Default TCP/IP Address 101 | 8. Launch database connectivity configuration 102 | 9. Go to Terminal write the command "python create_tables.py" and then "etl.py" 103 | 10. Should take around 4-10 minutes in total 104 | 11. Then you go back to jupyter notebook to test everything is working fine 105 | 12. I counted all the records in my tables 106 | 13. Now can delete the cluster, roles and assigned permission -------------------------------------------------------------------------------- /Cloud Data Warehouse/Project Data Warehouse with AWS/create_tables.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | import psycopg2 3 | from sql_queries import create_table_queries, drop_table_queries 4 | 5 | 6 | def drop_tables(cur, conn): 7 | for query in drop_table_queries: 8 | cur.execute(query) 9 | conn.commit() 10 | 11 | 12 | def create_tables(cur, conn): 13 | for query in create_table_queries: 14 | cur.execute(query) 15 | conn.commit() 16 | 17 | 18 | def main(): 19 | config = configparser.ConfigParser() 20 | config.read('dwh.cfg') 21 | 22 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values())) 23 | cur = conn.cursor() 24 | 25 | drop_tables(cur, conn) 26 | create_tables(cur, conn) 27 | 28 | conn.close() 29 | 30 | 31 | if __name__ == "__main__": 32 | main() -------------------------------------------------------------------------------- /Cloud Data Warehouse/Project Data Warehouse with AWS/dwh.cfg: -------------------------------------------------------------------------------- 1 | [AWS] 2 | KEY= 3 | SECRET= 4 | 5 | [DWH] 6 | DWH_CLUSTER_TYPE=multi-node 7 | DWH_NUM_NODES=4 8 | DWH_NODE_TYPE=dc2.large 9 | 10 | DWH_IAM_ROLE_NAME=dwhRole 11 | DWH_CLUSTER_IDENTIFIER=dwhCluster 12 | DWH_DB=dwh 13 | DWH_DB_USER=dwhuser 14 | DWH_DB_PASSWORD=Passw0rd 15 | DWH_PORT=5439 16 | 17 | [CLUSTER] 18 | HOST= 19 | DB_NAME=dwh 20 | DB_USER=dwhuser 21 | DB_PASSWORD=Passw0rd 22 | DB_PORT=5439 23 | 24 | [IAM_ROLE] 25 | ARN= 26 | 27 | [S3] 28 | LOG_DATA='s3://udacity-dend/log_data' 29 | LOG_JSONPATH='s3://udacity-dend/log_json_path.json' 30 | SONG_DATA='s3://udacity-dend/song_data' -------------------------------------------------------------------------------- /Cloud Data Warehouse/Project Data Warehouse with AWS/etl.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | import psycopg2 3 | from sql_queries import copy_table_queries, insert_table_queries 4 | 5 | 6 | def load_staging_tables(cur, conn): 7 | for query in copy_table_queries: 8 | cur.execute(query) 9 | conn.commit() 10 | 11 | 12 | def insert_tables(cur, conn): 13 | for query in insert_table_queries: 14 | cur.execute(query) 15 | conn.commit() 16 | 17 | 18 | def main(): 19 | config = configparser.ConfigParser() 20 | config.read('dwh.cfg') 21 | 22 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values())) 23 | cur = conn.cursor() 24 | 25 | load_staging_tables(cur, conn) 26 | insert_tables(cur, conn) 27 | 28 | conn.close() 29 | 30 | 31 | if __name__ == "__main__": 32 | main() -------------------------------------------------------------------------------- /Cloud Data Warehouse/Project Data Warehouse with AWS/sql_queries.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | 3 | 4 | # CONFIG 5 | config = configparser.ConfigParser() 6 | config.read('dwh.cfg') 7 | 8 | # GLOBAL VARIABLES 9 | LOG_DATA = config.get("S3","LOG_DATA") 10 | LOG_PATH = config.get("S3", "LOG_JSONPATH") 11 | SONG_DATA = config.get("S3", "SONG_DATA") 12 | IAM_ROLE = config.get("IAM_ROLE","ARN") 13 | 14 | # DROP TABLES 15 | 16 | staging_events_table_drop = "DROP TABLE IF EXISTS staging_events" 17 | staging_songs_table_drop = "DROP TABLE IF EXISTS staging_songs" 18 | songplay_table_drop = "DROP TABLE IF EXISTS fact_songplay" 19 | user_table_drop = "DROP TABLE IF EXISTS dim_user" 20 | song_table_drop = "DROP TABLE IF EXISTS dim_song" 21 | artist_table_drop = "DROP TABLE IF EXISTS dim_artist" 22 | time_table_drop = "DROP TABLE IF EXISTS dim_time" 23 | 24 | # CREATE TABLES 25 | 26 | staging_events_table_create= (""" 27 | CREATE TABLE IF NOT EXISTS staging_events 28 | ( 29 | artist VARCHAR, 30 | auth VARCHAR, 31 | firstName VARCHAR, 32 | gender VARCHAR, 33 | itemInSession INTEGER, 34 | lastName VARCHAR, 35 | length FLOAT, 36 | level VARCHAR, 37 | location VARCHAR, 38 | method VARCHAR, 39 | page VARCHAR, 40 | registration BIGINT, 41 | sessionId INTEGER, 42 | song VARCHAR, 43 | status INTEGER, 44 | ts TIMESTAMP, 45 | userAgent VARCHAR, 46 | userId INTEGER 47 | ); 48 | """) 49 | 50 | staging_songs_table_create = (""" 51 | CREATE TABLE IF NOT EXISTS staging_songs 52 | ( 53 | song_id VARCHAR, 54 | num_songs INTEGER, 55 | title VARCHAR, 56 | artist_name VARCHAR, 57 | artist_latitude FLOAT, 58 | year INTEGER, 59 | duration FLOAT, 60 | artist_id VARCHAR, 61 | artist_longitude FLOAT, 62 | artist_location VARCHAR 63 | ); 64 | """) 65 | 66 | songplay_table_create = (""" 67 | CREATE TABLE IF NOT EXISTS fact_songplay 68 | ( 69 | songplay_id INTEGER IDENTITY(0,1) PRIMARY KEY sortkey, 70 | start_time TIMESTAMP, 71 | user_id INTEGER, 72 | level VARCHAR, 73 | song_id VARCHAR, 74 | artist_id VARCHAR, 75 | session_id INTEGER, 76 | location VARCHAR, 77 | user_agent VARCHAR 78 | ); 79 | """) 80 | 81 | user_table_create = (""" 82 | CREATE TABLE IF NOT EXISTS dim_user 83 | ( 84 | user_id INTEGER PRIMARY KEY distkey, 85 | first_name VARCHAR, 86 | last_name VARCHAR, 87 | gender VARCHAR, 88 | level VARCHAR 89 | ); 90 | """) 91 | 92 | song_table_create = (""" 93 | CREATE TABLE IF NOT EXISTS dim_song 94 | ( 95 | song_id VARCHAR PRIMARY KEY, 96 | title VARCHAR, 97 | artist_id VARCHAR distkey, 98 | year INTEGER, 99 | duration FLOAT 100 | ); 101 | """) 102 | 103 | artist_table_create = (""" 104 | CREATE TABLE IF NOT EXISTS dim_artist 105 | ( 106 | artist_id VARCHAR PRIMARY KEY distkey, 107 | name VARCHAR, 108 | location VARCHAR, 109 | latitude FLOAT, 110 | longitude FLOAT 111 | ); 112 | """) 113 | 114 | time_table_create = (""" 115 | CREATE TABLE IF NOT EXISTS dim_time 116 | ( 117 | start_time TIMESTAMP PRIMARY KEY sortkey distkey, 118 | hour INTEGER, 119 | day INTEGER, 120 | week INTEGER, 121 | month INTEGER, 122 | year INTEGER, 123 | weekday INTEGER 124 | ); 125 | """) 126 | 127 | # STAGING TABLES 128 | 129 | staging_events_copy = (""" 130 | COPY staging_events FROM {} 131 | CREDENTIALS 'aws_iam_role={}' 132 | COMPUPDATE OFF region 'us-west-2' 133 | TIMEFORMAT as 'epochmillisecs' 134 | TRUNCATECOLUMNS BLANKSASNULL EMPTYASNULL 135 | FORMAT AS JSON {}; 136 | """).format(LOG_DATA, IAM_ROLE, LOG_PATH) 137 | 138 | staging_songs_copy = (""" 139 | COPY staging_songs FROM {} 140 | CREDENTIALS 'aws_iam_role={}' 141 | COMPUPDATE OFF region 'us-west-2' 142 | FORMAT AS JSON 'auto' 143 | TRUNCATECOLUMNS BLANKSASNULL EMPTYASNULL; 144 | """).format(SONG_DATA, IAM_ROLE) 145 | 146 | # FINAL TABLES 147 | 148 | songplay_table_insert = (""" 149 | INSERT INTO fact_songplay(start_time, user_id, level, song_id, artist_id, session_id, location, user_agent) 150 | SELECT DISTINCT to_timestamp(to_char(se.ts, '9999-99-99 99:99:99'),'YYYY-MM-DD HH24:MI:SS'), 151 | se.userId as user_id, 152 | se.level as level, 153 | ss.song_id as song_id, 154 | ss.artist_id as artist_id, 155 | se.sessionId as session_id, 156 | se.location as location, 157 | se.userAgent as user_agent 158 | FROM staging_events se 159 | JOIN staging_songs ss ON se.song = ss.title AND se.artist = ss.artist_name; 160 | """) 161 | 162 | user_table_insert = (""" 163 | INSERT INTO dim_user(user_id, first_name, last_name, gender, level) 164 | SELECT DISTINCT userId as user_id, 165 | firstName as first_name, 166 | lastName as last_name, 167 | gender as gender, 168 | level as level 169 | FROM staging_events 170 | where userId IS NOT NULL; 171 | """) 172 | 173 | song_table_insert = (""" 174 | INSERT INTO dim_song(song_id, title, artist_id, year, duration) 175 | SELECT DISTINCT song_id as song_id, 176 | title as title, 177 | artist_id as artist_id, 178 | year as year, 179 | duration as duration 180 | FROM staging_songs 181 | WHERE song_id IS NOT NULL; 182 | """) 183 | 184 | artist_table_insert = (""" 185 | INSERT INTO dim_artist(artist_id, name, location, latitude, longitude) 186 | SELECT DISTINCT artist_id as artist_id, 187 | artist_name as name, 188 | artist_location as location, 189 | artist_latitude as latitude, 190 | artist_longitude as longitude 191 | FROM staging_songs 192 | where artist_id IS NOT NULL; 193 | """) 194 | 195 | time_table_insert = (""" 196 | INSERT INTO dim_time(start_time, hour, day, week, month, year, weekday) 197 | SELECT distinct ts, 198 | EXTRACT(hour from ts), 199 | EXTRACT(day from ts), 200 | EXTRACT(week from ts), 201 | EXTRACT(month from ts), 202 | EXTRACT(year from ts), 203 | EXTRACT(weekday from ts) 204 | FROM staging_events 205 | WHERE ts IS NOT NULL; 206 | """) 207 | 208 | # QUERY LISTS 209 | 210 | create_table_queries = [staging_events_table_create, staging_songs_table_create, songplay_table_create, user_table_create, song_table_create, artist_table_create, time_table_create] 211 | drop_table_queries = [staging_events_table_drop, staging_songs_table_drop, songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop] 212 | copy_table_queries = [staging_events_copy, staging_songs_copy] 213 | insert_table_queries = [songplay_table_insert, user_table_insert, song_table_insert, artist_table_insert, time_table_insert] 214 | -------------------------------------------------------------------------------- /Cloud Data Warehouse/Readme.md: -------------------------------------------------------------------------------- 1 | This folder will contain the exercise files and details for Udacity Module - Cloud Data Warehouse 2 | -------------------------------------------------------------------------------- /Data Lakes with Spark/Dataframe_Quiz.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Answer Key to the Data Frame Programming Quiz\n", 8 | "\n", 9 | "Helpful resources:\n", 10 | "http://spark.apache.org/docs/latest/api/python/pyspark.sql.html" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "from pyspark.sql import SparkSession\n", 20 | "from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg\n", 21 | "from pyspark.sql.functions import sum as Fsum\n", 22 | "from pyspark.sql.window import Window\n", 23 | "from pyspark.sql.types import IntegerType" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 2, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "# 1) import any other libraries you might need\n", 33 | "# 2) instantiate a Spark session \n", 34 | "# 3) read in the data set located at the path \"data/sparkify_log_small.json\"\n", 35 | "# 4) write code to answer the quiz questions \n", 36 | "\n", 37 | "spark = SparkSession \\\n", 38 | " .builder \\\n", 39 | " .appName(\"Data Frames practice\") \\\n", 40 | " .getOrCreate()\n", 41 | "\n", 42 | "df = spark.read.json(\"data/sparkify_log_small.json\")" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "# Question 1\n", 50 | "\n", 51 | "Which page did user id \"\" (empty string) NOT visit?" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 3, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "name": "stdout", 61 | "output_type": "stream", 62 | "text": [ 63 | "root\n", 64 | " |-- artist: string (nullable = true)\n", 65 | " |-- auth: string (nullable = true)\n", 66 | " |-- firstName: string (nullable = true)\n", 67 | " |-- gender: string (nullable = true)\n", 68 | " |-- itemInSession: long (nullable = true)\n", 69 | " |-- lastName: string (nullable = true)\n", 70 | " |-- length: double (nullable = true)\n", 71 | " |-- level: string (nullable = true)\n", 72 | " |-- location: string (nullable = true)\n", 73 | " |-- method: string (nullable = true)\n", 74 | " |-- page: string (nullable = true)\n", 75 | " |-- registration: long (nullable = true)\n", 76 | " |-- sessionId: long (nullable = true)\n", 77 | " |-- song: string (nullable = true)\n", 78 | " |-- status: long (nullable = true)\n", 79 | " |-- ts: long (nullable = true)\n", 80 | " |-- userAgent: string (nullable = true)\n", 81 | " |-- userId: string (nullable = true)\n", 82 | "\n" 83 | ] 84 | } 85 | ], 86 | "source": [ 87 | "df.printSchema()" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 4, 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "Settings\n", 100 | "Logout\n", 101 | "Submit Upgrade\n", 102 | "Error\n", 103 | "NextSong\n", 104 | "Submit Downgrade\n", 105 | "Downgrade\n", 106 | "Upgrade\n", 107 | "Save Settings\n" 108 | ] 109 | } 110 | ], 111 | "source": [ 112 | "# filter for users with blank user id\n", 113 | "blank_pages = df.filter(df.userId == '') \\\n", 114 | " .select(col('page') \\\n", 115 | " .alias('blank_pages')) \\\n", 116 | " .dropDuplicates()\n", 117 | "\n", 118 | "# get a list of possible pages that could be visited\n", 119 | "all_pages = df.select('page').dropDuplicates()\n", 120 | "\n", 121 | "# find values in all_pages that are not in blank_pages\n", 122 | "# these are the pages that the blank user did not go to\n", 123 | "for row in set(all_pages.collect()) - set(blank_pages.collect()):\n", 124 | " print(row.page)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "# Question 2 - Reflect\n", 132 | "\n", 133 | "What type of user does the empty string user id most likely refer to?\n" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "Perhaps it represents users who have not signed up yet or who are signed out and are about to log in." 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "# Question 3\n", 148 | "\n", 149 | "How many female users do we have in the data set?" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 5, 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/plain": [ 160 | "462" 161 | ] 162 | }, 163 | "execution_count": 5, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "df.filter(df.gender == 'F') \\\n", 170 | " .select('userId', 'gender') \\\n", 171 | " .dropDuplicates() \\\n", 172 | " .count()" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "# Question 4\n", 180 | "\n", 181 | "How many songs were played from the most played artist?" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 6, 187 | "metadata": {}, 188 | "outputs": [ 189 | { 190 | "name": "stdout", 191 | "output_type": "stream", 192 | "text": [ 193 | "+--------+-----------+\n", 194 | "| Artist|Artistcount|\n", 195 | "+--------+-----------+\n", 196 | "|Coldplay| 83|\n", 197 | "+--------+-----------+\n", 198 | "only showing top 1 row\n", 199 | "\n" 200 | ] 201 | } 202 | ], 203 | "source": [ 204 | "df.filter(df.page == 'NextSong') \\\n", 205 | " .select('Artist') \\\n", 206 | " .groupBy('Artist') \\\n", 207 | " .agg({'Artist':'count'}) \\\n", 208 | " .withColumnRenamed('count(Artist)', 'Artistcount') \\\n", 209 | " .sort(desc('Artistcount')) \\\n", 210 | " .show(1)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "# Question 5 (challenge)\n", 218 | "\n", 219 | "How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.\n", 220 | "\n" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 7, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "name": "stdout", 230 | "output_type": "stream", 231 | "text": [ 232 | "+------------------+\n", 233 | "|avg(count(period))|\n", 234 | "+------------------+\n", 235 | "| 6.898347107438017|\n", 236 | "+------------------+\n", 237 | "\n" 238 | ] 239 | } 240 | ], 241 | "source": [ 242 | "function = udf(lambda ishome : int(ishome == 'Home'), IntegerType())\n", 243 | "\n", 244 | "user_window = Window \\\n", 245 | " .partitionBy('userID') \\\n", 246 | " .orderBy(desc('ts')) \\\n", 247 | " .rangeBetween(Window.unboundedPreceding, 0)\n", 248 | "\n", 249 | "cusum = df.filter((df.page == 'NextSong') | (df.page == 'Home')) \\\n", 250 | " .select('userID', 'page', 'ts') \\\n", 251 | " .withColumn('homevisit', function(col('page'))) \\\n", 252 | " .withColumn('period', Fsum('homevisit').over(user_window))\n", 253 | "\n", 254 | "cusum.filter((cusum.page == 'NextSong')) \\\n", 255 | " .groupBy('userID', 'period') \\\n", 256 | " .agg({'period':'count'}) \\\n", 257 | " .agg({'count(period)':'avg'}).show()" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [] 266 | } 267 | ], 268 | "metadata": { 269 | "kernelspec": { 270 | "display_name": "Python 3", 271 | "language": "python", 272 | "name": "python3" 273 | }, 274 | "language_info": { 275 | "codemirror_mode": { 276 | "name": "ipython", 277 | "version": 3 278 | }, 279 | "file_extension": ".py", 280 | "mimetype": "text/x-python", 281 | "name": "python", 282 | "nbconvert_exporter": "python", 283 | "pygments_lexer": "ipython3", 284 | "version": "3.6.3" 285 | } 286 | }, 287 | "nbformat": 4, 288 | "nbformat_minor": 2 289 | } 290 | -------------------------------------------------------------------------------- /Data Lakes with Spark/Procedural_vs_Functional_Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Procedural Programming\n", 8 | "\n", 9 | "This notebook contains the code from the previous screencast. The code counts the number of times a song appears in the log_of_songs variable. \n", 10 | "\n", 11 | "You'll notice that the first time you run `count_plays(\"Despacito\")`, you get the correct count. However, when you run the same code again `count_plays(\"Despacito\")`, the results are no longer correct.This is because the global variable `play_count` stores the results outside of the count_plays function. \n", 12 | "\n", 13 | "\n", 14 | "# Instructions\n", 15 | "\n", 16 | "Run the code cells in this notebook to see the problem with " 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "log_of_songs = [\n", 26 | " \"Despacito\",\n", 27 | " \"Nice for what\",\n", 28 | " \"No tears left to cry\",\n", 29 | " \"Despacito\",\n", 30 | " \"Havana\",\n", 31 | " \"In my feelings\",\n", 32 | " \"Nice for what\",\n", 33 | " \"Despacito\",\n", 34 | " \"All the stars\"\n", 35 | "]" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 2, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "play_count = 0" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "def count_plays(song_title):\n", 54 | " global play_count\n", 55 | " for song in log_of_songs:\n", 56 | " if song == song_title:\n", 57 | " play_count = play_count + 1\n", 58 | " return play_count" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 4, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "data": { 68 | "text/plain": [ 69 | "3" 70 | ] 71 | }, 72 | "execution_count": 4, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "count_plays(\"Despacito\")" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 5, 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "data": { 88 | "text/plain": [ 89 | "6" 90 | ] 91 | }, 92 | "execution_count": 5, 93 | "metadata": {}, 94 | "output_type": "execute_result" 95 | } 96 | ], 97 | "source": [ 98 | "count_plays(\"Despacito\")" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "# How to Solve the Issue\n", 106 | "\n", 107 | "How might you solve this issue? You could get rid of the global variable and instead use play_count as an input to the function:\n", 108 | "\n", 109 | "```python\n", 110 | "def count_plays(song_title, play_count):\n", 111 | " for song in log_of_songs:\n", 112 | " if song == song_title:\n", 113 | " play_count = play_count + 1\n", 114 | " return play_count\n", 115 | "\n", 116 | "```\n", 117 | "\n", 118 | "How would this work with parallel programming? Spark splits up data onto multiple machines. If your songs list were split onto two machines, Machine A would first need to finish counting, and then return its own result to Machine B. And then Machine B could use the output from Machine A and add to the count.\n", 119 | "\n", 120 | "However, that isn't parallel computing. Machine B would have to wait until Machine A finishes. You'll see in the next parts of the lesson how Spark solves this issue with a functional programming paradigm.\n", 121 | "\n", 122 | "In Spark, if your data is split onto two different machines, machine A will run a function to count how many times 'Despacito' appears on machine A. Machine B will simultaneously run a function to count how many times 'Despacito' appears on machine B. After they finish counting individually, they'll combine their results together. You'll see how this works in the next parts of the lesson." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [] 131 | } 132 | ], 133 | "metadata": { 134 | "kernelspec": { 135 | "display_name": "Python 3", 136 | "language": "python", 137 | "name": "python3" 138 | }, 139 | "language_info": { 140 | "codemirror_mode": { 141 | "name": "ipython", 142 | "version": 3 143 | }, 144 | "file_extension": ".py", 145 | "mimetype": "text/x-python", 146 | "name": "python", 147 | "nbconvert_exporter": "python", 148 | "pygments_lexer": "ipython3", 149 | "version": "3.6.3" 150 | } 151 | }, 152 | "nbformat": 4, 153 | "nbformat_minor": 2 154 | } 155 | -------------------------------------------------------------------------------- /Data Lakes with Spark/Project Data Lake with Spark/README.md: -------------------------------------------------------------------------------- 1 | Project: Data Lake 2 | 3 | Introduction 4 | 5 | A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app 6 | 7 | 8 | Project Description 9 | 10 | Apply the knowledge of Spark and Data Lakes to build and ETL pipeline for a Data Lake hosted on Amazon S3 11 | 12 | In this task, we have to build an ETL Pipeline that extracts their data from S3 and process them using Spark and then load back into S3 in a set of Fact and Dimension Tables. This will allow their analytics team to continue finding insights in what songs their users are listening. Will have to deploy this Spark process on a Cluster using AWS 13 | 14 | Project Datasets 15 | 16 | Song Data Path --> s3://udacity-dend/song_data Log Data Path --> s3://udacity-dend/log_data Log Data JSON Path --> s3://udacity-dend/log_json_path.json 17 | 18 | Song Dataset 19 | 20 | The first dataset is a subset of real data from the Million Song Dataset(https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example: 21 | 22 | song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json 23 | 24 | And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like. 25 | 26 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0} 27 | 28 | Log Dataset 29 | 30 | The second dataset consists of log files in JSON format. The log files in the dataset with are partitioned by year and month. For example: 31 | 32 | log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json 33 | 34 | And below is an example of what a single log file, 2018-11-13-events.json, looks like. 35 | 36 | {"artist":"Pavement", "auth":"Logged In", "firstName":"Sylvie", "gender", "F", "itemInSession":0, "lastName":"Cruz", "length":99.16036, "level":"free", "location":"Klamath Falls, OR", "method":"PUT", "page":"NextSong", "registration":"1.541078e+12", "sessionId":345, "song":"Mercy:The Laundromat", "status":200, "ts":1541990258796, "userAgent":"Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4...)", "userId":10} 37 | 38 | Schema for Song Play Analysis 39 | 40 | A Star Schema would be required for optimized queries on song play queries 41 | 42 | Fact Table 43 | 44 | songplays - records in event data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent 45 | 46 | Dimension Tables 47 | 48 | users - users in the app user_id, first_name, last_name, gender, level 49 | 50 | songs - songs in music database song_id, title, artist_id, year, duration 51 | 52 | artists - artists in music database artist_id, name, location, lattitude, longitude 53 | 54 | time - timestamps of records in songplays broken down into specific units start_time, hour, day, week, month, year, weekday 55 | 56 | Project Template 57 | 58 | Project Template include three files: 59 | 60 | 1. etl.py reads data from S3, processes that data using Spark and writes them back to S3 61 | 62 | 2. dl.cfg contains AWS Credentials 63 | 64 | 3. README.md provides discussion on your process and decisions 65 | 66 | ETL Pipeline 67 | 68 | 1. Load the credentials from dl.cfg 69 | 2. Load the Data which are in JSON Files(Song Data and Log Data) 70 | 3. After loading the JSON Files from S3 71 | 4.Use Spark process this JSON files and then generate a set of Fact and Dimension Tables 72 | 5. Load back these dimensional process to S3 73 | 74 | Final Instructions 75 | 76 | 1. Write correct keys in dl.cfg 77 | 2. Open Terminal write the command "python etl.py" 78 | 3. Should take about 2-4 mins in total 79 | -------------------------------------------------------------------------------- /Data Lakes with Spark/Project Data Lake with Spark/Readme.MD: -------------------------------------------------------------------------------- 1 | Hello Testing 2 | -------------------------------------------------------------------------------- /Data Lakes with Spark/Project Data Lake with Spark/dl.cfg: -------------------------------------------------------------------------------- 1 | [AWS] 2 | AWS_ACCESS_KEY_ID= 3 | AWS_SECRET_ACCESS_KEY= -------------------------------------------------------------------------------- /Data Lakes with Spark/Project Data Lake with Spark/etl.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | from datetime import datetime 3 | import os 4 | from pyspark.sql import SparkSession 5 | from pyspark.sql.functions import udf, col 6 | from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format 7 | 8 | 9 | config = configparser.ConfigParser() 10 | config.read_file(open('dl.cfg')) 11 | 12 | os.environ['AWS_ACCESS_KEY_ID']=config.get('AWS','AWS_ACCESS_KEY_ID') 13 | os.environ['AWS_SECRET_ACCESS_KEY']=config.get('AWS','AWS_SECRET_ACCESS_KEY') 14 | 15 | 16 | def create_spark_session(): 17 | spark = SparkSession \ 18 | .builder \ 19 | .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \ 20 | .getOrCreate() 21 | return spark 22 | 23 | 24 | def process_song_data(spark, input_data, output_data): 25 | """ 26 | Description: This function loads song_data from S3 and processes it by extracting the songs and artist tables 27 | and then again loaded back to S3 28 | 29 | Parameters: 30 | spark : this is the Spark Session 31 | input_data : the location of song_data from where the file is load to process 32 | output_data : the location where after processing the results will be stored 33 | 34 | """ 35 | # get filepath to song data file 36 | song_data = input_data + 'song_data/*/*/*/*.json' 37 | 38 | # read song data file 39 | df = spark.read.json(song_data) 40 | 41 | # created song view to write SQL Queries 42 | df.createOrReplaceTempView("song_data_table") 43 | 44 | # extract columns to create songs table 45 | songs_table = spark.sql(""" 46 | SELECT sdtn.song_id, 47 | sdtn.title, 48 | sdtn.artist_id, 49 | sdtn.year, 50 | sdtn.duration 51 | FROM song_data_table sdtn 52 | WHERE song_id IS NOT NULL 53 | """) 54 | 55 | # write songs table to parquet files partitioned by year and artist 56 | songs_table.write.mode('overwrite').partitionBy("year", "artist_id").parquet(output_data+'songs_table/') 57 | 58 | # extract columns to create artists table 59 | artists_table = spark.sql(""" 60 | SELECT DISTINCT arti.artist_id, 61 | arti.artist_name, 62 | arti.artist_location, 63 | arti.artist_latitude, 64 | arti.artist_longitude 65 | FROM song_data_table arti 66 | WHERE arti.artist_id IS NOT NULL 67 | """) 68 | 69 | # write artists table to parquet files 70 | artists_table.write.mode('overwrite').parquet(output_data+'artists_table/') 71 | 72 | 73 | def process_log_data(spark, input_data, output_data): 74 | """ 75 | Description: This function loads log_data from S3 and processes it by extracting the songs and artist tables 76 | and then again loaded back to S3. Also output from previous function is used in by spark.read.json command 77 | 78 | Parameters: 79 | spark : this is the Spark Session 80 | input_data : the location of song_data from where the file is load to process 81 | output_data : the location where after processing the results will be stored 82 | 83 | """ 84 | # get filepath to log data file 85 | log_path = input_data + 'log_data/*.json' 86 | 87 | # read log data file 88 | df = spark.read.json(log_path) 89 | 90 | # filter by actions for song plays 91 | df = df.filter(df.page == 'NextSong') 92 | 93 | # created log view to write SQL Queries 94 | df.createOrReplaceTempView("log_data_table") 95 | 96 | # extract columns for users table 97 | users_table = spark.sql(""" 98 | SELECT DISTINCT userT.userId as user_id, 99 | userT.firstName as first_name, 100 | userT.lastName as last_name, 101 | userT.gender as gender, 102 | userT.level as level 103 | FROM log_data_table userT 104 | WHERE userT.userId IS NOT NULL 105 | """) 106 | 107 | # write users table to parquet files 108 | users_table.write.mode('overwrite').parquet(output_data+'users_table/') 109 | 110 | # create timestamp column from original timestamp column 111 | # get_timestamp = udf() 112 | # df = 113 | 114 | # create datetime column from original timestamp column 115 | # get_datetime = udf() 116 | # df = 117 | 118 | # extract columns to create time table 119 | time_table = spark.sql(""" 120 | SELECT 121 | A.start_time_sub as start_time, 122 | hour(A.start_time_sub) as hour, 123 | dayofmonth(A.start_time_sub) as day, 124 | weekofyear(A.start_time_sub) as week, 125 | month(A.start_time_sub) as month, 126 | year(A.start_time_sub) as year, 127 | dayofweek(A.start_time_sub) as weekday 128 | FROM 129 | (SELECT to_timestamp(timeSt.ts/1000) as start_time_sub 130 | FROM log_data_table timeSt 131 | WHERE timeSt.ts IS NOT NULL 132 | ) A 133 | """) 134 | 135 | # write time table to parquet files partitioned by year and month 136 | time_table.write.mode('overwrite').partitionBy("year", "month").parquet(output_data+'time_table/') 137 | 138 | # read in song data to use for songplays table 139 | song_df = spark.read.parquet(output_data+'songs_table/') 140 | 141 | # read song data file 142 | # song_df_upd = spark.read.json(input_data + 'song_data/*/*/*/*.json') 143 | # created song view to write SQL Queries 144 | # song_df_upd.createOrReplaceTempView("song_data_table") 145 | 146 | 147 | 148 | # extract columns from joined song and log datasets to create songplays table 149 | songplays_table = spark.sql(""" 150 | SELECT monotonically_increasing_id() as songplay_id, 151 | to_timestamp(logT.ts/1000) as start_time, 152 | month(to_timestamp(logT.ts/1000)) as month, 153 | year(to_timestamp(logT.ts/1000)) as year, 154 | logT.userId as user_id, 155 | logT.level as level, 156 | songT.song_id as song_id, 157 | songT.artist_id as artist_id, 158 | logT.sessionId as session_id, 159 | logT.location as location, 160 | logT.userAgent as user_agent 161 | 162 | FROM log_data_table logT 163 | JOIN song_data_table songT on logT.artist = songT.artist_name and logT.song = songT.title 164 | """) 165 | 166 | # write songplays table to parquet files partitioned by year and month 167 | songplays_table.write.mode('overwrite').partitionBy("year", "month").parquet(output_data+'songplays_table/') 168 | 169 | 170 | def main(): 171 | spark = create_spark_session() 172 | 173 | input_data = "s3a://udacity-dend/" 174 | output_data = "s3a://udacity-dend/dloutput/" 175 | 176 | #input_data = "./" 177 | #output_data = "./dloutput/" 178 | 179 | process_song_data(spark, input_data, output_data) 180 | process_log_data(spark, input_data, output_data) 181 | 182 | 183 | if __name__ == "__main__": 184 | main() 185 | -------------------------------------------------------------------------------- /Data Lakes with Spark/README.md: -------------------------------------------------------------------------------- 1 | Data Lakes with Spark Exercise Files 2 | -------------------------------------------------------------------------------- /Data Lakes with Spark/Spark_Maps_Lazy_Evaluation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Maps\n", 8 | "\n", 9 | "In Spark, maps take data as input and then transform that data with whatever function you put in the map. They are like directions for the data telling how each input should get to the output.\n", 10 | "\n", 11 | "The first code cell creates a SparkContext object. With the SparkContext, you can input a dataset and parallelize the data across a cluster (since you are currently using Spark in local mode on a single machine, technically the dataset isn't distributed yet).\n", 12 | "\n", 13 | "Run the code cell below to instantiate a SparkContext object and then read in the log_of_songs list into Spark. " 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 1, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "### \n", 23 | "# You might have noticed this code in the screencast.\n", 24 | "#\n", 25 | "# import findspark\n", 26 | "# findspark.init('spark-2.3.2-bin-hadoop2.7')\n", 27 | "#\n", 28 | "# The findspark Python module makes it easier to install\n", 29 | "# Spark in local mode on your computer. This is convenient\n", 30 | "# for practicing Spark syntax locally. \n", 31 | "# However, the workspaces already have Spark installed and you do not\n", 32 | "# need to use the findspark module\n", 33 | "#\n", 34 | "###\n", 35 | "\n", 36 | "import pyspark\n", 37 | "sc = pyspark.SparkContext(appName=\"maps_and_lazy_evaluation_example\")\n", 38 | "\n", 39 | "log_of_songs = [\n", 40 | " \"Despacito\",\n", 41 | " \"Nice for what\",\n", 42 | " \"No tears left to cry\",\n", 43 | " \"Despacito\",\n", 44 | " \"Havana\",\n", 45 | " \"In my feelings\",\n", 46 | " \"Nice for what\",\n", 47 | " \"despacito\",\n", 48 | " \"All the stars\"\n", 49 | "]\n", 50 | "\n", 51 | "# parallelize the log_of_songs to use with Spark\n", 52 | "distributed_song_log = sc.parallelize(log_of_songs)" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "This next code cell defines a function that converts a song title to lowercase. Then there is an example converting the word \"Havana\" to \"havana\"." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 2, 65 | "metadata": {}, 66 | "outputs": [ 67 | { 68 | "data": { 69 | "text/plain": [ 70 | "'havana'" 71 | ] 72 | }, 73 | "execution_count": 2, 74 | "metadata": {}, 75 | "output_type": "execute_result" 76 | } 77 | ], 78 | "source": [ 79 | "def convert_song_to_lowercase(song):\n", 80 | " return song.lower()\n", 81 | "\n", 82 | "convert_song_to_lowercase(\"Havana\")" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "The following code cells demonstrate how to apply this function using a map step. The map step will go through each song in the list and apply the convert_song_to_lowercase() function. " 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 3, 95 | "metadata": {}, 96 | "outputs": [ 97 | { 98 | "data": { 99 | "text/plain": [ 100 | "PythonRDD[1] at RDD at PythonRDD.scala:53" 101 | ] 102 | }, 103 | "execution_count": 3, 104 | "metadata": {}, 105 | "output_type": "execute_result" 106 | } 107 | ], 108 | "source": [ 109 | "distributed_song_log.map(convert_song_to_lowercase)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "You'll notice that this code cell ran quite quickly. This is because of lazy evaluation. Spark does not actually execute the map step unless it needs to.\n", 117 | "\n", 118 | "\"RDD\" in the output refers to resilient distributed dataset. RDDs are exactly what they say they are: fault-tolerant datasets distributed across a cluster. This is how Spark stores data. \n", 119 | "\n", 120 | "To get Spark to actually run the map step, you need to use an \"action\". One available action is the collect method. The collect() method takes the results from all of the clusters and \"collects\" them into a single list on the master node." 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 4, 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "data": { 130 | "text/plain": [ 131 | "['despacito',\n", 132 | " 'nice for what',\n", 133 | " 'no tears left to cry',\n", 134 | " 'despacito',\n", 135 | " 'havana',\n", 136 | " 'in my feelings',\n", 137 | " 'nice for what',\n", 138 | " 'despacito',\n", 139 | " 'all the stars']" 140 | ] 141 | }, 142 | "execution_count": 4, 143 | "metadata": {}, 144 | "output_type": "execute_result" 145 | } 146 | ], 147 | "source": [ 148 | "distributed_song_log.map(convert_song_to_lowercase).collect()" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "Note as well that Spark is not changing the original data set: Spark is merely making a copy. You can see this by running collect() on the original dataset." 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 5, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "data": { 165 | "text/plain": [ 166 | "['Despacito',\n", 167 | " 'Nice for what',\n", 168 | " 'No tears left to cry',\n", 169 | " 'Despacito',\n", 170 | " 'Havana',\n", 171 | " 'In my feelings',\n", 172 | " 'Nice for what',\n", 173 | " 'despacito',\n", 174 | " 'All the stars']" 175 | ] 176 | }, 177 | "execution_count": 5, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "distributed_song_log.collect()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "You do not always have to write a custom function for the map step. You can also use anonymous (lambda) functions as well as built-in Python functions like string.lower(). \n", 191 | "\n", 192 | "Anonymous functions are actually a Python feature for writing functional style programs." 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 6, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "data": { 202 | "text/plain": [ 203 | "['despacito',\n", 204 | " 'nice for what',\n", 205 | " 'no tears left to cry',\n", 206 | " 'despacito',\n", 207 | " 'havana',\n", 208 | " 'in my feelings',\n", 209 | " 'nice for what',\n", 210 | " 'despacito',\n", 211 | " 'all the stars']" 212 | ] 213 | }, 214 | "execution_count": 6, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "distributed_song_log.map(lambda song: song.lower()).collect()" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 7, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "data": { 230 | "text/plain": [ 231 | "['despacito',\n", 232 | " 'nice for what',\n", 233 | " 'no tears left to cry',\n", 234 | " 'despacito',\n", 235 | " 'havana',\n", 236 | " 'in my feelings',\n", 237 | " 'nice for what',\n", 238 | " 'despacito',\n", 239 | " 'all the stars']" 240 | ] 241 | }, 242 | "execution_count": 7, 243 | "metadata": {}, 244 | "output_type": "execute_result" 245 | } 246 | ], 247 | "source": [ 248 | "distributed_song_log.map(lambda x: x.lower()).collect()" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 9, 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "data": { 258 | "text/plain": [ 259 | "9" 260 | ] 261 | }, 262 | "execution_count": 9, 263 | "metadata": {}, 264 | "output_type": "execute_result" 265 | } 266 | ], 267 | "source": [ 268 | "distributed_song_log.map(lambda x: x.lower()).count()" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [] 277 | } 278 | ], 279 | "metadata": { 280 | "kernelspec": { 281 | "display_name": "Python 3", 282 | "language": "python", 283 | "name": "python3" 284 | }, 285 | "language_info": { 286 | "codemirror_mode": { 287 | "name": "ipython", 288 | "version": 3 289 | }, 290 | "file_extension": ".py", 291 | "mimetype": "text/x-python", 292 | "name": "python", 293 | "nbconvert_exporter": "python", 294 | "pygments_lexer": "ipython3", 295 | "version": "3.6.3" 296 | } 297 | }, 298 | "nbformat": 4, 299 | "nbformat_minor": 2 300 | } 301 | -------------------------------------------------------------------------------- /Data Lakes with Spark/Spark_Sql_Quiz.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Answer Key to the Data Frame Programming Quiz\n", 8 | "\n", 9 | "Helpful resources:\n", 10 | "http://spark.apache.org/docs/latest/api/python/pyspark.sql.html" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 4, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "from pyspark.sql import SparkSession\n", 20 | "# from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg\n", 21 | "# from pyspark.sql.functions import sum as Fsum\n", 22 | "# from pyspark.sql.window import Window\n", 23 | "# from pyspark.sql.types import IntegerType" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 5, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "# 1) import any other libraries you might need\n", 33 | "# 2) instantiate a Spark session \n", 34 | "# 3) read in the data set located at the path \"data/sparkify_log_small.json\"\n", 35 | "# 4) create a view to use with your SQL queries\n", 36 | "# 5) write code to answer the quiz questions \n", 37 | "\n", 38 | "spark = SparkSession \\\n", 39 | " .builder \\\n", 40 | " .appName(\"Spark SQL Quiz\") \\\n", 41 | " .getOrCreate()\n", 42 | "\n", 43 | "user_log = spark.read.json(\"data/sparkify_log_small.json\")\n", 44 | "\n", 45 | "user_log.createOrReplaceTempView(\"log_table\")\n" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "# Question 1\n", 53 | "\n", 54 | "Which page did user id \"\" (empty string) NOT visit?" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 6, 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "name": "stdout", 64 | "output_type": "stream", 65 | "text": [ 66 | "root\n", 67 | " |-- artist: string (nullable = true)\n", 68 | " |-- auth: string (nullable = true)\n", 69 | " |-- firstName: string (nullable = true)\n", 70 | " |-- gender: string (nullable = true)\n", 71 | " |-- itemInSession: long (nullable = true)\n", 72 | " |-- lastName: string (nullable = true)\n", 73 | " |-- length: double (nullable = true)\n", 74 | " |-- level: string (nullable = true)\n", 75 | " |-- location: string (nullable = true)\n", 76 | " |-- method: string (nullable = true)\n", 77 | " |-- page: string (nullable = true)\n", 78 | " |-- registration: long (nullable = true)\n", 79 | " |-- sessionId: long (nullable = true)\n", 80 | " |-- song: string (nullable = true)\n", 81 | " |-- status: long (nullable = true)\n", 82 | " |-- ts: long (nullable = true)\n", 83 | " |-- userAgent: string (nullable = true)\n", 84 | " |-- userId: string (nullable = true)\n", 85 | "\n" 86 | ] 87 | } 88 | ], 89 | "source": [ 90 | "user_log.printSchema()" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 7, 96 | "metadata": {}, 97 | "outputs": [ 98 | { 99 | "name": "stdout", 100 | "output_type": "stream", 101 | "text": [ 102 | "+----+----------------+\n", 103 | "|page| page|\n", 104 | "+----+----------------+\n", 105 | "|null|Submit Downgrade|\n", 106 | "|null| Downgrade|\n", 107 | "|null| Logout|\n", 108 | "|null| Save Settings|\n", 109 | "|null| Settings|\n", 110 | "|null| NextSong|\n", 111 | "|null| Upgrade|\n", 112 | "|null| Error|\n", 113 | "|null| Submit Upgrade|\n", 114 | "+----+----------------+\n", 115 | "\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "# SELECT distinct pages for the blank user and distinc pages for all users\n", 121 | "# Right join the results to find pages that blank visitor did not visit\n", 122 | "spark.sql(\"SELECT * \\\n", 123 | " FROM ( \\\n", 124 | " SELECT DISTINCT page \\\n", 125 | " FROM log_table \\\n", 126 | " WHERE userID='') AS user_pages \\\n", 127 | " RIGHT JOIN ( \\\n", 128 | " SELECT DISTINCT page \\\n", 129 | " FROM log_table) AS all_pages \\\n", 130 | " ON user_pages.page = all_pages.page \\\n", 131 | " WHERE user_pages.page IS NULL\").show()" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "# Question 2 - Reflect\n", 139 | "\n", 140 | "Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?\n", 141 | "\n", 142 | "Both Spark SQL and Spark Data Frames are part of the Spark SQL library. Hence, they both use the Spark SQL Catalyst Optimizer to optimize queries. \n", 143 | "\n", 144 | "You might prefer SQL over data frames because the syntax is clearer especially for teams already experienced in SQL.\n", 145 | "\n", 146 | "Spark data frames give you more control. You can break down your queries into smaller steps, which can make debugging easier. You can also [cache](https://unraveldata.com/to-cache-or-not-to-cache/) intermediate results or [repartition](https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4) intermediate results." 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "# Question 3\n", 154 | "\n", 155 | "How many female users do we have in the data set?" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 8, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "+----------------------+\n", 168 | "|count(DISTINCT userID)|\n", 169 | "+----------------------+\n", 170 | "| 462|\n", 171 | "+----------------------+\n", 172 | "\n" 173 | ] 174 | } 175 | ], 176 | "source": [ 177 | "spark.sql(\"SELECT COUNT(DISTINCT userID) \\\n", 178 | " FROM log_table \\\n", 179 | " WHERE gender = 'F'\").show()" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "# Question 4\n", 187 | "\n", 188 | "How many songs were played from the most played artist?" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 9, 194 | "metadata": {}, 195 | "outputs": [ 196 | { 197 | "name": "stdout", 198 | "output_type": "stream", 199 | "text": [ 200 | "+--------+-----+\n", 201 | "| Artist|plays|\n", 202 | "+--------+-----+\n", 203 | "|Coldplay| 83|\n", 204 | "+--------+-----+\n", 205 | "\n", 206 | "+--------+-----+\n", 207 | "| Artist|plays|\n", 208 | "+--------+-----+\n", 209 | "|Coldplay| 83|\n", 210 | "+--------+-----+\n", 211 | "\n" 212 | ] 213 | } 214 | ], 215 | "source": [ 216 | "# Here is one solution\n", 217 | "spark.sql(\"SELECT Artist, COUNT(Artist) AS plays \\\n", 218 | " FROM log_table \\\n", 219 | " GROUP BY Artist \\\n", 220 | " ORDER BY plays DESC \\\n", 221 | " LIMIT 1\").show()\n", 222 | "\n", 223 | "# Here is an alternative solution\n", 224 | "# Get the artist play counts\n", 225 | "play_counts = spark.sql(\"SELECT Artist, COUNT(Artist) AS plays \\\n", 226 | " FROM log_table \\\n", 227 | " GROUP BY Artist\")\n", 228 | "\n", 229 | "# save the results in a new view\n", 230 | "play_counts.createOrReplaceTempView(\"artist_counts\")\n", 231 | "\n", 232 | "# use a self join to find where the max play equals the count value\n", 233 | "spark.sql(\"SELECT a2.Artist, a2.plays FROM \\\n", 234 | " (SELECT max(plays) AS max_plays FROM artist_counts) AS a1 \\\n", 235 | " JOIN artist_counts AS a2 \\\n", 236 | " ON a1.max_plays = a2.plays \\\n", 237 | " \").show()" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "# Question 5 (challenge)\n", 245 | "\n", 246 | "How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.\n", 247 | "\n" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 31, 253 | "metadata": {}, 254 | "outputs": [ 255 | { 256 | "name": "stdout", 257 | "output_type": "stream", 258 | "text": [ 259 | "+------------------+\n", 260 | "|avg(count(period))|\n", 261 | "+------------------+\n", 262 | "| 6.898347107438017|\n", 263 | "+------------------+\n", 264 | "\n" 265 | ] 266 | } 267 | ], 268 | "source": [ 269 | "# SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;\n", 270 | "is_home = spark.sql(\"SELECT userID, page, ts, CASE WHEN page = 'Home' THEN 1 ELSE 0 END AS is_home FROM log_table \\\n", 271 | " WHERE (page = 'NextSong') or (page = 'Home') \\\n", 272 | " \")\n", 273 | "\n", 274 | "# keep the results in a new view\n", 275 | "is_home.createOrReplaceTempView(\"is_home_table\")\n", 276 | "\n", 277 | "# find the cumulative sum over the is_home column\n", 278 | "cumulative_sum = spark.sql(\"SELECT *, SUM(is_home) OVER \\\n", 279 | " (PARTITION BY userID ORDER BY ts DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period \\\n", 280 | " FROM is_home_table\")\n", 281 | "\n", 282 | "# keep the results in a view\n", 283 | "cumulative_sum.createOrReplaceTempView(\"period_table\")\n", 284 | "\n", 285 | "# find the average count for NextSong\n", 286 | "spark.sql(\"SELECT AVG(count_results) FROM \\\n", 287 | " (SELECT COUNT(*) AS count_results FROM period_table \\\n", 288 | "GROUP BY userID, period, page HAVING page = 'NextSong') AS counts\").show()" 289 | ] 290 | } 291 | ], 292 | "metadata": { 293 | "kernelspec": { 294 | "display_name": "Python 3", 295 | "language": "python", 296 | "name": "python3" 297 | }, 298 | "language_info": { 299 | "codemirror_mode": { 300 | "name": "ipython", 301 | "version": 3 302 | }, 303 | "file_extension": ".py", 304 | "mimetype": "text/x-python", 305 | "name": "python", 306 | "nbconvert_exporter": "python", 307 | "pygments_lexer": "ipython3", 308 | "version": "3.6.3" 309 | } 310 | }, 311 | "nbformat": 4, 312 | "nbformat_minor": 2 313 | } 314 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Pipeline - Exercise 1.py: -------------------------------------------------------------------------------- 1 | # Instructions 2 | # Define a function that uses the python logger to log a function. Then finish filling in the details of the DAG down below. Once you’ve done that, run "/opt/airflow/start.sh" command to start the web server. Once the Airflow web server is ready, open the Airflow UI using the "Access Airflow" button. Turn your DAG “On”, and then Run your DAG. If you get stuck, you can take a look at the solution file or the video walkthrough on the next page. 3 | 4 | import datetime 5 | import logging 6 | 7 | from airflow import DAG 8 | from airflow.operators.python_operator import PythonOperator 9 | 10 | def first_prog(): 11 | logging.info("This is my very first program for airflow") 12 | 13 | dag = DAG( 14 | 'lesson1.exercise1', 15 | start_date=datetime.datetime.now()) 16 | 17 | greet_task = PythonOperator( 18 | task_id="first_airflow_program", 19 | python_callable=first_prog, 20 | dag=dag 21 | ) 22 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Pipeline - Exercise 2.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import logging 3 | 4 | from airflow import DAG 5 | from airflow.operators.python_operator import PythonOperator 6 | 7 | 8 | def second_prog(): 9 | logging.info("This is my second program for airflow") 10 | 11 | dag = DAG( 12 | "lesson1.exercise2", 13 | start_date=datetime.datetime.now() - datetime.timedelta(days=2), 14 | schedule_interval="@daily") 15 | 16 | task = PythonOperator( 17 | task_id="exercise_2", 18 | python_callable=second_prog, 19 | dag=dag) 20 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Pipeline - Exercise 3.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import logging 3 | 4 | from airflow import DAG 5 | from airflow.operators.python_operator import PythonOperator 6 | 7 | 8 | def hello_world(): 9 | logging.info("Hello World") 10 | 11 | 12 | def addition(): 13 | logging.info(f"2 + 2 = {2+2}") 14 | 15 | 16 | def subtraction(): 17 | logging.info(f"6 -2 = {6-2}") 18 | 19 | 20 | def division(): 21 | logging.info(f"10 / 2 = {int(10/2)}") 22 | 23 | def completed_task(): 24 | logging.info("All Tasks Completed") 25 | 26 | 27 | dag = DAG( 28 | "lesson1.exercise3", 29 | schedule_interval='@hourly', 30 | start_date=datetime.datetime.now() - datetime.timedelta(days=1)) 31 | 32 | hello_world_task = PythonOperator( 33 | task_id="hello_world", 34 | python_callable=hello_world, 35 | dag=dag) 36 | 37 | addition_task = PythonOperator( 38 | task_id="addition", 39 | python_callable=addition, 40 | dag=dag) 41 | 42 | subraction_task = PythonOperator( 43 | task_id="subtraction", 44 | python_callable=subtraction, 45 | dag=dag) 46 | 47 | division_task = PythonOperator( 48 | task_id="division", 49 | python_callable=division, 50 | dag=dag) 51 | 52 | completed_task = PythonOperator( 53 | task_id="completed_task", 54 | python_callable=completed_task, 55 | dag=dag) 56 | # 57 | # -> addition_task 58 | # / \ 59 | # hello_world_task -> division_task-> completed_task 60 | # \ / 61 | # -> subtraction_task 62 | 63 | hello_world_task >> addition_task 64 | hello_world_task >> division_task 65 | hello_world_task >> subtraction_task 66 | addition_task >> completed_task 67 | division_task >> completed_task 68 | subtraction_task >> completed_task -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Pipeline - Exercise 4.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import logging 3 | 4 | from airflow import DAG 5 | from airflow.models import Variable 6 | from airflow.operators.python_operator import PythonOperator 7 | from airflow.hooks.S3_hook import S3Hook 8 | 9 | # 10 | # TODO: There is no code to modify in this exercise. We're going to create a connection and a 11 | # variable. 12 | # 1. Open your browser to localhost:8080 and open Admin->Variables 13 | # 2. Click "Create" 14 | # 3. Set "Key" equal to "s3_bucket" and set "Val" equal to "udacity-dend" 15 | # 4. Click save 16 | # 5. Open Admin->Connections 17 | # 6. Click "Create" 18 | # 7. Set "Conn Id" to "aws_credentials", "Conn Type" to "Amazon Web Services" 19 | # Set "Login" to your aws_access_key_id and "Password" to your aws_secret_key 20 | # 8. Click save 21 | # 9. Run the DAG 22 | 23 | def list_keys(): 24 | hook = S3Hook(aws_conn_id='aws_credentials') 25 | bucket = Variable.get('s3_bucket') 26 | logging.info(f"Listing Keys from {bucket}") 27 | keys = hook.list_keys(bucket) 28 | for key in keys: 29 | logging.info(f"- s3://{bucket}/{key}") 30 | 31 | 32 | dag = DAG( 33 | 'lesson1.exercise4', 34 | start_date=datetime.datetime.now()) 35 | 36 | list_task = PythonOperator( 37 | task_id="list_keys", 38 | python_callable=list_keys, 39 | dag=dag 40 | ) -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Pipeline - Exercise 5.py: -------------------------------------------------------------------------------- 1 | # Instructions 2 | # Use the Airflow context in the pythonoperator to complete the TODOs below. Once you are done, run your DAG and check the logs to see the context in use. 3 | 4 | import datetime 5 | import logging 6 | 7 | from airflow import DAG 8 | from airflow.models import Variable 9 | from airflow.operators.python_operator import PythonOperator 10 | from airflow.hooks.S3_hook import S3Hook 11 | 12 | 13 | def log_details(*args, **kwargs): 14 | # 15 | # TODO: Extract ds, run_id, prev_ds, and next_ds from the kwargs, and log them 16 | # NOTE: Look here for context variables passed in on kwargs: 17 | # https://airflow.apache.org/macros.html 18 | # 19 | ds = kwargs['ds'] # kwargs[] 20 | run_id = kwargs['run_id'] # kwargs[] 21 | previous_ds = kwargs.get('prev_ds') # kwargs.get('') 22 | next_ds = kwargs.get('next_ds') # kwargs.get('') 23 | 24 | logging.info(f"Execution date is {ds}") 25 | logging.info(f"My run id is {run_id}") 26 | if previous_ds: 27 | logging.info(f"My previous run was on {previous_ds}") 28 | if next_ds: 29 | logging.info(f"My next run will be {next_ds}") 30 | 31 | dag = DAG( 32 | 'lesson1.exercise5', 33 | schedule_interval="@daily", 34 | start_date=datetime.datetime.now() - datetime.timedelta(days=2) 35 | ) 36 | 37 | list_task = PythonOperator( 38 | task_id="log_details", 39 | python_callable=log_details, 40 | provide_context=True, 41 | dag=dag 42 | ) 43 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Pipeline - Exercise 6.py: -------------------------------------------------------------------------------- 1 | # Instructions 2 | # Similar to what you saw in the demo, copy and populate the trips table. Then, add another operator which creates a traffic analysis table from the trips table you created. Note, in this class, we won’t be writing SQL -- all of the SQL statements we run against Redshift are predefined and included in your lesson. 3 | 4 | import datetime 5 | import logging 6 | 7 | from airflow import DAG 8 | from airflow.contrib.hooks.aws_hook import AwsHook 9 | from airflow.hooks.postgres_hook import PostgresHook 10 | from airflow.operators.postgres_operator import PostgresOperator 11 | from airflow.operators.python_operator import PythonOperator 12 | 13 | import sql_statements 14 | 15 | 16 | def load_data_to_redshift(*args, **kwargs): 17 | aws_hook = AwsHook("aws_credentials") 18 | credentials = aws_hook.get_credentials() 19 | redshift_hook = PostgresHook("redshift") 20 | redshift_hook.run(sql_statements.COPY_ALL_TRIPS_SQL.format(credentials.access_key, credentials.secret_key)) 21 | 22 | 23 | dag = DAG( 24 | 'lesson1.exercise6', 25 | start_date=datetime.datetime.now() 26 | ) 27 | 28 | create_table = PostgresOperator( 29 | task_id="create_table", 30 | dag=dag, 31 | postgres_conn_id="redshift", 32 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL 33 | ) 34 | 35 | copy_task = PythonOperator( 36 | task_id='load_from_s3_to_redshift', 37 | dag=dag, 38 | python_callable=load_data_to_redshift 39 | ) 40 | 41 | location_traffic_task = PostgresOperator( 42 | task_id="calculate_location_traffic", 43 | dag=dag, 44 | postgres_conn_id="redshift", 45 | sql=sql_statements.LOCATION_TRAFFIC_SQL 46 | ) 47 | 48 | create_table >> copy_task 49 | copy_task >> location_traffic_task -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Quality - Exercise 1.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #1 - Run the DAG as it is first, and observe the Airflow UI 3 | #2 - Next, open up the DAG and add the copy and load tasks as directed in the TODOs 4 | #3 - Reload the Airflow UI and run the DAG once more, observing the Airflow UI 5 | 6 | import datetime 7 | import logging 8 | 9 | from airflow import DAG 10 | from airflow.contrib.hooks.aws_hook import AwsHook 11 | from airflow.hooks.postgres_hook import PostgresHook 12 | from airflow.operators.postgres_operator import PostgresOperator 13 | from airflow.operators.python_operator import PythonOperator 14 | 15 | import sql_statements 16 | 17 | 18 | def load_trip_data_to_redshift(*args, **kwargs): 19 | aws_hook = AwsHook("aws_credentials") 20 | credentials = aws_hook.get_credentials() 21 | redshift_hook = PostgresHook("redshift") 22 | sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format( 23 | credentials.access_key, 24 | credentials.secret_key, 25 | ) 26 | redshift_hook.run(sql_stmt) 27 | 28 | 29 | def load_station_data_to_redshift(*args, **kwargs): 30 | aws_hook = AwsHook("aws_credentials") 31 | credentials = aws_hook.get_credentials() 32 | redshift_hook = PostgresHook("redshift") 33 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format( 34 | credentials.access_key, 35 | credentials.secret_key, 36 | ) 37 | redshift_hook.run(sql_stmt) 38 | 39 | 40 | dag = DAG( 41 | 'lesson2.exercise1', 42 | start_date=datetime.datetime.now() 43 | ) 44 | 45 | create_trips_table = PostgresOperator( 46 | task_id="create_trips_table", 47 | dag=dag, 48 | postgres_conn_id="redshift", 49 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL 50 | ) 51 | 52 | copy_trips_task = PythonOperator( 53 | task_id='load_trips_from_s3_to_redshift', 54 | dag=dag, 55 | python_callable=load_trip_data_to_redshift, 56 | ) 57 | 58 | create_stations_table = PostgresOperator( 59 | task_id="create_stations_table", 60 | dag=dag, 61 | postgres_conn_id="redshift", 62 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL, 63 | ) 64 | 65 | copy_stations_task = PythonOperator( 66 | task_id='load_stations_from_s3_to_redshift', 67 | dag=dag, 68 | python_callable=load_station_data_to_redshift, 69 | ) 70 | 71 | create_trips_table >> copy_trips_task 72 | create_stations_table >> copy_stations_task -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Quality - Exercise 2.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #1 - Revisit our bikeshare traffic 3 | #2 - Update our DAG with 4 | # a - @monthly schedule_interval 5 | # b - max_active_runs of 1 6 | # c - start_date of 2018/01/01 7 | # d - end_date of 2018/02/01 8 | # Use Airflow’s backfill capabilities to analyze our trip data on a monthly basis over 2 historical runs 9 | 10 | import datetime 11 | import logging 12 | 13 | from airflow import DAG 14 | from airflow.contrib.hooks.aws_hook import AwsHook 15 | from airflow.hooks.postgres_hook import PostgresHook 16 | from airflow.operators.postgres_operator import PostgresOperator 17 | from airflow.operators.python_operator import PythonOperator 18 | 19 | import sql_statements 20 | 21 | 22 | def load_trip_data_to_redshift(*args, **kwargs): 23 | aws_hook = AwsHook("aws_credentials") 24 | credentials = aws_hook.get_credentials() 25 | redshift_hook = PostgresHook("redshift") 26 | sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format( 27 | credentials.access_key, 28 | credentials.secret_key, 29 | ) 30 | redshift_hook.run(sql_stmt) 31 | 32 | 33 | def load_station_data_to_redshift(*args, **kwargs): 34 | aws_hook = AwsHook("aws_credentials") 35 | credentials = aws_hook.get_credentials() 36 | redshift_hook = PostgresHook("redshift") 37 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format( 38 | credentials.access_key, 39 | credentials.secret_key, 40 | ) 41 | redshift_hook.run(sql_stmt) 42 | 43 | 44 | dag = DAG( 45 | 'lesson2.exercise2', 46 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0), 47 | # TODO: Set the end date to February first 48 | end_date=datetime.datetime(2018, 2, 1, 0, 0, 0 , 0), 49 | # TODO: Set the schedule to be monthly 50 | schedule_interval='@monthly', 51 | # TODO: set the number of max active runs to 1 52 | max_active_runs=1 53 | ) 54 | 55 | create_trips_table = PostgresOperator( 56 | task_id="create_trips_table", 57 | dag=dag, 58 | postgres_conn_id="redshift", 59 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL 60 | ) 61 | 62 | copy_trips_task = PythonOperator( 63 | task_id='load_trips_from_s3_to_redshift', 64 | dag=dag, 65 | python_callable=load_trip_data_to_redshift, 66 | provide_context=True, 67 | ) 68 | 69 | create_stations_table = PostgresOperator( 70 | task_id="create_stations_table", 71 | dag=dag, 72 | postgres_conn_id="redshift", 73 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL, 74 | ) 75 | 76 | copy_stations_task = PythonOperator( 77 | task_id='load_stations_from_s3_to_redshift', 78 | dag=dag, 79 | python_callable=load_station_data_to_redshift, 80 | ) 81 | 82 | create_trips_table >> copy_trips_task 83 | create_stations_table >> copy_stations_task -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Quality - Exercise 3.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #1 - Modify the bikeshare DAG to load data month by month, instead of loading it all at once, every time. 3 | #2 - Use time partitioning to parallelize the execution of the DAG. 4 | 5 | import datetime 6 | import logging 7 | 8 | from airflow import DAG 9 | from airflow.contrib.hooks.aws_hook import AwsHook 10 | from airflow.hooks.postgres_hook import PostgresHook 11 | from airflow.operators.postgres_operator import PostgresOperator 12 | from airflow.operators.python_operator import PythonOperator 13 | 14 | import sql_statements 15 | 16 | 17 | def load_trip_data_to_redshift(*args, **kwargs): 18 | aws_hook = AwsHook("aws_credentials") 19 | credentials = aws_hook.get_credentials() 20 | redshift_hook = PostgresHook("redshift") 21 | execution_date = kwargs["execution_date"] 22 | sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format( 23 | credentials.access_key, 24 | credentials.secret_key, 25 | year=execution_date.year, 26 | month=execution_date.month 27 | ) 28 | redshift_hook.run(sql_stmt) 29 | 30 | 31 | def load_station_data_to_redshift(*args, **kwargs): 32 | aws_hook = AwsHook("aws_credentials") 33 | credentials = aws_hook.get_credentials() 34 | redshift_hook = PostgresHook("redshift") 35 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format( 36 | credentials.access_key, 37 | credentials.secret_key, 38 | ) 39 | redshift_hook.run(sql_stmt) 40 | 41 | 42 | dag = DAG( 43 | 'lesson2.exercise3', 44 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0), 45 | end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0), 46 | schedule_interval='@monthly', 47 | max_active_runs=1 48 | ) 49 | 50 | create_trips_table = PostgresOperator( 51 | task_id="create_trips_table", 52 | dag=dag, 53 | postgres_conn_id="redshift", 54 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL 55 | ) 56 | 57 | copy_trips_task = PythonOperator( 58 | task_id='load_trips_from_s3_to_redshift', 59 | dag=dag, 60 | python_callable=load_trip_data_to_redshift, 61 | provide_context=True, 62 | ) 63 | 64 | create_stations_table = PostgresOperator( 65 | task_id="create_stations_table", 66 | dag=dag, 67 | postgres_conn_id="redshift", 68 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL, 69 | ) 70 | 71 | copy_stations_task = PythonOperator( 72 | task_id='load_stations_from_s3_to_redshift', 73 | dag=dag, 74 | python_callable=load_station_data_to_redshift, 75 | ) 76 | 77 | create_trips_table >> copy_trips_task 78 | create_stations_table >> copy_stations_task 79 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Data Quality - Exercise 4.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #1 - Set an SLA on our bikeshare traffic calculation operator 3 | #2 - Add data verification step after the load step from s3 to redshift 4 | #3 - Add data verification step after we calculate our output table 5 | 6 | import datetime 7 | import logging 8 | 9 | from airflow import DAG 10 | from airflow.contrib.hooks.aws_hook import AwsHook 11 | from airflow.hooks.postgres_hook import PostgresHook 12 | from airflow.operators.postgres_operator import PostgresOperator 13 | from airflow.operators.python_operator import PythonOperator 14 | 15 | import sql_statements 16 | 17 | 18 | def load_trip_data_to_redshift(*args, **kwargs): 19 | aws_hook = AwsHook("aws_credentials") 20 | credentials = aws_hook.get_credentials() 21 | redshift_hook = PostgresHook("redshift") 22 | execution_date = kwargs["execution_date"] 23 | sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format( 24 | credentials.access_key, 25 | credentials.secret_key, 26 | year=execution_date.year, 27 | month=execution_date.month 28 | ) 29 | redshift_hook.run(sql_stmt) 30 | 31 | 32 | def load_station_data_to_redshift(*args, **kwargs): 33 | aws_hook = AwsHook("aws_credentials") 34 | credentials = aws_hook.get_credentials() 35 | redshift_hook = PostgresHook("redshift") 36 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format( 37 | credentials.access_key, 38 | credentials.secret_key, 39 | ) 40 | redshift_hook.run(sql_stmt) 41 | 42 | 43 | def check_greater_than_zero(*args, **kwargs): 44 | table = kwargs["params"]["table"] 45 | redshift_hook = PostgresHook("redshift") 46 | records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {table}") 47 | if len(records) < 1 or len(records[0]) < 1: 48 | raise ValueError(f"Data quality check failed. {table} returned no results") 49 | num_records = records[0][0] 50 | if num_records < 1: 51 | raise ValueError(f"Data quality check failed. {table} contained 0 rows") 52 | logging.info(f"Data quality on table {table} check passed with {records[0][0]} records") 53 | 54 | 55 | dag = DAG( 56 | 'lesson2.exercise4', 57 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0), 58 | end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0), 59 | schedule_interval='@monthly', 60 | max_active_runs=1 61 | ) 62 | 63 | create_trips_table = PostgresOperator( 64 | task_id="create_trips_table", 65 | dag=dag, 66 | postgres_conn_id="redshift", 67 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL 68 | ) 69 | 70 | copy_trips_task = PythonOperator( 71 | task_id='load_trips_from_s3_to_redshift', 72 | dag=dag, 73 | python_callable=load_trip_data_to_redshift, 74 | provide_context=True, 75 | ) 76 | 77 | check_trips = PythonOperator( 78 | task_id='check_trips_data', 79 | dag=dag, 80 | python_callable=check_greater_than_zero, 81 | provide_context=True, 82 | params={ 83 | 'table': 'trips', 84 | } 85 | ) 86 | 87 | create_stations_table = PostgresOperator( 88 | task_id="create_stations_table", 89 | dag=dag, 90 | postgres_conn_id="redshift", 91 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL, 92 | ) 93 | 94 | copy_stations_task = PythonOperator( 95 | task_id='load_stations_from_s3_to_redshift', 96 | dag=dag, 97 | python_callable=load_station_data_to_redshift, 98 | ) 99 | 100 | check_stations = PythonOperator( 101 | task_id='check_stations_data', 102 | dag=dag, 103 | python_callable=check_greater_than_zero, 104 | provide_context=True, 105 | params={ 106 | 'table': 'stations', 107 | } 108 | ) 109 | 110 | create_trips_table >> copy_trips_task 111 | create_stations_table >> copy_stations_task 112 | copy_stations_task >> check_stations 113 | copy_trips_task >> check_trips -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Production Data Pipelines - Exercise 1.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #In this exercise, we’ll consolidate repeated code into Operator Plugins 3 | #1 - Move the data quality check logic into a custom operator 4 | #2 - Replace the data quality check PythonOperators with our new custom operator 5 | #3 - Consolidate both the S3 to RedShift functions into a custom operator 6 | #4 - Replace the S3 to RedShift PythonOperators with our new custom operator 7 | #5 - Execute the DAG 8 | 9 | import datetime 10 | import logging 11 | 12 | from airflow import DAG 13 | from airflow.contrib.hooks.aws_hook import AwsHook 14 | from airflow.hooks.postgres_hook import PostgresHook 15 | 16 | from airflow.operators import ( 17 | HasRowsOperator, 18 | PostgresOperator, 19 | PythonOperator, 20 | S3ToRedshiftOperator 21 | ) 22 | 23 | import sql_statements 24 | 25 | 26 | # 27 | # TODO: Replace the data quality checks with the HasRowsOperator 28 | # 29 | 30 | dag = DAG( 31 | "lesson3.exercise1", 32 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0), 33 | end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0), 34 | schedule_interval="@monthly", 35 | max_active_runs=1 36 | ) 37 | 38 | create_trips_table = PostgresOperator( 39 | task_id="create_trips_table", 40 | dag=dag, 41 | postgres_conn_id="redshift", 42 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL 43 | ) 44 | 45 | copy_trips_task = S3ToRedshiftOperator( 46 | task_id="load_trips_from_s3_to_redshift", 47 | dag=dag, 48 | table="trips", 49 | redshift_conn_id="redshift", 50 | aws_credentials_id="aws_credentials", 51 | s3_bucket="udac-data-pipelines", 52 | s3_key="divvy/partitioned/{execution_date.year}/{execution_date.month}/divvy_trips.csv" 53 | ) 54 | 55 | # 56 | # TODO: Replace this data quality check with the HasRowsOperator 57 | # 58 | check_trips = HasRowsOperator( 59 | task_id='check_trips_data', 60 | dag=dag, 61 | redshift_conn_id="redshift", 62 | table="trips" 63 | ) 64 | 65 | create_stations_table = PostgresOperator( 66 | task_id="create_stations_table", 67 | dag=dag, 68 | postgres_conn_id="redshift", 69 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL, 70 | ) 71 | 72 | copy_stations_task = S3ToRedshiftOperator( 73 | task_id="load_stations_from_s3_to_redshift", 74 | dag=dag, 75 | redshift_conn_id="redshift", 76 | aws_credentials_id="aws_credentials", 77 | s3_bucket="udac-data-pipelines", 78 | s3_key="divvy/unpartitioned/divvy_stations_2017.csv", 79 | table="stations" 80 | ) 81 | 82 | # 83 | # TODO: Replace this data quality check with the HasRowsOperator 84 | # 85 | check_stations = HasRowsOperator( 86 | task_id='check_stations_data', 87 | dag=dag, 88 | redshift_conn_id="redshift", 89 | table="stations" 90 | ) 91 | 92 | create_trips_table >> copy_trips_task 93 | create_stations_table >> copy_stations_task 94 | copy_stations_task >> check_stations 95 | copy_trips_task >> check_trips -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Production Data Pipelines - Exercise 2.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #In this exercise, we’ll refactor a DAG with a single overloaded task into a DAG with several tasks with well-defined boundaries 3 | #1 - Read through the DAG and identify points in the DAG that could be split apart 4 | #2 - Split the DAG into multiple PythonOperators 5 | #3 - Run the DAG 6 | 7 | import datetime 8 | import logging 9 | 10 | from airflow import DAG 11 | from airflow.hooks.postgres_hook import PostgresHook 12 | 13 | from airflow.operators.postgres_operator import PostgresOperator 14 | from airflow.operators.python_operator import PythonOperator 15 | 16 | 17 | # 18 | # TODO: Finish refactoring this function into the appropriate set of tasks, 19 | # instead of keeping this one large task. 20 | # 21 | def load_and_analyze(*args, **kwargs): 22 | redshift_hook = PostgresHook("redshift") 23 | 24 | def log_oldest(): 25 | redshift_hook = PostgresHook("redshift") 26 | records = redshift_hook.get_records(""" 27 | SELECT birthyear FROM older_riders ORDER BY birthyear ASC LIMIT 1 28 | """) 29 | if len(records) > 0 and len(records[0]) > 0: 30 | logging.info(f"Oldest rider was born in {records[0][0]}") 31 | 32 | def log_younger(): 33 | redshift_hook = PostgresHook("redshift") 34 | records = redshift_hook.get_records(""" 35 | SELECT birthyear FROM younger_riders ORDER BY birthyear DESC LIMIT 1 36 | """) 37 | if len(records) > 0 and len(records[0]) > 0: 38 | logging.info(f"Youngest rider was born in {records[0][0]}") 39 | 40 | 41 | dag = DAG( 42 | "lesson3.exercise2", 43 | start_date=datetime.datetime.utcnow() 44 | ) 45 | 46 | load_and_analyze = PythonOperator( 47 | task_id='load_and_analyze', 48 | dag=dag, 49 | python_callable=load_and_analyze, 50 | provide_context=True, 51 | ) 52 | 53 | create_oldest_task = PostgresOperator( 54 | task_id="create_oldest", 55 | dag=dag, 56 | sql=""" 57 | BEGIN; 58 | DROP TABLE IF EXISTS older_riders; 59 | CREATE TABLE older_riders AS ( 60 | SELECT * FROM trips WHERE birthyear > 0 AND birthyear <= 1945 61 | ); 62 | COMMIT; 63 | """, 64 | postgres_conn_id="redshift" 65 | ) 66 | 67 | create_younger_task = PostgresOperator( 68 | task_id="create_younger", 69 | dag=dag, 70 | sql=""" 71 | BEGIN; 72 | DROP TABLE IF EXISTS younger_riders; 73 | CREATE TABLE younger_riders AS ( 74 | SELECT * FROM trips WHERE birthyear > 2000 75 | ); 76 | COMMIT; 77 | """, 78 | postgres_conn_id="redshift" 79 | ) 80 | 81 | create_lifetime_task = PostgresOperator( 82 | task_id="create_lifetime", 83 | dag=dag, 84 | sql=""" 85 | BEGIN; 86 | DROP TABLE IF EXISTS lifetime_rides; 87 | CREATE TABLE lifetime_rides AS ( 88 | SELECT bikeid, COUNT(bikeid) 89 | FROM trips 90 | GROUP BY bikeid 91 | ); 92 | COMMIT; 93 | """, 94 | postgres_conn_id="redshift" 95 | ) 96 | 97 | create_city_station_task = PostgresOperator( 98 | task_id="create_city_station", 99 | dag=dag, 100 | sql=""" 101 | BEGIN; 102 | DROP TABLE IF EXISTS city_station_counts; 103 | CREATE TABLE city_station_counts AS( 104 | SELECT city, COUNT(city) 105 | FROM stations 106 | GROUP BY city 107 | ); 108 | COMMIT; 109 | """, 110 | postgres_conn_id="redshift" 111 | ) 112 | 113 | log_oldest_task = PythonOperator( 114 | task_id="log_oldest", 115 | dag=dag, 116 | python_callable=log_oldest 117 | ) 118 | 119 | log_younger_task = PythonOperator( 120 | task_id="log_younger", 121 | dag=dag, 122 | python_callable=log_younger 123 | ) 124 | 125 | load_and_analyze >> create_oldest_task 126 | create_oldest_task >> log_oldest_task 127 | create_younger_task >> log_younger_task -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Production Data Pipelines - Exercise 3.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #In this exercise, we’ll refactor a DAG with a single overloaded task into a DAG with several tasks with well-defined boundaries 3 | #1 - Read through the DAG and identify points in the DAG that could be split apart 4 | #2 - Split the DAG into multiple PythonOperators 5 | #3 - Run the DAG 6 | 7 | import datetime 8 | import logging 9 | 10 | from airflow import DAG 11 | from airflow.hooks.postgres_hook import PostgresHook 12 | 13 | from airflow.operators.postgres_operator import PostgresOperator 14 | from airflow.operators.python_operator import PythonOperator 15 | 16 | 17 | # 18 | # TODO: Finish refactoring this function into the appropriate set of tasks, 19 | # instead of keeping this one large task. 20 | # 21 | def load_and_analyze(*args, **kwargs): 22 | redshift_hook = PostgresHook("redshift") 23 | 24 | def log_oldest(): 25 | redshift_hook = PostgresHook("redshift") 26 | records = redshift_hook.get_records(""" 27 | SELECT birthyear FROM older_riders ORDER BY birthyear ASC LIMIT 1 28 | """) 29 | if len(records) > 0 and len(records[0]) > 0: 30 | logging.info(f"Oldest rider was born in {records[0][0]}") 31 | 32 | def log_younger(): 33 | redshift_hook = PostgresHook("redshift") 34 | records = redshift_hook.get_records(""" 35 | SELECT birthyear FROM younger_riders ORDER BY birthyear DESC LIMIT 1 36 | """) 37 | if len(records) > 0 and len(records[0]) > 0: 38 | logging.info(f"Youngest rider was born in {records[0][0]}") 39 | 40 | 41 | dag = DAG( 42 | "lesson3.exercise2", 43 | start_date=datetime.datetime.utcnow() 44 | ) 45 | 46 | load_and_analyze = PythonOperator( 47 | task_id='load_and_analyze', 48 | dag=dag, 49 | python_callable=load_and_analyze, 50 | provide_context=True, 51 | ) 52 | 53 | create_oldest_task = PostgresOperator( 54 | task_id="create_oldest", 55 | dag=dag, 56 | sql=""" 57 | BEGIN; 58 | DROP TABLE IF EXISTS older_riders; 59 | CREATE TABLE older_riders AS ( 60 | SELECT * FROM trips WHERE birthyear > 0 AND birthyear <= 1945 61 | ); 62 | COMMIT; 63 | """, 64 | postgres_conn_id="redshift" 65 | ) 66 | 67 | create_younger_task = PostgresOperator( 68 | task_id="create_younger", 69 | dag=dag, 70 | sql=""" 71 | BEGIN; 72 | DROP TABLE IF EXISTS younger_riders; 73 | CREATE TABLE younger_riders AS ( 74 | SELECT * FROM trips WHERE birthyear > 2000 75 | ); 76 | COMMIT; 77 | """, 78 | postgres_conn_id="redshift" 79 | ) 80 | 81 | create_lifetime_task = PostgresOperator( 82 | task_id="create_lifetime", 83 | dag=dag, 84 | sql=""" 85 | BEGIN; 86 | DROP TABLE IF EXISTS lifetime_rides; 87 | CREATE TABLE lifetime_rides AS ( 88 | SELECT bikeid, COUNT(bikeid) 89 | FROM trips 90 | GROUP BY bikeid 91 | ); 92 | COMMIT; 93 | """, 94 | postgres_conn_id="redshift" 95 | ) 96 | 97 | create_city_station_task = PostgresOperator( 98 | task_id="create_city_station", 99 | dag=dag, 100 | sql=""" 101 | BEGIN; 102 | DROP TABLE IF EXISTS city_station_counts; 103 | CREATE TABLE city_station_counts AS( 104 | SELECT city, COUNT(city) 105 | FROM stations 106 | GROUP BY city 107 | ); 108 | COMMIT; 109 | """, 110 | postgres_conn_id="redshift" 111 | ) 112 | 113 | log_oldest_task = PythonOperator( 114 | task_id="log_oldest", 115 | dag=dag, 116 | python_callable=log_oldest 117 | ) 118 | 119 | log_younger_task = PythonOperator( 120 | task_id="log_younger", 121 | dag=dag, 122 | python_callable=log_younger 123 | ) 124 | 125 | load_and_analyze >> create_oldest_task 126 | create_oldest_task >> log_oldest_task 127 | create_younger_task >> log_younger_task -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Production Data Pipelines - Exercise 4.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | 3 | from airflow import DAG 4 | 5 | from airflow.operators import ( 6 | FactsCalculatorOperator, 7 | HasRowsOperator, 8 | S3ToRedshiftOperator 9 | ) 10 | 11 | # 12 | # The following DAG performs the following functions: 13 | # 14 | # 1. Loads Trip data from S3 to RedShift 15 | # 2. Performs a data quality check on the Trips table in RedShift 16 | # 3. Uses the FactsCalculatorOperator to create a Facts table in Redshift 17 | # a. **NOTE**: to complete this step you must complete the FactsCalcuatorOperator 18 | # skeleton defined in plugins/operators/facts_calculator.py 19 | # 20 | dag = DAG("lesson3.exercise4", start_date=datetime.datetime.utcnow()) 21 | 22 | # 23 | # The following code will load trips data from S3 to RedShift. Use the s3_key 24 | # "data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv" 25 | # and the s3_bucket "udacity-dend" 26 | # 27 | copy_trips_task = S3ToRedshiftOperator( 28 | task_id="load_trips_from_s3_to_redshift", 29 | dag=dag, 30 | table="trips", 31 | redshift_conn_id="redshift", 32 | aws_credentials_id="aws_credentials", 33 | s3_bucket="udacity-dend", 34 | s3_key="data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv" 35 | ) 36 | 37 | # 38 | # Data quality check on the Trips table 39 | # 40 | check_trips = HasRowsOperator( 41 | task_id="check_trips_data", 42 | dag=dag, 43 | redshift_conn_id="redshift", 44 | table="trips" 45 | ) 46 | 47 | # 48 | # We use the FactsCalculatorOperator to create a Facts table in RedShift. The fact column is 49 | # `tripduration` and the groupby_column is `bikeid` 50 | # 51 | calculate_facts = FactsCalculatorOperator( 52 | task_id="calculate_facts_trips", 53 | dag=dag, 54 | redshift_conn_id="redshift", 55 | origin_table="trips", 56 | destination_table="trips_facts", 57 | fact_column="tripduration", 58 | groupby_column="bikeid" 59 | ) 60 | 61 | # 62 | # Task ordering for the DAG tasks 63 | # 64 | copy_trips_task >> check_trips 65 | check_trips >> calculate_facts -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Graphview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Graphview.png -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Treeview.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Treeview.PNG -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/Readme.MD: -------------------------------------------------------------------------------- 1 | Project: Data Pipeline with Airflow 2 | 3 | Introduction 4 | 5 | A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app 6 | 7 | Project Description 8 | 9 | Apply the knowledge of Apache Airflow to build and ETL pipeline for a Data Lake hosted on Amazon S3. 10 | 11 | In this project, we would have to create our own custom operators to perform tasks such as staging the data, filling the data warehouse and running checks on the data as the final step. We have been provided with four empty operators that need to be implemented into functional pieces of a data pipeline. 12 | 13 | Project Datasets 14 | 15 | Song Data Path --> s3://udacity-dend/song_data 16 | 17 | Log Data Path --> s3://udacity-dend/log_data Log Data 18 | 19 | Project Template 20 | 21 | The project template package contains three major components for the project: 22 | 23 | The dag template has all the imports and task templates in place, but the task dependencies have not been set 24 | The operators folder with operator templates 25 | A helper class for the SQL transformations 26 | 27 | Configuring the DAG 28 | 29 | In the DAG, add default parameters according to these guidelines 30 | 31 | 1. The DAG does not have dependencies on past runs 32 | 2. On failure, the task are retried 3 times 33 | 3. Retries happen every 5 minutes 34 | 4. Catchup is turned off 35 | 5. Do not email on retry 36 | 37 | 38 | Building the Operators 39 | 40 | Need to build four different operators that will stage the data, transform the data, and run checks on data quality. All of the operators and task instances will run SQL statements against the Redshift database. However, using parameters wisely will allow you to build flexible, reusable, and configurable operators you can later apply to many kinds of data pipelines with Redshift and with other databases. 41 | 42 | Stage Operator 43 | 44 | The stage operator is expected to be able to load any JSON and CSV formatted files from S3 to Amazon Redshift. The operator creates and runs a SQL COPY statement based on the parameters provided. The operator's parameters should specify where in S3 the file is loaded and what is the target table. 45 | 46 | The parameters should be used to distinguish between JSON and CSV file. Another important requirement of the stage operator is containing a templated field that allows it to load timestamped files from S3 based on the execution time and run backfills. 47 | 48 | Fact and Dimension Operators 49 | 50 | Provided SQL Helper class will help to run data transformations. Most of the logic is within the SQL transformations and the operator is expected to take as input a SQL statement and target database on which to run the query against. Dimension loads are often done with the truncate-insert pattern where the target table is emptied before the load. Fact tables are usually so massive that they should only allow append type functionality. 51 | 52 | Data Quality Operator 53 | 54 | The final operator to create is the data quality operator, which is used to run checks on the data itself. The operator's main functionality is to receive one or more SQL based test cases along with the expected results and execute the tests. For each the test, the test result and expected result needs to be checked and if there is no match, the operator should raise an exception and the task should retry and fail eventually. 55 | 56 | For example one test could be a SQL statement that checks if certain column contains NULL values by counting all the rows that have NULL in the column. We do not want to have any NULLs so expected result would be 0 and the test would compare the SQL statement's outcome to the expected result. 57 | 58 | Final Instructions 59 | 60 | When you are in the workspace, after completing the code, you can start by using the command : /opt/airflow/start.sh 61 | 62 | Once you done, it would automatically start all the dags required and outputting the result to its respective tables 63 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/create_tables.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE IF NOT EXISTS public.artists ( 2 | artistid varchar(256) NOT NULL, 3 | name varchar(256), 4 | location varchar(256), 5 | lattitude numeric(18,0), 6 | longitude numeric(18,0) 7 | ); 8 | 9 | CREATE TABLE IF NOT EXISTS public.songplays ( 10 | playid varchar(32) NOT NULL, 11 | start_time timestamp NOT NULL, 12 | userid int4 NOT NULL, 13 | "level" varchar(256), 14 | songid varchar(256), 15 | artistid varchar(256), 16 | sessionid int4, 17 | location varchar(256), 18 | user_agent varchar(256), 19 | CONSTRAINT songplays_pkey PRIMARY KEY (playid) 20 | ); 21 | 22 | CREATE TABLE IF NOT EXISTS public.songs ( 23 | songid varchar(256) NOT NULL, 24 | title varchar(256), 25 | artistid varchar(256), 26 | "year" int4, 27 | duration numeric(18,0), 28 | CONSTRAINT songs_pkey PRIMARY KEY (songid) 29 | ); 30 | 31 | CREATE TABLE IF NOT EXISTS public.staging_events ( 32 | artist varchar(256), 33 | auth varchar(256), 34 | firstname varchar(256), 35 | gender varchar(256), 36 | iteminsession int4, 37 | lastname varchar(256), 38 | length numeric(18,0), 39 | "level" varchar(256), 40 | location varchar(256), 41 | "method" varchar(256), 42 | page varchar(256), 43 | registration numeric(18,0), 44 | sessionid int4, 45 | song varchar(256), 46 | status int4, 47 | ts int8, 48 | useragent varchar(256), 49 | userid int4 50 | ); 51 | 52 | CREATE TABLE IF NOT EXISTS public.staging_songs ( 53 | num_songs int4, 54 | artist_id varchar(256), 55 | artist_name varchar(256), 56 | artist_latitude numeric(18,0), 57 | artist_longitude numeric(18,0), 58 | artist_location varchar(256), 59 | song_id varchar(256), 60 | title varchar(256), 61 | duration numeric(18,0), 62 | "year" int4 63 | ); 64 | 65 | CREATE TABLE IF NOT EXISTS public."time" ( 66 | start_time timestamp NOT NULL, 67 | "hour" int4, 68 | "day" int4, 69 | week int4, 70 | "month" varchar(256), 71 | "year" int4, 72 | weekday varchar(256), 73 | CONSTRAINT time_pkey PRIMARY KEY (start_time) 74 | ); 75 | 76 | CREATE TABLE IF NOT EXISTS public.users ( 77 | userid int4 NOT NULL, 78 | first_name varchar(256), 79 | last_name varchar(256), 80 | gender varchar(256), 81 | "level" varchar(256), 82 | CONSTRAINT users_pkey PRIMARY KEY (userid) 83 | ); 84 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/dags/udac_example_dag.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime, timedelta 2 | import os 3 | from airflow import DAG 4 | from airflow.operators.dummy_operator import DummyOperator 5 | from airflow.operators import (StageToRedshiftOperator, LoadFactOperator, 6 | LoadDimensionOperator, DataQualityOperator) 7 | from helpers import SqlQueries 8 | 9 | # AWS_KEY = os.environ.get('AWS_KEY') 10 | # AWS_SECRET = os.environ.get('AWS_SECRET') 11 | 12 | default_args = { 13 | 'owner': 'nareshkumar', 14 | 'start_date': datetime(2018, 11, 1), 15 | 'end_date': datetime(2018, 11, 30), 16 | 'depends_on_past': False, 17 | 'retries': 3, 18 | 'retry_delay': timedelta(minutes=5), 19 | 'catchup': False, 20 | 'email_on_retry': False 21 | } 22 | 23 | dag = DAG('udacity_airflow_project5', 24 | default_args=default_args, 25 | description='Load and transform data in Redshift with Airflow', 26 | schedule_interval='0 * * * *', 27 | max_active_runs=3 28 | ) 29 | 30 | start_operator = DummyOperator(task_id='Begin_execution', dag=dag) 31 | 32 | stage_events_to_redshift = StageToRedshiftOperator( 33 | task_id='Stage_events', 34 | dag=dag, 35 | provide_context=True, 36 | aws_credentials_id="aws_credentials", 37 | redshift_conn_id='redshift', 38 | s3_bucket="udacity-dend-airflow-test", 39 | s3_key="log_data", 40 | table="staging_events", 41 | create_stmt=SqlQueries.create_table_staging_events 42 | ) 43 | 44 | stage_songs_to_redshift = StageToRedshiftOperator( 45 | task_id='Stage_songs', 46 | dag=dag, 47 | provide_context=True, 48 | aws_credentials_id="aws_credentials", 49 | redshift_conn_id='redshift', 50 | s3_bucket="udacity-dend-airflow-test", 51 | s3_key="song_data", 52 | table="staging_songs", 53 | create_stmt=SqlQueries.create_table_staging_songs 54 | ) 55 | 56 | load_songplays_table = LoadFactOperator( 57 | task_id='Load_songplays_fact_table', 58 | dag=dag, 59 | provide_context=True, 60 | aws_credentials_id="aws_credentials", 61 | redshift_conn_id='redshift', 62 | create_stmt=SqlQueries.create_table_songplays, 63 | sql_query=SqlQueries.songplay_table_insert 64 | ) 65 | 66 | load_user_dimension_table = LoadDimensionOperator( 67 | task_id='Load_user_dim_table', 68 | dag=dag, 69 | provide_context=True, 70 | aws_credentials_id="aws_credentials", 71 | redshift_conn_id='redshift', 72 | create_stmt=SqlQueries.create_table_users, 73 | sql_query=SqlQueries.user_table_insert 74 | ) 75 | 76 | load_song_dimension_table = LoadDimensionOperator( 77 | task_id='Load_song_dim_table', 78 | dag=dag, 79 | provide_context=True, 80 | aws_credentials_id="aws_credentials", 81 | redshift_conn_id='redshift', 82 | create_stmt=SqlQueries.create_table_songs, 83 | sql_query=SqlQueries.song_table_insert 84 | ) 85 | 86 | load_artist_dimension_table = LoadDimensionOperator( 87 | task_id='Load_artist_dim_table', 88 | dag=dag, 89 | provide_context=True, 90 | aws_credentials_id="aws_credentials", 91 | redshift_conn_id='redshift', 92 | create_stmt=SqlQueries.create_table_artist, 93 | sql_query=SqlQueries.artist_table_insert 94 | ) 95 | 96 | load_time_dimension_table = LoadDimensionOperator( 97 | task_id='Load_time_dim_table', 98 | dag=dag, 99 | provide_context=True, 100 | aws_credentials_id="aws_credentials", 101 | redshift_conn_id='redshift', 102 | create_stmt=SqlQueries.create_table_time, 103 | sql_query=SqlQueries.time_table_insert 104 | ) 105 | 106 | run_quality_checks = DataQualityOperator( 107 | task_id='Run_data_quality_checks', 108 | dag=dag, 109 | provide_context=True, 110 | aws_credentials_id="aws_credentials", 111 | redshift_conn_id='redshift', 112 | ) 113 | 114 | end_operator = DummyOperator(task_id='Stop_execution', dag=dag) 115 | 116 | start_operator >> [stage_events_to_redshift, stage_songs_to_redshift] 117 | [stage_events_to_redshift, stage_songs_to_redshift] >> load_songplays_table 118 | load_songplays_table >> [load_song_dimension_table, load_user_dimension_table, load_artist_dimension_table, load_time_dimension_table] 119 | [load_song_dimension_table, load_user_dimension_table, load_artist_dimension_table, load_time_dimension_table] >> run_quality_checks 120 | run_quality_checks >> end_operator -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, absolute_import, print_function 2 | 3 | from airflow.plugins_manager import AirflowPlugin 4 | 5 | import operators 6 | import helpers 7 | 8 | # Defining the plugin class 9 | class UdacityPlugin(AirflowPlugin): 10 | name = "udacity_plugin" 11 | operators = [ 12 | operators.StageToRedshiftOperator, 13 | operators.LoadFactOperator, 14 | operators.LoadDimensionOperator, 15 | operators.DataQualityOperator 16 | ] 17 | helpers = [ 18 | helpers.SqlQueries 19 | ] 20 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__init__.py: -------------------------------------------------------------------------------- 1 | from helpers.sql_queries import SqlQueries 2 | 3 | __all__ = [ 4 | 'SqlQueries', 5 | ] -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/sql_queries.py: -------------------------------------------------------------------------------- 1 | class SqlQueries: 2 | create_table_artist = (""" 3 | CREATE TABLE IF NOT EXISTS public.artists ( 4 | artistid varchar(256) NOT NULL, 5 | name varchar(256), 6 | location varchar(256), 7 | lattitude numeric(18,0), 8 | longitude numeric(18,0) 9 | ); 10 | """) 11 | 12 | create_table_songplays = (""" 13 | CREATE TABLE IF NOT EXISTS public.songplays 14 | playid varchar(32) NOT NULL, 15 | start_time timestamp NOT NULL, 16 | userid int4 NOT NULL, 17 | "level" varchar(256), 18 | songid varchar(256), 19 | artistid varchar(256), 20 | sessionid int4, 21 | location varchar(256), 22 | user_agent varchar(256), 23 | CONSTRAINT songplays_pkey PRIMARY KEY (playid) 24 | ); 25 | """) 26 | 27 | create_table_songs = (""" 28 | CREATE TABLE IF NOT EXISTS public.songs ( 29 | songid varchar(256) NOT NULL, 30 | title varchar(256), 31 | artistid varchar(256), 32 | "year" int4, 33 | duration numeric(18,0), 34 | CONSTRAINT songs_pkey PRIMARY KEY (songid) 35 | ); 36 | """) 37 | 38 | create_table_staging_events = (""" 39 | CREATE TABLE IF NOT EXISTS public.staging_events ( 40 | artist varchar(256), 41 | auth varchar(256), 42 | firstname varchar(256), 43 | gender varchar(256), 44 | iteminsession int4, 45 | lastname varchar(256), 46 | length numeric(18,0), 47 | "level" varchar(256), 48 | location varchar(256), 49 | "method" varchar(256), 50 | page varchar(256), 51 | registration numeric(18,0), 52 | sessionid int4, 53 | song varchar(256), 54 | status int4, 55 | ts int8, 56 | useragent varchar(256), 57 | userid int4 58 | ); 59 | """) 60 | 61 | create_table_staging_songs = (""" 62 | CREATE TABLE IF NOT EXISTS public.staging_songs ( 63 | num_songs int4, 64 | artist_id varchar(256), 65 | artist_name varchar(256), 66 | artist_latitude numeric(18,0), 67 | artist_longitude numeric(18,0), 68 | artist_location varchar(256), 69 | song_id varchar(256), 70 | title varchar(256), 71 | duration numeric(18,0), 72 | "year" int4 73 | ); 74 | """) 75 | 76 | create_table_time = (""" 77 | CREATE TABLE IF NOT EXISTS public."time" ( 78 | start_time timestamp NOT NULL, 79 | "hour" int4, 80 | "day" int4, 81 | week int4, 82 | "month" varchar(256), 83 | "year" int4, 84 | weekday varchar(256), 85 | CONSTRAINT time_pkey PRIMARY KEY (start_time) 86 | ); 87 | """) 88 | 89 | create_table_users = (""" 90 | CREATE TABLE IF NOT EXISTS public.users ( 91 | userid int4 NOT NULL, 92 | first_name varchar(256), 93 | last_name varchar(256), 94 | gender varchar(256), 95 | "level" varchar(256), 96 | CONSTRAINT users_pkey PRIMARY KEY (userid) 97 | ); 98 | """) 99 | 100 | songplay_table_insert = (""" 101 | SELECT 102 | md5(events.sessionid || events.start_time) songplay_id, 103 | events.start_time, 104 | events.userid, 105 | events.level, 106 | songs.song_id, 107 | songs.artist_id, 108 | events.sessionid, 109 | events.location, 110 | events.useragent 111 | FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, * 112 | FROM staging_events 113 | WHERE page='NextSong') events 114 | LEFT JOIN staging_songs songs 115 | ON events.song = songs.title 116 | AND events.artist = songs.artist_name 117 | AND events.length = songs.duration 118 | """) 119 | 120 | user_table_insert = (""" 121 | SELECT distinct userid, firstname, lastname, gender, level 122 | FROM staging_events 123 | WHERE page='NextSong' 124 | """) 125 | 126 | song_table_insert = (""" 127 | SELECT distinct song_id, title, artist_id, year, duration 128 | FROM staging_songs 129 | """) 130 | 131 | artist_table_insert = (""" 132 | SELECT distinct artist_id, artist_name, artist_location, artist_latitude, artist_longitude 133 | FROM staging_songs 134 | """) 135 | 136 | time_table_insert = (""" 137 | SELECT start_time, extract(hour from start_time), extract(day from start_time), extract(week from start_time), 138 | extract(month from start_time), extract(year from start_time), extract(dayofweek from start_time) 139 | FROM songplays 140 | """) -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__init__.py: -------------------------------------------------------------------------------- 1 | from operators.stage_redshift import StageToRedshiftOperator 2 | from operators.load_fact import LoadFactOperator 3 | from operators.load_dimension import LoadDimensionOperator 4 | from operators.data_quality import DataQualityOperator 5 | 6 | __all__ = [ 7 | 'StageToRedshiftOperator', 8 | 'LoadFactOperator', 9 | 'LoadDimensionOperator', 10 | 'DataQualityOperator' 11 | ] 12 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/data_quality.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class DataQualityOperator(BaseOperator): 6 | 7 | ui_color = '#89DA59' 8 | 9 | @apply_defaults 10 | def __init__(self, 11 | # Define your operators params (with defaults) here 12 | # Example: 13 | # conn_id = your-connection-name 14 | *args, **kwargs): 15 | 16 | super(DataQualityOperator, self).__init__(*args, **kwargs) 17 | # Map params here 18 | # Example: 19 | # self.conn_id = conn_id 20 | 21 | def execute(self, context): 22 | self.log.info('DataQualityOperator not implemented yet') -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/load_dimension.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class LoadDimensionOperator(BaseOperator): 6 | 7 | ui_color = '#80BD9E' 8 | 9 | @apply_defaults 10 | def __init__(self, 11 | # Define your operators params (with defaults) here 12 | # Example: 13 | # conn_id = your-connection-name 14 | *args, **kwargs): 15 | 16 | super(LoadDimensionOperator, self).__init__(*args, **kwargs) 17 | # Map params here 18 | # Example: 19 | # self.conn_id = conn_id 20 | 21 | def execute(self, context): 22 | self.log.info('LoadDimensionOperator not implemented yet') 23 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/load_fact.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class LoadFactOperator(BaseOperator): 6 | 7 | ui_color = '#F98866' 8 | 9 | @apply_defaults 10 | def __init__(self, 11 | # Define your operators params (with defaults) here 12 | # Example: 13 | # conn_id = your-connection-name 14 | *args, **kwargs): 15 | 16 | super(LoadFactOperator, self).__init__(*args, **kwargs) 17 | # Map params here 18 | # Example: 19 | # self.conn_id = conn_id 20 | 21 | def execute(self, context): 22 | self.log.info('LoadFactOperator not implemented yet') 23 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/stage_redshift.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class StageToRedshiftOperator(BaseOperator): 6 | ui_color = '#358140' 7 | 8 | @apply_defaults 9 | def __init__(self, 10 | # Define your operators params (with defaults) here 11 | # Example: 12 | # redshift_conn_id=your-connection-name 13 | *args, **kwargs): 14 | 15 | super(StageToRedshiftOperator, self).__init__(*args, **kwargs) 16 | # Map params here 17 | # Example: 18 | # self.conn_id = conn_id 19 | 20 | def execute(self, context): 21 | self.log.info('StageToRedshiftOperator not implemented yet') 22 | 23 | 24 | 25 | 26 | 27 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/Readme.MD: -------------------------------------------------------------------------------- 1 | Exercise Files 2 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/__init__.py: -------------------------------------------------------------------------------- 1 | from operators.facts_calculator import FactsCalculatorOperator 2 | from operators.has_rows import HasRowsOperator 3 | from operators.s3_to_redshift import S3ToRedshiftOperator 4 | 5 | __all__ = [ 6 | 'FactsCalculatorOperator', 7 | 'HasRowsOperator', 8 | 'S3ToRedshiftOperator' 9 | ] 10 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/dag.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag. 3 | #1 - Consolidate HasRowsOperator into the SubDag 4 | #2 - Reorder the tasks to take advantage of the SubDag Operators 5 | 6 | import datetime 7 | 8 | from airflow import DAG 9 | from airflow.operators.postgres_operator import PostgresOperator 10 | from airflow.operators.subdag_operator import SubDagOperator 11 | from airflow.operators.udacity_plugin import HasRowsOperator 12 | 13 | from lesson3.exercise3.subdag import get_s3_to_redshift_dag 14 | import sql_statements 15 | 16 | 17 | start_date = datetime.datetime.utcnow() 18 | 19 | dag = DAG( 20 | "lesson3.exercise3", 21 | start_date=start_date, 22 | ) 23 | 24 | trips_task_id = "trips_subdag" 25 | trips_subdag_task = SubDagOperator( 26 | subdag=get_s3_to_redshift_dag( 27 | "lesson3.exercise3", 28 | trips_task_id, 29 | "redshift", 30 | "aws_credentials", 31 | "trips", 32 | sql_statements.CREATE_TRIPS_TABLE_SQL, 33 | s3_bucket="udac-data-pipelines", 34 | s3_key="divvy/unpartitioned/divvy_trips_2018.csv", 35 | start_date=start_date, 36 | ), 37 | task_id=trips_task_id, 38 | dag=dag, 39 | ) 40 | 41 | stations_task_id = "stations_subdag" 42 | stations_subdag_task = SubDagOperator( 43 | subdag=get_s3_to_redshift_dag( 44 | "lesson3.exercise3", 45 | stations_task_id, 46 | "redshift", 47 | "aws_credentials", 48 | "stations", 49 | sql_statements.CREATE_STATIONS_TABLE_SQL, 50 | s3_bucket="udac-data-pipelines", 51 | s3_key="divvy/unpartitioned/divvy_stations_2017.csv", 52 | start_date=start_date, 53 | ), 54 | task_id=stations_task_id, 55 | dag=dag, 56 | ) 57 | 58 | # 59 | # TODO: Consolidate check_trips and check_stations into a single check in the subdag 60 | # as we did with the create and copy in the demo 61 | # 62 | check_trips = HasRowsOperator( 63 | task_id="check_trips_data", 64 | dag=dag, 65 | redshift_conn_id="redshift", 66 | table="trips" 67 | ) 68 | 69 | check_stations = HasRowsOperator( 70 | task_id="check_stations_data", 71 | dag=dag, 72 | redshift_conn_id="redshift", 73 | table="stations" 74 | ) 75 | 76 | location_traffic_task = PostgresOperator( 77 | task_id="calculate_location_traffic", 78 | dag=dag, 79 | postgres_conn_id="redshift", 80 | sql=sql_statements.LOCATION_TRAFFIC_SQL 81 | ) 82 | 83 | # 84 | # TODO: Reorder the Graph once you have moved the checks 85 | # 86 | trips_subdag_task >> location_traffic_task 87 | stations_subdag_task >> location_traffic_task -------------------------------------------------------------------------------- /Data Pipeline with Airflow/facts_calculator.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from airflow.hooks.postgres_hook import PostgresHook 4 | from airflow.models import BaseOperator 5 | from airflow.utils.decorators import apply_defaults 6 | 7 | 8 | class FactsCalculatorOperator(BaseOperator): 9 | facts_sql_template = """ 10 | DROP TABLE IF EXISTS {destination_table}; 11 | CREATE TABLE {destination_table} AS 12 | SELECT 13 | {groupby_column}, 14 | MAX({fact_column}) AS max_{fact_column}, 15 | MIN({fact_column}) AS min_{fact_column}, 16 | AVG({fact_column}) AS average_{fact_column} 17 | FROM {origin_table} 18 | GROUP BY {groupby_column}; 19 | """ 20 | 21 | @apply_defaults 22 | def __init__(self, 23 | redshift_conn_id="", 24 | origin_table="", 25 | destination_table="", 26 | fact_column="", 27 | groupby_column="", 28 | *args, **kwargs): 29 | 30 | super(FactsCalculatorOperator, self).__init__(*args, **kwargs) 31 | # 32 | # TODO: Set attributes from __init__ instantiation arguments 33 | # 34 | 35 | def execute(self, context): 36 | # 37 | # TODO: Fetch the redshift hook 38 | # 39 | 40 | # 41 | # TODO: Format the `facts_sql_template` and run the query against redshift 42 | # 43 | 44 | pass 45 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/has_rows.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from airflow.hooks.postgres_hook import PostgresHook 4 | from airflow.models import BaseOperator 5 | from airflow.utils.decorators import apply_defaults 6 | 7 | 8 | class HasRowsOperator(BaseOperator): 9 | 10 | @apply_defaults 11 | def __init__(self, 12 | redshift_conn_id="", 13 | table="", 14 | *args, **kwargs): 15 | 16 | super(HasRowsOperator, self).__init__(*args, **kwargs) 17 | self.table = table 18 | self.redshift_conn_id = redshift_conn_id 19 | 20 | def execute(self, context): 21 | redshift_hook = PostgresHook(self.redshift_conn_id) 22 | records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {self.table}") 23 | if len(records) < 1 or len(records[0]) < 1: 24 | raise ValueError(f"Data quality check failed. {self.table} returned no results") 25 | num_records = records[0][0] 26 | if num_records < 1: 27 | raise ValueError(f"Data quality check failed. {self.table} contained 0 rows") 28 | logging.info(f"Data quality on table {self.table} check passed with {records[0][0]} records") 29 | 30 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/s3_to_redshift.py: -------------------------------------------------------------------------------- 1 | from airflow.contrib.hooks.aws_hook import AwsHook 2 | from airflow.hooks.postgres_hook import PostgresHook 3 | from airflow.models import BaseOperator 4 | from airflow.utils.decorators import apply_defaults 5 | 6 | 7 | class S3ToRedshiftOperator(BaseOperator): 8 | template_fields = ("s3_key",) 9 | copy_sql = """ 10 | COPY {} 11 | FROM '{}' 12 | ACCESS_KEY_ID '{}' 13 | SECRET_ACCESS_KEY '{}' 14 | IGNOREHEADER {} 15 | DELIMITER '{}' 16 | """ 17 | 18 | 19 | @apply_defaults 20 | def __init__(self, 21 | redshift_conn_id="", 22 | aws_credentials_id="", 23 | table="", 24 | s3_bucket="", 25 | s3_key="", 26 | delimiter=",", 27 | ignore_headers=1, 28 | *args, **kwargs): 29 | 30 | super(S3ToRedshiftOperator, self).__init__(*args, **kwargs) 31 | self.table = table 32 | self.redshift_conn_id = redshift_conn_id 33 | self.s3_bucket = s3_bucket 34 | self.s3_key = s3_key 35 | self.delimiter = delimiter 36 | self.ignore_headers = ignore_headers 37 | self.aws_credentials_id = aws_credentials_id 38 | 39 | def execute(self, context): 40 | aws_hook = AwsHook(self.aws_credentials_id) 41 | credentials = aws_hook.get_credentials() 42 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id) 43 | 44 | self.log.info("Clearing data from destination Redshift table") 45 | redshift.run("DELETE FROM {}".format(self.table)) 46 | 47 | self.log.info("Copying data from S3 to Redshift") 48 | rendered_key = self.s3_key.format(**context) 49 | s3_path = "s3://{}/{}".format(self.s3_bucket, rendered_key) 50 | formatted_sql = S3ToRedshiftOperator.copy_sql.format( 51 | self.table, 52 | s3_path, 53 | credentials.access_key, 54 | credentials.secret_key, 55 | self.ignore_headers, 56 | self.delimiter 57 | ) 58 | redshift.run(formatted_sql) 59 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/sql_statements.py: -------------------------------------------------------------------------------- 1 | CREATE_TRIPS_TABLE_SQL = """ 2 | CREATE TABLE IF NOT EXISTS trips ( 3 | trip_id INTEGER NOT NULL, 4 | start_time TIMESTAMP NOT NULL, 5 | end_time TIMESTAMP NOT NULL, 6 | bikeid INTEGER NOT NULL, 7 | tripduration DECIMAL(16,2) NOT NULL, 8 | from_station_id INTEGER NOT NULL, 9 | from_station_name VARCHAR(100) NOT NULL, 10 | to_station_id INTEGER NOT NULL, 11 | to_station_name VARCHAR(100) NOT NULL, 12 | usertype VARCHAR(20), 13 | gender VARCHAR(6), 14 | birthyear INTEGER, 15 | PRIMARY KEY(trip_id)) 16 | DISTSTYLE ALL; 17 | """ 18 | 19 | CREATE_STATIONS_TABLE_SQL = """ 20 | CREATE TABLE IF NOT EXISTS stations ( 21 | id INTEGER NOT NULL, 22 | name VARCHAR(250) NOT NULL, 23 | city VARCHAR(100) NOT NULL, 24 | latitude DECIMAL(9, 6) NOT NULL, 25 | longitude DECIMAL(9, 6) NOT NULL, 26 | dpcapacity INTEGER NOT NULL, 27 | online_date TIMESTAMP NOT NULL, 28 | PRIMARY KEY(id)) 29 | DISTSTYLE ALL; 30 | """ 31 | 32 | COPY_SQL = """ 33 | COPY {} 34 | FROM '{}' 35 | ACCESS_KEY_ID '{{}}' 36 | SECRET_ACCESS_KEY '{{}}' 37 | IGNOREHEADER 1 38 | DELIMITER ',' 39 | """ 40 | 41 | COPY_MONTHLY_TRIPS_SQL = COPY_SQL.format( 42 | "trips", 43 | "s3://udac-data-pipelines/divvy/partitioned/{year}/{month}/divvy_trips.csv" 44 | ) 45 | 46 | COPY_ALL_TRIPS_SQL = COPY_SQL.format( 47 | "trips", 48 | "s3://udac-data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv" 49 | ) 50 | 51 | COPY_STATIONS_SQL = COPY_SQL.format( 52 | "stations", 53 | "s3://udac-data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv" 54 | ) 55 | 56 | LOCATION_TRAFFIC_SQL = """ 57 | BEGIN; 58 | DROP TABLE IF EXISTS station_traffic; 59 | CREATE TABLE station_traffic AS 60 | SELECT 61 | DISTINCT(t.from_station_id) AS station_id, 62 | t.from_station_name AS station_name, 63 | num_departures, 64 | num_arrivals 65 | FROM trips t 66 | JOIN ( 67 | SELECT 68 | from_station_id, 69 | COUNT(from_station_id) AS num_departures 70 | FROM trips 71 | GROUP BY from_station_id 72 | ) AS fs ON t.from_station_id = fs.from_station_id 73 | JOIN ( 74 | SELECT 75 | to_station_id, 76 | COUNT(to_station_id) AS num_arrivals 77 | FROM trips 78 | GROUP BY to_station_id 79 | ) AS ts ON t.from_station_id = ts.to_station_id 80 | """ 81 | -------------------------------------------------------------------------------- /Data Pipeline with Airflow/subdag.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag. 3 | #1 - Consolidate HasRowsOperator into the SubDag 4 | #2 - Reorder the tasks to take advantage of the SubDag Operators 5 | 6 | import datetime 7 | 8 | from airflow import DAG 9 | from airflow.operators.postgres_operator import PostgresOperator 10 | from airflow.operators.udacity_plugin import HasRowsOperator 11 | from airflow.operators.udacity_plugin import S3ToRedshiftOperator 12 | 13 | import sql 14 | 15 | 16 | # Returns a DAG which creates a table if it does not exist, and then proceeds 17 | # to load data into that table from S3. When the load is complete, a data 18 | # quality check is performed to assert that at least one row of data is 19 | # present. 20 | def get_s3_to_redshift_dag( 21 | parent_dag_name, 22 | task_id, 23 | redshift_conn_id, 24 | aws_credentials_id, 25 | table, 26 | create_sql_stmt, 27 | s3_bucket, 28 | s3_key, 29 | *args, **kwargs): 30 | dag = DAG( 31 | f"{parent_dag_name}.{task_id}", 32 | **kwargs 33 | ) 34 | 35 | create_task = PostgresOperator( 36 | task_id=f"create_{table}_table", 37 | dag=dag, 38 | postgres_conn_id=redshift_conn_id, 39 | sql=create_sql_stmt 40 | ) 41 | 42 | copy_task = S3ToRedshiftOperator( 43 | task_id=f"load_{table}_from_s3_to_redshift", 44 | dag=dag, 45 | table=table, 46 | redshift_conn_id=redshift_conn_id, 47 | aws_credentials_id=aws_credentials_id, 48 | s3_bucket=s3_bucket, 49 | s3_key=s3_key 50 | ) 51 | 52 | # 53 | # TODO: Move the HasRowsOperator task here from the DAG 54 | # 55 | 56 | check_task = HasRowsOperator( 57 | task_id=f"check_{table}_data", 58 | dag=dag, 59 | redshift_conn_id=redshift_conn_id, 60 | table=table 61 | ) 62 | 63 | create_task >> copy_task 64 | # 65 | # TODO: Use DAG ordering to place the check task 66 | # 67 | copy_task >> check_task 68 | return dag 69 | -------------------------------------------------------------------------------- /Data-Modeling/L1 Exercise 1 Creating a Table with Postgres.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# L1 Exercise 1: Creating a Table with PostgreSQL\n", 8 | "\n", 9 | "" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "### Walk through the basics of PostgreSQL. You will need to complete the following tasks:
  • Create a table in PostgreSQL,
  • Insert rows of data
  • Run a simple SQL query to validate the information.
    \n", 17 | "`#####` denotes where the code needs to be completed. \n", 18 | " \n", 19 | "Note: __Do not__ click the blue Preview button in the lower task bar" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "#### Import the library \n", 27 | "*Note:* An error might popup after this command has executed. If it does, read it carefully before ignoring. " 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 5, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "import psycopg2" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 6, 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "name": "stdout", 46 | "output_type": "stream", 47 | "text": [ 48 | "ALTER ROLE\r\n" 49 | ] 50 | } 51 | ], 52 | "source": [ 53 | "!echo \"alter user student createdb;\" | sudo -u postgres psql" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "### Create a connection to the database" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 7, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "try: \n", 70 | " conn = psycopg2.connect(\"host=127.0.0.1 dbname=studentdb user=student password=student\")\n", 71 | "except psycopg2.Error as e: \n", 72 | " print(\"Error: Could not make connection to the Postgres database\")\n", 73 | " print(e)" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "### Use the connection to get a cursor that can be used to execute queries." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 8, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "try: \n", 90 | " cur = conn.cursor()\n", 91 | "except psycopg2.Error as e: \n", 92 | " print(\"Error: Could not get curser to the Database\")\n", 93 | " print(e)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "### TO-DO: Set automatic commit to be true so that each action is committed without having to call conn.commit() after each command. " 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 9, 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [ 109 | "conn.set_session(autocommit=True)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "### TO-DO: Create a database to do the work in. " 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 10, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "## TO-DO: Add the database name within the CREATE DATABASE statement. You can choose your own db name.\n", 126 | "try: \n", 127 | " cur.execute(\"create database student1\")\n", 128 | "except psycopg2.Error as e:\n", 129 | " print(e)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "#### TO-DO: Add the database name in the connect statement. Let's close our connection to the default database, reconnect to the Udacity database, and get a new cursor." 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 11, 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "## TO-DO: Add the database name within the connect statement\n", 146 | "try: \n", 147 | " conn.close()\n", 148 | "except psycopg2.Error as e:\n", 149 | " print(e)\n", 150 | " \n", 151 | "try: \n", 152 | " conn = psycopg2.connect(\"host=127.0.0.1 dbname=student1 user=student password=student\")\n", 153 | "except psycopg2.Error as e: \n", 154 | " print(\"Error: Could not make connection to the Postgres database\")\n", 155 | " print(e)\n", 156 | " \n", 157 | "try: \n", 158 | " cur = conn.cursor()\n", 159 | "except psycopg2.Error as e: \n", 160 | " print(\"Error: Could not get curser to the Database\")\n", 161 | " print(e)\n", 162 | "\n", 163 | "conn.set_session(autocommit=True)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. \n", 171 | "\n", 172 | "`song_title\n", 173 | "artist_name\n", 174 | "year\n", 175 | "album_name\n", 176 | "single`\n" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 12, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "## TO-DO: Finish writing the CREATE TABLE statement with the correct arguments\n", 186 | "try: \n", 187 | " cur.execute(\"CREATE TABLE IF NOT EXISTS music_library_1(song_title varchar, artist_name varchar, year int, album_name varchar, single Boolean);\")\n", 188 | "except psycopg2.Error as e: \n", 189 | " print(\"Error: Issue creating table\")\n", 190 | " print (e)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "### TO-DO: Insert the following two rows in the table\n", 198 | "`First Row: \"Across The Universe\", \"The Beatles\", \"1970\", \"False\", \"Let It Be\"`\n", 199 | "\n", 200 | "`Second Row: \"The Beatles\", \"Think For Yourself\", \"False\", \"1965\", \"Rubber Soul\"`" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 13, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "## TO-DO: Finish the INSERT INTO statement with the correct arguments\n", 210 | "\n", 211 | "try: \n", 212 | " cur.execute(\"INSERT INTO music_library_1 (song_title, artist_name, year, album_name, single) \\\n", 213 | " VALUES (%s, %s, %s, %s, %s)\", \\\n", 214 | " (\"The Beatles\", \"Across The Universe\", 1970, \"Across The Universe\", False))\n", 215 | "except psycopg2.Error as e: \n", 216 | " print(\"Error: Inserting Rows\")\n", 217 | " print (e)\n", 218 | " \n", 219 | "try: \n", 220 | " cur.execute(\"INSERT INTO music_library_1 (song_title, artist_name, year, album_name, single) \\\n", 221 | " VALUES (%s, %s, %s, %s, %s)\",\n", 222 | " (\"Rubber Soul\", \"The Beatles\", 1965, \"Think For Yourself\", False))\n", 223 | "except psycopg2.Error as e: \n", 224 | " print(\"Error: Inserting Rows\")\n", 225 | " print (e)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "### TO-DO: Validate your data was inserted into the table. \n" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 14, 238 | "metadata": {}, 239 | "outputs": [ 240 | { 241 | "name": "stdout", 242 | "output_type": "stream", 243 | "text": [ 244 | "('The Beatles', 'Across The Universe', 1970, 'Across The Universe', False)\n", 245 | "('Rubber Soul', 'The Beatles', 1965, 'Think For Yourself', False)\n" 246 | ] 247 | } 248 | ], 249 | "source": [ 250 | "## TO-DO: Finish the SELECT * Statement \n", 251 | "try: \n", 252 | " cur.execute(\"SELECT * FROM music_library_1;\")\n", 253 | "except psycopg2.Error as e: \n", 254 | " print(\"Error: select *\")\n", 255 | " print (e)\n", 256 | "\n", 257 | "row = cur.fetchone()\n", 258 | "while row:\n", 259 | " print(row)\n", 260 | " row = cur.fetchone()" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "### And finally close your cursor and connection. " 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 15, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "cur.close()\n", 277 | "conn.close()" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": {}, 284 | "outputs": [], 285 | "source": [] 286 | } 287 | ], 288 | "metadata": { 289 | "kernelspec": { 290 | "display_name": "Python 3", 291 | "language": "python", 292 | "name": "python3" 293 | }, 294 | "language_info": { 295 | "codemirror_mode": { 296 | "name": "ipython", 297 | "version": 3 298 | }, 299 | "file_extension": ".py", 300 | "mimetype": "text/x-python", 301 | "name": "python", 302 | "nbconvert_exporter": "python", 303 | "pygments_lexer": "ipython3", 304 | "version": "3.6.3" 305 | } 306 | }, 307 | "nbformat": 4, 308 | "nbformat_minor": 2 309 | } 310 | -------------------------------------------------------------------------------- /Data-Modeling/L1 Exercise 2 Creating a Table with Apache Cassandra.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# L1 Exercise 2: Creating a Table with Apache Cassandra\n", 8 | "" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Walk through the basics of Apache Cassandra. Complete the following tasks:
  • Create a table in Apache Cassandra,
  • Insert rows of data,
  • Run a simple SQL query to validate the information.
    \n", 16 | "`#####` denotes where the code needs to be completed.\n", 17 | " \n", 18 | "Note: __Do not__ click the blue Preview button in the lower taskbar" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "#### Import Apache Cassandra python package" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 1, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "import cassandra" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### Create a connection to the database" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 2, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "from cassandra.cluster import Cluster\n", 51 | "try: \n", 52 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n", 53 | " session = cluster.connect()\n", 54 | "except Exception as e:\n", 55 | " print(e)\n", 56 | " " 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "### TO-DO: Create a keyspace to do the work in " 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 3, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "## TO-DO: Create the keyspace\n", 73 | "try:\n", 74 | " session.execute(\"\"\"\n", 75 | " CREATE KEYSPACE IF NOT EXISTS music_library_1 \n", 76 | " WITH REPLICATION = \n", 77 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n", 78 | ")\n", 79 | "\n", 80 | "except Exception as e:\n", 81 | " print(e)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### TO-DO: Connect to the Keyspace" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 4, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "## To-Do: Add in the keyspace you created\n", 98 | "try:\n", 99 | " session.set_keyspace('music_library_1')\n", 100 | "except Exception as e:\n", 101 | " print(e)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. \n", 109 | "\n", 110 | "`song_title\n", 111 | "artist_name\n", 112 | "year\n", 113 | "album_name\n", 114 | "single`" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### TO-DO: You need to create a table to be able to run the following query: \n", 122 | "`select * from songs WHERE year=1970 AND artist_name=\"The Beatles\"`" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 6, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "## TO-DO: Complete the query below\n", 132 | "query = \"CREATE TABLE IF NOT EXISTS music_library_table_1 \"\n", 133 | "query = query + \"(song_title text, artist_name text, year int, album_name text, single Boolean, PRIMARY KEY (year, artist_name))\"\n", 134 | "try:\n", 135 | " session.execute(query)\n", 136 | "except Exception as e:\n", 137 | " print(e)\n" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "### TO-DO: Insert the following two rows in your table\n", 145 | "`First Row: \"Across The Universe\", \"The Beatles\", \"1970\", \"False\", \"Let It Be\"`\n", 146 | "\n", 147 | "`Second Row: \"The Beatles\", \"Think For Yourself\", \"False\", \"1965\", \"Rubber Soul\"`" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 7, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "## Add in query and then run the insert statement\n", 157 | "query = \"INSERT INTO music_library_table_1 (song_title, artist_name, year, album_name, single)\" \n", 158 | "query = query + \" VALUES (%s, %s, %s, %s, %s)\"\n", 159 | "\n", 160 | "try:\n", 161 | " session.execute(query, (\"Across The Universe\", \"The Beatles\", 1970, \"Let It Be\", False))\n", 162 | "except Exception as e:\n", 163 | " print(e)\n", 164 | " \n", 165 | "try:\n", 166 | " session.execute(query, (\"Think For Yourself\", \"The Beatles\", 1965, \"Rubber Soul\", False))\n", 167 | "except Exception as e:\n", 168 | " print(e)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "### TO-DO: Validate your data was inserted into the table." 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 8, 181 | "metadata": { 182 | "scrolled": true 183 | }, 184 | "outputs": [ 185 | { 186 | "name": "stdout", 187 | "output_type": "stream", 188 | "text": [ 189 | "1965 Rubber Soul The Beatles\n", 190 | "1970 Let It Be The Beatles\n" 191 | ] 192 | } 193 | ], 194 | "source": [ 195 | "## TO-DO: Complete and then run the select statement to validate the data was inserted into the table\n", 196 | "query = 'SELECT * FROM music_library_table_1'\n", 197 | "try:\n", 198 | " rows = session.execute(query)\n", 199 | "except Exception as e:\n", 200 | " print(e)\n", 201 | " \n", 202 | "for row in rows:\n", 203 | " print (row.year, row.album_name, row.artist_name)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "### TO-DO: Validate the Data Model with the original query.\n", 211 | "\n", 212 | "`select * from songs WHERE YEAR=1970 AND artist_name=\"The Beatles\"`" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 9, 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "name": "stdout", 222 | "output_type": "stream", 223 | "text": [ 224 | "1970 Let It Be The Beatles\n" 225 | ] 226 | } 227 | ], 228 | "source": [ 229 | "##TO-DO: Complete the select statement to run the query \n", 230 | "query = \"SELECT * from music_library_table_1 where YEAR=1970 and artist_name = 'The Beatles'\"\n", 231 | "try:\n", 232 | " rows = session.execute(query)\n", 233 | "except Exception as e:\n", 234 | " print(e)\n", 235 | " \n", 236 | "for row in rows:\n", 237 | " print (row.year, row.album_name, row.artist_name)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "### And Finally close the session and cluster connection" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 10, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "session.shutdown()\n", 254 | "cluster.shutdown()" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [] 263 | } 264 | ], 265 | "metadata": { 266 | "kernelspec": { 267 | "display_name": "Python 3", 268 | "language": "python", 269 | "name": "python3" 270 | }, 271 | "language_info": { 272 | "codemirror_mode": { 273 | "name": "ipython", 274 | "version": 3 275 | }, 276 | "file_extension": ".py", 277 | "mimetype": "text/x-python", 278 | "name": "python", 279 | "nbconvert_exporter": "python", 280 | "pygments_lexer": "ipython3", 281 | "version": "3.6.3" 282 | } 283 | }, 284 | "nbformat": 4, 285 | "nbformat_minor": 2 286 | } 287 | -------------------------------------------------------------------------------- /Data-Modeling/L3 Exercise 2 Primary Key.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# L3 Exercise 2: Focus on Primary Key\n", 8 | "" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Walk through the basics of creating a table with a good Primary Key in Apache Cassandra, inserting rows of data, and doing a simple CQL query to validate the information." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n", 23 | "! pip install cassandra-driver\n", 24 | "#### More documentation can be found here: https://datastax.github.io/python-driver/" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "#### Import Apache Cassandra python package" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "import cassandra" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Create a connection to the database" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "from cassandra.cluster import Cluster\n", 57 | "try: \n", 58 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n", 59 | " session = cluster.connect()\n", 60 | "except Exception as e:\n", 61 | " print(e)" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "### Create a keyspace to work in " 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "try:\n", 78 | " session.execute(\"\"\"\n", 79 | " CREATE KEYSPACE IF NOT EXISTS udacity \n", 80 | " WITH REPLICATION = \n", 81 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n", 82 | ")\n", 83 | "\n", 84 | "except Exception as e:\n", 85 | " print(e)" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "#### Connect to the Keyspace. Compare this to how we had to create a new session in PostgreSQL. " 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 4, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "try:\n", 102 | " session.set_keyspace('udacity')\n", 103 | "except Exception as e:\n", 104 | " print(e)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "### Imagine you need to create a new Music Library of albums \n", 112 | "\n", 113 | "### Here is the information asked of the data:\n", 114 | "### 1. Give every album in the music library that was created by a given artist\n", 115 | "select * from music_library WHERE artist_name=\"The Beatles\"\n" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "### Here is the Collection of Data\n", 123 | "" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "### How should we model these data? \n", 131 | "\n", 132 | "#### What should be our Primary Key and Partition Key? Since the data are looking for the ARTIST, let's start with that. Is Partitioning our data by artist a good idea? In this case our data is very small. If we had a larger dataset of albums, partitions by artist might be a fine choice. But we would need to validate the dataset to make sure there is an equal spread of the data. \n", 133 | "\n", 134 | "`Table Name: music_library\n", 135 | "column 1: Year\n", 136 | "column 2: Artist Name\n", 137 | "column 3: Album Name\n", 138 | "Column 4: City\n", 139 | "PRIMARY KEY(artist_name)`" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 5, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "query = \"CREATE TABLE IF NOT EXISTS music_library\"\n", 149 | "query = query + \"(year int, artist_name text, album_name text, city text, PRIMARY KEY (artist_name))\"\n", 150 | "try:\n", 151 | " session.execute(query)\n", 152 | "except Exception as e:\n", 153 | " print(e)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "### Insert the data into the tables" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 6, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "query = \"INSERT INTO music_library (year, artist_name, album_name, city)\"\n", 170 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n", 171 | "\n", 172 | "try:\n", 173 | " session.execute(query, (1970, \"The Beatles\", \"Let it Be\", \"Liverpool\"))\n", 174 | "except Exception as e:\n", 175 | " print(e)\n", 176 | " \n", 177 | "try:\n", 178 | " session.execute(query, (1965, \"The Beatles\", \"Rubber Soul\", \"Oxford\"))\n", 179 | "except Exception as e:\n", 180 | " print(e)\n", 181 | " \n", 182 | "try:\n", 183 | " session.execute(query, (1965, \"The Who\", \"My Generation\", \"London\"))\n", 184 | "except Exception as e:\n", 185 | " print(e)\n", 186 | "\n", 187 | "try:\n", 188 | " session.execute(query, (1966, \"The Monkees\", \"The Monkees\", \"Los Angeles\"))\n", 189 | "except Exception as e:\n", 190 | " print(e)\n", 191 | "\n", 192 | "try:\n", 193 | " session.execute(query, (1970, \"The Carpenters\", \"Close To You\", \"San Diego\"))\n", 194 | "except Exception as e:\n", 195 | " print(e)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "### Let's Validate our Data Model -- Did it work?? If we look for Albums from The Beatles we should expect to see 2 rows.\n", 203 | "\n", 204 | "`select * from music_library WHERE artist_name=\"The Beatles\"`" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 7, 210 | "metadata": {}, 211 | "outputs": [ 212 | { 213 | "name": "stdout", 214 | "output_type": "stream", 215 | "text": [ 216 | "1965 The Beatles Rubber Soul Oxford\n" 217 | ] 218 | } 219 | ], 220 | "source": [ 221 | "query = \"select * from music_library WHERE artist_name='The Beatles'\"\n", 222 | "try:\n", 223 | " rows = session.execute(query)\n", 224 | "except Exception as e:\n", 225 | " print(e)\n", 226 | " \n", 227 | "for row in rows:\n", 228 | " print (row.year, row.artist_name, row.album_name, row.city)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "### That didn't work out as planned! Why is that? Because we did not create a unique primary key. " 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "### Let's try again. This time focus on making the PRIMARY KEY unique.\n", 243 | "### Looking at the dataset, what makes each row unique?\n", 244 | "\n", 245 | "### We have a couple of options (City and Album Name) but that will not get us the query we need which is looking for album's in a particular artist. Let's make a composite key of the `ARTIST NAME` and `ALBUM NAME`. This is assuming that an album name is unique to the artist it was created by (not a bad bet). --But remember this is just an exercise, you will need to understand your dataset fully (no betting!)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 8, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "query = \"CREATE TABLE IF NOT EXISTS music_library1 \"\n", 255 | "query = query + \"(artist_name text, album_name text, year int, city text, PRIMARY KEY (artist_name, album_name))\"\n", 256 | "try:\n", 257 | " session.execute(query)\n", 258 | "except Exception as e:\n", 259 | " print(e)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 9, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "query = \"INSERT INTO music_library1 (artist_name, album_name, year, city)\"\n", 269 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n", 270 | "\n", 271 | "try:\n", 272 | " session.execute(query, (\"The Beatles\", \"Let it Be\", 1970, \"Liverpool\"))\n", 273 | "except Exception as e:\n", 274 | " print(e)\n", 275 | " \n", 276 | "try:\n", 277 | " session.execute(query, (\"The Beatles\", \"Rubber Soul\", 1965, \"Oxford\"))\n", 278 | "except Exception as e:\n", 279 | " print(e)\n", 280 | " \n", 281 | "try:\n", 282 | " session.execute(query, (\"The Who\", \"My Generation\", 1965, \"London\"))\n", 283 | "except Exception as e:\n", 284 | " print(e)\n", 285 | "\n", 286 | "try:\n", 287 | " session.execute(query, (\"The Monkees\", \"The Monkees\", 1966, \"Los Angeles\"))\n", 288 | "except Exception as e:\n", 289 | " print(e)\n", 290 | "\n", 291 | "try:\n", 292 | " session.execute(query, (\"The Carpenters\", \"Close To You\", 1970, \"San Diego\"))\n", 293 | "except Exception as e:\n", 294 | " print(e)" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "### Validate the Data Model -- Did it work? If we look for Albums from The Beatles we should expect to see 2 rows.\n", 302 | "\n", 303 | "`select * from music_library WHERE artist_name=\"The Beatles\"`" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 10, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "name": "stdout", 313 | "output_type": "stream", 314 | "text": [ 315 | "1970 The Beatles Let it Be Liverpool\n", 316 | "1965 The Beatles Rubber Soul Oxford\n" 317 | ] 318 | } 319 | ], 320 | "source": [ 321 | "query = \"select * from music_library1 WHERE artist_name='The Beatles'\"\n", 322 | "try:\n", 323 | " rows = session.execute(query)\n", 324 | "except Exception as e:\n", 325 | " print(e)\n", 326 | " \n", 327 | "for row in rows:\n", 328 | " print (row.year, row.artist_name, row.album_name, row.city)" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "### Success it worked! We created a unique Primary key that evenly distributed our data. " 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "### Drop the tables" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 11, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [ 351 | "query = \"drop table music_library\"\n", 352 | "try:\n", 353 | " rows = session.execute(query)\n", 354 | "except Exception as e:\n", 355 | " print(e)\n", 356 | "\n", 357 | "query = \"drop table music_library1\"\n", 358 | "try:\n", 359 | " rows = session.execute(query)\n", 360 | "except Exception as e:\n", 361 | " print(e)" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "### Close the session and cluster connection" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": 12, 374 | "metadata": {}, 375 | "outputs": [], 376 | "source": [ 377 | "session.shutdown()\n", 378 | "cluster.shutdown()" 379 | ] 380 | } 381 | ], 382 | "metadata": { 383 | "kernelspec": { 384 | "display_name": "Python 3", 385 | "language": "python", 386 | "name": "python3" 387 | }, 388 | "language_info": { 389 | "codemirror_mode": { 390 | "name": "ipython", 391 | "version": 3 392 | }, 393 | "file_extension": ".py", 394 | "mimetype": "text/x-python", 395 | "name": "python", 396 | "nbconvert_exporter": "python", 397 | "pygments_lexer": "ipython3", 398 | "version": "3.6.3" 399 | } 400 | }, 401 | "nbformat": 4, 402 | "nbformat_minor": 2 403 | } 404 | -------------------------------------------------------------------------------- /Data-Modeling/L3 Exercise 3 Clustering Column.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# L3 Exercise 3: Focus on Clustering Columns\n", 8 | "" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Walk through the basics of creating a table with a good Primary Key and Clustering Columns in Apache Cassandra, inserting rows of data, and doing a simple CQL query to validate the information." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n", 23 | "! pip install cassandra-driver\n", 24 | "#### More documentation can be found here: https://datastax.github.io/python-driver/" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "#### Import Apache Cassandra python package" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "import cassandra" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Create a connection to the database" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "from cassandra.cluster import Cluster\n", 57 | "try: \n", 58 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n", 59 | " session = cluster.connect()\n", 60 | "except Exception as e:\n", 61 | " print(e)" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "### Create a keyspace to work in " 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "try:\n", 78 | " session.execute(\"\"\"\n", 79 | " CREATE KEYSPACE IF NOT EXISTS udacity \n", 80 | " WITH REPLICATION = \n", 81 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n", 82 | ")\n", 83 | "\n", 84 | "except Exception as e:\n", 85 | " print(e)" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL. " 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 4, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "try:\n", 102 | " session.set_keyspace('udacity')\n", 103 | "except Exception as e:\n", 104 | " print(e)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "### Imagine we would like to start creating a new Music Library of albums. \n", 112 | "\n", 113 | "### We want to ask 1 question of our data:\n", 114 | "#### 1. Give me all the information from the music library about a given album\n", 115 | "`select * from album_library WHERE album_name=\"Close To You\"`\n" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "### Here is the Data:\n", 123 | "" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "### How should we model this data? What should be our Primary Key and Partition Key? \n", 131 | "\n", 132 | "### Since the data is looking for the `ALBUM_NAME` let's start with that. From there we will need to add other elements to make sure the Key is unique. We also need to add the `ARTIST_NAME` as Clustering Columns to make the data unique. That should be enough to make the row key unique.\n", 133 | "\n", 134 | "`Table Name: music_library\n", 135 | "column 1: Year\n", 136 | "column 2: Artist Name\n", 137 | "column 3: Album Name\n", 138 | "Column 4: City\n", 139 | "PRIMARY KEY(album name, artist name)`" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 5, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n", 149 | "query = query + \"(album_name text, artist_name text, year int, city text, PRIMARY KEY (album_name, artist_name))\"\n", 150 | "try:\n", 151 | " session.execute(query)\n", 152 | "except Exception as e:\n", 153 | " print(e)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "### Insert the data into the table" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 6, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "query = \"INSERT INTO music_library (album_name, artist_name, year, city)\"\n", 170 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n", 171 | "\n", 172 | "try:\n", 173 | " session.execute(query, (\"Let it Be\", \"The Beatles\", 1970, \"Liverpool\"))\n", 174 | "except Exception as e:\n", 175 | " print(e)\n", 176 | " \n", 177 | "try:\n", 178 | " session.execute(query, (\"Rubber Soul\", \"The Beatles\", 1965, \"Oxford\"))\n", 179 | "except Exception as e:\n", 180 | " print(e)\n", 181 | " \n", 182 | "try:\n", 183 | " session.execute(query, (\"Beatles For Sale\", \"The Beatles\", 1964, \"London\"))\n", 184 | "except Exception as e:\n", 185 | " print(e)\n", 186 | "\n", 187 | "try:\n", 188 | " session.execute(query, (\"The Monkees\", \"The Monkees\", 1966, \"Los Angeles\"))\n", 189 | "except Exception as e:\n", 190 | " print(e)\n", 191 | "\n", 192 | "try:\n", 193 | " session.execute(query, (\"Close To You\", \"The Carpenters\", 1970, \"San Diego\"))\n", 194 | "except Exception as e:\n", 195 | " print(e)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "### Validate the Data Model -- Did it work?\n", 203 | "`select * from album_library WHERE album_name=\"Close To You\"`" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 7, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "The Carpenters Close To You San Diego 1970\n" 216 | ] 217 | } 218 | ], 219 | "source": [ 220 | "query = \"select * from music_library WHERE album_NAME='Close To You'\"\n", 221 | "try:\n", 222 | " rows = session.execute(query)\n", 223 | "except Exception as e:\n", 224 | " print(e)\n", 225 | " \n", 226 | "for row in rows:\n", 227 | " print (row.artist_name, row.album_name, row.city, row.year)" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "### Success it worked! We created a unique Primary key that evenly distributed our data, with clustering columns" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "### For the sake of the demo, drop the table" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 8, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "query = \"drop table music_library\"\n", 251 | "try:\n", 252 | " rows = session.execute(query)\n", 253 | "except Exception as e:\n", 254 | " print(e)\n" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "### Close the session and cluster connection" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 9, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "session.shutdown()\n", 271 | "cluster.shutdown()" 272 | ] 273 | } 274 | ], 275 | "metadata": { 276 | "kernelspec": { 277 | "display_name": "Python 3", 278 | "language": "python", 279 | "name": "python3" 280 | }, 281 | "language_info": { 282 | "codemirror_mode": { 283 | "name": "ipython", 284 | "version": 3 285 | }, 286 | "file_extension": ".py", 287 | "mimetype": "text/x-python", 288 | "name": "python", 289 | "nbconvert_exporter": "python", 290 | "pygments_lexer": "ipython3", 291 | "version": "3.6.3" 292 | } 293 | }, 294 | "nbformat": 4, 295 | "nbformat_minor": 2 296 | } 297 | -------------------------------------------------------------------------------- /Data-Modeling/L3 Exercise 4 Using the WHERE Clause.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Lesson 3 Demo 4: Using the WHERE Clause\n", 8 | "" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### In this exercise we are going to walk through the basics of using the WHERE clause in Apache Cassandra.\n", 16 | "\n", 17 | "##### denotes where the code needs to be completed.\n", 18 | "\n", 19 | "Note: __Do not__ click the blue Preview button in the lower task bar" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n", 27 | "! pip install cassandra-driver\n", 28 | "#### More documentation can be found here: https://datastax.github.io/python-driver/" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "#### Import Apache Cassandra python package" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 1, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "import cassandra" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "### First let's create a connection to the database" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 2, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "from cassandra.cluster import Cluster\n", 61 | "try: \n", 62 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n", 63 | " session = cluster.connect()\n", 64 | "except Exception as e:\n", 65 | " print(e)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "### Let's create a keyspace to do our work in " 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "try:\n", 82 | " session.execute(\"\"\"\n", 83 | " CREATE KEYSPACE IF NOT EXISTS udacity \n", 84 | " WITH REPLICATION = \n", 85 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n", 86 | ")\n", 87 | "\n", 88 | "except Exception as e:\n", 89 | " print(e)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL. " 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 4, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "try:\n", 106 | " session.set_keyspace('udacity')\n", 107 | "except Exception as e:\n", 108 | " print(e)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### Let's imagine we would like to start creating a new Music Library of albums. \n", 116 | "### We want to ask 4 question of our data\n", 117 | "#### 1. Give me every album in my music library that was released in a 1965 year\n", 118 | "#### 2. Give me the album that is in my music library that was released in 1965 by \"The Beatles\"\n", 119 | "#### 3. Give me all the albums released in a given year that was made in London \n", 120 | "#### 4. Give me the city that the album \"Rubber Soul\" was recorded" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "### Here is our Collection of Data\n", 128 | "" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the YEAR let's start with that. From there we will add clustering columns on Artist Name and Album Name." 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 6, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n", 145 | "query = query + \"(year int, artist_name text, album_name text, city text, PRIMARY KEY (year, artist_name, album_name))\"\n", 146 | "try:\n", 147 | " session.execute(query)\n", 148 | "except Exception as e:\n", 149 | " print(e)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "### Let's insert our data into of table" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 7, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "query = \"INSERT INTO music_library (year, artist_name, album_name, city)\"\n", 166 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n", 167 | "\n", 168 | "try:\n", 169 | " session.execute(query, (1970, \"The Beatles\", \"Let it Be\", \"Liverpool\"))\n", 170 | "except Exception as e:\n", 171 | " print(e)\n", 172 | " \n", 173 | "try:\n", 174 | " session.execute(query, (1965, \"The Beatles\", \"Rubber Soul\", \"Oxford\"))\n", 175 | "except Exception as e:\n", 176 | " print(e)\n", 177 | " \n", 178 | "try:\n", 179 | " session.execute(query, (1965, \"The Who\", \"My Generation\", \"London\"))\n", 180 | "except Exception as e:\n", 181 | " print(e)\n", 182 | "\n", 183 | "try:\n", 184 | " session.execute(query, (1966, \"The Monkees\", \"The Monkees\", \"Los Angeles\"))\n", 185 | "except Exception as e:\n", 186 | " print(e)\n", 187 | "\n", 188 | "try:\n", 189 | " session.execute(query, (1970, \"The Carpenters\", \"Close To You\", \"San Diego\"))\n", 190 | "except Exception as e:\n", 191 | " print(e)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### Let's Validate our Data Model with our 4 queries.\n", 199 | "\n", 200 | "Query 1: " 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 9, 206 | "metadata": {}, 207 | "outputs": [ 208 | { 209 | "name": "stdout", 210 | "output_type": "stream", 211 | "text": [ 212 | "1965 The Beatles Rubber Soul Oxford\n", 213 | "1965 The Who My Generation London\n" 214 | ] 215 | } 216 | ], 217 | "source": [ 218 | "query = \"SELECT * from music_library where year = 1965\"\n", 219 | "try:\n", 220 | " rows = session.execute(query)\n", 221 | "except Exception as e:\n", 222 | " print(e)\n", 223 | " \n", 224 | "for row in rows:\n", 225 | " print (row.year, row.artist_name, row.album_name, row.city)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | " Let's try the 2nd query.\n", 233 | " Query 2: " 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 10, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "name": "stdout", 243 | "output_type": "stream", 244 | "text": [ 245 | "1965 The Beatles Rubber Soul Oxford\n" 246 | ] 247 | } 248 | ], 249 | "source": [ 250 | "query = \"SELECT * from music_library where year = 1965 and artist_name = 'The Beatles'\"\n", 251 | "try:\n", 252 | " rows = session.execute(query)\n", 253 | "except Exception as e:\n", 254 | " print(e)\n", 255 | " \n", 256 | "for row in rows:\n", 257 | " print (row.year, row.artist_name, row.album_name, row.city)" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "### Let's try the 3rd query.\n", 265 | "Query 3: " 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 12, 271 | "metadata": {}, 272 | "outputs": [ 273 | { 274 | "name": "stdout", 275 | "output_type": "stream", 276 | "text": [ 277 | "Error from server: code=2200 [Invalid query] message=\"Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING\"\n" 278 | ] 279 | } 280 | ], 281 | "source": [ 282 | "query = \"select * from music_library where city = 'London'\"\n", 283 | "try:\n", 284 | " rows = session.execute(query)\n", 285 | "except Exception as e:\n", 286 | " print(e)\n", 287 | " \n", 288 | "for row in rows:\n", 289 | " print (row.year, row.artist_name, row.album_name, row.city)" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "### Did you get an error? You can not try to access a column or a clustering column if you have not used the other defined clustering column. Let's see if we can try it a different way. \n", 297 | "Try Query 4: \n", 298 | "\n" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 14, 304 | "metadata": {}, 305 | "outputs": [ 306 | { 307 | "name": "stdout", 308 | "output_type": "stream", 309 | "text": [ 310 | "London\n" 311 | ] 312 | } 313 | ], 314 | "source": [ 315 | "query = \"select city from music_library where year = 1965 and artist_name = 'The Who' and album_name = 'My Generation'\"\n", 316 | "try:\n", 317 | " rows = session.execute(query)\n", 318 | "except Exception as e:\n", 319 | " print(e)\n", 320 | " \n", 321 | "for row in rows:\n", 322 | " print (row.city)" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 16, 328 | "metadata": {}, 329 | "outputs": [ 330 | { 331 | "name": "stdout", 332 | "output_type": "stream", 333 | "text": [ 334 | "Error from server: code=2200 [Invalid query] message=\"PRIMARY KEY column \"album_name\" cannot be restricted as preceding column \"artist_name\" is not restricted\"\n" 335 | ] 336 | } 337 | ], 338 | "source": [ 339 | "query = \"select city from music_library where album_name = 'Rubber Soul'\"\n", 340 | "try:\n", 341 | " rows = session.execute(query)\n", 342 | "except Exception as e:\n", 343 | " print(e)\n", 344 | " \n", 345 | "for row in rows:\n", 346 | " print (row.city)" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 18, 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "name": "stdout", 356 | "output_type": "stream", 357 | "text": [ 358 | "Oxford\n" 359 | ] 360 | } 361 | ], 362 | "source": [ 363 | "query = \"select city from music_library where year = 1965 and artist_name = 'The Beatles' and album_name = 'Rubber Soul'\"\n", 364 | "try:\n", 365 | " rows = session.execute(query)\n", 366 | "except Exception as e:\n", 367 | " print(e)\n", 368 | " \n", 369 | "for row in rows:\n", 370 | " print (row.city)" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "### And Finally close the session and cluster connection" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 19, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "session.shutdown()\n", 387 | "cluster.shutdown()" 388 | ] 389 | } 390 | ], 391 | "metadata": { 392 | "kernelspec": { 393 | "display_name": "Python 3", 394 | "language": "python", 395 | "name": "python3" 396 | }, 397 | "language_info": { 398 | "codemirror_mode": { 399 | "name": "ipython", 400 | "version": 3 401 | }, 402 | "file_extension": ".py", 403 | "mimetype": "text/x-python", 404 | "name": "python", 405 | "nbconvert_exporter": "python", 406 | "pygments_lexer": "ipython3", 407 | "version": "3.6.3" 408 | } 409 | }, 410 | "nbformat": 4, 411 | "nbformat_minor": 2 412 | } 413 | -------------------------------------------------------------------------------- /Data-Modeling/Project 1/Instructions 1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Instructions 1.PNG -------------------------------------------------------------------------------- /Data-Modeling/Project 1/Instructions 2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Instructions 2.PNG -------------------------------------------------------------------------------- /Data-Modeling/Project 1/Instructions 3.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Instructions 3.PNG -------------------------------------------------------------------------------- /Data-Modeling/Project 1/Instructions 4.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Instructions 4.PNG -------------------------------------------------------------------------------- /Data-Modeling/Project 1/Project 1 Introduction.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Project 1 Introduction.PNG -------------------------------------------------------------------------------- /Data-Modeling/Project 1/README.md: -------------------------------------------------------------------------------- 1 | Introduction 2 | 3 | A startup called Sparkify want to analyze the data they have been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. 4 | 5 | The aim is to create a Postgres Database Schema and ETL pipeline to optimize queries for song play analysis. 6 | 7 | Project Description 8 | 9 | In this project, I have to model data with Postgres and build and ETL pipeline using Python. On the database side, I have to define fact and dimension tables for a Star Schema for a specific focus. On the other hand, ETL pipeline would transfer data from files located in two local directories into these tables in Postgres using Python and SQL 10 | 11 | Schema for Song Play Analysis 12 | 13 | Fact Table 14 | 15 | songplays records in log data associated with song plays 16 | 17 | Dimension Tables 18 | 19 | users in the app 20 | 21 | songs in music database 22 | 23 | artists in music database 24 | 25 | time: timestamps of records in songplays broken down into specific units 26 | 27 | Project Design 28 | 29 | Database Design is very optimized because with a ew number of tables and doing specific join, we can get the most information and do analysis 30 | 31 | ETL Design is also simplified have to read json files and parse accordingly to store the tables into specific columns and proper formatting 32 | 33 | Database Script 34 | 35 | Writing "python create_tables.py" command in terminal, it is easier to create and recreate tables 36 | 37 | Jupyter Notebook 38 | 39 | etl.ipynb, a Jupyter notebook is given for verifying each command and data as well and then using those statements and copying into etl.py and running it into terminal using "python etl.py" and then running test.ipynb to see whether data has been loaded in all the tables 40 | 41 | Relevant Files Provided 42 | 43 | test.ipnb displays the first few rows of each table to let you check your database 44 | 45 | create_tables.py drops and created your table 46 | 47 | etl.ipynb read and processes a single file from song_data and log_data and loads into your tables in Jupyter notebook 48 | 49 | etl.ipynb read and processes a single file from song_data and log_data and loads into your tables in ET 50 | 51 | sql_queries.py containg all your sql queries and in imported into the last three files above -------------------------------------------------------------------------------- /Data-Modeling/Project 1/create_tables.py: -------------------------------------------------------------------------------- 1 | import psycopg2 2 | from sql_queries import create_table_queries, drop_table_queries 3 | 4 | 5 | def create_database(): 6 | # connect to default database 7 | conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student") 8 | conn.set_session(autocommit=True) 9 | cur = conn.cursor() 10 | 11 | # create sparkify database with UTF8 encoding 12 | cur.execute("DROP DATABASE IF EXISTS sparkifydb") 13 | cur.execute("CREATE DATABASE sparkifydb WITH ENCODING 'utf8' TEMPLATE template0") 14 | 15 | # close connection to default database 16 | conn.close() 17 | 18 | # connect to sparkify database 19 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student") 20 | cur = conn.cursor() 21 | 22 | return cur, conn 23 | 24 | 25 | def drop_tables(cur, conn): 26 | for query in drop_table_queries: 27 | cur.execute(query) 28 | conn.commit() 29 | 30 | 31 | def create_tables(cur, conn): 32 | for query in create_table_queries: 33 | cur.execute(query) 34 | conn.commit() 35 | 36 | 37 | def main(): 38 | cur, conn = create_database() 39 | 40 | drop_tables(cur, conn) 41 | create_tables(cur, conn) 42 | 43 | conn.close() 44 | 45 | 46 | if __name__ == "__main__": 47 | main() -------------------------------------------------------------------------------- /Data-Modeling/Project 1/data.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/data.zip -------------------------------------------------------------------------------- /Data-Modeling/Project 1/etl.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import psycopg2 4 | import pandas as pd 5 | import numpy as np 6 | from sql_queries import * 7 | 8 | 9 | def process_song_file(cur, filepath): 10 | """ 11 | This function reads JSON files and read information of song and artist data and saves into song_data and artist_data 12 | Arguments: 13 | cur: Database Cursor 14 | filepath: location of JSON files 15 | Return: None 16 | """ 17 | # open song file 18 | df = pd.read_json(filepath, lines=True) 19 | 20 | # insert song record 21 | song_data = df[["song_id", "title", "artist_id", "year", "duration"]].values[0].tolist() 22 | cur.execute(song_table_insert, song_data) 23 | 24 | # insert artist record 25 | artist_data = df[["artist_id", "artist_name", "artist_location", "artist_latitude", "artist_longitude"]].values[0].tolist() 26 | cur.execute(artist_table_insert, artist_data) 27 | 28 | 29 | def process_log_file(cur, filepath): 30 | """ 31 | This function reads Log files and reads information of time, user and songplay data and saves into time, user, songplay 32 | Arguments: 33 | cur: Database Cursor 34 | filepath: location of Log files 35 | Return: None 36 | """ 37 | 38 | # open log file 39 | df = pd.read_json(filepath, lines=True) 40 | 41 | # filter by NextSong action 42 | df = df[(df['page'] == 'NextSong')] 43 | 44 | # convert timestamp column to datetime 45 | t = pd.to_datetime(df['ts'], unit='ms') 46 | df['ts'] = pd.to_datetime(df['ts'], unit='ms') 47 | 48 | # insert time data records 49 | time_data = list((t, t.dt.hour, t.dt.day, t.dt.weekofyear, t.dt.month, t.dt.year, t.dt.weekday)) 50 | column_labels = list(('start_time', 'hour', 'day', 'week', 'month', 'year', 'weekday')) 51 | time_df = pd.DataFrame.from_dict(dict(zip(column_labels, time_data))) 52 | 53 | for i, row in time_df.iterrows(): 54 | cur.execute(time_table_insert, list(row)) 55 | 56 | # load user table 57 | user_df = df[["userId", "firstName", "lastName", "gender", "level"]] 58 | 59 | # insert user records 60 | for i, row in user_df.iterrows(): 61 | cur.execute(user_table_insert, row) 62 | 63 | # insert songplay records 64 | for index, row in df.iterrows(): 65 | 66 | # get songid and artistid from song and artist tables 67 | cur.execute(song_select, (row.song, row.artist, row.length)) 68 | results = cur.fetchone() 69 | 70 | if results: 71 | songid, artistid = results 72 | else: 73 | songid, artistid = None, None 74 | 75 | # insert songplay record 76 | songplay_data = (index, row.ts, row.userId, row.level, songid, artistid, row.sessionId,\ 77 | row.location, row.userAgent) 78 | cur.execute(songplay_table_insert, songplay_data) 79 | 80 | 81 | def process_data(cur, conn, filepath, func): 82 | # get all files matching extension from directory 83 | all_files = [] 84 | for root, dirs, files in os.walk(filepath): 85 | files = glob.glob(os.path.join(root,'*.json')) 86 | for f in files : 87 | all_files.append(os.path.abspath(f)) 88 | 89 | # get total number of files found 90 | num_files = len(all_files) 91 | print('{} files found in {}'.format(num_files, filepath)) 92 | 93 | # iterate over files and process 94 | for i, datafile in enumerate(all_files, 1): 95 | func(cur, datafile) 96 | conn.commit() 97 | print('{}/{} files processed.'.format(i, num_files)) 98 | 99 | 100 | def main(): 101 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student") 102 | cur = conn.cursor() 103 | 104 | process_data(cur, conn, filepath='data/song_data', func=process_song_file) 105 | process_data(cur, conn, filepath='data/log_data', func=process_log_file) 106 | 107 | conn.close() 108 | 109 | 110 | if __name__ == "__main__": 111 | main() -------------------------------------------------------------------------------- /Data-Modeling/Project 1/sql_queries.py: -------------------------------------------------------------------------------- 1 | # DROP TABLES 2 | 3 | songplay_table_drop = "DROP TABLE IF EXISTS songplays" 4 | user_table_drop = "DROP TABLE IF EXISTS users" 5 | song_table_drop = "DROP TABLE IF EXISTS songs" 6 | artist_table_drop = "DROP TABLE IF EXISTS artists" 7 | time_table_drop = "DROP TABLE IF EXISTS time" 8 | 9 | # CREATE TABLES 10 | 11 | songplay_table_create = (""" 12 | CREATE TABLE IF NOT EXISTS songplays ( 13 | songplay_id SERIAL PRIMARY KEY, 14 | start_time TIMESTAMP, 15 | user_id INTEGER, 16 | level VARCHAR(10), 17 | song_id VARCHAR(20), 18 | artist_id VARCHAR(20), 19 | session_id INTEGER, 20 | location VARCHAR(50), 21 | user_agent VARCHAR(150) 22 | ); 23 | """) 24 | 25 | user_table_create = (""" 26 | CREATE TABLE IF NOT EXISTS users ( 27 | user_id INTEGER PRIMARY KEY, 28 | first_name VARCHAR(50), 29 | last_name VARCHAR(50), 30 | gender CHAR(1), 31 | level VARCHAR(10) 32 | ); 33 | """) 34 | 35 | song_table_create = (""" 36 | CREATE TABLE IF NOT EXISTS songs ( 37 | song_id VARCHAR(20) PRIMARY KEY, 38 | title VARCHAR(100), 39 | artist_id VARCHAR(20) NOT NULL, 40 | year INTEGER, 41 | duration FLOAT(5) 42 | ); 43 | """) 44 | 45 | artist_table_create = (""" 46 | CREATE TABLE IF NOT EXISTS artists ( 47 | artist_id VARCHAR(20) PRIMARY KEY, 48 | name VARCHAR(100), 49 | location VARCHAR(100), 50 | lattitude FLOAT(5), 51 | longitude FLOAT(5) 52 | ); 53 | """) 54 | 55 | time_table_create = (""" 56 | CREATE TABLE IF NOT EXISTS time ( 57 | start_time TIMESTAMP PRIMARY KEY, 58 | hour INTEGER, 59 | day INTEGER, 60 | week INTEGER, 61 | month INTEGER, 62 | year INTEGER, 63 | weekday INTEGER 64 | ); 65 | """) 66 | 67 | # INSERT RECORDS 68 | 69 | songplay_table_insert = (""" 70 | INSERT INTO songplays (songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent) 71 | VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s) 72 | ON CONFLICT(songplay_id) DO NOTHING; 73 | """) 74 | 75 | user_table_insert = (""" 76 | INSERT INTO users (user_id, first_name, last_name, gender, level) 77 | VALUES (%s, %s, %s, %s, %s) ON CONFLICT (user_id) DO UPDATE SET level = EXCLUDED.level; 78 | """) 79 | 80 | song_table_insert = (""" 81 | INSERT INTO songs (song_id, title, artist_id, year, duration) 82 | VALUES (%s, %s, %s, %s, %s) ON CONFLICT DO NOTHING; 83 | """) 84 | 85 | artist_table_insert = (""" 86 | INSERT INTO artists (artist_id, name, location, lattitude, longitude) 87 | VALUES (%s, %s, %s, %s, %s) ON CONFLICT DO NOTHING; 88 | """) 89 | 90 | 91 | time_table_insert = (""" 92 | INSERT INTO time (start_time, hour, day, week, month, year, weekday) 93 | VALUES (%s, %s, %s, %s, %s, %s, %s) ON CONFLICT DO NOTHING; 94 | """) 95 | 96 | # FIND SONGS 97 | 98 | song_select = (""" 99 | SELECT ss.song_id, ss.artist_id FROM songs ss 100 | JOIN artists ars on ss.artist_id = ars.artist_id 101 | WHERE ss.title = %s 102 | AND ars.name = %s 103 | AND ss.duration = %s 104 | ; 105 | """) 106 | 107 | # QUERY LISTS 108 | 109 | create_table_queries = [songplay_table_create, user_table_create, song_table_create, artist_table_create, time_table_create] 110 | drop_table_queries = [songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop] -------------------------------------------------------------------------------- /Data-Modeling/Project 2/Project_1B_ Project_Template.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Part I. ETL Pipeline for Pre-Processing the Files" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## PLEASE RUN THE FOLLOWING CODE FOR PRE-PROCESSING THE FILES" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "#### Import Python packages " 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "# Import Python packages \n", 31 | "import pandas as pd\n", 32 | "import cassandra\n", 33 | "import re\n", 34 | "import os\n", 35 | "import glob\n", 36 | "import numpy as np\n", 37 | "import json\n", 38 | "import csv" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "#### Creating list of filepaths to process original event csv data files" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "# checking your current working directory\n", 55 | "print(os.getcwd())\n", 56 | "\n", 57 | "# Get your current folder and subfolder event data\n", 58 | "filepath = os.getcwd() + '/event_data'\n", 59 | "\n", 60 | "# Create a for loop to create a list of files and collect each filepath\n", 61 | "for root, dirs, files in os.walk(filepath):\n", 62 | " \n", 63 | "# join the file path and roots with the subdirectories using glob\n", 64 | " file_path_list = glob.glob(os.path.join(root,'*'))\n", 65 | " #print(file_path_list)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "#### Processing the files to create the data file csv that will be used for Apache Casssandra tables" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "# initiating an empty list of rows that will be generated from each file\n", 82 | "full_data_rows_list = [] \n", 83 | " \n", 84 | "# for every filepath in the file path list \n", 85 | "for f in file_path_list:\n", 86 | "\n", 87 | "# reading csv file \n", 88 | " with open(f, 'r', encoding = 'utf8', newline='') as csvfile: \n", 89 | " # creating a csv reader object \n", 90 | " csvreader = csv.reader(csvfile) \n", 91 | " next(csvreader)\n", 92 | " \n", 93 | " # extracting each data row one by one and append it \n", 94 | " for line in csvreader:\n", 95 | " #print(line)\n", 96 | " full_data_rows_list.append(line) \n", 97 | " \n", 98 | "# uncomment the code below if you would like to get total number of rows \n", 99 | "#print(len(full_data_rows_list))\n", 100 | "# uncomment the code below if you would like to check to see what the list of event data rows will look like\n", 101 | "#print(full_data_rows_list)\n", 102 | "\n", 103 | "# creating a smaller event data csv file called event_datafile_full csv that will be used to insert data into the \\\n", 104 | "# Apache Cassandra tables\n", 105 | "csv.register_dialect('myDialect', quoting=csv.QUOTE_ALL, skipinitialspace=True)\n", 106 | "\n", 107 | "with open('event_datafile_new.csv', 'w', encoding = 'utf8', newline='') as f:\n", 108 | " writer = csv.writer(f, dialect='myDialect')\n", 109 | " writer.writerow(['artist','firstName','gender','itemInSession','lastName','length',\\\n", 110 | " 'level','location','sessionId','song','userId'])\n", 111 | " for row in full_data_rows_list:\n", 112 | " if (row[0] == ''):\n", 113 | " continue\n", 114 | " writer.writerow((row[0], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[12], row[13], row[16]))\n" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "# check the number of rows in your csv file\n", 124 | "with open('event_datafile_new.csv', 'r', encoding = 'utf8') as f:\n", 125 | " print(sum(1 for line in f))" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "# Part II. Complete the Apache Cassandra coding portion of your project. \n", 133 | "\n", 134 | "## Now you are ready to work with the CSV file titled event_datafile_new.csv, located within the Workspace directory. The event_datafile_new.csv contains the following columns: \n", 135 | "- artist \n", 136 | "- firstName of user\n", 137 | "- gender of user\n", 138 | "- item number in session\n", 139 | "- last name of user\n", 140 | "- length of the song\n", 141 | "- level (paid or free song)\n", 142 | "- location of the user\n", 143 | "- sessionId\n", 144 | "- song title\n", 145 | "- userId\n", 146 | "\n", 147 | "The image below is a screenshot of what the denormalized data should appear like in the **event_datafile_new.csv** after the code above is run:
    \n", 148 | "\n", 149 | "" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "## Begin writing your Apache Cassandra code in the cells below" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "#### Creating a Cluster" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "# This should make a connection to a Cassandra instance your local machine \n", 173 | "# (127.0.0.1)\n", 174 | "\n", 175 | "from cassandra.cluster import Cluster\n", 176 | "cluster = Cluster()\n", 177 | "\n", 178 | "# To establish connection and begin executing queries, need a session\n", 179 | "session = cluster.connect()" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "#### Create Keyspace" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "# TO-DO: Create a Keyspace " 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "#### Set Keyspace" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "# TO-DO: Set KEYSPACE to the keyspace specified above\n" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "### Now we need to create tables to run the following queries. Remember, with Apache Cassandra you model the database tables on the queries you want to run." 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "## Create queries to ask the following three questions of the data\n", 226 | "\n", 227 | "### 1. Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4\n", 228 | "\n", 229 | "\n", 230 | "### 2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182\n", 231 | " \n", 232 | "\n", 233 | "### 3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'\n", 234 | "\n", 235 | "\n" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 1, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "## TO-DO: Query 1: Give me the artist, song title and song's length in the music app history that was heard during \\\n", 245 | "## sessionId = 338, and itemInSession = 4\n", 246 | "\n", 247 | "\n", 248 | " " 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": { 255 | "scrolled": false 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "# We have provided part of the code to set up the CSV file. Please complete the Apache Cassandra code below#\n", 260 | "file = 'event_datafile_new.csv'\n", 261 | "\n", 262 | "with open(file, encoding = 'utf8') as f:\n", 263 | " csvreader = csv.reader(f)\n", 264 | " next(csvreader) # skip header\n", 265 | " for line in csvreader:\n", 266 | "## TO-DO: Assign the INSERT statements into the `query` variable\n", 267 | " query = \"\"\n", 268 | " query = query + \"\"\n", 269 | " ## TO-DO: Assign which column element should be assigned for each column in the INSERT statement.\n", 270 | " ## For e.g., to INSERT artist_name and user first_name, you would change the code below to `line[0], line[1]`\n", 271 | " session.execute(query, (line[#], line[#]))" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "#### Do a SELECT to verify that the data have been inserted into each table" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": { 285 | "scrolled": true 286 | }, 287 | "outputs": [], 288 | "source": [ 289 | "## TO-DO: Add in the SELECT statement to verify the data was entered into the table" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "### COPY AND REPEAT THE ABOVE THREE CELLS FOR EACH OF THE THREE QUESTIONS" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "## TO-DO: Query 2: Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name)\\\n", 306 | "## for userid = 10, sessionid = 182\n", 307 | "\n", 308 | "\n", 309 | " " 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "## TO-DO: Query 3: Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'\n", 319 | "\n", 320 | "\n", 321 | " " 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": null, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "### Drop the tables before closing out the sessions" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 4, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [ 351 | "## TO-DO: Drop the table before closing out the sessions" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "### Close the session and cluster connection¶" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [ 374 | "session.shutdown()\n", 375 | "cluster.shutdown()" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "metadata": {}, 389 | "outputs": [], 390 | "source": [] 391 | } 392 | ], 393 | "metadata": { 394 | "kernelspec": { 395 | "display_name": "Python 3", 396 | "language": "python", 397 | "name": "python3" 398 | }, 399 | "language_info": { 400 | "codemirror_mode": { 401 | "name": "ipython", 402 | "version": 3 403 | }, 404 | "file_extension": ".py", 405 | "mimetype": "text/x-python", 406 | "name": "python", 407 | "nbconvert_exporter": "python", 408 | "pygments_lexer": "ipython3", 409 | "version": "3.6.3" 410 | } 411 | }, 412 | "nbformat": 4, 413 | "nbformat_minor": 2 414 | } 415 | -------------------------------------------------------------------------------- /Data-Modeling/Project 2/README.md: -------------------------------------------------------------------------------- 1 | Project: Data Modeling with Cassandra 2 | 3 | Introduction: 4 | 5 | A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. There is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app. My role is to create an Apache Cassandra database which can create queries on song play data to answer the questions. 6 | 7 | Project Overview: 8 | 9 | In this project, I would be applying Data Modeling with Apache Cassandra and complete an ETL pipeline using Python. I am provided with part of the ETL pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables. 10 | 11 | Datasets: 12 | 13 | For this project, you'll be working with one dataset: event_data. The directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset: 14 | event_data/2018-11-08-events.csv 15 | event_data/2018-11-09-events.csv 16 | 17 | Project Template: 18 | 19 | The project template includes one Jupyter Notebook file, in which: 20 | • you will process the event_datafile_new.csv dataset to create a denormalized dataset 21 | • you will model the data tables keeping in mind the queries you need to run 22 | • you have been provided queries that you will need to model your data tables for 23 | • you will load the data into tables you create in Apache Cassandra and run your queries 24 | 25 | Project Steps: 26 | 27 | Below are steps you can follow to complete each component of this project. 28 | 29 | Modelling your NoSQL Database or Apache Cassandra Database: 30 | 31 | 1. Design tables to answer the queries outlined in the project template 32 | 2. Write Apache Cassandra CREATE KEYSPACE and SET KEYSPACE statements 33 | 3. Develop your CREATE statement for each of the tables to address each question 34 | 4. Load the data with INSERT statement for each of the tables 35 | 5. Include IF NOT EXISTS clauses in your CREATE statements to create tables only if the tables do not already exist. We recommend you also include DROP TABLE statement for each table, this way you can run drop and create tables whenever you want to reset your database and test your ETL pipeline 36 | 6. Test by running the proper select statements with the correct WHERE clause 37 | 38 | Build ETL Pipeline: 39 | 1. Implement the logic in section Part I of the notebook template to iterate through each event file in event_data to process and create a new CSV file in Python 40 | 2. Make necessary edits to Part II of the notebook template to include Apache Cassandra CREATE and INSERT three statements to load processed records into relevant tables in your data model 41 | 3. Test by running three SELECT statements after running the queries on your database 42 | 4. Finally, drop the tables and shutdown the cluster 43 | 44 | Files: 45 | 46 | Project_1B_Project_Template.ipynb: This was template file provided to fill in the details and write the python script 47 | 48 | Project_1B.ipynb: This is the final file provided in which all the queries have been written with importing the files, generating a new csv file and loading all csv files into one. All verifying the results whether all tables had been loaded accordingly as per requirement 49 | 50 | Event_datafile_new.csv: This is the final combination of all the files which are in the folder event_data 51 | 52 | Event_Data Folder: Each event file is present separately, so all the files would be combined into one into event_datafile_new.csv 53 | 54 | -------------------------------------------------------------------------------- /Data-Modeling/Project 2/event_data.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 2/event_data.rar -------------------------------------------------------------------------------- /Data-Modeling/Project 2/images.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 2/images.rar -------------------------------------------------------------------------------- /Data-Modeling/Readme.md: -------------------------------------------------------------------------------- 1 | Data Modeling with Postgres and Apache Cassandra 2 | 3 | Exercise and Projects 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Engineering Nanodegree 2 | 3 | Projects and resources developed in the [DEND Nanodegree](https://www.udacity.com/course/data-engineer-nanodegree--nd027) from Udacity. 4 | 5 | ## Project 1: [Relational Databases - Data Modeling with PostgreSQL](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Data-Modeling/Project%201). 6 | Developed a relational database using PostgreSQL to model user activity data for a music streaming app. Skills include: 7 | * Created a relational database using PostgreSQL 8 | * Developed a Star Schema database using optimized definitions of Fact and Dimension tables. Normalization of tables. 9 | * Built out an ETL pipeline to optimize queries in order to understand what songs users listen to. 10 | 11 | Proficiencies include: Python, PostgreSql, Star Schema, ETL pipelines, Normalization 12 | 13 | 14 | ## Project 2: [NoSQL Databases - Data Modeling with Apache Cassandra](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Data-Modeling/Project%202). 15 | Designed a NoSQL database using Apache Cassandra based on the original schema outlined in project one. Skills include: 16 | * Created a nosql database using Apache Cassandra (both locally and with docker containers) 17 | * Developed denormalized tables optimized for a specific set queries and business needs 18 | 19 | Proficiencies used: Python, Apache Cassandra, Denormalization 20 | 21 | 22 | ## Project 3: [Data Warehouse - Amazon Redshift](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Cloud%20Data%20Warehouse/Project%20Data%20Warehouse%20with%20AWS). 23 | Created a database warehouse utilizing Amazon Redshift. Skills include: 24 | * Creating a Redshift Cluster, IAM Roles, Security groups. 25 | * Develop an ETL Pipeline that copies data from S3 buckets into staging tables to be processed into a star schema 26 | * Developed a star schema with optimization to specific queries required by the data analytics team. 27 | 28 | Proficiencies used: Python, Amazon Redshift, aws cli, Amazon SDK, SQL, PostgreSQL 29 | 30 | ## Project 4: [Data Lake - Spark](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Data%20Lakes%20with%20Spark/Project%20Data%20Lake%20with%20Spark) 31 | Scaled up the current ETL pipeline by moving the data warehouse to a data lake. Skills include: 32 | * Create an EMR Hadoop Cluster 33 | * Further develop the ETL Pipeline copying datasets from S3 buckets, data processing using Spark and writing to S3 buckets using efficient partitioning and parquet formatting. 34 | * Fast-tracking the data lake buildout using (serverless) AWS Lambda and cataloging tables with AWS Glue Crawler. 35 | 36 | Technologies used: Spark, S3, EMR, Athena, Amazon Glue, Parquet. 37 | 38 | ## Project 5: [Data Pipelines - Airflow](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Data%20Pipeline%20with%20Airflow/Project%20Data%20Pipeline%20with%20Airflow) 39 | Automate the ETL pipeline and creation of data warehouse using Apache Airflow. Skills include: 40 | * Using Airflow to automate ETL pipelines using Airflow, Python, Amazon Redshift. 41 | * Writing custom operators to perform tasks such as staging data, filling the data warehouse, and validation through data quality checks. 42 | * Transforming data from various sources into a star schema optimized for the analytics team's use cases. 43 | 44 | Technologies used: Apache Airflow, S3, Amazon Redshift, Python. 45 | --------------------------------------------------------------------------------