├── Cloud Data Warehouse
    ├── L1 E1 - Step 1_2.ipynb
    ├── L1 E1 - Step 3.ipynb
    ├── L1 E1 - Step 4.ipynb
    ├── L1 E1 - Step 5.ipynb
    ├── L1 E1 - Step 6.ipynb
    ├── L1 E2 - CUBE.ipynb
    ├── L1 E2 - Grouping Sets.ipynb
    ├── L1 E2 - Roll up and Drill Down.ipynb
    ├── L1 E2 - Slicing and Dicing.ipynb
    ├── L1 E3 - Columnar Vs Row Storage.ipynb
    ├── Project Data Warehouse with AWS
    │   ├── README.md
    │   ├── RedShift_Test_Cluster.ipynb
    │   ├── create_tables.py
    │   ├── dwh.cfg
    │   ├── etl.py
    │   └── sql_queries.py
    └── Readme.md
├── Data Lakes with Spark
    ├── Data_Inputs_Outputs.ipynb
    ├── Data_Wrangling.ipynb
    ├── Data_Wrangling_Sql.ipynb
    ├── Dataframe_Quiz.ipynb
    ├── Exercise 1 - Schema On Read.ipynb
    ├── Exercise 2 - Advanced Analytics NLP.ipynb
    ├── Exercise 3 - Data Lake on S3.ipynb
    ├── Mapreduce_Practice.ipynb
    ├── Procedural_vs_Functional_Python.ipynb
    ├── Project Data Lake with Spark
    │   ├── README.md
    │   ├── Readme.MD
    │   ├── dl.cfg
    │   └── etl.py
    ├── README.md
    ├── Spark_Maps_Lazy_Evaluation.ipynb
    └── Spark_Sql_Quiz.ipynb
├── Data Pipeline with Airflow
    ├── Data Pipeline - Exercise 1.py
    ├── Data Pipeline - Exercise 2.py
    ├── Data Pipeline - Exercise 3.py
    ├── Data Pipeline - Exercise 4.py
    ├── Data Pipeline - Exercise 5.py
    ├── Data Pipeline - Exercise 6.py
    ├── Data Quality - Exercise 1.py
    ├── Data Quality - Exercise 2.py
    ├── Data Quality - Exercise 3.py
    ├── Data Quality - Exercise 4.py
    ├── Production Data Pipelines - Exercise 1.py
    ├── Production Data Pipelines - Exercise 2.py
    ├── Production Data Pipelines - Exercise 3.py
    ├── Production Data Pipelines - Exercise 4.py
    ├── Project Data Pipeline with Airflow
    │   ├── DAG Graphview.png
    │   ├── DAG Treeview.PNG
    │   ├── Readme.MD
    │   ├── create_tables.sql
    │   ├── dags
    │   │   ├── __pycache__
    │   │   │   └── udac_example_dag.cpython-36.pyc
    │   │   └── udac_example_dag.py
    │   └── plugins
    │   │   ├── __init__.py
    │   │   ├── __pycache__
    │   │       └── __init__.cpython-36.pyc
    │   │   ├── helpers
    │   │       ├── __init__.py
    │   │       ├── __pycache__
    │   │       │   ├── __init__.cpython-36.pyc
    │   │       │   └── sql_queries.cpython-36.pyc
    │   │       └── sql_queries.py
    │   │   └── operators
    │   │       ├── __init__.py
    │   │       ├── __pycache__
    │   │           ├── __init__.cpython-36.pyc
    │   │           ├── data_quality.cpython-36.pyc
    │   │           ├── load_dimension.cpython-36.pyc
    │   │           ├── load_fact.cpython-36.pyc
    │   │           └── stage_redshift.cpython-36.pyc
    │   │       ├── data_quality.py
    │   │       ├── load_dimension.py
    │   │       ├── load_fact.py
    │   │       └── stage_redshift.py
    ├── Readme.MD
    ├── __init__.py
    ├── dag.py
    ├── facts_calculator.py
    ├── has_rows.py
    ├── s3_to_redshift.py
    ├── sql_statements.py
    └── subdag.py
├── Data-Modeling
    ├── L1 Exercise 1 Creating a Table with Postgres.ipynb
    ├── L1 Exercise 2 Creating a Table with Apache Cassandra.ipynb
    ├── L2 Exercise 1 Creating Normalized Tables.ipynb
    ├── L2 Exercise 2 Creating Denormalized Tables.ipynb
    ├── L2 Exercise 3 Creating Fact and Dimension Tables with Star Schema.ipynb
    ├── L3 Exercise 1 Three Queries Three Tables.ipynb
    ├── L3 Exercise 2 Primary Key.ipynb
    ├── L3 Exercise 3 Clustering Column.ipynb
    ├── L3 Exercise 4 Using the WHERE Clause.ipynb
    ├── Project 1
    │   ├── Instructions 1.PNG
    │   ├── Instructions 2.PNG
    │   ├── Instructions 3.PNG
    │   ├── Instructions 4.PNG
    │   ├── Project 1 Introduction.PNG
    │   ├── README.md
    │   ├── create_tables.py
    │   ├── data.zip
    │   ├── etl.ipynb
    │   ├── etl.py
    │   ├── sql_queries.py
    │   └── test.ipynb
    ├── Project 2
    │   ├── Project_1B.ipynb
    │   ├── Project_1B_ Project_Template.ipynb
    │   ├── README.md
    │   ├── event_data.rar
    │   ├── event_datafile_new.csv
    │   └── images.rar
    └── Readme.md
└── README.md


/Cloud Data Warehouse/Project Data Warehouse with AWS/README.md:
--------------------------------------------------------------------------------
  1 | <b>Introduction</b>
  2 | 
  3 | A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
  4 | 
  5 | Task is to build an ETL Pipeline that extracts their data from S3, staging it in Redshift and then transforming data into a set of Dimensional and Fact Tables for their Analytics Team to continue finding Insights to what songs their users are listening to.
  6 | 
  7 | <b>Project Description</b>
  8 | 
  9 | Application of Data warehouse and AWS to build an ETL Pipeline for a database hosted on Redshift Will need to load data from S3 to staging tables on Redshift and execute SQL Statements that create fact and dimension tables from these staging tables to create analytics
 10 | 
 11 | <b>Project Datasets</b>
 12 | 
 13 | Song Data Path     -->     s3://udacity-dend/song_data
 14 | Log Data Path      -->     s3://udacity-dend/log_data
 15 | Log Data JSON Path -->     s3://udacity-dend/log_json_path.json
 16 | 
 17 | <b>Song Dataset</b>
 18 | 
 19 | The first dataset is a subset of real data from the Million Song Dataset(https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 
 20 | For example:
 21 | 
 22 | song_data/A/B/C/TRABCEI128F424C983.json
 23 | song_data/A/A/B/TRAABJL12903CDCF1A.json
 24 | 
 25 | And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
 26 | 
 27 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
 28 | 
 29 | <b>Log Dataset</b>
 30 | 
 31 | The second dataset consists of log files in JSON format. The log files in the dataset with are partitioned by year and month.
 32 | For example:
 33 | 
 34 | log_data/2018/11/2018-11-12-events.json
 35 | log_data/2018/11/2018-11-13-events.json
 36 | 
 37 | And below is an example of what a single log file, 2018-11-13-events.json, looks like.
 38 | 
 39 | {"artist":"Pavement", "auth":"Logged In", "firstName":"Sylvie", "gender", "F", "itemInSession":0, "lastName":"Cruz", "length":99.16036, "level":"free", "location":"Klamath Falls, OR", "method":"PUT", "page":"NextSong", "registration":"1.541078e+12", "sessionId":345, "song":"Mercy:The Laundromat", "status":200, "ts":1541990258796, "userAgent":"Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4...)", "userId":10}
 40 | 
 41 | <b>Schema for Song Play Analysis</b>
 42 | 
 43 | A Star Schema would be required for optimized queries on song play queries
 44 | 
 45 | <b>Fact Table</b>
 46 | 
 47 | <b>songplays</b> - records in event data associated with song plays i.e. records with page NextSong
 48 | songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
 49 | 
 50 | <b>Dimension Tables</b>
 51 | 
 52 | <b>users</b> - users in the app
 53 | user_id, first_name, last_name, gender, level
 54 | 
 55 | <b>songs</b> - songs in music database
 56 | song_id, title, artist_id, year, duration
 57 | 
 58 | <b>artists</b> - artists in music database
 59 | artist_id, name, location, lattitude, longitude
 60 | 
 61 | <b>time</b> - timestamps of records in songplays broken down into specific units
 62 | start_time, hour, day, week, month, year, weekday
 63 | 
 64 | <b>Project Template</b>
 65 | 
 66 | Project Template include four files:
 67 | 
 68 | <b>1. create_table.py</b> is where you'll create your fact and dimension tables for the star schema in Redshift.
 69 | 
 70 | <b>2. etl.py</b> is where you'll load data from S3 into staging tables on Redshift and then process that data into your analytics tables on Redshift.
 71 | 
 72 | <b>3. sql_queries.py</b> is where you'll define you SQL statements, which will be imported into the two other files above.
 73 | 
 74 | <b>4. README.md</b> is where you'll provide discussion on your process and decisions for this ETL pipeline.
 75 | 
 76 | <b>Create Table Schema</b>
 77 | 
 78 | 1. Write a SQL CREATE statement for each of these tables in sql_queries.py
 79 | 2. Complete the logic in create_tables.py to connect to the database and create these tables
 80 | 3. Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
 81 | 4. Launch a redshift cluster and create an IAM role that has read access to S3.
 82 | 5. Add redshift database and IAM role info to dwh.cfg.
 83 | 6. Test by running create_tables.py and checking the table schemas in your redshift database.
 84 | 
 85 | <b>Build ETL Pipeline</b>
 86 | 
 87 | 1. Implement the logic in etl.py to load data from S3 to staging tables on Redshift.
 88 | 2. Implement the logic in etl.py to load data from staging tables to analytics tables on Redshift.
 89 | 3. Test by running etl.py after running create_tables.py and running the analytic queries on your Redshift database to compare your results with the expected results.
 90 | 4. Delete your redshift cluster when finished.
 91 | 
 92 | <b>Final Instructions</b>
 93 | 
 94 | 1. Import all the necessary libraries
 95 | 2. Write the configuration of AWS Cluster, store the important parameter in some other file
 96 | 3. Configuration of boto3 which is an AWS SDK for Python
 97 | 4. Using the bucket, can check whether files log files and song data files are present
 98 | 5. Create an IAM User Role, Assign appropriate permissions and create the Redshift Cluster
 99 | 6. Get the Value of Endpoint and Role for put into main configuration file
100 | 7. Authorize Security Access Group to Default TCP/IP Address
101 | 8. Launch database connectivity configuration
102 | 9. Go to Terminal write the command "python create_tables.py" and then "etl.py"
103 | 10. Should take around 4-10 minutes in total
104 | 11. Then you go back to jupyter notebook to test everything is working fine
105 | 12. I counted all the records in my tables
106 | 13. Now can delete the cluster, roles and assigned permission


--------------------------------------------------------------------------------
/Cloud Data Warehouse/Project Data Warehouse with AWS/create_tables.py:
--------------------------------------------------------------------------------
 1 | import configparser
 2 | import psycopg2
 3 | from sql_queries import create_table_queries, drop_table_queries
 4 | 
 5 | 
 6 | def drop_tables(cur, conn):
 7 |     for query in drop_table_queries:
 8 |         cur.execute(query)
 9 |         conn.commit()
10 | 
11 | 
12 | def create_tables(cur, conn):
13 |     for query in create_table_queries:
14 |         cur.execute(query)
15 |         conn.commit()
16 | 
17 | 
18 | def main():
19 |     config = configparser.ConfigParser()
20 |     config.read('dwh.cfg')
21 | 
22 |     conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
23 |     cur = conn.cursor()
24 | 
25 |     drop_tables(cur, conn)
26 |     create_tables(cur, conn)
27 | 
28 |     conn.close()
29 | 
30 | 
31 | if __name__ == "__main__":
32 |     main()


--------------------------------------------------------------------------------
/Cloud Data Warehouse/Project Data Warehouse with AWS/dwh.cfg:
--------------------------------------------------------------------------------
 1 | [AWS]
 2 | KEY=
 3 | SECRET=
 4 | 
 5 | [DWH]
 6 | DWH_CLUSTER_TYPE=multi-node
 7 | DWH_NUM_NODES=4
 8 | DWH_NODE_TYPE=dc2.large
 9 | 
10 | DWH_IAM_ROLE_NAME=dwhRole
11 | DWH_CLUSTER_IDENTIFIER=dwhCluster
12 | DWH_DB=dwh
13 | DWH_DB_USER=dwhuser
14 | DWH_DB_PASSWORD=Passw0rd
15 | DWH_PORT=5439
16 | 
17 | [CLUSTER]
18 | HOST=
19 | DB_NAME=dwh
20 | DB_USER=dwhuser
21 | DB_PASSWORD=Passw0rd
22 | DB_PORT=5439
23 | 
24 | [IAM_ROLE]
25 | ARN=
26 | 
27 | [S3]
28 | LOG_DATA='s3://udacity-dend/log_data'
29 | LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
30 | SONG_DATA='s3://udacity-dend/song_data'


--------------------------------------------------------------------------------
/Cloud Data Warehouse/Project Data Warehouse with AWS/etl.py:
--------------------------------------------------------------------------------
 1 | import configparser
 2 | import psycopg2
 3 | from sql_queries import copy_table_queries, insert_table_queries
 4 | 
 5 | 
 6 | def load_staging_tables(cur, conn):
 7 |     for query in copy_table_queries:
 8 |         cur.execute(query)
 9 |         conn.commit()
10 | 
11 | 
12 | def insert_tables(cur, conn):
13 |     for query in insert_table_queries:
14 |         cur.execute(query)
15 |         conn.commit()
16 | 
17 | 
18 | def main():
19 |     config = configparser.ConfigParser()
20 |     config.read('dwh.cfg')
21 | 
22 |     conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
23 |     cur = conn.cursor()
24 |     
25 |     load_staging_tables(cur, conn)
26 |     insert_tables(cur, conn)
27 | 
28 |     conn.close()
29 | 
30 | 
31 | if __name__ == "__main__":
32 |     main()


--------------------------------------------------------------------------------
/Cloud Data Warehouse/Project Data Warehouse with AWS/sql_queries.py:
--------------------------------------------------------------------------------
  1 | import configparser
  2 | 
  3 | 
  4 | # CONFIG
  5 | config = configparser.ConfigParser()
  6 | config.read('dwh.cfg')
  7 | 
  8 | # GLOBAL VARIABLES
  9 | LOG_DATA = config.get("S3","LOG_DATA")
 10 | LOG_PATH = config.get("S3", "LOG_JSONPATH")
 11 | SONG_DATA = config.get("S3", "SONG_DATA")
 12 | IAM_ROLE = config.get("IAM_ROLE","ARN")
 13 | 
 14 | # DROP TABLES
 15 | 
 16 | staging_events_table_drop = "DROP TABLE IF EXISTS staging_events"
 17 | staging_songs_table_drop = "DROP TABLE IF EXISTS staging_songs"
 18 | songplay_table_drop = "DROP TABLE IF EXISTS fact_songplay"
 19 | user_table_drop = "DROP TABLE IF EXISTS dim_user"
 20 | song_table_drop = "DROP TABLE IF EXISTS dim_song"
 21 | artist_table_drop = "DROP TABLE IF EXISTS dim_artist"
 22 | time_table_drop = "DROP TABLE IF EXISTS dim_time"
 23 | 
 24 | # CREATE TABLES
 25 | 
 26 | staging_events_table_create= ("""
 27 | CREATE TABLE IF NOT EXISTS staging_events
 28 | (
 29 | artist          VARCHAR,
 30 | auth            VARCHAR, 
 31 | firstName       VARCHAR,
 32 | gender          VARCHAR,   
 33 | itemInSession   INTEGER,
 34 | lastName        VARCHAR,
 35 | length          FLOAT,
 36 | level           VARCHAR, 
 37 | location        VARCHAR,
 38 | method          VARCHAR,
 39 | page            VARCHAR,
 40 | registration    BIGINT,
 41 | sessionId       INTEGER,
 42 | song            VARCHAR,
 43 | status          INTEGER,
 44 | ts              TIMESTAMP,
 45 | userAgent       VARCHAR,
 46 | userId          INTEGER
 47 | );
 48 | """)
 49 | 
 50 | staging_songs_table_create = ("""
 51 | CREATE TABLE IF NOT EXISTS staging_songs
 52 | (
 53 | song_id            VARCHAR,
 54 | num_songs          INTEGER,
 55 | title              VARCHAR,
 56 | artist_name        VARCHAR,
 57 | artist_latitude    FLOAT,
 58 | year               INTEGER,
 59 | duration           FLOAT,
 60 | artist_id          VARCHAR,
 61 | artist_longitude   FLOAT,
 62 | artist_location    VARCHAR
 63 | );
 64 | """)
 65 | 
 66 | songplay_table_create = ("""
 67 | CREATE TABLE IF NOT EXISTS fact_songplay
 68 | (
 69 | songplay_id          INTEGER IDENTITY(0,1) PRIMARY KEY sortkey,
 70 | start_time           TIMESTAMP,
 71 | user_id              INTEGER,
 72 | level                VARCHAR,
 73 | song_id              VARCHAR,
 74 | artist_id            VARCHAR,
 75 | session_id           INTEGER,
 76 | location             VARCHAR,
 77 | user_agent           VARCHAR
 78 | );
 79 | """)
 80 | 
 81 | user_table_create = ("""
 82 | CREATE TABLE IF NOT EXISTS dim_user
 83 | (
 84 | user_id INTEGER PRIMARY KEY distkey,
 85 | first_name      VARCHAR,
 86 | last_name       VARCHAR,
 87 | gender          VARCHAR,
 88 | level           VARCHAR
 89 | );
 90 | """)
 91 | 
 92 | song_table_create = ("""
 93 | CREATE TABLE IF NOT EXISTS dim_song
 94 | (
 95 | song_id     VARCHAR PRIMARY KEY,
 96 | title       VARCHAR,
 97 | artist_id   VARCHAR distkey,
 98 | year        INTEGER,
 99 | duration    FLOAT
100 | );
101 | """)
102 | 
103 | artist_table_create = ("""
104 | CREATE TABLE IF NOT EXISTS dim_artist
105 | (
106 | artist_id          VARCHAR PRIMARY KEY distkey,
107 | name               VARCHAR,
108 | location           VARCHAR,
109 | latitude           FLOAT,
110 | longitude          FLOAT
111 | );
112 | """)
113 | 
114 | time_table_create = ("""
115 | CREATE TABLE IF NOT EXISTS dim_time
116 | (
117 | start_time    TIMESTAMP PRIMARY KEY sortkey distkey,
118 | hour          INTEGER,
119 | day           INTEGER,
120 | week          INTEGER,
121 | month         INTEGER,
122 | year          INTEGER,
123 | weekday       INTEGER
124 | );
125 | """)
126 | 
127 | # STAGING TABLES
128 | 
129 | staging_events_copy = ("""
130 |     COPY staging_events FROM {}
131 |     CREDENTIALS 'aws_iam_role={}'
132 |     COMPUPDATE OFF region 'us-west-2'
133 |     TIMEFORMAT as 'epochmillisecs'
134 |     TRUNCATECOLUMNS BLANKSASNULL EMPTYASNULL
135 |     FORMAT AS JSON {};
136 | """).format(LOG_DATA, IAM_ROLE, LOG_PATH)
137 | 
138 | staging_songs_copy = ("""
139 |     COPY staging_songs FROM {}
140 |     CREDENTIALS 'aws_iam_role={}'
141 |     COMPUPDATE OFF region 'us-west-2'
142 |     FORMAT AS JSON 'auto' 
143 |     TRUNCATECOLUMNS BLANKSASNULL EMPTYASNULL;
144 | """).format(SONG_DATA, IAM_ROLE)
145 | 
146 | # FINAL TABLES
147 | 
148 | songplay_table_insert = ("""
149 | INSERT INTO fact_songplay(start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)
150 | SELECT DISTINCT to_timestamp(to_char(se.ts, '9999-99-99 99:99:99'),'YYYY-MM-DD HH24:MI:SS'),
151 |                 se.userId as user_id,
152 |                 se.level as level,
153 |                 ss.song_id as song_id,
154 |                 ss.artist_id as artist_id,
155 |                 se.sessionId as session_id,
156 |                 se.location as location,
157 |                 se.userAgent as user_agent
158 | FROM staging_events se
159 | JOIN staging_songs ss ON se.song = ss.title AND se.artist = ss.artist_name;
160 | """)
161 | 
162 | user_table_insert = ("""
163 | INSERT INTO dim_user(user_id, first_name, last_name, gender, level)
164 | SELECT DISTINCT userId as user_id,
165 |                 firstName as first_name,
166 |                 lastName as last_name,
167 |                 gender as gender,
168 |                 level as level
169 | FROM staging_events
170 | where userId IS NOT NULL;
171 | """)
172 | 
173 | song_table_insert = ("""
174 | INSERT INTO dim_song(song_id, title, artist_id, year, duration)
175 | SELECT DISTINCT song_id as song_id,
176 |                 title as title,
177 |                 artist_id as artist_id,
178 |                 year as year,
179 |                 duration as duration
180 | FROM staging_songs
181 | WHERE song_id IS NOT NULL;
182 | """)
183 | 
184 | artist_table_insert = ("""
185 | INSERT INTO dim_artist(artist_id, name, location, latitude, longitude)
186 | SELECT DISTINCT artist_id as artist_id,
187 |                 artist_name as name,
188 |                 artist_location as location,
189 |                 artist_latitude as latitude,
190 |                 artist_longitude as longitude
191 | FROM staging_songs
192 | where artist_id IS NOT NULL;
193 | """)
194 | 
195 | time_table_insert = ("""
196 | INSERT INTO dim_time(start_time, hour, day, week, month, year, weekday)
197 | SELECT distinct ts,
198 |                 EXTRACT(hour from ts),
199 |                 EXTRACT(day from ts),
200 |                 EXTRACT(week from ts),
201 |                 EXTRACT(month from ts),
202 |                 EXTRACT(year from ts),
203 |                 EXTRACT(weekday from ts)
204 | FROM staging_events
205 | WHERE ts IS NOT NULL;
206 | """)
207 | 
208 | # QUERY LISTS
209 | 
210 | create_table_queries = [staging_events_table_create, staging_songs_table_create, songplay_table_create, user_table_create, song_table_create, artist_table_create, time_table_create]
211 | drop_table_queries = [staging_events_table_drop, staging_songs_table_drop, songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]
212 | copy_table_queries = [staging_events_copy, staging_songs_copy]
213 | insert_table_queries = [songplay_table_insert, user_table_insert, song_table_insert, artist_table_insert, time_table_insert]
214 | 


--------------------------------------------------------------------------------
/Cloud Data Warehouse/Readme.md:
--------------------------------------------------------------------------------
1 | This folder will contain the exercise files and details for Udacity Module - Cloud Data Warehouse
2 | 


--------------------------------------------------------------------------------
/Data Lakes with Spark/Dataframe_Quiz.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Answer Key to the Data Frame Programming Quiz\n",
  8 |     "\n",
  9 |     "Helpful resources:\n",
 10 |     "http://spark.apache.org/docs/latest/api/python/pyspark.sql.html"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 1,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "from pyspark.sql import SparkSession\n",
 20 |     "from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg\n",
 21 |     "from pyspark.sql.functions import sum as Fsum\n",
 22 |     "from pyspark.sql.window import Window\n",
 23 |     "from pyspark.sql.types import IntegerType"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {},
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "# 1) import any other libraries you might need\n",
 33 |     "# 2) instantiate a Spark session \n",
 34 |     "# 3) read in the data set located at the path \"data/sparkify_log_small.json\"\n",
 35 |     "# 4) write code to answer the quiz questions \n",
 36 |     "\n",
 37 |     "spark = SparkSession \\\n",
 38 |     "    .builder \\\n",
 39 |     "    .appName(\"Data Frames practice\") \\\n",
 40 |     "    .getOrCreate()\n",
 41 |     "\n",
 42 |     "df = spark.read.json(\"data/sparkify_log_small.json\")"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "markdown",
 47 |    "metadata": {},
 48 |    "source": [
 49 |     "# Question 1\n",
 50 |     "\n",
 51 |     "Which page did user id \"\" (empty string) NOT visit?"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 3,
 57 |    "metadata": {},
 58 |    "outputs": [
 59 |     {
 60 |      "name": "stdout",
 61 |      "output_type": "stream",
 62 |      "text": [
 63 |       "root\n",
 64 |       " |-- artist: string (nullable = true)\n",
 65 |       " |-- auth: string (nullable = true)\n",
 66 |       " |-- firstName: string (nullable = true)\n",
 67 |       " |-- gender: string (nullable = true)\n",
 68 |       " |-- itemInSession: long (nullable = true)\n",
 69 |       " |-- lastName: string (nullable = true)\n",
 70 |       " |-- length: double (nullable = true)\n",
 71 |       " |-- level: string (nullable = true)\n",
 72 |       " |-- location: string (nullable = true)\n",
 73 |       " |-- method: string (nullable = true)\n",
 74 |       " |-- page: string (nullable = true)\n",
 75 |       " |-- registration: long (nullable = true)\n",
 76 |       " |-- sessionId: long (nullable = true)\n",
 77 |       " |-- song: string (nullable = true)\n",
 78 |       " |-- status: long (nullable = true)\n",
 79 |       " |-- ts: long (nullable = true)\n",
 80 |       " |-- userAgent: string (nullable = true)\n",
 81 |       " |-- userId: string (nullable = true)\n",
 82 |       "\n"
 83 |      ]
 84 |     }
 85 |    ],
 86 |    "source": [
 87 |     "df.printSchema()"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 4,
 93 |    "metadata": {},
 94 |    "outputs": [
 95 |     {
 96 |      "name": "stdout",
 97 |      "output_type": "stream",
 98 |      "text": [
 99 |       "Settings\n",
100 |       "Logout\n",
101 |       "Submit Upgrade\n",
102 |       "Error\n",
103 |       "NextSong\n",
104 |       "Submit Downgrade\n",
105 |       "Downgrade\n",
106 |       "Upgrade\n",
107 |       "Save Settings\n"
108 |      ]
109 |     }
110 |    ],
111 |    "source": [
112 |     "# filter for users with blank user id\n",
113 |     "blank_pages = df.filter(df.userId == '') \\\n",
114 |     "    .select(col('page') \\\n",
115 |     "    .alias('blank_pages')) \\\n",
116 |     "    .dropDuplicates()\n",
117 |     "\n",
118 |     "# get a list of possible pages that could be visited\n",
119 |     "all_pages = df.select('page').dropDuplicates()\n",
120 |     "\n",
121 |     "# find values in all_pages that are not in blank_pages\n",
122 |     "# these are the pages that the blank user did not go to\n",
123 |     "for row in set(all_pages.collect()) - set(blank_pages.collect()):\n",
124 |     "    print(row.page)"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "markdown",
129 |    "metadata": {},
130 |    "source": [
131 |     "# Question 2 - Reflect\n",
132 |     "\n",
133 |     "What type of user does the empty string user id most likely refer to?\n"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "Perhaps it represents users who have not signed up yet or who are signed out and are about to log in."
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "# Question 3\n",
148 |     "\n",
149 |     "How many female users do we have in the data set?"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "code",
154 |    "execution_count": 5,
155 |    "metadata": {},
156 |    "outputs": [
157 |     {
158 |      "data": {
159 |       "text/plain": [
160 |        "462"
161 |       ]
162 |      },
163 |      "execution_count": 5,
164 |      "metadata": {},
165 |      "output_type": "execute_result"
166 |     }
167 |    ],
168 |    "source": [
169 |     "df.filter(df.gender == 'F') \\\n",
170 |     "    .select('userId', 'gender') \\\n",
171 |     "    .dropDuplicates() \\\n",
172 |     "    .count()"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "markdown",
177 |    "metadata": {},
178 |    "source": [
179 |     "# Question 4\n",
180 |     "\n",
181 |     "How many songs were played from the most played artist?"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": 6,
187 |    "metadata": {},
188 |    "outputs": [
189 |     {
190 |      "name": "stdout",
191 |      "output_type": "stream",
192 |      "text": [
193 |       "+--------+-----------+\n",
194 |       "|  Artist|Artistcount|\n",
195 |       "+--------+-----------+\n",
196 |       "|Coldplay|         83|\n",
197 |       "+--------+-----------+\n",
198 |       "only showing top 1 row\n",
199 |       "\n"
200 |      ]
201 |     }
202 |    ],
203 |    "source": [
204 |     "df.filter(df.page == 'NextSong') \\\n",
205 |     "    .select('Artist') \\\n",
206 |     "    .groupBy('Artist') \\\n",
207 |     "    .agg({'Artist':'count'}) \\\n",
208 |     "    .withColumnRenamed('count(Artist)', 'Artistcount') \\\n",
209 |     "    .sort(desc('Artistcount')) \\\n",
210 |     "    .show(1)"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "markdown",
215 |    "metadata": {},
216 |    "source": [
217 |     "# Question 5 (challenge)\n",
218 |     "\n",
219 |     "How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.\n",
220 |     "\n"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "code",
225 |    "execution_count": 7,
226 |    "metadata": {},
227 |    "outputs": [
228 |     {
229 |      "name": "stdout",
230 |      "output_type": "stream",
231 |      "text": [
232 |       "+------------------+\n",
233 |       "|avg(count(period))|\n",
234 |       "+------------------+\n",
235 |       "| 6.898347107438017|\n",
236 |       "+------------------+\n",
237 |       "\n"
238 |      ]
239 |     }
240 |    ],
241 |    "source": [
242 |     "function = udf(lambda ishome : int(ishome == 'Home'), IntegerType())\n",
243 |     "\n",
244 |     "user_window = Window \\\n",
245 |     "    .partitionBy('userID') \\\n",
246 |     "    .orderBy(desc('ts')) \\\n",
247 |     "    .rangeBetween(Window.unboundedPreceding, 0)\n",
248 |     "\n",
249 |     "cusum = df.filter((df.page == 'NextSong') | (df.page == 'Home')) \\\n",
250 |     "    .select('userID', 'page', 'ts') \\\n",
251 |     "    .withColumn('homevisit', function(col('page'))) \\\n",
252 |     "    .withColumn('period', Fsum('homevisit').over(user_window))\n",
253 |     "\n",
254 |     "cusum.filter((cusum.page == 'NextSong')) \\\n",
255 |     "    .groupBy('userID', 'period') \\\n",
256 |     "    .agg({'period':'count'}) \\\n",
257 |     "    .agg({'count(period)':'avg'}).show()"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "code",
262 |    "execution_count": null,
263 |    "metadata": {},
264 |    "outputs": [],
265 |    "source": []
266 |   }
267 |  ],
268 |  "metadata": {
269 |   "kernelspec": {
270 |    "display_name": "Python 3",
271 |    "language": "python",
272 |    "name": "python3"
273 |   },
274 |   "language_info": {
275 |    "codemirror_mode": {
276 |     "name": "ipython",
277 |     "version": 3
278 |    },
279 |    "file_extension": ".py",
280 |    "mimetype": "text/x-python",
281 |    "name": "python",
282 |    "nbconvert_exporter": "python",
283 |    "pygments_lexer": "ipython3",
284 |    "version": "3.6.3"
285 |   }
286 |  },
287 |  "nbformat": 4,
288 |  "nbformat_minor": 2
289 | }
290 | 


--------------------------------------------------------------------------------
/Data Lakes with Spark/Procedural_vs_Functional_Python.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Procedural Programming\n",
  8 |     "\n",
  9 |     "This notebook contains the code from the previous screencast. The code counts the number of times a song appears in the log_of_songs variable. \n",
 10 |     "\n",
 11 |     "You'll notice that the first time you run `count_plays(\"Despacito\")`, you get the correct count. However, when you run the same code again `count_plays(\"Despacito\")`, the results are no longer correct.This is because the global variable `play_count` stores the results outside of the count_plays function. \n",
 12 |     "\n",
 13 |     "\n",
 14 |     "# Instructions\n",
 15 |     "\n",
 16 |     "Run the code cells in this notebook to see the problem with  "
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": 1,
 22 |    "metadata": {},
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "log_of_songs = [\n",
 26 |     "        \"Despacito\",\n",
 27 |     "        \"Nice for what\",\n",
 28 |     "        \"No tears left to cry\",\n",
 29 |     "        \"Despacito\",\n",
 30 |     "        \"Havana\",\n",
 31 |     "        \"In my feelings\",\n",
 32 |     "        \"Nice for what\",\n",
 33 |     "        \"Despacito\",\n",
 34 |     "        \"All the stars\"\n",
 35 |     "]"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 2,
 41 |    "metadata": {},
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "play_count = 0"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 3,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "def count_plays(song_title):\n",
 54 |     "    global play_count\n",
 55 |     "    for song in log_of_songs:\n",
 56 |     "        if song == song_title:\n",
 57 |     "            play_count = play_count + 1\n",
 58 |     "    return play_count"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 4,
 64 |    "metadata": {},
 65 |    "outputs": [
 66 |     {
 67 |      "data": {
 68 |       "text/plain": [
 69 |        "3"
 70 |       ]
 71 |      },
 72 |      "execution_count": 4,
 73 |      "metadata": {},
 74 |      "output_type": "execute_result"
 75 |     }
 76 |    ],
 77 |    "source": [
 78 |     "count_plays(\"Despacito\")"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": 5,
 84 |    "metadata": {},
 85 |    "outputs": [
 86 |     {
 87 |      "data": {
 88 |       "text/plain": [
 89 |        "6"
 90 |       ]
 91 |      },
 92 |      "execution_count": 5,
 93 |      "metadata": {},
 94 |      "output_type": "execute_result"
 95 |     }
 96 |    ],
 97 |    "source": [
 98 |     "count_plays(\"Despacito\")"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "metadata": {},
104 |    "source": [
105 |     "# How to Solve the Issue\n",
106 |     "\n",
107 |     "How might you solve this issue? You could get rid of the global variable and instead use play_count as an input to the function:\n",
108 |     "\n",
109 |     "```python\n",
110 |     "def count_plays(song_title, play_count):\n",
111 |     "    for song in log_of_songs:\n",
112 |     "        if song == song_title:\n",
113 |     "            play_count = play_count + 1\n",
114 |     "    return play_count\n",
115 |     "\n",
116 |     "```\n",
117 |     "\n",
118 |     "How would this work with parallel programming? Spark splits up data onto multiple machines. If your songs list were split onto two machines, Machine A would first need to finish counting, and then return its own result to Machine B. And then Machine B could use the output from Machine A and add to the count.\n",
119 |     "\n",
120 |     "However, that isn't parallel computing. Machine B would have to wait until Machine A finishes. You'll see in the next parts of the lesson how Spark solves this issue with a functional programming paradigm.\n",
121 |     "\n",
122 |     "In Spark, if your data is split onto two different machines, machine A will run a function to count how many times 'Despacito' appears on machine A. Machine B will simultaneously run a function to count how many times 'Despacito' appears on machine B. After they finish counting individually, they'll combine their results together. You'll see how this works in the next parts of the lesson."
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": null,
128 |    "metadata": {},
129 |    "outputs": [],
130 |    "source": []
131 |   }
132 |  ],
133 |  "metadata": {
134 |   "kernelspec": {
135 |    "display_name": "Python 3",
136 |    "language": "python",
137 |    "name": "python3"
138 |   },
139 |   "language_info": {
140 |    "codemirror_mode": {
141 |     "name": "ipython",
142 |     "version": 3
143 |    },
144 |    "file_extension": ".py",
145 |    "mimetype": "text/x-python",
146 |    "name": "python",
147 |    "nbconvert_exporter": "python",
148 |    "pygments_lexer": "ipython3",
149 |    "version": "3.6.3"
150 |   }
151 |  },
152 |  "nbformat": 4,
153 |  "nbformat_minor": 2
154 | }
155 | 


--------------------------------------------------------------------------------
/Data Lakes with Spark/Project Data Lake with Spark/README.md:
--------------------------------------------------------------------------------
 1 | <b>Project: Data Lake</b>
 2 | 
 3 | <b>Introduction</b>
 4 | 
 5 | A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app
 6 | 
 7 | 
 8 | <b>Project Description</b>
 9 | 
10 | Apply the knowledge of Spark and Data Lakes to build and ETL pipeline for a Data Lake hosted on Amazon S3
11 | 
12 | In this task, we have to build an ETL Pipeline that extracts their data from S3 and process them using Spark and then load back into S3 in a set of Fact and Dimension Tables. This will allow their analytics team to continue finding insights in what songs their users are listening. Will have to deploy this Spark process on a Cluster using AWS
13 | 
14 | <b>Project Datasets</b>
15 | 
16 | Song Data Path --> s3://udacity-dend/song_data Log Data Path --> s3://udacity-dend/log_data Log Data JSON Path --> s3://udacity-dend/log_json_path.json
17 | 
18 | <b>Song Dataset</b>
19 | 
20 | The first dataset is a subset of real data from the Million Song Dataset(https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example:
21 | 
22 | song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json
23 | 
24 | And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
25 | 
26 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
27 | 
28 | <b>Log Dataset</b>
29 | 
30 | The second dataset consists of log files in JSON format. The log files in the dataset with are partitioned by year and month. For example:
31 | 
32 | log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json
33 | 
34 | And below is an example of what a single log file, 2018-11-13-events.json, looks like.
35 | 
36 | {"artist":"Pavement", "auth":"Logged In", "firstName":"Sylvie", "gender", "F", "itemInSession":0, "lastName":"Cruz", "length":99.16036, "level":"free", "location":"Klamath Falls, OR", "method":"PUT", "page":"NextSong", "registration":"1.541078e+12", "sessionId":345, "song":"Mercy:The Laundromat", "status":200, "ts":1541990258796, "userAgent":"Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4...)", "userId":10}
37 | 
38 | <b>Schema for Song Play Analysis</b>
39 | 
40 | A Star Schema would be required for optimized queries on song play queries
41 | 
42 | <b>Fact Table</b>
43 | 
44 | <b>songplays</b> - records in event data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
45 | 
46 | <b>Dimension Tables</b>
47 | 
48 | <b>users</b> - users in the app user_id, first_name, last_name, gender, level
49 | 
50 | <b>songs</b> - songs in music database song_id, title, artist_id, year, duration
51 | 
52 | <b>artists</b> - artists in music database artist_id, name, location, lattitude, longitude
53 | 
54 | <b>time</b> - timestamps of records in songplays broken down into specific units start_time, hour, day, week, month, year, weekday
55 | 
56 | <b>Project Template</b>
57 | 
58 | Project Template include three files:
59 | 
60 | <b>1. etl.py</b> reads data from S3, processes that data using Spark and writes them back to S3
61 | 
62 | <b>2. dl.cfg</b> contains AWS Credentials
63 | 
64 | <b>3. README.md</b> provides discussion on your process and decisions
65 | 
66 | <b>ETL Pipeline</b>
67 | 
68 | 1. Load the credentials from dl.cfg
69 | 2. Load the Data which are in JSON Files(Song Data and Log Data)
70 | 3. After loading the JSON Files from S3
71 | 4.Use Spark process this JSON files and then generate a set of Fact and Dimension Tables
72 | 5. Load back these dimensional process to S3
73 | 
74 | <b>Final Instructions</b>
75 | 
76 | 1. Write correct keys in dl.cfg
77 | 2. Open Terminal write the command "python etl.py"
78 | 3. Should take about 2-4 mins in total
79 | 


--------------------------------------------------------------------------------
/Data Lakes with Spark/Project Data Lake with Spark/Readme.MD:
--------------------------------------------------------------------------------
1 | Hello Testing
2 | 


--------------------------------------------------------------------------------
/Data Lakes with Spark/Project Data Lake with Spark/dl.cfg:
--------------------------------------------------------------------------------
1 | [AWS]
2 | AWS_ACCESS_KEY_ID=
3 | AWS_SECRET_ACCESS_KEY=


--------------------------------------------------------------------------------
/Data Lakes with Spark/Project Data Lake with Spark/etl.py:
--------------------------------------------------------------------------------
  1 | import configparser
  2 | from datetime import datetime
  3 | import os
  4 | from pyspark.sql import SparkSession
  5 | from pyspark.sql.functions import udf, col
  6 | from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
  7 | 
  8 | 
  9 | config = configparser.ConfigParser()
 10 | config.read_file(open('dl.cfg'))
 11 | 
 12 | os.environ['AWS_ACCESS_KEY_ID']=config.get('AWS','AWS_ACCESS_KEY_ID')
 13 | os.environ['AWS_SECRET_ACCESS_KEY']=config.get('AWS','AWS_SECRET_ACCESS_KEY')
 14 | 
 15 | 
 16 | def create_spark_session():
 17 |     spark = SparkSession \
 18 |         .builder \
 19 |         .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
 20 |         .getOrCreate()
 21 |     return spark
 22 | 
 23 | 
 24 | def process_song_data(spark, input_data, output_data):
 25 |     """
 26 |         Description: This function loads song_data from S3 and processes it by extracting the songs and artist tables
 27 |         and then again loaded back to S3
 28 |         
 29 |         Parameters:
 30 |             spark       : this is the Spark Session
 31 |             input_data  : the location of song_data from where the file is load to process
 32 |             output_data : the location where after processing the results will be stored
 33 |             
 34 |     """
 35 |     # get filepath to song data file
 36 |     song_data = input_data + 'song_data/*/*/*/*.json'
 37 |     
 38 |     # read song data file
 39 |     df = spark.read.json(song_data)
 40 |     
 41 |     # created song view to write SQL Queries
 42 |     df.createOrReplaceTempView("song_data_table")
 43 | 
 44 |     # extract columns to create songs table
 45 |     songs_table = spark.sql("""
 46 |                             SELECT sdtn.song_id, 
 47 |                             sdtn.title,
 48 |                             sdtn.artist_id,
 49 |                             sdtn.year,
 50 |                             sdtn.duration
 51 |                             FROM song_data_table sdtn
 52 |                             WHERE song_id IS NOT NULL
 53 |                         """)
 54 |     
 55 |     # write songs table to parquet files partitioned by year and artist
 56 |     songs_table.write.mode('overwrite').partitionBy("year", "artist_id").parquet(output_data+'songs_table/')
 57 | 
 58 |     # extract columns to create artists table
 59 |     artists_table = spark.sql("""
 60 |                                 SELECT DISTINCT arti.artist_id, 
 61 |                                 arti.artist_name,
 62 |                                 arti.artist_location,
 63 |                                 arti.artist_latitude,
 64 |                                 arti.artist_longitude
 65 |                                 FROM song_data_table arti
 66 |                                 WHERE arti.artist_id IS NOT NULL
 67 |                             """)
 68 |     
 69 |     # write artists table to parquet files
 70 |     artists_table.write.mode('overwrite').parquet(output_data+'artists_table/')
 71 | 
 72 | 
 73 | def process_log_data(spark, input_data, output_data):
 74 |      """
 75 |         Description: This function loads log_data from S3 and processes it by extracting the songs and artist tables
 76 |         and then again loaded back to S3. Also output from previous function is used in by spark.read.json command
 77 |         
 78 |         Parameters:
 79 |             spark       : this is the Spark Session
 80 |             input_data  : the location of song_data from where the file is load to process
 81 |             output_data : the location where after processing the results will be stored
 82 |             
 83 |     """
 84 |     # get filepath to log data file
 85 |     log_path = input_data + 'log_data/*.json'
 86 | 
 87 |     # read log data file
 88 |     df = spark.read.json(log_path)
 89 |     
 90 |     # filter by actions for song plays
 91 |     df = df.filter(df.page == 'NextSong')
 92 |     
 93 |     # created log view to write SQL Queries
 94 |     df.createOrReplaceTempView("log_data_table")
 95 | 
 96 |     # extract columns for users table    
 97 |     users_table = spark.sql("""
 98 |                             SELECT DISTINCT userT.userId as user_id, 
 99 |                             userT.firstName as first_name,
100 |                             userT.lastName as last_name,
101 |                             userT.gender as gender,
102 |                             userT.level as level
103 |                             FROM log_data_table userT
104 |                             WHERE userT.userId IS NOT NULL
105 |                         """)
106 |     
107 |     # write users table to parquet files
108 |     users_table.write.mode('overwrite').parquet(output_data+'users_table/')
109 | 
110 |     # create timestamp column from original timestamp column
111 |     # get_timestamp = udf()
112 |     # df = 
113 |     
114 |     # create datetime column from original timestamp column
115 |     # get_datetime = udf()
116 |     # df = 
117 |     
118 |     # extract columns to create time table
119 |     time_table = spark.sql("""
120 |                             SELECT 
121 |                             A.start_time_sub as start_time,
122 |                             hour(A.start_time_sub) as hour,
123 |                             dayofmonth(A.start_time_sub) as day,
124 |                             weekofyear(A.start_time_sub) as week,
125 |                             month(A.start_time_sub) as month,
126 |                             year(A.start_time_sub) as year,
127 |                             dayofweek(A.start_time_sub) as weekday
128 |                             FROM
129 |                             (SELECT to_timestamp(timeSt.ts/1000) as start_time_sub
130 |                             FROM log_data_table timeSt
131 |                             WHERE timeSt.ts IS NOT NULL
132 |                             ) A
133 |                         """)
134 |     
135 |     # write time table to parquet files partitioned by year and month
136 |     time_table.write.mode('overwrite').partitionBy("year", "month").parquet(output_data+'time_table/')
137 | 
138 |     # read in song data to use for songplays table
139 |     song_df = spark.read.parquet(output_data+'songs_table/')
140 |     
141 |     # read song data file
142 |     # song_df_upd = spark.read.json(input_data + 'song_data/*/*/*/*.json')
143 |     # created song view to write SQL Queries
144 |     # song_df_upd.createOrReplaceTempView("song_data_table")
145 | 
146 |     
147 | 
148 |     # extract columns from joined song and log datasets to create songplays table 
149 |     songplays_table = spark.sql("""
150 |                                 SELECT monotonically_increasing_id() as songplay_id,
151 |                                 to_timestamp(logT.ts/1000) as start_time,
152 |                                 month(to_timestamp(logT.ts/1000)) as month,
153 |                                 year(to_timestamp(logT.ts/1000)) as year,
154 |                                 logT.userId as user_id,
155 |                                 logT.level as level,
156 |                                 songT.song_id as song_id,
157 |                                 songT.artist_id as artist_id,
158 |                                 logT.sessionId as session_id,
159 |                                 logT.location as location,
160 |                                 logT.userAgent as user_agent
161 | 
162 |                                 FROM log_data_table logT
163 |                                 JOIN song_data_table songT on logT.artist = songT.artist_name and logT.song = songT.title
164 |                             """)
165 | 
166 |     # write songplays table to parquet files partitioned by year and month
167 |     songplays_table.write.mode('overwrite').partitionBy("year", "month").parquet(output_data+'songplays_table/')
168 | 
169 | 
170 | def main():
171 |     spark = create_spark_session()
172 |     
173 |     input_data = "s3a://udacity-dend/"
174 |     output_data = "s3a://udacity-dend/dloutput/"
175 |     
176 |     #input_data = "./"
177 |     #output_data = "./dloutput/"
178 |     
179 |     process_song_data(spark, input_data, output_data)    
180 |     process_log_data(spark, input_data, output_data)
181 | 
182 | 
183 | if __name__ == "__main__":
184 |     main()
185 | 


--------------------------------------------------------------------------------
/Data Lakes with Spark/README.md:
--------------------------------------------------------------------------------
1 | Data Lakes with Spark Exercise Files
2 | 


--------------------------------------------------------------------------------
/Data Lakes with Spark/Spark_Maps_Lazy_Evaluation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Maps\n",
  8 |     "\n",
  9 |     "In Spark, maps take data as input and then transform that data with whatever function you put in the map. They are like directions for the data telling how each input should get to the output.\n",
 10 |     "\n",
 11 |     "The first code cell creates a SparkContext object. With the SparkContext, you can input a dataset and parallelize the data across a cluster (since you are currently using Spark in local mode on a single machine, technically the dataset isn't distributed yet).\n",
 12 |     "\n",
 13 |     "Run the code cell below to instantiate a SparkContext object and then read in the log_of_songs list into Spark. "
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 1,
 19 |    "metadata": {},
 20 |    "outputs": [],
 21 |    "source": [
 22 |     "### \n",
 23 |     "# You might have noticed this code in the screencast.\n",
 24 |     "#\n",
 25 |     "# import findspark\n",
 26 |     "# findspark.init('spark-2.3.2-bin-hadoop2.7')\n",
 27 |     "#\n",
 28 |     "# The findspark Python module makes it easier to install\n",
 29 |     "# Spark in local mode on your computer. This is convenient\n",
 30 |     "# for practicing Spark syntax locally. \n",
 31 |     "# However, the workspaces already have Spark installed and you do not\n",
 32 |     "# need to use the findspark module\n",
 33 |     "#\n",
 34 |     "###\n",
 35 |     "\n",
 36 |     "import pyspark\n",
 37 |     "sc = pyspark.SparkContext(appName=\"maps_and_lazy_evaluation_example\")\n",
 38 |     "\n",
 39 |     "log_of_songs = [\n",
 40 |     "        \"Despacito\",\n",
 41 |     "        \"Nice for what\",\n",
 42 |     "        \"No tears left to cry\",\n",
 43 |     "        \"Despacito\",\n",
 44 |     "        \"Havana\",\n",
 45 |     "        \"In my feelings\",\n",
 46 |     "        \"Nice for what\",\n",
 47 |     "        \"despacito\",\n",
 48 |     "        \"All the stars\"\n",
 49 |     "]\n",
 50 |     "\n",
 51 |     "# parallelize the log_of_songs to use with Spark\n",
 52 |     "distributed_song_log = sc.parallelize(log_of_songs)"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "This next code cell defines a function that converts a song title to lowercase. Then there is an example converting the word \"Havana\" to \"havana\"."
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": 2,
 65 |    "metadata": {},
 66 |    "outputs": [
 67 |     {
 68 |      "data": {
 69 |       "text/plain": [
 70 |        "'havana'"
 71 |       ]
 72 |      },
 73 |      "execution_count": 2,
 74 |      "metadata": {},
 75 |      "output_type": "execute_result"
 76 |     }
 77 |    ],
 78 |    "source": [
 79 |     "def convert_song_to_lowercase(song):\n",
 80 |     "    return song.lower()\n",
 81 |     "\n",
 82 |     "convert_song_to_lowercase(\"Havana\")"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "markdown",
 87 |    "metadata": {},
 88 |    "source": [
 89 |     "The following code cells demonstrate how to apply this function using a map step. The map step will go through each song in the list and apply the convert_song_to_lowercase() function. "
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": 3,
 95 |    "metadata": {},
 96 |    "outputs": [
 97 |     {
 98 |      "data": {
 99 |       "text/plain": [
100 |        "PythonRDD[1] at RDD at PythonRDD.scala:53"
101 |       ]
102 |      },
103 |      "execution_count": 3,
104 |      "metadata": {},
105 |      "output_type": "execute_result"
106 |     }
107 |    ],
108 |    "source": [
109 |     "distributed_song_log.map(convert_song_to_lowercase)"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "You'll notice that this code cell ran quite quickly. This is because of lazy evaluation. Spark does not actually execute the map step unless it needs to.\n",
117 |     "\n",
118 |     "\"RDD\" in the output refers to resilient distributed dataset. RDDs are exactly what they say they are: fault-tolerant datasets distributed across a cluster. This is how Spark stores data. \n",
119 |     "\n",
120 |     "To get Spark to actually run the map step, you need to use an \"action\". One available action is the collect method. The collect() method takes the results from all of the clusters and \"collects\" them into a single list on the master node."
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 4,
126 |    "metadata": {},
127 |    "outputs": [
128 |     {
129 |      "data": {
130 |       "text/plain": [
131 |        "['despacito',\n",
132 |        " 'nice for what',\n",
133 |        " 'no tears left to cry',\n",
134 |        " 'despacito',\n",
135 |        " 'havana',\n",
136 |        " 'in my feelings',\n",
137 |        " 'nice for what',\n",
138 |        " 'despacito',\n",
139 |        " 'all the stars']"
140 |       ]
141 |      },
142 |      "execution_count": 4,
143 |      "metadata": {},
144 |      "output_type": "execute_result"
145 |     }
146 |    ],
147 |    "source": [
148 |     "distributed_song_log.map(convert_song_to_lowercase).collect()"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "Note as well that Spark is not changing the original data set: Spark is merely making a copy. You can see this by running collect() on the original dataset."
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 5,
161 |    "metadata": {},
162 |    "outputs": [
163 |     {
164 |      "data": {
165 |       "text/plain": [
166 |        "['Despacito',\n",
167 |        " 'Nice for what',\n",
168 |        " 'No tears left to cry',\n",
169 |        " 'Despacito',\n",
170 |        " 'Havana',\n",
171 |        " 'In my feelings',\n",
172 |        " 'Nice for what',\n",
173 |        " 'despacito',\n",
174 |        " 'All the stars']"
175 |       ]
176 |      },
177 |      "execution_count": 5,
178 |      "metadata": {},
179 |      "output_type": "execute_result"
180 |     }
181 |    ],
182 |    "source": [
183 |     "distributed_song_log.collect()"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {},
189 |    "source": [
190 |     "You do not always have to write a custom function for the map step. You can also use anonymous (lambda) functions as well as built-in Python functions like string.lower(). \n",
191 |     "\n",
192 |     "Anonymous functions are actually a Python feature for writing functional style programs."
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": 6,
198 |    "metadata": {},
199 |    "outputs": [
200 |     {
201 |      "data": {
202 |       "text/plain": [
203 |        "['despacito',\n",
204 |        " 'nice for what',\n",
205 |        " 'no tears left to cry',\n",
206 |        " 'despacito',\n",
207 |        " 'havana',\n",
208 |        " 'in my feelings',\n",
209 |        " 'nice for what',\n",
210 |        " 'despacito',\n",
211 |        " 'all the stars']"
212 |       ]
213 |      },
214 |      "execution_count": 6,
215 |      "metadata": {},
216 |      "output_type": "execute_result"
217 |     }
218 |    ],
219 |    "source": [
220 |     "distributed_song_log.map(lambda song: song.lower()).collect()"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "code",
225 |    "execution_count": 7,
226 |    "metadata": {},
227 |    "outputs": [
228 |     {
229 |      "data": {
230 |       "text/plain": [
231 |        "['despacito',\n",
232 |        " 'nice for what',\n",
233 |        " 'no tears left to cry',\n",
234 |        " 'despacito',\n",
235 |        " 'havana',\n",
236 |        " 'in my feelings',\n",
237 |        " 'nice for what',\n",
238 |        " 'despacito',\n",
239 |        " 'all the stars']"
240 |       ]
241 |      },
242 |      "execution_count": 7,
243 |      "metadata": {},
244 |      "output_type": "execute_result"
245 |     }
246 |    ],
247 |    "source": [
248 |     "distributed_song_log.map(lambda x: x.lower()).collect()"
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": 9,
254 |    "metadata": {},
255 |    "outputs": [
256 |     {
257 |      "data": {
258 |       "text/plain": [
259 |        "9"
260 |       ]
261 |      },
262 |      "execution_count": 9,
263 |      "metadata": {},
264 |      "output_type": "execute_result"
265 |     }
266 |    ],
267 |    "source": [
268 |     "distributed_song_log.map(lambda x: x.lower()).count()"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "metadata": {},
275 |    "outputs": [],
276 |    "source": []
277 |   }
278 |  ],
279 |  "metadata": {
280 |   "kernelspec": {
281 |    "display_name": "Python 3",
282 |    "language": "python",
283 |    "name": "python3"
284 |   },
285 |   "language_info": {
286 |    "codemirror_mode": {
287 |     "name": "ipython",
288 |     "version": 3
289 |    },
290 |    "file_extension": ".py",
291 |    "mimetype": "text/x-python",
292 |    "name": "python",
293 |    "nbconvert_exporter": "python",
294 |    "pygments_lexer": "ipython3",
295 |    "version": "3.6.3"
296 |   }
297 |  },
298 |  "nbformat": 4,
299 |  "nbformat_minor": 2
300 | }
301 | 


--------------------------------------------------------------------------------
/Data Lakes with Spark/Spark_Sql_Quiz.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Answer Key to the Data Frame Programming Quiz\n",
  8 |     "\n",
  9 |     "Helpful resources:\n",
 10 |     "http://spark.apache.org/docs/latest/api/python/pyspark.sql.html"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 4,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "from pyspark.sql import SparkSession\n",
 20 |     "# from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg\n",
 21 |     "# from pyspark.sql.functions import sum as Fsum\n",
 22 |     "# from pyspark.sql.window import Window\n",
 23 |     "# from pyspark.sql.types import IntegerType"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 5,
 29 |    "metadata": {},
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "# 1) import any other libraries you might need\n",
 33 |     "# 2) instantiate a Spark session \n",
 34 |     "# 3) read in the data set located at the path \"data/sparkify_log_small.json\"\n",
 35 |     "# 4) create a view to use with your SQL queries\n",
 36 |     "# 5) write code to answer the quiz questions \n",
 37 |     "\n",
 38 |     "spark = SparkSession \\\n",
 39 |     "    .builder \\\n",
 40 |     "    .appName(\"Spark SQL Quiz\") \\\n",
 41 |     "    .getOrCreate()\n",
 42 |     "\n",
 43 |     "user_log = spark.read.json(\"data/sparkify_log_small.json\")\n",
 44 |     "\n",
 45 |     "user_log.createOrReplaceTempView(\"log_table\")\n"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "# Question 1\n",
 53 |     "\n",
 54 |     "Which page did user id \"\" (empty string) NOT visit?"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": 6,
 60 |    "metadata": {},
 61 |    "outputs": [
 62 |     {
 63 |      "name": "stdout",
 64 |      "output_type": "stream",
 65 |      "text": [
 66 |       "root\n",
 67 |       " |-- artist: string (nullable = true)\n",
 68 |       " |-- auth: string (nullable = true)\n",
 69 |       " |-- firstName: string (nullable = true)\n",
 70 |       " |-- gender: string (nullable = true)\n",
 71 |       " |-- itemInSession: long (nullable = true)\n",
 72 |       " |-- lastName: string (nullable = true)\n",
 73 |       " |-- length: double (nullable = true)\n",
 74 |       " |-- level: string (nullable = true)\n",
 75 |       " |-- location: string (nullable = true)\n",
 76 |       " |-- method: string (nullable = true)\n",
 77 |       " |-- page: string (nullable = true)\n",
 78 |       " |-- registration: long (nullable = true)\n",
 79 |       " |-- sessionId: long (nullable = true)\n",
 80 |       " |-- song: string (nullable = true)\n",
 81 |       " |-- status: long (nullable = true)\n",
 82 |       " |-- ts: long (nullable = true)\n",
 83 |       " |-- userAgent: string (nullable = true)\n",
 84 |       " |-- userId: string (nullable = true)\n",
 85 |       "\n"
 86 |      ]
 87 |     }
 88 |    ],
 89 |    "source": [
 90 |     "user_log.printSchema()"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": 7,
 96 |    "metadata": {},
 97 |    "outputs": [
 98 |     {
 99 |      "name": "stdout",
100 |      "output_type": "stream",
101 |      "text": [
102 |       "+----+----------------+\n",
103 |       "|page|            page|\n",
104 |       "+----+----------------+\n",
105 |       "|null|Submit Downgrade|\n",
106 |       "|null|       Downgrade|\n",
107 |       "|null|          Logout|\n",
108 |       "|null|   Save Settings|\n",
109 |       "|null|        Settings|\n",
110 |       "|null|        NextSong|\n",
111 |       "|null|         Upgrade|\n",
112 |       "|null|           Error|\n",
113 |       "|null|  Submit Upgrade|\n",
114 |       "+----+----------------+\n",
115 |       "\n"
116 |      ]
117 |     }
118 |    ],
119 |    "source": [
120 |     "# SELECT distinct pages for the blank user and distinc pages for all users\n",
121 |     "# Right join the results to find pages that blank visitor did not visit\n",
122 |     "spark.sql(\"SELECT * \\\n",
123 |     "            FROM ( \\\n",
124 |     "                SELECT DISTINCT page \\\n",
125 |     "                FROM log_table \\\n",
126 |     "                WHERE userID='') AS user_pages \\\n",
127 |     "            RIGHT JOIN ( \\\n",
128 |     "                SELECT DISTINCT page \\\n",
129 |     "                FROM log_table) AS all_pages \\\n",
130 |     "            ON user_pages.page = all_pages.page \\\n",
131 |     "            WHERE user_pages.page IS NULL\").show()"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "# Question 2 - Reflect\n",
139 |     "\n",
140 |     "Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?\n",
141 |     "\n",
142 |     "Both Spark SQL and Spark Data Frames are part of the Spark SQL library. Hence, they both use the Spark SQL Catalyst Optimizer to optimize queries. \n",
143 |     "\n",
144 |     "You might prefer SQL over data frames because the syntax is clearer especially for teams already experienced in SQL.\n",
145 |     "\n",
146 |     "Spark data frames give you more control. You can break down your queries into smaller steps, which can make debugging easier. You can also [cache](https://unraveldata.com/to-cache-or-not-to-cache/) intermediate results or [repartition](https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4) intermediate results."
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "markdown",
151 |    "metadata": {},
152 |    "source": [
153 |     "# Question 3\n",
154 |     "\n",
155 |     "How many female users do we have in the data set?"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 8,
161 |    "metadata": {},
162 |    "outputs": [
163 |     {
164 |      "name": "stdout",
165 |      "output_type": "stream",
166 |      "text": [
167 |       "+----------------------+\n",
168 |       "|count(DISTINCT userID)|\n",
169 |       "+----------------------+\n",
170 |       "|                   462|\n",
171 |       "+----------------------+\n",
172 |       "\n"
173 |      ]
174 |     }
175 |    ],
176 |    "source": [
177 |     "spark.sql(\"SELECT COUNT(DISTINCT userID) \\\n",
178 |     "            FROM log_table \\\n",
179 |     "            WHERE gender = 'F'\").show()"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "markdown",
184 |    "metadata": {},
185 |    "source": [
186 |     "# Question 4\n",
187 |     "\n",
188 |     "How many songs were played from the most played artist?"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": 9,
194 |    "metadata": {},
195 |    "outputs": [
196 |     {
197 |      "name": "stdout",
198 |      "output_type": "stream",
199 |      "text": [
200 |       "+--------+-----+\n",
201 |       "|  Artist|plays|\n",
202 |       "+--------+-----+\n",
203 |       "|Coldplay|   83|\n",
204 |       "+--------+-----+\n",
205 |       "\n",
206 |       "+--------+-----+\n",
207 |       "|  Artist|plays|\n",
208 |       "+--------+-----+\n",
209 |       "|Coldplay|   83|\n",
210 |       "+--------+-----+\n",
211 |       "\n"
212 |      ]
213 |     }
214 |    ],
215 |    "source": [
216 |     "# Here is one solution\n",
217 |     "spark.sql(\"SELECT Artist, COUNT(Artist) AS plays \\\n",
218 |     "        FROM log_table \\\n",
219 |     "        GROUP BY Artist \\\n",
220 |     "        ORDER BY plays DESC \\\n",
221 |     "        LIMIT 1\").show()\n",
222 |     "\n",
223 |     "# Here is an alternative solution\n",
224 |     "# Get the artist play counts\n",
225 |     "play_counts = spark.sql(\"SELECT Artist, COUNT(Artist) AS plays \\\n",
226 |     "        FROM log_table \\\n",
227 |     "        GROUP BY Artist\")\n",
228 |     "\n",
229 |     "# save the results in a new view\n",
230 |     "play_counts.createOrReplaceTempView(\"artist_counts\")\n",
231 |     "\n",
232 |     "# use a self join to find where the max play equals the count value\n",
233 |     "spark.sql(\"SELECT a2.Artist, a2.plays FROM \\\n",
234 |     "          (SELECT max(plays) AS max_plays FROM artist_counts) AS a1 \\\n",
235 |     "          JOIN artist_counts AS a2 \\\n",
236 |     "          ON a1.max_plays = a2.plays \\\n",
237 |     "          \").show()"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "markdown",
242 |    "metadata": {},
243 |    "source": [
244 |     "# Question 5 (challenge)\n",
245 |     "\n",
246 |     "How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.\n",
247 |     "\n"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": 31,
253 |    "metadata": {},
254 |    "outputs": [
255 |     {
256 |      "name": "stdout",
257 |      "output_type": "stream",
258 |      "text": [
259 |       "+------------------+\n",
260 |       "|avg(count(period))|\n",
261 |       "+------------------+\n",
262 |       "| 6.898347107438017|\n",
263 |       "+------------------+\n",
264 |       "\n"
265 |      ]
266 |     }
267 |    ],
268 |    "source": [
269 |     "# SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;\n",
270 |     "is_home = spark.sql(\"SELECT userID, page, ts, CASE WHEN page = 'Home' THEN 1 ELSE 0 END AS is_home FROM log_table \\\n",
271 |     "            WHERE (page = 'NextSong') or (page = 'Home') \\\n",
272 |     "            \")\n",
273 |     "\n",
274 |     "# keep the results in a new view\n",
275 |     "is_home.createOrReplaceTempView(\"is_home_table\")\n",
276 |     "\n",
277 |     "# find the cumulative sum over the is_home column\n",
278 |     "cumulative_sum = spark.sql(\"SELECT *, SUM(is_home) OVER \\\n",
279 |     "    (PARTITION BY userID ORDER BY ts DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period \\\n",
280 |     "    FROM is_home_table\")\n",
281 |     "\n",
282 |     "# keep the results in a view\n",
283 |     "cumulative_sum.createOrReplaceTempView(\"period_table\")\n",
284 |     "\n",
285 |     "# find the average count for NextSong\n",
286 |     "spark.sql(\"SELECT AVG(count_results) FROM \\\n",
287 |     "          (SELECT COUNT(*) AS count_results FROM period_table \\\n",
288 |     "GROUP BY userID, period, page HAVING page = 'NextSong') AS counts\").show()"
289 |    ]
290 |   }
291 |  ],
292 |  "metadata": {
293 |   "kernelspec": {
294 |    "display_name": "Python 3",
295 |    "language": "python",
296 |    "name": "python3"
297 |   },
298 |   "language_info": {
299 |    "codemirror_mode": {
300 |     "name": "ipython",
301 |     "version": 3
302 |    },
303 |    "file_extension": ".py",
304 |    "mimetype": "text/x-python",
305 |    "name": "python",
306 |    "nbconvert_exporter": "python",
307 |    "pygments_lexer": "ipython3",
308 |    "version": "3.6.3"
309 |   }
310 |  },
311 |  "nbformat": 4,
312 |  "nbformat_minor": 2
313 | }
314 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 1.py:
--------------------------------------------------------------------------------
 1 | # Instructions
 2 | # Define a function that uses the python logger to log a function. Then finish filling in the details of the DAG down below. Once you’ve done that, run "/opt/airflow/start.sh" command to start the web server. Once the Airflow web server is ready,  open the Airflow UI using the "Access Airflow" button. Turn your DAG “On”, and then Run your DAG. If you get stuck, you can take a look at the solution file or the video walkthrough on the next page.
 3 | 
 4 | import datetime
 5 | import logging
 6 | 
 7 | from airflow import DAG
 8 | from airflow.operators.python_operator import PythonOperator
 9 | 
10 | def first_prog():
11 |     logging.info("This is my very first program for airflow")
12 | 
13 | dag = DAG(
14 |         'lesson1.exercise1',
15 |         start_date=datetime.datetime.now())
16 | 
17 | greet_task = PythonOperator(
18 |     task_id="first_airflow_program",
19 |     python_callable=first_prog,
20 |     dag=dag
21 | )
22 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 2.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | import logging
 3 | 
 4 | from airflow import DAG
 5 | from airflow.operators.python_operator import PythonOperator
 6 | 
 7 | 
 8 | def second_prog():
 9 |     logging.info("This is my second program for airflow")
10 | 
11 | dag = DAG(
12 |         "lesson1.exercise2",
13 |         start_date=datetime.datetime.now() - datetime.timedelta(days=2),
14 |         schedule_interval="@daily")
15 | 
16 | task = PythonOperator(
17 |         task_id="exercise_2",
18 |         python_callable=second_prog,
19 |         dag=dag)
20 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 3.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | import logging
 3 | 
 4 | from airflow import DAG
 5 | from airflow.operators.python_operator import PythonOperator
 6 | 
 7 | 
 8 | def hello_world():
 9 |     logging.info("Hello World")
10 | 
11 | 
12 | def addition():
13 |     logging.info(f"2 + 2 = {2+2}")
14 | 
15 | 
16 | def subtraction():
17 |     logging.info(f"6 -2 = {6-2}")
18 | 
19 | 
20 | def division():
21 |     logging.info(f"10 / 2 = {int(10/2)}")
22 |  
23 | def completed_task():
24 |     logging.info("All Tasks Completed")
25 | 
26 | 
27 | dag = DAG(
28 |     "lesson1.exercise3",
29 |     schedule_interval='@hourly',
30 |     start_date=datetime.datetime.now() - datetime.timedelta(days=1))
31 | 
32 | hello_world_task = PythonOperator(
33 |     task_id="hello_world",
34 |     python_callable=hello_world,
35 |     dag=dag)
36 | 
37 | addition_task = PythonOperator(
38 |     task_id="addition",
39 |     python_callable=addition,
40 |     dag=dag)
41 | 
42 | subraction_task = PythonOperator(
43 |     task_id="subtraction",
44 |     python_callable=subtraction,
45 |     dag=dag)
46 | 
47 | division_task = PythonOperator(
48 |     task_id="division",
49 |     python_callable=division,
50 |     dag=dag)
51 | 
52 | completed_task = PythonOperator(
53 |     task_id="completed_task",
54 |     python_callable=completed_task,
55 |     dag=dag)
56 | #
57 | #                    ->  addition_task
58 | #                   /                 \
59 | # hello_world_task  -> division_task-> completed_task                
60 | #                   \                 /
61 | #                    -> subtraction_task
62 | 
63 | hello_world_task >> addition_task
64 | hello_world_task >> division_task
65 | hello_world_task >> subtraction_task
66 | addition_task >> completed_task
67 | division_task >> completed_task
68 | subtraction_task >> completed_task


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 4.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | import logging
 3 | 
 4 | from airflow import DAG
 5 | from airflow.models import Variable
 6 | from airflow.operators.python_operator import PythonOperator
 7 | from airflow.hooks.S3_hook import S3Hook
 8 | 
 9 | #
10 | # TODO: There is no code to modify in this exercise. We're going to create a connection and a
11 | # variable.
12 | # 1. Open your browser to localhost:8080 and open Admin->Variables
13 | # 2. Click "Create"
14 | # 3. Set "Key" equal to "s3_bucket" and set "Val" equal to "udacity-dend"
15 | # 4. Click save
16 | # 5. Open Admin->Connections
17 | # 6. Click "Create"
18 | # 7. Set "Conn Id" to "aws_credentials", "Conn Type" to "Amazon Web Services"
19 | #    Set "Login" to your aws_access_key_id and "Password" to your aws_secret_key
20 | # 8. Click save
21 | # 9. Run the DAG
22 | 
23 | def list_keys():
24 |     hook = S3Hook(aws_conn_id='aws_credentials')
25 |     bucket = Variable.get('s3_bucket')
26 |     logging.info(f"Listing Keys from {bucket}")
27 |     keys = hook.list_keys(bucket)
28 |     for key in keys:
29 |         logging.info(f"- s3://{bucket}/{key}")
30 | 
31 | 
32 | dag = DAG(
33 |         'lesson1.exercise4',
34 |         start_date=datetime.datetime.now())
35 | 
36 | list_task = PythonOperator(
37 |     task_id="list_keys",
38 |     python_callable=list_keys,
39 |     dag=dag
40 | )


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 5.py:
--------------------------------------------------------------------------------
 1 | # Instructions
 2 | # Use the Airflow context in the pythonoperator to complete the TODOs below. Once you are done, run your DAG and check the logs to see the context in use.
 3 | 
 4 | import datetime
 5 | import logging
 6 | 
 7 | from airflow import DAG
 8 | from airflow.models import Variable
 9 | from airflow.operators.python_operator import PythonOperator
10 | from airflow.hooks.S3_hook import S3Hook
11 | 
12 | 
13 | def log_details(*args, **kwargs):
14 |     #
15 |     # TODO: Extract ds, run_id, prev_ds, and next_ds from the kwargs, and log them
16 |     # NOTE: Look here for context variables passed in on kwargs:
17 |     #       https://airflow.apache.org/macros.html
18 |     #
19 |     ds = kwargs['ds'] # kwargs[]
20 |     run_id = kwargs['run_id'] # kwargs[]
21 |     previous_ds = kwargs.get('prev_ds') # kwargs.get('')
22 |     next_ds = kwargs.get('next_ds') # kwargs.get('')
23 | 
24 |     logging.info(f"Execution date is {ds}")
25 |     logging.info(f"My run id is {run_id}")
26 |     if previous_ds:
27 |         logging.info(f"My previous run was on {previous_ds}")
28 |     if next_ds:
29 |         logging.info(f"My next run will be {next_ds}")
30 | 
31 | dag = DAG(
32 |     'lesson1.exercise5',
33 |     schedule_interval="@daily",
34 |     start_date=datetime.datetime.now() - datetime.timedelta(days=2)
35 | )
36 | 
37 | list_task = PythonOperator(
38 |     task_id="log_details",
39 |     python_callable=log_details,
40 |     provide_context=True,
41 |     dag=dag
42 | )
43 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Pipeline - Exercise 6.py:
--------------------------------------------------------------------------------
 1 | # Instructions
 2 | # Similar to what you saw in the demo, copy and populate the trips table. Then, add another operator which creates a traffic analysis table from the trips table you created. Note, in this class, we won’t be writing SQL -- all of the SQL statements we run against Redshift are predefined and included in your lesson.
 3 | 
 4 | import datetime
 5 | import logging
 6 | 
 7 | from airflow import DAG
 8 | from airflow.contrib.hooks.aws_hook import AwsHook
 9 | from airflow.hooks.postgres_hook import PostgresHook
10 | from airflow.operators.postgres_operator import PostgresOperator
11 | from airflow.operators.python_operator import PythonOperator
12 | 
13 | import sql_statements
14 | 
15 | 
16 | def load_data_to_redshift(*args, **kwargs):
17 |     aws_hook = AwsHook("aws_credentials")
18 |     credentials = aws_hook.get_credentials()
19 |     redshift_hook = PostgresHook("redshift")
20 |     redshift_hook.run(sql_statements.COPY_ALL_TRIPS_SQL.format(credentials.access_key, credentials.secret_key))
21 | 
22 | 
23 | dag = DAG(
24 |     'lesson1.exercise6',
25 |     start_date=datetime.datetime.now()
26 | )
27 | 
28 | create_table = PostgresOperator(
29 |     task_id="create_table",
30 |     dag=dag,
31 |     postgres_conn_id="redshift",
32 |     sql=sql_statements.CREATE_TRIPS_TABLE_SQL
33 | )
34 | 
35 | copy_task = PythonOperator(
36 |     task_id='load_from_s3_to_redshift',
37 |     dag=dag,
38 |     python_callable=load_data_to_redshift
39 | )
40 | 
41 | location_traffic_task = PostgresOperator(
42 |     task_id="calculate_location_traffic",
43 |     dag=dag,
44 |     postgres_conn_id="redshift",
45 |     sql=sql_statements.LOCATION_TRAFFIC_SQL
46 | )
47 | 
48 | create_table >> copy_task
49 | copy_task >> location_traffic_task


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Quality - Exercise 1.py:
--------------------------------------------------------------------------------
 1 | #Instructions
 2 | #1 - Run the DAG as it is first, and observe the Airflow UI
 3 | #2 - Next, open up the DAG and add the copy and load tasks as directed in the TODOs
 4 | #3 - Reload the Airflow UI and run the DAG once more, observing the Airflow UI
 5 | 
 6 | import datetime
 7 | import logging
 8 | 
 9 | from airflow import DAG
10 | from airflow.contrib.hooks.aws_hook import AwsHook
11 | from airflow.hooks.postgres_hook import PostgresHook
12 | from airflow.operators.postgres_operator import PostgresOperator
13 | from airflow.operators.python_operator import PythonOperator
14 | 
15 | import sql_statements
16 | 
17 | 
18 | def load_trip_data_to_redshift(*args, **kwargs):
19 |     aws_hook = AwsHook("aws_credentials")
20 |     credentials = aws_hook.get_credentials()
21 |     redshift_hook = PostgresHook("redshift")
22 |     sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format(
23 |         credentials.access_key,
24 |         credentials.secret_key,
25 |     )
26 |     redshift_hook.run(sql_stmt)
27 | 
28 | 
29 | def load_station_data_to_redshift(*args, **kwargs):
30 |     aws_hook = AwsHook("aws_credentials")
31 |     credentials = aws_hook.get_credentials()
32 |     redshift_hook = PostgresHook("redshift")
33 |     sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
34 |         credentials.access_key,
35 |         credentials.secret_key,
36 |     )
37 |     redshift_hook.run(sql_stmt)
38 | 
39 | 
40 | dag = DAG(
41 |     'lesson2.exercise1',
42 |     start_date=datetime.datetime.now()
43 | )
44 | 
45 | create_trips_table = PostgresOperator(
46 |     task_id="create_trips_table",
47 |     dag=dag,
48 |     postgres_conn_id="redshift",
49 |     sql=sql_statements.CREATE_TRIPS_TABLE_SQL
50 | )
51 | 
52 | copy_trips_task = PythonOperator(
53 |     task_id='load_trips_from_s3_to_redshift',
54 |     dag=dag,
55 |     python_callable=load_trip_data_to_redshift,
56 | )
57 | 
58 | create_stations_table = PostgresOperator(
59 |     task_id="create_stations_table",
60 |     dag=dag,
61 |     postgres_conn_id="redshift",
62 |     sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
63 | )
64 | 
65 | copy_stations_task = PythonOperator(
66 |     task_id='load_stations_from_s3_to_redshift',
67 |     dag=dag,
68 |     python_callable=load_station_data_to_redshift,
69 | )
70 | 
71 | create_trips_table >> copy_trips_task
72 | create_stations_table >> copy_stations_task


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Quality - Exercise 2.py:
--------------------------------------------------------------------------------
 1 | #Instructions
 2 | #1 - Revisit our bikeshare traffic 
 3 | #2 - Update our DAG with
 4 | #  a - @monthly schedule_interval
 5 | #  b - max_active_runs of 1
 6 | #  c - start_date of 2018/01/01
 7 | #  d - end_date of 2018/02/01
 8 | # Use Airflow’s backfill capabilities to analyze our trip data on a monthly basis over 2 historical runs
 9 | 
10 | import datetime
11 | import logging
12 | 
13 | from airflow import DAG
14 | from airflow.contrib.hooks.aws_hook import AwsHook
15 | from airflow.hooks.postgres_hook import PostgresHook
16 | from airflow.operators.postgres_operator import PostgresOperator
17 | from airflow.operators.python_operator import PythonOperator
18 | 
19 | import sql_statements
20 | 
21 | 
22 | def load_trip_data_to_redshift(*args, **kwargs):
23 |     aws_hook = AwsHook("aws_credentials")
24 |     credentials = aws_hook.get_credentials()
25 |     redshift_hook = PostgresHook("redshift")
26 |     sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format(
27 |         credentials.access_key,
28 |         credentials.secret_key,
29 |     )
30 |     redshift_hook.run(sql_stmt)
31 | 
32 | 
33 | def load_station_data_to_redshift(*args, **kwargs):
34 |     aws_hook = AwsHook("aws_credentials")
35 |     credentials = aws_hook.get_credentials()
36 |     redshift_hook = PostgresHook("redshift")
37 |     sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
38 |         credentials.access_key,
39 |         credentials.secret_key,
40 |     )
41 |     redshift_hook.run(sql_stmt)
42 | 
43 | 
44 | dag = DAG(
45 |     'lesson2.exercise2',
46 |     start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
47 |     # TODO: Set the end date to February first
48 |     end_date=datetime.datetime(2018, 2, 1, 0, 0, 0 , 0),
49 |     # TODO: Set the schedule to be monthly
50 |     schedule_interval='@monthly',
51 |     # TODO: set the number of max active runs to 1
52 |     max_active_runs=1
53 | )
54 | 
55 | create_trips_table = PostgresOperator(
56 |     task_id="create_trips_table",
57 |     dag=dag,
58 |     postgres_conn_id="redshift",
59 |     sql=sql_statements.CREATE_TRIPS_TABLE_SQL
60 | )
61 | 
62 | copy_trips_task = PythonOperator(
63 |     task_id='load_trips_from_s3_to_redshift',
64 |     dag=dag,
65 |     python_callable=load_trip_data_to_redshift,
66 |     provide_context=True,
67 | )
68 | 
69 | create_stations_table = PostgresOperator(
70 |     task_id="create_stations_table",
71 |     dag=dag,
72 |     postgres_conn_id="redshift",
73 |     sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
74 | )
75 | 
76 | copy_stations_task = PythonOperator(
77 |     task_id='load_stations_from_s3_to_redshift',
78 |     dag=dag,
79 |     python_callable=load_station_data_to_redshift,
80 | )
81 | 
82 | create_trips_table >> copy_trips_task
83 | create_stations_table >> copy_stations_task


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Quality - Exercise 3.py:
--------------------------------------------------------------------------------
 1 | #Instructions
 2 | #1 - Modify the bikeshare DAG to load data month by month, instead of loading it all at once, every time. 
 3 | #2 - Use time partitioning to parallelize the execution of the DAG.
 4 | 
 5 | import datetime
 6 | import logging
 7 | 
 8 | from airflow import DAG
 9 | from airflow.contrib.hooks.aws_hook import AwsHook
10 | from airflow.hooks.postgres_hook import PostgresHook
11 | from airflow.operators.postgres_operator import PostgresOperator
12 | from airflow.operators.python_operator import PythonOperator
13 | 
14 | import sql_statements
15 | 
16 | 
17 | def load_trip_data_to_redshift(*args, **kwargs):
18 |     aws_hook = AwsHook("aws_credentials")
19 |     credentials = aws_hook.get_credentials()
20 |     redshift_hook = PostgresHook("redshift")
21 |     execution_date = kwargs["execution_date"]
22 |     sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format(
23 |         credentials.access_key,
24 |         credentials.secret_key,
25 |         year=execution_date.year,
26 |         month=execution_date.month
27 |     )
28 |     redshift_hook.run(sql_stmt)
29 | 
30 | 
31 | def load_station_data_to_redshift(*args, **kwargs):
32 |     aws_hook = AwsHook("aws_credentials")
33 |     credentials = aws_hook.get_credentials()
34 |     redshift_hook = PostgresHook("redshift")
35 |     sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
36 |         credentials.access_key,
37 |         credentials.secret_key,
38 |     )
39 |     redshift_hook.run(sql_stmt)
40 | 
41 | 
42 | dag = DAG(
43 |     'lesson2.exercise3',
44 |     start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
45 |     end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0),
46 |     schedule_interval='@monthly',
47 |     max_active_runs=1
48 | )
49 | 
50 | create_trips_table = PostgresOperator(
51 |     task_id="create_trips_table",
52 |     dag=dag,
53 |     postgres_conn_id="redshift",
54 |     sql=sql_statements.CREATE_TRIPS_TABLE_SQL
55 | )
56 | 
57 | copy_trips_task = PythonOperator(
58 |     task_id='load_trips_from_s3_to_redshift',
59 |     dag=dag,
60 |     python_callable=load_trip_data_to_redshift,
61 |     provide_context=True,
62 | )
63 | 
64 | create_stations_table = PostgresOperator(
65 |     task_id="create_stations_table",
66 |     dag=dag,
67 |     postgres_conn_id="redshift",
68 |     sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
69 | )
70 | 
71 | copy_stations_task = PythonOperator(
72 |     task_id='load_stations_from_s3_to_redshift',
73 |     dag=dag,
74 |     python_callable=load_station_data_to_redshift,
75 | )
76 | 
77 | create_trips_table >> copy_trips_task
78 | create_stations_table >> copy_stations_task
79 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Data Quality - Exercise 4.py:
--------------------------------------------------------------------------------
  1 | #Instructions
  2 | #1 - Set an SLA on our bikeshare traffic calculation operator
  3 | #2 - Add data verification step after the load step from s3 to redshift
  4 | #3 - Add data verification step after we calculate our output table
  5 | 
  6 | import datetime
  7 | import logging
  8 | 
  9 | from airflow import DAG
 10 | from airflow.contrib.hooks.aws_hook import AwsHook
 11 | from airflow.hooks.postgres_hook import PostgresHook
 12 | from airflow.operators.postgres_operator import PostgresOperator
 13 | from airflow.operators.python_operator import PythonOperator
 14 | 
 15 | import sql_statements
 16 | 
 17 | 
 18 | def load_trip_data_to_redshift(*args, **kwargs):
 19 |     aws_hook = AwsHook("aws_credentials")
 20 |     credentials = aws_hook.get_credentials()
 21 |     redshift_hook = PostgresHook("redshift")
 22 |     execution_date = kwargs["execution_date"]
 23 |     sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format(
 24 |         credentials.access_key,
 25 |         credentials.secret_key,
 26 |         year=execution_date.year,
 27 |         month=execution_date.month
 28 |     )
 29 |     redshift_hook.run(sql_stmt)
 30 | 
 31 | 
 32 | def load_station_data_to_redshift(*args, **kwargs):
 33 |     aws_hook = AwsHook("aws_credentials")
 34 |     credentials = aws_hook.get_credentials()
 35 |     redshift_hook = PostgresHook("redshift")
 36 |     sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
 37 |         credentials.access_key,
 38 |         credentials.secret_key,
 39 |     )
 40 |     redshift_hook.run(sql_stmt)
 41 | 
 42 | 
 43 | def check_greater_than_zero(*args, **kwargs):
 44 |     table = kwargs["params"]["table"]
 45 |     redshift_hook = PostgresHook("redshift")
 46 |     records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {table}")
 47 |     if len(records) < 1 or len(records[0]) < 1:
 48 |         raise ValueError(f"Data quality check failed. {table} returned no results")
 49 |     num_records = records[0][0]
 50 |     if num_records < 1:
 51 |         raise ValueError(f"Data quality check failed. {table} contained 0 rows")
 52 |     logging.info(f"Data quality on table {table} check passed with {records[0][0]} records")
 53 | 
 54 | 
 55 | dag = DAG(
 56 |     'lesson2.exercise4',
 57 |     start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
 58 |     end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0),
 59 |     schedule_interval='@monthly',
 60 |     max_active_runs=1
 61 | )
 62 | 
 63 | create_trips_table = PostgresOperator(
 64 |     task_id="create_trips_table",
 65 |     dag=dag,
 66 |     postgres_conn_id="redshift",
 67 |     sql=sql_statements.CREATE_TRIPS_TABLE_SQL
 68 | )
 69 | 
 70 | copy_trips_task = PythonOperator(
 71 |     task_id='load_trips_from_s3_to_redshift',
 72 |     dag=dag,
 73 |     python_callable=load_trip_data_to_redshift,
 74 |     provide_context=True,
 75 | )
 76 | 
 77 | check_trips = PythonOperator(
 78 |     task_id='check_trips_data',
 79 |     dag=dag,
 80 |     python_callable=check_greater_than_zero,
 81 |     provide_context=True,
 82 |     params={
 83 |         'table': 'trips',
 84 |     }
 85 | )
 86 | 
 87 | create_stations_table = PostgresOperator(
 88 |     task_id="create_stations_table",
 89 |     dag=dag,
 90 |     postgres_conn_id="redshift",
 91 |     sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
 92 | )
 93 | 
 94 | copy_stations_task = PythonOperator(
 95 |     task_id='load_stations_from_s3_to_redshift',
 96 |     dag=dag,
 97 |     python_callable=load_station_data_to_redshift,
 98 | )
 99 | 
100 | check_stations = PythonOperator(
101 |     task_id='check_stations_data',
102 |     dag=dag,
103 |     python_callable=check_greater_than_zero,
104 |     provide_context=True,
105 |     params={
106 |         'table': 'stations',
107 |     }
108 | )
109 | 
110 | create_trips_table >> copy_trips_task
111 | create_stations_table >> copy_stations_task
112 | copy_stations_task >> check_stations
113 | copy_trips_task >> check_trips


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Production Data Pipelines - Exercise 1.py:
--------------------------------------------------------------------------------
 1 | #Instructions
 2 | #In this exercise, we’ll consolidate repeated code into Operator Plugins
 3 | #1 - Move the data quality check logic into a custom operator
 4 | #2 - Replace the data quality check PythonOperators with our new custom operator
 5 | #3 - Consolidate both the S3 to RedShift functions into a custom operator
 6 | #4 - Replace the S3 to RedShift PythonOperators with our new custom operator
 7 | #5 - Execute the DAG
 8 | 
 9 | import datetime
10 | import logging
11 | 
12 | from airflow import DAG
13 | from airflow.contrib.hooks.aws_hook import AwsHook
14 | from airflow.hooks.postgres_hook import PostgresHook
15 | 
16 | from airflow.operators import (
17 |     HasRowsOperator,
18 |     PostgresOperator,
19 |     PythonOperator,
20 |     S3ToRedshiftOperator
21 | )
22 | 
23 | import sql_statements
24 | 
25 | 
26 | #
27 | # TODO: Replace the data quality checks with the HasRowsOperator
28 | #
29 | 
30 | dag = DAG(
31 |     "lesson3.exercise1",
32 |     start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
33 |     end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0),
34 |     schedule_interval="@monthly",
35 |     max_active_runs=1
36 | )
37 | 
38 | create_trips_table = PostgresOperator(
39 |     task_id="create_trips_table",
40 |     dag=dag,
41 |     postgres_conn_id="redshift",
42 |     sql=sql_statements.CREATE_TRIPS_TABLE_SQL
43 | )
44 | 
45 | copy_trips_task = S3ToRedshiftOperator(
46 |     task_id="load_trips_from_s3_to_redshift",
47 |     dag=dag,
48 |     table="trips",
49 |     redshift_conn_id="redshift",
50 |     aws_credentials_id="aws_credentials",
51 |     s3_bucket="udac-data-pipelines",
52 |     s3_key="divvy/partitioned/{execution_date.year}/{execution_date.month}/divvy_trips.csv"
53 | )
54 | 
55 | #
56 | # TODO: Replace this data quality check with the HasRowsOperator
57 | #
58 | check_trips = HasRowsOperator(
59 | 	task_id='check_trips_data',
60 | 	dag=dag,
61 | 	redshift_conn_id="redshift",
62 | 	table="trips"
63 | 	)
64 | 
65 | create_stations_table = PostgresOperator(
66 |     task_id="create_stations_table",
67 |     dag=dag,
68 |     postgres_conn_id="redshift",
69 |     sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
70 | )
71 | 
72 | copy_stations_task = S3ToRedshiftOperator(
73 |     task_id="load_stations_from_s3_to_redshift",
74 |     dag=dag,
75 |     redshift_conn_id="redshift",
76 |     aws_credentials_id="aws_credentials",
77 |     s3_bucket="udac-data-pipelines",
78 |     s3_key="divvy/unpartitioned/divvy_stations_2017.csv",
79 |     table="stations"
80 | )
81 | 
82 | #
83 | # TODO: Replace this data quality check with the HasRowsOperator
84 | #
85 | check_stations = HasRowsOperator(
86 | 	task_id='check_stations_data',
87 | 	dag=dag,
88 | 	redshift_conn_id="redshift",
89 | 	table="stations"
90 | 	)
91 | 
92 | create_trips_table >> copy_trips_task
93 | create_stations_table >> copy_stations_task
94 | copy_stations_task >> check_stations
95 | copy_trips_task >> check_trips


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Production Data Pipelines - Exercise 2.py:
--------------------------------------------------------------------------------
  1 | #Instructions
  2 | #In this exercise, we’ll refactor a DAG with a single overloaded task into a DAG with several tasks with well-defined boundaries
  3 | #1 - Read through the DAG and identify points in the DAG that could be split apart
  4 | #2 - Split the DAG into multiple PythonOperators
  5 | #3 - Run the DAG
  6 | 
  7 | import datetime
  8 | import logging
  9 | 
 10 | from airflow import DAG
 11 | from airflow.hooks.postgres_hook import PostgresHook
 12 | 
 13 | from airflow.operators.postgres_operator import PostgresOperator
 14 | from airflow.operators.python_operator import PythonOperator
 15 | 
 16 | 
 17 | #
 18 | # TODO: Finish refactoring this function into the appropriate set of tasks,
 19 | #       instead of keeping this one large task.
 20 | #
 21 | def load_and_analyze(*args, **kwargs):
 22 |     redshift_hook = PostgresHook("redshift")
 23 | 
 24 | def log_oldest():
 25 |     redshift_hook = PostgresHook("redshift")
 26 |     records = redshift_hook.get_records("""
 27 |         SELECT birthyear FROM older_riders ORDER BY birthyear ASC LIMIT 1
 28 |     """)
 29 |     if len(records) > 0 and len(records[0]) > 0:
 30 |         logging.info(f"Oldest rider was born in {records[0][0]}")
 31 | 		
 32 | def log_younger():
 33 | 	redshift_hook = PostgresHook("redshift")
 34 | 	records = redshift_hook.get_records("""
 35 |         SELECT birthyear FROM younger_riders ORDER BY birthyear DESC LIMIT 1
 36 |     """)
 37 |     if len(records) > 0 and len(records[0]) > 0:
 38 |         logging.info(f"Youngest rider was born in {records[0][0]}")
 39 | 
 40 | 
 41 | dag = DAG(
 42 |     "lesson3.exercise2",
 43 |     start_date=datetime.datetime.utcnow()
 44 | )
 45 | 
 46 | load_and_analyze = PythonOperator(
 47 |     task_id='load_and_analyze',
 48 |     dag=dag,
 49 |     python_callable=load_and_analyze,
 50 |     provide_context=True,
 51 | )
 52 | 
 53 | create_oldest_task = PostgresOperator(
 54 |     task_id="create_oldest",
 55 |     dag=dag,
 56 |     sql="""
 57 |         BEGIN;
 58 |         DROP TABLE IF EXISTS older_riders;
 59 |         CREATE TABLE older_riders AS (
 60 |             SELECT * FROM trips WHERE birthyear > 0 AND birthyear <= 1945
 61 |         );
 62 |         COMMIT;
 63 |     """,
 64 |     postgres_conn_id="redshift"
 65 | )
 66 | 
 67 | create_younger_task = PostgresOperator(
 68 |     task_id="create_younger",
 69 |     dag=dag,
 70 |     sql="""
 71 |         BEGIN;
 72 |         DROP TABLE IF EXISTS younger_riders;
 73 |         CREATE TABLE younger_riders AS (
 74 |             SELECT * FROM trips WHERE birthyear > 2000
 75 |         );
 76 |         COMMIT;
 77 |     """,
 78 |     postgres_conn_id="redshift"
 79 | )
 80 | 
 81 | create_lifetime_task = PostgresOperator(
 82 |     task_id="create_lifetime",
 83 |     dag=dag,
 84 |     sql="""
 85 |         BEGIN;
 86 |         DROP TABLE IF EXISTS lifetime_rides;
 87 |         CREATE TABLE lifetime_rides AS (
 88 |             SELECT bikeid, COUNT(bikeid)
 89 |             FROM trips
 90 |             GROUP BY bikeid
 91 |         );
 92 |         COMMIT;
 93 |     """,
 94 |     postgres_conn_id="redshift"
 95 | )
 96 | 
 97 | create_city_station_task = PostgresOperator(
 98 |     task_id="create_city_station",
 99 |     dag=dag,
100 |     sql="""
101 |         BEGIN;
102 |         DROP TABLE IF EXISTS city_station_counts;
103 |         CREATE TABLE city_station_counts AS(
104 |             SELECT city, COUNT(city)
105 |             FROM stations
106 |             GROUP BY city
107 |         );
108 |         COMMIT;
109 |     """,
110 |     postgres_conn_id="redshift"
111 | )
112 | 
113 | log_oldest_task = PythonOperator(
114 |     task_id="log_oldest",
115 |     dag=dag,
116 |     python_callable=log_oldest
117 | )
118 | 
119 | log_younger_task = PythonOperator(
120 |     task_id="log_younger",
121 |     dag=dag,
122 |     python_callable=log_younger
123 | )
124 | 
125 | load_and_analyze >> create_oldest_task
126 | create_oldest_task >> log_oldest_task
127 | create_younger_task >> log_younger_task


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Production Data Pipelines - Exercise 3.py:
--------------------------------------------------------------------------------
  1 | #Instructions
  2 | #In this exercise, we’ll refactor a DAG with a single overloaded task into a DAG with several tasks with well-defined boundaries
  3 | #1 - Read through the DAG and identify points in the DAG that could be split apart
  4 | #2 - Split the DAG into multiple PythonOperators
  5 | #3 - Run the DAG
  6 | 
  7 | import datetime
  8 | import logging
  9 | 
 10 | from airflow import DAG
 11 | from airflow.hooks.postgres_hook import PostgresHook
 12 | 
 13 | from airflow.operators.postgres_operator import PostgresOperator
 14 | from airflow.operators.python_operator import PythonOperator
 15 | 
 16 | 
 17 | #
 18 | # TODO: Finish refactoring this function into the appropriate set of tasks,
 19 | #       instead of keeping this one large task.
 20 | #
 21 | def load_and_analyze(*args, **kwargs):
 22 |     redshift_hook = PostgresHook("redshift")
 23 | 
 24 | def log_oldest():
 25 |     redshift_hook = PostgresHook("redshift")
 26 |     records = redshift_hook.get_records("""
 27 |         SELECT birthyear FROM older_riders ORDER BY birthyear ASC LIMIT 1
 28 |     """)
 29 |     if len(records) > 0 and len(records[0]) > 0:
 30 |         logging.info(f"Oldest rider was born in {records[0][0]}")
 31 | 		
 32 | def log_younger():
 33 | 	redshift_hook = PostgresHook("redshift")
 34 | 	records = redshift_hook.get_records("""
 35 |         SELECT birthyear FROM younger_riders ORDER BY birthyear DESC LIMIT 1
 36 |     """)
 37 |     if len(records) > 0 and len(records[0]) > 0:
 38 |         logging.info(f"Youngest rider was born in {records[0][0]}")
 39 | 
 40 | 
 41 | dag = DAG(
 42 |     "lesson3.exercise2",
 43 |     start_date=datetime.datetime.utcnow()
 44 | )
 45 | 
 46 | load_and_analyze = PythonOperator(
 47 |     task_id='load_and_analyze',
 48 |     dag=dag,
 49 |     python_callable=load_and_analyze,
 50 |     provide_context=True,
 51 | )
 52 | 
 53 | create_oldest_task = PostgresOperator(
 54 |     task_id="create_oldest",
 55 |     dag=dag,
 56 |     sql="""
 57 |         BEGIN;
 58 |         DROP TABLE IF EXISTS older_riders;
 59 |         CREATE TABLE older_riders AS (
 60 |             SELECT * FROM trips WHERE birthyear > 0 AND birthyear <= 1945
 61 |         );
 62 |         COMMIT;
 63 |     """,
 64 |     postgres_conn_id="redshift"
 65 | )
 66 | 
 67 | create_younger_task = PostgresOperator(
 68 |     task_id="create_younger",
 69 |     dag=dag,
 70 |     sql="""
 71 |         BEGIN;
 72 |         DROP TABLE IF EXISTS younger_riders;
 73 |         CREATE TABLE younger_riders AS (
 74 |             SELECT * FROM trips WHERE birthyear > 2000
 75 |         );
 76 |         COMMIT;
 77 |     """,
 78 |     postgres_conn_id="redshift"
 79 | )
 80 | 
 81 | create_lifetime_task = PostgresOperator(
 82 |     task_id="create_lifetime",
 83 |     dag=dag,
 84 |     sql="""
 85 |         BEGIN;
 86 |         DROP TABLE IF EXISTS lifetime_rides;
 87 |         CREATE TABLE lifetime_rides AS (
 88 |             SELECT bikeid, COUNT(bikeid)
 89 |             FROM trips
 90 |             GROUP BY bikeid
 91 |         );
 92 |         COMMIT;
 93 |     """,
 94 |     postgres_conn_id="redshift"
 95 | )
 96 | 
 97 | create_city_station_task = PostgresOperator(
 98 |     task_id="create_city_station",
 99 |     dag=dag,
100 |     sql="""
101 |         BEGIN;
102 |         DROP TABLE IF EXISTS city_station_counts;
103 |         CREATE TABLE city_station_counts AS(
104 |             SELECT city, COUNT(city)
105 |             FROM stations
106 |             GROUP BY city
107 |         );
108 |         COMMIT;
109 |     """,
110 |     postgres_conn_id="redshift"
111 | )
112 | 
113 | log_oldest_task = PythonOperator(
114 |     task_id="log_oldest",
115 |     dag=dag,
116 |     python_callable=log_oldest
117 | )
118 | 
119 | log_younger_task = PythonOperator(
120 |     task_id="log_younger",
121 |     dag=dag,
122 |     python_callable=log_younger
123 | )
124 | 
125 | load_and_analyze >> create_oldest_task
126 | create_oldest_task >> log_oldest_task
127 | create_younger_task >> log_younger_task


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Production Data Pipelines - Exercise 4.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | 
 3 | from airflow import DAG
 4 | 
 5 | from airflow.operators import (
 6 |     FactsCalculatorOperator,
 7 |     HasRowsOperator,
 8 |     S3ToRedshiftOperator
 9 | )
10 | 
11 | #
12 | # The following DAG performs the following functions:
13 | #
14 | #       1. Loads Trip data from S3 to RedShift
15 | #       2. Performs a data quality check on the Trips table in RedShift
16 | #       3. Uses the FactsCalculatorOperator to create a Facts table in Redshift
17 | #           a. **NOTE**: to complete this step you must complete the FactsCalcuatorOperator
18 | #              skeleton defined in plugins/operators/facts_calculator.py
19 | #
20 | dag = DAG("lesson3.exercise4", start_date=datetime.datetime.utcnow())
21 | 
22 | #
23 | # The following code will load trips data from S3 to RedShift. Use the s3_key
24 | #       "data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
25 | #       and the s3_bucket "udacity-dend"
26 | #
27 | copy_trips_task = S3ToRedshiftOperator(
28 |     task_id="load_trips_from_s3_to_redshift",
29 |     dag=dag,
30 |     table="trips",
31 |     redshift_conn_id="redshift",
32 |     aws_credentials_id="aws_credentials",
33 |     s3_bucket="udacity-dend",
34 |     s3_key="data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
35 | )
36 | 
37 | #
38 | #  Data quality check on the Trips table
39 | #
40 | check_trips = HasRowsOperator(
41 |     task_id="check_trips_data",
42 |     dag=dag,
43 |     redshift_conn_id="redshift",
44 |     table="trips"
45 | )
46 | 
47 | #
48 | # We use the FactsCalculatorOperator to create a Facts table in RedShift. The fact column is
49 | #  `tripduration` and the groupby_column is `bikeid`
50 | #
51 | calculate_facts = FactsCalculatorOperator(
52 |     task_id="calculate_facts_trips",
53 |     dag=dag,
54 |     redshift_conn_id="redshift",
55 |     origin_table="trips",
56 |     destination_table="trips_facts",
57 |     fact_column="tripduration",
58 |     groupby_column="bikeid"
59 | )
60 | 
61 | #
62 | # Task ordering for the DAG tasks 
63 | #
64 | copy_trips_task >> check_trips
65 | check_trips >> calculate_facts


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Graphview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Graphview.png


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Treeview.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/DAG Treeview.PNG


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/Readme.MD:
--------------------------------------------------------------------------------
 1 | <b>Project:</b> Data Pipeline with Airflow
 2 | 
 3 | <b>Introduction</b>
 4 | 
 5 | A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app
 6 | 
 7 | <b>Project Description</b>
 8 | 
 9 | Apply the knowledge of Apache Airflow to build and ETL pipeline for a Data Lake hosted on Amazon S3.
10 | 
11 | In this project, we would have to create our own custom operators to perform tasks such as staging the data, filling the data warehouse and running checks on the data as the final step. We have been provided with four empty operators that need to be implemented into functional pieces of a data pipeline. 
12 | 
13 | <b>Project Datasets</b>
14 | 
15 | <b>Song Data Path</b> --> s3://udacity-dend/song_data 
16 | 
17 | <b>Log Data Path</b> --> s3://udacity-dend/log_data Log Data 
18 | 
19 | <b>Project Template</b>
20 | 
21 | The project template package contains three major components for the project:
22 | 
23 | The <b>dag template</b> has all the imports and task templates in place, but the task dependencies have not been set
24 | The <b>operators</b> folder with operator templates
25 | A <b>helper</b> class for the SQL transformations
26 | 
27 | <b>Configuring the DAG</b>
28 | 
29 | In the DAG, add default parameters according to these guidelines
30 | 
31 | 1. The DAG does not have dependencies on past runs
32 | 2. On failure, the task are retried 3 times
33 | 3. Retries happen every 5 minutes
34 | 4. Catchup is turned off
35 | 5. Do not email on retry
36 | 
37 | 
38 | <b>Building the Operators</b>
39 | 
40 | Need to build four different operators that will stage the data, transform the data, and run checks on data quality. All of the operators and task instances will run SQL statements against the Redshift database. However, using parameters wisely will allow you to build flexible, reusable, and configurable operators you can later apply to many kinds of data pipelines with Redshift and with other databases.
41 | 
42 | <b>Stage Operator</b>
43 | 
44 | The stage operator is expected to be able to load any JSON and CSV formatted files from S3 to Amazon Redshift. The operator creates and runs a SQL COPY statement based on the parameters provided. The operator's parameters should specify where in S3 the file is loaded and what is the target table.
45 | 
46 | The parameters should be used to distinguish between JSON and CSV file. Another important requirement of the stage operator is containing a templated field that allows it to load timestamped files from S3 based on the execution time and run backfills.
47 | 
48 | <b>Fact and Dimension Operators</b>
49 | 
50 | Provided SQL Helper class will help to run data transformations. Most of the logic is within the SQL transformations and the operator is expected to take as input a SQL statement and target database on which to run the query against. Dimension loads are often done with the truncate-insert pattern where the target table is emptied before the load. Fact tables are usually so massive that they should only allow append type functionality.
51 | 
52 | <b>Data Quality Operator</b>
53 | 
54 | The final operator to create is the data quality operator, which is used to run checks on the data itself. The operator's main functionality is to receive one or more SQL based test cases along with the expected results and execute the tests. For each the test, the test result and expected result needs to be checked and if there is no match, the operator should raise an exception and the task should retry and fail eventually.
55 | 
56 | For example one test could be a SQL statement that checks if certain column contains NULL values by counting all the rows that have NULL in the column. We do not want to have any NULLs so expected result would be 0 and the test would compare the SQL statement's outcome to the expected result.
57 | 
58 | <b>Final Instructions</b>
59 | 
60 | When you are in the workspace, after completing the code, you can start by using the command : /opt/airflow/start.sh
61 | 
62 | Once you done, it would automatically start all the dags required and outputting the result to its respective tables
63 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/create_tables.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE IF NOT EXISTS public.artists (
 2 | 	artistid varchar(256) NOT NULL,
 3 | 	name varchar(256),
 4 | 	location varchar(256),
 5 | 	lattitude numeric(18,0),
 6 | 	longitude numeric(18,0)
 7 | );
 8 | 
 9 | CREATE TABLE IF NOT EXISTS public.songplays (
10 | 	playid varchar(32) NOT NULL,
11 | 	start_time timestamp NOT NULL,
12 | 	userid int4 NOT NULL,
13 | 	"level" varchar(256),
14 | 	songid varchar(256),
15 | 	artistid varchar(256),
16 | 	sessionid int4,
17 | 	location varchar(256),
18 | 	user_agent varchar(256),
19 | 	CONSTRAINT songplays_pkey PRIMARY KEY (playid)
20 | );
21 | 
22 | CREATE TABLE IF NOT EXISTS public.songs (
23 | 	songid varchar(256) NOT NULL,
24 | 	title varchar(256),
25 | 	artistid varchar(256),
26 | 	"year" int4,
27 | 	duration numeric(18,0),
28 | 	CONSTRAINT songs_pkey PRIMARY KEY (songid)
29 | );
30 | 
31 | CREATE TABLE IF NOT EXISTS public.staging_events (
32 | 	artist varchar(256),
33 | 	auth varchar(256),
34 | 	firstname varchar(256),
35 | 	gender varchar(256),
36 | 	iteminsession int4,
37 | 	lastname varchar(256),
38 | 	length numeric(18,0),
39 | 	"level" varchar(256),
40 | 	location varchar(256),
41 | 	"method" varchar(256),
42 | 	page varchar(256),
43 | 	registration numeric(18,0),
44 | 	sessionid int4,
45 | 	song varchar(256),
46 | 	status int4,
47 | 	ts int8,
48 | 	useragent varchar(256),
49 | 	userid int4
50 | );
51 | 
52 | CREATE TABLE IF NOT EXISTS public.staging_songs (
53 | 	num_songs int4,
54 | 	artist_id varchar(256),
55 | 	artist_name varchar(256),
56 | 	artist_latitude numeric(18,0),
57 | 	artist_longitude numeric(18,0),
58 | 	artist_location varchar(256),
59 | 	song_id varchar(256),
60 | 	title varchar(256),
61 | 	duration numeric(18,0),
62 | 	"year" int4
63 | );
64 | 
65 | CREATE TABLE IF NOT EXISTS public."time" (
66 | 	start_time timestamp NOT NULL,
67 | 	"hour" int4,
68 | 	"day" int4,
69 | 	week int4,
70 | 	"month" varchar(256),
71 | 	"year" int4,
72 | 	weekday varchar(256),
73 | 	CONSTRAINT time_pkey PRIMARY KEY (start_time)
74 | );
75 | 
76 | CREATE TABLE IF NOT EXISTS public.users (
77 | 	userid int4 NOT NULL,
78 | 	first_name varchar(256),
79 | 	last_name varchar(256),
80 | 	gender varchar(256),
81 | 	"level" varchar(256),
82 | 	CONSTRAINT users_pkey PRIMARY KEY (userid)
83 | );
84 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/dags/udac_example_dag.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime, timedelta
  2 | import os
  3 | from airflow import DAG
  4 | from airflow.operators.dummy_operator import DummyOperator
  5 | from airflow.operators import (StageToRedshiftOperator, LoadFactOperator,
  6 |                                 LoadDimensionOperator, DataQualityOperator)
  7 | from helpers import SqlQueries
  8 | 
  9 | # AWS_KEY = os.environ.get('AWS_KEY')
 10 | # AWS_SECRET = os.environ.get('AWS_SECRET')
 11 | 
 12 | default_args = {
 13 |     'owner': 'nareshkumar',
 14 |     'start_date': datetime(2018, 11, 1),
 15 |     'end_date': datetime(2018, 11, 30),
 16 |     'depends_on_past': False,
 17 |     'retries': 3,
 18 |     'retry_delay': timedelta(minutes=5),
 19 |     'catchup': False,
 20 |     'email_on_retry': False
 21 | }
 22 | 
 23 | dag = DAG('udacity_airflow_project5',
 24 |           default_args=default_args,
 25 |           description='Load and transform data in Redshift with Airflow',
 26 |           schedule_interval='0 * * * *',
 27 |           max_active_runs=3
 28 |         )
 29 | 
 30 | start_operator = DummyOperator(task_id='Begin_execution',  dag=dag)
 31 | 
 32 | stage_events_to_redshift = StageToRedshiftOperator(
 33 |     task_id='Stage_events',
 34 |     dag=dag,
 35 |     provide_context=True,
 36 |     aws_credentials_id="aws_credentials",
 37 |     redshift_conn_id='redshift',
 38 |     s3_bucket="udacity-dend-airflow-test",
 39 |     s3_key="log_data",
 40 |     table="staging_events",
 41 |     create_stmt=SqlQueries.create_table_staging_events
 42 | )
 43 | 
 44 | stage_songs_to_redshift = StageToRedshiftOperator(
 45 |     task_id='Stage_songs',
 46 |     dag=dag,
 47 |     provide_context=True,
 48 |     aws_credentials_id="aws_credentials",
 49 |     redshift_conn_id='redshift',
 50 |     s3_bucket="udacity-dend-airflow-test",
 51 |     s3_key="song_data",
 52 |     table="staging_songs",
 53 |     create_stmt=SqlQueries.create_table_staging_songs
 54 | )
 55 | 
 56 | load_songplays_table = LoadFactOperator(
 57 |     task_id='Load_songplays_fact_table',
 58 |     dag=dag,
 59 |     provide_context=True,
 60 |     aws_credentials_id="aws_credentials",
 61 |     redshift_conn_id='redshift',
 62 |     create_stmt=SqlQueries.create_table_songplays,
 63 |     sql_query=SqlQueries.songplay_table_insert
 64 | )
 65 | 
 66 | load_user_dimension_table = LoadDimensionOperator(
 67 |     task_id='Load_user_dim_table',
 68 |     dag=dag,
 69 |     provide_context=True,
 70 |     aws_credentials_id="aws_credentials",
 71 |     redshift_conn_id='redshift',
 72 |     create_stmt=SqlQueries.create_table_users,
 73 |     sql_query=SqlQueries.user_table_insert
 74 | )
 75 | 
 76 | load_song_dimension_table = LoadDimensionOperator(
 77 |     task_id='Load_song_dim_table',
 78 |     dag=dag,
 79 |     provide_context=True,
 80 |     aws_credentials_id="aws_credentials",
 81 |     redshift_conn_id='redshift',
 82 |     create_stmt=SqlQueries.create_table_songs,
 83 |     sql_query=SqlQueries.song_table_insert
 84 | )
 85 | 
 86 | load_artist_dimension_table = LoadDimensionOperator(
 87 |     task_id='Load_artist_dim_table',
 88 |     dag=dag,
 89 |     provide_context=True,
 90 |     aws_credentials_id="aws_credentials",
 91 |     redshift_conn_id='redshift',
 92 |     create_stmt=SqlQueries.create_table_artist,
 93 |     sql_query=SqlQueries.artist_table_insert
 94 | )
 95 | 
 96 | load_time_dimension_table = LoadDimensionOperator(
 97 |     task_id='Load_time_dim_table',
 98 |     dag=dag,
 99 |     provide_context=True,
100 |     aws_credentials_id="aws_credentials",
101 |     redshift_conn_id='redshift',
102 |     create_stmt=SqlQueries.create_table_time,
103 |     sql_query=SqlQueries.time_table_insert
104 | )
105 | 
106 | run_quality_checks = DataQualityOperator(
107 |     task_id='Run_data_quality_checks',
108 |     dag=dag,
109 |     provide_context=True,
110 |     aws_credentials_id="aws_credentials",
111 |     redshift_conn_id='redshift',
112 | )
113 | 
114 | end_operator = DummyOperator(task_id='Stop_execution',  dag=dag)
115 | 
116 | start_operator >> [stage_events_to_redshift, stage_songs_to_redshift]
117 | [stage_events_to_redshift, stage_songs_to_redshift] >> load_songplays_table
118 | load_songplays_table >> [load_song_dimension_table, load_user_dimension_table, load_artist_dimension_table, load_time_dimension_table]
119 | [load_song_dimension_table, load_user_dimension_table, load_artist_dimension_table, load_time_dimension_table] >> run_quality_checks
120 | run_quality_checks >> end_operator


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/__init__.py:
--------------------------------------------------------------------------------
 1 | from __future__ import division, absolute_import, print_function
 2 | 
 3 | from airflow.plugins_manager import AirflowPlugin
 4 | 
 5 | import operators
 6 | import helpers
 7 | 
 8 | # Defining the plugin class
 9 | class UdacityPlugin(AirflowPlugin):
10 |     name = "udacity_plugin"
11 |     operators = [
12 |         operators.StageToRedshiftOperator,
13 |         operators.LoadFactOperator,
14 |         operators.LoadDimensionOperator,
15 |         operators.DataQualityOperator
16 |     ]
17 |     helpers = [
18 |         helpers.SqlQueries
19 |     ]
20 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__init__.py:
--------------------------------------------------------------------------------
1 | from helpers.sql_queries import SqlQueries
2 | 
3 | __all__ = [
4 |     'SqlQueries',
5 | ]


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/helpers/sql_queries.py:
--------------------------------------------------------------------------------
  1 | class SqlQueries:
  2 |     create_table_artist = ("""
  3 |         CREATE TABLE IF NOT EXISTS public.artists (
  4 |                 artistid varchar(256) NOT NULL,
  5 |                 name varchar(256),
  6 |                 location varchar(256),
  7 |                 lattitude numeric(18,0),
  8 |                 longitude numeric(18,0)
  9 |          );
 10 |     """)
 11 |     
 12 |     create_table_songplays = ("""
 13 |         CREATE TABLE IF NOT EXISTS public.songplays        
 14 |             playid varchar(32) NOT NULL,
 15 |             start_time timestamp NOT NULL,
 16 |             userid int4 NOT NULL,
 17 |             "level" varchar(256),
 18 |             songid varchar(256),
 19 |             artistid varchar(256),
 20 |             sessionid int4,
 21 |             location varchar(256),
 22 |             user_agent varchar(256),
 23 |             CONSTRAINT songplays_pkey PRIMARY KEY (playid)
 24 |         );
 25 |     """)
 26 |     
 27 |     create_table_songs = ("""
 28 |         CREATE TABLE IF NOT EXISTS public.songs (
 29 |             songid varchar(256) NOT NULL,
 30 |             title varchar(256),
 31 |             artistid varchar(256),
 32 |             "year" int4,
 33 |             duration numeric(18,0),
 34 |             CONSTRAINT songs_pkey PRIMARY KEY (songid)
 35 |         );
 36 |     """)
 37 |     
 38 |     create_table_staging_events = ("""
 39 |         CREATE TABLE IF NOT EXISTS public.staging_events (
 40 |             artist varchar(256),
 41 |             auth varchar(256),
 42 |             firstname varchar(256),
 43 |             gender varchar(256),
 44 |             iteminsession int4,
 45 |             lastname varchar(256),
 46 |             length numeric(18,0),
 47 |             "level" varchar(256),
 48 |             location varchar(256),
 49 |             "method" varchar(256),
 50 |             page varchar(256),
 51 |             registration numeric(18,0),
 52 |             sessionid int4,
 53 |             song varchar(256),
 54 |             status int4,
 55 |             ts int8,
 56 |             useragent varchar(256),
 57 |             userid int4
 58 |         );
 59 |     """)
 60 |     
 61 |     create_table_staging_songs = ("""
 62 |         CREATE TABLE IF NOT EXISTS public.staging_songs (
 63 |                 num_songs int4,
 64 |                 artist_id varchar(256),
 65 |                 artist_name varchar(256),
 66 |                 artist_latitude numeric(18,0),
 67 |                 artist_longitude numeric(18,0),
 68 |                 artist_location varchar(256),
 69 |                 song_id varchar(256),
 70 |                 title varchar(256),
 71 |                 duration numeric(18,0),
 72 |                 "year" int4
 73 |             );
 74 |     """)
 75 |     
 76 |     create_table_time = ("""
 77 |         CREATE TABLE IF NOT EXISTS public."time" (
 78 |             start_time timestamp NOT NULL,
 79 |             "hour" int4,
 80 |             "day" int4,
 81 |             week int4,
 82 |             "month" varchar(256),
 83 |             "year" int4,
 84 |             weekday varchar(256),
 85 |             CONSTRAINT time_pkey PRIMARY KEY (start_time)
 86 |         );
 87 |     """)
 88 |     
 89 |     create_table_users = ("""
 90 |         CREATE TABLE IF NOT EXISTS public.users (
 91 |             userid int4 NOT NULL,
 92 |             first_name varchar(256),
 93 |             last_name varchar(256),
 94 |             gender varchar(256),
 95 |             "level" varchar(256),
 96 |             CONSTRAINT users_pkey PRIMARY KEY (userid)
 97 |         );
 98 |     """)
 99 |     
100 |     songplay_table_insert = ("""
101 |         SELECT
102 |                 md5(events.sessionid || events.start_time) songplay_id,
103 |                 events.start_time, 
104 |                 events.userid, 
105 |                 events.level, 
106 |                 songs.song_id, 
107 |                 songs.artist_id, 
108 |                 events.sessionid, 
109 |                 events.location, 
110 |                 events.useragent
111 |                 FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, *
112 |             FROM staging_events
113 |             WHERE page='NextSong') events
114 |             LEFT JOIN staging_songs songs
115 |             ON events.song = songs.title
116 |                 AND events.artist = songs.artist_name
117 |                 AND events.length = songs.duration
118 |     """)
119 | 
120 |     user_table_insert = ("""
121 |         SELECT distinct userid, firstname, lastname, gender, level
122 |         FROM staging_events
123 |         WHERE page='NextSong'
124 |     """)
125 | 
126 |     song_table_insert = ("""
127 |         SELECT distinct song_id, title, artist_id, year, duration
128 |         FROM staging_songs
129 |     """)
130 | 
131 |     artist_table_insert = ("""
132 |         SELECT distinct artist_id, artist_name, artist_location, artist_latitude, artist_longitude
133 |         FROM staging_songs
134 |     """)
135 | 
136 |     time_table_insert = ("""
137 |         SELECT start_time, extract(hour from start_time), extract(day from start_time), extract(week from start_time), 
138 |                extract(month from start_time), extract(year from start_time), extract(dayofweek from start_time)
139 |         FROM songplays
140 |     """)


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__init__.py:
--------------------------------------------------------------------------------
 1 | from operators.stage_redshift import StageToRedshiftOperator
 2 | from operators.load_fact import LoadFactOperator
 3 | from operators.load_dimension import LoadDimensionOperator
 4 | from operators.data_quality import DataQualityOperator
 5 | 
 6 | __all__ = [
 7 |     'StageToRedshiftOperator',
 8 |     'LoadFactOperator',
 9 |     'LoadDimensionOperator',
10 |     'DataQualityOperator'
11 | ]
12 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/data_quality.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | 
 5 | class DataQualityOperator(BaseOperator):
 6 | 
 7 |     ui_color = '#89DA59'
 8 | 
 9 |     @apply_defaults
10 |     def __init__(self,
11 |                  # Define your operators params (with defaults) here
12 |                  # Example:
13 |                  # conn_id = your-connection-name
14 |                  *args, **kwargs):
15 | 
16 |         super(DataQualityOperator, self).__init__(*args, **kwargs)
17 |         # Map params here
18 |         # Example:
19 |         # self.conn_id = conn_id
20 | 
21 |     def execute(self, context):
22 |         self.log.info('DataQualityOperator not implemented yet')


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/load_dimension.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | 
 5 | class LoadDimensionOperator(BaseOperator):
 6 | 
 7 |     ui_color = '#80BD9E'
 8 | 
 9 |     @apply_defaults
10 |     def __init__(self,
11 |                  # Define your operators params (with defaults) here
12 |                  # Example:
13 |                  # conn_id = your-connection-name
14 |                  *args, **kwargs):
15 | 
16 |         super(LoadDimensionOperator, self).__init__(*args, **kwargs)
17 |         # Map params here
18 |         # Example:
19 |         # self.conn_id = conn_id
20 | 
21 |     def execute(self, context):
22 |         self.log.info('LoadDimensionOperator not implemented yet')
23 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/load_fact.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | 
 5 | class LoadFactOperator(BaseOperator):
 6 | 
 7 |     ui_color = '#F98866'
 8 | 
 9 |     @apply_defaults
10 |     def __init__(self,
11 |                  # Define your operators params (with defaults) here
12 |                  # Example:
13 |                  # conn_id = your-connection-name
14 |                  *args, **kwargs):
15 | 
16 |         super(LoadFactOperator, self).__init__(*args, **kwargs)
17 |         # Map params here
18 |         # Example:
19 |         # self.conn_id = conn_id
20 | 
21 |     def execute(self, context):
22 |         self.log.info('LoadFactOperator not implemented yet')
23 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Project Data Pipeline with Airflow/plugins/operators/stage_redshift.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | 
 5 | class StageToRedshiftOperator(BaseOperator):
 6 |     ui_color = '#358140'
 7 | 
 8 |     @apply_defaults
 9 |     def __init__(self,
10 |                  # Define your operators params (with defaults) here
11 |                  # Example:
12 |                  # redshift_conn_id=your-connection-name
13 |                  *args, **kwargs):
14 | 
15 |         super(StageToRedshiftOperator, self).__init__(*args, **kwargs)
16 |         # Map params here
17 |         # Example:
18 |         # self.conn_id = conn_id
19 | 
20 |     def execute(self, context):
21 |         self.log.info('StageToRedshiftOperator not implemented yet')
22 | 
23 | 
24 | 
25 | 
26 | 
27 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/Readme.MD:
--------------------------------------------------------------------------------
1 | Exercise Files
2 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/__init__.py:
--------------------------------------------------------------------------------
 1 | from operators.facts_calculator import FactsCalculatorOperator
 2 | from operators.has_rows import HasRowsOperator
 3 | from operators.s3_to_redshift import S3ToRedshiftOperator
 4 | 
 5 | __all__ = [
 6 |     'FactsCalculatorOperator',
 7 |     'HasRowsOperator',
 8 |     'S3ToRedshiftOperator'
 9 | ]
10 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/dag.py:
--------------------------------------------------------------------------------
 1 | #Instructions
 2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag.
 3 | #1 - Consolidate HasRowsOperator into the SubDag
 4 | #2 - Reorder the tasks to take advantage of the SubDag Operators
 5 | 
 6 | import datetime
 7 | 
 8 | from airflow import DAG
 9 | from airflow.operators.postgres_operator import PostgresOperator
10 | from airflow.operators.subdag_operator import SubDagOperator
11 | from airflow.operators.udacity_plugin import HasRowsOperator
12 | 
13 | from lesson3.exercise3.subdag import get_s3_to_redshift_dag
14 | import sql_statements
15 | 
16 | 
17 | start_date = datetime.datetime.utcnow()
18 | 
19 | dag = DAG(
20 |     "lesson3.exercise3",
21 |     start_date=start_date,
22 | )
23 | 
24 | trips_task_id = "trips_subdag"
25 | trips_subdag_task = SubDagOperator(
26 |     subdag=get_s3_to_redshift_dag(
27 |         "lesson3.exercise3",
28 |         trips_task_id,
29 |         "redshift",
30 |         "aws_credentials",
31 |         "trips",
32 |         sql_statements.CREATE_TRIPS_TABLE_SQL,
33 |         s3_bucket="udac-data-pipelines",
34 |         s3_key="divvy/unpartitioned/divvy_trips_2018.csv",
35 |         start_date=start_date,
36 |     ),
37 |     task_id=trips_task_id,
38 |     dag=dag,
39 | )
40 | 
41 | stations_task_id = "stations_subdag"
42 | stations_subdag_task = SubDagOperator(
43 |     subdag=get_s3_to_redshift_dag(
44 |         "lesson3.exercise3",
45 |         stations_task_id,
46 |         "redshift",
47 |         "aws_credentials",
48 |         "stations",
49 |         sql_statements.CREATE_STATIONS_TABLE_SQL,
50 |         s3_bucket="udac-data-pipelines",
51 |         s3_key="divvy/unpartitioned/divvy_stations_2017.csv",
52 |         start_date=start_date,
53 |     ),
54 |     task_id=stations_task_id,
55 |     dag=dag,
56 | )
57 | 
58 | #
59 | # TODO: Consolidate check_trips and check_stations into a single check in the subdag
60 | #       as we did with the create and copy in the demo
61 | #
62 | check_trips = HasRowsOperator(
63 |     task_id="check_trips_data",
64 |     dag=dag,
65 |     redshift_conn_id="redshift",
66 |     table="trips"
67 | )
68 | 
69 | check_stations = HasRowsOperator(
70 |     task_id="check_stations_data",
71 |     dag=dag,
72 |     redshift_conn_id="redshift",
73 |     table="stations"
74 | )
75 | 
76 | location_traffic_task = PostgresOperator(
77 |     task_id="calculate_location_traffic",
78 |     dag=dag,
79 |     postgres_conn_id="redshift",
80 |     sql=sql_statements.LOCATION_TRAFFIC_SQL
81 | )
82 | 
83 | #
84 | # TODO: Reorder the Graph once you have moved the checks
85 | #
86 | trips_subdag_task >> location_traffic_task
87 | stations_subdag_task >> location_traffic_task


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/facts_calculator.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | 
 3 | from airflow.hooks.postgres_hook import PostgresHook
 4 | from airflow.models import BaseOperator
 5 | from airflow.utils.decorators import apply_defaults
 6 | 
 7 | 
 8 | class FactsCalculatorOperator(BaseOperator):
 9 |     facts_sql_template = """
10 |     DROP TABLE IF EXISTS {destination_table};
11 |     CREATE TABLE {destination_table} AS
12 |     SELECT
13 |         {groupby_column},
14 |         MAX({fact_column}) AS max_{fact_column},
15 |         MIN({fact_column}) AS min_{fact_column},
16 |         AVG({fact_column}) AS average_{fact_column}
17 |     FROM {origin_table}
18 |     GROUP BY {groupby_column};
19 |     """
20 | 
21 |     @apply_defaults
22 |     def __init__(self,
23 |                  redshift_conn_id="",
24 |                  origin_table="",
25 |                  destination_table="",
26 |                  fact_column="",
27 |                  groupby_column="",
28 |                  *args, **kwargs):
29 | 
30 |         super(FactsCalculatorOperator, self).__init__(*args, **kwargs)
31 |         #
32 |         # TODO: Set attributes from __init__ instantiation arguments
33 |         #
34 | 
35 |     def execute(self, context):
36 |         #
37 |         # TODO: Fetch the redshift hook
38 |         #
39 | 
40 |         #
41 |         # TODO: Format the `facts_sql_template` and run the query against redshift
42 |         #
43 | 
44 |         pass
45 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/has_rows.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | 
 3 | from airflow.hooks.postgres_hook import PostgresHook
 4 | from airflow.models import BaseOperator
 5 | from airflow.utils.decorators import apply_defaults
 6 | 
 7 | 
 8 | class HasRowsOperator(BaseOperator):
 9 | 
10 |     @apply_defaults
11 |     def __init__(self,
12 |                  redshift_conn_id="",
13 |                  table="",
14 |                  *args, **kwargs):
15 | 
16 |         super(HasRowsOperator, self).__init__(*args, **kwargs)
17 |         self.table = table
18 |         self.redshift_conn_id = redshift_conn_id
19 | 
20 |     def execute(self, context):
21 |         redshift_hook = PostgresHook(self.redshift_conn_id)
22 |         records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {self.table}")
23 |         if len(records) < 1 or len(records[0]) < 1:
24 |             raise ValueError(f"Data quality check failed. {self.table} returned no results")
25 |         num_records = records[0][0]
26 |         if num_records < 1:
27 |             raise ValueError(f"Data quality check failed. {self.table} contained 0 rows")
28 |         logging.info(f"Data quality on table {self.table} check passed with {records[0][0]} records")
29 | 
30 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/s3_to_redshift.py:
--------------------------------------------------------------------------------
 1 | from airflow.contrib.hooks.aws_hook import AwsHook
 2 | from airflow.hooks.postgres_hook import PostgresHook
 3 | from airflow.models import BaseOperator
 4 | from airflow.utils.decorators import apply_defaults
 5 | 
 6 | 
 7 | class S3ToRedshiftOperator(BaseOperator):
 8 |     template_fields = ("s3_key",)
 9 |     copy_sql = """
10 |         COPY {}
11 |         FROM '{}'
12 |         ACCESS_KEY_ID '{}'
13 |         SECRET_ACCESS_KEY '{}'
14 |         IGNOREHEADER {}
15 |         DELIMITER '{}'
16 |     """
17 | 
18 | 
19 |     @apply_defaults
20 |     def __init__(self,
21 |                  redshift_conn_id="",
22 |                  aws_credentials_id="",
23 |                  table="",
24 |                  s3_bucket="",
25 |                  s3_key="",
26 |                  delimiter=",",
27 |                  ignore_headers=1,
28 |                  *args, **kwargs):
29 | 
30 |         super(S3ToRedshiftOperator, self).__init__(*args, **kwargs)
31 |         self.table = table
32 |         self.redshift_conn_id = redshift_conn_id
33 |         self.s3_bucket = s3_bucket
34 |         self.s3_key = s3_key
35 |         self.delimiter = delimiter
36 |         self.ignore_headers = ignore_headers
37 |         self.aws_credentials_id = aws_credentials_id
38 | 
39 |     def execute(self, context):
40 |         aws_hook = AwsHook(self.aws_credentials_id)
41 |         credentials = aws_hook.get_credentials()
42 |         redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
43 | 
44 |         self.log.info("Clearing data from destination Redshift table")
45 |         redshift.run("DELETE FROM {}".format(self.table))
46 | 
47 |         self.log.info("Copying data from S3 to Redshift")
48 |         rendered_key = self.s3_key.format(**context)
49 |         s3_path = "s3://{}/{}".format(self.s3_bucket, rendered_key)
50 |         formatted_sql = S3ToRedshiftOperator.copy_sql.format(
51 |             self.table,
52 |             s3_path,
53 |             credentials.access_key,
54 |             credentials.secret_key,
55 |             self.ignore_headers,
56 |             self.delimiter
57 |         )
58 |         redshift.run(formatted_sql)
59 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/sql_statements.py:
--------------------------------------------------------------------------------
 1 | CREATE_TRIPS_TABLE_SQL = """
 2 | CREATE TABLE IF NOT EXISTS trips (
 3 | trip_id INTEGER NOT NULL,
 4 | start_time TIMESTAMP NOT NULL,
 5 | end_time TIMESTAMP NOT NULL,
 6 | bikeid INTEGER NOT NULL,
 7 | tripduration DECIMAL(16,2) NOT NULL,
 8 | from_station_id INTEGER NOT NULL,
 9 | from_station_name VARCHAR(100) NOT NULL,
10 | to_station_id INTEGER NOT NULL,
11 | to_station_name VARCHAR(100) NOT NULL,
12 | usertype VARCHAR(20),
13 | gender VARCHAR(6),
14 | birthyear INTEGER,
15 | PRIMARY KEY(trip_id))
16 | DISTSTYLE ALL;
17 | """
18 | 
19 | CREATE_STATIONS_TABLE_SQL = """
20 | CREATE TABLE IF NOT EXISTS stations (
21 | id INTEGER NOT NULL,
22 | name VARCHAR(250) NOT NULL,
23 | city VARCHAR(100) NOT NULL,
24 | latitude DECIMAL(9, 6) NOT NULL,
25 | longitude DECIMAL(9, 6) NOT NULL,
26 | dpcapacity INTEGER NOT NULL,
27 | online_date TIMESTAMP NOT NULL,
28 | PRIMARY KEY(id))
29 | DISTSTYLE ALL;
30 | """
31 | 
32 | COPY_SQL = """
33 | COPY {}
34 | FROM '{}'
35 | ACCESS_KEY_ID '{{}}'
36 | SECRET_ACCESS_KEY '{{}}'
37 | IGNOREHEADER 1
38 | DELIMITER ','
39 | """
40 | 
41 | COPY_MONTHLY_TRIPS_SQL = COPY_SQL.format(
42 |     "trips",
43 |     "s3://udac-data-pipelines/divvy/partitioned/{year}/{month}/divvy_trips.csv"
44 | )
45 | 
46 | COPY_ALL_TRIPS_SQL = COPY_SQL.format(
47 |     "trips",
48 |     "s3://udac-data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
49 | )
50 | 
51 | COPY_STATIONS_SQL = COPY_SQL.format(
52 |     "stations",
53 |     "s3://udac-data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv"
54 | )
55 | 
56 | LOCATION_TRAFFIC_SQL = """
57 | BEGIN;
58 | DROP TABLE IF EXISTS station_traffic;
59 | CREATE TABLE station_traffic AS
60 | SELECT
61 |     DISTINCT(t.from_station_id) AS station_id,
62 |     t.from_station_name AS station_name,
63 |     num_departures,
64 |     num_arrivals
65 | FROM trips t
66 | JOIN (
67 |     SELECT
68 |         from_station_id,
69 |         COUNT(from_station_id) AS num_departures
70 |     FROM trips
71 |     GROUP BY from_station_id
72 | ) AS fs ON t.from_station_id = fs.from_station_id
73 | JOIN (
74 |     SELECT
75 |         to_station_id,
76 |         COUNT(to_station_id) AS num_arrivals
77 |     FROM trips
78 |     GROUP BY to_station_id
79 | ) AS ts ON t.from_station_id = ts.to_station_id
80 | """
81 | 


--------------------------------------------------------------------------------
/Data Pipeline with Airflow/subdag.py:
--------------------------------------------------------------------------------
 1 | #Instructions
 2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag.
 3 | #1 - Consolidate HasRowsOperator into the SubDag
 4 | #2 - Reorder the tasks to take advantage of the SubDag Operators
 5 | 
 6 | import datetime
 7 | 
 8 | from airflow import DAG
 9 | from airflow.operators.postgres_operator import PostgresOperator
10 | from airflow.operators.udacity_plugin import HasRowsOperator
11 | from airflow.operators.udacity_plugin import S3ToRedshiftOperator
12 | 
13 | import sql
14 | 
15 | 
16 | # Returns a DAG which creates a table if it does not exist, and then proceeds
17 | # to load data into that table from S3. When the load is complete, a data
18 | # quality  check is performed to assert that at least one row of data is
19 | # present.
20 | def get_s3_to_redshift_dag(
21 |         parent_dag_name,
22 |         task_id,
23 |         redshift_conn_id,
24 |         aws_credentials_id,
25 |         table,
26 |         create_sql_stmt,
27 |         s3_bucket,
28 |         s3_key,
29 |         *args, **kwargs):
30 |     dag = DAG(
31 |         f"{parent_dag_name}.{task_id}",
32 |         **kwargs
33 |     )
34 | 
35 |     create_task = PostgresOperator(
36 |         task_id=f"create_{table}_table",
37 |         dag=dag,
38 |         postgres_conn_id=redshift_conn_id,
39 |         sql=create_sql_stmt
40 |     )
41 | 
42 |     copy_task = S3ToRedshiftOperator(
43 |         task_id=f"load_{table}_from_s3_to_redshift",
44 |         dag=dag,
45 |         table=table,
46 |         redshift_conn_id=redshift_conn_id,
47 |         aws_credentials_id=aws_credentials_id,
48 |         s3_bucket=s3_bucket,
49 |         s3_key=s3_key
50 |     )
51 | 
52 |     #
53 |     # TODO: Move the HasRowsOperator task here from the DAG
54 |     #
55 | 	
56 | 	check_task = HasRowsOperator(
57 | 		task_id=f"check_{table}_data",
58 | 		dag=dag,
59 | 		redshift_conn_id=redshift_conn_id,
60 | 		table=table
61 | 	)
62 | 
63 |     create_task >> copy_task
64 |     #
65 |     # TODO: Use DAG ordering to place the check task
66 |     #
67 | 	copy_task >> check_task
68 |     return dag
69 | 


--------------------------------------------------------------------------------
/Data-Modeling/L1 Exercise 1 Creating a Table with Postgres.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# L1 Exercise 1: Creating a Table with PostgreSQL\n",
  8 |     "\n",
  9 |     "<img src=\"images/postgresSQLlogo.png\" width=\"250\" height=\"250\">"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "### Walk through the basics of PostgreSQL. You will need to complete the following tasks:<li> Create a table in PostgreSQL, <li> Insert rows of data <li> Run a simple SQL query to validate the information. <br>\n",
 17 |     "`#####` denotes where the code needs to be completed. \n",
 18 |     "    \n",
 19 |     "Note: __Do not__ click the blue Preview button in the lower task bar"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "#### Import the library \n",
 27 |     "*Note:* An error might popup after this command has executed. If it does, read it carefully before ignoring. "
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": 5,
 33 |    "metadata": {},
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "import psycopg2"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 6,
 42 |    "metadata": {},
 43 |    "outputs": [
 44 |     {
 45 |      "name": "stdout",
 46 |      "output_type": "stream",
 47 |      "text": [
 48 |       "ALTER ROLE\r\n"
 49 |      ]
 50 |     }
 51 |    ],
 52 |    "source": [
 53 |     "!echo \"alter user student createdb;\" | sudo -u postgres psql"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "### Create a connection to the database"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 7,
 66 |    "metadata": {},
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "try: \n",
 70 |     "    conn = psycopg2.connect(\"host=127.0.0.1 dbname=studentdb user=student password=student\")\n",
 71 |     "except psycopg2.Error as e: \n",
 72 |     "    print(\"Error: Could not make connection to the Postgres database\")\n",
 73 |     "    print(e)"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "markdown",
 78 |    "metadata": {},
 79 |    "source": [
 80 |     "### Use the connection to get a cursor that can be used to execute queries."
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 8,
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "try: \n",
 90 |     "    cur = conn.cursor()\n",
 91 |     "except psycopg2.Error as e: \n",
 92 |     "    print(\"Error: Could not get curser to the Database\")\n",
 93 |     "    print(e)"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "markdown",
 98 |    "metadata": {},
 99 |    "source": [
100 |     "### TO-DO: Set automatic commit to be true so that each action is committed without having to call conn.commit() after each command. "
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": 9,
106 |    "metadata": {},
107 |    "outputs": [],
108 |    "source": [
109 |     "conn.set_session(autocommit=True)"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "### TO-DO: Create a database to do the work in. "
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 10,
122 |    "metadata": {},
123 |    "outputs": [],
124 |    "source": [
125 |     "## TO-DO: Add the database name within the CREATE DATABASE statement. You can choose your own db name.\n",
126 |     "try: \n",
127 |     "    cur.execute(\"create database student1\")\n",
128 |     "except psycopg2.Error as e:\n",
129 |     "    print(e)"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "#### TO-DO: Add the database name in the connect statement. Let's close our connection to the default database, reconnect to the Udacity database, and get a new cursor."
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 11,
142 |    "metadata": {},
143 |    "outputs": [],
144 |    "source": [
145 |     "## TO-DO: Add the database name within the connect statement\n",
146 |     "try: \n",
147 |     "    conn.close()\n",
148 |     "except psycopg2.Error as e:\n",
149 |     "    print(e)\n",
150 |     "    \n",
151 |     "try: \n",
152 |     "    conn = psycopg2.connect(\"host=127.0.0.1 dbname=student1 user=student password=student\")\n",
153 |     "except psycopg2.Error as e: \n",
154 |     "    print(\"Error: Could not make connection to the Postgres database\")\n",
155 |     "    print(e)\n",
156 |     "    \n",
157 |     "try: \n",
158 |     "    cur = conn.cursor()\n",
159 |     "except psycopg2.Error as e: \n",
160 |     "    print(\"Error: Could not get curser to the Database\")\n",
161 |     "    print(e)\n",
162 |     "\n",
163 |     "conn.set_session(autocommit=True)"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. \n",
171 |     "\n",
172 |     "`song_title\n",
173 |     "artist_name\n",
174 |     "year\n",
175 |     "album_name\n",
176 |     "single`\n"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": 12,
182 |    "metadata": {},
183 |    "outputs": [],
184 |    "source": [
185 |     "## TO-DO: Finish writing the CREATE TABLE statement with the correct arguments\n",
186 |     "try: \n",
187 |     "    cur.execute(\"CREATE TABLE IF NOT EXISTS music_library_1(song_title varchar, artist_name varchar, year int, album_name varchar, single Boolean);\")\n",
188 |     "except psycopg2.Error as e: \n",
189 |     "    print(\"Error: Issue creating table\")\n",
190 |     "    print (e)"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "### TO-DO: Insert the following two rows in the table\n",
198 |     "`First Row:  \"Across The Universe\", \"The Beatles\", \"1970\", \"False\", \"Let It Be\"`\n",
199 |     "\n",
200 |     "`Second Row: \"The Beatles\", \"Think For Yourself\", \"False\", \"1965\", \"Rubber Soul\"`"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "code",
205 |    "execution_count": 13,
206 |    "metadata": {},
207 |    "outputs": [],
208 |    "source": [
209 |     "## TO-DO: Finish the INSERT INTO statement with the correct arguments\n",
210 |     "\n",
211 |     "try: \n",
212 |     "    cur.execute(\"INSERT INTO music_library_1 (song_title, artist_name, year, album_name, single) \\\n",
213 |     "                 VALUES (%s, %s, %s, %s, %s)\", \\\n",
214 |     "                 (\"The Beatles\", \"Across The Universe\", 1970, \"Across The Universe\", False))\n",
215 |     "except psycopg2.Error as e: \n",
216 |     "    print(\"Error: Inserting Rows\")\n",
217 |     "    print (e)\n",
218 |     "    \n",
219 |     "try: \n",
220 |     "    cur.execute(\"INSERT INTO music_library_1 (song_title, artist_name, year, album_name, single) \\\n",
221 |     "                  VALUES (%s, %s, %s, %s, %s)\",\n",
222 |     "                  (\"Rubber Soul\", \"The Beatles\", 1965, \"Think For Yourself\", False))\n",
223 |     "except psycopg2.Error as e: \n",
224 |     "    print(\"Error: Inserting Rows\")\n",
225 |     "    print (e)"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "### TO-DO: Validate your data was inserted into the table. \n"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": 14,
238 |    "metadata": {},
239 |    "outputs": [
240 |     {
241 |      "name": "stdout",
242 |      "output_type": "stream",
243 |      "text": [
244 |       "('The Beatles', 'Across The Universe', 1970, 'Across The Universe', False)\n",
245 |       "('Rubber Soul', 'The Beatles', 1965, 'Think For Yourself', False)\n"
246 |      ]
247 |     }
248 |    ],
249 |    "source": [
250 |     "## TO-DO: Finish the SELECT * Statement \n",
251 |     "try: \n",
252 |     "    cur.execute(\"SELECT * FROM music_library_1;\")\n",
253 |     "except psycopg2.Error as e: \n",
254 |     "    print(\"Error: select *\")\n",
255 |     "    print (e)\n",
256 |     "\n",
257 |     "row = cur.fetchone()\n",
258 |     "while row:\n",
259 |     "   print(row)\n",
260 |     "   row = cur.fetchone()"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "### And finally close your cursor and connection. "
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": 15,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": [
276 |     "cur.close()\n",
277 |     "conn.close()"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": null,
283 |    "metadata": {},
284 |    "outputs": [],
285 |    "source": []
286 |   }
287 |  ],
288 |  "metadata": {
289 |   "kernelspec": {
290 |    "display_name": "Python 3",
291 |    "language": "python",
292 |    "name": "python3"
293 |   },
294 |   "language_info": {
295 |    "codemirror_mode": {
296 |     "name": "ipython",
297 |     "version": 3
298 |    },
299 |    "file_extension": ".py",
300 |    "mimetype": "text/x-python",
301 |    "name": "python",
302 |    "nbconvert_exporter": "python",
303 |    "pygments_lexer": "ipython3",
304 |    "version": "3.6.3"
305 |   }
306 |  },
307 |  "nbformat": 4,
308 |  "nbformat_minor": 2
309 | }
310 | 


--------------------------------------------------------------------------------
/Data-Modeling/L1 Exercise 2 Creating a Table with Apache Cassandra.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# L1 Exercise 2: Creating a Table with Apache Cassandra\n",
  8 |     "<img src=\"images/cassandralogo.png\" width=\"250\" height=\"250\">"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "### Walk through the basics of Apache Cassandra. Complete the following tasks:<li> Create a table in Apache Cassandra, <li> Insert rows of data,<li> Run a simple SQL query to validate the information. <br>\n",
 16 |     "`#####` denotes where the code needs to be completed.\n",
 17 |     "    \n",
 18 |     "Note: __Do not__ click the blue Preview button in the lower taskbar"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "#### Import Apache Cassandra python package"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 1,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "import cassandra"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "markdown",
 39 |    "metadata": {},
 40 |    "source": [
 41 |     "### Create a connection to the database"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 2,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "from cassandra.cluster import Cluster\n",
 51 |     "try: \n",
 52 |     "    cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
 53 |     "    session = cluster.connect()\n",
 54 |     "except Exception as e:\n",
 55 |     "    print(e)\n",
 56 |     " "
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "markdown",
 61 |    "metadata": {},
 62 |    "source": [
 63 |     "### TO-DO: Create a keyspace to do the work in "
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": 3,
 69 |    "metadata": {},
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "## TO-DO: Create the keyspace\n",
 73 |     "try:\n",
 74 |     "    session.execute(\"\"\"\n",
 75 |     "    CREATE KEYSPACE IF NOT EXISTS music_library_1 \n",
 76 |     "    WITH REPLICATION = \n",
 77 |     "    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
 78 |     ")\n",
 79 |     "\n",
 80 |     "except Exception as e:\n",
 81 |     "    print(e)"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "### TO-DO: Connect to the Keyspace"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 4,
 94 |    "metadata": {},
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "## To-Do: Add in the keyspace you created\n",
 98 |     "try:\n",
 99 |     "    session.set_keyspace('music_library_1')\n",
100 |     "except Exception as e:\n",
101 |     "    print(e)"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "metadata": {},
107 |    "source": [
108 |     "### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. \n",
109 |     "\n",
110 |     "`song_title\n",
111 |     "artist_name\n",
112 |     "year\n",
113 |     "album_name\n",
114 |     "single`"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "metadata": {},
120 |    "source": [
121 |     "### TO-DO: You need to create a table to be able to run the following query: \n",
122 |     "`select * from songs WHERE year=1970 AND artist_name=\"The Beatles\"`"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": 6,
128 |    "metadata": {},
129 |    "outputs": [],
130 |    "source": [
131 |     "## TO-DO: Complete the query below\n",
132 |     "query = \"CREATE TABLE IF NOT EXISTS music_library_table_1 \"\n",
133 |     "query = query + \"(song_title text, artist_name text, year int, album_name text, single Boolean, PRIMARY KEY (year, artist_name))\"\n",
134 |     "try:\n",
135 |     "    session.execute(query)\n",
136 |     "except Exception as e:\n",
137 |     "    print(e)\n"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {},
143 |    "source": [
144 |     "### TO-DO: Insert the following two rows in your table\n",
145 |     "`First Row:  \"Across The Universe\", \"The Beatles\", \"1970\", \"False\", \"Let It Be\"`\n",
146 |     "\n",
147 |     "`Second Row: \"The Beatles\", \"Think For Yourself\", \"False\", \"1965\", \"Rubber Soul\"`"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": 7,
153 |    "metadata": {},
154 |    "outputs": [],
155 |    "source": [
156 |     "## Add in query and then run the insert statement\n",
157 |     "query = \"INSERT INTO music_library_table_1 (song_title, artist_name, year, album_name, single)\" \n",
158 |     "query = query + \" VALUES (%s, %s, %s, %s, %s)\"\n",
159 |     "\n",
160 |     "try:\n",
161 |     "    session.execute(query, (\"Across The Universe\", \"The Beatles\", 1970, \"Let It Be\", False))\n",
162 |     "except Exception as e:\n",
163 |     "    print(e)\n",
164 |     "    \n",
165 |     "try:\n",
166 |     "    session.execute(query, (\"Think For Yourself\", \"The Beatles\", 1965, \"Rubber Soul\", False))\n",
167 |     "except Exception as e:\n",
168 |     "    print(e)"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "markdown",
173 |    "metadata": {},
174 |    "source": [
175 |     "### TO-DO: Validate your data was inserted into the table."
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": 8,
181 |    "metadata": {
182 |     "scrolled": true
183 |    },
184 |    "outputs": [
185 |     {
186 |      "name": "stdout",
187 |      "output_type": "stream",
188 |      "text": [
189 |       "1965 Rubber Soul The Beatles\n",
190 |       "1970 Let It Be The Beatles\n"
191 |      ]
192 |     }
193 |    ],
194 |    "source": [
195 |     "## TO-DO: Complete and then run the select statement to validate the data was inserted into the table\n",
196 |     "query = 'SELECT * FROM music_library_table_1'\n",
197 |     "try:\n",
198 |     "    rows = session.execute(query)\n",
199 |     "except Exception as e:\n",
200 |     "    print(e)\n",
201 |     "    \n",
202 |     "for row in rows:\n",
203 |     "    print (row.year, row.album_name, row.artist_name)"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "metadata": {},
209 |    "source": [
210 |     "### TO-DO: Validate the Data Model with the original query.\n",
211 |     "\n",
212 |     "`select * from songs WHERE YEAR=1970 AND artist_name=\"The Beatles\"`"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": 9,
218 |    "metadata": {},
219 |    "outputs": [
220 |     {
221 |      "name": "stdout",
222 |      "output_type": "stream",
223 |      "text": [
224 |       "1970 Let It Be The Beatles\n"
225 |      ]
226 |     }
227 |    ],
228 |    "source": [
229 |     "##TO-DO: Complete the select statement to run the query \n",
230 |     "query = \"SELECT * from music_library_table_1 where YEAR=1970 and artist_name = 'The Beatles'\"\n",
231 |     "try:\n",
232 |     "    rows = session.execute(query)\n",
233 |     "except Exception as e:\n",
234 |     "    print(e)\n",
235 |     "    \n",
236 |     "for row in rows:\n",
237 |     "    print (row.year, row.album_name, row.artist_name)"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "markdown",
242 |    "metadata": {},
243 |    "source": [
244 |     "### And Finally close the session and cluster connection"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "code",
249 |    "execution_count": 10,
250 |    "metadata": {},
251 |    "outputs": [],
252 |    "source": [
253 |     "session.shutdown()\n",
254 |     "cluster.shutdown()"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "code",
259 |    "execution_count": null,
260 |    "metadata": {},
261 |    "outputs": [],
262 |    "source": []
263 |   }
264 |  ],
265 |  "metadata": {
266 |   "kernelspec": {
267 |    "display_name": "Python 3",
268 |    "language": "python",
269 |    "name": "python3"
270 |   },
271 |   "language_info": {
272 |    "codemirror_mode": {
273 |     "name": "ipython",
274 |     "version": 3
275 |    },
276 |    "file_extension": ".py",
277 |    "mimetype": "text/x-python",
278 |    "name": "python",
279 |    "nbconvert_exporter": "python",
280 |    "pygments_lexer": "ipython3",
281 |    "version": "3.6.3"
282 |   }
283 |  },
284 |  "nbformat": 4,
285 |  "nbformat_minor": 2
286 | }
287 | 


--------------------------------------------------------------------------------
/Data-Modeling/L3 Exercise 2 Primary Key.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# L3 Exercise 2: Focus on Primary Key\n",
  8 |     "<img src=\"images/cassandralogo.png\" width=\"250\" height=\"250\">"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "### Walk through the basics of creating a table with a good Primary Key in Apache Cassandra, inserting rows of data, and doing a simple CQL query to validate the information."
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n",
 23 |     "! pip install cassandra-driver\n",
 24 |     "#### More documentation can be found here:  https://datastax.github.io/python-driver/"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "markdown",
 29 |    "metadata": {},
 30 |    "source": [
 31 |     "#### Import Apache Cassandra python package"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": 1,
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "import cassandra"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "### Create a connection to the database"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 2,
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "from cassandra.cluster import Cluster\n",
 57 |     "try: \n",
 58 |     "    cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
 59 |     "    session = cluster.connect()\n",
 60 |     "except Exception as e:\n",
 61 |     "    print(e)"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "### Create a keyspace to work in "
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": 3,
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "try:\n",
 78 |     "    session.execute(\"\"\"\n",
 79 |     "    CREATE KEYSPACE IF NOT EXISTS udacity \n",
 80 |     "    WITH REPLICATION = \n",
 81 |     "    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
 82 |     ")\n",
 83 |     "\n",
 84 |     "except Exception as e:\n",
 85 |     "    print(e)"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "#### Connect to the Keyspace. Compare this to how we had to create a new session in PostgreSQL.  "
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": 4,
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "try:\n",
102 |     "    session.set_keyspace('udacity')\n",
103 |     "except Exception as e:\n",
104 |     "    print(e)"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "### Imagine you need to create a new Music Library of albums \n",
112 |     "\n",
113 |     "### Here is the information asked of the data:\n",
114 |     "### 1. Give every album in the music library that was created by a given artist\n",
115 |     "select * from music_library WHERE artist_name=\"The Beatles\"\n"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "markdown",
120 |    "metadata": {},
121 |    "source": [
122 |     "### Here is the Collection of Data\n",
123 |     "<img src=\"images/table3.png\" width=\"650\" height=\"350\">"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "### How should we model these data? \n",
131 |     "\n",
132 |     "#### What should be our Primary Key and Partition Key? Since the data are looking for the ARTIST, let's start with that. Is Partitioning our data by artist a good idea? In this case our data is very small. If we had a larger dataset of albums, partitions by artist might be a fine choice. But we would need to validate the dataset to make sure there is an equal spread of the data. \n",
133 |     "\n",
134 |     "`Table Name: music_library\n",
135 |     "column 1: Year\n",
136 |     "column 2: Artist Name\n",
137 |     "column 3: Album Name\n",
138 |     "Column 4: City\n",
139 |     "PRIMARY KEY(artist_name)`"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 5,
145 |    "metadata": {},
146 |    "outputs": [],
147 |    "source": [
148 |     "query = \"CREATE TABLE IF NOT EXISTS music_library\"\n",
149 |     "query = query + \"(year int, artist_name text, album_name text, city text, PRIMARY KEY (artist_name))\"\n",
150 |     "try:\n",
151 |     "    session.execute(query)\n",
152 |     "except Exception as e:\n",
153 |     "    print(e)"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "### Insert the data into the tables"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 6,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": [
169 |     "query = \"INSERT INTO music_library (year, artist_name, album_name, city)\"\n",
170 |     "query = query + \" VALUES (%s, %s, %s, %s)\"\n",
171 |     "\n",
172 |     "try:\n",
173 |     "    session.execute(query, (1970, \"The Beatles\", \"Let it Be\", \"Liverpool\"))\n",
174 |     "except Exception as e:\n",
175 |     "    print(e)\n",
176 |     "    \n",
177 |     "try:\n",
178 |     "    session.execute(query, (1965, \"The Beatles\", \"Rubber Soul\", \"Oxford\"))\n",
179 |     "except Exception as e:\n",
180 |     "    print(e)\n",
181 |     "    \n",
182 |     "try:\n",
183 |     "    session.execute(query, (1965, \"The Who\", \"My Generation\", \"London\"))\n",
184 |     "except Exception as e:\n",
185 |     "    print(e)\n",
186 |     "\n",
187 |     "try:\n",
188 |     "    session.execute(query, (1966, \"The Monkees\", \"The Monkees\", \"Los Angeles\"))\n",
189 |     "except Exception as e:\n",
190 |     "    print(e)\n",
191 |     "\n",
192 |     "try:\n",
193 |     "    session.execute(query, (1970, \"The Carpenters\", \"Close To You\", \"San Diego\"))\n",
194 |     "except Exception as e:\n",
195 |     "    print(e)"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "markdown",
200 |    "metadata": {},
201 |    "source": [
202 |     "### Let's Validate our Data Model -- Did it work?? If we look for Albums from The Beatles we should expect to see 2 rows.\n",
203 |     "\n",
204 |     "`select * from music_library WHERE artist_name=\"The Beatles\"`"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": 7,
210 |    "metadata": {},
211 |    "outputs": [
212 |     {
213 |      "name": "stdout",
214 |      "output_type": "stream",
215 |      "text": [
216 |       "1965 The Beatles Rubber Soul Oxford\n"
217 |      ]
218 |     }
219 |    ],
220 |    "source": [
221 |     "query = \"select * from music_library WHERE artist_name='The Beatles'\"\n",
222 |     "try:\n",
223 |     "    rows = session.execute(query)\n",
224 |     "except Exception as e:\n",
225 |     "    print(e)\n",
226 |     "    \n",
227 |     "for row in rows:\n",
228 |     "    print (row.year, row.artist_name, row.album_name, row.city)"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "markdown",
233 |    "metadata": {},
234 |    "source": [
235 |     "### That didn't work out as planned! Why is that? Because we did not create a unique primary key. "
236 |    ]
237 |   },
238 |   {
239 |    "cell_type": "markdown",
240 |    "metadata": {},
241 |    "source": [
242 |     "### Let's try again. This time focus on making the PRIMARY KEY unique.\n",
243 |     "### Looking at the dataset, what makes each row unique?\n",
244 |     "\n",
245 |     "### We have a couple of options (City and Album Name) but that will not get us the query we need which is looking for album's in a particular artist. Let's make a composite key of the `ARTIST NAME` and `ALBUM NAME`. This is assuming that an album name is unique to the artist it was created by (not a bad bet). --But remember this is just an exercise, you will need to understand your dataset fully (no betting!)"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": 8,
251 |    "metadata": {},
252 |    "outputs": [],
253 |    "source": [
254 |     "query = \"CREATE TABLE IF NOT EXISTS music_library1 \"\n",
255 |     "query = query + \"(artist_name text, album_name text, year int, city text, PRIMARY KEY (artist_name, album_name))\"\n",
256 |     "try:\n",
257 |     "    session.execute(query)\n",
258 |     "except Exception as e:\n",
259 |     "    print(e)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 9,
265 |    "metadata": {},
266 |    "outputs": [],
267 |    "source": [
268 |     "query = \"INSERT INTO music_library1 (artist_name, album_name, year, city)\"\n",
269 |     "query = query + \" VALUES (%s, %s, %s, %s)\"\n",
270 |     "\n",
271 |     "try:\n",
272 |     "    session.execute(query, (\"The Beatles\", \"Let it Be\", 1970, \"Liverpool\"))\n",
273 |     "except Exception as e:\n",
274 |     "    print(e)\n",
275 |     "    \n",
276 |     "try:\n",
277 |     "    session.execute(query, (\"The Beatles\", \"Rubber Soul\", 1965, \"Oxford\"))\n",
278 |     "except Exception as e:\n",
279 |     "    print(e)\n",
280 |     "    \n",
281 |     "try:\n",
282 |     "    session.execute(query, (\"The Who\", \"My Generation\", 1965, \"London\"))\n",
283 |     "except Exception as e:\n",
284 |     "    print(e)\n",
285 |     "\n",
286 |     "try:\n",
287 |     "    session.execute(query, (\"The Monkees\", \"The Monkees\", 1966, \"Los Angeles\"))\n",
288 |     "except Exception as e:\n",
289 |     "    print(e)\n",
290 |     "\n",
291 |     "try:\n",
292 |     "    session.execute(query, (\"The Carpenters\", \"Close To You\", 1970, \"San Diego\"))\n",
293 |     "except Exception as e:\n",
294 |     "    print(e)"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "markdown",
299 |    "metadata": {},
300 |    "source": [
301 |     "### Validate the Data Model -- Did it work? If we look for Albums from The Beatles we should expect to see 2 rows.\n",
302 |     "\n",
303 |     "`select * from music_library WHERE artist_name=\"The Beatles\"`"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "code",
308 |    "execution_count": 10,
309 |    "metadata": {},
310 |    "outputs": [
311 |     {
312 |      "name": "stdout",
313 |      "output_type": "stream",
314 |      "text": [
315 |       "1970 The Beatles Let it Be Liverpool\n",
316 |       "1965 The Beatles Rubber Soul Oxford\n"
317 |      ]
318 |     }
319 |    ],
320 |    "source": [
321 |     "query = \"select * from music_library1 WHERE artist_name='The Beatles'\"\n",
322 |     "try:\n",
323 |     "    rows = session.execute(query)\n",
324 |     "except Exception as e:\n",
325 |     "    print(e)\n",
326 |     "    \n",
327 |     "for row in rows:\n",
328 |     "    print (row.year, row.artist_name, row.album_name, row.city)"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "markdown",
333 |    "metadata": {},
334 |    "source": [
335 |     "### Success it worked! We created a unique Primary key that evenly distributed our data. "
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "markdown",
340 |    "metadata": {},
341 |    "source": [
342 |     "### Drop the tables"
343 |    ]
344 |   },
345 |   {
346 |    "cell_type": "code",
347 |    "execution_count": 11,
348 |    "metadata": {},
349 |    "outputs": [],
350 |    "source": [
351 |     "query = \"drop table music_library\"\n",
352 |     "try:\n",
353 |     "    rows = session.execute(query)\n",
354 |     "except Exception as e:\n",
355 |     "    print(e)\n",
356 |     "\n",
357 |     "query = \"drop table music_library1\"\n",
358 |     "try:\n",
359 |     "    rows = session.execute(query)\n",
360 |     "except Exception as e:\n",
361 |     "    print(e)"
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "markdown",
366 |    "metadata": {},
367 |    "source": [
368 |     "### Close the session and cluster connection"
369 |    ]
370 |   },
371 |   {
372 |    "cell_type": "code",
373 |    "execution_count": 12,
374 |    "metadata": {},
375 |    "outputs": [],
376 |    "source": [
377 |     "session.shutdown()\n",
378 |     "cluster.shutdown()"
379 |    ]
380 |   }
381 |  ],
382 |  "metadata": {
383 |   "kernelspec": {
384 |    "display_name": "Python 3",
385 |    "language": "python",
386 |    "name": "python3"
387 |   },
388 |   "language_info": {
389 |    "codemirror_mode": {
390 |     "name": "ipython",
391 |     "version": 3
392 |    },
393 |    "file_extension": ".py",
394 |    "mimetype": "text/x-python",
395 |    "name": "python",
396 |    "nbconvert_exporter": "python",
397 |    "pygments_lexer": "ipython3",
398 |    "version": "3.6.3"
399 |   }
400 |  },
401 |  "nbformat": 4,
402 |  "nbformat_minor": 2
403 | }
404 | 


--------------------------------------------------------------------------------
/Data-Modeling/L3 Exercise 3 Clustering Column.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# L3 Exercise 3: Focus on Clustering Columns\n",
  8 |     "<img src=\"images/cassandralogo.png\" width=\"250\" height=\"250\">"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "### Walk through the basics of creating a table with a good Primary Key and Clustering Columns in Apache Cassandra, inserting rows of data, and doing a simple CQL query to validate the information."
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n",
 23 |     "! pip install cassandra-driver\n",
 24 |     "#### More documentation can be found here:  https://datastax.github.io/python-driver/"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "markdown",
 29 |    "metadata": {},
 30 |    "source": [
 31 |     "#### Import Apache Cassandra python package"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": 1,
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "import cassandra"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "### Create a connection to the database"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 2,
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "from cassandra.cluster import Cluster\n",
 57 |     "try: \n",
 58 |     "    cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
 59 |     "    session = cluster.connect()\n",
 60 |     "except Exception as e:\n",
 61 |     "    print(e)"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "### Create a keyspace to work in "
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": 3,
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "try:\n",
 78 |     "    session.execute(\"\"\"\n",
 79 |     "    CREATE KEYSPACE IF NOT EXISTS udacity \n",
 80 |     "    WITH REPLICATION = \n",
 81 |     "    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
 82 |     ")\n",
 83 |     "\n",
 84 |     "except Exception as e:\n",
 85 |     "    print(e)"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL.  "
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": 4,
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "try:\n",
102 |     "    session.set_keyspace('udacity')\n",
103 |     "except Exception as e:\n",
104 |     "    print(e)"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "### Imagine we would like to start creating a new Music Library of albums. \n",
112 |     "\n",
113 |     "### We want to ask 1 question of our data:\n",
114 |     "#### 1. Give me all the information from the music library about a given album\n",
115 |     "`select * from album_library WHERE album_name=\"Close To You\"`\n"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "markdown",
120 |    "metadata": {},
121 |    "source": [
122 |     "### Here is the Data:\n",
123 |     "<img src=\"images/table4.png\" width=\"650\" height=\"350\">"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "### How should we model this data? What should be our Primary Key and Partition Key? \n",
131 |     "\n",
132 |     "### Since the data is looking for the `ALBUM_NAME` let's start with that. From there we will need to add other elements to make sure the Key is unique. We also need to add the  `ARTIST_NAME` as Clustering Columns to make the data unique. That should be enough to make the row key unique.\n",
133 |     "\n",
134 |     "`Table Name: music_library\n",
135 |     "column 1: Year\n",
136 |     "column 2: Artist Name\n",
137 |     "column 3: Album Name\n",
138 |     "Column 4: City\n",
139 |     "PRIMARY KEY(album name, artist name)`"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 5,
145 |    "metadata": {},
146 |    "outputs": [],
147 |    "source": [
148 |     "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n",
149 |     "query = query + \"(album_name text, artist_name text, year int, city text, PRIMARY KEY (album_name, artist_name))\"\n",
150 |     "try:\n",
151 |     "    session.execute(query)\n",
152 |     "except Exception as e:\n",
153 |     "    print(e)"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "### Insert the data into the table"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 6,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": [
169 |     "query = \"INSERT INTO music_library (album_name, artist_name, year, city)\"\n",
170 |     "query = query + \" VALUES (%s, %s, %s, %s)\"\n",
171 |     "\n",
172 |     "try:\n",
173 |     "    session.execute(query, (\"Let it Be\", \"The Beatles\", 1970, \"Liverpool\"))\n",
174 |     "except Exception as e:\n",
175 |     "    print(e)\n",
176 |     "    \n",
177 |     "try:\n",
178 |     "    session.execute(query, (\"Rubber Soul\", \"The Beatles\", 1965, \"Oxford\"))\n",
179 |     "except Exception as e:\n",
180 |     "    print(e)\n",
181 |     "    \n",
182 |     "try:\n",
183 |     "    session.execute(query, (\"Beatles For Sale\", \"The Beatles\", 1964, \"London\"))\n",
184 |     "except Exception as e:\n",
185 |     "    print(e)\n",
186 |     "\n",
187 |     "try:\n",
188 |     "    session.execute(query, (\"The Monkees\", \"The Monkees\", 1966, \"Los Angeles\"))\n",
189 |     "except Exception as e:\n",
190 |     "    print(e)\n",
191 |     "\n",
192 |     "try:\n",
193 |     "    session.execute(query, (\"Close To You\", \"The Carpenters\", 1970, \"San Diego\"))\n",
194 |     "except Exception as e:\n",
195 |     "    print(e)"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "markdown",
200 |    "metadata": {},
201 |    "source": [
202 |     "### Validate the Data Model -- Did it work?\n",
203 |     "`select * from album_library WHERE album_name=\"Close To You\"`"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": 7,
209 |    "metadata": {},
210 |    "outputs": [
211 |     {
212 |      "name": "stdout",
213 |      "output_type": "stream",
214 |      "text": [
215 |       "The Carpenters Close To You San Diego 1970\n"
216 |      ]
217 |     }
218 |    ],
219 |    "source": [
220 |     "query = \"select * from music_library WHERE album_NAME='Close To You'\"\n",
221 |     "try:\n",
222 |     "    rows = session.execute(query)\n",
223 |     "except Exception as e:\n",
224 |     "    print(e)\n",
225 |     "    \n",
226 |     "for row in rows:\n",
227 |     "    print (row.artist_name, row.album_name, row.city, row.year)"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "markdown",
232 |    "metadata": {},
233 |    "source": [
234 |     "### Success it worked! We created a unique Primary key that evenly distributed our data, with clustering columns"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "markdown",
239 |    "metadata": {},
240 |    "source": [
241 |     "### For the sake of the demo, drop the table"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": 8,
247 |    "metadata": {},
248 |    "outputs": [],
249 |    "source": [
250 |     "query = \"drop table music_library\"\n",
251 |     "try:\n",
252 |     "    rows = session.execute(query)\n",
253 |     "except Exception as e:\n",
254 |     "    print(e)\n"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "markdown",
259 |    "metadata": {},
260 |    "source": [
261 |     "### Close the session and cluster connection"
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": 9,
267 |    "metadata": {},
268 |    "outputs": [],
269 |    "source": [
270 |     "session.shutdown()\n",
271 |     "cluster.shutdown()"
272 |    ]
273 |   }
274 |  ],
275 |  "metadata": {
276 |   "kernelspec": {
277 |    "display_name": "Python 3",
278 |    "language": "python",
279 |    "name": "python3"
280 |   },
281 |   "language_info": {
282 |    "codemirror_mode": {
283 |     "name": "ipython",
284 |     "version": 3
285 |    },
286 |    "file_extension": ".py",
287 |    "mimetype": "text/x-python",
288 |    "name": "python",
289 |    "nbconvert_exporter": "python",
290 |    "pygments_lexer": "ipython3",
291 |    "version": "3.6.3"
292 |   }
293 |  },
294 |  "nbformat": 4,
295 |  "nbformat_minor": 2
296 | }
297 | 


--------------------------------------------------------------------------------
/Data-Modeling/L3 Exercise 4 Using the WHERE Clause.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Lesson 3 Demo 4: Using the WHERE Clause\n",
  8 |     "<img src=\"images/cassandralogo.png\" width=\"250\" height=\"250\">"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "### In this exercise we are going to walk through the basics of using the WHERE clause in Apache Cassandra.\n",
 16 |     "\n",
 17 |     "##### denotes where the code needs to be completed.\n",
 18 |     "\n",
 19 |     "Note: __Do not__ click the blue Preview button in the lower task bar"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n",
 27 |     "! pip install cassandra-driver\n",
 28 |     "#### More documentation can be found here:  https://datastax.github.io/python-driver/"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "#### Import Apache Cassandra python package"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 1,
 41 |    "metadata": {},
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "import cassandra"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "### First let's create a connection to the database"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 2,
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "from cassandra.cluster import Cluster\n",
 61 |     "try: \n",
 62 |     "    cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
 63 |     "    session = cluster.connect()\n",
 64 |     "except Exception as e:\n",
 65 |     "    print(e)"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "### Let's create a keyspace to do our work in "
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 3,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "try:\n",
 82 |     "    session.execute(\"\"\"\n",
 83 |     "    CREATE KEYSPACE IF NOT EXISTS udacity \n",
 84 |     "    WITH REPLICATION = \n",
 85 |     "    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
 86 |     ")\n",
 87 |     "\n",
 88 |     "except Exception as e:\n",
 89 |     "    print(e)"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL.  "
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": 4,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "try:\n",
106 |     "    session.set_keyspace('udacity')\n",
107 |     "except Exception as e:\n",
108 |     "    print(e)"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "### Let's imagine we would like to start creating a new Music Library of albums. \n",
116 |     "### We want to ask 4 question of our data\n",
117 |     "#### 1. Give me every album in my music library that was released in a 1965 year\n",
118 |     "#### 2. Give me the album that is in my music library that was released in 1965 by \"The Beatles\"\n",
119 |     "#### 3. Give me all the albums released in a given year that was made in London \n",
120 |     "#### 4. Give me the city that the album \"Rubber Soul\" was recorded"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "metadata": {},
126 |    "source": [
127 |     "### Here is our Collection of Data\n",
128 |     "<img src=\"images/table3.png\" width=\"650\" height=\"350\">"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the YEAR let's start with that. From there we will add clustering columns on Artist Name and Album Name."
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": 6,
141 |    "metadata": {},
142 |    "outputs": [],
143 |    "source": [
144 |     "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n",
145 |     "query = query + \"(year int, artist_name text, album_name text, city text, PRIMARY KEY (year, artist_name, album_name))\"\n",
146 |     "try:\n",
147 |     "    session.execute(query)\n",
148 |     "except Exception as e:\n",
149 |     "    print(e)"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "metadata": {},
155 |    "source": [
156 |     "### Let's insert our data into of table"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": 7,
162 |    "metadata": {},
163 |    "outputs": [],
164 |    "source": [
165 |     "query = \"INSERT INTO music_library (year, artist_name, album_name, city)\"\n",
166 |     "query = query + \" VALUES (%s, %s, %s, %s)\"\n",
167 |     "\n",
168 |     "try:\n",
169 |     "    session.execute(query, (1970, \"The Beatles\", \"Let it Be\", \"Liverpool\"))\n",
170 |     "except Exception as e:\n",
171 |     "    print(e)\n",
172 |     "    \n",
173 |     "try:\n",
174 |     "    session.execute(query, (1965, \"The Beatles\", \"Rubber Soul\", \"Oxford\"))\n",
175 |     "except Exception as e:\n",
176 |     "    print(e)\n",
177 |     "    \n",
178 |     "try:\n",
179 |     "    session.execute(query, (1965, \"The Who\", \"My Generation\", \"London\"))\n",
180 |     "except Exception as e:\n",
181 |     "    print(e)\n",
182 |     "\n",
183 |     "try:\n",
184 |     "    session.execute(query, (1966, \"The Monkees\", \"The Monkees\", \"Los Angeles\"))\n",
185 |     "except Exception as e:\n",
186 |     "    print(e)\n",
187 |     "\n",
188 |     "try:\n",
189 |     "    session.execute(query, (1970, \"The Carpenters\", \"Close To You\", \"San Diego\"))\n",
190 |     "except Exception as e:\n",
191 |     "    print(e)"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "markdown",
196 |    "metadata": {},
197 |    "source": [
198 |     "### Let's Validate our Data Model with our 4 queries.\n",
199 |     "\n",
200 |     "Query 1: "
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "code",
205 |    "execution_count": 9,
206 |    "metadata": {},
207 |    "outputs": [
208 |     {
209 |      "name": "stdout",
210 |      "output_type": "stream",
211 |      "text": [
212 |       "1965 The Beatles Rubber Soul Oxford\n",
213 |       "1965 The Who My Generation London\n"
214 |      ]
215 |     }
216 |    ],
217 |    "source": [
218 |     "query = \"SELECT * from music_library where year = 1965\"\n",
219 |     "try:\n",
220 |     "    rows = session.execute(query)\n",
221 |     "except Exception as e:\n",
222 |     "    print(e)\n",
223 |     "    \n",
224 |     "for row in rows:\n",
225 |     "    print (row.year, row.artist_name, row.album_name, row.city)"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     " Let's try the 2nd query.\n",
233 |     " Query 2: "
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": 10,
239 |    "metadata": {},
240 |    "outputs": [
241 |     {
242 |      "name": "stdout",
243 |      "output_type": "stream",
244 |      "text": [
245 |       "1965 The Beatles Rubber Soul Oxford\n"
246 |      ]
247 |     }
248 |    ],
249 |    "source": [
250 |     "query = \"SELECT * from music_library where year = 1965 and artist_name = 'The Beatles'\"\n",
251 |     "try:\n",
252 |     "    rows = session.execute(query)\n",
253 |     "except Exception as e:\n",
254 |     "    print(e)\n",
255 |     "    \n",
256 |     "for row in rows:\n",
257 |     "    print (row.year, row.artist_name, row.album_name, row.city)"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "markdown",
262 |    "metadata": {},
263 |    "source": [
264 |     "### Let's try the 3rd query.\n",
265 |     "Query 3: "
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "code",
270 |    "execution_count": 12,
271 |    "metadata": {},
272 |    "outputs": [
273 |     {
274 |      "name": "stdout",
275 |      "output_type": "stream",
276 |      "text": [
277 |       "Error from server: code=2200 [Invalid query] message=\"Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING\"\n"
278 |      ]
279 |     }
280 |    ],
281 |    "source": [
282 |     "query = \"select * from music_library where city = 'London'\"\n",
283 |     "try:\n",
284 |     "    rows = session.execute(query)\n",
285 |     "except Exception as e:\n",
286 |     "    print(e)\n",
287 |     "    \n",
288 |     "for row in rows:\n",
289 |     "    print (row.year, row.artist_name, row.album_name, row.city)"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "markdown",
294 |    "metadata": {},
295 |    "source": [
296 |     "### Did you get an error? You can not try to access a column or a clustering column if you have not used the other defined clustering column. Let's see if we can try it a different way. \n",
297 |     "Try Query 4: \n",
298 |     "\n"
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "code",
303 |    "execution_count": 14,
304 |    "metadata": {},
305 |    "outputs": [
306 |     {
307 |      "name": "stdout",
308 |      "output_type": "stream",
309 |      "text": [
310 |       "London\n"
311 |      ]
312 |     }
313 |    ],
314 |    "source": [
315 |     "query = \"select city from music_library where year = 1965 and artist_name = 'The Who' and album_name = 'My Generation'\"\n",
316 |     "try:\n",
317 |     "    rows = session.execute(query)\n",
318 |     "except Exception as e:\n",
319 |     "    print(e)\n",
320 |     "    \n",
321 |     "for row in rows:\n",
322 |     "    print (row.city)"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "code",
327 |    "execution_count": 16,
328 |    "metadata": {},
329 |    "outputs": [
330 |     {
331 |      "name": "stdout",
332 |      "output_type": "stream",
333 |      "text": [
334 |       "Error from server: code=2200 [Invalid query] message=\"PRIMARY KEY column \"album_name\" cannot be restricted as preceding column \"artist_name\" is not restricted\"\n"
335 |      ]
336 |     }
337 |    ],
338 |    "source": [
339 |     "query = \"select city from music_library where album_name = 'Rubber Soul'\"\n",
340 |     "try:\n",
341 |     "    rows = session.execute(query)\n",
342 |     "except Exception as e:\n",
343 |     "    print(e)\n",
344 |     "    \n",
345 |     "for row in rows:\n",
346 |     "    print (row.city)"
347 |    ]
348 |   },
349 |   {
350 |    "cell_type": "code",
351 |    "execution_count": 18,
352 |    "metadata": {},
353 |    "outputs": [
354 |     {
355 |      "name": "stdout",
356 |      "output_type": "stream",
357 |      "text": [
358 |       "Oxford\n"
359 |      ]
360 |     }
361 |    ],
362 |    "source": [
363 |     "query = \"select city from music_library where year = 1965 and artist_name = 'The Beatles' and album_name = 'Rubber Soul'\"\n",
364 |     "try:\n",
365 |     "    rows = session.execute(query)\n",
366 |     "except Exception as e:\n",
367 |     "    print(e)\n",
368 |     "    \n",
369 |     "for row in rows:\n",
370 |     "    print (row.city)"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "markdown",
375 |    "metadata": {},
376 |    "source": [
377 |     "### And Finally close the session and cluster connection"
378 |    ]
379 |   },
380 |   {
381 |    "cell_type": "code",
382 |    "execution_count": 19,
383 |    "metadata": {},
384 |    "outputs": [],
385 |    "source": [
386 |     "session.shutdown()\n",
387 |     "cluster.shutdown()"
388 |    ]
389 |   }
390 |  ],
391 |  "metadata": {
392 |   "kernelspec": {
393 |    "display_name": "Python 3",
394 |    "language": "python",
395 |    "name": "python3"
396 |   },
397 |   "language_info": {
398 |    "codemirror_mode": {
399 |     "name": "ipython",
400 |     "version": 3
401 |    },
402 |    "file_extension": ".py",
403 |    "mimetype": "text/x-python",
404 |    "name": "python",
405 |    "nbconvert_exporter": "python",
406 |    "pygments_lexer": "ipython3",
407 |    "version": "3.6.3"
408 |   }
409 |  },
410 |  "nbformat": 4,
411 |  "nbformat_minor": 2
412 | }
413 | 


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/Instructions 1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Instructions 1.PNG


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/Instructions 2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Instructions 2.PNG


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/Instructions 3.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Instructions 3.PNG


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/Instructions 4.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Instructions 4.PNG


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/Project 1 Introduction.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/Project 1 Introduction.PNG


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/README.md:
--------------------------------------------------------------------------------
 1 | <b>Introduction</b>
 2 | 
 3 | A startup called <b>Sparkify</b> want to analyze the data they have been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to.
 4 | 
 5 | The aim is to create a Postgres Database Schema and ETL pipeline to optimize queries for song play analysis.
 6 | 
 7 | <b>Project Description </b>
 8 | 
 9 | In this project, I have to model data with Postgres and build and ETL pipeline using Python. On the database side, I have to define fact and dimension tables for a Star Schema for a specific focus. On the other hand, ETL pipeline would transfer data from files located in two local directories into these tables in Postgres using Python and SQL
10 | 
11 | <b>Schema for Song Play Analysis</b>
12 | 
13 | <b>Fact Table</b>
14 | 
15 | <b> songplays </b> records in log data associated with song plays
16 | 
17 | <b>Dimension Tables</b>
18 | 
19 | <b> users </b> in the app
20 | 
21 | <b> songs </b> in music database
22 | 
23 | <b> artists </b> in music database
24 | 
25 | <b> time: </b> timestamps of records in songplays broken down into specific units
26 | 
27 | <b>Project Design</b>
28 | 
29 | Database Design is very optimized because with a ew number of tables and doing specific join, we can get the most information and do analysis
30 | 
31 | ETL Design is also simplified have to read json files and parse accordingly to store the tables into specific columns and proper formatting
32 | 
33 | <b>Database Script</b>
34 | 
35 | Writing "python create_tables.py" command in terminal, it is easier to create and recreate tables
36 | 
37 | <b>Jupyter Notebook</b>
38 | 
39 | etl.ipynb, a Jupyter notebook is given for verifying each command and data as well and then using those statements and copying into etl.py and running it into terminal using "python etl.py" and then running test.ipynb to see whether data has been loaded in all the tables
40 | 
41 | <b>Relevant Files Provided </b>
42 | 
43 | <b>test.ipnb </b>displays the first few rows of each table to let you check your database
44 | 
45 | <b>create_tables.py </b>drops and created your table
46 | 
47 | <b>etl.ipynb </b>read and processes a single file from song_data and log_data and loads into your tables in Jupyter notebook
48 | 
49 | <b>etl.ipynb </b>read and processes a single file from song_data and log_data and loads into your tables in ET
50 | 
51 | <b>sql_queries.py </b>containg all your sql queries and in imported into the last three files above


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/create_tables.py:
--------------------------------------------------------------------------------
 1 | import psycopg2
 2 | from sql_queries import create_table_queries, drop_table_queries
 3 | 
 4 | 
 5 | def create_database():
 6 |     # connect to default database
 7 |     conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")
 8 |     conn.set_session(autocommit=True)
 9 |     cur = conn.cursor()
10 |     
11 |     # create sparkify database with UTF8 encoding
12 |     cur.execute("DROP DATABASE IF EXISTS sparkifydb")
13 |     cur.execute("CREATE DATABASE sparkifydb WITH ENCODING 'utf8' TEMPLATE template0")
14 | 
15 |     # close connection to default database
16 |     conn.close()    
17 |     
18 |     # connect to sparkify database
19 |     conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
20 |     cur = conn.cursor()
21 |     
22 |     return cur, conn
23 | 
24 | 
25 | def drop_tables(cur, conn):
26 |     for query in drop_table_queries:
27 |         cur.execute(query)
28 |         conn.commit()
29 | 
30 | 
31 | def create_tables(cur, conn):
32 |     for query in create_table_queries:
33 |         cur.execute(query)
34 |         conn.commit()
35 | 
36 | 
37 | def main():
38 |     cur, conn = create_database()
39 |     
40 |     drop_tables(cur, conn)
41 |     create_tables(cur, conn)
42 | 
43 |     conn.close()
44 | 
45 | 
46 | if __name__ == "__main__":
47 |     main()


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 1/data.zip


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/etl.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import glob
  3 | import psycopg2
  4 | import pandas as pd
  5 | import numpy as np
  6 | from sql_queries import *
  7 | 
  8 | 
  9 | def process_song_file(cur, filepath):
 10 |     """
 11 |         This function reads JSON files and read information of song and artist data and saves into song_data and artist_data
 12 |         Arguments:
 13 |         cur: Database Cursor
 14 |         filepath: location of JSON files
 15 |         Return: None
 16 |     """
 17 |     # open song file
 18 |     df = pd.read_json(filepath, lines=True)
 19 | 
 20 |     # insert song record
 21 |     song_data = df[["song_id", "title", "artist_id", "year", "duration"]].values[0].tolist()
 22 |     cur.execute(song_table_insert, song_data)
 23 |     
 24 |     # insert artist record
 25 |     artist_data = df[["artist_id", "artist_name", "artist_location", "artist_latitude", "artist_longitude"]].values[0].tolist()
 26 |     cur.execute(artist_table_insert, artist_data)
 27 | 
 28 | 
 29 | def process_log_file(cur, filepath):
 30 |     """
 31 |         This function reads Log files and reads information of time, user and songplay data and saves into time, user, songplay
 32 |         Arguments:
 33 |         cur: Database Cursor
 34 |         filepath: location of Log files
 35 |         Return: None
 36 |     """
 37 | 
 38 |     # open log file
 39 |     df = pd.read_json(filepath, lines=True)
 40 | 
 41 |     # filter by NextSong action
 42 |     df = df[(df['page'] == 'NextSong')]
 43 | 
 44 |     # convert timestamp column to datetime
 45 |     t = pd.to_datetime(df['ts'], unit='ms')
 46 |     df['ts'] = pd.to_datetime(df['ts'], unit='ms')
 47 |     
 48 |     # insert time data records
 49 |     time_data = list((t, t.dt.hour, t.dt.day, t.dt.weekofyear, t.dt.month, t.dt.year, t.dt.weekday))
 50 |     column_labels = list(('start_time', 'hour', 'day', 'week', 'month', 'year', 'weekday'))
 51 |     time_df =  pd.DataFrame.from_dict(dict(zip(column_labels, time_data)))
 52 | 
 53 |     for i, row in time_df.iterrows():
 54 |         cur.execute(time_table_insert, list(row))
 55 | 
 56 |     # load user table
 57 |     user_df = df[["userId", "firstName", "lastName", "gender", "level"]]
 58 | 
 59 |     # insert user records
 60 |     for i, row in user_df.iterrows():
 61 |         cur.execute(user_table_insert, row)
 62 | 
 63 |     # insert songplay records
 64 |     for index, row in df.iterrows():
 65 |         
 66 |         # get songid and artistid from song and artist tables
 67 |         cur.execute(song_select, (row.song, row.artist, row.length))
 68 |         results = cur.fetchone()
 69 |         
 70 |         if results:
 71 |             songid, artistid = results
 72 |         else:
 73 |             songid, artistid = None, None
 74 | 
 75 |         # insert songplay record
 76 |         songplay_data = (index, row.ts, row.userId, row.level, songid, artistid, row.sessionId,\
 77 |                      row.location, row.userAgent)
 78 |         cur.execute(songplay_table_insert, songplay_data)
 79 | 
 80 | 
 81 | def process_data(cur, conn, filepath, func):
 82 |     # get all files matching extension from directory
 83 |     all_files = []
 84 |     for root, dirs, files in os.walk(filepath):
 85 |         files = glob.glob(os.path.join(root,'*.json'))
 86 |         for f in files :
 87 |             all_files.append(os.path.abspath(f))
 88 | 
 89 |     # get total number of files found
 90 |     num_files = len(all_files)
 91 |     print('{} files found in {}'.format(num_files, filepath))
 92 | 
 93 |     # iterate over files and process
 94 |     for i, datafile in enumerate(all_files, 1):
 95 |         func(cur, datafile)
 96 |         conn.commit()
 97 |         print('{}/{} files processed.'.format(i, num_files))
 98 | 
 99 | 
100 | def main():
101 |     conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
102 |     cur = conn.cursor()
103 | 
104 |     process_data(cur, conn, filepath='data/song_data', func=process_song_file)
105 |     process_data(cur, conn, filepath='data/log_data', func=process_log_file)
106 | 
107 |     conn.close()
108 | 
109 | 
110 | if __name__ == "__main__":
111 |     main()


--------------------------------------------------------------------------------
/Data-Modeling/Project 1/sql_queries.py:
--------------------------------------------------------------------------------
  1 | # DROP TABLES
  2 | 
  3 | songplay_table_drop = "DROP TABLE IF EXISTS songplays"
  4 | user_table_drop = "DROP TABLE IF EXISTS users"
  5 | song_table_drop = "DROP TABLE IF EXISTS songs"
  6 | artist_table_drop = "DROP TABLE IF EXISTS artists"
  7 | time_table_drop = "DROP TABLE IF EXISTS time"
  8 | 
  9 | # CREATE TABLES
 10 | 
 11 | songplay_table_create = ("""
 12 | CREATE TABLE IF NOT EXISTS songplays (
 13 |     songplay_id SERIAL PRIMARY KEY,
 14 |     start_time TIMESTAMP,
 15 |     user_id INTEGER,
 16 |     level VARCHAR(10),
 17 |     song_id VARCHAR(20),
 18 |     artist_id VARCHAR(20),
 19 |     session_id INTEGER,
 20 |     location VARCHAR(50),
 21 |     user_agent VARCHAR(150)
 22 | );
 23 | """)
 24 | 
 25 | user_table_create = ("""
 26 | CREATE TABLE IF NOT EXISTS users (
 27 |     user_id INTEGER PRIMARY KEY,
 28 |     first_name VARCHAR(50),
 29 |     last_name VARCHAR(50),
 30 |     gender CHAR(1),
 31 |     level VARCHAR(10)
 32 | );
 33 | """)
 34 | 
 35 | song_table_create = ("""
 36 | CREATE TABLE IF NOT EXISTS songs (
 37 |     song_id VARCHAR(20) PRIMARY KEY,
 38 |     title VARCHAR(100),
 39 |     artist_id VARCHAR(20) NOT NULL,
 40 |     year INTEGER,
 41 |     duration FLOAT(5)
 42 | );
 43 | """)
 44 | 
 45 | artist_table_create = ("""
 46 | CREATE TABLE IF NOT EXISTS artists (
 47 |     artist_id VARCHAR(20) PRIMARY KEY,
 48 |     name VARCHAR(100),
 49 |     location VARCHAR(100),
 50 |     lattitude FLOAT(5),
 51 |     longitude FLOAT(5)
 52 | );
 53 | """)
 54 | 
 55 | time_table_create = ("""
 56 | CREATE TABLE IF NOT EXISTS time (
 57 |     start_time TIMESTAMP PRIMARY KEY,
 58 |     hour INTEGER,
 59 |     day INTEGER,
 60 |     week INTEGER,
 61 |     month INTEGER,
 62 |     year INTEGER,
 63 |     weekday INTEGER
 64 | );
 65 | """)
 66 | 
 67 | # INSERT RECORDS
 68 | 
 69 | songplay_table_insert = ("""
 70 | INSERT INTO songplays (songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)
 71 | VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
 72 | ON CONFLICT(songplay_id) DO NOTHING;
 73 | """)
 74 | 
 75 | user_table_insert = ("""
 76 | INSERT INTO users (user_id, first_name, last_name, gender, level)
 77 | VALUES (%s, %s, %s, %s, %s) ON CONFLICT (user_id) DO UPDATE SET level = EXCLUDED.level;
 78 | """)
 79 | 
 80 | song_table_insert = ("""
 81 | INSERT INTO songs (song_id, title, artist_id, year, duration)
 82 | VALUES (%s, %s, %s, %s, %s) ON CONFLICT DO NOTHING;
 83 | """)
 84 | 
 85 | artist_table_insert = ("""
 86 | INSERT INTO artists (artist_id, name, location, lattitude, longitude)
 87 | VALUES (%s, %s, %s, %s, %s) ON CONFLICT DO NOTHING;
 88 | """)
 89 | 
 90 | 
 91 | time_table_insert = ("""
 92 | INSERT INTO time (start_time, hour, day, week, month, year, weekday)
 93 | VALUES (%s, %s, %s, %s, %s, %s, %s) ON CONFLICT DO NOTHING;
 94 | """)
 95 | 
 96 | # FIND SONGS
 97 | 
 98 | song_select = ("""
 99 | SELECT ss.song_id, ss.artist_id FROM songs ss 
100 | JOIN artists ars on ss.artist_id = ars.artist_id
101 | WHERE ss.title = %s
102 | AND ars.name = %s
103 | AND ss.duration = %s
104 | ;
105 | """)
106 | 
107 | # QUERY LISTS
108 | 
109 | create_table_queries = [songplay_table_create, user_table_create, song_table_create, artist_table_create, time_table_create]
110 | drop_table_queries = [songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]


--------------------------------------------------------------------------------
/Data-Modeling/Project 2/Project_1B_ Project_Template.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Part I. ETL Pipeline for Pre-Processing the Files"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## PLEASE RUN THE FOLLOWING CODE FOR PRE-PROCESSING THE FILES"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "#### Import Python packages "
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": null,
 27 |    "metadata": {},
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "# Import Python packages \n",
 31 |     "import pandas as pd\n",
 32 |     "import cassandra\n",
 33 |     "import re\n",
 34 |     "import os\n",
 35 |     "import glob\n",
 36 |     "import numpy as np\n",
 37 |     "import json\n",
 38 |     "import csv"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "#### Creating list of filepaths to process original event csv data files"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "# checking your current working directory\n",
 55 |     "print(os.getcwd())\n",
 56 |     "\n",
 57 |     "# Get your current folder and subfolder event data\n",
 58 |     "filepath = os.getcwd() + '/event_data'\n",
 59 |     "\n",
 60 |     "# Create a for loop to create a list of files and collect each filepath\n",
 61 |     "for root, dirs, files in os.walk(filepath):\n",
 62 |     "    \n",
 63 |     "# join the file path and roots with the subdirectories using glob\n",
 64 |     "    file_path_list = glob.glob(os.path.join(root,'*'))\n",
 65 |     "    #print(file_path_list)"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "#### Processing the files to create the data file csv that will be used for Apache Casssandra tables"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "# initiating an empty list of rows that will be generated from each file\n",
 82 |     "full_data_rows_list = [] \n",
 83 |     "    \n",
 84 |     "# for every filepath in the file path list \n",
 85 |     "for f in file_path_list:\n",
 86 |     "\n",
 87 |     "# reading csv file \n",
 88 |     "    with open(f, 'r', encoding = 'utf8', newline='') as csvfile: \n",
 89 |     "        # creating a csv reader object \n",
 90 |     "        csvreader = csv.reader(csvfile) \n",
 91 |     "        next(csvreader)\n",
 92 |     "        \n",
 93 |     " # extracting each data row one by one and append it        \n",
 94 |     "        for line in csvreader:\n",
 95 |     "            #print(line)\n",
 96 |     "            full_data_rows_list.append(line) \n",
 97 |     "            \n",
 98 |     "# uncomment the code below if you would like to get total number of rows \n",
 99 |     "#print(len(full_data_rows_list))\n",
100 |     "# uncomment the code below if you would like to check to see what the list of event data rows will look like\n",
101 |     "#print(full_data_rows_list)\n",
102 |     "\n",
103 |     "# creating a smaller event data csv file called event_datafile_full csv that will be used to insert data into the \\\n",
104 |     "# Apache Cassandra tables\n",
105 |     "csv.register_dialect('myDialect', quoting=csv.QUOTE_ALL, skipinitialspace=True)\n",
106 |     "\n",
107 |     "with open('event_datafile_new.csv', 'w', encoding = 'utf8', newline='') as f:\n",
108 |     "    writer = csv.writer(f, dialect='myDialect')\n",
109 |     "    writer.writerow(['artist','firstName','gender','itemInSession','lastName','length',\\\n",
110 |     "                'level','location','sessionId','song','userId'])\n",
111 |     "    for row in full_data_rows_list:\n",
112 |     "        if (row[0] == ''):\n",
113 |     "            continue\n",
114 |     "        writer.writerow((row[0], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[12], row[13], row[16]))\n"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": null,
120 |    "metadata": {},
121 |    "outputs": [],
122 |    "source": [
123 |     "# check the number of rows in your csv file\n",
124 |     "with open('event_datafile_new.csv', 'r', encoding = 'utf8') as f:\n",
125 |     "    print(sum(1 for line in f))"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {},
131 |    "source": [
132 |     "# Part II. Complete the Apache Cassandra coding portion of your project. \n",
133 |     "\n",
134 |     "## Now you are ready to work with the CSV file titled <font color=red>event_datafile_new.csv</font>, located within the Workspace directory.  The event_datafile_new.csv contains the following columns: \n",
135 |     "- artist \n",
136 |     "- firstName of user\n",
137 |     "- gender of user\n",
138 |     "- item number in session\n",
139 |     "- last name of user\n",
140 |     "- length of the song\n",
141 |     "- level (paid or free song)\n",
142 |     "- location of the user\n",
143 |     "- sessionId\n",
144 |     "- song title\n",
145 |     "- userId\n",
146 |     "\n",
147 |     "The image below is a screenshot of what the denormalized data should appear like in the <font color=red>**event_datafile_new.csv**</font> after the code above is run:<br>\n",
148 |     "\n",
149 |     "<img src=\"images/image_event_datafile_new.jpg\">"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "metadata": {},
155 |    "source": [
156 |     "## Begin writing your Apache Cassandra code in the cells below"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "markdown",
161 |    "metadata": {},
162 |    "source": [
163 |     "#### Creating a Cluster"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": null,
169 |    "metadata": {},
170 |    "outputs": [],
171 |    "source": [
172 |     "# This should make a connection to a Cassandra instance your local machine \n",
173 |     "# (127.0.0.1)\n",
174 |     "\n",
175 |     "from cassandra.cluster import Cluster\n",
176 |     "cluster = Cluster()\n",
177 |     "\n",
178 |     "# To establish connection and begin executing queries, need a session\n",
179 |     "session = cluster.connect()"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "markdown",
184 |    "metadata": {},
185 |    "source": [
186 |     "#### Create Keyspace"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": null,
192 |    "metadata": {},
193 |    "outputs": [],
194 |    "source": [
195 |     "# TO-DO: Create a Keyspace "
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "markdown",
200 |    "metadata": {},
201 |    "source": [
202 |     "#### Set Keyspace"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "code",
207 |    "execution_count": null,
208 |    "metadata": {},
209 |    "outputs": [],
210 |    "source": [
211 |     "# TO-DO: Set KEYSPACE to the keyspace specified above\n"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "markdown",
216 |    "metadata": {},
217 |    "source": [
218 |     "### Now we need to create tables to run the following queries. Remember, with Apache Cassandra you model the database tables on the queries you want to run."
219 |    ]
220 |   },
221 |   {
222 |    "cell_type": "markdown",
223 |    "metadata": {},
224 |    "source": [
225 |     "## Create queries to ask the following three questions of the data\n",
226 |     "\n",
227 |     "### 1. Give me the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4\n",
228 |     "\n",
229 |     "\n",
230 |     "### 2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182\n",
231 |     "    \n",
232 |     "\n",
233 |     "### 3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'\n",
234 |     "\n",
235 |     "\n"
236 |    ]
237 |   },
238 |   {
239 |    "cell_type": "code",
240 |    "execution_count": 1,
241 |    "metadata": {},
242 |    "outputs": [],
243 |    "source": [
244 |     "## TO-DO: Query 1:  Give me the artist, song title and song's length in the music app history that was heard during \\\n",
245 |     "## sessionId = 338, and itemInSession = 4\n",
246 |     "\n",
247 |     "\n",
248 |     "                    "
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": null,
254 |    "metadata": {
255 |     "scrolled": false
256 |    },
257 |    "outputs": [],
258 |    "source": [
259 |     "# We have provided part of the code to set up the CSV file. Please complete the Apache Cassandra code below#\n",
260 |     "file = 'event_datafile_new.csv'\n",
261 |     "\n",
262 |     "with open(file, encoding = 'utf8') as f:\n",
263 |     "    csvreader = csv.reader(f)\n",
264 |     "    next(csvreader) # skip header\n",
265 |     "    for line in csvreader:\n",
266 |     "## TO-DO: Assign the INSERT statements into the `query` variable\n",
267 |     "        query = \"<ENTER INSERT STATEMENT HERE>\"\n",
268 |     "        query = query + \"<ASSIGN VALUES HERE>\"\n",
269 |     "        ## TO-DO: Assign which column element should be assigned for each column in the INSERT statement.\n",
270 |     "        ## For e.g., to INSERT artist_name and user first_name, you would change the code below to `line[0], line[1]`\n",
271 |     "        session.execute(query, (line[#], line[#]))"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "markdown",
276 |    "metadata": {},
277 |    "source": [
278 |     "#### Do a SELECT to verify that the data have been inserted into each table"
279 |    ]
280 |   },
281 |   {
282 |    "cell_type": "code",
283 |    "execution_count": null,
284 |    "metadata": {
285 |     "scrolled": true
286 |    },
287 |    "outputs": [],
288 |    "source": [
289 |     "## TO-DO: Add in the SELECT statement to verify the data was entered into the table"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "markdown",
294 |    "metadata": {},
295 |    "source": [
296 |     "### COPY AND REPEAT THE ABOVE THREE CELLS FOR EACH OF THE THREE QUESTIONS"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "code",
301 |    "execution_count": null,
302 |    "metadata": {},
303 |    "outputs": [],
304 |    "source": [
305 |     "## TO-DO: Query 2: Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name)\\\n",
306 |     "## for userid = 10, sessionid = 182\n",
307 |     "\n",
308 |     "\n",
309 |     "                    "
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "code",
314 |    "execution_count": null,
315 |    "metadata": {},
316 |    "outputs": [],
317 |    "source": [
318 |     "## TO-DO: Query 3: Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'\n",
319 |     "\n",
320 |     "\n",
321 |     "                    "
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "code",
326 |    "execution_count": null,
327 |    "metadata": {},
328 |    "outputs": [],
329 |    "source": []
330 |   },
331 |   {
332 |    "cell_type": "code",
333 |    "execution_count": null,
334 |    "metadata": {},
335 |    "outputs": [],
336 |    "source": []
337 |   },
338 |   {
339 |    "cell_type": "markdown",
340 |    "metadata": {},
341 |    "source": [
342 |     "### Drop the tables before closing out the sessions"
343 |    ]
344 |   },
345 |   {
346 |    "cell_type": "code",
347 |    "execution_count": 4,
348 |    "metadata": {},
349 |    "outputs": [],
350 |    "source": [
351 |     "## TO-DO: Drop the table before closing out the sessions"
352 |    ]
353 |   },
354 |   {
355 |    "cell_type": "code",
356 |    "execution_count": null,
357 |    "metadata": {},
358 |    "outputs": [],
359 |    "source": []
360 |   },
361 |   {
362 |    "cell_type": "markdown",
363 |    "metadata": {},
364 |    "source": [
365 |     "### Close the session and cluster connection¶"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "code",
370 |    "execution_count": null,
371 |    "metadata": {},
372 |    "outputs": [],
373 |    "source": [
374 |     "session.shutdown()\n",
375 |     "cluster.shutdown()"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "code",
380 |    "execution_count": null,
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": []
384 |   },
385 |   {
386 |    "cell_type": "code",
387 |    "execution_count": null,
388 |    "metadata": {},
389 |    "outputs": [],
390 |    "source": []
391 |   }
392 |  ],
393 |  "metadata": {
394 |   "kernelspec": {
395 |    "display_name": "Python 3",
396 |    "language": "python",
397 |    "name": "python3"
398 |   },
399 |   "language_info": {
400 |    "codemirror_mode": {
401 |     "name": "ipython",
402 |     "version": 3
403 |    },
404 |    "file_extension": ".py",
405 |    "mimetype": "text/x-python",
406 |    "name": "python",
407 |    "nbconvert_exporter": "python",
408 |    "pygments_lexer": "ipython3",
409 |    "version": "3.6.3"
410 |   }
411 |  },
412 |  "nbformat": 4,
413 |  "nbformat_minor": 2
414 | }
415 | 


--------------------------------------------------------------------------------
/Data-Modeling/Project 2/README.md:
--------------------------------------------------------------------------------
 1 | <b>Project: Data Modeling with Cassandra</b>
 2 | 
 3 | <b>Introduction:</b>
 4 |     
 5 | A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. There is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app. My role is to create an Apache Cassandra database which can create queries on song play data to answer the questions.
 6 | 
 7 | <b>Project Overview:</b>
 8 | 
 9 | In this project, I would be applying Data Modeling with Apache Cassandra and complete an ETL pipeline using Python. I am provided with part of the ETL pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables.
10 | 
11 | <b>Datasets:</b>
12 | 
13 | For this project, you'll be working with one dataset: event_data. The directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset:
14 | event_data/2018-11-08-events.csv
15 | event_data/2018-11-09-events.csv
16 | 
17 | <b>Project Template:</b>
18 | 
19 | The project template includes one Jupyter Notebook file, in which:
20 | •	you will process the event_datafile_new.csv dataset to create a denormalized dataset
21 | •	you will model the data tables keeping in mind the queries you need to run
22 | •	you have been provided queries that you will need to model your data tables for
23 | •	you will load the data into tables you create in Apache Cassandra and run your queries
24 | 
25 | <b>Project Steps:</b>
26 | 
27 | Below are steps you can follow to complete each component of this project.
28 | 
29 | <b>Modelling your NoSQL Database or Apache Cassandra Database:</b>
30 |     
31 | 1.	Design tables to answer the queries outlined in the project template
32 | 2.	Write Apache Cassandra CREATE KEYSPACE and SET KEYSPACE statements
33 | 3.	Develop your CREATE statement for each of the tables to address each question
34 | 4.	Load the data with INSERT statement for each of the tables
35 | 5.	Include IF NOT EXISTS clauses in your CREATE statements to create tables only if the tables do not already exist. We recommend you also include DROP TABLE statement for each table, this way you can run drop and create tables whenever you want to reset your database and test your ETL pipeline
36 | 6.	Test by running the proper select statements with the correct WHERE clause
37 | 
38 | <b>Build ETL Pipeline:</b>
39 | 1.	Implement the logic in section Part I of the notebook template to iterate through each event file in event_data to process and create a new CSV file in Python
40 | 2.	Make necessary edits to Part II of the notebook template to include Apache Cassandra CREATE and INSERT three statements to load processed records into relevant tables in your data model
41 | 3.	Test by running three SELECT statements after running the queries on your database
42 | 4.	Finally, drop the tables and shutdown the cluster
43 | 
44 | <b>Files:</b>
45 | 
46 | <b>Project_1B_Project_Template.ipynb:</b> This was template file provided to fill in the details and write the python script
47 | 
48 | <b>Project_1B.ipynb:</b> This is the final file provided in which all the queries have been written with importing the files, generating a new csv file and loading all csv files into one. All verifying the results whether all tables had been loaded accordingly as per requirement
49 | 
50 | <b>Event_datafile_new.csv:</b> This is the final combination of all the files which are in the folder event_data
51 | 
52 | <b>Event_Data Folder:</b> Each event file is present separately, so all the files would be combined into one into event_datafile_new.csv
53 | 
54 | 


--------------------------------------------------------------------------------
/Data-Modeling/Project 2/event_data.rar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 2/event_data.rar


--------------------------------------------------------------------------------
/Data-Modeling/Project 2/images.rar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nareshk1290/Udacity-Data-Engineering/a3d202ac5e1a74d16be8a64fc63ae841e7a7639d/Data-Modeling/Project 2/images.rar


--------------------------------------------------------------------------------
/Data-Modeling/Readme.md:
--------------------------------------------------------------------------------
1 | Data Modeling with Postgres and Apache Cassandra
2 | 
3 | Exercise and Projects
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data Engineering Nanodegree
 2 | 
 3 | Projects and resources developed in the [DEND Nanodegree](https://www.udacity.com/course/data-engineer-nanodegree--nd027) from Udacity.
 4 | 
 5 | ## Project 1: [Relational Databases - Data Modeling with PostgreSQL](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Data-Modeling/Project%201).
 6 | Developed a relational database using PostgreSQL to model user activity data for a music streaming app. Skills include:
 7 | * Created a relational database using PostgreSQL
 8 | * Developed a Star Schema database using optimized definitions of Fact and Dimension tables. Normalization of tables.
 9 | * Built out an ETL pipeline to optimize queries in order to understand what songs users listen to.
10 | 
11 | Proficiencies include: Python, PostgreSql, Star Schema, ETL pipelines, Normalization
12 | 
13 | 
14 | ## Project 2: [NoSQL Databases - Data Modeling with Apache Cassandra](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Data-Modeling/Project%202).
15 | Designed a NoSQL database using Apache Cassandra based on the original schema outlined in project one. Skills include:
16 | * Created a nosql database using Apache Cassandra (both locally and with docker containers)
17 | * Developed denormalized tables optimized for a specific set queries and business needs
18 | 
19 | Proficiencies used: Python, Apache Cassandra, Denormalization
20 | 
21 | 
22 | ## Project 3: [Data Warehouse - Amazon Redshift](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Cloud%20Data%20Warehouse/Project%20Data%20Warehouse%20with%20AWS).
23 | Created a database warehouse utilizing Amazon Redshift. Skills include:
24 | * Creating a Redshift Cluster, IAM Roles, Security groups.
25 | * Develop an ETL Pipeline that copies data from S3 buckets into staging tables to be processed into a star schema
26 | * Developed a star schema with optimization to specific queries required by the data analytics team.
27 | 
28 | Proficiencies used: Python, Amazon Redshift, aws cli, Amazon SDK, SQL, PostgreSQL
29 | 
30 | ## Project 4: [Data Lake - Spark](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Data%20Lakes%20with%20Spark/Project%20Data%20Lake%20with%20Spark)
31 | Scaled up the current ETL pipeline by moving the data warehouse to a data lake. Skills include:
32 | * Create an EMR Hadoop Cluster
33 | * Further develop the ETL Pipeline copying datasets from S3 buckets, data processing using Spark and writing to S3 buckets using efficient partitioning and parquet formatting.
34 | * Fast-tracking the data lake buildout using (serverless) AWS Lambda and cataloging tables with AWS Glue Crawler.
35 | 
36 | Technologies used: Spark, S3, EMR, Athena, Amazon Glue, Parquet.
37 | 
38 | ## Project 5: [Data Pipelines - Airflow](https://github.com/nareshk1290/Udacity-Data-Engineering/tree/master/Data%20Pipeline%20with%20Airflow/Project%20Data%20Pipeline%20with%20Airflow)
39 | Automate the ETL pipeline and creation of data warehouse using Apache Airflow. Skills include:
40 | * Using Airflow to automate ETL pipelines using Airflow, Python, Amazon Redshift.
41 | * Writing custom operators to perform tasks such as staging data, filling the data warehouse, and validation through data quality checks.
42 | * Transforming data from various sources into a star schema optimized for the analytics team's use cases.
43 | 
44 | Technologies used: Apache Airflow, S3, Amazon Redshift, Python.
45 | 


--------------------------------------------------------------------------------