├── .gitignore
├── 1-data-modeling
├── L1_Exercise_1_Creating_a_Table_with_Postgres.ipynb
├── L1_Exercise_2_Creating_a_Table_with_Apache_Cassandra.ipynb
├── L2_Exercise_1_Creating_Normalized_Tables.ipynb
├── L2_Exercise_2_Creating_Denormalized_Tables.ipynb
├── L2_Exercise_3_Creating_Fact_and_Dimension_Tables_with_Star_Schema.ipynb
├── L3-Project_Data_Modeling_with_Postgres
│ ├── .gitignore
│ ├── README.md
│ ├── create_tables.py
│ ├── etl.ipynb
│ ├── etl.py
│ ├── sql_queries.py
│ └── test.ipynb
├── L4-demo-1-2-queries-2-tables.ipynb
├── L4-demo-2-primary-key.ipynb
├── L4-demo-3-clustering-column.ipynb
├── L4-demo-4-using-the-where-clause.ipynb
├── L4_Exercise_1_Three_Queries_Three_Tables.ipynb
├── L4_Exercise_2_Primary_Key.ipynb
├── L4_Exercise_3_Clustering_Column.ipynb
├── L4_Exercise_4_Using_the_WHERE_Clause.ipynb
└── L5-Project_Data_Modeling_with_Apache_Cassandra
│ ├── .gitignore
│ └── Project_1B_Project_Template.ipynb
├── 2-cloud-data-warehouses
├── L1_E1_-_Step_1_&_2.ipynb
├── L1_E1_-_Step_3.ipynb
├── L1_E1_-_Step_4.ipynb
├── L1_E1_-_Step_5.ipynb
├── L1_E1_-_Step_6.ipynb
├── L1_E2_-_1_-_Slicing_and_Dicing.ipynb
├── L1_E2_-_2_-_Roll_up_and_Drill_Down.ipynb
├── L1_E2_-_3_-_Grouping_Sets.ipynb
├── L1_E2_-_4_-_CUBE.ipynb
├── L1_E3_-_Columnar_Vs_Row_Storage.ipynb
├── L3_Exercise_2_-_IaC.ipynb
├── L3_Exercise_3_-_Parallel_ETL.ipynb
├── L3_Exercise_4_-_Table_Design.ipynb
└── L4_Project_-_Data_Warehouse
│ ├── .gitignore
│ ├── README.md
│ ├── analyze.py
│ ├── aws_check_cluster_available.py
│ ├── aws_create_cluster.py
│ ├── aws_destroy_cluster.py
│ ├── create_tables.py
│ ├── data-warehouse-project-der-diagram.png
│ ├── dwh.cfg.example
│ ├── etl.py
│ └── sql_queries.py
├── 3-data-lakes-with-spark
├── 10_L4_Exercise_2_-_Advanced_Analytics_NLP.ipynb
├── 11_L4_Exercise_3_-_Data_Lake_on_S3.ipynb
├── 1_procedural_vs_functional_in_python.ipynb
├── 2_spark_maps_and_lazy_evaluation.ipynb
├── 3_data_inputs_and_outputs.ipynb
├── 4_data_wrangling.ipynb
├── 5_dataframe_quiz.ipynb
├── 7_data_wrangling-sql.ipynb
├── 8_spark_sql_quiz.ipynb
├── 9_L4_Exercise_1_-_Schema_On_Read.ipynb
├── L4_Project
│ ├── .gitignore
│ ├── README.md
│ ├── dl.cfg.example
│ └── etl.py
├── data
│ └── sparkify_log_small.json
└── mapreduce_practice.ipynb
├── 4-data-pipelines-with-airflow
├── L1_exercises
│ ├── exercise1.py
│ ├── exercise2.py
│ ├── exercise3.py
│ ├── exercise4.py
│ ├── exercise5.py
│ ├── exercise6.py
│ └── sql_statements.py
├── L2_exercises
│ ├── exercise1.py
│ ├── exercise2.py
│ ├── exercise3.py
│ ├── exercise4.py
│ └── sql_statements.py
├── L3_exercises
│ ├── exercise1.py
│ ├── exercise2.py
│ ├── exercise3
│ │ ├── dag.py
│ │ └── subdag.py
│ ├── exercise4.py
│ ├── operators
│ │ ├── __init__.py
│ │ ├── facts_calculator.py
│ │ ├── has_rows.py
│ │ └── s3_to_redshift.py
│ └── sql_statements.py
└── L4_project
│ ├── README.md
│ ├── create_tables.sql
│ ├── dags
│ └── sparkify_analytical_tables_dag.py
│ ├── images
│ └── dag.png
│ └── plugins
│ ├── __init__.py
│ ├── helpers
│ ├── __init__.py
│ └── sql_queries.py
│ └── operators
│ ├── __init__.py
│ ├── data_quality.py
│ ├── load_dimension.py
│ ├── load_fact.py
│ └── stage_redshift.py
├── 5-capstone-project
├── README.md
└── datasets-exploration.ipynb
├── README.md
└── explorations
└── nyc-taxi-challenge.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | .DS_Store
3 | .ipynb_checkpoints
4 | __pycache__
5 | explorations/outputs/*
6 |
7 |
--------------------------------------------------------------------------------
/1-data-modeling/L1_Exercise_1_Creating_a_Table_with_Postgres.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Lesson 1 Exercise 1: Creating a Table with PostgreSQL\n",
8 | "\n",
9 | "
"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "### Walk through the basics of PostgreSQL. You will need to complete the following tasks:
Create a table in PostgreSQL, Insert rows of data Run a simple SQL query to validate the information.
\n",
17 | "`#####` denotes where the code needs to be completed. \n",
18 | " \n",
19 | "Note: __Do not__ click the blue Preview button in the lower task bar"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "#### Import the library \n",
27 | "*Note:* An error might popup after this command has executed. If it does, read it carefully before ignoring. "
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 16,
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "import psycopg2"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 17,
42 | "metadata": {},
43 | "outputs": [
44 | {
45 | "name": "stdout",
46 | "output_type": "stream",
47 | "text": [
48 | "ALTER ROLE\r\n"
49 | ]
50 | }
51 | ],
52 | "source": [
53 | "!echo \"alter user student createdb;\" | sudo -u postgres psql"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "### Create a connection to the database"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": 18,
66 | "metadata": {},
67 | "outputs": [],
68 | "source": [
69 | "try: \n",
70 | " conn = psycopg2.connect(\"host=127.0.0.1 dbname=studentdb user=student password=student\")\n",
71 | "except psycopg2.Error as e: \n",
72 | " print(\"Error: Could not make connection to the Postgres database\")\n",
73 | " print(e)"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "### Use the connection to get a cursor that can be used to execute queries."
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 19,
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "try: \n",
90 | " cur = conn.cursor()\n",
91 | "except psycopg2.Error as e: \n",
92 | " print(\"Error: Could not get curser to the Database\")\n",
93 | " print(e)"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "### Set automatic commit to be true so that each action is committed without having to call conn.commit() after each command. "
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 20,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "conn.set_session(autocommit=True)"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "### Create a database to do the work in. "
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": 21,
122 | "metadata": {},
123 | "outputs": [
124 | {
125 | "name": "stdout",
126 | "output_type": "stream",
127 | "text": [
128 | "database \"music_library\" already exists\n",
129 | "\n"
130 | ]
131 | }
132 | ],
133 | "source": [
134 | "try: \n",
135 | " cur.execute(\"create database music_library\")\n",
136 | "except psycopg2.Error as e:\n",
137 | " print(e)"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {},
143 | "source": [
144 | "#### Add the database name in the connect statement. Let's close our connection to the default database, reconnect to the Udacity database, and get a new cursor."
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 22,
150 | "metadata": {},
151 | "outputs": [],
152 | "source": [
153 | "try: \n",
154 | " conn.close()\n",
155 | "except psycopg2.Error as e:\n",
156 | " print(e)\n",
157 | " \n",
158 | "try: \n",
159 | " conn = psycopg2.connect(\"host=127.0.0.1 dbname=music_library user=student password=student\")\n",
160 | "except psycopg2.Error as e: \n",
161 | " print(\"Error: Could not make connection to the Postgres database\")\n",
162 | " print(e)\n",
163 | " \n",
164 | "try: \n",
165 | " cur = conn.cursor()\n",
166 | "except psycopg2.Error as e: \n",
167 | " print(\"Error: Could not get curser to the Database\")\n",
168 | " print(e)\n",
169 | "\n",
170 | "conn.set_session(autocommit=True)"
171 | ]
172 | },
173 | {
174 | "cell_type": "markdown",
175 | "metadata": {},
176 | "source": [
177 | "### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. \n",
178 | "\n",
179 | "`song_title\n",
180 | "artist_name\n",
181 | "year\n",
182 | "album_name\n",
183 | "single`\n"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 23,
189 | "metadata": {},
190 | "outputs": [],
191 | "source": [
192 | "## TO-DO: Finish writing the CREATE TABLE statement with the correct arguments\n",
193 | "try: \n",
194 | " cur.execute(\"CREATE TABLE IF NOT EXISTS songs (song_title varchar, artist_name varchar, year integer, album_name varchar, single boolean);\")\n",
195 | "except psycopg2.Error as e: \n",
196 | " print(\"Error: Issue creating table\")\n",
197 | " print (e)"
198 | ]
199 | },
200 | {
201 | "cell_type": "markdown",
202 | "metadata": {},
203 | "source": [
204 | "### Insert the following two rows in the table\n",
205 | "`First Row: \"Across The Universe\", \"The Beatles\", \"1970\", \"False\", \"Let It Be\"`\n",
206 | "\n",
207 | "`Second Row: \"The Beatles\", \"Think For Yourself\", \"False\", \"1965\", \"Rubber Soul\"`"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 24,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "## TO-DO: Finish the INSERT INTO statement with the correct arguments\n",
217 | "\n",
218 | "try: \n",
219 | " cur.execute(\"INSERT INTO songs (song_title, artist_name, year, album_name, single) \\\n",
220 | " VALUES (%s, %s, %s, %s, %s)\", \\\n",
221 | " (\"Across The Universe\", \"The Beatles\", 1970, \"Let It Be\", False))\n",
222 | "except psycopg2.Error as e: \n",
223 | " print(\"Error: Inserting Rows\")\n",
224 | " print (e)\n",
225 | " \n",
226 | "try: \n",
227 | " cur.execute(\"INSERT INTO songs (song_title, artist_name, year, album_name, single) \\\n",
228 | " VALUES (%s, %s, %s, %s, %s)\",\n",
229 | " (\"Think For Yourself\", \"The Beatles\", 1965, \"Rubber Soul\", False))\n",
230 | "except psycopg2.Error as e: \n",
231 | " print(\"Error: Inserting Rows\")\n",
232 | " print (e)"
233 | ]
234 | },
235 | {
236 | "cell_type": "markdown",
237 | "metadata": {},
238 | "source": [
239 | "### Validate your data was inserted into the table. \n"
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": 25,
245 | "metadata": {},
246 | "outputs": [
247 | {
248 | "name": "stdout",
249 | "output_type": "stream",
250 | "text": [
251 | "('Across The Universe', 'The Beatles', 1970, 'Let It Be', 'False')\n",
252 | "('Think For Yourself', 'The Beatles', 1965, 'Rubber Soul', 'False')\n",
253 | "('Across The Universe', 'The Beatles', 1970, 'Let It Be', 'false')\n",
254 | "('Think For Yourself', 'The Beatles', 1965, 'Rubber Soul', 'false')\n"
255 | ]
256 | }
257 | ],
258 | "source": [
259 | "try: \n",
260 | " cur.execute(\"SELECT * FROM songs;\")\n",
261 | "except psycopg2.Error as e: \n",
262 | " print(\"Error: select *\")\n",
263 | " print (e)\n",
264 | "\n",
265 | "row = cur.fetchone()\n",
266 | "while row:\n",
267 | " print(row)\n",
268 | " row = cur.fetchone()"
269 | ]
270 | },
271 | {
272 | "cell_type": "markdown",
273 | "metadata": {},
274 | "source": [
275 | "### And finally close your cursor and connection. "
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": 26,
281 | "metadata": {},
282 | "outputs": [],
283 | "source": [
284 | "cur.close()\n",
285 | "conn.close()"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": null,
291 | "metadata": {},
292 | "outputs": [],
293 | "source": []
294 | }
295 | ],
296 | "metadata": {
297 | "kernelspec": {
298 | "display_name": "Python 3",
299 | "language": "python",
300 | "name": "python3"
301 | },
302 | "language_info": {
303 | "codemirror_mode": {
304 | "name": "ipython",
305 | "version": 3
306 | },
307 | "file_extension": ".py",
308 | "mimetype": "text/x-python",
309 | "name": "python",
310 | "nbconvert_exporter": "python",
311 | "pygments_lexer": "ipython3",
312 | "version": "3.6.3"
313 | }
314 | },
315 | "nbformat": 4,
316 | "nbformat_minor": 2
317 | }
318 |
--------------------------------------------------------------------------------
/1-data-modeling/L1_Exercise_2_Creating_a_Table_with_Apache_Cassandra.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Lesson 1 Exercise 2: Creating a Table with Apache Cassandra\n",
8 | "
"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "### Walk through the basics of Apache Cassandra. Complete the following tasks: Create a table in Apache Cassandra, Insert rows of data, Run a simple SQL query to validate the information.
\n",
16 | "`#####` denotes where the code needs to be completed.\n",
17 | " \n",
18 | "Note: __Do not__ click the blue Preview button in the lower taskbar"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "#### Import Apache Cassandra python package"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 1,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "import cassandra"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "### Create a connection to the database"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 2,
47 | "metadata": {},
48 | "outputs": [],
49 | "source": [
50 | "from cassandra.cluster import Cluster\n",
51 | "try: \n",
52 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
53 | " session = cluster.connect()\n",
54 | "except Exception as e:\n",
55 | " print(e)\n",
56 | " "
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "### Create a keyspace to do the work in "
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": 3,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "## TO-DO: Create the keyspace\n",
73 | "try:\n",
74 | " session.execute(\"\"\"\n",
75 | " CREATE KEYSPACE IF NOT EXISTS music_library \n",
76 | " WITH REPLICATION = \n",
77 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
78 | ")\n",
79 | "\n",
80 | "except Exception as e:\n",
81 | " print(e)"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "### Connect to the Keyspace"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": 4,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": [
97 | "## To-Do: Add in the keyspace you created\n",
98 | "try:\n",
99 | " session.set_keyspace('music_library')\n",
100 | "except Exception as e:\n",
101 | " print(e)"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. \n",
109 | "\n",
110 | "`song_title\n",
111 | "artist_name\n",
112 | "year\n",
113 | "album_name\n",
114 | "single`"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "### You need to create a table to be able to run the following query: \n",
122 | "`select * from songs WHERE year=1970 AND artist_name=\"The Beatles\"`"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 6,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "## TO-DO: Complete the query below\n",
132 | "query = \"CREATE TABLE IF NOT EXISTS songs \"\n",
133 | "query = query + \"(song_title text, artist_name text, year int, album_name text, single boolean, PRIMARY KEY (year, artist_name))\"\n",
134 | "try:\n",
135 | " session.execute(query)\n",
136 | "except Exception as e:\n",
137 | " print(e)\n"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {},
143 | "source": [
144 | "### Insert the following two rows in your table\n",
145 | "`First Row: \"Across The Universe\", \"The Beatles\", \"1970\", \"False\", \"Let It Be\"`\n",
146 | "\n",
147 | "`Second Row: \"The Beatles\", \"Think For Yourself\", \"False\", \"1965\", \"Rubber Soul\"`"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 7,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "## Add in query and then run the insert statement\n",
157 | "query = \"INSERT INTO songs (song_title, artist_name, year, album_name, single)\" \n",
158 | "query = query + \" VALUES (%s, %s, %s, %s, %s)\"\n",
159 | "\n",
160 | "try:\n",
161 | " session.execute(query, (\"Across The Universe\", \"The Beatles\", 1970, \"Let It Be\", False))\n",
162 | "except Exception as e:\n",
163 | " print(e)\n",
164 | " \n",
165 | "try:\n",
166 | " session.execute(query, (\"Think For Yourself\", \"The Beatles\", 1965, \"Rubber Soul\", False))\n",
167 | "except Exception as e:\n",
168 | " print(e)"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "### Validate your data was inserted into the table."
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 8,
181 | "metadata": {
182 | "scrolled": true
183 | },
184 | "outputs": [
185 | {
186 | "name": "stdout",
187 | "output_type": "stream",
188 | "text": [
189 | "1965 Rubber Soul The Beatles\n",
190 | "1970 Let It Be The Beatles\n"
191 | ]
192 | }
193 | ],
194 | "source": [
195 | "## TO-DO: Complete and then run the select statement to validate the data was inserted into the table\n",
196 | "query = 'SELECT * FROM songs'\n",
197 | "try:\n",
198 | " rows = session.execute(query)\n",
199 | "except Exception as e:\n",
200 | " print(e)\n",
201 | " \n",
202 | "for row in rows:\n",
203 | " print (row.year, row.album_name, row.artist_name)"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "### Validate the Data Model with the original query.\n",
211 | "\n",
212 | "`select * from songs WHERE YEAR=1970 AND artist_name=\"The Beatles\"`"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": 9,
218 | "metadata": {},
219 | "outputs": [
220 | {
221 | "name": "stdout",
222 | "output_type": "stream",
223 | "text": [
224 | "1970 Let It Be The Beatles\n"
225 | ]
226 | }
227 | ],
228 | "source": [
229 | "##TO-DO: Complete the select statement to run the query \n",
230 | "query = \"SELECT * FROM songs WHERE year = 1970 AND artist_name = 'The Beatles'\"\n",
231 | "try:\n",
232 | " rows = session.execute(query)\n",
233 | "except Exception as e:\n",
234 | " print(e)\n",
235 | " \n",
236 | "for row in rows:\n",
237 | " print (row.year, row.album_name, row.artist_name)"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "### And Finally close the session and cluster connection"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": 10,
250 | "metadata": {},
251 | "outputs": [],
252 | "source": [
253 | "session.shutdown()\n",
254 | "cluster.shutdown()"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {},
261 | "outputs": [],
262 | "source": []
263 | }
264 | ],
265 | "metadata": {
266 | "kernelspec": {
267 | "display_name": "Python 3",
268 | "language": "python",
269 | "name": "python3"
270 | },
271 | "language_info": {
272 | "codemirror_mode": {
273 | "name": "ipython",
274 | "version": 3
275 | },
276 | "file_extension": ".py",
277 | "mimetype": "text/x-python",
278 | "name": "python",
279 | "nbconvert_exporter": "python",
280 | "pygments_lexer": "ipython3",
281 | "version": "3.6.3"
282 | }
283 | },
284 | "nbformat": 4,
285 | "nbformat_minor": 2
286 | }
287 |
--------------------------------------------------------------------------------
/1-data-modeling/L3-Project_Data_Modeling_with_Postgres/.gitignore:
--------------------------------------------------------------------------------
1 | data
2 | .ipynb_checkpoints
3 | __pycache__
4 |
--------------------------------------------------------------------------------
/1-data-modeling/L3-Project_Data_Modeling_with_Postgres/README.md:
--------------------------------------------------------------------------------
1 | # Sparkify song play logs ETL process
2 |
3 | This project extract, transform and loads 5 main informations (tables) from the Sparkify app (an app to listen to your favorite musics) logs:
4 | - `users`
5 | - `songs`
6 | - `artists`
7 | - `songplays`
8 | - `time` - auxiliary table to help us breaks timestamps into comprehensible columns with time chunks (like `day`, `weekday`)
9 |
10 | With this structured database we can extract several insightful informations from the way our users listens to their musics. Learning from its habits through hidden patterns inside this large quantity of data.
11 |
12 | Right down below you can read some instructions on how to create this database and then how to understand how this database was structured.
13 |
14 | ## Running the ETL
15 |
16 | First you should create the PostgreSQL database structure, by doing:
17 |
18 | ```
19 | python create_tables.py
20 | ```
21 |
22 | Then parse the logs files:
23 |
24 | ```
25 | python etl.py
26 | ```
27 |
28 | ## Database Schema Design
29 |
30 | To learn how/why this schema design was made in this way, you should read our docs below:
31 |
32 | ### Song Plays table
33 |
34 | - *Name:* `songplays`
35 | - *Type:* Fact table
36 |
37 | | Column | Type | Description |
38 | | ------ | ---- | ----------- |
39 | | `songplay_id` | `INTEGER` | The main identification of the table |
40 | | `start_time` | `TIMESTAMP NOT NULL` | The timestamp that this song play log happened |
41 | | `user_id` | `INTEGER NOT NULL REFERENCES users (user_id)` | The user id that triggered this song play log. It cannot be null, as we don't have song play logs without being triggered by an user. |
42 | | `level` | `VARCHAR` | The level of the user that triggered this song play log |
43 | | `song_id` | `VARCHAR REFERENCES songs (song_id)` | The identification of the song that was played. It can be null. |
44 | | `artist_id` | `VARCHAR REFERENCES artists (artist_id)` | The identification of the artist of the song that was played. |
45 | | `session_id` | `INTEGER NOT NULL` | The session_id of the user on the app |
46 | | `location` | `VARCHAR` | The location where this song play log was triggered |
47 | | `user_agent` | `VARCHAR` | The user agent of our app |
48 |
49 | ### Users table
50 |
51 | - *Name:* `users`
52 | - *Type:* Dimension table
53 |
54 | | Column | Type | Description |
55 | | ------ | ---- | ----------- |
56 | | `user_id` | `INTEGER PRIMARY KEY` | The main identification of an user |
57 | | `first_name` | `VARCHAR NOT NULL` | First name of the user, can not be null. It is the basic information we have from the user |
58 | | `last_name` | `VARCHAR NOT NULL` | Last name of the user. |
59 | | `gender` | `CHAR(1)` | The gender is stated with just one character `M` (male) or `F` (female). Otherwise it can be stated as `NULL` |
60 | | `level` | `VARCHAR NOT NULL` | The level stands for the user app plans (`premium` or `free`) |
61 |
62 |
63 | ### Songs table
64 |
65 | - *Name:* `songs`
66 | - *Type:* Dimension table
67 |
68 | | Column | Type | Description |
69 | | ------ | ---- | ----------- |
70 | | `song_id` | `VARCHAR PRIMARY KEY` | The main identification of a song |
71 | | `title` | `VARCHAR NOT NULL` | The title of the song. It can not be null, as it is the basic information we have about a song. |
72 | | `artist_id` | `VARCHAR NOT NULL REFERENCES artists (artist_id)` | The artist id, it can not be null as we don't have songs without an artist, and this field also references the artists table. |
73 | | `year` | `INTEGER NOT NULL` | The year that this song was made |
74 | | `duration` | `NUMERIC (15, 5) NOT NULL` | The duration of the song |
75 |
76 |
77 | ### Artists table
78 |
79 | - *Name:* `artists`
80 | - *Type:* Dimension table
81 |
82 | | Column | Type | Description |
83 | | ------ | ---- | ----------- |
84 | | `artist_id` | `VARCHAR PRIMARY KEY` | The main identification of an artist |
85 | | `name` | `VARCHAR NOT NULL` | The name of the artist |
86 | | `location` | `VARCHAR` | The location where the artist are from |
87 | | `latitude` | `NUMERIC` | The latitude of the location that the artist are from |
88 | | `longitude` | `NUMERIC` | The longitude of the location that the artist are from |
89 |
90 | ### Time table
91 |
92 | - *Name:* `time`
93 | - *Type:* Dimension table
94 |
95 | | Column | Type | Description |
96 | | ------ | ---- | ----------- |
97 | | `start_time` | `TIMESTAMP NOT NULL PRIMARY KEY` | The timestamp itself, serves as the main identification of this table |
98 | | `hour` | `NUMERIC NOT NULL` | The hour from the timestamp |
99 | | `day` | `NUMERIC NOT NULL` | The day of the month from the timestamp |
100 | | `week` | `NUMERIC NOT NULL` | The week of the year from the timestamp |
101 | | `month` | `NUMERIC NOT NULL` | The month of the year from the timestamp |
102 | | `year` | `NUMERIC NOT NULL` | The year from the timestamp |
103 | | `weekday` | `NUMERIC NOT NULL` | The week day from the timestamp |
104 |
105 | ## The project file structure
106 |
107 | We have a small list of files, easy to maintain and understand:
108 | - `sql_queries.py` - Where it all begins, this files is meant to be a query repository to use throughout the ETL process
109 | - `create_tables.py` - It's the file reponsible to create the schema structure into the PostgreSQL database
110 | - `etl.py` - It's the file responsible for the main ETL process
111 | - `etl.ipynb` - The python notebook that was written to develop the logic behind the `etl.py` process
112 | - `test.ipynb` - And finally this notebook was used to certify if our ETL process was being successful (or not).
--------------------------------------------------------------------------------
/1-data-modeling/L3-Project_Data_Modeling_with_Postgres/create_tables.py:
--------------------------------------------------------------------------------
1 | import psycopg2
2 | from sql_queries import create_table_queries, drop_table_queries
3 |
4 |
5 | def create_database():
6 | # connect to default database
7 | conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")
8 | conn.set_session(autocommit=True)
9 | cur = conn.cursor()
10 |
11 | # create sparkify database with UTF8 encoding
12 | cur.execute("DROP DATABASE IF EXISTS sparkifydb")
13 | cur.execute("CREATE DATABASE sparkifydb WITH ENCODING 'utf8' TEMPLATE template0")
14 |
15 | # close connection to default database
16 | conn.close()
17 |
18 | # connect to sparkify database
19 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
20 | cur = conn.cursor()
21 |
22 | return cur, conn
23 |
24 |
25 | def drop_tables(cur, conn):
26 | for query in drop_table_queries:
27 | cur.execute(query)
28 | conn.commit()
29 |
30 |
31 | def create_tables(cur, conn):
32 | for query in create_table_queries:
33 | cur.execute(query)
34 | conn.commit()
35 |
36 |
37 | def main():
38 | cur, conn = create_database()
39 |
40 | drop_tables(cur, conn)
41 | create_tables(cur, conn)
42 |
43 | conn.close()
44 |
45 |
46 | if __name__ == "__main__":
47 | main()
--------------------------------------------------------------------------------
/1-data-modeling/L3-Project_Data_Modeling_with_Postgres/etl.py:
--------------------------------------------------------------------------------
1 | import os
2 | import glob
3 | import psycopg2
4 | import pandas as pd
5 | import numpy as np
6 | from sql_queries import *
7 |
8 |
9 | def insert_from_dataframe(cur, df, insert_query):
10 | """
11 | Insert a pandas dataframe with a given insert_query
12 | :param cur: The cursor object
13 | :param df: The pandas dataframe
14 | :param insert_query: The insert query
15 | :return: None
16 | """
17 | for i, row in df.iterrows():
18 | cur.execute(insert_query, list(row))
19 |
20 |
21 | def process_song_file(cur, filepath):
22 | """
23 | Process songs log file
24 | :param cur: the cursor object
25 | :param filepath: log data file path
26 | :return: None
27 | """
28 | # open song file
29 | df = pd.read_json(filepath, lines=True)
30 |
31 | # insert artist record
32 | artist_data = df[['artist_id', 'artist_name', 'artist_location', 'artist_latitude', 'artist_longitude']]
33 | artist_data = artist_data.drop_duplicates()
34 | artist_data = artist_data.replace(np.nan, None, regex=True)
35 |
36 | insert_from_dataframe(cur, artist_data, artist_table_insert)
37 |
38 | # insert song record
39 | song_data = df[['song_id','title', 'artist_id', 'year', 'duration']]
40 | song_data = song_data.drop_duplicates()
41 | song_data = song_data.replace(np.nan, None, regex=True)
42 |
43 | insert_from_dataframe(cur, song_data, song_table_insert)
44 |
45 |
46 | def process_log_file(cur, filepath):
47 | """
48 | Process a songsplay log file
49 | :param cur: The cursor object
50 | :param filepath: The path to the log file
51 | :return:None
52 | """
53 | # open log file
54 | df = pd.read_json(filepath, lines=True)
55 |
56 | # filter by NextSong action
57 | df = df[df['page'] == 'NextSong']
58 |
59 | # Parsing the ts column as datetime into an Series object from panda, then creating the DataFrame
60 | tf = pd.DataFrame({
61 | 'start_time': pd.to_datetime(df['ts'], unit='ms')
62 | })
63 |
64 | # Creating new columns
65 | tf['hour'] = tf['start_time'].dt.hour
66 | tf['day'] = tf['start_time'].dt.day
67 | tf['week'] = tf['start_time'].dt.week
68 | tf['month'] = tf['start_time'].dt.month
69 | tf['year'] = tf['start_time'].dt.year
70 | tf['weekday'] = tf['start_time'].dt.weekday
71 |
72 | tf = tf.drop_duplicates()
73 |
74 | # insert time data records
75 | insert_from_dataframe(cur, tf, time_table_insert)
76 |
77 | # load user table
78 | user_df = df[['userId', 'firstName', 'lastName', 'gender', 'level']]
79 | user_df = user_df.drop_duplicates()
80 | user_df = user_df.replace(np.nan, None, regex=True)
81 | user_df.columns = ['user_id', 'first_name', 'last_name', 'gender', 'level']
82 |
83 | # insert user records
84 | insert_from_dataframe(cur, user_df, user_table_insert)
85 |
86 | # insert songplay records
87 | for index, row in df.iterrows():
88 |
89 | # get songid and artistid from song and artist tables
90 | cur.execute(song_select, (row.song, row.artist, row.length))
91 | results = cur.fetchone()
92 |
93 | if results:
94 | songid, artistid = results
95 | else:
96 | songid, artistid = None, None
97 |
98 | # insert songplay record
99 | songplay_data = (
100 | index, pd.to_datetime(row.ts, unit='ms'),
101 | row.userId, row.level, songid, artistid,
102 | row.sessionId, row.location, row.userAgent
103 | )
104 | cur.execute(songplay_table_insert, songplay_data)
105 |
106 |
107 | def process_data(cur, conn, filepath, func):
108 | """
109 | Process all the data executing the given func for every *.json file of the given filepath
110 | :param cur: The cursor data
111 | :param conn: The connection with postgresql
112 | :param filepath: The logs folder path
113 | :param func: The function to process one log file per time
114 | :return:None
115 | """
116 | # get all files matching extension from directory
117 | all_files = []
118 | for root, dirs, files in os.walk(filepath):
119 | files = glob.glob(os.path.join(root, '*.json'))
120 | for f in files:
121 | all_files.append(os.path.abspath(f))
122 |
123 | # get total number of files found
124 | num_files = len(all_files)
125 | print('{} files found in {}'.format(num_files, filepath))
126 |
127 | # iterate over files and process
128 | for i, datafile in enumerate(all_files, 1):
129 | func(cur, datafile)
130 | conn.commit()
131 | print('{}/{} files processed.'.format(i, num_files))
132 |
133 |
134 | def main():
135 | """
136 | The main function
137 | :return:None
138 | """
139 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
140 | cur = conn.cursor()
141 |
142 | process_data(cur, conn, filepath='data/song_data', func=process_song_file)
143 | process_data(cur, conn, filepath='data/log_data', func=process_log_file)
144 |
145 | conn.close()
146 |
147 |
148 | if __name__ == "__main__":
149 | main()
--------------------------------------------------------------------------------
/1-data-modeling/L3-Project_Data_Modeling_with_Postgres/sql_queries.py:
--------------------------------------------------------------------------------
1 | # DROP TABLES
2 |
3 | songplay_table_drop = "DROP TABLE IF EXISTS songplays"
4 | user_table_drop = "DROP TABLE IF EXISTS users"
5 | song_table_drop = "DROP TABLE IF EXISTS songs"
6 | artist_table_drop = "DROP TABLE IF EXISTS artists"
7 | time_table_drop = "DROP TABLE IF EXISTS time"
8 |
9 | # CREATE TABLES
10 |
11 | songplay_table_create = ("""
12 | CREATE TABLE IF NOT EXISTS songplays (
13 | songplay_id INTEGER ,
14 | start_time TIMESTAMP NOT NULL,
15 | user_id INTEGER NOT NULL REFERENCES users (user_id),
16 | level VARCHAR,
17 | song_id VARCHAR REFERENCES songs (song_id),
18 | artist_id VARCHAR REFERENCES artists (artist_id),
19 | session_id INTEGER NOT NULL,
20 | location VARCHAR,
21 | user_agent VARCHAR
22 | )
23 | """)
24 |
25 | user_table_create = ("""
26 | CREATE TABLE IF NOT EXISTS users (
27 | user_id INTEGER PRIMARY KEY,
28 | first_name VARCHAR NOT NULL,
29 | last_name VARCHAR NOT NULL,
30 | gender CHAR(1),
31 | level VARCHAR NOT NULL
32 | )
33 | """)
34 |
35 | song_table_create = ("""
36 | CREATE TABLE IF NOT EXISTS songs (
37 | song_id VARCHAR PRIMARY KEY,
38 | title VARCHAR NOT NULL,
39 | artist_id VARCHAR NOT NULL REFERENCES artists (artist_id),
40 | year INTEGER NOT NULL,
41 | duration NUMERIC (15, 5) NOT NULL
42 | )
43 | """)
44 |
45 | artist_table_create = ("""
46 | CREATE TABLE IF NOT EXISTS artists (
47 | artist_id VARCHAR PRIMARY KEY,
48 | name VARCHAR NOT NULL,
49 | location VARCHAR,
50 | latitude NUMERIC,
51 | longitude NUMERIC
52 | )
53 | """)
54 |
55 | time_table_create = ("""
56 | CREATE TABLE IF NOT EXISTS time (
57 | start_time TIMESTAMP NOT NULL PRIMARY KEY,
58 | hour NUMERIC NOT NULL,
59 | day NUMERIC NOT NULL,
60 | week NUMERIC NOT NULL,
61 | month NUMERIC NOT NULL,
62 | year NUMERIC NOT NULL,
63 | weekday NUMERIC NOT NULL
64 | )
65 | """)
66 |
67 | # INSERT RECORDS
68 |
69 | songplay_table_insert = ("""
70 | INSERT INTO songplays (
71 | songplay_id,
72 | start_time,
73 | user_id,
74 | level,
75 | song_id,
76 | artist_id,
77 | session_id,
78 | location,
79 | user_agent
80 | )
81 | VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
82 | """)
83 |
84 | user_table_insert = ("""
85 |
86 | INSERT INTO users (
87 | user_id,
88 | first_name,
89 | last_name,
90 | gender,
91 | level
92 | )
93 | VALUES (%s, %s, %s, %s, %s)
94 | ON CONFLICT (user_id)
95 | DO UPDATE
96 | SET level = EXCLUDED.level
97 | """)
98 |
99 | song_table_insert = ("""
100 | INSERT INTO songs (
101 | song_id,
102 | title,
103 | artist_id,
104 | year,
105 | duration
106 | )
107 | VALUES (%s, %s, %s, %s, %s)
108 | ON CONFLICT (song_id)
109 | DO NOTHING
110 | """)
111 |
112 | artist_table_insert = ("""
113 | INSERT INTO artists (
114 | artist_id,
115 | name,
116 | location,
117 | latitude,
118 | longitude
119 | )
120 | VALUES (%s, %s, %s, %s, %s)
121 | ON CONFLICT (artist_id)
122 | DO NOTHING
123 | """)
124 |
125 |
126 | time_table_insert = ("""
127 | INSERT INTO time (
128 | start_time,
129 | hour,
130 | day,
131 | week,
132 | month,
133 | year,
134 | weekday
135 | )
136 | VALUES (%s, %s, %s, %s, %s, %s, %s)
137 | ON CONFLICT (start_time)
138 | DO NOTHING
139 | """)
140 |
141 | # FIND SONGS
142 |
143 | song_select = ("""
144 | SELECT
145 | songs.song_id AS song_id,
146 | songs.artist_id AS artist_id
147 | FROM
148 | songs
149 | JOIN artists ON (songs.artist_id = artists.artist_id)
150 | WHERE
151 | songs.title = %s AND
152 | artists.name = %s AND
153 | songs.duration = %s
154 | """)
155 |
156 | # QUERY LISTS
157 |
158 | create_table_queries = [time_table_create, user_table_create, artist_table_create, song_table_create, songplay_table_create]
159 | drop_table_queries = [songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]
--------------------------------------------------------------------------------
/1-data-modeling/L4-demo-3-clustering-column.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Lesson 3 Demo 3: Focus on Clustering Columns\n"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### In this demo we are going to walk through the basics of creating a table with a good Primary Key and Clustering Columns in Apache Cassandra, inserting rows of data, and doing a simple SQL query to validate the information."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n",
22 | "! pip install cassandra-driver\n",
23 | "#### More documentation can be found here: https://datastax.github.io/python-driver/"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "#### Import Apache Cassandra python package"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 1,
36 | "metadata": {},
37 | "outputs": [],
38 | "source": [
39 | "import cassandra"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "### First let's create a connection to the database"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 2,
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "from cassandra.cluster import Cluster\n",
56 | "try: \n",
57 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
58 | " session = cluster.connect()\n",
59 | "except Exception as e:\n",
60 | " print(e)"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "### Let's create a keyspace to do our work in "
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 3,
73 | "metadata": {},
74 | "outputs": [],
75 | "source": [
76 | "try:\n",
77 | " session.execute(\"\"\"\n",
78 | " CREATE KEYSPACE IF NOT EXISTS udacity \n",
79 | " WITH REPLICATION = \n",
80 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
81 | ")\n",
82 | "\n",
83 | "except Exception as e:\n",
84 | " print(e)"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL. "
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 4,
97 | "metadata": {},
98 | "outputs": [],
99 | "source": [
100 | "try:\n",
101 | " session.set_keyspace('udacity')\n",
102 | "except Exception as e:\n",
103 | " print(e)"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "### Let's imagine we would like to start creating a new Music Library of albums. \n",
111 | "\n",
112 | "### We want to ask 1 question of our data\n",
113 | "#### 1. Give me every album in my music library that was released by an Artist with Albumn Name in `DESC` Order and City In `DESC` Order\n",
114 | "`select * from music_library WHERE ARTIST_NAME=\"The Beatles\"`\n"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "### Here is our Collection of Data\n",
122 | "\n",
123 | "Please refer to Table 4 in the video"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the `ARTIST_NAME` let's start with that. From there we will need to add other elements to make sure the Key is unique. We also need to add the `CITY` and `ALBUM_NAME` as Clustering Columns to sort the data. That should be enough to make the row key unique\n",
131 | "\n",
132 | "`Table Name: music_library\n",
133 | "column 1: Year\n",
134 | "column 2: Artist Name\n",
135 | "column 3: Album Name\n",
136 | "Column 4: City\n",
137 | "PRIMARY KEY(artist name, album name, city)`"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 5,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n",
147 | "query = query + \"(year int, artist_name text, album_name text, city text, PRIMARY KEY (artist_name, album_name, city))\"\n",
148 | "try:\n",
149 | " session.execute(query)\n",
150 | "except Exception as e:\n",
151 | " print(e)"
152 | ]
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "metadata": {},
157 | "source": [
158 | "### Let's insert our data into of table"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": 6,
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "query = \"INSERT INTO music_library (year, artist_name, album_name, city)\"\n",
168 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n",
169 | "\n",
170 | "try:\n",
171 | " session.execute(query, (1970, \"The Beatles\", \"Let it Be\", \"Liverpool\"))\n",
172 | "except Exception as e:\n",
173 | " print(e)\n",
174 | " \n",
175 | "try:\n",
176 | " session.execute(query, (1965, \"The Beatles\", \"Rubber Soul\", \"Oxford\"))\n",
177 | "except Exception as e:\n",
178 | " print(e)\n",
179 | " \n",
180 | "try:\n",
181 | " session.execute(query, (1964, \"The Beatles\", \"Beatles For Sale\", \"London\"))\n",
182 | "except Exception as e:\n",
183 | " print(e)\n",
184 | "\n",
185 | "try:\n",
186 | " session.execute(query, (1966, \"The Monkees\", \"The Monkees\", \"Los Angeles\"))\n",
187 | "except Exception as e:\n",
188 | " print(e)\n",
189 | "\n",
190 | "try:\n",
191 | " session.execute(query, (1970, \"The Carpenters\", \"Close To You\", \"San Diego\"))\n",
192 | "except Exception as e:\n",
193 | " print(e)"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": [
200 | "### Let's Validate our Data Model -- Did it work?? If we look for Albums from The Beatles we should expect to see 3 rows.\n",
201 | "\n",
202 | "`select * from music_library WHERE ARTIST_NAME=\"The Beatles\"`"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 7,
208 | "metadata": {},
209 | "outputs": [
210 | {
211 | "name": "stdout",
212 | "output_type": "stream",
213 | "text": [
214 | "The Beatles Beatles For Sale London 1964\n",
215 | "The Beatles Let it Be Liverpool 1970\n",
216 | "The Beatles Rubber Soul Oxford 1965\n"
217 | ]
218 | }
219 | ],
220 | "source": [
221 | "query = \"select * from music_library WHERE ARTIST_NAME='The Beatles'\"\n",
222 | "try:\n",
223 | " rows = session.execute(query)\n",
224 | "except Exception as e:\n",
225 | " print(e)\n",
226 | " \n",
227 | "for row in rows:\n",
228 | " print (row.artist_name, row.album_name, row.city, row.year)"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "### Success it worked! We created a unique Primary key that evenly distributed our data, with clustering columns that sorted our data. "
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {},
241 | "source": [
242 | "### For the sake of the demo, I will drop the table. "
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": 8,
248 | "metadata": {},
249 | "outputs": [],
250 | "source": [
251 | "query = \"drop table music_library\"\n",
252 | "try:\n",
253 | " rows = session.execute(query)\n",
254 | "except Exception as e:\n",
255 | " print(e)\n"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "### And Finally close the session and cluster connection"
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": 9,
268 | "metadata": {},
269 | "outputs": [],
270 | "source": [
271 | "session.shutdown()\n",
272 | "cluster.shutdown()"
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": null,
278 | "metadata": {},
279 | "outputs": [],
280 | "source": []
281 | }
282 | ],
283 | "metadata": {
284 | "kernelspec": {
285 | "display_name": "Python 3",
286 | "language": "python",
287 | "name": "python3"
288 | },
289 | "language_info": {
290 | "codemirror_mode": {
291 | "name": "ipython",
292 | "version": 3
293 | },
294 | "file_extension": ".py",
295 | "mimetype": "text/x-python",
296 | "name": "python",
297 | "nbconvert_exporter": "python",
298 | "pygments_lexer": "ipython3",
299 | "version": "3.7.2"
300 | }
301 | },
302 | "nbformat": 4,
303 | "nbformat_minor": 2
304 | }
305 |
--------------------------------------------------------------------------------
/1-data-modeling/L4_Exercise_3_Clustering_Column.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Lesson 3 Exercise 3: Focus on Clustering Columns\n",
8 | "
"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "### Walk through the basics of creating a table with a good Primary Key and Clustering Columns in Apache Cassandra, inserting rows of data, and doing a simple CQL query to validate the information. \n",
16 | "\n",
17 | "### Remember, replace ##### with your own code.\n",
18 | "\n",
19 | "Note: __Do not__ click the blue Preview button in the lower task bar"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n",
27 | "! pip install cassandra-driver\n",
28 | "#### More documentation can be found here: https://datastax.github.io/python-driver/"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "#### Import Apache Cassandra python package"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 1,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "import cassandra"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "### Create a connection to the database"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 2,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "from cassandra.cluster import Cluster\n",
61 | "try: \n",
62 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
63 | " session = cluster.connect()\n",
64 | "except Exception as e:\n",
65 | " print(e)"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | "### Create a keyspace to work in "
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 3,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "try:\n",
82 | " session.execute(\"\"\"\n",
83 | " CREATE KEYSPACE IF NOT EXISTS udacity \n",
84 | " WITH REPLICATION = \n",
85 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
86 | ")\n",
87 | "\n",
88 | "except Exception as e:\n",
89 | " print(e)"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "#### Connect to the Keyspace. Compare this to how we had to create a new session in PostgreSQL. "
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 4,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "try:\n",
106 | " session.set_keyspace('udacity')\n",
107 | "except Exception as e:\n",
108 | " print(e)"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "### Imagine we would like to start creating a new Music Library of albums. \n",
116 | "\n",
117 | "### We want to ask 1 question of our data:\n",
118 | "### 1. Give me all the information from the music library about a given album\n",
119 | "`select * from album_library WHERE album_name=\"Close To You\"`"
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "### Here is the data:\n",
127 | "
"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "### How should we model this data? What should be our Primary Key and Partition Key? "
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 5,
140 | "metadata": {},
141 | "outputs": [],
142 | "source": [
143 | "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n",
144 | "query = query + \"(artist_name text, album_name text, city text, year int, PRIMARY KEY (album_name, artist_name))\"\n",
145 | "try:\n",
146 | " session.execute(query)\n",
147 | "except Exception as e:\n",
148 | " print(e)"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "### Insert data into the table"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 6,
161 | "metadata": {},
162 | "outputs": [],
163 | "source": [
164 | "## You can opt to change the sequence of columns to match your composite key. \\ \n",
165 | "## If you do, make sure to match the values in the INSERT statement\n",
166 | "\n",
167 | "query = \"INSERT INTO music_library (year, artist_name, album_name, city)\"\n",
168 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n",
169 | "\n",
170 | "try:\n",
171 | " session.execute(query, (1970, \"The Beatles\", \"Let it Be\", \"Liverpool\"))\n",
172 | "except Exception as e:\n",
173 | " print(e)\n",
174 | " \n",
175 | "try:\n",
176 | " session.execute(query, (1965, \"The Beatles\", \"Rubber Soul\", \"Oxford\"))\n",
177 | "except Exception as e:\n",
178 | " print(e)\n",
179 | " \n",
180 | "try:\n",
181 | " session.execute(query, (1964, \"The Beatles\", \"Beatles For Sale\", \"London\"))\n",
182 | "except Exception as e:\n",
183 | " print(e)\n",
184 | "\n",
185 | "try:\n",
186 | " session.execute(query, (1966, \"The Monkees\", \"The Monkees\", \"Los Angeles\"))\n",
187 | "except Exception as e:\n",
188 | " print(e)\n",
189 | "\n",
190 | "try:\n",
191 | " session.execute(query, (1970, \"The Carpenters\", \"Close To You\", \"San Diego\"))\n",
192 | "except Exception as e:\n",
193 | " print(e)"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": [
200 | "### Validate the Data Model -- Did it work? \n",
201 | "`select * from album_library WHERE album_name=\"Close To You\"`"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": 7,
207 | "metadata": {},
208 | "outputs": [
209 | {
210 | "name": "stdout",
211 | "output_type": "stream",
212 | "text": [
213 | "The Carpenters Close To You San Diego 1970\n"
214 | ]
215 | }
216 | ],
217 | "source": [
218 | "query = \"select * from music_library WHERE album_name='Close To You'\"\n",
219 | "try:\n",
220 | " rows = session.execute(query)\n",
221 | "except Exception as e:\n",
222 | " print(e)\n",
223 | " \n",
224 | "for row in rows:\n",
225 | " print (row.artist_name, row.album_name, row.city, row.year)"
226 | ]
227 | },
228 | {
229 | "cell_type": "markdown",
230 | "metadata": {},
231 | "source": [
232 | "### Your output should be:\n",
233 | "('The Carpenters', 'Close to You', 'San Diego', 1970)\n",
234 | "\n",
235 | "### OR\n",
236 | "('The Carpenters', 'Close to You', 1970, 'San Diego') "
237 | ]
238 | },
239 | {
240 | "cell_type": "markdown",
241 | "metadata": {},
242 | "source": [
243 | "### Drop the table"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": 8,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": [
252 | "query = \"drop table music_library\"\n",
253 | "try:\n",
254 | " rows = session.execute(query)\n",
255 | "except Exception as e:\n",
256 | " print(e)\n"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "### Close the session and cluster connection"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": 9,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": [
272 | "session.shutdown()\n",
273 | "cluster.shutdown()"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {},
280 | "outputs": [],
281 | "source": []
282 | }
283 | ],
284 | "metadata": {
285 | "kernelspec": {
286 | "display_name": "Python 3",
287 | "language": "python",
288 | "name": "python3"
289 | },
290 | "language_info": {
291 | "codemirror_mode": {
292 | "name": "ipython",
293 | "version": 3
294 | },
295 | "file_extension": ".py",
296 | "mimetype": "text/x-python",
297 | "name": "python",
298 | "nbconvert_exporter": "python",
299 | "pygments_lexer": "ipython3",
300 | "version": "3.6.3"
301 | }
302 | },
303 | "nbformat": 4,
304 | "nbformat_minor": 2
305 | }
306 |
--------------------------------------------------------------------------------
/1-data-modeling/L4_Exercise_4_Using_the_WHERE_Clause.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Lesson 3 Demo 4: Using the WHERE Clause\n",
8 | "
"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "### In this exercise we are going to walk through the basics of using the WHERE clause in Apache Cassandra.\n",
16 | "\n",
17 | "##### denotes where the code needs to be completed.\n",
18 | "\n",
19 | "Note: __Do not__ click the blue Preview button in the lower task bar"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n",
27 | "! pip install cassandra-driver\n",
28 | "#### More documentation can be found here: https://datastax.github.io/python-driver/"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "#### Import Apache Cassandra python package"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 1,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "import cassandra"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "### First let's create a connection to the database"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 2,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "from cassandra.cluster import Cluster\n",
61 | "try: \n",
62 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
63 | " session = cluster.connect()\n",
64 | "except Exception as e:\n",
65 | " print(e)"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | "### Let's create a keyspace to do our work in "
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 3,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "try:\n",
82 | " session.execute(\"\"\"\n",
83 | " CREATE KEYSPACE IF NOT EXISTS udacity \n",
84 | " WITH REPLICATION = \n",
85 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
86 | ")\n",
87 | "\n",
88 | "except Exception as e:\n",
89 | " print(e)"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL. "
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 4,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "try:\n",
106 | " session.set_keyspace('udacity')\n",
107 | "except Exception as e:\n",
108 | " print(e)"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "### Let's imagine we would like to start creating a new Music Library of albums. \n",
116 | "### We want to ask 4 question of our data\n",
117 | "#### 1. Give me every album in my music library that was released in a 1965 year\n",
118 | "#### 2. Give me the album that is in my music library that was released in 1965 by \"The Beatles\"\n",
119 | "#### 3. Give me all the albums released in a given year that was made in London \n",
120 | "#### 4. Give me the city that the album \"Rubber Soul\" was recorded"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {},
126 | "source": [
127 | "### Here is our Collection of Data\n",
128 | "
"
129 | ]
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the YEAR let's start with that. From there we will add clustering columns on Artist Name and Album Name."
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 5,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n",
145 | "query = query + \"(year int, artist_name text, album_name text, city text, PRIMARY KEY (year, artist_name, album_name))\"\n",
146 | "try:\n",
147 | " session.execute(query)\n",
148 | "except Exception as e:\n",
149 | " print(e)"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "### Let's insert our data into of table"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 6,
162 | "metadata": {},
163 | "outputs": [],
164 | "source": [
165 | "query = \"INSERT INTO music_library (year, artist_name, album_name, city)\"\n",
166 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n",
167 | "\n",
168 | "try:\n",
169 | " session.execute(query, (1970, \"The Beatles\", \"Let it Be\", \"Liverpool\"))\n",
170 | "except Exception as e:\n",
171 | " print(e)\n",
172 | " \n",
173 | "try:\n",
174 | " session.execute(query, (1965, \"The Beatles\", \"Rubber Soul\", \"Oxford\"))\n",
175 | "except Exception as e:\n",
176 | " print(e)\n",
177 | " \n",
178 | "try:\n",
179 | " session.execute(query, (1965, \"The Who\", \"My Generation\", \"London\"))\n",
180 | "except Exception as e:\n",
181 | " print(e)\n",
182 | "\n",
183 | "try:\n",
184 | " session.execute(query, (1966, \"The Monkees\", \"The Monkees\", \"Los Angeles\"))\n",
185 | "except Exception as e:\n",
186 | " print(e)\n",
187 | "\n",
188 | "try:\n",
189 | " session.execute(query, (1970, \"The Carpenters\", \"Close To You\", \"San Diego\"))\n",
190 | "except Exception as e:\n",
191 | " print(e)"
192 | ]
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "metadata": {},
197 | "source": [
198 | "### Let's Validate our Data Model with our 4 queries.\n",
199 | "\n",
200 | "Query 1: "
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": 7,
206 | "metadata": {},
207 | "outputs": [
208 | {
209 | "name": "stdout",
210 | "output_type": "stream",
211 | "text": [
212 | "1970 The Beatles Let it Be Liverpool\n"
213 | ]
214 | }
215 | ],
216 | "source": [
217 | "query = \"select * from music_library WHERE YEAR=1970 AND ARTIST_NAME = 'The Beatles'\"\n",
218 | "try:\n",
219 | " rows = session.execute(query)\n",
220 | "except Exception as e:\n",
221 | " print(e)\n",
222 | " \n",
223 | "for row in rows:\n",
224 | " print (row.year, row.artist_name, row.album_name, row.city)"
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | " Let's try the 2nd query.\n",
232 | " Query 2: "
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": 8,
238 | "metadata": {},
239 | "outputs": [
240 | {
241 | "name": "stdout",
242 | "output_type": "stream",
243 | "text": [
244 | "1970 The Beatles Let it Be Liverpool\n"
245 | ]
246 | }
247 | ],
248 | "source": [
249 | "query = \"select * from music_library WHERE YEAR = 1970 AND ARTIST_NAME = 'The Beatles' AND ALBUM_NAME='Let it Be'\"\n",
250 | "\n",
251 | "try:\n",
252 | " rows = session.execute(query)\n",
253 | "except Exception as e:\n",
254 | " print(e)\n",
255 | " \n",
256 | "for row in rows:\n",
257 | " print (row.year, row.artist_name, row.album_name, row.city)"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "### Let's try the 3rd query.\n",
265 | "Query 3: "
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 9,
271 | "metadata": {},
272 | "outputs": [
273 | {
274 | "name": "stdout",
275 | "output_type": "stream",
276 | "text": [
277 | "Error from server: code=2200 [Invalid query] message=\"Undefined column name location\"\n"
278 | ]
279 | }
280 | ],
281 | "source": [
282 | "query = \"select * from music_library WHERE YEAR = 1970 AND LOCATION = 'Liverpool'\"\n",
283 | "try:\n",
284 | " rows = session.execute(query)\n",
285 | "except Exception as e:\n",
286 | " print(e)\n",
287 | " \n",
288 | "for row in rows:\n",
289 | " print (row.year, row.artist_name, row.album_name, row.city)"
290 | ]
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "### Did you get an error? You can not try to access a column or a clustering column if you have not used the other defined clustering column. Let's see if we can try it a different way. \n",
297 | "Try Query 4: \n",
298 | "\n"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": 10,
304 | "metadata": {},
305 | "outputs": [],
306 | "source": [
307 | "query = \"select * from music_library WHERE YEAR = 1970 AND ARTIST_NAME='The Who'\"\n",
308 | "try:\n",
309 | " rows = session.execute(query)\n",
310 | "except Exception as e:\n",
311 | " print(e)\n",
312 | " \n",
313 | "for row in rows:\n",
314 | " print (row.city)"
315 | ]
316 | },
317 | {
318 | "cell_type": "markdown",
319 | "metadata": {},
320 | "source": [
321 | "### And Finally close the session and cluster connection"
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "execution_count": 11,
327 | "metadata": {},
328 | "outputs": [],
329 | "source": [
330 | "session.shutdown()\n",
331 | "cluster.shutdown()"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": null,
337 | "metadata": {},
338 | "outputs": [],
339 | "source": []
340 | }
341 | ],
342 | "metadata": {
343 | "kernelspec": {
344 | "display_name": "Python 3",
345 | "language": "python",
346 | "name": "python3"
347 | },
348 | "language_info": {
349 | "codemirror_mode": {
350 | "name": "ipython",
351 | "version": 3
352 | },
353 | "file_extension": ".py",
354 | "mimetype": "text/x-python",
355 | "name": "python",
356 | "nbconvert_exporter": "python",
357 | "pygments_lexer": "ipython3",
358 | "version": "3.6.3"
359 | }
360 | },
361 | "nbformat": 4,
362 | "nbformat_minor": 2
363 | }
364 |
--------------------------------------------------------------------------------
/1-data-modeling/L5-Project_Data_Modeling_with_Apache_Cassandra/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | event_datafile_new.csv
3 |
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L1_E1_-_Step_6.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# STEP 6: Repeat the computation from the facts & dimension table\n",
8 | "\n",
9 | "Note: You will not have to write any code in this notebook. It's purely to illustrate the performance difference between Star and 3NF schemas.\n",
10 | "\n",
11 | "Start by running the code in the cell below to connect to the database."
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": null,
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "!PGPASSWORD=student createdb -h 127.0.0.1 -U student pagila_star\n",
21 | "!PGPASSWORD=student psql -q -h 127.0.0.1 -U student -d pagila_star -f Data/pagila-data.sql"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 2,
27 | "metadata": {},
28 | "outputs": [
29 | {
30 | "name": "stdout",
31 | "output_type": "stream",
32 | "text": [
33 | "postgresql://student:student@127.0.0.1:5432/pagila_star\n"
34 | ]
35 | },
36 | {
37 | "data": {
38 | "text/plain": [
39 | "'Connected: student@pagila_star'"
40 | ]
41 | },
42 | "execution_count": 2,
43 | "metadata": {},
44 | "output_type": "execute_result"
45 | }
46 | ],
47 | "source": [
48 | "%load_ext sql\n",
49 | "\n",
50 | "DB_ENDPOINT = \"127.0.0.1\"\n",
51 | "DB = 'pagila_star'\n",
52 | "DB_USER = 'student'\n",
53 | "DB_PASSWORD = 'student'\n",
54 | "DB_PORT = '5432'\n",
55 | "\n",
56 | "# postgresql://username:password@host:port/database\n",
57 | "conn_string = \"postgresql://{}:{}@{}:{}/{}\" \\\n",
58 | " .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)\n",
59 | "\n",
60 | "print(conn_string)\n",
61 | "%sql $conn_string"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "## 6.1 Facts Table has all the needed dimensions, no need for deep joins"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 3,
74 | "metadata": {},
75 | "outputs": [
76 | {
77 | "name": "stdout",
78 | "output_type": "stream",
79 | "text": [
80 | " * postgresql://student:***@127.0.0.1:5432/pagila_star\n",
81 | "(psycopg2.ProgrammingError) relation \"factsales\" does not exist\n",
82 | "LINE 2: FROM factSales \n",
83 | " ^\n",
84 | " [SQL: 'SELECT movie_key, date_key, customer_key, sales_amount\\nFROM factSales \\nlimit 5;']\n",
85 | "CPU times: user 2.57 ms, sys: 397 µs, total: 2.97 ms\n",
86 | "Wall time: 4.63 ms\n"
87 | ]
88 | }
89 | ],
90 | "source": [
91 | "%%time\n",
92 | "%%sql\n",
93 | "SELECT movie_key, date_key, customer_key, sales_amount\n",
94 | "FROM factSales \n",
95 | "limit 5;\n"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "## 6.2 Join fact table with dimensions to replace keys with attributes\n",
103 | "\n",
104 | "As you run each cell, pay attention to the time that is printed. Which schema do you think will run faster?\n",
105 | "\n",
106 | "##### Star Schema"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 4,
112 | "metadata": {},
113 | "outputs": [
114 | {
115 | "name": "stdout",
116 | "output_type": "stream",
117 | "text": [
118 | " * postgresql://student:***@127.0.0.1:5432/pagila_star\n",
119 | "(psycopg2.ProgrammingError) relation \"factsales\" does not exist\n",
120 | "LINE 2: FROM factSales \n",
121 | " ^\n",
122 | " [SQL: 'SELECT dimMovie.title, dimDate.month, dimCustomer.city, sum(sales_amount) as revenue\\nFROM factSales \\nJOIN dimMovie on (dimMovie.movie_key = factSales.movie_key)\\nJOIN dimDate on (dimDate.date_key = factSales.date_key)\\nJOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)\\ngroup by (dimMovie.title, dimDate.month, dimCustomer.city)\\norder by dimMovie.title, dimDate.month, dimCustomer.city, revenue desc;']\n",
123 | "CPU times: user 4.97 ms, sys: 54 µs, total: 5.02 ms\n",
124 | "Wall time: 6.58 ms\n"
125 | ]
126 | }
127 | ],
128 | "source": [
129 | "%%time\n",
130 | "%%sql\n",
131 | "SELECT dimMovie.title, dimDate.month, dimCustomer.city, sum(sales_amount) as revenue\n",
132 | "FROM factSales \n",
133 | "JOIN dimMovie on (dimMovie.movie_key = factSales.movie_key)\n",
134 | "JOIN dimDate on (dimDate.date_key = factSales.date_key)\n",
135 | "JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)\n",
136 | "group by (dimMovie.title, dimDate.month, dimCustomer.city)\n",
137 | "order by dimMovie.title, dimDate.month, dimCustomer.city, revenue desc;"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {},
143 | "source": [
144 | "##### 3NF Schema"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 5,
150 | "metadata": {},
151 | "outputs": [
152 | {
153 | "name": "stdout",
154 | "output_type": "stream",
155 | "text": [
156 | " * postgresql://student:***@127.0.0.1:5432/pagila_star\n",
157 | "(psycopg2.ProgrammingError) relation \"payment\" does not exist\n",
158 | "LINE 2: FROM payment p\n",
159 | " ^\n",
160 | " [SQL: 'SELECT f.title, EXTRACT(month FROM p.payment_date) as month, ci.city, sum(p.amount) as revenue\\nFROM payment p\\nJOIN rental r ON ( p.rental_id = r.rental_id )\\nJOIN inventory i ON ( r.inventory_id = i.inventory_id )\\nJOIN film f ON ( i.film_id = f.film_id)\\nJOIN customer c ON ( p.customer_id = c.customer_id )\\nJOIN address a ON ( c.address_id = a.address_id )\\nJOIN city ci ON ( a.city_id = ci.city_id )\\ngroup by (f.title, month, ci.city)\\norder by f.title, month, ci.city, revenue desc;']\n",
161 | "CPU times: user 1.95 ms, sys: 4.39 ms, total: 6.34 ms\n",
162 | "Wall time: 7.86 ms\n"
163 | ]
164 | }
165 | ],
166 | "source": [
167 | "%%time\n",
168 | "%%sql\n",
169 | "SELECT f.title, EXTRACT(month FROM p.payment_date) as month, ci.city, sum(p.amount) as revenue\n",
170 | "FROM payment p\n",
171 | "JOIN rental r ON ( p.rental_id = r.rental_id )\n",
172 | "JOIN inventory i ON ( r.inventory_id = i.inventory_id )\n",
173 | "JOIN film f ON ( i.film_id = f.film_id)\n",
174 | "JOIN customer c ON ( p.customer_id = c.customer_id )\n",
175 | "JOIN address a ON ( c.address_id = a.address_id )\n",
176 | "JOIN city ci ON ( a.city_id = ci.city_id )\n",
177 | "group by (f.title, month, ci.city)\n",
178 | "order by f.title, month, ci.city, revenue desc;"
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "metadata": {},
184 | "source": [
185 | "# Conclusion"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "We were able to show that:\n",
193 | "* The star schema is easier to understand and write queries against.\n",
194 | "* Queries with a star schema are more performant."
195 | ]
196 | }
197 | ],
198 | "metadata": {
199 | "kernelspec": {
200 | "display_name": "Python 3",
201 | "language": "python",
202 | "name": "python3"
203 | },
204 | "language_info": {
205 | "codemirror_mode": {
206 | "name": "ipython",
207 | "version": 3
208 | },
209 | "file_extension": ".py",
210 | "mimetype": "text/x-python",
211 | "name": "python",
212 | "nbconvert_exporter": "python",
213 | "pygments_lexer": "ipython3",
214 | "version": "3.6.3"
215 | },
216 | "toc": {
217 | "base_numbering": 1,
218 | "nav_menu": {},
219 | "number_sections": true,
220 | "sideBar": true,
221 | "skip_h1_title": false,
222 | "title_cell": "Table of Contents",
223 | "title_sidebar": "Contents",
224 | "toc_cell": false,
225 | "toc_position": {},
226 | "toc_section_display": true,
227 | "toc_window_display": false
228 | }
229 | },
230 | "nbformat": 4,
231 | "nbformat_minor": 2
232 | }
233 |
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L3_Exercise_3_-_Parallel_ETL.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Exercise 3: Parallel ETL"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "%load_ext sql"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 2,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "import boto3\n",
26 | "import configparser\n",
27 | "import matplotlib.pyplot as plt\n",
28 | "import pandas as pd\n",
29 | "from time import time"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "# STEP 1: Get the params of the created redshift cluster \n",
37 | "- We need:\n",
38 | " - The redshift cluster endpoint\n",
39 | " - The IAM role ARN that give access to Redshift to read from S3"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 3,
45 | "metadata": {},
46 | "outputs": [],
47 | "source": [
48 | "config = configparser.ConfigParser()\n",
49 | "config.read_file(open('dwh.cfg'))\n",
50 | "KEY=config.get('AWS','key')\n",
51 | "SECRET= config.get('AWS','secret')\n",
52 | "\n",
53 | "DWH_DB= config.get(\"DWH\",\"DWH_DB\")\n",
54 | "DWH_DB_USER= config.get(\"DWH\",\"DWH_DB_USER\")\n",
55 | "DWH_DB_PASSWORD= config.get(\"DWH\",\"DWH_DB_PASSWORD\")\n",
56 | "DWH_PORT = config.get(\"DWH\",\"DWH_PORT\")"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 4,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "# FILL IN THE REDSHIFT ENPOINT HERE\n",
66 | "# e.g. DWH_ENDPOINT=\"redshift-cluster-1.csmamz5zxmle.us-west-2.redshift.amazonaws.com\" \n",
67 | "DWH_ENDPOINT=\"dwhcluster.con66cjapis6.us-east-2.redshift.amazonaws.com\" \n",
68 | " \n",
69 | "#FILL IN THE IAM ROLE ARN you got in step 2.2 of the previous exercise\n",
70 | "#e.g DWH_ROLE_ARN=\"arn:aws:iam::988332130976:role/dwhRole\"\n",
71 | "DWH_ROLE_ARN=\"arn:aws:iam::430737919253:role/dwhRole\""
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "# STEP 2: Connect to the Redshift Cluster"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 5,
84 | "metadata": {},
85 | "outputs": [
86 | {
87 | "name": "stdout",
88 | "output_type": "stream",
89 | "text": [
90 | "postgresql://dwhuser:Passw0rd@dwhcluster.con66cjapis6.us-east-2.redshift.amazonaws.com:5439/dwh\n"
91 | ]
92 | },
93 | {
94 | "data": {
95 | "text/plain": [
96 | "'Connected: dwhuser@dwh'"
97 | ]
98 | },
99 | "execution_count": 5,
100 | "metadata": {},
101 | "output_type": "execute_result"
102 | }
103 | ],
104 | "source": [
105 | "conn_string=\"postgresql://{}:{}@{}:{}/{}\".format(DWH_DB_USER, DWH_DB_PASSWORD, DWH_ENDPOINT, DWH_PORT,DWH_DB)\n",
106 | "print(conn_string)\n",
107 | "%sql $conn_string"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 6,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "s3 = boto3.resource('s3',\n",
117 | " region_name=\"us-west-2\",\n",
118 | " aws_access_key_id=KEY,\n",
119 | " aws_secret_access_key=SECRET\n",
120 | " )\n",
121 | "\n",
122 | "sampleDbBucket = s3.Bucket('udacity-labs')"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 7,
128 | "metadata": {},
129 | "outputs": [
130 | {
131 | "name": "stdout",
132 | "output_type": "stream",
133 | "text": [
134 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/')\n",
135 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/full/')\n",
136 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/full/full.csv.gz')\n",
137 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/')\n",
138 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00000-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n",
139 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00001-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n",
140 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00002-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n",
141 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00003-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n",
142 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00004-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n",
143 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00005-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n",
144 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00006-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n",
145 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00007-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n",
146 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00008-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n",
147 | "s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00009-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')\n"
148 | ]
149 | }
150 | ],
151 | "source": [
152 | "for obj in sampleDbBucket.objects.filter(Prefix=\"tickets\"):\n",
153 | " print(obj)"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "# STEP 3: Create Tables"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 8,
166 | "metadata": {},
167 | "outputs": [
168 | {
169 | "name": "stdout",
170 | "output_type": "stream",
171 | "text": [
172 | " * postgresql://dwhuser:***@dwhcluster.con66cjapis6.us-east-2.redshift.amazonaws.com:5439/dwh\n",
173 | "Done.\n",
174 | "Done.\n"
175 | ]
176 | },
177 | {
178 | "data": {
179 | "text/plain": [
180 | "[]"
181 | ]
182 | },
183 | "execution_count": 8,
184 | "metadata": {},
185 | "output_type": "execute_result"
186 | }
187 | ],
188 | "source": [
189 | "%%sql \n",
190 | "DROP TABLE IF EXISTS \"sporting_event_ticket\";\n",
191 | "CREATE TABLE \"sporting_event_ticket\" (\n",
192 | " \"id\" double precision DEFAULT nextval('sporting_event_ticket_seq') NOT NULL,\n",
193 | " \"sporting_event_id\" double precision NOT NULL,\n",
194 | " \"sport_location_id\" double precision NOT NULL,\n",
195 | " \"seat_level\" numeric(1,0) NOT NULL,\n",
196 | " \"seat_section\" character varying(15) NOT NULL,\n",
197 | " \"seat_row\" character varying(10) NOT NULL,\n",
198 | " \"seat\" character varying(10) NOT NULL,\n",
199 | " \"ticketholder_id\" double precision,\n",
200 | " \"ticket_price\" numeric(8,2) NOT NULL\n",
201 | ");"
202 | ]
203 | },
204 | {
205 | "cell_type": "markdown",
206 | "metadata": {},
207 | "source": [
208 | "# STEP 4: Load Partitioned data into the cluster\n",
209 | "Use the COPY command to load data from `s3://udacity-labs/tickets/split/part` using your iam role credentials. Use gzip delimiter `;`."
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 9,
215 | "metadata": {},
216 | "outputs": [
217 | {
218 | "name": "stdout",
219 | "output_type": "stream",
220 | "text": [
221 | " * postgresql://dwhuser:***@dwhcluster.con66cjapis6.us-east-2.redshift.amazonaws.com:5439/dwh\n",
222 | "Done.\n",
223 | "CPU times: user 1.56 ms, sys: 3.58 ms, total: 5.14 ms\n",
224 | "Wall time: 30 s\n"
225 | ]
226 | }
227 | ],
228 | "source": [
229 | "%%time\n",
230 | "qry = \"\"\"\n",
231 | "\n",
232 | " copy sporting_event_ticket from 's3://udacity-labs/tickets/split/part'\n",
233 | " credentials 'aws_iam_role={}'\n",
234 | " gzip delimiter ';' compupdate off region 'us-west-2';\n",
235 | "\n",
236 | "\"\"\".format(DWH_ROLE_ARN)\n",
237 | "\n",
238 | "%sql $qry"
239 | ]
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "metadata": {},
244 | "source": [
245 | "# STEP 5: Create Tables for the non-partitioned data"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": 10,
251 | "metadata": {},
252 | "outputs": [
253 | {
254 | "name": "stdout",
255 | "output_type": "stream",
256 | "text": [
257 | " * postgresql://dwhuser:***@dwhcluster.con66cjapis6.us-east-2.redshift.amazonaws.com:5439/dwh\n",
258 | "Done.\n",
259 | "Done.\n"
260 | ]
261 | },
262 | {
263 | "data": {
264 | "text/plain": [
265 | "[]"
266 | ]
267 | },
268 | "execution_count": 10,
269 | "metadata": {},
270 | "output_type": "execute_result"
271 | }
272 | ],
273 | "source": [
274 | "%%sql\n",
275 | "DROP TABLE IF EXISTS \"sporting_event_ticket_full\";\n",
276 | "CREATE TABLE \"sporting_event_ticket_full\" (\n",
277 | " \"id\" double precision DEFAULT nextval('sporting_event_ticket_seq') NOT NULL,\n",
278 | " \"sporting_event_id\" double precision NOT NULL,\n",
279 | " \"sport_location_id\" double precision NOT NULL,\n",
280 | " \"seat_level\" numeric(1,0) NOT NULL,\n",
281 | " \"seat_section\" character varying(15) NOT NULL,\n",
282 | " \"seat_row\" character varying(10) NOT NULL,\n",
283 | " \"seat\" character varying(10) NOT NULL,\n",
284 | " \"ticketholder_id\" double precision,\n",
285 | " \"ticket_price\" numeric(8,2) NOT NULL\n",
286 | ");"
287 | ]
288 | },
289 | {
290 | "cell_type": "markdown",
291 | "metadata": {},
292 | "source": [
293 | "# STEP 6: Load non-partitioned data into the cluster\n",
294 | "Use the COPY command to load data from `s3://udacity-labs/tickets/full/full.csv.gz` using your iam role credentials. Use gzip delimiter `;`.\n",
295 | "\n",
296 | "- Note how it's slower than loading partitioned data"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 11,
302 | "metadata": {},
303 | "outputs": [
304 | {
305 | "name": "stdout",
306 | "output_type": "stream",
307 | "text": [
308 | " * postgresql://dwhuser:***@dwhcluster.con66cjapis6.us-east-2.redshift.amazonaws.com:5439/dwh\n",
309 | "Done.\n",
310 | "CPU times: user 4.42 ms, sys: 68 µs, total: 4.48 ms\n",
311 | "Wall time: 24.8 s\n"
312 | ]
313 | }
314 | ],
315 | "source": [
316 | "%%time\n",
317 | "\n",
318 | "qry = \"\"\"\n",
319 | " \n",
320 | " copy sporting_event_ticket_full from 's3://udacity-labs/tickets/full/full.csv.gz'\n",
321 | " credentials 'aws_iam_role={}'\n",
322 | " gzip delimiter ';' compupdate off region 'us-west-2'\n",
323 | " \n",
324 | "\"\"\".format(DWH_ROLE_ARN)\n",
325 | "\n",
326 | "%sql $qry"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": null,
332 | "metadata": {},
333 | "outputs": [],
334 | "source": []
335 | }
336 | ],
337 | "metadata": {
338 | "kernelspec": {
339 | "display_name": "Python 3",
340 | "language": "python",
341 | "name": "python3"
342 | },
343 | "language_info": {
344 | "codemirror_mode": {
345 | "name": "ipython",
346 | "version": 3
347 | },
348 | "file_extension": ".py",
349 | "mimetype": "text/x-python",
350 | "name": "python",
351 | "nbconvert_exporter": "python",
352 | "pygments_lexer": "ipython3",
353 | "version": "3.6.3"
354 | }
355 | },
356 | "nbformat": 4,
357 | "nbformat_minor": 2
358 | }
359 |
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/.gitignore:
--------------------------------------------------------------------------------
1 | dwh.cfg
2 | .ipynb_checkpoints
3 | __pycache__
4 | data
5 |
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/README.md:
--------------------------------------------------------------------------------
1 | # Sparkify's Data Warehouse ETL process
2 |
3 | ## Summary
4 |
5 | - [Introduction](#introduction)
6 | - [Getting started](#getting-started)
7 | - [The ETL Process](#the-etl-process)
8 | - [Analyzing the results](#analyzing-the-results)
9 | - [The database structure](#the-database-structure)
10 |
11 | ## Introduction
12 |
13 | This project uses Amazon Web Services S3 and Redshift to make the ETL process from raw log files of the Sparkify app to a database schema that provides analytical data to be queried.
14 |
15 | ## Getting started
16 |
17 | Read the sections below to know how to get started:
18 |
19 | ### Configuration
20 |
21 | First of all, you have to copy the `dwh.cfg.example` to a version without the example suffix (`dwh.cfg`), then fill all the fields of configuration.
22 |
23 | Letting only two fields empty:
24 | - `HOST` (inside the `DB` configuration section)
25 | - And the `ARN` (inside the `IAM_ROLE` configuration section)
26 |
27 | ### Infrastructure provisioning
28 |
29 | There are **3 scripts** that will ease our job to create our data warehouse infrastructure:
30 | #### 1. Creating a new AWS Redshift Cluster
31 | ```sh
32 | python aws_create_cluster.py
33 | ```
34 |
35 | #### 2. Checking the cluster availability
36 |
37 | _This one you should run several times until your cluster becomes available - takes from 3 to 6 minutes_
38 |
39 | ```sh
40 | python aws_check_cluster_available.py
41 | ```
42 |
43 | #### 3. Destroying the cluster
44 |
45 | _After the ETL process done, nor whenever you want, you can destroy it with a single command:_
46 |
47 | ```sh
48 | python aws_destroy_cluster.py
49 | ```
50 |
51 | ## The ETL Process
52 |
53 | It consists of these two simple python scripts:
54 |
55 | - `python create_tables.py` - It will drop the tables if exists, and then create it (again);
56 | - `python etl.py` - This script does two principal tasks:
57 | - Copy (load) the logs from the dataset's S3 bucket to the staging tables;
58 | - Translate all data from the staging tables to the analytical tables with `INSERT ... SELECT` statements.
59 |
60 | ## Analyzing the results
61 |
62 | After the ETL process completion we can check if we did it right by running the `python analyze.py`.
63 |
64 | It is a simple script to return the counting of each analytical table.
65 |
66 | ## The database structure
67 |
68 | Below you can dive into the database structure (_a simple star schema_) created to run the analytical queries.
69 |
70 | ### Entity-relationship diagram
71 |
72 | 
73 |
74 | ### Analytical Tables specifications
75 |
76 | #### Song Plays table
77 |
78 | - *Name:* `songplays`
79 | - *Type:* Fact table
80 |
81 | | Column | Type | Description |
82 | | ------ | ---- | ----------- |
83 | | `songplay_id` | `INTEGER IDENTITY(0,1) SORTKEY` | The main identification of the table |
84 | | `start_time` | `TIMESTAMP NOT NULL` | The timestamp that this song play log happened |
85 | | `user_id` | `INTEGER NOT NULL REFERENCES users (user_id)` | The user id that triggered this song play log. It cannot be null, as we don't have song play logs without being triggered by an user. |
86 | | `level` | `VARCHAR(10)` | The level of the user that triggered this song play log |
87 | | `song_id` | `VARCHAR(20) REFERENCES songs (song_id)` | The identification of the song that was played. It can be null. |
88 | | `artist_id` | `VARCHAR(20) REFERENCES artists (artist_id)` | The identification of the artist of the song that was played. |
89 | | `session_id` | `INTEGER NOT NULL` | The session_id of the user on the app |
90 | | `location` | `VARCHAR(500)` | The location where this song play log was triggered |
91 | | `user_agent` | `VARCHAR(500)` | The user agent of our app |
92 |
93 | #### Users table
94 |
95 | - *Name:* `users`
96 | - *Type:* Dimension table
97 |
98 | | Column | Type | Description |
99 | | ------ | ---- | ----------- |
100 | | `user_id` | `INTEGER PRIMARY KEY` | The main identification of an user |
101 | | `first_name` | `VARCHAR(500) NOT NULL` | First name of the user, can not be null. It is the basic information we have from the user |
102 | | `last_name` | `VARCHAR(500) NOT NULL` | Last name of the user. |
103 | | `gender` | `CHAR(1)` | The gender is stated with just one character `M` (male) or `F` (female). Otherwise it can be stated as `NULL` |
104 | | `level` | `VARCHAR(10) NOT NULL` | The level stands for the user app plans (`premium` or `free`) |
105 |
106 |
107 | #### Songs table
108 |
109 | - *Name:* `songs`
110 | - *Type:* Dimension table
111 |
112 | | Column | Type | Description |
113 | | ------ | ---- | ----------- |
114 | | `song_id` | `VARCHAR(20) PRIMARY KEY` | The main identification of a song |
115 | | `title` | `VARCHAR(500) NOT NULL SORTKEY` | The title of the song. It can not be null, as it is the basic information we have about a song. |
116 | | `artist_id` | `VARCHAR NOT NULL DISTKEY REFERENCES artists (artist_id)` | The artist id, it can not be null as we don't have songs without an artist, and this field also references the artists table. |
117 | | `year` | `INTEGER NOT NULL` | The year that this song was made |
118 | | `duration` | `NUMERIC (15, 5) NOT NULL` | The duration of the song |
119 |
120 |
121 | #### Artists table
122 |
123 | - *Name:* `artists`
124 | - *Type:* Dimension table
125 |
126 | | Column | Type | Description |
127 | | ------ | ---- | ----------- |
128 | | `artist_id` | `VARCHAR(20) PRIMARY KEY` | The main identification of an artist |
129 | | `name` | `VARCHAR(500) NOT NULL` | The name of the artist |
130 | | `location` | `VARCHAR(500)` | The location where the artist are from |
131 | | `latitude` | `DECIMAL(12,6)` | The latitude of the location that the artist are from |
132 | | `longitude` | `DECIMAL(12,6)` | The longitude of the location that the artist are from |
133 |
134 | #### Time table
135 |
136 | - *Name:* `time`
137 | - *Type:* Dimension table
138 |
139 | | Column | Type | Description |
140 | | ------ | ---- | ----------- |
141 | | `start_time` | `TIMESTAMP NOT NULL PRIMARY KEY` | The timestamp itself, serves as the main identification of this table |
142 | | `hour` | `NUMERIC NOT NULL` | The hour from the timestamp |
143 | | `day` | `NUMERIC NOT NULL` | The day of the month from the timestamp |
144 | | `week` | `NUMERIC NOT NULL` | The week of the year from the timestamp |
145 | | `month` | `NUMERIC NOT NULL` | The month of the year from the timestamp |
146 | | `year` | `NUMERIC NOT NULL` | The year from the timestamp |
147 | | `weekday` | `NUMERIC NOT NULL` | The week day from the timestamp |
148 |
149 | ### Staging Tables specifications
150 |
151 | The ETL process uses staging tables to copy the logs from unstructured log files to a single database table.
152 |
153 | #### Events table
154 |
155 | - *Name:* `staging_events`
156 | - *Type:* Staging table
157 |
158 | | Column | Type | Description |
159 | | ------ | ---- | ----------- |
160 | | `artist` | `VARCHAR(500)` | The artist name |
161 | | `auth` | `VARCHAR(20)` | The authentication status |
162 | | `firstName` | `VARCHAR(500)` | The first name of the user |
163 | | `gender` | `CHAR(1)` | The gender of the user |
164 | | `itemInSession` | `INTEGER` | The sequence number of the item inside a given session |
165 | | `lastName` | `VARCHAR(500)` | The last name of the user |
166 | | `length` | `DECIMAL(12, 5)` | The duration of the song |
167 | | `level` | `VARCHAR(10)` | The level of the user´s plan (free or premium) |
168 | | `location` | `VARCHAR(500)` | The location of the user |
169 | | `method` | `VARCHAR(20)` | The method of the http request |
170 | | `page` | `VARCHAR(500)` | The page that the event occurred |
171 | | `registration` | `FLOAT` | The time that the user registered |
172 | | `sessionId` | `INTEGER` | The session id |
173 | | `song` | `VARCHAR(500)` | The song name |
174 | | `status` | `INTEGER` | The status |
175 | | `ts` | `VARCHAR(50)` | The timestamp that this event occurred |
176 | | `userAgent` | `VARCHAR(500)` | The user agent he was using |
177 | | `userId` | `INTEGER` | The user id |
178 |
179 | #### Songs table
180 |
181 | - *Name:* `staging_songs`
182 | - *Type:* Staging table
183 |
184 | | Column | Type | Description |
185 | | ------ | ---- | ----------- |
186 | | `num_songs` | `INTEGER` | The number of songs of this artist |
187 | | `artist_id` | `VARCHAR(20)` | The artist id |
188 | | `artist_latitude` | `DECIMAL(12, 5)` | The artist latitude location |
189 | | `artist_longitude` | `DECIMAL(12, 5)` | The artist longitude location |
190 | | `artist_location` | `VARCHAR(500)` | The artist descriptive location |
191 | | `artist_name` | `VARCHAR(500)` | The artist name |
192 | | `song_id` | `VARCHAR(20)` | The song id |
193 | | `title` | `VARCHAR(500)` | The title |
194 | | `duration` | `DECIMAL(15, 5)` | The duration of the song |
195 | | `year` | `INTEGER` | The year of the song |
196 |
197 |
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/analyze.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import psycopg2
3 | from sql_queries import analytical_queries, analytical_query_titles
4 |
5 |
6 | def run_analytical_queries(cur):
7 | """
8 | Runs all analytical queries written in the sql_queries script
9 | :param cur:
10 | :return:
11 | """
12 | idx = 0
13 | for query in analytical_queries:
14 | print("{}... ".format(analytical_query_titles[idx]))
15 | row = cur.execute(query)
16 | print(row.total)
17 | idx = idx + 1
18 | print(" [DONE] ")
19 |
20 |
21 | def main():
22 | config = configparser.ConfigParser()
23 | config.read('dwh.cfg')
24 |
25 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
26 | cur = conn.cursor()
27 |
28 | run_analytical_queries(cur)
29 |
30 | conn.close()
31 |
32 |
33 | if __name__ == "__main__":
34 | main()
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/aws_check_cluster_available.py:
--------------------------------------------------------------------------------
1 | from aws_create_cluster import config_parse_file, aws_client, aws_open_redshift_port, check_cluster_creation, aws_resource, config_persist_cluster_infos
2 |
3 |
4 | def main():
5 | config_parse_file()
6 |
7 | redshift = aws_client('redshift', "us-east-2")
8 |
9 | if check_cluster_creation(redshift):
10 | print('available')
11 | ec2 = aws_resource('ec2', 'us-east-2')
12 | config_persist_cluster_infos(redshift)
13 | aws_open_redshift_port(ec2, redshift)
14 | else:
15 | print('notyet')
16 |
17 |
18 | if __name__ == '__main__':
19 | main()
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/aws_create_cluster.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import pandas as pd
3 | import boto3
4 | import json
5 | import time
6 |
7 | KEY = None
8 | SECRET = None
9 |
10 | DWH_CLUSTER_TYPE = None
11 | DWH_NUM_NODES = None
12 | DWH_NODE_TYPE = None
13 |
14 | DWH_CLUSTER_IDENTIFIER = None
15 | DWH_DB = None
16 | DWH_DB_USER = None
17 | DWH_DB_PASSWORD = None
18 | DWH_PORT = None
19 |
20 | DWH_IAM_ROLE_NAME = None
21 |
22 |
23 | def config_parse_file():
24 | """
25 | Parse the dwh.cfg configuration file
26 | :return:
27 | """
28 | global KEY, SECRET, DWH_CLUSTER_TYPE, DWH_NUM_NODES, \
29 | DWH_NODE_TYPE, DWH_CLUSTER_IDENTIFIER, DWH_DB, \
30 | DWH_DB_USER, DWH_DB_PASSWORD, DWH_PORT, DWH_IAM_ROLE_NAME
31 |
32 | print("Parsing the config file...")
33 | config = configparser.ConfigParser()
34 | with open('dwh.cfg') as configfile:
35 | config.read_file(configfile)
36 |
37 | KEY = config.get('AWS', 'KEY')
38 | SECRET = config.get('AWS', 'SECRET')
39 |
40 | DWH_CLUSTER_TYPE = config.get("DWH", "DWH_CLUSTER_TYPE")
41 | DWH_NUM_NODES = config.get("DWH", "DWH_NUM_NODES")
42 | DWH_NODE_TYPE = config.get("DWH", "DWH_NODE_TYPE")
43 |
44 | DWH_IAM_ROLE_NAME = config.get("DWH", "DWH_IAM_ROLE_NAME")
45 | DWH_CLUSTER_IDENTIFIER = config.get("DWH", "DWH_CLUSTER_IDENTIFIER")
46 |
47 | DWH_DB = config.get("CLUSTER", "DB_NAME")
48 | DWH_DB_USER = config.get("CLUSTER", "DB_USER")
49 | DWH_DB_PASSWORD = config.get("CLUSTER", "DB_PASSWORD")
50 | DWH_PORT = config.get("CLUSTER", "DB_PORT")
51 |
52 |
53 | def create_iam_role(iam):
54 | """
55 | Create the AWS IAM role
56 | :param iam:
57 | :return:
58 | """
59 | global DWH_IAM_ROLE_NAME
60 | dwhRole = None
61 | try:
62 | print('1.1 Creating a new IAM Role')
63 | dwhRole = iam.create_role(
64 | Path='/',
65 | RoleName=DWH_IAM_ROLE_NAME,
66 | Description="Allows Redshift clusters to call AWS services on your behalf.",
67 | AssumeRolePolicyDocument=json.dumps(
68 | {'Statement': [{'Action': 'sts:AssumeRole',
69 | 'Effect': 'Allow',
70 | 'Principal': {'Service': 'redshift.amazonaws.com'}}],
71 | 'Version': '2012-10-17'})
72 | )
73 | except Exception as e:
74 | print(e)
75 | dwhRole = iam.get_role(RoleName=DWH_IAM_ROLE_NAME)
76 | return dwhRole
77 |
78 |
79 | def attach_iam_role_policy(iam):
80 | """
81 | Attach the AmazonS3ReadOnlyAccess role policy to the created IAM
82 | :param iam:
83 | :return:
84 | """
85 | global DWH_IAM_ROLE_NAME
86 | print('1.2 Attaching Policy')
87 | return iam.attach_role_policy(RoleName=DWH_IAM_ROLE_NAME, PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess")['ResponseMetadata']['HTTPStatusCode'] == 200
88 |
89 |
90 | def get_iam_role_arn(iam):
91 | """
92 | Get the IAM role ARN string
93 | :param iam: The IAM resource client
94 | :return:string
95 | """
96 | global DWH_IAM_ROLE_NAME
97 | return iam.get_role(RoleName=DWH_IAM_ROLE_NAME)['Role']['Arn']
98 |
99 |
100 | def start_cluster_creation(redshift, roleArn):
101 | """
102 | Start the Redshift cluster creation
103 | :param redshift: The redshift resource client
104 | :param roleArn: The created role ARN
105 | :return:
106 | """
107 | global DWH_CLUSTER_TYPE, DWH_NODE_TYPE, DWH_NUM_NODES, \
108 | DWH_DB, DWH_CLUSTER_IDENTIFIER, DWH_DB_USER, DWH_DB_PASSWORD
109 | print("2. Starting redshift cluster creation")
110 | try:
111 | response = redshift.create_cluster(
112 | # HW
113 | ClusterType=DWH_CLUSTER_TYPE,
114 | NodeType=DWH_NODE_TYPE,
115 | NumberOfNodes=int(DWH_NUM_NODES),
116 |
117 | # Identifiers & Credentials
118 | DBName=DWH_DB,
119 | ClusterIdentifier=DWH_CLUSTER_IDENTIFIER,
120 | MasterUsername=DWH_DB_USER,
121 | MasterUserPassword=DWH_DB_PASSWORD,
122 |
123 | # Roles (for s3 access)
124 | IamRoles=[roleArn]
125 | )
126 | print("Redshift cluster creation http response status code: ")
127 | print(response['ResponseMetadata']['HTTPStatusCode'])
128 | return response['ResponseMetadata']['HTTPStatusCode'] == 200
129 | except Exception as e:
130 | print(e)
131 | return False
132 |
133 |
134 | def config_persist_cluster_infos(redshift):
135 | """
136 | Write back to the dwh.cfg configuration file the cluster endpoint and IAM ARN
137 | :param redshift: The redshift resource client
138 | :return:
139 | """
140 | global DWH_CLUSTER_IDENTIFIER
141 | print("Writing the cluster address and IamRoleArn to the config file...")
142 |
143 | cluster_props = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]
144 |
145 | config = configparser.ConfigParser()
146 |
147 | with open('dwh.cfg') as configfile:
148 | config.read_file(configfile)
149 |
150 | config.set("CLUSTER", "HOST", cluster_props['Endpoint']['Address'])
151 | config.set("IAM_ROLE", "ARN", cluster_props['IamRoles'][0]['IamRoleArn'])
152 |
153 | with open('dwh.cfg', 'w+') as configfile:
154 | config.write(configfile)
155 |
156 | config_parse_file()
157 |
158 |
159 | def get_redshift_cluster_status(redshift):
160 | """
161 | Retrieves the Redshift cluster status
162 | :param redshift: The Redshift resource client
163 | :return: The cluster status
164 | """
165 | global DWH_CLUSTER_IDENTIFIER
166 | cluster_props = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]
167 | cluster_status = cluster_props['ClusterStatus']
168 | return cluster_status.lower()
169 |
170 |
171 | def check_cluster_creation(redshift):
172 | """
173 | Check if the cluster status is available, if it is returns True. Otherwise, false.
174 | :param redshift: The Redshift client resource
175 | :return:bool
176 | """
177 | if get_redshift_cluster_status(redshift) == 'available':
178 | return True
179 | return False
180 |
181 |
182 | def destroy_redshift_cluster(redshift):
183 | """
184 | Destroy the Redshift cluster (request deletion)
185 | :param redshift: The Redshift client resource
186 | :return:None
187 | """
188 | global DWH_CLUSTER_IDENTIFIER
189 | redshift.delete_cluster(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER, SkipFinalClusterSnapshot=True)
190 |
191 |
192 | def aws_open_redshift_port(ec2, redshift):
193 | """
194 | Opens the Redshift port on the VPC security group.
195 | :param ec2: The EC2 client resource
196 | :param redshift: The Redshift client resource
197 | :return:None
198 | """
199 | global DWH_CLUSTER_IDENTIFIER, DWH_PORT
200 | cluster_props = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]
201 | try:
202 | vpc = ec2.Vpc(id=cluster_props['VpcId'])
203 | all_security_groups = list(vpc.security_groups.all())
204 | print(all_security_groups)
205 | defaultSg = all_security_groups[1]
206 | print(defaultSg)
207 |
208 | defaultSg.authorize_ingress(
209 | GroupName=defaultSg.group_name,
210 | CidrIp='0.0.0.0/0',
211 | IpProtocol='TCP',
212 | FromPort=int(DWH_PORT),
213 | ToPort=int(DWH_PORT)
214 | )
215 | except Exception as e:
216 | print(e)
217 |
218 |
219 | def aws_resource(name, region):
220 | """
221 | Creates an AWS client resource
222 | :param name: The name of the resource
223 | :param region: The region of the resource
224 | :return:
225 | """
226 | global KEY, SECRET
227 | return boto3.resource(name, region_name=region, aws_access_key_id=KEY, aws_secret_access_key=SECRET)
228 |
229 |
230 | def aws_client(service, region):
231 | """
232 | Creates an AWS client
233 | :param service: The service
234 | :param region: The region of the service
235 | :return:
236 | """
237 | global KEY, SECRET
238 | return boto3.client(service, aws_access_key_id=KEY, aws_secret_access_key=SECRET, region_name=region)
239 |
240 | def main():
241 | config_parse_file()
242 |
243 | # ec2 = aws_resource('ec2', 'us-east-2')
244 | # s3 = aws_resource('s3', 'us-west-2')
245 | iam = aws_client('iam', "us-east-2")
246 | redshift = aws_client('redshift', "us-east-2")
247 |
248 | create_iam_role(iam)
249 | attach_iam_role_policy(iam)
250 | roleArn = get_iam_role_arn(iam)
251 |
252 | clusterCreationStarted = start_cluster_creation(redshift, roleArn)
253 |
254 | if clusterCreationStarted:
255 | print("The cluster is being created.")
256 | # while True:
257 | # print("Gonna check if the cluster was created...")
258 | # if check_cluster_creation(redshift):
259 | # config_persist_cluster_infos(redshift)
260 | # aws_open_redshift_port(ec2, redshift)
261 | # break
262 | # else:
263 | # print("Not yet. Waiting 30s before next check.")
264 | # time.sleep(30)
265 | # print("DONE!!")
266 |
267 | # wait until becomes true?
268 |
269 | if __name__ == '__main__':
270 | main()
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/aws_destroy_cluster.py:
--------------------------------------------------------------------------------
1 | from aws_create_cluster import config_parse_file, aws_client, check_cluster_creation, \
2 | config_persist_cluster_infos, destroy_redshift_cluster, get_redshift_cluster_status
3 |
4 |
5 | def main():
6 | config_parse_file()
7 |
8 | redshift = aws_client('redshift', "us-east-2")
9 |
10 | if check_cluster_creation(redshift):
11 | print('available')
12 | destroy_redshift_cluster(redshift)
13 | print('New redshift cluster status: ')
14 | print(get_redshift_cluster_status(redshift))
15 | else:
16 | print('notyet')
17 |
18 |
19 | if __name__ == '__main__':
20 | main()
21 |
22 |
23 |
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/create_tables.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import psycopg2
3 | from sql_queries import create_table_queries, drop_table_queries
4 |
5 |
6 | def drop_tables(cur, conn):
7 | """
8 | Drop all tables
9 | :param cur:
10 | :param conn:
11 | :return:
12 | """
13 | for query in drop_table_queries:
14 | cur.execute(query)
15 | conn.commit()
16 |
17 |
18 | def create_tables(cur, conn):
19 | """
20 | Create all tables
21 | :param cur:
22 | :param conn:
23 | :return:
24 | """
25 | for query in create_table_queries:
26 | cur.execute(query)
27 | conn.commit()
28 |
29 |
30 | def main():
31 | config = configparser.ConfigParser()
32 | config.read('dwh.cfg')
33 |
34 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
35 | cur = conn.cursor()
36 |
37 | drop_tables(cur, conn)
38 | create_tables(cur, conn)
39 |
40 | conn.close()
41 |
42 |
43 | if __name__ == "__main__":
44 | main()
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/data-warehouse-project-der-diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabfr/data-engineering-nanodegree/61e6934bee8238d1b45beed124a9c778b83b366a/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/data-warehouse-project-der-diagram.png
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/dwh.cfg.example:
--------------------------------------------------------------------------------
1 | [CLUSTER]
2 | HOST=
3 | DB_NAME=
4 | DB_USER=
5 | DB_PASSWORD=
6 | DB_PORT=
7 |
8 | [IAM_ROLE]
9 | ARN=
10 |
11 | [S3]
12 | LOG_DATA=
13 | LOG_JSONPATH=
14 | SONG_DATA=
15 |
16 | [AWS]
17 | KEY=
18 | SECRET=
19 |
20 | [DWH]
21 | DWH_CLUSTER_TYPE=
22 | DWH_NUM_NODES=
23 | DWH_NODE_TYPE=
24 |
25 | DWH_IAM_ROLE_NAME=
26 | DWH_CLUSTER_IDENTIFIER=
27 |
28 |
29 |
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/etl.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import psycopg2
3 | from sql_queries import copy_table_order, copy_table_queries, insert_table_order, insert_table_queries
4 |
5 |
6 | def load_staging_tables(cur, conn):
7 | """
8 | Load data from the logs to the staging tables
9 | :param cur: The cursor of the connection
10 | :param conn: The connection itself
11 | :return:None
12 | """
13 | idx = 0
14 | for query in copy_table_queries:
15 | print("Copying data into {}...".format(copy_table_order[idx]))
16 | cur.execute(query)
17 | conn.commit()
18 | idx = idx + 1
19 | print(" [DONE] ")
20 |
21 |
22 | def insert_tables(cur, conn):
23 | """
24 | Translate/insert data from the staging tables to the analytical tables
25 | :param cur: The cursor of the connection
26 | :param conn: The connection itself
27 | :return:None
28 | """
29 | idx = 0
30 | for query in insert_table_queries:
31 | print("Inserting data into {}...".format(insert_table_order[idx]))
32 | cur.execute(query)
33 | conn.commit()
34 | idx = idx + 1
35 | print(" [DONE] ")
36 |
37 |
38 | def main():
39 | config = configparser.ConfigParser()
40 | config.read('dwh.cfg')
41 |
42 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
43 | cur = conn.cursor()
44 |
45 | load_staging_tables(cur, conn)
46 | insert_tables(cur, conn)
47 |
48 | conn.close()
49 |
50 |
51 | if __name__ == "__main__":
52 | main()
--------------------------------------------------------------------------------
/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse/sql_queries.py:
--------------------------------------------------------------------------------
1 | import configparser
2 |
3 |
4 | # CONFIG
5 | config = configparser.ConfigParser()
6 | config.read('dwh.cfg')
7 |
8 | S3_LOG_DATA = config.get('S3', 'LOG_DATA')
9 | S3_LOG_JSONPATH = config.get('S3', 'LOG_JSONPATH')
10 | S3_SONG_DATA = config.get('S3', 'SONG_DATA')
11 | DWH_IAM_ROLE_ARN = config.get("IAM_ROLE", "ARN")
12 |
13 | # DROP TABLES
14 |
15 | staging_events_table_drop = "DROP TABLE IF EXISTS staging_events;"
16 | staging_songs_table_drop = "DROP TABLE IF EXISTS staging_songs;"
17 | songplay_table_drop = "DROP TABLE IF EXISTS songplays;"
18 | user_table_drop = "DROP TABLE IF EXISTS users;"
19 | song_table_drop = "DROP TABLE IF EXISTS songs;"
20 | artist_table_drop = "DROP TABLE IF EXISTS artists;"
21 | time_table_drop = "DROP TABLE IF EXISTS time;"
22 |
23 | # CREATE TABLES
24 |
25 | staging_events_table_create= ("""
26 | CREATE TABLE staging_events (
27 | artist VARCHAR(500),
28 | auth VARCHAR(20),
29 | firstName VARCHAR(500),
30 | gender CHAR(1),
31 | itemInSession INTEGER,
32 | lastName VARCHAR(500),
33 | length DECIMAL(12, 5),
34 | level VARCHAR(10),
35 | location VARCHAR(500),
36 | method VARCHAR(20),
37 | page VARCHAR(500),
38 | registration FLOAT,
39 | sessionId INTEGER,
40 | song VARCHAR(500),
41 | status INTEGER,
42 | ts VARCHAR(50),
43 | userAgent VARCHAR(500),
44 | userId INTEGER
45 | );
46 | """)
47 |
48 | staging_songs_table_create = ("""
49 | CREATE TABLE staging_songs (
50 | num_songs INTEGER,
51 | artist_id VARCHAR(20),
52 | artist_latitude DECIMAL(12, 5),
53 | artist_longitude DECIMAL(12, 5),
54 | artist_location VARCHAR(500),
55 | artist_name VARCHAR(500),
56 | song_id VARCHAR(20),
57 | title VARCHAR(500),
58 | duration DECIMAL(15, 5),
59 | year INTEGER
60 | );
61 | """)
62 |
63 | songplay_table_create = ("""
64 | CREATE TABLE IF NOT EXISTS songplays (
65 | songplay_id INTEGER IDENTITY(0,1) SORTKEY,
66 | start_time TIMESTAMP NOT NULL,
67 | user_id INTEGER NOT NULL REFERENCES users (user_id),
68 | level VARCHAR(10),
69 | song_id VARCHAR(20) REFERENCES songs (song_id),
70 | artist_id VARCHAR(20) REFERENCES artists (artist_id),
71 | session_id INTEGER NOT NULL,
72 | location VARCHAR(500),
73 | user_agent VARCHAR(500)
74 | )
75 | """)
76 |
77 | user_table_create = ("""
78 | CREATE TABLE IF NOT EXISTS users (
79 | user_id INTEGER PRIMARY KEY,
80 | first_name VARCHAR(500) NOT NULL,
81 | last_name VARCHAR(500) NOT NULL,
82 | gender CHAR(1),
83 | level VARCHAR(10) NOT NULL
84 | )
85 | """)
86 |
87 | song_table_create = ("""
88 | CREATE TABLE IF NOT EXISTS songs (
89 | song_id VARCHAR(20) PRIMARY KEY,
90 | title VARCHAR(500) NOT NULL SORTKEY,
91 | artist_id VARCHAR NOT NULL DISTKEY REFERENCES artists (artist_id),
92 | year INTEGER NOT NULL,
93 | duration DECIMAL (15, 5) NOT NULL
94 | )
95 | """)
96 |
97 | artist_table_create = ("""
98 | CREATE TABLE IF NOT EXISTS artists (
99 | artist_id VARCHAR(20) PRIMARY KEY,
100 | name VARCHAR(500) NOT NULL SORTKEY,
101 | location VARCHAR(500),
102 | latitude DECIMAL(12,6),
103 | longitude DECIMAL(12,6)
104 | )
105 | """)
106 |
107 | time_table_create = ("""
108 | CREATE TABLE IF NOT EXISTS time (
109 | start_time TIMESTAMP NOT NULL PRIMARY KEY SORTKEY,
110 | hour NUMERIC NOT NULL,
111 | day NUMERIC NOT NULL,
112 | week NUMERIC NOT NULL,
113 | month NUMERIC NOT NULL,
114 | year NUMERIC NOT NULL,
115 | weekday NUMERIC NOT NULL
116 | )
117 | """)
118 |
119 | # STAGING TABLES
120 |
121 | staging_events_copy = ("""
122 |
123 | copy staging_events
124 | from {}
125 | region 'us-west-2'
126 | iam_role '{}'
127 | compupdate off statupdate off
128 | format as json {}
129 | timeformat as 'epochmillisecs'
130 |
131 | """).format(S3_LOG_DATA, DWH_IAM_ROLE_ARN, S3_LOG_JSONPATH)
132 |
133 | staging_songs_copy = ("""
134 |
135 | copy staging_songs
136 | from {}
137 | region 'us-west-2'
138 | iam_role '{}'
139 | compupdate off statupdate off
140 | format as json 'auto'
141 |
142 | """).format(S3_SONG_DATA, DWH_IAM_ROLE_ARN)
143 |
144 | # FINAL TABLES
145 |
146 | songplay_table_insert = ("""
147 | INSERT INTO songplays (start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)
148 | SELECT DISTINCT
149 | TIMESTAMP 'epoch' + ts/1000 * INTERVAL '1 second' AS start_time,
150 | e.userId as user_id,
151 | e.level,
152 | s.song_id AS song_id,
153 | s.artist_id AS artist_id,
154 | e.sessionId AS session_id,
155 | e.location AS location,
156 | e.userAgent AS user_agent
157 | FROM
158 | staging_events e, staging_songs s
159 | WHERE
160 | e.page = 'NextSong' AND
161 | e.song = s.title AND
162 | e.userId NOT IN (
163 | SELECT DISTINCT
164 | s2.user_id
165 | FROM
166 | songplays s2
167 | WHERE
168 | s2.user_id = e.userId AND
169 | s2.session_id = e.sessionId
170 | )
171 | """)
172 |
173 | user_table_insert = ("""
174 | INSERT INTO users (user_id, first_name, last_name, gender, level)
175 | SELECT DISTINCT
176 | userId AS user_id,
177 | firstName AS first_name,
178 | lastName AS last_name,
179 | gender,
180 | level
181 | FROM
182 | staging_events
183 | WHERE
184 | page = 'NextSong' AND
185 | user_id NOT IN (SELECT DISTINCT user_id FROM users)
186 | """)
187 |
188 | song_table_insert = ("""
189 | INSERT INTO songs (song_id, title, artist_id, year, duration)
190 | SELECT DISTINCT
191 | song_id,
192 | title,
193 | artist_id,
194 | year,
195 | duration
196 | FROM
197 | staging_songs
198 | WHERE
199 | song_id NOT IN (SELECT DISTINCT song_id FROM songs)
200 | """)
201 |
202 | artist_table_insert = ("""
203 | INSERT INTO artists (artist_id, name, location, latitude, longitude)
204 | SELECT DISTINCT
205 | artist_id,
206 | artist_name AS name,
207 | artist_location AS location,
208 | artist_latitude AS latitude,
209 | artist_longitude AS longitude
210 | FROM
211 | staging_songs
212 | WHERE
213 | artist_id NOT IN (SELECT DISTINCT artist_id FROM artists)
214 | """)
215 |
216 | time_table_insert = ("""
217 | INSERT INTO time (start_time, hour, day, week, month, year, weekday)
218 | SELECT
219 | ts AS start_time,
220 | EXTRACT(hr FROM ts) AS hour,
221 | EXTRACT(d FROM ts) AS day,
222 | EXTRACT(w FROM ts) AS week,
223 | EXTRACT(mon FROM ts) AS month,
224 | EXTRACT(yr FROM ts) AS year,
225 | EXTRACT(weekday FROM ts) AS weekday
226 | FROM (
227 | SELECT DISTINCT TIMESTAMP 'epoch' + ts/1000 *INTERVAL '1 second' as ts
228 | FROM staging_events s
229 | )
230 | WHERE
231 | start_time NOT IN (SELECT DISTINCT start_time FROM time)
232 | """)
233 |
234 | analytical_queries = [
235 | 'SELECT COUNT(*) AS total FROM artists',
236 | 'SELECT COUNT(*) AS total FROM songs',
237 | 'SELECT COUNT(*) AS total FROM time',
238 | 'SELECT COUNT(*) AS total FROM users',
239 | 'SELECT COUNT(*) AS total FROM songplays'
240 | ]
241 | analytical_query_titles = [
242 | 'Artists table count',
243 | 'Songs table count',
244 | 'Time table count',
245 | 'Users table count',
246 | 'Song plays table count'
247 | ]
248 |
249 | # QUERY LISTS
250 |
251 | create_table_queries = [
252 | staging_events_table_create,
253 | staging_songs_table_create,
254 | time_table_create,
255 | user_table_create,
256 | artist_table_create,
257 | song_table_create,
258 | songplay_table_create
259 | ]
260 | drop_table_queries = [staging_events_table_drop, staging_songs_table_drop, songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]
261 | copy_table_order = ['staging_events', 'staging_songs']
262 | copy_table_queries = [staging_events_copy, staging_songs_copy]
263 | insert_table_order = ['artists', 'songs', 'time', 'users', 'songplays']
264 | insert_table_queries = [artist_table_insert, song_table_insert, time_table_insert, user_table_insert, songplay_table_insert]
265 |
--------------------------------------------------------------------------------
/3-data-lakes-with-spark/1_procedural_vs_functional_in_python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Procedural Programming\n",
8 | "\n",
9 | "This notebook contains the code from the previous screencast. The code counts the number of times a song appears in the log_of_songs variable. \n",
10 | "\n",
11 | "You'll notice that the first time you run `count_plays(\"Despacito\")`, you get the correct count. However, when you run the same code again `count_plays(\"Despacito\")`, the results are no longer correct.This is because the global variable `play_count` stores the results outside of the count_plays function. \n",
12 | "\n",
13 | "\n",
14 | "# Instructions\n",
15 | "\n",
16 | "Run the code cells in this notebook to see the problem with "
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 1,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "log_of_songs = [\n",
26 | " \"Despacito\",\n",
27 | " \"Nice for what\",\n",
28 | " \"No tears left to cry\",\n",
29 | " \"Despacito\",\n",
30 | " \"Havana\",\n",
31 | " \"In my feelings\",\n",
32 | " \"Nice for what\",\n",
33 | " \"Despacito\",\n",
34 | " \"All the stars\"\n",
35 | "]"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 2,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "play_count = 0"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 3,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "def count_plays(song_title):\n",
54 | " global play_count\n",
55 | " for song in log_of_songs:\n",
56 | " if song == song_title:\n",
57 | " play_count = play_count + 1\n",
58 | " return play_count"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 4,
64 | "metadata": {},
65 | "outputs": [
66 | {
67 | "data": {
68 | "text/plain": [
69 | "3"
70 | ]
71 | },
72 | "execution_count": 4,
73 | "metadata": {},
74 | "output_type": "execute_result"
75 | }
76 | ],
77 | "source": [
78 | "count_plays(\"Despacito\")"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 5,
84 | "metadata": {},
85 | "outputs": [
86 | {
87 | "data": {
88 | "text/plain": [
89 | "6"
90 | ]
91 | },
92 | "execution_count": 5,
93 | "metadata": {},
94 | "output_type": "execute_result"
95 | }
96 | ],
97 | "source": [
98 | "count_plays(\"Despacito\")"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "# How to Solve the Issue\n",
106 | "\n",
107 | "How might you solve this issue? You could get rid of the global variable and instead use play_count as an input to the function:\n",
108 | "\n",
109 | "```python\n",
110 | "def count_plays(song_title, play_count):\n",
111 | " for song in log_of_songs:\n",
112 | " if song == song_title:\n",
113 | " play_count = play_count + 1\n",
114 | " return play_count\n",
115 | "\n",
116 | "```\n",
117 | "\n",
118 | "How would this work with parallel programming? Spark splits up data onto multiple machines. If your songs list were split onto two machines, Machine A would first need to finish counting, and then return its own result to Machine B. And then Machine B could use the output from Machine A and add to the count.\n",
119 | "\n",
120 | "However, that isn't parallel computing. Machine B would have to wait until Machine A finishes. You'll see in the next parts of the lesson how Spark solves this issue with a functional programming paradigm.\n",
121 | "\n",
122 | "In Spark, if your data is split onto two different machines, machine A will run a function to count how many times 'Despacito' appears on machine A. Machine B will simultaneously run a function to count how many times 'Despacito' appears on machine B. After they finish counting individually, they'll combine their results together. You'll see how this works in the next parts of the lesson."
123 | ]
124 | }
125 | ],
126 | "metadata": {
127 | "kernelspec": {
128 | "display_name": "Python 3",
129 | "language": "python",
130 | "name": "python3"
131 | },
132 | "language_info": {
133 | "codemirror_mode": {
134 | "name": "ipython",
135 | "version": 3
136 | },
137 | "file_extension": ".py",
138 | "mimetype": "text/x-python",
139 | "name": "python",
140 | "nbconvert_exporter": "python",
141 | "pygments_lexer": "ipython3",
142 | "version": "3.6.3"
143 | }
144 | },
145 | "nbformat": 4,
146 | "nbformat_minor": 2
147 | }
148 |
--------------------------------------------------------------------------------
/3-data-lakes-with-spark/2_spark_maps_and_lazy_evaluation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Maps\n",
8 | "\n",
9 | "In Spark, maps take data as input and then transform that data with whatever function you put in the map. They are like directions for the data telling how each input should get to the output.\n",
10 | "\n",
11 | "The first code cell creates a SparkContext object. With the SparkContext, you can input a dataset and parallelize the data across a cluster (since you are currently using Spark in local mode on a single machine, technically the dataset isn't distributed yet).\n",
12 | "\n",
13 | "Run the code cell below to instantiate a SparkContext object and then read in the log_of_songs list into Spark. "
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 1,
19 | "metadata": {},
20 | "outputs": [],
21 | "source": [
22 | "### \n",
23 | "# You might have noticed this code in the screencast.\n",
24 | "#\n",
25 | "# import findspark\n",
26 | "# findspark.init('spark-2.3.2-bin-hadoop2.7')\n",
27 | "#\n",
28 | "# The findspark Python module makes it easier to install\n",
29 | "# Spark in local mode on your computer. This is convenient\n",
30 | "# for practicing Spark syntax locally. \n",
31 | "# However, the workspaces already have Spark installed and you do not\n",
32 | "# need to use the findspark module\n",
33 | "#\n",
34 | "###\n",
35 | "\n",
36 | "import pyspark\n",
37 | "sc = pyspark.SparkContext(appName=\"maps_and_lazy_evaluation_example\")\n",
38 | "\n",
39 | "log_of_songs = [\n",
40 | " \"Despacito\",\n",
41 | " \"Nice for what\",\n",
42 | " \"No tears left to cry\",\n",
43 | " \"Despacito\",\n",
44 | " \"Havana\",\n",
45 | " \"In my feelings\",\n",
46 | " \"Nice for what\",\n",
47 | " \"despacito\",\n",
48 | " \"All the stars\"\n",
49 | "]\n",
50 | "\n",
51 | "# parallelize the log_of_songs to use with Spark\n",
52 | "distributed_song_log = sc.parallelize(log_of_songs)"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "This next code cell defines a function that converts a song title to lowercase. Then there is an example converting the word \"Havana\" to \"havana\"."
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 2,
65 | "metadata": {},
66 | "outputs": [
67 | {
68 | "data": {
69 | "text/plain": [
70 | "'havana'"
71 | ]
72 | },
73 | "execution_count": 2,
74 | "metadata": {},
75 | "output_type": "execute_result"
76 | }
77 | ],
78 | "source": [
79 | "def convert_song_to_lowercase(song):\n",
80 | " return song.lower()\n",
81 | "\n",
82 | "convert_song_to_lowercase(\"Havana\")"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "The following code cells demonstrate how to apply this function using a map step. The map step will go through each song in the list and apply the convert_song_to_lowercase() function. "
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 3,
95 | "metadata": {},
96 | "outputs": [
97 | {
98 | "data": {
99 | "text/plain": [
100 | "PythonRDD[1] at RDD at PythonRDD.scala:53"
101 | ]
102 | },
103 | "execution_count": 3,
104 | "metadata": {},
105 | "output_type": "execute_result"
106 | }
107 | ],
108 | "source": [
109 | "distributed_song_log.map(convert_song_to_lowercase)"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "You'll notice that this code cell ran quite quickly. This is because of lazy evaluation. Spark does not actually execute the map step unless it needs to.\n",
117 | "\n",
118 | "\"RDD\" in the output refers to resilient distributed dataset. RDDs are exactly what they say they are: fault-tolerant datasets distributed across a cluster. This is how Spark stores data. \n",
119 | "\n",
120 | "To get Spark to actually run the map step, you need to use an \"action\". One available action is the collect method. The collect() method takes the results from all of the clusters and \"collects\" them into a single list on the master node."
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 4,
126 | "metadata": {},
127 | "outputs": [
128 | {
129 | "data": {
130 | "text/plain": [
131 | "['despacito',\n",
132 | " 'nice for what',\n",
133 | " 'no tears left to cry',\n",
134 | " 'despacito',\n",
135 | " 'havana',\n",
136 | " 'in my feelings',\n",
137 | " 'nice for what',\n",
138 | " 'despacito',\n",
139 | " 'all the stars']"
140 | ]
141 | },
142 | "execution_count": 4,
143 | "metadata": {},
144 | "output_type": "execute_result"
145 | }
146 | ],
147 | "source": [
148 | "distributed_song_log.map(convert_song_to_lowercase).collect()"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "Note as well that Spark is not changing the original data set: Spark is merely making a copy. You can see this by running collect() on the original dataset."
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 5,
161 | "metadata": {},
162 | "outputs": [
163 | {
164 | "data": {
165 | "text/plain": [
166 | "['Despacito',\n",
167 | " 'Nice for what',\n",
168 | " 'No tears left to cry',\n",
169 | " 'Despacito',\n",
170 | " 'Havana',\n",
171 | " 'In my feelings',\n",
172 | " 'Nice for what',\n",
173 | " 'despacito',\n",
174 | " 'All the stars']"
175 | ]
176 | },
177 | "execution_count": 5,
178 | "metadata": {},
179 | "output_type": "execute_result"
180 | }
181 | ],
182 | "source": [
183 | "distributed_song_log.collect()"
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "You do not always have to write a custom function for the map step. You can also use anonymous (lambda) functions as well as built-in Python functions like string.lower(). \n",
191 | "\n",
192 | "Anonymous functions are actually a Python feature for writing functional style programs."
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": 6,
198 | "metadata": {},
199 | "outputs": [
200 | {
201 | "data": {
202 | "text/plain": [
203 | "['despacito',\n",
204 | " 'nice for what',\n",
205 | " 'no tears left to cry',\n",
206 | " 'despacito',\n",
207 | " 'havana',\n",
208 | " 'in my feelings',\n",
209 | " 'nice for what',\n",
210 | " 'despacito',\n",
211 | " 'all the stars']"
212 | ]
213 | },
214 | "execution_count": 6,
215 | "metadata": {},
216 | "output_type": "execute_result"
217 | }
218 | ],
219 | "source": [
220 | "distributed_song_log.map(lambda song: song.lower()).collect()"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 7,
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "data": {
230 | "text/plain": [
231 | "['despacito',\n",
232 | " 'nice for what',\n",
233 | " 'no tears left to cry',\n",
234 | " 'despacito',\n",
235 | " 'havana',\n",
236 | " 'in my feelings',\n",
237 | " 'nice for what',\n",
238 | " 'despacito',\n",
239 | " 'all the stars']"
240 | ]
241 | },
242 | "execution_count": 7,
243 | "metadata": {},
244 | "output_type": "execute_result"
245 | }
246 | ],
247 | "source": [
248 | "distributed_song_log.map(lambda x: x.lower()).collect()"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": null,
254 | "metadata": {},
255 | "outputs": [],
256 | "source": []
257 | }
258 | ],
259 | "metadata": {
260 | "kernelspec": {
261 | "display_name": "Python 3",
262 | "language": "python",
263 | "name": "python3"
264 | },
265 | "language_info": {
266 | "codemirror_mode": {
267 | "name": "ipython",
268 | "version": 3
269 | },
270 | "file_extension": ".py",
271 | "mimetype": "text/x-python",
272 | "name": "python",
273 | "nbconvert_exporter": "python",
274 | "pygments_lexer": "ipython3",
275 | "version": "3.6.3"
276 | }
277 | },
278 | "nbformat": 4,
279 | "nbformat_minor": 2
280 | }
281 |
--------------------------------------------------------------------------------
/3-data-lakes-with-spark/5_dataframe_quiz.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Data Wrangling with DataFrames Coding Quiz\n",
8 | "\n",
9 | "Use this Jupyter notebook to find the answers to the quiz in the previous section. There is an answer key in the next part of the lesson."
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 7,
15 | "metadata": {},
16 | "outputs": [
17 | {
18 | "data": {
19 | "text/plain": [
20 | "DataFrame[summary: string, artist: string, auth: string, firstName: string, gender: string, itemInSession: string, lastName: string, length: string, level: string, location: string, method: string, page: string, registration: string, sessionId: string, song: string, status: string, ts: string, userAgent: string, userId: string]"
21 | ]
22 | },
23 | "execution_count": 7,
24 | "metadata": {},
25 | "output_type": "execute_result"
26 | }
27 | ],
28 | "source": [
29 | "from pyspark.sql import SparkSession\n",
30 | "import numpy as np\n",
31 | "import pandas as pd\n",
32 | "\n",
33 | "spark = SparkSession.builder.appName(\"Data Wrangling\").getOrCreate()\n",
34 | "\n",
35 | "user_log = spark.read.json(\"data/sparkify_log_small.json\")\n",
36 | "\n",
37 | "user_log.describe()"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "# Question 1\n",
45 | "\n",
46 | "Which page did user id \"\" (empty string) NOT visit?"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 14,
52 | "metadata": {},
53 | "outputs": [
54 | {
55 | "name": "stdout",
56 | "output_type": "stream",
57 | "text": [
58 | "The user empty did not visit the following pages: \n",
59 | "NextSong\n",
60 | "Downgrade\n",
61 | "Logout\n",
62 | "Settings\n",
63 | "Save Settings\n",
64 | "Error\n",
65 | "Submit Downgrade\n",
66 | "Upgrade\n",
67 | "Submit Upgrade\n"
68 | ]
69 | }
70 | ],
71 | "source": [
72 | "pages = user_log.select(\"page\").dropDuplicates().collect()\n",
73 | "\n",
74 | "empty_user_pages = user_log.select(\"page\").where(user_log.userId == \"\").dropDuplicates().collect()\n",
75 | "\n",
76 | "empty_user_not_visited_pages = list(set(pages) - set(empty_user_pages))\n",
77 | "\n",
78 | "print('The user empty did not visit the following pages: ')\n",
79 | "for row in empty_user_not_visited_pages:\n",
80 | " print(row['page'])"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "# Question 2 - Reflect\n",
88 | "\n",
89 | "What type of user does the empty string user id most likely refer to?\n"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 15,
95 | "metadata": {},
96 | "outputs": [
97 | {
98 | "data": {
99 | "text/plain": [
100 | "[Row(page='Home'), Row(page='About'), Row(page='Login'), Row(page='Help')]"
101 | ]
102 | },
103 | "execution_count": 15,
104 | "metadata": {},
105 | "output_type": "execute_result"
106 | }
107 | ],
108 | "source": [
109 | "empty_user_pages = user_log.select(\"page\").where(user_log.userId == \"\").dropDuplicates().collect()\n",
110 | "\n",
111 | "empty_user_pages"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "metadata": {},
117 | "source": [
118 | "Looking above the pages that the user id empty string visited, we can conclude that **these pages does not requires the user to be registered nor logged in.**"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "# Question 3\n",
126 | "\n",
127 | "How many female users do we have in the data set?"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 18,
133 | "metadata": {},
134 | "outputs": [
135 | {
136 | "name": "stdout",
137 | "output_type": "stream",
138 | "text": [
139 | "+-------+------+\n",
140 | "|summary|gender|\n",
141 | "+-------+------+\n",
142 | "| count| 462|\n",
143 | "| mean| null|\n",
144 | "| stddev| null|\n",
145 | "| min| F|\n",
146 | "| max| F|\n",
147 | "+-------+------+\n",
148 | "\n"
149 | ]
150 | }
151 | ],
152 | "source": [
153 | "user_log.select(['userId', 'gender']).where(user_log.gender == 'F').dropDuplicates().describe('gender').show()"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "# Question 4\n",
161 | "\n",
162 | "How many songs were played from the most played artist?"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 27,
168 | "metadata": {},
169 | "outputs": [
170 | {
171 | "name": "stdout",
172 | "output_type": "stream",
173 | "text": [
174 | "Most played artist: \n",
175 | "Coldplay\n",
176 | "+-------+--------------------+\n",
177 | "|summary| song|\n",
178 | "+-------+--------------------+\n",
179 | "| count| 83|\n",
180 | "| mean| null|\n",
181 | "| stddev| null|\n",
182 | "| min|A Rush Of Blood T...|\n",
183 | "| max| Yes|\n",
184 | "+-------+--------------------+\n",
185 | "\n"
186 | ]
187 | }
188 | ],
189 | "source": [
190 | "top_artists = user_log.where(user_log.page == \"NextSong\").groupBy('artist').count().sort('count', ascending=False)\n",
191 | "top_artists = top_artists.collect()\n",
192 | "\n",
193 | "most_played_artist = top_artists[0]['artist']\n",
194 | "print('Most played artist: ')\n",
195 | "print(most_played_artist)\n",
196 | "\n",
197 | "song_count = user_log.select('song').where(user_log.page == \"NextSong\") \\\n",
198 | " .where(user_log.artist == most_played_artist)\n",
199 | "\n",
200 | "song_count.describe('song').show()"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "# Question 5 (challenge)\n",
208 | "\n",
209 | "How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.\n",
210 | "\n"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 51,
216 | "metadata": {},
217 | "outputs": [
218 | {
219 | "name": "stdout",
220 | "output_type": "stream",
221 | "text": [
222 | "user_log_valid count\n",
223 | "9473\n",
224 | "+-----------------+\n",
225 | "|avg(count(phase))|\n",
226 | "+-----------------+\n",
227 | "|6.898347107438017|\n",
228 | "+-----------------+\n",
229 | "\n"
230 | ]
231 | }
232 | ],
233 | "source": [
234 | "from pyspark.sql.types import IntegerType\n",
235 | "from pyspark.sql.functions import udf\n",
236 | "from pyspark.sql import Window\n",
237 | "from pyspark.sql.functions import sum as Fsum\n",
238 | "from pyspark.sql.functions import desc\n",
239 | "from pyspark.sql.functions import col\n",
240 | "\n",
241 | "# create a numerical flag for wether the user is in the home page\n",
242 | "flag_homepage_visit = udf(lambda x: 1 if x == \"Home\" else 0, IntegerType())\n",
243 | "\n",
244 | "# after flagging all home visits, we use that new column to create a window between home visits\n",
245 | "windowval = Window.partitionBy(\"userId\").orderBy(desc(\"ts\")).rangeBetween(Window.unboundedPreceding, 0)\n",
246 | "\n",
247 | "# add a new columns called phase with that window count\n",
248 | "user_log_valid = user_log.filter((user_log.page == 'NextSong') | (user_log.page == 'Home')) \\\n",
249 | " .select('userId', 'page', 'ts') \\\n",
250 | " .withColumn(\"visited_home\", flag_homepage_visit(col(\"page\"))) \\\n",
251 | " .withColumn(\"phase\", Fsum(\"visited_home\").over(windowval))\n",
252 | "\n",
253 | "# with the new column, filter only the phase 1 and the page NextSong. Then group by userId and count it!\n",
254 | "user_log_valid.where(user_log_valid.page == 'NextSong') \\\n",
255 | " .groupBy('userId', 'phase') \\\n",
256 | " .agg({'phase':'count'}) \\\n",
257 | " .agg({'count(phase)':'avg'}).show()\n"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": null,
263 | "metadata": {},
264 | "outputs": [],
265 | "source": []
266 | }
267 | ],
268 | "metadata": {
269 | "kernelspec": {
270 | "display_name": "Python 3",
271 | "language": "python",
272 | "name": "python3"
273 | },
274 | "language_info": {
275 | "codemirror_mode": {
276 | "name": "ipython",
277 | "version": 3
278 | },
279 | "file_extension": ".py",
280 | "mimetype": "text/x-python",
281 | "name": "python",
282 | "nbconvert_exporter": "python",
283 | "pygments_lexer": "ipython3",
284 | "version": "3.7.3"
285 | }
286 | },
287 | "nbformat": 4,
288 | "nbformat_minor": 2
289 | }
290 |
--------------------------------------------------------------------------------
/3-data-lakes-with-spark/8_spark_sql_quiz.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Data Wrangling with Spark SQL Quiz\n",
8 | "\n",
9 | "This quiz uses the same dataset and most of the same questions from the earlier \"Quiz - Data Wrangling with Data Frames Jupyter Notebook.\" For this quiz, however, use Spark SQL instead of Spark Data Frames."
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 17,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "from pyspark.sql import SparkSession\n",
19 | "from pyspark.sql.functions import udf\n",
20 | "from pyspark.sql.types import StringType\n",
21 | "from pyspark.sql.types import IntegerType\n",
22 | "from pyspark.sql.functions import desc\n",
23 | "from pyspark.sql.functions import asc\n",
24 | "from pyspark.sql.functions import sum as Fsum\n",
25 | "\n",
26 | "import datetime\n",
27 | "\n",
28 | "import numpy as np\n",
29 | "import pandas as pd\n",
30 | "\n",
31 | "spark = SparkSession.builder.appName(\"Data wrangling with Spark SQL\").getOrCreate()\n",
32 | "\n",
33 | "user_log = spark.read.json(\"data/sparkify_log_small.json\")\n",
34 | "\n",
35 | "user_log.createOrReplaceTempView(\"user_log\")\n"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "# Question 1\n",
43 | "\n",
44 | "Which page did user id \"\"(empty string) NOT visit?"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 5,
50 | "metadata": {},
51 | "outputs": [
52 | {
53 | "name": "stdout",
54 | "output_type": "stream",
55 | "text": [
56 | "+----------------+\n",
57 | "| page|\n",
58 | "+----------------+\n",
59 | "|Submit Downgrade|\n",
60 | "| Downgrade|\n",
61 | "| Logout|\n",
62 | "| Save Settings|\n",
63 | "| Settings|\n",
64 | "| NextSong|\n",
65 | "| Upgrade|\n",
66 | "| Error|\n",
67 | "| Submit Upgrade|\n",
68 | "+----------------+\n",
69 | "\n"
70 | ]
71 | }
72 | ],
73 | "source": [
74 | "spark.sql(\"\"\"\n",
75 | " SELECT DISTINCT page \n",
76 | " FROM user_log \n",
77 | " WHERE page NOT IN (\n",
78 | " SELECT DISTINCT page \n",
79 | " FROM user_log \n",
80 | " WHERE userId = ''\n",
81 | " )\n",
82 | "\"\"\").show()"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "# Question 2 - Reflect\n",
90 | "\n",
91 | "Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?\n",
92 | "\n",
93 | "## Response\n",
94 | "\n",
95 | "We should use SQL over data frames because of:\n",
96 | " - It is straight forward. You just specify in plain SQL what data you need.\n",
97 | " - SQL is a common language when talking about databases, it's more likely you don't have a much greater learning curve\n",
98 | " \n",
99 | "We should use data frames over SQL because of:\n",
100 | " - When you need to use more advanced features like adding a column to a dataframe, windowing results"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "# Question 3\n",
108 | "\n",
109 | "How many female users do we have in the data set?"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 6,
115 | "metadata": {},
116 | "outputs": [
117 | {
118 | "name": "stdout",
119 | "output_type": "stream",
120 | "text": [
121 | "+------+-----+\n",
122 | "|gender|count|\n",
123 | "+------+-----+\n",
124 | "| F| 462|\n",
125 | "+------+-----+\n",
126 | "\n"
127 | ]
128 | }
129 | ],
130 | "source": [
131 | "spark.sql(\"\"\"\n",
132 | " SELECT gender, COUNT(DISTINCT userId) AS count\n",
133 | " FROM user_log \n",
134 | " WHERE gender = 'F'\n",
135 | " GROUP BY gender\n",
136 | "\"\"\").show()"
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {},
142 | "source": [
143 | "# Question 4\n",
144 | "\n",
145 | "How many songs were played from the most played artist?"
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": 11,
151 | "metadata": {},
152 | "outputs": [
153 | {
154 | "name": "stdout",
155 | "output_type": "stream",
156 | "text": [
157 | "+--------+-----------+\n",
158 | "| artist|plays_count|\n",
159 | "+--------+-----------+\n",
160 | "|Coldplay| 83|\n",
161 | "+--------+-----------+\n",
162 | "\n"
163 | ]
164 | }
165 | ],
166 | "source": [
167 | "spark.sql(\"\"\"\n",
168 | " SELECT artist, COUNT(ts) AS plays_count\n",
169 | " FROM user_log\n",
170 | " WHERE page = 'NextSong'\n",
171 | " GROUP BY artist\n",
172 | " ORDER BY plays_count DESC\n",
173 | " LIMIT 1\n",
174 | "\"\"\").show()"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "# Question 5 (challenge)\n",
182 | "\n",
183 | "How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer."
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 18,
189 | "metadata": {},
190 | "outputs": [
191 | {
192 | "name": "stdout",
193 | "output_type": "stream",
194 | "text": [
195 | "+------------------+\n",
196 | "|avg(count_results)|\n",
197 | "+------------------+\n",
198 | "| 6.898347107438017|\n",
199 | "+------------------+\n",
200 | "\n"
201 | ]
202 | }
203 | ],
204 | "source": [
205 | "is_home = spark.sql(\"\"\"\n",
206 | " SELECT \n",
207 | " userID, \n",
208 | " page, \n",
209 | " ts, \n",
210 | " (CASE WHEN page = 'Home' THEN 1 ELSE 0 END) AS is_home \n",
211 | " FROM \n",
212 | " user_log\n",
213 | " WHERE \n",
214 | " (page = 'NextSong') or (page = 'Home')\n",
215 | "\"\"\")\n",
216 | "\n",
217 | "is_home.createOrReplaceTempView(\"is_home_table\")\n",
218 | "\n",
219 | "cumulative_sum = spark.sql(\"\"\"\n",
220 | " SELECT \n",
221 | " *, \n",
222 | " SUM(is_home) OVER (\n",
223 | " PARTITION BY userID \n",
224 | " ORDER BY ts DESC \n",
225 | " ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW\n",
226 | " ) AS period\n",
227 | " FROM is_home_table\n",
228 | "\"\"\")\n",
229 | "\n",
230 | "cumulative_sum.createOrReplaceTempView(\"period_table\")\n",
231 | "\n",
232 | "spark.sql(\"\"\"\n",
233 | " SELECT \n",
234 | " AVG(count_results) \n",
235 | " FROM (\n",
236 | " SELECT \n",
237 | " COUNT(*) AS count_results \n",
238 | " FROM \n",
239 | " period_table \n",
240 | " GROUP BY \n",
241 | " userID, period, page \n",
242 | " HAVING \n",
243 | " page = 'NextSong'\n",
244 | " ) AS counts\n",
245 | "\"\"\").show()\n"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {},
252 | "outputs": [],
253 | "source": []
254 | }
255 | ],
256 | "metadata": {
257 | "kernelspec": {
258 | "display_name": "Python 3",
259 | "language": "python",
260 | "name": "python3"
261 | },
262 | "language_info": {
263 | "codemirror_mode": {
264 | "name": "ipython",
265 | "version": 3
266 | },
267 | "file_extension": ".py",
268 | "mimetype": "text/x-python",
269 | "name": "python",
270 | "nbconvert_exporter": "python",
271 | "pygments_lexer": "ipython3",
272 | "version": "3.7.3"
273 | }
274 | },
275 | "nbformat": 4,
276 | "nbformat_minor": 2
277 | }
278 |
--------------------------------------------------------------------------------
/3-data-lakes-with-spark/L4_Project/.gitignore:
--------------------------------------------------------------------------------
1 | dl.cfg
--------------------------------------------------------------------------------
/3-data-lakes-with-spark/L4_Project/README.md:
--------------------------------------------------------------------------------
1 | # Sparkify's Data Lake ELT process
2 |
3 | ## Summary
4 |
5 | - [Introduction](#introduction)
6 | - [Getting started](#getting-started)
7 | - [Data sources](#data-sources)
8 | - [Parquet data schema](#parquet-data-schema)
9 |
10 | ## Introduction
11 |
12 | This project aims to create analytical _parquet_ tables on Amazon S3 using AWS ElasticMapReduce/Spark to extract,
13 | load and transform songs data and event logs from the usage of the Sparkify app.
14 |
15 | ## Getting started
16 |
17 | This ELT process is a pretty simple process. If it's your first time running this project, you should make a copy of the `dl.cfg.example` file, configure your AWS credentials and save it as `dl.cfg`.
18 |
19 | Then, just run inside your Spark master machine: `python etl.py`
20 |
21 | ## Data sources
22 |
23 | We will read basically two main data sources:
24 |
25 | - `s3a://udacity-dend/song_data/*/*/*` - JSON files containing meta information about song/artists data
26 | - `s3a://udacity-dend/log_data/*/*` - JSON files containing log events from the Sparkify app
27 |
28 | ## Parquet data schema
29 |
30 | After reading from these two data sources, we will transform it to the schema described below:
31 |
32 | #### Song Plays table
33 |
34 | - *Location:* `s3a://social-wiki-datalake/songplays.parquet`
35 | - *Type:* Fact table
36 |
37 | | Column | Type | Description |
38 | | ------ | ---- | ----------- |
39 | | `songplay_id` | `INTEGER` | The main identification of the table |
40 | | `start_time` | `TIMESTAMP` | The timestamp that this song play log happened |
41 | | `user_id` | `INTEGER` | The user id that triggered this song play log. It cannot be null, as we don't have song play logs without being triggered by an user. |
42 | | `level` | `STRING` | The level of the user that triggered this song play log |
43 | | `song_id` | `STRING` | The identification of the song that was played. It can be null. |
44 | | `artist_id` | `STRING` | The identification of the artist of the song that was played. |
45 | | `session_id` | `INTEGER` | The session_id of the user on the app |
46 | | `location` | `STRING` | The location where this song play log was triggered |
47 | | `user_agent` | `STRING` | The user agent of our app |
48 |
49 | #### Users table
50 |
51 | - *Location:* `s3a://social-wiki-datalake/users.parquet`
52 | - *Type:* Dimension table
53 |
54 | | Column | Type | Description |
55 | | ------ | ---- | ----------- |
56 | | `user_id` | `INTEGER` | The main identification of an user |
57 | | `first_name` | `STRING` | First name of the user, can not be null. It is the basic information we have from the user |
58 | | `last_name` | `STRING` | Last name of the user. |
59 | | `gender` | `STRING` | The gender is stated with just one character `M` (male) or `F` (female). Otherwise it can be stated as `NULL` |
60 | | `level` | `STRING` | The level stands for the user app plans (`premium` or `free`) |
61 |
62 |
63 | #### Songs table
64 |
65 | - *Location:* `s3a://social-wiki-datalake/songs.parquet`
66 | - *Type:* Dimension table
67 |
68 | | Column | Type | Description |
69 | | ------ | ---- | ----------- |
70 | | `song_id` | `STRING` | The main identification of a song |
71 | | `title` | `STRING` | The title of the song. It can not be null, as it is the basic information we have about a song. |
72 | | `artist_id` | `STRING` | The artist id, it can not be null as we don't have songs without an artist, and this field also references the artists table. |
73 | | `year` | `INTEGER` | The year that this song was made |
74 | | `duration` | `DOUBLE` | The duration of the song |
75 |
76 |
77 | #### Artists table
78 |
79 | - *Location:* `s3a://social-wiki-datalake/artists.parquet`
80 | - *Type:* Dimension table
81 |
82 | | Column | Type | Description |
83 | | ------ | ---- | ----------- |
84 | | `artist_id` | `STRING` | The main identification of an artist |
85 | | `name` | `STRING` | The name of the artist |
86 | | `location` | `STRING` | The location where the artist are from |
87 | | `latitude` | `DOUBLE` | The latitude of the location that the artist are from |
88 | | `longitude` | `DOUBLE` | The longitude of the location that the artist are from |
89 |
90 | #### Time table
91 |
92 | - *Name:* `s3a://social-wiki-datalake/time.parquet`
93 | - *Type:* Dimension table
94 |
95 | | Column | Type | Description |
96 | | ------ | ---- | ----------- |
97 | | `start_time` | `TIMESTAMP` | The timestamp itself, serves as the main identification of this table |
98 | | `hour` | `INTEGER` | The hour from the timestamp |
99 | | `day` | `INTEGER` | The day of the month from the timestamp |
100 | | `week` | `INTEGER` | The week of the year from the timestamp |
101 | | `month` | `INTEGER` | The month of the year from the timestamp |
102 | | `year` | `INTEGER` | The year from the timestamp |
103 | | `weekday` | `STRING` | The week day from the timestamp (Monday to Friday) |
--------------------------------------------------------------------------------
/3-data-lakes-with-spark/L4_Project/dl.cfg.example:
--------------------------------------------------------------------------------
1 | AWS_ACCESS_KEY_ID=''
2 | AWS_SECRET_ACCESS_KEY=''
--------------------------------------------------------------------------------
/3-data-lakes-with-spark/L4_Project/etl.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | from datetime import datetime
3 | import os
4 | from pyspark.sql import SparkSession
5 | from pyspark.sql import functions as F
6 | from pyspark.sql import types as T
7 | from pyspark.sql.functions import udf, col
8 | from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
9 |
10 |
11 | config = configparser.ConfigParser()
12 | config.read('dl.cfg')
13 |
14 | os.environ['AWS_ACCESS_KEY_ID']=config['AWS_ACCESS_KEY_ID']
15 | os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS_SECRET_ACCESS_KEY']
16 |
17 |
18 | def create_spark_session():
19 | """
20 | Creates the Spark Session
21 | :return spark:
22 | """
23 | spark = SparkSession \
24 | .builder \
25 | .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
26 | .getOrCreate()
27 | return spark
28 |
29 |
30 | def process_song_data(spark, input_data, output_data):
31 | """
32 | Process all the songs data from the json files specified from the input_data
33 | :param spark:
34 | :param input_data:
35 | :param output_data:
36 | :return:
37 | """
38 | # get filepath to song data file
39 | song_data = input_data + "song_data/*/*/*"
40 |
41 | # read song data file
42 | df = spark.read.json(song_data)
43 |
44 | # extract columns to create songs table
45 | songs_table = (
46 | df.select(
47 | 'song_id', 'title', 'artist_id',
48 | 'year', 'duration'
49 | ).distinct()
50 | )
51 |
52 | # write songs table to parquet files partitioned by year and artist
53 | songs_table.write.parquet(output_data + "songs.parquet", mode="overwrite")
54 |
55 | # extract columns to create artists table
56 | artists_table = (
57 | df.select(
58 | 'artist_id',
59 | col('artist_name').alias('name'),
60 | col('artist_location').alias('location'),
61 | col('artist_latitude').alias('latitude'),
62 | col('artist_longitude').alias('longitude'),
63 | ).distinct()
64 | )
65 |
66 | # write artists table to parquet files
67 | artists_table.write.parquet(output_data + "artists.parquet", mode="overwrite")
68 |
69 |
70 | def process_log_data(spark, input_data, output_data):
71 | """
72 | Process all event logs of the Sparkify app usage, specifically the 'NextSong' event.
73 | :param spark:
74 | :param input_data:
75 | :param output_data:
76 | :return:
77 | """
78 | # get filepath to log data file
79 | log_data = input_data + "log_data/*/*"
80 |
81 | # read log data file
82 | df = spark.read.json(log_data)
83 |
84 | # filter by actions for song plays
85 | df = df.where(df.page == 'NextSong')
86 |
87 | # extract columns for users table
88 | users_table = (
89 | df.select(
90 | col('userId').alias('user_id'),
91 | col('firstName').alias('first_name'),
92 | col('lastName').alias('last_name'),
93 | col('gender').alias('gender'),
94 | col('level').alias('level')
95 | ).distinct()
96 | )
97 |
98 | # write users table to parquet files
99 | users_table.write.parquet(output_data + "users.parquet", mode="overwrite")
100 |
101 | # create timestamp column from original timestamp column
102 | df = df.withColumn(
103 | "ts_timestamp",
104 | F.to_timestamp(F.from_unixtime((col("ts") / 1000) , 'yyyy-MM-dd HH:mm:ss.SSS')).cast("Timestamp")
105 | )
106 |
107 | def get_weekday(date):
108 | import datetime
109 | import calendar
110 | date = date.strftime("%m-%d-%Y") # , %H:%M:%S
111 | month, day, year = (int(x) for x in date.split('-'))
112 | weekday = datetime.date(year, month, day)
113 | return calendar.day_name[weekday.weekday()]
114 |
115 | udf_week_day = udf(get_weekday, T.StringType())
116 |
117 | # extract columns to create time table
118 | time_table = (
119 | df.withColumn("hour", hour(col("ts_timestamp")))
120 | .withColumn("day", dayofmonth(col("ts_timestamp")))
121 | .withColumn("week", weekofyear(col("ts_timestamp")))
122 | .withColumn("month", month(col("ts_timestamp")))
123 | .withColumn("year", year(col("ts_timestamp")))
124 | .withColumn("weekday", udf_week_day(col("ts_timestamp")))
125 | .select(
126 | col("ts_timestamp").alias("start_time"),
127 | col("hour"),
128 | col("day"),
129 | col("week"),
130 | col("month"),
131 | col("year"),
132 | col("weekday")
133 | )
134 | )
135 |
136 | # write time table to parquet files partitioned by year and month
137 | time_table.write.parquet(output_data + "time.parquet", mode="overwrite")
138 |
139 | # read in song data to use for songplays table
140 | song_df = spark.read.parquet(output_data + "songs.parquet")
141 |
142 | # extract columns from joined song and log datasets to create songplays table
143 | songplays_table = (
144 | df.withColumn("songplay_id", F.monotonically_increasing_id())
145 | .join(song_df, song_df.title == df.song)
146 | .select(
147 | "songplay_id",
148 | col("ts_timestamp").alias("start_time"),
149 | col("userId").alias("user_id"),
150 | "level",
151 | "song_id",
152 | "artist_id",
153 | col("sessionId").alias("session_id"),
154 | "location",
155 | col("userAgent").alias("user_agent")
156 | )
157 | )
158 |
159 | # write songplays table to parquet files partitioned by year and month
160 | songplays_table.write.parquet(output_data + "songplays.parquet", mode="overwrite")
161 |
162 |
163 | def main():
164 | """
165 | Runs the previous functions to process the data and define their parameters.
166 | :param:
167 | :return:
168 | """
169 | spark = create_spark_session()
170 | input_data = "s3a://udacity-dend/"
171 | output_data = "s3a://social-wiki-datalake/"
172 |
173 | process_song_data(spark, input_data, output_data)
174 | process_log_data(spark, input_data, output_data)
175 |
176 |
177 | if __name__ == "__main__":
178 | main()
179 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L1_exercises/exercise1.py:
--------------------------------------------------------------------------------
1 | # Instructions
2 | # Define a function that uses the python logger to log a function. Then finish filling in the details of the DAG down below. Once you’ve done that, run "/opt/airflow/start.sh" command to start the web server. Once the Airflow web server is ready, open the Airflow UI using the "Access Airflow" button. Turn your DAG “On”, and then Run your DAG. If you get stuck, you can take a look at the solution file or the video walkthrough on the next page.
3 |
4 | import datetime
5 | import logging
6 |
7 | from airflow import DAG
8 | from airflow.operators.python_operator import PythonOperator
9 |
10 |
11 | #
12 | # TODO: Define a function for the PythonOperator to call and have it log something
13 | #
14 | def greet_world():
15 | logging.info("Hello world!")
16 |
17 |
18 | dag = DAG(
19 | 'lesson1.exercise1',
20 | start_date=datetime.datetime.now())
21 |
22 | #
23 | # TODO: Uncomment the operator below and replace the arguments labeled below
24 | #
25 |
26 | greet_task = PythonOperator(
27 | task_id="greet_world",
28 | python_callable=greet_world,
29 | dag=dag
30 | )
31 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L1_exercises/exercise2.py:
--------------------------------------------------------------------------------
1 | # Instructions
2 | # Complete the TODOs in this DAG so that it runs once a day. Once you’ve done that, open the Airflow UI using the "Access Airflow" button. go to the Airflow UI and turn the last exercise off, then turn this exercise on. Wait a moment and refresh the UI to see Airflow automatically run your DAG.
3 |
4 | import datetime
5 | import logging
6 |
7 | from airflow import DAG
8 | from airflow.operators.python_operator import PythonOperator
9 |
10 |
11 | def hello_world():
12 | logging.info("Hello World")
13 |
14 | dag = DAG(
15 | "lesson1.exercise2",
16 | start_date=datetime.datetime.now() - datetime.timedelta(days=2),
17 | schedule_interval="@daily")
18 |
19 | task = PythonOperator(
20 | task_id="hello_world_task",
21 | python_callable=hello_world,
22 | dag=dag)
23 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L1_exercises/exercise3.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import logging
3 |
4 | from airflow import DAG
5 | from airflow.operators.python_operator import PythonOperator
6 |
7 |
8 | def hello_world():
9 | logging.info("Hello World")
10 |
11 |
12 | def addition():
13 | logging.info(f"2 + 2 = {2+2}")
14 |
15 |
16 | def subtraction():
17 | logging.info(f"6 -2 = {6-2}")
18 |
19 |
20 | def division():
21 | logging.info(f"10 / 2 = {int(10/2)}")
22 |
23 |
24 | dag = DAG(
25 | "lesson1.exercise3",
26 | schedule_interval='@hourly',
27 | start_date=datetime.datetime.now() - datetime.timedelta(days=1))
28 |
29 | hello_world_task = PythonOperator(
30 | task_id="hello_world",
31 | python_callable=hello_world,
32 | dag=dag)
33 |
34 | addition_task = PythonOperator(
35 | task_id="addition",
36 | python_callable=addition,
37 | dag=dag)
38 |
39 | subtraction_task = PythonOperator(
40 | task_id="subtraction",
41 | python_callable=subtraction,
42 | dag=dag)
43 |
44 | division_task = PythonOperator(
45 | task_id="division",
46 | python_callable=division,
47 | dag=dag)
48 |
49 | # -> addition_task
50 | # / \
51 | # hello_world_task -> division_task
52 | # \ /
53 | # ->subtraction_task
54 |
55 | hello_world_task >> addition_task
56 | hello_world_task >> subtraction_task
57 | addition_task >> division_task
58 | subtraction_task >> division_task
59 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L1_exercises/exercise4.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import logging
3 |
4 | from airflow import DAG
5 | from airflow.models import Variable
6 | from airflow.operators.python_operator import PythonOperator
7 | from airflow.hooks.S3_hook import S3Hook
8 |
9 |
10 | def list_keys():
11 | hook = S3Hook(aws_conn_id='aws_credentials')
12 | bucket = Variable.get('s3_bucket')
13 | prefix = Variable.get('s3_prefix')
14 | logging.info(f"Listing Keys from {bucket}/{prefix}")
15 | keys = hook.list_keys(bucket, prefix=prefix)
16 | for key in keys:
17 | logging.info(f"- s3://{bucket}/{key}")
18 |
19 |
20 | dag = DAG(
21 | 'lesson1.exercise4',
22 | start_date=datetime.datetime.now())
23 |
24 | list_task = PythonOperator(
25 | task_id="list_keys",
26 | python_callable=list_keys,
27 | dag=dag
28 | )
29 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L1_exercises/exercise5.py:
--------------------------------------------------------------------------------
1 | # Instructions
2 | # Use the Airflow context in the pythonoperator to complete the TODOs below. Once you are done, run your DAG and check the logs to see the context in use.
3 |
4 | import datetime
5 | import logging
6 |
7 | from airflow import DAG
8 | from airflow.models import Variable
9 | from airflow.operators.python_operator import PythonOperator
10 | from airflow.hooks.S3_hook import S3Hook
11 |
12 |
13 | def log_details(*args, **kwargs):
14 | #
15 | # NOTE: Look here for context variables passed in on kwargs:
16 | # https://airflow.apache.org/macros.html
17 | #
18 | ds = kwargs['ds']
19 | run_id = kwargs['run_id']
20 | previous_ds = kwargs.get('previous_ds')
21 | next_ds = kwargs.get('next_ds')
22 |
23 | logging.info(f"Execution date is {ds}")
24 | logging.info(f"My run id is {run_id}")
25 | if previous_ds:
26 | logging.info(f"My previous run was on {previous_ds}")
27 | if next_ds:
28 | logging.info(f"My next run will be {next_ds}")
29 |
30 | dag = DAG(
31 | 'lesson1.exercise5',
32 | schedule_interval="@daily",
33 | start_date=datetime.datetime.now() - datetime.timedelta(days=2)
34 | )
35 |
36 | list_task = PythonOperator(
37 | task_id="log_details",
38 | python_callable=log_details,
39 | provide_context=True,
40 | dag=dag
41 | )
42 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L1_exercises/exercise6.py:
--------------------------------------------------------------------------------
1 | # Instructions
2 | # Similar to what you saw in the demo, copy and populate the trips table. Then, add another operator which creates a traffic analysis table from the trips table you created. Note, in this class, we won’t be writing SQL -- all of the SQL statements we run against Redshift are predefined and included in your lesson.
3 |
4 | import datetime
5 | import logging
6 |
7 | from airflow import DAG
8 | from airflow.contrib.hooks.aws_hook import AwsHook
9 | from airflow.hooks.postgres_hook import PostgresHook
10 | from airflow.operators.postgres_operator import PostgresOperator
11 | from airflow.operators.python_operator import PythonOperator
12 |
13 | import sql_statements
14 |
15 |
16 | def load_data_to_redshift(*args, **kwargs):
17 | aws_hook = AwsHook("aws_credentials")
18 | credentials = aws_hook.get_credentials()
19 | redshift_hook = PostgresHook("redshift")
20 | redshift_hook.run(sql_statements.COPY_ALL_TRIPS_SQL.format(credentials.access_key, credentials.secret_key))
21 |
22 |
23 | dag = DAG(
24 | 'lesson1.exercise6',
25 | start_date=datetime.datetime.now()
26 | )
27 |
28 | create_table = PostgresOperator(
29 | task_id="create_table",
30 | dag=dag,
31 | postgres_conn_id="redshift",
32 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
33 | )
34 |
35 | copy_task = PythonOperator(
36 | task_id='load_from_s3_to_redshift',
37 | dag=dag,
38 | python_callable=load_data_to_redshift
39 | )
40 |
41 | location_traffic_task = PostgresOperator(
42 | task_id="calculate_location_traffic",
43 | dag=dag,
44 | postgres_conn_id="redshift",
45 | sql=sql_statements.LOCATION_TRAFFIC_SQL
46 | )
47 |
48 | create_table >> copy_task
49 | copy_task >> location_traffic_task
50 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L1_exercises/sql_statements.py:
--------------------------------------------------------------------------------
1 | CREATE_TRIPS_TABLE_SQL = """
2 | CREATE TABLE IF NOT EXISTS trips (
3 | trip_id INTEGER NOT NULL,
4 | start_time TIMESTAMP NOT NULL,
5 | end_time TIMESTAMP NOT NULL,
6 | bikeid INTEGER NOT NULL,
7 | tripduration DECIMAL(16,2) NOT NULL,
8 | from_station_id INTEGER NOT NULL,
9 | from_station_name VARCHAR(100) NOT NULL,
10 | to_station_id INTEGER NOT NULL,
11 | to_station_name VARCHAR(100) NOT NULL,
12 | usertype VARCHAR(20),
13 | gender VARCHAR(6),
14 | birthyear INTEGER,
15 | PRIMARY KEY(trip_id))
16 | DISTSTYLE ALL;
17 | """
18 |
19 | CREATE_STATIONS_TABLE_SQL = """
20 | CREATE TABLE IF NOT EXISTS stations (
21 | id INTEGER NOT NULL,
22 | name VARCHAR(250) NOT NULL,
23 | city VARCHAR(100) NOT NULL,
24 | latitude DECIMAL(9, 6) NOT NULL,
25 | longitude DECIMAL(9, 6) NOT NULL,
26 | dpcapacity INTEGER NOT NULL,
27 | online_date TIMESTAMP NOT NULL,
28 | PRIMARY KEY(id))
29 | DISTSTYLE ALL;
30 | """
31 |
32 | COPY_SQL = """
33 | COPY {}
34 | FROM '{}'
35 | ACCESS_KEY_ID '{{}}'
36 | SECRET_ACCESS_KEY '{{}}'
37 | IGNOREHEADER 1
38 | DELIMITER ','
39 | """
40 |
41 | COPY_MONTHLY_TRIPS_SQL = COPY_SQL.format(
42 | "trips",
43 | "s3://udacity-dend/data-pipelines/divvy/partitioned/{year}/{month}/divvy_trips.csv"
44 | )
45 |
46 | COPY_ALL_TRIPS_SQL = COPY_SQL.format(
47 | "trips",
48 | "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
49 | )
50 |
51 | COPY_STATIONS_SQL = COPY_SQL.format(
52 | "stations",
53 | "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv"
54 | )
55 |
56 | LOCATION_TRAFFIC_SQL = """
57 | BEGIN;
58 | DROP TABLE IF EXISTS station_traffic;
59 | CREATE TABLE station_traffic AS
60 | SELECT
61 | DISTINCT(t.from_station_id) AS station_id,
62 | t.from_station_name AS station_name,
63 | num_departures,
64 | num_arrivals
65 | FROM trips t
66 | JOIN (
67 | SELECT
68 | from_station_id,
69 | COUNT(from_station_id) AS num_departures
70 | FROM trips
71 | GROUP BY from_station_id
72 | ) AS fs ON t.from_station_id = fs.from_station_id
73 | JOIN (
74 | SELECT
75 | to_station_id,
76 | COUNT(to_station_id) AS num_arrivals
77 | FROM trips
78 | GROUP BY to_station_id
79 | ) AS ts ON t.from_station_id = ts.to_station_id
80 | """
81 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L2_exercises/exercise1.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #1 - Run the DAG as it is first, and observe the Airflow UI
3 | #2 - Next, open up the DAG and add the copy and load tasks as directed in the TODOs
4 | #3 - Reload the Airflow UI and run the DAG once more, observing the Airflow UI
5 |
6 | import datetime
7 | import logging
8 |
9 | from airflow import DAG
10 | from airflow.contrib.hooks.aws_hook import AwsHook
11 | from airflow.hooks.postgres_hook import PostgresHook
12 | from airflow.operators.postgres_operator import PostgresOperator
13 | from airflow.operators.python_operator import PythonOperator
14 |
15 | import sql_statements
16 |
17 |
18 | def load_trip_data_to_redshift(*args, **kwargs):
19 | aws_hook = AwsHook("aws_credentials")
20 | credentials = aws_hook.get_credentials()
21 | redshift_hook = PostgresHook("redshift")
22 | sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format(
23 | credentials.access_key,
24 | credentials.secret_key,
25 | )
26 | redshift_hook.run(sql_stmt)
27 |
28 |
29 | def load_station_data_to_redshift(*args, **kwargs):
30 | aws_hook = AwsHook("aws_credentials")
31 | credentials = aws_hook.get_credentials()
32 | redshift_hook = PostgresHook("redshift")
33 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
34 | credentials.access_key,
35 | credentials.secret_key,
36 | )
37 | redshift_hook.run(sql_stmt)
38 |
39 |
40 | dag = DAG(
41 | 'lesson2.exercise1',
42 | start_date=datetime.datetime.now()
43 | )
44 |
45 | create_trips_table = PostgresOperator(
46 | task_id="create_trips_table",
47 | dag=dag,
48 | postgres_conn_id="redshift",
49 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
50 | )
51 |
52 | copy_trips_task = PythonOperator(
53 | task_id='load_trips_from_s3_to_redshift',
54 | dag=dag,
55 | python_callable=load_trip_data_to_redshift,
56 | )
57 |
58 | create_stations_table = PostgresOperator(
59 | task_id="create_stations_table",
60 | dag=dag,
61 | postgres_conn_id="redshift",
62 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
63 | )
64 |
65 | copy_stations_task = PythonOperator(
66 | task_id='load_stations_from_s3_to_redshift',
67 | dag=dag,
68 | python_callable=load_station_data_to_redshift,
69 | )
70 |
71 | create_trips_table >> copy_trips_task
72 | create_stations_table >> copy_stations_task
73 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L2_exercises/exercise2.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #1 - Revisit our bikeshare traffic
3 | #2 - Update our DAG with
4 | # a - @monthly schedule_interval
5 | # b - max_active_runs of 1
6 | # c - start_date of 2018/01/01
7 | # d - end_date of 2018/02/01
8 | # Use Airflow’s backfill capabilities to analyze our trip data on a monthly basis over 2 historical runs
9 |
10 | import datetime
11 | import logging
12 |
13 | from airflow import DAG
14 | from airflow.contrib.hooks.aws_hook import AwsHook
15 | from airflow.hooks.postgres_hook import PostgresHook
16 | from airflow.operators.postgres_operator import PostgresOperator
17 | from airflow.operators.python_operator import PythonOperator
18 |
19 | import sql_statements
20 |
21 |
22 | def load_trip_data_to_redshift(*args, **kwargs):
23 | aws_hook = AwsHook("aws_credentials")
24 | credentials = aws_hook.get_credentials()
25 | redshift_hook = PostgresHook("redshift")
26 | sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format(
27 | credentials.access_key,
28 | credentials.secret_key,
29 | )
30 | redshift_hook.run(sql_stmt)
31 |
32 |
33 | def load_station_data_to_redshift(*args, **kwargs):
34 | aws_hook = AwsHook("aws_credentials")
35 | credentials = aws_hook.get_credentials()
36 | redshift_hook = PostgresHook("redshift")
37 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
38 | credentials.access_key,
39 | credentials.secret_key,
40 | )
41 | redshift_hook.run(sql_stmt)
42 |
43 |
44 | dag = DAG(
45 | 'lesson2.exercise2',
46 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
47 | # TODO: Set the end date to February first
48 | end_date=datetime.datetime(2018, 2, 1, 0, 0, 0, 0),
49 | # TODO: Set the schedule to be monthly
50 | schedule_interval='@monthly',
51 | # TODO: set the number of max active runs to 1
52 | max_active_runs=1
53 | )
54 |
55 | create_trips_table = PostgresOperator(
56 | task_id="create_trips_table",
57 | dag=dag,
58 | postgres_conn_id="redshift",
59 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
60 | )
61 |
62 | copy_trips_task = PythonOperator(
63 | task_id='load_trips_from_s3_to_redshift',
64 | dag=dag,
65 | python_callable=load_trip_data_to_redshift,
66 | provide_context=True,
67 | )
68 |
69 | create_stations_table = PostgresOperator(
70 | task_id="create_stations_table",
71 | dag=dag,
72 | postgres_conn_id="redshift",
73 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
74 | )
75 |
76 | copy_stations_task = PythonOperator(
77 | task_id='load_stations_from_s3_to_redshift',
78 | dag=dag,
79 | python_callable=load_station_data_to_redshift,
80 | )
81 |
82 | create_trips_table >> copy_trips_task
83 | create_stations_table >> copy_stations_task
84 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L2_exercises/exercise3.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #1 - Modify the bikeshare DAG to load data month by month, instead of loading it all at once, every time.
3 | #2 - Use time partitioning to parallelize the execution of the DAG.
4 |
5 | import datetime
6 | import logging
7 |
8 | from airflow import DAG
9 | from airflow.contrib.hooks.aws_hook import AwsHook
10 | from airflow.hooks.postgres_hook import PostgresHook
11 | from airflow.operators.postgres_operator import PostgresOperator
12 | from airflow.operators.python_operator import PythonOperator
13 |
14 | import sql_statements
15 |
16 |
17 | def load_trip_data_to_redshift(*args, **kwargs):
18 | aws_hook = AwsHook("aws_credentials")
19 | credentials = aws_hook.get_credentials()
20 | redshift_hook = PostgresHook("redshift")
21 |
22 | # # #
23 | execution_date = datetime.datetime.strptime(kwargs["ds"], '%Y-%m-%d')
24 | # # #
25 |
26 | sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format(
27 | credentials.access_key,
28 | credentials.secret_key,
29 | year=execution_date.year,
30 | month=execution_date.month
31 | )
32 | redshift_hook.run(sql_stmt)
33 |
34 |
35 | def load_station_data_to_redshift(*args, **kwargs):
36 | aws_hook = AwsHook("aws_credentials")
37 | credentials = aws_hook.get_credentials()
38 | redshift_hook = PostgresHook("redshift")
39 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
40 | credentials.access_key,
41 | credentials.secret_key,
42 | )
43 | redshift_hook.run(sql_stmt)
44 |
45 |
46 | dag = DAG(
47 | 'lesson2.exercise3',
48 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
49 | end_date=datetime.datetime(2019, 1, 1, 0, 0, 0, 0),
50 | schedule_interval='@monthly',
51 | max_active_runs=1
52 | )
53 |
54 | create_trips_table = PostgresOperator(
55 | task_id="create_trips_table",
56 | dag=dag,
57 | postgres_conn_id="redshift",
58 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
59 | )
60 |
61 | copy_trips_task = PythonOperator(
62 | task_id='load_trips_from_s3_to_redshift',
63 | dag=dag,
64 | python_callable=load_trip_data_to_redshift,
65 | provide_context=True
66 | )
67 |
68 | create_stations_table = PostgresOperator(
69 | task_id="create_stations_table",
70 | dag=dag,
71 | postgres_conn_id="redshift",
72 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
73 | )
74 |
75 | copy_stations_task = PythonOperator(
76 | task_id='load_stations_from_s3_to_redshift',
77 | dag=dag,
78 | python_callable=load_station_data_to_redshift,
79 | )
80 |
81 | create_trips_table >> copy_trips_task
82 | create_stations_table >> copy_stations_task
83 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L2_exercises/exercise4.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #1 - Set an SLA on our bikeshare traffic calculation operator
3 | #2 - Add data verification step after the load step from s3 to redshift
4 | #3 - Add data verification step after we calculate our output table
5 |
6 | import datetime
7 | import logging
8 |
9 | from airflow import DAG
10 | from airflow.contrib.hooks.aws_hook import AwsHook
11 | from airflow.hooks.postgres_hook import PostgresHook
12 | from airflow.operators.postgres_operator import PostgresOperator
13 | from airflow.operators.python_operator import PythonOperator
14 |
15 | import sql_statements
16 |
17 |
18 | def load_trip_data_to_redshift(*args, **kwargs):
19 | aws_hook = AwsHook("aws_credentials")
20 | credentials = aws_hook.get_credentials()
21 | redshift_hook = PostgresHook("redshift")
22 | execution_date = kwargs["execution_date"]
23 | sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format(
24 | credentials.access_key,
25 | credentials.secret_key,
26 | year=execution_date.year,
27 | month=execution_date.month
28 | )
29 | redshift_hook.run(sql_stmt)
30 |
31 |
32 | def load_station_data_to_redshift(*args, **kwargs):
33 | aws_hook = AwsHook("aws_credentials")
34 | credentials = aws_hook.get_credentials()
35 | redshift_hook = PostgresHook("redshift")
36 | sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
37 | credentials.access_key,
38 | credentials.secret_key,
39 | )
40 | redshift_hook.run(sql_stmt)
41 |
42 |
43 | def check_greater_than_zero(*args, **kwargs):
44 | table = kwargs["params"]["table"]
45 | redshift_hook = PostgresHook("redshift")
46 | records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {table}")
47 | if len(records) < 1 or len(records[0]) < 1:
48 | raise ValueError(f"Data quality check failed. {table} returned no results")
49 | num_records = records[0][0]
50 |
51 | if num_records < 1:
52 | raise ValueError('Not a single record was found in this step')
53 |
54 | logging.info(f"Data quality on table {table} check passed with {records[0][0]} records")
55 |
56 |
57 | dag = DAG(
58 | 'lesson2.exercise4',
59 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
60 | end_date=datetime.datetime(2019, 1, 1, 0, 0, 0, 0),
61 | schedule_interval='@monthly',
62 | max_active_runs=1
63 | )
64 |
65 | create_trips_table = PostgresOperator(
66 | task_id="create_trips_table",
67 | dag=dag,
68 | postgres_conn_id="redshift",
69 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
70 | )
71 |
72 | copy_trips_task = PythonOperator(
73 | task_id='load_trips_from_s3_to_redshift',
74 | dag=dag,
75 | python_callable=load_trip_data_to_redshift,
76 | provide_context=True,
77 | sla=datetime.timedelta(hours=1)
78 | )
79 |
80 | check_trips = PythonOperator(
81 | task_id='check_trips_data',
82 | dag=dag,
83 | python_callable=check_greater_than_zero,
84 | provide_context=True,
85 | params={
86 | 'table': 'trips',
87 | }
88 | )
89 |
90 | create_stations_table = PostgresOperator(
91 | task_id="create_stations_table",
92 | dag=dag,
93 | postgres_conn_id="redshift",
94 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
95 | )
96 |
97 | copy_stations_task = PythonOperator(
98 | task_id='load_stations_from_s3_to_redshift',
99 | dag=dag,
100 | python_callable=load_station_data_to_redshift,
101 | sla=datetime.timedelta(hours=1)
102 | )
103 |
104 | check_stations = PythonOperator(
105 | task_id='check_stations_data',
106 | dag=dag,
107 | python_callable=check_greater_than_zero,
108 | provide_context=True,
109 | params={
110 | 'table': 'stations',
111 | }
112 | )
113 |
114 | create_trips_table >> copy_trips_task
115 | create_stations_table >> copy_stations_task
116 | copy_stations_task >> check_stations
117 | copy_trips_task >> check_trips
118 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L2_exercises/sql_statements.py:
--------------------------------------------------------------------------------
1 | CREATE_TRIPS_TABLE_SQL = """
2 | CREATE TABLE IF NOT EXISTS trips (
3 | trip_id INTEGER NOT NULL,
4 | start_time TIMESTAMP NOT NULL,
5 | end_time TIMESTAMP NOT NULL,
6 | bikeid INTEGER NOT NULL,
7 | tripduration DECIMAL(16,2) NOT NULL,
8 | from_station_id INTEGER NOT NULL,
9 | from_station_name VARCHAR(100) NOT NULL,
10 | to_station_id INTEGER NOT NULL,
11 | to_station_name VARCHAR(100) NOT NULL,
12 | usertype VARCHAR(20),
13 | gender VARCHAR(6),
14 | birthyear INTEGER,
15 | PRIMARY KEY(trip_id))
16 | DISTSTYLE ALL;
17 | """
18 |
19 | CREATE_STATIONS_TABLE_SQL = """
20 | CREATE TABLE IF NOT EXISTS stations (
21 | id INTEGER NOT NULL,
22 | name VARCHAR(250) NOT NULL,
23 | city VARCHAR(100) NOT NULL,
24 | latitude DECIMAL(9, 6) NOT NULL,
25 | longitude DECIMAL(9, 6) NOT NULL,
26 | dpcapacity INTEGER NOT NULL,
27 | online_date TIMESTAMP NOT NULL,
28 | PRIMARY KEY(id))
29 | DISTSTYLE ALL;
30 | """
31 |
32 | COPY_SQL = """
33 | COPY {}
34 | FROM '{}'
35 | ACCESS_KEY_ID '{{}}'
36 | SECRET_ACCESS_KEY '{{}}'
37 | IGNOREHEADER 1
38 | DELIMITER ','
39 | """
40 |
41 | COPY_MONTHLY_TRIPS_SQL = COPY_SQL.format(
42 | "trips",
43 | "s3://udacity-dend/data-pipelines/divvy/partitioned/{year}/{month}/divvy_trips.csv"
44 | )
45 |
46 | COPY_ALL_TRIPS_SQL = COPY_SQL.format(
47 | "trips",
48 | "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
49 | )
50 |
51 | COPY_STATIONS_SQL = COPY_SQL.format(
52 | "stations",
53 | "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv"
54 | )
55 |
56 | LOCATION_TRAFFIC_SQL = """
57 | BEGIN;
58 | DROP TABLE IF EXISTS station_traffic;
59 | CREATE TABLE station_traffic AS
60 | SELECT
61 | DISTINCT(t.from_station_id) AS station_id,
62 | t.from_station_name AS station_name,
63 | num_departures,
64 | num_arrivals
65 | FROM trips t
66 | JOIN (
67 | SELECT
68 | from_station_id,
69 | COUNT(from_station_id) AS num_departures
70 | FROM trips
71 | GROUP BY from_station_id
72 | ) AS fs ON t.from_station_id = fs.from_station_id
73 | JOIN (
74 | SELECT
75 | to_station_id,
76 | COUNT(to_station_id) AS num_arrivals
77 | FROM trips
78 | GROUP BY to_station_id
79 | ) AS ts ON t.from_station_id = ts.to_station_id
80 | """
81 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/exercise1.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll consolidate repeated code into Operator Plugins
3 | #1 - Move the data quality check logic into a custom operator
4 | #2 - Replace the data quality check PythonOperators with our new custom operator
5 | #3 - Consolidate both the S3 to RedShift functions into a custom operator
6 | #4 - Replace the S3 to RedShift PythonOperators with our new custom operator
7 | #5 - Execute the DAG
8 |
9 | import datetime
10 | import logging
11 |
12 | from airflow import DAG
13 | from airflow.contrib.hooks.aws_hook import AwsHook
14 | from airflow.hooks.postgres_hook import PostgresHook
15 |
16 | from airflow.operators import (
17 | HasRowsOperator,
18 | PostgresOperator,
19 | PythonOperator,
20 | S3ToRedshiftOperator
21 | )
22 |
23 | import sql_statements
24 |
25 | dag = DAG(
26 | "lesson3.exercise1",
27 | start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
28 | end_date=datetime.datetime(2018, 12, 1, 0, 0, 0, 0),
29 | schedule_interval="@monthly",
30 | max_active_runs=1
31 | )
32 |
33 | create_trips_table = PostgresOperator(
34 | task_id="create_trips_table",
35 | dag=dag,
36 | postgres_conn_id="redshift",
37 | sql=sql_statements.CREATE_TRIPS_TABLE_SQL
38 | )
39 |
40 | copy_trips_task = S3ToRedshiftOperator(
41 | task_id="load_trips_from_s3_to_redshift",
42 | dag=dag,
43 | table="trips",
44 | redshift_conn_id="redshift",
45 | aws_credentials_id="aws_credentials",
46 | s3_bucket="udacity-dend",
47 | s3_key="data-pipelines/divvy/partitioned/{execution_date.year}/{execution_date.month}/divvy_trips.csv"
48 | )
49 |
50 | check_trips = HasRowsOperator(
51 | task_id='check_trips_data',
52 | dag=dag,
53 | redshift_conn_id="redshift",
54 | table="trips"
55 | )
56 |
57 | create_stations_table = PostgresOperator(
58 | task_id="create_stations_table",
59 | dag=dag,
60 | postgres_conn_id="redshift",
61 | sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
62 | )
63 |
64 | copy_stations_task = S3ToRedshiftOperator(
65 | task_id="load_stations_from_s3_to_redshift",
66 | dag=dag,
67 | redshift_conn_id="redshift",
68 | aws_credentials_id="aws_credentials",
69 | s3_bucket="udacity-dend",
70 | s3_key="data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv",
71 | table="stations"
72 | )
73 |
74 | check_stations = HasRowsOperator(
75 | task_id='check_stations_data',
76 | dag=dag,
77 | redshift_conn_id="redshift",
78 | table="stations"
79 | )
80 |
81 | create_trips_table >> copy_trips_task
82 | create_stations_table >> copy_stations_task
83 | copy_stations_task >> check_stations
84 | copy_trips_task >> check_trips
85 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/exercise2.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll refactor a DAG with a single overloaded task into a DAG with several tasks with well-defined boundaries
3 | #1 - Read through the DAG and identify points in the DAG that could be split apart
4 | #2 - Split the DAG into multiple PythonOperators
5 | #3 - Run the DAG
6 |
7 | import datetime
8 | import logging
9 |
10 | from airflow import DAG
11 | from airflow.hooks.postgres_hook import PostgresHook
12 |
13 | from airflow.operators.postgres_operator import PostgresOperator
14 | from airflow.operators.python_operator import PythonOperator
15 |
16 | def log_oldest():
17 | redshift_hook = PostgresHook("redshift")
18 | records = redshift_hook.get_records("""
19 | SELECT birthyear FROM older_riders ORDER BY birthyear ASC LIMIT 1
20 | """)
21 | if len(records) > 0 and len(records[0]) > 0:
22 | logging.info(f"Oldest rider was born in {records[0][0]}")
23 |
24 | def log_youngest():
25 | redshift_hook = PostgresHook("redshift")
26 |
27 | records = redshift_hook.get_records("""
28 | SELECT birthyear FROM younger_riders ORDER BY birthyear DESC LIMIT 1
29 | """)
30 | if len(records) > 0 and len(records[0]) > 0:
31 | logging.info(f"Youngest rider was born in {records[0][0]}")
32 |
33 | dag = DAG(
34 | "lesson3.exercise2",
35 | start_date=datetime.datetime.utcnow()
36 | )
37 |
38 | create_oldest_task = PostgresOperator(
39 | task_id="create_oldest",
40 | dag=dag,
41 | sql="""
42 | BEGIN;
43 | DROP TABLE IF EXISTS older_riders;
44 | CREATE TABLE older_riders AS (
45 | SELECT * FROM trips WHERE birthyear > 0 AND birthyear <= 1945
46 | );
47 | COMMIT;
48 | """,
49 | postgres_conn_id="redshift"
50 | )
51 |
52 | log_oldest_task = PythonOperator(
53 | task_id="log_oldest",
54 | dag=dag,
55 | python_callable=log_oldest
56 | )
57 |
58 | create_youngest_task = PostgresOperator(
59 | task_id="create_youngest",
60 | dag=dag,
61 | sql="""
62 | BEGIN;
63 | DROP TABLE IF EXISTS younger_riders;
64 | CREATE TABLE younger_riders AS (
65 | SELECT * FROM trips WHERE birthyear > 2000
66 | );
67 | COMMIT;
68 | """,
69 | postgres_conn_id="redshift"
70 | )
71 |
72 | log_youngest_task = PythonOperator(
73 | task_id="log_youngest",
74 | dag=dag,
75 | python_callable=log_youngest
76 | )
77 |
78 | create_lifetime_rides_task = PostgresOperator(
79 | task_id="create_lifetime_rides",
80 | dag=dag,
81 | sql="""
82 | BEGIN;
83 | DROP TABLE IF EXISTS lifetime_rides;
84 | CREATE TABLE lifetime_rides AS (
85 | SELECT bikeid, COUNT(bikeid)
86 | FROM trips
87 | GROUP BY bikeid
88 | );
89 | COMMIT;
90 | """,
91 | postgres_conn_id="redshift"
92 | )
93 |
94 | create_city_station_counts_task = PostgresOperator(
95 | task_id="create_city_station_counts",
96 | dag=dag,
97 | sql="""
98 | BEGIN;
99 | DROP TABLE IF EXISTS city_station_counts;
100 | CREATE TABLE city_station_counts AS(
101 | SELECT city, COUNT(city)
102 | FROM stations
103 | GROUP BY city
104 | );
105 | COMMIT;
106 | """,
107 | postgres_conn_id="redshift"
108 | )
109 |
110 | create_oldest_task >> log_oldest_task
111 | create_youngest_task >> log_youngest_task
112 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/exercise3/dag.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag.
3 | #1 - Consolidate HasRowsOperator into the SubDag
4 | #2 - Reorder the tasks to take advantage of the SubDag Operators
5 |
6 | import datetime
7 |
8 | from airflow import DAG
9 | from airflow.operators.postgres_operator import PostgresOperator
10 | from airflow.operators.subdag_operator import SubDagOperator
11 | from airflow.operators.udacity_plugin import HasRowsOperator
12 |
13 | from lesson3.exercise3.subdag import get_s3_to_redshift_dag
14 | import sql_statements
15 |
16 |
17 | start_date = datetime.datetime.utcnow()
18 |
19 | dag = DAG(
20 | "lesson3.exercise3",
21 | start_date=start_date,
22 | )
23 |
24 | trips_task_id = "trips_subdag"
25 | trips_subdag_task = SubDagOperator(
26 | subdag=get_s3_to_redshift_dag(
27 | "lesson3.exercise3",
28 | trips_task_id,
29 | "redshift",
30 | "aws_credentials",
31 | "trips",
32 | sql_statements.CREATE_TRIPS_TABLE_SQL,
33 | s3_bucket="udacity-dend",
34 | s3_key="data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv",
35 | start_date=start_date,
36 | ),
37 | task_id=trips_task_id,
38 | dag=dag,
39 | )
40 |
41 | stations_task_id = "stations_subdag"
42 | stations_subdag_task = SubDagOperator(
43 | subdag=get_s3_to_redshift_dag(
44 | "lesson3.exercise3",
45 | stations_task_id,
46 | "redshift",
47 | "aws_credentials",
48 | "stations",
49 | sql_statements.CREATE_STATIONS_TABLE_SQL,
50 | s3_bucket="udacity-dend",
51 | s3_key="data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv",
52 | start_date=start_date,
53 | ),
54 | task_id=stations_task_id,
55 | dag=dag,
56 | )
57 |
58 | location_traffic_task = PostgresOperator(
59 | task_id="calculate_location_traffic",
60 | dag=dag,
61 | postgres_conn_id="redshift",
62 | sql=sql_statements.LOCATION_TRAFFIC_SQL
63 | )
64 |
65 | trips_subdag_task >> location_traffic_task
66 | stations_subdag_task >> location_traffic_task
67 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/exercise3/subdag.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag.
3 | #1 - Consolidate HasRowsOperator into the SubDag
4 | #2 - Reorder the tasks to take advantage of the SubDag Operators
5 |
6 | import datetime
7 |
8 | from airflow import DAG
9 | from airflow.operators.postgres_operator import PostgresOperator
10 | from airflow.operators.udacity_plugin import HasRowsOperator
11 | from airflow.operators.udacity_plugin import S3ToRedshiftOperator
12 |
13 | import sql
14 |
15 |
16 | # Returns a DAG which creates a table if it does not exist, and then proceeds
17 | # to load data into that table from S3. When the load is complete, a data
18 | # quality check is performed to assert that at least one row of data is
19 | # present.
20 | def get_s3_to_redshift_dag(
21 | parent_dag_name,
22 | task_id,
23 | redshift_conn_id,
24 | aws_credentials_id,
25 | table,
26 | create_sql_stmt,
27 | s3_bucket,
28 | s3_key,
29 | *args, **kwargs):
30 | dag = DAG(
31 | f"{parent_dag_name}.{task_id}",
32 | **kwargs
33 | )
34 |
35 | create_task = PostgresOperator(
36 | task_id=f"create_{table}_table",
37 | dag=dag,
38 | postgres_conn_id=redshift_conn_id,
39 | sql=create_sql_stmt
40 | )
41 |
42 | copy_task = S3ToRedshiftOperator(
43 | task_id=f"load_{table}_from_s3_to_redshift",
44 | dag=dag,
45 | table=table,
46 | redshift_conn_id=redshift_conn_id,
47 | aws_credentials_id=aws_credentials_id,
48 | s3_bucket=s3_bucket,
49 | s3_key=s3_key
50 | )
51 |
52 | check_task = HasRowsOperator(
53 | task_id=f"check_{table}_data",
54 | dag=dag,
55 | redshift_conn_id=redshift_conn_id,
56 | table=table
57 | )
58 |
59 | create_task >> copy_task
60 | copy_task >> check_task
61 |
62 | return dag
63 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/exercise4.py:
--------------------------------------------------------------------------------
1 | import datetime
2 |
3 | from airflow import DAG
4 |
5 | from airflow.operators import (
6 | FactsCalculatorOperator,
7 | HasRowsOperator,
8 | S3ToRedshiftOperator
9 | )
10 |
11 |
12 | dag = DAG("lesson3.exercise4", start_date=datetime.datetime.utcnow())
13 |
14 | copy_trips_task = S3ToRedshiftOperator(
15 | task_id="load_trips_from_s3_to_redshift",
16 | dag=dag,
17 | table="trips",
18 | redshift_conn_id="redshift",
19 | aws_credentials_id="aws_credentials",
20 | s3_bucket="udacity-dend",
21 | s3_key="data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
22 | )
23 |
24 | check_trips = HasRowsOperator(
25 | task_id='check_trips_data',
26 | dag=dag,
27 | redshift_conn_id="redshift",
28 | table="trips"
29 | )
30 |
31 | calculate_facts = FactsCalculatorOperator(
32 | task_id='calculate_facts',
33 | dag=dag,
34 | redshift_conn_id="redshift",
35 | origin_table="trips",
36 | destination_table="trip_facts",
37 | fact_column="tripduration",
38 | groupby_column="bikeid",
39 | )
40 |
41 | copy_trips_task >> check_trips
42 | check_trips >> calculate_facts
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/operators/__init__.py:
--------------------------------------------------------------------------------
1 | from airflow.plugins_manager import AirflowPlugin
2 |
3 | import FactsCalculatorOperator
4 | import HasRowsOperator
5 | import S3ToRedshiftOperator
6 |
7 |
8 | # Defining the plugin class
9 | class UdacityPlugin(AirflowPlugin):
10 | name = "udacity_plugin"
11 | operators = [
12 | FactsCalculatorOperator,
13 | HasRowsOperator,
14 | S3ToRedshiftOperator
15 | ]
16 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/operators/facts_calculator.py:
--------------------------------------------------------------------------------
1 | import logging
2 |
3 | from airflow.hooks.postgres_hook import PostgresHook
4 | from airflow.models import BaseOperator
5 | from airflow.utils.decorators import apply_defaults
6 |
7 |
8 | class FactsCalculatorOperator(BaseOperator):
9 | facts_sql_template = """
10 | DROP TABLE IF EXISTS {destination_table};
11 | CREATE TABLE {destination_table} AS
12 | SELECT
13 | {groupby_column},
14 | MAX({fact_column}) AS max_{fact_column},
15 | MIN({fact_column}) AS min_{fact_column},
16 | AVG({fact_column}) AS average_{fact_column}
17 | FROM {origin_table}
18 | GROUP BY {groupby_column};
19 | """
20 |
21 | @apply_defaults
22 | def __init__(self,
23 | redshift_conn_id="",
24 | origin_table="",
25 | destination_table="",
26 | fact_column="",
27 | groupby_column="",
28 | *args, **kwargs):
29 |
30 | super(FactsCalculatorOperator, self).__init__(*args, **kwargs)
31 | self.redshift_conn_id = redshift_conn_id
32 | self.origin_table = origin_table
33 | self.destination_table = destination_table
34 | self.fact_column = fact_column
35 | self.groupby_column = groupby_column
36 |
37 | def execute(self, context):
38 | redshift_hook = PostgresHook(self.redshift_conn_id)
39 |
40 | redshift_hook.run(self.facts_sql_template.format(
41 | destination_table=self.destination_table,
42 | groupby_column=self.groupby_column,
43 | fact_column=self.fact_column,
44 | origin_table=self.origin_table
45 | ))
46 |
47 | pass
48 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/operators/has_rows.py:
--------------------------------------------------------------------------------
1 | import logging
2 |
3 | from airflow.hooks.postgres_hook import PostgresHook
4 | from airflow.models import BaseOperator
5 | from airflow.utils.decorators import apply_defaults
6 |
7 |
8 | class HasRowsOperator(BaseOperator):
9 |
10 | @apply_defaults
11 | def __init__(self,
12 | redshift_conn_id="",
13 | table="",
14 | *args, **kwargs):
15 |
16 | super(HasRowsOperator, self).__init__(*args, **kwargs)
17 | self.table = table
18 | self.redshift_conn_id = redshift_conn_id
19 |
20 | def execute(self, context):
21 | redshift_hook = PostgresHook(self.redshift_conn_id)
22 | records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {self.table}")
23 | if len(records) < 1 or len(records[0]) < 1:
24 | raise ValueError(f"Data quality check failed. {self.table} returned no results")
25 | num_records = records[0][0]
26 | if num_records < 1:
27 | raise ValueError(f"Data quality check failed. {self.table} contained 0 rows")
28 | logging.info(f"Data quality on table {self.table} check passed with {records[0][0]} records")
29 |
30 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/operators/s3_to_redshift.py:
--------------------------------------------------------------------------------
1 | from airflow.contrib.hooks.aws_hook import AwsHook
2 | from airflow.hooks.postgres_hook import PostgresHook
3 | from airflow.models import BaseOperator
4 | from airflow.utils.decorators import apply_defaults
5 |
6 |
7 | class S3ToRedshiftOperator(BaseOperator):
8 | template_fields = ("s3_key",)
9 | copy_sql = """
10 | COPY {}
11 | FROM '{}'
12 | ACCESS_KEY_ID '{}'
13 | SECRET_ACCESS_KEY '{}'
14 | IGNOREHEADER {}
15 | DELIMITER '{}'
16 | """
17 |
18 |
19 | @apply_defaults
20 | def __init__(self,
21 | redshift_conn_id="",
22 | aws_credentials_id="",
23 | table="",
24 | s3_bucket="",
25 | s3_key="",
26 | delimiter=",",
27 | ignore_headers=1,
28 | *args, **kwargs):
29 |
30 | super(S3ToRedshiftOperator, self).__init__(*args, **kwargs)
31 | self.table = table
32 | self.redshift_conn_id = redshift_conn_id
33 | self.s3_bucket = s3_bucket
34 | self.s3_key = s3_key
35 | self.delimiter = delimiter
36 | self.ignore_headers = ignore_headers
37 | self.aws_credentials_id = aws_credentials_id
38 |
39 | def execute(self, context):
40 | aws_hook = AwsHook(self.aws_credentials_id)
41 | credentials = aws_hook.get_credentials()
42 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
43 |
44 | self.log.info("Clearing data from destination Redshift table")
45 | redshift.run("DELETE FROM {}".format(self.table))
46 |
47 | self.log.info("Copying data from S3 to Redshift")
48 | rendered_key = self.s3_key.format(**context)
49 | s3_path = "s3://{}/{}".format(self.s3_bucket, rendered_key)
50 | formatted_sql = S3ToRedshiftOperator.copy_sql.format(
51 | self.table,
52 | s3_path,
53 | credentials.access_key,
54 | credentials.secret_key,
55 | self.ignore_headers,
56 | self.delimiter
57 | )
58 | redshift.run(formatted_sql)
59 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L3_exercises/sql_statements.py:
--------------------------------------------------------------------------------
1 | CREATE_TRIPS_TABLE_SQL = """
2 | CREATE TABLE IF NOT EXISTS trips (
3 | trip_id INTEGER NOT NULL,
4 | start_time TIMESTAMP NOT NULL,
5 | end_time TIMESTAMP NOT NULL,
6 | bikeid INTEGER NOT NULL,
7 | tripduration DECIMAL(16,2) NOT NULL,
8 | from_station_id INTEGER NOT NULL,
9 | from_station_name VARCHAR(100) NOT NULL,
10 | to_station_id INTEGER NOT NULL,
11 | to_station_name VARCHAR(100) NOT NULL,
12 | usertype VARCHAR(20),
13 | gender VARCHAR(6),
14 | birthyear INTEGER,
15 | PRIMARY KEY(trip_id))
16 | DISTSTYLE ALL;
17 | """
18 |
19 | CREATE_STATIONS_TABLE_SQL = """
20 | CREATE TABLE IF NOT EXISTS stations (
21 | id INTEGER NOT NULL,
22 | name VARCHAR(250) NOT NULL,
23 | city VARCHAR(100) NOT NULL,
24 | latitude DECIMAL(9, 6) NOT NULL,
25 | longitude DECIMAL(9, 6) NOT NULL,
26 | dpcapacity INTEGER NOT NULL,
27 | online_date TIMESTAMP NOT NULL,
28 | PRIMARY KEY(id))
29 | DISTSTYLE ALL;
30 | """
31 |
32 | COPY_SQL = """
33 | COPY {}
34 | FROM '{}'
35 | ACCESS_KEY_ID '{{}}'
36 | SECRET_ACCESS_KEY '{{}}'
37 | IGNOREHEADER 1
38 | DELIMITER ','
39 | """
40 |
41 | COPY_MONTHLY_TRIPS_SQL = COPY_SQL.format(
42 | "trips",
43 | "s3://udacity-dend/data-pipelines/divvy/partitioned/{year}/{month}/divvy_trips.csv"
44 | )
45 |
46 | COPY_ALL_TRIPS_SQL = COPY_SQL.format(
47 | "trips",
48 | "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
49 | )
50 |
51 | COPY_STATIONS_SQL = COPY_SQL.format(
52 | "stations",
53 | "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv"
54 | )
55 |
56 | LOCATION_TRAFFIC_SQL = """
57 | BEGIN;
58 | DROP TABLE IF EXISTS station_traffic;
59 | CREATE TABLE station_traffic AS
60 | SELECT
61 | DISTINCT(t.from_station_id) AS station_id,
62 | t.from_station_name AS station_name,
63 | num_departures,
64 | num_arrivals
65 | FROM trips t
66 | JOIN (
67 | SELECT
68 | from_station_id,
69 | COUNT(from_station_id) AS num_departures
70 | FROM trips
71 | GROUP BY from_station_id
72 | ) AS fs ON t.from_station_id = fs.from_station_id
73 | JOIN (
74 | SELECT
75 | to_station_id,
76 | COUNT(to_station_id) AS num_arrivals
77 | FROM trips
78 | GROUP BY to_station_id
79 | ) AS ts ON t.from_station_id = ts.to_station_id
80 | """
81 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/README.md:
--------------------------------------------------------------------------------
1 | # Sparkify's Event Logs Data Pipeline
2 |
3 | ## Introduction
4 |
5 | This project consists of one Directed Acyclic Graph that implements the data pipeline responsible for reading all Sparkify's event
6 | logs, process and create some fact/dimensions tables described in our data schema down below.
7 |
8 | For illustration purposes you can check out the graph that represents this pipeline's flow:
9 |
10 | 
11 |
12 | Briefly talking about this ELT process:
13 | - Stages the raw data;
14 | - then transform the raw data to the songplays fact table;
15 | - and transform the raw data into the dimensions table too;
16 | - finally, check if the fact/dimensions table has at least one row.
17 |
18 | ## Data sources
19 |
20 | We will read basically two main data sources on Amazon S3:
21 |
22 | - `s3://udacity-dend/song_data/` - JSON files containing meta information about song/artists data
23 | - `s3://udacity-dend/log_data/` - JSON files containing log events from the Sparkify app
24 |
25 | ## Data Schema
26 |
27 | Besides the staging tables, we have 1 fact table and 4 dimensions table detailed below:
28 |
29 | #### Song Plays table
30 |
31 | - *Table:* `songplays`
32 | - *Type:* Fact table
33 |
34 | | Column | Type | Description |
35 | | ------ | ---- | ----------- |
36 | | `playid` | `varchar(32) NOT NULL` | The main identification of the table |
37 | | `start_time` | `timestamp NOT NULL` | The timestamp that this song play log happened |
38 | | `userid` | `int4 NOT NULL` | The user id that triggered this song play log. It cannot be null, as we don't have song play logs without being triggered by an user. |
39 | | `level` | `varchar(256)` | The level of the user that triggered this song play log |
40 | | `songid` | `varchar(256)` | The identification of the song that was played. It can be null. |
41 | | `artistid` | `varchar(256)` | The identification of the artist of the song that was played. |
42 | | `sessionid` | `int4` | The session_id of the user on the app |
43 | | `location` | `varchar(256)` | The location where this song play log was triggered |
44 | | `user_agent` | `varchar(256)` | The user agent of our app |
45 |
46 | #### Users table
47 |
48 | - *Table:* `users`
49 | - *Type:* Dimension table
50 |
51 | | Column | Type | Description |
52 | | ------ | ---- | ----------- |
53 | | `userid` | `int4 NOT NULL` | The main identification of an user |
54 | | `first_name` | `varchar(256)` | First name of the user, can not be null. It is the basic information we have from the user |
55 | | `last_name` | `varchar(256)` | Last name of the user. |
56 | | `gender` | `varchar(256)` | The gender is stated with just one character `M` (male) or `F` (female). Otherwise it can be stated as `NULL` |
57 | | `level` | `varchar(256)` | The level stands for the user app plans (`premium` or `free`) |
58 |
59 |
60 | #### Songs table
61 |
62 | - *Table:* `songs`
63 | - *Type:* Dimension table
64 |
65 | | Column | Type | Description |
66 | | ------ | ---- | ----------- |
67 | | `songid` | `varchar(256) NOT NULL` | The main identification of a song |
68 | | `title` | `varchar(256)` | The title of the song. It can not be null, as it is the basic information we have about a song. |
69 | | `artistid` | `varchar(256)` | The artist id, it can not be null as we don't have songs without an artist, and this field also references the artists table. |
70 | | `year` | `int4` | The year that this song was made |
71 | | `duration` | `numeric(18,0)` | The duration of the song |
72 |
73 |
74 | #### Artists table
75 |
76 | - *Table:* `artists`
77 | - *Type:* Dimension table
78 |
79 | | Column | Type | Description |
80 | | ------ | ---- | ----------- |
81 | | `artistid` | `varchar(256) NOT NULL` | The main identification of an artist |
82 | | `name` | `varchar(256)` | The name of the artist |
83 | | `location` | `varchar(256)` | The location where the artist are from |
84 | | `lattitude` | `numeric(18,0)` | The latitude of the location that the artist are from |
85 | | `longitude` | `numeric(18,0)` | The longitude of the location that the artist are from |
86 |
87 | #### Time table
88 |
89 | - *Table:* `time`
90 | - *Type:* Dimension table
91 |
92 | | Column | Type | Description |
93 | | ------ | ---- | ----------- |
94 | | `start_time` | `timestamp NOT NULL` | The timestamp itself, serves as the main identification of this table |
95 | | `hour` | `int4` | The hour from the timestamp |
96 | | `day` | `int4` | The day of the month from the timestamp |
97 | | `week` | `int4` | The week of the year from the timestamp |
98 | | `month` | `varchar(255)` | The month of the year from the timestamp |
99 | | `year` | `int4` | The year from the timestamp |
100 | | `weekday` | `varchar(255)` | The week day from the timestamp (Monday to Friday) |
101 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/create_tables.sql:
--------------------------------------------------------------------------------
1 | CREATE TABLE public.artists (
2 | artistid varchar(256) NOT NULL,
3 | name varchar(256),
4 | location varchar(256),
5 | lattitude numeric(18,0),
6 | longitude numeric(18,0)
7 | );
8 |
9 | CREATE TABLE public.songplays (
10 | playid varchar(32) NOT NULL,
11 | start_time timestamp NOT NULL,
12 | userid int4 NOT NULL,
13 | "level" varchar(256),
14 | songid varchar(256),
15 | artistid varchar(256),
16 | sessionid int4,
17 | location varchar(256),
18 | user_agent varchar(256),
19 | CONSTRAINT songplays_pkey PRIMARY KEY (playid)
20 | );
21 |
22 | CREATE TABLE public.songs (
23 | songid varchar(256) NOT NULL,
24 | title varchar(256),
25 | artistid varchar(256),
26 | "year" int4,
27 | duration numeric(18,0),
28 | CONSTRAINT songs_pkey PRIMARY KEY (songid)
29 | );
30 |
31 | CREATE TABLE public.staging_events (
32 | artist varchar(256),
33 | auth varchar(256),
34 | firstname varchar(256),
35 | gender varchar(256),
36 | iteminsession int4,
37 | lastname varchar(256),
38 | length numeric(18,0),
39 | "level" varchar(256),
40 | location varchar(256),
41 | "method" varchar(256),
42 | page varchar(256),
43 | registration numeric(18,0),
44 | sessionid int4,
45 | song varchar(256),
46 | status int4,
47 | ts int8,
48 | useragent varchar(256),
49 | userid int4
50 | );
51 |
52 | CREATE TABLE public.staging_songs (
53 | num_songs int4,
54 | artist_id varchar(256),
55 | artist_name varchar(256),
56 | artist_latitude numeric(18,0),
57 | artist_longitude numeric(18,0),
58 | artist_location varchar(256),
59 | song_id varchar(256),
60 | title varchar(256),
61 | duration numeric(18,0),
62 | "year" int4
63 | );
64 |
65 | CREATE TABLE public.time (
66 | start_time timestamp NOT NULL,
67 | hour int4,
68 | day int4,
69 | week int4,
70 | month varchar(255),
71 | year int4,
72 | weekday varchar(255),
73 | CONSTRAINT time_pkey PRIMARY KEY (start_time)
74 | );
75 |
76 | CREATE TABLE public.users (
77 | userid int4 NOT NULL,
78 | first_name varchar(256),
79 | last_name varchar(256),
80 | gender varchar(256),
81 | "level" varchar(256),
82 | CONSTRAINT users_pkey PRIMARY KEY (userid)
83 | );
84 |
85 |
86 |
87 |
88 |
89 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/dags/sparkify_analytical_tables_dag.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime, timedelta
2 | import os
3 | from airflow import DAG
4 | from airflow.operators.dummy_operator import DummyOperator
5 | from airflow.operators import (StageToRedshiftOperator, LoadFactOperator,
6 | LoadDimensionOperator, DataQualityOperator)
7 | from helpers import SqlQueries
8 |
9 | # AWS_KEY = os.environ.get('AWS_KEY')
10 | # AWS_SECRET = os.environ.get('AWS_SECRET')
11 |
12 | default_args = {
13 | 'owner': 'udacity',
14 | 'start_date': datetime(2019, 1, 12),
15 | 'depends_on_past': False,
16 | 'retries': 1,
17 | 'retry_delay': timedelta(seconds=300),
18 | 'catchup': False
19 | }
20 |
21 | dag = DAG('sparkify_analytical_tables_dag',
22 | default_args=default_args,
23 | description='Load and transform data in Redshift with Airflow',
24 | schedule_interval='0 * * * *'
25 | )
26 |
27 | start_operator = DummyOperator(task_id='Begin_execution', dag=dag)
28 |
29 | stage_events_to_redshift = StageToRedshiftOperator(
30 | task_id='Stage_events',
31 | dag=dag,
32 | table="staging_events",
33 | redshift_conn_id="redshift",
34 | aws_credentials_id="aws_credentials",
35 | s3_bucket="udacity-dend",
36 | s3_key="log_data",
37 | json_path="s3://udacity-dend/log_json_path.json"
38 | )
39 |
40 | stage_songs_to_redshift = StageToRedshiftOperator(
41 | task_id='Stage_songs',
42 | dag=dag,
43 |
44 | table="staging_songs",
45 | redshift_conn_id="redshift",
46 | aws_credentials_id="aws_credentials",
47 | s3_bucket="udacity-dend",
48 | s3_key="song_data"
49 | )
50 |
51 | load_songplays_table = LoadFactOperator(
52 | task_id='Load_songplays_fact_table',
53 | dag=dag,
54 | redshift_conn_id="redshift",
55 | table="songplays",
56 | select_query=SqlQueries.songplay_table_insert
57 | )
58 |
59 | load_user_dimension_table = LoadDimensionOperator(
60 | task_id='Load_user_dim_table',
61 | dag=dag,
62 | redshift_conn_id="redshift",
63 | table="users",
64 | truncate_table=True,
65 | select_query=SqlQueries.user_table_insert
66 | )
67 |
68 | load_song_dimension_table = LoadDimensionOperator(
69 | task_id='Load_song_dim_table',
70 | dag=dag,
71 | redshift_conn_id="redshift",
72 | table="songs",
73 | truncate_table=True,
74 | select_query=SqlQueries.song_table_insert
75 | )
76 |
77 | load_artist_dimension_table = LoadDimensionOperator(
78 | task_id='Load_artist_dim_table',
79 | dag=dag,
80 | redshift_conn_id="redshift",
81 | table="artists",
82 | truncate_table=True,
83 | select_query=SqlQueries.artist_table_insert
84 | )
85 |
86 | load_time_dimension_table = LoadDimensionOperator(
87 | task_id='Load_time_dim_table',
88 | dag=dag,
89 | redshift_conn_id="redshift",
90 | table="time",
91 | truncate_table=True,
92 | select_query=SqlQueries.time_table_insert
93 | )
94 |
95 | run_quality_checks = DataQualityOperator(
96 | task_id='Run_data_quality_checks',
97 | dag=dag,
98 | redshift_conn_id="redshift",
99 | tables=[
100 | "songplays",
101 | "users",
102 | "songs",
103 | "artists",
104 | "time"
105 | ],
106 | )
107 |
108 | end_operator = DummyOperator(task_id='Stop_execution', dag=dag)
109 |
110 | # Step 1
111 | start_operator >> stage_events_to_redshift
112 | start_operator >> stage_songs_to_redshift
113 |
114 | # Step 2
115 | stage_events_to_redshift >> load_songplays_table
116 | stage_songs_to_redshift >> load_songplays_table
117 |
118 | # Step 3
119 | load_songplays_table >> load_song_dimension_table
120 | load_songplays_table >> load_user_dimension_table
121 | load_songplays_table >> load_artist_dimension_table
122 | load_songplays_table >> load_time_dimension_table
123 |
124 | # Step 4
125 | load_song_dimension_table >> run_quality_checks
126 | load_user_dimension_table >> run_quality_checks
127 | load_artist_dimension_table >> run_quality_checks
128 | load_time_dimension_table >> run_quality_checks
129 |
130 | # Step 5 - end
131 | run_quality_checks >> end_operator
132 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/images/dag.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gabfr/data-engineering-nanodegree/61e6934bee8238d1b45beed124a9c778b83b366a/4-data-pipelines-with-airflow/L4_project/images/dag.png
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/plugins/__init__.py:
--------------------------------------------------------------------------------
1 | from __future__ import division, absolute_import, print_function
2 |
3 | from airflow.plugins_manager import AirflowPlugin
4 |
5 | import operators
6 | import helpers
7 |
8 | # Defining the plugin class
9 | class UdacityPlugin(AirflowPlugin):
10 | name = "udacity_plugin"
11 | operators = [
12 | operators.StageToRedshiftOperator,
13 | operators.LoadFactOperator,
14 | operators.LoadDimensionOperator,
15 | operators.DataQualityOperator
16 | ]
17 | helpers = [
18 | helpers.SqlQueries
19 | ]
20 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/plugins/helpers/__init__.py:
--------------------------------------------------------------------------------
1 | from helpers.sql_queries import SqlQueries
2 |
3 | __all__ = [
4 | 'SqlQueries',
5 | ]
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/plugins/helpers/sql_queries.py:
--------------------------------------------------------------------------------
1 | class SqlQueries:
2 | songplay_table_insert = ("""
3 | SELECT
4 | md5(events.sessionid || events.start_time) songplay_id,
5 | events.start_time,
6 | events.userid,
7 | events.level,
8 | songs.song_id,
9 | songs.artist_id,
10 | events.sessionid,
11 | events.location,
12 | events.useragent
13 | FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, *
14 | FROM staging_events
15 | WHERE page='NextSong') events
16 | LEFT JOIN staging_songs songs
17 | ON events.song = songs.title
18 | AND events.artist = songs.artist_name
19 | AND events.length = songs.duration
20 | """)
21 |
22 | user_table_insert = ("""
23 | SELECT distinct userid, firstname, lastname, gender, level
24 | FROM staging_events
25 | WHERE page='NextSong'
26 | """)
27 |
28 | song_table_insert = ("""
29 | SELECT distinct song_id, title, artist_id, year, duration
30 | FROM staging_songs
31 | """)
32 |
33 | artist_table_insert = ("""
34 | SELECT distinct artist_id, artist_name, artist_location, artist_latitude, artist_longitude
35 | FROM staging_songs
36 | """)
37 |
38 | time_table_insert = ("""
39 | SELECT start_time, extract(hour from start_time), extract(day from start_time), extract(week from start_time),
40 | extract(month from start_time), extract(year from start_time), extract(dayofweek from start_time)
41 | FROM songplays
42 | """)
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/plugins/operators/__init__.py:
--------------------------------------------------------------------------------
1 | from operators.stage_redshift import StageToRedshiftOperator
2 | from operators.load_fact import LoadFactOperator
3 | from operators.load_dimension import LoadDimensionOperator
4 | from operators.data_quality import DataQualityOperator
5 |
6 | __all__ = [
7 | 'StageToRedshiftOperator',
8 | 'LoadFactOperator',
9 | 'LoadDimensionOperator',
10 | 'DataQualityOperator'
11 | ]
12 |
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/plugins/operators/data_quality.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class DataQualityOperator(BaseOperator):
6 |
7 | ui_color = '#89DA59'
8 |
9 | @apply_defaults
10 | def __init__(self,
11 | redshift_conn_id="",
12 | tables=[],
13 | *args, **kwargs):
14 |
15 | super(DataQualityOperator, self).__init__(*args, **kwargs)
16 | self.redshift_conn_id = redshift_conn_id
17 | self.tables = tables
18 |
19 | def execute(self, context):
20 | redshift_hook = PostgresHook(self.redshift_conn_id)
21 | for table in self.tables:
22 | records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {table}")
23 | if len(records) < 1 or len(records[0]) < 1:
24 | raise ValueError(f"Data quality check failed. {table} returned no results")
25 | num_records = records[0][0]
26 | if num_records < 1:
27 | raise ValueError(f"Data quality check failed. {table} contained 0 rows")
28 | self.log.info(f"Data quality on table {table} check passed with {records[0][0]} records")
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/plugins/operators/load_dimension.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class LoadDimensionOperator(BaseOperator):
6 |
7 | ui_color = '#80BD9E'
8 | truncate_stmt = """
9 | TRUNCATE TABLE {table}
10 | """
11 | insert_into_stmt = """
12 | INSERT INTO {table}
13 | {select_query}
14 | """
15 |
16 | @apply_defaults
17 | def __init__(self,
18 | # Define your operators params (with defaults) here
19 | # Example:
20 | # conn_id = your-connection-name
21 | redshift_conn_id,
22 | table,
23 | select_query,
24 | truncate_table=False,
25 | *args, **kwargs):
26 |
27 | super(LoadDimensionOperator, self).__init__(*args, **kwargs)
28 |
29 | self.redshift_conn_id = redshift_conn_id
30 | self.table = table
31 | self.select_query = select_query
32 | self.truncate_table = truncate_table
33 |
34 | def execute(self, context):
35 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
36 |
37 | if self.truncate_table:
38 | self.log.info("Will truncate table before inserting new data...")
39 | redshift.run(LoadDimensionOperator.truncate_stmt.format(
40 | table=self.table
41 | ))
42 |
43 | self.log.info("Inserting dimension table data...")
44 | redshift.run(LoadDimensionOperator.insert_into_stmt.format(
45 | table=self.table,
46 | select_query=self.select_query
47 | ))
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/plugins/operators/load_fact.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class LoadFactOperator(BaseOperator):
6 |
7 | ui_color = '#F98866'
8 | insert_into_stmt = """
9 | INSERT INTO {table}
10 | {select_query}
11 | """
12 |
13 | @apply_defaults
14 | def __init__(self,
15 | # Define your operators params (with defaults) here
16 | # Example:
17 | # conn_id = your-connection-name
18 | redshift_conn_id,
19 | table,
20 | select_query,
21 | *args, **kwargs):
22 |
23 | super(LoadFactOperator, self).__init__(*args, **kwargs)
24 |
25 | self.redshift_conn_id = redshift_conn_id
26 | self.table = table
27 | self.select_query = select_query
28 |
29 | def execute(self, context):
30 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
31 |
32 | redshift.run(LoadFactOperator.insert_into_stmt.format(
33 | table=self.table,
34 | select_query=self.select_query
35 | ))
--------------------------------------------------------------------------------
/4-data-pipelines-with-airflow/L4_project/plugins/operators/stage_redshift.py:
--------------------------------------------------------------------------------
1 | from airflow.contrib.hooks.aws_hook import AwsHook
2 | from airflow.hooks.postgres_hook import PostgresHook
3 | from airflow.models import BaseOperator
4 | from airflow.utils.decorators import apply_defaults
5 |
6 | class StageToRedshiftOperator(BaseOperator):
7 | ui_color = '#358140'
8 | template_fields = ("s3_key",)
9 | copy_sql = """
10 | COPY {}
11 | FROM '{}'
12 | ACCESS_KEY_ID '{}'
13 | SECRET_ACCESS_KEY '{}'
14 | COMPUPDATE OFF STATUPDATE OFF
15 | FORMAT AS JSON '{}'
16 | """
17 |
18 | @apply_defaults
19 | def __init__(self,
20 | redshift_conn_id="",
21 | aws_credentials_id="",
22 | table="",
23 | s3_bucket="",
24 | s3_key="",
25 | json_path="auto",
26 | *args, **kwargs):
27 |
28 | super(StageToRedshiftOperator, self).__init__(*args, **kwargs)
29 | self.table = table
30 | self.redshift_conn_id = redshift_conn_id
31 | self.s3_bucket = s3_bucket
32 | self.s3_key = s3_key
33 | self.aws_credentials_id = aws_credentials_id
34 | self.json_path = json_path
35 |
36 | def execute(self, context):
37 | aws_hook = AwsHook(self.aws_credentials_id)
38 | credentials = aws_hook.get_credentials()
39 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
40 |
41 | self.log.info("Clearing data from destination Redshift table")
42 | redshift.run("DELETE FROM {}".format(self.table))
43 |
44 | self.log.info("Copying data from S3 to Redshift")
45 | rendered_key = self.s3_key.format(**context)
46 | s3_path = "s3://{}/{}".format(self.s3_bucket, rendered_key)
47 | formatted_sql = StageToRedshiftOperator.copy_sql.format(
48 | self.table,
49 | s3_path,
50 | credentials.access_key,
51 | credentials.secret_key,
52 | self.json_path
53 | )
54 | redshift.run(formatted_sql)
55 |
56 |
57 |
58 |
59 |
60 |
--------------------------------------------------------------------------------
/5-capstone-project/README.md:
--------------------------------------------------------------------------------
1 | # Work around the world
2 |
3 | This is meant to be my capstone project. [The project itself was moved to another repository,
4 | click here to check it out.](https://github.com/gabfr/work-around-the-world)
5 |
6 | Here you can find solely the notebook of data sources exploration.
7 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data Engineering Nanodegree
2 |
3 | You can check more about the nanodegree program out here: https://www.udacity.com/course/data-engineer-nanodegree--nd027
4 |
5 | ## Purpose of this repository
6 | Here you can take a look at all my exercise notebooks made throughout the nanodegree courses.
7 |
8 | [Also, you encounter the list of the projects developed throughout this course down below.](#courses-projects)
9 |
10 | ## Courses Projects
11 |
12 | ### 1. Data Modeling Course
13 |
14 | - Project 1: [Data Modeling with Postgres: *Sparkify song play logs ETL process*](https://github.com/gabfr/data-engineering-nanodegree/tree/master/1-data-modeling/L3-Project_Data_Modeling_with_Postgres#sparkify-song-play-logs-etl-process)
15 | - Project 2: [Data Modeling with Apache Cassandra: *Sparkify song play logs ETL process*](https://github.com/gabfr/data-engineering-nanodegree/blob/master/1-data-modeling/L5-Project_Data_Modeling_with_Apache_Cassandra/Project_1B_Project_Template.ipynb)
16 |
17 | ### 2. Cloud Data Warehouses
18 |
19 | - Project 3: [Data Warehouse with AWS Redshift: *Sparkify - ETL process of song play events*](https://github.com/gabfr/data-engineering-nanodegree/tree/master/2-cloud-data-warehouses/L4_Project_-_Data_Warehouse)
20 |
21 | ### 3. Data Lakes with Spark
22 |
23 | - Project 4: [Sparkify's Data Lake ELT process](https://github.com/gabfr/data-engineering-nanodegree/tree/master/3-data-lakes-with-spark/L4_Project)
24 |
25 | ### 4. Data Pipelines with Airflow
26 |
27 | - Project 5: [Sparkify's Event Logs Data Pipeline](https://github.com/gabfr/data-engineering-nanodegree/tree/master/4-data-pipelines-with-airflow/L4_project)
28 |
29 | ### 5. Capstone Project
30 |
31 | - Work around the world: a simple and unified dataset with jobs from major tech jobs lists
32 | - [Click here to check out the data sources exploration notebook](https://github.com/gabfr/data-engineering-nanodegree/tree/master/5-capstone-project)
33 | - [Click here to check out the implementation](https://github.com/gabfr/work-around-the-world)
--------------------------------------------------------------------------------