├── .gitignore
├── 0. Back to Basics
├── 1. Intro to Data Modelling
│ ├── Creating Table with Cassandra
│ │ └── Creating_a_Table_with_Apache_Cassandra.ipynb
│ ├── Creating Table with Postgres
│ │ ├── creating-a-table-with-postgres-0.ipynb
│ │ └── creating-a-table-with-postgres-1.ipynb
│ └── README.md
├── 2. Relational Data Models
│ ├── 1. Creating Normalized Tables.ipynb
│ ├── 2. Creating Denormalized Tables.ipynb
│ ├── 3. Creating Fact and Dimension Tables with Star Schema.ipynb
│ └── README.md
├── 3. NoSQL Data Models
│ ├── 1. Creating Tables Based on Queries.ipynb
│ ├── 2. Primary Key.ipynb
│ ├── 3. Clustering Column.ipynb
│ ├── 4. Using the WHERE Clause.ipynb
│ └── README.md
├── 4. Data Warehouses
│ ├── 1. ETL from 3NF to Star Schema using SQL.ipynb
│ ├── 2. OLAP Cubes.ipynb
│ ├── 3. Columnar Vs Row Storage.ipynb
│ ├── README.md
│ └── snapshots
│ │ ├── cif.PNG
│ │ ├── datamart.PNG
│ │ ├── hybrid.PNG
│ │ └── kimball.PNG
├── 5. Implementing Data Warehouse on AWS
│ ├── 1. AWS RedShift Setup Using Code.ipynb
│ ├── 2. Parallel ETL.ipynb
│ ├── 3. Optimizing Redshift Table Design.ipynb
│ ├── README.md
│ ├── dwh.cfg
│ └── redshift_dwh.PNG
├── 6. Intro to Spark
│ ├── PySpark Schema on Read & UDFs.ipynb
│ ├── Pyspark Data Wrangling.ipynb
│ ├── README.md
│ └── Spark SQL.ipynb
├── 7. Data Lakes
│ ├── Data Lake on S3.ipynb
│ ├── README.md
│ └── dlvdwh.PNG
└── 8. Data Pipelines with Airflow
│ ├── README.md
│ ├── context_and_templating.py
│ ├── dag_for_subdag.py
│ ├── hello_airflow.py
│ └── subdag.py
├── 1. Postgres ETL
├── README.md
├── create_tables.py
├── data
│ ├── log_data
│ │ └── 2018
│ │ │ └── 11
│ │ │ ├── 2018-11-01-events.json
│ │ │ ├── 2018-11-02-events.json
│ │ │ ├── 2018-11-03-events.json
│ │ │ ├── 2018-11-04-events.json
│ │ │ ├── 2018-11-05-events.json
│ │ │ ├── 2018-11-06-events.json
│ │ │ ├── 2018-11-07-events.json
│ │ │ ├── 2018-11-08-events.json
│ │ │ ├── 2018-11-09-events.json
│ │ │ ├── 2018-11-10-events.json
│ │ │ ├── 2018-11-11-events.json
│ │ │ ├── 2018-11-12-events.json
│ │ │ ├── 2018-11-13-events.json
│ │ │ ├── 2018-11-14-events.json
│ │ │ ├── 2018-11-15-events.json
│ │ │ ├── 2018-11-16-events.json
│ │ │ ├── 2018-11-17-events.json
│ │ │ ├── 2018-11-18-events.json
│ │ │ ├── 2018-11-19-events.json
│ │ │ ├── 2018-11-20-events.json
│ │ │ ├── 2018-11-21-events.json
│ │ │ ├── 2018-11-22-events.json
│ │ │ ├── 2018-11-23-events.json
│ │ │ ├── 2018-11-24-events.json
│ │ │ ├── 2018-11-25-events.json
│ │ │ ├── 2018-11-26-events.json
│ │ │ ├── 2018-11-27-events.json
│ │ │ ├── 2018-11-28-events.json
│ │ │ ├── 2018-11-29-events.json
│ │ │ └── 2018-11-30-events.json
│ └── song_data
│ │ └── A
│ │ ├── A
│ │ ├── A
│ │ │ ├── TRAAAAW128F429D538.json
│ │ │ ├── TRAAABD128F429CF47.json
│ │ │ ├── TRAAADZ128F9348C2E.json
│ │ │ ├── TRAAAEF128F4273421.json
│ │ │ ├── TRAAAFD128F92F423A.json
│ │ │ ├── TRAAAMO128F1481E7F.json
│ │ │ ├── TRAAAMQ128F1460CD3.json
│ │ │ ├── TRAAAPK128E0786D96.json
│ │ │ ├── TRAAARJ128F9320760.json
│ │ │ ├── TRAAAVG12903CFA543.json
│ │ │ └── TRAAAVO128F93133D4.json
│ │ ├── B
│ │ │ ├── TRAABCL128F4286650.json
│ │ │ ├── TRAABDL12903CAABBA.json
│ │ │ ├── TRAABJL12903CDCF1A.json
│ │ │ ├── TRAABJV128F1460C49.json
│ │ │ ├── TRAABLR128F423B7E3.json
│ │ │ ├── TRAABNV128F425CEE1.json
│ │ │ ├── TRAABRB128F9306DD5.json
│ │ │ ├── TRAABVM128F92CA9DC.json
│ │ │ ├── TRAABXG128F9318EBD.json
│ │ │ ├── TRAABYN12903CFD305.json
│ │ │ └── TRAABYW128F4244559.json
│ │ └── C
│ │ │ ├── TRAACCG128F92E8A55.json
│ │ │ ├── TRAACER128F4290F96.json
│ │ │ ├── TRAACFV128F935E50B.json
│ │ │ ├── TRAACHN128F1489601.json
│ │ │ ├── TRAACIW12903CC0F6D.json
│ │ │ ├── TRAACLV128F427E123.json
│ │ │ ├── TRAACNS128F14A2DF5.json
│ │ │ ├── TRAACOW128F933E35F.json
│ │ │ ├── TRAACPE128F421C1B9.json
│ │ │ ├── TRAACQT128F9331780.json
│ │ │ ├── TRAACSL128F93462F4.json
│ │ │ ├── TRAACTB12903CAAF15.json
│ │ │ ├── TRAACVS128E078BE39.json
│ │ │ └── TRAACZK128F4243829.json
│ │ └── B
│ │ ├── A
│ │ ├── TRABACN128F425B784.json
│ │ ├── TRABAFJ128F42AF24E.json
│ │ ├── TRABAFP128F931E9A1.json
│ │ ├── TRABAIO128F42938F9.json
│ │ ├── TRABATO128F42627E9.json
│ │ ├── TRABAVQ12903CBF7E0.json
│ │ ├── TRABAWW128F4250A31.json
│ │ ├── TRABAXL128F424FC50.json
│ │ ├── TRABAXR128F426515F.json
│ │ ├── TRABAXV128F92F6AE3.json
│ │ └── TRABAZH128F930419A.json
│ │ ├── B
│ │ ├── TRABBAM128F429D223.json
│ │ ├── TRABBBV128F42967D7.json
│ │ ├── TRABBJE12903CDB442.json
│ │ ├── TRABBKX128F4285205.json
│ │ ├── TRABBLU128F93349CF.json
│ │ ├── TRABBNP128F932546F.json
│ │ ├── TRABBOP128F931B50D.json
│ │ ├── TRABBOR128F4286200.json
│ │ ├── TRABBTA128F933D304.json
│ │ ├── TRABBVJ128F92F7EAA.json
│ │ ├── TRABBXU128F92FEF48.json
│ │ └── TRABBZN12903CD9297.json
│ │ └── C
│ │ ├── TRABCAJ12903CDFCC2.json
│ │ ├── TRABCEC128F426456E.json
│ │ ├── TRABCEI128F424C983.json
│ │ ├── TRABCFL128F149BB0D.json
│ │ ├── TRABCIX128F4265903.json
│ │ ├── TRABCKL128F423A778.json
│ │ ├── TRABCPZ128F4275C32.json
│ │ ├── TRABCRU128F423F449.json
│ │ ├── TRABCTK128F934B224.json
│ │ ├── TRABCUQ128E0783E2B.json
│ │ ├── TRABCXB128F4286BD3.json
│ │ └── TRABCYE128F934CE1D.json
├── etl.ipynb
├── etl.py
├── schema.PNG
├── sql_queries.py
└── test.ipynb
├── 2. Cassandra ETL
├── ETL using Cassandra.ipynb
├── event_data
│ ├── 2018-11-01-events.csv
│ ├── 2018-11-02-events.csv
│ ├── 2018-11-03-events.csv
│ ├── 2018-11-04-events.csv
│ ├── 2018-11-05-events.csv
│ ├── 2018-11-06-events.csv
│ ├── 2018-11-07-events.csv
│ ├── 2018-11-08-events.csv
│ ├── 2018-11-09-events.csv
│ ├── 2018-11-10-events.csv
│ ├── 2018-11-11-events.csv
│ ├── 2018-11-12-events.csv
│ ├── 2018-11-13-events.csv
│ ├── 2018-11-14-events.csv
│ ├── 2018-11-15-events.csv
│ ├── 2018-11-16-events.csv
│ ├── 2018-11-17-events.csv
│ ├── 2018-11-18-events.csv
│ ├── 2018-11-19-events.csv
│ ├── 2018-11-20-events.csv
│ ├── 2018-11-21-events.csv
│ ├── 2018-11-22-events.csv
│ ├── 2018-11-23-events.csv
│ ├── 2018-11-24-events.csv
│ ├── 2018-11-25-events.csv
│ ├── 2018-11-26-events.csv
│ ├── 2018-11-27-events.csv
│ ├── 2018-11-28-events.csv
│ ├── 2018-11-29-events.csv
│ └── 2018-11-30-events.csv
├── event_datafile_new.csv
└── images
│ └── image_event_datafile_new.jpg
├── 3. Web Scraping using Scrapy, Mongo ETL
├── README.md
├── books.PNG
├── books
│ ├── books
│ │ ├── __init__.py
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ │ └── spiders
│ │ │ ├── __init__.py
│ │ │ └── books_spider.py
│ └── scrapy.cfg
└── requirements.txt
├── 4. Data Warehousing with AWS Redshift
├── README.md
├── create_tables.py
├── dwh.cfg
├── etl.py
├── redshift_cluster_setup.py
├── redshift_cluster_teardown.py
├── screenshots
│ ├── architecture.PNG
│ ├── redshift.PNG
│ └── schema.PNG
└── sql_queries.py
├── 5. Data Lake with Spark & AWS S3
├── README.md
├── etl.py
└── screenshots
│ ├── s3.PNG
│ ├── schema.PNG
│ └── spark.PNG
├── 6. Data Pipelining with Airflow
├── README.md
├── airflow
│ ├── dags
│ │ ├── create_tables.sql
│ │ └── sparkify_dwh_dag.py
│ └── plugins
│ │ ├── __init__.py
│ │ ├── helpers
│ │ ├── __init__.py
│ │ └── sql_queries.py
│ │ └── operators
│ │ ├── __init__.py
│ │ ├── data_quality.py
│ │ ├── load_dimension.py
│ │ ├── load_fact.py
│ │ └── stage_redshift.py
└── screenshots
│ ├── airflow.png
│ ├── dag.PNG
│ └── schema.PNG
└── README.md
/.gitignore:
--------------------------------------------------------------------------------
1 | env/
2 | __pycache__/
3 | .ipynb_checkpoints/
4 | *.zip
--------------------------------------------------------------------------------
/0. Back to Basics/1. Intro to Data Modelling/Creating Table with Cassandra/Creating_a_Table_with_Apache_Cassandra.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Lesson 1 Exercise 2: Creating a Table with Apache Cassandra\n",
8 | "
"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "### Walk through the basics of Apache Cassandra. Complete the following tasks:
Create a table in Apache Cassandra, Insert rows of data, Run a simple SQL query to validate the information.
\n",
16 | "`#####` denotes where the code needs to be completed.\n",
17 | " \n",
18 | "Note: __Do not__ click the blue Preview button in the lower taskbar"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "#### Import Apache Cassandra python package"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 1,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "import cassandra"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "### Create a connection to the database"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 2,
47 | "metadata": {},
48 | "outputs": [],
49 | "source": [
50 | "from cassandra.cluster import Cluster\n",
51 | "try: \n",
52 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
53 | " session = cluster.connect()\n",
54 | "except Exception as e:\n",
55 | " print(e)\n",
56 | " "
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "### TO-DO: Create a keyspace to do the work in "
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": 4,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "## TO-DO: Create the keyspace\n",
73 | "try:\n",
74 | " session.execute(\"\"\"\n",
75 | " CREATE KEYSPACE IF NOT EXISTS udacity \n",
76 | " WITH REPLICATION = \n",
77 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
78 | ")\n",
79 | "\n",
80 | "except Exception as e:\n",
81 | " print(e)"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "### TO-DO: Connect to the Keyspace"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": 5,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": [
97 | "## To-Do: Add in the keyspace you created\n",
98 | "try:\n",
99 | " session.set_keyspace('udacity')\n",
100 | "except Exception as e:\n",
101 | " print(e)"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. \n",
109 | "\n",
110 | "`song_title\n",
111 | "artist_name\n",
112 | "year\n",
113 | "album_name\n",
114 | "single`"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "### TO-DO: You need to create a table to be able to run the following query: \n",
122 | "`select * from songs WHERE year=1970 AND artist_name=\"The Beatles\"`"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 20,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "## TO-DO: Complete the query below\n",
132 | "query = \"CREATE TABLE IF NOT EXISTS songs \"\n",
133 | "query = query + \"(year int, artist_name text, song_title text, album_name text, single boolean, PRIMARY KEY (year, artist_name))\"\n",
134 | "try:\n",
135 | " session.execute(query)\n",
136 | "except Exception as e:\n",
137 | " print(e)\n"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {},
143 | "source": [
144 | "### TO-DO: Insert the following two rows in your table\n",
145 | "`First Row: \"Across The Universe\", \"The Beatles\", \"1970\", \"False\", \"Let It Be\"`\n",
146 | "\n",
147 | "`Second Row: \"The Beatles\", \"Think For Yourself\", \"False\", \"1965\", \"Rubber Soul\"`"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 22,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "## Add in query and then run the insert statement\n",
157 | "query = \"INSERT INTO songs (album_name, artist_name, year, single, song_title)\" \n",
158 | "query = query + \" VALUES (%s, %s, %s, %s, %s)\"\n",
159 | "\n",
160 | "try:\n",
161 | " session.execute(query, (\"Across The Universe\", \"The Beatles\", 1970, False, \"Let It Be\"))\n",
162 | "except Exception as e:\n",
163 | " print(e)\n",
164 | " \n",
165 | "try:\n",
166 | " session.execute(query, (\"The Beatles\", \"Think For Yourself\", 1965, False, \"Rubber Soul\"))\n",
167 | "except Exception as e:\n",
168 | " print(e)"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "### TO-DO: Validate your data was inserted into the table."
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 23,
181 | "metadata": {
182 | "scrolled": true
183 | },
184 | "outputs": [
185 | {
186 | "name": "stdout",
187 | "output_type": "stream",
188 | "text": [
189 | "1965 The Beatles Think For Yourself\n",
190 | "1970 Across The Universe The Beatles\n"
191 | ]
192 | }
193 | ],
194 | "source": [
195 | "## TO-DO: Complete and then run the select statement to validate the data was inserted into the table\n",
196 | "query = 'SELECT * FROM songs'\n",
197 | "try:\n",
198 | " rows = session.execute(query)\n",
199 | "except Exception as e:\n",
200 | " print(e)\n",
201 | " \n",
202 | "for row in rows:\n",
203 | " print (row.year, row.album_name, row.artist_name)"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "### TO-DO: Validate the Data Model with the original query.\n",
211 | "\n",
212 | "`select * from songs WHERE YEAR=1970 AND artist_name=\"The Beatles\"`"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": 24,
218 | "metadata": {},
219 | "outputs": [
220 | {
221 | "name": "stdout",
222 | "output_type": "stream",
223 | "text": [
224 | "1970 Across The Universe The Beatles\n"
225 | ]
226 | }
227 | ],
228 | "source": [
229 | "##TO-DO: Complete the select statement to run the query \n",
230 | "query = \"select * from songs WHERE YEAR=1970 AND artist_name='The Beatles'\"\n",
231 | "try:\n",
232 | " rows = session.execute(query)\n",
233 | "except Exception as e:\n",
234 | " print(e)\n",
235 | " \n",
236 | "for row in rows:\n",
237 | " print (row.year, row.album_name, row.artist_name)"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "### And Finally close the session and cluster connection"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": 25,
250 | "metadata": {},
251 | "outputs": [],
252 | "source": [
253 | "session.shutdown()\n",
254 | "cluster.shutdown()"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {},
261 | "outputs": [],
262 | "source": []
263 | }
264 | ],
265 | "metadata": {
266 | "kernelspec": {
267 | "display_name": "Python 3",
268 | "language": "python",
269 | "name": "python3"
270 | },
271 | "language_info": {
272 | "codemirror_mode": {
273 | "name": "ipython",
274 | "version": 3
275 | },
276 | "file_extension": ".py",
277 | "mimetype": "text/x-python",
278 | "name": "python",
279 | "nbconvert_exporter": "python",
280 | "pygments_lexer": "ipython3",
281 | "version": "3.7.0"
282 | }
283 | },
284 | "nbformat": 4,
285 | "nbformat_minor": 2
286 | }
287 |
--------------------------------------------------------------------------------
/0. Back to Basics/1. Intro to Data Modelling/Creating Table with Postgres/creating-a-table-with-postgres-0.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Lesson 1 Demo 0: PostgreSQL and AutoCommits\n",
8 | "\n",
9 | "
"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "## Walk through the basics of PostgreSQL autocommits "
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": null,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "## import postgreSQL adapter for the Python\n",
26 | "import psycopg2"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "### Create a connection to the database\n",
34 | "1. Connect to the local instance of PostgreSQL (*127.0.0.1*)\n",
35 | "2. Use the database/schema from the instance. \n",
36 | "3. The connection reaches out to the database (*studentdb*) and use the correct privilages to connect to the database (*user and password = student*)."
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "conn = psycopg2.connect(\"host=127.0.0.1 dbname=studentdb user=student password=student\")"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "### Use the connection to get a cursor that will be used to execute queries."
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": null,
58 | "metadata": {},
59 | "outputs": [],
60 | "source": [
61 | "cur = conn.cursor()"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "### Create a database to work in"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": null,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "cur.execute(\"select * from test\")"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "### Error occurs, but it was to be expected because table has not been created as yet. To fix the error, create the table. "
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": null,
90 | "metadata": {},
91 | "outputs": [],
92 | "source": [
93 | "cur.execute(\"CREATE TABLE test (col1 int, col2 int, col3 int);\")"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "### Error indicates we cannot execute this query. Since we have not committed the transaction and had an error in the transaction block, we are blocked until we restart the connection."
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": null,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "conn = psycopg2.connect(\"host=127.0.0.1 dbname=studentdb user=student password=student\")\n",
110 | "cur = conn.cursor()"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "In our exercises instead of worrying about commiting each transaction or getting a strange error when we hit something unexpected, let's set autocommit to true. **This says after each call during the session commit that one action and do not hold open the transaction for any other actions. One action = one transaction.**"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "In this demo we will use automatic commit so each action is commited without having to call `conn.commit()` after each command. **The ability to rollback and commit transactions are a feature of Relational Databases.**"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {},
131 | "outputs": [],
132 | "source": [
133 | "conn.set_session(autocommit=True)"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {},
140 | "outputs": [],
141 | "source": [
142 | "cur.execute(\"select * from test\")"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "execution_count": null,
148 | "metadata": {},
149 | "outputs": [],
150 | "source": [
151 | "cur.execute(\"CREATE TABLE test (col1 int, col2 int, col3 int);\")"
152 | ]
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "metadata": {},
157 | "source": [
158 | "### Once autocommit is set to true, we execute this code successfully. There were no issues with transaction blocks and we did not need to restart our connection. "
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": null,
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "cur.execute(\"select * from test\")"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": null,
173 | "metadata": {},
174 | "outputs": [],
175 | "source": [
176 | "cur.execute(\"select count(*) from test\")\n",
177 | "print(cur.fetchall())"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": null,
183 | "metadata": {},
184 | "outputs": [],
185 | "source": []
186 | }
187 | ],
188 | "metadata": {
189 | "kernelspec": {
190 | "display_name": "Python 3",
191 | "language": "python",
192 | "name": "python3"
193 | },
194 | "language_info": {
195 | "codemirror_mode": {
196 | "name": "ipython",
197 | "version": 3
198 | },
199 | "file_extension": ".py",
200 | "mimetype": "text/x-python",
201 | "name": "python",
202 | "nbconvert_exporter": "python",
203 | "pygments_lexer": "ipython3",
204 | "version": "3.7.2"
205 | }
206 | },
207 | "nbformat": 4,
208 | "nbformat_minor": 2
209 | }
210 |
--------------------------------------------------------------------------------
/0. Back to Basics/1. Intro to Data Modelling/README.md:
--------------------------------------------------------------------------------
1 | ## The Data Modelling Process:
2 | 1. Gather requirements
3 | 2. Conceptual Data Modelling
4 | 3. Logical Data Modelling
5 |
6 | ## Important Feature of RDBMS:
7 | **Atomicity** - Whole transaction or nothing is processed
8 | **Consistency** - Only transactions abiding by constraints & rules is written into database
9 | **Isolation** - Transactions proceed independently and securely
10 | **Durability** - Once transactions are committed, they remain committed
11 |
12 | ## When to use Relational Database?
13 | ### Advantages of Using a Relational Database
14 | * Flexibility for writing in SQL queries: With SQL being the most common database query language.
15 | * Modeling the data not modeling queries
16 | * Ability to do JOINS
17 | * Ability to do aggregations and analytics
18 | * Secondary Indexes available : You have the advantage of being able to add another index to help with quick searching.
19 | * Smaller data volumes: If you have a smaller data volume (and not big data) you can use a relational database for its simplicity.
20 | * ACID Transactions: Allows you to meet a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, and thus maintain data integrity.
21 | * Easier to change to business requirements
22 |
23 | ## When to use NoSQL Database?
24 | ### Advantages of Using a NoSQL Database
25 | * Need to be able to store different data type formats: NoSQL was also created to handle different data configurations: structured, semi-structured, and unstructured data. JSON, XML documents can all be handled easily with NoSQL.
26 | * Large amounts of data: Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. NoSQL databases were created to be able to be horizontally scalable. The more servers/systems you add to the database the more data that can be hosted with high availability and low latency (fast reads and writes).
27 | * Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data
28 | * Need high throughput: While ACID transactions bring benefits they also slow down the process of reading and writing data. If you need very fast reads and writes using a relational database may not suit your needs.
29 | * Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
30 | * Need high availability: Relational databases have a single point of failure. When that database goes down, a failover to a backup system must happen and takes time.
--------------------------------------------------------------------------------
/0. Back to Basics/2. Relational Data Models/README.md:
--------------------------------------------------------------------------------
1 | ## Importance of Relational Databases:
2 | ---
3 | * Standardization of data model: Once your data is transformed into the rows and columns format, your data is standardized and you can query it with SQL
4 | * Flexibility in adding and altering tables: Relational databases gives you flexibility to add tables, alter tables, add and remove data.
5 | * Data Integrity: Data Integrity is the backbone of using a relational database.
6 | * Structured Query Language (SQL): A standard language can be used to access the data with a predefined language.
7 | * Simplicity : Data is systematically stored and modeled in tabular format.
8 | * Intuitive Organization: The spreadsheet format is intuitive but intuitive to data modeling in relational databases.
9 |
10 | ## OLAP vs OLTP:
11 | ---
12 | * Online Analytical Processing (OLAP):
13 | Databases optimized for these workloads allow for complex analytical and ad hoc queries, including aggregations. These type of databases are optimized for reads.
14 |
15 | * Online Transactional Processing (OLTP):
16 | Databases optimized for these workloads allow for less complex queries in large volume. The types of queries for these databases are read, insert, update, and delete.
17 |
18 | * The key to remember the difference between OLAP and OLTP is analytics (A) vs transactions (T). If you want to get the price of a shoe then you are using OLTP (this has very little or no aggregations). If you want to know the total stock of shoes a particular store sold, then this requires using OLAP (since this will require aggregations).
19 |
20 | ## Normal Forms:
21 | ---
22 | ### Objectives:
23 | 1. To free the database from unwanted insertions, updates, & deletion dependencies
24 | 2. To reduce the need for refactoring the database as new types of data are introduced
25 | 3. To make the relational model more informative to users
26 | 4. To make the database neutral to the query statistics
27 |
28 | ### Types of Normal Forms:
29 | #### First Normal Form (1NF):
30 | * Atomic values: each cell contains unique and single values
31 | * Be able to add data without altering tables
32 | * Separate different relations into different tables
33 | * Keep relationships between tables together with foreign keys
34 |
35 | #### Second Normal Form (2NF):
36 | * Have reached 1NF
37 | * All columns in the table must rely on the Primary Key
38 |
39 | #### Third Normal Form (3NF):
40 | * Must be in 2nd Normal Form
41 | * No transitive dependencies
42 | * Remember, transitive dependencies you are trying to maintain is that to get from A-> C, you want to avoid going through B.
43 |
44 | When to use 3NF:
45 | When you want to update data, we want to be able to do in just 1 place.
46 |
47 | ## Denormalization:
48 | ---
49 | JOINS on the database allow for outstanding flexibility but are extremely slow. If you are dealing with heavy reads on your database, you may want to think about denormalizing your tables. You get your data into normalized form, and then you proceed with denormalization. So, denormalization comes after normalization.
50 |
51 | ## Normalize vs Denormalize:
52 | ---
53 | Normalization is about trying to increase data integrity by reducing the number of copies of the data. Data that needs to be added or updated will be done in as few places as possible.
54 |
55 | Denormalization is trying to increase performance by reducing the number of joins between tables (as joins can be slow). Data integrity will take a bit of a potential hit, as there will be more copies of the data (to reduce JOINS).
56 |
57 | ## Star Schema:
58 | ---
59 | * Simplest style of data mart schema
60 | * Consist of 1 or more fact tables referencing multiple dimension tables
61 |
62 | ### Benefits:
63 | * Denormalize tables, simplify queries and provide fast aggregations
64 |
65 | ### Drawbacks:
66 | * Issues that come with denormalization
67 | * Data Integrity
68 | * Decrease Query Flexibility
69 | * Many to many relationship -- simplified
70 |
71 | ## Snowflake Schema:
72 | ---
73 | * Logical arrangement of tables in a multidimensional database
74 | * Represented by centralized fact tables that are connected to multiple dimensions
75 | * Dimensions of snowflake schema are elaborated, having multiple levels of relationships, child tables having multiple parents
76 | * Star schema is a special, simplified case of snowflake schema
77 | * Star schema does not allow for one to many relationships while snowflake schema does
--------------------------------------------------------------------------------
/0. Back to Basics/3. NoSQL Data Models/3. Clustering Column.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Lesson 3 Exercise 3 Solution: Focus on Clustering Columns\n",
8 | "
"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "### Walk through the basics of creating a table with a good Primary Key and Clustering Columns in Apache Cassandra, inserting rows of data, and doing a simple CQL query to validate the information."
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n",
23 | "! pip install cassandra-driver\n",
24 | "#### More documentation can be found here: https://datastax.github.io/python-driver/"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "#### Import Apache Cassandra python package"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 1,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "import cassandra"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "### Create a connection to the database"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 2,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": [
56 | "from cassandra.cluster import Cluster\n",
57 | "try: \n",
58 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
59 | " session = cluster.connect()\n",
60 | "except Exception as e:\n",
61 | " print(e)"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "### Create a keyspace to work in "
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 3,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "try:\n",
78 | " session.execute(\"\"\"\n",
79 | " CREATE KEYSPACE IF NOT EXISTS udacity \n",
80 | " WITH REPLICATION = \n",
81 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
82 | ")\n",
83 | "\n",
84 | "except Exception as e:\n",
85 | " print(e)"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL. "
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": 4,
98 | "metadata": {},
99 | "outputs": [],
100 | "source": [
101 | "try:\n",
102 | " session.set_keyspace('udacity')\n",
103 | "except Exception as e:\n",
104 | " print(e)"
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "### Imagine we would like to start creating a new Music Library of albums. \n",
112 | "\n",
113 | "### We want to ask 1 question of our data:\n",
114 | "#### 1. Give me all the information from the music library about a given album\n",
115 | "`select * from album_library WHERE album_name=\"Close To You\"`\n"
116 | ]
117 | },
118 | {
119 | "cell_type": "markdown",
120 | "metadata": {},
121 | "source": [
122 | "### Here is the Data:\n",
123 | "
"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "### How should we model this data? What should be our Primary Key and Partition Key? \n",
131 | "\n",
132 | "### Since the data is looking for the `ALBUM_NAME` let's start with that. From there we will need to add other elements to make sure the Key is unique. We also need to add the `ARTIST_NAME` as Clustering Columns to make the data unique. That should be enough to make the row key unique.\n",
133 | "\n",
134 | "`Table Name: music_library\n",
135 | "column 1: Year\n",
136 | "column 2: Artist Name\n",
137 | "column 3: Album Name\n",
138 | "Column 4: City\n",
139 | "PRIMARY KEY(album name, artist name)`"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 5,
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n",
149 | "query = query + \"(album_name text, artist_name text, year int, city text, PRIMARY KEY (album_name, artist_name))\"\n",
150 | "try:\n",
151 | " session.execute(query)\n",
152 | "except Exception as e:\n",
153 | " print(e)"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "### Insert the data into the table"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 6,
166 | "metadata": {},
167 | "outputs": [],
168 | "source": [
169 | "query = \"INSERT INTO music_library (album_name, artist_name, year, city)\"\n",
170 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n",
171 | "\n",
172 | "try:\n",
173 | " session.execute(query, (\"Let it Be\", \"The Beatles\", 1970, \"Liverpool\"))\n",
174 | "except Exception as e:\n",
175 | " print(e)\n",
176 | " \n",
177 | "try:\n",
178 | " session.execute(query, (\"Rubber Soul\", \"The Beatles\", 1965, \"Oxford\"))\n",
179 | "except Exception as e:\n",
180 | " print(e)\n",
181 | " \n",
182 | "try:\n",
183 | " session.execute(query, (\"Beatles For Sale\", \"The Beatles\", 1964, \"London\"))\n",
184 | "except Exception as e:\n",
185 | " print(e)\n",
186 | "\n",
187 | "try:\n",
188 | " session.execute(query, (\"The Monkees\", \"The Monkees\", 1966, \"Los Angeles\"))\n",
189 | "except Exception as e:\n",
190 | " print(e)\n",
191 | "\n",
192 | "try:\n",
193 | " session.execute(query, (\"Close To You\", \"The Carpenters\", 1970, \"San Diego\"))\n",
194 | "except Exception as e:\n",
195 | " print(e)"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "### Validate the Data Model -- Did it work?\n",
203 | "`select * from album_library WHERE album_name=\"Close To You\"`"
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": 7,
209 | "metadata": {},
210 | "outputs": [
211 | {
212 | "name": "stdout",
213 | "output_type": "stream",
214 | "text": [
215 | "The Carpenters Close To You San Diego 1970\n"
216 | ]
217 | }
218 | ],
219 | "source": [
220 | "query = \"select * from music_library WHERE album_NAME='Close To You'\"\n",
221 | "try:\n",
222 | " rows = session.execute(query)\n",
223 | "except Exception as e:\n",
224 | " print(e)\n",
225 | " \n",
226 | "for row in rows:\n",
227 | " print (row.artist_name, row.album_name, row.city, row.year)"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "### Success it worked! We created a unique Primary key that evenly distributed our data, with clustering columns"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "### For the sake of the demo, drop the table"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 8,
247 | "metadata": {},
248 | "outputs": [],
249 | "source": [
250 | "query = \"drop table music_library\"\n",
251 | "try:\n",
252 | " rows = session.execute(query)\n",
253 | "except Exception as e:\n",
254 | " print(e)\n"
255 | ]
256 | },
257 | {
258 | "cell_type": "markdown",
259 | "metadata": {},
260 | "source": [
261 | "### Close the session and cluster connection"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 9,
267 | "metadata": {},
268 | "outputs": [],
269 | "source": [
270 | "session.shutdown()\n",
271 | "cluster.shutdown()"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": 1,
277 | "metadata": {},
278 | "outputs": [
279 | {
280 | "name": "stderr",
281 | "output_type": "stream",
282 | "text": [
283 | "'zip' is not recognized as an internal or external command,\n",
284 | "operable program or batch file.\n"
285 | ]
286 | }
287 | ],
288 | "source": [
289 | "! zip ."
290 | ]
291 | }
292 | ],
293 | "metadata": {
294 | "kernelspec": {
295 | "display_name": "Python 3",
296 | "language": "python",
297 | "name": "python3"
298 | },
299 | "language_info": {
300 | "codemirror_mode": {
301 | "name": "ipython",
302 | "version": 3
303 | },
304 | "file_extension": ".py",
305 | "mimetype": "text/x-python",
306 | "name": "python",
307 | "nbconvert_exporter": "python",
308 | "pygments_lexer": "ipython3",
309 | "version": "3.7.0"
310 | }
311 | },
312 | "nbformat": 4,
313 | "nbformat_minor": 2
314 | }
315 |
--------------------------------------------------------------------------------
/0. Back to Basics/3. NoSQL Data Models/4. Using the WHERE Clause.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Lesson 3 Demo 4: Using the WHERE Clause\n",
8 | "
"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "### In this exercise we are going to walk through the basics of using the WHERE clause in Apache Cassandra.\n",
16 | "\n",
17 | "##### denotes where the code needs to be completed.\n",
18 | "\n",
19 | "Note: __Do not__ click the blue Preview button in the lower task bar"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n",
27 | "! pip install cassandra-driver\n",
28 | "#### More documentation can be found here: https://datastax.github.io/python-driver/"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "#### Import Apache Cassandra python package"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 5,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "import cassandra"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "### First let's create a connection to the database"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 6,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "from cassandra.cluster import Cluster\n",
61 | "try: \n",
62 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n",
63 | " session = cluster.connect()\n",
64 | "except Exception as e:\n",
65 | " print(e)"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | "### Let's create a keyspace to do our work in "
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 7,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "try:\n",
82 | " session.execute(\"\"\"\n",
83 | " CREATE KEYSPACE IF NOT EXISTS udacity \n",
84 | " WITH REPLICATION = \n",
85 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n",
86 | ")\n",
87 | "\n",
88 | "except Exception as e:\n",
89 | " print(e)"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL. "
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 8,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "try:\n",
106 | " session.set_keyspace('udacity')\n",
107 | "except Exception as e:\n",
108 | " print(e)"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "### Let's imagine we would like to start creating a new Music Library of albums. \n",
116 | "### We want to ask 4 question of our data\n",
117 | "#### 1. Give me every album in my music library that was released in a 1965 year\n",
118 | "#### 2. Give me the album that is in my music library that was released in 1965 by \"The Beatles\"\n",
119 | "#### 3. Give me all the albums released in a given year that was made in London \n",
120 | "#### 4. Give me the city that the album \"Rubber Soul\" was recorded"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {},
126 | "source": [
127 | "### Here is our Collection of Data\n",
128 | "
"
129 | ]
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the YEAR let's start with that. From there we will add clustering columns on Artist Name and Album Name."
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 9,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n",
145 | "query = query + \"(year int, artist_name text, album_name text, city text, PRIMARY KEY (year, artist_name, album_name))\"\n",
146 | "try:\n",
147 | " session.execute(query)\n",
148 | "except Exception as e:\n",
149 | " print(e)"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "### Let's insert our data into of table"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 10,
162 | "metadata": {},
163 | "outputs": [],
164 | "source": [
165 | "query = \"INSERT INTO music_library (year, artist_name, album_name, city)\"\n",
166 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n",
167 | "\n",
168 | "try:\n",
169 | " session.execute(query, (1970, \"The Beatles\", \"Let it Be\", \"Liverpool\"))\n",
170 | "except Exception as e:\n",
171 | " print(e)\n",
172 | " \n",
173 | "try:\n",
174 | " session.execute(query, (1965, \"The Beatles\", \"Rubber Soul\", \"Oxford\"))\n",
175 | "except Exception as e:\n",
176 | " print(e)\n",
177 | " \n",
178 | "try:\n",
179 | " session.execute(query, (1965, \"The Who\", \"My Generation\", \"London\"))\n",
180 | "except Exception as e:\n",
181 | " print(e)\n",
182 | "\n",
183 | "try:\n",
184 | " session.execute(query, (1966, \"The Monkees\", \"The Monkees\", \"Los Angeles\"))\n",
185 | "except Exception as e:\n",
186 | " print(e)\n",
187 | "\n",
188 | "try:\n",
189 | " session.execute(query, (1970, \"The Carpenters\", \"Close To You\", \"San Diego\"))\n",
190 | "except Exception as e:\n",
191 | " print(e)"
192 | ]
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "metadata": {},
197 | "source": [
198 | "### Let's Validate our Data Model with our 4 queries.\n",
199 | "\n",
200 | "Query 1: "
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": 13,
206 | "metadata": {},
207 | "outputs": [
208 | {
209 | "name": "stdout",
210 | "output_type": "stream",
211 | "text": [
212 | "1965 The Beatles Rubber Soul Oxford\n",
213 | "1965 The Who My Generation London\n"
214 | ]
215 | }
216 | ],
217 | "source": [
218 | "query = \"SELECT * FROM music_library WHERE year=1965\"\n",
219 | "try:\n",
220 | " rows = session.execute(query)\n",
221 | "except Exception as e:\n",
222 | " print(e)\n",
223 | " \n",
224 | "for row in rows:\n",
225 | " print (row.year, row.artist_name, row.album_name, row.city)"
226 | ]
227 | },
228 | {
229 | "cell_type": "markdown",
230 | "metadata": {},
231 | "source": [
232 | " Let's try the 2nd query.\n",
233 | " Query 2: "
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": 14,
239 | "metadata": {},
240 | "outputs": [
241 | {
242 | "name": "stdout",
243 | "output_type": "stream",
244 | "text": [
245 | "1965 The Beatles Rubber Soul Oxford\n"
246 | ]
247 | }
248 | ],
249 | "source": [
250 | "query = \"SELECT * FROM music_library WHERE year=1965 AND artist_name='The Beatles'\"\n",
251 | "try:\n",
252 | " rows = session.execute(query)\n",
253 | "except Exception as e:\n",
254 | " print(e)\n",
255 | " \n",
256 | "for row in rows:\n",
257 | " print (row.year, row.artist_name, row.album_name, row.city)"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "### Let's try the 3rd query.\n",
265 | "Query 3: "
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 15,
271 | "metadata": {},
272 | "outputs": [
273 | {
274 | "name": "stdout",
275 | "output_type": "stream",
276 | "text": [
277 | "Error from server: code=2200 [Invalid query] message=\"Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING\"\n"
278 | ]
279 | }
280 | ],
281 | "source": [
282 | "query = \"SELECT * FROM music_library WHERE city='London'\"\n",
283 | "try:\n",
284 | " rows = session.execute(query)\n",
285 | "except Exception as e:\n",
286 | " print(e)\n",
287 | " \n",
288 | "for row in rows:\n",
289 | " print (row.year, row.artist_name, row.album_name, row.city)"
290 | ]
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "### Did you get an error? You can not try to access a column or a clustering column if you have not used the other defined clustering column. Let's see if we can try it a different way. \n",
297 | "Try Query 4: \n",
298 | "\n"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": 17,
304 | "metadata": {},
305 | "outputs": [
306 | {
307 | "name": "stdout",
308 | "output_type": "stream",
309 | "text": [
310 | "Oxford\n"
311 | ]
312 | }
313 | ],
314 | "source": [
315 | "query = \"SELECT city FROM music_library WHERE year=1965 AND artist_name='The Beatles' AND album_name='Rubber Soul'\"\n",
316 | "try:\n",
317 | " rows = session.execute(query)\n",
318 | "except Exception as e:\n",
319 | " print(e)\n",
320 | " \n",
321 | "for row in rows:\n",
322 | " print (row.city)"
323 | ]
324 | },
325 | {
326 | "cell_type": "markdown",
327 | "metadata": {},
328 | "source": [
329 | "### And Finally close the session and cluster connection"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": 18,
335 | "metadata": {},
336 | "outputs": [],
337 | "source": [
338 | "session.shutdown()\n",
339 | "cluster.shutdown()"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": null,
345 | "metadata": {},
346 | "outputs": [],
347 | "source": []
348 | }
349 | ],
350 | "metadata": {
351 | "kernelspec": {
352 | "display_name": "Python 3",
353 | "language": "python",
354 | "name": "python3"
355 | },
356 | "language_info": {
357 | "codemirror_mode": {
358 | "name": "ipython",
359 | "version": 3
360 | },
361 | "file_extension": ".py",
362 | "mimetype": "text/x-python",
363 | "name": "python",
364 | "nbconvert_exporter": "python",
365 | "pygments_lexer": "ipython3",
366 | "version": "3.7.0"
367 | }
368 | },
369 | "nbformat": 4,
370 | "nbformat_minor": 2
371 | }
372 |
--------------------------------------------------------------------------------
/0. Back to Basics/3. NoSQL Data Models/README.md:
--------------------------------------------------------------------------------
1 | ## Data Modelling using NoSQL Databases
2 | ---
3 | * NoSQL Databases stand for not only SQL databases
4 | * When Not to Use SQL?
5 | * Need high Availability in the data: Indicates the system is always up and there is no downtime
6 | * Have Large Amounts of Data
7 | * Need Linear Scalability: The need to add more nodes to the system so performance will increase linearly
8 | * Low Latency: Shorter delay before the data is transferred once the instruction for the transfer has been received.
9 | * Need fast reads and write
10 |
11 | ## Distributed Databases
12 | ---
13 | * Data is stored on multiple machines
14 | * Eventual Consistency:
15 | Over time (if no new changes are made) each copy of the data will be the same, but if there are new changes, the data may be different in different locations. The data may be inconsistent for only milliseconds. There are workarounds in place to prevent getting stale data.
16 | * CAP Theorem:
17 | * It is impossible for a distributed data store to simultaneously provide more than 2 out of 3 guarantees of CAP
18 | * **Consistency**: Every read from the database gets the latest (and correct) piece of data or an error
19 | * **Availability**: Every request is received and a response is given -- without a guarantee that the data is the latest update
20 | * **Partition Tolerance**: The system continues to work regardless of losing network connectivity between nodes
21 | * Which of these combinations is desirable for a production system - Consistency and Availability, Consistency and Partition Tolerance, or Availability and Partition Tolerance?
22 | * As the CAP Theorem Wikipedia entry says, "The CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability." So there is no such thing as Consistency and Availability in a distributed database since it must always tolerate network issues. You can only have Consistency and Partition Tolerance (CP) or Availability and Partition Tolerance (AP). Supporting Availability and Partition Tolerance makes sense, since Availability and Partition Tolerance are the biggest requirements.
23 | * Data Modeling in Apache Cassandra:
24 | * Denormalization is not just okay -- it's a must, for fast reads
25 | * Apache Cassandra has been optimized for fast writes
26 | * ALWAYS think Queries first, one table per query is a great strategy
27 | * Apache Cassandra does not allow for JOINs between tables
28 | * Primary Key must be unique
29 | * The PRIMARY KEY is made up of either just the PARTITION KEY or may also include additional CLUSTERING COLUMNS
30 | * A Simple PRIMARY KEY is just one column that is also the PARTITION KEY. A Composite PRIMARY KEY is made up of more than one column and will assist in creating a unique value and in your retrieval queries
31 | * The PARTITION KEY will determine the distribution of data across the system
32 | * WHERE clause
33 | * Data Modeling in Apache Cassandra is query focused, and that focus needs to be on the WHERE clause
34 | * Failure to include a WHERE clause will result in an error
35 |
36 |
--------------------------------------------------------------------------------
/0. Back to Basics/4. Data Warehouses/README.md:
--------------------------------------------------------------------------------
1 | ## What is a Data Warehouse?
2 | ---
3 | * Data Warehouse is a system (including processes, technologies & data representations that enables support for analytical processing)
4 |
5 | * Goals of a Data Warehouse:
6 | * Simple to understand
7 | * Performant
8 | * Quality Assured
9 | * Handles new business questions well
10 | * Secure
11 |
12 | ## Architecture
13 | ---
14 | * Several possible architectures to building a Data Warehouse
15 | 1. **Kimball's Bus Architecture**:
16 | 
17 | * Results in common dimension data models shared by different business departments
18 | * Data is not kept at an aggregated level, rather they are at the atomic level
19 | * Organized by business processes, used by different departments
20 | 2. **Independent Data Marts**:
21 | 
22 | * Independent Data Marts have ETL processes that are designed by specific business departments to meet their analytical needs
23 | * Different fact tables for the same events, no conformed dimensions
24 | * Uncoordinated efforts can lead to inconsistent views
25 | * Generally discouraged
26 | 3. **Inmon's Corporate Information Factory**:
27 | 
28 | * The Enterprise Data Warehouse provides a normalized data architecture before individual departments build on it
29 | * 2 ETL Process
30 | * Source systems -> 3NF DB
31 | * 3NF DB -> Departmental Data Marts
32 | * The Data Marts use a source 3NF model (single integrated source of truth) and add denormalization based on department needs
33 | * Data marts dimensionally modelled & unlike Kimball's dimensional models, they are mostly aggregated
34 | 4. **Hybrid Kimball Bus & Inmon CIF**:
35 | 
36 |
37 | ## OLAP Cubes
38 | ---
39 | * An OLAP Cube is an aggregation of a fact metric on a number of dimensions
40 |
41 | * OLAP cubes need to store the finest grain of data in case drill-down is needed
42 |
43 | * Operations:
44 | 1. Roll-up & Drill-Down
45 | * Roll-Up: eg, from sales at city level, sum up sales of each city by country
46 | * Drill-Down: eg, decompose the sales of each city into smaller districts
47 | 2. Slice & Dice
48 | * Slice: Reduce N dimensions to N-1 dimensions by restricting one dimension to a single value
49 | * Dice: Same dimensions but computing a sub-cube by restricting some of the values of the dimensions
50 | Eg month in ['Feb', 'Mar'] and movie in ['Avatar', 'Batman']
51 |
52 | * Query Optimization
53 | * Business users typically want to slice, dice, rollup and drill-down
54 | * Each sub-combination goes through all the facts table
55 | * Using CUBE operation "GROUP by CUBE" and saving the output is usually enough to answer forthcoming aggregations from business users without having to process the whole facts table again
56 |
57 | * Serving OLAP Cubes
58 | * Approach 1: Pre-aggregate the OLAP cubes and save them on a special purpose non-relational database (MOLAP)
59 | * Approach 2: Compute the OLAP Cubes on the fly from existing relational databases where the dimensional model resides (ROLAP)
--------------------------------------------------------------------------------
/0. Back to Basics/4. Data Warehouses/snapshots/cif.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/4. Data Warehouses/snapshots/cif.PNG
--------------------------------------------------------------------------------
/0. Back to Basics/4. Data Warehouses/snapshots/datamart.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/4. Data Warehouses/snapshots/datamart.PNG
--------------------------------------------------------------------------------
/0. Back to Basics/4. Data Warehouses/snapshots/hybrid.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/4. Data Warehouses/snapshots/hybrid.PNG
--------------------------------------------------------------------------------
/0. Back to Basics/4. Data Warehouses/snapshots/kimball.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/4. Data Warehouses/snapshots/kimball.PNG
--------------------------------------------------------------------------------
/0. Back to Basics/5. Implementing Data Warehouse on AWS/README.md:
--------------------------------------------------------------------------------
1 | ## Choices for implementing Data Warehouse:
2 | ---
3 | 1. On-Premise
4 | * Need for diverse IT skills & multiple locations
5 | * Cost of ownership (capital and operational costs)
6 |
7 | 2. Cloud
8 | * Lower barriers to entry (time and money)
9 | * Scalability and elasticity out of the box
10 | * Within cloud, there are 2 ways to manage infrastructure
11 | 1. Cloud-Managed (Amazon RDS, Amazon DynamoDB, Amazon S3)
12 | * Reuse of expertise (Infrastructure as Code)
13 | * Less operational expense
14 | 2. Self-Managed (EC2 + Postgres, EC2 + Cassandra, EC2 + Unix FS)
15 |
16 | ## Amazon Redshift
17 | ---
18 | 1. Properties
19 | * Column-oriented storage, internally it is modified Postgresql
20 | * Best suited for storing OLAP workloads
21 | * is a Massively Parellel Processing Database
22 | * Parallelizes one query on multiple CPUS/machines
23 | * A table is partitioned and partitions are processed in parallel
24 |
25 | 2. Architecture
26 | * Leader Node:
27 | * Coordinates compute nodes
28 | * Handles external communication
29 | * Optimizes query execution
30 | * Compute Node:
31 | * Each with CPU, memory, disk and a number of slices
32 | * A node with n slices can process n partitions of a table simultaneously
33 | * Scale-up: get more powerful nodes
34 | * Scale-out: get more nodes
35 | * Example of setting up a Data Warehouse in Redshift:
36 | 
37 | Source: Udacity DE ND Lesson 3: Implementing Data Warehouses on AWS
38 |
39 | 3. Ingesting at Scale
40 | * Use COPY command to transfer from S3 staging area
41 | * If the file is large, better to break it up into multiple files
42 | * Either use a common prefix or a manifest file
43 | * Ingest from the same AWS region
44 | * Compress all csv files
45 |
46 | 4. Optimizing Table Design
47 | * 2 possible strategies: distribution style and sorting key
48 |
49 | 1. Distribution style
50 | * Even:
51 | * Round-robin over all slices for load-balancing
52 | * High cost of joining (Shuffling)
53 | * All:
54 | * Small (dimension) tables can be replicated on all slices to speed up joins
55 | * Auto:
56 | * Leave decision with Redshift, "small enough" tables are distributed with an ALL strategy. Large tables distributed with EVEN strategy
57 | * KEY:
58 | * Rows with similar values of key column are placed in the same slice
59 | * Can lead to skewed distribution if some values of dist key are more frequent than others
60 | * Very useful for large dimension tables
61 |
62 | 2. Sorting key
63 | * Rows are sorted before distribution to slices
64 | * Minimize query time since each node already has contiguous ranges of rows based on sorting key
65 | * Useful for colummns that are frequently in sorting like date dimension and its corresponding foreign key in fact table
66 |
67 |
--------------------------------------------------------------------------------
/0. Back to Basics/5. Implementing Data Warehouse on AWS/dwh.cfg:
--------------------------------------------------------------------------------
1 | [AWS]
2 | KEY=
3 | SECRET=
4 |
5 | [DWH]
6 | DWH_CLUSTER_TYPE=multi-node
7 | DWH_NUM_NODES=4
8 | DWH_NODE_TYPE=dc2.large
9 |
10 | DWH_IAM_ROLE_NAME=dwhRole
11 | DWH_CLUSTER_IDENTIFIER=dwhCluster
12 | DWH_DB=dwh
13 | DWH_DB_USER=
14 | DWH_DB_PASSWORD=
15 | DWH_PORT=5439
16 |
17 |
--------------------------------------------------------------------------------
/0. Back to Basics/5. Implementing Data Warehouse on AWS/redshift_dwh.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/5. Implementing Data Warehouse on AWS/redshift_dwh.PNG
--------------------------------------------------------------------------------
/0. Back to Basics/6. Intro to Spark/Pyspark Data Wrangling.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Data Wrangling with PySpark DataFrames "
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "from pyspark.sql import SparkSession\n",
17 | "from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg\n",
18 | "from pyspark.sql.functions import sum as Fsum\n",
19 | "from pyspark.sql.window import Window\n",
20 | "from pyspark.sql.types import IntegerType\n",
21 | "\n",
22 | "spark = SparkSession \\\n",
23 | " .builder \\\n",
24 | " .appName(\"Wrangling Data\") \\\n",
25 | " .getOrCreate()\n",
26 | "path = \"data/sparkify_log_small.json\"\n",
27 | "user_log = spark.read.json(path)"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "# Which page did user id \"\" (empty string) NOT visit?"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 40,
40 | "metadata": {},
41 | "outputs": [
42 | {
43 | "name": "stdout",
44 | "output_type": "stream",
45 | "text": [
46 | "Pages not visited by empty string user id: ['Submit Downgrade', 'Downgrade', 'Logout', 'Save Settings', 'Settings', 'NextSong', 'Upgrade', 'Error', 'Submit Upgrade']\n"
47 | ]
48 | }
49 | ],
50 | "source": [
51 | "ul1 = user_log.alias('ul1')\n",
52 | "ul2 = user_log.filter(user_log.userId == \"\").alias('ul2')\n",
53 | "\n",
54 | "pages = ul1.join(ul2, ul1.page == ul2.page, how='left_anti').select('page') \\\n",
55 | " .distinct() \\\n",
56 | " .collect()\n",
57 | "pages = [x['page'] + for x in pages]\n",
58 | "\n",
59 | "print(\"Pages not visited by empty string user id: {}\".format(pages))"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "# What type of user does the empty string user id most likely refer to?\n"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": 39,
72 | "metadata": {},
73 | "outputs": [
74 | {
75 | "name": "stdout",
76 | "output_type": "stream",
77 | "text": [
78 | "Pages visited by empty string user id: ['Home', 'About', 'Login', 'Help']\n"
79 | ]
80 | }
81 | ],
82 | "source": [
83 | "all_pages = ul1.select('page').distinct().collect()\n",
84 | "\n",
85 | "all_pages = [x['page'] for x in all_pages]\n",
86 | "\n",
87 | "other_user_pages = [x for x in all_pages if x not in pages]\n",
88 | "\n",
89 | "print(\"Pages visited by empty string user id: {}\".format(other_user_pages))"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "Since ['Home', 'About', 'Login', 'Help'] are pages that empty string user ids visit, they are likely users who have not yet registered"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "# How many female users do we have in the data set?"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 38,
109 | "metadata": {},
110 | "outputs": [
111 | {
112 | "name": "stdout",
113 | "output_type": "stream",
114 | "text": [
115 | "Number of female users: 462\n"
116 | ]
117 | }
118 | ],
119 | "source": [
120 | "female_no = ul1.filter(ul1.gender == 'F').select(\"userId\").distinct().count()\n",
121 | "print(\"Number of female users: {}\".format(female_no))"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "# How many songs were played from the most played artist?"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": 37,
134 | "metadata": {},
135 | "outputs": [
136 | {
137 | "name": "stdout",
138 | "output_type": "stream",
139 | "text": [
140 | "Number of songs played by top artist Coldplay: 83\n"
141 | ]
142 | }
143 | ],
144 | "source": [
145 | "artist_counts = ul1.where(col(\"artist\").isNotNull()).groupby(\"artist\") \\\n",
146 | " .count().sort(col(\"count\").desc()).collect()\n",
147 | "\n",
148 | "top_artist = artist_counts[0]['artist']\n",
149 | "\n",
150 | "number_of_songs = ul1.filter(ul1.artist == top_artist).count()\n",
151 | "\n",
152 | "print(\"Number of songs played by top artist {}: {}\".format(top_artist,\n",
153 | " number_of_songs))"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "# How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.\n",
161 | "\n"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 43,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "name": "stdout",
171 | "output_type": "stream",
172 | "text": [
173 | "+------------------+\n",
174 | "|avg(count(period))|\n",
175 | "+------------------+\n",
176 | "| 6.898347107438017|\n",
177 | "+------------------+\n",
178 | "\n"
179 | ]
180 | }
181 | ],
182 | "source": [
183 | "function = udf(lambda ishome : int(ishome == 'Home'), IntegerType())\n",
184 | "\n",
185 | "user_window = Window \\\n",
186 | " .partitionBy('userID') \\\n",
187 | " .orderBy(desc('ts')) \\\n",
188 | " .rangeBetween(Window.unboundedPreceding, 0)\n",
189 | "\n",
190 | "cusum = ul1.filter((ul1.page == 'NextSong') | (ul1.page == 'Home')) \\\n",
191 | " .select('userID', 'page', 'ts') \\\n",
192 | " .withColumn('homevisit', function(col('page'))) \\\n",
193 | " .withColumn('period', Fsum('homevisit').over(user_window))\n",
194 | "\n",
195 | "cusum.filter((cusum.page == 'NextSong')) \\\n",
196 | " .groupBy('userID', 'period') \\\n",
197 | " .agg({'period':'count'}) \\\n",
198 | " .agg({'count(period)':'avg'}).show()"
199 | ]
200 | }
201 | ],
202 | "metadata": {
203 | "kernelspec": {
204 | "display_name": "Python 3",
205 | "language": "python",
206 | "name": "python3"
207 | },
208 | "language_info": {
209 | "codemirror_mode": {
210 | "name": "ipython",
211 | "version": 3
212 | },
213 | "file_extension": ".py",
214 | "mimetype": "text/x-python",
215 | "name": "python",
216 | "nbconvert_exporter": "python",
217 | "pygments_lexer": "ipython3",
218 | "version": "3.7.0"
219 | }
220 | },
221 | "nbformat": 4,
222 | "nbformat_minor": 2
223 | }
224 |
--------------------------------------------------------------------------------
/0. Back to Basics/6. Intro to Spark/README.md:
--------------------------------------------------------------------------------
1 | ## What is Spark?
2 | ---
3 | * Spark is a general-purpose distributed data processing engine.
4 | * On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.
5 | * Spark is often used with distributed data stores such as Hadoop's HDFS, and Amazon's S3, with popular NoSQL databases such as Apache HBase, Apache Cassandra, and MongoDB, and with distributed messaging stores such as MapR Event Store and Apache Kafka.
6 | * Pyspark API
7 | * Pyspark supports imperative (Spark Dataframes) and declarative syntax (Spark SQL)
8 |
9 | ## How a Spark Application Runs on a Cluster
10 | ---
11 | * A Spark application runs as independent processes, coordinated by the SparkSession object in the driver program.
12 | * The resource or cluster manager assigns tasks to workers, one task per partition.
13 | * A task applies its unit of work to the dataset in its partition and outputs a new partition dataset.
14 | * Because iterative algorithms apply operations repeatedly to data, they benefit from caching datasets across iterations.
15 | * Results are sent back to the driver application or can be saved to disk.
16 | * Spark supports the following resource/cluster managers:
17 | * Spark Standalone – a simple cluster manager included with Spark
18 | * Apache Mesos – a general cluster manager that can also run Hadoop applications
19 | * Apache Hadoop YARN – the resource manager in Hadoop 2
20 | * Kubernetes – an open source system for automating deployment, scaling, and management of containerized applications
21 |
22 | ## Spark's Limitations
23 | ---
24 | * Spark Streaming’s latency is at least 500 milliseconds since it operates on micro-batches of records, instead of processing one record at a time. Native streaming tools such as Storm, Apex, or Flink can push down this latency value and might be more suitable for low-latency applications. Flink and Apex can be used for batch computation as well, so if you're already using them for stream processing, there's no need to add Spark to your stack of technologies.
25 |
26 | * Another limitation of Spark is its selection of machine learning algorithms. Currently, Spark only supports algorithms that scale linearly with the input data size. In general, deep learning is not available either, though there are many projects integrate Spark with Tensorflow and other deep learning tools.
--------------------------------------------------------------------------------
/0. Back to Basics/6. Intro to Spark/Spark SQL.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Data Wrangling with Spark SQL"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "from pyspark.sql import SparkSession\n",
17 | "\n",
18 | "spark = SparkSession \\\n",
19 | " .builder \\\n",
20 | " .appName(\"Data wrangling with Spark SQL\") \\\n",
21 | " .getOrCreate()\n",
22 | "path = \"data/sparkify_log_small.json\"\n",
23 | "user_log = spark.read.json(path)\n",
24 | "user_log.createOrReplaceTempView(\"user_log_table\")"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "Which page did user id \"\"(empty string) NOT visit?"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 38,
37 | "metadata": {},
38 | "outputs": [
39 | {
40 | "name": "stdout",
41 | "output_type": "stream",
42 | "text": [
43 | "userId \"\" did not visit pages: ['Submit Downgrade', 'Downgrade', 'Logout', 'Save Settings', 'Settings', 'NextSong', 'Upgrade', 'Error', 'Submit Upgrade']\n"
44 | ]
45 | }
46 | ],
47 | "source": [
48 | "rows = spark.sql(\"\"\"\n",
49 | " SELECT DISTINCT ul1.page FROM user_log_table ul1\n",
50 | " LEFT ANTI JOIN (\n",
51 | " SELECT DISTINCT page FROM user_log_table\n",
52 | " WHERE user_log_table.userId = ''\n",
53 | " ) ul2 ON ul1.page = ul2.page\n",
54 | " \"\"\").collect()\n",
55 | "pages = [row.page for row in rows]\n",
56 | "print('userId \"\" did not visit pages: {}'.format(pages))"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "# How many female users do we have in the data set?"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 29,
76 | "metadata": {},
77 | "outputs": [
78 | {
79 | "name": "stdout",
80 | "output_type": "stream",
81 | "text": [
82 | "There are 462 female users\n"
83 | ]
84 | }
85 | ],
86 | "source": [
87 | "row = spark.sql(\"\"\"\n",
88 | " SELECT COUNT(DISTINCT(userId)) AS count FROM user_log_table\n",
89 | " WHERE gender='F'\n",
90 | " \"\"\").collect()\n",
91 | "count = row[0][0]\n",
92 | "print('There are {} female users'.format(count))"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "# How many songs were played from the most played artist?"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 30,
105 | "metadata": {},
106 | "outputs": [
107 | {
108 | "name": "stdout",
109 | "output_type": "stream",
110 | "text": [
111 | "83 songs were played from the most played artist: Coldplay\n"
112 | ]
113 | }
114 | ],
115 | "source": [
116 | "row = spark.sql(\"\"\"\n",
117 | " (SELECT artist, COUNT(song) AS count FROM user_log_table\n",
118 | " GROUP BY artist\n",
119 | " ORDER BY count DESC LIMIT 1)\n",
120 | " \"\"\").collect()\n",
121 | "count = row[0][0]\n",
122 | "print('{} songs were played from the most played artist: {}'.format(row[0][1], row[0][0]))"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "# How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer."
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": 32,
135 | "metadata": {},
136 | "outputs": [
137 | {
138 | "name": "stdout",
139 | "output_type": "stream",
140 | "text": [
141 | "+------------------+\n",
142 | "|avg(count_results)|\n",
143 | "+------------------+\n",
144 | "| 6.898347107438017|\n",
145 | "+------------------+\n",
146 | "\n"
147 | ]
148 | }
149 | ],
150 | "source": [
151 | "# SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;\n",
152 | "is_home = spark.sql(\"SELECT userID, page, ts, CASE WHEN page = 'Home' THEN 1 ELSE 0 END AS is_home FROM user_log_table \\\n",
153 | " WHERE (page = 'NextSong') or (page = 'Home') \\\n",
154 | " \")\n",
155 | "\n",
156 | "# keep the results in a new view\n",
157 | "is_home.createOrReplaceTempView(\"is_home_table\")\n",
158 | "\n",
159 | "# find the cumulative sum over the is_home column\n",
160 | "cumulative_sum = spark.sql(\"SELECT *, SUM(is_home) OVER \\\n",
161 | " (PARTITION BY userID ORDER BY ts DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period \\\n",
162 | " FROM is_home_table\")\n",
163 | "\n",
164 | "# keep the results in a view\n",
165 | "cumulative_sum.createOrReplaceTempView(\"period_table\")\n",
166 | "\n",
167 | "# find the average count for NextSong\n",
168 | "spark.sql(\"SELECT AVG(count_results) FROM \\\n",
169 | " (SELECT COUNT(*) AS count_results FROM period_table \\\n",
170 | "GROUP BY userID, period, page HAVING page = 'NextSong') AS counts\").show()"
171 | ]
172 | }
173 | ],
174 | "metadata": {
175 | "kernelspec": {
176 | "display_name": "Python 3",
177 | "language": "python",
178 | "name": "python3"
179 | },
180 | "language_info": {
181 | "codemirror_mode": {
182 | "name": "ipython",
183 | "version": 3
184 | },
185 | "file_extension": ".py",
186 | "mimetype": "text/x-python",
187 | "name": "python",
188 | "nbconvert_exporter": "python",
189 | "pygments_lexer": "ipython3",
190 | "version": "3.7.0"
191 | }
192 | },
193 | "nbformat": 4,
194 | "nbformat_minor": 2
195 | }
196 |
--------------------------------------------------------------------------------
/0. Back to Basics/7. Data Lakes/README.md:
--------------------------------------------------------------------------------
1 | ## What is a Data Lake?
2 | ---
3 | * A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files (from Wikipedia).
4 |
5 | ## Why Data Lakes?
6 | ---
7 | * Some data is difficult to put in tabular format, like deep json structures.
8 | * Text/Image data can be stored as blobs of data, and extracted easily for analytics later on.
9 | * Analytics such as machine learning and natural language processing may require accessing raw data in forms totally different from a star schema.
10 |
11 | ## Difference between Data Lake and Data Warehouse
12 | ---
13 | 
14 | Source: Udacity DE ND
15 |
16 | * A data warehouse is like a producer of water, where users are handled bottled water in a particular size and shape of the bottle.
17 | * A data lake is like a water lake with many streams flowing into it and its up to users to get the water the way he/she wants
18 |
19 | ## Data Lake Issues
20 | ---
21 | * Data Lake is prone to being a "chaotic garbage dump".
22 | * Since a data lake is widely accessible across business departments, sometimes data governance is difficult to implement
23 | * It is still unclear, per given case, whether a data lake should replace, offload or work in parallel with a data warehouse or data marts. In all cases, dimensional modelling, even in the context of a data lake, continue to remain a valuable practice.
24 | * Data Lake remains an important complement to a Data Warehouse in many businesses.
25 |
--------------------------------------------------------------------------------
/0. Back to Basics/7. Data Lakes/dlvdwh.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/7. Data Lakes/dlvdwh.PNG
--------------------------------------------------------------------------------
/0. Back to Basics/8. Data Pipelines with Airflow/README.md:
--------------------------------------------------------------------------------
1 | ## What is a Data Pipeline?
2 | ---
3 | * A data pipeline is simply, a series of steps in which data is processed
4 |
5 | ## Data Partitioning
6 | ---
7 | * Pipeline data partitioning is the process of isolating data to be analyzed by one or more attributes, such as time, logical type or data size
8 | * Data partitioning often leads to faster and more reliable pipelines
9 | * Types of Data Partitioning:
10 | 1. Schedule partitioning
11 | * Not only are schedules great for reducing the amount of data our pipelines have to process, but they also help us guarantee that we can meet timing guarantees that our data consumers may need
12 | 2. Logical partitioning
13 | * Conceptually related data can be partitioned into discrete segments and processed separately. This process of separating data based on its conceptual relationship is called logical partitioning.
14 | * With logical partitioning, unrelated things belong in separate steps. Consider your dependencies and separate processing around those boundaries
15 | * Examples of such partitioning are by date and time
16 | 3. Size partitioning
17 | * Size partitioning separates data for processing based on desired or required storage limits
18 | * This essentially sets the amount of data included in a data pipeline run
19 | * Why partition data?
20 | * Pipelines designed to work with partitioned data fail more gracefully. Smaller datasets, smaller time periods, and related concepts are easier to debug than big datasets, large time periods, and unrelated concepts
21 | * If data is partitioned appropriately, tasks will naturally have fewer dependencies on each other
22 | * Airflow will be able to parallelize execution of DAGs to produce results even faster
23 |
24 | ## Data Validation
25 | ---
26 | * Data Validation is the process of ensuring that data is present, correct & meaningful. Ensuring the quality of data through automated validation checks is a critical step in building data pipelines at any organization
27 |
28 | ## Data Quality
29 | ---
30 | * Data Quality is a measure of how well a dataset satisfies its intended use
31 | * Examples of Data Quality Requirements
32 | * Data must be a certain size
33 | * Data must be accurate to some margin of error
34 | * Data must arrive within a given timeframe from the start of execution
35 | * Pipelines must run on a particular schedule
36 | * Data must not contain any sensitive information
37 |
38 | ## Directed Acyclic Graphs
39 | ---
40 | * Directed Acyclic Graphs (DAGs): DAGs are a special subset of graphs in which the edges between nodes have a specific direction, and no cycles exist.
41 |
42 | ## Apache Airflow
43 | ---
44 | * What is Airflow?
45 | * Airflow is a platform to programmatically author, schedule and monitor workflows
46 | * Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks
47 | * The airflow scheduler executes your tasks on an array of workers while following the specified dependencies
48 | * When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative
49 |
50 | * Airflow concepts (Taken from Airflow documentation)
51 | * Operators
52 | * Operators determine what actually gets done by a task. An operator describes a single task in a workflow. Operators are usually (but not always) atomic. The DAG will make sure that operators run in the correct order; other than those dependencies, operators generally run independently
53 | * Tasks
54 | * Once an operator is instantiated, it is referred to as a "task" The instantiation defines specific values when calling the abstract operator, and the parameterized task becomes a node in a DAG
55 | * A task instance represents a specific run of a task and is characterized as the combination of a DAG, a task, and a point in time. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc
56 | * DAGs
57 | * In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies
58 | * A DAG run is a physical instance of a DAG, containing task instances that run for a specific execution_date. A DAG run is usually created by the Airflow scheduler, but can also be created by an external trigger
59 | * Hooks
60 | * Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators
61 | * They also use the airflow.models.connection.Connection model to retrieve hostnames and authentication information
62 | * Hooks keep authentication code and information out of pipelines, centralized in the metadata database
63 | * Connections
64 | * The information needed to connect to external systems is stored in the Airflow metastore database and can be managed in the UI (Menu -> Admin -> Connections)
65 | * A conn_id is defined there, and hostname / login / password / schema information attached to it
66 | * Airflow pipelines retrieve centrally-managed connections information by specifying the relevant conn_id
67 | * Variables
68 | * Variables are a generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow
69 | * Variables can be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI. In addition, json settings files can be bulk uploaded through the UI
70 | * Context & Templating
71 | * Airflow leverages templating to allow users to "fill in the blank" with important runtime variables for tasks
72 | * See: https://airflow.apache.org/docs/stable/macros-ref for a list of context variables
73 |
74 | * Airflow functionalities
75 | * Airflow Plugins
76 | * Airflow was built with the intention of allowing its users to extend and customize its functionality through plugins.
77 | * The most common types of user-created plugins for Airflow are Operators and Hooks. These plugins make DAGs reusable and simpler to maintain
78 | * To create custom operator, follow the steps
79 | 1. Identify Operators that perform similar functions and can be consolidated
80 | 2. Define a new Operator in the plugins folder
81 | 3. Replace the original Operators with your new custom one, re-parameterize, and instantiate them
82 | * Airflow subdags
83 | * Commonly repeated series of tasks within DAGs can be captured as reusable SubDAGs
84 | * Benefits include:
85 | * Decrease the amount of code we need to write and maintain to create a new DAG
86 | * Easier to understand the high level goals of a DAG
87 | * Bug fixes, speedups, and other enhancements can be made more quickly and distributed to all DAGs that use that SubDAG
88 | * Drawbacks of Using SubDAGs:
89 | * Limit the visibility within the Airflow UI
90 | * Abstraction makes understanding what the DAG is doing more difficult
91 | * Encourages premature optimization
92 |
93 | * Monitoring
94 | * Airflow can surface metrics and emails to help you stay on top of pipeline issues
95 | * SLAs
96 | * Airflow DAGs may optionally specify an SLA, or “Service Level Agreement”, which is defined as a time by which a DAG must complete
97 | * For time-sensitive applications these features are critical for developing trust amongst pipeline customers and ensuring that data is delivered while it is still meaningful
98 | * Emails and Alerts
99 | * Airflow can be configured to send emails on DAG and task state changes
100 | * These state changes may include successes, failures, or retries
101 | * Failure emails can easily trigger alerts
102 | * Metrics
103 | * Airflow comes out of the box with the ability to send system metrics using a metrics aggregator called statsd
104 | * Statsd can be coupled with metrics visualization tools like Grafana to provide high level insights into the overall performance of DAGs, jobs, and tasks
105 |
106 | * Best practices for data pipelining
107 | * Task Boundaries
108 | DAG tasks should be designed such that they are:
109 | * Atomic and have a single purpose
110 | * Maximize parallelism
111 | * Make failure states obvious
--------------------------------------------------------------------------------
/0. Back to Basics/8. Data Pipelines with Airflow/context_and_templating.py:
--------------------------------------------------------------------------------
1 | # Instructions
2 | # Use the Airflow context in the pythonoperator to complete the TODOs below. Once you are done, run your DAG and check the logs to see the context in use.
3 |
4 | import datetime
5 | import logging
6 |
7 | from airflow import DAG
8 | from airflow.models import Variable
9 | from airflow.operators.python_operator import PythonOperator
10 | from airflow.hooks.S3_hook import S3Hook
11 |
12 |
13 | def log_details(*args, **kwargs):
14 | #
15 | # TODO: Extract ds, run_id, prev_ds, and next_ds from the kwargs, and log them
16 | # NOTE: Look here for context variables passed in on kwargs:
17 | # https://airflow.apache.org/macros.html
18 | #
19 | ds = kwargs['ds']
20 | run_id = kwargs['run_id']
21 | previous_ds = kwargs['prev_ds']
22 | next_ds = kwargs['next_ds']
23 |
24 | logging.info(f"Execution date is {ds}")
25 | logging.info(f"My run id is {run_id}")
26 | if previous_ds:
27 | logging.info(f"My previous run was on {previous_ds}")
28 | if next_ds:
29 | logging.info(f"My next run will be {next_ds}")
30 |
31 | dag = DAG(
32 | 'lesson1.exercise5',
33 | schedule_interval="@daily",
34 | start_date=datetime.datetime.now() - datetime.timedelta(days=2)
35 | )
36 |
37 | list_task = PythonOperator(
38 | task_id="log_details",
39 | python_callable=log_details,
40 | provide_context=True,
41 | dag=dag
42 | )
43 |
--------------------------------------------------------------------------------
/0. Back to Basics/8. Data Pipelines with Airflow/dag_for_subdag.py:
--------------------------------------------------------------------------------
1 | import datetime
2 |
3 | from airflow import DAG
4 | from airflow.operators.postgres_operator import PostgresOperator
5 | from airflow.operators.subdag_operator import SubDagOperator
6 | from airflow.operators.udacity_plugin import HasRowsOperator
7 |
8 | from lesson3.exercise3.subdag import get_s3_to_redshift_dag
9 | import sql_statements
10 |
11 |
12 | start_date = datetime.datetime.utcnow()
13 |
14 | dag = DAG(
15 | "lesson3.exercise3",
16 | start_date=start_date,
17 | )
18 |
19 | trips_task_id = "trips_subdag"
20 | trips_subdag_task = SubDagOperator(
21 | subdag=get_s3_to_redshift_dag(
22 | "lesson3.exercise3",
23 | trips_task_id,
24 | "redshift",
25 | "aws_credentials",
26 | "trips",
27 | sql_statements.CREATE_TRIPS_TABLE_SQL,
28 | s3_bucket="udac-data-pipelines",
29 | s3_key="divvy/unpartitioned/divvy_trips_2018.csv",
30 | start_date=start_date,
31 | ),
32 | task_id=trips_task_id,
33 | dag=dag,
34 | )
35 |
36 | stations_task_id = "stations_subdag"
37 | stations_subdag_task = SubDagOperator(
38 | subdag=get_s3_to_redshift_dag(
39 | "lesson3.exercise3",
40 | stations_task_id,
41 | "redshift",
42 | "aws_credentials",
43 | "stations",
44 | sql_statements.CREATE_STATIONS_TABLE_SQL,
45 | s3_bucket="udac-data-pipelines",
46 | s3_key="divvy/unpartitioned/divvy_stations_2017.csv",
47 | start_date=start_date,
48 | ),
49 | task_id=stations_task_id,
50 | dag=dag,
51 | )
52 |
53 | #
54 | # TODO: Consolidate check_trips and check_stations into a single check in the subdag
55 | # as we did with the create and copy in the demo
56 | #
57 | check_trips = HasRowsOperator(
58 | task_id="check_trips_data",
59 | dag=dag,
60 | redshift_conn_id="redshift",
61 | table="trips"
62 | )
63 |
64 | check_stations = HasRowsOperator(
65 | task_id="check_stations_data",
66 | dag=dag,
67 | redshift_conn_id="redshift",
68 | table="stations"
69 | )
70 |
71 | location_traffic_task = PostgresOperator(
72 | task_id="calculate_location_traffic",
73 | dag=dag,
74 | postgres_conn_id="redshift",
75 | sql=sql_statements.LOCATION_TRAFFIC_SQL
76 | )
77 |
78 | #
79 | # TODO: Reorder the Graph once you have moved the checks
80 | #
81 | trips_subdag_task >> check_trips
82 | stations_subdag_task >> check_stations
83 | check_stations >> location_traffic_task
84 | check_trips >> location_traffic_task
85 |
--------------------------------------------------------------------------------
/0. Back to Basics/8. Data Pipelines with Airflow/hello_airflow.py:
--------------------------------------------------------------------------------
1 | # Instructions
2 | # Define a function that uses the python logger to log a function. Then finish filling in the details of the DAG down below. Once you’ve done that, run "/opt/airflow/start.sh" command to start the web server. Once the Airflow web server is ready, open the Airflow UI using the "Access Airflow" button. Turn your DAG “On”, and then Run your DAG. If you get stuck, you can take a look at the solution file or the video walkthrough on the next page.
3 |
4 | import datetime
5 | import logging
6 |
7 | from airflow import DAG
8 | from airflow.operators.python_operator import PythonOperator
9 |
10 | def my_function():
11 | logging.info("hello airflow")
12 |
13 |
14 | dag = DAG(
15 | 'mock_airflow_dag',
16 | start_date=datetime.datetime.now())
17 |
18 | greet_task = PythonOperator(
19 | task_id="hello_airflow_task",
20 | python_callable=my_function,
21 | dag=dag
22 | )
--------------------------------------------------------------------------------
/0. Back to Basics/8. Data Pipelines with Airflow/subdag.py:
--------------------------------------------------------------------------------
1 | #Instructions
2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag.
3 | #1 - Consolidate HasRowsOperator into the SubDag
4 | #2 - Reorder the tasks to take advantage of the SubDag Operators
5 |
6 | import datetime
7 |
8 | from airflow import DAG
9 | from airflow.operators.postgres_operator import PostgresOperator
10 | from airflow.operators.udacity_plugin import HasRowsOperator
11 | from airflow.operators.udacity_plugin import S3ToRedshiftOperator
12 |
13 | import sql
14 |
15 | def get_s3_to_redshift_dag(
16 | parent_dag_name,
17 | task_id,
18 | redshift_conn_id,
19 | aws_credentials_id,
20 | table,
21 | create_sql_stmt,
22 | s3_bucket,
23 | s3_key,
24 | *args, **kwargs):
25 | dag = DAG(
26 | f"{parent_dag_name}.{task_id}",
27 | **kwargs
28 | )
29 |
30 | create_task = PostgresOperator(
31 | task_id=f"create_{table}_table",
32 | dag=dag,
33 | postgres_conn_id=redshift_conn_id,
34 | sql=create_sql_stmt
35 | )
36 |
37 | copy_task = S3ToRedshiftOperator(
38 | task_id=f"load_{table}_from_s3_to_redshift",
39 | dag=dag,
40 | table=table,
41 | redshift_conn_id=redshift_conn_id,
42 | aws_credentials_id=aws_credentials_id,
43 | s3_bucket=s3_bucket,
44 | s3_key=s3_key
45 | )
46 |
47 | create_task >> copy_task
48 | return dag
49 |
--------------------------------------------------------------------------------
/1. Postgres ETL/README.md:
--------------------------------------------------------------------------------
1 | ## Description
2 | ---
3 | This repo provides the ETL pipeline, to populate the sparkifydb database.
4 | * The purpose of this database is to enable Sparkify to answer business questions it may have of its users, the types of songs they listen to and the artists of those songs using the data that it has in logs and files. The database provides a consistent and reliable source to store this data.
5 |
6 | * This source of data will be useful in helping Sparkify reach some of its analytical goals, for example, finding out songs that have highest popularity or times of the day which is high in traffic.
7 |
8 | ## Database Design and ETL Pipeline
9 | ---
10 | * For the schema design, the STAR schema is used as it simplifies queries and provides fast aggregations of data.
11 |
12 | 
13 |
14 | * For the ETL pipeline, Python is used as it contains libraries such as pandas, that simplifies data manipulation. It also allows connection to Postgres Database.
15 |
16 | * There are 2 types of data involved, song and log data. For song data, it contains information about songs and artists, which we extract from and load into users and artists dimension tables
17 |
18 | * Log data gives the information of each user session. From log data, we extract and load into time, users dimension tables and songplays fact table.
19 |
20 | ## Running the ETL Pipeline
21 | ---
22 | * First, run create_tables.py to create the data tables using the schema design specified. If tables were created previously, they will be dropped and recreated.
23 |
24 | * Next, run etl.py to populate the data tables created.
--------------------------------------------------------------------------------
/1. Postgres ETL/create_tables.py:
--------------------------------------------------------------------------------
1 | import psycopg2
2 | from sql_queries import create_table_queries, drop_table_queries
3 |
4 |
5 | def create_database():
6 | """
7 | - Creates and connects to the sparkifydb
8 | - Returns the connection and cursor to sparkifydb
9 | """
10 |
11 | # connect to default database
12 | conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")
13 | conn.set_session(autocommit=True)
14 | cur = conn.cursor()
15 |
16 | # create sparkify database with UTF8 encoding
17 | cur.execute("DROP DATABASE IF EXISTS sparkifydb")
18 | cur.execute("CREATE DATABASE sparkifydb WITH ENCODING 'utf8' TEMPLATE template0")
19 |
20 | # close connection to default database
21 | conn.close()
22 |
23 | # connect to sparkify database
24 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
25 | cur = conn.cursor()
26 |
27 | return cur, conn
28 |
29 |
30 | def drop_tables(cur, conn):
31 | """
32 | Drops each table using the queries in `drop_table_queries` list.
33 | """
34 | for query in drop_table_queries:
35 | cur.execute(query)
36 | conn.commit()
37 |
38 |
39 | def create_tables(cur, conn):
40 | """
41 | Creates each table using the queries in `create_table_queries` list.
42 | """
43 | for query in create_table_queries:
44 | cur.execute(query)
45 | conn.commit()
46 |
47 |
48 | def main():
49 | """
50 | - Drops (if exists) and Creates the sparkify database.
51 |
52 | - Establishes connection with the sparkify database and gets
53 | cursor to it.
54 |
55 | - Drops all the tables.
56 |
57 | - Creates all tables needed.
58 |
59 | - Finally, closes the connection.
60 | """
61 | cur, conn = create_database()
62 |
63 | drop_tables(cur, conn)
64 | create_tables(cur, conn)
65 |
66 | conn.close()
67 |
68 |
69 | if __name__ == "__main__":
70 | main()
--------------------------------------------------------------------------------
/1. Postgres ETL/data/log_data/2018/11/2018-11-01-events.json:
--------------------------------------------------------------------------------
1 | {"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
2 | {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
3 | {"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
4 | {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":2,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Upgrade","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106132796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
5 | {"artist":"Mr Oizo","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":3,"lastName":"Summers","length":144.03873,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Flat 55","status":200,"ts":1541106352796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
6 | {"artist":"Tamba Trio","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":4,"lastName":"Summers","length":177.18812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Quem Quiser Encontrar O Amor","status":200,"ts":1541106496796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
7 | {"artist":"The Mars Volta","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":5,"lastName":"Summers","length":380.42077,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Eriatarka","status":200,"ts":1541106673796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
8 | {"artist":"Infected Mushroom","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":6,"lastName":"Summers","length":440.2673,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Becoming Insane","status":200,"ts":1541107053796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
9 | {"artist":"Blue October \/ Imogen Heap","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":7,"lastName":"Summers","length":241.3971,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Congratulations","status":200,"ts":1541107493796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
10 | {"artist":"Girl Talk","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":8,"lastName":"Summers","length":160.15628,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Once again","status":200,"ts":1541107734796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
11 | {"artist":"Black Eyed Peas","auth":"Logged In","firstName":"Sylvie","gender":"F","itemInSession":0,"lastName":"Cruz","length":214.93506,"level":"free","location":"Washington-Arlington-Alexandria, DC-VA-MD-WV","method":"PUT","page":"NextSong","registration":1540266185796.0,"sessionId":9,"song":"Pump It","status":200,"ts":1541108520796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.77.4 (KHTML, like Gecko) Version\/7.0.5 Safari\/537.77.4\"","userId":"10"}
12 | {"artist":null,"auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":0,"lastName":"Smith","length":null,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"GET","page":"Home","registration":1541016707796.0,"sessionId":169,"song":null,"status":200,"ts":1541109015796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
13 | {"artist":"Fall Out Boy","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":1,"lastName":"Smith","length":200.72444,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Nobody Puts Baby In The Corner","status":200,"ts":1541109125796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
14 | {"artist":"M.I.A.","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":2,"lastName":"Smith","length":233.7171,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Mango Pickle Down River (With The Wilcannia Mob)","status":200,"ts":1541109325796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
15 | {"artist":"Survivor","auth":"Logged In","firstName":"Jayden","gender":"M","itemInSession":0,"lastName":"Fox","length":245.36771,"level":"free","location":"New Orleans-Metairie, LA","method":"PUT","page":"NextSong","registration":1541033612796.0,"sessionId":100,"song":"Eye Of The Tiger","status":200,"ts":1541110994796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"101"}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAAAW128F429D538.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAABD128F429CF47.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARMJAGH1187FB546F3", "artist_latitude": 35.14968, "artist_longitude": -90.04892, "artist_location": "Memphis, TN", "artist_name": "The Box Tops", "song_id": "SOCIWDW12A8C13D406", "title": "Soul Deep", "duration": 148.03546, "year": 1969}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAADZ128F9348C2E.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARKRRTF1187B9984DA", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Sonora Santanera", "song_id": "SOXVLOJ12AB0189215", "title": "Amor De Cabaret", "duration": 177.47546, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAAEF128F4273421.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR7G5I41187FB4CE6C", "artist_latitude": null, "artist_longitude": null, "artist_location": "London, England", "artist_name": "Adam Ant", "song_id": "SONHOTT12A8C13493C", "title": "Something Girls", "duration": 233.40363, "year": 1982}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAAFD128F92F423A.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARXR32B1187FB57099", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Gob", "song_id": "SOFSOCN12A8C143F5D", "title": "Face the Ashes", "duration": 209.60608, "year": 2007}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAAMO128F1481E7F.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARKFYS91187B98E58F", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Jeff And Sheri Easter", "song_id": "SOYMRWW12A6D4FAB14", "title": "The Moon And I (Ordinary Day Album Version)", "duration": 267.7024, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAAMQ128F1460CD3.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARD0S291187B9B7BF5", "artist_latitude": null, "artist_longitude": null, "artist_location": "Ohio", "artist_name": "Rated R", "song_id": "SOMJBYD12A6D4F8557", "title": "Keepin It Real (Skit)", "duration": 114.78159, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAAPK128E0786D96.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR10USD1187B99F3F1", "artist_latitude": null, "artist_longitude": null, "artist_location": "Burlington, Ontario, Canada", "artist_name": "Tweeterfriendly Music", "song_id": "SOHKNRJ12A6701D1F8", "title": "Drop of Rain", "duration": 189.57016, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAARJ128F9320760.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR8ZCNI1187B9A069B", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Planet P Project", "song_id": "SOIAZJW12AB01853F1", "title": "Pink World", "duration": 269.81832, "year": 1984}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAAVG12903CFA543.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNTLGG11E2835DDB9", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Clp", "song_id": "SOUDSGM12AC9618304", "title": "Insatiable (Instrumental Version)", "duration": 266.39628, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/A/TRAAAVO128F93133D4.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGSJW91187B9B1D6B", "artist_latitude": 35.21962, "artist_longitude": -80.01955, "artist_location": "North Carolina", "artist_name": "JennyAnyKind", "song_id": "SOQHXMF12AB0182363", "title": "Young Boy Blues", "duration": 218.77506, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABCL128F4286650.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARC43071187B990240", "artist_latitude": null, "artist_longitude": null, "artist_location": "Wisner, LA", "artist_name": "Wayne Watson", "song_id": "SOKEJEJ12A8C13E0D0", "title": "The Urgency (LP Version)", "duration": 245.21098, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABDL12903CAABBA.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARL7K851187B99ACD2", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Andy Andy", "song_id": "SOMUYGI12AB0188633", "title": "La Culpa", "duration": 226.35057, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABJL12903CDCF1A.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARHHO3O1187B989413", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Bob Azzam", "song_id": "SORAMLE12AB017C8B0", "title": "Auguri Cha Cha", "duration": 191.84281, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABJV128F1460C49.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARIK43K1187B9AE54C", "artist_latitude": null, "artist_longitude": null, "artist_location": "Beverly Hills, CA", "artist_name": "Lionel Richie", "song_id": "SOBONFF12A6D4F84D8", "title": "Tonight Will Be Alright", "duration": 307.3824, "year": 1986}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABLR128F423B7E3.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARD842G1187B997376", "artist_latitude": 43.64856, "artist_longitude": -79.38533, "artist_location": "Toronto, Ontario, Canada", "artist_name": "Blue Rodeo", "song_id": "SOHUOAP12A8AE488E9", "title": "Floating", "duration": 491.12771, "year": 1987}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABNV128F425CEE1.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARIG6O41187B988BDD", "artist_latitude": 37.16793, "artist_longitude": -95.84502, "artist_location": "United States", "artist_name": "Richard Souther", "song_id": "SOUQQEA12A8C134B1B", "title": "High Tide", "duration": 228.5971, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABRB128F9306DD5.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR1ZHYZ1187FB3C717", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Faiz Ali Faiz", "song_id": "SOILPQQ12AB017E82A", "title": "Sohna Nee Sohna Data", "duration": 599.24853, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABVM128F92CA9DC.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARYKCQI1187FB3B18F", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Tesla", "song_id": "SOXLBJT12A8C140925", "title": "Caught In A Dream", "duration": 290.29832, "year": 2004}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABXG128F9318EBD.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNPAGP1241B9C7FD4", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "lextrical", "song_id": "SOZVMJI12AB01808AF", "title": "Synthetic Dream", "duration": 165.69424, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABYN12903CFD305.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARQGYP71187FB44566", "artist_latitude": 34.31109, "artist_longitude": -94.02978, "artist_location": "Mineola, AR", "artist_name": "Jimmy Wakely", "song_id": "SOWTBJW12AC468AC6E", "title": "Broken-Down Merry-Go-Round", "duration": 151.84934, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/B/TRAABYW128F4244559.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARI3BMM1187FB4255E", "artist_latitude": 38.8991, "artist_longitude": -77.029, "artist_location": "Washington", "artist_name": "Alice Stuart", "song_id": "SOBEBDG12A58A76D60", "title": "Kassie Jones", "duration": 220.78649, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACCG128F92E8A55.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR5KOSW1187FB35FF4", "artist_latitude": 49.80388, "artist_longitude": 15.47491, "artist_location": "Dubai UAE", "artist_name": "Elena", "song_id": "SOZCTXZ12AB0182364", "title": "Setanta matins", "duration": 269.58322, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACER128F4290F96.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARMAC4T1187FB3FA4C", "artist_latitude": 40.82624, "artist_longitude": -74.47995, "artist_location": "Morris Plains, NJ", "artist_name": "The Dillinger Escape Plan", "song_id": "SOBBUGU12A8C13E95D", "title": "Setting Fire to Sleeping Giants", "duration": 207.77751, "year": 2004}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACFV128F935E50B.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR47JEX1187B995D81", "artist_latitude": 37.83721, "artist_longitude": -94.35868, "artist_location": "Nevada, MO", "artist_name": "SUE THOMPSON", "song_id": "SOBLGCN12AB0183212", "title": "James (Hold The Ladder Steady)", "duration": 124.86485, "year": 1985}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACHN128F1489601.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGIWFO1187B9B55B7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Five Bolt Main", "song_id": "SOPSWQW12A6D4F8781", "title": "Made Like This (Live)", "duration": 225.09669, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACIW12903CC0F6D.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNTLGG11E2835DDB9", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Clp", "song_id": "SOZQDIU12A58A7BCF6", "title": "Superconfidential", "duration": 338.31138, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACLV128F427E123.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARDNS031187B9924F0", "artist_latitude": 32.67828, "artist_longitude": -83.22295, "artist_location": "Georgia", "artist_name": "Tim Wilson", "song_id": "SONYPOM12A8C13B2D7", "title": "I Think My Wife Is Running Around On Me (Taco Hell)", "duration": 186.48771, "year": 2005}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACNS128F14A2DF5.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AROUOZZ1187B9ABE51", "artist_latitude": 40.79195, "artist_longitude": -73.94512, "artist_location": "New York, NY [Spanish Harlem]", "artist_name": "Willie Bobo", "song_id": "SOBZBAZ12A6D4F8742", "title": "Spanish Grease", "duration": 168.25424, "year": 1997}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACOW128F933E35F.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARH4Z031187B9A71F2", "artist_latitude": 40.73197, "artist_longitude": -74.17418, "artist_location": "Newark, NJ", "artist_name": "Faye Adams", "song_id": "SOVYKGO12AB0187199", "title": "Crazy Mixed Up World", "duration": 156.39465, "year": 1961}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACPE128F421C1B9.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARB29H41187B98F0EF", "artist_latitude": 41.88415, "artist_longitude": -87.63241, "artist_location": "Chicago", "artist_name": "Terry Callier", "song_id": "SOGNCJP12A58A80271", "title": "Do You Finally Need A Friend", "duration": 342.56934, "year": 1972}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACQT128F9331780.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR1Y2PT1187FB5B9CE", "artist_latitude": 27.94017, "artist_longitude": -82.32547, "artist_location": "Brandon", "artist_name": "John Wesley", "song_id": "SOLLHMX12AB01846DC", "title": "The Emperor Falls", "duration": 484.62322, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACSL128F93462F4.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARAJPHH1187FB5566A", "artist_latitude": 40.7038, "artist_longitude": -73.83168, "artist_location": "Queens, NY", "artist_name": "The Shangri-Las", "song_id": "SOYTPEP12AB0180E7B", "title": "Twist and Shout", "duration": 164.80608, "year": 1964}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACTB12903CAAF15.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR0RCMP1187FB3F427", "artist_latitude": 30.08615, "artist_longitude": -94.10158, "artist_location": "Beaumont, TX", "artist_name": "Billie Jo Spears", "song_id": "SOGXHEG12AB018653E", "title": "It Makes No Difference Now", "duration": 133.32853, "year": 1992}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACVS128E078BE39.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AREBBGV1187FB523D2", "artist_latitude": null, "artist_longitude": null, "artist_location": "Houston, TX", "artist_name": "Mike Jones (Featuring CJ_ Mello & Lil' Bran)", "song_id": "SOOLYAZ12A6701F4A6", "title": "Laws Patrolling (Album Version)", "duration": 173.66159, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/A/C/TRAACZK128F4243829.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGUVEV1187B98BA17", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Sierra Maestra", "song_id": "SOGOSOV12AF72A285E", "title": "\u00bfD\u00f3nde va Chichi?", "duration": 313.12934, "year": 1997}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABACN128F425B784.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOQLGFP12A58A7800E", "title": "OAKtown", "duration": 259.44771, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABAFJ128F42AF24E.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR3JMC51187B9AE49D", "artist_latitude": 28.53823, "artist_longitude": -81.37739, "artist_location": "Orlando, FL", "artist_name": "Backstreet Boys", "song_id": "SOPVXLX12A8C1402D5", "title": "Larger Than Life", "duration": 236.25098, "year": 1999}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABAFP128F931E9A1.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARPBNLO1187FB3D52F", "artist_latitude": 40.71455, "artist_longitude": -74.00712, "artist_location": "New York, NY", "artist_name": "Tiny Tim", "song_id": "SOAOIBZ12AB01815BE", "title": "I Hold Your Hand In Mine [Live At Royal Albert Hall]", "duration": 43.36281, "year": 2000}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABAIO128F42938F9.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR9AWNF1187B9AB0B4", "artist_latitude": null, "artist_longitude": null, "artist_location": "Seattle, Washington USA", "artist_name": "Kenny G featuring Daryl Hall", "song_id": "SOZHPGD12A8C1394FE", "title": "Baby Come To Me", "duration": 236.93016, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABATO128F42627E9.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AROGWRA122988FEE45", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Christos Dantis", "song_id": "SOSLAVG12A8C13397F", "title": "Den Pai Alo", "duration": 243.82649, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABAVQ12903CBF7E0.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARMBR4Y1187B9990EB", "artist_latitude": 37.77916, "artist_longitude": -122.42005, "artist_location": "California - SF", "artist_name": "David Martin", "song_id": "SOTTDKS12AB018D69B", "title": "It Wont Be Christmas", "duration": 241.47546, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABAWW128F4250A31.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARQ9BO41187FB5CF1F", "artist_latitude": 40.99471, "artist_longitude": -77.60454, "artist_location": "Pennsylvania", "artist_name": "John Davis", "song_id": "SOMVWWT12A58A7AE05", "title": "Knocked Out Of The Park", "duration": 183.17016, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABAXL128F424FC50.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARKULSX1187FB45F84", "artist_latitude": 39.49974, "artist_longitude": -111.54732, "artist_location": "Utah", "artist_name": "Trafik", "song_id": "SOQVMXR12A81C21483", "title": "Salt In NYC", "duration": 424.12363, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABAXR128F426515F.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARI2JSK1187FB496EF", "artist_latitude": 51.50632, "artist_longitude": -0.12714, "artist_location": "London, England", "artist_name": "Nick Ingman;Gavyn Wright", "song_id": "SODUJBS12A8C132150", "title": "Wessex Loses a Bride", "duration": 111.62077, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABAXV128F92F6AE3.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AREDBBQ1187B98AFF5", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Eddie Calvert", "song_id": "SOBBXLX12A58A79DDA", "title": "Erica (2005 Digital Remaster)", "duration": 138.63138, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/A/TRABAZH128F930419A.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR7ZKHQ1187B98DD73", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Glad", "song_id": "SOTUKVB12AB0181477", "title": "Blessed Assurance", "duration": 270.602, "year": 1993}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBAM128F429D223.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARBGXIG122988F409D", "artist_latitude": 37.77916, "artist_longitude": -122.42005, "artist_location": "California - SF", "artist_name": "Steel Rain", "song_id": "SOOJPRH12A8C141995", "title": "Loaded Like A Gun", "duration": 173.19138, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBBV128F42967D7.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR7SMBG1187B9B9066", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Los Manolos", "song_id": "SOBCOSW12A8C13D398", "title": "Rumba De Barcelona", "duration": 218.38322, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBJE12903CDB442.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGCY1Y1187B9A4FA5", "artist_latitude": 36.16778, "artist_longitude": -86.77836, "artist_location": "Nashville, TN.", "artist_name": "Gloriana", "song_id": "SOQOTLQ12AB01868D0", "title": "Clementina Santaf\u00e8", "duration": 153.33832, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBKX128F4285205.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR36F9J1187FB406F1", "artist_latitude": 56.27609, "artist_longitude": 9.51695, "artist_location": "Denmark", "artist_name": "Bombay Rockers", "song_id": "SOBKWDJ12A8C13B2F3", "title": "Wild Rose (Back 2 Basics Mix)", "duration": 230.71302, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBLU128F93349CF.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNNKDK1187B98BBD5", "artist_latitude": 45.80726, "artist_longitude": 15.9676, "artist_location": "Zagreb Croatia", "artist_name": "Jinx", "song_id": "SOFNOQK12AB01840FC", "title": "Kutt Free (DJ Volume Remix)", "duration": 407.37914, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBNP128F932546F.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR62SOJ1187FB47BB5", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Chase & Status", "song_id": "SOGVQGJ12AB017F169", "title": "Ten Tonne", "duration": 337.68444, "year": 2005}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBOP128F931B50D.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARBEBBY1187B9B43DB", "artist_latitude": null, "artist_longitude": null, "artist_location": "Gainesville, FL", "artist_name": "Tom Petty", "song_id": "SOFFKZS12AB017F194", "title": "A Higher Place (Album Version)", "duration": 236.17261, "year": 1994}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBOR128F4286200.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARDR4AC1187FB371A1", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Montserrat Caball\u00e9;Placido Domingo;Vicente Sardinero;Judith Blegen;Sherrill Milnes;Georg Solti", "song_id": "SOBAYLL12A8C138AF9", "title": "Sono andati? Fingevo di dormire", "duration": 511.16363, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBTA128F933D304.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARAGB2O1187FB3A161", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Pucho & His Latin Soul Brothers", "song_id": "SOLEYHO12AB0188A85", "title": "Got My Mojo Workin", "duration": 338.23302, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBVJ128F92F7EAA.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AREDL271187FB40F44", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Soul Mekanik", "song_id": "SOPEGZN12AB0181B3D", "title": "Get Your Head Stuck On Your Neck", "duration": 45.66159, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBXU128F92FEF48.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARP6N5A1187B99D1A3", "artist_latitude": null, "artist_longitude": null, "artist_location": "Hamtramck, MI", "artist_name": "Mitch Ryder", "song_id": "SOXILUQ12A58A7C72A", "title": "Jenny Take a Ride", "duration": 207.43791, "year": 2004}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/B/TRABBZN12903CD9297.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGSAFR1269FB35070", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Blingtones", "song_id": "SOTCKKY12AB018A141", "title": "Sonnerie lalaleul\u00e9 hi houuu", "duration": 29.54404, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCAJ12903CDFCC2.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARULZCI1241B9C8611", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Luna Orbit Project", "song_id": "SOSWKAV12AB018FC91", "title": "Midnight Star", "duration": 335.51628, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCEC128F426456E.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR0IAWL1187B9A96D0", "artist_latitude": 8.4177, "artist_longitude": -80.11278, "artist_location": "Panama", "artist_name": "Danilo Perez", "song_id": "SONSKXP12A8C13A2C9", "title": "Native Soul", "duration": 197.19791, "year": 2003}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCEI128F424C983.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCFL128F149BB0D.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARLTWXK1187FB5A3F8", "artist_latitude": 32.74863, "artist_longitude": -97.32925, "artist_location": "Fort Worth, TX", "artist_name": "King Curtis", "song_id": "SODREIN12A58A7F2E5", "title": "A Whiter Shade Of Pale (Live @ Fillmore West)", "duration": 326.00771, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCIX128F4265903.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNF6401187FB57032", "artist_latitude": 40.79086, "artist_longitude": -73.96644, "artist_location": "New York, NY [Manhattan]", "artist_name": "Sophie B. Hawkins", "song_id": "SONWXQJ12A8C134D94", "title": "The Ballad Of Sleeping Beauty", "duration": 305.162, "year": 1994}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCKL128F423A778.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARPFHN61187FB575F6", "artist_latitude": 41.88415, "artist_longitude": -87.63241, "artist_location": "Chicago, IL", "artist_name": "Lupe Fiasco", "song_id": "SOWQTQZ12A58A7B63E", "title": "Streets On Fire (Explicit Album Version)", "duration": 279.97995, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCPZ128F4275C32.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR051KA1187B98B2FF", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Wilks", "song_id": "SOLYIBD12A8C135045", "title": "Music is what we love", "duration": 261.51138, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCRU128F423F449.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR8IEZO1187B99055E", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Marc Shaiman", "song_id": "SOINLJW12A8C13314C", "title": "City Slickers", "duration": 149.86404, "year": 2008}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCTK128F934B224.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR558FS1187FB45658", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "40 Grit", "song_id": "SOGDBUF12A8C140FAA", "title": "Intro", "duration": 75.67628, "year": 2003}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCUQ128E0783E2B.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARVBRGZ1187FB4675A", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Gwen Stefani", "song_id": "SORRZGD12A6310DBC3", "title": "Harajuku Girls", "duration": 290.55955, "year": 2004}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCXB128F4286BD3.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARWB3G61187FB49404", "artist_latitude": null, "artist_longitude": null, "artist_location": "Hamilton, Ohio", "artist_name": "Steve Morse", "song_id": "SODAUVL12A8C13D184", "title": "Prognosis", "duration": 363.85914, "year": 2000}
--------------------------------------------------------------------------------
/1. Postgres ETL/data/song_data/A/B/C/TRABCYE128F934CE1D.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AREVWGE1187B9B890A", "artist_latitude": -13.442, "artist_longitude": -41.9952, "artist_location": "Noci (BA)", "artist_name": "Bitter End", "song_id": "SOFCHDR12AB01866EF", "title": "Living Hell", "duration": 282.43546, "year": 0}
--------------------------------------------------------------------------------
/1. Postgres ETL/etl.py:
--------------------------------------------------------------------------------
1 | import os
2 | import glob
3 | import psycopg2
4 | import pandas as pd
5 | from sql_queries import *
6 |
7 |
8 | def process_song_file(cur, filepath):
9 | """
10 | - Load data from a song file to the song and artist data tables
11 | """
12 | # open song file
13 | df = pd.read_json(filepath, lines=True)
14 |
15 | # insert song record
16 | song_data = list(df[['song_id', 'title', 'artist_id', 'year', 'duration']].values[0])
17 | cur.execute(song_table_insert, song_data)
18 |
19 | # insert artist record
20 | artist_data = list(df[['artist_id', 'artist_name', 'artist_location',
21 | 'artist_latitude', 'artist_longitude']].values[0])
22 | cur.execute(artist_table_insert, artist_data)
23 |
24 |
25 | def process_log_file(cur, filepath):
26 | """
27 | - Load data from a log file to the time, user and songplay data tables
28 | """
29 | # open log file
30 | df = pd.read_json(filepath, lines=True)
31 |
32 | # filter by NextSong action
33 | df = df[df['page'] == 'NextSong']
34 |
35 | # convert timestamp column to datetime
36 | t = pd.to_datetime(df['ts'])
37 |
38 | # insert time data records
39 | time_data = [(tt.value, tt.hour, tt.day, tt.week, tt.month, tt.year, tt.weekday()) for tt in t]
40 | column_labels = ('timestamp', 'hour', 'day', 'week', 'month', 'year', 'weekday')
41 | time_df = pd.DataFrame(data=time_data, columns=column_labels)
42 |
43 | for i, row in time_df.iterrows():
44 | cur.execute(time_table_insert, list(row))
45 |
46 | # load user table
47 | user_df = df[['userId', 'firstName', 'lastName', 'gender', 'level']]
48 |
49 | # insert user records
50 | for i, row in user_df.iterrows():
51 | cur.execute(user_table_insert, row)
52 |
53 | # insert songplay records
54 | for index, row in df.iterrows():
55 |
56 | # get songid and artistid from song and artist tables
57 | cur.execute(song_select, (row.song, row.artist, row.length))
58 | results = cur.fetchone()
59 |
60 | if results:
61 | songid, artistid = results
62 | else:
63 | songid, artistid = None, None
64 |
65 | # insert songplay record
66 | songplay_data = (index, row['ts'], row['userId'], row['level'], songid, artistid, row['sessionId'],
67 | row['location'], row['userAgent'])
68 | cur.execute(songplay_table_insert, songplay_data)
69 |
70 |
71 | def process_data(cur, conn, filepath, func):
72 | """
73 | - Iterate over all files and populate data tables in sparkifydb
74 | """
75 | # get all files matching extension from directory
76 | all_files = []
77 | for root, dirs, files in os.walk(filepath):
78 | files = glob.glob(os.path.join(root,'*.json'))
79 | for f in files :
80 | all_files.append(os.path.abspath(f))
81 |
82 | # get total number of files found
83 | num_files = len(all_files)
84 | print('{} files found in {}'.format(num_files, filepath))
85 |
86 | # iterate over files and process
87 | for i, datafile in enumerate(all_files, 1):
88 | func(cur, datafile)
89 | conn.commit()
90 | print('{}/{} files processed.'.format(i, num_files))
91 |
92 |
93 | def main():
94 | """
95 | - Establishes connection with the sparkify database and gets
96 | cursor to it.
97 |
98 | - Runs ETL pipelines
99 | """
100 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
101 | cur = conn.cursor()
102 |
103 | process_data(cur, conn, filepath='data/song_data', func=process_song_file)
104 | process_data(cur, conn, filepath='data/log_data', func=process_log_file)
105 |
106 | conn.close()
107 |
108 |
109 | if __name__ == "__main__":
110 | main()
--------------------------------------------------------------------------------
/1. Postgres ETL/schema.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/1. Postgres ETL/schema.PNG
--------------------------------------------------------------------------------
/1. Postgres ETL/sql_queries.py:
--------------------------------------------------------------------------------
1 | # DROP TABLES
2 |
3 | songplay_table_drop = "DROP TABLE IF EXISTS songplays;"
4 | user_table_drop = "DROP TABLE IF EXISTS users;"
5 | song_table_drop = "DROP TABLE IF EXISTS songs;"
6 | artist_table_drop = "DROP TABLE IF EXISTS artists;"
7 | time_table_drop = "DROP TABLE IF EXISTS time;"
8 |
9 | # CREATE TABLES
10 |
11 | songplay_table_create = ("""
12 | CREATE TABLE songplays
13 | (songplay_id int PRIMARY KEY,
14 | start_time bigint REFERENCES time(start_time) ON DELETE RESTRICT,
15 | user_id int REFERENCES users(user_id) ON DELETE RESTRICT,
16 | level varchar,
17 | song_id varchar REFERENCES songs(song_id) ON DELETE RESTRICT,
18 | artist_id varchar REFERENCES artists(artist_id) ON DELETE RESTRICT,
19 | session_id int,
20 | location varchar,
21 | user_agent varchar);
22 | """)
23 |
24 | user_table_create = ("""
25 | CREATE TABLE users
26 | (user_id int PRIMARY KEY,
27 | first_name varchar,
28 | last_name varchar,
29 | gender varchar,
30 | level varchar);
31 | """)
32 |
33 | song_table_create = ("""
34 | CREATE TABLE songs
35 | (song_id varchar PRIMARY KEY,
36 | title varchar,
37 | artist_id varchar,
38 | year int,
39 | duration float);
40 | """)
41 |
42 | artist_table_create = ("""
43 | CREATE TABLE artists
44 | (artist_id varchar PRIMARY KEY,
45 | name varchar,
46 | location varchar,
47 | latitude float,
48 | longitude float);
49 | """)
50 |
51 | time_table_create = ("""
52 | CREATE TABLE time
53 | (start_time bigint PRIMARY KEY,
54 | hour int,
55 | day int,
56 | week int,
57 | month int,
58 | year int,
59 | weekday int);
60 | """)
61 |
62 | # INSERT RECORDS
63 |
64 | songplay_table_insert = ("""
65 | INSERT INTO songplays (songplay_id, start_time, user_id, level, song_id, artist_id,
66 | session_id, location, user_agent)
67 | VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
68 | ON CONFLICT (songplay_id)
69 | DO NOTHING;
70 | """)
71 |
72 | user_table_insert = ("""
73 | INSERT INTO users (user_id, first_name, last_name, gender, level)
74 | VALUES (%s, %s, %s, %s, %s)
75 | ON CONFLICT (user_id) DO UPDATE
76 | SET level=excluded.level;
77 | """)
78 |
79 | song_table_insert = ("""
80 | INSERT INTO songs (song_id, title, artist_id, year, duration)
81 | VALUES (%s, %s, %s, %s, %s)
82 | ON CONFLICT (song_id)
83 | DO NOTHING;
84 | """)
85 |
86 | artist_table_insert = ("""
87 | INSERT INTO artists (artist_id, name, location, latitude, longitude)
88 | VALUES (%s, %s, %s, %s, %s)
89 | ON CONFLICT (artist_id)
90 | DO NOTHING;
91 | """)
92 |
93 |
94 | time_table_insert = ("""
95 | INSERT INTO time (start_time, hour, day, week, month, year, weekday)
96 | VALUES (%s, %s, %s, %s, %s, %s, %s)
97 | ON CONFLICT (start_time)
98 | DO NOTHING;
99 | """)
100 |
101 | # FIND SONGS
102 |
103 | song_select = ("""
104 | SELECT songs.song_id, artists.artist_id FROM songs
105 | JOIN artists ON songs.artist_id=artists.artist_id
106 | WHERE songs.title=%s AND artists.name=%s AND songs.duration=%s;
107 | """)
108 |
109 | # QUERY LISTS
110 |
111 | create_table_queries = [user_table_create, song_table_create, artist_table_create, time_table_create, songplay_table_create]
112 | drop_table_queries = [songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]
--------------------------------------------------------------------------------
/2. Cassandra ETL/event_data/2018-11-01-events.csv:
--------------------------------------------------------------------------------
1 | artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userId
2 | ,Logged In,Walter,M,0,Frye,,free,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54092E+12,38,,200,1.54111E+12,39
3 | ,Logged In,Kaylee,F,0,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Home,1.54034E+12,139,,200,1.54111E+12,8
4 | Des'ree,Logged In,Kaylee,F,1,Summers,246.30812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,You Gotta Be,200,1.54111E+12,8
5 | ,Logged In,Kaylee,F,2,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Upgrade,1.54034E+12,139,,200,1.54111E+12,8
6 | Mr Oizo,Logged In,Kaylee,F,3,Summers,144.03873,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Flat 55,200,1.54111E+12,8
7 | Tamba Trio,Logged In,Kaylee,F,4,Summers,177.18812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Quem Quiser Encontrar O Amor,200,1.54111E+12,8
8 | The Mars Volta,Logged In,Kaylee,F,5,Summers,380.42077,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Eriatarka,200,1.54111E+12,8
9 | Infected Mushroom,Logged In,Kaylee,F,6,Summers,440.2673,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Becoming Insane,200,1.54111E+12,8
10 | Blue October / Imogen Heap,Logged In,Kaylee,F,7,Summers,241.3971,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Congratulations,200,1.54111E+12,8
11 | Girl Talk,Logged In,Kaylee,F,8,Summers,160.15628,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Once again,200,1.54111E+12,8
12 | Black Eyed Peas,Logged In,Sylvie,F,0,Cruz,214.93506,free,"Washington-Arlington-Alexandria, DC-VA-MD-WV",PUT,NextSong,1.54027E+12,9,Pump It,200,1.54111E+12,10
13 | ,Logged In,Ryan,M,0,Smith,,free,"San Jose-Sunnyvale-Santa Clara, CA",GET,Home,1.54102E+12,169,,200,1.54111E+12,26
14 | Fall Out Boy,Logged In,Ryan,M,1,Smith,200.72444,free,"San Jose-Sunnyvale-Santa Clara, CA",PUT,NextSong,1.54102E+12,169,Nobody Puts Baby In The Corner,200,1.54111E+12,26
15 | M.I.A.,Logged In,Ryan,M,2,Smith,233.7171,free,"San Jose-Sunnyvale-Santa Clara, CA",PUT,NextSong,1.54102E+12,169,Mango Pickle Down River (With The Wilcannia Mob),200,1.54111E+12,26
16 | Survivor,Logged In,Jayden,M,0,Fox,245.36771,free,"New Orleans-Metairie, LA",PUT,NextSong,1.54103E+12,100,Eye Of The Tiger,200,1.54111E+12,101
17 |
--------------------------------------------------------------------------------
/2. Cassandra ETL/event_data/2018-11-25-events.csv:
--------------------------------------------------------------------------------
1 | artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userId
2 | matchbox twenty,Logged In,Jayden,F,0,Duffy,177.65832,free,"Seattle-Tacoma-Bellevue, WA",PUT,NextSong,1.54015E+12,846,Argue (LP Version),200,1.54311E+12,76
3 | The Lonely Island / T-Pain,Logged In,Jayden,F,1,Duffy,156.23791,free,"Seattle-Tacoma-Bellevue, WA",PUT,NextSong,1.54015E+12,846,I'm On A Boat,200,1.54311E+12,76
4 | ,Logged In,Jayden,F,2,Duffy,,free,"Seattle-Tacoma-Bellevue, WA",GET,Home,1.54015E+12,846,,200,1.54311E+12,76
5 | ,Logged In,Jayden,F,3,Duffy,,free,"Seattle-Tacoma-Bellevue, WA",GET,Settings,1.54015E+12,846,,200,1.54311E+12,76
6 | ,Logged In,Jayden,F,4,Duffy,,free,"Seattle-Tacoma-Bellevue, WA",PUT,Save Settings,1.54015E+12,846,,307,1.54311E+12,76
7 | John Mayer,Logged In,Wyatt,M,0,Scott,275.27791,free,"Eureka-Arcata-Fortuna, CA",PUT,NextSong,1.54087E+12,856,All We Ever Do Is Say Goodbye,200,1.54311E+12,9
8 | ,Logged In,Wyatt,M,1,Scott,,free,"Eureka-Arcata-Fortuna, CA",GET,Home,1.54087E+12,856,,200,1.54311E+12,9
9 | 10_000 Maniacs,Logged In,Wyatt,M,2,Scott,251.8722,free,"Eureka-Arcata-Fortuna, CA",PUT,NextSong,1.54087E+12,856,Gun Shy (LP Version),200,1.54311E+12,9
10 | Leona Lewis,Logged In,Chloe,F,0,Cuevas,203.88526,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Forgive Me,200,1.54312E+12,49
11 | Nine Inch Nails,Logged In,Chloe,F,1,Cuevas,277.83791,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,La Mer,200,1.54312E+12,49
12 | Audioslave,Logged In,Chloe,F,2,Cuevas,334.91546,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,I Am The Highway,200,1.54312E+12,49
13 | Kid Rock,Logged In,Chloe,F,3,Cuevas,296.95955,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,All Summer Long (Album Version),200,1.54312E+12,49
14 | The Jets,Logged In,Chloe,F,4,Cuevas,220.89098,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,I Do You,200,1.54312E+12,49
15 | The Gerbils,Logged In,Chloe,F,5,Cuevas,27.01016,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,(iii),200,1.54312E+12,49
16 | Damian Marley / Stephen Marley / Yami Bolo,Logged In,Chloe,F,6,Cuevas,304.69179,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Still Searching,200,1.54312E+12,49
17 | ,Logged In,Chloe,F,7,Cuevas,,paid,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54094E+12,916,,200,1.54312E+12,49
18 | The Bloody Beetroots,Logged In,Chloe,F,8,Cuevas,201.97832,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Warp 1.9 (feat. Steve Aoki),200,1.54312E+12,49
19 | ,Logged In,Chloe,F,9,Cuevas,,paid,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54094E+12,916,,200,1.54313E+12,49
20 | The Specials,Logged In,Chloe,F,10,Cuevas,188.81261,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Rat Race,200,1.54313E+12,49
21 | The Lively Ones,Logged In,Chloe,F,11,Cuevas,142.52363,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Walkin' The Board (LP Version),200,1.54313E+12,49
22 | Katie Melua,Logged In,Chloe,F,12,Cuevas,252.78649,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Blues In The Night,200,1.54313E+12,49
23 | Jason Mraz,Logged In,Chloe,F,13,Cuevas,243.48689,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,I'm Yours (Album Version),200,1.54313E+12,49
24 | Fisher,Logged In,Chloe,F,14,Cuevas,133.98159,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Rianna,200,1.54313E+12,49
25 | Zee Avi,Logged In,Chloe,F,15,Cuevas,160.62649,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,No Christmas For Me,200,1.54313E+12,49
26 | Black Eyed Peas,Logged In,Chloe,F,16,Cuevas,289.12281,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,I Gotta Feeling,200,1.54313E+12,49
27 | Emiliana Torrini,Logged In,Chloe,F,17,Cuevas,184.29342,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Sunny Road,200,1.54313E+12,49
28 | ,Logged In,Chloe,F,18,Cuevas,,paid,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54094E+12,916,,200,1.54313E+12,49
29 | Days Of The New,Logged In,Chloe,F,19,Cuevas,258.5073,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,The Down Town,200,1.54313E+12,49
30 | Julio Iglesias duet with Willie Nelson,Logged In,Chloe,F,20,Cuevas,212.16608,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,To All The Girls I've Loved Before (With Julio Iglesias),200,1.54313E+12,49
31 | ,Logged In,Jacqueline,F,0,Lynch,,paid,"Atlanta-Sandy Springs-Roswell, GA",GET,Home,1.54022E+12,914,,200,1.54313E+12,29
32 | Jason Mraz & Colbie Caillat,Logged In,Chloe,F,0,Roth,189.6224,free,"Indianapolis-Carmel-Anderson, IN",PUT,NextSong,1.5407E+12,704,Lucky (Album Version),200,1.54314E+12,78
33 | ,Logged In,Anabelle,F,0,Simpson,,free,"Philadelphia-Camden-Wilmington, PA-NJ-DE-MD",GET,Home,1.54104E+12,901,,200,1.54315E+12,69
34 | R. Kelly,Logged In,Anabelle,F,1,Simpson,234.39628,free,"Philadelphia-Camden-Wilmington, PA-NJ-DE-MD",PUT,NextSong,1.54104E+12,901,The World's Greatest,200,1.54315E+12,69
35 | ,Logged In,Kynnedi,F,0,Sanchez,,free,"Cedar Rapids, IA",GET,Home,1.54108E+12,804,,200,1.54315E+12,89
36 | Jacky Terrasson,Logged In,Marina,F,0,Sutton,342.7522,free,"Salinas, CA",PUT,NextSong,1.54106E+12,373,Le Jardin d'Hiver,200,1.54315E+12,48
37 | Papa Roach,Logged In,Theodore,M,0,Harris,202.1873,free,"Red Bluff, CA",PUT,NextSong,1.5411E+12,813,Alive,200,1.54316E+12,14
38 | Burt Bacharach,Logged In,Theodore,M,1,Harris,156.96934,free,"Red Bluff, CA",PUT,NextSong,1.5411E+12,813,Casino Royale Theme (Main Title),200,1.54316E+12,14
39 | ,Logged In,Chloe,F,0,Cuevas,,paid,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54094E+12,923,,200,1.54316E+12,49
40 | Floetry,Logged In,Chloe,F,1,Cuevas,254.48444,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,Sunshine,200,1.54316E+12,49
41 | The Rakes,Logged In,Chloe,F,2,Cuevas,225.2273,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,Leave The City And Come Home,200,1.54316E+12,49
42 | Dwight Yoakam,Logged In,Chloe,F,3,Cuevas,239.3073,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,You're The One,200,1.54316E+12,49
43 | Ween,Logged In,Chloe,F,4,Cuevas,228.10077,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,Voodoo Lady,200,1.54316E+12,49
44 | Café Quijano,Logged In,Chloe,F,5,Cuevas,197.32853,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,La Lola,200,1.54316E+12,49
45 | ,Logged In,Chloe,F,0,Roth,,free,"Indianapolis-Carmel-Anderson, IN",GET,Home,1.5407E+12,925,,200,1.54317E+12,78
46 | Parov Stelar,Logged In,Chloe,F,1,Roth,203.65016,free,"Indianapolis-Carmel-Anderson, IN",PUT,NextSong,1.5407E+12,925,Good Bye Emily (feat. Gabriella Hanninen),200,1.54317E+12,78
47 | ,Logged In,Chloe,F,2,Roth,,free,"Indianapolis-Carmel-Anderson, IN",GET,Home,1.5407E+12,925,,200,1.54317E+12,78
48 | ,Logged In,Tegan,F,0,Levine,,paid,"Portland-South Portland, ME",GET,Home,1.54079E+12,915,,200,1.54317E+12,80
49 | Bryan Adams,Logged In,Tegan,F,1,Levine,166.29506,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,I Will Always Return,200,1.54317E+12,80
50 | KT Tunstall,Logged In,Tegan,F,2,Levine,192.31302,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,White Bird,200,1.54317E+12,80
51 | Technicolour,Logged In,Tegan,F,3,Levine,235.12771,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,Turn Away,200,1.54317E+12,80
52 | The Dears,Logged In,Tegan,F,4,Levine,289.95873,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,Lost In The Plot,200,1.54317E+12,80
53 | Go West,Logged In,Tegan,F,5,Levine,259.49995,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,Never Let Them See You Sweat,200,1.54317E+12,80
54 | ,Logged In,Tegan,F,6,Levine,,paid,"Portland-South Portland, ME",PUT,Logout,1.54079E+12,915,,307,1.54317E+12,80
55 | ,Logged In,Sylvie,F,0,Cruz,,free,"Washington-Arlington-Alexandria, DC-VA-MD-WV",GET,Home,1.54027E+12,912,,200,1.54317E+12,10
56 | ,Logged Out,,,7,,,paid,,GET,Home,,915,,200,1.54317E+12,
57 | Gondwana,Logged In,Jordan,F,0,Hicks,262.5824,free,"Salinas, CA",PUT,NextSong,1.54001E+12,814,Mi Princesa,200,1.54319E+12,37
58 | ,Logged In,Kevin,M,0,Arellano,,free,"Harrisburg-Carlisle, PA",GET,Home,1.54001E+12,855,,200,1.54319E+12,66
59 | Ella Fitzgerald,Logged In,Jordan,F,1,Hicks,427.15383,free,"Salinas, CA",PUT,NextSong,1.54001E+12,814,On Green Dolphin Street (Medley) (1999 Digital Remaster),200,1.54319E+12,37
60 | Creedence Clearwater Revival,Logged In,Jordan,F,2,Hicks,184.73751,free,"Salinas, CA",PUT,NextSong,1.54001E+12,814,Run Through The Jungle,200,1.54319E+12,37
61 |
--------------------------------------------------------------------------------
/2. Cassandra ETL/images/image_event_datafile_new.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/2. Cassandra ETL/images/image_event_datafile_new.jpg
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/README.md:
--------------------------------------------------------------------------------
1 | ## Description
2 | ---
3 | * This repo provides the ETL pipeline, to populate the books database, in collection:titles.
4 | * It provides the code to scrape from a books listing website, providing the title, price, rating of a book, along with whether it is still in stock and its url.
5 | * The code in this repository scrapes from: "http://books.toscrape.com/".
6 | * It then ingests the data into a MongoDB database hosted on localhost, port 27017, into a database called "books" and collection called "titles".
7 |
8 | ## Running the ETL Pipeline
9 | ---
10 | * First, make sure MongoDB is running on port 27017, on localhost
11 | * Next, run ```scrapy crawl books``` from the scrapy project folder "books"
12 | * You can now confirm that the data was stored on MongoDB in books database using MongoDB Compass
13 |
14 | 
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/books.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/3. Web Scraping using Scrapy, Mongo ETL/books.PNG
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/books/books/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/3. Web Scraping using Scrapy, Mongo ETL/books/books/__init__.py
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/books/books/items.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define here the models for your scraped items
4 | #
5 | # See documentation in:
6 | # https://docs.scrapy.org/en/latest/topics/items.html
7 |
8 | import scrapy
9 |
10 |
11 | class BooksItem(scrapy.Item):
12 | # define the fields for your item here like:
13 | title = scrapy.Field()
14 | price = scrapy.Field()
15 | in_stock = scrapy.Field()
16 | rating = scrapy.Field()
17 | url = scrapy.Field()
18 |
19 |
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/books/books/middlewares.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define here the models for your spider middleware
4 | #
5 | # See documentation in:
6 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
7 |
8 | from scrapy import signals
9 |
10 |
11 | class BooksSpiderMiddleware(object):
12 | # Not all methods need to be defined. If a method is not defined,
13 | # scrapy acts as if the spider middleware does not modify the
14 | # passed objects.
15 |
16 | @classmethod
17 | def from_crawler(cls, crawler):
18 | # This method is used by Scrapy to create your spiders.
19 | s = cls()
20 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
21 | return s
22 |
23 | def process_spider_input(self, response, spider):
24 | # Called for each response that goes through the spider
25 | # middleware and into the spider.
26 |
27 | # Should return None or raise an exception.
28 | return None
29 |
30 | def process_spider_output(self, response, result, spider):
31 | # Called with the results returned from the Spider, after
32 | # it has processed the response.
33 |
34 | # Must return an iterable of Request, dict or Item objects.
35 | for i in result:
36 | yield i
37 |
38 | def process_spider_exception(self, response, exception, spider):
39 | # Called when a spider or process_spider_input() method
40 | # (from other spider middleware) raises an exception.
41 |
42 | # Should return either None or an iterable of Request, dict
43 | # or Item objects.
44 | pass
45 |
46 | def process_start_requests(self, start_requests, spider):
47 | # Called with the start requests of the spider, and works
48 | # similarly to the process_spider_output() method, except
49 | # that it doesn’t have a response associated.
50 |
51 | # Must return only requests (not items).
52 | for r in start_requests:
53 | yield r
54 |
55 | def spider_opened(self, spider):
56 | spider.logger.info('Spider opened: %s' % spider.name)
57 |
58 |
59 | class BooksDownloaderMiddleware(object):
60 | # Not all methods need to be defined. If a method is not defined,
61 | # scrapy acts as if the downloader middleware does not modify the
62 | # passed objects.
63 |
64 | @classmethod
65 | def from_crawler(cls, crawler):
66 | # This method is used by Scrapy to create your spiders.
67 | s = cls()
68 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
69 | return s
70 |
71 | def process_request(self, request, spider):
72 | # Called for each request that goes through the downloader
73 | # middleware.
74 |
75 | # Must either:
76 | # - return None: continue processing this request
77 | # - or return a Response object
78 | # - or return a Request object
79 | # - or raise IgnoreRequest: process_exception() methods of
80 | # installed downloader middleware will be called
81 | return None
82 |
83 | def process_response(self, request, response, spider):
84 | # Called with the response returned from the downloader.
85 |
86 | # Must either;
87 | # - return a Response object
88 | # - return a Request object
89 | # - or raise IgnoreRequest
90 | return response
91 |
92 | def process_exception(self, request, exception, spider):
93 | # Called when a download handler or a process_request()
94 | # (from other downloader middleware) raises an exception.
95 |
96 | # Must either:
97 | # - return None: continue processing this exception
98 | # - return a Response object: stops process_exception() chain
99 | # - return a Request object: stops process_exception() chain
100 | pass
101 |
102 | def spider_opened(self, spider):
103 | spider.logger.info('Spider opened: %s' % spider.name)
104 |
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/books/books/pipelines.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define your item pipelines here
4 | #
5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
6 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
7 | import logging
8 | import pymongo
9 |
10 | class MongoDBPipeline(object):
11 |
12 | def __init__(self, mongo_uri, mongo_db, collection_name):
13 | self.mongo_uri = mongo_uri
14 | self.mongo_db = mongo_db
15 | self.collection_name = collection_name
16 |
17 | @classmethod
18 | def from_crawler(cls, crawler):
19 | # pull info from settings.py
20 | return cls(
21 | mongo_uri = crawler.settings.get('MONGO_URI'),
22 | mongo_db = crawler.settings.get('MONGO_DB'),
23 | collection_name = crawler.settings.get('MONGO_COLLECTION')
24 | )
25 |
26 | def open_spider(self, spider):
27 | # initialize spider
28 | # open db connection
29 | self.client = pymongo.MongoClient(self.mongo_uri)
30 | self.db = self.client[self.mongo_db]
31 |
32 | def close_spider(self, spider):
33 | # clean up when spider is closed
34 | self.client.close()
35 |
36 | def process_item(self, item, spider):
37 | print('collection:', self.collection_name)
38 | self.db[self.collection_name].insert(dict(item))
39 | logging.debug("Title added to MongoDB")
40 | return item
41 |
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/books/books/settings.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Scrapy settings for books project
4 | #
5 | # For simplicity, this file contains only settings considered important or
6 | # commonly used. You can find more settings consulting the documentation:
7 | #
8 | # https://docs.scrapy.org/en/latest/topics/settings.html
9 | # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
10 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
11 |
12 | BOT_NAME = 'books'
13 |
14 | SPIDER_MODULES = ['books.spiders']
15 | NEWSPIDER_MODULE = 'books.spiders'
16 |
17 |
18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 | #USER_AGENT = 'books (+http://www.yourdomain.com)'
20 |
21 | # Obey robots.txt rules
22 | ROBOTSTXT_OBEY = True
23 |
24 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 | #CONCURRENT_REQUESTS = 32
26 |
27 | # Configure a delay for requests for the same website (default: 0)
28 | # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
29 | # See also autothrottle settings and docs
30 | #DOWNLOAD_DELAY = 3
31 | # The download delay setting will honor only one of:
32 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 | #CONCURRENT_REQUESTS_PER_IP = 16
34 |
35 | # Disable cookies (enabled by default)
36 | #COOKIES_ENABLED = False
37 |
38 | # Disable Telnet Console (enabled by default)
39 | #TELNETCONSOLE_ENABLED = False
40 |
41 | # Override the default request headers:
42 | #DEFAULT_REQUEST_HEADERS = {
43 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 | # 'Accept-Language': 'en',
45 | #}
46 |
47 | # Enable or disable spider middlewares
48 | # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
49 | #SPIDER_MIDDLEWARES = {
50 | # 'books.middlewares.BooksSpiderMiddleware': 543,
51 | #}
52 |
53 | # Enable or disable downloader middlewares
54 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
55 | #DOWNLOADER_MIDDLEWARES = {
56 | # 'books.middlewares.BooksDownloaderMiddleware': 543,
57 | #}
58 |
59 | # Enable or disable extensions
60 | # See https://docs.scrapy.org/en/latest/topics/extensions.html
61 | #EXTENSIONS = {
62 | # 'scrapy.extensions.telnet.TelnetConsole': None,
63 | #}
64 |
65 | # Configure item pipelines
66 | # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
67 | #ITEM_PIPELINES = {
68 | # 'books.pipelines.BooksPipeline': 300,
69 | #}
70 |
71 | # Enable and configure the AutoThrottle extension (disabled by default)
72 | # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
73 | #AUTOTHROTTLE_ENABLED = True
74 | # The initial download delay
75 | #AUTOTHROTTLE_START_DELAY = 5
76 | # The maximum download delay to be set in case of high latencies
77 | #AUTOTHROTTLE_MAX_DELAY = 60
78 | # The average number of requests Scrapy should be sending in parallel to
79 | # each remote server
80 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
81 | # Enable showing throttling stats for every response received:
82 | #AUTOTHROTTLE_DEBUG = False
83 |
84 | # Enable and configure HTTP caching (disabled by default)
85 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
86 | #HTTPCACHE_ENABLED = True
87 | #HTTPCACHE_EXPIRATION_SECS = 0
88 | #HTTPCACHE_DIR = 'httpcache'
89 | #HTTPCACHE_IGNORE_HTTP_CODES = []
90 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
91 |
92 | ITEM_PIPELINES = {'books.pipelines.MongoDBPipeline': 300}
93 | MONGO_URI = 'mongodb://localhost:27017'
94 | MONGO_DB = "books"
95 | MONGO_COLLECTION = 'titles'
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/books/books/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 |
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/books/books/spiders/books_spider.py:
--------------------------------------------------------------------------------
1 | from scrapy import Spider
2 | from scrapy.selector import Selector
3 | from books.items import BooksItem
4 |
5 | class BooksSpider(Spider):
6 | name = 'books' # name of spider
7 | allowed_domains = ['http://books.toscrape.com/'] #base urls of allowed domains, for spider to crawl
8 | start_urls = [
9 | "http://books.toscrape.com/",
10 | ]
11 |
12 | def parse(self, response):
13 | books = Selector(response).xpath('//article[@class="product_pod"]')
14 | for book in books:
15 | item = BooksItem()
16 | item['title'] = book.xpath(
17 | 'div/a/img/@alt').extract()[0]
18 | item['price'] = book.xpath(
19 | 'div/p[@class="price_color"]/text()').extract()[0]
20 | instock_status = "".join(book.xpath(
21 | 'div/p[@class="instock availability"]/text()').extract())
22 | instock_status = instock_status.strip('\n')
23 | instock_status = instock_status.strip()
24 | item['in_stock'] = instock_status
25 | rating = book.xpath(
26 | 'p[contains(@class, "star-rating")]/@class').extract()[0]
27 | rating = rating.replace("star-rating ", "")
28 | item['rating'] = rating
29 | item['url'] = book.xpath(
30 | 'div[@class="image_container"]/a/@href').extract()[0]
31 | yield item
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/books/scrapy.cfg:
--------------------------------------------------------------------------------
1 | # Automatically created by: scrapy startproject
2 | #
3 | # For more information about the [deploy] section see:
4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
5 |
6 | [settings]
7 | default = books.settings
8 |
9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = books
12 |
--------------------------------------------------------------------------------
/3. Web Scraping using Scrapy, Mongo ETL/requirements.txt:
--------------------------------------------------------------------------------
1 | attrs==19.3.0
2 | Automat==20.2.0
3 | cffi==1.14.0
4 | constantly==15.1.0
5 | cryptography==3.3.2
6 | cssselect==1.1.0
7 | hyperlink==19.0.0
8 | idna==2.9
9 | incremental==17.5.0
10 | lxml==4.6.5
11 | parsel==1.5.2
12 | Protego==0.1.16
13 | pyasn1==0.4.8
14 | pyasn1-modules==0.2.8
15 | pycparser==2.20
16 | PyDispatcher==2.0.5
17 | PyHamcrest==2.0.2
18 | pymongo==3.10.1
19 | pyOpenSSL==19.1.0
20 | queuelib==1.5.0
21 | Scrapy==2.6.1
22 | service-identity==18.1.0
23 | six==1.14.0
24 | Twisted==22.2.0
25 | w3lib==1.21.0
26 | zope.interface==5.1.0
27 |
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/README.md:
--------------------------------------------------------------------------------
1 | ## Description
2 | ---
3 | This repo provides the ETL pipeline, to populate the sparkifydb database in AWS Redshift.
4 | * The purpose of this database is to enable Sparkify to answer business questions it may have of its users, the types of songs they listen to and the artists of those songs using the data that it has in logs and files. The database provides a consistent and reliable source to store this data.
5 |
6 | * This source of data will be useful in helping Sparkify reach some of its analytical goals, for example, finding out songs that have highest popularity or times of the day which is high in traffic.
7 |
8 | ## Why Redshift?
9 | ---
10 | * Redshift is a fully managed, cloud-based, petabyte-scale data warehouse service by Amazon Web Services (AWS). It is an efficient solution to collect and store all data and enables analysis using various business intelligence tools to acquire new insights for businesses and their customers.
11 | 
12 |
13 | ## Database Design
14 | ---
15 | * For the schema design, the STAR schema is used as it simplifies queries and provides fast aggregations of data.
16 | 
17 |
18 | * songplays is our facts table with the rest being our dimension tables.
19 |
20 | ## Data Pipeline design
21 | * For the ETL pipeline, Python is used as it contains libraries such as pandas, that simplifies data manipulation. It also allows connection to Postgres Database.
22 |
23 | * There are 2 types of data involved, song and log data. For song data, it contains information about songs and artists, which we extract from and load into users and artists dimension table
24 |
25 | * First, we load song and log data from JSON format in S3 into our staging tables (staging_songs_table and staging_events_table)
26 |
27 | * Next, we perform ETL using SQL, from the staging tables to our fact and dimension tables. Below shows the architectural design of this pipeline:
28 | 
29 |
30 | ## Files
31 | ---
32 | * create_tables.py is the python script that drops all tables and create all tables (including staging tables)
33 |
34 | * sql_queries.py is the python file containing all SQL queries. It is called by create_tables.py and etl.py
35 |
36 | * etl.py is the python script that loads data into staging tables, then load data into fact and dimension tables from staging tables
37 |
38 | * redshift_cluster_setup.py sets up the redshift cluster and creates an IAM role for redshift to access other AWS services
39 |
40 | * redshift_cluster_teardown.py removes the redshift cluster and IAM role created
41 |
42 | * dwh.cfg contains configurations for Redshift database. Please edit according to the Redshift cluster and database created on AWS
43 |
44 | ## Running the ETL Pipeline
45 | ---
46 | * First, run create_tables.py to create the data tables using the schema design specified. If tables were created previously, they will be dropped and recreated.
47 |
48 | * Next, run etl.py to populate the data tables created.
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/create_tables.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import psycopg2
3 | from sql_queries import create_table_queries, drop_table_queries
4 |
5 |
6 | def drop_tables(cur, conn):
7 | """
8 | Description: Drops each table using the queries in `drop_table_queries` list in sql_queries.
9 |
10 | Arguments:
11 | cur: the cursor object.
12 | conn: connection object to redshift.
13 |
14 | Returns:
15 | None
16 | """
17 | for query in drop_table_queries:
18 | cur.execute(query)
19 | conn.commit()
20 |
21 |
22 | def create_tables(cur, conn):
23 | """
24 | Description: Creates each table using the queries in `create_table_queries` list in sql_queries.
25 |
26 | Arguments:
27 | cur: the cursor object.
28 | conn: connection object to redshift.
29 |
30 | Returns:
31 | None
32 | """
33 | for query in create_table_queries:
34 | cur.execute(query)
35 | conn.commit()
36 |
37 |
38 | def main():
39 | """
40 | Description:
41 | - Establishes connection with the sparkify database and gets
42 | cursor to it (on AWS redshift cluster created earlier).
43 |
44 | - Drops all the tables.
45 |
46 | - Creates all tables needed.
47 |
48 | - Finally, closes the connection.
49 |
50 | Returns:
51 | None
52 | """
53 | config = configparser.ConfigParser()
54 | config.read('dwh.cfg')
55 |
56 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
57 | cur = conn.cursor()
58 |
59 | drop_tables(cur, conn)
60 | create_tables(cur, conn)
61 |
62 | conn.close()
63 |
64 |
65 | if __name__ == "__main__":
66 | main()
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/dwh.cfg:
--------------------------------------------------------------------------------
1 | [AWS]
2 | KEY=
3 | SECRET=
4 |
5 | [DWH]
6 | DWH_CLUSTER_TYPE=multi-node
7 | DWH_NUM_NODES=4
8 | DWH_NODE_TYPE=dc2.large
9 |
10 | DWH_IAM_ROLE_NAME=dwhRole
11 | DWH_CLUSTER_IDENTIFIER=dwhCluster
12 | DWH_DB=
13 | DWH_DB_USER=
14 | DWH_DB_PASSWORD=
15 | DWH_PORT=5439
16 |
17 | [CLUSTER]
18 | HOST=
19 | DB_NAME=
20 | DB_USER=
21 | DB_PASSWORD=
22 | DB_PORT=5439
23 |
24 | [IAM_ROLE]
25 | ARN=
26 |
27 | [S3]
28 | LOG_DATA='s3://udacity-dend/log_data'
29 | LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
30 | SONG_DATA='s3://udacity-dend/song_data'
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/etl.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import psycopg2
3 | from sql_queries import copy_table_queries, insert_table_queries, copy_staging_order
4 | count_staging_queries, insert_table_order, count_fact_dim_queries
5 |
6 |
7 | def load_staging_tables(cur, conn):
8 | """
9 | Description: Copies data in json format in S3 to staging tables in redshift.
10 |
11 | Arguments:
12 | cur: the cursor object.
13 | conn: connection object to redshift.
14 |
15 | Returns:
16 | None
17 | """
18 | for idx, query in enumerate(copy_table_queries):
19 | cur.execute(query)
20 | conn.commit()
21 | row = cur.execute(count_staging_queries[idx])
22 | print('No. of rows copied into {}: {}'.format(copy_staging_order[idx], row.count))
23 |
24 |
25 | def insert_tables(cur, conn):
26 | """
27 | Description: ETL from staging tables to songplays fact and its dimension
28 | tables in redshift.
29 |
30 | Arguments:
31 | cur: the cursor object.
32 | conn: connection object to redshift.
33 |
34 | Returns:
35 | None
36 | """
37 | for idx, query in enumerate(insert_table_queries):
38 | cur.execute(query)
39 | conn.commit()
40 | row = cur.execute(count_fact_dim_queries[idx])
41 | print('No. of rows inserted into {}: {}'.format(insert_table_order[idx], row.count))
42 |
43 |
44 | def main():
45 | """
46 | Description:
47 | - Establishes connection with the sparkify database and gets
48 | cursor to it (on AWS redshift cluster created earlier).
49 |
50 | - Loads staging tables from raw log and song files to redshift database
51 |
52 | - From staging tables, perform ETL to songplays fact and its dimension
53 | tables in redshift using SQL
54 |
55 | Returns:
56 | None
57 | """
58 | config = configparser.ConfigParser()
59 | config.read('dwh.cfg')
60 |
61 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
62 | cur = conn.cursor()
63 |
64 | load_staging_tables(cur, conn)
65 | insert_tables(cur, conn)
66 |
67 | conn.close()
68 |
69 |
70 | if __name__ == "__main__":
71 | main()
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/redshift_cluster_setup.py:
--------------------------------------------------------------------------------
1 | import boto3
2 | import json
3 | import configparser
4 |
5 |
6 | def create_iam_role(DWH_IAM_ROLE_NAME):
7 | """
8 | Description:
9 | - Creates an IAM role that allows Redshift to call on
10 | other AWS services
11 |
12 | Returns:
13 | - Role Arn
14 | """
15 | # Create the IAM role
16 | try:
17 | print('1.1 Creating a new IAM Role')
18 | dwh_role = iam.create_role(
19 | Path = '/',
20 | RoleName = DWH_IAM_ROLE_NAME,
21 | Description = 'Allows Redshift cluster to call AWS service on your behalf.',
22 | AssumeRolePolicyDocument = json.dumps(
23 | {'Statement': [{'Action': 'sts:AssumeRole',
24 | 'Effect': 'Allow',
25 | 'Principal': {'Service': 'redshift.amazonaws.com'}}],
26 | 'Version': '2012-10-17'})
27 | )
28 | # Attach Policy
29 | iam.attach_role_policy(RoleName=DWH_IAM_ROLE_NAME,
30 | PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
31 | )['ResponseMetadata']['HTTPStatusCode']
32 | role_arn = iam.get_role(RoleName=DWH_IAM_ROLE_NAME)['Role']['Arn']
33 | return role_arn
34 | except Exception as e:
35 | print(e)
36 |
37 |
38 |
39 | def main():
40 | """
41 | Description:
42 | - Sets up a Redshift cluster on AWS
43 |
44 | Returns:
45 | None
46 | """
47 | # Load DWH parameters from a file
48 | config = configparser.ConfigParser()
49 | config.read_file(open('dwh.cfg'))
50 | KEY = config.get('AWS','KEY')
51 | SECRET = config.get('AWS','SECRET')
52 | DWH_CLUSTER_TYPE = config.get("DWH","DWH_CLUSTER_TYPE")
53 | DWH_NUM_NODES = config.get("DWH","DWH_NUM_NODES")
54 | DWH_NODE_TYPE = config.get("DWH","DWH_NODE_TYPE")
55 | DWH_CLUSTER_IDENTIFIER = config.get("DWH","DWH_CLUSTER_IDENTIFIER")
56 | DWH_DB = config.get("DWH","DWH_DB")
57 | DWH_DB_USER = config.get("DWH","DWH_DB_USER")
58 | DWH_DB_PASSWORD = config.get("DWH","DWH_DB_PASSWORD")
59 | DWH_PORT = config.get("DWH","DWH_PORT")
60 | DWH_IAM_ROLE_NAME = config.get("DWH", "DWH_IAM_ROLE_NAME")
61 |
62 | # Create clients for EC2, S3, IAM, and Redshift
63 | ec2 = boto3.resource('ec2',
64 | region_name='us-west-2',
65 | aws_access_key_id=KEY,
66 | aws_secret_access_key=SECRET)
67 |
68 | iam = boto3.client('iam',
69 | region_name='us-west-2',
70 | aws_access_key_id=KEY,
71 | aws_secret_access_key=SECRET)
72 |
73 | redshift = boto3.client('redshift',
74 | region_name="us-west-2",
75 | aws_access_key_id=KEY,
76 | aws_secret_access_key=SECRET)
77 |
78 | role_arn = create_iam_role(DWH_IAM_ROLE_NAME)
79 |
80 | # Create the cluster
81 | try:
82 | response = redshift.create_cluster(
83 | #HW
84 | ClusterType=DWH_CLUSTER_TYPE,
85 | NodeType=DWH_NODE_TYPE,
86 | NumberOfNodes=int(DWH_NUM_NODES),
87 |
88 | #Identifiers & Credentials
89 | DBName=DWH_DB,
90 | ClusterIdentifier=DWH_CLUSTER_IDENTIFIER,
91 | MasterUsername=DWH_DB_USER,
92 | MasterUserPassword=DWH_DB_PASSWORD,
93 |
94 | #Roles (for s3 access)
95 | IamRoles=[role_arn]
96 | )
97 |
98 | # Open an incoming TCP port to access the cluster endpoint
99 | vpc = ec2.Vpc(id=myClusterProps['VpcId'])
100 | default_sg = list(vpc.security_groups.all())[0]
101 | default_sg.authorize_ingress(
102 | GroupName=default_sg.group_name,
103 | CidrIp='0.0.0.0/0',
104 | IpProtocol='TCP',
105 | FromPort=int(DWH_PORT),
106 | ToPort=int(DWH_PORT)
107 | )
108 | except Exception as e:
109 | print(e)
110 |
111 | print("Cluster has been created, check details of cluster on AWS")
112 |
113 |
114 | if __name__ == "__main__":
115 | main()
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/redshift_cluster_teardown.py:
--------------------------------------------------------------------------------
1 | import boto3
2 | import configparser
3 |
4 | def main():
5 | """
6 | Description:
7 | - Sets up a Redshift cluster on AWS
8 |
9 | Returns:
10 | None
11 | """
12 | KEY = config.get('AWS','KEY')
13 | SECRET = config.get('AWS','SECRET')
14 | DWH_CLUSTER_IDENTIFIER = config.get("DWH","DWH_CLUSTER_IDENTIFIER")
15 | DWH_IAM_ROLE_NAME = config.get("DWH", "DWH_IAM_ROLE_NAME")
16 |
17 | redshift = boto3.client('redshift',
18 | region_name="us-west-2",
19 | aws_access_key_id=KEY,
20 | aws_secret_access_key=SECRET)
21 |
22 | iam = boto3.client('iam',
23 | region_name='us-west-2',
24 | aws_access_key_id=KEY,
25 | aws_secret_access_key=SECRET)
26 |
27 | redshift.delete_cluster(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER,
28 | SkipFinalClusterSnapshot=True)
29 |
30 | # Remove role:
31 | iam.detach_role_policy(RoleName=DWH_IAM_ROLE_NAME,
32 | PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess")
33 | iam.delete_role(RoleName=DWH_IAM_ROLE_NAME)
34 | print("Cluster and IAM role has been deleted")
35 |
36 | if __name__ == "__main__":
37 | main()
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/screenshots/architecture.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/4. Data Warehousing with AWS Redshift/screenshots/architecture.PNG
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/screenshots/redshift.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/4. Data Warehousing with AWS Redshift/screenshots/redshift.PNG
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/screenshots/schema.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/4. Data Warehousing with AWS Redshift/screenshots/schema.PNG
--------------------------------------------------------------------------------
/4. Data Warehousing with AWS Redshift/sql_queries.py:
--------------------------------------------------------------------------------
1 | import configparser
2 |
3 |
4 | # CONFIG
5 | config = configparser.ConfigParser()
6 | config.read('dwh.cfg')
7 |
8 | # DROP TABLES
9 |
10 | staging_events_table_drop = "DROP TABLE IF EXISTS staging_events_table"
11 | staging_songs_table_drop = "DROP TABLE IF EXISTS staging_songs_table"
12 | songplay_table_drop = "DROP TABLE IF EXISTS songplays"
13 | user_table_drop = "DROP TABLE IF EXISTS users"
14 | song_table_drop = "DROP TABLE IF EXISTS songs"
15 | artist_table_drop = "DROP TABLE IF EXISTS artists"
16 | time_table_drop = "DROP TABLE IF EXISTS time"
17 |
18 | # CREATE TABLES
19 |
20 | staging_events_table_create= (
21 | """
22 | CREATE TABLE staging_events_table (
23 | stagingEventId bigint IDENTITY(0,1) PRIMARY KEY,
24 | artist VARCHAR(500),
25 | auth VARCHAR(20),
26 | firstName VARCHAR(500),
27 | gender CHAR(1),
28 | itemInSession SMALLINT,
29 | lastName VARCHAR(500),
30 | length NUMERIC,
31 | level VARCHAR(10),
32 | location VARCHAR(500),
33 | method VARCHAR(20),
34 | page VARCHAR(500),
35 | registration NUMERIC,
36 | sessionId SMALLINT,
37 | song VARCHAR,
38 | status SMALLINT,
39 | ts BIGINT,
40 | userAgent VARCHAR(500),
41 | userId SMALLINT
42 | )
43 | """
44 | )
45 |
46 | staging_songs_table_create = (
47 | """
48 | CREATE TABLE staging_songs_table (
49 | staging_song_id bigint IDENTITY(0,1) PRIMARY KEY,
50 | num_songs INTEGER NOT NULL,
51 | artist_id VARCHAR(20) NOT NULL,
52 | artist_latitude NUMERIC,
53 | artist_longitude NUMERIC,
54 | artist_location VARCHAR(500),
55 | artist_name VARCHAR(500) NOT NULL,
56 | song_id VARCHAR(20) NOT NULL,
57 | title VARCHAR(500) NOT NULL,
58 | duration NUMERIC NOT NULL,
59 | year SMALLINT NOT NULL
60 | );
61 | """
62 | )
63 |
64 | songplay_table_create = (
65 | """
66 | CREATE TABLE songplays (
67 | songplay_id BIGINT IDENTITY(0,1) PRIMARY KEY,
68 | start_time BIGINT REFERENCES time(start_time) distkey,
69 | user_id SMALLINT REFERENCES users(user_id),
70 | level VARCHAR(10),
71 | song_id VARCHAR(20) REFERENCES songs(song_id),
72 | artist_id VARCHAR(20) REFERENCES artists(artist_id),
73 | session_id SMALLINT,
74 | location VARCHAR(500),
75 | user_agent VARCHAR(500)
76 | )
77 | sortkey(level, start_time);
78 | """
79 | )
80 |
81 | user_table_create = (
82 | """
83 | CREATE TABLE users (
84 | user_id INT PRIMARY KEY,
85 | first_name VARCHAR(500),
86 | last_name VARCHAR(500),
87 | gender CHAR(1),
88 | level VARCHAR(10) NOT NULL
89 | )
90 | diststyle all
91 | sortkey(level, gender, first_name, last_name);
92 | """
93 | )
94 |
95 | song_table_create = (
96 | """
97 | CREATE TABLE songs (
98 | song_id VARCHAR(20) PRIMARY KEY,
99 | title VARCHAR(500) NOT NULL,
100 | artist_id VARCHAR(20) NOT NULL,
101 | year SMALLINT NOT NULL,
102 | duration NUMERIC NOT NULL
103 | )
104 | diststyle all
105 | sortkey(year, title, duration);
106 | """
107 | )
108 |
109 | artist_table_create = (
110 | """
111 | CREATE TABLE artists (
112 | artist_id VARCHAR(20) PRIMARY KEY,
113 | name VARCHAR(500) NOT NULL,
114 | location VARCHAR(500),
115 | latitude NUMERIC,
116 | longitude NUMERIC
117 | )
118 | diststyle all
119 | sortkey(name, location);
120 | """
121 | )
122 |
123 | time_table_create = (
124 | """
125 | CREATE TABLE time (
126 | start_time timestamp PRIMARY KEY distkey,
127 | hour SMALLINT NOT NULL,
128 | day SMALLINT NOT NULL,
129 | week SMALLINT NOT NULL,
130 | month SMALLINT NOT NULL,
131 | year SMALLINT NOT NULL,
132 | weekday SMALLINT NOT NULL
133 | )
134 | sortkey(year, month, day);
135 | """
136 | )
137 |
138 | # STAGING TABLES
139 |
140 | staging_events_copy = (
141 | """
142 | copy staging_events_table (
143 | artist, auth, firstName, gender,itemInSession, lastName,
144 | length, level, location, method, page, registration,
145 | sessionId, song, status, ts, userAgent, userId
146 | )
147 | from {}
148 | iam_role {}
149 | json {} region 'us-west-2';
150 | """
151 | ).format(config['S3']['log_data'], config['IAM_ROLE']['arn'], config['S3']['log_jsonpath'])
152 |
153 | staging_songs_copy = (
154 | """
155 | copy staging_songs_table
156 | from {}
157 | iam_role {}
158 | json 'auto' region 'us-west-2';
159 | """
160 | ).format(config['S3']['song_data'], config['IAM_ROLE']['arn'])
161 |
162 | # FINAL TABLES
163 |
164 | songplay_table_insert = (
165 | """
166 | INSERT INTO songplays (start_time, user_id, level, song_id, artist_id,
167 | session_id, location, user_agent)
168 | SELECT se.ts, se.userId, se.level, sa.song_id, sa.artist_id, se.sessionId,
169 | se.location, se.userAgent
170 | FROM staging_events_table se
171 | JOIN (
172 | SELECT s.song_id AS song_id, a.artist_id AS artist_id, s.title AS song,
173 | a.name AS artist, s.duration AS length
174 | FROM songs s
175 | JOIN artists a ON s.artist_id=a.artist_id
176 | ) sa
177 | ON se.song=sa.song AND se.artist=sa.artist AND se.length=sa.length;
178 | """
179 | )
180 |
181 | user_table_insert = (
182 | """
183 | INSERT INTO users (user_id, first_name, last_name, gender, level)
184 | SELECT userId, firstName, lastName, gender, level
185 | FROM (
186 | SELECT userId, firstName, lastName, gender, level,
187 | ROW_NUMBER() OVER (PARTITION BY userId
188 | ORDER BY firstName, lastName,
189 | gender, level) AS user_id_ranked
190 | FROM staging_events_table
191 | WHERE userId IS NOT NULL
192 | ) AS ranked
193 | WHERE ranked.user_id_ranked = 1;
194 | """
195 | )
196 |
197 | song_table_insert = (
198 | """
199 | INSERT INTO songs (song_id, title, artist_id, year, duration)
200 | SELECT song_id, title, artist_id, year, duration
201 | FROM (
202 | SELECT song_id, title, artist_id, year, duration,
203 | ROW_NUMBER() OVER (PARTITION BY song_id
204 | ORDER BY title, artist_id,
205 | year, duration) AS song_id_ranked
206 | FROM staging_songs_table
207 | WHERE song_id IS NOT NULL
208 | ) AS ranked
209 | WHERE ranked.song_id_ranked = 1;
210 | """
211 | )
212 |
213 | artist_table_insert = (
214 | """
215 | INSERT INTO artists (artist_id, name, location, latitude, longitude)
216 | SELECT artist_id, artist_name, artist_location, artist_latitude, artist_longitude
217 | FROM (
218 | SELECT artist_id, artist_name, artist_location, artist_latitude, artist_longitude,
219 | ROW_NUMBER() OVER (PARTITION BY artist_id
220 | ORDER BY artist_name, artist_location,
221 | artist_latitude, artist_longitude) AS artist_id_ranked
222 | FROM staging_songs_table
223 | WHERE artist_id IS NOT NULL
224 | ) AS ranked
225 | WHERE ranked.artist_id_ranked = 1;
226 | """
227 | )
228 |
229 |
230 | time_table_insert = (
231 | """
232 | INSERT INTO time (start_time, hour, day, week, month, year, weekday)
233 | SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time,
234 | EXTRACT(HOUR FROM start_time) AS hour,
235 | EXTRACT(DAY FROM start_time) AS day,
236 | EXTRACT(WEEK FROM start_time) AS week,
237 | EXTRACT(MONTH FROM start_time) AS month,
238 | EXTRACT(YEAR FROM start_time) AS year,
239 | EXTRACT(DOW FROM start_time) AS weekday
240 | FROM staging_events_table
241 | WHERE ts IS NOT NULL;
242 | """
243 | )
244 |
245 |
246 | count_staging_rows = "SELECT COUNT(*) AS count FROM {}"
247 |
248 | # QUERY LISTS
249 | create_table_queries = [staging_events_table_create, staging_songs_table_create,
250 | user_table_create, song_table_create, artist_table_create,
251 | time_table_create,songplay_table_create]
252 |
253 | drop_table_queries = [staging_events_table_drop, staging_songs_table_drop,
254 | songplay_table_drop, user_table_drop, song_table_drop,
255 | artist_table_drop, time_table_drop]
256 |
257 | copy_table_queries = [staging_events_copy, staging_songs_copy]
258 |
259 | copy_staging_order = ['staging_events_table', 'staging_songs_table']
260 |
261 | count_staging_queries = [count_staging_rows.format(copy_staging_order[0]),
262 | count_staging_rows.format(copy_staging_order[1])]
263 |
264 | insert_table_queries = [user_table_insert, song_table_insert, artist_table_insert,
265 | time_table_insert, songplay_table_insert]
266 |
267 | insert_table_order = ['users', 'songs', 'artists', 'time', 'songplays']
268 |
269 | count_fact_dim_queries = [count_staging_rows.format(insert_table_order[0]),
270 | count_staging_rows.format(insert_table_order[1]),
271 | count_staging_rows.format(insert_table_order[2]),
272 | count_staging_rows.format(insert_table_order[3]),
273 | count_staging_rows.format(insert_table_order[4])]
--------------------------------------------------------------------------------
/5. Data Lake with Spark & AWS S3/README.md:
--------------------------------------------------------------------------------
1 | ## Description
2 | ---
3 | This repo provides the ETL pipeline, to populate the sparkifydb AWS S3 Data Lake using Spark
4 |
5 |  
6 | * The purpose of this database is to enable Sparkify to answer business questions it may have of its users, the types of songs they listen to and the artists of those songs using the data that it has in logs and files. The database provides a consistent and reliable source to store this data.
7 |
8 | * This source of data will be useful in helping Sparkify reach some of its analytical goals, for example, finding out songs that have highest popularity or times of the day which is high in traffic.
9 |
10 | ## Dependencies
11 | ---
12 | * Note that you will need to have the pyspark library installed. Also, you should have a spark cluster running, either locally or on AWS EMR.
13 |
14 | ## Database Design and ETL Pipeline
15 | ---
16 | * For the schema design, the STAR schema is used as it simplifies queries and provides fast aggregations of data.
17 |
18 | 
19 |
20 | * For the ETL pipeline, Python is used as it contains libraries such as pandas, that simplifies data manipulation. It enables reading of files from S3, and data processing using Pyspark.
21 |
22 | * There are 2 types of data involved, song and log data. For song data, it contains information about songs and artists, which we extract from and load into users and artists dimension tables
23 |
24 | * Log data gives the information of each user session. From log data, we extract and load into time, users dimension tables and songplays fact table.
25 |
26 | ## Running the ETL Pipeline
27 | ---
28 | * Run etl.py to read from the song and logs json files, denormalize the data into fact and dimension tables and gives the output of these tables in S3 in the form of parquet files.
--------------------------------------------------------------------------------
/5. Data Lake with Spark & AWS S3/etl.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import os
3 | from datetime import datetime
4 | from pyspark.sql import SparkSession
5 | from pyspark.sql.functions import udf, col, from_unixtime
6 | from pyspark.sql.functions import year, month, dayofmonth, hour,
7 | weekofyear, dayofweek, date_format
8 | from pyspark.sql.types import StructType, StructField as Fld, DoubleType as Dbl,
9 | StringType as Str, IntegerType as Int, DateType as Date,
10 | TimestampType as Ts
11 |
12 |
13 | config = configparser.ConfigParser()
14 | config.read('dl.cfg')
15 |
16 | os.environ['AWS_ACCESS_KEY_ID']=config['AWS_ACCESS_KEY_ID']
17 | os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS_SECRET_ACCESS_KEY']
18 |
19 |
20 | def create_spark_session():
21 | """
22 | Description: Creates spark session.
23 |
24 | Returns:
25 | spark session object
26 | """
27 | AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']
28 | AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
29 |
30 | spark = SparkSession \
31 | .builder \
32 | .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
33 | .getOrCreate()
34 |
35 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
36 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)
37 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY_ID)
38 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_ACCESS_KEY)
39 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
40 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3n.endpoint", "s3.amazonaws.com")
41 | return spark
42 |
43 |
44 | def song_schema():
45 | """
46 | Description: Provides the schema for the staging_songs table.
47 |
48 | Returns:
49 | spark dataframe schema object
50 | """
51 | return StructType([
52 | Fld("num_songs", Int()),
53 | Fld("artist_id", Str()),
54 | Fld("artist_latitude", Dbl()),
55 | Fld("artist_longitude", Dbl()),
56 | Fld("artist_location", Str()),
57 | Fld("artist_name", Str()),
58 | Fld("song_id", Str()),
59 | Fld("title", Str()),
60 | Fld("duration", Dbl()),
61 | Fld("year", Int())
62 | ])
63 |
64 |
65 | def process_song_data(spark, input_data, output_data):
66 | """
67 | Description: Read in songs data from json files.
68 | Outputs songs and artists dimension tables in parquet files in S3.
69 |
70 | Arguments:
71 | spark: the spark session object.
72 | input_data: path to the S3 bucket containing input json files.
73 | output_data: path to S3 bucket that will contain output parquet files.
74 |
75 | Returns:
76 | None
77 | """
78 | # get filepath to song data file
79 | song_data = input_data + 'song_data/*/*/*/*.json'
80 |
81 | # read song data file
82 | df = spark.read.json(song_data, schema=song_schema())
83 |
84 | # extract columns to create songs table
85 | songs_table = df.select(['song_id', 'title', 'artist_id',
86 | 'year', 'duration']).distinct().where(
87 | col('song_id').isNotNull())
88 |
89 | # write songs table to parquet files partitioned by year and artist
90 | songs_path = output_data + 'songs'
91 | songs_table.write.partitionBy('year', 'artist_id').parquet(songs_path)
92 |
93 | # extract columns to create artists table
94 | artists_table = df.select(['artist_id', 'artist_name', 'artist_location',
95 | 'artist_latitude', 'artist_longitude']).distinct().where(
96 | col('artist_id').isNotNull())
97 |
98 | # write artists table to parquet files
99 | artists_path = output_data + 'artists'
100 | artists_table.write.parquet(artists_path)
101 |
102 |
103 | def process_log_data(spark, input_data, output_data):
104 | """
105 | Description: Read in logs data from json files.
106 | Outputs time and users dimension tables, songplays fact table
107 | in parquet files in S3.
108 |
109 | Arguments:
110 | spark: the spark session object.
111 | input_data: path to the S3 bucket containing input json files.
112 | output_data: path to S3 bucket that will contain output parquet files.
113 |
114 | Returns:
115 | None
116 | """
117 | # get filepath to log data file
118 | log_data = input_data + 'log_data/*/*/*.json'
119 |
120 | # read log data file
121 | df = spark.read.json(log_data)
122 |
123 | # filter by actions for song plays
124 | df = df.filter(df.page == 'NextSong')
125 |
126 | # extract columns for users table
127 | users_table = df.select(['userId', 'firstName', 'lastName',
128 | 'gender', 'level']).distinct().where(
129 | col('userId').isNotNull())
130 |
131 | # write users table to parquet files
132 | users_path = output_data + 'users'
133 | users_table.write.parquet(users_path)
134 |
135 | def format_datetime(ts):
136 | """
137 | Description: converts numeric timestamp to datetime format.
138 |
139 | Returns:
140 | timestamp with type datetime
141 | """
142 | return datetime.fromtimestamp(ts/1000.0)
143 |
144 | # create timestamp column from original timestamp column
145 | get_timestamp = udf(lambda x: format_datetime(int(x)), Ts())
146 | df = df.withColumn("start_time", get_timestamp(df.ts))
147 |
148 | # create datetime column from original timestamp column
149 | get_datetime = udf(lambda x: format_datetime(int(x)), Date())
150 | df = df.withColumn("datetime", get_datetime(df.ts))
151 |
152 | # extract columns to create time table
153 | time_table = df.select('ts', 'start_time', 'datetime'
154 | hour("datetime").alias('hour'),
155 | dayofmonth("datetime").alias('day'),
156 | weekofyear("datetime").alias('week'),
157 | year("datetime").alias('year'),
158 | month("datetime").alias('month'),
159 | dayofweek("datetime").alias('weekday')
160 | ).dropDuplicates()
161 |
162 | # write time table to parquet files partitioned by year and month
163 | time_table_path = output_data + 'time'
164 | time_table.write.partitionBy('year', 'month').parquet(time_table_path)
165 |
166 | # read in song data to use for songplays table
167 | songs_path = input_data + 'song_data/*/*/*/*.json'
168 | song_df = spark.read.json(songs_path, schema=song_schema())
169 |
170 | # extract columns from joined song and log datasets to create songplays table
171 | df = df.drop_duplicates(subset=['start_time'])
172 | songplays_table = song_df.alias('s').join(df.alias('l'),
173 | (song_df.title == df.song) & \
174 | (song_df.artist_name == df.artist)).where(
175 | df.page == 'NextSong').select([
176 | col('l.start_time'),
177 | year("l.datetime").alias('year'),
178 | month("l.datetime").alias('month'),
179 | col('l.userId'),
180 | col('l.level'),
181 | col('s.song_id'),
182 | col('s.artist_id'),
183 | col('l.sessionID'),
184 | col('l.location'),
185 | col('l.userAgent')
186 | ])
187 |
188 | # write songplays table to parquet files partitioned by year and month
189 | songplays_path = output_data + 'songplays'
190 | songplays_table.write.partitionBy('year', 'month').parquet(songplays_path)
191 |
192 |
193 | def main():
194 | """
195 | Description: Calls functions to create spark session, read from S3
196 | and perform ETL to S3 Data Lake.
197 |
198 | Returns:
199 | None
200 | """
201 | spark = create_spark_session()
202 | input_data = "s3a://udacity-dend/"
203 | output_data = "s3://alanchn31-datalake/"
204 |
205 | process_song_data(spark, input_data, output_data)
206 | process_log_data(spark, input_data, output_data)
207 |
208 |
209 | if __name__ == "__main__":
210 | main()
211 |
--------------------------------------------------------------------------------
/5. Data Lake with Spark & AWS S3/screenshots/s3.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/5. Data Lake with Spark & AWS S3/screenshots/s3.PNG
--------------------------------------------------------------------------------
/5. Data Lake with Spark & AWS S3/screenshots/schema.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/5. Data Lake with Spark & AWS S3/screenshots/schema.PNG
--------------------------------------------------------------------------------
/5. Data Lake with Spark & AWS S3/screenshots/spark.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/5. Data Lake with Spark & AWS S3/screenshots/spark.PNG
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/README.md:
--------------------------------------------------------------------------------
1 | ## Description
2 | ---
3 | This repo provides the ETL pipeline, to ingest sparkify's music data into an AWS Redshift Data Warehouse. The ETL pipeline will be run on an hourly basis, scheduled using Airflow.
4 |
5 | 
6 |
7 | * Why Airflow? Airflow allows workflows to be defined as code, they become more maintainable, versionable, testable, and collaborative
8 |
9 | * The purpose of this database is to enable Sparkify to answer business questions it may have of its users, the types of songs they listen to and the artists of those songs using the data that it has in logs and files. The database provides a consistent and reliable source to store this data.
10 |
11 | * This source of data will be useful in helping Sparkify reach some of its analytical goals, for example, finding out songs that have highest popularity or times of the day which is high in traffic.
12 |
13 | ## Dependencies
14 | ---
15 | * Note that you will need to have Airflow installed. To do so, run `pip install airflow`
16 |
17 | * To use postgres to store metadata from airflow jobs, edit airflow.cfg file under the AIRFLOW_HOME dir. Refer to https://gist.github.com/rosiehoyem/9e111067fe4373eb701daf9e7abcc423 for set up instructions
18 |
19 | * Run: `airflow webserver -p 8080`. Refer to https://airflow.apache.org/docs/stable/start.html for more details on how to get started,
20 |
21 | * Configure aws_credentials in Airflow using access and secret access keys. (under Airflow UI >> Admin >> Connections)
22 |
23 | * Configure redshift connection in Airflow (under Airflow UI >> Admin >> Connections)
24 |
25 | ## Database Design and ETL Pipeline
26 | ---
27 | * For the schema design, the STAR schema is used as it simplifies queries and provides fast aggregations of data.
28 |
29 | 
30 |
31 | * For the ETL pipeline, Python is used as it contains libraries such as pandas, that simplifies data manipulation. It enables reading of files from S3.
32 |
33 | * There are 2 types of data involved, song and log data. For song data, it contains information about songs and artists, which we extract from and load into users and artists dimension tables
34 |
35 | * Log data gives the information of each user session. From log data, we extract and load into time, users dimension tables and songplays fact table.
36 |
37 | ## Running the ETL Pipeline
38 | ---
39 | * Turning on the sparkify_music_dwh_dag DAG in Airflow UI will automatically trigger the ETL pipelines to run.
40 | * DAG is as such (from graph view):
41 |
42 | 
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/dags/create_tables.sql:
--------------------------------------------------------------------------------
1 | CREATE TABLE IF NOT EXISTS public.artists (
2 | artistid varchar(256) NOT NULL,
3 | name varchar(256),
4 | location varchar(256),
5 | lattitude numeric(18,0),
6 | longitude numeric(18,0)
7 | );
8 |
9 | CREATE TABLE IF NOT EXISTS public.songplays (
10 | playid varchar(32) NOT NULL,
11 | start_time timestamp NOT NULL,
12 | userid int4 NOT NULL,
13 | "level" varchar(256),
14 | songid varchar(256),
15 | artistid varchar(256),
16 | sessionid int4,
17 | location varchar(256),
18 | user_agent varchar(256),
19 | CONSTRAINT songplays_pkey PRIMARY KEY (playid)
20 | );
21 |
22 | CREATE TABLE IF NOT EXISTS public.songs (
23 | songid varchar(256) NOT NULL,
24 | title varchar(256),
25 | artistid varchar(256),
26 | "year" int4,
27 | duration numeric(18,0),
28 | CONSTRAINT songs_pkey PRIMARY KEY (songid)
29 | );
30 |
31 | CREATE TABLE IF NOT EXISTS public.staging_events (
32 | artist varchar(256),
33 | auth varchar(256),
34 | firstname varchar(256),
35 | gender varchar(256),
36 | iteminsession int4,
37 | lastname varchar(256),
38 | length numeric(18,0),
39 | "level" varchar(256),
40 | location varchar(256),
41 | "method" varchar(256),
42 | page varchar(256),
43 | registration numeric(18,0),
44 | sessionid int4,
45 | song varchar(256),
46 | status int4,
47 | ts int8,
48 | useragent varchar(256),
49 | userid int4
50 | );
51 |
52 | CREATE TABLE IF NOT EXISTS public.staging_songs (
53 | num_songs int4,
54 | artist_id varchar(256),
55 | artist_name varchar(256),
56 | artist_latitude numeric(18,0),
57 | artist_longitude numeric(18,0),
58 | artist_location varchar(256),
59 | song_id varchar(256),
60 | title varchar(256),
61 | duration numeric(18,0),
62 | "year" int4
63 | );
64 |
65 | CREATE TABLE IF NOT EXISTS public.time (
66 | start_time timestamp NOT NULL,
67 | hour int4 NOT NULL,
68 | day int4 NOT NULL,
69 | week int4 NOT NULL,
70 | month int4 NOT NULL,
71 | year int4 NOT NULL,
72 | dayofweek int4 NOT NULL
73 | );
74 |
75 | CREATE TABLE IF NOT EXISTS public.users (
76 | userid int4 NOT NULL,
77 | first_name varchar(256),
78 | last_name varchar(256),
79 | gender varchar(256),
80 | "level" varchar(256),
81 | CONSTRAINT users_pkey PRIMARY KEY (userid)
82 | );
83 |
84 |
85 |
86 |
87 |
88 |
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/dags/sparkify_dwh_dag.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime, timedelta
2 | import os
3 | from airflow import DAG
4 | from airflow.operators.sparkify_plugin import (StageToRedshiftOperator, LoadFactOperator,
5 | LoadDimensionOperator, DataQualityOperator)
6 | from airflow.operators.postgres_operator import PostgresOperator
7 | from airflow.operators.dummy_operator import DummyOperator
8 | from helpers import SqlQueries
9 |
10 | # /opt/airflow/start.sh
11 |
12 | default_args = {
13 | 'owner': 'udacity',
14 | 'start_date': datetime(2019, 1, 12),
15 | 'depends_on_past': False,
16 | 'retries': 1,
17 | 'retry_delay': timedelta(minutes=5),
18 | 'catchup': True
19 | }
20 |
21 | with DAG(dag_id='sparkify_music_dwh_dag', default_args=default_args,
22 | description='Load and transform data in Redshift \
23 | Data Warehouse with Airflow',
24 | schedule_interval='@hourly') as dag:
25 |
26 | start_operator = DummyOperator(task_id='begin_execution', dag=dag)
27 |
28 | create_tables = PostgresOperator(
29 | task_id='create_tables',
30 | postgres_conn_id="redshift",
31 | sql="create_tables.sql"
32 | )
33 |
34 | stage_events_to_redshift = StageToRedshiftOperator(
35 | task_id='load_stage_events',
36 | redshift_conn_id="redshift",
37 | aws_credentials_id="aws_credentials",
38 | s3_bucket="udacity-dend",
39 | s3_key="log_data",
40 | jsonpath="log_json_path.json",
41 | table_name="public.staging_events",
42 | ignore_headers=1
43 | )
44 |
45 | stage_songs_to_redshift = StageToRedshiftOperator(
46 | task_id='load_stage_songs',
47 | redshift_conn_id="redshift",
48 | aws_credentials_id="aws_credentials",
49 | s3_bucket="udacity-dend",
50 | s3_key="song_data",
51 | table_name="public.staging_songs",
52 | ignore_headers=1
53 | )
54 |
55 | load_songplays_table = LoadFactOperator(
56 | task_id='load_songplays_fact_table',
57 | redshift_conn_id="redshift",
58 | load_sql=SqlQueries.songplay_table_insert,
59 | table_name="public.songplays"
60 | )
61 |
62 | load_user_dimension_table = LoadDimensionOperator(
63 | task_id='load_user_dim_table',
64 | redshift_conn_id="redshift",
65 | load_sql=SqlQueries.user_table_insert,
66 | table_name="public.users",
67 | append_only=False
68 | )
69 |
70 | load_song_dimension_table = LoadDimensionOperator(
71 | task_id='load_song_dim_table',
72 | redshift_conn_id="redshift",
73 | load_sql=SqlQueries.song_table_insert,
74 | table_name="public.songs",
75 | append_only=False
76 | )
77 |
78 | load_artist_dimension_table = LoadDimensionOperator(
79 | task_id='load_artist_dim_table',
80 | redshift_conn_id="redshift",
81 | load_sql=SqlQueries.artist_table_insert,
82 | table_name="public.artists",
83 | append_only=False
84 | )
85 |
86 | load_time_dimension_table = LoadDimensionOperator(
87 | task_id='load_time_dim_table',
88 | redshift_conn_id="redshift",
89 | load_sql=SqlQueries.time_table_insert,
90 | table_name="public.time",
91 | append_only=False
92 | )
93 |
94 | run_quality_checks = DataQualityOperator(
95 | task_id='run_data_quality_checks',
96 | redshift_conn_id="redshift",
97 | table_names=["public.staging_events", "public.staging_songs",
98 | "public.songplays", "public.artists",
99 | "public.songs", "public.time", "public.users"]
100 | )
101 |
102 | end_operator = DummyOperator(task_id='stop_execution', dag=dag)
103 |
104 | start_operator >> create_tables
105 | create_tables >> [stage_events_to_redshift,
106 | stage_songs_to_redshift]
107 |
108 | [stage_events_to_redshift,
109 | stage_songs_to_redshift] >> load_songplays_table
110 |
111 | load_songplays_table >> [load_user_dimension_table,
112 | load_song_dimension_table,
113 | load_artist_dimension_table,
114 | load_time_dimension_table]
115 | [load_user_dimension_table,
116 | load_song_dimension_table,
117 | load_artist_dimension_table,
118 | load_time_dimension_table] >> run_quality_checks
119 |
120 | run_quality_checks >> end_operator
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/plugins/__init__.py:
--------------------------------------------------------------------------------
1 | from __future__ import division, absolute_import, print_function
2 |
3 | from airflow.plugins_manager import AirflowPlugin
4 |
5 | import operators
6 | import helpers
7 |
8 | # Defining the plugin class
9 | class SparkifyPlugin(AirflowPlugin):
10 | name = "sparkify_plugin"
11 | operators = [
12 | operators.StageToRedshiftOperator,
13 | operators.LoadFactOperator,
14 | operators.LoadDimensionOperator,
15 | operators.DataQualityOperator
16 | ]
17 | helpers = [
18 | helpers.SqlQueries
19 | ]
20 |
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/plugins/helpers/__init__.py:
--------------------------------------------------------------------------------
1 | from helpers.sql_queries import SqlQueries
2 |
3 | __all__ = [
4 | 'SqlQueries',
5 | ]
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/plugins/helpers/sql_queries.py:
--------------------------------------------------------------------------------
1 | class SqlQueries:
2 | songplay_table_insert = ("""
3 | SELECT
4 | md5(events.sessionid || events.start_time) songplay_id,
5 | events.start_time,
6 | events.userid,
7 | events.level,
8 | songs.song_id,
9 | songs.artist_id,
10 | events.sessionid,
11 | events.location,
12 | events.useragent
13 | FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, *
14 | FROM staging_events
15 | WHERE page='NextSong') events
16 | LEFT JOIN staging_songs songs
17 | ON events.song = songs.title
18 | AND events.artist = songs.artist_name
19 | AND events.length = songs.duration
20 | """)
21 |
22 | user_table_insert = ("""
23 | SELECT distinct userid, firstname, lastname, gender, level
24 | FROM staging_events
25 | WHERE page='NextSong'
26 | """)
27 |
28 | song_table_insert = ("""
29 | SELECT distinct song_id, title, artist_id, year, duration
30 | FROM staging_songs
31 | """)
32 |
33 | artist_table_insert = ("""
34 | SELECT distinct artist_id, artist_name, artist_location, artist_latitude, artist_longitude
35 | FROM staging_songs
36 | """)
37 |
38 | time_table_insert = ("""
39 | SELECT start_time, extract(hour from start_time), extract(day from start_time), extract(week from start_time),
40 | extract(month from start_time), extract(year from start_time), extract(dayofweek from start_time)
41 | FROM songplays
42 | """)
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/plugins/operators/__init__.py:
--------------------------------------------------------------------------------
1 | from operators.stage_redshift import StageToRedshiftOperator
2 | from operators.load_fact import LoadFactOperator
3 | from operators.load_dimension import LoadDimensionOperator
4 | from operators.data_quality import DataQualityOperator
5 |
6 | __all__ = [
7 | 'StageToRedshiftOperator',
8 | 'LoadFactOperator',
9 | 'LoadDimensionOperator',
10 | 'DataQualityOperator'
11 | ]
12 |
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/plugins/operators/data_quality.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class DataQualityOperator(BaseOperator):
6 |
7 | ui_color = '#89DA59'
8 |
9 | @apply_defaults
10 | def __init__(self,
11 | redshift_conn_id="",
12 | table_names=[""],
13 | *args, **kwargs):
14 |
15 | super(DataQualityOperator, self).__init__(*args, **kwargs)
16 | self.redshift_conn_id = redshift_conn_id
17 | self.table_names = table_names
18 |
19 | def execute(self, context):
20 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
21 | for table in self.table_names:
22 | # Check that entries are being copied to table
23 | records = redshift.get_records(f"SELECT COUNT(*) FROM {table}")
24 | if len(records) < 1 or len(records[0]) < 1:
25 | raise ValueError(f"Data quality check failed. {table} returned no results")
26 |
27 | # Check that there are no rows with null ids
28 | dq_checks=[
29 | {'table': 'users',
30 | 'check_sql': "SELECT COUNT(*) FROM users WHERE userid is null",
31 | 'expected_result': 0},
32 | {'table': 'songs',
33 | 'check_sql': "SELECT COUNT(*) FROM songs WHERE songid is null",
34 | 'expected_result': 0}
35 | ]
36 | for check in dq_checks:
37 | records = redshift.get_records(check['check_sql'])
38 | if records[0] != check['expected_result']:
39 | raise ValueError(f"Data quality check failed. {check['table']} \
40 | contains null in id column")
41 |
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/plugins/operators/load_dimension.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class LoadDimensionOperator(BaseOperator):
6 |
7 | ui_color = '#80BD9E'
8 |
9 | @apply_defaults
10 | def __init__(self,
11 | redshift_conn_id="",
12 | load_sql="",
13 | table_name="",
14 | append_only=False,
15 | *args, **kwargs):
16 |
17 | super(LoadDimensionOperator, self).__init__(*args, **kwargs)
18 | self.redshift_conn_id = redshift_conn_id
19 | self.load_sql = load_sql
20 | self.table_name = table_name
21 | self.append_only = append_only
22 |
23 | def execute(self, context):
24 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
25 | self.log.info("Loading into {} dimension table".format(self.table_name))
26 | self.log.info("Append only mode: {}".format(self.append_only))
27 | if self.append_only:
28 | sql_stmt = 'INSERT INTO %s %s' % (self.table_name, self.load_sql)
29 | redshift.run(sql_stmt)
30 | else:
31 | sql_del_stmt = 'DELETE FROM %s' % (self.table_name)
32 | redshift.run(sql_del_stmt)
33 | sql_stmt = 'INSERT INTO %s %s' % (self.table_name, self.load_sql)
34 | redshift.run(sql_stmt)
35 |
36 |
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/plugins/operators/load_fact.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.models import BaseOperator
3 | from airflow.utils.decorators import apply_defaults
4 |
5 | class LoadFactOperator(BaseOperator):
6 |
7 | ui_color = '#F98866'
8 |
9 | @apply_defaults
10 | def __init__(self,
11 | redshift_conn_id="",
12 | load_sql="",
13 | table_name="",
14 | *args, **kwargs):
15 |
16 | super(LoadFactOperator, self).__init__(*args, **kwargs)
17 | self.redshift_conn_id = redshift_conn_id
18 | self.load_sql = load_sql
19 | self.table_name = table_name
20 |
21 | def execute(self, context):
22 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
23 | self.log.info("Loading into {} fact table".format(self.table_name))
24 | sql_stmt = 'INSERT INTO %s %s' % (self.table_name, self.load_sql)
25 | redshift.run(sql_stmt)
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/airflow/plugins/operators/stage_redshift.py:
--------------------------------------------------------------------------------
1 | from airflow.hooks.postgres_hook import PostgresHook
2 | from airflow.contrib.hooks.aws_hook import AwsHook
3 | from airflow.models import BaseOperator
4 | from airflow.utils.decorators import apply_defaults
5 |
6 | class StageToRedshiftOperator(BaseOperator):
7 | ui_color = '#358140'
8 | copy_sql = """
9 | COPY {}
10 | FROM '{}'
11 | ACCESS_KEY_ID '{}'
12 | SECRET_ACCESS_KEY '{}'
13 | IGNOREHEADER {}
14 | JSON '{}'
15 | """
16 |
17 | @apply_defaults
18 | def __init__(self,
19 | redshift_conn_id="",
20 | aws_credentials_id="",
21 | s3_bucket="",
22 | s3_key="",
23 | jsonpath="auto",
24 | table_name="",
25 | ignore_headers=1,
26 | *args, **kwargs):
27 |
28 | super(StageToRedshiftOperator, self).__init__(*args, **kwargs)
29 | self.redshift_conn_id = redshift_conn_id
30 | self.aws_credentials_id = aws_credentials_id
31 | self.s3_bucket = s3_bucket
32 | self.ignore_headers = ignore_headers
33 | self.s3_key = s3_key
34 | self.jsonpath = jsonpath
35 | self.table = table_name
36 |
37 | def execute(self, context):
38 | aws_hook = AwsHook(self.aws_credentials_id)
39 | credentials = aws_hook.get_credentials()
40 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)
41 | self.log.info("Clearing data from destination Redshift table")
42 | redshift.run("DELETE FROM {}".format(self.table))
43 | self.log.info("Copying data from S3 to Redshift")
44 | s3_path = "s3://{}/{}".format(self.s3_bucket, self.s3_key)
45 | if self.jsonpath != "auto":
46 | jsonpath = "s3://{}/{}".format(self.s3_bucket, self.jsonpath)
47 | else:
48 | jsonpath = self.jsonpath
49 | formatted_sql = StageToRedshiftOperator.copy_sql.format(
50 | self.table,
51 | s3_path,
52 | credentials.access_key,
53 | credentials.secret_key,
54 | self.ignore_headers,
55 | jsonpath
56 | )
57 | redshift.run(formatted_sql)
58 |
59 |
60 |
61 |
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/screenshots/airflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/6. Data Pipelining with Airflow/screenshots/airflow.png
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/screenshots/dag.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/6. Data Pipelining with Airflow/screenshots/dag.PNG
--------------------------------------------------------------------------------
/6. Data Pipelining with Airflow/screenshots/schema.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/6. Data Pipelining with Airflow/screenshots/schema.PNG
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Description
2 | ---
3 | * This repo contains projects done which applies principles in data engineering.
4 | * Notes taken during the course can be found in folder `0. Back to Basics`
5 |
6 | ## Projects
7 | ---
8 | 1. Postgres ETL :heavy_check_mark:
9 | * This project looks at data modelling for a fictitious music startup Sparkify, applying STAR schema to ingest data to simplify queries that answers business questions the product owner may have
10 |
11 | 2. Cassandra ETL :heavy_check_mark:
12 | * Looking at the realm of big data, Cassandra helps to ingest large amounts of data in a NoSQL context. This project adopts a query centric approach in ingesting data into data tables in Cassandra, to answer business questions about a music app
13 |
14 | 3. Web Scrapying using Scrapy, MongoDB ETL :heavy_check_mark:
15 | * In storing semi-structured data, one form to store it in, is in the form of documents. MongoDB makes this possible, with a specific collection containing related documents. Each document contains fields of data which can be queried.
16 | * In this project, data is scraped from a books listing website using Scrapy. The fields of each book, such as price of a book, ratings, whether it is available is stored in a document in the books collection in MongoDB.
17 |
18 | 4. Data Warehousing with AWS Redshift :heavy_check_mark:
19 | * This project creates a data warehouse, in AWS Redshift. A data warehouse provides a reliable and consistent foundation for users to query and answer some business questions based on requirements.
20 |
21 | 5. Data Lake with Spark & AWS S3 :heavy_check_mark:
22 | * This project creates a data lake, in AWS S3 using Spark.
23 | * Why create a data lake? A data lake provides a reliable store for large amounts of data, from unstructured to semi-structured and even structured data. In this project, we ingest json files, denormalize them into fact and dimension tables and upload them into a AWS S3 data lake, in the form of parquet files.
24 |
25 | 6. Data Pipelining with Airflow :heavy_check_mark:
26 | * This project schedules data pipelines, to perform ETL from json files in S3 to Redshift using Airflow.
27 | * Why use Airflow? Airflow allows workflows to be defined as code, they become more maintainable, versionable, testable, and collaborative
28 |
29 | 7. Capstone Project :heavy_check_mark:
30 | * This project is the finale to Udacity's data engineering nanodegree. Udacity provides a default dataset however I chose to embark on my own project.
31 | * My project is on building a movies data warehouse, which can be used to build a movies recommendation system, as well as predicting box-office earnings. View the project here: [Movies Data Warehouse](https://github.com/alanchn31/Udacity-Data-Engineering-Capstone)
--------------------------------------------------------------------------------