├── .gitignore ├── 0. Back to Basics ├── 1. Intro to Data Modelling │ ├── Creating Table with Cassandra │ │ └── Creating_a_Table_with_Apache_Cassandra.ipynb │ ├── Creating Table with Postgres │ │ ├── creating-a-table-with-postgres-0.ipynb │ │ └── creating-a-table-with-postgres-1.ipynb │ └── README.md ├── 2. Relational Data Models │ ├── 1. Creating Normalized Tables.ipynb │ ├── 2. Creating Denormalized Tables.ipynb │ ├── 3. Creating Fact and Dimension Tables with Star Schema.ipynb │ └── README.md ├── 3. NoSQL Data Models │ ├── 1. Creating Tables Based on Queries.ipynb │ ├── 2. Primary Key.ipynb │ ├── 3. Clustering Column.ipynb │ ├── 4. Using the WHERE Clause.ipynb │ └── README.md ├── 4. Data Warehouses │ ├── 1. ETL from 3NF to Star Schema using SQL.ipynb │ ├── 2. OLAP Cubes.ipynb │ ├── 3. Columnar Vs Row Storage.ipynb │ ├── README.md │ └── snapshots │ │ ├── cif.PNG │ │ ├── datamart.PNG │ │ ├── hybrid.PNG │ │ └── kimball.PNG ├── 5. Implementing Data Warehouse on AWS │ ├── 1. AWS RedShift Setup Using Code.ipynb │ ├── 2. Parallel ETL.ipynb │ ├── 3. Optimizing Redshift Table Design.ipynb │ ├── README.md │ ├── dwh.cfg │ └── redshift_dwh.PNG ├── 6. Intro to Spark │ ├── PySpark Schema on Read & UDFs.ipynb │ ├── Pyspark Data Wrangling.ipynb │ ├── README.md │ └── Spark SQL.ipynb ├── 7. Data Lakes │ ├── Data Lake on S3.ipynb │ ├── README.md │ └── dlvdwh.PNG └── 8. Data Pipelines with Airflow │ ├── README.md │ ├── context_and_templating.py │ ├── dag_for_subdag.py │ ├── hello_airflow.py │ └── subdag.py ├── 1. Postgres ETL ├── README.md ├── create_tables.py ├── data │ ├── log_data │ │ └── 2018 │ │ │ └── 11 │ │ │ ├── 2018-11-01-events.json │ │ │ ├── 2018-11-02-events.json │ │ │ ├── 2018-11-03-events.json │ │ │ ├── 2018-11-04-events.json │ │ │ ├── 2018-11-05-events.json │ │ │ ├── 2018-11-06-events.json │ │ │ ├── 2018-11-07-events.json │ │ │ ├── 2018-11-08-events.json │ │ │ ├── 2018-11-09-events.json │ │ │ ├── 2018-11-10-events.json │ │ │ ├── 2018-11-11-events.json │ │ │ ├── 2018-11-12-events.json │ │ │ ├── 2018-11-13-events.json │ │ │ ├── 2018-11-14-events.json │ │ │ ├── 2018-11-15-events.json │ │ │ ├── 2018-11-16-events.json │ │ │ ├── 2018-11-17-events.json │ │ │ ├── 2018-11-18-events.json │ │ │ ├── 2018-11-19-events.json │ │ │ ├── 2018-11-20-events.json │ │ │ ├── 2018-11-21-events.json │ │ │ ├── 2018-11-22-events.json │ │ │ ├── 2018-11-23-events.json │ │ │ ├── 2018-11-24-events.json │ │ │ ├── 2018-11-25-events.json │ │ │ ├── 2018-11-26-events.json │ │ │ ├── 2018-11-27-events.json │ │ │ ├── 2018-11-28-events.json │ │ │ ├── 2018-11-29-events.json │ │ │ └── 2018-11-30-events.json │ └── song_data │ │ └── A │ │ ├── A │ │ ├── A │ │ │ ├── TRAAAAW128F429D538.json │ │ │ ├── TRAAABD128F429CF47.json │ │ │ ├── TRAAADZ128F9348C2E.json │ │ │ ├── TRAAAEF128F4273421.json │ │ │ ├── TRAAAFD128F92F423A.json │ │ │ ├── TRAAAMO128F1481E7F.json │ │ │ ├── TRAAAMQ128F1460CD3.json │ │ │ ├── TRAAAPK128E0786D96.json │ │ │ ├── TRAAARJ128F9320760.json │ │ │ ├── TRAAAVG12903CFA543.json │ │ │ └── TRAAAVO128F93133D4.json │ │ ├── B │ │ │ ├── TRAABCL128F4286650.json │ │ │ ├── TRAABDL12903CAABBA.json │ │ │ ├── TRAABJL12903CDCF1A.json │ │ │ ├── TRAABJV128F1460C49.json │ │ │ ├── TRAABLR128F423B7E3.json │ │ │ ├── TRAABNV128F425CEE1.json │ │ │ ├── TRAABRB128F9306DD5.json │ │ │ ├── TRAABVM128F92CA9DC.json │ │ │ ├── TRAABXG128F9318EBD.json │ │ │ ├── TRAABYN12903CFD305.json │ │ │ └── TRAABYW128F4244559.json │ │ └── C │ │ │ ├── TRAACCG128F92E8A55.json │ │ │ ├── TRAACER128F4290F96.json │ │ │ ├── TRAACFV128F935E50B.json │ │ │ ├── TRAACHN128F1489601.json │ │ │ ├── TRAACIW12903CC0F6D.json │ │ │ ├── TRAACLV128F427E123.json │ │ │ ├── TRAACNS128F14A2DF5.json │ │ │ ├── TRAACOW128F933E35F.json │ │ │ ├── TRAACPE128F421C1B9.json │ │ │ ├── TRAACQT128F9331780.json │ │ │ ├── TRAACSL128F93462F4.json │ │ │ ├── TRAACTB12903CAAF15.json │ │ │ ├── TRAACVS128E078BE39.json │ │ │ └── TRAACZK128F4243829.json │ │ └── B │ │ ├── A │ │ ├── TRABACN128F425B784.json │ │ ├── TRABAFJ128F42AF24E.json │ │ ├── TRABAFP128F931E9A1.json │ │ ├── TRABAIO128F42938F9.json │ │ ├── TRABATO128F42627E9.json │ │ ├── TRABAVQ12903CBF7E0.json │ │ ├── TRABAWW128F4250A31.json │ │ ├── TRABAXL128F424FC50.json │ │ ├── TRABAXR128F426515F.json │ │ ├── TRABAXV128F92F6AE3.json │ │ └── TRABAZH128F930419A.json │ │ ├── B │ │ ├── TRABBAM128F429D223.json │ │ ├── TRABBBV128F42967D7.json │ │ ├── TRABBJE12903CDB442.json │ │ ├── TRABBKX128F4285205.json │ │ ├── TRABBLU128F93349CF.json │ │ ├── TRABBNP128F932546F.json │ │ ├── TRABBOP128F931B50D.json │ │ ├── TRABBOR128F4286200.json │ │ ├── TRABBTA128F933D304.json │ │ ├── TRABBVJ128F92F7EAA.json │ │ ├── TRABBXU128F92FEF48.json │ │ └── TRABBZN12903CD9297.json │ │ └── C │ │ ├── TRABCAJ12903CDFCC2.json │ │ ├── TRABCEC128F426456E.json │ │ ├── TRABCEI128F424C983.json │ │ ├── TRABCFL128F149BB0D.json │ │ ├── TRABCIX128F4265903.json │ │ ├── TRABCKL128F423A778.json │ │ ├── TRABCPZ128F4275C32.json │ │ ├── TRABCRU128F423F449.json │ │ ├── TRABCTK128F934B224.json │ │ ├── TRABCUQ128E0783E2B.json │ │ ├── TRABCXB128F4286BD3.json │ │ └── TRABCYE128F934CE1D.json ├── etl.ipynb ├── etl.py ├── schema.PNG ├── sql_queries.py └── test.ipynb ├── 2. Cassandra ETL ├── ETL using Cassandra.ipynb ├── event_data │ ├── 2018-11-01-events.csv │ ├── 2018-11-02-events.csv │ ├── 2018-11-03-events.csv │ ├── 2018-11-04-events.csv │ ├── 2018-11-05-events.csv │ ├── 2018-11-06-events.csv │ ├── 2018-11-07-events.csv │ ├── 2018-11-08-events.csv │ ├── 2018-11-09-events.csv │ ├── 2018-11-10-events.csv │ ├── 2018-11-11-events.csv │ ├── 2018-11-12-events.csv │ ├── 2018-11-13-events.csv │ ├── 2018-11-14-events.csv │ ├── 2018-11-15-events.csv │ ├── 2018-11-16-events.csv │ ├── 2018-11-17-events.csv │ ├── 2018-11-18-events.csv │ ├── 2018-11-19-events.csv │ ├── 2018-11-20-events.csv │ ├── 2018-11-21-events.csv │ ├── 2018-11-22-events.csv │ ├── 2018-11-23-events.csv │ ├── 2018-11-24-events.csv │ ├── 2018-11-25-events.csv │ ├── 2018-11-26-events.csv │ ├── 2018-11-27-events.csv │ ├── 2018-11-28-events.csv │ ├── 2018-11-29-events.csv │ └── 2018-11-30-events.csv ├── event_datafile_new.csv └── images │ └── image_event_datafile_new.jpg ├── 3. Web Scraping using Scrapy, Mongo ETL ├── README.md ├── books.PNG ├── books │ ├── books │ │ ├── __init__.py │ │ ├── items.py │ │ ├── middlewares.py │ │ ├── pipelines.py │ │ ├── settings.py │ │ └── spiders │ │ │ ├── __init__.py │ │ │ └── books_spider.py │ └── scrapy.cfg └── requirements.txt ├── 4. Data Warehousing with AWS Redshift ├── README.md ├── create_tables.py ├── dwh.cfg ├── etl.py ├── redshift_cluster_setup.py ├── redshift_cluster_teardown.py ├── screenshots │ ├── architecture.PNG │ ├── redshift.PNG │ └── schema.PNG └── sql_queries.py ├── 5. Data Lake with Spark & AWS S3 ├── README.md ├── etl.py └── screenshots │ ├── s3.PNG │ ├── schema.PNG │ └── spark.PNG ├── 6. Data Pipelining with Airflow ├── README.md ├── airflow │ ├── dags │ │ ├── create_tables.sql │ │ └── sparkify_dwh_dag.py │ └── plugins │ │ ├── __init__.py │ │ ├── helpers │ │ ├── __init__.py │ │ └── sql_queries.py │ │ └── operators │ │ ├── __init__.py │ │ ├── data_quality.py │ │ ├── load_dimension.py │ │ ├── load_fact.py │ │ └── stage_redshift.py └── screenshots │ ├── airflow.png │ ├── dag.PNG │ └── schema.PNG └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | env/ 2 | __pycache__/ 3 | .ipynb_checkpoints/ 4 | *.zip -------------------------------------------------------------------------------- /0. Back to Basics/1. Intro to Data Modelling/Creating Table with Cassandra/Creating_a_Table_with_Apache_Cassandra.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Lesson 1 Exercise 2: Creating a Table with Apache Cassandra\n", 8 | "" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Walk through the basics of Apache Cassandra. Complete the following tasks:
  • Create a table in Apache Cassandra,
  • Insert rows of data,
  • Run a simple SQL query to validate the information.
    \n", 16 | "`#####` denotes where the code needs to be completed.\n", 17 | " \n", 18 | "Note: __Do not__ click the blue Preview button in the lower taskbar" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "#### Import Apache Cassandra python package" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 1, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "import cassandra" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### Create a connection to the database" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 2, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "from cassandra.cluster import Cluster\n", 51 | "try: \n", 52 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n", 53 | " session = cluster.connect()\n", 54 | "except Exception as e:\n", 55 | " print(e)\n", 56 | " " 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "### TO-DO: Create a keyspace to do the work in " 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "## TO-DO: Create the keyspace\n", 73 | "try:\n", 74 | " session.execute(\"\"\"\n", 75 | " CREATE KEYSPACE IF NOT EXISTS udacity \n", 76 | " WITH REPLICATION = \n", 77 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n", 78 | ")\n", 79 | "\n", 80 | "except Exception as e:\n", 81 | " print(e)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### TO-DO: Connect to the Keyspace" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 5, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "## To-Do: Add in the keyspace you created\n", 98 | "try:\n", 99 | " session.set_keyspace('udacity')\n", 100 | "except Exception as e:\n", 101 | " print(e)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. \n", 109 | "\n", 110 | "`song_title\n", 111 | "artist_name\n", 112 | "year\n", 113 | "album_name\n", 114 | "single`" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### TO-DO: You need to create a table to be able to run the following query: \n", 122 | "`select * from songs WHERE year=1970 AND artist_name=\"The Beatles\"`" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 20, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "## TO-DO: Complete the query below\n", 132 | "query = \"CREATE TABLE IF NOT EXISTS songs \"\n", 133 | "query = query + \"(year int, artist_name text, song_title text, album_name text, single boolean, PRIMARY KEY (year, artist_name))\"\n", 134 | "try:\n", 135 | " session.execute(query)\n", 136 | "except Exception as e:\n", 137 | " print(e)\n" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "### TO-DO: Insert the following two rows in your table\n", 145 | "`First Row: \"Across The Universe\", \"The Beatles\", \"1970\", \"False\", \"Let It Be\"`\n", 146 | "\n", 147 | "`Second Row: \"The Beatles\", \"Think For Yourself\", \"False\", \"1965\", \"Rubber Soul\"`" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 22, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "## Add in query and then run the insert statement\n", 157 | "query = \"INSERT INTO songs (album_name, artist_name, year, single, song_title)\" \n", 158 | "query = query + \" VALUES (%s, %s, %s, %s, %s)\"\n", 159 | "\n", 160 | "try:\n", 161 | " session.execute(query, (\"Across The Universe\", \"The Beatles\", 1970, False, \"Let It Be\"))\n", 162 | "except Exception as e:\n", 163 | " print(e)\n", 164 | " \n", 165 | "try:\n", 166 | " session.execute(query, (\"The Beatles\", \"Think For Yourself\", 1965, False, \"Rubber Soul\"))\n", 167 | "except Exception as e:\n", 168 | " print(e)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "### TO-DO: Validate your data was inserted into the table." 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 23, 181 | "metadata": { 182 | "scrolled": true 183 | }, 184 | "outputs": [ 185 | { 186 | "name": "stdout", 187 | "output_type": "stream", 188 | "text": [ 189 | "1965 The Beatles Think For Yourself\n", 190 | "1970 Across The Universe The Beatles\n" 191 | ] 192 | } 193 | ], 194 | "source": [ 195 | "## TO-DO: Complete and then run the select statement to validate the data was inserted into the table\n", 196 | "query = 'SELECT * FROM songs'\n", 197 | "try:\n", 198 | " rows = session.execute(query)\n", 199 | "except Exception as e:\n", 200 | " print(e)\n", 201 | " \n", 202 | "for row in rows:\n", 203 | " print (row.year, row.album_name, row.artist_name)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "### TO-DO: Validate the Data Model with the original query.\n", 211 | "\n", 212 | "`select * from songs WHERE YEAR=1970 AND artist_name=\"The Beatles\"`" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 24, 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "name": "stdout", 222 | "output_type": "stream", 223 | "text": [ 224 | "1970 Across The Universe The Beatles\n" 225 | ] 226 | } 227 | ], 228 | "source": [ 229 | "##TO-DO: Complete the select statement to run the query \n", 230 | "query = \"select * from songs WHERE YEAR=1970 AND artist_name='The Beatles'\"\n", 231 | "try:\n", 232 | " rows = session.execute(query)\n", 233 | "except Exception as e:\n", 234 | " print(e)\n", 235 | " \n", 236 | "for row in rows:\n", 237 | " print (row.year, row.album_name, row.artist_name)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "### And Finally close the session and cluster connection" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 25, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "session.shutdown()\n", 254 | "cluster.shutdown()" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [] 263 | } 264 | ], 265 | "metadata": { 266 | "kernelspec": { 267 | "display_name": "Python 3", 268 | "language": "python", 269 | "name": "python3" 270 | }, 271 | "language_info": { 272 | "codemirror_mode": { 273 | "name": "ipython", 274 | "version": 3 275 | }, 276 | "file_extension": ".py", 277 | "mimetype": "text/x-python", 278 | "name": "python", 279 | "nbconvert_exporter": "python", 280 | "pygments_lexer": "ipython3", 281 | "version": "3.7.0" 282 | } 283 | }, 284 | "nbformat": 4, 285 | "nbformat_minor": 2 286 | } 287 | -------------------------------------------------------------------------------- /0. Back to Basics/1. Intro to Data Modelling/Creating Table with Postgres/creating-a-table-with-postgres-0.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Lesson 1 Demo 0: PostgreSQL and AutoCommits\n", 8 | "\n", 9 | "" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Walk through the basics of PostgreSQL autocommits " 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "## import postgreSQL adapter for the Python\n", 26 | "import psycopg2" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "### Create a connection to the database\n", 34 | "1. Connect to the local instance of PostgreSQL (*127.0.0.1*)\n", 35 | "2. Use the database/schema from the instance. \n", 36 | "3. The connection reaches out to the database (*studentdb*) and use the correct privilages to connect to the database (*user and password = student*)." 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "conn = psycopg2.connect(\"host=127.0.0.1 dbname=studentdb user=student password=student\")" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### Use the connection to get a cursor that will be used to execute queries." 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "cur = conn.cursor()" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "### Create a database to work in" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "cur.execute(\"select * from test\")" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "### Error occurs, but it was to be expected because table has not been created as yet. To fix the error, create the table. " 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "cur.execute(\"CREATE TABLE test (col1 int, col2 int, col3 int);\")" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "### Error indicates we cannot execute this query. Since we have not committed the transaction and had an error in the transaction block, we are blocked until we restart the connection." 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [ 109 | "conn = psycopg2.connect(\"host=127.0.0.1 dbname=studentdb user=student password=student\")\n", 110 | "cur = conn.cursor()" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "In our exercises instead of worrying about commiting each transaction or getting a strange error when we hit something unexpected, let's set autocommit to true. **This says after each call during the session commit that one action and do not hold open the transaction for any other actions. One action = one transaction.**" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "In this demo we will use automatic commit so each action is commited without having to call `conn.commit()` after each command. **The ability to rollback and commit transactions are a feature of Relational Databases.**" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "conn.set_session(autocommit=True)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "cur.execute(\"select * from test\")" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "cur.execute(\"CREATE TABLE test (col1 int, col2 int, col3 int);\")" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "### Once autocommit is set to true, we execute this code successfully. There were no issues with transaction blocks and we did not need to restart our connection. " 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "cur.execute(\"select * from test\")" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "cur.execute(\"select count(*) from test\")\n", 177 | "print(cur.fetchall())" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [] 186 | } 187 | ], 188 | "metadata": { 189 | "kernelspec": { 190 | "display_name": "Python 3", 191 | "language": "python", 192 | "name": "python3" 193 | }, 194 | "language_info": { 195 | "codemirror_mode": { 196 | "name": "ipython", 197 | "version": 3 198 | }, 199 | "file_extension": ".py", 200 | "mimetype": "text/x-python", 201 | "name": "python", 202 | "nbconvert_exporter": "python", 203 | "pygments_lexer": "ipython3", 204 | "version": "3.7.2" 205 | } 206 | }, 207 | "nbformat": 4, 208 | "nbformat_minor": 2 209 | } 210 | -------------------------------------------------------------------------------- /0. Back to Basics/1. Intro to Data Modelling/README.md: -------------------------------------------------------------------------------- 1 | ## The Data Modelling Process: 2 | 1. Gather requirements 3 | 2. Conceptual Data Modelling 4 | 3. Logical Data Modelling 5 | 6 | ## Important Feature of RDBMS: 7 | **Atomicity** - Whole transaction or nothing is processed 8 | **Consistency** - Only transactions abiding by constraints & rules is written into database 9 | **Isolation** - Transactions proceed independently and securely 10 | **Durability** - Once transactions are committed, they remain committed 11 | 12 | ## When to use Relational Database? 13 | ### Advantages of Using a Relational Database 14 | * Flexibility for writing in SQL queries: With SQL being the most common database query language. 15 | * Modeling the data not modeling queries 16 | * Ability to do JOINS 17 | * Ability to do aggregations and analytics 18 | * Secondary Indexes available : You have the advantage of being able to add another index to help with quick searching. 19 | * Smaller data volumes: If you have a smaller data volume (and not big data) you can use a relational database for its simplicity. 20 | * ACID Transactions: Allows you to meet a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, and thus maintain data integrity. 21 | * Easier to change to business requirements 22 | 23 | ## When to use NoSQL Database? 24 | ### Advantages of Using a NoSQL Database 25 | * Need to be able to store different data type formats: NoSQL was also created to handle different data configurations: structured, semi-structured, and unstructured data. JSON, XML documents can all be handled easily with NoSQL. 26 | * Large amounts of data: Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. NoSQL databases were created to be able to be horizontally scalable. The more servers/systems you add to the database the more data that can be hosted with high availability and low latency (fast reads and writes). 27 | * Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data 28 | * Need high throughput: While ACID transactions bring benefits they also slow down the process of reading and writing data. If you need very fast reads and writes using a relational database may not suit your needs. 29 | * Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space. 30 | * Need high availability: Relational databases have a single point of failure. When that database goes down, a failover to a backup system must happen and takes time. -------------------------------------------------------------------------------- /0. Back to Basics/2. Relational Data Models/README.md: -------------------------------------------------------------------------------- 1 | ## Importance of Relational Databases: 2 | --- 3 | * Standardization of data model: Once your data is transformed into the rows and columns format, your data is standardized and you can query it with SQL 4 | * Flexibility in adding and altering tables: Relational databases gives you flexibility to add tables, alter tables, add and remove data. 5 | * Data Integrity: Data Integrity is the backbone of using a relational database. 6 | * Structured Query Language (SQL): A standard language can be used to access the data with a predefined language. 7 | * Simplicity : Data is systematically stored and modeled in tabular format. 8 | * Intuitive Organization: The spreadsheet format is intuitive but intuitive to data modeling in relational databases. 9 | 10 | ## OLAP vs OLTP: 11 | --- 12 | * Online Analytical Processing (OLAP): 13 | Databases optimized for these workloads allow for complex analytical and ad hoc queries, including aggregations. These type of databases are optimized for reads. 14 | 15 | * Online Transactional Processing (OLTP): 16 | Databases optimized for these workloads allow for less complex queries in large volume. The types of queries for these databases are read, insert, update, and delete. 17 | 18 | * The key to remember the difference between OLAP and OLTP is analytics (A) vs transactions (T). If you want to get the price of a shoe then you are using OLTP (this has very little or no aggregations). If you want to know the total stock of shoes a particular store sold, then this requires using OLAP (since this will require aggregations). 19 | 20 | ## Normal Forms: 21 | --- 22 | ### Objectives: 23 | 1. To free the database from unwanted insertions, updates, & deletion dependencies 24 | 2. To reduce the need for refactoring the database as new types of data are introduced 25 | 3. To make the relational model more informative to users 26 | 4. To make the database neutral to the query statistics 27 | 28 | ### Types of Normal Forms: 29 | #### First Normal Form (1NF): 30 | * Atomic values: each cell contains unique and single values 31 | * Be able to add data without altering tables 32 | * Separate different relations into different tables 33 | * Keep relationships between tables together with foreign keys 34 | 35 | #### Second Normal Form (2NF): 36 | * Have reached 1NF 37 | * All columns in the table must rely on the Primary Key 38 | 39 | #### Third Normal Form (3NF): 40 | * Must be in 2nd Normal Form 41 | * No transitive dependencies 42 | * Remember, transitive dependencies you are trying to maintain is that to get from A-> C, you want to avoid going through B. 43 | 44 | When to use 3NF: 45 | When you want to update data, we want to be able to do in just 1 place. 46 | 47 | ## Denormalization: 48 | --- 49 | JOINS on the database allow for outstanding flexibility but are extremely slow. If you are dealing with heavy reads on your database, you may want to think about denormalizing your tables. You get your data into normalized form, and then you proceed with denormalization. So, denormalization comes after normalization. 50 | 51 | ## Normalize vs Denormalize: 52 | --- 53 | Normalization is about trying to increase data integrity by reducing the number of copies of the data. Data that needs to be added or updated will be done in as few places as possible. 54 | 55 | Denormalization is trying to increase performance by reducing the number of joins between tables (as joins can be slow). Data integrity will take a bit of a potential hit, as there will be more copies of the data (to reduce JOINS). 56 | 57 | ## Star Schema: 58 | --- 59 | * Simplest style of data mart schema 60 | * Consist of 1 or more fact tables referencing multiple dimension tables 61 | 62 | ### Benefits: 63 | * Denormalize tables, simplify queries and provide fast aggregations 64 | 65 | ### Drawbacks: 66 | * Issues that come with denormalization 67 | * Data Integrity 68 | * Decrease Query Flexibility 69 | * Many to many relationship -- simplified 70 | 71 | ## Snowflake Schema: 72 | --- 73 | * Logical arrangement of tables in a multidimensional database 74 | * Represented by centralized fact tables that are connected to multiple dimensions 75 | * Dimensions of snowflake schema are elaborated, having multiple levels of relationships, child tables having multiple parents 76 | * Star schema is a special, simplified case of snowflake schema 77 | * Star schema does not allow for one to many relationships while snowflake schema does -------------------------------------------------------------------------------- /0. Back to Basics/3. NoSQL Data Models/3. Clustering Column.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Lesson 3 Exercise 3 Solution: Focus on Clustering Columns\n", 8 | "" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Walk through the basics of creating a table with a good Primary Key and Clustering Columns in Apache Cassandra, inserting rows of data, and doing a simple CQL query to validate the information." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n", 23 | "! pip install cassandra-driver\n", 24 | "#### More documentation can be found here: https://datastax.github.io/python-driver/" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "#### Import Apache Cassandra python package" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "import cassandra" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Create a connection to the database" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "from cassandra.cluster import Cluster\n", 57 | "try: \n", 58 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n", 59 | " session = cluster.connect()\n", 60 | "except Exception as e:\n", 61 | " print(e)" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "### Create a keyspace to work in " 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "try:\n", 78 | " session.execute(\"\"\"\n", 79 | " CREATE KEYSPACE IF NOT EXISTS udacity \n", 80 | " WITH REPLICATION = \n", 81 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n", 82 | ")\n", 83 | "\n", 84 | "except Exception as e:\n", 85 | " print(e)" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL. " 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 4, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "try:\n", 102 | " session.set_keyspace('udacity')\n", 103 | "except Exception as e:\n", 104 | " print(e)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "### Imagine we would like to start creating a new Music Library of albums. \n", 112 | "\n", 113 | "### We want to ask 1 question of our data:\n", 114 | "#### 1. Give me all the information from the music library about a given album\n", 115 | "`select * from album_library WHERE album_name=\"Close To You\"`\n" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "### Here is the Data:\n", 123 | "" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "### How should we model this data? What should be our Primary Key and Partition Key? \n", 131 | "\n", 132 | "### Since the data is looking for the `ALBUM_NAME` let's start with that. From there we will need to add other elements to make sure the Key is unique. We also need to add the `ARTIST_NAME` as Clustering Columns to make the data unique. That should be enough to make the row key unique.\n", 133 | "\n", 134 | "`Table Name: music_library\n", 135 | "column 1: Year\n", 136 | "column 2: Artist Name\n", 137 | "column 3: Album Name\n", 138 | "Column 4: City\n", 139 | "PRIMARY KEY(album name, artist name)`" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 5, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n", 149 | "query = query + \"(album_name text, artist_name text, year int, city text, PRIMARY KEY (album_name, artist_name))\"\n", 150 | "try:\n", 151 | " session.execute(query)\n", 152 | "except Exception as e:\n", 153 | " print(e)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "### Insert the data into the table" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 6, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "query = \"INSERT INTO music_library (album_name, artist_name, year, city)\"\n", 170 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n", 171 | "\n", 172 | "try:\n", 173 | " session.execute(query, (\"Let it Be\", \"The Beatles\", 1970, \"Liverpool\"))\n", 174 | "except Exception as e:\n", 175 | " print(e)\n", 176 | " \n", 177 | "try:\n", 178 | " session.execute(query, (\"Rubber Soul\", \"The Beatles\", 1965, \"Oxford\"))\n", 179 | "except Exception as e:\n", 180 | " print(e)\n", 181 | " \n", 182 | "try:\n", 183 | " session.execute(query, (\"Beatles For Sale\", \"The Beatles\", 1964, \"London\"))\n", 184 | "except Exception as e:\n", 185 | " print(e)\n", 186 | "\n", 187 | "try:\n", 188 | " session.execute(query, (\"The Monkees\", \"The Monkees\", 1966, \"Los Angeles\"))\n", 189 | "except Exception as e:\n", 190 | " print(e)\n", 191 | "\n", 192 | "try:\n", 193 | " session.execute(query, (\"Close To You\", \"The Carpenters\", 1970, \"San Diego\"))\n", 194 | "except Exception as e:\n", 195 | " print(e)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "### Validate the Data Model -- Did it work?\n", 203 | "`select * from album_library WHERE album_name=\"Close To You\"`" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 7, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "The Carpenters Close To You San Diego 1970\n" 216 | ] 217 | } 218 | ], 219 | "source": [ 220 | "query = \"select * from music_library WHERE album_NAME='Close To You'\"\n", 221 | "try:\n", 222 | " rows = session.execute(query)\n", 223 | "except Exception as e:\n", 224 | " print(e)\n", 225 | " \n", 226 | "for row in rows:\n", 227 | " print (row.artist_name, row.album_name, row.city, row.year)" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "### Success it worked! We created a unique Primary key that evenly distributed our data, with clustering columns" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "### For the sake of the demo, drop the table" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 8, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "query = \"drop table music_library\"\n", 251 | "try:\n", 252 | " rows = session.execute(query)\n", 253 | "except Exception as e:\n", 254 | " print(e)\n" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "### Close the session and cluster connection" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 9, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "session.shutdown()\n", 271 | "cluster.shutdown()" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 1, 277 | "metadata": {}, 278 | "outputs": [ 279 | { 280 | "name": "stderr", 281 | "output_type": "stream", 282 | "text": [ 283 | "'zip' is not recognized as an internal or external command,\n", 284 | "operable program or batch file.\n" 285 | ] 286 | } 287 | ], 288 | "source": [ 289 | "! zip ." 290 | ] 291 | } 292 | ], 293 | "metadata": { 294 | "kernelspec": { 295 | "display_name": "Python 3", 296 | "language": "python", 297 | "name": "python3" 298 | }, 299 | "language_info": { 300 | "codemirror_mode": { 301 | "name": "ipython", 302 | "version": 3 303 | }, 304 | "file_extension": ".py", 305 | "mimetype": "text/x-python", 306 | "name": "python", 307 | "nbconvert_exporter": "python", 308 | "pygments_lexer": "ipython3", 309 | "version": "3.7.0" 310 | } 311 | }, 312 | "nbformat": 4, 313 | "nbformat_minor": 2 314 | } 315 | -------------------------------------------------------------------------------- /0. Back to Basics/3. NoSQL Data Models/4. Using the WHERE Clause.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Lesson 3 Demo 4: Using the WHERE Clause\n", 8 | "" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### In this exercise we are going to walk through the basics of using the WHERE clause in Apache Cassandra.\n", 16 | "\n", 17 | "##### denotes where the code needs to be completed.\n", 18 | "\n", 19 | "Note: __Do not__ click the blue Preview button in the lower task bar" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: \n", 27 | "! pip install cassandra-driver\n", 28 | "#### More documentation can be found here: https://datastax.github.io/python-driver/" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "#### Import Apache Cassandra python package" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 5, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "import cassandra" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "### First let's create a connection to the database" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 6, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "from cassandra.cluster import Cluster\n", 61 | "try: \n", 62 | " cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance\n", 63 | " session = cluster.connect()\n", 64 | "except Exception as e:\n", 65 | " print(e)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "### Let's create a keyspace to do our work in " 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 7, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "try:\n", 82 | " session.execute(\"\"\"\n", 83 | " CREATE KEYSPACE IF NOT EXISTS udacity \n", 84 | " WITH REPLICATION = \n", 85 | " { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }\"\"\"\n", 86 | ")\n", 87 | "\n", 88 | "except Exception as e:\n", 89 | " print(e)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL. " 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 8, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "try:\n", 106 | " session.set_keyspace('udacity')\n", 107 | "except Exception as e:\n", 108 | " print(e)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### Let's imagine we would like to start creating a new Music Library of albums. \n", 116 | "### We want to ask 4 question of our data\n", 117 | "#### 1. Give me every album in my music library that was released in a 1965 year\n", 118 | "#### 2. Give me the album that is in my music library that was released in 1965 by \"The Beatles\"\n", 119 | "#### 3. Give me all the albums released in a given year that was made in London \n", 120 | "#### 4. Give me the city that the album \"Rubber Soul\" was recorded" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "### Here is our Collection of Data\n", 128 | "" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the YEAR let's start with that. From there we will add clustering columns on Artist Name and Album Name." 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 9, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "query = \"CREATE TABLE IF NOT EXISTS music_library \"\n", 145 | "query = query + \"(year int, artist_name text, album_name text, city text, PRIMARY KEY (year, artist_name, album_name))\"\n", 146 | "try:\n", 147 | " session.execute(query)\n", 148 | "except Exception as e:\n", 149 | " print(e)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "### Let's insert our data into of table" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 10, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "query = \"INSERT INTO music_library (year, artist_name, album_name, city)\"\n", 166 | "query = query + \" VALUES (%s, %s, %s, %s)\"\n", 167 | "\n", 168 | "try:\n", 169 | " session.execute(query, (1970, \"The Beatles\", \"Let it Be\", \"Liverpool\"))\n", 170 | "except Exception as e:\n", 171 | " print(e)\n", 172 | " \n", 173 | "try:\n", 174 | " session.execute(query, (1965, \"The Beatles\", \"Rubber Soul\", \"Oxford\"))\n", 175 | "except Exception as e:\n", 176 | " print(e)\n", 177 | " \n", 178 | "try:\n", 179 | " session.execute(query, (1965, \"The Who\", \"My Generation\", \"London\"))\n", 180 | "except Exception as e:\n", 181 | " print(e)\n", 182 | "\n", 183 | "try:\n", 184 | " session.execute(query, (1966, \"The Monkees\", \"The Monkees\", \"Los Angeles\"))\n", 185 | "except Exception as e:\n", 186 | " print(e)\n", 187 | "\n", 188 | "try:\n", 189 | " session.execute(query, (1970, \"The Carpenters\", \"Close To You\", \"San Diego\"))\n", 190 | "except Exception as e:\n", 191 | " print(e)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### Let's Validate our Data Model with our 4 queries.\n", 199 | "\n", 200 | "Query 1: " 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 13, 206 | "metadata": {}, 207 | "outputs": [ 208 | { 209 | "name": "stdout", 210 | "output_type": "stream", 211 | "text": [ 212 | "1965 The Beatles Rubber Soul Oxford\n", 213 | "1965 The Who My Generation London\n" 214 | ] 215 | } 216 | ], 217 | "source": [ 218 | "query = \"SELECT * FROM music_library WHERE year=1965\"\n", 219 | "try:\n", 220 | " rows = session.execute(query)\n", 221 | "except Exception as e:\n", 222 | " print(e)\n", 223 | " \n", 224 | "for row in rows:\n", 225 | " print (row.year, row.artist_name, row.album_name, row.city)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | " Let's try the 2nd query.\n", 233 | " Query 2: " 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 14, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "name": "stdout", 243 | "output_type": "stream", 244 | "text": [ 245 | "1965 The Beatles Rubber Soul Oxford\n" 246 | ] 247 | } 248 | ], 249 | "source": [ 250 | "query = \"SELECT * FROM music_library WHERE year=1965 AND artist_name='The Beatles'\"\n", 251 | "try:\n", 252 | " rows = session.execute(query)\n", 253 | "except Exception as e:\n", 254 | " print(e)\n", 255 | " \n", 256 | "for row in rows:\n", 257 | " print (row.year, row.artist_name, row.album_name, row.city)" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "### Let's try the 3rd query.\n", 265 | "Query 3: " 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 15, 271 | "metadata": {}, 272 | "outputs": [ 273 | { 274 | "name": "stdout", 275 | "output_type": "stream", 276 | "text": [ 277 | "Error from server: code=2200 [Invalid query] message=\"Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING\"\n" 278 | ] 279 | } 280 | ], 281 | "source": [ 282 | "query = \"SELECT * FROM music_library WHERE city='London'\"\n", 283 | "try:\n", 284 | " rows = session.execute(query)\n", 285 | "except Exception as e:\n", 286 | " print(e)\n", 287 | " \n", 288 | "for row in rows:\n", 289 | " print (row.year, row.artist_name, row.album_name, row.city)" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "### Did you get an error? You can not try to access a column or a clustering column if you have not used the other defined clustering column. Let's see if we can try it a different way. \n", 297 | "Try Query 4: \n", 298 | "\n" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 17, 304 | "metadata": {}, 305 | "outputs": [ 306 | { 307 | "name": "stdout", 308 | "output_type": "stream", 309 | "text": [ 310 | "Oxford\n" 311 | ] 312 | } 313 | ], 314 | "source": [ 315 | "query = \"SELECT city FROM music_library WHERE year=1965 AND artist_name='The Beatles' AND album_name='Rubber Soul'\"\n", 316 | "try:\n", 317 | " rows = session.execute(query)\n", 318 | "except Exception as e:\n", 319 | " print(e)\n", 320 | " \n", 321 | "for row in rows:\n", 322 | " print (row.city)" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "### And Finally close the session and cluster connection" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 18, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "session.shutdown()\n", 339 | "cluster.shutdown()" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [] 348 | } 349 | ], 350 | "metadata": { 351 | "kernelspec": { 352 | "display_name": "Python 3", 353 | "language": "python", 354 | "name": "python3" 355 | }, 356 | "language_info": { 357 | "codemirror_mode": { 358 | "name": "ipython", 359 | "version": 3 360 | }, 361 | "file_extension": ".py", 362 | "mimetype": "text/x-python", 363 | "name": "python", 364 | "nbconvert_exporter": "python", 365 | "pygments_lexer": "ipython3", 366 | "version": "3.7.0" 367 | } 368 | }, 369 | "nbformat": 4, 370 | "nbformat_minor": 2 371 | } 372 | -------------------------------------------------------------------------------- /0. Back to Basics/3. NoSQL Data Models/README.md: -------------------------------------------------------------------------------- 1 | ## Data Modelling using NoSQL Databases 2 | --- 3 | * NoSQL Databases stand for not only SQL databases 4 | * When Not to Use SQL? 5 | * Need high Availability in the data: Indicates the system is always up and there is no downtime 6 | * Have Large Amounts of Data 7 | * Need Linear Scalability: The need to add more nodes to the system so performance will increase linearly 8 | * Low Latency: Shorter delay before the data is transferred once the instruction for the transfer has been received. 9 | * Need fast reads and write 10 | 11 | ## Distributed Databases 12 | --- 13 | * Data is stored on multiple machines 14 | * Eventual Consistency: 15 | Over time (if no new changes are made) each copy of the data will be the same, but if there are new changes, the data may be different in different locations. The data may be inconsistent for only milliseconds. There are workarounds in place to prevent getting stale data. 16 | * CAP Theorem: 17 | * It is impossible for a distributed data store to simultaneously provide more than 2 out of 3 guarantees of CAP 18 | * **Consistency**: Every read from the database gets the latest (and correct) piece of data or an error 19 | * **Availability**: Every request is received and a response is given -- without a guarantee that the data is the latest update 20 | * **Partition Tolerance**: The system continues to work regardless of losing network connectivity between nodes 21 | * Which of these combinations is desirable for a production system - Consistency and Availability, Consistency and Partition Tolerance, or Availability and Partition Tolerance? 22 | * As the CAP Theorem Wikipedia entry says, "The CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability." So there is no such thing as Consistency and Availability in a distributed database since it must always tolerate network issues. You can only have Consistency and Partition Tolerance (CP) or Availability and Partition Tolerance (AP). Supporting Availability and Partition Tolerance makes sense, since Availability and Partition Tolerance are the biggest requirements. 23 | * Data Modeling in Apache Cassandra: 24 | * Denormalization is not just okay -- it's a must, for fast reads 25 | * Apache Cassandra has been optimized for fast writes 26 | * ALWAYS think Queries first, one table per query is a great strategy 27 | * Apache Cassandra does not allow for JOINs between tables 28 | * Primary Key must be unique 29 | * The PRIMARY KEY is made up of either just the PARTITION KEY or may also include additional CLUSTERING COLUMNS 30 | * A Simple PRIMARY KEY is just one column that is also the PARTITION KEY. A Composite PRIMARY KEY is made up of more than one column and will assist in creating a unique value and in your retrieval queries 31 | * The PARTITION KEY will determine the distribution of data across the system 32 | * WHERE clause 33 | * Data Modeling in Apache Cassandra is query focused, and that focus needs to be on the WHERE clause 34 | * Failure to include a WHERE clause will result in an error 35 | 36 | -------------------------------------------------------------------------------- /0. Back to Basics/4. Data Warehouses/README.md: -------------------------------------------------------------------------------- 1 | ## What is a Data Warehouse? 2 | --- 3 | * Data Warehouse is a system (including processes, technologies & data representations that enables support for analytical processing) 4 | 5 | * Goals of a Data Warehouse: 6 | * Simple to understand 7 | * Performant 8 | * Quality Assured 9 | * Handles new business questions well 10 | * Secure 11 | 12 | ## Architecture 13 | --- 14 | * Several possible architectures to building a Data Warehouse 15 | 1. **Kimball's Bus Architecture**: 16 | ![Kimball's Bus Architecture](snapshots/kimball.PNG) 17 | * Results in common dimension data models shared by different business departments 18 | * Data is not kept at an aggregated level, rather they are at the atomic level 19 | * Organized by business processes, used by different departments 20 | 2. **Independent Data Marts**: 21 | ![Independent Data Marts](snapshots/datamart.PNG) 22 | * Independent Data Marts have ETL processes that are designed by specific business departments to meet their analytical needs 23 | * Different fact tables for the same events, no conformed dimensions 24 | * Uncoordinated efforts can lead to inconsistent views 25 | * Generally discouraged 26 | 3. **Inmon's Corporate Information Factory**: 27 | ![Inmon's Corporate Information Factory](snapshots/cif.PNG) 28 | * The Enterprise Data Warehouse provides a normalized data architecture before individual departments build on it 29 | * 2 ETL Process 30 | * Source systems -> 3NF DB 31 | * 3NF DB -> Departmental Data Marts 32 | * The Data Marts use a source 3NF model (single integrated source of truth) and add denormalization based on department needs 33 | * Data marts dimensionally modelled & unlike Kimball's dimensional models, they are mostly aggregated 34 | 4. **Hybrid Kimball Bus & Inmon CIF**: 35 | ![Hybrid Kimball Bus & Inmon CIF](snapshots/hybrid.PNG) 36 | 37 | ## OLAP Cubes 38 | --- 39 | * An OLAP Cube is an aggregation of a fact metric on a number of dimensions 40 | 41 | * OLAP cubes need to store the finest grain of data in case drill-down is needed 42 | 43 | * Operations: 44 | 1. Roll-up & Drill-Down 45 | * Roll-Up: eg, from sales at city level, sum up sales of each city by country 46 | * Drill-Down: eg, decompose the sales of each city into smaller districts 47 | 2. Slice & Dice 48 | * Slice: Reduce N dimensions to N-1 dimensions by restricting one dimension to a single value 49 | * Dice: Same dimensions but computing a sub-cube by restricting some of the values of the dimensions 50 | Eg month in ['Feb', 'Mar'] and movie in ['Avatar', 'Batman'] 51 | 52 | * Query Optimization 53 | * Business users typically want to slice, dice, rollup and drill-down 54 | * Each sub-combination goes through all the facts table 55 | * Using CUBE operation "GROUP by CUBE" and saving the output is usually enough to answer forthcoming aggregations from business users without having to process the whole facts table again 56 | 57 | * Serving OLAP Cubes 58 | * Approach 1: Pre-aggregate the OLAP cubes and save them on a special purpose non-relational database (MOLAP) 59 | * Approach 2: Compute the OLAP Cubes on the fly from existing relational databases where the dimensional model resides (ROLAP) -------------------------------------------------------------------------------- /0. Back to Basics/4. Data Warehouses/snapshots/cif.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/4. Data Warehouses/snapshots/cif.PNG -------------------------------------------------------------------------------- /0. Back to Basics/4. Data Warehouses/snapshots/datamart.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/4. Data Warehouses/snapshots/datamart.PNG -------------------------------------------------------------------------------- /0. Back to Basics/4. Data Warehouses/snapshots/hybrid.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/4. Data Warehouses/snapshots/hybrid.PNG -------------------------------------------------------------------------------- /0. Back to Basics/4. Data Warehouses/snapshots/kimball.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/4. Data Warehouses/snapshots/kimball.PNG -------------------------------------------------------------------------------- /0. Back to Basics/5. Implementing Data Warehouse on AWS/README.md: -------------------------------------------------------------------------------- 1 | ## Choices for implementing Data Warehouse: 2 | --- 3 | 1. On-Premise 4 | * Need for diverse IT skills & multiple locations 5 | * Cost of ownership (capital and operational costs) 6 | 7 | 2. Cloud 8 | * Lower barriers to entry (time and money) 9 | * Scalability and elasticity out of the box 10 | * Within cloud, there are 2 ways to manage infrastructure 11 | 1. Cloud-Managed (Amazon RDS, Amazon DynamoDB, Amazon S3) 12 | * Reuse of expertise (Infrastructure as Code) 13 | * Less operational expense 14 | 2. Self-Managed (EC2 + Postgres, EC2 + Cassandra, EC2 + Unix FS) 15 | 16 | ## Amazon Redshift 17 | --- 18 | 1. Properties 19 | * Column-oriented storage, internally it is modified Postgresql 20 | * Best suited for storing OLAP workloads 21 | * is a Massively Parellel Processing Database 22 | * Parallelizes one query on multiple CPUS/machines 23 | * A table is partitioned and partitions are processed in parallel 24 | 25 | 2. Architecture 26 | * Leader Node: 27 | * Coordinates compute nodes 28 | * Handles external communication 29 | * Optimizes query execution 30 | * Compute Node: 31 | * Each with CPU, memory, disk and a number of slices 32 | * A node with n slices can process n partitions of a table simultaneously 33 | * Scale-up: get more powerful nodes 34 | * Scale-out: get more nodes 35 | * Example of setting up a Data Warehouse in Redshift: 36 | ![Example of Data Warehouse](redshift_dwh.PNG) 37 | Source: Udacity DE ND Lesson 3: Implementing Data Warehouses on AWS 38 | 39 | 3. Ingesting at Scale 40 | * Use COPY command to transfer from S3 staging area 41 | * If the file is large, better to break it up into multiple files 42 | * Either use a common prefix or a manifest file 43 | * Ingest from the same AWS region 44 | * Compress all csv files 45 | 46 | 4. Optimizing Table Design 47 | * 2 possible strategies: distribution style and sorting key 48 | 49 | 1. Distribution style 50 | * Even: 51 | * Round-robin over all slices for load-balancing 52 | * High cost of joining (Shuffling) 53 | * All: 54 | * Small (dimension) tables can be replicated on all slices to speed up joins 55 | * Auto: 56 | * Leave decision with Redshift, "small enough" tables are distributed with an ALL strategy. Large tables distributed with EVEN strategy 57 | * KEY: 58 | * Rows with similar values of key column are placed in the same slice 59 | * Can lead to skewed distribution if some values of dist key are more frequent than others 60 | * Very useful for large dimension tables 61 | 62 | 2. Sorting key 63 | * Rows are sorted before distribution to slices 64 | * Minimize query time since each node already has contiguous ranges of rows based on sorting key 65 | * Useful for colummns that are frequently in sorting like date dimension and its corresponding foreign key in fact table 66 | 67 | -------------------------------------------------------------------------------- /0. Back to Basics/5. Implementing Data Warehouse on AWS/dwh.cfg: -------------------------------------------------------------------------------- 1 | [AWS] 2 | KEY= 3 | SECRET= 4 | 5 | [DWH] 6 | DWH_CLUSTER_TYPE=multi-node 7 | DWH_NUM_NODES=4 8 | DWH_NODE_TYPE=dc2.large 9 | 10 | DWH_IAM_ROLE_NAME=dwhRole 11 | DWH_CLUSTER_IDENTIFIER=dwhCluster 12 | DWH_DB=dwh 13 | DWH_DB_USER= 14 | DWH_DB_PASSWORD= 15 | DWH_PORT=5439 16 | 17 | -------------------------------------------------------------------------------- /0. Back to Basics/5. Implementing Data Warehouse on AWS/redshift_dwh.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/5. Implementing Data Warehouse on AWS/redshift_dwh.PNG -------------------------------------------------------------------------------- /0. Back to Basics/6. Intro to Spark/Pyspark Data Wrangling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Wrangling with PySpark DataFrames " 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "from pyspark.sql import SparkSession\n", 17 | "from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg\n", 18 | "from pyspark.sql.functions import sum as Fsum\n", 19 | "from pyspark.sql.window import Window\n", 20 | "from pyspark.sql.types import IntegerType\n", 21 | "\n", 22 | "spark = SparkSession \\\n", 23 | " .builder \\\n", 24 | " .appName(\"Wrangling Data\") \\\n", 25 | " .getOrCreate()\n", 26 | "path = \"data/sparkify_log_small.json\"\n", 27 | "user_log = spark.read.json(path)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "# Which page did user id \"\" (empty string) NOT visit?" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 40, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "name": "stdout", 44 | "output_type": "stream", 45 | "text": [ 46 | "Pages not visited by empty string user id: ['Submit Downgrade', 'Downgrade', 'Logout', 'Save Settings', 'Settings', 'NextSong', 'Upgrade', 'Error', 'Submit Upgrade']\n" 47 | ] 48 | } 49 | ], 50 | "source": [ 51 | "ul1 = user_log.alias('ul1')\n", 52 | "ul2 = user_log.filter(user_log.userId == \"\").alias('ul2')\n", 53 | "\n", 54 | "pages = ul1.join(ul2, ul1.page == ul2.page, how='left_anti').select('page') \\\n", 55 | " .distinct() \\\n", 56 | " .collect()\n", 57 | "pages = [x['page'] + for x in pages]\n", 58 | "\n", 59 | "print(\"Pages not visited by empty string user id: {}\".format(pages))" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "# What type of user does the empty string user id most likely refer to?\n" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 39, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "name": "stdout", 76 | "output_type": "stream", 77 | "text": [ 78 | "Pages visited by empty string user id: ['Home', 'About', 'Login', 'Help']\n" 79 | ] 80 | } 81 | ], 82 | "source": [ 83 | "all_pages = ul1.select('page').distinct().collect()\n", 84 | "\n", 85 | "all_pages = [x['page'] for x in all_pages]\n", 86 | "\n", 87 | "other_user_pages = [x for x in all_pages if x not in pages]\n", 88 | "\n", 89 | "print(\"Pages visited by empty string user id: {}\".format(other_user_pages))" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "Since ['Home', 'About', 'Login', 'Help'] are pages that empty string user ids visit, they are likely users who have not yet registered" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "# How many female users do we have in the data set?" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 38, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "Number of female users: 462\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "female_no = ul1.filter(ul1.gender == 'F').select(\"userId\").distinct().count()\n", 121 | "print(\"Number of female users: {}\".format(female_no))" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "# How many songs were played from the most played artist?" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 37, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "name": "stdout", 138 | "output_type": "stream", 139 | "text": [ 140 | "Number of songs played by top artist Coldplay: 83\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "artist_counts = ul1.where(col(\"artist\").isNotNull()).groupby(\"artist\") \\\n", 146 | " .count().sort(col(\"count\").desc()).collect()\n", 147 | "\n", 148 | "top_artist = artist_counts[0]['artist']\n", 149 | "\n", 150 | "number_of_songs = ul1.filter(ul1.artist == top_artist).count()\n", 151 | "\n", 152 | "print(\"Number of songs played by top artist {}: {}\".format(top_artist,\n", 153 | " number_of_songs))" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "# How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.\n", 161 | "\n" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 43, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "+------------------+\n", 174 | "|avg(count(period))|\n", 175 | "+------------------+\n", 176 | "| 6.898347107438017|\n", 177 | "+------------------+\n", 178 | "\n" 179 | ] 180 | } 181 | ], 182 | "source": [ 183 | "function = udf(lambda ishome : int(ishome == 'Home'), IntegerType())\n", 184 | "\n", 185 | "user_window = Window \\\n", 186 | " .partitionBy('userID') \\\n", 187 | " .orderBy(desc('ts')) \\\n", 188 | " .rangeBetween(Window.unboundedPreceding, 0)\n", 189 | "\n", 190 | "cusum = ul1.filter((ul1.page == 'NextSong') | (ul1.page == 'Home')) \\\n", 191 | " .select('userID', 'page', 'ts') \\\n", 192 | " .withColumn('homevisit', function(col('page'))) \\\n", 193 | " .withColumn('period', Fsum('homevisit').over(user_window))\n", 194 | "\n", 195 | "cusum.filter((cusum.page == 'NextSong')) \\\n", 196 | " .groupBy('userID', 'period') \\\n", 197 | " .agg({'period':'count'}) \\\n", 198 | " .agg({'count(period)':'avg'}).show()" 199 | ] 200 | } 201 | ], 202 | "metadata": { 203 | "kernelspec": { 204 | "display_name": "Python 3", 205 | "language": "python", 206 | "name": "python3" 207 | }, 208 | "language_info": { 209 | "codemirror_mode": { 210 | "name": "ipython", 211 | "version": 3 212 | }, 213 | "file_extension": ".py", 214 | "mimetype": "text/x-python", 215 | "name": "python", 216 | "nbconvert_exporter": "python", 217 | "pygments_lexer": "ipython3", 218 | "version": "3.7.0" 219 | } 220 | }, 221 | "nbformat": 4, 222 | "nbformat_minor": 2 223 | } 224 | -------------------------------------------------------------------------------- /0. Back to Basics/6. Intro to Spark/README.md: -------------------------------------------------------------------------------- 1 | ## What is Spark? 2 | --- 3 | * Spark is a general-purpose distributed data processing engine. 4 | * On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application. 5 | * Spark is often used with distributed data stores such as Hadoop's HDFS, and Amazon's S3, with popular NoSQL databases such as Apache HBase, Apache Cassandra, and MongoDB, and with distributed messaging stores such as MapR Event Store and Apache Kafka. 6 | * Pyspark API 7 | * Pyspark supports imperative (Spark Dataframes) and declarative syntax (Spark SQL) 8 | 9 | ## How a Spark Application Runs on a Cluster 10 | --- 11 | * A Spark application runs as independent processes, coordinated by the SparkSession object in the driver program. 12 | * The resource or cluster manager assigns tasks to workers, one task per partition. 13 | * A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. 14 | * Because iterative algorithms apply operations repeatedly to data, they benefit from caching datasets across iterations. 15 | * Results are sent back to the driver application or can be saved to disk. 16 | * Spark supports the following resource/cluster managers: 17 | * Spark Standalone – a simple cluster manager included with Spark 18 | * Apache Mesos – a general cluster manager that can also run Hadoop applications 19 | * Apache Hadoop YARN – the resource manager in Hadoop 2 20 | * Kubernetes – an open source system for automating deployment, scaling, and management of containerized applications 21 | 22 | ## Spark's Limitations 23 | --- 24 | * Spark Streaming’s latency is at least 500 milliseconds since it operates on micro-batches of records, instead of processing one record at a time. Native streaming tools such as Storm, Apex, or Flink can push down this latency value and might be more suitable for low-latency applications. Flink and Apex can be used for batch computation as well, so if you're already using them for stream processing, there's no need to add Spark to your stack of technologies. 25 | 26 | * Another limitation of Spark is its selection of machine learning algorithms. Currently, Spark only supports algorithms that scale linearly with the input data size. In general, deep learning is not available either, though there are many projects integrate Spark with Tensorflow and other deep learning tools. -------------------------------------------------------------------------------- /0. Back to Basics/6. Intro to Spark/Spark SQL.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Wrangling with Spark SQL" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "from pyspark.sql import SparkSession\n", 17 | "\n", 18 | "spark = SparkSession \\\n", 19 | " .builder \\\n", 20 | " .appName(\"Data wrangling with Spark SQL\") \\\n", 21 | " .getOrCreate()\n", 22 | "path = \"data/sparkify_log_small.json\"\n", 23 | "user_log = spark.read.json(path)\n", 24 | "user_log.createOrReplaceTempView(\"user_log_table\")" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "Which page did user id \"\"(empty string) NOT visit?" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 38, 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "userId \"\" did not visit pages: ['Submit Downgrade', 'Downgrade', 'Logout', 'Save Settings', 'Settings', 'NextSong', 'Upgrade', 'Error', 'Submit Upgrade']\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "rows = spark.sql(\"\"\"\n", 49 | " SELECT DISTINCT ul1.page FROM user_log_table ul1\n", 50 | " LEFT ANTI JOIN (\n", 51 | " SELECT DISTINCT page FROM user_log_table\n", 52 | " WHERE user_log_table.userId = ''\n", 53 | " ) ul2 ON ul1.page = ul2.page\n", 54 | " \"\"\").collect()\n", 55 | "pages = [row.page for row in rows]\n", 56 | "print('userId \"\" did not visit pages: {}'.format(pages))" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "# How many female users do we have in the data set?" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 29, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "name": "stdout", 80 | "output_type": "stream", 81 | "text": [ 82 | "There are 462 female users\n" 83 | ] 84 | } 85 | ], 86 | "source": [ 87 | "row = spark.sql(\"\"\"\n", 88 | " SELECT COUNT(DISTINCT(userId)) AS count FROM user_log_table\n", 89 | " WHERE gender='F'\n", 90 | " \"\"\").collect()\n", 91 | "count = row[0][0]\n", 92 | "print('There are {} female users'.format(count))" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "# How many songs were played from the most played artist?" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 30, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "83 songs were played from the most played artist: Coldplay\n" 112 | ] 113 | } 114 | ], 115 | "source": [ 116 | "row = spark.sql(\"\"\"\n", 117 | " (SELECT artist, COUNT(song) AS count FROM user_log_table\n", 118 | " GROUP BY artist\n", 119 | " ORDER BY count DESC LIMIT 1)\n", 120 | " \"\"\").collect()\n", 121 | "count = row[0][0]\n", 122 | "print('{} songs were played from the most played artist: {}'.format(row[0][1], row[0][0]))" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "# How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer." 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 32, 135 | "metadata": {}, 136 | "outputs": [ 137 | { 138 | "name": "stdout", 139 | "output_type": "stream", 140 | "text": [ 141 | "+------------------+\n", 142 | "|avg(count_results)|\n", 143 | "+------------------+\n", 144 | "| 6.898347107438017|\n", 145 | "+------------------+\n", 146 | "\n" 147 | ] 148 | } 149 | ], 150 | "source": [ 151 | "# SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;\n", 152 | "is_home = spark.sql(\"SELECT userID, page, ts, CASE WHEN page = 'Home' THEN 1 ELSE 0 END AS is_home FROM user_log_table \\\n", 153 | " WHERE (page = 'NextSong') or (page = 'Home') \\\n", 154 | " \")\n", 155 | "\n", 156 | "# keep the results in a new view\n", 157 | "is_home.createOrReplaceTempView(\"is_home_table\")\n", 158 | "\n", 159 | "# find the cumulative sum over the is_home column\n", 160 | "cumulative_sum = spark.sql(\"SELECT *, SUM(is_home) OVER \\\n", 161 | " (PARTITION BY userID ORDER BY ts DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period \\\n", 162 | " FROM is_home_table\")\n", 163 | "\n", 164 | "# keep the results in a view\n", 165 | "cumulative_sum.createOrReplaceTempView(\"period_table\")\n", 166 | "\n", 167 | "# find the average count for NextSong\n", 168 | "spark.sql(\"SELECT AVG(count_results) FROM \\\n", 169 | " (SELECT COUNT(*) AS count_results FROM period_table \\\n", 170 | "GROUP BY userID, period, page HAVING page = 'NextSong') AS counts\").show()" 171 | ] 172 | } 173 | ], 174 | "metadata": { 175 | "kernelspec": { 176 | "display_name": "Python 3", 177 | "language": "python", 178 | "name": "python3" 179 | }, 180 | "language_info": { 181 | "codemirror_mode": { 182 | "name": "ipython", 183 | "version": 3 184 | }, 185 | "file_extension": ".py", 186 | "mimetype": "text/x-python", 187 | "name": "python", 188 | "nbconvert_exporter": "python", 189 | "pygments_lexer": "ipython3", 190 | "version": "3.7.0" 191 | } 192 | }, 193 | "nbformat": 4, 194 | "nbformat_minor": 2 195 | } 196 | -------------------------------------------------------------------------------- /0. Back to Basics/7. Data Lakes/README.md: -------------------------------------------------------------------------------- 1 | ## What is a Data Lake? 2 | --- 3 | * A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files (from Wikipedia). 4 | 5 | ## Why Data Lakes? 6 | --- 7 | * Some data is difficult to put in tabular format, like deep json structures. 8 | * Text/Image data can be stored as blobs of data, and extracted easily for analytics later on. 9 | * Analytics such as machine learning and natural language processing may require accessing raw data in forms totally different from a star schema. 10 | 11 | ## Difference between Data Lake and Data Warehouse 12 | --- 13 | ![Lake vs Warehouse](dlvdwh.PNG) 14 | Source: Udacity DE ND 15 | 16 | * A data warehouse is like a producer of water, where users are handled bottled water in a particular size and shape of the bottle. 17 | * A data lake is like a water lake with many streams flowing into it and its up to users to get the water the way he/she wants 18 | 19 | ## Data Lake Issues 20 | --- 21 | * Data Lake is prone to being a "chaotic garbage dump". 22 | * Since a data lake is widely accessible across business departments, sometimes data governance is difficult to implement 23 | * It is still unclear, per given case, whether a data lake should replace, offload or work in parallel with a data warehouse or data marts. In all cases, dimensional modelling, even in the context of a data lake, continue to remain a valuable practice. 24 | * Data Lake remains an important complement to a Data Warehouse in many businesses. 25 | -------------------------------------------------------------------------------- /0. Back to Basics/7. Data Lakes/dlvdwh.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/0. Back to Basics/7. Data Lakes/dlvdwh.PNG -------------------------------------------------------------------------------- /0. Back to Basics/8. Data Pipelines with Airflow/README.md: -------------------------------------------------------------------------------- 1 | ## What is a Data Pipeline? 2 | --- 3 | * A data pipeline is simply, a series of steps in which data is processed 4 | 5 | ## Data Partitioning 6 | --- 7 | * Pipeline data partitioning is the process of isolating data to be analyzed by one or more attributes, such as time, logical type or data size 8 | * Data partitioning often leads to faster and more reliable pipelines 9 | * Types of Data Partitioning: 10 | 1. Schedule partitioning 11 | * Not only are schedules great for reducing the amount of data our pipelines have to process, but they also help us guarantee that we can meet timing guarantees that our data consumers may need 12 | 2. Logical partitioning 13 | * Conceptually related data can be partitioned into discrete segments and processed separately. This process of separating data based on its conceptual relationship is called logical partitioning. 14 | * With logical partitioning, unrelated things belong in separate steps. Consider your dependencies and separate processing around those boundaries 15 | * Examples of such partitioning are by date and time 16 | 3. Size partitioning 17 | * Size partitioning separates data for processing based on desired or required storage limits 18 | * This essentially sets the amount of data included in a data pipeline run 19 | * Why partition data? 20 | * Pipelines designed to work with partitioned data fail more gracefully. Smaller datasets, smaller time periods, and related concepts are easier to debug than big datasets, large time periods, and unrelated concepts 21 | * If data is partitioned appropriately, tasks will naturally have fewer dependencies on each other 22 | * Airflow will be able to parallelize execution of DAGs to produce results even faster 23 | 24 | ## Data Validation 25 | --- 26 | * Data Validation is the process of ensuring that data is present, correct & meaningful. Ensuring the quality of data through automated validation checks is a critical step in building data pipelines at any organization 27 | 28 | ## Data Quality 29 | --- 30 | * Data Quality is a measure of how well a dataset satisfies its intended use 31 | * Examples of Data Quality Requirements 32 | * Data must be a certain size 33 | * Data must be accurate to some margin of error 34 | * Data must arrive within a given timeframe from the start of execution 35 | * Pipelines must run on a particular schedule 36 | * Data must not contain any sensitive information 37 | 38 | ## Directed Acyclic Graphs 39 | --- 40 | * Directed Acyclic Graphs (DAGs): DAGs are a special subset of graphs in which the edges between nodes have a specific direction, and no cycles exist. 41 | 42 | ## Apache Airflow 43 | --- 44 | * What is Airflow? 45 | * Airflow is a platform to programmatically author, schedule and monitor workflows 46 | * Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks 47 | * The airflow scheduler executes your tasks on an array of workers while following the specified dependencies 48 | * When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative 49 | 50 | * Airflow concepts (Taken from Airflow documentation) 51 | * Operators 52 | * Operators determine what actually gets done by a task. An operator describes a single task in a workflow. Operators are usually (but not always) atomic. The DAG will make sure that operators run in the correct order; other than those dependencies, operators generally run independently 53 | * Tasks 54 | * Once an operator is instantiated, it is referred to as a "task" The instantiation defines specific values when calling the abstract operator, and the parameterized task becomes a node in a DAG 55 | * A task instance represents a specific run of a task and is characterized as the combination of a DAG, a task, and a point in time. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc 56 | * DAGs 57 | * In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies 58 | * A DAG run is a physical instance of a DAG, containing task instances that run for a specific execution_date. A DAG run is usually created by the Airflow scheduler, but can also be created by an external trigger 59 | * Hooks 60 | * Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators 61 | * They also use the airflow.models.connection.Connection model to retrieve hostnames and authentication information 62 | * Hooks keep authentication code and information out of pipelines, centralized in the metadata database 63 | * Connections 64 | * The information needed to connect to external systems is stored in the Airflow metastore database and can be managed in the UI (Menu -> Admin -> Connections) 65 | * A conn_id is defined there, and hostname / login / password / schema information attached to it 66 | * Airflow pipelines retrieve centrally-managed connections information by specifying the relevant conn_id 67 | * Variables 68 | * Variables are a generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow 69 | * Variables can be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI. In addition, json settings files can be bulk uploaded through the UI 70 | * Context & Templating 71 | * Airflow leverages templating to allow users to "fill in the blank" with important runtime variables for tasks 72 | * See: https://airflow.apache.org/docs/stable/macros-ref for a list of context variables 73 | 74 | * Airflow functionalities 75 | * Airflow Plugins 76 | * Airflow was built with the intention of allowing its users to extend and customize its functionality through plugins. 77 | * The most common types of user-created plugins for Airflow are Operators and Hooks. These plugins make DAGs reusable and simpler to maintain 78 | * To create custom operator, follow the steps 79 | 1. Identify Operators that perform similar functions and can be consolidated 80 | 2. Define a new Operator in the plugins folder 81 | 3. Replace the original Operators with your new custom one, re-parameterize, and instantiate them 82 | * Airflow subdags 83 | * Commonly repeated series of tasks within DAGs can be captured as reusable SubDAGs 84 | * Benefits include: 85 | * Decrease the amount of code we need to write and maintain to create a new DAG 86 | * Easier to understand the high level goals of a DAG 87 | * Bug fixes, speedups, and other enhancements can be made more quickly and distributed to all DAGs that use that SubDAG 88 | * Drawbacks of Using SubDAGs: 89 | * Limit the visibility within the Airflow UI 90 | * Abstraction makes understanding what the DAG is doing more difficult 91 | * Encourages premature optimization 92 | 93 | * Monitoring 94 | * Airflow can surface metrics and emails to help you stay on top of pipeline issues 95 | * SLAs 96 | * Airflow DAGs may optionally specify an SLA, or “Service Level Agreement”, which is defined as a time by which a DAG must complete 97 | * For time-sensitive applications these features are critical for developing trust amongst pipeline customers and ensuring that data is delivered while it is still meaningful 98 | * Emails and Alerts 99 | * Airflow can be configured to send emails on DAG and task state changes 100 | * These state changes may include successes, failures, or retries 101 | * Failure emails can easily trigger alerts 102 | * Metrics 103 | * Airflow comes out of the box with the ability to send system metrics using a metrics aggregator called statsd 104 | * Statsd can be coupled with metrics visualization tools like Grafana to provide high level insights into the overall performance of DAGs, jobs, and tasks 105 | 106 | * Best practices for data pipelining 107 | * Task Boundaries 108 | DAG tasks should be designed such that they are: 109 | * Atomic and have a single purpose 110 | * Maximize parallelism 111 | * Make failure states obvious -------------------------------------------------------------------------------- /0. Back to Basics/8. Data Pipelines with Airflow/context_and_templating.py: -------------------------------------------------------------------------------- 1 | # Instructions 2 | # Use the Airflow context in the pythonoperator to complete the TODOs below. Once you are done, run your DAG and check the logs to see the context in use. 3 | 4 | import datetime 5 | import logging 6 | 7 | from airflow import DAG 8 | from airflow.models import Variable 9 | from airflow.operators.python_operator import PythonOperator 10 | from airflow.hooks.S3_hook import S3Hook 11 | 12 | 13 | def log_details(*args, **kwargs): 14 | # 15 | # TODO: Extract ds, run_id, prev_ds, and next_ds from the kwargs, and log them 16 | # NOTE: Look here for context variables passed in on kwargs: 17 | # https://airflow.apache.org/macros.html 18 | # 19 | ds = kwargs['ds'] 20 | run_id = kwargs['run_id'] 21 | previous_ds = kwargs['prev_ds'] 22 | next_ds = kwargs['next_ds'] 23 | 24 | logging.info(f"Execution date is {ds}") 25 | logging.info(f"My run id is {run_id}") 26 | if previous_ds: 27 | logging.info(f"My previous run was on {previous_ds}") 28 | if next_ds: 29 | logging.info(f"My next run will be {next_ds}") 30 | 31 | dag = DAG( 32 | 'lesson1.exercise5', 33 | schedule_interval="@daily", 34 | start_date=datetime.datetime.now() - datetime.timedelta(days=2) 35 | ) 36 | 37 | list_task = PythonOperator( 38 | task_id="log_details", 39 | python_callable=log_details, 40 | provide_context=True, 41 | dag=dag 42 | ) 43 | -------------------------------------------------------------------------------- /0. Back to Basics/8. Data Pipelines with Airflow/dag_for_subdag.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | 3 | from airflow import DAG 4 | from airflow.operators.postgres_operator import PostgresOperator 5 | from airflow.operators.subdag_operator import SubDagOperator 6 | from airflow.operators.udacity_plugin import HasRowsOperator 7 | 8 | from lesson3.exercise3.subdag import get_s3_to_redshift_dag 9 | import sql_statements 10 | 11 | 12 | start_date = datetime.datetime.utcnow() 13 | 14 | dag = DAG( 15 | "lesson3.exercise3", 16 | start_date=start_date, 17 | ) 18 | 19 | trips_task_id = "trips_subdag" 20 | trips_subdag_task = SubDagOperator( 21 | subdag=get_s3_to_redshift_dag( 22 | "lesson3.exercise3", 23 | trips_task_id, 24 | "redshift", 25 | "aws_credentials", 26 | "trips", 27 | sql_statements.CREATE_TRIPS_TABLE_SQL, 28 | s3_bucket="udac-data-pipelines", 29 | s3_key="divvy/unpartitioned/divvy_trips_2018.csv", 30 | start_date=start_date, 31 | ), 32 | task_id=trips_task_id, 33 | dag=dag, 34 | ) 35 | 36 | stations_task_id = "stations_subdag" 37 | stations_subdag_task = SubDagOperator( 38 | subdag=get_s3_to_redshift_dag( 39 | "lesson3.exercise3", 40 | stations_task_id, 41 | "redshift", 42 | "aws_credentials", 43 | "stations", 44 | sql_statements.CREATE_STATIONS_TABLE_SQL, 45 | s3_bucket="udac-data-pipelines", 46 | s3_key="divvy/unpartitioned/divvy_stations_2017.csv", 47 | start_date=start_date, 48 | ), 49 | task_id=stations_task_id, 50 | dag=dag, 51 | ) 52 | 53 | # 54 | # TODO: Consolidate check_trips and check_stations into a single check in the subdag 55 | # as we did with the create and copy in the demo 56 | # 57 | check_trips = HasRowsOperator( 58 | task_id="check_trips_data", 59 | dag=dag, 60 | redshift_conn_id="redshift", 61 | table="trips" 62 | ) 63 | 64 | check_stations = HasRowsOperator( 65 | task_id="check_stations_data", 66 | dag=dag, 67 | redshift_conn_id="redshift", 68 | table="stations" 69 | ) 70 | 71 | location_traffic_task = PostgresOperator( 72 | task_id="calculate_location_traffic", 73 | dag=dag, 74 | postgres_conn_id="redshift", 75 | sql=sql_statements.LOCATION_TRAFFIC_SQL 76 | ) 77 | 78 | # 79 | # TODO: Reorder the Graph once you have moved the checks 80 | # 81 | trips_subdag_task >> check_trips 82 | stations_subdag_task >> check_stations 83 | check_stations >> location_traffic_task 84 | check_trips >> location_traffic_task 85 | -------------------------------------------------------------------------------- /0. Back to Basics/8. Data Pipelines with Airflow/hello_airflow.py: -------------------------------------------------------------------------------- 1 | # Instructions 2 | # Define a function that uses the python logger to log a function. Then finish filling in the details of the DAG down below. Once you’ve done that, run "/opt/airflow/start.sh" command to start the web server. Once the Airflow web server is ready, open the Airflow UI using the "Access Airflow" button. Turn your DAG “On”, and then Run your DAG. If you get stuck, you can take a look at the solution file or the video walkthrough on the next page. 3 | 4 | import datetime 5 | import logging 6 | 7 | from airflow import DAG 8 | from airflow.operators.python_operator import PythonOperator 9 | 10 | def my_function(): 11 | logging.info("hello airflow") 12 | 13 | 14 | dag = DAG( 15 | 'mock_airflow_dag', 16 | start_date=datetime.datetime.now()) 17 | 18 | greet_task = PythonOperator( 19 | task_id="hello_airflow_task", 20 | python_callable=my_function, 21 | dag=dag 22 | ) -------------------------------------------------------------------------------- /0. Back to Basics/8. Data Pipelines with Airflow/subdag.py: -------------------------------------------------------------------------------- 1 | #Instructions 2 | #In this exercise, we’ll place our S3 to RedShift Copy operations into a SubDag. 3 | #1 - Consolidate HasRowsOperator into the SubDag 4 | #2 - Reorder the tasks to take advantage of the SubDag Operators 5 | 6 | import datetime 7 | 8 | from airflow import DAG 9 | from airflow.operators.postgres_operator import PostgresOperator 10 | from airflow.operators.udacity_plugin import HasRowsOperator 11 | from airflow.operators.udacity_plugin import S3ToRedshiftOperator 12 | 13 | import sql 14 | 15 | def get_s3_to_redshift_dag( 16 | parent_dag_name, 17 | task_id, 18 | redshift_conn_id, 19 | aws_credentials_id, 20 | table, 21 | create_sql_stmt, 22 | s3_bucket, 23 | s3_key, 24 | *args, **kwargs): 25 | dag = DAG( 26 | f"{parent_dag_name}.{task_id}", 27 | **kwargs 28 | ) 29 | 30 | create_task = PostgresOperator( 31 | task_id=f"create_{table}_table", 32 | dag=dag, 33 | postgres_conn_id=redshift_conn_id, 34 | sql=create_sql_stmt 35 | ) 36 | 37 | copy_task = S3ToRedshiftOperator( 38 | task_id=f"load_{table}_from_s3_to_redshift", 39 | dag=dag, 40 | table=table, 41 | redshift_conn_id=redshift_conn_id, 42 | aws_credentials_id=aws_credentials_id, 43 | s3_bucket=s3_bucket, 44 | s3_key=s3_key 45 | ) 46 | 47 | create_task >> copy_task 48 | return dag 49 | -------------------------------------------------------------------------------- /1. Postgres ETL/README.md: -------------------------------------------------------------------------------- 1 | ## Description 2 | --- 3 | This repo provides the ETL pipeline, to populate the sparkifydb database. 4 | * The purpose of this database is to enable Sparkify to answer business questions it may have of its users, the types of songs they listen to and the artists of those songs using the data that it has in logs and files. The database provides a consistent and reliable source to store this data. 5 | 6 | * This source of data will be useful in helping Sparkify reach some of its analytical goals, for example, finding out songs that have highest popularity or times of the day which is high in traffic. 7 | 8 | ## Database Design and ETL Pipeline 9 | --- 10 | * For the schema design, the STAR schema is used as it simplifies queries and provides fast aggregations of data. 11 | 12 | ![Schema](schema.PNG) 13 | 14 | * For the ETL pipeline, Python is used as it contains libraries such as pandas, that simplifies data manipulation. It also allows connection to Postgres Database. 15 | 16 | * There are 2 types of data involved, song and log data. For song data, it contains information about songs and artists, which we extract from and load into users and artists dimension tables 17 | 18 | * Log data gives the information of each user session. From log data, we extract and load into time, users dimension tables and songplays fact table. 19 | 20 | ## Running the ETL Pipeline 21 | --- 22 | * First, run create_tables.py to create the data tables using the schema design specified. If tables were created previously, they will be dropped and recreated. 23 | 24 | * Next, run etl.py to populate the data tables created. -------------------------------------------------------------------------------- /1. Postgres ETL/create_tables.py: -------------------------------------------------------------------------------- 1 | import psycopg2 2 | from sql_queries import create_table_queries, drop_table_queries 3 | 4 | 5 | def create_database(): 6 | """ 7 | - Creates and connects to the sparkifydb 8 | - Returns the connection and cursor to sparkifydb 9 | """ 10 | 11 | # connect to default database 12 | conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student") 13 | conn.set_session(autocommit=True) 14 | cur = conn.cursor() 15 | 16 | # create sparkify database with UTF8 encoding 17 | cur.execute("DROP DATABASE IF EXISTS sparkifydb") 18 | cur.execute("CREATE DATABASE sparkifydb WITH ENCODING 'utf8' TEMPLATE template0") 19 | 20 | # close connection to default database 21 | conn.close() 22 | 23 | # connect to sparkify database 24 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student") 25 | cur = conn.cursor() 26 | 27 | return cur, conn 28 | 29 | 30 | def drop_tables(cur, conn): 31 | """ 32 | Drops each table using the queries in `drop_table_queries` list. 33 | """ 34 | for query in drop_table_queries: 35 | cur.execute(query) 36 | conn.commit() 37 | 38 | 39 | def create_tables(cur, conn): 40 | """ 41 | Creates each table using the queries in `create_table_queries` list. 42 | """ 43 | for query in create_table_queries: 44 | cur.execute(query) 45 | conn.commit() 46 | 47 | 48 | def main(): 49 | """ 50 | - Drops (if exists) and Creates the sparkify database. 51 | 52 | - Establishes connection with the sparkify database and gets 53 | cursor to it. 54 | 55 | - Drops all the tables. 56 | 57 | - Creates all tables needed. 58 | 59 | - Finally, closes the connection. 60 | """ 61 | cur, conn = create_database() 62 | 63 | drop_tables(cur, conn) 64 | create_tables(cur, conn) 65 | 66 | conn.close() 67 | 68 | 69 | if __name__ == "__main__": 70 | main() -------------------------------------------------------------------------------- /1. Postgres ETL/data/log_data/2018/11/2018-11-01-events.json: -------------------------------------------------------------------------------- 1 | {"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"} 2 | {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 3 | {"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 4 | {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":2,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Upgrade","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106132796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 5 | {"artist":"Mr Oizo","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":3,"lastName":"Summers","length":144.03873,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Flat 55","status":200,"ts":1541106352796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 6 | {"artist":"Tamba Trio","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":4,"lastName":"Summers","length":177.18812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Quem Quiser Encontrar O Amor","status":200,"ts":1541106496796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 7 | {"artist":"The Mars Volta","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":5,"lastName":"Summers","length":380.42077,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Eriatarka","status":200,"ts":1541106673796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 8 | {"artist":"Infected Mushroom","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":6,"lastName":"Summers","length":440.2673,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Becoming Insane","status":200,"ts":1541107053796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 9 | {"artist":"Blue October \/ Imogen Heap","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":7,"lastName":"Summers","length":241.3971,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Congratulations","status":200,"ts":1541107493796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 10 | {"artist":"Girl Talk","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":8,"lastName":"Summers","length":160.15628,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Once again","status":200,"ts":1541107734796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 11 | {"artist":"Black Eyed Peas","auth":"Logged In","firstName":"Sylvie","gender":"F","itemInSession":0,"lastName":"Cruz","length":214.93506,"level":"free","location":"Washington-Arlington-Alexandria, DC-VA-MD-WV","method":"PUT","page":"NextSong","registration":1540266185796.0,"sessionId":9,"song":"Pump It","status":200,"ts":1541108520796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.77.4 (KHTML, like Gecko) Version\/7.0.5 Safari\/537.77.4\"","userId":"10"} 12 | {"artist":null,"auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":0,"lastName":"Smith","length":null,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"GET","page":"Home","registration":1541016707796.0,"sessionId":169,"song":null,"status":200,"ts":1541109015796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"} 13 | {"artist":"Fall Out Boy","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":1,"lastName":"Smith","length":200.72444,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Nobody Puts Baby In The Corner","status":200,"ts":1541109125796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"} 14 | {"artist":"M.I.A.","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":2,"lastName":"Smith","length":233.7171,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Mango Pickle Down River (With The Wilcannia Mob)","status":200,"ts":1541109325796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"} 15 | {"artist":"Survivor","auth":"Logged In","firstName":"Jayden","gender":"M","itemInSession":0,"lastName":"Fox","length":245.36771,"level":"free","location":"New Orleans-Metairie, LA","method":"PUT","page":"NextSong","registration":1541033612796.0,"sessionId":100,"song":"Eye Of The Tiger","status":200,"ts":1541110994796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"101"} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAAAW128F429D538.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAABD128F429CF47.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARMJAGH1187FB546F3", "artist_latitude": 35.14968, "artist_longitude": -90.04892, "artist_location": "Memphis, TN", "artist_name": "The Box Tops", "song_id": "SOCIWDW12A8C13D406", "title": "Soul Deep", "duration": 148.03546, "year": 1969} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAADZ128F9348C2E.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARKRRTF1187B9984DA", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Sonora Santanera", "song_id": "SOXVLOJ12AB0189215", "title": "Amor De Cabaret", "duration": 177.47546, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAAEF128F4273421.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR7G5I41187FB4CE6C", "artist_latitude": null, "artist_longitude": null, "artist_location": "London, England", "artist_name": "Adam Ant", "song_id": "SONHOTT12A8C13493C", "title": "Something Girls", "duration": 233.40363, "year": 1982} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAAFD128F92F423A.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARXR32B1187FB57099", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Gob", "song_id": "SOFSOCN12A8C143F5D", "title": "Face the Ashes", "duration": 209.60608, "year": 2007} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAAMO128F1481E7F.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARKFYS91187B98E58F", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Jeff And Sheri Easter", "song_id": "SOYMRWW12A6D4FAB14", "title": "The Moon And I (Ordinary Day Album Version)", "duration": 267.7024, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAAMQ128F1460CD3.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARD0S291187B9B7BF5", "artist_latitude": null, "artist_longitude": null, "artist_location": "Ohio", "artist_name": "Rated R", "song_id": "SOMJBYD12A6D4F8557", "title": "Keepin It Real (Skit)", "duration": 114.78159, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAAPK128E0786D96.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR10USD1187B99F3F1", "artist_latitude": null, "artist_longitude": null, "artist_location": "Burlington, Ontario, Canada", "artist_name": "Tweeterfriendly Music", "song_id": "SOHKNRJ12A6701D1F8", "title": "Drop of Rain", "duration": 189.57016, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAARJ128F9320760.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR8ZCNI1187B9A069B", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Planet P Project", "song_id": "SOIAZJW12AB01853F1", "title": "Pink World", "duration": 269.81832, "year": 1984} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAAVG12903CFA543.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNTLGG11E2835DDB9", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Clp", "song_id": "SOUDSGM12AC9618304", "title": "Insatiable (Instrumental Version)", "duration": 266.39628, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/A/TRAAAVO128F93133D4.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGSJW91187B9B1D6B", "artist_latitude": 35.21962, "artist_longitude": -80.01955, "artist_location": "North Carolina", "artist_name": "JennyAnyKind", "song_id": "SOQHXMF12AB0182363", "title": "Young Boy Blues", "duration": 218.77506, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABCL128F4286650.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARC43071187B990240", "artist_latitude": null, "artist_longitude": null, "artist_location": "Wisner, LA", "artist_name": "Wayne Watson", "song_id": "SOKEJEJ12A8C13E0D0", "title": "The Urgency (LP Version)", "duration": 245.21098, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABDL12903CAABBA.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARL7K851187B99ACD2", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Andy Andy", "song_id": "SOMUYGI12AB0188633", "title": "La Culpa", "duration": 226.35057, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABJL12903CDCF1A.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARHHO3O1187B989413", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Bob Azzam", "song_id": "SORAMLE12AB017C8B0", "title": "Auguri Cha Cha", "duration": 191.84281, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABJV128F1460C49.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARIK43K1187B9AE54C", "artist_latitude": null, "artist_longitude": null, "artist_location": "Beverly Hills, CA", "artist_name": "Lionel Richie", "song_id": "SOBONFF12A6D4F84D8", "title": "Tonight Will Be Alright", "duration": 307.3824, "year": 1986} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABLR128F423B7E3.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARD842G1187B997376", "artist_latitude": 43.64856, "artist_longitude": -79.38533, "artist_location": "Toronto, Ontario, Canada", "artist_name": "Blue Rodeo", "song_id": "SOHUOAP12A8AE488E9", "title": "Floating", "duration": 491.12771, "year": 1987} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABNV128F425CEE1.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARIG6O41187B988BDD", "artist_latitude": 37.16793, "artist_longitude": -95.84502, "artist_location": "United States", "artist_name": "Richard Souther", "song_id": "SOUQQEA12A8C134B1B", "title": "High Tide", "duration": 228.5971, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABRB128F9306DD5.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR1ZHYZ1187FB3C717", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Faiz Ali Faiz", "song_id": "SOILPQQ12AB017E82A", "title": "Sohna Nee Sohna Data", "duration": 599.24853, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABVM128F92CA9DC.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARYKCQI1187FB3B18F", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Tesla", "song_id": "SOXLBJT12A8C140925", "title": "Caught In A Dream", "duration": 290.29832, "year": 2004} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABXG128F9318EBD.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNPAGP1241B9C7FD4", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "lextrical", "song_id": "SOZVMJI12AB01808AF", "title": "Synthetic Dream", "duration": 165.69424, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABYN12903CFD305.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARQGYP71187FB44566", "artist_latitude": 34.31109, "artist_longitude": -94.02978, "artist_location": "Mineola, AR", "artist_name": "Jimmy Wakely", "song_id": "SOWTBJW12AC468AC6E", "title": "Broken-Down Merry-Go-Round", "duration": 151.84934, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/B/TRAABYW128F4244559.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARI3BMM1187FB4255E", "artist_latitude": 38.8991, "artist_longitude": -77.029, "artist_location": "Washington", "artist_name": "Alice Stuart", "song_id": "SOBEBDG12A58A76D60", "title": "Kassie Jones", "duration": 220.78649, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACCG128F92E8A55.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR5KOSW1187FB35FF4", "artist_latitude": 49.80388, "artist_longitude": 15.47491, "artist_location": "Dubai UAE", "artist_name": "Elena", "song_id": "SOZCTXZ12AB0182364", "title": "Setanta matins", "duration": 269.58322, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACER128F4290F96.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARMAC4T1187FB3FA4C", "artist_latitude": 40.82624, "artist_longitude": -74.47995, "artist_location": "Morris Plains, NJ", "artist_name": "The Dillinger Escape Plan", "song_id": "SOBBUGU12A8C13E95D", "title": "Setting Fire to Sleeping Giants", "duration": 207.77751, "year": 2004} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACFV128F935E50B.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR47JEX1187B995D81", "artist_latitude": 37.83721, "artist_longitude": -94.35868, "artist_location": "Nevada, MO", "artist_name": "SUE THOMPSON", "song_id": "SOBLGCN12AB0183212", "title": "James (Hold The Ladder Steady)", "duration": 124.86485, "year": 1985} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACHN128F1489601.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGIWFO1187B9B55B7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Five Bolt Main", "song_id": "SOPSWQW12A6D4F8781", "title": "Made Like This (Live)", "duration": 225.09669, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACIW12903CC0F6D.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNTLGG11E2835DDB9", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Clp", "song_id": "SOZQDIU12A58A7BCF6", "title": "Superconfidential", "duration": 338.31138, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACLV128F427E123.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARDNS031187B9924F0", "artist_latitude": 32.67828, "artist_longitude": -83.22295, "artist_location": "Georgia", "artist_name": "Tim Wilson", "song_id": "SONYPOM12A8C13B2D7", "title": "I Think My Wife Is Running Around On Me (Taco Hell)", "duration": 186.48771, "year": 2005} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACNS128F14A2DF5.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AROUOZZ1187B9ABE51", "artist_latitude": 40.79195, "artist_longitude": -73.94512, "artist_location": "New York, NY [Spanish Harlem]", "artist_name": "Willie Bobo", "song_id": "SOBZBAZ12A6D4F8742", "title": "Spanish Grease", "duration": 168.25424, "year": 1997} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACOW128F933E35F.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARH4Z031187B9A71F2", "artist_latitude": 40.73197, "artist_longitude": -74.17418, "artist_location": "Newark, NJ", "artist_name": "Faye Adams", "song_id": "SOVYKGO12AB0187199", "title": "Crazy Mixed Up World", "duration": 156.39465, "year": 1961} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACPE128F421C1B9.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARB29H41187B98F0EF", "artist_latitude": 41.88415, "artist_longitude": -87.63241, "artist_location": "Chicago", "artist_name": "Terry Callier", "song_id": "SOGNCJP12A58A80271", "title": "Do You Finally Need A Friend", "duration": 342.56934, "year": 1972} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACQT128F9331780.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR1Y2PT1187FB5B9CE", "artist_latitude": 27.94017, "artist_longitude": -82.32547, "artist_location": "Brandon", "artist_name": "John Wesley", "song_id": "SOLLHMX12AB01846DC", "title": "The Emperor Falls", "duration": 484.62322, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACSL128F93462F4.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARAJPHH1187FB5566A", "artist_latitude": 40.7038, "artist_longitude": -73.83168, "artist_location": "Queens, NY", "artist_name": "The Shangri-Las", "song_id": "SOYTPEP12AB0180E7B", "title": "Twist and Shout", "duration": 164.80608, "year": 1964} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACTB12903CAAF15.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR0RCMP1187FB3F427", "artist_latitude": 30.08615, "artist_longitude": -94.10158, "artist_location": "Beaumont, TX", "artist_name": "Billie Jo Spears", "song_id": "SOGXHEG12AB018653E", "title": "It Makes No Difference Now", "duration": 133.32853, "year": 1992} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACVS128E078BE39.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AREBBGV1187FB523D2", "artist_latitude": null, "artist_longitude": null, "artist_location": "Houston, TX", "artist_name": "Mike Jones (Featuring CJ_ Mello & Lil' Bran)", "song_id": "SOOLYAZ12A6701F4A6", "title": "Laws Patrolling (Album Version)", "duration": 173.66159, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/A/C/TRAACZK128F4243829.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGUVEV1187B98BA17", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Sierra Maestra", "song_id": "SOGOSOV12AF72A285E", "title": "\u00bfD\u00f3nde va Chichi?", "duration": 313.12934, "year": 1997} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABACN128F425B784.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOQLGFP12A58A7800E", "title": "OAKtown", "duration": 259.44771, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABAFJ128F42AF24E.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR3JMC51187B9AE49D", "artist_latitude": 28.53823, "artist_longitude": -81.37739, "artist_location": "Orlando, FL", "artist_name": "Backstreet Boys", "song_id": "SOPVXLX12A8C1402D5", "title": "Larger Than Life", "duration": 236.25098, "year": 1999} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABAFP128F931E9A1.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARPBNLO1187FB3D52F", "artist_latitude": 40.71455, "artist_longitude": -74.00712, "artist_location": "New York, NY", "artist_name": "Tiny Tim", "song_id": "SOAOIBZ12AB01815BE", "title": "I Hold Your Hand In Mine [Live At Royal Albert Hall]", "duration": 43.36281, "year": 2000} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABAIO128F42938F9.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR9AWNF1187B9AB0B4", "artist_latitude": null, "artist_longitude": null, "artist_location": "Seattle, Washington USA", "artist_name": "Kenny G featuring Daryl Hall", "song_id": "SOZHPGD12A8C1394FE", "title": "Baby Come To Me", "duration": 236.93016, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABATO128F42627E9.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AROGWRA122988FEE45", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Christos Dantis", "song_id": "SOSLAVG12A8C13397F", "title": "Den Pai Alo", "duration": 243.82649, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABAVQ12903CBF7E0.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARMBR4Y1187B9990EB", "artist_latitude": 37.77916, "artist_longitude": -122.42005, "artist_location": "California - SF", "artist_name": "David Martin", "song_id": "SOTTDKS12AB018D69B", "title": "It Wont Be Christmas", "duration": 241.47546, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABAWW128F4250A31.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARQ9BO41187FB5CF1F", "artist_latitude": 40.99471, "artist_longitude": -77.60454, "artist_location": "Pennsylvania", "artist_name": "John Davis", "song_id": "SOMVWWT12A58A7AE05", "title": "Knocked Out Of The Park", "duration": 183.17016, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABAXL128F424FC50.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARKULSX1187FB45F84", "artist_latitude": 39.49974, "artist_longitude": -111.54732, "artist_location": "Utah", "artist_name": "Trafik", "song_id": "SOQVMXR12A81C21483", "title": "Salt In NYC", "duration": 424.12363, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABAXR128F426515F.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARI2JSK1187FB496EF", "artist_latitude": 51.50632, "artist_longitude": -0.12714, "artist_location": "London, England", "artist_name": "Nick Ingman;Gavyn Wright", "song_id": "SODUJBS12A8C132150", "title": "Wessex Loses a Bride", "duration": 111.62077, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABAXV128F92F6AE3.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AREDBBQ1187B98AFF5", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Eddie Calvert", "song_id": "SOBBXLX12A58A79DDA", "title": "Erica (2005 Digital Remaster)", "duration": 138.63138, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/A/TRABAZH128F930419A.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR7ZKHQ1187B98DD73", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Glad", "song_id": "SOTUKVB12AB0181477", "title": "Blessed Assurance", "duration": 270.602, "year": 1993} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBAM128F429D223.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARBGXIG122988F409D", "artist_latitude": 37.77916, "artist_longitude": -122.42005, "artist_location": "California - SF", "artist_name": "Steel Rain", "song_id": "SOOJPRH12A8C141995", "title": "Loaded Like A Gun", "duration": 173.19138, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBBV128F42967D7.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR7SMBG1187B9B9066", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Los Manolos", "song_id": "SOBCOSW12A8C13D398", "title": "Rumba De Barcelona", "duration": 218.38322, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBJE12903CDB442.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGCY1Y1187B9A4FA5", "artist_latitude": 36.16778, "artist_longitude": -86.77836, "artist_location": "Nashville, TN.", "artist_name": "Gloriana", "song_id": "SOQOTLQ12AB01868D0", "title": "Clementina Santaf\u00e8", "duration": 153.33832, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBKX128F4285205.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR36F9J1187FB406F1", "artist_latitude": 56.27609, "artist_longitude": 9.51695, "artist_location": "Denmark", "artist_name": "Bombay Rockers", "song_id": "SOBKWDJ12A8C13B2F3", "title": "Wild Rose (Back 2 Basics Mix)", "duration": 230.71302, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBLU128F93349CF.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNNKDK1187B98BBD5", "artist_latitude": 45.80726, "artist_longitude": 15.9676, "artist_location": "Zagreb Croatia", "artist_name": "Jinx", "song_id": "SOFNOQK12AB01840FC", "title": "Kutt Free (DJ Volume Remix)", "duration": 407.37914, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBNP128F932546F.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR62SOJ1187FB47BB5", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Chase & Status", "song_id": "SOGVQGJ12AB017F169", "title": "Ten Tonne", "duration": 337.68444, "year": 2005} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBOP128F931B50D.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARBEBBY1187B9B43DB", "artist_latitude": null, "artist_longitude": null, "artist_location": "Gainesville, FL", "artist_name": "Tom Petty", "song_id": "SOFFKZS12AB017F194", "title": "A Higher Place (Album Version)", "duration": 236.17261, "year": 1994} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBOR128F4286200.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARDR4AC1187FB371A1", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Montserrat Caball\u00e9;Placido Domingo;Vicente Sardinero;Judith Blegen;Sherrill Milnes;Georg Solti", "song_id": "SOBAYLL12A8C138AF9", "title": "Sono andati? Fingevo di dormire", "duration": 511.16363, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBTA128F933D304.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARAGB2O1187FB3A161", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Pucho & His Latin Soul Brothers", "song_id": "SOLEYHO12AB0188A85", "title": "Got My Mojo Workin", "duration": 338.23302, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBVJ128F92F7EAA.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AREDL271187FB40F44", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Soul Mekanik", "song_id": "SOPEGZN12AB0181B3D", "title": "Get Your Head Stuck On Your Neck", "duration": 45.66159, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBXU128F92FEF48.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARP6N5A1187B99D1A3", "artist_latitude": null, "artist_longitude": null, "artist_location": "Hamtramck, MI", "artist_name": "Mitch Ryder", "song_id": "SOXILUQ12A58A7C72A", "title": "Jenny Take a Ride", "duration": 207.43791, "year": 2004} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/B/TRABBZN12903CD9297.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGSAFR1269FB35070", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Blingtones", "song_id": "SOTCKKY12AB018A141", "title": "Sonnerie lalaleul\u00e9 hi houuu", "duration": 29.54404, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCAJ12903CDFCC2.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARULZCI1241B9C8611", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Luna Orbit Project", "song_id": "SOSWKAV12AB018FC91", "title": "Midnight Star", "duration": 335.51628, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCEC128F426456E.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR0IAWL1187B9A96D0", "artist_latitude": 8.4177, "artist_longitude": -80.11278, "artist_location": "Panama", "artist_name": "Danilo Perez", "song_id": "SONSKXP12A8C13A2C9", "title": "Native Soul", "duration": 197.19791, "year": 2003} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCEI128F424C983.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCFL128F149BB0D.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARLTWXK1187FB5A3F8", "artist_latitude": 32.74863, "artist_longitude": -97.32925, "artist_location": "Fort Worth, TX", "artist_name": "King Curtis", "song_id": "SODREIN12A58A7F2E5", "title": "A Whiter Shade Of Pale (Live @ Fillmore West)", "duration": 326.00771, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCIX128F4265903.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNF6401187FB57032", "artist_latitude": 40.79086, "artist_longitude": -73.96644, "artist_location": "New York, NY [Manhattan]", "artist_name": "Sophie B. Hawkins", "song_id": "SONWXQJ12A8C134D94", "title": "The Ballad Of Sleeping Beauty", "duration": 305.162, "year": 1994} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCKL128F423A778.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARPFHN61187FB575F6", "artist_latitude": 41.88415, "artist_longitude": -87.63241, "artist_location": "Chicago, IL", "artist_name": "Lupe Fiasco", "song_id": "SOWQTQZ12A58A7B63E", "title": "Streets On Fire (Explicit Album Version)", "duration": 279.97995, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCPZ128F4275C32.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR051KA1187B98B2FF", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Wilks", "song_id": "SOLYIBD12A8C135045", "title": "Music is what we love", "duration": 261.51138, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCRU128F423F449.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR8IEZO1187B99055E", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Marc Shaiman", "song_id": "SOINLJW12A8C13314C", "title": "City Slickers", "duration": 149.86404, "year": 2008} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCTK128F934B224.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR558FS1187FB45658", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "40 Grit", "song_id": "SOGDBUF12A8C140FAA", "title": "Intro", "duration": 75.67628, "year": 2003} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCUQ128E0783E2B.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARVBRGZ1187FB4675A", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Gwen Stefani", "song_id": "SORRZGD12A6310DBC3", "title": "Harajuku Girls", "duration": 290.55955, "year": 2004} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCXB128F4286BD3.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARWB3G61187FB49404", "artist_latitude": null, "artist_longitude": null, "artist_location": "Hamilton, Ohio", "artist_name": "Steve Morse", "song_id": "SODAUVL12A8C13D184", "title": "Prognosis", "duration": 363.85914, "year": 2000} -------------------------------------------------------------------------------- /1. Postgres ETL/data/song_data/A/B/C/TRABCYE128F934CE1D.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AREVWGE1187B9B890A", "artist_latitude": -13.442, "artist_longitude": -41.9952, "artist_location": "Noci (BA)", "artist_name": "Bitter End", "song_id": "SOFCHDR12AB01866EF", "title": "Living Hell", "duration": 282.43546, "year": 0} -------------------------------------------------------------------------------- /1. Postgres ETL/etl.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import psycopg2 4 | import pandas as pd 5 | from sql_queries import * 6 | 7 | 8 | def process_song_file(cur, filepath): 9 | """ 10 | - Load data from a song file to the song and artist data tables 11 | """ 12 | # open song file 13 | df = pd.read_json(filepath, lines=True) 14 | 15 | # insert song record 16 | song_data = list(df[['song_id', 'title', 'artist_id', 'year', 'duration']].values[0]) 17 | cur.execute(song_table_insert, song_data) 18 | 19 | # insert artist record 20 | artist_data = list(df[['artist_id', 'artist_name', 'artist_location', 21 | 'artist_latitude', 'artist_longitude']].values[0]) 22 | cur.execute(artist_table_insert, artist_data) 23 | 24 | 25 | def process_log_file(cur, filepath): 26 | """ 27 | - Load data from a log file to the time, user and songplay data tables 28 | """ 29 | # open log file 30 | df = pd.read_json(filepath, lines=True) 31 | 32 | # filter by NextSong action 33 | df = df[df['page'] == 'NextSong'] 34 | 35 | # convert timestamp column to datetime 36 | t = pd.to_datetime(df['ts']) 37 | 38 | # insert time data records 39 | time_data = [(tt.value, tt.hour, tt.day, tt.week, tt.month, tt.year, tt.weekday()) for tt in t] 40 | column_labels = ('timestamp', 'hour', 'day', 'week', 'month', 'year', 'weekday') 41 | time_df = pd.DataFrame(data=time_data, columns=column_labels) 42 | 43 | for i, row in time_df.iterrows(): 44 | cur.execute(time_table_insert, list(row)) 45 | 46 | # load user table 47 | user_df = df[['userId', 'firstName', 'lastName', 'gender', 'level']] 48 | 49 | # insert user records 50 | for i, row in user_df.iterrows(): 51 | cur.execute(user_table_insert, row) 52 | 53 | # insert songplay records 54 | for index, row in df.iterrows(): 55 | 56 | # get songid and artistid from song and artist tables 57 | cur.execute(song_select, (row.song, row.artist, row.length)) 58 | results = cur.fetchone() 59 | 60 | if results: 61 | songid, artistid = results 62 | else: 63 | songid, artistid = None, None 64 | 65 | # insert songplay record 66 | songplay_data = (index, row['ts'], row['userId'], row['level'], songid, artistid, row['sessionId'], 67 | row['location'], row['userAgent']) 68 | cur.execute(songplay_table_insert, songplay_data) 69 | 70 | 71 | def process_data(cur, conn, filepath, func): 72 | """ 73 | - Iterate over all files and populate data tables in sparkifydb 74 | """ 75 | # get all files matching extension from directory 76 | all_files = [] 77 | for root, dirs, files in os.walk(filepath): 78 | files = glob.glob(os.path.join(root,'*.json')) 79 | for f in files : 80 | all_files.append(os.path.abspath(f)) 81 | 82 | # get total number of files found 83 | num_files = len(all_files) 84 | print('{} files found in {}'.format(num_files, filepath)) 85 | 86 | # iterate over files and process 87 | for i, datafile in enumerate(all_files, 1): 88 | func(cur, datafile) 89 | conn.commit() 90 | print('{}/{} files processed.'.format(i, num_files)) 91 | 92 | 93 | def main(): 94 | """ 95 | - Establishes connection with the sparkify database and gets 96 | cursor to it. 97 | 98 | - Runs ETL pipelines 99 | """ 100 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student") 101 | cur = conn.cursor() 102 | 103 | process_data(cur, conn, filepath='data/song_data', func=process_song_file) 104 | process_data(cur, conn, filepath='data/log_data', func=process_log_file) 105 | 106 | conn.close() 107 | 108 | 109 | if __name__ == "__main__": 110 | main() -------------------------------------------------------------------------------- /1. Postgres ETL/schema.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/1. Postgres ETL/schema.PNG -------------------------------------------------------------------------------- /1. Postgres ETL/sql_queries.py: -------------------------------------------------------------------------------- 1 | # DROP TABLES 2 | 3 | songplay_table_drop = "DROP TABLE IF EXISTS songplays;" 4 | user_table_drop = "DROP TABLE IF EXISTS users;" 5 | song_table_drop = "DROP TABLE IF EXISTS songs;" 6 | artist_table_drop = "DROP TABLE IF EXISTS artists;" 7 | time_table_drop = "DROP TABLE IF EXISTS time;" 8 | 9 | # CREATE TABLES 10 | 11 | songplay_table_create = (""" 12 | CREATE TABLE songplays 13 | (songplay_id int PRIMARY KEY, 14 | start_time bigint REFERENCES time(start_time) ON DELETE RESTRICT, 15 | user_id int REFERENCES users(user_id) ON DELETE RESTRICT, 16 | level varchar, 17 | song_id varchar REFERENCES songs(song_id) ON DELETE RESTRICT, 18 | artist_id varchar REFERENCES artists(artist_id) ON DELETE RESTRICT, 19 | session_id int, 20 | location varchar, 21 | user_agent varchar); 22 | """) 23 | 24 | user_table_create = (""" 25 | CREATE TABLE users 26 | (user_id int PRIMARY KEY, 27 | first_name varchar, 28 | last_name varchar, 29 | gender varchar, 30 | level varchar); 31 | """) 32 | 33 | song_table_create = (""" 34 | CREATE TABLE songs 35 | (song_id varchar PRIMARY KEY, 36 | title varchar, 37 | artist_id varchar, 38 | year int, 39 | duration float); 40 | """) 41 | 42 | artist_table_create = (""" 43 | CREATE TABLE artists 44 | (artist_id varchar PRIMARY KEY, 45 | name varchar, 46 | location varchar, 47 | latitude float, 48 | longitude float); 49 | """) 50 | 51 | time_table_create = (""" 52 | CREATE TABLE time 53 | (start_time bigint PRIMARY KEY, 54 | hour int, 55 | day int, 56 | week int, 57 | month int, 58 | year int, 59 | weekday int); 60 | """) 61 | 62 | # INSERT RECORDS 63 | 64 | songplay_table_insert = (""" 65 | INSERT INTO songplays (songplay_id, start_time, user_id, level, song_id, artist_id, 66 | session_id, location, user_agent) 67 | VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s) 68 | ON CONFLICT (songplay_id) 69 | DO NOTHING; 70 | """) 71 | 72 | user_table_insert = (""" 73 | INSERT INTO users (user_id, first_name, last_name, gender, level) 74 | VALUES (%s, %s, %s, %s, %s) 75 | ON CONFLICT (user_id) DO UPDATE 76 | SET level=excluded.level; 77 | """) 78 | 79 | song_table_insert = (""" 80 | INSERT INTO songs (song_id, title, artist_id, year, duration) 81 | VALUES (%s, %s, %s, %s, %s) 82 | ON CONFLICT (song_id) 83 | DO NOTHING; 84 | """) 85 | 86 | artist_table_insert = (""" 87 | INSERT INTO artists (artist_id, name, location, latitude, longitude) 88 | VALUES (%s, %s, %s, %s, %s) 89 | ON CONFLICT (artist_id) 90 | DO NOTHING; 91 | """) 92 | 93 | 94 | time_table_insert = (""" 95 | INSERT INTO time (start_time, hour, day, week, month, year, weekday) 96 | VALUES (%s, %s, %s, %s, %s, %s, %s) 97 | ON CONFLICT (start_time) 98 | DO NOTHING; 99 | """) 100 | 101 | # FIND SONGS 102 | 103 | song_select = (""" 104 | SELECT songs.song_id, artists.artist_id FROM songs 105 | JOIN artists ON songs.artist_id=artists.artist_id 106 | WHERE songs.title=%s AND artists.name=%s AND songs.duration=%s; 107 | """) 108 | 109 | # QUERY LISTS 110 | 111 | create_table_queries = [user_table_create, song_table_create, artist_table_create, time_table_create, songplay_table_create] 112 | drop_table_queries = [songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop] -------------------------------------------------------------------------------- /2. Cassandra ETL/event_data/2018-11-01-events.csv: -------------------------------------------------------------------------------- 1 | artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userId 2 | ,Logged In,Walter,M,0,Frye,,free,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54092E+12,38,,200,1.54111E+12,39 3 | ,Logged In,Kaylee,F,0,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Home,1.54034E+12,139,,200,1.54111E+12,8 4 | Des'ree,Logged In,Kaylee,F,1,Summers,246.30812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,You Gotta Be,200,1.54111E+12,8 5 | ,Logged In,Kaylee,F,2,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Upgrade,1.54034E+12,139,,200,1.54111E+12,8 6 | Mr Oizo,Logged In,Kaylee,F,3,Summers,144.03873,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Flat 55,200,1.54111E+12,8 7 | Tamba Trio,Logged In,Kaylee,F,4,Summers,177.18812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Quem Quiser Encontrar O Amor,200,1.54111E+12,8 8 | The Mars Volta,Logged In,Kaylee,F,5,Summers,380.42077,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Eriatarka,200,1.54111E+12,8 9 | Infected Mushroom,Logged In,Kaylee,F,6,Summers,440.2673,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Becoming Insane,200,1.54111E+12,8 10 | Blue October / Imogen Heap,Logged In,Kaylee,F,7,Summers,241.3971,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Congratulations,200,1.54111E+12,8 11 | Girl Talk,Logged In,Kaylee,F,8,Summers,160.15628,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1.54034E+12,139,Once again,200,1.54111E+12,8 12 | Black Eyed Peas,Logged In,Sylvie,F,0,Cruz,214.93506,free,"Washington-Arlington-Alexandria, DC-VA-MD-WV",PUT,NextSong,1.54027E+12,9,Pump It,200,1.54111E+12,10 13 | ,Logged In,Ryan,M,0,Smith,,free,"San Jose-Sunnyvale-Santa Clara, CA",GET,Home,1.54102E+12,169,,200,1.54111E+12,26 14 | Fall Out Boy,Logged In,Ryan,M,1,Smith,200.72444,free,"San Jose-Sunnyvale-Santa Clara, CA",PUT,NextSong,1.54102E+12,169,Nobody Puts Baby In The Corner,200,1.54111E+12,26 15 | M.I.A.,Logged In,Ryan,M,2,Smith,233.7171,free,"San Jose-Sunnyvale-Santa Clara, CA",PUT,NextSong,1.54102E+12,169,Mango Pickle Down River (With The Wilcannia Mob),200,1.54111E+12,26 16 | Survivor,Logged In,Jayden,M,0,Fox,245.36771,free,"New Orleans-Metairie, LA",PUT,NextSong,1.54103E+12,100,Eye Of The Tiger,200,1.54111E+12,101 17 | -------------------------------------------------------------------------------- /2. Cassandra ETL/event_data/2018-11-25-events.csv: -------------------------------------------------------------------------------- 1 | artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userId 2 | matchbox twenty,Logged In,Jayden,F,0,Duffy,177.65832,free,"Seattle-Tacoma-Bellevue, WA",PUT,NextSong,1.54015E+12,846,Argue (LP Version),200,1.54311E+12,76 3 | The Lonely Island / T-Pain,Logged In,Jayden,F,1,Duffy,156.23791,free,"Seattle-Tacoma-Bellevue, WA",PUT,NextSong,1.54015E+12,846,I'm On A Boat,200,1.54311E+12,76 4 | ,Logged In,Jayden,F,2,Duffy,,free,"Seattle-Tacoma-Bellevue, WA",GET,Home,1.54015E+12,846,,200,1.54311E+12,76 5 | ,Logged In,Jayden,F,3,Duffy,,free,"Seattle-Tacoma-Bellevue, WA",GET,Settings,1.54015E+12,846,,200,1.54311E+12,76 6 | ,Logged In,Jayden,F,4,Duffy,,free,"Seattle-Tacoma-Bellevue, WA",PUT,Save Settings,1.54015E+12,846,,307,1.54311E+12,76 7 | John Mayer,Logged In,Wyatt,M,0,Scott,275.27791,free,"Eureka-Arcata-Fortuna, CA",PUT,NextSong,1.54087E+12,856,All We Ever Do Is Say Goodbye,200,1.54311E+12,9 8 | ,Logged In,Wyatt,M,1,Scott,,free,"Eureka-Arcata-Fortuna, CA",GET,Home,1.54087E+12,856,,200,1.54311E+12,9 9 | 10_000 Maniacs,Logged In,Wyatt,M,2,Scott,251.8722,free,"Eureka-Arcata-Fortuna, CA",PUT,NextSong,1.54087E+12,856,Gun Shy (LP Version),200,1.54311E+12,9 10 | Leona Lewis,Logged In,Chloe,F,0,Cuevas,203.88526,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Forgive Me,200,1.54312E+12,49 11 | Nine Inch Nails,Logged In,Chloe,F,1,Cuevas,277.83791,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,La Mer,200,1.54312E+12,49 12 | Audioslave,Logged In,Chloe,F,2,Cuevas,334.91546,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,I Am The Highway,200,1.54312E+12,49 13 | Kid Rock,Logged In,Chloe,F,3,Cuevas,296.95955,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,All Summer Long (Album Version),200,1.54312E+12,49 14 | The Jets,Logged In,Chloe,F,4,Cuevas,220.89098,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,I Do You,200,1.54312E+12,49 15 | The Gerbils,Logged In,Chloe,F,5,Cuevas,27.01016,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,(iii),200,1.54312E+12,49 16 | Damian Marley / Stephen Marley / Yami Bolo,Logged In,Chloe,F,6,Cuevas,304.69179,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Still Searching,200,1.54312E+12,49 17 | ,Logged In,Chloe,F,7,Cuevas,,paid,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54094E+12,916,,200,1.54312E+12,49 18 | The Bloody Beetroots,Logged In,Chloe,F,8,Cuevas,201.97832,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Warp 1.9 (feat. Steve Aoki),200,1.54312E+12,49 19 | ,Logged In,Chloe,F,9,Cuevas,,paid,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54094E+12,916,,200,1.54313E+12,49 20 | The Specials,Logged In,Chloe,F,10,Cuevas,188.81261,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Rat Race,200,1.54313E+12,49 21 | The Lively Ones,Logged In,Chloe,F,11,Cuevas,142.52363,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Walkin' The Board (LP Version),200,1.54313E+12,49 22 | Katie Melua,Logged In,Chloe,F,12,Cuevas,252.78649,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Blues In The Night,200,1.54313E+12,49 23 | Jason Mraz,Logged In,Chloe,F,13,Cuevas,243.48689,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,I'm Yours (Album Version),200,1.54313E+12,49 24 | Fisher,Logged In,Chloe,F,14,Cuevas,133.98159,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Rianna,200,1.54313E+12,49 25 | Zee Avi,Logged In,Chloe,F,15,Cuevas,160.62649,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,No Christmas For Me,200,1.54313E+12,49 26 | Black Eyed Peas,Logged In,Chloe,F,16,Cuevas,289.12281,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,I Gotta Feeling,200,1.54313E+12,49 27 | Emiliana Torrini,Logged In,Chloe,F,17,Cuevas,184.29342,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,Sunny Road,200,1.54313E+12,49 28 | ,Logged In,Chloe,F,18,Cuevas,,paid,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54094E+12,916,,200,1.54313E+12,49 29 | Days Of The New,Logged In,Chloe,F,19,Cuevas,258.5073,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,The Down Town,200,1.54313E+12,49 30 | Julio Iglesias duet with Willie Nelson,Logged In,Chloe,F,20,Cuevas,212.16608,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,916,To All The Girls I've Loved Before (With Julio Iglesias),200,1.54313E+12,49 31 | ,Logged In,Jacqueline,F,0,Lynch,,paid,"Atlanta-Sandy Springs-Roswell, GA",GET,Home,1.54022E+12,914,,200,1.54313E+12,29 32 | Jason Mraz & Colbie Caillat,Logged In,Chloe,F,0,Roth,189.6224,free,"Indianapolis-Carmel-Anderson, IN",PUT,NextSong,1.5407E+12,704,Lucky (Album Version),200,1.54314E+12,78 33 | ,Logged In,Anabelle,F,0,Simpson,,free,"Philadelphia-Camden-Wilmington, PA-NJ-DE-MD",GET,Home,1.54104E+12,901,,200,1.54315E+12,69 34 | R. Kelly,Logged In,Anabelle,F,1,Simpson,234.39628,free,"Philadelphia-Camden-Wilmington, PA-NJ-DE-MD",PUT,NextSong,1.54104E+12,901,The World's Greatest,200,1.54315E+12,69 35 | ,Logged In,Kynnedi,F,0,Sanchez,,free,"Cedar Rapids, IA",GET,Home,1.54108E+12,804,,200,1.54315E+12,89 36 | Jacky Terrasson,Logged In,Marina,F,0,Sutton,342.7522,free,"Salinas, CA",PUT,NextSong,1.54106E+12,373,Le Jardin d'Hiver,200,1.54315E+12,48 37 | Papa Roach,Logged In,Theodore,M,0,Harris,202.1873,free,"Red Bluff, CA",PUT,NextSong,1.5411E+12,813,Alive,200,1.54316E+12,14 38 | Burt Bacharach,Logged In,Theodore,M,1,Harris,156.96934,free,"Red Bluff, CA",PUT,NextSong,1.5411E+12,813,Casino Royale Theme (Main Title),200,1.54316E+12,14 39 | ,Logged In,Chloe,F,0,Cuevas,,paid,"San Francisco-Oakland-Hayward, CA",GET,Home,1.54094E+12,923,,200,1.54316E+12,49 40 | Floetry,Logged In,Chloe,F,1,Cuevas,254.48444,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,Sunshine,200,1.54316E+12,49 41 | The Rakes,Logged In,Chloe,F,2,Cuevas,225.2273,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,Leave The City And Come Home,200,1.54316E+12,49 42 | Dwight Yoakam,Logged In,Chloe,F,3,Cuevas,239.3073,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,You're The One,200,1.54316E+12,49 43 | Ween,Logged In,Chloe,F,4,Cuevas,228.10077,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,Voodoo Lady,200,1.54316E+12,49 44 | Café Quijano,Logged In,Chloe,F,5,Cuevas,197.32853,paid,"San Francisco-Oakland-Hayward, CA",PUT,NextSong,1.54094E+12,923,La Lola,200,1.54316E+12,49 45 | ,Logged In,Chloe,F,0,Roth,,free,"Indianapolis-Carmel-Anderson, IN",GET,Home,1.5407E+12,925,,200,1.54317E+12,78 46 | Parov Stelar,Logged In,Chloe,F,1,Roth,203.65016,free,"Indianapolis-Carmel-Anderson, IN",PUT,NextSong,1.5407E+12,925,Good Bye Emily (feat. Gabriella Hanninen),200,1.54317E+12,78 47 | ,Logged In,Chloe,F,2,Roth,,free,"Indianapolis-Carmel-Anderson, IN",GET,Home,1.5407E+12,925,,200,1.54317E+12,78 48 | ,Logged In,Tegan,F,0,Levine,,paid,"Portland-South Portland, ME",GET,Home,1.54079E+12,915,,200,1.54317E+12,80 49 | Bryan Adams,Logged In,Tegan,F,1,Levine,166.29506,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,I Will Always Return,200,1.54317E+12,80 50 | KT Tunstall,Logged In,Tegan,F,2,Levine,192.31302,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,White Bird,200,1.54317E+12,80 51 | Technicolour,Logged In,Tegan,F,3,Levine,235.12771,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,Turn Away,200,1.54317E+12,80 52 | The Dears,Logged In,Tegan,F,4,Levine,289.95873,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,Lost In The Plot,200,1.54317E+12,80 53 | Go West,Logged In,Tegan,F,5,Levine,259.49995,paid,"Portland-South Portland, ME",PUT,NextSong,1.54079E+12,915,Never Let Them See You Sweat,200,1.54317E+12,80 54 | ,Logged In,Tegan,F,6,Levine,,paid,"Portland-South Portland, ME",PUT,Logout,1.54079E+12,915,,307,1.54317E+12,80 55 | ,Logged In,Sylvie,F,0,Cruz,,free,"Washington-Arlington-Alexandria, DC-VA-MD-WV",GET,Home,1.54027E+12,912,,200,1.54317E+12,10 56 | ,Logged Out,,,7,,,paid,,GET,Home,,915,,200,1.54317E+12, 57 | Gondwana,Logged In,Jordan,F,0,Hicks,262.5824,free,"Salinas, CA",PUT,NextSong,1.54001E+12,814,Mi Princesa,200,1.54319E+12,37 58 | ,Logged In,Kevin,M,0,Arellano,,free,"Harrisburg-Carlisle, PA",GET,Home,1.54001E+12,855,,200,1.54319E+12,66 59 | Ella Fitzgerald,Logged In,Jordan,F,1,Hicks,427.15383,free,"Salinas, CA",PUT,NextSong,1.54001E+12,814,On Green Dolphin Street (Medley) (1999 Digital Remaster),200,1.54319E+12,37 60 | Creedence Clearwater Revival,Logged In,Jordan,F,2,Hicks,184.73751,free,"Salinas, CA",PUT,NextSong,1.54001E+12,814,Run Through The Jungle,200,1.54319E+12,37 61 | -------------------------------------------------------------------------------- /2. Cassandra ETL/images/image_event_datafile_new.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/2. Cassandra ETL/images/image_event_datafile_new.jpg -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/README.md: -------------------------------------------------------------------------------- 1 | ## Description 2 | --- 3 | * This repo provides the ETL pipeline, to populate the books database, in collection:titles. 4 | * It provides the code to scrape from a books listing website, providing the title, price, rating of a book, along with whether it is still in stock and its url. 5 | * The code in this repository scrapes from: "http://books.toscrape.com/". 6 | * It then ingests the data into a MongoDB database hosted on localhost, port 27017, into a database called "books" and collection called "titles". 7 | 8 | ## Running the ETL Pipeline 9 | --- 10 | * First, make sure MongoDB is running on port 27017, on localhost 11 | * Next, run ```scrapy crawl books``` from the scrapy project folder "books" 12 | * You can now confirm that the data was stored on MongoDB in books database using MongoDB Compass 13 | 14 | ![Books](books.PNG) -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/books.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/3. Web Scraping using Scrapy, Mongo ETL/books.PNG -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/books/books/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/3. Web Scraping using Scrapy, Mongo ETL/books/books/__init__.py -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/books/books/items.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your scraped items 4 | # 5 | # See documentation in: 6 | # https://docs.scrapy.org/en/latest/topics/items.html 7 | 8 | import scrapy 9 | 10 | 11 | class BooksItem(scrapy.Item): 12 | # define the fields for your item here like: 13 | title = scrapy.Field() 14 | price = scrapy.Field() 15 | in_stock = scrapy.Field() 16 | rating = scrapy.Field() 17 | url = scrapy.Field() 18 | 19 | -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/books/books/middlewares.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your spider middleware 4 | # 5 | # See documentation in: 6 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 7 | 8 | from scrapy import signals 9 | 10 | 11 | class BooksSpiderMiddleware(object): 12 | # Not all methods need to be defined. If a method is not defined, 13 | # scrapy acts as if the spider middleware does not modify the 14 | # passed objects. 15 | 16 | @classmethod 17 | def from_crawler(cls, crawler): 18 | # This method is used by Scrapy to create your spiders. 19 | s = cls() 20 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 21 | return s 22 | 23 | def process_spider_input(self, response, spider): 24 | # Called for each response that goes through the spider 25 | # middleware and into the spider. 26 | 27 | # Should return None or raise an exception. 28 | return None 29 | 30 | def process_spider_output(self, response, result, spider): 31 | # Called with the results returned from the Spider, after 32 | # it has processed the response. 33 | 34 | # Must return an iterable of Request, dict or Item objects. 35 | for i in result: 36 | yield i 37 | 38 | def process_spider_exception(self, response, exception, spider): 39 | # Called when a spider or process_spider_input() method 40 | # (from other spider middleware) raises an exception. 41 | 42 | # Should return either None or an iterable of Request, dict 43 | # or Item objects. 44 | pass 45 | 46 | def process_start_requests(self, start_requests, spider): 47 | # Called with the start requests of the spider, and works 48 | # similarly to the process_spider_output() method, except 49 | # that it doesn’t have a response associated. 50 | 51 | # Must return only requests (not items). 52 | for r in start_requests: 53 | yield r 54 | 55 | def spider_opened(self, spider): 56 | spider.logger.info('Spider opened: %s' % spider.name) 57 | 58 | 59 | class BooksDownloaderMiddleware(object): 60 | # Not all methods need to be defined. If a method is not defined, 61 | # scrapy acts as if the downloader middleware does not modify the 62 | # passed objects. 63 | 64 | @classmethod 65 | def from_crawler(cls, crawler): 66 | # This method is used by Scrapy to create your spiders. 67 | s = cls() 68 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 69 | return s 70 | 71 | def process_request(self, request, spider): 72 | # Called for each request that goes through the downloader 73 | # middleware. 74 | 75 | # Must either: 76 | # - return None: continue processing this request 77 | # - or return a Response object 78 | # - or return a Request object 79 | # - or raise IgnoreRequest: process_exception() methods of 80 | # installed downloader middleware will be called 81 | return None 82 | 83 | def process_response(self, request, response, spider): 84 | # Called with the response returned from the downloader. 85 | 86 | # Must either; 87 | # - return a Response object 88 | # - return a Request object 89 | # - or raise IgnoreRequest 90 | return response 91 | 92 | def process_exception(self, request, exception, spider): 93 | # Called when a download handler or a process_request() 94 | # (from other downloader middleware) raises an exception. 95 | 96 | # Must either: 97 | # - return None: continue processing this exception 98 | # - return a Response object: stops process_exception() chain 99 | # - return a Request object: stops process_exception() chain 100 | pass 101 | 102 | def spider_opened(self, spider): 103 | spider.logger.info('Spider opened: %s' % spider.name) 104 | -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/books/books/pipelines.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define your item pipelines here 4 | # 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 7 | import logging 8 | import pymongo 9 | 10 | class MongoDBPipeline(object): 11 | 12 | def __init__(self, mongo_uri, mongo_db, collection_name): 13 | self.mongo_uri = mongo_uri 14 | self.mongo_db = mongo_db 15 | self.collection_name = collection_name 16 | 17 | @classmethod 18 | def from_crawler(cls, crawler): 19 | # pull info from settings.py 20 | return cls( 21 | mongo_uri = crawler.settings.get('MONGO_URI'), 22 | mongo_db = crawler.settings.get('MONGO_DB'), 23 | collection_name = crawler.settings.get('MONGO_COLLECTION') 24 | ) 25 | 26 | def open_spider(self, spider): 27 | # initialize spider 28 | # open db connection 29 | self.client = pymongo.MongoClient(self.mongo_uri) 30 | self.db = self.client[self.mongo_db] 31 | 32 | def close_spider(self, spider): 33 | # clean up when spider is closed 34 | self.client.close() 35 | 36 | def process_item(self, item, spider): 37 | print('collection:', self.collection_name) 38 | self.db[self.collection_name].insert(dict(item)) 39 | logging.debug("Title added to MongoDB") 40 | return item 41 | -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/books/books/settings.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Scrapy settings for books project 4 | # 5 | # For simplicity, this file contains only settings considered important or 6 | # commonly used. You can find more settings consulting the documentation: 7 | # 8 | # https://docs.scrapy.org/en/latest/topics/settings.html 9 | # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 10 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 11 | 12 | BOT_NAME = 'books' 13 | 14 | SPIDER_MODULES = ['books.spiders'] 15 | NEWSPIDER_MODULE = 'books.spiders' 16 | 17 | 18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 | #USER_AGENT = 'books (+http://www.yourdomain.com)' 20 | 21 | # Obey robots.txt rules 22 | ROBOTSTXT_OBEY = True 23 | 24 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 25 | #CONCURRENT_REQUESTS = 32 26 | 27 | # Configure a delay for requests for the same website (default: 0) 28 | # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay 29 | # See also autothrottle settings and docs 30 | #DOWNLOAD_DELAY = 3 31 | # The download delay setting will honor only one of: 32 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 | #CONCURRENT_REQUESTS_PER_IP = 16 34 | 35 | # Disable cookies (enabled by default) 36 | #COOKIES_ENABLED = False 37 | 38 | # Disable Telnet Console (enabled by default) 39 | #TELNETCONSOLE_ENABLED = False 40 | 41 | # Override the default request headers: 42 | #DEFAULT_REQUEST_HEADERS = { 43 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 44 | # 'Accept-Language': 'en', 45 | #} 46 | 47 | # Enable or disable spider middlewares 48 | # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html 49 | #SPIDER_MIDDLEWARES = { 50 | # 'books.middlewares.BooksSpiderMiddleware': 543, 51 | #} 52 | 53 | # Enable or disable downloader middlewares 54 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 55 | #DOWNLOADER_MIDDLEWARES = { 56 | # 'books.middlewares.BooksDownloaderMiddleware': 543, 57 | #} 58 | 59 | # Enable or disable extensions 60 | # See https://docs.scrapy.org/en/latest/topics/extensions.html 61 | #EXTENSIONS = { 62 | # 'scrapy.extensions.telnet.TelnetConsole': None, 63 | #} 64 | 65 | # Configure item pipelines 66 | # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html 67 | #ITEM_PIPELINES = { 68 | # 'books.pipelines.BooksPipeline': 300, 69 | #} 70 | 71 | # Enable and configure the AutoThrottle extension (disabled by default) 72 | # See https://docs.scrapy.org/en/latest/topics/autothrottle.html 73 | #AUTOTHROTTLE_ENABLED = True 74 | # The initial download delay 75 | #AUTOTHROTTLE_START_DELAY = 5 76 | # The maximum download delay to be set in case of high latencies 77 | #AUTOTHROTTLE_MAX_DELAY = 60 78 | # The average number of requests Scrapy should be sending in parallel to 79 | # each remote server 80 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 81 | # Enable showing throttling stats for every response received: 82 | #AUTOTHROTTLE_DEBUG = False 83 | 84 | # Enable and configure HTTP caching (disabled by default) 85 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 86 | #HTTPCACHE_ENABLED = True 87 | #HTTPCACHE_EXPIRATION_SECS = 0 88 | #HTTPCACHE_DIR = 'httpcache' 89 | #HTTPCACHE_IGNORE_HTTP_CODES = [] 90 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 91 | 92 | ITEM_PIPELINES = {'books.pipelines.MongoDBPipeline': 300} 93 | MONGO_URI = 'mongodb://localhost:27017' 94 | MONGO_DB = "books" 95 | MONGO_COLLECTION = 'titles' -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/books/books/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/books/books/spiders/books_spider.py: -------------------------------------------------------------------------------- 1 | from scrapy import Spider 2 | from scrapy.selector import Selector 3 | from books.items import BooksItem 4 | 5 | class BooksSpider(Spider): 6 | name = 'books' # name of spider 7 | allowed_domains = ['http://books.toscrape.com/'] #base urls of allowed domains, for spider to crawl 8 | start_urls = [ 9 | "http://books.toscrape.com/", 10 | ] 11 | 12 | def parse(self, response): 13 | books = Selector(response).xpath('//article[@class="product_pod"]') 14 | for book in books: 15 | item = BooksItem() 16 | item['title'] = book.xpath( 17 | 'div/a/img/@alt').extract()[0] 18 | item['price'] = book.xpath( 19 | 'div/p[@class="price_color"]/text()').extract()[0] 20 | instock_status = "".join(book.xpath( 21 | 'div/p[@class="instock availability"]/text()').extract()) 22 | instock_status = instock_status.strip('\n') 23 | instock_status = instock_status.strip() 24 | item['in_stock'] = instock_status 25 | rating = book.xpath( 26 | 'p[contains(@class, "star-rating")]/@class').extract()[0] 27 | rating = rating.replace("star-rating ", "") 28 | item['rating'] = rating 29 | item['url'] = book.xpath( 30 | 'div[@class="image_container"]/a/@href').extract()[0] 31 | yield item -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/books/scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html 5 | 6 | [settings] 7 | default = books.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = books 12 | -------------------------------------------------------------------------------- /3. Web Scraping using Scrapy, Mongo ETL/requirements.txt: -------------------------------------------------------------------------------- 1 | attrs==19.3.0 2 | Automat==20.2.0 3 | cffi==1.14.0 4 | constantly==15.1.0 5 | cryptography==3.3.2 6 | cssselect==1.1.0 7 | hyperlink==19.0.0 8 | idna==2.9 9 | incremental==17.5.0 10 | lxml==4.6.5 11 | parsel==1.5.2 12 | Protego==0.1.16 13 | pyasn1==0.4.8 14 | pyasn1-modules==0.2.8 15 | pycparser==2.20 16 | PyDispatcher==2.0.5 17 | PyHamcrest==2.0.2 18 | pymongo==3.10.1 19 | pyOpenSSL==19.1.0 20 | queuelib==1.5.0 21 | Scrapy==2.6.1 22 | service-identity==18.1.0 23 | six==1.14.0 24 | Twisted==22.2.0 25 | w3lib==1.21.0 26 | zope.interface==5.1.0 27 | -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/README.md: -------------------------------------------------------------------------------- 1 | ## Description 2 | --- 3 | This repo provides the ETL pipeline, to populate the sparkifydb database in AWS Redshift. 4 | * The purpose of this database is to enable Sparkify to answer business questions it may have of its users, the types of songs they listen to and the artists of those songs using the data that it has in logs and files. The database provides a consistent and reliable source to store this data. 5 | 6 | * This source of data will be useful in helping Sparkify reach some of its analytical goals, for example, finding out songs that have highest popularity or times of the day which is high in traffic. 7 | 8 | ## Why Redshift? 9 | --- 10 | * Redshift is a fully managed, cloud-based, petabyte-scale data warehouse service by Amazon Web Services (AWS). It is an efficient solution to collect and store all data and enables analysis using various business intelligence tools to acquire new insights for businesses and their customers. 11 | ![Redshift](screenshots/redshift.PNG) 12 | 13 | ## Database Design 14 | --- 15 | * For the schema design, the STAR schema is used as it simplifies queries and provides fast aggregations of data. 16 | ![Schema](screenshots/schema.PNG) 17 | 18 | * songplays is our facts table with the rest being our dimension tables. 19 | 20 | ## Data Pipeline design 21 | * For the ETL pipeline, Python is used as it contains libraries such as pandas, that simplifies data manipulation. It also allows connection to Postgres Database. 22 | 23 | * There are 2 types of data involved, song and log data. For song data, it contains information about songs and artists, which we extract from and load into users and artists dimension table 24 | 25 | * First, we load song and log data from JSON format in S3 into our staging tables (staging_songs_table and staging_events_table) 26 | 27 | * Next, we perform ETL using SQL, from the staging tables to our fact and dimension tables. Below shows the architectural design of this pipeline: 28 | ![architecture](screenshots/architecture.PNG) 29 | 30 | ## Files 31 | --- 32 | * create_tables.py is the python script that drops all tables and create all tables (including staging tables) 33 | 34 | * sql_queries.py is the python file containing all SQL queries. It is called by create_tables.py and etl.py 35 | 36 | * etl.py is the python script that loads data into staging tables, then load data into fact and dimension tables from staging tables 37 | 38 | * redshift_cluster_setup.py sets up the redshift cluster and creates an IAM role for redshift to access other AWS services 39 | 40 | * redshift_cluster_teardown.py removes the redshift cluster and IAM role created 41 | 42 | * dwh.cfg contains configurations for Redshift database. Please edit according to the Redshift cluster and database created on AWS 43 | 44 | ## Running the ETL Pipeline 45 | --- 46 | * First, run create_tables.py to create the data tables using the schema design specified. If tables were created previously, they will be dropped and recreated. 47 | 48 | * Next, run etl.py to populate the data tables created. -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/create_tables.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | import psycopg2 3 | from sql_queries import create_table_queries, drop_table_queries 4 | 5 | 6 | def drop_tables(cur, conn): 7 | """ 8 | Description: Drops each table using the queries in `drop_table_queries` list in sql_queries. 9 | 10 | Arguments: 11 | cur: the cursor object. 12 | conn: connection object to redshift. 13 | 14 | Returns: 15 | None 16 | """ 17 | for query in drop_table_queries: 18 | cur.execute(query) 19 | conn.commit() 20 | 21 | 22 | def create_tables(cur, conn): 23 | """ 24 | Description: Creates each table using the queries in `create_table_queries` list in sql_queries. 25 | 26 | Arguments: 27 | cur: the cursor object. 28 | conn: connection object to redshift. 29 | 30 | Returns: 31 | None 32 | """ 33 | for query in create_table_queries: 34 | cur.execute(query) 35 | conn.commit() 36 | 37 | 38 | def main(): 39 | """ 40 | Description: 41 | - Establishes connection with the sparkify database and gets 42 | cursor to it (on AWS redshift cluster created earlier). 43 | 44 | - Drops all the tables. 45 | 46 | - Creates all tables needed. 47 | 48 | - Finally, closes the connection. 49 | 50 | Returns: 51 | None 52 | """ 53 | config = configparser.ConfigParser() 54 | config.read('dwh.cfg') 55 | 56 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values())) 57 | cur = conn.cursor() 58 | 59 | drop_tables(cur, conn) 60 | create_tables(cur, conn) 61 | 62 | conn.close() 63 | 64 | 65 | if __name__ == "__main__": 66 | main() -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/dwh.cfg: -------------------------------------------------------------------------------- 1 | [AWS] 2 | KEY= 3 | SECRET= 4 | 5 | [DWH] 6 | DWH_CLUSTER_TYPE=multi-node 7 | DWH_NUM_NODES=4 8 | DWH_NODE_TYPE=dc2.large 9 | 10 | DWH_IAM_ROLE_NAME=dwhRole 11 | DWH_CLUSTER_IDENTIFIER=dwhCluster 12 | DWH_DB= 13 | DWH_DB_USER= 14 | DWH_DB_PASSWORD= 15 | DWH_PORT=5439 16 | 17 | [CLUSTER] 18 | HOST= 19 | DB_NAME= 20 | DB_USER= 21 | DB_PASSWORD= 22 | DB_PORT=5439 23 | 24 | [IAM_ROLE] 25 | ARN= 26 | 27 | [S3] 28 | LOG_DATA='s3://udacity-dend/log_data' 29 | LOG_JSONPATH='s3://udacity-dend/log_json_path.json' 30 | SONG_DATA='s3://udacity-dend/song_data' -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/etl.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | import psycopg2 3 | from sql_queries import copy_table_queries, insert_table_queries, copy_staging_order 4 | count_staging_queries, insert_table_order, count_fact_dim_queries 5 | 6 | 7 | def load_staging_tables(cur, conn): 8 | """ 9 | Description: Copies data in json format in S3 to staging tables in redshift. 10 | 11 | Arguments: 12 | cur: the cursor object. 13 | conn: connection object to redshift. 14 | 15 | Returns: 16 | None 17 | """ 18 | for idx, query in enumerate(copy_table_queries): 19 | cur.execute(query) 20 | conn.commit() 21 | row = cur.execute(count_staging_queries[idx]) 22 | print('No. of rows copied into {}: {}'.format(copy_staging_order[idx], row.count)) 23 | 24 | 25 | def insert_tables(cur, conn): 26 | """ 27 | Description: ETL from staging tables to songplays fact and its dimension 28 | tables in redshift. 29 | 30 | Arguments: 31 | cur: the cursor object. 32 | conn: connection object to redshift. 33 | 34 | Returns: 35 | None 36 | """ 37 | for idx, query in enumerate(insert_table_queries): 38 | cur.execute(query) 39 | conn.commit() 40 | row = cur.execute(count_fact_dim_queries[idx]) 41 | print('No. of rows inserted into {}: {}'.format(insert_table_order[idx], row.count)) 42 | 43 | 44 | def main(): 45 | """ 46 | Description: 47 | - Establishes connection with the sparkify database and gets 48 | cursor to it (on AWS redshift cluster created earlier). 49 | 50 | - Loads staging tables from raw log and song files to redshift database 51 | 52 | - From staging tables, perform ETL to songplays fact and its dimension 53 | tables in redshift using SQL 54 | 55 | Returns: 56 | None 57 | """ 58 | config = configparser.ConfigParser() 59 | config.read('dwh.cfg') 60 | 61 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values())) 62 | cur = conn.cursor() 63 | 64 | load_staging_tables(cur, conn) 65 | insert_tables(cur, conn) 66 | 67 | conn.close() 68 | 69 | 70 | if __name__ == "__main__": 71 | main() -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/redshift_cluster_setup.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import json 3 | import configparser 4 | 5 | 6 | def create_iam_role(DWH_IAM_ROLE_NAME): 7 | """ 8 | Description: 9 | - Creates an IAM role that allows Redshift to call on 10 | other AWS services 11 | 12 | Returns: 13 | - Role Arn 14 | """ 15 | # Create the IAM role 16 | try: 17 | print('1.1 Creating a new IAM Role') 18 | dwh_role = iam.create_role( 19 | Path = '/', 20 | RoleName = DWH_IAM_ROLE_NAME, 21 | Description = 'Allows Redshift cluster to call AWS service on your behalf.', 22 | AssumeRolePolicyDocument = json.dumps( 23 | {'Statement': [{'Action': 'sts:AssumeRole', 24 | 'Effect': 'Allow', 25 | 'Principal': {'Service': 'redshift.amazonaws.com'}}], 26 | 'Version': '2012-10-17'}) 27 | ) 28 | # Attach Policy 29 | iam.attach_role_policy(RoleName=DWH_IAM_ROLE_NAME, 30 | PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess" 31 | )['ResponseMetadata']['HTTPStatusCode'] 32 | role_arn = iam.get_role(RoleName=DWH_IAM_ROLE_NAME)['Role']['Arn'] 33 | return role_arn 34 | except Exception as e: 35 | print(e) 36 | 37 | 38 | 39 | def main(): 40 | """ 41 | Description: 42 | - Sets up a Redshift cluster on AWS 43 | 44 | Returns: 45 | None 46 | """ 47 | # Load DWH parameters from a file 48 | config = configparser.ConfigParser() 49 | config.read_file(open('dwh.cfg')) 50 | KEY = config.get('AWS','KEY') 51 | SECRET = config.get('AWS','SECRET') 52 | DWH_CLUSTER_TYPE = config.get("DWH","DWH_CLUSTER_TYPE") 53 | DWH_NUM_NODES = config.get("DWH","DWH_NUM_NODES") 54 | DWH_NODE_TYPE = config.get("DWH","DWH_NODE_TYPE") 55 | DWH_CLUSTER_IDENTIFIER = config.get("DWH","DWH_CLUSTER_IDENTIFIER") 56 | DWH_DB = config.get("DWH","DWH_DB") 57 | DWH_DB_USER = config.get("DWH","DWH_DB_USER") 58 | DWH_DB_PASSWORD = config.get("DWH","DWH_DB_PASSWORD") 59 | DWH_PORT = config.get("DWH","DWH_PORT") 60 | DWH_IAM_ROLE_NAME = config.get("DWH", "DWH_IAM_ROLE_NAME") 61 | 62 | # Create clients for EC2, S3, IAM, and Redshift 63 | ec2 = boto3.resource('ec2', 64 | region_name='us-west-2', 65 | aws_access_key_id=KEY, 66 | aws_secret_access_key=SECRET) 67 | 68 | iam = boto3.client('iam', 69 | region_name='us-west-2', 70 | aws_access_key_id=KEY, 71 | aws_secret_access_key=SECRET) 72 | 73 | redshift = boto3.client('redshift', 74 | region_name="us-west-2", 75 | aws_access_key_id=KEY, 76 | aws_secret_access_key=SECRET) 77 | 78 | role_arn = create_iam_role(DWH_IAM_ROLE_NAME) 79 | 80 | # Create the cluster 81 | try: 82 | response = redshift.create_cluster( 83 | #HW 84 | ClusterType=DWH_CLUSTER_TYPE, 85 | NodeType=DWH_NODE_TYPE, 86 | NumberOfNodes=int(DWH_NUM_NODES), 87 | 88 | #Identifiers & Credentials 89 | DBName=DWH_DB, 90 | ClusterIdentifier=DWH_CLUSTER_IDENTIFIER, 91 | MasterUsername=DWH_DB_USER, 92 | MasterUserPassword=DWH_DB_PASSWORD, 93 | 94 | #Roles (for s3 access) 95 | IamRoles=[role_arn] 96 | ) 97 | 98 | # Open an incoming TCP port to access the cluster endpoint 99 | vpc = ec2.Vpc(id=myClusterProps['VpcId']) 100 | default_sg = list(vpc.security_groups.all())[0] 101 | default_sg.authorize_ingress( 102 | GroupName=default_sg.group_name, 103 | CidrIp='0.0.0.0/0', 104 | IpProtocol='TCP', 105 | FromPort=int(DWH_PORT), 106 | ToPort=int(DWH_PORT) 107 | ) 108 | except Exception as e: 109 | print(e) 110 | 111 | print("Cluster has been created, check details of cluster on AWS") 112 | 113 | 114 | if __name__ == "__main__": 115 | main() -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/redshift_cluster_teardown.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import configparser 3 | 4 | def main(): 5 | """ 6 | Description: 7 | - Sets up a Redshift cluster on AWS 8 | 9 | Returns: 10 | None 11 | """ 12 | KEY = config.get('AWS','KEY') 13 | SECRET = config.get('AWS','SECRET') 14 | DWH_CLUSTER_IDENTIFIER = config.get("DWH","DWH_CLUSTER_IDENTIFIER") 15 | DWH_IAM_ROLE_NAME = config.get("DWH", "DWH_IAM_ROLE_NAME") 16 | 17 | redshift = boto3.client('redshift', 18 | region_name="us-west-2", 19 | aws_access_key_id=KEY, 20 | aws_secret_access_key=SECRET) 21 | 22 | iam = boto3.client('iam', 23 | region_name='us-west-2', 24 | aws_access_key_id=KEY, 25 | aws_secret_access_key=SECRET) 26 | 27 | redshift.delete_cluster(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER, 28 | SkipFinalClusterSnapshot=True) 29 | 30 | # Remove role: 31 | iam.detach_role_policy(RoleName=DWH_IAM_ROLE_NAME, 32 | PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess") 33 | iam.delete_role(RoleName=DWH_IAM_ROLE_NAME) 34 | print("Cluster and IAM role has been deleted") 35 | 36 | if __name__ == "__main__": 37 | main() -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/screenshots/architecture.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/4. Data Warehousing with AWS Redshift/screenshots/architecture.PNG -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/screenshots/redshift.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/4. Data Warehousing with AWS Redshift/screenshots/redshift.PNG -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/screenshots/schema.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/4. Data Warehousing with AWS Redshift/screenshots/schema.PNG -------------------------------------------------------------------------------- /4. Data Warehousing with AWS Redshift/sql_queries.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | 3 | 4 | # CONFIG 5 | config = configparser.ConfigParser() 6 | config.read('dwh.cfg') 7 | 8 | # DROP TABLES 9 | 10 | staging_events_table_drop = "DROP TABLE IF EXISTS staging_events_table" 11 | staging_songs_table_drop = "DROP TABLE IF EXISTS staging_songs_table" 12 | songplay_table_drop = "DROP TABLE IF EXISTS songplays" 13 | user_table_drop = "DROP TABLE IF EXISTS users" 14 | song_table_drop = "DROP TABLE IF EXISTS songs" 15 | artist_table_drop = "DROP TABLE IF EXISTS artists" 16 | time_table_drop = "DROP TABLE IF EXISTS time" 17 | 18 | # CREATE TABLES 19 | 20 | staging_events_table_create= ( 21 | """ 22 | CREATE TABLE staging_events_table ( 23 | stagingEventId bigint IDENTITY(0,1) PRIMARY KEY, 24 | artist VARCHAR(500), 25 | auth VARCHAR(20), 26 | firstName VARCHAR(500), 27 | gender CHAR(1), 28 | itemInSession SMALLINT, 29 | lastName VARCHAR(500), 30 | length NUMERIC, 31 | level VARCHAR(10), 32 | location VARCHAR(500), 33 | method VARCHAR(20), 34 | page VARCHAR(500), 35 | registration NUMERIC, 36 | sessionId SMALLINT, 37 | song VARCHAR, 38 | status SMALLINT, 39 | ts BIGINT, 40 | userAgent VARCHAR(500), 41 | userId SMALLINT 42 | ) 43 | """ 44 | ) 45 | 46 | staging_songs_table_create = ( 47 | """ 48 | CREATE TABLE staging_songs_table ( 49 | staging_song_id bigint IDENTITY(0,1) PRIMARY KEY, 50 | num_songs INTEGER NOT NULL, 51 | artist_id VARCHAR(20) NOT NULL, 52 | artist_latitude NUMERIC, 53 | artist_longitude NUMERIC, 54 | artist_location VARCHAR(500), 55 | artist_name VARCHAR(500) NOT NULL, 56 | song_id VARCHAR(20) NOT NULL, 57 | title VARCHAR(500) NOT NULL, 58 | duration NUMERIC NOT NULL, 59 | year SMALLINT NOT NULL 60 | ); 61 | """ 62 | ) 63 | 64 | songplay_table_create = ( 65 | """ 66 | CREATE TABLE songplays ( 67 | songplay_id BIGINT IDENTITY(0,1) PRIMARY KEY, 68 | start_time BIGINT REFERENCES time(start_time) distkey, 69 | user_id SMALLINT REFERENCES users(user_id), 70 | level VARCHAR(10), 71 | song_id VARCHAR(20) REFERENCES songs(song_id), 72 | artist_id VARCHAR(20) REFERENCES artists(artist_id), 73 | session_id SMALLINT, 74 | location VARCHAR(500), 75 | user_agent VARCHAR(500) 76 | ) 77 | sortkey(level, start_time); 78 | """ 79 | ) 80 | 81 | user_table_create = ( 82 | """ 83 | CREATE TABLE users ( 84 | user_id INT PRIMARY KEY, 85 | first_name VARCHAR(500), 86 | last_name VARCHAR(500), 87 | gender CHAR(1), 88 | level VARCHAR(10) NOT NULL 89 | ) 90 | diststyle all 91 | sortkey(level, gender, first_name, last_name); 92 | """ 93 | ) 94 | 95 | song_table_create = ( 96 | """ 97 | CREATE TABLE songs ( 98 | song_id VARCHAR(20) PRIMARY KEY, 99 | title VARCHAR(500) NOT NULL, 100 | artist_id VARCHAR(20) NOT NULL, 101 | year SMALLINT NOT NULL, 102 | duration NUMERIC NOT NULL 103 | ) 104 | diststyle all 105 | sortkey(year, title, duration); 106 | """ 107 | ) 108 | 109 | artist_table_create = ( 110 | """ 111 | CREATE TABLE artists ( 112 | artist_id VARCHAR(20) PRIMARY KEY, 113 | name VARCHAR(500) NOT NULL, 114 | location VARCHAR(500), 115 | latitude NUMERIC, 116 | longitude NUMERIC 117 | ) 118 | diststyle all 119 | sortkey(name, location); 120 | """ 121 | ) 122 | 123 | time_table_create = ( 124 | """ 125 | CREATE TABLE time ( 126 | start_time timestamp PRIMARY KEY distkey, 127 | hour SMALLINT NOT NULL, 128 | day SMALLINT NOT NULL, 129 | week SMALLINT NOT NULL, 130 | month SMALLINT NOT NULL, 131 | year SMALLINT NOT NULL, 132 | weekday SMALLINT NOT NULL 133 | ) 134 | sortkey(year, month, day); 135 | """ 136 | ) 137 | 138 | # STAGING TABLES 139 | 140 | staging_events_copy = ( 141 | """ 142 | copy staging_events_table ( 143 | artist, auth, firstName, gender,itemInSession, lastName, 144 | length, level, location, method, page, registration, 145 | sessionId, song, status, ts, userAgent, userId 146 | ) 147 | from {} 148 | iam_role {} 149 | json {} region 'us-west-2'; 150 | """ 151 | ).format(config['S3']['log_data'], config['IAM_ROLE']['arn'], config['S3']['log_jsonpath']) 152 | 153 | staging_songs_copy = ( 154 | """ 155 | copy staging_songs_table 156 | from {} 157 | iam_role {} 158 | json 'auto' region 'us-west-2'; 159 | """ 160 | ).format(config['S3']['song_data'], config['IAM_ROLE']['arn']) 161 | 162 | # FINAL TABLES 163 | 164 | songplay_table_insert = ( 165 | """ 166 | INSERT INTO songplays (start_time, user_id, level, song_id, artist_id, 167 | session_id, location, user_agent) 168 | SELECT se.ts, se.userId, se.level, sa.song_id, sa.artist_id, se.sessionId, 169 | se.location, se.userAgent 170 | FROM staging_events_table se 171 | JOIN ( 172 | SELECT s.song_id AS song_id, a.artist_id AS artist_id, s.title AS song, 173 | a.name AS artist, s.duration AS length 174 | FROM songs s 175 | JOIN artists a ON s.artist_id=a.artist_id 176 | ) sa 177 | ON se.song=sa.song AND se.artist=sa.artist AND se.length=sa.length; 178 | """ 179 | ) 180 | 181 | user_table_insert = ( 182 | """ 183 | INSERT INTO users (user_id, first_name, last_name, gender, level) 184 | SELECT userId, firstName, lastName, gender, level 185 | FROM ( 186 | SELECT userId, firstName, lastName, gender, level, 187 | ROW_NUMBER() OVER (PARTITION BY userId 188 | ORDER BY firstName, lastName, 189 | gender, level) AS user_id_ranked 190 | FROM staging_events_table 191 | WHERE userId IS NOT NULL 192 | ) AS ranked 193 | WHERE ranked.user_id_ranked = 1; 194 | """ 195 | ) 196 | 197 | song_table_insert = ( 198 | """ 199 | INSERT INTO songs (song_id, title, artist_id, year, duration) 200 | SELECT song_id, title, artist_id, year, duration 201 | FROM ( 202 | SELECT song_id, title, artist_id, year, duration, 203 | ROW_NUMBER() OVER (PARTITION BY song_id 204 | ORDER BY title, artist_id, 205 | year, duration) AS song_id_ranked 206 | FROM staging_songs_table 207 | WHERE song_id IS NOT NULL 208 | ) AS ranked 209 | WHERE ranked.song_id_ranked = 1; 210 | """ 211 | ) 212 | 213 | artist_table_insert = ( 214 | """ 215 | INSERT INTO artists (artist_id, name, location, latitude, longitude) 216 | SELECT artist_id, artist_name, artist_location, artist_latitude, artist_longitude 217 | FROM ( 218 | SELECT artist_id, artist_name, artist_location, artist_latitude, artist_longitude, 219 | ROW_NUMBER() OVER (PARTITION BY artist_id 220 | ORDER BY artist_name, artist_location, 221 | artist_latitude, artist_longitude) AS artist_id_ranked 222 | FROM staging_songs_table 223 | WHERE artist_id IS NOT NULL 224 | ) AS ranked 225 | WHERE ranked.artist_id_ranked = 1; 226 | """ 227 | ) 228 | 229 | 230 | time_table_insert = ( 231 | """ 232 | INSERT INTO time (start_time, hour, day, week, month, year, weekday) 233 | SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, 234 | EXTRACT(HOUR FROM start_time) AS hour, 235 | EXTRACT(DAY FROM start_time) AS day, 236 | EXTRACT(WEEK FROM start_time) AS week, 237 | EXTRACT(MONTH FROM start_time) AS month, 238 | EXTRACT(YEAR FROM start_time) AS year, 239 | EXTRACT(DOW FROM start_time) AS weekday 240 | FROM staging_events_table 241 | WHERE ts IS NOT NULL; 242 | """ 243 | ) 244 | 245 | 246 | count_staging_rows = "SELECT COUNT(*) AS count FROM {}" 247 | 248 | # QUERY LISTS 249 | create_table_queries = [staging_events_table_create, staging_songs_table_create, 250 | user_table_create, song_table_create, artist_table_create, 251 | time_table_create,songplay_table_create] 252 | 253 | drop_table_queries = [staging_events_table_drop, staging_songs_table_drop, 254 | songplay_table_drop, user_table_drop, song_table_drop, 255 | artist_table_drop, time_table_drop] 256 | 257 | copy_table_queries = [staging_events_copy, staging_songs_copy] 258 | 259 | copy_staging_order = ['staging_events_table', 'staging_songs_table'] 260 | 261 | count_staging_queries = [count_staging_rows.format(copy_staging_order[0]), 262 | count_staging_rows.format(copy_staging_order[1])] 263 | 264 | insert_table_queries = [user_table_insert, song_table_insert, artist_table_insert, 265 | time_table_insert, songplay_table_insert] 266 | 267 | insert_table_order = ['users', 'songs', 'artists', 'time', 'songplays'] 268 | 269 | count_fact_dim_queries = [count_staging_rows.format(insert_table_order[0]), 270 | count_staging_rows.format(insert_table_order[1]), 271 | count_staging_rows.format(insert_table_order[2]), 272 | count_staging_rows.format(insert_table_order[3]), 273 | count_staging_rows.format(insert_table_order[4])] -------------------------------------------------------------------------------- /5. Data Lake with Spark & AWS S3/README.md: -------------------------------------------------------------------------------- 1 | ## Description 2 | --- 3 | This repo provides the ETL pipeline, to populate the sparkifydb AWS S3 Data Lake using Spark 4 | 5 | ![S3](screenshots/s3.PNG)                ![Spark](screenshots/spark.PNG) 6 | * The purpose of this database is to enable Sparkify to answer business questions it may have of its users, the types of songs they listen to and the artists of those songs using the data that it has in logs and files. The database provides a consistent and reliable source to store this data. 7 | 8 | * This source of data will be useful in helping Sparkify reach some of its analytical goals, for example, finding out songs that have highest popularity or times of the day which is high in traffic. 9 | 10 | ## Dependencies 11 | --- 12 | * Note that you will need to have the pyspark library installed. Also, you should have a spark cluster running, either locally or on AWS EMR. 13 | 14 | ## Database Design and ETL Pipeline 15 | --- 16 | * For the schema design, the STAR schema is used as it simplifies queries and provides fast aggregations of data. 17 | 18 | ![Schema](screenshots/schema.PNG) 19 | 20 | * For the ETL pipeline, Python is used as it contains libraries such as pandas, that simplifies data manipulation. It enables reading of files from S3, and data processing using Pyspark. 21 | 22 | * There are 2 types of data involved, song and log data. For song data, it contains information about songs and artists, which we extract from and load into users and artists dimension tables 23 | 24 | * Log data gives the information of each user session. From log data, we extract and load into time, users dimension tables and songplays fact table. 25 | 26 | ## Running the ETL Pipeline 27 | --- 28 | * Run etl.py to read from the song and logs json files, denormalize the data into fact and dimension tables and gives the output of these tables in S3 in the form of parquet files. -------------------------------------------------------------------------------- /5. Data Lake with Spark & AWS S3/etl.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | import os 3 | from datetime import datetime 4 | from pyspark.sql import SparkSession 5 | from pyspark.sql.functions import udf, col, from_unixtime 6 | from pyspark.sql.functions import year, month, dayofmonth, hour, 7 | weekofyear, dayofweek, date_format 8 | from pyspark.sql.types import StructType, StructField as Fld, DoubleType as Dbl, 9 | StringType as Str, IntegerType as Int, DateType as Date, 10 | TimestampType as Ts 11 | 12 | 13 | config = configparser.ConfigParser() 14 | config.read('dl.cfg') 15 | 16 | os.environ['AWS_ACCESS_KEY_ID']=config['AWS_ACCESS_KEY_ID'] 17 | os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS_SECRET_ACCESS_KEY'] 18 | 19 | 20 | def create_spark_session(): 21 | """ 22 | Description: Creates spark session. 23 | 24 | Returns: 25 | spark session object 26 | """ 27 | AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID'] 28 | AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY'] 29 | 30 | spark = SparkSession \ 31 | .builder \ 32 | .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \ 33 | .getOrCreate() 34 | 35 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID) 36 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY) 37 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY_ID) 38 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_ACCESS_KEY) 39 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 40 | spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3n.endpoint", "s3.amazonaws.com") 41 | return spark 42 | 43 | 44 | def song_schema(): 45 | """ 46 | Description: Provides the schema for the staging_songs table. 47 | 48 | Returns: 49 | spark dataframe schema object 50 | """ 51 | return StructType([ 52 | Fld("num_songs", Int()), 53 | Fld("artist_id", Str()), 54 | Fld("artist_latitude", Dbl()), 55 | Fld("artist_longitude", Dbl()), 56 | Fld("artist_location", Str()), 57 | Fld("artist_name", Str()), 58 | Fld("song_id", Str()), 59 | Fld("title", Str()), 60 | Fld("duration", Dbl()), 61 | Fld("year", Int()) 62 | ]) 63 | 64 | 65 | def process_song_data(spark, input_data, output_data): 66 | """ 67 | Description: Read in songs data from json files. 68 | Outputs songs and artists dimension tables in parquet files in S3. 69 | 70 | Arguments: 71 | spark: the spark session object. 72 | input_data: path to the S3 bucket containing input json files. 73 | output_data: path to S3 bucket that will contain output parquet files. 74 | 75 | Returns: 76 | None 77 | """ 78 | # get filepath to song data file 79 | song_data = input_data + 'song_data/*/*/*/*.json' 80 | 81 | # read song data file 82 | df = spark.read.json(song_data, schema=song_schema()) 83 | 84 | # extract columns to create songs table 85 | songs_table = df.select(['song_id', 'title', 'artist_id', 86 | 'year', 'duration']).distinct().where( 87 | col('song_id').isNotNull()) 88 | 89 | # write songs table to parquet files partitioned by year and artist 90 | songs_path = output_data + 'songs' 91 | songs_table.write.partitionBy('year', 'artist_id').parquet(songs_path) 92 | 93 | # extract columns to create artists table 94 | artists_table = df.select(['artist_id', 'artist_name', 'artist_location', 95 | 'artist_latitude', 'artist_longitude']).distinct().where( 96 | col('artist_id').isNotNull()) 97 | 98 | # write artists table to parquet files 99 | artists_path = output_data + 'artists' 100 | artists_table.write.parquet(artists_path) 101 | 102 | 103 | def process_log_data(spark, input_data, output_data): 104 | """ 105 | Description: Read in logs data from json files. 106 | Outputs time and users dimension tables, songplays fact table 107 | in parquet files in S3. 108 | 109 | Arguments: 110 | spark: the spark session object. 111 | input_data: path to the S3 bucket containing input json files. 112 | output_data: path to S3 bucket that will contain output parquet files. 113 | 114 | Returns: 115 | None 116 | """ 117 | # get filepath to log data file 118 | log_data = input_data + 'log_data/*/*/*.json' 119 | 120 | # read log data file 121 | df = spark.read.json(log_data) 122 | 123 | # filter by actions for song plays 124 | df = df.filter(df.page == 'NextSong') 125 | 126 | # extract columns for users table 127 | users_table = df.select(['userId', 'firstName', 'lastName', 128 | 'gender', 'level']).distinct().where( 129 | col('userId').isNotNull()) 130 | 131 | # write users table to parquet files 132 | users_path = output_data + 'users' 133 | users_table.write.parquet(users_path) 134 | 135 | def format_datetime(ts): 136 | """ 137 | Description: converts numeric timestamp to datetime format. 138 | 139 | Returns: 140 | timestamp with type datetime 141 | """ 142 | return datetime.fromtimestamp(ts/1000.0) 143 | 144 | # create timestamp column from original timestamp column 145 | get_timestamp = udf(lambda x: format_datetime(int(x)), Ts()) 146 | df = df.withColumn("start_time", get_timestamp(df.ts)) 147 | 148 | # create datetime column from original timestamp column 149 | get_datetime = udf(lambda x: format_datetime(int(x)), Date()) 150 | df = df.withColumn("datetime", get_datetime(df.ts)) 151 | 152 | # extract columns to create time table 153 | time_table = df.select('ts', 'start_time', 'datetime' 154 | hour("datetime").alias('hour'), 155 | dayofmonth("datetime").alias('day'), 156 | weekofyear("datetime").alias('week'), 157 | year("datetime").alias('year'), 158 | month("datetime").alias('month'), 159 | dayofweek("datetime").alias('weekday') 160 | ).dropDuplicates() 161 | 162 | # write time table to parquet files partitioned by year and month 163 | time_table_path = output_data + 'time' 164 | time_table.write.partitionBy('year', 'month').parquet(time_table_path) 165 | 166 | # read in song data to use for songplays table 167 | songs_path = input_data + 'song_data/*/*/*/*.json' 168 | song_df = spark.read.json(songs_path, schema=song_schema()) 169 | 170 | # extract columns from joined song and log datasets to create songplays table 171 | df = df.drop_duplicates(subset=['start_time']) 172 | songplays_table = song_df.alias('s').join(df.alias('l'), 173 | (song_df.title == df.song) & \ 174 | (song_df.artist_name == df.artist)).where( 175 | df.page == 'NextSong').select([ 176 | col('l.start_time'), 177 | year("l.datetime").alias('year'), 178 | month("l.datetime").alias('month'), 179 | col('l.userId'), 180 | col('l.level'), 181 | col('s.song_id'), 182 | col('s.artist_id'), 183 | col('l.sessionID'), 184 | col('l.location'), 185 | col('l.userAgent') 186 | ]) 187 | 188 | # write songplays table to parquet files partitioned by year and month 189 | songplays_path = output_data + 'songplays' 190 | songplays_table.write.partitionBy('year', 'month').parquet(songplays_path) 191 | 192 | 193 | def main(): 194 | """ 195 | Description: Calls functions to create spark session, read from S3 196 | and perform ETL to S3 Data Lake. 197 | 198 | Returns: 199 | None 200 | """ 201 | spark = create_spark_session() 202 | input_data = "s3a://udacity-dend/" 203 | output_data = "s3://alanchn31-datalake/" 204 | 205 | process_song_data(spark, input_data, output_data) 206 | process_log_data(spark, input_data, output_data) 207 | 208 | 209 | if __name__ == "__main__": 210 | main() 211 | -------------------------------------------------------------------------------- /5. Data Lake with Spark & AWS S3/screenshots/s3.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/5. Data Lake with Spark & AWS S3/screenshots/s3.PNG -------------------------------------------------------------------------------- /5. Data Lake with Spark & AWS S3/screenshots/schema.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/5. Data Lake with Spark & AWS S3/screenshots/schema.PNG -------------------------------------------------------------------------------- /5. Data Lake with Spark & AWS S3/screenshots/spark.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/5. Data Lake with Spark & AWS S3/screenshots/spark.PNG -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/README.md: -------------------------------------------------------------------------------- 1 | ## Description 2 | --- 3 | This repo provides the ETL pipeline, to ingest sparkify's music data into an AWS Redshift Data Warehouse. The ETL pipeline will be run on an hourly basis, scheduled using Airflow. 4 | 5 | ![Airflow](screenshots/airflow.png) 6 | 7 | * Why Airflow? Airflow allows workflows to be defined as code, they become more maintainable, versionable, testable, and collaborative 8 | 9 | * The purpose of this database is to enable Sparkify to answer business questions it may have of its users, the types of songs they listen to and the artists of those songs using the data that it has in logs and files. The database provides a consistent and reliable source to store this data. 10 | 11 | * This source of data will be useful in helping Sparkify reach some of its analytical goals, for example, finding out songs that have highest popularity or times of the day which is high in traffic. 12 | 13 | ## Dependencies 14 | --- 15 | * Note that you will need to have Airflow installed. To do so, run `pip install airflow` 16 | 17 | * To use postgres to store metadata from airflow jobs, edit airflow.cfg file under the AIRFLOW_HOME dir. Refer to https://gist.github.com/rosiehoyem/9e111067fe4373eb701daf9e7abcc423 for set up instructions 18 | 19 | * Run: `airflow webserver -p 8080`. Refer to https://airflow.apache.org/docs/stable/start.html for more details on how to get started, 20 | 21 | * Configure aws_credentials in Airflow using access and secret access keys. (under Airflow UI >> Admin >> Connections) 22 | 23 | * Configure redshift connection in Airflow (under Airflow UI >> Admin >> Connections) 24 | 25 | ## Database Design and ETL Pipeline 26 | --- 27 | * For the schema design, the STAR schema is used as it simplifies queries and provides fast aggregations of data. 28 | 29 | ![Schema](screenshots/schema.PNG) 30 | 31 | * For the ETL pipeline, Python is used as it contains libraries such as pandas, that simplifies data manipulation. It enables reading of files from S3. 32 | 33 | * There are 2 types of data involved, song and log data. For song data, it contains information about songs and artists, which we extract from and load into users and artists dimension tables 34 | 35 | * Log data gives the information of each user session. From log data, we extract and load into time, users dimension tables and songplays fact table. 36 | 37 | ## Running the ETL Pipeline 38 | --- 39 | * Turning on the sparkify_music_dwh_dag DAG in Airflow UI will automatically trigger the ETL pipelines to run. 40 | * DAG is as such (from graph view): 41 | 42 | ![DAG](screenshots/dag.PNG) -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/dags/create_tables.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE IF NOT EXISTS public.artists ( 2 | artistid varchar(256) NOT NULL, 3 | name varchar(256), 4 | location varchar(256), 5 | lattitude numeric(18,0), 6 | longitude numeric(18,0) 7 | ); 8 | 9 | CREATE TABLE IF NOT EXISTS public.songplays ( 10 | playid varchar(32) NOT NULL, 11 | start_time timestamp NOT NULL, 12 | userid int4 NOT NULL, 13 | "level" varchar(256), 14 | songid varchar(256), 15 | artistid varchar(256), 16 | sessionid int4, 17 | location varchar(256), 18 | user_agent varchar(256), 19 | CONSTRAINT songplays_pkey PRIMARY KEY (playid) 20 | ); 21 | 22 | CREATE TABLE IF NOT EXISTS public.songs ( 23 | songid varchar(256) NOT NULL, 24 | title varchar(256), 25 | artistid varchar(256), 26 | "year" int4, 27 | duration numeric(18,0), 28 | CONSTRAINT songs_pkey PRIMARY KEY (songid) 29 | ); 30 | 31 | CREATE TABLE IF NOT EXISTS public.staging_events ( 32 | artist varchar(256), 33 | auth varchar(256), 34 | firstname varchar(256), 35 | gender varchar(256), 36 | iteminsession int4, 37 | lastname varchar(256), 38 | length numeric(18,0), 39 | "level" varchar(256), 40 | location varchar(256), 41 | "method" varchar(256), 42 | page varchar(256), 43 | registration numeric(18,0), 44 | sessionid int4, 45 | song varchar(256), 46 | status int4, 47 | ts int8, 48 | useragent varchar(256), 49 | userid int4 50 | ); 51 | 52 | CREATE TABLE IF NOT EXISTS public.staging_songs ( 53 | num_songs int4, 54 | artist_id varchar(256), 55 | artist_name varchar(256), 56 | artist_latitude numeric(18,0), 57 | artist_longitude numeric(18,0), 58 | artist_location varchar(256), 59 | song_id varchar(256), 60 | title varchar(256), 61 | duration numeric(18,0), 62 | "year" int4 63 | ); 64 | 65 | CREATE TABLE IF NOT EXISTS public.time ( 66 | start_time timestamp NOT NULL, 67 | hour int4 NOT NULL, 68 | day int4 NOT NULL, 69 | week int4 NOT NULL, 70 | month int4 NOT NULL, 71 | year int4 NOT NULL, 72 | dayofweek int4 NOT NULL 73 | ); 74 | 75 | CREATE TABLE IF NOT EXISTS public.users ( 76 | userid int4 NOT NULL, 77 | first_name varchar(256), 78 | last_name varchar(256), 79 | gender varchar(256), 80 | "level" varchar(256), 81 | CONSTRAINT users_pkey PRIMARY KEY (userid) 82 | ); 83 | 84 | 85 | 86 | 87 | 88 | -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/dags/sparkify_dwh_dag.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime, timedelta 2 | import os 3 | from airflow import DAG 4 | from airflow.operators.sparkify_plugin import (StageToRedshiftOperator, LoadFactOperator, 5 | LoadDimensionOperator, DataQualityOperator) 6 | from airflow.operators.postgres_operator import PostgresOperator 7 | from airflow.operators.dummy_operator import DummyOperator 8 | from helpers import SqlQueries 9 | 10 | # /opt/airflow/start.sh 11 | 12 | default_args = { 13 | 'owner': 'udacity', 14 | 'start_date': datetime(2019, 1, 12), 15 | 'depends_on_past': False, 16 | 'retries': 1, 17 | 'retry_delay': timedelta(minutes=5), 18 | 'catchup': True 19 | } 20 | 21 | with DAG(dag_id='sparkify_music_dwh_dag', default_args=default_args, 22 | description='Load and transform data in Redshift \ 23 | Data Warehouse with Airflow', 24 | schedule_interval='@hourly') as dag: 25 | 26 | start_operator = DummyOperator(task_id='begin_execution', dag=dag) 27 | 28 | create_tables = PostgresOperator( 29 | task_id='create_tables', 30 | postgres_conn_id="redshift", 31 | sql="create_tables.sql" 32 | ) 33 | 34 | stage_events_to_redshift = StageToRedshiftOperator( 35 | task_id='load_stage_events', 36 | redshift_conn_id="redshift", 37 | aws_credentials_id="aws_credentials", 38 | s3_bucket="udacity-dend", 39 | s3_key="log_data", 40 | jsonpath="log_json_path.json", 41 | table_name="public.staging_events", 42 | ignore_headers=1 43 | ) 44 | 45 | stage_songs_to_redshift = StageToRedshiftOperator( 46 | task_id='load_stage_songs', 47 | redshift_conn_id="redshift", 48 | aws_credentials_id="aws_credentials", 49 | s3_bucket="udacity-dend", 50 | s3_key="song_data", 51 | table_name="public.staging_songs", 52 | ignore_headers=1 53 | ) 54 | 55 | load_songplays_table = LoadFactOperator( 56 | task_id='load_songplays_fact_table', 57 | redshift_conn_id="redshift", 58 | load_sql=SqlQueries.songplay_table_insert, 59 | table_name="public.songplays" 60 | ) 61 | 62 | load_user_dimension_table = LoadDimensionOperator( 63 | task_id='load_user_dim_table', 64 | redshift_conn_id="redshift", 65 | load_sql=SqlQueries.user_table_insert, 66 | table_name="public.users", 67 | append_only=False 68 | ) 69 | 70 | load_song_dimension_table = LoadDimensionOperator( 71 | task_id='load_song_dim_table', 72 | redshift_conn_id="redshift", 73 | load_sql=SqlQueries.song_table_insert, 74 | table_name="public.songs", 75 | append_only=False 76 | ) 77 | 78 | load_artist_dimension_table = LoadDimensionOperator( 79 | task_id='load_artist_dim_table', 80 | redshift_conn_id="redshift", 81 | load_sql=SqlQueries.artist_table_insert, 82 | table_name="public.artists", 83 | append_only=False 84 | ) 85 | 86 | load_time_dimension_table = LoadDimensionOperator( 87 | task_id='load_time_dim_table', 88 | redshift_conn_id="redshift", 89 | load_sql=SqlQueries.time_table_insert, 90 | table_name="public.time", 91 | append_only=False 92 | ) 93 | 94 | run_quality_checks = DataQualityOperator( 95 | task_id='run_data_quality_checks', 96 | redshift_conn_id="redshift", 97 | table_names=["public.staging_events", "public.staging_songs", 98 | "public.songplays", "public.artists", 99 | "public.songs", "public.time", "public.users"] 100 | ) 101 | 102 | end_operator = DummyOperator(task_id='stop_execution', dag=dag) 103 | 104 | start_operator >> create_tables 105 | create_tables >> [stage_events_to_redshift, 106 | stage_songs_to_redshift] 107 | 108 | [stage_events_to_redshift, 109 | stage_songs_to_redshift] >> load_songplays_table 110 | 111 | load_songplays_table >> [load_user_dimension_table, 112 | load_song_dimension_table, 113 | load_artist_dimension_table, 114 | load_time_dimension_table] 115 | [load_user_dimension_table, 116 | load_song_dimension_table, 117 | load_artist_dimension_table, 118 | load_time_dimension_table] >> run_quality_checks 119 | 120 | run_quality_checks >> end_operator -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/plugins/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, absolute_import, print_function 2 | 3 | from airflow.plugins_manager import AirflowPlugin 4 | 5 | import operators 6 | import helpers 7 | 8 | # Defining the plugin class 9 | class SparkifyPlugin(AirflowPlugin): 10 | name = "sparkify_plugin" 11 | operators = [ 12 | operators.StageToRedshiftOperator, 13 | operators.LoadFactOperator, 14 | operators.LoadDimensionOperator, 15 | operators.DataQualityOperator 16 | ] 17 | helpers = [ 18 | helpers.SqlQueries 19 | ] 20 | -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/plugins/helpers/__init__.py: -------------------------------------------------------------------------------- 1 | from helpers.sql_queries import SqlQueries 2 | 3 | __all__ = [ 4 | 'SqlQueries', 5 | ] -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/plugins/helpers/sql_queries.py: -------------------------------------------------------------------------------- 1 | class SqlQueries: 2 | songplay_table_insert = (""" 3 | SELECT 4 | md5(events.sessionid || events.start_time) songplay_id, 5 | events.start_time, 6 | events.userid, 7 | events.level, 8 | songs.song_id, 9 | songs.artist_id, 10 | events.sessionid, 11 | events.location, 12 | events.useragent 13 | FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, * 14 | FROM staging_events 15 | WHERE page='NextSong') events 16 | LEFT JOIN staging_songs songs 17 | ON events.song = songs.title 18 | AND events.artist = songs.artist_name 19 | AND events.length = songs.duration 20 | """) 21 | 22 | user_table_insert = (""" 23 | SELECT distinct userid, firstname, lastname, gender, level 24 | FROM staging_events 25 | WHERE page='NextSong' 26 | """) 27 | 28 | song_table_insert = (""" 29 | SELECT distinct song_id, title, artist_id, year, duration 30 | FROM staging_songs 31 | """) 32 | 33 | artist_table_insert = (""" 34 | SELECT distinct artist_id, artist_name, artist_location, artist_latitude, artist_longitude 35 | FROM staging_songs 36 | """) 37 | 38 | time_table_insert = (""" 39 | SELECT start_time, extract(hour from start_time), extract(day from start_time), extract(week from start_time), 40 | extract(month from start_time), extract(year from start_time), extract(dayofweek from start_time) 41 | FROM songplays 42 | """) -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/plugins/operators/__init__.py: -------------------------------------------------------------------------------- 1 | from operators.stage_redshift import StageToRedshiftOperator 2 | from operators.load_fact import LoadFactOperator 3 | from operators.load_dimension import LoadDimensionOperator 4 | from operators.data_quality import DataQualityOperator 5 | 6 | __all__ = [ 7 | 'StageToRedshiftOperator', 8 | 'LoadFactOperator', 9 | 'LoadDimensionOperator', 10 | 'DataQualityOperator' 11 | ] 12 | -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/plugins/operators/data_quality.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class DataQualityOperator(BaseOperator): 6 | 7 | ui_color = '#89DA59' 8 | 9 | @apply_defaults 10 | def __init__(self, 11 | redshift_conn_id="", 12 | table_names=[""], 13 | *args, **kwargs): 14 | 15 | super(DataQualityOperator, self).__init__(*args, **kwargs) 16 | self.redshift_conn_id = redshift_conn_id 17 | self.table_names = table_names 18 | 19 | def execute(self, context): 20 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id) 21 | for table in self.table_names: 22 | # Check that entries are being copied to table 23 | records = redshift.get_records(f"SELECT COUNT(*) FROM {table}") 24 | if len(records) < 1 or len(records[0]) < 1: 25 | raise ValueError(f"Data quality check failed. {table} returned no results") 26 | 27 | # Check that there are no rows with null ids 28 | dq_checks=[ 29 | {'table': 'users', 30 | 'check_sql': "SELECT COUNT(*) FROM users WHERE userid is null", 31 | 'expected_result': 0}, 32 | {'table': 'songs', 33 | 'check_sql': "SELECT COUNT(*) FROM songs WHERE songid is null", 34 | 'expected_result': 0} 35 | ] 36 | for check in dq_checks: 37 | records = redshift.get_records(check['check_sql']) 38 | if records[0] != check['expected_result']: 39 | raise ValueError(f"Data quality check failed. {check['table']} \ 40 | contains null in id column") 41 | -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/plugins/operators/load_dimension.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class LoadDimensionOperator(BaseOperator): 6 | 7 | ui_color = '#80BD9E' 8 | 9 | @apply_defaults 10 | def __init__(self, 11 | redshift_conn_id="", 12 | load_sql="", 13 | table_name="", 14 | append_only=False, 15 | *args, **kwargs): 16 | 17 | super(LoadDimensionOperator, self).__init__(*args, **kwargs) 18 | self.redshift_conn_id = redshift_conn_id 19 | self.load_sql = load_sql 20 | self.table_name = table_name 21 | self.append_only = append_only 22 | 23 | def execute(self, context): 24 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id) 25 | self.log.info("Loading into {} dimension table".format(self.table_name)) 26 | self.log.info("Append only mode: {}".format(self.append_only)) 27 | if self.append_only: 28 | sql_stmt = 'INSERT INTO %s %s' % (self.table_name, self.load_sql) 29 | redshift.run(sql_stmt) 30 | else: 31 | sql_del_stmt = 'DELETE FROM %s' % (self.table_name) 32 | redshift.run(sql_del_stmt) 33 | sql_stmt = 'INSERT INTO %s %s' % (self.table_name, self.load_sql) 34 | redshift.run(sql_stmt) 35 | 36 | -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/plugins/operators/load_fact.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class LoadFactOperator(BaseOperator): 6 | 7 | ui_color = '#F98866' 8 | 9 | @apply_defaults 10 | def __init__(self, 11 | redshift_conn_id="", 12 | load_sql="", 13 | table_name="", 14 | *args, **kwargs): 15 | 16 | super(LoadFactOperator, self).__init__(*args, **kwargs) 17 | self.redshift_conn_id = redshift_conn_id 18 | self.load_sql = load_sql 19 | self.table_name = table_name 20 | 21 | def execute(self, context): 22 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id) 23 | self.log.info("Loading into {} fact table".format(self.table_name)) 24 | sql_stmt = 'INSERT INTO %s %s' % (self.table_name, self.load_sql) 25 | redshift.run(sql_stmt) -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/airflow/plugins/operators/stage_redshift.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.contrib.hooks.aws_hook import AwsHook 3 | from airflow.models import BaseOperator 4 | from airflow.utils.decorators import apply_defaults 5 | 6 | class StageToRedshiftOperator(BaseOperator): 7 | ui_color = '#358140' 8 | copy_sql = """ 9 | COPY {} 10 | FROM '{}' 11 | ACCESS_KEY_ID '{}' 12 | SECRET_ACCESS_KEY '{}' 13 | IGNOREHEADER {} 14 | JSON '{}' 15 | """ 16 | 17 | @apply_defaults 18 | def __init__(self, 19 | redshift_conn_id="", 20 | aws_credentials_id="", 21 | s3_bucket="", 22 | s3_key="", 23 | jsonpath="auto", 24 | table_name="", 25 | ignore_headers=1, 26 | *args, **kwargs): 27 | 28 | super(StageToRedshiftOperator, self).__init__(*args, **kwargs) 29 | self.redshift_conn_id = redshift_conn_id 30 | self.aws_credentials_id = aws_credentials_id 31 | self.s3_bucket = s3_bucket 32 | self.ignore_headers = ignore_headers 33 | self.s3_key = s3_key 34 | self.jsonpath = jsonpath 35 | self.table = table_name 36 | 37 | def execute(self, context): 38 | aws_hook = AwsHook(self.aws_credentials_id) 39 | credentials = aws_hook.get_credentials() 40 | redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id) 41 | self.log.info("Clearing data from destination Redshift table") 42 | redshift.run("DELETE FROM {}".format(self.table)) 43 | self.log.info("Copying data from S3 to Redshift") 44 | s3_path = "s3://{}/{}".format(self.s3_bucket, self.s3_key) 45 | if self.jsonpath != "auto": 46 | jsonpath = "s3://{}/{}".format(self.s3_bucket, self.jsonpath) 47 | else: 48 | jsonpath = self.jsonpath 49 | formatted_sql = StageToRedshiftOperator.copy_sql.format( 50 | self.table, 51 | s3_path, 52 | credentials.access_key, 53 | credentials.secret_key, 54 | self.ignore_headers, 55 | jsonpath 56 | ) 57 | redshift.run(formatted_sql) 58 | 59 | 60 | 61 | -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/screenshots/airflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/6. Data Pipelining with Airflow/screenshots/airflow.png -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/screenshots/dag.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/6. Data Pipelining with Airflow/screenshots/dag.PNG -------------------------------------------------------------------------------- /6. Data Pipelining with Airflow/screenshots/schema.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alanchn31/Data-Engineering-Projects/4cd0a0e12b3ab2e2dd5fa128985288e773076b45/6. Data Pipelining with Airflow/screenshots/schema.PNG -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Description 2 | --- 3 | * This repo contains projects done which applies principles in data engineering. 4 | * Notes taken during the course can be found in folder `0. Back to Basics` 5 | 6 | ## Projects 7 | --- 8 | 1. Postgres ETL :heavy_check_mark: 9 | * This project looks at data modelling for a fictitious music startup Sparkify, applying STAR schema to ingest data to simplify queries that answers business questions the product owner may have 10 | 11 | 2. Cassandra ETL :heavy_check_mark: 12 | * Looking at the realm of big data, Cassandra helps to ingest large amounts of data in a NoSQL context. This project adopts a query centric approach in ingesting data into data tables in Cassandra, to answer business questions about a music app 13 | 14 | 3. Web Scrapying using Scrapy, MongoDB ETL :heavy_check_mark: 15 | * In storing semi-structured data, one form to store it in, is in the form of documents. MongoDB makes this possible, with a specific collection containing related documents. Each document contains fields of data which can be queried. 16 | * In this project, data is scraped from a books listing website using Scrapy. The fields of each book, such as price of a book, ratings, whether it is available is stored in a document in the books collection in MongoDB. 17 | 18 | 4. Data Warehousing with AWS Redshift :heavy_check_mark: 19 | * This project creates a data warehouse, in AWS Redshift. A data warehouse provides a reliable and consistent foundation for users to query and answer some business questions based on requirements. 20 | 21 | 5. Data Lake with Spark & AWS S3 :heavy_check_mark: 22 | * This project creates a data lake, in AWS S3 using Spark. 23 | * Why create a data lake? A data lake provides a reliable store for large amounts of data, from unstructured to semi-structured and even structured data. In this project, we ingest json files, denormalize them into fact and dimension tables and upload them into a AWS S3 data lake, in the form of parquet files. 24 | 25 | 6. Data Pipelining with Airflow :heavy_check_mark: 26 | * This project schedules data pipelines, to perform ETL from json files in S3 to Redshift using Airflow. 27 | * Why use Airflow? Airflow allows workflows to be defined as code, they become more maintainable, versionable, testable, and collaborative 28 | 29 | 7. Capstone Project :heavy_check_mark: 30 | * This project is the finale to Udacity's data engineering nanodegree. Udacity provides a default dataset however I chose to embark on my own project. 31 | * My project is on building a movies data warehouse, which can be used to build a movies recommendation system, as well as predicting box-office earnings. View the project here: [Movies Data Warehouse](https://github.com/alanchn31/Udacity-Data-Engineering-Capstone) --------------------------------------------------------------------------------