├── A brief introduction of parquet files.docx ├── Capstone_Project ├── .ipynb_checkpoints │ └── Capstone_Project_with_Immigration_Data-checkpoint.ipynb ├── Capstone_Project_Immigration_Data.ipynb ├── I94_SAS_Labels_Descriptions.SAS ├── airport_codes.csv ├── dl.cfg ├── immigration_data_sample.csv ├── notes for uploading data ro s3.txt ├── readme.md ├── ref │ ├── dictionary for star schema 1.PNG │ ├── model for star schema 1 .PNG │ ├── star schema 2.1.PNG │ ├── star schema 2.3.PNG │ ├── star schema 2.4.PNG │ └── star shema 2.2.PNG ├── us_cities_demographics.csv └── valid_i94port.txt ├── Data Sources.docx ├── Project_1A_Data_Modeling_with_Postgres ├── README.md ├── create_tables.py ├── etl.ipynb ├── etl.py ├── sql_queries.py └── test.ipynb ├── Project_1B_Data_Modeling_with_Cassandra ├── Project_1B_ ETL_for_NoSQL_Modeling_with_Cassandra.ipynb ├── Readme.md └── event_datafile_new.csv ├── Project_3_Data_Warehouse_with_Redshift ├── README.md ├── create_tables.py ├── dwh.cfg ├── etl.py ├── project3_star_schema.PNG └── sql_queries.py ├── Project_4_Data_Lake_with_Spark ├── Data │ ├── log-data │ │ ├── 2018-11-01-events.json │ │ ├── 2018-11-02-events.json │ │ ├── 2018-11-03-events.json │ │ ├── 2018-11-04-events.json │ │ ├── 2018-11-05-events.json │ │ ├── 2018-11-06-events.json │ │ ├── 2018-11-07-events.json │ │ ├── 2018-11-08-events.json │ │ ├── 2018-11-09-events.json │ │ ├── 2018-11-10-events.json │ │ ├── 2018-11-11-events.json │ │ ├── 2018-11-12-events.json │ │ ├── 2018-11-13-events.json │ │ ├── 2018-11-14-events.json │ │ ├── 2018-11-15-events.json │ │ ├── 2018-11-16-events.json │ │ ├── 2018-11-17-events.json │ │ ├── 2018-11-18-events.json │ │ ├── 2018-11-19-events.json │ │ ├── 2018-11-20-events.json │ │ ├── 2018-11-21-events.json │ │ ├── 2018-11-22-events.json │ │ ├── 2018-11-23-events.json │ │ ├── 2018-11-24-events.json │ │ ├── 2018-11-25-events.json │ │ ├── 2018-11-26-events.json │ │ ├── 2018-11-27-events.json │ │ ├── 2018-11-28-events.json │ │ ├── 2018-11-29-events.json │ │ └── 2018-11-30-events.json │ └── song-data │ │ └── song_data │ │ ├── .DS_Store │ │ └── A │ │ ├── .DS_Store │ │ ├── A │ │ ├── .DS_Store │ │ ├── A │ │ │ ├── TRAAAAW128F429D538.json │ │ │ ├── TRAAABD128F429CF47.json │ │ │ ├── TRAAADZ128F9348C2E.json │ │ │ ├── TRAAAEF128F4273421.json │ │ │ ├── TRAAAFD128F92F423A.json │ │ │ ├── TRAAAMO128F1481E7F.json │ │ │ ├── TRAAAMQ128F1460CD3.json │ │ │ ├── TRAAAPK128E0786D96.json │ │ │ ├── TRAAARJ128F9320760.json │ │ │ ├── TRAAAVG12903CFA543.json │ │ │ └── TRAAAVO128F93133D4.json │ │ ├── B │ │ │ ├── TRAABCL128F4286650.json │ │ │ ├── TRAABDL12903CAABBA.json │ │ │ ├── TRAABJL12903CDCF1A.json │ │ │ ├── TRAABJV128F1460C49.json │ │ │ ├── TRAABLR128F423B7E3.json │ │ │ ├── TRAABNV128F425CEE1.json │ │ │ ├── TRAABRB128F9306DD5.json │ │ │ ├── TRAABVM128F92CA9DC.json │ │ │ ├── TRAABXG128F9318EBD.json │ │ │ ├── TRAABYN12903CFD305.json │ │ │ └── TRAABYW128F4244559.json │ │ └── C │ │ │ ├── TRAACCG128F92E8A55.json │ │ │ ├── TRAACER128F4290F96.json │ │ │ ├── TRAACFV128F935E50B.json │ │ │ ├── TRAACHN128F1489601.json │ │ │ ├── TRAACIW12903CC0F6D.json │ │ │ ├── TRAACLV128F427E123.json │ │ │ ├── TRAACNS128F14A2DF5.json │ │ │ ├── TRAACOW128F933E35F.json │ │ │ ├── TRAACPE128F421C1B9.json │ │ │ ├── TRAACQT128F9331780.json │ │ │ ├── TRAACSL128F93462F4.json │ │ │ ├── TRAACTB12903CAAF15.json │ │ │ ├── TRAACVS128E078BE39.json │ │ │ └── TRAACZK128F4243829.json │ │ └── B │ │ ├── .DS_Store │ │ ├── A │ │ ├── TRABACN128F425B784.json │ │ ├── TRABAFJ128F42AF24E.json │ │ ├── TRABAFP128F931E9A1.json │ │ ├── TRABAIO128F42938F9.json │ │ ├── TRABATO128F42627E9.json │ │ ├── TRABAVQ12903CBF7E0.json │ │ ├── TRABAWW128F4250A31.json │ │ ├── TRABAXL128F424FC50.json │ │ ├── TRABAXR128F426515F.json │ │ ├── TRABAXV128F92F6AE3.json │ │ └── TRABAZH128F930419A.json │ │ ├── B │ │ ├── TRABBAM128F429D223.json │ │ ├── TRABBBV128F42967D7.json │ │ ├── TRABBJE12903CDB442.json │ │ ├── TRABBKX128F4285205.json │ │ ├── TRABBLU128F93349CF.json │ │ ├── TRABBNP128F932546F.json │ │ ├── TRABBOP128F931B50D.json │ │ ├── TRABBOR128F4286200.json │ │ ├── TRABBTA128F933D304.json │ │ ├── TRABBVJ128F92F7EAA.json │ │ ├── TRABBXU128F92FEF48.json │ │ └── TRABBZN12903CD9297.json │ │ └── C │ │ ├── TRABCAJ12903CDFCC2.json │ │ ├── TRABCEC128F426456E.json │ │ ├── TRABCEI128F424C983.json │ │ ├── TRABCFL128F149BB0D.json │ │ ├── TRABCIX128F4265903.json │ │ ├── TRABCKL128F423A778.json │ │ ├── TRABCPZ128F4275C32.json │ │ ├── TRABCRU128F423F449.json │ │ ├── TRABCTK128F934B224.json │ │ ├── TRABCUQ128E0783E2B.json │ │ ├── TRABCXB128F4286BD3.json │ │ └── TRABCYE128F934CE1D.json ├── README.md ├── __MACOSX │ ├── ._README.md │ ├── ._dl.cfg │ └── ._etl.py ├── dl.cfg ├── etl.py └── star_schema.PNG ├── Project_5_Data_Pipeline_with_Airflow ├── airflow │ ├── create_tables.sql │ ├── dags │ │ ├── __pycache__ │ │ │ └── udac_example_dag.cpython-36.pyc │ │ └── udac_example_dag.py │ └── plugins │ │ ├── __init__.py │ │ ├── __pycache__ │ │ └── __init__.cpython-36.pyc │ │ ├── helpers │ │ ├── __init__.py │ │ ├── __pycache__ │ │ │ ├── __init__.cpython-36.pyc │ │ │ └── sql_queries.cpython-36.pyc │ │ └── sql_queries.py │ │ └── operators │ │ ├── __init__.py │ │ ├── __pycache__ │ │ ├── __init__.cpython-36.pyc │ │ ├── data_quality.cpython-36.pyc │ │ ├── load_dimension.cpython-36.pyc │ │ ├── load_fact.cpython-36.pyc │ │ └── stage_redshift.cpython-36.pyc │ │ ├── data_quality.py │ │ ├── load_dimension.py │ │ ├── load_fact.py │ │ └── stage_redshift.py └── readme.md └── README.md /A brief introduction of parquet files.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/A brief introduction of parquet files.docx -------------------------------------------------------------------------------- /Capstone_Project/dl.cfg: -------------------------------------------------------------------------------- 1 | AWS_ACCESS_KEY_ID= 2 | AWS_SECRET_ACCESS_KEY= -------------------------------------------------------------------------------- /Capstone_Project/notes for uploading data ro s3.txt: -------------------------------------------------------------------------------- 1 | *********************** process ************************************ 2 | upload the files into S3 and use Redshift to directly process the data? Why using PySpark as a part of process 3 | 4 | You are free to use any technologies you like in the project 5 | 6 | 7 | 8 | *********************** load data from workspace to s3 ************************************ 9 | --install aws cli first in workspace 10 | * upload the sas data to S3 directly, you have to install awscli in workspace first 11 | * it must be workspace not your local if you do your project in the workspace 12 | 13 | -- install in local, it did not work for me 14 | https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html 15 | Steps to Take after Installation 16 | Setting the Path to Include the AWS CLI 17 | 18 | Configure the AWS CLI with Your Credentials 19 | 20 | Upgrading to the Latest Version of the AWS CLI 21 | 22 | Uninstalling the AWS CLI 23 | 24 | Setting the Path to Include the AWS CLI 25 | After you install the AWS CLI, you might need to add the path to the executable file to your PATH variable. For platform-specific instructions, see the following topics: 26 | 27 | Linux – Add the AWS CLI Executable to Your Command Line Path 28 | 29 | Windows – Add the AWS CLI Executable to Your Command Line Path 30 | 31 | 32 | 33 | Verify that the AWS CLI installed correctly by running aws --version. 34 | 35 | $ aws --version 36 | aws-cli/1.16.116 Python/3.6.8 Linux/4.14.77-81.59-amzn2.x86_64 botocore/1.12.106 37 | Configure the AWS CLI with Your Credentials 38 | Before you can run a CLI command, you must configure the AWS CLI with your credentials. 39 | 40 | You store credential information locally by defining profiles in the AWS CLI configuration files, which are stored by default in your user's home directory. For more information, see Configuring the AWS CLI. 41 | 42 | Note 43 | 44 | If you are running in an Amazon EC2 instance, credentials can be automatically retrieved from the instance metadata. For more information, see Instance Metadata. 45 | 46 | Upgrading to the Latest Version of the AWS CLI 47 | The AWS CLI is updated regularly to add support for new services and commands. To update to the latest version of the AWS CLI, run the installation command again. For details about the latest version of the AWS CLI, see the AWS CLI release notes. 48 | 49 | $ pip3 install awscli --upgrade --user 50 | 51 | 52 | 53 | **********************************read data to spark from s3 ************************* 54 | 55 | from pyspark.sql import SparkSession 56 | 57 | spark = SparkSession.builder 58 | .config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11") 59 | .getOrCreate() 60 | 61 | immigration_data_dir = 's3://udacity-capstone-project-12138/immigration-data/i94_feb16_sub.sas7bdat' 62 | 63 | df_immigration = spark.read.format('com.github.saurfang.sas.spark').load(immigration_data_dir) 64 | 65 | 66 | according to my knowledge, this package com.github.saurfang.sas.spark does not support directly read a directory. 67 | 68 | You should try to read file one by one 69 | 70 | 71 | 72 | ******************************** Convert data to convert SAS Date to python datetime in Spark rather than convert the sparkDF to pandas ? 73 | mport datetime as dt 74 | 75 | from pyspark.sql import SparkSession 76 | 77 | from pyspark.sql.functions import udf 78 | 79 | spark = SparkSession.builder.appName("capstone_project").getOrCreate() 80 | 81 | get_date = udf(lambda x: (dt.datetime(1960, 1, 1).date() + dt.timedelta(x)).isoformat() if x else None) 82 | 83 | sp_sas_data = spark.read.parquet("sas_data") 84 | 85 | sp_sas_data = sp_sas_data.withColumn("arrival_date", get_date(sp_sas_data.arrdate)) 86 | 87 | sp_sas_data.createOrReplaceTempView("i94") 88 | 89 | i94_df = spark.sql("select distinct arrival_date from i94") 90 | 91 | 92 | 93 | 94 | *****************read the file "I94_SAS_Labels_Descriptions.SAS"***************************** 95 | 96 | You can copy the labels into different files, such as saving I94CIT & I94RES into a file called code.txt 97 | 98 | 582 = 'MEXICO Air Sea, and Not Reported (I-94, no land arrivals)' 99 | 100 | 236 = 'AFGHANISTAN' 101 | 102 | 101 = 'ALBANIA' 103 | 104 | 316 = 'ALGERIA' 105 | 106 | ... And then, you can use this way to read the code, 107 | 108 | df = pd.read_csv('code.txt', sep=" = ", header=None) -------------------------------------------------------------------------------- /Capstone_Project/readme.md: -------------------------------------------------------------------------------- 1 | 2 | ## Project Instruction 3 | 4 | * a very important step is to compress the datasets from workspace to a single zip file and upload to s3 bucket : 5 | * this is the command : 6 | zip -r /home/workspace/data.zip /data/18-83510-I94-Data-2016 7 | 8 | 9 | 10 | ### Step 1: Scope the Project and Gather Data 11 | 12 | Since the scope of the project will be highly dependent on the data, these two things happen simultaneously. 13 | 14 | In this step, I’ll: Identify and gather the data I'll be using for my project(at least two sources and more than 1 million rows.) 15 | I picked I94 immigration and us-cities-demographics two datasets from the provided datasets. 16 | 17 | 18 | The main purpose is for analytics . The following questions are what I try to answer through this project 19 | * which coutry sends the most visitors to US in year 2016? 20 | * which cities in US attract the most visitors in year 2016 ? 21 | * which visa type is the most popular in 2016 ? 22 | * Which race has the most number of immigrants to US? 23 | 24 | ### Step 2: Explore and Assess the Data 25 | 26 | Explore the data to identify data quality issues, like missing values, duplicate data, etc. 27 | Document steps necessary to clean the data 28 | 29 | ## Step 3: Define the Data Model 30 | 31 | Map out the conceptual data model and explain why you chose that model 32 | List the steps necessary to pipeline the data into the chosen data model 33 | 34 | ## Step 4: Run ETL to Model the Data 35 | 36 | * Create the data pipelines and the data model 37 | * Include a data dictionary 38 | * Run data quality checks to ensure the pipeline ran as expected 39 | * Integrity constraints on the relational database (e.g., unique key, data type, etc.) 40 | * Unit tests for the scripts to ensure they are doing the right thing 41 | * Source/count checks to ensure completeness 42 | 43 | ## Step 5: Complete Project Write Up 44 | 45 | What's the goal? What queries will you want to run? How would Spark or Airflow be incorporated? Why did you choose the model you chose? 46 | Clearly state the rationale for the choice of tools and technologies for the project. 47 | Document the steps of the process. 48 | Propose how often the data should be updated and why. 49 | Post your write-up and final data model in a GitHub repo. 50 | Include a description of how you would approach the problem differently under the following 51 | 52 | ## scenarios: 53 | If the data was increased by 100x. 54 | If the pipelines were run on a daily basis by 7am. 55 | If the database needed to be accessed by 100+ people. 56 | 57 | 58 | 59 | 60 | ## Datasets 61 | 62 | ### Brief Introduction of the dataset 63 | 64 | The following 4 datasets are included in the project.If something about the data is unclear, make an assumption, document it, and move on. 65 | (1). I94 Immigration Data: This data comes from the US National Tourism and Trade Office. 66 | A data dictionary is included in the workspace. This is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in. You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project. 67 | (2). World Temperature Data: This dataset came from Kaggle. 68 | (3). U.S. City Demographic Data: This data comes from OpenSoft. 69 | (4).Airport Code Table: This is a simple table of airport codes and corresponding cities. 70 | 71 | ### Accessing the Data 72 | The immigration data and the global temperate data is in an attached disk. 73 | (1). Immigration Data 74 | You can access the immigration data in a folder with the following path: 75 | ../../data/18-83510-I94-Data-2016/. 76 | There's a file for each month of the year. 77 | 78 | An example file name is i94_apr16_sub.sas7bdat. Each file has a three-letter abbreviation 79 | for the month name. 80 | So a full file path for June would look like this: 81 | ../../data/18-83510-I94-Data-2016/i94_jun16_sub.sas7bdat. 82 | 83 | Below is what it would look like to import this file into pandas. (Note: these files are 84 | large, so you'll have to think about how to process and aggregate them efficiently.) 85 | fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat' 86 | df = pd.read_sas(fname, 'sas7bdat', encoding="ISO-8859-1") 87 | 88 | ### The most important decision for modeling with this data is thinking about the level of aggregation. Do you want to aggregate by airport by month? Or by city by year? This level of aggregation will influence how you join the data with other datasets. There isn't a right answer, it all depends on what you want your final dataset to look like. 89 | 90 | (2). Temperature Data 91 | You can access the temperature data in a folder with the following path: 92 | ../../data2/. 93 | There's just one file in that folder, called GlobalLandTemperaturesByCity.csv. 94 | 95 | Below is how you would read the file into a pandas dataframe. 96 | fname = '../../data2/GlobalLandTemperaturesByCity.csv' 97 | df = pd.read_csv(fname) 98 | 99 | ## Reference: 100 | ##### Part 1: 101 | ##### https://github.com/cheuklau/udacity-capstone 102 | ##### https://github.com/shalgrim/dend_capstone 103 | 104 | 105 | ##### Part 2: 106 | ##### https://github.com/saurfang/spark-sas7bdat (reading SAS files from local and distributed filesystems, into Spark DataFrames.) 107 | ##### https://github.com/FedericoSerini/DEND-Capstone-RomeTransportNetwork 108 | ##### https://github.com/sariabod/dend-capstone 109 | ##### https://github.com/akessela/Sparkify-Capstone-Project 110 | ##### https://github.com/Mikemraz/Capstone-Project-Big-Data-Sparkify -------------------------------------------------------------------------------- /Capstone_Project/ref/dictionary for star schema 1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/dictionary for star schema 1.PNG -------------------------------------------------------------------------------- /Capstone_Project/ref/model for star schema 1 .PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/model for star schema 1 .PNG -------------------------------------------------------------------------------- /Capstone_Project/ref/star schema 2.1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/star schema 2.1.PNG -------------------------------------------------------------------------------- /Capstone_Project/ref/star schema 2.3.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/star schema 2.3.PNG -------------------------------------------------------------------------------- /Capstone_Project/ref/star schema 2.4.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/star schema 2.4.PNG -------------------------------------------------------------------------------- /Capstone_Project/ref/star shema 2.2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/star shema 2.2.PNG -------------------------------------------------------------------------------- /Capstone_Project/valid_i94port.txt: -------------------------------------------------------------------------------- 1 | 'ALC' = 'ALCAN, AK ' 2 | 'ANC' = 'ANCHORAGE, AK ' 3 | 'BAR' = 'BAKER AAF - BAKER ISLAND, AK' 4 | 'DAC' = 'DALTONS CACHE, AK ' 5 | 'PIZ' = 'DEW STATION PT LAY DEW, AK' 6 | 'DTH' = 'DUTCH HARBOR, AK ' 7 | 'EGL' = 'EAGLE, AK ' 8 | 'FRB' = 'FAIRBANKS, AK ' 9 | 'HOM' = 'HOMER, AK ' 10 | 'HYD' = 'HYDER, AK ' 11 | 'JUN' = 'JUNEAU, AK ' 12 | '5KE' = 'KETCHIKAN, AK' 13 | 'KET' = 'KETCHIKAN, AK ' 14 | 'MOS' = 'MOSES POINT INTERMEDIATE, AK' 15 | 'NIK' = 'NIKISKI, AK ' 16 | 'NOM' = 'NOM, AK ' 17 | 'PKC' = 'POKER CREEK, AK ' 18 | 'ORI' = 'PORT LIONS SPB, AK' 19 | 'SKA' = 'SKAGWAY, AK ' 20 | 'SNP' = 'ST. PAUL ISLAND, AK' 21 | 'TKI' = 'TOKEEN, AK' 22 | 'WRA' = 'WRANGELL, AK ' 23 | 'HSV' = 'MADISON COUNTY - HUNTSVILLE, AL' 24 | 'MOB' = 'MOBILE, AL ' 25 | 'LIA' = 'LITTLE ROCK, AR (BPS)' 26 | 'ROG' = 'ROGERS ARPT, AR' 27 | 'DOU' = 'DOUGLAS, AZ ' 28 | 'LUK' = 'LUKEVILLE, AZ ' 29 | 'MAP' = 'MARIPOSA AZ ' 30 | 'NAC' = 'NACO, AZ ' 31 | 'NOG' = 'NOGALES, AZ ' 32 | 'PHO' = 'PHOENIX, AZ ' 33 | 'POR' = 'PORTAL, AZ' 34 | 'SLU' = 'SAN LUIS, AZ ' 35 | 'SAS' = 'SASABE, AZ ' 36 | 'TUC' = 'TUCSON, AZ ' 37 | 'YUI' = 'YUMA, AZ ' 38 | 'AND' = 'ANDRADE, CA ' 39 | 'BUR' = 'BURBANK, CA' 40 | 'CAL' = 'CALEXICO, CA ' 41 | 'CAO' = 'CAMPO, CA ' 42 | 'FRE' = 'FRESNO, CA ' 43 | 'ICP' = 'IMPERIAL COUNTY, CA ' 44 | 'LNB' = 'LONG BEACH, CA ' 45 | 'LOS' = 'LOS ANGELES, CA ' 46 | 'BFL' = 'MEADOWS FIELD - BAKERSFIELD, CA' 47 | 'OAK' = 'OAKLAND, CA ' 48 | 'ONT' = 'ONTARIO, CA' 49 | 'OTM' = 'OTAY MESA, CA ' 50 | 'BLT' = 'PACIFIC, HWY. STATION, CA ' 51 | 'PSP' = 'PALM SPRINGS, CA' 52 | 'SAC' = 'SACRAMENTO, CA ' 53 | 'SLS' = 'SALINAS, CA (BPS)' 54 | 'SDP' = 'SAN DIEGO, CA' 55 | 'SFR' = 'SAN FRANCISCO, CA ' 56 | 'SNJ' = 'SAN JOSE, CA ' 57 | 'SLO' = 'SAN LUIS OBISPO, CA ' 58 | 'SLI' = 'SAN LUIS OBISPO, CA (BPS)' 59 | 'SPC' = 'SAN PEDRO, CA ' 60 | 'SYS' = 'SAN YSIDRO, CA ' 61 | 'SAA' = 'SANTA ANA, CA ' 62 | 'STO' = 'STOCKTON, CA (BPS)' 63 | 'TEC' = 'TECATE, CA ' 64 | 'TRV' = 'TRAVIS-AFB, CA ' 65 | 'APA' = 'ARAPAHOE COUNTY, CO' 66 | 'ASE' = 'ASPEN, CO #ARPT' 67 | 'COS' = 'COLORADO SPRINGS, CO' 68 | 'DEN' = 'DENVER, CO ' 69 | 'DRO' = 'LA PLATA - DURANGO, CO' 70 | 'BDL' = 'BRADLEY INTERNATIONAL, CT' 71 | 'BGC' = 'BRIDGEPORT, CT ' 72 | 'GRT' = 'GROTON, CT ' 73 | 'HAR' = 'HARTFORD, CT ' 74 | 'NWH' = 'NEW HAVEN, CT ' 75 | 'NWL' = 'NEW LONDON, CT ' 76 | 'TST' = 'NEWINGTON DATA CENTER TEST, CT' 77 | 'WAS' = 'WASHINGTON DC ' 78 | 'DOV' = 'DOVER AFB, DE' 79 | 'DVD' = 'DOVER-AFB, DE ' 80 | 'WLL' = 'WILMINGTON, DE ' 81 | 'BOC' = 'BOCAGRANDE, FL ' 82 | 'SRQ' = 'BRADENTON - SARASOTA, FL' 83 | 'CAN' = 'CAPE CANAVERAL, FL ' 84 | 'DAB' = 'DAYTONA BEACH INTERNATIONAL, FL' 85 | 'FRN' = 'FERNANDINA, FL ' 86 | 'FTL' = 'FORT LAUDERDALE, FL ' 87 | 'FMY' = 'FORT MYERS, FL ' 88 | 'FPF' = 'FORT PIERCE, FL ' 89 | 'HUR' = 'HURLBURT FIELD, FL' 90 | 'GNV' = 'J R ALISON MUNI - GAINESVILLE, FL' 91 | 'JAC' = 'JACKSONVILLE, FL ' 92 | 'KEY' = 'KEY WEST, FL ' 93 | 'LEE' = 'LEESBURG MUNICIPAL AIRPORT, FL' 94 | 'MLB' = 'MELBOURNE, FL' 95 | 'MIA' = 'MIAMI, FL ' 96 | 'APF' = 'NAPLES, FL #ARPT' 97 | 'OPF' = 'OPA LOCKA, FL' 98 | 'ORL' = 'ORLANDO, FL ' 99 | 'PAN' = 'PANAMA CITY, FL ' 100 | 'PEN' = 'PENSACOLA, FL ' 101 | 'PCF' = 'PORT CANAVERAL, FL ' 102 | 'PEV' = 'PORT EVERGLADES, FL ' 103 | 'PSJ' = 'PORT ST JOE, FL ' 104 | 'SFB' = 'SANFORD, FL ' 105 | 'SGJ' = 'ST AUGUSTINE ARPT, FL' 106 | 'SAU' = 'ST AUGUSTINE, FL ' 107 | 'FPR' = 'ST LUCIE COUNTY, FL' 108 | 'SPE' = 'ST PETERSBURG, FL ' 109 | 'TAM' = 'TAMPA, FL ' 110 | 'WPB' = 'WEST PALM BEACH, FL ' 111 | 'ATL' = 'ATLANTA, GA ' 112 | 'BRU' = 'BRUNSWICK, GA ' 113 | 'AGS' = 'BUSH FIELD - AUGUSTA, GA' 114 | 'SAV' = 'SAVANNAH, GA ' 115 | 'AGA' = 'AGANA, GU ' 116 | 'HHW' = 'HONOLULU, HI ' 117 | 'OGG' = 'KAHULUI - MAUI, HI' 118 | 'KOA' = 'KEAHOLE-KONA, HI ' 119 | 'LIH' = 'LIHUE, HI ' 120 | 'CID' = 'CEDAR RAPIDS/IOWA CITY, IA' 121 | 'DSM' = 'DES MOINES, IA' 122 | 'BOI' = 'AIR TERM. (GOWEN FLD) BOISE, ID' 123 | 'EPI' = 'EASTPORT, ID ' 124 | 'IDA' = 'FANNING FIELD - IDAHO FALLS, ID' 125 | 'PTL' = 'PORTHILL, ID ' 126 | 'SPI' = 'CAPITAL - SPRINGFIELD, IL' 127 | 'CHI' = 'CHICAGO, IL ' 128 | 'DPA' = 'DUPAGE COUNTY, IL' 129 | 'PIA' = 'GREATER PEORIA, IL' 130 | 'RFD' = 'GREATER ROCKFORD, IL' 131 | 'UGN' = 'MEMORIAL - WAUKEGAN, IL' 132 | 'GAR' = 'GARY, IN ' 133 | 'HMM' = 'HAMMOND, IN ' 134 | 'INP' = 'INDIANAPOLIS, IN ' 135 | 'MRL' = 'MERRILLVILLE, IN ' 136 | 'SBN' = 'SOUTH BEND, IN' 137 | 'ICT' = 'MID-CONTINENT - WITCHITA, KS' 138 | 'LEX' = 'BLUE GRASS - LEXINGTON, KY' 139 | 'LOU' = 'LOUISVILLE, KY ' 140 | 'BTN' = 'BATON ROUGE, LA ' 141 | 'LKC' = 'LAKE CHARLES, LA ' 142 | 'LAK' = 'LAKE CHARLES, LA (BPS)' 143 | 'MLU' = 'MONROE, LA' 144 | 'MGC' = 'MORGAN CITY, LA ' 145 | 'NOL' = 'NEW ORLEANS, LA ' 146 | 'BOS' = 'BOSTON, MA ' 147 | 'GLO' = 'GLOUCESTER, MA ' 148 | 'BED' = 'HANSCOM FIELD - BEDFORD, MA' 149 | 'LYN' = 'LYNDEN, WA ' 150 | 'ADW' = 'ANDREWS AFB, MD' 151 | 'BAL' = 'BALTIMORE, MD ' 152 | 'MKG' = 'MUSKEGON, MD' 153 | 'PAX' = 'PATUXENT RIVER, MD ' 154 | 'BGM' = 'BANGOR, ME ' 155 | 'BOO' = 'BOOTHBAY HARBOR, ME ' 156 | 'BWM' = 'BRIDGEWATER, ME ' 157 | 'BCK' = 'BUCKPORT, ME ' 158 | 'CLS' = 'CALAIS, ME ' 159 | 'CRB' = 'CARIBOU, ME ' 160 | 'COB' = 'COBURN GORE, ME ' 161 | 'EST' = 'EASTCOURT, ME ' 162 | 'EPT' = 'EASTPORT MUNICIPAL, ME' 163 | 'EPM' = 'EASTPORT, ME ' 164 | 'FOR' = 'FOREST CITY, ME ' 165 | 'FTF' = 'FORT FAIRFIELD, ME ' 166 | 'FTK' = 'FORT KENT, ME ' 167 | 'HML' = 'HAMIIN, ME ' 168 | 'HTM' = 'HOULTON, ME ' 169 | 'JKM' = 'JACKMAN, ME ' 170 | 'KAL' = 'KALISPEL, MT ' 171 | 'LIM' = 'LIMESTONE, ME ' 172 | 'LUB' = 'LUBEC, ME ' 173 | 'MAD' = 'MADAWASKA, ME ' 174 | 'POM' = 'PORTLAND, ME ' 175 | 'RGM' = 'RANGELEY, ME (BPS)' 176 | 'SBR' = 'SOUTH BREWER, ME ' 177 | 'SRL' = 'ST AURELIE, ME ' 178 | 'SPA' = 'ST PAMPILE, ME ' 179 | 'VNB' = 'VAN BUREN, ME ' 180 | 'VCB' = 'VANCEBORO, ME ' 181 | 'AGN' = 'ALGONAC, MI ' 182 | 'ALP' = 'ALPENA, MI ' 183 | 'BCY' = 'BAY CITY, MI ' 184 | 'DET' = 'DETROIT, MI ' 185 | 'GRP' = 'GRAND RAPIDS, MI' 186 | 'GRO' = 'GROSSE ISLE, MI ' 187 | 'ISL' = 'ISLE ROYALE, MI ' 188 | 'MRC' = 'MARINE CITY, MI ' 189 | 'MRY' = 'MARYSVILLE, MI ' 190 | 'PTK' = 'OAKLAND COUNTY - PONTIAC, MI' 191 | 'PHU' = 'PORT HURON, MI ' 192 | 'RBT' = 'ROBERTS LANDING, MI ' 193 | 'SAG' = 'SAGINAW, MI ' 194 | 'SSM' = 'SAULT STE. MARIE, MI ' 195 | 'SCL' = 'ST CLAIR, MI ' 196 | 'YIP' = 'WILLOW RUN - YPSILANTI, MI' 197 | 'BAU' = 'BAUDETTE, MN ' 198 | 'CAR' = 'CARIBOU MUNICIPAL AIRPORT, MN' 199 | 'GTF' = 'Collapsed into INT, MN' 200 | 'INL' = 'Collapsed into INT, MN' 201 | 'CRA' = 'CRANE LAKE, MN ' 202 | 'MIC' = 'CRYSTAL MUNICIPAL AIRPORT, MN' 203 | 'DUL' = 'DULUTH, MN ' 204 | 'ELY' = 'ELY, MN ' 205 | 'GPM' = 'GRAND PORTAGE, MN ' 206 | 'SVC' = 'GRANT COUNTY - SILVER CITY, MN' 207 | 'INT' = 'INT''L FALLS, MN ' 208 | 'LAN' = 'LANCASTER, MN ' 209 | 'MSP' = 'MINN./ST PAUL, MN ' 210 | 'LIN' = 'NORTHERN SVC CENTER, MN ' 211 | 'NOY' = 'NOYES, MN ' 212 | 'PIN' = 'PINE CREEK, MN ' 213 | '48Y' = 'PINECREEK BORDER ARPT, MN' 214 | 'RAN' = 'RAINER, MN ' 215 | 'RST' = 'ROCHESTER, MN' 216 | 'ROS' = 'ROSEAU, MN ' 217 | 'SPM' = 'ST PAUL, MN ' 218 | 'WSB' = 'WARROAD INTL, SPB, MN' 219 | 'WAR' = 'WARROAD, MN ' 220 | 'KAN' = 'KANSAS CITY, MO ' 221 | 'SGF' = 'SPRINGFIELD-BRANSON, MO' 222 | 'STL' = 'ST LOUIS, MO ' 223 | 'WHI' = 'WHITETAIL, MT ' 224 | 'WHM' = 'WILD HORSE, MT ' 225 | 'GPT' = 'BILOXI REGIONAL, MS' 226 | 'GTR' = 'GOLDEN TRIANGLE LOWNDES CNTY, MS' 227 | 'GUL' = 'GULFPORT, MS ' 228 | 'PAS' = 'PASCAGOULA, MS ' 229 | 'JAN' = 'THOMPSON FIELD - JACKSON, MS' 230 | 'BIL' = 'BILLINGS, MT ' 231 | 'BTM' = 'BUTTE, MT ' 232 | 'CHF' = 'CHIEF MT, MT ' 233 | 'CTB' = 'CUT BANK MUNICIPAL, MT' 234 | 'CUT' = 'CUT BANK, MT ' 235 | 'DLB' = 'DEL BONITA, MT ' 236 | 'EUR' = 'EUREKA, MT (BPS)' 237 | 'BZN' = 'GALLATIN FIELD - BOZEMAN, MT' 238 | 'FCA' = 'GLACIER NATIONAL PARK, MT' 239 | 'GGW' = 'GLASGOW, MT ' 240 | 'GRE' = 'GREAT FALLS, MT ' 241 | 'HVR' = 'HAVRE, MT ' 242 | 'HEL' = 'HELENA, MT ' 243 | 'LWT' = 'LEWISTON, MT ' 244 | 'MGM' = 'MORGAN, MT ' 245 | 'OPH' = 'OPHEIM, MT ' 246 | 'PIE' = 'PIEGAN, MT ' 247 | 'RAY' = 'RAYMOND, MT ' 248 | 'ROO' = 'ROOSVILLE, MT ' 249 | 'SCO' = 'SCOBEY, MT ' 250 | 'SWE' = 'SWEETGTASS, MT ' 251 | 'TRL' = 'TRIAL CREEK, MT ' 252 | 'TUR' = 'TURNER, MT ' 253 | 'WCM' = 'WILLOW CREEK, MT ' 254 | 'CLT' = 'CHARLOTTE, NC ' 255 | 'FAY' = 'FAYETTEVILLE, NC' 256 | 'MRH' = 'MOREHEAD CITY, NC ' 257 | 'FOP' = 'MORRIS FIELDS AAF, NC' 258 | 'GSO' = 'PIEDMONT TRIAD INTL AIRPORT, NC' 259 | 'RDU' = 'RALEIGH/DURHAM, NC ' 260 | 'SSC' = 'SHAW AFB - SUMTER, NC' 261 | 'WIL' = 'WILMINGTON, NC ' 262 | 'AMB' = 'AMBROSE, ND ' 263 | 'ANT' = 'ANTLER, ND ' 264 | 'CRY' = 'CARBURY, ND ' 265 | 'DNS' = 'DUNSEITH, ND ' 266 | 'FAR' = 'FARGO, ND ' 267 | 'FRT' = 'FORTUNA, ND ' 268 | 'GRF' = 'GRAND FORKS, ND ' 269 | 'HNN' = 'HANNAH, ND ' 270 | 'HNS' = 'HANSBORO, ND ' 271 | 'MAI' = 'MAIDA, ND ' 272 | 'MND' = 'MINOT, ND ' 273 | 'NEC' = 'NECHE, ND ' 274 | 'NOO' = 'NOONAN, ND ' 275 | 'NRG' = 'NORTHGATE, ND ' 276 | 'PEM' = 'PEMBINA, ND ' 277 | 'SAR' = 'SARLES, ND ' 278 | 'SHR' = 'SHERWOOD, ND ' 279 | 'SJO' = 'ST JOHN, ND ' 280 | 'WAL' = 'WALHALLA, ND ' 281 | 'WHO' = 'WESTHOPE, ND ' 282 | 'WND' = 'WILLISTON, ND ' 283 | 'OMA' = 'OMAHA, NE ' 284 | 'LEB' = 'LEBANON, NH ' 285 | 'MHT' = 'MANCHESTER, NH' 286 | 'PNH' = 'PITTSBURG, NH ' 287 | 'PSM' = 'PORTSMOUTH, NH ' 288 | 'BYO' = 'BAYONNE, NJ ' 289 | 'CNJ' = 'CAMDEN, NJ ' 290 | 'HOB' = 'HOBOKEN, NJ ' 291 | 'JER' = 'JERSEY CITY, NJ ' 292 | 'WRI' = 'MC GUIRE AFB - WRIGHTSOWN, NJ' 293 | 'MMU' = 'MORRISTOWN, NJ' 294 | 'NEW' = 'NEWARK/TETERBORO, NJ ' 295 | 'PER' = 'PERTH AMBOY, NJ ' 296 | 'ACY' = 'POMONA FIELD - ATLANTIC CITY, NJ' 297 | 'ALA' = 'ALAMAGORDO, NM (BPS)' 298 | 'ABQ' = 'ALBUQUERQUE, NM ' 299 | 'ANP' = 'ANTELOPE WELLS, NM ' 300 | 'CRL' = 'CARLSBAD, NM ' 301 | 'COL' = 'COLUMBUS, NM ' 302 | 'CDD' = 'CRANE LAKE - ST. LOUIS CNTY, NM' 303 | 'DNM' = 'DEMING, NM (BPS)' 304 | 'LAS' = 'LAS CRUCES, NM ' 305 | 'LOB' = 'LORDSBURG, NM (BPS)' 306 | 'RUI' = 'RUIDOSO, NM' 307 | 'STR' = 'SANTA TERESA, NM ' 308 | 'RNO' = 'CANNON INTL - RENO/TAHOE, NV' 309 | 'FLX' = 'FALLON MUNICIPAL AIRPORT, NV' 310 | 'LVG' = 'LAS VEGAS, NV ' 311 | 'REN' = 'RENO, NV ' 312 | 'ALB' = 'ALBANY, NY ' 313 | 'AXB' = 'ALEXANDRIA BAY, NY ' 314 | 'BUF' = 'BUFFALO, NY ' 315 | 'CNH' = 'CANNON CORNERS, NY' 316 | 'CAP' = 'CAPE VINCENT, NY ' 317 | 'CHM' = 'CHAMPLAIN, NY ' 318 | 'CHT' = 'CHATEAUGAY, NY ' 319 | 'CLA' = 'CLAYTON, NY ' 320 | 'FTC' = 'FORT COVINGTON, NY ' 321 | 'LAG' = 'LA GUARDIA, NY ' 322 | 'LEW' = 'LEWISTON, NY ' 323 | 'MAS' = 'MASSENA, NY ' 324 | 'MAG' = 'MCGUIRE AFB, NY ' 325 | 'MOO' = 'MOORES, NY ' 326 | 'MRR' = 'MORRISTOWN, NY ' 327 | 'NYC' = 'NEW YORK, NY ' 328 | 'NIA' = 'NIAGARA FALLS, NY ' 329 | 'OGD' = 'OGDENSBURG, NY ' 330 | 'OSW' = 'OSWEGO, NY ' 331 | 'ELM' = 'REGIONAL ARPT - HORSEHEAD, NY' 332 | 'ROC' = 'ROCHESTER, NY ' 333 | 'ROU' = 'ROUSES POINT, NY ' 334 | 'SWF' = 'STEWART - ORANGE CNTY, NY' 335 | 'SYR' = 'SYRACUSE, NY ' 336 | 'THO' = 'THOUSAND ISLAND BRIDGE, NY' 337 | 'TRO' = 'TROUT RIVER, NY ' 338 | 'WAT' = 'WATERTOWN, NY ' 339 | 'HPN' = 'WESTCHESTER - WHITE PLAINS, NY' 340 | 'WRB' = 'WHIRLPOOL BRIDGE, NY' 341 | 'YOU' = 'YOUNGSTOWN, NY ' 342 | 'AKR' = 'AKRON, OH ' 343 | 'ATB' = 'ASHTABULA, OH ' 344 | 'CIN' = 'CINCINNATI, OH ' 345 | 'CLE' = 'CLEVELAND, OH ' 346 | 'CLM' = 'COLUMBUS, OH ' 347 | 'LOR' = 'LORAIN, OH ' 348 | 'MBO' = 'MARBLE HEADS, OH ' 349 | 'SDY' = 'SANDUSKY, OH ' 350 | 'TOL' = 'TOLEDO, OH ' 351 | 'OKC' = 'OKLAHOMA CITY, OK ' 352 | 'TUL' = 'TULSA, OK' 353 | 'AST' = 'ASTORIA, OR ' 354 | 'COO' = 'COOS BAY, OR ' 355 | 'HIO' = 'HILLSBORO, OR' 356 | 'MED' = 'MEDFORD, OR ' 357 | 'NPT' = 'NEWPORT, OR ' 358 | 'POO' = 'PORTLAND, OR ' 359 | 'PUT' = 'PUT-IN-BAY, OH ' 360 | 'RDM' = 'ROBERTS FIELDS - REDMOND, OR' 361 | 'ERI' = 'ERIE, PA ' 362 | 'MDT' = 'HARRISBURG, PA' 363 | 'HSB' = 'HARRISONBURG, PA ' 364 | 'PHI' = 'PHILADELPHIA, PA ' 365 | 'PIT' = 'PITTSBURG, PA ' 366 | 'AGU' = 'AGUADILLA, PR ' 367 | 'BQN' = 'BORINQUEN - AGUADILLO, PR' 368 | 'JCP' = 'CULEBRA - BENJAMIN RIVERA, PR' 369 | 'ENS' = 'ENSENADA, PR ' 370 | 'FAJ' = 'FAJARDO, PR ' 371 | 'HUM' = 'HUMACAO, PR ' 372 | 'JOB' = 'JOBOS, PR ' 373 | 'MAY' = 'MAYAGUEZ, PR ' 374 | 'PON' = 'PONCE, PR ' 375 | 'PSE' = 'PONCE-MERCEDITA, PR' 376 | 'SAJ' = 'SAN JUAN, PR ' 377 | 'VQS' = 'VIEQUES-ARPT, PR' 378 | 'PRO' = 'PROVIDENCE, RI ' 379 | 'PVD' = 'THEODORE FRANCIS - WARWICK, RI' 380 | 'CHL' = 'CHARLESTON, SC ' 381 | 'CAE' = 'COLUMBIA, SC #ARPT' 382 | 'GEO' = 'GEORGETOWN, SC ' 383 | 'GSP' = 'GREENVILLE, SC' 384 | 'GRR' = 'GREER, SC' 385 | 'MYR' = 'MYRTLE BEACH, SC' 386 | 'SPF' = 'BLACK HILLS, SPEARFISH, SD' 387 | 'HON' = 'HOWES REGIONAL ARPT - HURON, SD' 388 | 'SAI' = 'SAIPAN, SPN ' 389 | 'TYS' = 'MC GHEE TYSON - ALCOA, TN' 390 | 'MEM' = 'MEMPHIS, TN ' 391 | 'NSV' = 'NASHVILLE, TN ' 392 | 'TRI' = 'TRI CITY ARPT, TN' 393 | 'ADS' = 'ADDISON AIRPORT- ADDISON, TX' 394 | 'ADT' = 'AMISTAD DAM, TX ' 395 | 'ANZ' = 'ANZALDUAS, TX' 396 | 'AUS' = 'AUSTIN, TX ' 397 | 'BEA' = 'BEAUMONT, TX ' 398 | 'BBP' = 'BIG BEND PARK, TX (BPS)' 399 | 'SCC' = 'BP SPEC COORD. CTR, TX' 400 | 'BTC' = 'BP TACTICAL UNIT, TX ' 401 | 'BOA' = 'BRIDGE OF AMERICAS, TX' 402 | 'BRO' = 'BROWNSVILLE, TX ' 403 | 'CRP' = 'CORPUS CHRISTI, TX ' 404 | 'DAL' = 'DALLAS, TX ' 405 | 'DLR' = 'DEL RIO, TX ' 406 | 'DNA' = 'DONNA, TX' 407 | 'EGP' = 'EAGLE PASS, TX ' 408 | 'ELP' = 'EL PASO, TX ' 409 | 'FAB' = 'FABENS, TX ' 410 | 'FAL' = 'FALCON HEIGHTS, TX ' 411 | 'FTH' = 'FORT HANCOCK, TX ' 412 | 'AFW' = 'FORT WORTH ALLIANCE, TX' 413 | 'FPT' = 'FREEPORT, TX ' 414 | 'GAL' = 'GALVESTON, TX ' 415 | 'HLG' = 'HARLINGEN, TX ' 416 | 'HID' = 'HIDALGO, TX ' 417 | 'HOU' = 'HOUSTON, TX ' 418 | 'SGR' = 'HULL FIELD, SUGAR LAND ARPT, TX' 419 | 'LLB' = 'JUAREZ-LINCOLN BRIDGE, TX' 420 | 'LCB' = 'LAREDO COLUMBIA BRIDGE, TX' 421 | 'LRN' = 'LAREDO NORTH, TX ' 422 | 'LAR' = 'LAREDO, TX ' 423 | 'LSE' = 'LOS EBANOS, TX ' 424 | 'IND' = 'LOS INDIOS, TX' 425 | 'LOI' = 'LOS INDIOS, TX ' 426 | 'MRS' = 'MARFA, TX (BPS)' 427 | 'MCA' = 'MCALLEN, TX ' 428 | 'MAF' = 'ODESSA REGIONAL, TX' 429 | 'PDN' = 'PASO DEL NORTE,TX ' 430 | 'PBB' = 'PEACE BRIDGE, NY ' 431 | 'PHR' = 'PHARR, TX ' 432 | 'PAR' = 'PORT ARTHUR, TX ' 433 | 'ISB' = 'PORT ISABEL, TX ' 434 | 'POE' = 'PORT OF EL PASO, TX ' 435 | 'PRE' = 'PRESIDIO, TX ' 436 | 'PGR' = 'PROGRESO, TX ' 437 | 'RIO' = 'RIO GRANDE CITY, TX ' 438 | 'ROM' = 'ROMA, TX ' 439 | 'SNA' = 'SAN ANTONIO, TX ' 440 | 'SNN' = 'SANDERSON, TX ' 441 | 'VIB' = 'VETERAN INTL BRIDGE, TX' 442 | 'YSL' = 'YSLETA, TX ' 443 | 'CHA' = 'CHARLOTTE AMALIE, VI ' 444 | 'CHR' = 'CHRISTIANSTED, VI ' 445 | 'CRU' = 'CRUZ BAY, ST JOHN, VI ' 446 | 'FRK' = 'FREDERIKSTED, VI ' 447 | 'STT' = 'ST THOMAS, VI ' 448 | 'LGU' = 'CACHE AIRPORT - LOGAN, UT' 449 | 'SLC' = 'SALT LAKE CITY, UT ' 450 | 'CHO' = 'ALBEMARLE CHARLOTTESVILLE, VA' 451 | 'DAA' = 'DAVISON AAF - FAIRFAX CNTY, VA' 452 | 'HOP' = 'HOPEWELL, VA ' 453 | 'HEF' = 'MANASSAS, VA #ARPT' 454 | 'NWN' = 'NEWPORT, VA ' 455 | 'NOR' = 'NORFOLK, VA ' 456 | 'RCM' = 'RICHMOND, VA ' 457 | 'ABS' = 'ALBURG SPRINGS, VT ' 458 | 'ABG' = 'ALBURG, VT ' 459 | 'BEB' = 'BEEBE PLAIN, VT ' 460 | 'BEE' = 'BEECHER FALLS, VT ' 461 | 'BRG' = 'BURLINGTON, VT ' 462 | 'CNA' = 'CANAAN, VT ' 463 | 'DER' = 'DERBY LINE, VT (I-91) ' 464 | 'DLV' = 'DERBY LINE, VT (RT. 5)' 465 | 'ERC' = 'EAST RICHFORD, VT ' 466 | 'HIG' = 'HIGHGATE SPRINGS, VT ' 467 | 'MOR' = 'MORSES LINE, VT ' 468 | 'NPV' = 'NEWPORT, VT ' 469 | 'NRT' = 'NORTH TROY, VT ' 470 | 'NRN' = 'NORTON, VT ' 471 | 'PIV' = 'PINNACLE ROAD, VT ' 472 | 'RIF' = 'RICHFORT, VT ' 473 | 'STA' = 'ST ALBANS, VT ' 474 | 'SWB' = 'SWANTON, VT (BP - SECTOR HQ)' 475 | 'WBE' = 'WEST BERKSHIRE, VT ' 476 | 'ABE' = 'ABERDEEN, WA ' 477 | 'ANA' = 'ANACORTES, WA ' 478 | 'BEL' = 'BELLINGHAM, WA ' 479 | 'BLI' = 'BELLINGHAM, WASHINGTON #INTL' 480 | 'BLA' = 'BLAINE, WA ' 481 | 'BWA' = 'BOUNDARY, WA ' 482 | 'CUR' = 'CURLEW, WA (BPS)' 483 | 'DVL' = 'DANVILLE, WA ' 484 | 'EVE' = 'EVERETT, WA ' 485 | 'FER' = 'FERRY, WA ' 486 | 'FRI' = 'FRIDAY HARBOR, WA ' 487 | 'FWA' = 'FRONTIER, WA ' 488 | 'KLM' = 'KALAMA, WA ' 489 | 'LAU' = 'LAURIER, WA ' 490 | 'LON' = 'LONGVIEW, WA ' 491 | 'MET' = 'METALINE FALLS, WA ' 492 | 'MWH' = 'MOSES LAKE GRANT COUNTY ARPT, WA' 493 | 'NEA' = 'NEAH BAY, WA ' 494 | 'NIG' = 'NIGHTHAWK, WA ' 495 | 'OLY' = 'OLYMPIA, WA ' 496 | 'ORO' = 'OROVILLE, WA ' 497 | 'PWB' = 'PASCO, WA ' 498 | 'PIR' = 'POINT ROBERTS, WA ' 499 | 'PNG' = 'PORT ANGELES, WA ' 500 | 'PTO' = 'PORT TOWNSEND, WA ' 501 | 'SEA' = 'SEATTLE, WA ' 502 | 'SPO' = 'SPOKANE, WA ' 503 | 'SUM' = 'SUMAS, WA ' 504 | 'TAC' = 'TACOMA, WA ' 505 | 'PSC' = 'TRI-CITIES - PASCO, WA' 506 | 'VAN' = 'VANCOUVER, WA ' 507 | 'AGM' = 'ALGOMA, WI ' 508 | 'BAY' = 'BAYFIELD, WI ' 509 | 'GRB' = 'GREEN BAY, WI ' 510 | 'MNW' = 'MANITOWOC, WI ' 511 | 'MIL' = 'MILWAUKEE, WI ' 512 | 'MSN' = 'TRUAX FIELD - DANE COUNTY, WI' 513 | 'CHS' = 'CHARLESTON, WV ' 514 | 'CLK' = 'CLARKSBURG, WV ' 515 | 'BLF' = 'MERCER COUNTY, WV' 516 | 'CSP' = 'CASPER, WY ' 517 | 'CLG' = 'CALGARY, CANADA ' 518 | 'EDA' = 'EDMONTON, CANADA ' 519 | 'YHC' = 'HAKAI PASS, CANADA' 520 | 'HAL' = 'Halifax, NS, Canada ' 521 | 'MON' = 'MONTREAL, CANADA ' 522 | 'OTT' = 'OTTAWA, CANADA ' 523 | 'YXE' = 'SASKATOON, CANADA' 524 | 'TOR' = 'TORONTO, CANADA ' 525 | 'VCV' = 'VANCOUVER, CANADA ' 526 | 'VIC' = 'VICTORIA, CANADA ' 527 | 'WIN' = 'WINNIPEG, CANADA ' 528 | 'AMS' = 'AMSTERDAM-SCHIPHOL, NETHERLANDS' 529 | 'ARB' = 'ARUBA, NETH ANTILLES ' 530 | 'BAN' = 'BANKOK, THAILAND ' 531 | 'BEI' = 'BEICA #ARPT, ETHIOPIA' 532 | 'PEK' = 'BEIJING CAPITAL INTL, PRC' 533 | 'BDA' = 'KINDLEY FIELD, BERMUDA' 534 | 'BOG' = 'BOGOTA, EL DORADO #ARPT, COLOMBIA' 535 | 'EZE' = 'BUENOS AIRES, MINISTRO PIST, ARGENTINA' 536 | 'CUN' = 'CANCUN, MEXICO' 537 | 'CRQ' = 'CARAVELAS, BA #ARPT, BRAZIL' 538 | 'MVD' = 'CARRASCO, URUGUAY' 539 | 'DUB' = 'DUBLIN, IRELAND ' 540 | 'FOU' = 'FOUGAMOU #ARPT, GABON' 541 | 'FBA' = 'FREEPORT, BAHAMAS ' 542 | 'MTY' = 'GEN M. ESCOBEDO, Monterrey, MX' 543 | 'HMO' = 'GEN PESQUEIRA GARCIA, MX' 544 | 'GCM' = 'GRAND CAYMAN, CAYMAN ISLAND' 545 | 'GDL' = 'GUADALAJARA, MIGUEL HIDAL, MX' 546 | 'HAM' = 'HAMILTON, BERMUDA ' 547 | 'ICN' = 'INCHON, SEOUL KOREA' 548 | 'IWA' = 'INVALID - IWAKUNI, JAPAN' 549 | 'CND' = 'KOGALNICEANU, ROMANIA' 550 | 'LAH' = 'LABUHA ARPT, INDONESIA' 551 | 'DUR' = 'LOUIS BOTHA, SOUTH AFRICA' 552 | 'MAL' = 'MANGOLE ARPT, INDONESIA' 553 | 'MDE' = 'MEDELLIN, COLOMBIA' 554 | 'MEX' = 'JUAREZ INTL, MEXICO CITY, MX' 555 | 'LHR' = 'MIDDLESEX, ENGLAND' 556 | 'NBO' = 'NAIROBI, KENYA ' 557 | 'NAS' = 'NASSAU, BAHAMAS ' 558 | 'NCA' = 'NORTH CAICOS, TURK & CAIMAN' 559 | 'PTY' = 'OMAR TORRIJOS, PANAMA' 560 | 'SPV' = 'PAPUA, NEW GUINEA' 561 | 'UIO' = 'QUITO (MARISCAL SUCR), ECUADOR' 562 | 'RIT' = 'ROME, ITALY ' 563 | 'SNO' = 'SAKON NAKHON #ARPT, THAILAND' 564 | 'SLP' = 'SAN LUIS POTOSI #ARPT, MEXICO' 565 | 'SAN' = 'SAN SALVADOR, EL SALVADOR' 566 | 'SRO' = 'SANTANA RAMOS #ARPT, COLOMBIA' 567 | 'GRU' = 'GUARULHOS INTL, SAO PAULO, BRAZIL' 568 | 'SHA' = 'SHANNON, IRELAND ' 569 | 'HIL' = 'SHILLAVO, ETHIOPIA' 570 | 'TOK' = 'TOROKINA #ARPT, PAPUA, NEW GUINEA' 571 | 'VER' = 'VERACRUZ, MEXICO' 572 | 'LGW' = 'WEST SUSSEX, ENGLAND ' 573 | 'ZZZ' = 'MEXICO Land (Banco de Mexico) ' 574 | 'CHN' = 'No PORT Code (CHN)' 575 | 'CNC' = 'CANNON CORNERS, NY' 576 | 'MAA' = 'Abu Dhabi' 577 | 'AG0' = 'MAGNOLIA, AR' 578 | 'BHM' = 'BAR HARBOR, ME' 579 | 'BHX' = 'BIRMINGHAM, AL' 580 | 'CAK' = 'AKRON, OH' 581 | 'FOK' = 'SUFFOLK COUNTY, NY' 582 | 'LND' = 'LANDER, WY' 583 | 'MAR' = 'MARFA, TX' 584 | 'MLI' = 'MOLINE, IL' 585 | 'RIV' = 'RIVERSIDE, CA' 586 | 'RME' = 'ROME, NY' 587 | 'VNY' = 'VAN NUYS, CA' 588 | 'YUM' = 'YUMA, AZ' -------------------------------------------------------------------------------- /Data Sources.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Data Sources.docx -------------------------------------------------------------------------------- /Project_1A_Data_Modeling_with_Postgres/README.md: -------------------------------------------------------------------------------- 1 | # Files in My Project Repository 2 | (1). There are 6 files and 1 data folder including two data sets in my repository. 3 | (2). For the two .ipynb files, you can just open and run it in the workspace. 4 | But for 3 .py files, you have to open a new terminal to execute them by typing "python filename.py". 5 | (3). The steps for this project: 6 | * sql_queries.py ---> create_tables.py ----> etl.ipynb----> etl.py ----> test.ipynb 7 | ---> readme.md 8 | (4). Attention: each time you run create_table.py, you have to restart test.ipynb and etl.ipynb first to close the connection to 9 | database. 10 | 11 | # Purpose of this Database 12 | 13 | (1). The analytics team at Sparkify wanted to know what songs users were listening to but did not have easy assess to the data. 14 | Both datasets were on Sparkify's new music streaming app. One resided in a directory of JSON logs on user activity while the other 15 | was in the directory with JSON metadata on the songs. 16 | 17 | (2). My goal was to create a Postgres database with tables designed to optimize queries on song play analysis. 18 | 19 | 20 | # Explanation of Database Schema Design and ETL Pipeline 21 | 22 | (1). Database schema: I created a star schema optimized for queries on song analysis. 23 | * It contained 5 tables : 24 | * one fact table:songplays with columns songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agen 25 | * four dimension tables 26 | * users - users in the app with columns user_id, first_name, last_name, gender, level 27 | * songs - songs in music database with columns song_id, title, artist_id, year, duration 28 | * artists - artists in music database with columns artist_id, name, location, lattitude, longitude 29 | * time - timestamps of records in songplays broken down into specific units with columns start_time, hour, day, week, month, year, weekday 30 | 31 | (2). ETL Pipeline 32 | * Process Song data to insert records into songs and artists tables 33 | * Process log data to create time and users tables 34 | * Extract songs table, artists table, and original log file to create songplays table (join songs table, artists table) 35 | 36 | # Example Queries and Results for Song Play Analysis 37 | 38 | query 1: show the first five rows in songplays table 39 | * query: 40 | %sql SELECT * FROM songplays LIMIT 5; 41 | 42 | * result: 43 | songplay_id start_time user_id level song_id artist_id session_id location user_agent 44 | 1 2018-11-29 00:00:57.796000 73 paid None None 954 Tampa-St. Petersburg-Clearwater, FL "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2" 45 | 2 2018-11-29 00:01:30.796000 24 paid None None 984 Lake Havasu City-Kingman, AZ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36" 46 | 3 2018-11-29 00:04:01.796000 24 paid None None 984 Lake Havasu City-Kingman, AZ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36" 47 | 4 2018-11-29 00:04:55.796000 73 paid None None 954 Tampa-St. Petersburg-Clearwater, FL "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2" 48 | 5 2018-11-29 00:07:13.796000 24 paid None None 984 Lake Havasu City-Kingman, AZ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36" 49 | 50 | query 2: show the first five rows in users table 51 | * query: 52 | %sql SELECT * FROM users LIMIT 5; 53 | 54 | * result: 55 | user_id first_name last_name gender level 56 | 73 Jacob Klein M paid 57 | 24 Layla Griffin F paid 58 | 24 Layla Griffin F paid 59 | 73 Jacob Klein M paid 60 | 24 Layla Griffin F paid 61 | 62 | query 3: show the first five rows in songs table 63 | * query: 64 | %sql SELECT * FROM songs LIMIT 5; 65 | 66 | * result: 67 | song_id title artist_id year duration 68 | SOFNOQK12AB01840FC Kutt Free (DJ Volume Remix) ARNNKDK1187B98BBD5 0 407.37914 69 | SOBAYLL12A8C138AF9 Sono andati? Fingevo di dormire ARDR4AC1187FB371A1 0 511.16363 70 | SOFFKZS12AB017F194 A Higher Place (Album Version) ARBEBBY1187B9B43DB 1994 236.17261 71 | SOGVQGJ12AB017F169 Ten Tonne AR62SOJ1187FB47BB5 2005 337.68444 72 | SOXILUQ12A58A7C72A Jenny Take a Ride ARP6N5A1187B99D1A3 2004 207.43791 73 | 74 | query 4: What is the average duration of songs ? 75 | * query: 76 | %sql SELECT AVG(duration) AS avg_duration FROM songs; 77 | 78 | * result: 79 | avg_duration 80 | 239.729676056338 81 | 82 | 83 | query 5:The proportion of male users and female users 84 | * query: 85 | # total user 86 | %sql SELECT COUNT(gender) FROM users; 87 | # number of female users 88 | %sql SELECT COUNT(gender) AS female_count \ 89 | FROM users \ 90 | WHERE gender = 'F'; 91 | # number of male users 92 | %sql SELECT COUNT(gender) AS male_count \ 93 | FROM users \ 94 | WHERE gender = 'M'; 95 | # print propotion 96 | print(4887/6820 , 1933/6820) 97 | 98 | * result: 99 | 0.7165689149560117 0.2834310850439883 100 | 101 | # conclusion: 102 | Accoring to the data given, there are far more female users than male users for Spartify music streaming app. 103 | 104 | 105 | -------------------------------------------------------------------------------- /Project_1A_Data_Modeling_with_Postgres/create_tables.py: -------------------------------------------------------------------------------- 1 | import psycopg2 2 | from sql_queries import create_table_queries, drop_table_queries 3 | 4 | 5 | def create_database(): 6 | """ 7 | This function creates the sparkifydb dabases and connect to the database 8 | 9 | argument: none 10 | 11 | return: 12 | cur: the cursor object 13 | conn : the connection object 14 | """ 15 | 16 | # connect to default database 17 | conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student") 18 | conn.set_session(autocommit=True) 19 | cur = conn.cursor() 20 | 21 | # create sparkify database with UTF8 encoding 22 | cur.execute("DROP DATABASE IF EXISTS sparkifydb") 23 | cur.execute("CREATE DATABASE sparkifydb WITH ENCODING 'utf8' TEMPLATE template0") 24 | 25 | # close connection to default database 26 | conn.close() 27 | 28 | # connect to sparkify database 29 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student") 30 | cur = conn.cursor() 31 | 32 | return cur, conn 33 | 34 | 35 | def drop_tables(cur, conn): 36 | """ 37 | This function drop tables. 38 | 39 | argument: 40 | cur: the cursor object 41 | conn : the connection object 42 | 43 | return : 44 | none 45 | 46 | """ 47 | for query in drop_table_queries: 48 | cur.execute(query) 49 | conn.commit() 50 | 51 | 52 | def create_tables(cur, conn): 53 | """ 54 | This function creates tables 55 | 56 | argument: 57 | cur: the cursor object 58 | conn : the connection object 59 | 60 | return : 61 | none 62 | """ 63 | for query in create_table_queries: 64 | cur.execute(query) 65 | conn.commit() 66 | 67 | 68 | def main(): 69 | """ 70 | When a python program is executed, python interpreter starts executing code inside it. 71 | three functions above would be executed and then the connection would be closed 72 | """ 73 | cur, conn = create_database() 74 | 75 | drop_tables(cur, conn) 76 | create_tables(cur, conn) 77 | 78 | conn.close() 79 | 80 | 81 | if __name__ == "__main__": 82 | """ 83 | For python main function, we have to use if __name__ == '__main__' condition to execute. 84 | """ 85 | main() -------------------------------------------------------------------------------- /Project_1A_Data_Modeling_with_Postgres/etl.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import psycopg2 4 | import pandas as pd 5 | from sql_queries import * 6 | 7 | # two tables: song and artist are extracted from song file 8 | def process_song_file(cur, filepath): 9 | """ 10 | this function would read in the song data, insert song records and artist records into song table and artist table respectively. 11 | 12 | argument: 13 | cur: the cursor object 14 | filepath: the path to the song files 15 | 16 | return: none 17 | 18 | """ 19 | # open song file 20 | df = pd.read_json(filepath, lines=True) 21 | 22 | # in etl.ipynb, select the first (only) record in the dataframe 23 | # now, we need to select all records/all files 24 | for i in range(len(df)): 25 | # insert song record 26 | song_data = df[['song_id','title','artist_id','year','duration']].values[i].tolist() 27 | cur.execute(song_table_insert, song_data) 28 | 29 | # insert artist record 30 | artist_data = df[['artist_id','artist_name','artist_location','artist_latitude','artist_longitude']].values[i].tolist() 31 | cur.execute(artist_table_insert, artist_data) 32 | 33 | 34 | def process_log_file(cur, filepath): 35 | """ 36 | this function would read in the log data, insert time data records,users records and songplays records into time, users and 37 | songplays tables respectively. 38 | 39 | argument: 40 | cur: the cursor object 41 | filepath: the path to the log files 42 | 43 | return: none 44 | """ 45 | # open log file 46 | df = pd.read_json(filepath, lines=True) 47 | 48 | # filter by NextSong action 49 | df = df[df['page']=='NextSong'] 50 | 51 | # convert timestamp column to datetime 52 | t = pd.to_datetime( df['ts'] , unit = 'ms') 53 | 54 | # insert time data records 55 | time_data = list((t, t.dt.hour, t.dt.day, t.dt.week,t.dt.month, t.dt.year, t.dt.dayofweek)) 56 | column_labels = ('timestamp', 'hour', 'day', 'week', 'month', 'year','weekday') 57 | time_df = pd.DataFrame(dict(zip(column_labels,time_data)),columns=column_labels).reset_index(drop=True) 58 | 59 | for i, row in time_df.iterrows(): 60 | cur.execute(time_table_insert, list(row)) 61 | 62 | # load user table 63 | user_df = df[['userId','firstName','lastName','gender','level']] 64 | 65 | # insert user records 66 | for i, row in user_df.iterrows(): 67 | cur.execute(user_table_insert, row) 68 | 69 | # insert songplay records 70 | for index, row in df.iterrows(): 71 | 72 | # get songid and artistid from song and artist tables 73 | cur.execute(song_select, (row.song, row.artist, row.length)) 74 | # row.length is the duration. "row.song, row.artist, row.length" based on the condition in where statement in song_select 75 | results = cur.fetchone() 76 | 77 | if results: 78 | songid, artistid = results 79 | else: 80 | songid, artistid = None, None 81 | 82 | # insert songplay record 83 | songplay_data = ( pd.to_datetime(row.ts, unit = 'ms'), row.userId, row.level, songid , artistid, row.sessionId, row.location, row.userAgent ) 84 | cur.execute(songplay_table_insert, songplay_data) 85 | 86 | 87 | def process_data(cur, conn, filepath, func): 88 | """ 89 | this function get all files from directory and iterate all files 90 | 91 | arguments: 92 | cur: the cursor object 93 | conn: the connection object 94 | filepath : path to files 95 | func : specify the function name the program will execute 96 | 97 | return: none 98 | """ 99 | # get all files matching extension from directory 100 | all_files = [] 101 | for root, dirs, files in os.walk(filepath): 102 | files = glob.glob(os.path.join(root,'*.json')) 103 | for f in files : 104 | all_files.append(os.path.abspath(f)) 105 | 106 | # get total number of files found 107 | num_files = len(all_files) 108 | print('{} files found in {}'.format(num_files, filepath)) 109 | 110 | # iterate over files and process 111 | for i, datafile in enumerate(all_files, 1): 112 | func(cur, datafile) 113 | conn.commit() 114 | print('{}/{} files processed.'.format(i, num_files)) 115 | 116 | 117 | def main(): 118 | """ 119 | this function will tell python intepreter to execute the functions inside it 120 | 121 | argument: none 122 | 123 | return: none 124 | """ 125 | conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student") 126 | cur = conn.cursor() 127 | 128 | process_data(cur, conn, filepath='data/song_data', func=process_song_file) 129 | process_data(cur, conn, filepath='data/log_data', func=process_log_file) 130 | 131 | conn.close() 132 | 133 | 134 | if __name__ == "__main__": 135 | """ 136 | this part is the necesaary condition to execute python main function 137 | """ 138 | main() -------------------------------------------------------------------------------- /Project_1A_Data_Modeling_with_Postgres/sql_queries.py: -------------------------------------------------------------------------------- 1 | # DROP TABLES 2 | # remember to write the whole sentence if exists 3 | songplay_table_drop = "DROP TABLE IF EXISTS songplays;" # all use plural forms, it needs to be consistent with test.ipynb 4 | user_table_drop = "DROP TABLE IF EXISTS users;" # user is a reserved word 5 | song_table_drop = "DROP TABLE IF EXISTS songs;" 6 | artist_table_drop = "DROP TABLE IF EXISTS artists;" 7 | time_table_drop = "DROP TABLE IF EXISTS time;" 8 | 9 | 10 | 11 | # CREATE TABLES 12 | # each table requires a primary key. when it does not have such a column, create a automatically increment one like (songplay_id # serial primary key and id serial primary key ) 13 | # A primary key is required to be NOT NUL 14 | # foreign keys must reference a unique key in the parent table: pay attention to who references whom 15 | # which means, a foreign key must reference columns that either are a primary key or form a unique constrain 16 | 17 | songplay_table_create = (""" CREATE TABLE IF NOT EXISTS songplays (songplay_id serial, start_time timestamp NOT NULL, user_id int NOT NULL, level varchar, song_id varchar , artist_id varchar, session_id int, location varchar, user_agent varchar, 18 | primary key(songplay_id), 19 | foreign key (user_id) references users(user_id), 20 | foreign key(song_id) references songs(song_id), 21 | foreign key(artist_id) references artists(artist_id), 22 | foreign key(start_time) references time(start_time))""" ) 23 | 24 | user_table_create = (""" CREATE TABLE IF NOT EXISTS users (user_id int primary key, first_name varchar, last_name varchar, gender varchar, level varchar) """) 25 | 26 | song_table_create = (""" CREATE TABLE IF NOT EXISTS songs (song_id varchar primary key, title varchar, artist_id varchar, year int, duration float) """) 27 | 28 | artist_table_create = (""" CREATE TABLE IF NOT EXISTS artists (artist_id varchar primary key, artist_name varchar, artist_location varchar, artist_lattitude float, artist_longitude float ) """) 29 | 30 | # start_time time without time zone 31 | time_table_create = (""" CREATE TABLE IF NOT EXISTS time ( start_time timestamp primary key, hour int NOT NULL, day int NOT NULL, week int NOT NULL, month int NOT NULL, year int NOT NULL, weekday int NOT NULL ) """) 32 | 33 | 34 | 35 | # INSERT RECORDS 36 | # primary key (user_id,song_id,artist_id), so they are unique. But in reality, there will be some conflicts. So you need handle conflicts 37 | 38 | # pay attention to three tables which requires to handle confliction because we set them as primary key (user_id,song_id,artist_id), user table has a special part, coz the level(status) can be changed from free to paid 39 | 40 | # songplay_id will increase automatically, it should not be treated as a separate column 41 | songplay_table_insert = (""" INSERT INTO songplays ( start_time, user_id, level, song_id, artist_id, session_id, location , user_agent) VALUES ( %s, %s, %s, %s, %s, %s, %s, %s) """) 42 | 43 | user_table_insert = (""" INSERT INTO users (user_id, first_name, last_name, gender, level) VALUES (%s, %s, %s, %s, %s) ON CONFLICT (user_id) DO UPDATE SET level= EXCLUDED.level """) 44 | 45 | song_table_insert = (""" INSERT INTO songs (song_id, title, artist_id, year, duration) VALUES (%s, %s, %s, %s, %s) ON CONFLICT (song_id) DO NOTHING """) 46 | 47 | artist_table_insert = (""" INSERT INTO artists (artist_id, artist_name, artist_location, artist_lattitude, artist_longitude) VALUES (%s, %s, %s, %s, %s) ON CONFLICT (artist_id) DO NOTHING """) 48 | 49 | time_table_insert = (""" INSERT INTO time (start_time, hour, day, week, month, year, weekday) VALUES (%s, %s, %s, %s, %s, %s, %s) 50 | ON CONFLICT (start_time) DO NOTHING """) 51 | 52 | 53 | # FIND SONGS 54 | # this is required in etl.ipynb 55 | # Implement the song_select query in sql_queries.py to find the song ID and artist ID based on the title, artist name, 56 | # and duration of a song. 57 | #should be left join , some songs may not have artist in record 58 | song_select = (""" SELECT s.song_id, a.artist_id 59 | FROM songs s 60 | LEFT JOIN artists a 61 | ON s.artist_id = a.artist_id 62 | WHERE title = (%s) AND artist_name = (%s) AND duration = (%s) """) 63 | # this where statement is based on what is required in etl.ipynb 64 | 65 | 66 | # QUERY LISTS 67 | # the order is important, you should execute the dimension tables first, then fact table. Because there are fks in fact table pointing to dimension tables. So dimension tables must exist first 68 | # dimension tables in the front 69 | create_table_queries = [user_table_create, song_table_create, artist_table_create, time_table_create, songplay_table_create] 70 | drop_table_queries = [songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop] 71 | 72 | # reference: 73 | # upsert example to handle conflict : http://www.postgresqltutorial.com/postgresql-upsert/ -------------------------------------------------------------------------------- /Project_1A_Data_Modeling_with_Postgres/test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%load_ext sql" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/plain": [ 20 | "'Connected: student@sparkifydb'" 21 | ] 22 | }, 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "output_type": "execute_result" 26 | } 27 | ], 28 | "source": [ 29 | "%sql postgresql://student:student@127.0.0.1/sparkifydb" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 3, 35 | "metadata": {}, 36 | "outputs": [ 37 | { 38 | "name": "stdout", 39 | "output_type": "stream", 40 | "text": [ 41 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 42 | "5 rows affected.\n" 43 | ] 44 | }, 45 | { 46 | "data": { 47 | "text/html": [ 48 | "\n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | "
songplay_idstart_timeuser_idlevelsong_idartist_idsession_idlocationuser_agent
12018-11-29 00:00:57.79600073paidNoneNone954Tampa-St. Petersburg-Clearwater, FL"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"
22018-11-29 00:01:30.79600024paidNoneNone984Lake Havasu City-Kingman, AZ"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"
32018-11-29 00:04:01.79600024paidNoneNone984Lake Havasu City-Kingman, AZ"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"
42018-11-29 00:04:55.79600073paidNoneNone954Tampa-St. Petersburg-Clearwater, FL"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"
52018-11-29 00:07:13.79600024paidNoneNone984Lake Havasu City-Kingman, AZ"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"
" 116 | ], 117 | "text/plain": [ 118 | "[(1, datetime.datetime(2018, 11, 29, 0, 0, 57, 796000), 73, 'paid', None, None, 954, 'Tampa-St. Petersburg-Clearwater, FL', '\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2\"'),\n", 119 | " (2, datetime.datetime(2018, 11, 29, 0, 1, 30, 796000), 24, 'paid', None, None, 984, 'Lake Havasu City-Kingman, AZ', '\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"'),\n", 120 | " (3, datetime.datetime(2018, 11, 29, 0, 4, 1, 796000), 24, 'paid', None, None, 984, 'Lake Havasu City-Kingman, AZ', '\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"'),\n", 121 | " (4, datetime.datetime(2018, 11, 29, 0, 4, 55, 796000), 73, 'paid', None, None, 954, 'Tampa-St. Petersburg-Clearwater, FL', '\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2\"'),\n", 122 | " (5, datetime.datetime(2018, 11, 29, 0, 7, 13, 796000), 24, 'paid', None, None, 984, 'Lake Havasu City-Kingman, AZ', '\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"')]" 123 | ] 124 | }, 125 | "execution_count": 3, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "%sql SELECT * FROM songplays LIMIT 5;" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 4, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 144 | "5 rows affected.\n" 145 | ] 146 | }, 147 | { 148 | "data": { 149 | "text/html": [ 150 | "\n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | "
user_idfirst_namelast_namegenderlevel
73JacobKleinMpaid
24LaylaGriffinFpaid
50AvaRobinsonFfree
54KalebCookMfree
32LilyBurnsFfree
" 194 | ], 195 | "text/plain": [ 196 | "[(73, 'Jacob', 'Klein', 'M', 'paid'),\n", 197 | " (24, 'Layla', 'Griffin', 'F', 'paid'),\n", 198 | " (50, 'Ava', 'Robinson', 'F', 'free'),\n", 199 | " (54, 'Kaleb', 'Cook', 'M', 'free'),\n", 200 | " (32, 'Lily', 'Burns', 'F', 'free')]" 201 | ] 202 | }, 203 | "execution_count": 4, 204 | "metadata": {}, 205 | "output_type": "execute_result" 206 | } 207 | ], 208 | "source": [ 209 | "%sql SELECT * FROM users LIMIT 5;" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 5, 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "name": "stdout", 219 | "output_type": "stream", 220 | "text": [ 221 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 222 | "1 rows affected.\n" 223 | ] 224 | }, 225 | { 226 | "data": { 227 | "text/html": [ 228 | "\n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | "
song_idtitleartist_idyearduration
SOFNOQK12AB01840FCKutt Free (DJ Volume Remix)ARNNKDK1187B98BBD50407.37914
" 244 | ], 245 | "text/plain": [ 246 | "[('SOFNOQK12AB01840FC', 'Kutt Free (DJ Volume Remix)', 'ARNNKDK1187B98BBD5', 0, 407.37914)]" 247 | ] 248 | }, 249 | "execution_count": 5, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "%sql SELECT * FROM songs LIMIT 5;" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 6, 261 | "metadata": {}, 262 | "outputs": [ 263 | { 264 | "name": "stdout", 265 | "output_type": "stream", 266 | "text": [ 267 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 268 | "1 rows affected.\n" 269 | ] 270 | }, 271 | { 272 | "data": { 273 | "text/html": [ 274 | "\n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | "
artist_idartist_nameartist_locationartist_lattitudeartist_longitude
ARNNKDK1187B98BBD5JinxZagreb Croatia45.8072615.9676
" 290 | ], 291 | "text/plain": [ 292 | "[('ARNNKDK1187B98BBD5', 'Jinx', 'Zagreb Croatia', 45.80726, 15.9676)]" 293 | ] 294 | }, 295 | "execution_count": 6, 296 | "metadata": {}, 297 | "output_type": "execute_result" 298 | } 299 | ], 300 | "source": [ 301 | "%sql SELECT * FROM artists LIMIT 5;" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 7, 307 | "metadata": {}, 308 | "outputs": [ 309 | { 310 | "name": "stdout", 311 | "output_type": "stream", 312 | "text": [ 313 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 314 | "5 rows affected.\n" 315 | ] 316 | }, 317 | { 318 | "data": { 319 | "text/html": [ 320 | "\n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | "
idstart_timehourdayweekmonthyearweekday
100:00:57.796000029481120183
200:01:30.796000029481120183
300:04:01.796000029481120183
400:04:55.796000029481120183
500:07:13.796000029481120183
" 382 | ], 383 | "text/plain": [ 384 | "[(1, datetime.time(0, 0, 57, 796000), 0, 29, 48, 11, 2018, 3),\n", 385 | " (2, datetime.time(0, 1, 30, 796000), 0, 29, 48, 11, 2018, 3),\n", 386 | " (3, datetime.time(0, 4, 1, 796000), 0, 29, 48, 11, 2018, 3),\n", 387 | " (4, datetime.time(0, 4, 55, 796000), 0, 29, 48, 11, 2018, 3),\n", 388 | " (5, datetime.time(0, 7, 13, 796000), 0, 29, 48, 11, 2018, 3)]" 389 | ] 390 | }, 391 | "execution_count": 7, 392 | "metadata": {}, 393 | "output_type": "execute_result" 394 | } 395 | ], 396 | "source": [ 397 | "%sql SELECT * FROM time LIMIT 5;" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "## REMEMBER: Restart this notebook to close connection to `sparkifydb`\n", 405 | "Each time you run the cells above, remember to restart this notebook to close the connection to your database. Otherwise, you won't be able to run your code in `create_tables.py`, `etl.py`, or `etl.ipynb` files since you can't make multiple connections to the same database (in this case, sparkifydb)." 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "### Example Queries:" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 8, 418 | "metadata": {}, 419 | "outputs": [ 420 | { 421 | "name": "stdout", 422 | "output_type": "stream", 423 | "text": [ 424 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 425 | "(psycopg2.ProgrammingError) column \"title\" does not exist\n", 426 | "LINE 1: SELECT song_id, artist_id, title FROM songplays Order by so...\n", 427 | " ^\n", 428 | " [SQL: 'SELECT song_id, artist_id, title FROM songplays Order by song_id LIMIT 5;']\n" 429 | ] 430 | } 431 | ], 432 | "source": [ 433 | "# 1. Which song is the most popular one ?\n", 434 | "%sql SELECT song_id, artist_id, title \\\n", 435 | "FROM songplays Order by song_id LIMIT 5;" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 9, 441 | "metadata": {}, 442 | "outputs": [ 443 | { 444 | "name": "stdout", 445 | "output_type": "stream", 446 | "text": [ 447 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 448 | "1 rows affected.\n" 449 | ] 450 | }, 451 | { 452 | "data": { 453 | "text/html": [ 454 | "\n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | "
avg_duration
239.729676056338
" 462 | ], 463 | "text/plain": [ 464 | "[(239.729676056338,)]" 465 | ] 466 | }, 467 | "execution_count": 9, 468 | "metadata": {}, 469 | "output_type": "execute_result" 470 | } 471 | ], 472 | "source": [ 473 | "# 2. What is the average duration of songs ?\n", 474 | "%sql SELECT AVG(duration) AS avg_duration FROM songs;" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": 10, 480 | "metadata": {}, 481 | "outputs": [ 482 | { 483 | "name": "stdout", 484 | "output_type": "stream", 485 | "text": [ 486 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 487 | "1 rows affected.\n" 488 | ] 489 | }, 490 | { 491 | "data": { 492 | "text/html": [ 493 | "\n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | "
count
6820
" 501 | ], 502 | "text/plain": [ 503 | "[(6820,)]" 504 | ] 505 | }, 506 | "execution_count": 10, 507 | "metadata": {}, 508 | "output_type": "execute_result" 509 | } 510 | ], 511 | "source": [ 512 | "# 3. The proportion of male users and female users\n", 513 | "# total user\n", 514 | "%sql SELECT COUNT(gender) FROM users;" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": 11, 520 | "metadata": {}, 521 | "outputs": [ 522 | { 523 | "name": "stdout", 524 | "output_type": "stream", 525 | "text": [ 526 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 527 | "1 rows affected.\n" 528 | ] 529 | }, 530 | { 531 | "data": { 532 | "text/html": [ 533 | "\n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | "
female_count
4887
" 541 | ], 542 | "text/plain": [ 543 | "[(4887,)]" 544 | ] 545 | }, 546 | "execution_count": 11, 547 | "metadata": {}, 548 | "output_type": "execute_result" 549 | } 550 | ], 551 | "source": [ 552 | "# number of female users\n", 553 | "%sql SELECT COUNT(gender) AS female_count \\\n", 554 | "FROM users \\\n", 555 | "WHERE gender = 'F'; \n", 556 | "# here must use single quotes. Double quotes will be treated as column names" 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": 12, 562 | "metadata": {}, 563 | "outputs": [ 564 | { 565 | "name": "stdout", 566 | "output_type": "stream", 567 | "text": [ 568 | " * postgresql://student:***@127.0.0.1/sparkifydb\n", 569 | "1 rows affected.\n" 570 | ] 571 | }, 572 | { 573 | "data": { 574 | "text/html": [ 575 | "\n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | "
male_count
1933
" 583 | ], 584 | "text/plain": [ 585 | "[(1933,)]" 586 | ] 587 | }, 588 | "execution_count": 12, 589 | "metadata": {}, 590 | "output_type": "execute_result" 591 | } 592 | ], 593 | "source": [ 594 | "# number of male users\n", 595 | "%sql SELECT COUNT(gender) AS male_count \\\n", 596 | "FROM users \\\n", 597 | "WHERE gender = 'M';\n", 598 | "\n", 599 | "\n" 600 | ] 601 | }, 602 | { 603 | "cell_type": "code", 604 | "execution_count": 19, 605 | "metadata": {}, 606 | "outputs": [ 607 | { 608 | "name": "stdout", 609 | "output_type": "stream", 610 | "text": [ 611 | "0.7165689149560117 0.2834310850439883\n" 612 | ] 613 | } 614 | ], 615 | "source": [ 616 | "print(4887/6820 , 1933/6820)\n" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": {}, 622 | "source": [ 623 | "# references:\n", 624 | "* 1.https://towardsdatascience.com/jupyter-magics-with-sql-921370099589" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "metadata": {}, 631 | "outputs": [], 632 | "source": [] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": {}, 638 | "outputs": [], 639 | "source": [] 640 | }, 641 | { 642 | "cell_type": "code", 643 | "execution_count": null, 644 | "metadata": {}, 645 | "outputs": [], 646 | "source": [] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": null, 651 | "metadata": {}, 652 | "outputs": [], 653 | "source": [] 654 | } 655 | ], 656 | "metadata": { 657 | "kernelspec": { 658 | "display_name": "Python 3", 659 | "language": "python", 660 | "name": "python3" 661 | }, 662 | "language_info": { 663 | "codemirror_mode": { 664 | "name": "ipython", 665 | "version": 3 666 | }, 667 | "file_extension": ".py", 668 | "mimetype": "text/x-python", 669 | "name": "python", 670 | "nbconvert_exporter": "python", 671 | "pygments_lexer": "ipython3", 672 | "version": "3.6.3" 673 | } 674 | }, 675 | "nbformat": 4, 676 | "nbformat_minor": 2 677 | } 678 | -------------------------------------------------------------------------------- /Project_1B_Data_Modeling_with_Cassandra/Readme.md: -------------------------------------------------------------------------------- 1 | Project: Data Modeling with Cassandra 2 | 3 | ## Objective: 4 | A startup called Sparkify wanted to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team was particularly interested in understanding what songs users were listening to. But there was no easy way to query the data to generate the results, since the data resided in a directory of CSV files on user activity on the app. 5 | 6 | My role was to create an Apache Cassandra database which could create queries on song play data to answer the questions. I also need to test my database by running queries given by the analytics team from Sparkify to create the results. 7 | 8 | 9 | 10 | ## Datasets 11 | For this project, there is one dataset: event_data. The directory of CSV files partitioned by date. 12 | Here are examples of filepaths to two files in the dataset: 13 | 14 | event_data/2018-11-08-events.csv 15 | event_data/2018-11-09-events.csv 16 | link to a image of the dataset with all the columns: 17 | https://r766469c826649xjupyterlrhraafil.udacity-student-workspaces.com/files/images 18 | link to download the event data: 19 | https://r766469c826649xjupyterlrhraafil.udacity-student-workspaces.com/files/event_data 20 | 21 | 22 | When I insert the data into Apache Cassandra tables, I generated a smaller-sized data set called event_datafile_new.csv based on event_data by extracting most related information. It contains the following columns: 23 | 24 | artist 25 | firstName of user 26 | gender of user 27 | item number in session 28 | last name of user 29 | length of the song 30 | level (paid or free song) 31 | location of the user 32 | sessionId 33 | song title 34 | userId 35 | 36 | 37 | 38 | ## Import Python packages 39 | import pandas as pd 40 | import cassandra 41 | import re 42 | import os 43 | import glob 44 | import numpy as np 45 | import json 46 | import csv 47 | 48 | 49 | 50 | ## How to run the project ? 51 | All the libraries above should be installed 52 | Open your jupyter notebook in your terminal by typing "jupyter notebook" 53 | Open the files you want to run 54 | 55 | 56 | ## Two Major Parts of the Project: 57 | 58 | Part I. Build ETL Pipeline for Pre-Processing the Files 59 | Part II. Complete the Apache Cassandra coding portion 60 | 61 | 62 | 63 | ## Project Steps 64 | 65 | 1. Modeling the NoSQL database(Apache Cassandra database) 66 | 2. Design tables to answer the queries outlined in the project .ipynb file 67 | 3. Write Apache Cassandra CREATE KEYSPACE and SET KEYSPACE statements 68 | 4. Develop the CREATE statement for each of the tables to address each question 69 | 5. Load the data with INSERT statement for each of the tables 70 | 6. Include IF NOT EXISTS clauses in the CREATE statements to create tables only if the tables do not already exist. 71 | (Include DROP TABLE statement for each table. This way you can run drop and create tables whenever you want to reset your 72 | database and test your ETL pipeline ) 73 | 7. Test by running the proper select statements with the correct WHERE clause 74 | 75 | 76 | 77 | ## Key Point: Primary Keys in Creating Tables 78 | 79 | There are 3 queries, which means I need to create 3 tables correspondingly. 80 | Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4 81 | Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182 82 | Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own' 83 | 84 | Each query requires three basic steps: create a table, shape an incoming row of data, insert that row into the table 85 | 86 | It is crucial to create proper primary key, which would directly affect your query execution later. 87 | 88 | Primary key is composed by partion key and clustering columns. The number of clustering columns varies from case to case. Some may require mutiple 89 | while others may require none. The point you need to keep on mind is that Partition key is for filtering while clustering columns are for sorting. 90 | 91 | In query 1: Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338 and itemInSession = 4 92 | we have two filtering conditions sessionId and itemInSession, so the partition key is a combination of sessionId and itemInSession. 93 | Thus PRIMARY KEY((sessionId,itemInSession)) 94 | 95 | In query 2: Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182 96 | we have two filtering conditions, userid and sessionId, so the partition key is a combination of userid and sessionId. Also, we are required to sort it by itemInSession. itemInSession should be the clustering column. 97 | Thus PRIMARY KEY((userid, sessionId), itemInSession) 98 | 99 | In query 3: Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own' 100 | We have only one filtering condition, which is the song title. Song title should be the partition key. Just one partition key may not be 101 | enough for the primary key. Since we need user informaion, I add useid as the clustering column. 102 | Thus PRIMARY KEY(song, userid) 103 | 104 | 105 | -------------------------------------------------------------------------------- /Project_3_Data_Warehouse_with_Redshift/README.md: -------------------------------------------------------------------------------- 1 | ## Introduction 2 | A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto cloud. Their data resides in S3 ( a directory of JSON logs on the app's user activity section & a directoty of JSON metadata on the app's song section) 3 | 4 | 5 | ## My Task 6 | Build a ETL pipeline to extract data from S3 , stage them in Redshift, transform them into a set of dimensional tables for analytics team to find insights about what songs their users are listening to. 7 | 8 | 9 | ## The Datasets 10 | There are two datasets that reside in S3. 11 | 12 | (1). Link to Song data: s3://udacity-dend/song_data 13 | 14 | The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata 15 | about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 16 | 17 | For example,here are filepaths to a file in this dataset. 18 | song_data/A/B/C/TRABCEI128F424C983.json 19 | 20 | Single song file example : 21 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", 22 | "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0} 23 | 24 | (2). Link to Log data: s3://udacity-dend/log_data 25 | 26 | It consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These 27 | simulate app activity logs from an imaginary music streaming app based on configuration settings. 28 | The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset. 29 | 30 | log_data/2018/11/2018-11-12-events.json 31 | log_data/2018/11/2018-11-13-events.json 32 | 33 | 34 | ## Files in the Repository 35 | (1).sql_queries.py 36 | Define SQL statements like creating tables, drop tables, copying data from S3 to staging tables, inserting data from satging 37 | tables to dimensional tables. 38 | 39 | The statements will be imported into create_table.py and etl.py 40 | 41 | (2).create_tables.py 42 | Create dimension and fact tables for the star schema in Redshift 43 | 44 | (3).etl.py 45 | Execute the SQL statement defined in sql_queries.py. Load data from S3 into staging tables on Redshift and then process that data into the analytics tables on Redshift. 46 | 47 | (4).README.md 48 | A summary and supporting material for the whole project. 49 | 50 | 51 | ## Steps to Run the Program 52 | (1). Finish sql_queries.py 53 | (2). Finish create_tables.py 54 | (3). Create a IAM role with read only access and a security group 55 | (4). Launch a redshift cluster ( Do not let your cluster run overnight or over the weekend to prevent unexpected high charges) 56 | (5). Run create_tables.py in the terminal by type "python create_tables.py". 57 | A new termnial will be available if you click "file" tab on the top left, move your cursor to "new" and select 'terminal' 58 | (6). In query editor under cluster you will see your database and all your tables (you must run "python create_tables.py" first) 59 | To use query editor in your cluster, you need to make sure 60 | a. the node type meeting the standard (DC1.8xlarge, DC2.large, DC2.8xlarge, DS2.8xlarge either one is fine) 61 | b. attach Iam policies to query editor to enable access to query editor 62 | make sure select 'public' in the dropdown button under schema 63 | (7). Run etl.py in the terminal by type "python etl.py". 64 | 65 | 66 | ## Star Schema for Song Play Analysis 67 | (1).Fact Table 68 | songplays - records in event data associated with song plays i.e. records with page NextSong 69 | songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent 70 | (2).Dimension Tables 71 | users - users in the app: 72 | user_id, first_name, last_name, gender, level 73 | songs - songs in music database: 74 | song_id, title, artist_id, year, duration 75 | artists - artists in music database: 76 | artist_id, name, location, lattitude, longitude 77 | time - timestamps of records in songplays broken down into specific units: 78 | start_time, hour, day, week, month, year, weekday 79 | 80 | 81 | 82 | ## Example Queries 83 | (1).Find the popular artists 84 | 85 | (2).Find the popular songs 86 | select song_title, count(song_title) as song_played 87 | from staging_events 88 | group by song_title 89 | order by song_played desc; 90 | 91 | (3).Song plays by hours 92 | 93 | (4).song plays by level and gender 94 | 95 | 96 | ## References: 97 | * https://www.klipfolio.com/blog/create-sql-dashboard 98 | * https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-authorization.html 99 | * https://github.com/FedericoSerini/DEND-Project-3-Data-Warehouse-AWS/blob/master/sql_queries.py 100 | * https://github.com/janjagusch/dend_03_data_warehouse 101 | * https://www.postgresql.org/docs/9.4/functions-datetime.html 102 | -------------------------------------------------------------------------------- /Project_3_Data_Warehouse_with_Redshift/create_tables.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | import psycopg2 3 | from sql_queries import create_table_queries, drop_table_queries 4 | 5 | 6 | def drop_tables(cur, conn): 7 | """ 8 | This function drop tables. 9 | 10 | argument: 11 | cur: the cursor object 12 | conn : the connection object 13 | 14 | return : 15 | none 16 | """ 17 | for query in drop_table_queries: 18 | cur.execute(query) 19 | conn.commit() 20 | 21 | 22 | def create_tables(cur, conn): 23 | """ 24 | This function creates tables 25 | 26 | argument: 27 | cur: the cursor object 28 | conn : the connection object 29 | 30 | return : 31 | none 32 | """ 33 | for query in create_table_queries: 34 | cur.execute(query) 35 | conn.commit() 36 | 37 | 38 | def main(): 39 | """ 40 | When a python program is executed, python interpreter starts executing code inside main function. 41 | three functions above would be executed and then the connection would be closed 42 | """ 43 | config = configparser.ConfigParser() 44 | config.read('dwh.cfg') 45 | 46 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values())) 47 | conn.set_session(autocommit=True) 48 | cur = conn.cursor() 49 | 50 | drop_tables(cur, conn) 51 | create_tables(cur, conn) 52 | 53 | conn.close() 54 | 55 | 56 | if __name__ == "__main__": 57 | """ 58 | For python main function, we have to use if __name__ == '__main__' condition to execute. 59 | """ 60 | main() -------------------------------------------------------------------------------- /Project_3_Data_Warehouse_with_Redshift/dwh.cfg: -------------------------------------------------------------------------------- 1 | [CLUSTER] 2 | # get the cluster infor from exercise 2 IaC 3 | # this need to be changed if it's in a different region. You can find endpoint in cluster details remove port 4 | HOST = redshift-cluster.cpynd2akmpcr.us-west-2.redshift.amazonaws.com 5 | DB_NAME = sparkify 6 | DB_USER = sparkifyuser 7 | DB_PASSWORD = Passw0rd 8 | DB_PORT = 5439 9 | # these four do not require single quotes 10 | 11 | [IAM_ROLE] 12 | ARN = 'arn:aws:iam::162431663335:role/myRedshiftRole' 13 | 14 | [S3] 15 | LOG_DATA = 's3://udacity-dend/log_data' 16 | LOG_JSONPATH = 's3://udacity-dend/log_json_path.json' 17 | SONG_DATA = 's3://udacity-dend/song_data' 18 | # these 4 require single quotes 19 | 20 | -------------------------------------------------------------------------------- /Project_3_Data_Warehouse_with_Redshift/etl.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | import psycopg2 3 | from sql_queries import copy_table_queries, insert_table_queries 4 | 5 | 6 | def load_staging_tables(cur, conn): 7 | """ 8 | This function load data from S3 to staging tables on redshift 9 | 10 | argument: 11 | cur: the cursor object 12 | conn : the connection object 13 | 14 | return : 15 | none 16 | """ 17 | for query in copy_table_queries: 18 | cur.execute(query) 19 | conn.commit() 20 | 21 | 22 | def insert_tables(cur, conn): 23 | """ 24 | This function load data from staging tables to analytics tables on redshift 25 | 26 | argument: 27 | cur: the cursor object 28 | conn : the connection object 29 | 30 | return : 31 | none 32 | """ 33 | for query in insert_table_queries: 34 | cur.execute(query) 35 | conn.commit() 36 | 37 | 38 | def main(): 39 | """ 40 | This main function will extract all the required information to connect to the database and then 41 | call load_staging_tables function to load data to staging tables and then execute insert_tables function to load data to 42 | analytics tables 43 | """ 44 | config = configparser.ConfigParser() 45 | config.read('dwh.cfg') 46 | 47 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values())) 48 | cur = conn.cursor() 49 | 50 | load_staging_tables(cur, conn) 51 | insert_tables(cur, conn) 52 | 53 | conn.close() 54 | 55 | 56 | if __name__ == "__main__": 57 | main() 58 | -------------------------------------------------------------------------------- /Project_3_Data_Warehouse_with_Redshift/project3_star_schema.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_3_Data_Warehouse_with_Redshift/project3_star_schema.PNG -------------------------------------------------------------------------------- /Project_3_Data_Warehouse_with_Redshift/sql_queries.py: -------------------------------------------------------------------------------- 1 | #import configparser 2 | 3 | 4 | # CONFIG 5 | #config = configparser.ConfigParser() 6 | #config.read('dwh.cfg') 7 | 8 | 9 | 10 | # DROP TABLES 11 | # do not need quote characters for staging tables 12 | staging_events_table_drop = "DROP TABLE IF EXISTS staging_events" 13 | staging_songs_table_drop = "DROP TABLE IF EXISTS staging_songs" 14 | songplay_table_drop = "DROP TABLE IF EXISTS songplays" 15 | user_table_drop = "DROP TABLE IF EXISTS users" 16 | song_table_drop = "DROP TABLE IF EXISTS songs" 17 | artist_table_drop = "DROP TABLE IF EXISTS artists" 18 | time_table_drop = "DROP TABLE IF EXISTS times" 19 | 20 | 21 | 22 | # CREATE TABLES 23 | # every table should have a primary key 24 | # I deleted all the 'if not exists' in table creating statement 25 | 26 | staging_events_table_create= ("""CREATE TABLE staging_events ( 27 | artist_name varchar, 28 | auth varchar, 29 | user_first_name varchar, 30 | gender varchar, 31 | itemInSession int, 32 | user_last_name varchar, 33 | length numeric, 34 | level varchar, 35 | user_location varchar, 36 | method varchar, 37 | page varchar, 38 | registration bigint, 39 | sessionId int, 40 | song_title varchar, 41 | status int, 42 | ts bigint, 43 | userAgent varchar, 44 | user_id int)"""); 45 | # changed ts to start_time to be consistent 46 | 47 | 48 | 49 | staging_songs_table_create = ("""CREATE TABLE staging_songs( 50 | num_songs int, 51 | artist_id varchar, 52 | artist_latitude numeric, 53 | artist_longitude numeric, 54 | artist_location varchar, 55 | artist_name varchar, 56 | song_id varchar, 57 | song_title varchar, 58 | duration numeric, 59 | year int) 60 | """); 61 | 62 | 63 | 64 | songplay_table_create = ("""CREATE TABLE songplays ( 65 | songplay_id INT IDENTITY(0,1) PRIMARY KEY , 66 | start_time TIMESTAMP NOT NULL REFERENCES times(start_time), 67 | user_id INT NOT NULL REFERENCES users(user_id), 68 | level VARCHAR, 69 | song_id VARCHAR NOT NULL REFERENCES songs(song_id), 70 | artist_id VARCHAR NOT NULL REFERENCES artists(artist_id), 71 | sessionId INT , 72 | user_location VARCHAR, 73 | userAgent VARCHAR 74 | ) 75 | """); 76 | 77 | # four references not null removed 78 | 79 | 80 | user_table_create = ("""CREATE TABLE users ( 81 | user_id INT PRIMARY KEY, 82 | user_first_name VARCHAR, 83 | user_last_name VARCHAR, 84 | gender VARCHAR, 85 | level VARCHAR 86 | ) 87 | """); 88 | 89 | 90 | 91 | song_table_create = ("""CREATE TABLE songs ( 92 | song_id VARCHAR PRIMARY KEY, 93 | song_title VARCHAR, 94 | artist_id VARCHAR NOT NULL, 95 | year INTEGER, 96 | duration FLOAT 97 | ) 98 | """); 99 | 100 | 101 | 102 | artist_table_create = ("""CREATE TABLE artists ( 103 | artist_id VARCHAR PRIMARY KEY, 104 | artist_name VARCHAR, 105 | artist_location VARCHAR, 106 | artist_latitude FLOAT, 107 | artist_longitude FLOAT 108 | ) 109 | """); 110 | 111 | 112 | 113 | time_table_create = ("""CREATE TABLE times ( 114 | start_time TIMESTAMP PRIMARY KEY, 115 | hour INTEGER , 116 | day INTEGER, 117 | week INTEGER, 118 | month VARCHAR, 119 | year INTEGER, 120 | weekday VARCHAR 121 | ) 122 | """); 123 | # changed INT to INTEGER 124 | 125 | ARN = config.get('IAM_ROLE', 'ARN') 126 | LOG_DATA = config.get('S3','LOG_DATA') 127 | LOG_JSONPATH = config.get('S3','LOG_JSONPATH') 128 | SONG_DATA = config.get('S3','SONG_DATA') 129 | 130 | # To load the sample data, you must provide authentication for your cluster to access Amazon S3 on your behalf. 131 | # You can provide either role-based authentication or key-based authentication. In this case it is role-based authentication 132 | 133 | # setting COMPUPDATE, STATUPDATE to speed up COPY 134 | staging_events_copy = (""" copy staging_events from {} 135 | iam_role {} 136 | region 'us-west-2' 137 | COMPUPDATE OFF STATUPDATE OFF 138 | JSON {} timeformat 'epochmillisecs'; """).format(LOG_DATA, ARN, LOG_JSONPATH ) 139 | # completed: 8056 rows for events 140 | 141 | 142 | 143 | staging_songs_copy = (""" copy staging_songs from {} 144 | iam_role {} 145 | region 'us-west-2' 146 | COMPUPDATE OFF STATUPDATE OFF 147 | JSON 'auto' """).format(SONG_DATA,ARN) 148 | # completed : 14896 rows for songs 149 | 150 | 151 | 152 | # FINAL TABLES 153 | # make sure when selecting columns, the orders of the columns should be the same as previous order 154 | 155 | # remember songplay_id is auto incremented so do not need to add here 156 | songplay_table_insert = (""" INSERT INTO songplays (start_time, user_id, level, song_id, artist_id, sessionId, user_location, userAgent) 157 | SELECT DISTINCT 158 | TIMESTAMP 'epoch' + e.ts/1000 *INTERVAL '1 second' as start_time, 159 | e.user_id, 160 | e.level, 161 | s.song_id, 162 | s.artist_id, 163 | e.sessionId, 164 | e.user_location, 165 | e.userAgent 166 | FROM staging_events e 167 | JOIN staging_songs s 168 | ON e.song_title = s.song_title 169 | WHERE e.page = 'NextSong' AND e.user_id NOT IN (SELECT DISTINCT sp.user_id FROM songplays sp WHERE sp.user_id = e.user_id AND sp.start_time = e.ts AND sp.sessionId = e.sessionId) 170 | """) 171 | # here pay attention to the alias. Do not make two s . The program will be confused 172 | 173 | 174 | user_table_insert = (""" INSERT INTO users (user_id, user_first_name, user_last_name, gender, level) 175 | SELECT DISTINCT 176 | user_id, 177 | user_first_name, 178 | user_last_name, 179 | gender, 180 | level 181 | FROM staging_events 182 | WHERE page = 'NextSong' AND user_id NOT IN (SELECT DISTINCT user_id FROM users) 183 | """) 184 | 185 | 186 | 187 | # do not need WHERE paging = 'NextSong' coz it has nothing to do with staging_events table 188 | song_table_insert = (""" INSERT INTO songs (song_id, song_title, artist_id, year, duration) 189 | SELECT DISTINCT 190 | song_id, 191 | song_title, 192 | artist_id, 193 | year, 194 | duration 195 | FROM staging_songs 196 | WHERE song_id NOT IN ( SELECT DISTINCT song_id FROM songs) 197 | """) 198 | 199 | 200 | 201 | artist_table_insert = (""" INSERT INTO artists (artist_id, artist_name, artist_location, artist_latitude, artist_longitude) 202 | SELECT DISTINCT 203 | artist_id, 204 | artist_name, 205 | artist_location, 206 | artist_latitude, 207 | artist_longitude 208 | FROM staging_songs 209 | WHERE artist_id NOT IN (SELECT DISTINCT artist_id FROM artists) 210 | """) 211 | 212 | 213 | 214 | # do not need WHERE paging = 'NextSong' coz it has nothing to do with staging_events table 215 | 216 | # in order to use extarct(), we have to have ts as varchar 217 | 218 | time_table_insert = (""" 219 | INSERT INTO times \ 220 | (start_time, hour, day, week, month, year, weekday) \ 221 | SELECT start_time, \ 222 | EXTRACT(hr from start_time), \ 223 | EXTRACT(d from start_time), \ 224 | EXTRACT(w from start_time), \ 225 | EXTRACT(mon from start_time), \ 226 | EXTRACT(yr from start_time), \ 227 | EXTRACT(weekday from start_time) \ 228 | FROM (SELECT DISTINCT TIMESTAMP 'epoch' + ts/1000 *INTERVAL '1 second' as start_time \ 229 | FROM staging_events s)""") 230 | 231 | 232 | 233 | # QUERY LISTS 234 | # When creating tables, we should create dimension tables first, then fact table 235 | create_table_queries = [staging_events_table_create, staging_songs_table_create,user_table_create, song_table_create, artist_table_create,time_table_create,songplay_table_create] 236 | 237 | drop_table_queries = [staging_events_table_drop, staging_songs_table_drop, songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop] 238 | 239 | copy_table_queries = [staging_events_copy, staging_songs_copy] 240 | 241 | insert_table_queries = [user_table_insert, song_table_insert, artist_table_insert, time_table_insert, songplay_table_insert] 242 | -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/log-data/2018-11-01-events.json: -------------------------------------------------------------------------------- 1 | {"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"} 2 | {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 3 | {"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 4 | {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":2,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Upgrade","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106132796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 5 | {"artist":"Mr Oizo","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":3,"lastName":"Summers","length":144.03873,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Flat 55","status":200,"ts":1541106352796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 6 | {"artist":"Tamba Trio","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":4,"lastName":"Summers","length":177.18812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Quem Quiser Encontrar O Amor","status":200,"ts":1541106496796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 7 | {"artist":"The Mars Volta","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":5,"lastName":"Summers","length":380.42077,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Eriatarka","status":200,"ts":1541106673796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 8 | {"artist":"Infected Mushroom","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":6,"lastName":"Summers","length":440.2673,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Becoming Insane","status":200,"ts":1541107053796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 9 | {"artist":"Blue October \/ Imogen Heap","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":7,"lastName":"Summers","length":241.3971,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Congratulations","status":200,"ts":1541107493796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 10 | {"artist":"Girl Talk","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":8,"lastName":"Summers","length":160.15628,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Once again","status":200,"ts":1541107734796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} 11 | {"artist":"Black Eyed Peas","auth":"Logged In","firstName":"Sylvie","gender":"F","itemInSession":0,"lastName":"Cruz","length":214.93506,"level":"free","location":"Washington-Arlington-Alexandria, DC-VA-MD-WV","method":"PUT","page":"NextSong","registration":1540266185796.0,"sessionId":9,"song":"Pump It","status":200,"ts":1541108520796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.77.4 (KHTML, like Gecko) Version\/7.0.5 Safari\/537.77.4\"","userId":"10"} 12 | {"artist":null,"auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":0,"lastName":"Smith","length":null,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"GET","page":"Home","registration":1541016707796.0,"sessionId":169,"song":null,"status":200,"ts":1541109015796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"} 13 | {"artist":"Fall Out Boy","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":1,"lastName":"Smith","length":200.72444,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Nobody Puts Baby In The Corner","status":200,"ts":1541109125796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"} 14 | {"artist":"M.I.A.","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":2,"lastName":"Smith","length":233.7171,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Mango Pickle Down River (With The Wilcannia Mob)","status":200,"ts":1541109325796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"} 15 | {"artist":"Survivor","auth":"Logged In","firstName":"Jayden","gender":"M","itemInSession":0,"lastName":"Fox","length":245.36771,"level":"free","location":"New Orleans-Metairie, LA","method":"PUT","page":"NextSong","registration":1541033612796.0,"sessionId":100,"song":"Eye Of The Tiger","status":200,"ts":1541110994796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"101"} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/log-data/2018-11-25-events.json: -------------------------------------------------------------------------------- 1 | {"artist":"matchbox twenty","auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":0,"lastName":"Duffy","length":177.65832,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"PUT","page":"NextSong","registration":1540146037796.0,"sessionId":846,"song":"Argue (LP Version)","status":200,"ts":1543109954796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"} 2 | {"artist":"The Lonely Island \/ T-Pain","auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":1,"lastName":"Duffy","length":156.23791,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"PUT","page":"NextSong","registration":1540146037796.0,"sessionId":846,"song":"I'm On A Boat","status":200,"ts":1543110131796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"} 3 | {"artist":null,"auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":2,"lastName":"Duffy","length":null,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"GET","page":"Home","registration":1540146037796.0,"sessionId":846,"song":null,"status":200,"ts":1543110132796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"} 4 | {"artist":null,"auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":3,"lastName":"Duffy","length":null,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"GET","page":"Settings","registration":1540146037796.0,"sessionId":846,"song":null,"status":200,"ts":1543110168796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"} 5 | {"artist":null,"auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":4,"lastName":"Duffy","length":null,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"PUT","page":"Save Settings","registration":1540146037796.0,"sessionId":846,"song":null,"status":307,"ts":1543110169796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"} 6 | {"artist":"John Mayer","auth":"Logged In","firstName":"Wyatt","gender":"M","itemInSession":0,"lastName":"Scott","length":275.27791,"level":"free","location":"Eureka-Arcata-Fortuna, CA","method":"PUT","page":"NextSong","registration":1540872073796.0,"sessionId":856,"song":"All We Ever Do Is Say Goodbye","status":200,"ts":1543113347796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; Trident\/7.0; rv:11.0) like Gecko","userId":"9"} 7 | {"artist":null,"auth":"Logged In","firstName":"Wyatt","gender":"M","itemInSession":1,"lastName":"Scott","length":null,"level":"free","location":"Eureka-Arcata-Fortuna, CA","method":"GET","page":"Home","registration":1540872073796.0,"sessionId":856,"song":null,"status":200,"ts":1543113365796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; Trident\/7.0; rv:11.0) like Gecko","userId":"9"} 8 | {"artist":"10_000 Maniacs","auth":"Logged In","firstName":"Wyatt","gender":"M","itemInSession":2,"lastName":"Scott","length":251.8722,"level":"free","location":"Eureka-Arcata-Fortuna, CA","method":"PUT","page":"NextSong","registration":1540872073796.0,"sessionId":856,"song":"Gun Shy (LP Version)","status":200,"ts":1543113622796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; Trident\/7.0; rv:11.0) like Gecko","userId":"9"} 9 | {"artist":"Leona Lewis","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":0,"lastName":"Cuevas","length":203.88526,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Forgive Me","status":200,"ts":1543122348796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 10 | {"artist":"Nine Inch Nails","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":1,"lastName":"Cuevas","length":277.83791,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"La Mer","status":200,"ts":1543122551796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 11 | {"artist":"Audioslave","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":2,"lastName":"Cuevas","length":334.91546,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"I Am The Highway","status":200,"ts":1543122828796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 12 | {"artist":"Kid Rock","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":3,"lastName":"Cuevas","length":296.95955,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"All Summer Long (Album Version)","status":200,"ts":1543123162796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 13 | {"artist":"The Jets","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":4,"lastName":"Cuevas","length":220.89098,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"I Do You","status":200,"ts":1543123458796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 14 | {"artist":"The Gerbils","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":5,"lastName":"Cuevas","length":27.01016,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"(iii)","status":200,"ts":1543123678796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 15 | {"artist":"Damian Marley \/ Stephen Marley \/ Yami Bolo","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":6,"lastName":"Cuevas","length":304.69179,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Still Searching","status":200,"ts":1543123705796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 16 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":7,"lastName":"Cuevas","length":null,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540940782796.0,"sessionId":916,"song":null,"status":200,"ts":1543124166796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 17 | {"artist":"The Bloody Beetroots","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":8,"lastName":"Cuevas","length":201.97832,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Warp 1.9 (feat. Steve Aoki)","status":200,"ts":1543124951796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 18 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":9,"lastName":"Cuevas","length":null,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540940782796.0,"sessionId":916,"song":null,"status":200,"ts":1543125120796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 19 | {"artist":"The Specials","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":10,"lastName":"Cuevas","length":188.81261,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Rat Race","status":200,"ts":1543125152796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 20 | {"artist":"The Lively Ones","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":11,"lastName":"Cuevas","length":142.52363,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Walkin' The Board (LP Version)","status":200,"ts":1543125340796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 21 | {"artist":"Katie Melua","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":12,"lastName":"Cuevas","length":252.78649,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Blues In The Night","status":200,"ts":1543125482796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 22 | {"artist":"Jason Mraz","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":13,"lastName":"Cuevas","length":243.48689,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"I'm Yours (Album Version)","status":200,"ts":1543125734796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 23 | {"artist":"Fisher","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":14,"lastName":"Cuevas","length":133.98159,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Rianna","status":200,"ts":1543125977796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 24 | {"artist":"Zee Avi","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":15,"lastName":"Cuevas","length":160.62649,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"No Christmas For Me","status":200,"ts":1543126110796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 25 | {"artist":"Black Eyed Peas","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":16,"lastName":"Cuevas","length":289.12281,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"I Gotta Feeling","status":200,"ts":1543126270796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 26 | {"artist":"Emiliana Torrini","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":17,"lastName":"Cuevas","length":184.29342,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Sunny Road","status":200,"ts":1543126559796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 27 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":18,"lastName":"Cuevas","length":null,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540940782796.0,"sessionId":916,"song":null,"status":200,"ts":1543126820796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 28 | {"artist":"Days Of The New","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":19,"lastName":"Cuevas","length":258.5073,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"The Down Town","status":200,"ts":1543128418796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 29 | {"artist":"Julio Iglesias duet with Willie Nelson","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":20,"lastName":"Cuevas","length":212.16608,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"To All The Girls I've Loved Before (With Julio Iglesias)","status":200,"ts":1543128676796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 30 | {"artist":null,"auth":"Logged In","firstName":"Jacqueline","gender":"F","itemInSession":0,"lastName":"Lynch","length":null,"level":"paid","location":"Atlanta-Sandy Springs-Roswell, GA","method":"GET","page":"Home","registration":1540223723796.0,"sessionId":914,"song":null,"status":200,"ts":1543133256796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.78.2 (KHTML, like Gecko) Version\/7.0.6 Safari\/537.78.2\"","userId":"29"} 31 | {"artist":"Jason Mraz & Colbie Caillat","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":0,"lastName":"Roth","length":189.6224,"level":"free","location":"Indianapolis-Carmel-Anderson, IN","method":"PUT","page":"NextSong","registration":1540699429796.0,"sessionId":704,"song":"Lucky (Album Version)","status":200,"ts":1543135208796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"78"} 32 | {"artist":null,"auth":"Logged In","firstName":"Anabelle","gender":"F","itemInSession":0,"lastName":"Simpson","length":null,"level":"free","location":"Philadelphia-Camden-Wilmington, PA-NJ-DE-MD","method":"GET","page":"Home","registration":1541044398796.0,"sessionId":901,"song":null,"status":200,"ts":1543145299796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"69"} 33 | {"artist":"R. Kelly","auth":"Logged In","firstName":"Anabelle","gender":"F","itemInSession":1,"lastName":"Simpson","length":234.39628,"level":"free","location":"Philadelphia-Camden-Wilmington, PA-NJ-DE-MD","method":"PUT","page":"NextSong","registration":1541044398796.0,"sessionId":901,"song":"The World's Greatest","status":200,"ts":1543145305796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"69"} 34 | {"artist":null,"auth":"Logged In","firstName":"Kynnedi","gender":"F","itemInSession":0,"lastName":"Sanchez","length":null,"level":"free","location":"Cedar Rapids, IA","method":"GET","page":"Home","registration":1541079034796.0,"sessionId":804,"song":null,"status":200,"ts":1543151224796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"89"} 35 | {"artist":"Jacky Terrasson","auth":"Logged In","firstName":"Marina","gender":"F","itemInSession":0,"lastName":"Sutton","length":342.7522,"level":"free","location":"Salinas, CA","method":"PUT","page":"NextSong","registration":1541064343796.0,"sessionId":373,"song":"Le Jardin d'Hiver","status":200,"ts":1543154200796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"48"} 36 | {"artist":"Papa Roach","auth":"Logged In","firstName":"Theodore","gender":"M","itemInSession":0,"lastName":"Harris","length":202.1873,"level":"free","location":"Red Bluff, CA","method":"PUT","page":"NextSong","registration":1541097374796.0,"sessionId":813,"song":"Alive","status":200,"ts":1543155228796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"14"} 37 | {"artist":"Burt Bacharach","auth":"Logged In","firstName":"Theodore","gender":"M","itemInSession":1,"lastName":"Harris","length":156.96934,"level":"free","location":"Red Bluff, CA","method":"PUT","page":"NextSong","registration":1541097374796.0,"sessionId":813,"song":"Casino Royale Theme (Main Title)","status":200,"ts":1543155430796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"14"} 38 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":0,"lastName":"Cuevas","length":null,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540940782796.0,"sessionId":923,"song":null,"status":200,"ts":1543161483796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 39 | {"artist":"Floetry","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":1,"lastName":"Cuevas","length":254.48444,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"Sunshine","status":200,"ts":1543161985796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 40 | {"artist":"The Rakes","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":2,"lastName":"Cuevas","length":225.2273,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"Leave The City And Come Home","status":200,"ts":1543162239796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 41 | {"artist":"Dwight Yoakam","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":3,"lastName":"Cuevas","length":239.3073,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"You're The One","status":200,"ts":1543162464796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 42 | {"artist":"Ween","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":4,"lastName":"Cuevas","length":228.10077,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"Voodoo Lady","status":200,"ts":1543162703796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 43 | {"artist":"Caf\u00c3\u0083\u00c2\u00a9 Quijano","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":5,"lastName":"Cuevas","length":197.32853,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"La Lola","status":200,"ts":1543162931796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"} 44 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":0,"lastName":"Roth","length":null,"level":"free","location":"Indianapolis-Carmel-Anderson, IN","method":"GET","page":"Home","registration":1540699429796.0,"sessionId":925,"song":null,"status":200,"ts":1543166945796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"78"} 45 | {"artist":"Parov Stelar","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":1,"lastName":"Roth","length":203.65016,"level":"free","location":"Indianapolis-Carmel-Anderson, IN","method":"PUT","page":"NextSong","registration":1540699429796.0,"sessionId":925,"song":"Good Bye Emily (feat. Gabriella Hanninen)","status":200,"ts":1543166945796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"78"} 46 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":2,"lastName":"Roth","length":null,"level":"free","location":"Indianapolis-Carmel-Anderson, IN","method":"GET","page":"Home","registration":1540699429796.0,"sessionId":925,"song":null,"status":200,"ts":1543166953796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"78"} 47 | {"artist":null,"auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":0,"lastName":"Levine","length":null,"level":"paid","location":"Portland-South Portland, ME","method":"GET","page":"Home","registration":1540794356796.0,"sessionId":915,"song":null,"status":200,"ts":1543169974796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"} 48 | {"artist":"Bryan Adams","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":1,"lastName":"Levine","length":166.29506,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"I Will Always Return","status":200,"ts":1543170092796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"} 49 | {"artist":"KT Tunstall","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":2,"lastName":"Levine","length":192.31302,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"White Bird","status":200,"ts":1543170258796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"} 50 | {"artist":"Technicolour","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":3,"lastName":"Levine","length":235.12771,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"Turn Away","status":200,"ts":1543170450796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"} 51 | {"artist":"The Dears","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":4,"lastName":"Levine","length":289.95873,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"Lost In The Plot","status":200,"ts":1543170685796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"} 52 | {"artist":"Go West","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":5,"lastName":"Levine","length":259.49995,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"Never Let Them See You Sweat","status":200,"ts":1543170974796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"} 53 | {"artist":null,"auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":6,"lastName":"Levine","length":null,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"Logout","registration":1540794356796.0,"sessionId":915,"song":null,"status":307,"ts":1543170975796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"} 54 | {"artist":null,"auth":"Logged In","firstName":"Sylvie","gender":"F","itemInSession":0,"lastName":"Cruz","length":null,"level":"free","location":"Washington-Arlington-Alexandria, DC-VA-MD-WV","method":"GET","page":"Home","registration":1540266185796.0,"sessionId":912,"song":null,"status":200,"ts":1543170999796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.77.4 (KHTML, like Gecko) Version\/7.0.5 Safari\/537.77.4\"","userId":"10"} 55 | {"artist":null,"auth":"Logged Out","firstName":null,"gender":null,"itemInSession":7,"lastName":null,"length":null,"level":"paid","location":null,"method":"GET","page":"Home","registration":null,"sessionId":915,"song":null,"status":200,"ts":1543171228796,"userAgent":null,"userId":""} 56 | {"artist":"Gondwana","auth":"Logged In","firstName":"Jordan","gender":"F","itemInSession":0,"lastName":"Hicks","length":262.5824,"level":"free","location":"Salinas, CA","method":"PUT","page":"NextSong","registration":1540008898796.0,"sessionId":814,"song":"Mi Princesa","status":200,"ts":1543189690796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.78.2 (KHTML, like Gecko) Version\/7.0.6 Safari\/537.78.2\"","userId":"37"} 57 | {"artist":null,"auth":"Logged In","firstName":"Kevin","gender":"M","itemInSession":0,"lastName":"Arellano","length":null,"level":"free","location":"Harrisburg-Carlisle, PA","method":"GET","page":"Home","registration":1540006905796.0,"sessionId":855,"song":null,"status":200,"ts":1543189812796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"66"} 58 | {"artist":"Ella Fitzgerald","auth":"Logged In","firstName":"Jordan","gender":"F","itemInSession":1,"lastName":"Hicks","length":427.15383,"level":"free","location":"Salinas, CA","method":"PUT","page":"NextSong","registration":1540008898796.0,"sessionId":814,"song":"On Green Dolphin Street (Medley) (1999 Digital Remaster)","status":200,"ts":1543189952796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.78.2 (KHTML, like Gecko) Version\/7.0.6 Safari\/537.78.2\"","userId":"37"} 59 | {"artist":"Creedence Clearwater Revival","auth":"Logged In","firstName":"Jordan","gender":"F","itemInSession":2,"lastName":"Hicks","length":184.73751,"level":"free","location":"Salinas, CA","method":"PUT","page":"NextSong","registration":1540008898796.0,"sessionId":814,"song":"Run Through The Jungle","status":200,"ts":1543190379796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.78.2 (KHTML, like Gecko) Version\/7.0.6 Safari\/537.78.2\"","userId":"37"} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/.DS_Store -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/.DS_Store -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/.DS_Store -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAAW128F429D538.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAABD128F429CF47.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARMJAGH1187FB546F3", "artist_latitude": 35.14968, "artist_longitude": -90.04892, "artist_location": "Memphis, TN", "artist_name": "The Box Tops", "song_id": "SOCIWDW12A8C13D406", "title": "Soul Deep", "duration": 148.03546, "year": 1969} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAADZ128F9348C2E.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARKRRTF1187B9984DA", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Sonora Santanera", "song_id": "SOXVLOJ12AB0189215", "title": "Amor De Cabaret", "duration": 177.47546, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAEF128F4273421.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR7G5I41187FB4CE6C", "artist_latitude": null, "artist_longitude": null, "artist_location": "London, England", "artist_name": "Adam Ant", "song_id": "SONHOTT12A8C13493C", "title": "Something Girls", "duration": 233.40363, "year": 1982} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAFD128F92F423A.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARXR32B1187FB57099", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Gob", "song_id": "SOFSOCN12A8C143F5D", "title": "Face the Ashes", "duration": 209.60608, "year": 2007} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAMO128F1481E7F.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARKFYS91187B98E58F", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Jeff And Sheri Easter", "song_id": "SOYMRWW12A6D4FAB14", "title": "The Moon And I (Ordinary Day Album Version)", "duration": 267.7024, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAMQ128F1460CD3.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARD0S291187B9B7BF5", "artist_latitude": null, "artist_longitude": null, "artist_location": "Ohio", "artist_name": "Rated R", "song_id": "SOMJBYD12A6D4F8557", "title": "Keepin It Real (Skit)", "duration": 114.78159, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAPK128E0786D96.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR10USD1187B99F3F1", "artist_latitude": null, "artist_longitude": null, "artist_location": "Burlington, Ontario, Canada", "artist_name": "Tweeterfriendly Music", "song_id": "SOHKNRJ12A6701D1F8", "title": "Drop of Rain", "duration": 189.57016, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAARJ128F9320760.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR8ZCNI1187B9A069B", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Planet P Project", "song_id": "SOIAZJW12AB01853F1", "title": "Pink World", "duration": 269.81832, "year": 1984} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAVG12903CFA543.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNTLGG11E2835DDB9", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Clp", "song_id": "SOUDSGM12AC9618304", "title": "Insatiable (Instrumental Version)", "duration": 266.39628, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAVO128F93133D4.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGSJW91187B9B1D6B", "artist_latitude": 35.21962, "artist_longitude": -80.01955, "artist_location": "North Carolina", "artist_name": "JennyAnyKind", "song_id": "SOQHXMF12AB0182363", "title": "Young Boy Blues", "duration": 218.77506, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABCL128F4286650.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARC43071187B990240", "artist_latitude": null, "artist_longitude": null, "artist_location": "Wisner, LA", "artist_name": "Wayne Watson", "song_id": "SOKEJEJ12A8C13E0D0", "title": "The Urgency (LP Version)", "duration": 245.21098, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABDL12903CAABBA.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARL7K851187B99ACD2", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Andy Andy", "song_id": "SOMUYGI12AB0188633", "title": "La Culpa", "duration": 226.35057, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABJL12903CDCF1A.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARHHO3O1187B989413", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Bob Azzam", "song_id": "SORAMLE12AB017C8B0", "title": "Auguri Cha Cha", "duration": 191.84281, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABJV128F1460C49.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARIK43K1187B9AE54C", "artist_latitude": null, "artist_longitude": null, "artist_location": "Beverly Hills, CA", "artist_name": "Lionel Richie", "song_id": "SOBONFF12A6D4F84D8", "title": "Tonight Will Be Alright", "duration": 307.3824, "year": 1986} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABLR128F423B7E3.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARD842G1187B997376", "artist_latitude": 43.64856, "artist_longitude": -79.38533, "artist_location": "Toronto, Ontario, Canada", "artist_name": "Blue Rodeo", "song_id": "SOHUOAP12A8AE488E9", "title": "Floating", "duration": 491.12771, "year": 1987} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABNV128F425CEE1.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARIG6O41187B988BDD", "artist_latitude": 37.16793, "artist_longitude": -95.84502, "artist_location": "United States", "artist_name": "Richard Souther", "song_id": "SOUQQEA12A8C134B1B", "title": "High Tide", "duration": 228.5971, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABRB128F9306DD5.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR1ZHYZ1187FB3C717", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Faiz Ali Faiz", "song_id": "SOILPQQ12AB017E82A", "title": "Sohna Nee Sohna Data", "duration": 599.24853, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABVM128F92CA9DC.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARYKCQI1187FB3B18F", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Tesla", "song_id": "SOXLBJT12A8C140925", "title": "Caught In A Dream", "duration": 290.29832, "year": 2004} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABXG128F9318EBD.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNPAGP1241B9C7FD4", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "lextrical", "song_id": "SOZVMJI12AB01808AF", "title": "Synthetic Dream", "duration": 165.69424, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABYN12903CFD305.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARQGYP71187FB44566", "artist_latitude": 34.31109, "artist_longitude": -94.02978, "artist_location": "Mineola, AR", "artist_name": "Jimmy Wakely", "song_id": "SOWTBJW12AC468AC6E", "title": "Broken-Down Merry-Go-Round", "duration": 151.84934, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABYW128F4244559.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARI3BMM1187FB4255E", "artist_latitude": 38.8991, "artist_longitude": -77.029, "artist_location": "Washington", "artist_name": "Alice Stuart", "song_id": "SOBEBDG12A58A76D60", "title": "Kassie Jones", "duration": 220.78649, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACCG128F92E8A55.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR5KOSW1187FB35FF4", "artist_latitude": 49.80388, "artist_longitude": 15.47491, "artist_location": "Dubai UAE", "artist_name": "Elena", "song_id": "SOZCTXZ12AB0182364", "title": "Setanta matins", "duration": 269.58322, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACER128F4290F96.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARMAC4T1187FB3FA4C", "artist_latitude": 40.82624, "artist_longitude": -74.47995, "artist_location": "Morris Plains, NJ", "artist_name": "The Dillinger Escape Plan", "song_id": "SOBBUGU12A8C13E95D", "title": "Setting Fire to Sleeping Giants", "duration": 207.77751, "year": 2004} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACFV128F935E50B.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR47JEX1187B995D81", "artist_latitude": 37.83721, "artist_longitude": -94.35868, "artist_location": "Nevada, MO", "artist_name": "SUE THOMPSON", "song_id": "SOBLGCN12AB0183212", "title": "James (Hold The Ladder Steady)", "duration": 124.86485, "year": 1985} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACHN128F1489601.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGIWFO1187B9B55B7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Five Bolt Main", "song_id": "SOPSWQW12A6D4F8781", "title": "Made Like This (Live)", "duration": 225.09669, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACIW12903CC0F6D.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNTLGG11E2835DDB9", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Clp", "song_id": "SOZQDIU12A58A7BCF6", "title": "Superconfidential", "duration": 338.31138, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACLV128F427E123.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARDNS031187B9924F0", "artist_latitude": 32.67828, "artist_longitude": -83.22295, "artist_location": "Georgia", "artist_name": "Tim Wilson", "song_id": "SONYPOM12A8C13B2D7", "title": "I Think My Wife Is Running Around On Me (Taco Hell)", "duration": 186.48771, "year": 2005} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACNS128F14A2DF5.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AROUOZZ1187B9ABE51", "artist_latitude": 40.79195, "artist_longitude": -73.94512, "artist_location": "New York, NY [Spanish Harlem]", "artist_name": "Willie Bobo", "song_id": "SOBZBAZ12A6D4F8742", "title": "Spanish Grease", "duration": 168.25424, "year": 1997} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACOW128F933E35F.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARH4Z031187B9A71F2", "artist_latitude": 40.73197, "artist_longitude": -74.17418, "artist_location": "Newark, NJ", "artist_name": "Faye Adams", "song_id": "SOVYKGO12AB0187199", "title": "Crazy Mixed Up World", "duration": 156.39465, "year": 1961} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACPE128F421C1B9.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARB29H41187B98F0EF", "artist_latitude": 41.88415, "artist_longitude": -87.63241, "artist_location": "Chicago", "artist_name": "Terry Callier", "song_id": "SOGNCJP12A58A80271", "title": "Do You Finally Need A Friend", "duration": 342.56934, "year": 1972} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACQT128F9331780.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR1Y2PT1187FB5B9CE", "artist_latitude": 27.94017, "artist_longitude": -82.32547, "artist_location": "Brandon", "artist_name": "John Wesley", "song_id": "SOLLHMX12AB01846DC", "title": "The Emperor Falls", "duration": 484.62322, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACSL128F93462F4.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARAJPHH1187FB5566A", "artist_latitude": 40.7038, "artist_longitude": -73.83168, "artist_location": "Queens, NY", "artist_name": "The Shangri-Las", "song_id": "SOYTPEP12AB0180E7B", "title": "Twist and Shout", "duration": 164.80608, "year": 1964} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACTB12903CAAF15.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR0RCMP1187FB3F427", "artist_latitude": 30.08615, "artist_longitude": -94.10158, "artist_location": "Beaumont, TX", "artist_name": "Billie Jo Spears", "song_id": "SOGXHEG12AB018653E", "title": "It Makes No Difference Now", "duration": 133.32853, "year": 1992} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACVS128E078BE39.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AREBBGV1187FB523D2", "artist_latitude": null, "artist_longitude": null, "artist_location": "Houston, TX", "artist_name": "Mike Jones (Featuring CJ_ Mello & Lil' Bran)", "song_id": "SOOLYAZ12A6701F4A6", "title": "Laws Patrolling (Album Version)", "duration": 173.66159, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACZK128F4243829.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGUVEV1187B98BA17", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Sierra Maestra", "song_id": "SOGOSOV12AF72A285E", "title": "\u00bfD\u00f3nde va Chichi?", "duration": 313.12934, "year": 1997} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/.DS_Store -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABACN128F425B784.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOQLGFP12A58A7800E", "title": "OAKtown", "duration": 259.44771, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAFJ128F42AF24E.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR3JMC51187B9AE49D", "artist_latitude": 28.53823, "artist_longitude": -81.37739, "artist_location": "Orlando, FL", "artist_name": "Backstreet Boys", "song_id": "SOPVXLX12A8C1402D5", "title": "Larger Than Life", "duration": 236.25098, "year": 1999} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAFP128F931E9A1.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARPBNLO1187FB3D52F", "artist_latitude": 40.71455, "artist_longitude": -74.00712, "artist_location": "New York, NY", "artist_name": "Tiny Tim", "song_id": "SOAOIBZ12AB01815BE", "title": "I Hold Your Hand In Mine [Live At Royal Albert Hall]", "duration": 43.36281, "year": 2000} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAIO128F42938F9.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR9AWNF1187B9AB0B4", "artist_latitude": null, "artist_longitude": null, "artist_location": "Seattle, Washington USA", "artist_name": "Kenny G featuring Daryl Hall", "song_id": "SOZHPGD12A8C1394FE", "title": "Baby Come To Me", "duration": 236.93016, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABATO128F42627E9.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AROGWRA122988FEE45", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Christos Dantis", "song_id": "SOSLAVG12A8C13397F", "title": "Den Pai Alo", "duration": 243.82649, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAVQ12903CBF7E0.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARMBR4Y1187B9990EB", "artist_latitude": 37.77916, "artist_longitude": -122.42005, "artist_location": "California - SF", "artist_name": "David Martin", "song_id": "SOTTDKS12AB018D69B", "title": "It Wont Be Christmas", "duration": 241.47546, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAWW128F4250A31.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARQ9BO41187FB5CF1F", "artist_latitude": 40.99471, "artist_longitude": -77.60454, "artist_location": "Pennsylvania", "artist_name": "John Davis", "song_id": "SOMVWWT12A58A7AE05", "title": "Knocked Out Of The Park", "duration": 183.17016, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAXL128F424FC50.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARKULSX1187FB45F84", "artist_latitude": 39.49974, "artist_longitude": -111.54732, "artist_location": "Utah", "artist_name": "Trafik", "song_id": "SOQVMXR12A81C21483", "title": "Salt In NYC", "duration": 424.12363, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAXR128F426515F.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARI2JSK1187FB496EF", "artist_latitude": 51.50632, "artist_longitude": -0.12714, "artist_location": "London, England", "artist_name": "Nick Ingman;Gavyn Wright", "song_id": "SODUJBS12A8C132150", "title": "Wessex Loses a Bride", "duration": 111.62077, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAXV128F92F6AE3.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AREDBBQ1187B98AFF5", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Eddie Calvert", "song_id": "SOBBXLX12A58A79DDA", "title": "Erica (2005 Digital Remaster)", "duration": 138.63138, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAZH128F930419A.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR7ZKHQ1187B98DD73", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Glad", "song_id": "SOTUKVB12AB0181477", "title": "Blessed Assurance", "duration": 270.602, "year": 1993} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBAM128F429D223.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARBGXIG122988F409D", "artist_latitude": 37.77916, "artist_longitude": -122.42005, "artist_location": "California - SF", "artist_name": "Steel Rain", "song_id": "SOOJPRH12A8C141995", "title": "Loaded Like A Gun", "duration": 173.19138, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBBV128F42967D7.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR7SMBG1187B9B9066", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Los Manolos", "song_id": "SOBCOSW12A8C13D398", "title": "Rumba De Barcelona", "duration": 218.38322, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBJE12903CDB442.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGCY1Y1187B9A4FA5", "artist_latitude": 36.16778, "artist_longitude": -86.77836, "artist_location": "Nashville, TN.", "artist_name": "Gloriana", "song_id": "SOQOTLQ12AB01868D0", "title": "Clementina Santaf\u00e8", "duration": 153.33832, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBKX128F4285205.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR36F9J1187FB406F1", "artist_latitude": 56.27609, "artist_longitude": 9.51695, "artist_location": "Denmark", "artist_name": "Bombay Rockers", "song_id": "SOBKWDJ12A8C13B2F3", "title": "Wild Rose (Back 2 Basics Mix)", "duration": 230.71302, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBLU128F93349CF.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNNKDK1187B98BBD5", "artist_latitude": 45.80726, "artist_longitude": 15.9676, "artist_location": "Zagreb Croatia", "artist_name": "Jinx", "song_id": "SOFNOQK12AB01840FC", "title": "Kutt Free (DJ Volume Remix)", "duration": 407.37914, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBNP128F932546F.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR62SOJ1187FB47BB5", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Chase & Status", "song_id": "SOGVQGJ12AB017F169", "title": "Ten Tonne", "duration": 337.68444, "year": 2005} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBOP128F931B50D.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARBEBBY1187B9B43DB", "artist_latitude": null, "artist_longitude": null, "artist_location": "Gainesville, FL", "artist_name": "Tom Petty", "song_id": "SOFFKZS12AB017F194", "title": "A Higher Place (Album Version)", "duration": 236.17261, "year": 1994} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBOR128F4286200.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARDR4AC1187FB371A1", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Montserrat Caball\u00e9;Placido Domingo;Vicente Sardinero;Judith Blegen;Sherrill Milnes;Georg Solti", "song_id": "SOBAYLL12A8C138AF9", "title": "Sono andati? Fingevo di dormire", "duration": 511.16363, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBTA128F933D304.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARAGB2O1187FB3A161", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Pucho & His Latin Soul Brothers", "song_id": "SOLEYHO12AB0188A85", "title": "Got My Mojo Workin", "duration": 338.23302, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBVJ128F92F7EAA.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AREDL271187FB40F44", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Soul Mekanik", "song_id": "SOPEGZN12AB0181B3D", "title": "Get Your Head Stuck On Your Neck", "duration": 45.66159, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBXU128F92FEF48.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARP6N5A1187B99D1A3", "artist_latitude": null, "artist_longitude": null, "artist_location": "Hamtramck, MI", "artist_name": "Mitch Ryder", "song_id": "SOXILUQ12A58A7C72A", "title": "Jenny Take a Ride", "duration": 207.43791, "year": 2004} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBZN12903CD9297.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARGSAFR1269FB35070", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Blingtones", "song_id": "SOTCKKY12AB018A141", "title": "Sonnerie lalaleul\u00e9 hi houuu", "duration": 29.54404, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCAJ12903CDFCC2.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARULZCI1241B9C8611", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Luna Orbit Project", "song_id": "SOSWKAV12AB018FC91", "title": "Midnight Star", "duration": 335.51628, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCEC128F426456E.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR0IAWL1187B9A96D0", "artist_latitude": 8.4177, "artist_longitude": -80.11278, "artist_location": "Panama", "artist_name": "Danilo Perez", "song_id": "SONSKXP12A8C13A2C9", "title": "Native Soul", "duration": 197.19791, "year": 2003} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCEI128F424C983.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCFL128F149BB0D.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARLTWXK1187FB5A3F8", "artist_latitude": 32.74863, "artist_longitude": -97.32925, "artist_location": "Fort Worth, TX", "artist_name": "King Curtis", "song_id": "SODREIN12A58A7F2E5", "title": "A Whiter Shade Of Pale (Live @ Fillmore West)", "duration": 326.00771, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCIX128F4265903.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARNF6401187FB57032", "artist_latitude": 40.79086, "artist_longitude": -73.96644, "artist_location": "New York, NY [Manhattan]", "artist_name": "Sophie B. Hawkins", "song_id": "SONWXQJ12A8C134D94", "title": "The Ballad Of Sleeping Beauty", "duration": 305.162, "year": 1994} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCKL128F423A778.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARPFHN61187FB575F6", "artist_latitude": 41.88415, "artist_longitude": -87.63241, "artist_location": "Chicago, IL", "artist_name": "Lupe Fiasco", "song_id": "SOWQTQZ12A58A7B63E", "title": "Streets On Fire (Explicit Album Version)", "duration": 279.97995, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCPZ128F4275C32.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR051KA1187B98B2FF", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Wilks", "song_id": "SOLYIBD12A8C135045", "title": "Music is what we love", "duration": 261.51138, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCRU128F423F449.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR8IEZO1187B99055E", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Marc Shaiman", "song_id": "SOINLJW12A8C13314C", "title": "City Slickers", "duration": 149.86404, "year": 2008} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCTK128F934B224.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AR558FS1187FB45658", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "40 Grit", "song_id": "SOGDBUF12A8C140FAA", "title": "Intro", "duration": 75.67628, "year": 2003} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCUQ128E0783E2B.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARVBRGZ1187FB4675A", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Gwen Stefani", "song_id": "SORRZGD12A6310DBC3", "title": "Harajuku Girls", "duration": 290.55955, "year": 2004} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCXB128F4286BD3.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "ARWB3G61187FB49404", "artist_latitude": null, "artist_longitude": null, "artist_location": "Hamilton, Ohio", "artist_name": "Steve Morse", "song_id": "SODAUVL12A8C13D184", "title": "Prognosis", "duration": 363.85914, "year": 2000} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCYE128F934CE1D.json: -------------------------------------------------------------------------------- 1 | {"num_songs": 1, "artist_id": "AREVWGE1187B9B890A", "artist_latitude": -13.442, "artist_longitude": -41.9952, "artist_location": "Noci (BA)", "artist_name": "Bitter End", "song_id": "SOFCHDR12AB01866EF", "title": "Living Hell", "duration": 282.43546, "year": 0} -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## Purpose of this Project 3 | A music streaming startup called Sparkify. It has grown their user base and song database and want to move their 4 | data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app 5 | as well as a directory of JSON metadata on the songs in their app. 6 | 7 | ## My Task 8 | Build an ETL pipeline that extracts data from S3, process them using spark, and load the data back into S3 as a set 9 | of dimensional tables. This would allow the data analytics team to continue finding insights in what songs their users 10 | are listening to. 11 | 12 | 13 | ## Datasets 14 | The two datasets reside in S3. Here is the links for each: 15 | * Song data: s3://udacity-dend/song_data 16 | * Log data: s3://udacity-dend/log_data 17 | 18 | The first data set is a subset of real data from Million Song Dataset. Each file is in JSON format and contains metadata 19 | about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 20 | eg: 21 | song_data/A/B/C/TRABCEI128F424C983.json 22 | song_data/A/A/B/TRAABJL12903CDCF1A.json 23 | 24 | The second data consists of log files in JSON format generated by event simulator based on the songs above. The activity 25 | log data is simulated from an imaginary music streaming app based on configuration settings. 26 | 27 | 28 | You can download the dataset on your desktop and do it locally. After all the process completed, load the output parquet data back to S3. 29 | 30 | 31 | 32 | ## This project includes 3 files 33 | etl.py : reads data from S3, processes the data using Spark and writes them back to S3 34 | dl.cfg: contains your aws credential ( do not expose them in public) 35 | README.md: explanation of the project process and decisions 36 | 37 | ## Schema for Song Play Analysis 38 | 39 | Using the song and log datasets to create a star schema optimized for queries on song play analysis. 40 | This includes the following tables. 41 | 42 | Fact Table: 43 | songplays - records in log data associated with song plays i.e. records with page NextSong 44 | songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent 45 | 46 | Dimension Tables: 47 | users - users in the app 48 | user_id, first_name, last_name, gender, level 49 | songs - songs in music database 50 | song_id, title, artist_id, year, duration 51 | artists - artists in music database 52 | artist_id, name, location, lattitude, longitude 53 | time - timestamps of records in songplays broken down into specific units 54 | start_time, hour, day, week, month, year, weekday 55 | 56 | # steps for this project 57 | (1) complete the code in etl.py 58 | (2) create a new user with s3 full access 59 | (check your users to see if you have it already. Normally , create 3 types of users 60 | user 1 with read only access 61 | user 2 with full access 62 | user 3 with admin access) 63 | (3) Launch a cluster (specify the master user name to the one with full access) 64 | In the terminal, run python etl.py 65 | 66 | 67 | # Notes: 68 | A new user was created on AWS IAM with S3 full access. Use control key and secret key to Fill in credentails in dl.cfg( no quotation needed). 69 | cluster type: multi node ; Set the number of nodes to 4 70 | 71 | 72 | 73 | ## Reference: 74 | * https://github.com/ashleyadrias/udacity-data-engineer-nanodegree-project3-data-lakes-with-spark 75 | * https://github.com/FedericoSerini/DEND-Project-4-Apache-Spark-And-Data-Lake 76 | * https://github.com/gfkw/dend-project-4 77 | * https://github.com/jukkakansanaho/udacity-dend-project-4 78 | -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/__MACOSX/._README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/__MACOSX/._README.md -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/__MACOSX/._dl.cfg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/__MACOSX/._dl.cfg -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/__MACOSX/._etl.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/__MACOSX/._etl.py -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/dl.cfg: -------------------------------------------------------------------------------- 1 | AWS_ACCESS_KEY_ID= 2 | AWS_SECRET_ACCESS_KEY= -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/etl.py: -------------------------------------------------------------------------------- 1 | # import the libraries 2 | import configparser 3 | from datetime import datetime 4 | import os 5 | from pyspark.sql import SparkSession 6 | from pyspark.sql.functions import udf, col, to_timestamp, monotonically_increasing_id 7 | from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format 8 | 9 | # set up the configration 10 | config = configparser.ConfigParser() 11 | 12 | # read in credentials for S3 full access 13 | config.read('dl.cfg') 14 | os.environ['AWS_ACCESS_KEY_ID'] = config.get('CREDENTIALS', 'AWS_ACCESS_KEY_ID' ) 15 | os.environ['AWS_SECRET_ACCESS_KEY'] = config.get('CREDENTIALS', 'AWS_SECRET_ACCESS_KEY') 16 | 17 | # initialize a spark session on aws hadoop 18 | # return spark session 19 | def create_spark_session(): 20 | spark = SparkSession \ 21 | .builder \ 22 | .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \ 23 | .getOrCreate() 24 | return spark 25 | 26 | 27 | def process_song_data(spark, input_data, output_data): 28 | ''' 29 | Read in song data from S3 30 | transform the data to create dimensional tables 31 | load the data back to S3 in parquet files 32 | 33 | input: spark session, input datain json format and output data as s3 path 34 | ''' 35 | 36 | # get filepath to song data file 37 | # try with one file 38 | #song_data = input_data + "song_data/A/B/C/TRABCEI128F424C983.json" 39 | song_data = input_data + "song_data/*/*/*/*.json" 40 | 41 | # read song data file 42 | song_df = spark.read.json(song_data) 43 | print("Song data schema:") 44 | song_df.printSchema() 45 | print("Total records in song data is: ") 46 | print(song_df.count()) 47 | 48 | # extract columns to create songs table 49 | songs_table = song_df.select(['song_id','song_title','artist_id','year','duration']).dropDuplicates().collect() 50 | 51 | # write songs table to parquet files partitioned by year and artist 52 | songs_table.write.partitionBy('year','artist_id').parquet(output_data + "/songs_table.parquet") 53 | 54 | # extract columns to create artists table 55 | artists_table = song_df.select(['artist_id','artist_name','artist_location','artist_latitude', 'artist_longitude'])\ 56 | .dropDuplicates().collect() 57 | 58 | # write artists table to parquet files 59 | artists_table.write.parquet(output_data + "/artists_table.parquet") 60 | 61 | 62 | def process_log_data(spark, input_data, output_data): 63 | 64 | ''' 65 | Read in log data from S3 66 | transform the data to create dimensional tables 67 | load the data back to S3 in parquet files 68 | ''' 69 | 70 | # get filepath to log data file 71 | # try with one file 72 | #log_data = input_data + "log_data/2018/11/2018-11-12-events.json" 73 | log_data = input_data + "log_data/*/*/*.json" 74 | 75 | # read log data file 76 | log_df = spark.read.json(log_data) 77 | 78 | # filter by actions for songplays 79 | log_df = log_df.filter(log_df.page=='NextSong') 80 | 81 | # extract columns for users table 82 | uers_table = log_df.select(['user_id', 'user_first_name', 'user_last_name', 'gender', 'level']).dropDuplicates().collect() 83 | 84 | # write users table to parquet files 85 | users_table.write.parquet(output_data + "/users_table.parquet") 86 | 87 | # create timestamp column from original timestamp column 88 | get_timestamp = udf(lambda x: datetime.datetime.fromtimestamp(x/1000.0) 89 | .strftime('%Y-%m-%d %H:%M:%S'), TimestampType() ) 90 | 91 | log_df = log_df.withColumn('ts', get_timestamp(log_df.ts)) 92 | log_df = log_df.withColumn('start_time', get_timestampe(log_df.ts)) 93 | log_df = log_df.withColumn('hour', get_timestamp(log_df.ts).hour) 94 | log_df = log_df.withColumn('day' , get_timestamp(log_df.ts).day) 95 | log_df = log_df.withColumn('week' , get_timestamp(log_df.ts).week) 96 | log_df = log_df.withColumn('month' , get_timestamp(log_df.ts).month) 97 | log_df = log_df.withColumn('year' , get_timestamp(log_df.ts).year) 98 | log_df = log_df.withColumn('weekday' , get_timestamp(log_df.ts).weekday) 99 | log_df.printSchema() 100 | log_df.head(1) 101 | 102 | # extract columns to create time table 103 | time_table = log_df.select(['start_time', 'hour', 'day', 'week', 'month', 'year', 'weekday']).dropDuplicates().collect() 104 | 105 | # write time table to parquet files partitioned by year and month 106 | time_table.write.partitionBy('year','month').parquet(output_data + "/time_table.parquet") 107 | 108 | # read in song data to use for songplays table 109 | #song_data = input_data + "song_data/A/B/C/TRABCEI128F424C983.json" 110 | song_data = input_data + "song_data/*/*/*/*.json" 111 | 112 | song_df = spark.read.json(song_data) 113 | 114 | # extract columns from joined song and log datasets to create songplays table 115 | # join two dataframes together by using the shared column 116 | songplays_table = song_df.join(log_df, (song_df.artist_name == log_df.artist_name) & 117 | (song_df.song_title == log_df.song_title) & song_df.duration = log_df.length ) 118 | songplays_table.withColumn("songplay_id", monotonically_increasing_id()) 119 | songplays_table = songplays_table.select( ['start_time', 'user_id', 'level', 'song_id', 120 | 'artist_id', 'sessionId', 'user_location','userAgent']).dropDuplicates().collect() 121 | songplays_table.printSchema() 122 | songplays_table.head(1) 123 | 124 | # write songplays table to parquet files partitioned by year and month 125 | songplays_table.write.partitionBy('year','month').parquet(output_data + "/songplays_table.parquet") 126 | 127 | 128 | def main(): 129 | spark = create_spark_session() 130 | input_data = "s3a://udacity-dend/" 131 | output_data = "s3a://udacity-dend/" 132 | # if you want the output in a different bucket , then you need to create a new bucket first "s3a://project4_output/" 133 | 134 | process_song_data(spark, input_data, output_data) 135 | process_log_data(spark, input_data, output_data) 136 | 137 | 138 | if __name__ == "__main__": 139 | main() 140 | -------------------------------------------------------------------------------- /Project_4_Data_Lake_with_Spark/star_schema.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/star_schema.PNG -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/create_tables.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE public.artists ( 2 | artistid varchar(256) NOT NULL, 3 | name varchar(256), 4 | location varchar(256), 5 | lattitude numeric(18,0), 6 | longitude numeric(18,0) 7 | ); 8 | 9 | CREATE TABLE public.songplays ( 10 | playid varchar(32) NOT NULL, 11 | start_time timestamp NOT NULL, 12 | userid int4 NOT NULL, 13 | "level" varchar(256), 14 | songid varchar(256), 15 | artistid varchar(256), 16 | sessionid int4, 17 | location varchar(256), 18 | user_agent varchar(256), 19 | CONSTRAINT songplays_pkey PRIMARY KEY (playid) 20 | ); 21 | 22 | CREATE TABLE public.songs ( 23 | songid varchar(256) NOT NULL, 24 | title varchar(256), 25 | artistid varchar(256), 26 | "year" int4, 27 | duration numeric(18,0), 28 | CONSTRAINT songs_pkey PRIMARY KEY (songid) 29 | ); 30 | 31 | CREATE TABLE public.staging_events ( 32 | artist varchar(256), 33 | auth varchar(256), 34 | firstname varchar(256), 35 | gender varchar(256), 36 | iteminsession int4, 37 | lastname varchar(256), 38 | length numeric(18,0), 39 | "level" varchar(256), 40 | location varchar(256), 41 | "method" varchar(256), 42 | page varchar(256), 43 | registration numeric(18,0), 44 | sessionid int4, 45 | song varchar(256), 46 | status int4, 47 | ts int8, 48 | useragent varchar(256), 49 | userid int4 50 | ); 51 | 52 | CREATE TABLE public.staging_songs ( 53 | num_songs int4, 54 | artist_id varchar(256), 55 | artist_name varchar(256), 56 | artist_latitude numeric(18,0), 57 | artist_longitude numeric(18,0), 58 | artist_location varchar(256), 59 | song_id varchar(256), 60 | title varchar(256), 61 | duration numeric(18,0), 62 | "year" int4 63 | ); 64 | 65 | CREATE TABLE public.staging_songs ( 66 | num_songs int4, 67 | artist_id varchar(256), 68 | artist_name varchar(256), 69 | artist_latitude numeric(18,0), 70 | artist_longitude numeric(18,0), 71 | artist_location varchar(256), 72 | song_id varchar(256), 73 | title varchar(256), 74 | duration numeric(18,0), 75 | "year" int4 76 | ); 77 | 78 | CREATE TABLE public.users ( 79 | userid int4 NOT NULL, 80 | first_name varchar(256), 81 | last_name varchar(256), 82 | gender varchar(256), 83 | "level" varchar(256), 84 | CONSTRAINT users_pkey PRIMARY KEY (userid) 85 | ); 86 | 87 | 88 | 89 | 90 | 91 | -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/dags/udac_example_dag.py: -------------------------------------------------------------------------------- 1 | """ 2 | This DAG will do the following : 3 | # copy data from AWS s3 to AWS redshift staging tables: stage_events and stage_songs 4 | # load data from staging tables to star schema tables( fact table and dimension tables) 5 | # data control check to see if data has been populated to star schema 6 | 7 | # in case of failure, DAG would retry 3 times , after 5 minutes delay 8 | """ 9 | 10 | from datetime import datetime, timedelta 11 | import os 12 | from airflow import DAG 13 | from airflow.operators.dummy_operator import DummyOperator 14 | from airflow.operators import (StageToRedshiftOperator, LoadFactOperator, 15 | LoadDimensionOperator, DataQualityOperator) 16 | 17 | from helpers import SqlQueries 18 | 19 | # AWS_KEY = os.environ.get('AWS_KEY') 20 | # AWS_SECRET = os.environ.get('AWS_SECRET') 21 | 22 | # these are set based on guideline 23 | default_args = { 24 | 'owner' : 'udacity', 25 | 'depends_on_past' : False, 26 | 'retries' : 3, 27 | 'retry_delay': timedelta(minutes = 5), 28 | 'start_date': datetime(2019, 8,4 ), 29 | 'end_date' : datetime(2019, 8,15), 30 | 'email_on_retry' : False , 31 | 'catchup_by_default' : False 32 | } 33 | 34 | dag = DAG('udac_example_dag', 35 | default_args=default_args, 36 | description='Load and transform data in Redshift with Airflow', 37 | schedule_interval= '@hourly' # according to the data format , dailly is a better choice 38 | ) 39 | 40 | start_operator = DummyOperator(task_id='Begin_execution', dag=dag) 41 | 42 | 43 | stage_events_to_redshift = StageToRedshiftOperator( 44 | task_id='Stage_events', 45 | dag=dag, 46 | table = 'staging_events', 47 | redshift_conn_id = "redshift", 48 | aws_credentials_id = "aws_credentials", 49 | s3_bucket = "udacity-dend", 50 | s3_key = "log_data/2018/11/2018-11-12-events.json", 51 | json_path = "s3://udacity-dend/log_json_path.json", 52 | file_type="json" 53 | ) 54 | 55 | stage_songs_to_redshift = StageToRedshiftOperator( 56 | task_id='Stage_songs', 57 | dag=dag, 58 | table = 'staging_songs', 59 | redshift_conn_id = "redshift", 60 | aws_credentials_id = "aws_credentials", 61 | s3_bucket = "udacity-dend", 62 | s3_key = "song_data/A/B/C/TRABCEI128F424C983.json", 63 | json_path = "auto", 64 | file_type="json" 65 | ) 66 | 67 | load_songplays_table = LoadFactOperator( 68 | task_id='Load_songplays_fact_table', 69 | dag=dag, 70 | table = 'songplays', 71 | redshift_conn_id = "redshift", 72 | load_sql_stmt = SqlQueries.songplay_table_insert 73 | ) 74 | 75 | load_user_dimension_table = LoadDimensionOperator( 76 | task_id='Load_user_dim_table', 77 | dag=dag, 78 | table = 'users', 79 | redshift_conn_id = "redshift", 80 | load_sql_stmt = SqlQueries.user_table_insert 81 | ) 82 | 83 | load_song_dimension_table = LoadDimensionOperator( 84 | task_id='Load_song_dim_table', 85 | dag=dag, 86 | table = 'songs', 87 | redshift_conn_id = "redshift", 88 | load_sql_stmt = SqlQueries.song_table_insert 89 | ) 90 | 91 | load_artist_dimension_table = LoadDimensionOperator( 92 | task_id='Load_artist_dim_table', 93 | dag=dag, 94 | table = 'artists', 95 | redshift_conn_id = "redshift", 96 | load_sql_stmt = SqlQueries.artist_table_insert 97 | ) 98 | 99 | load_time_dimension_table = LoadDimensionOperator( 100 | task_id='Load_time_dim_table', 101 | dag=dag, 102 | table = 'times', 103 | redshift_conn_id = "redshift", 104 | load_sql_stmt = SqlQueries.time_table_insert 105 | ) 106 | 107 | run_quality_checks = DataQualityOperator( 108 | task_id='Run_data_quality_checks', 109 | dag=dag, 110 | table = ['songplays', 'users', 'songs', 'artists', 'times'], 111 | redshift_conn_id = "redshift" 112 | ) 113 | 114 | end_operator = DummyOperator(task_id='Stop_execution', dag=dag) 115 | 116 | # set up tasks dependancies 117 | # use the varaibles not task_id 118 | start_operator >> stage_events_to_redshift 119 | start_operator >> stage_songs_to_redshift 120 | 121 | stage_events_to_redshift >> load_songplays_table 122 | stage_songs_to_redshift >> load_songplays_table 123 | 124 | load_songplays_table >> load_song_dimension_table 125 | load_song_dimension_table >> run_quality_checks 126 | 127 | load_songplays_table >> load_user_dimension_table 128 | load_user_dimension_table >> run_quality_checks 129 | 130 | load_songplays_table >> load_artist_dimension_table 131 | load_artist_dimension_table >> run_quality_checks 132 | 133 | load_songplays_table >> load_time_dimension_table 134 | load_time_dimension_table >> run_quality_checks 135 | 136 | run_quality_checks >> end_operator 137 | 138 | 139 | 140 | -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, absolute_import, print_function 2 | 3 | from airflow.plugins_manager import AirflowPlugin 4 | 5 | import operators 6 | import helpers 7 | 8 | # Defining the plugin class 9 | class UdacityPlugin(AirflowPlugin): 10 | name = "udacity_plugin" 11 | operators = [ 12 | operators.StageToRedshiftOperator, 13 | operators.LoadFactOperator, 14 | operators.LoadDimensionOperator, 15 | operators.DataQualityOperator 16 | ] 17 | helpers = [ 18 | helpers.SqlQueries 19 | ] 20 | -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__init__.py: -------------------------------------------------------------------------------- 1 | from helpers.sql_queries import SqlQueries 2 | 3 | __all__ = [ 4 | 'SqlQueries', 5 | ] -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/sql_queries.py: -------------------------------------------------------------------------------- 1 | class SqlQueries: 2 | ''' 3 | This class is to set up the tables for inserting later 4 | ''' 5 | songplay_table_insert = (""" 6 | SELECT 7 | md5(events.sessionid || events.start_time) songplay_id, 8 | events.start_time, 9 | events.userid, 10 | events.level, 11 | songs.song_id, 12 | songs.artist_id, 13 | events.sessionid, 14 | events.location, 15 | events.useragent 16 | FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, * 17 | FROM staging_events 18 | WHERE page='NextSong') events 19 | LEFT JOIN staging_songs songs 20 | ON events.song = songs.title 21 | AND events.artist = songs.artist_name 22 | AND events.length = songs.duration 23 | """) 24 | 25 | user_table_insert = (""" 26 | SELECT distinct userid, firstname, lastname, gender, level 27 | FROM staging_events 28 | WHERE page='NextSong' 29 | """) 30 | 31 | song_table_insert = (""" 32 | SELECT distinct song_id, title, artist_id, year, duration 33 | FROM staging_songs 34 | """) 35 | 36 | artist_table_insert = (""" 37 | SELECT distinct artist_id, artist_name, artist_location, artist_latitude, artist_longitude 38 | FROM staging_songs 39 | """) 40 | 41 | time_table_insert = (""" 42 | SELECT start_time, extract(hour from start_time), extract(day from start_time), extract(week from start_time), extract(month from start_time), extract(year from start_time), extract(dayofweek from start_time) 43 | FROM songplays 44 | """) -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__init__.py: -------------------------------------------------------------------------------- 1 | from operators.stage_redshift import StageToRedshiftOperator 2 | from operators.load_fact import LoadFactOperator 3 | from operators.load_dimension import LoadDimensionOperator 4 | from operators.data_quality import DataQualityOperator 5 | 6 | __all__ = [ 7 | 'StageToRedshiftOperator', 8 | 'LoadFactOperator', 9 | 'LoadDimensionOperator', 10 | 'DataQualityOperator' 11 | ] 12 | -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/data_quality.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class DataQualityOperator(BaseOperator): 6 | """ 7 | This airflow operator check the numbers of records in each table 8 | """ 9 | 10 | ui_color = '#89DA59' 11 | 12 | @apply_defaults 13 | def __init__(self, 14 | # Define your operators params (with defaults) here 15 | redshift_conn_id = "", 16 | tables = [], 17 | *args, **kwargs): 18 | 19 | super(DataQualityOperator, self).__init__(*args, **kwargs) 20 | # Map params here 21 | self.redshift_conn_id = redshift_conn_id 22 | self.tables = tables 23 | 24 | def execute(self, context): 25 | redshift_hook = PostgresHook( self.redshift_conn_id ) 26 | for table in self.tables: 27 | self.log.info(f"Data Quality checking for {table} table") 28 | records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {table}") 29 | if len(records) <1 or len(records[0]) <1 : 30 | raise ValueError(f"Data quality check failed. {table} has no results") 31 | 32 | num_records = records[0][0] 33 | if num_records < 1: 34 | raise ValueError(f"Data quality check failed. {table} has zero records") 35 | self.log.info( f"Yeah, {table} has loaded successfully with {num_records} records!") 36 | 37 | 38 | 39 | 40 | 41 | 42 | -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/load_dimension.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class LoadDimensionOperator(BaseOperator): 6 | """ 7 | This airflow operator class load data from AWS redshift staging tables to dimension tables 8 | 9 | keyword arguments: 10 | redshift_conn_id 11 | target_table 12 | query : query name from SqlQueries 13 | 14 | output: populate data into dimension tables 15 | """ 16 | 17 | ui_color = '#80BD9E' 18 | insert_sql = """ 19 | INSERT INTO {} 20 | {}; 21 | COMMIT; 22 | """ 23 | 24 | @apply_defaults 25 | def __init__(self, 26 | # Define your operators params (with defaults) here 27 | redshift_conn_id = "", 28 | target_table = "", 29 | query = "", 30 | delete_load = False, 31 | *args, **kwargs): 32 | 33 | super(LoadDimensionOperator, self).__init__(*args, **kwargs) 34 | # Map params here 35 | self.redshift_conn_id = redshift_conn_id 36 | self.target_table = target_table 37 | self.query = query 38 | self.delete_load = delete_load 39 | 40 | def execute(self, context): 41 | redshift = PostgresHook(postgres_conn_id = self.redshift_conn_id) 42 | if self.delete_load: 43 | self.log.info('Clear data from dimension tables') 44 | redshift.run("DELETE FROM {}".format(self.table)) 45 | 46 | self.log.info('Load data from staging table to dimension tables...') 47 | formatted_sql = LoadFactOperator.insert_sql.format(self.target_table, 48 | self.query) 49 | redshift.run(formatted_sql) -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/load_fact.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | 5 | class LoadFactOperator(BaseOperator): 6 | """ 7 | This airflow operator class load data from AWS redshift staging tables to fact table 8 | 9 | keyword arguments: 10 | redshift_conn_id 11 | target_table 12 | query : query name from SqlQueries 13 | 14 | output: populate data into fact table 15 | """ 16 | ui_color = '#F98866' 17 | insert_sql = """ 18 | INSERT INTO {} 19 | {}; 20 | COMMIT; 21 | """ 22 | 23 | @apply_defaults 24 | # Define operators params (with defaults) here 25 | def __init__(self, 26 | redshift_conn_id = "", 27 | target_table = "", 28 | query = "", 29 | delete_load = False, 30 | *args, **kwargs): 31 | 32 | super(LoadFactOperator, self).__init__(*args, **kwargs) 33 | # Map params here 34 | self.redshift_conn_id = redshift_conn_id 35 | self.target_table = target_table 36 | self.query = query 37 | self.delete_load = delete_load 38 | 39 | def execute(self, context): 40 | redshift = PostgresHook(postgres_conn_id = self.redshift_conn_id) 41 | self.log.info('Clear data from fact table') 42 | redshift.run("DELETE FROM {}".format(self.table)) 43 | 44 | self.log.info('Load data from staging table to fact table...') 45 | formatted_sql = LoadFactOperator.insert_sql.format(self.target_table, 46 | self.query) 47 | redshift.run(formatted_sql) 48 | -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/stage_redshift.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.contrib.hooks.aws_hook import AwsHook 3 | from airflow.models import BaseOperator 4 | from airflow.utils.decorators import apply_defaults 5 | 6 | class StageToRedshiftOperator(BaseOperator): 7 | """ 8 | This airflow operator class load data from AWS S3 bucket to redshift staging tables 9 | 10 | keyword arguments: 11 | redshift_conn_id, 12 | aws_credentials_id , 13 | target_table , 14 | s3_bucket , 15 | s3_key , 16 | json_path , 17 | file_type , 18 | delimiter , 19 | ignore_headers 20 | 21 | output: populate data into staging tables 22 | """ 23 | ui_color = '#358140' 24 | 25 | # copy sql statement would be different for csv files and json files 26 | copy_sql_json = """ 27 | COPY {} 28 | FROM '{}' 29 | ACCESS_KEY_ID '{}' 30 | SECRET_ACCESS_KEY '{}' 31 | JSON '{}' 32 | COMPUPDATE OFF 33 | """ 34 | 35 | copy_sql_csv = """ 36 | COPY {} 37 | FROM '{}' 38 | ACCESS_KEY_ID '{}' 39 | SECRET_ACCESS_KEY '{}' 40 | IGNOREHEADER {} 41 | DELIMITER '{}' 42 | 43 | """ 44 | 45 | @apply_defaults 46 | def __init__(self, 47 | # Define your operators params (with defaults) here 48 | redshift_conn_id = "", 49 | aws_credentials_id = "", 50 | target_table = "", 51 | s3_bucket = "", 52 | s3_key = "", 53 | json_path = "" , 54 | file_type = "" , 55 | delimiter = "," , 56 | ignore_headers = 1, 57 | *args, **kwargs): 58 | 59 | super(StageToRedshiftOperator, self).__init__(*args, **kwargs) 60 | # Map params here 61 | self.redshift_conn_id = redshift_conn_id 62 | self.aws_credentials_id = aws_credentials_id 63 | self.target_table = target_table 64 | self.s3_bucket = s3_bucket 65 | self.s3_key = s3_key 66 | self.json_path = json_path 67 | self.file_type = file_type 68 | self.delimiter = delimiter 69 | self.ignore_headers = ignore_headers 70 | 71 | def execute(self, context): 72 | aws_hook = AwsHook( self.aws_credentials_id ) 73 | credentials = aws_hook.get_credentials() 74 | redshift = PostgresHook( postgres_conn_id = self.redshift_conn_id) 75 | 76 | self.log.info('copy data from s3 to redshift staging table') 77 | rendered_key = self.s3_key.format(**context) 78 | s3_path = "s3://{}/{}".format( self.s3_bucket, rendered_key) 79 | 80 | if self.file_type == "json": 81 | formatted_sql = StageToRedshiftOperator.copy_sql_json.format( 82 | self.target_table, 83 | s3_path, 84 | credentials.access_key, 85 | credentials.secret_key, 86 | self.json_path 87 | ) 88 | redshift.run(formatted_sql) 89 | 90 | if self.file_type == "cvs": 91 | formatted_sql = StageToRedshiftOperator.copy_sql_csv.format( 92 | self.target_table, 93 | s3_path, 94 | credentials.access_key, 95 | credentials.secret_key, 96 | self.ignore_headers, 97 | self.delimiter 98 | 99 | ) 100 | redshift.run(formatted_sql) 101 | 102 | 103 | 104 | 105 | -------------------------------------------------------------------------------- /Project_5_Data_Pipeline_with_Airflow/readme.md: -------------------------------------------------------------------------------- 1 | ### Introduction: 2 | A music streaming company, Sparkify, wants to introduce more automation and monitoring to their data warehouse ETL pipelines. 3 | Apache Airflow is the tool they decided to use. As a data engineer brought into this project, I need to create custom operators 4 | to perform tasks such as staging the data, filling the data warehouse and running checks on the data as final step. 5 | 6 | 7 | ### Data sets: 8 | The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. 9 | The source datasets consist of JSON logs that tell about user activity in the application and 10 | JSON metadata about the songs the users listen to. 11 | (1). Link to Song data: s3://udacity-dend/song_data 12 | The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata 13 | about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 14 | 15 | For example,here are filepaths to a file in this dataset. 16 | song_data/A/B/C/TRABCEI128F424C983.json 17 | 18 | Single song file example : 19 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", 20 | "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0} 21 | 22 | (2). Link to Log data: s3://udacity-dend/log_data 23 | It consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These 24 | simulate app activity logs from an imaginary music streaming app based on configuration settings. 25 | The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset. 26 | 27 | log_data/2018/11/2018-11-12-events.json 28 | log_data/2018/11/2018-11-13-events.json 29 | 30 | 31 | ### Star Schema for Song Play Analysis ( a star schema picture can be found in this project) 32 | (1).Fact Table 33 | songplays(records in event data associated with song plays i.e. records with page NextSong): songplay_id, start_time, 34 | user_id, level, song_id, artist_id, session_id, location, user_agent 35 | 36 | (2).Dimension Tables 37 | users(users in the app): user_id, first_name, last_name, gender, level 38 | songs(songs in music database): song_id, title, artist_id, year, duration 39 | artists(artists in music database): artist_id, name, location, lattitude, longitude 40 | time(timestamps of records in songplays broken down into specific units): start_time, hour, day, week, month, year, weekday 41 | 42 | 43 | ### Files included in the Repository 44 | 2 folders and 1 file : dags , plugins folders and create_tables.sql file 45 | 46 | (1). The DAG template: 47 | It has all the imports and task templates in place, but the task dependencies have not been set 48 | 49 | (2).Plugins folder : 50 | The operators folder with operator templates : 51 | data_quality.py 52 | load_dimension.py 53 | load_fact.py 54 | stage_redshift.py 55 | A helper class for the SQL transformations : sql_queries.py 56 | (3). create_tables.sql : this file should be converted to a python file with creating and dropping table sql statements 57 | Pls refer to the projects before 58 | 59 | 60 | # Prerequisites 61 | AWS redshift cluster must run successfully 62 | read and write access to S3 and aws redshift services 63 | Tables must be created in Redshift before executing the DAG workflow. The create tables statements can be found in: create_tables.sql ( refer to project 3) 64 | 65 | ### Steps to run the project 66 | (1). Launching a redshift cluster 67 | 68 | (2). creating tables 69 | method1: Create Tables by using query editor and the statements from create_tables.sql 70 | method2: refer to project 3 using 2 python files 71 | 72 | (3). Adding airflow connections 73 | On the create varaible page,enter the following values 74 | Set "Key" equal to "s3_bucket" and set "Val" equal to "udacity-dend" 75 | Set "Key" equal to "s3_prefix" and set "Val" equal to "data-pipelines" 76 | 77 | On the create connection page, enter the following values: 78 | Conn Id: Enter aws_credentials. 79 | Conn Type: Enter Amazon Web Services. 80 | Login: Enter your Access key ID from the IAM User credentials you downloaded earlier. 81 | Password: Enter your Secret access key from the IAM User credentials you downloaded earlier. 82 | 83 | Once you've entered these values, select Save and Add Another. 84 | On the next create connection page, enter the following values: 85 | Conn Id: Enter redshift. 86 | Conn Type: Enter Postgres. 87 | Host: Enter the endpoint of your Redshift cluster, excluding the port at the end. 88 | Schema: Enter dev. This is the Redshift database you want to connect to. 89 | Login: Enter awsuser. 90 | Password: Enter the password you created when launching your Redshift cluster. 91 | Port: Enter 5439. 92 | Once you've entered these values, select Save. 93 | 94 | Everytime you open Udacity workspace to enter Airflow UI, you have to re-create the connections and variable. 95 | 96 | (3). Running the tempalte DAG and compare with the graph provided in the project instruction. 97 | Check the logs to see if it has " operator is not implemented" messages 98 | 99 | (4). Configuring the DAG 100 | In the DAG, add default parameters according to these guidelines: 101 | The DAG does not have dependencies on past runs 102 | On failure, the task are retried 3 times 103 | Retries happen every 5 minutes 104 | Catchup is turned off 105 | Do not email on retry 106 | In the DAG, configure the task dependencies , compare with the graph view with what provided in 107 | project instruction. 108 | 109 | (5). Building the operators 110 | (The separate functional operators for dimensions, fact and quality control make the sql statements dynamics) 111 | 112 | Stage Operator 113 | The stage operator is expected to be able to load any JSON formatted files from S3 to Amazon Redshift. The operator creates and 114 | runs a SQL COPY statement based on the parameters provided. The operator's parameters should specify where in S3 the file is 115 | loaded and what is the target table. 116 | The parameters should be used to distinguish between JSON file. Another important requirement of the stage operator is 117 | containing templated field that allows it to load timestamped files from S3 based on the execution time and run backfills. 118 | 119 | Fact and Dimension Operators 120 | With dimension and fact operators, you can utilize the provided SQL helper class to run data transformations. Most of the logic 121 | is within the SQL transformations and the operator is expected to take as input a SQL statement and target database on which 122 | to run the query against. You can also define a target table that will contain the results of the transformation. 123 | Dimension loads are often done with the truncate-insert pattern where the target table is emptied before the load. Thus, you 124 | could also have a parameter that allows switching between insert modes when loading dimensions. Fact tables are usually so 125 | massive that they should only allow append type functionality. 126 | 127 | Data Quality Operator 128 | The final operator to create is the data quality operator, which is used to run checks on the data itself. The operator's 129 | main functionality is to receive one or more SQL based test cases along with the expected results and execute the tests. 130 | For each the test, the test result and expected result needs to be checked and if there is no match, the operator should 131 | raise an exception and the task should retry and fail eventually. 132 | 133 | For example one test could be a SQL statement that checks if certain column contains NULL values by counting all the rows 134 | that have NULL in the column. We do not want to have any NULLs so expected result would be 0 and the test would compare the 135 | SQL statement's outcome to the expected result. 136 | 137 | (6). Running create_tables.py script to create tables to AWS Redshift 138 | 139 | 140 | ### Notes about Workspace 141 | After you have updated the DAG, you will need to run /opt/airflow/start.sh command to start the Airflow web server. Once the Airflow web server is ready, you can access the Airflow UI by clicking on the blue Access Airflow button. 142 | 143 | A quality control must be implemented to make sure the workflow can work properly by counting the number of records of each table 144 | 145 | Command to kill the running DAG or reset Airflow : run command "pkill -f airflow" ( This one does not work for me. I have to save all files and reset the data in workspace to refresh it) 146 | 147 | To extract compression files: 148 | tar xvzf file_name.tar.gz ( by defult, it will be extracted in current working dir) 149 | 150 | ### References : 151 | (1). https://github.com/gfkw/dend-project-5 152 | 153 | (2). https://github.com/jukkakansanaho/udacity-dend-project-5 154 | 155 | (3). https://github.com/FedericoSerini/DEND-Project-5-Data-Pipelines 156 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Data_Engineering_Project_Portfolio 2 | 3 | Link to my certificate : https://graduation.udacity.com/nd027 4 | 5 | ## Introduction 6 | The projects in the Data Engineer Nanodegree program were designed in collaboration with a group of highly talented industry professionals. The projects took me to a role of a Data Engineer at a fabricated data streaming company called “Sparkify” as it scaled its data engineering in both size and sophistication. I worked with simulated data of listening behavior, as well as a wealth of metadata related to songs and artists. At first, I started working with a small amount of data, with low complexity, processed and stored on a single machine. By the end, I developed a sophisticated set of data pipelines to work with massive amounts of data processed and stored on the cloud. There are five projects in the program. Below is a description of each. 7 | 8 | ## Udacity Data Engineering Course Content 9 | ### Course 1: Data Modelling with Postgres and Cassandra 10 | Week # Lesson 11 | * 1 Introduction to Data Modeling 12 | * 2 Relational Data Modeling with Postgres 13 | * 2 Project: Data Modeling with Postgres 14 | * 3 Non-Relational Modeling with Cassandra 15 | * 3 Project: Data Modeling with Cassandra 16 | 17 | ### Course 2: Cloud Data Warehouses 18 | Week # Lesson 19 | * 4 Introduction to the Data Warehouses 20 | * 5 Introduction to the Cloud and AWS 21 | * 6 Implementing Data Warehouses on AWS 22 | * 7 Project: Create a Cloud Data Warehouse 23 | 24 | ### Course 3: Data Lakes with S3 and Spark 25 | Week # Lesson 26 | * 8 The Power of Spark 27 | * 9 Data Wrangling with Spark 28 | * 10 Debugging and Optimizing 29 | * 11 Introduction to Data Lakes 30 | * 12 Project: Create a Data Lake 31 | 32 | ### Course 4: Data Pipelines with Airflow 33 | Week # Lesson 34 | * 13 Data Pipelines 35 | * 14 Data Quality 36 | * 15 Production Data Pipelines 37 | * 16 Project: Create Data Pipelines 38 | 39 | ### Capstone Project 40 | Week # Lesson 41 | * 17 Project: Capstone 42 | 43 | ## Projects 44 | 45 | ### Project 1 - Data Modeling 46 | In this project, you’ll model user activity data for a music streaming app called Sparkify. The project is done in two parts. You’ll create a database and import data stored in CSV and JSON files, and model the data. You’ll do this first with a relational model in Postgres, then with a NoSQL data model with Apache Cassandra. You’ll design the data models to optimize queries for understanding what songs users are listening to. For PostgreSQL, you will also define Fact and Dimension tables and insert data into your new tables. For Apache Cassandra, you will model your data to help the data team at Sparkify answer queries about app usage. You will set up your Apache Cassandra database tables in ways to optimize writes of transactional data on user sessions. 47 | 48 | ### Project 2 - Cloud Data Warehousing 49 | In this project, you’ll move to the cloud as you work with larger amounts of data. You are tasked with building an ELT pipeline that extracts Sparkify’s data from S3, Amazon’s popular storage system. From there, you’ll stage the data in Amazon Redshift and transform it into a set of fact and dimensional tables for the Sparkify analytics team to continue finding insights in what songs their users are listening to. 50 | 51 | ### Project 3 - Data Lakes with Apache Spark 52 | In this project, you'll build an ETL pipeline for a data lake. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in the app. You will load data from S3, process the data into analytics tables using Spark, and load them back into S3. You'll deploy this Spark process on a cluster using AWS. 53 | 54 | ### Project 4 - Data Pipelines with Apache Airflow 55 | In this project, you’ll continue your work on Sparkify’s data infrastructure by creating and automating a set of data pipelines. You’ll use the up-and-coming tool Apache Airflow, developed and open-sourced by Airbnb and the Apache Foundation. You’ll configure and schedule data pipelines with Airflow, setting dependencies, triggers, and quality checks as you would in a production setting. 56 | 57 | ### Project 5 - Data Engineering Capstone 58 | The capstone project is an opportunity for you to combine what you've learned throughout the program into a more self-driven project. In this project, you'll define the scope of the project and the data you'll be working with. We'll provide guidelines, suggestions, tips, and resources to help you be successful, but your project will be unique to you. You'll gather data from several different data sources; transform, combine, and summarize it; and create a clean database for others to analyze. 59 | --------------------------------------------------------------------------------