├── A brief introduction of parquet files.docx
├── Capstone_Project
    ├── .ipynb_checkpoints
    │   └── Capstone_Project_with_Immigration_Data-checkpoint.ipynb
    ├── Capstone_Project_Immigration_Data.ipynb
    ├── I94_SAS_Labels_Descriptions.SAS
    ├── airport_codes.csv
    ├── dl.cfg
    ├── immigration_data_sample.csv
    ├── notes for uploading data ro s3.txt
    ├── readme.md
    ├── ref
    │   ├── dictionary for star schema 1.PNG
    │   ├── model for star schema 1 .PNG
    │   ├── star schema 2.1.PNG
    │   ├── star schema 2.3.PNG
    │   ├── star schema 2.4.PNG
    │   └── star shema 2.2.PNG
    ├── us_cities_demographics.csv
    └── valid_i94port.txt
├── Data Sources.docx
├── Project_1A_Data_Modeling_with_Postgres
    ├── README.md
    ├── create_tables.py
    ├── etl.ipynb
    ├── etl.py
    ├── sql_queries.py
    └── test.ipynb
├── Project_1B_Data_Modeling_with_Cassandra
    ├── Project_1B_ ETL_for_NoSQL_Modeling_with_Cassandra.ipynb
    ├── Readme.md
    └── event_datafile_new.csv
├── Project_3_Data_Warehouse_with_Redshift
    ├── README.md
    ├── create_tables.py
    ├── dwh.cfg
    ├── etl.py
    ├── project3_star_schema.PNG
    └── sql_queries.py
├── Project_4_Data_Lake_with_Spark
    ├── Data
    │   ├── log-data
    │   │   ├── 2018-11-01-events.json
    │   │   ├── 2018-11-02-events.json
    │   │   ├── 2018-11-03-events.json
    │   │   ├── 2018-11-04-events.json
    │   │   ├── 2018-11-05-events.json
    │   │   ├── 2018-11-06-events.json
    │   │   ├── 2018-11-07-events.json
    │   │   ├── 2018-11-08-events.json
    │   │   ├── 2018-11-09-events.json
    │   │   ├── 2018-11-10-events.json
    │   │   ├── 2018-11-11-events.json
    │   │   ├── 2018-11-12-events.json
    │   │   ├── 2018-11-13-events.json
    │   │   ├── 2018-11-14-events.json
    │   │   ├── 2018-11-15-events.json
    │   │   ├── 2018-11-16-events.json
    │   │   ├── 2018-11-17-events.json
    │   │   ├── 2018-11-18-events.json
    │   │   ├── 2018-11-19-events.json
    │   │   ├── 2018-11-20-events.json
    │   │   ├── 2018-11-21-events.json
    │   │   ├── 2018-11-22-events.json
    │   │   ├── 2018-11-23-events.json
    │   │   ├── 2018-11-24-events.json
    │   │   ├── 2018-11-25-events.json
    │   │   ├── 2018-11-26-events.json
    │   │   ├── 2018-11-27-events.json
    │   │   ├── 2018-11-28-events.json
    │   │   ├── 2018-11-29-events.json
    │   │   └── 2018-11-30-events.json
    │   └── song-data
    │   │   └── song_data
    │   │       ├── .DS_Store
    │   │       └── A
    │   │           ├── .DS_Store
    │   │           ├── A
    │   │               ├── .DS_Store
    │   │               ├── A
    │   │               │   ├── TRAAAAW128F429D538.json
    │   │               │   ├── TRAAABD128F429CF47.json
    │   │               │   ├── TRAAADZ128F9348C2E.json
    │   │               │   ├── TRAAAEF128F4273421.json
    │   │               │   ├── TRAAAFD128F92F423A.json
    │   │               │   ├── TRAAAMO128F1481E7F.json
    │   │               │   ├── TRAAAMQ128F1460CD3.json
    │   │               │   ├── TRAAAPK128E0786D96.json
    │   │               │   ├── TRAAARJ128F9320760.json
    │   │               │   ├── TRAAAVG12903CFA543.json
    │   │               │   └── TRAAAVO128F93133D4.json
    │   │               ├── B
    │   │               │   ├── TRAABCL128F4286650.json
    │   │               │   ├── TRAABDL12903CAABBA.json
    │   │               │   ├── TRAABJL12903CDCF1A.json
    │   │               │   ├── TRAABJV128F1460C49.json
    │   │               │   ├── TRAABLR128F423B7E3.json
    │   │               │   ├── TRAABNV128F425CEE1.json
    │   │               │   ├── TRAABRB128F9306DD5.json
    │   │               │   ├── TRAABVM128F92CA9DC.json
    │   │               │   ├── TRAABXG128F9318EBD.json
    │   │               │   ├── TRAABYN12903CFD305.json
    │   │               │   └── TRAABYW128F4244559.json
    │   │               └── C
    │   │               │   ├── TRAACCG128F92E8A55.json
    │   │               │   ├── TRAACER128F4290F96.json
    │   │               │   ├── TRAACFV128F935E50B.json
    │   │               │   ├── TRAACHN128F1489601.json
    │   │               │   ├── TRAACIW12903CC0F6D.json
    │   │               │   ├── TRAACLV128F427E123.json
    │   │               │   ├── TRAACNS128F14A2DF5.json
    │   │               │   ├── TRAACOW128F933E35F.json
    │   │               │   ├── TRAACPE128F421C1B9.json
    │   │               │   ├── TRAACQT128F9331780.json
    │   │               │   ├── TRAACSL128F93462F4.json
    │   │               │   ├── TRAACTB12903CAAF15.json
    │   │               │   ├── TRAACVS128E078BE39.json
    │   │               │   └── TRAACZK128F4243829.json
    │   │           └── B
    │   │               ├── .DS_Store
    │   │               ├── A
    │   │                   ├── TRABACN128F425B784.json
    │   │                   ├── TRABAFJ128F42AF24E.json
    │   │                   ├── TRABAFP128F931E9A1.json
    │   │                   ├── TRABAIO128F42938F9.json
    │   │                   ├── TRABATO128F42627E9.json
    │   │                   ├── TRABAVQ12903CBF7E0.json
    │   │                   ├── TRABAWW128F4250A31.json
    │   │                   ├── TRABAXL128F424FC50.json
    │   │                   ├── TRABAXR128F426515F.json
    │   │                   ├── TRABAXV128F92F6AE3.json
    │   │                   └── TRABAZH128F930419A.json
    │   │               ├── B
    │   │                   ├── TRABBAM128F429D223.json
    │   │                   ├── TRABBBV128F42967D7.json
    │   │                   ├── TRABBJE12903CDB442.json
    │   │                   ├── TRABBKX128F4285205.json
    │   │                   ├── TRABBLU128F93349CF.json
    │   │                   ├── TRABBNP128F932546F.json
    │   │                   ├── TRABBOP128F931B50D.json
    │   │                   ├── TRABBOR128F4286200.json
    │   │                   ├── TRABBTA128F933D304.json
    │   │                   ├── TRABBVJ128F92F7EAA.json
    │   │                   ├── TRABBXU128F92FEF48.json
    │   │                   └── TRABBZN12903CD9297.json
    │   │               └── C
    │   │                   ├── TRABCAJ12903CDFCC2.json
    │   │                   ├── TRABCEC128F426456E.json
    │   │                   ├── TRABCEI128F424C983.json
    │   │                   ├── TRABCFL128F149BB0D.json
    │   │                   ├── TRABCIX128F4265903.json
    │   │                   ├── TRABCKL128F423A778.json
    │   │                   ├── TRABCPZ128F4275C32.json
    │   │                   ├── TRABCRU128F423F449.json
    │   │                   ├── TRABCTK128F934B224.json
    │   │                   ├── TRABCUQ128E0783E2B.json
    │   │                   ├── TRABCXB128F4286BD3.json
    │   │                   └── TRABCYE128F934CE1D.json
    ├── README.md
    ├── __MACOSX
    │   ├── ._README.md
    │   ├── ._dl.cfg
    │   └── ._etl.py
    ├── dl.cfg
    ├── etl.py
    └── star_schema.PNG
├── Project_5_Data_Pipeline_with_Airflow
    ├── airflow
    │   ├── create_tables.sql
    │   ├── dags
    │   │   ├── __pycache__
    │   │   │   └── udac_example_dag.cpython-36.pyc
    │   │   └── udac_example_dag.py
    │   └── plugins
    │   │   ├── __init__.py
    │   │   ├── __pycache__
    │   │       └── __init__.cpython-36.pyc
    │   │   ├── helpers
    │   │       ├── __init__.py
    │   │       ├── __pycache__
    │   │       │   ├── __init__.cpython-36.pyc
    │   │       │   └── sql_queries.cpython-36.pyc
    │   │       └── sql_queries.py
    │   │   └── operators
    │   │       ├── __init__.py
    │   │       ├── __pycache__
    │   │           ├── __init__.cpython-36.pyc
    │   │           ├── data_quality.cpython-36.pyc
    │   │           ├── load_dimension.cpython-36.pyc
    │   │           ├── load_fact.cpython-36.pyc
    │   │           └── stage_redshift.cpython-36.pyc
    │   │       ├── data_quality.py
    │   │       ├── load_dimension.py
    │   │       ├── load_fact.py
    │   │       └── stage_redshift.py
    └── readme.md
└── README.md


/A brief introduction of parquet files.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/A brief introduction of parquet files.docx


--------------------------------------------------------------------------------
/Capstone_Project/dl.cfg:
--------------------------------------------------------------------------------
1 | AWS_ACCESS_KEY_ID=
2 | AWS_SECRET_ACCESS_KEY=


--------------------------------------------------------------------------------
/Capstone_Project/notes for uploading data ro s3.txt:
--------------------------------------------------------------------------------
  1 | *********************** process ************************************
  2 | upload the files into S3 and use Redshift to directly process the data? Why using PySpark as a part of process
  3 | 
  4 | You are free to use any technologies you like in the project
  5 | 
  6 | 
  7 | 
  8 | *********************** load data from workspace to s3 ************************************
  9 | --install aws cli first in workspace 
 10 | * upload the sas data to S3 directly, you have to install awscli in workspace first
 11 | * it must be workspace not your local if you do your project in the workspace 
 12 | 
 13 | -- install in local, it did not work for me 
 14 | https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html
 15 | Steps to Take after Installation
 16 | Setting the Path to Include the AWS CLI
 17 | 
 18 | Configure the AWS CLI with Your Credentials
 19 | 
 20 | Upgrading to the Latest Version of the AWS CLI
 21 | 
 22 | Uninstalling the AWS CLI
 23 | 
 24 | Setting the Path to Include the AWS CLI
 25 | After you install the AWS CLI, you might need to add the path to the executable file to your PATH variable. For platform-specific instructions, see the following topics:
 26 | 
 27 | Linux – Add the AWS CLI Executable to Your Command Line Path
 28 | 
 29 | Windows – Add the AWS CLI Executable to Your Command Line Path
 30 | 
 31 | 
 32 | 
 33 | Verify that the AWS CLI installed correctly by running aws --version.
 34 | 
 35 | $ aws --version
 36 | aws-cli/1.16.116 Python/3.6.8 Linux/4.14.77-81.59-amzn2.x86_64 botocore/1.12.106
 37 | Configure the AWS CLI with Your Credentials
 38 | Before you can run a CLI command, you must configure the AWS CLI with your credentials.
 39 | 
 40 | You store credential information locally by defining profiles in the AWS CLI configuration files, which are stored by default in your user's home directory. For more information, see Configuring the AWS CLI.
 41 | 
 42 | Note
 43 | 
 44 | If you are running in an Amazon EC2 instance, credentials can be automatically retrieved from the instance metadata. For more information, see Instance Metadata.
 45 | 
 46 | Upgrading to the Latest Version of the AWS CLI
 47 | The AWS CLI is updated regularly to add support for new services and commands. To update to the latest version of the AWS CLI, run the installation command again. For details about the latest version of the AWS CLI, see the AWS CLI release notes.
 48 | 
 49 | $ pip3 install awscli --upgrade --user
 50 | 
 51 | 
 52 | 
 53 | **********************************read data to spark from s3 *************************
 54 | 
 55 | from pyspark.sql import SparkSession
 56 | 
 57 | spark = SparkSession.builder
 58 | .config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")
 59 | .getOrCreate()
 60 | 
 61 | immigration_data_dir = 's3://udacity-capstone-project-12138/immigration-data/i94_feb16_sub.sas7bdat'
 62 | 
 63 | df_immigration = spark.read.format('com.github.saurfang.sas.spark').load(immigration_data_dir)
 64 | 
 65 | 
 66 | according to my knowledge, this package com.github.saurfang.sas.spark does not support directly read a directory.
 67 | 
 68 | You should try to read file one by one
 69 | 
 70 | 
 71 | 
 72 | ******************************** Convert data to convert SAS Date to python datetime in Spark rather than convert the sparkDF to pandas ?
 73 | mport datetime as dt
 74 | 
 75 | from pyspark.sql import SparkSession
 76 | 
 77 | from pyspark.sql.functions import udf
 78 | 
 79 | spark = SparkSession.builder.appName("capstone_project").getOrCreate()
 80 | 
 81 | get_date = udf(lambda x: (dt.datetime(1960, 1, 1).date() + dt.timedelta(x)).isoformat() if x else None)
 82 | 
 83 | sp_sas_data = spark.read.parquet("sas_data")
 84 | 
 85 | sp_sas_data = sp_sas_data.withColumn("arrival_date", get_date(sp_sas_data.arrdate))
 86 | 
 87 | sp_sas_data.createOrReplaceTempView("i94")
 88 | 
 89 | i94_df = spark.sql("select distinct arrival_date from i94")
 90 | 
 91 | 
 92 | 
 93 | 
 94 | *****************read the file "I94_SAS_Labels_Descriptions.SAS"*****************************
 95 | 
 96 | You can copy the labels into different files, such as saving I94CIT & I94RES into a file called code.txt
 97 | 
 98 | 582 = 'MEXICO Air Sea, and Not Reported (I-94, no land arrivals)'
 99 | 
100 | 236 = 'AFGHANISTAN'
101 | 
102 | 101 = 'ALBANIA'
103 | 
104 | 316 = 'ALGERIA'
105 | 
106 | ... And then, you can use this way to read the code,
107 | 
108 | df = pd.read_csv('code.txt', sep=" = ", header=None)


--------------------------------------------------------------------------------
/Capstone_Project/readme.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ## Project Instruction 
  3 | 
  4 | * a very important step is to compress the datasets from workspace to a single zip file and upload to s3 bucket : 
  5 | * this is the command :
  6 |    zip -r /home/workspace/data.zip /data/18-83510-I94-Data-2016
  7 | 
  8 | 
  9 | 
 10 | ### Step 1: Scope the Project and Gather Data
 11 | 
 12 |     Since the scope of the project will be highly dependent on the data, these two things happen simultaneously. 
 13 |     
 14 |     In this step, I’ll: Identify and gather the data I'll be using for my project(at least two sources and more than 1 million rows.)
 15 |     I picked I94 immigration and us-cities-demographics two datasets from the provided datasets. 
 16 |     
 17 | 
 18 |     The main purpose is for analytics . The following questions are what I try to answer through this project
 19 |     * which coutry sends the most visitors to US in year 2016?
 20 |     * which cities in US attract the most visitors in year 2016 ?
 21 |     * which visa type is the most popular in 2016 ?
 22 |     * Which race has the most number of immigrants to US?
 23 | 
 24 | ### Step 2: Explore and Assess the Data
 25 | 
 26 |     Explore the data to identify data quality issues, like missing values, duplicate data, etc.
 27 |     Document steps necessary to clean the data
 28 | 
 29 | ## Step 3: Define the Data Model
 30 | 
 31 |     Map out the conceptual data model and explain why you chose that model
 32 |     List the steps necessary to pipeline the data into the chosen data model
 33 | 
 34 | ## Step 4: Run ETL to Model the Data
 35 | 
 36 |     * Create the data pipelines and the data model
 37 |     * Include a data dictionary
 38 |     * Run data quality checks to ensure the pipeline ran as expected
 39 |     * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 40 |     * Unit tests for the scripts to ensure they are doing the right thing
 41 |     * Source/count checks to ensure completeness
 42 |     
 43 | ## Step 5: Complete Project Write Up
 44 | 
 45 |     What's the goal? What queries will you want to run? How would Spark or Airflow be incorporated?     Why did you choose the model you chose?
 46 |     Clearly state the rationale for the choice of tools and technologies for the project.
 47 |     Document the steps of the process.
 48 |     Propose how often the data should be updated and why.
 49 |     Post your write-up and final data model in a GitHub repo.
 50 |     Include a description of how you would approach the problem differently under the following 
 51 |      
 52 | ## scenarios:
 53 |     If the data was increased by 100x.
 54 |     If the pipelines were run on a daily basis by 7am.
 55 |     If the database needed to be accessed by 100+ people.
 56 | 
 57 | 
 58 | 
 59 | 
 60 | ## Datasets
 61 | 
 62 |     ### Brief Introduction of the dataset 
 63 |     
 64 |     The following 4 datasets are included in the project.If something about the data is unclear, make an assumption, document it, and move on. 
 65 |     (1). I94 Immigration Data: This data comes from the US National Tourism and Trade Office. 
 66 |     A data dictionary is included in the workspace. This is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in. You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.
 67 |     (2). World Temperature Data: This dataset came from Kaggle. 
 68 |     (3). U.S. City Demographic Data: This data comes from OpenSoft. 
 69 |     (4).Airport Code Table: This is a simple table of airport codes and corresponding cities. 
 70 | 
 71 |     ### Accessing the Data
 72 |     The immigration data and the global temperate data is in an attached disk.
 73 |     (1). Immigration Data
 74 |         You can access the immigration data in a folder with the following path: 
 75 |             ../../data/18-83510-I94-Data-2016/. 
 76 |         There's a file for each month of the year. 
 77 |     
 78 |         An example file name is i94_apr16_sub.sas7bdat. Each file has a three-letter abbreviation  
 79 |         for the month name. 
 80 |         So a full file path for June would look like this: 
 81 |             ../../data/18-83510-I94-Data-2016/i94_jun16_sub.sas7bdat. 
 82 |     
 83 |         Below is what it would look like to import this file into pandas. (Note: these files are   
 84 |         large, so you'll have to think about how to process and aggregate them efficiently.)
 85 |         fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'   
 86 |         df = pd.read_sas(fname, 'sas7bdat', encoding="ISO-8859-1")
 87 |     
 88 |     ### The most important decision for modeling with this data is thinking about the level of aggregation. Do you want to aggregate by airport by month? Or by city by year? This level of aggregation will influence how you join the data with other datasets. There isn't a right answer, it all depends on what you want your final dataset to look like.
 89 | 
 90 |     (2). Temperature Data
 91 |         You can access the temperature data in a folder with the following path: 
 92 |             ../../data2/. 
 93 |         There's just one file in that folder, called  GlobalLandTemperaturesByCity.csv. 
 94 |             
 95 |         Below is how you would read the file into a pandas dataframe.
 96 |         fname = '../../data2/GlobalLandTemperaturesByCity.csv'
 97 |         df = pd.read_csv(fname)
 98 |         
 99 | ## Reference:
100 | ##### Part 1:
101 | ##### https://github.com/cheuklau/udacity-capstone
102 | ##### https://github.com/shalgrim/dend_capstone
103 | 
104 | 
105 | ##### Part 2:
106 | ##### https://github.com/saurfang/spark-sas7bdat (reading SAS files from local and distributed filesystems, into Spark DataFrames.)
107 | ##### https://github.com/FedericoSerini/DEND-Capstone-RomeTransportNetwork
108 | ##### https://github.com/sariabod/dend-capstone
109 | ##### https://github.com/akessela/Sparkify-Capstone-Project
110 | ##### https://github.com/Mikemraz/Capstone-Project-Big-Data-Sparkify


--------------------------------------------------------------------------------
/Capstone_Project/ref/dictionary for star schema 1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/dictionary for star schema 1.PNG


--------------------------------------------------------------------------------
/Capstone_Project/ref/model for star schema 1 .PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/model for star schema 1 .PNG


--------------------------------------------------------------------------------
/Capstone_Project/ref/star schema 2.1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/star schema 2.1.PNG


--------------------------------------------------------------------------------
/Capstone_Project/ref/star schema 2.3.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/star schema 2.3.PNG


--------------------------------------------------------------------------------
/Capstone_Project/ref/star schema 2.4.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/star schema 2.4.PNG


--------------------------------------------------------------------------------
/Capstone_Project/ref/star shema 2.2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Capstone_Project/ref/star shema 2.2.PNG


--------------------------------------------------------------------------------
/Capstone_Project/valid_i94port.txt:
--------------------------------------------------------------------------------
  1 | 'ALC'	=	'ALCAN, AK             '
  2 | 'ANC'	=	'ANCHORAGE, AK         '
  3 | 'BAR'	=	'BAKER AAF - BAKER ISLAND, AK'
  4 | 'DAC'	=	'DALTONS CACHE, AK     '
  5 | 'PIZ'	=	'DEW STATION PT LAY DEW, AK'
  6 | 'DTH'	=	'DUTCH HARBOR, AK      '
  7 | 'EGL'	=	'EAGLE, AK             '
  8 | 'FRB'	=	'FAIRBANKS, AK         '
  9 | 'HOM'	=	'HOMER, AK             '           
 10 | 'HYD'	=	'HYDER, AK             '
 11 | 'JUN'	=	'JUNEAU, AK            '
 12 | '5KE'	=	'KETCHIKAN, AK'
 13 | 'KET'	=	'KETCHIKAN, AK         '
 14 | 'MOS'	=	'MOSES POINT INTERMEDIATE, AK'
 15 | 'NIK'	=	'NIKISKI, AK           '
 16 | 'NOM'	=	'NOM, AK               '
 17 | 'PKC'	=	'POKER CREEK, AK       '
 18 | 'ORI'	=	'PORT LIONS SPB, AK'
 19 | 'SKA'	=	'SKAGWAY, AK           '
 20 | 'SNP'	=	'ST. PAUL ISLAND, AK'
 21 | 'TKI'	=	'TOKEEN, AK'
 22 | 'WRA'	=	'WRANGELL, AK          '
 23 | 'HSV'	=	'MADISON COUNTY - HUNTSVILLE, AL'
 24 | 'MOB'	=	'MOBILE, AL            '
 25 | 'LIA'	=	'LITTLE ROCK, AR (BPS)'
 26 | 'ROG'	=	'ROGERS ARPT, AR'
 27 | 'DOU'	=	'DOUGLAS, AZ           '
 28 | 'LUK'	=	'LUKEVILLE, AZ         '
 29 | 'MAP'	=	'MARIPOSA AZ           '
 30 | 'NAC'	=	'NACO, AZ              '
 31 | 'NOG'	=	'NOGALES, AZ           '
 32 | 'PHO'	=	'PHOENIX, AZ           '
 33 | 'POR'	=	'PORTAL, AZ'
 34 | 'SLU'	=	'SAN LUIS, AZ          '
 35 | 'SAS'	=	'SASABE, AZ            '
 36 | 'TUC'	=	'TUCSON, AZ            '
 37 | 'YUI'	=	'YUMA, AZ              ' 
 38 | 'AND'	=	'ANDRADE, CA           '
 39 | 'BUR'	=	'BURBANK, CA'
 40 | 'CAL'	=	'CALEXICO, CA          '
 41 | 'CAO'	=	'CAMPO, CA             ' 
 42 | 'FRE'	=	'FRESNO, CA            '
 43 | 'ICP'	=	'IMPERIAL COUNTY, CA   '
 44 | 'LNB'	=	'LONG BEACH, CA         '
 45 | 'LOS'	=	'LOS ANGELES, CA       '
 46 | 'BFL'	=	'MEADOWS FIELD - BAKERSFIELD, CA'
 47 | 'OAK'	=	'OAKLAND, CA ' 
 48 | 'ONT'	=	'ONTARIO, CA'
 49 | 'OTM'	=	'OTAY MESA, CA          '
 50 | 'BLT'	=	'PACIFIC, HWY. STATION, CA '
 51 | 'PSP'	=	'PALM SPRINGS, CA'
 52 | 'SAC'	=	'SACRAMENTO, CA        '
 53 | 'SLS'	=	'SALINAS, CA (BPS)'
 54 | 'SDP'	=	'SAN DIEGO, CA'
 55 | 'SFR'	=	'SAN FRANCISCO, CA     '
 56 | 'SNJ'	=	'SAN JOSE, CA          '
 57 | 'SLO'	=	'SAN LUIS OBISPO, CA   '
 58 | 'SLI'	=	'SAN LUIS OBISPO, CA (BPS)'
 59 | 'SPC'	=	'SAN PEDRO, CA         '
 60 | 'SYS'	=	'SAN YSIDRO, CA        '
 61 | 'SAA'	=	'SANTA ANA, CA         '
 62 | 'STO'	=	'STOCKTON, CA (BPS)'
 63 | 'TEC'	=	'TECATE, CA            '
 64 | 'TRV'	=	'TRAVIS-AFB, CA        '
 65 | 'APA'	=	'ARAPAHOE COUNTY, CO'
 66 | 'ASE'	=	'ASPEN, CO #ARPT'
 67 | 'COS'	=	'COLORADO SPRINGS, CO'
 68 | 'DEN'	=	'DENVER, CO            '
 69 | 'DRO'	=	'LA PLATA - DURANGO, CO'
 70 | 'BDL'	=	'BRADLEY INTERNATIONAL, CT'
 71 | 'BGC'	=	'BRIDGEPORT, CT        '
 72 | 'GRT'	=	'GROTON, CT            '
 73 | 'HAR'	=	'HARTFORD, CT          '
 74 | 'NWH'	=	'NEW HAVEN, CT         '
 75 | 'NWL'	=	'NEW LONDON, CT        '
 76 | 'TST'	=	'NEWINGTON DATA CENTER TEST, CT'
 77 | 'WAS'	=	'WASHINGTON DC         '
 78 | 'DOV'	=	'DOVER AFB, DE'
 79 | 'DVD'	=	'DOVER-AFB, DE         '
 80 | 'WLL'	=	'WILMINGTON, DE        '
 81 | 'BOC'	=	'BOCAGRANDE, FL        '
 82 | 'SRQ'	=	'BRADENTON - SARASOTA, FL'
 83 | 'CAN'	=	'CAPE CANAVERAL, FL    '
 84 | 'DAB'	=	'DAYTONA BEACH INTERNATIONAL, FL'
 85 | 'FRN'	=	'FERNANDINA, FL        '
 86 | 'FTL'	=	'FORT LAUDERDALE, FL   '
 87 | 'FMY'	=	'FORT MYERS, FL        '
 88 | 'FPF'	=	'FORT PIERCE, FL       '
 89 | 'HUR'	=	'HURLBURT FIELD, FL'
 90 | 'GNV'	=	'J R ALISON MUNI - GAINESVILLE, FL'
 91 | 'JAC'	=	'JACKSONVILLE, FL      '
 92 | 'KEY'	=	'KEY WEST, FL          '
 93 | 'LEE'	=	'LEESBURG MUNICIPAL AIRPORT, FL'
 94 | 'MLB'	=	'MELBOURNE, FL'
 95 | 'MIA'	=	'MIAMI, FL             '
 96 | 'APF'	=	'NAPLES, FL #ARPT'
 97 | 'OPF'	=	'OPA LOCKA, FL'
 98 | 'ORL'	=	'ORLANDO, FL           '
 99 | 'PAN'	=	'PANAMA CITY, FL       '
100 | 'PEN'	=	'PENSACOLA, FL         '
101 | 'PCF'	=	'PORT CANAVERAL, FL    '
102 | 'PEV'	=	'PORT EVERGLADES, FL   '
103 | 'PSJ'	=	'PORT ST JOE, FL       '
104 | 'SFB'	=	'SANFORD, FL           '
105 | 'SGJ'	=	'ST AUGUSTINE ARPT, FL'
106 | 'SAU'	=	'ST AUGUSTINE, FL      '
107 | 'FPR'	=	'ST LUCIE COUNTY, FL'
108 | 'SPE'	=	'ST PETERSBURG, FL     '
109 | 'TAM'	=	'TAMPA, FL             '
110 | 'WPB'	=	'WEST PALM BEACH, FL   '
111 | 'ATL'	=	'ATLANTA, GA           '
112 | 'BRU'	=	'BRUNSWICK, GA         '
113 | 'AGS'	=	'BUSH FIELD - AUGUSTA, GA'
114 | 'SAV'	=	'SAVANNAH, GA          '
115 | 'AGA'	=	'AGANA, GU             '
116 | 'HHW'	=	'HONOLULU, HI          '
117 | 'OGG'	=	'KAHULUI - MAUI, HI'
118 | 'KOA'	=	'KEAHOLE-KONA, HI      '
119 | 'LIH'	=	'LIHUE, HI             '
120 | 'CID'	=	'CEDAR RAPIDS/IOWA CITY, IA'
121 | 'DSM'	=	'DES MOINES, IA'
122 | 'BOI'	=	'AIR TERM. (GOWEN FLD) BOISE, ID'
123 | 'EPI'	=	'EASTPORT, ID          '
124 | 'IDA'	=	'FANNING FIELD - IDAHO FALLS, ID'
125 | 'PTL'	=	'PORTHILL, ID          '
126 | 'SPI'	=	'CAPITAL - SPRINGFIELD, IL'
127 | 'CHI'	=	'CHICAGO, IL           '
128 | 'DPA'	=	'DUPAGE COUNTY, IL'
129 | 'PIA'	=	'GREATER PEORIA, IL'
130 | 'RFD'	=	'GREATER ROCKFORD, IL'
131 | 'UGN'	=	'MEMORIAL - WAUKEGAN, IL'
132 | 'GAR'	=	'GARY, IN              '
133 | 'HMM'	=	'HAMMOND, IN           '
134 | 'INP'	=	'INDIANAPOLIS, IN      '
135 | 'MRL'	=	'MERRILLVILLE, IN      '
136 | 'SBN'	=	'SOUTH BEND, IN'
137 | 'ICT'	=	'MID-CONTINENT - WITCHITA, KS'
138 | 'LEX'	=	'BLUE GRASS - LEXINGTON, KY'
139 | 'LOU'	=	'LOUISVILLE, KY        '
140 | 'BTN'	=	'BATON ROUGE, LA       '
141 | 'LKC'	=	'LAKE CHARLES, LA      '
142 | 'LAK'	=	'LAKE CHARLES, LA (BPS)'
143 | 'MLU'	=	'MONROE, LA'
144 | 'MGC'	=	'MORGAN CITY, LA       '
145 | 'NOL'	=	'NEW ORLEANS, LA       '
146 | 'BOS'	=	'BOSTON, MA            '
147 | 'GLO'	=	'GLOUCESTER, MA        '
148 | 'BED'	=	'HANSCOM FIELD - BEDFORD, MA'
149 | 'LYN'	=	'LYNDEN, WA            '
150 | 'ADW'	=	'ANDREWS AFB, MD'
151 | 'BAL'	=	'BALTIMORE, MD         '
152 | 'MKG'	=	'MUSKEGON, MD'
153 | 'PAX'	=	'PATUXENT RIVER, MD    '
154 | 'BGM'	=	'BANGOR, ME            '
155 | 'BOO'	=	'BOOTHBAY HARBOR, ME   '
156 | 'BWM'	=	'BRIDGEWATER, ME       '
157 | 'BCK'	=	'BUCKPORT, ME          '
158 | 'CLS'	=	'CALAIS, ME   '
159 | 'CRB'	=	'CARIBOU, ME           '
160 | 'COB'	=	'COBURN GORE, ME       '
161 | 'EST'	=	'EASTCOURT, ME         '
162 | 'EPT'	=	'EASTPORT MUNICIPAL, ME'
163 | 'EPM'	=	'EASTPORT, ME          '
164 | 'FOR'	=	'FOREST CITY, ME       '
165 | 'FTF'	=	'FORT FAIRFIELD, ME    '
166 | 'FTK'	=	'FORT KENT, ME         '
167 | 'HML'	=	'HAMIIN, ME            '
168 | 'HTM'	=	'HOULTON, ME           '
169 | 'JKM'	=	'JACKMAN, ME           '
170 | 'KAL'	=	'KALISPEL, MT          '
171 | 'LIM'	=	'LIMESTONE, ME         '
172 | 'LUB'	=	'LUBEC, ME             '
173 | 'MAD'	=	'MADAWASKA, ME         '
174 | 'POM'	=	'PORTLAND, ME          '
175 | 'RGM'	=	'RANGELEY, ME (BPS)'
176 | 'SBR'	=	'SOUTH BREWER, ME      '
177 | 'SRL'	=	'ST AURELIE, ME        '
178 | 'SPA'	=	'ST PAMPILE, ME        '
179 | 'VNB'	=	'VAN BUREN, ME         '
180 | 'VCB'	=	'VANCEBORO, ME         '
181 | 'AGN'	=	'ALGONAC, MI           '
182 | 'ALP'	=	'ALPENA, MI            '
183 | 'BCY'	=	'BAY CITY, MI          '
184 | 'DET'	=	'DETROIT, MI           '
185 | 'GRP'	=	'GRAND RAPIDS, MI'
186 | 'GRO'	=	'GROSSE ISLE, MI       '
187 | 'ISL'	=	'ISLE ROYALE, MI       '
188 | 'MRC'	=	'MARINE CITY, MI       '
189 | 'MRY'	=	'MARYSVILLE, MI        '
190 | 'PTK'	=	'OAKLAND COUNTY - PONTIAC, MI'
191 | 'PHU'	=	'PORT HURON, MI        '
192 | 'RBT'	=	'ROBERTS LANDING, MI   '
193 | 'SAG'	=	'SAGINAW, MI           '
194 | 'SSM'	=	'SAULT STE. MARIE, MI  '
195 | 'SCL'	=	'ST CLAIR, MI          '
196 | 'YIP'	=	'WILLOW RUN - YPSILANTI, MI'
197 | 'BAU'	=	'BAUDETTE, MN          '
198 | 'CAR'	=	'CARIBOU MUNICIPAL AIRPORT, MN'
199 | 'GTF'	=	'Collapsed into INT, MN'
200 | 'INL'	=	'Collapsed into INT, MN'
201 | 'CRA'	=	'CRANE LAKE, MN        '
202 | 'MIC'	=	'CRYSTAL MUNICIPAL AIRPORT, MN'
203 | 'DUL'	=	'DULUTH, MN            '
204 | 'ELY'	=	'ELY, MN               '
205 | 'GPM'	=	'GRAND PORTAGE, MN     '
206 | 'SVC'	=	'GRANT COUNTY - SILVER CITY, MN'
207 | 'INT'	=	'INT''L FALLS, MN      '
208 | 'LAN'	=	'LANCASTER, MN         '
209 | 'MSP'	=	'MINN./ST PAUL, MN     '
210 | 'LIN'	=	'NORTHERN SVC CENTER, MN   '
211 | 'NOY'	=	'NOYES, MN             '
212 | 'PIN'	=	'PINE CREEK, MN        '
213 | '48Y'	=	'PINECREEK BORDER ARPT, MN'
214 | 'RAN'	=	'RAINER, MN            '
215 | 'RST'	=	'ROCHESTER, MN'
216 | 'ROS'	=	'ROSEAU, MN            '
217 | 'SPM'	=	'ST PAUL, MN           '
218 | 'WSB'	=	'WARROAD INTL, SPB, MN'
219 | 'WAR'	=	'WARROAD, MN           '
220 | 'KAN'	=	'KANSAS CITY, MO       '
221 | 'SGF'	=	'SPRINGFIELD-BRANSON, MO'
222 | 'STL'	=	'ST LOUIS, MO          '
223 | 'WHI'	=	'WHITETAIL, MT         '
224 | 'WHM'	=	'WILD HORSE, MT        '
225 | 'GPT'	=	'BILOXI REGIONAL, MS'
226 | 'GTR'	=	'GOLDEN TRIANGLE LOWNDES CNTY, MS'
227 | 'GUL'	=	'GULFPORT, MS          '
228 | 'PAS'	=	'PASCAGOULA, MS        '
229 | 'JAN'	=	'THOMPSON FIELD - JACKSON, MS'
230 | 'BIL'	=	'BILLINGS, MT          '
231 | 'BTM'	=	'BUTTE, MT             '
232 | 'CHF'	=	'CHIEF MT, MT          '
233 | 'CTB'	=	'CUT BANK MUNICIPAL, MT'
234 | 'CUT'	=	'CUT BANK, MT          '
235 | 'DLB'	=	'DEL BONITA, MT        '
236 | 'EUR'	=	'EUREKA, MT (BPS)'
237 | 'BZN'	=	'GALLATIN FIELD - BOZEMAN, MT'
238 | 'FCA'	=	'GLACIER NATIONAL PARK, MT'
239 | 'GGW'	=	'GLASGOW, MT           '
240 | 'GRE'	=	'GREAT FALLS, MT       '
241 | 'HVR'	=	'HAVRE, MT             '
242 | 'HEL'	=	'HELENA, MT            '
243 | 'LWT'	=	'LEWISTON, MT          '
244 | 'MGM'	=	'MORGAN, MT            '
245 | 'OPH'	=	'OPHEIM, MT            '
246 | 'PIE'	=	'PIEGAN, MT            '
247 | 'RAY'	=	'RAYMOND, MT           '
248 | 'ROO'	=	'ROOSVILLE, MT         '
249 | 'SCO'	=	'SCOBEY, MT            '
250 | 'SWE'	=	'SWEETGTASS, MT        '
251 | 'TRL'	=	'TRIAL CREEK, MT       '
252 | 'TUR'	=	'TURNER, MT            '
253 | 'WCM'	=	'WILLOW CREEK, MT      '
254 | 'CLT'	=	'CHARLOTTE, NC         '
255 | 'FAY'	=	'FAYETTEVILLE, NC'
256 | 'MRH'	=	'MOREHEAD CITY, NC     '
257 | 'FOP'	=	'MORRIS FIELDS AAF, NC'
258 | 'GSO'	=	'PIEDMONT TRIAD INTL AIRPORT, NC'
259 | 'RDU'	=	'RALEIGH/DURHAM, NC    '
260 | 'SSC'	=	'SHAW AFB - SUMTER, NC'
261 | 'WIL'	=	'WILMINGTON, NC        '
262 | 'AMB'	=	'AMBROSE, ND           '
263 | 'ANT'	=	'ANTLER, ND            '
264 | 'CRY'	=	'CARBURY, ND           '
265 | 'DNS'	=	'DUNSEITH, ND          '
266 | 'FAR'	=	'FARGO, ND             '
267 | 'FRT'	=	'FORTUNA, ND           '
268 | 'GRF'	=	'GRAND FORKS, ND       '
269 | 'HNN'	=	'HANNAH, ND            '
270 | 'HNS'	=	'HANSBORO, ND          '
271 | 'MAI'	=	'MAIDA, ND             '
272 | 'MND'	=	'MINOT, ND             '
273 | 'NEC'	=	'NECHE, ND             '
274 | 'NOO'	=	'NOONAN, ND            '
275 | 'NRG'	=	'NORTHGATE, ND         '
276 | 'PEM'	=	'PEMBINA, ND           '
277 | 'SAR'	=	'SARLES, ND            '
278 | 'SHR'	=	'SHERWOOD, ND          '
279 | 'SJO'	=	'ST JOHN, ND           '
280 | 'WAL'	=	'WALHALLA, ND          '
281 | 'WHO'	=	'WESTHOPE, ND          '
282 | 'WND'	=	'WILLISTON, ND         '
283 | 'OMA'	=	'OMAHA, NE             '
284 | 'LEB'	=	'LEBANON, NH           '
285 | 'MHT'	=	'MANCHESTER, NH'
286 | 'PNH'	=	'PITTSBURG, NH         '
287 | 'PSM'	=	'PORTSMOUTH, NH        '
288 | 'BYO'	=	'BAYONNE, NJ           '
289 | 'CNJ'	=	'CAMDEN, NJ            '
290 | 'HOB'	=	'HOBOKEN, NJ           '
291 | 'JER'	=	'JERSEY CITY, NJ       '
292 | 'WRI'	=	'MC GUIRE AFB - WRIGHTSOWN, NJ'
293 | 'MMU'	=	'MORRISTOWN, NJ'
294 | 'NEW'	=	'NEWARK/TETERBORO, NJ  '
295 | 'PER'	=	'PERTH AMBOY, NJ       '
296 | 'ACY'	=	'POMONA FIELD - ATLANTIC CITY, NJ'
297 | 'ALA'	=	'ALAMAGORDO, NM (BPS)'
298 | 'ABQ'	=	'ALBUQUERQUE, NM       '
299 | 'ANP'	=	'ANTELOPE WELLS, NM    '
300 | 'CRL'	=	'CARLSBAD, NM          '
301 | 'COL'	=	'COLUMBUS, NM          '
302 | 'CDD'	=	'CRANE LAKE - ST. LOUIS CNTY, NM'
303 | 'DNM'	=	'DEMING, NM (BPS)'
304 | 'LAS'	=	'LAS CRUCES, NM        '
305 | 'LOB'	=	'LORDSBURG, NM (BPS)'
306 | 'RUI'	=	'RUIDOSO, NM'
307 | 'STR'	=	'SANTA TERESA, NM      '
308 | 'RNO'	=	'CANNON INTL - RENO/TAHOE, NV'
309 | 'FLX'	=	'FALLON MUNICIPAL AIRPORT, NV'
310 | 'LVG'	=	'LAS VEGAS, NV         '
311 | 'REN'	=	'RENO, NV              '
312 | 'ALB'	=	'ALBANY, NY            '
313 | 'AXB'	=	'ALEXANDRIA BAY, NY    '
314 | 'BUF'	=	'BUFFALO, NY           '
315 | 'CNH'	=	'CANNON CORNERS, NY'
316 | 'CAP'	=	'CAPE VINCENT, NY      '
317 | 'CHM'	=	'CHAMPLAIN, NY         '
318 | 'CHT'	=	'CHATEAUGAY, NY        '
319 | 'CLA'	=	'CLAYTON, NY           '
320 | 'FTC'	=	'FORT COVINGTON, NY    '
321 | 'LAG'	=	'LA GUARDIA, NY        '
322 | 'LEW'	=	'LEWISTON, NY          '
323 | 'MAS'	=	'MASSENA, NY           '
324 | 'MAG'	=	'MCGUIRE AFB, NY       '
325 | 'MOO'	=	'MOORES, NY            '
326 | 'MRR'	=	'MORRISTOWN, NY        '
327 | 'NYC'	=	'NEW YORK, NY          '
328 | 'NIA'	=	'NIAGARA FALLS, NY     '
329 | 'OGD'	=	'OGDENSBURG, NY        '
330 | 'OSW'	=	'OSWEGO, NY            '
331 | 'ELM'	=	'REGIONAL ARPT - HORSEHEAD, NY'
332 | 'ROC'	=	'ROCHESTER, NY         '
333 | 'ROU'	=	'ROUSES POINT, NY      '
334 | 'SWF'	=	'STEWART - ORANGE CNTY, NY'
335 | 'SYR'	=	'SYRACUSE, NY          '
336 | 'THO'	=	'THOUSAND ISLAND BRIDGE, NY'
337 | 'TRO'	=	'TROUT RIVER, NY       '
338 | 'WAT'	=	'WATERTOWN, NY         '
339 | 'HPN'	=	'WESTCHESTER - WHITE PLAINS, NY'
340 | 'WRB'	=	'WHIRLPOOL BRIDGE, NY'
341 | 'YOU'	=	'YOUNGSTOWN, NY        '
342 | 'AKR'	=	'AKRON, OH             '
343 | 'ATB'	=	'ASHTABULA, OH         '
344 | 'CIN'	=	'CINCINNATI, OH        '
345 | 'CLE'	=	'CLEVELAND, OH         '
346 | 'CLM'	=	'COLUMBUS, OH          '
347 | 'LOR'	=	'LORAIN, OH            '
348 | 'MBO'	=	'MARBLE HEADS, OH      '
349 | 'SDY'	=	'SANDUSKY, OH          '
350 | 'TOL'	=	'TOLEDO, OH            '
351 | 'OKC'	=	'OKLAHOMA CITY, OK     '
352 | 'TUL'	=	'TULSA, OK'
353 | 'AST'	=	'ASTORIA, OR           '
354 | 'COO'	=	'COOS BAY, OR          '
355 | 'HIO'	=	'HILLSBORO, OR'
356 | 'MED'	=	'MEDFORD, OR           '
357 | 'NPT'	=	'NEWPORT, OR           '
358 | 'POO'	=	'PORTLAND, OR          '
359 | 'PUT'	=	'PUT-IN-BAY, OH        '
360 | 'RDM'	=	'ROBERTS FIELDS - REDMOND, OR'
361 | 'ERI'	=	'ERIE, PA              '
362 | 'MDT'	=	'HARRISBURG, PA'
363 | 'HSB'	=	'HARRISONBURG, PA      '
364 | 'PHI'	=	'PHILADELPHIA, PA      '
365 | 'PIT'	=	'PITTSBURG, PA         '
366 | 'AGU'	=	'AGUADILLA, PR         '
367 | 'BQN'	=	'BORINQUEN - AGUADILLO, PR'
368 | 'JCP'	=	'CULEBRA - BENJAMIN RIVERA, PR'
369 | 'ENS'	=	'ENSENADA, PR          '
370 | 'FAJ'	=	'FAJARDO, PR           '
371 | 'HUM'	=	'HUMACAO, PR           '
372 | 'JOB'	=	'JOBOS, PR             '
373 | 'MAY'	=	'MAYAGUEZ, PR          '
374 | 'PON'	=	'PONCE, PR             '
375 | 'PSE'	=	'PONCE-MERCEDITA, PR'
376 | 'SAJ'	=	'SAN JUAN, PR          '
377 | 'VQS'	=	'VIEQUES-ARPT, PR'
378 | 'PRO'	=	'PROVIDENCE, RI        '
379 | 'PVD'	=	'THEODORE FRANCIS - WARWICK, RI'
380 | 'CHL'	=	'CHARLESTON, SC        '
381 | 'CAE'	=	'COLUMBIA, SC #ARPT'
382 | 'GEO'	=	'GEORGETOWN, SC        '
383 | 'GSP'	=	'GREENVILLE, SC'
384 | 'GRR'	=	'GREER, SC'
385 | 'MYR'	=	'MYRTLE BEACH, SC'
386 | 'SPF'	=	'BLACK HILLS, SPEARFISH, SD'
387 | 'HON'	=	'HOWES REGIONAL ARPT - HURON, SD'
388 | 'SAI'	=	'SAIPAN, SPN           '
389 | 'TYS'	=	'MC GHEE TYSON - ALCOA, TN'
390 | 'MEM'	=	'MEMPHIS, TN           '
391 | 'NSV'	=	'NASHVILLE, TN         '
392 | 'TRI'	=	'TRI CITY ARPT, TN'
393 | 'ADS'	=	'ADDISON AIRPORT- ADDISON, TX'
394 | 'ADT'	=	'AMISTAD DAM, TX       '
395 | 'ANZ'	=	'ANZALDUAS, TX'
396 | 'AUS'	=	'AUSTIN, TX            '
397 | 'BEA'	=	'BEAUMONT, TX          '
398 | 'BBP'	=	'BIG BEND PARK, TX (BPS)'
399 | 'SCC'	=	'BP SPEC COORD. CTR, TX'
400 | 'BTC'	=	'BP TACTICAL UNIT, TX  ' 
401 | 'BOA'	=	'BRIDGE OF AMERICAS, TX'
402 | 'BRO'	=	'BROWNSVILLE, TX       '
403 | 'CRP'	=	'CORPUS CHRISTI, TX    '
404 | 'DAL'	=	'DALLAS, TX            '
405 | 'DLR'	=	'DEL RIO, TX           '
406 | 'DNA'	=	'DONNA, TX'
407 | 'EGP'	=	'EAGLE PASS, TX        '
408 | 'ELP'	=	'EL PASO, TX           '
409 | 'FAB'	=	'FABENS, TX            '
410 | 'FAL'	=	'FALCON HEIGHTS, TX    '
411 | 'FTH'	=	'FORT HANCOCK, TX      '
412 | 'AFW'	=	'FORT WORTH ALLIANCE, TX'
413 | 'FPT'	=	'FREEPORT, TX          '
414 | 'GAL'	=	'GALVESTON, TX         '
415 | 'HLG'	=	'HARLINGEN, TX         '
416 | 'HID'	=	'HIDALGO, TX           '
417 | 'HOU'	=	'HOUSTON, TX           '
418 | 'SGR'	=	'HULL FIELD, SUGAR LAND ARPT, TX'
419 | 'LLB'	=	'JUAREZ-LINCOLN BRIDGE, TX'
420 | 'LCB'	=	'LAREDO COLUMBIA BRIDGE, TX'
421 | 'LRN'	=	'LAREDO NORTH, TX      '
422 | 'LAR'	=	'LAREDO, TX            '
423 | 'LSE'	=	'LOS EBANOS, TX        '
424 | 'IND'	=	'LOS INDIOS, TX'
425 | 'LOI'	=	'LOS INDIOS, TX        '
426 | 'MRS'	=	'MARFA, TX (BPS)'
427 | 'MCA'	=	'MCALLEN, TX           '
428 | 'MAF'	=	'ODESSA REGIONAL, TX'
429 | 'PDN'	=	'PASO DEL NORTE,TX     '
430 | 'PBB'	=	'PEACE BRIDGE, NY      '
431 | 'PHR'	=	'PHARR, TX             '
432 | 'PAR'	=	'PORT ARTHUR, TX       '
433 | 'ISB'	=	'PORT ISABEL, TX       '
434 | 'POE'	=	'PORT OF EL PASO, TX   '
435 | 'PRE'	=	'PRESIDIO, TX          '
436 | 'PGR'	=	'PROGRESO, TX          '
437 | 'RIO'	=	'RIO GRANDE CITY, TX   '
438 | 'ROM'	=	'ROMA, TX              '
439 | 'SNA'	=	'SAN ANTONIO, TX       '
440 | 'SNN'	=	'SANDERSON, TX         '
441 | 'VIB'	=	'VETERAN INTL BRIDGE, TX'
442 | 'YSL'	=	'YSLETA, TX            '
443 | 'CHA'	=	'CHARLOTTE AMALIE, VI  '
444 | 'CHR'	=	'CHRISTIANSTED, VI     '
445 | 'CRU'	=	'CRUZ BAY, ST JOHN, VI '
446 | 'FRK'	=	'FREDERIKSTED, VI      '
447 | 'STT'	=	'ST THOMAS, VI         '
448 | 'LGU'	=	'CACHE AIRPORT - LOGAN, UT'
449 | 'SLC'	=	'SALT LAKE CITY, UT    '
450 | 'CHO'	=	'ALBEMARLE CHARLOTTESVILLE, VA'
451 | 'DAA'	=	'DAVISON AAF - FAIRFAX CNTY, VA'
452 | 'HOP'	=	'HOPEWELL, VA          '
453 | 'HEF'	=	'MANASSAS, VA #ARPT'
454 | 'NWN'	=	'NEWPORT, VA           '
455 | 'NOR'	=	'NORFOLK, VA           '
456 | 'RCM'	=	'RICHMOND, VA          '
457 | 'ABS'	=	'ALBURG SPRINGS, VT    '
458 | 'ABG'	=	'ALBURG, VT            '
459 | 'BEB'	=	'BEEBE PLAIN, VT       '
460 | 'BEE'	=	'BEECHER FALLS, VT     '
461 | 'BRG'	=	'BURLINGTON, VT        '
462 | 'CNA'	=	'CANAAN, VT            '
463 | 'DER'	=	'DERBY LINE, VT (I-91) '
464 | 'DLV'	=	'DERBY LINE, VT (RT. 5)'
465 | 'ERC'	=	'EAST RICHFORD, VT     '
466 | 'HIG'	=	'HIGHGATE SPRINGS, VT  '
467 | 'MOR'	=	'MORSES LINE, VT       '
468 | 'NPV'	=	'NEWPORT, VT           '
469 | 'NRT'	=	'NORTH TROY, VT        '
470 | 'NRN'	=	'NORTON, VT            '
471 | 'PIV'	=	'PINNACLE ROAD, VT     '
472 | 'RIF'	=	'RICHFORT, VT          '
473 | 'STA'	=	'ST ALBANS, VT         '
474 | 'SWB'	=	'SWANTON, VT (BP - SECTOR HQ)'
475 | 'WBE'	=	'WEST BERKSHIRE, VT    '
476 | 'ABE'	=	'ABERDEEN, WA          '
477 | 'ANA'	=	'ANACORTES, WA         '
478 | 'BEL'	=	'BELLINGHAM, WA        '
479 | 'BLI'	=	'BELLINGHAM, WASHINGTON #INTL'
480 | 'BLA'	=	'BLAINE, WA            '
481 | 'BWA'	=	'BOUNDARY, WA          '
482 | 'CUR'	=	'CURLEW, WA (BPS)'
483 | 'DVL'	=	'DANVILLE, WA          '
484 | 'EVE'	=	'EVERETT, WA           '
485 | 'FER'	=	'FERRY, WA             '
486 | 'FRI'	=	'FRIDAY HARBOR, WA     '
487 | 'FWA'	=	'FRONTIER, WA          '
488 | 'KLM'	=	'KALAMA, WA            '
489 | 'LAU'	=	'LAURIER, WA           '
490 | 'LON'	=	'LONGVIEW, WA          '
491 | 'MET'	=	'METALINE FALLS, WA    '
492 | 'MWH'	=	'MOSES LAKE GRANT COUNTY ARPT, WA'
493 | 'NEA'	=	'NEAH BAY, WA          '
494 | 'NIG'	=	'NIGHTHAWK, WA         '
495 | 'OLY'	=	'OLYMPIA, WA           '
496 | 'ORO'	=	'OROVILLE, WA          '
497 | 'PWB'	=	'PASCO, WA             '
498 | 'PIR'	=	'POINT ROBERTS, WA     '
499 | 'PNG'	=	'PORT ANGELES, WA      '
500 | 'PTO'	=	'PORT TOWNSEND, WA     '
501 | 'SEA'	=	'SEATTLE, WA           '
502 | 'SPO'	=	'SPOKANE, WA           '
503 | 'SUM'	=	'SUMAS, WA             '
504 | 'TAC'	=	'TACOMA, WA            '
505 | 'PSC'	=	'TRI-CITIES - PASCO, WA'
506 | 'VAN'	=	'VANCOUVER, WA         '
507 | 'AGM'	=	'ALGOMA, WI            '
508 | 'BAY'	=	'BAYFIELD, WI          '
509 | 'GRB'	=	'GREEN BAY, WI         '
510 | 'MNW'	=	'MANITOWOC, WI         '
511 | 'MIL'	=	'MILWAUKEE, WI         '
512 | 'MSN'	=	'TRUAX FIELD - DANE COUNTY, WI'
513 | 'CHS'	=	'CHARLESTON, WV        '
514 | 'CLK'	=	'CLARKSBURG, WV        '
515 | 'BLF'	=	'MERCER COUNTY, WV'
516 | 'CSP'	=	'CASPER, WY            '
517 | 'CLG'	=	'CALGARY, CANADA       '
518 | 'EDA'	=	'EDMONTON, CANADA      '
519 | 'YHC'	=	'HAKAI PASS, CANADA'
520 | 'HAL'	=	'Halifax, NS, Canada   '
521 | 'MON'	=	'MONTREAL, CANADA      '
522 | 'OTT'	=	'OTTAWA, CANADA        '
523 | 'YXE'	=	'SASKATOON, CANADA'
524 | 'TOR'	=	'TORONTO, CANADA       '
525 | 'VCV'	=	'VANCOUVER, CANADA     '
526 | 'VIC'	=	'VICTORIA, CANADA      '
527 | 'WIN'	=	'WINNIPEG, CANADA      '
528 | 'AMS'	=	'AMSTERDAM-SCHIPHOL, NETHERLANDS'
529 | 'ARB'	=	'ARUBA, NETH ANTILLES  '
530 | 'BAN'	=	'BANKOK, THAILAND      '
531 | 'BEI'	=	'BEICA #ARPT, ETHIOPIA'
532 | 'PEK'	=	'BEIJING CAPITAL INTL, PRC'
533 | 'BDA'	=	'KINDLEY FIELD, BERMUDA'
534 | 'BOG'	=	'BOGOTA, EL DORADO #ARPT, COLOMBIA'
535 | 'EZE'	=	'BUENOS AIRES, MINISTRO PIST, ARGENTINA'
536 | 'CUN'	=	'CANCUN, MEXICO'
537 | 'CRQ'	=	'CARAVELAS, BA #ARPT, BRAZIL'
538 | 'MVD'	=	'CARRASCO, URUGUAY'
539 | 'DUB'	=	'DUBLIN, IRELAND       '
540 | 'FOU'	=	'FOUGAMOU #ARPT, GABON'
541 | 'FBA'	=	'FREEPORT, BAHAMAS      '
542 | 'MTY'	=	'GEN M. ESCOBEDO, Monterrey, MX'
543 | 'HMO'	=	'GEN PESQUEIRA GARCIA, MX'
544 | 'GCM'	=	'GRAND CAYMAN, CAYMAN ISLAND'
545 | 'GDL'	=	'GUADALAJARA, MIGUEL HIDAL, MX'
546 | 'HAM'	=	'HAMILTON, BERMUDA     '
547 | 'ICN'	=	'INCHON, SEOUL KOREA'
548 | 'IWA'	=	'INVALID - IWAKUNI, JAPAN'
549 | 'CND'	=	'KOGALNICEANU, ROMANIA'
550 | 'LAH'	=	'LABUHA ARPT, INDONESIA'
551 | 'DUR'	=	'LOUIS BOTHA, SOUTH AFRICA'
552 | 'MAL'	=	'MANGOLE ARPT, INDONESIA'
553 | 'MDE'	=	'MEDELLIN, COLOMBIA'
554 | 'MEX'	=	'JUAREZ INTL, MEXICO CITY, MX'
555 | 'LHR'	=	'MIDDLESEX, ENGLAND'
556 | 'NBO'	=	'NAIROBI, KENYA        '
557 | 'NAS'	=	'NASSAU, BAHAMAS       '
558 | 'NCA'	=	'NORTH CAICOS, TURK & CAIMAN'
559 | 'PTY'	=	'OMAR TORRIJOS, PANAMA'
560 | 'SPV'	=	'PAPUA, NEW GUINEA'
561 | 'UIO'	=	'QUITO (MARISCAL SUCR), ECUADOR'
562 | 'RIT'	=	'ROME, ITALY           '
563 | 'SNO'	=	'SAKON NAKHON #ARPT, THAILAND'
564 | 'SLP'	=	'SAN LUIS POTOSI #ARPT, MEXICO'
565 | 'SAN'	=	'SAN SALVADOR, EL SALVADOR'
566 | 'SRO'	=	'SANTANA RAMOS #ARPT, COLOMBIA'
567 | 'GRU'	=	'GUARULHOS INTL, SAO PAULO, BRAZIL'
568 | 'SHA'	=	'SHANNON, IRELAND      '
569 | 'HIL'	=	'SHILLAVO, ETHIOPIA'
570 | 'TOK'	=	'TOROKINA #ARPT, PAPUA, NEW GUINEA'
571 | 'VER'	=	'VERACRUZ, MEXICO'
572 | 'LGW'	=	'WEST SUSSEX, ENGLAND  '
573 | 'ZZZ'	=	'MEXICO Land (Banco de Mexico) '
574 | 'CHN'	=	'No PORT Code (CHN)'
575 | 'CNC'	=	'CANNON CORNERS, NY'
576 | 'MAA'	=	'Abu Dhabi'
577 | 'AG0'	=	'MAGNOLIA, AR'
578 | 'BHM'	=	'BAR HARBOR, ME'
579 | 'BHX'	=	'BIRMINGHAM, AL'
580 | 'CAK'	=	'AKRON, OH'
581 | 'FOK'	=	'SUFFOLK COUNTY, NY'
582 | 'LND'	=	'LANDER, WY'
583 | 'MAR'	=	'MARFA, TX'
584 | 'MLI'	=	'MOLINE, IL'
585 | 'RIV'	=	'RIVERSIDE, CA'
586 | 'RME'	=	'ROME, NY'
587 | 'VNY'	=	'VAN NUYS, CA'
588 | 'YUM'	=	'YUMA, AZ'


--------------------------------------------------------------------------------
/Data Sources.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Data Sources.docx


--------------------------------------------------------------------------------
/Project_1A_Data_Modeling_with_Postgres/README.md:
--------------------------------------------------------------------------------
  1 | # Files in My Project Repository
  2 |     (1). There are 6 files and 1 data folder including two data sets in my repository.
  3 |     (2). For the two .ipynb files, you can just open and run it in the workspace.
  4 |          But for 3 .py files, you have to open a new terminal to execute them by typing "python filename.py".
  5 |     (3). The steps for this project: 
  6 |          * sql_queries.py ---> create_tables.py ----> etl.ipynb----> etl.py ----> test.ipynb
  7 |          ---> readme.md
  8 |     (4). Attention: each time you run create_table.py, you have to restart test.ipynb and etl.ipynb first to close the connection to   
  9 |          database. 
 10 | 
 11 | # Purpose of this Database
 12 | 
 13 | (1). The analytics team at Sparkify wanted to know what songs users were listening to but did not have easy assess to the data. 
 14 |      Both datasets were on Sparkify's new music streaming app. One resided in a directory of JSON logs on user activity while the other 
 15 |      was in the directory with JSON metadata on the songs. 
 16 | 
 17 | (2). My goal was to create a Postgres database with tables designed to optimize queries on song play analysis. 
 18 | 
 19 | 
 20 | # Explanation of Database Schema Design and ETL Pipeline
 21 | 
 22 | (1). Database schema: I created a star schema optimized for queries on song analysis. 
 23 |      * It contained 5 tables :
 24 |      * one fact table:songplays with columns songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agen
 25 |      * four dimension tables
 26 |           * users - users in the app with columns user_id, first_name, last_name, gender, level
 27 |           * songs - songs in music database with columns song_id, title, artist_id, year, duration
 28 |           * artists - artists in music database with columns artist_id, name, location, lattitude, longitude
 29 |           * time - timestamps of records in songplays broken down into specific units  with columns start_time, hour, day, week, month, year, weekday
 30 |           
 31 | (2). ETL Pipeline
 32 |     * Process Song data to insert records into songs and artists tables
 33 |     * Process log data to create time and users tables
 34 |     * Extract songs table, artists table, and original log file to create songplays table (join songs table, artists table)
 35 |     
 36 | # Example Queries and Results for Song Play Analysis
 37 | 
 38 | query 1: show the first five rows in songplays table 
 39 | * query:
 40 | %sql SELECT * FROM songplays LIMIT 5;
 41 | 
 42 | * result: 
 43 | songplay_id	start_time	user_id	level	song_id	artist_id	session_id	location	user_agent
 44 | 1	2018-11-29 00:00:57.796000	73	paid	None	None	954	Tampa-St. Petersburg-Clearwater, FL	"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"
 45 | 2	2018-11-29 00:01:30.796000	24	paid	None	None	984	Lake Havasu City-Kingman, AZ	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"
 46 | 3	2018-11-29 00:04:01.796000	24	paid	None	None	984	Lake Havasu City-Kingman, AZ	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"
 47 | 4	2018-11-29 00:04:55.796000	73	paid	None	None	954	Tampa-St. Petersburg-Clearwater, FL	"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"
 48 | 5	2018-11-29 00:07:13.796000	24	paid	None	None	984	Lake Havasu City-Kingman, AZ	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"
 49 | 
 50 | query 2: show the first five rows in users table
 51 | * query:
 52 | %sql SELECT * FROM users LIMIT 5;
 53 | 
 54 | * result:
 55 | user_id	first_name	last_name	gender	level
 56 | 73	Jacob	Klein	M	paid
 57 | 24	Layla	Griffin	F	paid
 58 | 24	Layla	Griffin	F	paid
 59 | 73	Jacob	Klein	M	paid
 60 | 24	Layla	Griffin	F	paid
 61 | 
 62 | query 3: show the first five rows in songs table
 63 | * query:
 64 | %sql SELECT * FROM songs LIMIT 5;
 65 | 
 66 | * result:
 67 | song_id	title	artist_id	year	duration
 68 | SOFNOQK12AB01840FC	Kutt Free (DJ Volume Remix)	ARNNKDK1187B98BBD5	0	407.37914
 69 | SOBAYLL12A8C138AF9	Sono andati? Fingevo di dormire	ARDR4AC1187FB371A1	0	511.16363
 70 | SOFFKZS12AB017F194	A Higher Place (Album Version)	ARBEBBY1187B9B43DB	1994	236.17261
 71 | SOGVQGJ12AB017F169	Ten Tonne	AR62SOJ1187FB47BB5	2005	337.68444
 72 | SOXILUQ12A58A7C72A	Jenny Take a Ride	ARP6N5A1187B99D1A3	2004	207.43791
 73 | 
 74 | query 4: What is the average duration of songs ?
 75 | * query:
 76 | %sql SELECT AVG(duration) AS avg_duration FROM songs;
 77 | 
 78 | * result:
 79 | avg_duration
 80 | 239.729676056338
 81 | 
 82 | 
 83 | query 5:The proportion of male users and female users
 84 | * query:
 85 | # total user
 86 | %sql SELECT COUNT(gender) FROM users;
 87 | # number of female users
 88 | %sql SELECT COUNT(gender) AS female_count \
 89 | FROM users \
 90 | WHERE gender = 'F';  
 91 | # number of male users
 92 | %sql SELECT COUNT(gender) AS male_count \
 93 | FROM users \
 94 | WHERE gender = 'M';
 95 | # print propotion
 96 | print(4887/6820 , 1933/6820)
 97 | 
 98 | * result:
 99 | 0.7165689149560117 0.2834310850439883
100 | 
101 | # conclusion:
102 | Accoring to the data given, there are far more female users than male users for Spartify music streaming app. 
103 | 
104 | 
105 | 


--------------------------------------------------------------------------------
/Project_1A_Data_Modeling_with_Postgres/create_tables.py:
--------------------------------------------------------------------------------
 1 | import psycopg2
 2 | from sql_queries import create_table_queries, drop_table_queries
 3 | 
 4 | 
 5 | def create_database():
 6 |     """
 7 |     This function creates the sparkifydb dabases and connect to the database 
 8 |      
 9 |     argument: none
10 |       
11 |     return:
12 |     cur: the cursor object
13 |     conn : the connection object
14 |     """
15 |     
16 |     # connect to default database
17 |     conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")
18 |     conn.set_session(autocommit=True)
19 |     cur = conn.cursor()
20 |     
21 |     # create sparkify database with UTF8 encoding
22 |     cur.execute("DROP DATABASE IF EXISTS sparkifydb")
23 |     cur.execute("CREATE DATABASE sparkifydb WITH ENCODING 'utf8' TEMPLATE template0")
24 | 
25 |     # close connection to default database
26 |     conn.close()    
27 |     
28 |     # connect to sparkify database
29 |     conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
30 |     cur = conn.cursor()
31 |     
32 |     return cur, conn
33 | 
34 | 
35 | def drop_tables(cur, conn):
36 |     """
37 |     This function drop tables.
38 |     
39 |     argument:
40 |     cur: the cursor object
41 |     conn : the connection object
42 |     
43 |     return :
44 |     none
45 |     
46 |     """
47 |     for query in drop_table_queries:
48 |         cur.execute(query)
49 |         conn.commit()
50 | 
51 | 
52 | def create_tables(cur, conn):
53 |     """
54 |     This function creates tables 
55 |      
56 |     argument:
57 |     cur: the cursor object
58 |     conn : the connection object
59 |     
60 |     return :
61 |     none
62 |     """
63 |     for query in create_table_queries:
64 |         cur.execute(query)
65 |         conn.commit()
66 | 
67 | 
68 | def main():
69 |     """
70 |     When a python program is executed, python interpreter starts executing code inside it.
71 |     three functions above would be executed and then the connection would be closed
72 |     """
73 |     cur, conn = create_database()
74 |     
75 |     drop_tables(cur, conn)
76 |     create_tables(cur, conn)
77 | 
78 |     conn.close()
79 | 
80 | 
81 | if __name__ == "__main__":
82 |     """
83 |     For python main function, we have to use if __name__ == '__main__' condition to execute.
84 |     """
85 |     main()


--------------------------------------------------------------------------------
/Project_1A_Data_Modeling_with_Postgres/etl.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import glob
  3 | import psycopg2
  4 | import pandas as pd
  5 | from sql_queries import *
  6 | 
  7 | # two tables: song and artist are extracted from song file 
  8 | def process_song_file(cur, filepath):
  9 |     """
 10 |     this function would read in the song data, insert song records and artist records into song table and artist table respectively.
 11 |     
 12 |     argument: 
 13 |     cur: the cursor object
 14 |     filepath: the path to the song files
 15 |     
 16 |     return: none
 17 |     
 18 |     """
 19 |     # open song file
 20 |     df = pd.read_json(filepath, lines=True)
 21 |     
 22 |     # in etl.ipynb, select the first (only) record in the dataframe
 23 |     # now, we need to select all records/all files 
 24 |     for i in range(len(df)):
 25 |         # insert song record
 26 |         song_data = df[['song_id','title','artist_id','year','duration']].values[i].tolist()
 27 |         cur.execute(song_table_insert, song_data)
 28 |     
 29 |         # insert artist record
 30 |         artist_data = df[['artist_id','artist_name','artist_location','artist_latitude','artist_longitude']].values[i].tolist()
 31 |         cur.execute(artist_table_insert, artist_data)
 32 | 
 33 | 
 34 | def process_log_file(cur, filepath):
 35 |     """
 36 |     this function would read in the log data, insert time data records,users records and songplays records into time, users and  
 37 |     songplays tables respectively.
 38 |     
 39 |     argument: 
 40 |     cur: the cursor object
 41 |     filepath: the path to the log files
 42 |     
 43 |     return: none    
 44 |     """
 45 |     # open log file
 46 |     df = pd.read_json(filepath, lines=True)
 47 | 
 48 |     # filter by NextSong action
 49 |     df = df[df['page']=='NextSong']
 50 | 
 51 |     # convert timestamp column to datetime
 52 |     t = pd.to_datetime( df['ts'] , unit = 'ms')
 53 |     
 54 |     # insert time data records
 55 |     time_data = list((t, t.dt.hour, t.dt.day, t.dt.week,t.dt.month, t.dt.year, t.dt.dayofweek))
 56 |     column_labels = ('timestamp', 'hour', 'day', 'week', 'month', 'year','weekday')
 57 |     time_df = pd.DataFrame(dict(zip(column_labels,time_data)),columns=column_labels).reset_index(drop=True) 
 58 | 
 59 |     for i, row in time_df.iterrows():
 60 |         cur.execute(time_table_insert, list(row))
 61 | 
 62 |     # load user table
 63 |     user_df = df[['userId','firstName','lastName','gender','level']]
 64 | 
 65 |     # insert user records
 66 |     for i, row in user_df.iterrows():
 67 |         cur.execute(user_table_insert, row)
 68 | 
 69 |     # insert songplay records
 70 |     for index, row in df.iterrows():
 71 |         
 72 |         # get songid and artistid from song and artist tables
 73 |         cur.execute(song_select, (row.song, row.artist, row.length)) 
 74 |         # row.length is the duration. "row.song, row.artist, row.length" based on the condition in where statement in song_select
 75 |         results = cur.fetchone()
 76 |         
 77 |         if results:
 78 |             songid, artistid = results
 79 |         else:
 80 |             songid, artistid = None, None
 81 | 
 82 |         # insert songplay record
 83 |         songplay_data = ( pd.to_datetime(row.ts, unit = 'ms'), row.userId, row.level, songid , artistid, row.sessionId, row.location, row.userAgent )
 84 |         cur.execute(songplay_table_insert, songplay_data)
 85 | 
 86 | 
 87 | def process_data(cur, conn, filepath, func):
 88 |     """
 89 |     this function get all files from directory and iterate all files 
 90 |     
 91 |     arguments:
 92 |     cur: the cursor object 
 93 |     conn: the connection object
 94 |     filepath : path to files 
 95 |     func : specify the function name the program will execute
 96 |     
 97 |     return: none
 98 |     """
 99 |     # get all files matching extension from directory
100 |     all_files = []
101 |     for root, dirs, files in os.walk(filepath):
102 |         files = glob.glob(os.path.join(root,'*.json'))
103 |         for f in files :
104 |             all_files.append(os.path.abspath(f))
105 | 
106 |     # get total number of files found
107 |     num_files = len(all_files)
108 |     print('{} files found in {}'.format(num_files, filepath))
109 | 
110 |     # iterate over files and process
111 |     for i, datafile in enumerate(all_files, 1):
112 |         func(cur, datafile)
113 |         conn.commit()
114 |         print('{}/{} files processed.'.format(i, num_files))
115 | 
116 | 
117 | def main():
118 |     """
119 |     this function will tell python intepreter to execute the functions inside it
120 |     
121 |     argument: none
122 |     
123 |     return: none
124 |     """
125 |     conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
126 |     cur = conn.cursor()
127 | 
128 |     process_data(cur, conn, filepath='data/song_data', func=process_song_file)
129 |     process_data(cur, conn, filepath='data/log_data', func=process_log_file)
130 | 
131 |     conn.close()
132 | 
133 | 
134 | if __name__ == "__main__":
135 |     """
136 |     this part is the necesaary condition to execute python main function
137 |     """
138 |     main()


--------------------------------------------------------------------------------
/Project_1A_Data_Modeling_with_Postgres/sql_queries.py:
--------------------------------------------------------------------------------
 1 | # DROP TABLES
 2 | # remember to write the whole sentence if exists
 3 | songplay_table_drop = "DROP TABLE IF EXISTS songplays;"     # all use plural forms, it needs to be consistent with test.ipynb
 4 | user_table_drop = "DROP TABLE IF EXISTS users;"   # user is a reserved word
 5 | song_table_drop = "DROP TABLE IF EXISTS songs;"
 6 | artist_table_drop = "DROP TABLE IF EXISTS artists;"
 7 | time_table_drop = "DROP TABLE IF EXISTS time;"
 8 | 
 9 | 
10 | 
11 | # CREATE TABLES
12 | # each table requires a primary key. when it does not have such a column, create a automatically increment one like (songplay_id # serial primary key and id serial primary key )
13 | # A primary key is required to be NOT NUL
14 | # foreign keys must reference a unique key in the parent table: pay attention to who references whom
15 | # which means, a foreign key must reference columns that either are a primary key or form a unique constrain
16 | 
17 | songplay_table_create = (""" CREATE TABLE IF NOT EXISTS songplays (songplay_id serial, start_time timestamp NOT NULL, user_id int NOT NULL, level varchar, song_id varchar , artist_id varchar, session_id int, location varchar, user_agent varchar,  
18 | primary key(songplay_id), 
19 | foreign key (user_id) references users(user_id),
20 | foreign key(song_id) references songs(song_id),
21 | foreign key(artist_id) references artists(artist_id),
22 | foreign key(start_time) references time(start_time))""" )
23 | 
24 | user_table_create = (""" CREATE TABLE IF NOT EXISTS users (user_id int primary key, first_name varchar, last_name varchar, gender varchar, level varchar) """)
25 | 
26 | song_table_create = (""" CREATE TABLE IF NOT EXISTS songs (song_id varchar primary key, title varchar, artist_id varchar, year int, duration float) """)
27 | 
28 | artist_table_create = (""" CREATE TABLE IF NOT EXISTS artists (artist_id varchar primary key, artist_name varchar, artist_location varchar, artist_lattitude float, artist_longitude float ) """)
29 | 
30 | # start_time time without time zone 
31 | time_table_create = (""" CREATE TABLE IF NOT EXISTS time ( start_time timestamp primary key, hour int NOT NULL, day int NOT NULL, week int NOT NULL, month int NOT NULL, year int NOT NULL, weekday int NOT NULL ) """)
32 | 
33 | 
34 | 
35 | # INSERT RECORDS
36 | # primary key (user_id,song_id,artist_id), so they are unique. But in reality, there will be some conflicts. So you need handle conflicts
37 | 
38 | # pay attention to three tables which requires to handle confliction because we set them as primary key (user_id,song_id,artist_id), user table has a special part, coz the level(status) can be changed from free to paid 
39 | 
40 | # songplay_id will increase automatically, it should not be treated as a separate column 
41 | songplay_table_insert = (""" INSERT INTO songplays ( start_time, user_id, level, song_id, artist_id, session_id, location , user_agent) VALUES ( %s, %s, %s, %s, %s, %s, %s, %s) """)   
42 | 
43 | user_table_insert = (""" INSERT INTO users (user_id, first_name, last_name, gender, level) VALUES (%s, %s, %s, %s, %s) ON CONFLICT (user_id) DO UPDATE SET level= EXCLUDED.level """)
44 | 
45 | song_table_insert = (""" INSERT INTO songs (song_id, title, artist_id, year, duration) VALUES (%s, %s, %s, %s, %s) ON CONFLICT (song_id) DO NOTHING """)
46 | 
47 | artist_table_insert = (""" INSERT INTO artists (artist_id, artist_name, artist_location, artist_lattitude, artist_longitude) VALUES (%s, %s, %s, %s, %s) ON CONFLICT (artist_id) DO NOTHING """)
48 | 
49 | time_table_insert = (""" INSERT INTO time (start_time, hour, day, week, month, year, weekday) VALUES (%s, %s, %s, %s, %s, %s, %s)
50 | ON CONFLICT (start_time) DO NOTHING """)
51 | 
52 | 
53 | # FIND SONGS
54 | # this is required in etl.ipynb 
55 | # Implement the song_select query in sql_queries.py to find the song ID and artist ID based on the title, artist name, 
56 | # and duration of a song.
57 | #should be left join , some songs may not have artist in record
58 | song_select = (""" SELECT s.song_id, a.artist_id
59 |                    FROM songs s 
60 |                    LEFT JOIN artists a     
61 |                    ON s.artist_id = a.artist_id
62 |                    WHERE title = (%s) AND artist_name = (%s) AND duration = (%s) """)    
63 | # this where statement is based on what is required in etl.ipynb
64 | 
65 | 
66 | # QUERY LISTS
67 | # the order is important, you should execute the dimension tables first, then fact table. Because there are fks in fact table pointing to dimension tables. So dimension tables must exist first
68 | # dimension tables in the front 
69 | create_table_queries = [user_table_create, song_table_create, artist_table_create, time_table_create, songplay_table_create]
70 | drop_table_queries = [songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]
71 | 
72 | # reference:
73 | # upsert example to handle conflict : http://www.postgresqltutorial.com/postgresql-upsert/


--------------------------------------------------------------------------------
/Project_1A_Data_Modeling_with_Postgres/test.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "%load_ext sql"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 2,
 15 |    "metadata": {},
 16 |    "outputs": [
 17 |     {
 18 |      "data": {
 19 |       "text/plain": [
 20 |        "'Connected: student@sparkifydb'"
 21 |       ]
 22 |      },
 23 |      "execution_count": 2,
 24 |      "metadata": {},
 25 |      "output_type": "execute_result"
 26 |     }
 27 |    ],
 28 |    "source": [
 29 |     "%sql postgresql://student:student@127.0.0.1/sparkifydb"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 3,
 35 |    "metadata": {},
 36 |    "outputs": [
 37 |     {
 38 |      "name": "stdout",
 39 |      "output_type": "stream",
 40 |      "text": [
 41 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
 42 |       "5 rows affected.\n"
 43 |      ]
 44 |     },
 45 |     {
 46 |      "data": {
 47 |       "text/html": [
 48 |        "<table>\n",
 49 |        "    <tr>\n",
 50 |        "        <th>songplay_id</th>\n",
 51 |        "        <th>start_time</th>\n",
 52 |        "        <th>user_id</th>\n",
 53 |        "        <th>level</th>\n",
 54 |        "        <th>song_id</th>\n",
 55 |        "        <th>artist_id</th>\n",
 56 |        "        <th>session_id</th>\n",
 57 |        "        <th>location</th>\n",
 58 |        "        <th>user_agent</th>\n",
 59 |        "    </tr>\n",
 60 |        "    <tr>\n",
 61 |        "        <td>1</td>\n",
 62 |        "        <td>2018-11-29 00:00:57.796000</td>\n",
 63 |        "        <td>73</td>\n",
 64 |        "        <td>paid</td>\n",
 65 |        "        <td>None</td>\n",
 66 |        "        <td>None</td>\n",
 67 |        "        <td>954</td>\n",
 68 |        "        <td>Tampa-St. Petersburg-Clearwater, FL</td>\n",
 69 |        "        <td>&quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2&quot;</td>\n",
 70 |        "    </tr>\n",
 71 |        "    <tr>\n",
 72 |        "        <td>2</td>\n",
 73 |        "        <td>2018-11-29 00:01:30.796000</td>\n",
 74 |        "        <td>24</td>\n",
 75 |        "        <td>paid</td>\n",
 76 |        "        <td>None</td>\n",
 77 |        "        <td>None</td>\n",
 78 |        "        <td>984</td>\n",
 79 |        "        <td>Lake Havasu City-Kingman, AZ</td>\n",
 80 |        "        <td>&quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36&quot;</td>\n",
 81 |        "    </tr>\n",
 82 |        "    <tr>\n",
 83 |        "        <td>3</td>\n",
 84 |        "        <td>2018-11-29 00:04:01.796000</td>\n",
 85 |        "        <td>24</td>\n",
 86 |        "        <td>paid</td>\n",
 87 |        "        <td>None</td>\n",
 88 |        "        <td>None</td>\n",
 89 |        "        <td>984</td>\n",
 90 |        "        <td>Lake Havasu City-Kingman, AZ</td>\n",
 91 |        "        <td>&quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36&quot;</td>\n",
 92 |        "    </tr>\n",
 93 |        "    <tr>\n",
 94 |        "        <td>4</td>\n",
 95 |        "        <td>2018-11-29 00:04:55.796000</td>\n",
 96 |        "        <td>73</td>\n",
 97 |        "        <td>paid</td>\n",
 98 |        "        <td>None</td>\n",
 99 |        "        <td>None</td>\n",
100 |        "        <td>954</td>\n",
101 |        "        <td>Tampa-St. Petersburg-Clearwater, FL</td>\n",
102 |        "        <td>&quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2&quot;</td>\n",
103 |        "    </tr>\n",
104 |        "    <tr>\n",
105 |        "        <td>5</td>\n",
106 |        "        <td>2018-11-29 00:07:13.796000</td>\n",
107 |        "        <td>24</td>\n",
108 |        "        <td>paid</td>\n",
109 |        "        <td>None</td>\n",
110 |        "        <td>None</td>\n",
111 |        "        <td>984</td>\n",
112 |        "        <td>Lake Havasu City-Kingman, AZ</td>\n",
113 |        "        <td>&quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36&quot;</td>\n",
114 |        "    </tr>\n",
115 |        "</table>"
116 |       ],
117 |       "text/plain": [
118 |        "[(1, datetime.datetime(2018, 11, 29, 0, 0, 57, 796000), 73, 'paid', None, None, 954, 'Tampa-St. Petersburg-Clearwater, FL', '\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2\"'),\n",
119 |        " (2, datetime.datetime(2018, 11, 29, 0, 1, 30, 796000), 24, 'paid', None, None, 984, 'Lake Havasu City-Kingman, AZ', '\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"'),\n",
120 |        " (3, datetime.datetime(2018, 11, 29, 0, 4, 1, 796000), 24, 'paid', None, None, 984, 'Lake Havasu City-Kingman, AZ', '\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"'),\n",
121 |        " (4, datetime.datetime(2018, 11, 29, 0, 4, 55, 796000), 73, 'paid', None, None, 954, 'Tampa-St. Petersburg-Clearwater, FL', '\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2\"'),\n",
122 |        " (5, datetime.datetime(2018, 11, 29, 0, 7, 13, 796000), 24, 'paid', None, None, 984, 'Lake Havasu City-Kingman, AZ', '\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"')]"
123 |       ]
124 |      },
125 |      "execution_count": 3,
126 |      "metadata": {},
127 |      "output_type": "execute_result"
128 |     }
129 |    ],
130 |    "source": [
131 |     "%sql SELECT * FROM songplays LIMIT 5;"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 4,
137 |    "metadata": {},
138 |    "outputs": [
139 |     {
140 |      "name": "stdout",
141 |      "output_type": "stream",
142 |      "text": [
143 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
144 |       "5 rows affected.\n"
145 |      ]
146 |     },
147 |     {
148 |      "data": {
149 |       "text/html": [
150 |        "<table>\n",
151 |        "    <tr>\n",
152 |        "        <th>user_id</th>\n",
153 |        "        <th>first_name</th>\n",
154 |        "        <th>last_name</th>\n",
155 |        "        <th>gender</th>\n",
156 |        "        <th>level</th>\n",
157 |        "    </tr>\n",
158 |        "    <tr>\n",
159 |        "        <td>73</td>\n",
160 |        "        <td>Jacob</td>\n",
161 |        "        <td>Klein</td>\n",
162 |        "        <td>M</td>\n",
163 |        "        <td>paid</td>\n",
164 |        "    </tr>\n",
165 |        "    <tr>\n",
166 |        "        <td>24</td>\n",
167 |        "        <td>Layla</td>\n",
168 |        "        <td>Griffin</td>\n",
169 |        "        <td>F</td>\n",
170 |        "        <td>paid</td>\n",
171 |        "    </tr>\n",
172 |        "    <tr>\n",
173 |        "        <td>50</td>\n",
174 |        "        <td>Ava</td>\n",
175 |        "        <td>Robinson</td>\n",
176 |        "        <td>F</td>\n",
177 |        "        <td>free</td>\n",
178 |        "    </tr>\n",
179 |        "    <tr>\n",
180 |        "        <td>54</td>\n",
181 |        "        <td>Kaleb</td>\n",
182 |        "        <td>Cook</td>\n",
183 |        "        <td>M</td>\n",
184 |        "        <td>free</td>\n",
185 |        "    </tr>\n",
186 |        "    <tr>\n",
187 |        "        <td>32</td>\n",
188 |        "        <td>Lily</td>\n",
189 |        "        <td>Burns</td>\n",
190 |        "        <td>F</td>\n",
191 |        "        <td>free</td>\n",
192 |        "    </tr>\n",
193 |        "</table>"
194 |       ],
195 |       "text/plain": [
196 |        "[(73, 'Jacob', 'Klein', 'M', 'paid'),\n",
197 |        " (24, 'Layla', 'Griffin', 'F', 'paid'),\n",
198 |        " (50, 'Ava', 'Robinson', 'F', 'free'),\n",
199 |        " (54, 'Kaleb', 'Cook', 'M', 'free'),\n",
200 |        " (32, 'Lily', 'Burns', 'F', 'free')]"
201 |       ]
202 |      },
203 |      "execution_count": 4,
204 |      "metadata": {},
205 |      "output_type": "execute_result"
206 |     }
207 |    ],
208 |    "source": [
209 |     "%sql SELECT * FROM users LIMIT 5;"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": 5,
215 |    "metadata": {},
216 |    "outputs": [
217 |     {
218 |      "name": "stdout",
219 |      "output_type": "stream",
220 |      "text": [
221 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
222 |       "1 rows affected.\n"
223 |      ]
224 |     },
225 |     {
226 |      "data": {
227 |       "text/html": [
228 |        "<table>\n",
229 |        "    <tr>\n",
230 |        "        <th>song_id</th>\n",
231 |        "        <th>title</th>\n",
232 |        "        <th>artist_id</th>\n",
233 |        "        <th>year</th>\n",
234 |        "        <th>duration</th>\n",
235 |        "    </tr>\n",
236 |        "    <tr>\n",
237 |        "        <td>SOFNOQK12AB01840FC</td>\n",
238 |        "        <td>Kutt Free (DJ Volume Remix)</td>\n",
239 |        "        <td>ARNNKDK1187B98BBD5</td>\n",
240 |        "        <td>0</td>\n",
241 |        "        <td>407.37914</td>\n",
242 |        "    </tr>\n",
243 |        "</table>"
244 |       ],
245 |       "text/plain": [
246 |        "[('SOFNOQK12AB01840FC', 'Kutt Free (DJ Volume Remix)', 'ARNNKDK1187B98BBD5', 0, 407.37914)]"
247 |       ]
248 |      },
249 |      "execution_count": 5,
250 |      "metadata": {},
251 |      "output_type": "execute_result"
252 |     }
253 |    ],
254 |    "source": [
255 |     "%sql SELECT * FROM songs LIMIT 5;"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "code",
260 |    "execution_count": 6,
261 |    "metadata": {},
262 |    "outputs": [
263 |     {
264 |      "name": "stdout",
265 |      "output_type": "stream",
266 |      "text": [
267 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
268 |       "1 rows affected.\n"
269 |      ]
270 |     },
271 |     {
272 |      "data": {
273 |       "text/html": [
274 |        "<table>\n",
275 |        "    <tr>\n",
276 |        "        <th>artist_id</th>\n",
277 |        "        <th>artist_name</th>\n",
278 |        "        <th>artist_location</th>\n",
279 |        "        <th>artist_lattitude</th>\n",
280 |        "        <th>artist_longitude</th>\n",
281 |        "    </tr>\n",
282 |        "    <tr>\n",
283 |        "        <td>ARNNKDK1187B98BBD5</td>\n",
284 |        "        <td>Jinx</td>\n",
285 |        "        <td>Zagreb Croatia</td>\n",
286 |        "        <td>45.80726</td>\n",
287 |        "        <td>15.9676</td>\n",
288 |        "    </tr>\n",
289 |        "</table>"
290 |       ],
291 |       "text/plain": [
292 |        "[('ARNNKDK1187B98BBD5', 'Jinx', 'Zagreb Croatia', 45.80726, 15.9676)]"
293 |       ]
294 |      },
295 |      "execution_count": 6,
296 |      "metadata": {},
297 |      "output_type": "execute_result"
298 |     }
299 |    ],
300 |    "source": [
301 |     "%sql SELECT * FROM artists LIMIT 5;"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": 7,
307 |    "metadata": {},
308 |    "outputs": [
309 |     {
310 |      "name": "stdout",
311 |      "output_type": "stream",
312 |      "text": [
313 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
314 |       "5 rows affected.\n"
315 |      ]
316 |     },
317 |     {
318 |      "data": {
319 |       "text/html": [
320 |        "<table>\n",
321 |        "    <tr>\n",
322 |        "        <th>id</th>\n",
323 |        "        <th>start_time</th>\n",
324 |        "        <th>hour</th>\n",
325 |        "        <th>day</th>\n",
326 |        "        <th>week</th>\n",
327 |        "        <th>month</th>\n",
328 |        "        <th>year</th>\n",
329 |        "        <th>weekday</th>\n",
330 |        "    </tr>\n",
331 |        "    <tr>\n",
332 |        "        <td>1</td>\n",
333 |        "        <td>00:00:57.796000</td>\n",
334 |        "        <td>0</td>\n",
335 |        "        <td>29</td>\n",
336 |        "        <td>48</td>\n",
337 |        "        <td>11</td>\n",
338 |        "        <td>2018</td>\n",
339 |        "        <td>3</td>\n",
340 |        "    </tr>\n",
341 |        "    <tr>\n",
342 |        "        <td>2</td>\n",
343 |        "        <td>00:01:30.796000</td>\n",
344 |        "        <td>0</td>\n",
345 |        "        <td>29</td>\n",
346 |        "        <td>48</td>\n",
347 |        "        <td>11</td>\n",
348 |        "        <td>2018</td>\n",
349 |        "        <td>3</td>\n",
350 |        "    </tr>\n",
351 |        "    <tr>\n",
352 |        "        <td>3</td>\n",
353 |        "        <td>00:04:01.796000</td>\n",
354 |        "        <td>0</td>\n",
355 |        "        <td>29</td>\n",
356 |        "        <td>48</td>\n",
357 |        "        <td>11</td>\n",
358 |        "        <td>2018</td>\n",
359 |        "        <td>3</td>\n",
360 |        "    </tr>\n",
361 |        "    <tr>\n",
362 |        "        <td>4</td>\n",
363 |        "        <td>00:04:55.796000</td>\n",
364 |        "        <td>0</td>\n",
365 |        "        <td>29</td>\n",
366 |        "        <td>48</td>\n",
367 |        "        <td>11</td>\n",
368 |        "        <td>2018</td>\n",
369 |        "        <td>3</td>\n",
370 |        "    </tr>\n",
371 |        "    <tr>\n",
372 |        "        <td>5</td>\n",
373 |        "        <td>00:07:13.796000</td>\n",
374 |        "        <td>0</td>\n",
375 |        "        <td>29</td>\n",
376 |        "        <td>48</td>\n",
377 |        "        <td>11</td>\n",
378 |        "        <td>2018</td>\n",
379 |        "        <td>3</td>\n",
380 |        "    </tr>\n",
381 |        "</table>"
382 |       ],
383 |       "text/plain": [
384 |        "[(1, datetime.time(0, 0, 57, 796000), 0, 29, 48, 11, 2018, 3),\n",
385 |        " (2, datetime.time(0, 1, 30, 796000), 0, 29, 48, 11, 2018, 3),\n",
386 |        " (3, datetime.time(0, 4, 1, 796000), 0, 29, 48, 11, 2018, 3),\n",
387 |        " (4, datetime.time(0, 4, 55, 796000), 0, 29, 48, 11, 2018, 3),\n",
388 |        " (5, datetime.time(0, 7, 13, 796000), 0, 29, 48, 11, 2018, 3)]"
389 |       ]
390 |      },
391 |      "execution_count": 7,
392 |      "metadata": {},
393 |      "output_type": "execute_result"
394 |     }
395 |    ],
396 |    "source": [
397 |     "%sql SELECT * FROM time LIMIT 5;"
398 |    ]
399 |   },
400 |   {
401 |    "cell_type": "markdown",
402 |    "metadata": {},
403 |    "source": [
404 |     "## REMEMBER: Restart this notebook to close connection to `sparkifydb`\n",
405 |     "Each time you run the cells above, remember to restart this notebook to close the connection to your database. Otherwise, you won't be able to run your code in `create_tables.py`, `etl.py`, or `etl.ipynb` files since you can't make multiple connections to the same database (in this case, sparkifydb)."
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "markdown",
410 |    "metadata": {},
411 |    "source": [
412 |     "### Example Queries:"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "code",
417 |    "execution_count": 8,
418 |    "metadata": {},
419 |    "outputs": [
420 |     {
421 |      "name": "stdout",
422 |      "output_type": "stream",
423 |      "text": [
424 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
425 |       "(psycopg2.ProgrammingError) column \"title\" does not exist\n",
426 |       "LINE 1: SELECT song_id,  artist_id, title FROM songplays Order by so...\n",
427 |       "                                    ^\n",
428 |       " [SQL: 'SELECT song_id,  artist_id, title FROM songplays Order by song_id LIMIT 5;']\n"
429 |      ]
430 |     }
431 |    ],
432 |    "source": [
433 |     "# 1. Which song is the most popular one ?\n",
434 |     "%sql SELECT song_id,  artist_id, title \\\n",
435 |     "FROM songplays Order by song_id LIMIT 5;"
436 |    ]
437 |   },
438 |   {
439 |    "cell_type": "code",
440 |    "execution_count": 9,
441 |    "metadata": {},
442 |    "outputs": [
443 |     {
444 |      "name": "stdout",
445 |      "output_type": "stream",
446 |      "text": [
447 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
448 |       "1 rows affected.\n"
449 |      ]
450 |     },
451 |     {
452 |      "data": {
453 |       "text/html": [
454 |        "<table>\n",
455 |        "    <tr>\n",
456 |        "        <th>avg_duration</th>\n",
457 |        "    </tr>\n",
458 |        "    <tr>\n",
459 |        "        <td>239.729676056338</td>\n",
460 |        "    </tr>\n",
461 |        "</table>"
462 |       ],
463 |       "text/plain": [
464 |        "[(239.729676056338,)]"
465 |       ]
466 |      },
467 |      "execution_count": 9,
468 |      "metadata": {},
469 |      "output_type": "execute_result"
470 |     }
471 |    ],
472 |    "source": [
473 |     "# 2. What is the average duration of songs ?\n",
474 |     "%sql SELECT AVG(duration) AS avg_duration FROM songs;"
475 |    ]
476 |   },
477 |   {
478 |    "cell_type": "code",
479 |    "execution_count": 10,
480 |    "metadata": {},
481 |    "outputs": [
482 |     {
483 |      "name": "stdout",
484 |      "output_type": "stream",
485 |      "text": [
486 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
487 |       "1 rows affected.\n"
488 |      ]
489 |     },
490 |     {
491 |      "data": {
492 |       "text/html": [
493 |        "<table>\n",
494 |        "    <tr>\n",
495 |        "        <th>count</th>\n",
496 |        "    </tr>\n",
497 |        "    <tr>\n",
498 |        "        <td>6820</td>\n",
499 |        "    </tr>\n",
500 |        "</table>"
501 |       ],
502 |       "text/plain": [
503 |        "[(6820,)]"
504 |       ]
505 |      },
506 |      "execution_count": 10,
507 |      "metadata": {},
508 |      "output_type": "execute_result"
509 |     }
510 |    ],
511 |    "source": [
512 |     "# 3. The proportion of male users and female users\n",
513 |     "# total user\n",
514 |     "%sql SELECT COUNT(gender) FROM users;"
515 |    ]
516 |   },
517 |   {
518 |    "cell_type": "code",
519 |    "execution_count": 11,
520 |    "metadata": {},
521 |    "outputs": [
522 |     {
523 |      "name": "stdout",
524 |      "output_type": "stream",
525 |      "text": [
526 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
527 |       "1 rows affected.\n"
528 |      ]
529 |     },
530 |     {
531 |      "data": {
532 |       "text/html": [
533 |        "<table>\n",
534 |        "    <tr>\n",
535 |        "        <th>female_count</th>\n",
536 |        "    </tr>\n",
537 |        "    <tr>\n",
538 |        "        <td>4887</td>\n",
539 |        "    </tr>\n",
540 |        "</table>"
541 |       ],
542 |       "text/plain": [
543 |        "[(4887,)]"
544 |       ]
545 |      },
546 |      "execution_count": 11,
547 |      "metadata": {},
548 |      "output_type": "execute_result"
549 |     }
550 |    ],
551 |    "source": [
552 |     "# number of female users\n",
553 |     "%sql SELECT COUNT(gender) AS female_count \\\n",
554 |     "FROM users \\\n",
555 |     "WHERE gender = 'F';  \n",
556 |     "# here must use single quotes. Double quotes will be treated as column names"
557 |    ]
558 |   },
559 |   {
560 |    "cell_type": "code",
561 |    "execution_count": 12,
562 |    "metadata": {},
563 |    "outputs": [
564 |     {
565 |      "name": "stdout",
566 |      "output_type": "stream",
567 |      "text": [
568 |       " * postgresql://student:***@127.0.0.1/sparkifydb\n",
569 |       "1 rows affected.\n"
570 |      ]
571 |     },
572 |     {
573 |      "data": {
574 |       "text/html": [
575 |        "<table>\n",
576 |        "    <tr>\n",
577 |        "        <th>male_count</th>\n",
578 |        "    </tr>\n",
579 |        "    <tr>\n",
580 |        "        <td>1933</td>\n",
581 |        "    </tr>\n",
582 |        "</table>"
583 |       ],
584 |       "text/plain": [
585 |        "[(1933,)]"
586 |       ]
587 |      },
588 |      "execution_count": 12,
589 |      "metadata": {},
590 |      "output_type": "execute_result"
591 |     }
592 |    ],
593 |    "source": [
594 |     "# number of male users\n",
595 |     "%sql SELECT COUNT(gender) AS male_count \\\n",
596 |     "FROM users \\\n",
597 |     "WHERE gender = 'M';\n",
598 |     "\n",
599 |     "\n"
600 |    ]
601 |   },
602 |   {
603 |    "cell_type": "code",
604 |    "execution_count": 19,
605 |    "metadata": {},
606 |    "outputs": [
607 |     {
608 |      "name": "stdout",
609 |      "output_type": "stream",
610 |      "text": [
611 |       "0.7165689149560117 0.2834310850439883\n"
612 |      ]
613 |     }
614 |    ],
615 |    "source": [
616 |     "print(4887/6820 , 1933/6820)\n"
617 |    ]
618 |   },
619 |   {
620 |    "cell_type": "markdown",
621 |    "metadata": {},
622 |    "source": [
623 |     "# references:\n",
624 |     "* 1.https://towardsdatascience.com/jupyter-magics-with-sql-921370099589"
625 |    ]
626 |   },
627 |   {
628 |    "cell_type": "code",
629 |    "execution_count": null,
630 |    "metadata": {},
631 |    "outputs": [],
632 |    "source": []
633 |   },
634 |   {
635 |    "cell_type": "code",
636 |    "execution_count": null,
637 |    "metadata": {},
638 |    "outputs": [],
639 |    "source": []
640 |   },
641 |   {
642 |    "cell_type": "code",
643 |    "execution_count": null,
644 |    "metadata": {},
645 |    "outputs": [],
646 |    "source": []
647 |   },
648 |   {
649 |    "cell_type": "code",
650 |    "execution_count": null,
651 |    "metadata": {},
652 |    "outputs": [],
653 |    "source": []
654 |   }
655 |  ],
656 |  "metadata": {
657 |   "kernelspec": {
658 |    "display_name": "Python 3",
659 |    "language": "python",
660 |    "name": "python3"
661 |   },
662 |   "language_info": {
663 |    "codemirror_mode": {
664 |     "name": "ipython",
665 |     "version": 3
666 |    },
667 |    "file_extension": ".py",
668 |    "mimetype": "text/x-python",
669 |    "name": "python",
670 |    "nbconvert_exporter": "python",
671 |    "pygments_lexer": "ipython3",
672 |    "version": "3.6.3"
673 |   }
674 |  },
675 |  "nbformat": 4,
676 |  "nbformat_minor": 2
677 | }
678 | 


--------------------------------------------------------------------------------
/Project_1B_Data_Modeling_with_Cassandra/Readme.md:
--------------------------------------------------------------------------------
  1 | Project: Data Modeling with Cassandra
  2 | 
  3 | ## Objective:
  4 | A startup called Sparkify wanted to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team was particularly interested in understanding what songs users were listening to. But there was no easy way to query the data to generate the results, since the data resided in a directory of CSV files on user activity on the app.
  5 | 
  6 | My role was to create an Apache Cassandra database which could create queries on song play data to answer the questions. I also need to test my database by running queries given by the analytics team from Sparkify to create the results.
  7 | 
  8 | 
  9 | 
 10 | ## Datasets
 11 | For this project, there is one dataset: event_data. The directory of CSV files partitioned by date. 
 12 | Here are examples of filepaths to two files in the dataset:
 13 | 
 14 |     event_data/2018-11-08-events.csv
 15 |     event_data/2018-11-09-events.csv
 16 |     link to a image of the dataset with all the columns:
 17 |     https://r766469c826649xjupyterlrhraafil.udacity-student-workspaces.com/files/images
 18 |     link to download the event data:
 19 |     https://r766469c826649xjupyterlrhraafil.udacity-student-workspaces.com/files/event_data
 20 |     
 21 | 
 22 | When I insert the data into Apache Cassandra tables, I generated a smaller-sized data set called event_datafile_new.csv based on event_data by extracting most related information. It contains the following columns: 
 23 | 
 24 |     artist
 25 |     firstName of user
 26 |     gender of user
 27 |     item number in session
 28 |     last name of user
 29 |     length of the song
 30 |     level (paid or free song)
 31 |     location of the user
 32 |     sessionId
 33 |     song title
 34 |     userId
 35 | 
 36 | 
 37 | 
 38 | ## Import Python packages 
 39 |     import pandas as pd
 40 |     import cassandra
 41 |     import re
 42 |     import os
 43 |     import glob
 44 |     import numpy as np
 45 |     import json
 46 |     import csv
 47 | 
 48 | 
 49 | 
 50 | ## How to run the project ?
 51 |     All the libraries above should be installed
 52 |     Open your jupyter notebook in your terminal by typing "jupyter notebook"
 53 |     Open the files you want to run 
 54 |     
 55 | 
 56 | ## Two Major Parts of the Project:
 57 |     
 58 |     Part I. Build ETL Pipeline for Pre-Processing the Files 
 59 |     Part II. Complete the Apache Cassandra coding portion 
 60 |     
 61 | 
 62 | 
 63 | ## Project Steps
 64 | 
 65 | 1. Modeling the NoSQL database(Apache Cassandra database) 
 66 | 2. Design tables to answer the queries outlined in the project .ipynb file
 67 | 3. Write Apache Cassandra CREATE KEYSPACE and SET KEYSPACE statements
 68 | 4. Develop the CREATE statement for each of the tables to address each question
 69 | 5. Load the data with INSERT statement for each of the tables
 70 | 6. Include IF NOT EXISTS clauses in the CREATE statements to create tables only if the tables do not already exist. 
 71 |     (Include DROP TABLE statement for each table. This way you can run drop and create tables whenever you want to reset your 
 72 |     database and test your ETL pipeline ) 
 73 | 7. Test by running the proper select statements with the correct WHERE clause
 74 | 
 75 | 
 76 | 
 77 | ## Key Point: Primary Keys in Creating Tables
 78 | 
 79 |     There are 3 queries, which means I need to create 3 tables correspondingly. 
 80 |         Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4
 81 |         Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
 82 |         Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'
 83 |  
 84 |     Each query requires three basic steps: create a table, shape an incoming row of data, insert that row into the table
 85 |     
 86 |     It is crucial to create proper primary key, which would directly affect your query execution later. 
 87 |     
 88 |     Primary key is composed by partion key and clustering columns. The number of clustering columns varies from case to case. Some may require mutiple
 89 |     while others may require none. The point you need to keep on mind is that Partition key is for filtering while clustering columns are for sorting.
 90 |     
 91 |     In query 1: Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338 and itemInSession = 4
 92 |         we have two filtering conditions sessionId and itemInSession, so the partition key is a combination of sessionId and itemInSession. 
 93 |         Thus PRIMARY KEY((sessionId,itemInSession))
 94 |         
 95 |     In query 2: Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid =     182
 96 |          we have two filtering conditions, userid and sessionId, so the partition key is a combination of userid and sessionId. Also, we are required          to sort it by itemInSession. itemInSession should be the clustering column.
 97 |          Thus PRIMARY KEY((userid, sessionId), itemInSession)
 98 |          
 99 |     In query 3: Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'
100 |         We have only one filtering condition, which is the song title. Song title should be the partition key. Just one partition key may not be     
101 |         enough for the primary key. Since we need user informaion, I add useid as the clustering column.
102 |         Thus PRIMARY KEY(song, userid)
103 | 
104 | 
105 | 


--------------------------------------------------------------------------------
/Project_3_Data_Warehouse_with_Redshift/README.md:
--------------------------------------------------------------------------------
  1 | ## Introduction 
  2 | A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto cloud. Their data resides in S3 ( a directory of JSON logs on the app's user activity section  & a directoty of JSON metadata on the app's song section)
  3 | 
  4 | 
  5 | ## My Task
  6 | Build a ETL pipeline to extract data from S3 , stage them in Redshift, transform them into a set of dimensional tables for analytics team to find insights about what songs their users are listening to.
  7 | 
  8 | 
  9 | ## The Datasets
 10 | There are two datasets that reside in S3. 
 11 | 
 12 |       (1). Link to Song data: s3://udacity-dend/song_data
 13 | 
 14 |       The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata   
 15 |       about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 
 16 |       
 17 |       For example,here are filepaths to a file in this dataset.    
 18 |       song_data/A/B/C/TRABCEI128F424C983.json
 19 | 
 20 |      Single song file example :
 21 |      {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "",    
 22 |      "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
 23 | 
 24 |     (2). Link to Log data: s3://udacity-dend/log_data
 25 | 
 26 |      It consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These 
 27 |      simulate app activity logs from an imaginary music streaming app based on configuration settings.     
 28 |      The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset.
 29 | 
 30 |      log_data/2018/11/2018-11-12-events.json
 31 |      log_data/2018/11/2018-11-13-events.json         
 32 | 
 33 | 
 34 | ## Files in the Repository
 35 |     (1).sql_queries.py
 36 |     Define SQL statements like creating tables, drop tables, copying data from S3 to staging tables, inserting data from satging 
 37 |     tables to dimensional tables.
 38 |     
 39 |     The statements will be imported into create_table.py and etl.py
 40 | 
 41 |     (2).create_tables.py
 42 |      Create dimension and fact tables for the star schema in Redshift
 43 | 
 44 |     (3).etl.py
 45 |     Execute the SQL statement defined in sql_queries.py. Load data from S3 into staging tables on Redshift and then process that data     into the analytics tables on Redshift.
 46 |  
 47 |     (4).README.md
 48 |     A summary and supporting material for the whole project. 
 49 | 
 50 | 
 51 | ## Steps to Run the Program
 52 |      (1). Finish sql_queries.py
 53 |      (2). Finish create_tables.py
 54 |      (3). Create a IAM role with read only access and a security group
 55 |      (4). Launch a redshift cluster ( Do not let your cluster run overnight or over the weekend to prevent unexpected high charges)
 56 |      (5). Run create_tables.py in the terminal by type "python create_tables.py". 
 57 |          A new termnial will be available if you click "file" tab on the top left, move your cursor to "new" and select 'terminal'
 58 |      (6). In query editor under cluster you will see your database and all your tables (you must run "python create_tables.py" first)
 59 |          To use query editor in your cluster, you need to make sure 
 60 |          a. the node type meeting the standard (DC1.8xlarge, DC2.large, DC2.8xlarge, DS2.8xlarge either one is fine)
 61 |          b. attach Iam policies to query editor to enable access to query editor
 62 |          make sure select 'public' in the dropdown button under schema
 63 |     (7). Run etl.py in the terminal by type "python etl.py". 
 64 | 
 65 | 
 66 | ## Star Schema for Song Play Analysis
 67 |     (1).Fact Table
 68 |     songplays - records in event data associated with song plays i.e. records with page NextSong
 69 |     songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
 70 |     (2).Dimension Tables
 71 |     users - users in the app:
 72 |         user_id, first_name, last_name, gender, level
 73 |     songs - songs in music database:
 74 |         song_id, title, artist_id, year, duration
 75 |     artists - artists in music database:
 76 |         artist_id, name, location, lattitude, longitude
 77 |     time - timestamps of records in songplays broken down into specific units:
 78 |         start_time, hour, day, week, month, year, weekday
 79 | 
 80 | 
 81 | 
 82 | ## Example Queries 
 83 |     (1).Find the popular artists
 84 | 
 85 |     (2).Find the popular songs
 86 |     select song_title, count(song_title) as song_played
 87 |     from staging_events 
 88 |     group by song_title
 89 |     order by song_played desc;
 90 | 
 91 |     (3).Song plays by hours
 92 | 
 93 |     (4).song plays by level and gender 
 94 | 
 95 | 
 96 | ## References:
 97 | * https://www.klipfolio.com/blog/create-sql-dashboard
 98 | * https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-authorization.html
 99 | * https://github.com/FedericoSerini/DEND-Project-3-Data-Warehouse-AWS/blob/master/sql_queries.py
100 | * https://github.com/janjagusch/dend_03_data_warehouse
101 | * https://www.postgresql.org/docs/9.4/functions-datetime.html
102 | 


--------------------------------------------------------------------------------
/Project_3_Data_Warehouse_with_Redshift/create_tables.py:
--------------------------------------------------------------------------------
 1 | import configparser
 2 | import psycopg2
 3 | from sql_queries import create_table_queries, drop_table_queries
 4 | 
 5 | 
 6 | def drop_tables(cur, conn):
 7 |     """
 8 |     This function drop tables.
 9 |     
10 |     argument:
11 |     cur: the cursor object
12 |     conn : the connection object
13 |     
14 |     return :
15 |     none
16 |     """
17 |     for query in drop_table_queries:
18 |         cur.execute(query)
19 |         conn.commit()
20 | 
21 | 
22 | def create_tables(cur, conn):
23 |     """
24 |     This function creates tables 
25 |      
26 |     argument:
27 |     cur: the cursor object
28 |     conn : the connection object
29 |     
30 |     return :
31 |     none
32 |     """
33 |     for query in create_table_queries:
34 |         cur.execute(query)
35 |         conn.commit()
36 | 
37 | 
38 | def main():
39 |     """
40 |     When a python program is executed, python interpreter starts executing code inside main function.
41 |     three functions above would be executed and then the connection would be closed
42 |     """
43 |     config = configparser.ConfigParser()
44 |     config.read('dwh.cfg')
45 | 
46 |     conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
47 |     conn.set_session(autocommit=True)
48 |     cur = conn.cursor()
49 | 
50 |     drop_tables(cur, conn)
51 |     create_tables(cur, conn)
52 | 
53 |     conn.close()
54 | 
55 |     
56 | if __name__ == "__main__":
57 |     """
58 |     For python main function, we have to use if __name__ == '__main__' condition to execute.
59 |     """
60 |     main()


--------------------------------------------------------------------------------
/Project_3_Data_Warehouse_with_Redshift/dwh.cfg:
--------------------------------------------------------------------------------
 1 | [CLUSTER]
 2 | # get the cluster infor from exercise 2 IaC
 3 | # this need to be changed if it's in a different region. You can find endpoint in cluster details   remove port   
 4 | HOST =  redshift-cluster.cpynd2akmpcr.us-west-2.redshift.amazonaws.com
 5 | DB_NAME = sparkify  
 6 | DB_USER = sparkifyuser
 7 | DB_PASSWORD =  Passw0rd     
 8 | DB_PORT = 5439
 9 | # these four do not require single quotes
10 | 
11 | [IAM_ROLE]
12 | ARN = 'arn:aws:iam::162431663335:role/myRedshiftRole'
13 |       
14 | [S3]
15 | LOG_DATA = 's3://udacity-dend/log_data'
16 | LOG_JSONPATH = 's3://udacity-dend/log_json_path.json'
17 | SONG_DATA = 's3://udacity-dend/song_data'
18 | # these 4 require single quotes
19 | 
20 | 


--------------------------------------------------------------------------------
/Project_3_Data_Warehouse_with_Redshift/etl.py:
--------------------------------------------------------------------------------
 1 | import configparser
 2 | import psycopg2
 3 | from sql_queries import copy_table_queries, insert_table_queries
 4 | 
 5 | 
 6 | def load_staging_tables(cur, conn):
 7 |     """
 8 |     This function load data from S3 to staging tables on redshift
 9 |      
10 |     argument:
11 |     cur: the cursor object
12 |     conn : the connection object
13 |     
14 |     return :
15 |     none
16 |     """
17 |     for query in copy_table_queries:
18 |         cur.execute(query)
19 |         conn.commit()
20 | 
21 | 
22 | def insert_tables(cur, conn):
23 |     """
24 |     This function load data from staging tables to analytics tables on redshift
25 |      
26 |     argument:
27 |     cur: the cursor object
28 |     conn : the connection object
29 |     
30 |     return :
31 |     none
32 |     """
33 |     for query in insert_table_queries:
34 |         cur.execute(query)
35 |         conn.commit()
36 | 
37 | 
38 | def main():
39 |     """
40 |     This main function will extract all the required information to connect to the database and then 
41 |     call load_staging_tables function to load data to staging tables and then execute insert_tables function to load data to
42 |     analytics tables
43 |     """
44 |     config = configparser.ConfigParser()
45 |     config.read('dwh.cfg')
46 | 
47 |     conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
48 |     cur = conn.cursor()
49 |     
50 |     load_staging_tables(cur, conn)
51 |     insert_tables(cur, conn)
52 | 
53 |     conn.close()
54 | 
55 | 
56 | if __name__ == "__main__":
57 |     main()
58 |     


--------------------------------------------------------------------------------
/Project_3_Data_Warehouse_with_Redshift/project3_star_schema.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_3_Data_Warehouse_with_Redshift/project3_star_schema.PNG


--------------------------------------------------------------------------------
/Project_3_Data_Warehouse_with_Redshift/sql_queries.py:
--------------------------------------------------------------------------------
  1 | #import configparser
  2 | 
  3 | 
  4 | # CONFIG
  5 | #config = configparser.ConfigParser()
  6 | #config.read('dwh.cfg')
  7 | 
  8 | 
  9 | 
 10 | # DROP TABLES
 11 | # do not need quote characters for staging tables
 12 | staging_events_table_drop = "DROP TABLE IF EXISTS staging_events"
 13 | staging_songs_table_drop = "DROP TABLE IF EXISTS staging_songs"
 14 | songplay_table_drop = "DROP TABLE IF EXISTS songplays"
 15 | user_table_drop = "DROP TABLE IF EXISTS users"
 16 | song_table_drop = "DROP TABLE IF EXISTS songs"
 17 | artist_table_drop = "DROP TABLE IF EXISTS artists"
 18 | time_table_drop = "DROP TABLE IF EXISTS times"
 19 | 
 20 | 
 21 | 
 22 | # CREATE TABLES
 23 | # every table should have a primary key 
 24 | # I deleted all the 'if not exists' in table creating statement 
 25 | 
 26 | staging_events_table_create= ("""CREATE TABLE staging_events (
 27 | artist_name varchar, 
 28 | auth varchar,
 29 | user_first_name varchar, 
 30 | gender varchar, 
 31 | itemInSession int, 
 32 | user_last_name varchar, 
 33 | length numeric, 
 34 | level varchar, 
 35 | user_location varchar, 
 36 | method varchar, 
 37 | page varchar, 
 38 | registration bigint, 
 39 | sessionId int, 
 40 | song_title varchar, 
 41 | status int, 
 42 | ts bigint, 
 43 | userAgent varchar, 
 44 | user_id int)""");
 45 | # changed ts to start_time to be consistent
 46 | 
 47 | 
 48 | 
 49 | staging_songs_table_create = ("""CREATE TABLE staging_songs(
 50 | num_songs int, 
 51 | artist_id varchar, 
 52 | artist_latitude numeric, 
 53 | artist_longitude numeric, 
 54 | artist_location varchar, 
 55 | artist_name varchar, 
 56 | song_id varchar, 
 57 | song_title varchar, 
 58 | duration numeric, 
 59 | year int)
 60 | """);
 61 | 
 62 | 
 63 | 
 64 | songplay_table_create = ("""CREATE TABLE songplays ( 
 65 | songplay_id INT IDENTITY(0,1) PRIMARY KEY ,  
 66 | start_time TIMESTAMP NOT NULL REFERENCES times(start_time), 
 67 | user_id INT NOT NULL REFERENCES users(user_id), 
 68 | level VARCHAR, 
 69 | song_id VARCHAR NOT NULL REFERENCES songs(song_id), 
 70 | artist_id VARCHAR NOT NULL REFERENCES artists(artist_id), 
 71 | sessionId INT ,
 72 | user_location VARCHAR, 
 73 | userAgent VARCHAR
 74 | )
 75 | """);
 76 | 
 77 | # four references not null removed 
 78 | 
 79 | 
 80 | user_table_create = ("""CREATE TABLE users  (
 81 | user_id INT PRIMARY KEY, 
 82 | user_first_name VARCHAR, 
 83 | user_last_name VARCHAR, 
 84 | gender VARCHAR,  
 85 | level VARCHAR
 86 | )
 87 | """);
 88 | 
 89 | 
 90 | 
 91 | song_table_create = ("""CREATE TABLE songs (
 92 | song_id VARCHAR PRIMARY KEY, 
 93 | song_title VARCHAR,
 94 | artist_id VARCHAR NOT NULL, 
 95 | year INTEGER, 
 96 | duration FLOAT
 97 | )
 98 | """);
 99 | 
100 | 
101 | 
102 | artist_table_create = ("""CREATE TABLE artists (
103 | artist_id VARCHAR PRIMARY KEY, 
104 | artist_name VARCHAR, 
105 | artist_location VARCHAR, 
106 | artist_latitude FLOAT, 
107 | artist_longitude FLOAT
108 | )
109 | """);
110 | 
111 | 
112 | 
113 | time_table_create = ("""CREATE TABLE times (
114 | start_time TIMESTAMP PRIMARY KEY, 
115 | hour INTEGER , 
116 | day INTEGER,
117 | week INTEGER, 
118 | month VARCHAR, 
119 | year INTEGER, 
120 | weekday VARCHAR
121 | )
122 | """);
123 | # changed INT to INTEGER
124 | 
125 | ARN = config.get('IAM_ROLE', 'ARN')
126 | LOG_DATA = config.get('S3','LOG_DATA')
127 | LOG_JSONPATH = config.get('S3','LOG_JSONPATH')
128 | SONG_DATA = config.get('S3','SONG_DATA')
129 | 
130 | # To load the sample data, you must provide authentication for your cluster to access Amazon S3 on your behalf. 
131 | # You can provide either role-based authentication or key-based authentication. In this case it is role-based authentication
132 | 
133 | # setting COMPUPDATE, STATUPDATE to speed up COPY
134 | staging_events_copy = (""" copy staging_events from {}
135 |                            iam_role {}
136 |                            region 'us-west-2' 
137 |                            COMPUPDATE OFF STATUPDATE OFF
138 |                            JSON {} timeformat 'epochmillisecs'; """).format(LOG_DATA, ARN, LOG_JSONPATH )
139 | # completed: 8056 rows for events
140 | 
141 | 
142 | 
143 | staging_songs_copy = (""" copy staging_songs from {}
144 |                          iam_role {}
145 |                          region 'us-west-2' 
146 |                          COMPUPDATE OFF STATUPDATE OFF
147 |                          JSON 'auto' """).format(SONG_DATA,ARN)
148 | # completed : 14896 rows for songs
149 | 
150 |  
151 |     
152 | # FINAL TABLES
153 | # make sure when selecting columns, the orders of the columns should be the same as previous order
154 | 
155 | # remember songplay_id is auto incremented so do not need to add here
156 | songplay_table_insert = (""" INSERT INTO songplays (start_time, user_id, level, song_id, artist_id, sessionId, user_location, userAgent) 
157 | SELECT DISTINCT 
158 |                 TIMESTAMP 'epoch' + e.ts/1000 *INTERVAL '1 second' as start_time,
159 |                 e.user_id, 
160 |                 e.level, 
161 |                 s.song_id, 
162 |                 s.artist_id, 
163 |                 e.sessionId, 
164 |                 e.user_location, 
165 |                 e.userAgent
166 | FROM staging_events e
167 | JOIN staging_songs s
168 | ON e.song_title = s.song_title
169 | WHERE e.page = 'NextSong' AND e.user_id NOT IN (SELECT DISTINCT sp.user_id FROM songplays sp WHERE sp.user_id =  e.user_id AND sp.start_time = e.ts AND sp.sessionId = e.sessionId)
170 | """)
171 | # here pay attention to the alias. Do not make two s . The program will be confused
172 | 
173 | 
174 | user_table_insert = (""" INSERT INTO users (user_id, user_first_name, user_last_name, gender, level) 
175 | SELECT DISTINCT 
176 |                user_id, 
177 |                user_first_name, 
178 |                user_last_name, 
179 |                gender, 
180 |                level
181 | FROM staging_events 
182 | WHERE page = 'NextSong' AND user_id NOT IN (SELECT DISTINCT user_id FROM users)
183 | """)
184 | 
185 | 
186 | 
187 | # do not need WHERE paging = 'NextSong' coz it has nothing to do with staging_events table
188 | song_table_insert = (""" INSERT INTO songs (song_id, song_title, artist_id, year, duration) 
189 | SELECT DISTINCT 
190 |                song_id, 
191 |                song_title, 
192 |                artist_id, 
193 |                year, 
194 |                duration
195 | FROM staging_songs
196 | WHERE song_id NOT IN ( SELECT DISTINCT song_id FROM songs)
197 | """)
198 | 
199 | 
200 | 
201 | artist_table_insert = (""" INSERT INTO artists (artist_id, artist_name, artist_location, artist_latitude, artist_longitude) 
202 | SELECT DISTINCT                                                                          
203 |                artist_id, 
204 |                artist_name, 
205 |                artist_location, 
206 |                artist_latitude, 
207 |                artist_longitude
208 | FROM staging_songs
209 | WHERE artist_id NOT IN (SELECT DISTINCT artist_id  FROM artists)
210 | """)
211 | 
212 | 
213 | 
214 | # do not need WHERE paging = 'NextSong' coz it has nothing to do with staging_events table
215 | 
216 | # in order to use extarct(), we have to have ts as varchar 
217 | 
218 | time_table_insert = ("""
219 | INSERT INTO times \
220 | (start_time, hour, day, week, month, year, weekday) \
221 | SELECT start_time, \
222 |     EXTRACT(hr from start_time), \
223 |     EXTRACT(d from start_time), \
224 |     EXTRACT(w from start_time), \
225 |     EXTRACT(mon from start_time), \
226 |     EXTRACT(yr from start_time), \
227 |     EXTRACT(weekday from start_time) \
228 | FROM (SELECT DISTINCT  TIMESTAMP 'epoch' + ts/1000 *INTERVAL '1 second' as start_time \
229 |       FROM staging_events s)""")
230 | 
231 | 
232 | 
233 | # QUERY LISTS
234 | # When creating tables, we should create dimension tables first, then fact table
235 | create_table_queries = [staging_events_table_create, staging_songs_table_create,user_table_create, song_table_create, artist_table_create,time_table_create,songplay_table_create]
236 | 
237 | drop_table_queries = [staging_events_table_drop, staging_songs_table_drop, songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]
238 | 
239 | copy_table_queries = [staging_events_copy, staging_songs_copy]
240 | 
241 | insert_table_queries = [user_table_insert, song_table_insert, artist_table_insert, time_table_insert, songplay_table_insert]
242 | 


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/log-data/2018-11-01-events.json:
--------------------------------------------------------------------------------
 1 | {"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
 2 | {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
 3 | {"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
 4 | {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":2,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Upgrade","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106132796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
 5 | {"artist":"Mr Oizo","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":3,"lastName":"Summers","length":144.03873,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Flat 55","status":200,"ts":1541106352796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
 6 | {"artist":"Tamba Trio","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":4,"lastName":"Summers","length":177.18812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Quem Quiser Encontrar O Amor","status":200,"ts":1541106496796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
 7 | {"artist":"The Mars Volta","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":5,"lastName":"Summers","length":380.42077,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Eriatarka","status":200,"ts":1541106673796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
 8 | {"artist":"Infected Mushroom","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":6,"lastName":"Summers","length":440.2673,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Becoming Insane","status":200,"ts":1541107053796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
 9 | {"artist":"Blue October \/ Imogen Heap","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":7,"lastName":"Summers","length":241.3971,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Congratulations","status":200,"ts":1541107493796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
10 | {"artist":"Girl Talk","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":8,"lastName":"Summers","length":160.15628,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Once again","status":200,"ts":1541107734796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
11 | {"artist":"Black Eyed Peas","auth":"Logged In","firstName":"Sylvie","gender":"F","itemInSession":0,"lastName":"Cruz","length":214.93506,"level":"free","location":"Washington-Arlington-Alexandria, DC-VA-MD-WV","method":"PUT","page":"NextSong","registration":1540266185796.0,"sessionId":9,"song":"Pump It","status":200,"ts":1541108520796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.77.4 (KHTML, like Gecko) Version\/7.0.5 Safari\/537.77.4\"","userId":"10"}
12 | {"artist":null,"auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":0,"lastName":"Smith","length":null,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"GET","page":"Home","registration":1541016707796.0,"sessionId":169,"song":null,"status":200,"ts":1541109015796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
13 | {"artist":"Fall Out Boy","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":1,"lastName":"Smith","length":200.72444,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Nobody Puts Baby In The Corner","status":200,"ts":1541109125796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
14 | {"artist":"M.I.A.","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":2,"lastName":"Smith","length":233.7171,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Mango Pickle Down River (With The Wilcannia Mob)","status":200,"ts":1541109325796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
15 | {"artist":"Survivor","auth":"Logged In","firstName":"Jayden","gender":"M","itemInSession":0,"lastName":"Fox","length":245.36771,"level":"free","location":"New Orleans-Metairie, LA","method":"PUT","page":"NextSong","registration":1541033612796.0,"sessionId":100,"song":"Eye Of The Tiger","status":200,"ts":1541110994796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"101"}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/log-data/2018-11-25-events.json:
--------------------------------------------------------------------------------
 1 | {"artist":"matchbox twenty","auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":0,"lastName":"Duffy","length":177.65832,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"PUT","page":"NextSong","registration":1540146037796.0,"sessionId":846,"song":"Argue (LP Version)","status":200,"ts":1543109954796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"}
 2 | {"artist":"The Lonely Island \/ T-Pain","auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":1,"lastName":"Duffy","length":156.23791,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"PUT","page":"NextSong","registration":1540146037796.0,"sessionId":846,"song":"I'm On A Boat","status":200,"ts":1543110131796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"}
 3 | {"artist":null,"auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":2,"lastName":"Duffy","length":null,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"GET","page":"Home","registration":1540146037796.0,"sessionId":846,"song":null,"status":200,"ts":1543110132796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"}
 4 | {"artist":null,"auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":3,"lastName":"Duffy","length":null,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"GET","page":"Settings","registration":1540146037796.0,"sessionId":846,"song":null,"status":200,"ts":1543110168796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"}
 5 | {"artist":null,"auth":"Logged In","firstName":"Jayden","gender":"F","itemInSession":4,"lastName":"Duffy","length":null,"level":"free","location":"Seattle-Tacoma-Bellevue, WA","method":"PUT","page":"Save Settings","registration":1540146037796.0,"sessionId":846,"song":null,"status":307,"ts":1543110169796,"userAgent":"\"Mozilla\/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit\/537.51.2 (KHTML, like Gecko) Version\/7.0 Mobile\/11D257 Safari\/9537.53\"","userId":"76"}
 6 | {"artist":"John Mayer","auth":"Logged In","firstName":"Wyatt","gender":"M","itemInSession":0,"lastName":"Scott","length":275.27791,"level":"free","location":"Eureka-Arcata-Fortuna, CA","method":"PUT","page":"NextSong","registration":1540872073796.0,"sessionId":856,"song":"All We Ever Do Is Say Goodbye","status":200,"ts":1543113347796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; Trident\/7.0; rv:11.0) like Gecko","userId":"9"}
 7 | {"artist":null,"auth":"Logged In","firstName":"Wyatt","gender":"M","itemInSession":1,"lastName":"Scott","length":null,"level":"free","location":"Eureka-Arcata-Fortuna, CA","method":"GET","page":"Home","registration":1540872073796.0,"sessionId":856,"song":null,"status":200,"ts":1543113365796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; Trident\/7.0; rv:11.0) like Gecko","userId":"9"}
 8 | {"artist":"10_000 Maniacs","auth":"Logged In","firstName":"Wyatt","gender":"M","itemInSession":2,"lastName":"Scott","length":251.8722,"level":"free","location":"Eureka-Arcata-Fortuna, CA","method":"PUT","page":"NextSong","registration":1540872073796.0,"sessionId":856,"song":"Gun Shy (LP Version)","status":200,"ts":1543113622796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; Trident\/7.0; rv:11.0) like Gecko","userId":"9"}
 9 | {"artist":"Leona Lewis","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":0,"lastName":"Cuevas","length":203.88526,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Forgive Me","status":200,"ts":1543122348796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
10 | {"artist":"Nine Inch Nails","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":1,"lastName":"Cuevas","length":277.83791,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"La Mer","status":200,"ts":1543122551796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
11 | {"artist":"Audioslave","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":2,"lastName":"Cuevas","length":334.91546,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"I Am The Highway","status":200,"ts":1543122828796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
12 | {"artist":"Kid Rock","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":3,"lastName":"Cuevas","length":296.95955,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"All Summer Long (Album Version)","status":200,"ts":1543123162796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
13 | {"artist":"The Jets","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":4,"lastName":"Cuevas","length":220.89098,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"I Do You","status":200,"ts":1543123458796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
14 | {"artist":"The Gerbils","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":5,"lastName":"Cuevas","length":27.01016,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"(iii)","status":200,"ts":1543123678796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
15 | {"artist":"Damian Marley \/ Stephen Marley \/ Yami Bolo","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":6,"lastName":"Cuevas","length":304.69179,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Still Searching","status":200,"ts":1543123705796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
16 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":7,"lastName":"Cuevas","length":null,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540940782796.0,"sessionId":916,"song":null,"status":200,"ts":1543124166796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
17 | {"artist":"The Bloody Beetroots","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":8,"lastName":"Cuevas","length":201.97832,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Warp 1.9 (feat. Steve Aoki)","status":200,"ts":1543124951796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
18 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":9,"lastName":"Cuevas","length":null,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540940782796.0,"sessionId":916,"song":null,"status":200,"ts":1543125120796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
19 | {"artist":"The Specials","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":10,"lastName":"Cuevas","length":188.81261,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Rat Race","status":200,"ts":1543125152796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
20 | {"artist":"The Lively Ones","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":11,"lastName":"Cuevas","length":142.52363,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Walkin' The Board (LP Version)","status":200,"ts":1543125340796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
21 | {"artist":"Katie Melua","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":12,"lastName":"Cuevas","length":252.78649,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Blues In The Night","status":200,"ts":1543125482796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
22 | {"artist":"Jason Mraz","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":13,"lastName":"Cuevas","length":243.48689,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"I'm Yours (Album Version)","status":200,"ts":1543125734796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
23 | {"artist":"Fisher","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":14,"lastName":"Cuevas","length":133.98159,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Rianna","status":200,"ts":1543125977796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
24 | {"artist":"Zee Avi","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":15,"lastName":"Cuevas","length":160.62649,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"No Christmas For Me","status":200,"ts":1543126110796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
25 | {"artist":"Black Eyed Peas","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":16,"lastName":"Cuevas","length":289.12281,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"I Gotta Feeling","status":200,"ts":1543126270796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
26 | {"artist":"Emiliana Torrini","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":17,"lastName":"Cuevas","length":184.29342,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"Sunny Road","status":200,"ts":1543126559796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
27 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":18,"lastName":"Cuevas","length":null,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540940782796.0,"sessionId":916,"song":null,"status":200,"ts":1543126820796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
28 | {"artist":"Days Of The New","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":19,"lastName":"Cuevas","length":258.5073,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"The Down Town","status":200,"ts":1543128418796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
29 | {"artist":"Julio Iglesias duet with Willie Nelson","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":20,"lastName":"Cuevas","length":212.16608,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":916,"song":"To All The Girls I've Loved Before (With Julio Iglesias)","status":200,"ts":1543128676796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
30 | {"artist":null,"auth":"Logged In","firstName":"Jacqueline","gender":"F","itemInSession":0,"lastName":"Lynch","length":null,"level":"paid","location":"Atlanta-Sandy Springs-Roswell, GA","method":"GET","page":"Home","registration":1540223723796.0,"sessionId":914,"song":null,"status":200,"ts":1543133256796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.78.2 (KHTML, like Gecko) Version\/7.0.6 Safari\/537.78.2\"","userId":"29"}
31 | {"artist":"Jason Mraz & Colbie Caillat","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":0,"lastName":"Roth","length":189.6224,"level":"free","location":"Indianapolis-Carmel-Anderson, IN","method":"PUT","page":"NextSong","registration":1540699429796.0,"sessionId":704,"song":"Lucky (Album Version)","status":200,"ts":1543135208796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"78"}
32 | {"artist":null,"auth":"Logged In","firstName":"Anabelle","gender":"F","itemInSession":0,"lastName":"Simpson","length":null,"level":"free","location":"Philadelphia-Camden-Wilmington, PA-NJ-DE-MD","method":"GET","page":"Home","registration":1541044398796.0,"sessionId":901,"song":null,"status":200,"ts":1543145299796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"69"}
33 | {"artist":"R. Kelly","auth":"Logged In","firstName":"Anabelle","gender":"F","itemInSession":1,"lastName":"Simpson","length":234.39628,"level":"free","location":"Philadelphia-Camden-Wilmington, PA-NJ-DE-MD","method":"PUT","page":"NextSong","registration":1541044398796.0,"sessionId":901,"song":"The World's Greatest","status":200,"ts":1543145305796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"69"}
34 | {"artist":null,"auth":"Logged In","firstName":"Kynnedi","gender":"F","itemInSession":0,"lastName":"Sanchez","length":null,"level":"free","location":"Cedar Rapids, IA","method":"GET","page":"Home","registration":1541079034796.0,"sessionId":804,"song":null,"status":200,"ts":1543151224796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"89"}
35 | {"artist":"Jacky Terrasson","auth":"Logged In","firstName":"Marina","gender":"F","itemInSession":0,"lastName":"Sutton","length":342.7522,"level":"free","location":"Salinas, CA","method":"PUT","page":"NextSong","registration":1541064343796.0,"sessionId":373,"song":"Le Jardin d'Hiver","status":200,"ts":1543154200796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"48"}
36 | {"artist":"Papa Roach","auth":"Logged In","firstName":"Theodore","gender":"M","itemInSession":0,"lastName":"Harris","length":202.1873,"level":"free","location":"Red Bluff, CA","method":"PUT","page":"NextSong","registration":1541097374796.0,"sessionId":813,"song":"Alive","status":200,"ts":1543155228796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"14"}
37 | {"artist":"Burt Bacharach","auth":"Logged In","firstName":"Theodore","gender":"M","itemInSession":1,"lastName":"Harris","length":156.96934,"level":"free","location":"Red Bluff, CA","method":"PUT","page":"NextSong","registration":1541097374796.0,"sessionId":813,"song":"Casino Royale Theme (Main Title)","status":200,"ts":1543155430796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"14"}
38 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":0,"lastName":"Cuevas","length":null,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540940782796.0,"sessionId":923,"song":null,"status":200,"ts":1543161483796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
39 | {"artist":"Floetry","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":1,"lastName":"Cuevas","length":254.48444,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"Sunshine","status":200,"ts":1543161985796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
40 | {"artist":"The Rakes","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":2,"lastName":"Cuevas","length":225.2273,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"Leave The City And Come Home","status":200,"ts":1543162239796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
41 | {"artist":"Dwight Yoakam","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":3,"lastName":"Cuevas","length":239.3073,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"You're The One","status":200,"ts":1543162464796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
42 | {"artist":"Ween","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":4,"lastName":"Cuevas","length":228.10077,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"Voodoo Lady","status":200,"ts":1543162703796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
43 | {"artist":"Caf\u00c3\u0083\u00c2\u00a9 Quijano","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":5,"lastName":"Cuevas","length":197.32853,"level":"paid","location":"San Francisco-Oakland-Hayward, CA","method":"PUT","page":"NextSong","registration":1540940782796.0,"sessionId":923,"song":"La Lola","status":200,"ts":1543162931796,"userAgent":"Mozilla\/5.0 (Windows NT 5.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"49"}
44 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":0,"lastName":"Roth","length":null,"level":"free","location":"Indianapolis-Carmel-Anderson, IN","method":"GET","page":"Home","registration":1540699429796.0,"sessionId":925,"song":null,"status":200,"ts":1543166945796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"78"}
45 | {"artist":"Parov Stelar","auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":1,"lastName":"Roth","length":203.65016,"level":"free","location":"Indianapolis-Carmel-Anderson, IN","method":"PUT","page":"NextSong","registration":1540699429796.0,"sessionId":925,"song":"Good Bye Emily (feat. Gabriella Hanninen)","status":200,"ts":1543166945796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"78"}
46 | {"artist":null,"auth":"Logged In","firstName":"Chloe","gender":"F","itemInSession":2,"lastName":"Roth","length":null,"level":"free","location":"Indianapolis-Carmel-Anderson, IN","method":"GET","page":"Home","registration":1540699429796.0,"sessionId":925,"song":null,"status":200,"ts":1543166953796,"userAgent":"Mozilla\/5.0 (Windows NT 6.1; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"78"}
47 | {"artist":null,"auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":0,"lastName":"Levine","length":null,"level":"paid","location":"Portland-South Portland, ME","method":"GET","page":"Home","registration":1540794356796.0,"sessionId":915,"song":null,"status":200,"ts":1543169974796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"}
48 | {"artist":"Bryan Adams","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":1,"lastName":"Levine","length":166.29506,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"I Will Always Return","status":200,"ts":1543170092796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"}
49 | {"artist":"KT Tunstall","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":2,"lastName":"Levine","length":192.31302,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"White Bird","status":200,"ts":1543170258796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"}
50 | {"artist":"Technicolour","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":3,"lastName":"Levine","length":235.12771,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"Turn Away","status":200,"ts":1543170450796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"}
51 | {"artist":"The Dears","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":4,"lastName":"Levine","length":289.95873,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"Lost In The Plot","status":200,"ts":1543170685796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"}
52 | {"artist":"Go West","auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":5,"lastName":"Levine","length":259.49995,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"NextSong","registration":1540794356796.0,"sessionId":915,"song":"Never Let Them See You Sweat","status":200,"ts":1543170974796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"}
53 | {"artist":null,"auth":"Logged In","firstName":"Tegan","gender":"F","itemInSession":6,"lastName":"Levine","length":null,"level":"paid","location":"Portland-South Portland, ME","method":"PUT","page":"Logout","registration":1540794356796.0,"sessionId":915,"song":null,"status":307,"ts":1543170975796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"80"}
54 | {"artist":null,"auth":"Logged In","firstName":"Sylvie","gender":"F","itemInSession":0,"lastName":"Cruz","length":null,"level":"free","location":"Washington-Arlington-Alexandria, DC-VA-MD-WV","method":"GET","page":"Home","registration":1540266185796.0,"sessionId":912,"song":null,"status":200,"ts":1543170999796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.77.4 (KHTML, like Gecko) Version\/7.0.5 Safari\/537.77.4\"","userId":"10"}
55 | {"artist":null,"auth":"Logged Out","firstName":null,"gender":null,"itemInSession":7,"lastName":null,"length":null,"level":"paid","location":null,"method":"GET","page":"Home","registration":null,"sessionId":915,"song":null,"status":200,"ts":1543171228796,"userAgent":null,"userId":""}
56 | {"artist":"Gondwana","auth":"Logged In","firstName":"Jordan","gender":"F","itemInSession":0,"lastName":"Hicks","length":262.5824,"level":"free","location":"Salinas, CA","method":"PUT","page":"NextSong","registration":1540008898796.0,"sessionId":814,"song":"Mi Princesa","status":200,"ts":1543189690796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.78.2 (KHTML, like Gecko) Version\/7.0.6 Safari\/537.78.2\"","userId":"37"}
57 | {"artist":null,"auth":"Logged In","firstName":"Kevin","gender":"M","itemInSession":0,"lastName":"Arellano","length":null,"level":"free","location":"Harrisburg-Carlisle, PA","method":"GET","page":"Home","registration":1540006905796.0,"sessionId":855,"song":null,"status":200,"ts":1543189812796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"66"}
58 | {"artist":"Ella Fitzgerald","auth":"Logged In","firstName":"Jordan","gender":"F","itemInSession":1,"lastName":"Hicks","length":427.15383,"level":"free","location":"Salinas, CA","method":"PUT","page":"NextSong","registration":1540008898796.0,"sessionId":814,"song":"On Green Dolphin Street (Medley) (1999 Digital Remaster)","status":200,"ts":1543189952796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.78.2 (KHTML, like Gecko) Version\/7.0.6 Safari\/537.78.2\"","userId":"37"}
59 | {"artist":"Creedence Clearwater Revival","auth":"Logged In","firstName":"Jordan","gender":"F","itemInSession":2,"lastName":"Hicks","length":184.73751,"level":"free","location":"Salinas, CA","method":"PUT","page":"NextSong","registration":1540008898796.0,"sessionId":814,"song":"Run Through The Jungle","status":200,"ts":1543190379796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.78.2 (KHTML, like Gecko) Version\/7.0.6 Safari\/537.78.2\"","userId":"37"}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/.DS_Store


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/.DS_Store


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/.DS_Store


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAAW128F429D538.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAABD128F429CF47.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARMJAGH1187FB546F3", "artist_latitude": 35.14968, "artist_longitude": -90.04892, "artist_location": "Memphis, TN", "artist_name": "The Box Tops", "song_id": "SOCIWDW12A8C13D406", "title": "Soul Deep", "duration": 148.03546, "year": 1969}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAADZ128F9348C2E.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARKRRTF1187B9984DA", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Sonora Santanera", "song_id": "SOXVLOJ12AB0189215", "title": "Amor De Cabaret", "duration": 177.47546, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAEF128F4273421.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR7G5I41187FB4CE6C", "artist_latitude": null, "artist_longitude": null, "artist_location": "London, England", "artist_name": "Adam Ant", "song_id": "SONHOTT12A8C13493C", "title": "Something Girls", "duration": 233.40363, "year": 1982}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAFD128F92F423A.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARXR32B1187FB57099", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Gob", "song_id": "SOFSOCN12A8C143F5D", "title": "Face the Ashes", "duration": 209.60608, "year": 2007}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAMO128F1481E7F.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARKFYS91187B98E58F", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Jeff And Sheri Easter", "song_id": "SOYMRWW12A6D4FAB14", "title": "The Moon And I (Ordinary Day Album Version)", "duration": 267.7024, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAMQ128F1460CD3.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARD0S291187B9B7BF5", "artist_latitude": null, "artist_longitude": null, "artist_location": "Ohio", "artist_name": "Rated R", "song_id": "SOMJBYD12A6D4F8557", "title": "Keepin It Real (Skit)", "duration": 114.78159, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAPK128E0786D96.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR10USD1187B99F3F1", "artist_latitude": null, "artist_longitude": null, "artist_location": "Burlington, Ontario, Canada", "artist_name": "Tweeterfriendly Music", "song_id": "SOHKNRJ12A6701D1F8", "title": "Drop of Rain", "duration": 189.57016, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAARJ128F9320760.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR8ZCNI1187B9A069B", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Planet P Project", "song_id": "SOIAZJW12AB01853F1", "title": "Pink World", "duration": 269.81832, "year": 1984}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAVG12903CFA543.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNTLGG11E2835DDB9", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Clp", "song_id": "SOUDSGM12AC9618304", "title": "Insatiable (Instrumental Version)", "duration": 266.39628, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/A/TRAAAVO128F93133D4.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGSJW91187B9B1D6B", "artist_latitude": 35.21962, "artist_longitude": -80.01955, "artist_location": "North Carolina", "artist_name": "JennyAnyKind", "song_id": "SOQHXMF12AB0182363", "title": "Young Boy Blues", "duration": 218.77506, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABCL128F4286650.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARC43071187B990240", "artist_latitude": null, "artist_longitude": null, "artist_location": "Wisner, LA", "artist_name": "Wayne Watson", "song_id": "SOKEJEJ12A8C13E0D0", "title": "The Urgency (LP Version)", "duration": 245.21098, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABDL12903CAABBA.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARL7K851187B99ACD2", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Andy Andy", "song_id": "SOMUYGI12AB0188633", "title": "La Culpa", "duration": 226.35057, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABJL12903CDCF1A.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARHHO3O1187B989413", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Bob Azzam", "song_id": "SORAMLE12AB017C8B0", "title": "Auguri Cha Cha", "duration": 191.84281, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABJV128F1460C49.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARIK43K1187B9AE54C", "artist_latitude": null, "artist_longitude": null, "artist_location": "Beverly Hills, CA", "artist_name": "Lionel Richie", "song_id": "SOBONFF12A6D4F84D8", "title": "Tonight Will Be Alright", "duration": 307.3824, "year": 1986}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABLR128F423B7E3.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARD842G1187B997376", "artist_latitude": 43.64856, "artist_longitude": -79.38533, "artist_location": "Toronto, Ontario, Canada", "artist_name": "Blue Rodeo", "song_id": "SOHUOAP12A8AE488E9", "title": "Floating", "duration": 491.12771, "year": 1987}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABNV128F425CEE1.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARIG6O41187B988BDD", "artist_latitude": 37.16793, "artist_longitude": -95.84502, "artist_location": "United States", "artist_name": "Richard Souther", "song_id": "SOUQQEA12A8C134B1B", "title": "High Tide", "duration": 228.5971, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABRB128F9306DD5.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR1ZHYZ1187FB3C717", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Faiz Ali Faiz", "song_id": "SOILPQQ12AB017E82A", "title": "Sohna Nee Sohna Data", "duration": 599.24853, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABVM128F92CA9DC.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARYKCQI1187FB3B18F", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Tesla", "song_id": "SOXLBJT12A8C140925", "title": "Caught In A Dream", "duration": 290.29832, "year": 2004}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABXG128F9318EBD.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNPAGP1241B9C7FD4", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "lextrical", "song_id": "SOZVMJI12AB01808AF", "title": "Synthetic Dream", "duration": 165.69424, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABYN12903CFD305.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARQGYP71187FB44566", "artist_latitude": 34.31109, "artist_longitude": -94.02978, "artist_location": "Mineola, AR", "artist_name": "Jimmy Wakely", "song_id": "SOWTBJW12AC468AC6E", "title": "Broken-Down Merry-Go-Round", "duration": 151.84934, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/B/TRAABYW128F4244559.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARI3BMM1187FB4255E", "artist_latitude": 38.8991, "artist_longitude": -77.029, "artist_location": "Washington", "artist_name": "Alice Stuart", "song_id": "SOBEBDG12A58A76D60", "title": "Kassie Jones", "duration": 220.78649, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACCG128F92E8A55.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR5KOSW1187FB35FF4", "artist_latitude": 49.80388, "artist_longitude": 15.47491, "artist_location": "Dubai UAE", "artist_name": "Elena", "song_id": "SOZCTXZ12AB0182364", "title": "Setanta matins", "duration": 269.58322, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACER128F4290F96.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARMAC4T1187FB3FA4C", "artist_latitude": 40.82624, "artist_longitude": -74.47995, "artist_location": "Morris Plains, NJ", "artist_name": "The Dillinger Escape Plan", "song_id": "SOBBUGU12A8C13E95D", "title": "Setting Fire to Sleeping Giants", "duration": 207.77751, "year": 2004}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACFV128F935E50B.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR47JEX1187B995D81", "artist_latitude": 37.83721, "artist_longitude": -94.35868, "artist_location": "Nevada, MO", "artist_name": "SUE THOMPSON", "song_id": "SOBLGCN12AB0183212", "title": "James (Hold The Ladder Steady)", "duration": 124.86485, "year": 1985}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACHN128F1489601.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGIWFO1187B9B55B7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Five Bolt Main", "song_id": "SOPSWQW12A6D4F8781", "title": "Made Like This (Live)", "duration": 225.09669, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACIW12903CC0F6D.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNTLGG11E2835DDB9", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Clp", "song_id": "SOZQDIU12A58A7BCF6", "title": "Superconfidential", "duration": 338.31138, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACLV128F427E123.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARDNS031187B9924F0", "artist_latitude": 32.67828, "artist_longitude": -83.22295, "artist_location": "Georgia", "artist_name": "Tim Wilson", "song_id": "SONYPOM12A8C13B2D7", "title": "I Think My Wife Is Running Around On Me (Taco Hell)", "duration": 186.48771, "year": 2005}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACNS128F14A2DF5.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AROUOZZ1187B9ABE51", "artist_latitude": 40.79195, "artist_longitude": -73.94512, "artist_location": "New York, NY [Spanish Harlem]", "artist_name": "Willie Bobo", "song_id": "SOBZBAZ12A6D4F8742", "title": "Spanish Grease", "duration": 168.25424, "year": 1997}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACOW128F933E35F.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARH4Z031187B9A71F2", "artist_latitude": 40.73197, "artist_longitude": -74.17418, "artist_location": "Newark, NJ", "artist_name": "Faye Adams", "song_id": "SOVYKGO12AB0187199", "title": "Crazy Mixed Up World", "duration": 156.39465, "year": 1961}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACPE128F421C1B9.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARB29H41187B98F0EF", "artist_latitude": 41.88415, "artist_longitude": -87.63241, "artist_location": "Chicago", "artist_name": "Terry Callier", "song_id": "SOGNCJP12A58A80271", "title": "Do You Finally Need A Friend", "duration": 342.56934, "year": 1972}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACQT128F9331780.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR1Y2PT1187FB5B9CE", "artist_latitude": 27.94017, "artist_longitude": -82.32547, "artist_location": "Brandon", "artist_name": "John Wesley", "song_id": "SOLLHMX12AB01846DC", "title": "The Emperor Falls", "duration": 484.62322, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACSL128F93462F4.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARAJPHH1187FB5566A", "artist_latitude": 40.7038, "artist_longitude": -73.83168, "artist_location": "Queens, NY", "artist_name": "The Shangri-Las", "song_id": "SOYTPEP12AB0180E7B", "title": "Twist and Shout", "duration": 164.80608, "year": 1964}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACTB12903CAAF15.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR0RCMP1187FB3F427", "artist_latitude": 30.08615, "artist_longitude": -94.10158, "artist_location": "Beaumont, TX", "artist_name": "Billie Jo Spears", "song_id": "SOGXHEG12AB018653E", "title": "It Makes No Difference Now", "duration": 133.32853, "year": 1992}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACVS128E078BE39.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AREBBGV1187FB523D2", "artist_latitude": null, "artist_longitude": null, "artist_location": "Houston, TX", "artist_name": "Mike Jones (Featuring CJ_ Mello & Lil' Bran)", "song_id": "SOOLYAZ12A6701F4A6", "title": "Laws Patrolling (Album Version)", "duration": 173.66159, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/A/C/TRAACZK128F4243829.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGUVEV1187B98BA17", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Sierra Maestra", "song_id": "SOGOSOV12AF72A285E", "title": "\u00bfD\u00f3nde va Chichi?", "duration": 313.12934, "year": 1997}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/.DS_Store


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABACN128F425B784.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOQLGFP12A58A7800E", "title": "OAKtown", "duration": 259.44771, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAFJ128F42AF24E.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR3JMC51187B9AE49D", "artist_latitude": 28.53823, "artist_longitude": -81.37739, "artist_location": "Orlando, FL", "artist_name": "Backstreet Boys", "song_id": "SOPVXLX12A8C1402D5", "title": "Larger Than Life", "duration": 236.25098, "year": 1999}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAFP128F931E9A1.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARPBNLO1187FB3D52F", "artist_latitude": 40.71455, "artist_longitude": -74.00712, "artist_location": "New York, NY", "artist_name": "Tiny Tim", "song_id": "SOAOIBZ12AB01815BE", "title": "I Hold Your Hand In Mine [Live At Royal Albert Hall]", "duration": 43.36281, "year": 2000}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAIO128F42938F9.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR9AWNF1187B9AB0B4", "artist_latitude": null, "artist_longitude": null, "artist_location": "Seattle, Washington USA", "artist_name": "Kenny G featuring Daryl Hall", "song_id": "SOZHPGD12A8C1394FE", "title": "Baby Come To Me", "duration": 236.93016, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABATO128F42627E9.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AROGWRA122988FEE45", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Christos Dantis", "song_id": "SOSLAVG12A8C13397F", "title": "Den Pai Alo", "duration": 243.82649, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAVQ12903CBF7E0.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARMBR4Y1187B9990EB", "artist_latitude": 37.77916, "artist_longitude": -122.42005, "artist_location": "California - SF", "artist_name": "David Martin", "song_id": "SOTTDKS12AB018D69B", "title": "It Wont Be Christmas", "duration": 241.47546, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAWW128F4250A31.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARQ9BO41187FB5CF1F", "artist_latitude": 40.99471, "artist_longitude": -77.60454, "artist_location": "Pennsylvania", "artist_name": "John Davis", "song_id": "SOMVWWT12A58A7AE05", "title": "Knocked Out Of The Park", "duration": 183.17016, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAXL128F424FC50.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARKULSX1187FB45F84", "artist_latitude": 39.49974, "artist_longitude": -111.54732, "artist_location": "Utah", "artist_name": "Trafik", "song_id": "SOQVMXR12A81C21483", "title": "Salt In NYC", "duration": 424.12363, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAXR128F426515F.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARI2JSK1187FB496EF", "artist_latitude": 51.50632, "artist_longitude": -0.12714, "artist_location": "London, England", "artist_name": "Nick Ingman;Gavyn Wright", "song_id": "SODUJBS12A8C132150", "title": "Wessex Loses a Bride", "duration": 111.62077, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAXV128F92F6AE3.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AREDBBQ1187B98AFF5", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Eddie Calvert", "song_id": "SOBBXLX12A58A79DDA", "title": "Erica (2005 Digital Remaster)", "duration": 138.63138, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/A/TRABAZH128F930419A.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR7ZKHQ1187B98DD73", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Glad", "song_id": "SOTUKVB12AB0181477", "title": "Blessed Assurance", "duration": 270.602, "year": 1993}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBAM128F429D223.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARBGXIG122988F409D", "artist_latitude": 37.77916, "artist_longitude": -122.42005, "artist_location": "California - SF", "artist_name": "Steel Rain", "song_id": "SOOJPRH12A8C141995", "title": "Loaded Like A Gun", "duration": 173.19138, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBBV128F42967D7.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR7SMBG1187B9B9066", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Los Manolos", "song_id": "SOBCOSW12A8C13D398", "title": "Rumba De Barcelona", "duration": 218.38322, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBJE12903CDB442.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGCY1Y1187B9A4FA5", "artist_latitude": 36.16778, "artist_longitude": -86.77836, "artist_location": "Nashville, TN.", "artist_name": "Gloriana", "song_id": "SOQOTLQ12AB01868D0", "title": "Clementina Santaf\u00e8", "duration": 153.33832, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBKX128F4285205.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR36F9J1187FB406F1", "artist_latitude": 56.27609, "artist_longitude": 9.51695, "artist_location": "Denmark", "artist_name": "Bombay Rockers", "song_id": "SOBKWDJ12A8C13B2F3", "title": "Wild Rose (Back 2 Basics Mix)", "duration": 230.71302, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBLU128F93349CF.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNNKDK1187B98BBD5", "artist_latitude": 45.80726, "artist_longitude": 15.9676, "artist_location": "Zagreb Croatia", "artist_name": "Jinx", "song_id": "SOFNOQK12AB01840FC", "title": "Kutt Free (DJ Volume Remix)", "duration": 407.37914, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBNP128F932546F.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR62SOJ1187FB47BB5", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Chase & Status", "song_id": "SOGVQGJ12AB017F169", "title": "Ten Tonne", "duration": 337.68444, "year": 2005}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBOP128F931B50D.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARBEBBY1187B9B43DB", "artist_latitude": null, "artist_longitude": null, "artist_location": "Gainesville, FL", "artist_name": "Tom Petty", "song_id": "SOFFKZS12AB017F194", "title": "A Higher Place (Album Version)", "duration": 236.17261, "year": 1994}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBOR128F4286200.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARDR4AC1187FB371A1", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Montserrat Caball\u00e9;Placido Domingo;Vicente Sardinero;Judith Blegen;Sherrill Milnes;Georg Solti", "song_id": "SOBAYLL12A8C138AF9", "title": "Sono andati? Fingevo di dormire", "duration": 511.16363, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBTA128F933D304.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARAGB2O1187FB3A161", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Pucho & His Latin Soul Brothers", "song_id": "SOLEYHO12AB0188A85", "title": "Got My Mojo Workin", "duration": 338.23302, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBVJ128F92F7EAA.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AREDL271187FB40F44", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Soul Mekanik", "song_id": "SOPEGZN12AB0181B3D", "title": "Get Your Head Stuck On Your Neck", "duration": 45.66159, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBXU128F92FEF48.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARP6N5A1187B99D1A3", "artist_latitude": null, "artist_longitude": null, "artist_location": "Hamtramck, MI", "artist_name": "Mitch Ryder", "song_id": "SOXILUQ12A58A7C72A", "title": "Jenny Take a Ride", "duration": 207.43791, "year": 2004}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/B/TRABBZN12903CD9297.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARGSAFR1269FB35070", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Blingtones", "song_id": "SOTCKKY12AB018A141", "title": "Sonnerie lalaleul\u00e9 hi houuu", "duration": 29.54404, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCAJ12903CDFCC2.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARULZCI1241B9C8611", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Luna Orbit Project", "song_id": "SOSWKAV12AB018FC91", "title": "Midnight Star", "duration": 335.51628, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCEC128F426456E.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR0IAWL1187B9A96D0", "artist_latitude": 8.4177, "artist_longitude": -80.11278, "artist_location": "Panama", "artist_name": "Danilo Perez", "song_id": "SONSKXP12A8C13A2C9", "title": "Native Soul", "duration": 197.19791, "year": 2003}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCEI128F424C983.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCFL128F149BB0D.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARLTWXK1187FB5A3F8", "artist_latitude": 32.74863, "artist_longitude": -97.32925, "artist_location": "Fort Worth, TX", "artist_name": "King Curtis", "song_id": "SODREIN12A58A7F2E5", "title": "A Whiter Shade Of Pale (Live @ Fillmore West)", "duration": 326.00771, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCIX128F4265903.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARNF6401187FB57032", "artist_latitude": 40.79086, "artist_longitude": -73.96644, "artist_location": "New York, NY [Manhattan]", "artist_name": "Sophie B. Hawkins", "song_id": "SONWXQJ12A8C134D94", "title": "The Ballad Of Sleeping Beauty", "duration": 305.162, "year": 1994}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCKL128F423A778.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARPFHN61187FB575F6", "artist_latitude": 41.88415, "artist_longitude": -87.63241, "artist_location": "Chicago, IL", "artist_name": "Lupe Fiasco", "song_id": "SOWQTQZ12A58A7B63E", "title": "Streets On Fire (Explicit Album Version)", "duration": 279.97995, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCPZ128F4275C32.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR051KA1187B98B2FF", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Wilks", "song_id": "SOLYIBD12A8C135045", "title": "Music is what we love", "duration": 261.51138, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCRU128F423F449.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR8IEZO1187B99055E", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Marc Shaiman", "song_id": "SOINLJW12A8C13314C", "title": "City Slickers", "duration": 149.86404, "year": 2008}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCTK128F934B224.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AR558FS1187FB45658", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "40 Grit", "song_id": "SOGDBUF12A8C140FAA", "title": "Intro", "duration": 75.67628, "year": 2003}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCUQ128E0783E2B.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARVBRGZ1187FB4675A", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Gwen Stefani", "song_id": "SORRZGD12A6310DBC3", "title": "Harajuku Girls", "duration": 290.55955, "year": 2004}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCXB128F4286BD3.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "ARWB3G61187FB49404", "artist_latitude": null, "artist_longitude": null, "artist_location": "Hamilton, Ohio", "artist_name": "Steve Morse", "song_id": "SODAUVL12A8C13D184", "title": "Prognosis", "duration": 363.85914, "year": 2000}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/Data/song-data/song_data/A/B/C/TRABCYE128F934CE1D.json:
--------------------------------------------------------------------------------
1 | {"num_songs": 1, "artist_id": "AREVWGE1187B9B890A", "artist_latitude": -13.442, "artist_longitude": -41.9952, "artist_location": "Noci (BA)", "artist_name": "Bitter End", "song_id": "SOFCHDR12AB01866EF", "title": "Living Hell", "duration": 282.43546, "year": 0}


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## Purpose of this Project
 3 |     A music streaming startup called Sparkify. It has grown their user base and song database and want to move their 
 4 |     data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app
 5 |     as well as a directory of JSON metadata on the songs in their app. 
 6 | 
 7 | ## My Task
 8 |     Build an ETL pipeline that extracts data from S3, process them using spark, and load the data back into S3 as a set
 9 |     of dimensional tables. This would allow the data analytics team to continue finding insights in what songs their users
10 |     are listening to.
11 | 
12 | 
13 | ## Datasets
14 |    The two datasets reside in S3. Here is the links for each:
15 |    * Song data: s3://udacity-dend/song_data
16 |    * Log data: s3://udacity-dend/log_data
17 | 
18 |     The first data set is a subset of real data from Million Song Dataset. Each file is in JSON format and contains metadata
19 |     about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 
20 |     eg: 
21 |         song_data/A/B/C/TRABCEI128F424C983.json
22 | 		    song_data/A/A/B/TRAABJL12903CDCF1A.json
23 | 
24 | 	  The second data consists of log files in JSON format generated by event simulator based on the songs above. The activity
25 |     log data is simulated from an imaginary music streaming app based on configuration settings.
26 | 
27 | 
28 |     You can download the dataset on your desktop and do it locally. After all the process completed, load the output parquet data back to S3.
29 | 
30 | 
31 | 
32 | ## This project includes 3 files
33 |     etl.py : reads data from S3, processes the data using Spark and writes them back to S3
34 |     dl.cfg: contains your aws credential ( do not expose them in public)
35 |     README.md: explanation of the project process and decisions
36 | 
37 | ## Schema for Song Play Analysis
38 | 
39 |     Using the song and log datasets to create a star schema optimized for queries on song play analysis. 
40 |     This includes the following tables.
41 | 
42 |     Fact Table: 
43 |     songplays - records in log data associated with song plays i.e. records with page NextSong
44 |     songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
45 | 
46 |     Dimension Tables:
47 | 	  users - users in the app
48 |   	user_id, first_name, last_name, gender, level
49 |   	songs - songs in music database
50 |   	song_id, title, artist_id, year, duration
51 |   	artists - artists in music database
52 |   	artist_id, name, location, lattitude, longitude
53 |   	time - timestamps of records in songplays broken down into specific units
54 |   	start_time, hour, day, week, month, year, weekday   
55 | 
56 | # steps for this project 
57 |     (1) complete the code in etl.py
58 |     (2) create a new user with s3 full access 
59 |         (check your users to see if you have it already. Normally , create 3 types of users
60 |          user 1 with read only access
61 |          user 2 with full access
62 |          user 3 with admin access)
63 |     (3) Launch a cluster (specify the master user name to the one with full access)
64 |     In the terminal, run python etl.py
65 |   
66 | 
67 | # Notes: 
68 |   A new user was created on AWS IAM with S3 full access. Use control key and secret key to Fill in credentails in dl.cfg( no quotation needed).
69 |   cluster type: multi node   ;     Set the number of nodes to 4 
70 |   
71 |   
72 |   
73 | ## Reference:
74 |   * https://github.com/ashleyadrias/udacity-data-engineer-nanodegree-project3-data-lakes-with-spark
75 |   * https://github.com/FedericoSerini/DEND-Project-4-Apache-Spark-And-Data-Lake
76 |   * https://github.com/gfkw/dend-project-4
77 |   * https://github.com/jukkakansanaho/udacity-dend-project-4
78 |   


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/__MACOSX/._README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/__MACOSX/._README.md


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/__MACOSX/._dl.cfg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/__MACOSX/._dl.cfg


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/__MACOSX/._etl.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/__MACOSX/._etl.py


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/dl.cfg:
--------------------------------------------------------------------------------
1 | AWS_ACCESS_KEY_ID=
2 | AWS_SECRET_ACCESS_KEY=


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/etl.py:
--------------------------------------------------------------------------------
  1 | # import the libraries 
  2 | import configparser
  3 | from datetime import datetime
  4 | import os
  5 | from pyspark.sql import SparkSession
  6 | from pyspark.sql.functions import udf, col, to_timestamp, monotonically_increasing_id
  7 | from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
  8 | 
  9 | # set up the configration 
 10 | config = configparser.ConfigParser()
 11 | 
 12 | # read in credentials for S3 full access 
 13 | config.read('dl.cfg')
 14 | os.environ['AWS_ACCESS_KEY_ID'] = config.get('CREDENTIALS', 'AWS_ACCESS_KEY_ID' )
 15 | os.environ['AWS_SECRET_ACCESS_KEY'] = config.get('CREDENTIALS', 'AWS_SECRET_ACCESS_KEY')
 16 | 
 17 | # initialize a spark session on aws hadoop
 18 | # return spark session
 19 | def create_spark_session():
 20 |     spark = SparkSession \
 21 |         .builder \
 22 |         .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
 23 |         .getOrCreate()
 24 |     return spark
 25 | 
 26 | 
 27 | def process_song_data(spark, input_data, output_data):
 28 |     '''
 29 |     Read in song data from S3
 30 |     transform the data to create dimensional tables
 31 |     load the data back to S3 in parquet files  
 32 |     
 33 |     input: spark session, input datain json format and output data as s3 path    
 34 |     '''
 35 |     
 36 |     # get filepath to song data file
 37 |     # try with one file 
 38 |     #song_data = input_data + "song_data/A/B/C/TRABCEI128F424C983.json"
 39 |     song_data = input_data + "song_data/*/*/*/*.json"
 40 |     
 41 |     # read song data file
 42 |     song_df = spark.read.json(song_data)
 43 |     print("Song data schema:")
 44 |     song_df.printSchema()
 45 |     print("Total records in song data is: ")
 46 |     print(song_df.count())
 47 | 
 48 |     # extract columns to create songs table
 49 |     songs_table = song_df.select(['song_id','song_title','artist_id','year','duration']).dropDuplicates().collect()
 50 |     
 51 |     # write songs table to parquet files partitioned by year and artist     
 52 |     songs_table.write.partitionBy('year','artist_id').parquet(output_data + "/songs_table.parquet")
 53 | 
 54 |     # extract columns to create artists table
 55 |     artists_table = song_df.select(['artist_id','artist_name','artist_location','artist_latitude', 'artist_longitude'])\
 56 |     .dropDuplicates().collect()
 57 |     
 58 |     # write artists table to parquet files
 59 |     artists_table.write.parquet(output_data + "/artists_table.parquet")
 60 | 
 61 | 
 62 | def process_log_data(spark, input_data, output_data):
 63 |     
 64 |     '''
 65 |     Read in log data from S3
 66 |     transform the data to create dimensional tables
 67 |     load the data back to S3 in parquet files     
 68 |     '''
 69 |     
 70 |     # get filepath to log data file
 71 |     # try with one file
 72 |     #log_data = input_data + "log_data/2018/11/2018-11-12-events.json"
 73 |     log_data = input_data + "log_data/*/*/*.json"
 74 | 
 75 |     # read log data file
 76 |     log_df = spark.read.json(log_data)
 77 |     
 78 |     # filter by actions for songplays
 79 |     log_df = log_df.filter(log_df.page=='NextSong')   
 80 | 
 81 |     # extract columns for users table    
 82 |     uers_table = log_df.select(['user_id', 'user_first_name', 'user_last_name', 'gender', 'level']).dropDuplicates().collect()
 83 |     
 84 |     # write users table to parquet files
 85 |     users_table.write.parquet(output_data + "/users_table.parquet")
 86 | 
 87 |     # create timestamp column from original timestamp column
 88 |     get_timestamp = udf(lambda x: datetime.datetime.fromtimestamp(x/1000.0)
 89 |                         .strftime('%Y-%m-%d %H:%M:%S'), TimestampType() )
 90 |     
 91 |     log_df = log_df.withColumn('ts', get_timestamp(log_df.ts))  
 92 |     log_df = log_df.withColumn('start_time', get_timestampe(log_df.ts))     
 93 |     log_df = log_df.withColumn('hour', get_timestamp(log_df.ts).hour)  
 94 |     log_df = log_df.withColumn('day' , get_timestamp(log_df.ts).day)
 95 |     log_df = log_df.withColumn('week' , get_timestamp(log_df.ts).week)
 96 |     log_df = log_df.withColumn('month' , get_timestamp(log_df.ts).month)
 97 |     log_df = log_df.withColumn('year' , get_timestamp(log_df.ts).year)
 98 |     log_df = log_df.withColumn('weekday' , get_timestamp(log_df.ts).weekday)
 99 |     log_df.printSchema()
100 |     log_df.head(1)
101 |     
102 |     # extract columns to create time table
103 |     time_table = log_df.select(['start_time', 'hour', 'day', 'week', 'month', 'year', 'weekday']).dropDuplicates().collect()
104 |     
105 |     # write time table to parquet files partitioned by year and month
106 |     time_table.write.partitionBy('year','month').parquet(output_data + "/time_table.parquet")
107 | 
108 |     # read in song data to use for songplays table
109 |     #song_data = input_data + "song_data/A/B/C/TRABCEI128F424C983.json"
110 |     song_data = input_data + "song_data/*/*/*/*.json"
111 | 
112 |     song_df = spark.read.json(song_data)  
113 | 
114 |     # extract columns from joined song and log datasets to create songplays table 
115 |     # join two dataframes together by using the shared column 
116 |     songplays_table = song_df.join(log_df, (song_df.artist_name == log_df.artist_name) &      
117 |                                    (song_df.song_title == log_df.song_title) & song_df.duration = log_df.length )
118 |     songplays_table.withColumn("songplay_id",  monotonically_increasing_id())
119 |     songplays_table = songplays_table.select( ['start_time', 'user_id', 'level', 'song_id', 
120 |     'artist_id', 'sessionId', 'user_location','userAgent']).dropDuplicates().collect()
121 |     songplays_table.printSchema()
122 |     songplays_table.head(1)
123 |     
124 |     # write songplays table to parquet files partitioned by year and month
125 |     songplays_table.write.partitionBy('year','month').parquet(output_data + "/songplays_table.parquet")
126 | 
127 | 
128 | def main():
129 |     spark = create_spark_session()
130 |     input_data = "s3a://udacity-dend/"
131 |     output_data = "s3a://udacity-dend/"      
132 |     # if you want the output in a different bucket , then you need to create a new bucket first "s3a://project4_output/"
133 |     
134 |     process_song_data(spark, input_data, output_data)    
135 |     process_log_data(spark, input_data, output_data)
136 | 
137 | 
138 | if __name__ == "__main__":
139 |     main()
140 | 


--------------------------------------------------------------------------------
/Project_4_Data_Lake_with_Spark/star_schema.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_4_Data_Lake_with_Spark/star_schema.PNG


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/create_tables.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE public.artists (
 2 | 	artistid varchar(256) NOT NULL,
 3 | 	name varchar(256),
 4 | 	location varchar(256),
 5 | 	lattitude numeric(18,0),
 6 | 	longitude numeric(18,0)
 7 | );
 8 | 
 9 | CREATE TABLE public.songplays (
10 | 	playid varchar(32) NOT NULL,
11 | 	start_time timestamp NOT NULL,
12 | 	userid int4 NOT NULL,
13 | 	"level" varchar(256),
14 | 	songid varchar(256),
15 | 	artistid varchar(256),
16 | 	sessionid int4,
17 | 	location varchar(256),
18 | 	user_agent varchar(256),
19 | 	CONSTRAINT songplays_pkey PRIMARY KEY (playid)
20 | );
21 | 
22 | CREATE TABLE public.songs (
23 | 	songid varchar(256) NOT NULL,
24 | 	title varchar(256),
25 | 	artistid varchar(256),
26 | 	"year" int4,
27 | 	duration numeric(18,0),
28 | 	CONSTRAINT songs_pkey PRIMARY KEY (songid)
29 | );
30 | 
31 | CREATE TABLE public.staging_events (
32 | 	artist varchar(256),
33 | 	auth varchar(256),
34 | 	firstname varchar(256),
35 | 	gender varchar(256),
36 | 	iteminsession int4,
37 | 	lastname varchar(256),
38 | 	length numeric(18,0),
39 | 	"level" varchar(256),
40 | 	location varchar(256),
41 | 	"method" varchar(256),
42 | 	page varchar(256),
43 | 	registration numeric(18,0),
44 | 	sessionid int4,
45 | 	song varchar(256),
46 | 	status int4,
47 | 	ts int8,
48 | 	useragent varchar(256),
49 | 	userid int4
50 | );
51 | 
52 | CREATE TABLE public.staging_songs (
53 | 	num_songs int4,
54 | 	artist_id varchar(256),
55 | 	artist_name varchar(256),
56 | 	artist_latitude numeric(18,0),
57 | 	artist_longitude numeric(18,0),
58 | 	artist_location varchar(256),
59 | 	song_id varchar(256),
60 | 	title varchar(256),
61 | 	duration numeric(18,0),
62 | 	"year" int4
63 | );
64 | 
65 | CREATE TABLE public.staging_songs (
66 | 	num_songs int4,
67 | 	artist_id varchar(256),
68 | 	artist_name varchar(256),
69 | 	artist_latitude numeric(18,0),
70 | 	artist_longitude numeric(18,0),
71 | 	artist_location varchar(256),
72 | 	song_id varchar(256),
73 | 	title varchar(256),
74 | 	duration numeric(18,0),
75 | 	"year" int4
76 | );
77 | 
78 | CREATE TABLE public.users (
79 | 	userid int4 NOT NULL,
80 | 	first_name varchar(256),
81 | 	last_name varchar(256),
82 | 	gender varchar(256),
83 | 	"level" varchar(256),
84 | 	CONSTRAINT users_pkey PRIMARY KEY (userid)
85 | );
86 | 
87 | 
88 | 
89 | 
90 | 
91 | 


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/dags/udac_example_dag.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This DAG will do the following :
  3 | # copy data from AWS s3 to AWS redshift staging tables: stage_events and stage_songs
  4 | # load data from staging tables to star schema tables( fact table and dimension tables)
  5 | # data control check to see if data has been populated to star schema
  6 | 
  7 | # in case of failure, DAG would retry 3 times , after 5 minutes delay
  8 | """
  9 | 
 10 | from datetime import datetime, timedelta
 11 | import os
 12 | from airflow import DAG
 13 | from airflow.operators.dummy_operator import DummyOperator
 14 | from airflow.operators import (StageToRedshiftOperator, LoadFactOperator,
 15 |                                 LoadDimensionOperator, DataQualityOperator)
 16 | 
 17 | from helpers import SqlQueries
 18 | 
 19 | # AWS_KEY = os.environ.get('AWS_KEY')
 20 | # AWS_SECRET = os.environ.get('AWS_SECRET')
 21 | 
 22 | # these are set based on guideline
 23 | default_args = {
 24 |     'owner' : 'udacity',
 25 |     'depends_on_past' : False,
 26 |     'retries' : 3,
 27 |     'retry_delay': timedelta(minutes = 5),
 28 |     'start_date': datetime(2019, 8,4 ),
 29 |     'end_date' : datetime(2019, 8,15),
 30 |     'email_on_retry' : False ,
 31 |     'catchup_by_default' : False    
 32 | }
 33 | 
 34 | dag = DAG('udac_example_dag',
 35 |           default_args=default_args,
 36 |           description='Load and transform data in Redshift with Airflow',
 37 |           schedule_interval= '@hourly'    # according to the data format , dailly is a better choice 
 38 |         )
 39 | 
 40 | start_operator = DummyOperator(task_id='Begin_execution',  dag=dag)  
 41 | 
 42 | 
 43 | stage_events_to_redshift = StageToRedshiftOperator(
 44 |     task_id='Stage_events',
 45 |     dag=dag,
 46 |     table = 'staging_events',
 47 |     redshift_conn_id = "redshift",
 48 |     aws_credentials_id = "aws_credentials",
 49 |     s3_bucket = "udacity-dend",
 50 |     s3_key = "log_data/2018/11/2018-11-12-events.json",   
 51 |     json_path = "s3://udacity-dend/log_json_path.json",
 52 |     file_type="json"
 53 | )
 54 | 
 55 | stage_songs_to_redshift = StageToRedshiftOperator(
 56 |     task_id='Stage_songs',
 57 |     dag=dag,
 58 |     table = 'staging_songs',
 59 |     redshift_conn_id = "redshift",
 60 |     aws_credentials_id = "aws_credentials",
 61 |     s3_bucket = "udacity-dend",
 62 |     s3_key = "song_data/A/B/C/TRABCEI128F424C983.json",
 63 |     json_path = "auto",
 64 |     file_type="json"
 65 | )
 66 | 
 67 | load_songplays_table = LoadFactOperator(
 68 |     task_id='Load_songplays_fact_table',
 69 |     dag=dag,
 70 |     table = 'songplays',
 71 |     redshift_conn_id = "redshift",
 72 |     load_sql_stmt = SqlQueries.songplay_table_insert 
 73 | )  
 74 | 
 75 | load_user_dimension_table = LoadDimensionOperator(
 76 |     task_id='Load_user_dim_table',
 77 |     dag=dag,
 78 |     table = 'users',
 79 |     redshift_conn_id = "redshift",
 80 |     load_sql_stmt = SqlQueries.user_table_insert
 81 | )
 82 | 
 83 | load_song_dimension_table = LoadDimensionOperator(
 84 |     task_id='Load_song_dim_table',
 85 |     dag=dag,
 86 |     table = 'songs',
 87 |     redshift_conn_id = "redshift",
 88 |     load_sql_stmt = SqlQueries.song_table_insert
 89 | )
 90 | 
 91 | load_artist_dimension_table = LoadDimensionOperator(
 92 |     task_id='Load_artist_dim_table',
 93 |     dag=dag,
 94 |     table = 'artists',
 95 |     redshift_conn_id = "redshift",
 96 |     load_sql_stmt = SqlQueries.artist_table_insert
 97 | )
 98 | 
 99 | load_time_dimension_table = LoadDimensionOperator(
100 |     task_id='Load_time_dim_table',
101 |     dag=dag,
102 |     table = 'times',
103 |     redshift_conn_id = "redshift",
104 |     load_sql_stmt = SqlQueries.time_table_insert
105 | )
106 | 
107 | run_quality_checks = DataQualityOperator(
108 |     task_id='Run_data_quality_checks',
109 |     dag=dag,
110 |     table = ['songplays', 'users', 'songs', 'artists', 'times'],
111 |     redshift_conn_id = "redshift"    
112 | )
113 | 
114 | end_operator = DummyOperator(task_id='Stop_execution',  dag=dag)
115 | 
116 | # set up tasks dependancies
117 | # use the varaibles not task_id
118 | start_operator >> stage_events_to_redshift
119 | start_operator >> stage_songs_to_redshift
120 | 
121 | stage_events_to_redshift >> load_songplays_table
122 | stage_songs_to_redshift >> load_songplays_table
123 | 
124 | load_songplays_table >> load_song_dimension_table
125 | load_song_dimension_table >> run_quality_checks
126 | 
127 | load_songplays_table >> load_user_dimension_table
128 | load_user_dimension_table >> run_quality_checks
129 | 
130 | load_songplays_table >> load_artist_dimension_table
131 | load_artist_dimension_table >> run_quality_checks
132 | 
133 | load_songplays_table >> load_time_dimension_table
134 | load_time_dimension_table >> run_quality_checks
135 | 
136 | run_quality_checks >> end_operator
137 | 
138 | 
139 | 
140 | 


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/__init__.py:
--------------------------------------------------------------------------------
 1 | from __future__ import division, absolute_import, print_function
 2 | 
 3 | from airflow.plugins_manager import AirflowPlugin
 4 | 
 5 | import operators
 6 | import helpers
 7 | 
 8 | # Defining the plugin class
 9 | class UdacityPlugin(AirflowPlugin):
10 |     name = "udacity_plugin"
11 |     operators = [
12 |         operators.StageToRedshiftOperator,
13 |         operators.LoadFactOperator,
14 |         operators.LoadDimensionOperator,
15 |         operators.DataQualityOperator
16 |     ]
17 |     helpers = [
18 |         helpers.SqlQueries
19 |     ]
20 | 


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__init__.py:
--------------------------------------------------------------------------------
1 | from helpers.sql_queries import SqlQueries
2 | 
3 | __all__ = [
4 |     'SqlQueries',
5 | ]


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/helpers/sql_queries.py:
--------------------------------------------------------------------------------
 1 | class SqlQueries:
 2 |     '''
 3 |     This class is to set up the tables for inserting later
 4 |     '''
 5 |     songplay_table_insert = ("""
 6 |         SELECT
 7 |                 md5(events.sessionid || events.start_time) songplay_id,
 8 |                 events.start_time, 
 9 |                 events.userid, 
10 |                 events.level, 
11 |                 songs.song_id, 
12 |                 songs.artist_id, 
13 |                 events.sessionid, 
14 |                 events.location, 
15 |                 events.useragent
16 |                 FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, *
17 |             FROM staging_events
18 |             WHERE page='NextSong') events
19 |             LEFT JOIN staging_songs songs
20 |             ON events.song = songs.title
21 |                 AND events.artist = songs.artist_name
22 |                 AND events.length = songs.duration
23 |     """)
24 | 
25 |     user_table_insert = ("""
26 |         SELECT distinct userid, firstname, lastname, gender, level
27 |         FROM staging_events
28 |         WHERE page='NextSong'
29 |     """)
30 | 
31 |     song_table_insert = ("""
32 |         SELECT distinct song_id, title, artist_id, year, duration
33 |         FROM staging_songs
34 |     """)
35 | 
36 |     artist_table_insert = ("""
37 |         SELECT distinct artist_id, artist_name, artist_location, artist_latitude, artist_longitude
38 |         FROM staging_songs
39 |     """)
40 | 
41 |     time_table_insert = ("""
42 |         SELECT start_time, extract(hour from start_time), extract(day from start_time), extract(week from start_time), extract(month from start_time), extract(year from start_time), extract(dayofweek from start_time)
43 |         FROM songplays
44 |     """)


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__init__.py:
--------------------------------------------------------------------------------
 1 | from operators.stage_redshift import StageToRedshiftOperator
 2 | from operators.load_fact import LoadFactOperator
 3 | from operators.load_dimension import LoadDimensionOperator
 4 | from operators.data_quality import DataQualityOperator
 5 | 
 6 | __all__ = [
 7 |     'StageToRedshiftOperator',
 8 |     'LoadFactOperator',
 9 |     'LoadDimensionOperator',
10 |     'DataQualityOperator'
11 | ]
12 | 


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CICIFLY/Data_Engineering_Project_Portfolio/c6ed6f46b2b5b3c06c1b6c7f17f6229a6d02e7ec/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/data_quality.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | 
 5 | class DataQualityOperator(BaseOperator):
 6 |     """
 7 |     This airflow operator check the numbers of records in each table 
 8 |     """
 9 |     
10 |     ui_color = '#89DA59'  
11 |     
12 |     @apply_defaults
13 |     def __init__(self,
14 |                  # Define your operators params (with defaults) here
15 |                  redshift_conn_id = "",
16 |                  tables = [],
17 |                  *args, **kwargs):
18 | 
19 |         super(DataQualityOperator, self).__init__(*args, **kwargs)
20 |         # Map params here
21 |         self.redshift_conn_id = redshift_conn_id
22 |         self.tables = tables
23 | 
24 |     def execute(self, context):
25 |         redshift_hook = PostgresHook( self.redshift_conn_id )
26 |         for table in self.tables:            
27 |             self.log.info(f"Data Quality checking for {table} table")
28 |             records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {table}")
29 |             if len(records) <1 or len(records[0]) <1 :
30 |                 raise ValueError(f"Data quality check failed. {table} has no results")
31 |                 
32 |             num_records = records[0][0]
33 |             if num_records < 1:
34 |                 raise ValueError(f"Data quality check failed. {table} has zero records")              
35 |             self.log.info( f"Yeah, {table} has loaded successfully with {num_records} records!")
36 |                           
37 |                           
38 |             
39 |  
40 |             
41 |             
42 | 


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/load_dimension.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | 
 5 | class LoadDimensionOperator(BaseOperator):
 6 |     """
 7 |     This airflow operator class load data from AWS redshift staging tables to dimension tables 
 8 |     
 9 |     keyword arguments:
10 |     redshift_conn_id
11 |     target_table
12 |     query : query name from SqlQueries
13 |     
14 |     output: populate data into dimension tables 
15 |     """
16 | 
17 |     ui_color = '#80BD9E'
18 |     insert_sql = """
19 |     INSERT INTO {} 
20 |     {};
21 |     COMMIT;
22 |     """
23 |     
24 |     @apply_defaults
25 |     def __init__(self,
26 |                  # Define your operators params (with defaults) here
27 |                  redshift_conn_id = "",
28 |                  target_table = "",
29 |                  query = "",
30 |                  delete_load = False,
31 |                  *args, **kwargs):
32 | 
33 |         super(LoadDimensionOperator, self).__init__(*args, **kwargs)
34 |         # Map params here
35 |         self.redshift_conn_id = redshift_conn_id
36 |         self.target_table = target_table
37 |         self.query = query
38 |         self.delete_load = delete_load
39 | 
40 |     def execute(self, context):
41 |         redshift = PostgresHook(postgres_conn_id = self.redshift_conn_id)      
42 |         if self.delete_load:
43 |             self.log.info('Clear data from dimension tables')
44 |             redshift.run("DELETE FROM {}".format(self.table))
45 |                          
46 |         self.log.info('Load data from staging table to dimension tables...')
47 |         formatted_sql = LoadFactOperator.insert_sql.format(self.target_table,
48 |                                                            self.query)
49 |         redshift.run(formatted_sql)


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/load_fact.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | 
 5 | class LoadFactOperator(BaseOperator):
 6 |     """
 7 |     This airflow operator class load data from AWS redshift staging tables to fact table 
 8 |     
 9 |     keyword arguments:
10 |     redshift_conn_id
11 |     target_table
12 |     query : query name from SqlQueries
13 |     
14 |     output: populate data into fact table
15 |     """
16 |     ui_color = '#F98866'
17 |     insert_sql = """
18 |     INSERT INTO {} 
19 |     {};
20 |     COMMIT;
21 |     """
22 | 
23 |     @apply_defaults
24 |     # Define operators params (with defaults) here
25 |     def __init__(self,
26 |                  redshift_conn_id = "",
27 |                  target_table = "",
28 |                  query = "",
29 |                  delete_load = False,
30 |                  *args, **kwargs):
31 | 
32 |         super(LoadFactOperator, self).__init__(*args, **kwargs)
33 |         # Map params here
34 |         self.redshift_conn_id = redshift_conn_id
35 |         self.target_table = target_table
36 |         self.query = query
37 |         self.delete_load = delete_load
38 | 
39 |     def execute(self, context):
40 |         redshift = PostgresHook(postgres_conn_id = self.redshift_conn_id)
41 |         self.log.info('Clear data from fact table')
42 |         redshift.run("DELETE FROM {}".format(self.table))   
43 |                          
44 |         self.log.info('Load data from staging table to fact table...')
45 |         formatted_sql = LoadFactOperator.insert_sql.format(self.target_table,
46 |                                                            self.query)
47 |         redshift.run(formatted_sql)
48 | 


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/airflow/plugins/operators/stage_redshift.py:
--------------------------------------------------------------------------------
  1 | from airflow.hooks.postgres_hook import PostgresHook
  2 | from airflow.contrib.hooks.aws_hook import AwsHook
  3 | from airflow.models import BaseOperator
  4 | from airflow.utils.decorators import apply_defaults
  5 | 
  6 | class StageToRedshiftOperator(BaseOperator):
  7 |     """
  8 |     This airflow operator class load data from AWS S3 bucket to redshift staging tables
  9 |     
 10 |     keyword arguments:
 11 |      redshift_conn_id,
 12 |      aws_credentials_id ,
 13 |      target_table ,
 14 |      s3_bucket ,
 15 |      s3_key ,
 16 |      json_path ,
 17 |      file_type ,
 18 |      delimiter ,
 19 |      ignore_headers
 20 |     
 21 |     output: populate data into staging tables 
 22 |     """
 23 |     ui_color = '#358140'
 24 |     
 25 |     # copy sql statement would be different for csv files and json files 
 26 |     copy_sql_json = """
 27 |     COPY {}
 28 |     FROM '{}'
 29 |     ACCESS_KEY_ID '{}'
 30 |     SECRET_ACCESS_KEY '{}'
 31 |     JSON '{}'
 32 |     COMPUPDATE OFF
 33 |     """
 34 | 
 35 |     copy_sql_csv = """
 36 |     COPY {}
 37 |     FROM '{}'
 38 |     ACCESS_KEY_ID '{}'
 39 |     SECRET_ACCESS_KEY '{}'
 40 |     IGNOREHEADER {}
 41 |     DELIMITER '{}'
 42 |    
 43 |     """    
 44 |     
 45 |     @apply_defaults
 46 |     def __init__(self,
 47 |                  # Define your operators params (with defaults) here
 48 |                  redshift_conn_id = "",
 49 |                  aws_credentials_id = "",
 50 |                  target_table = "",
 51 |                  s3_bucket = "",
 52 |                  s3_key = "",
 53 |                  json_path = "" ,
 54 |                  file_type = "" ,
 55 |                  delimiter = "," , 
 56 |                  ignore_headers = 1,                 
 57 |                  *args, **kwargs):
 58 | 
 59 |         super(StageToRedshiftOperator, self).__init__(*args, **kwargs)
 60 |         # Map params here
 61 |         self.redshift_conn_id = redshift_conn_id
 62 |         self.aws_credentials_id = aws_credentials_id
 63 |         self.target_table = target_table 
 64 |         self.s3_bucket = s3_bucket
 65 |         self.s3_key = s3_key
 66 |         self.json_path = json_path 
 67 |         self.file_type = file_type 
 68 |         self.delimiter = delimiter 
 69 |         self.ignore_headers = ignore_headers    
 70 | 
 71 |     def execute(self, context):
 72 |         aws_hook = AwsHook( self.aws_credentials_id )
 73 |         credentials = aws_hook.get_credentials()
 74 |         redshift = PostgresHook( postgres_conn_id = self.redshift_conn_id)
 75 |         
 76 |         self.log.info('copy data from s3 to redshift staging table')
 77 |         rendered_key = self.s3_key.format(**context)
 78 |         s3_path = "s3://{}/{}".format( self.s3_bucket, rendered_key)
 79 |         
 80 |         if self.file_type == "json":
 81 |             formatted_sql = StageToRedshiftOperator.copy_sql_json.format(
 82 |                 self.target_table,
 83 |                 s3_path,
 84 |                 credentials.access_key,
 85 |                 credentials.secret_key,
 86 |                 self.json_path                
 87 |             )
 88 |             redshift.run(formatted_sql)
 89 |             
 90 |         if self.file_type == "cvs":
 91 |             formatted_sql = StageToRedshiftOperator.copy_sql_csv.format(
 92 |                 self.target_table,
 93 |                 s3_path,
 94 |                 credentials.access_key,
 95 |                 credentials.secret_key,
 96 |                 self.ignore_headers,
 97 |                 self.delimiter
 98 |                 
 99 |             )
100 |             redshift.run(formatted_sql)
101 | 
102 | 
103 | 
104 | 
105 | 


--------------------------------------------------------------------------------
/Project_5_Data_Pipeline_with_Airflow/readme.md:
--------------------------------------------------------------------------------
  1 | ### Introduction:  
  2 |     A music streaming company, Sparkify, wants to introduce more automation and monitoring to their data warehouse ETL pipelines. 
  3 |     Apache Airflow is the tool they decided to use. As a data engineer brought into this project, I need to create custom operators
  4 |     to perform tasks such as staging the data, filling the data warehouse and running checks on the data as final step. 
  5 | 
  6 | 
  7 | ### Data sets:
  8 |     The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. 
  9 |     The source datasets consist of JSON logs that tell about user activity in the application and
 10 |     JSON metadata about the songs the users listen to.
 11 |     (1). Link to Song data: s3://udacity-dend/song_data
 12 |         The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata   
 13 |         about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 
 14 |   
 15 |         For example,here are filepaths to a file in this dataset.    
 16 |         song_data/A/B/C/TRABCEI128F424C983.json
 17 | 
 18 |         Single song file example :
 19 |        {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "",    
 20 |        "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
 21 |        
 22 |      (2). Link to Log data: s3://udacity-dend/log_data
 23 |          It consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These 
 24 |          simulate app activity logs from an imaginary music streaming app based on configuration settings.     
 25 |          The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset.
 26 | 
 27 |          log_data/2018/11/2018-11-12-events.json
 28 |          log_data/2018/11/2018-11-13-events.json  
 29 | 
 30 | 
 31 | ### Star Schema for Song Play Analysis ( a star schema picture can be found in this project)
 32 |     (1).Fact Table
 33 |         songplays(records in event data associated with song plays i.e. records with page NextSong): songplay_id, start_time,
 34 |         user_id, level, song_id, artist_id, session_id, location, user_agent
 35 |         
 36 |     (2).Dimension Tables
 37 |         users(users in the app): user_id, first_name, last_name, gender, level
 38 |         songs(songs in music database): song_id, title, artist_id, year, duration
 39 |         artists(artists in music database): artist_id, name, location, lattitude, longitude
 40 |         time(timestamps of records in songplays broken down into specific units): start_time, hour, day, week, month, year, weekday
 41 | 
 42 | 
 43 | ### Files included in the Repository
 44 |      2 folders and 1 file : dags , plugins folders and create_tables.sql file 
 45 |    
 46 |     (1). The DAG template:
 47 |          It has all the imports and task templates in place, but the task dependencies have not been set
 48 | 
 49 |     (2).Plugins folder : 
 50 |         The operators folder with operator templates :
 51 |             data_quality.py   
 52 |             load_dimension.py 
 53 |             load_fact.py   
 54 |             stage_redshift.py       
 55 |         A helper class for the SQL transformations  : sql_queries.py
 56 |     (3). create_tables.sql :   this file should be converted to a python file with creating and dropping table sql statements
 57 |          Pls refer to the projects before 
 58 | 
 59 | 
 60 | # Prerequisites
 61 |     AWS redshift cluster must run successfully
 62 |     read and write access to S3 and aws redshift services
 63 |     Tables must be created in Redshift before executing the DAG workflow. The create tables statements can be found in: create_tables.sql ( refer to project 3) 
 64 |     
 65 | ### Steps to run the project
 66 |     (1). Launching a redshift cluster
 67 |          
 68 |     (2). creating tables
 69 |         method1: Create Tables by using query editor and the statements from create_tables.sql
 70 |         method2: refer to project 3 using 2 python files
 71 |     
 72 |     (3). Adding airflow connections 
 73 |         On the create varaible page,enter the following values 
 74 |            Set "Key" equal to "s3_bucket" and set "Val" equal to "udacity-dend"
 75 |            Set "Key" equal to "s3_prefix" and set "Val" equal to "data-pipelines"
 76 | 
 77 |         On the create connection page, enter the following values:
 78 |             Conn Id: Enter aws_credentials.
 79 |             Conn Type: Enter Amazon Web Services.
 80 |             Login: Enter your Access key ID from the IAM User credentials you downloaded earlier.
 81 |             Password: Enter your Secret access key from the IAM User credentials you downloaded earlier.
 82 | 
 83 |         Once you've entered these values, select Save and Add Another.
 84 |         On the next create connection page, enter the following values:
 85 |            Conn Id: Enter redshift.
 86 |            Conn Type: Enter Postgres.
 87 |            Host: Enter the endpoint of your Redshift cluster, excluding the port at the end. 
 88 |            Schema: Enter dev. This is the Redshift database you want to connect to.
 89 |            Login: Enter awsuser.
 90 |            Password: Enter the password you created when launching your Redshift cluster.
 91 |            Port: Enter 5439.
 92 |         Once you've entered these values, select Save.
 93 | 
 94 |         Everytime you open Udacity workspace to enter Airflow UI, you have to re-create the connections and variable. 
 95 |             
 96 |     (3). Running the tempalte DAG and compare with the graph provided in the project instruction. 
 97 |         Check the logs to see if it has " operator is not implemented" messages
 98 |         
 99 |     (4). Configuring the DAG
100 |         In the DAG, add default parameters according to these guidelines:
101 |             The DAG does not have dependencies on past runs            
102 |             On failure, the task are retried 3 times
103 |             Retries happen every 5 minutes
104 |             Catchup is turned off
105 |             Do not email on retry
106 |         In the DAG, configure the task dependencies , compare with the graph view with what provided in   
107 |         project instruction.  
108 |         
109 |     (5). Building the operators 
110 |          (The separate functional operators for dimensions, fact and quality control make the sql statements dynamics)
111 | 
112 |         Stage Operator
113 |         The stage operator is expected to be able to load any JSON formatted files from S3 to Amazon Redshift. The operator creates and 
114 |         runs a SQL COPY statement based on the parameters provided. The operator's parameters should specify where in S3 the file is 
115 |         loaded and what is the target table.
116 |         The parameters should be used to distinguish between JSON file. Another important requirement of the stage operator is    
117 |         containing templated field that allows it to load timestamped files from S3 based on the execution time and run backfills.
118 |         
119 |         Fact and Dimension Operators
120 |         With dimension and fact operators, you can utilize the provided SQL helper class to run data transformations. Most of the logic 
121 |         is within the SQL transformations and the operator is expected to take as input a SQL statement and target database on which
122 |         to run the query against. You can also define a target table that will contain the results of the transformation.
123 |         Dimension loads are often done with the truncate-insert pattern where the target table is emptied before the load. Thus, you         
124 |         could also have a parameter that allows switching between insert modes when loading dimensions. Fact tables are usually so 
125 |         massive that they should only allow append type functionality.
126 |         
127 |         Data Quality Operator
128 |         The final operator to create is the data quality operator, which is used to run checks on the data itself. The operator's 
129 |         main functionality is to receive one or more SQL based test cases along with the expected results and execute the tests. 
130 |         For each the test, the test result and expected result needs to be checked and if there is no match, the operator should 
131 |         raise an exception and the task should retry and fail eventually.
132 | 
133 |         For example one test could be a SQL statement that checks if certain column contains NULL values by counting all the rows
134 |         that have NULL in the column. We do not want to have any NULLs so expected result would be 0 and the test would compare the 
135 |         SQL statement's outcome to the expected result.
136 |         
137 |     (6). Running create_tables.py script to create tables to AWS Redshift
138 |     
139 | 
140 | ### Notes about Workspace
141 |     After you have updated the DAG, you will need to run /opt/airflow/start.sh command to start the Airflow web server. Once the Airflow web server is ready, you can access the Airflow UI by clicking on the blue Access Airflow button.
142 |     
143 |     A quality control must be implemented to make sure the workflow can work properly by counting the number of records of each table 
144 | 
145 |     Command to kill the running DAG or reset Airflow : run command  "pkill -f airflow"  ( This one does not work for me. I have to save all files and reset the data in workspace to refresh it)
146 | 
147 |     To extract compression files: 
148 |     tar xvzf file_name.tar.gz   ( by defult, it will be extracted in current working dir)
149 |     
150 | ### References :
151 |    (1). https://github.com/gfkw/dend-project-5
152 |    
153 |    (2). https://github.com/jukkakansanaho/udacity-dend-project-5
154 |    
155 |    (3). https://github.com/FedericoSerini/DEND-Project-5-Data-Pipelines
156 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 |                                                Data_Engineering_Project_Portfolio
 2 | 
 3 | Link to my certificate : https://graduation.udacity.com/nd027
 4 | 
 5 | ## Introduction 
 6 | The projects in the Data Engineer Nanodegree program were designed in collaboration with a group of highly talented industry professionals. The projects took me to a role of a Data Engineer at a fabricated data streaming company called “Sparkify” as it scaled its data engineering in both size and sophistication. I worked with simulated data of listening behavior, as well as a wealth of metadata related to songs and artists. At first, I started working with a small amount of data, with low complexity, processed and stored on a single machine. By the end, I developed a sophisticated set of data pipelines to work with massive amounts of data processed and stored on the cloud. There are five projects in the program. Below is a description of each.
 7 | 
 8 | ## Udacity Data Engineering Course Content
 9 | ### Course 1: Data Modelling with Postgres and Cassandra
10 | Week #	            Lesson
11 | * 1	       Introduction to Data Modeling
12 | * 2	       Relational Data Modeling with Postgres
13 | * 2	       Project: Data Modeling with Postgres
14 | * 3	       Non-Relational Modeling with Cassandra
15 | * 3	       Project: Data Modeling with Cassandra
16 | 
17 | ### Course 2: Cloud Data Warehouses
18 | Week #	           Lesson
19 | * 4	       Introduction to the Data Warehouses
20 | * 5	       Introduction to the Cloud and AWS
21 | * 6	       Implementing Data Warehouses on AWS
22 | * 7	       Project: Create a Cloud Data Warehouse
23 | 
24 | ### Course 3: Data Lakes with S3 and Spark
25 | Week #	         Lesson
26 | * 8	       The Power of Spark
27 | * 9	       Data Wrangling with Spark
28 | * 10	     Debugging and Optimizing
29 | * 11	     Introduction to Data Lakes
30 | * 12	     Project: Create a Data Lake
31 | 
32 | ### Course 4: Data Pipelines with Airflow
33 | Week #	       Lesson
34 | * 13	     Data Pipelines
35 | * 14	     Data Quality
36 | * 15	     Production Data Pipelines
37 | * 16	     Project: Create Data Pipelines
38 | 
39 | ### Capstone Project
40 | Week #	     Lesson
41 | * 17	    Project: Capstone
42 | 
43 | ## Projects
44 | 
45 | ### Project 1 - Data Modeling
46 | In this project, you’ll model user activity data for a music streaming app called Sparkify. The project is done in two parts. You’ll create a database and import data stored in CSV and JSON files, and model the data. You’ll do this first with a relational model in Postgres, then with a NoSQL data model with Apache Cassandra. You’ll design the data models to optimize queries for understanding what songs users are listening to. For PostgreSQL, you will also define Fact and Dimension tables and insert data into your new tables. For Apache Cassandra, you will model your data to help the data team at Sparkify answer queries about app usage. You will set up your Apache Cassandra database tables in ways to optimize writes of transactional data on user sessions.
47 | 
48 | ### Project 2 - Cloud Data Warehousing
49 | In this project, you’ll move to the cloud as you work with larger amounts of data. You are tasked with building an ELT pipeline that extracts Sparkify’s data from S3, Amazon’s popular storage system. From there, you’ll stage the data in Amazon Redshift and transform it into a set of fact and dimensional tables for the Sparkify analytics team to continue finding insights in what songs their users are listening to.
50 | 
51 | ### Project 3 - Data Lakes with Apache Spark
52 | In this project, you'll build an ETL pipeline for a data lake. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in the app. You will load data from S3, process the data into analytics tables using Spark, and load them back into S3. You'll deploy this Spark process on a cluster using AWS.
53 | 
54 | ### Project 4 - Data Pipelines with Apache Airflow
55 | In this project, you’ll continue your work on Sparkify’s data infrastructure by creating and automating a set of data pipelines. You’ll use the up-and-coming tool Apache Airflow, developed and open-sourced by Airbnb and the Apache Foundation. You’ll configure and schedule data pipelines with Airflow, setting dependencies, triggers, and quality checks as you would in a production setting.
56 | 
57 | ### Project 5 - Data Engineering Capstone
58 | The capstone project is an opportunity for you to combine what you've learned throughout the program into a more self-driven project. In this project, you'll define the scope of the project and the data you'll be working with. We'll provide guidelines, suggestions, tips, and resources to help you be successful, but your project will be unique to you. You'll gather data from several different data sources; transform, combine, and summarize it; and create a clean database for others to analyze.
59 | 


--------------------------------------------------------------------------------