├── .env
├── .gitignore
├── data_pipeline.jpg
├── etl_process.py
├── readme.md
├── serverless.yml
├── snowflake.sql
└── table_design.ipynb


/.env:
--------------------------------------------------------------------------------
1 | # This will be automatically added into lambda runtime, if you enter it in serverless.yml:custom.dotenv.include 
2 | TZ_LOCAL=Pacific/Auckland


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .serverless/
2 | .DS_Store
3 | FullScript.sql
4 | FullScript.py
5 | FullScript.ipynb
6 | data.ctl
7 | note.txt
8 | rawfile.csv


--------------------------------------------------------------------------------
/data_pipeline.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AWS-Serverless-Projects/Snowflake-Data-Cloud-ETL-Integration-with-Serverless/92b890d0a93cd226b8059d0e4ec3130c649746fb/data_pipeline.jpg


--------------------------------------------------------------------------------
/etl_process.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import boto3
 3 | import io
 4 | from io import StringIO
 5 | 
 6 | def lambda_handler(event, context):
 7 |     s3_file_key = event['Records'][0]['s3']['object']['key']
 8 |     bucket = 'src-bucket-datapipeline'
 9 |     s3 = boto3.client('s3')
10 |     obj = s3.get_object(Bucket=bucket, Key=s3_file_key)
11 |     s3_resource = boto3.resource('s3')
12 |     df = pd.read_csv(io.BytesIO(obj['Body'].read()))
13 | 
14 |     bucket='dst-bucket-snowpipeline'
15 | 
16 |     # ----------------------ETL PROCESS----------------------
17 |     pd.options.mode.chained_assignment = None # Disable warining 
18 |     df1 = df[df['events'].str.contains('206', na=False)] # Filter value as required
19 |     
20 |     # Convert DateTime for DimDate Table and Fact Table
21 |     df1['DATETIME_SKEY'] = pd.to_datetime(df1['DateTime']).dt.tz_convert(None).dt.strftime('%Y-%m-%d %H:%M')
22 |     
23 |     # split by '|' for table Dimvideo, DimPlatform, and DimSite
24 |     df1[['0','1','2','3','4']] = df1['VideoTitle'].str.split("|",expand=True,n=4).reindex(range(5), axis=1)
25 | 
26 |     # note: for [0] ['news', 'App Web', 'App Android', 'App iPhone', 'App iPad']
27 |     # Build Dimvideo Table 
28 |     df1['Video_Title'] = df1.iloc[:, 1:].ffill(axis=1).iloc[:, -1] # Create Video_Title column
29 |     df1['Video_SKEY'] = df1.groupby(['Video_Title']).ngroup() # create souragate key 
30 |     dfvideo= df1.drop_duplicates(subset = ["Video_Title","Video_SKEY"])
31 | 
32 |     # Build DimPlatform Table
33 |     df1.loc[df1['0'].str.contains('Android'), 'Platform_Type'] = 'Platform'
34 |     df1.loc[df1['0'].str.contains('iPad'), 'Platform_Type'] = 'Platform'
35 |     df1.loc[df1['0'].str.contains('iPhone'), 'Platform_Type'] = 'Platform'
36 |     df1.loc[df1['0'].str.contains('Web'), 'Platform_Type'] = 'Desktop'
37 |     df1.loc[df1['0'].str.contains('news'), 'Platform_Type'] = 'Desktop'
38 |     df1['Platform_SKEY'] = df1.groupby(['Platform_Type']).ngroup() # create souragate key 
39 |     dfplatform= df1.drop_duplicates(subset = ["Platform_Type","Platform_SKEY"])
40 | 
41 |     # Build DimSite Table
42 |     df1.loc[df1['0'].str.contains('news'), 'Site'] = 'news' # Create Column 'Site' by news
43 |     df1.loc[df1['0'].str.contains('Web'), 'Site'] = 'App Web'  # Create Column 'Site' by web
44 |     df1.loc[df1['Site'].isnull(), 'Site'] = 'Not Applicable' #fill NaN value 
45 |     df1['Site_SKEY'] = df1.groupby(['Site']).ngroup() # create souragate key 
46 |     dfsite= df1.drop_duplicates(subset = ["Site","Site_SKEY"])
47 | 
48 |     # Export target tables to S3 Dst Bucket
49 |     s3_resource.Object(bucket, 'dimdate/dimdate.csv').put(Body=df1[['DATETIME_SKEY']].to_csv(index=False))
50 |     s3_resource.Object(bucket, 'dimvideo/dimvideo.csv').put(Body=dfvideo[['Video_SKEY','Video_Title']].to_csv(index = False))
51 |     s3_resource.Object(bucket, 'dimplatform/dimplatform.csv').put(Body=dfplatform[['Platform_SKEY','Platform_Type']].to_csv(index = False))
52 |     s3_resource.Object(bucket, 'dimsite/dimsite.csv').put(Body=dfsite[['Site_SKEY','Site']].to_csv(index = False))
53 |     s3_resource.Object(bucket, 'facttable/facttable.csv').put(Body=df1[['DATETIME_SKEY','Platform_SKEY','Site_SKEY','Video_SKEY','events']].to_csv(index = False))


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
  1 | # Overview of the solution
  2 | The project includes a local environment to inspect the data and deploy the stack using the AWS S3, Lambda in Serverless Framework, AWS SQS, and Snowflake Snowpipe. 
  3 | 
  4 | The deployment includes an object-based event that triggers an AWS Lambda function. The function ingests and stores the raw data from an external source, transforms the content, and saves clean information. 
  5 | 
  6 | The raw and clean data is stored in an S3 destination bucket with SQS event notifications, which will deliver data to snowpipe.
  7 | 
  8 | The following diagram illustrates my architecture:
  9 | 
 10 | ![alt text](https://github.com/miaaaalu/AWS-Lambda-to-Snowflake-Data-Cloud-ETL-Integration-with-Serverless/blob/master/data_pipeline.jpg?raw=true)
 11 | 
 12 | The solution includes the following high-level steps:
 13 | 
 14 | 1. Design the star schema for the source file 
 15 | 2. Snowflake Snowpipe Configuration
 16 | 3. Build and test the ETL process
 17 | 2. Inspect the function and the AWS Cloudformation template and deploy in Serverless Framework
 18 | 5. Create an Event Notification for the S3 destination buck
 19 | 
 20 | # Test Environment 
 21 | ```powershell
 22 | * System: macOS Big Sur Version 11.5.2
 23 | 
 24 | * Programming: Python 3.8, Snowflake SQL
 25 | 
 26 | * Tools: Snowflake, Cloud Service Provider: AWS (S3, Lambda, and SQS), Serverless Framework
 27 | ```
 28 | 
 29 | # Process Workflow 
 30 | ## Preparation
 31 | ### AWS
 32 | ```powershell
 33 | 1. Create source bucket in S3
 34 | 2. Create deploy bucket for the serverless package in S3 
 35 | 3. Create an IAM User for snowflake with S3 full access policies
 36 | ```
 37 | ### Python 
 38 | ```powershell
 39 | 3.8 preferable
 40 | ```
 41 | ## Step 1 — Table Design 
 42 | ```powershell
 43 | For details check table_design.ipynb
 44 | ```
 45 | 
 46 | ## Step 2 — Snowflake Snowpipe Configuration
 47 | ```powershell
 48 | 1. Database Creation
 49 | 2. Tables Creation (5 tables in this project)
 50 | 3. File Format Creation with AWS Credentials
 51 | 4. External Stage Creation
 52 | 4. Pipes Creation 
 53 | 
 54 | For details check snowflake.sql
 55 | ```
 56 | 
 57 | ## Step 3 — Serverless Framework Deploy
 58 | 
 59 | *Serverless Framework is open source software that builds, compiles, and packages code for serverless deployment, and then deploys the package to the cloud. With Python on AWS, for example, Serverless Framework creates the self-contained Python environment, including all dependencies.*
 60 | 
 61 | ### 1. Serverless Project Initialization
 62 | ```powershell
 63 | % serverless create --template aws-python3 --path {project_folder}
 64 | ```
 65 | 
 66 | ### 2. Open the project in VScode
 67 | ```powershell
 68 | % open -a “Visual Studio Code” ${project_folder}
 69 | ```
 70 | 
 71 | ### 3. Serverless Plugin Installation
 72 | ```powershell
 73 | # Instal Serverless Prune Plugin 
 74 | % sudo sls plugin install -n serverless-prune-plugin
 75 | 
 76 | # Install serverless python requirements (https://github.com/UnitedIncome/serverless-python-requirements)
 77 | % sudo sls plugin install -n serverless-python-requirements
 78 | 
 79 | # Install serverless dotenv plugin
 80 | % sudo npm i -D serverless-dotenv-plugin
 81 | ```
 82 | ### 4. Modify .python file for ETL Process
 83 | ```powershell
 84 | # Rename python file
 85 | % mv handler.py ${project_handle}.py
 86 | 
 87 | # Handle your Python packaging
 88 | By default, pandas library is not available in AWS Lambda Python environments. 
 89 | For using pandas library in Lambda function, a requirements.txt needs to be attached.
 90 | OR add Python pandas layer to AWS Lambda in serverless.yml.
 91 | 
 92 | # option 1: attach a requirements.txt with needed library
 93 | % touch requirements.txt
 94 | % echo “pandas” >> requirements.txt
 95 | % pip install -r requirements.txt
 96 | 
 97 | # option 2: add the pandas layer from Klayers in serverless.yml (recommend)
 98 | source from Klayers: https://github.com/keithrozario/Klayers/tree/master/deployments/python3.8/arns 
 99 | 
100 | # ETL Process
101 | a. Load raw file 
102 | b. Data Cleaning
103 | c. Data Washing
104 | b. Data transformation for table DIMDATE, DIMPLATFORM, DIMSITE, DIMVIDEO, and FACTTABLE
105 | d. Load Data into staging folders - DIMDATE, DIMPLATFORM, DIMSITE, DIMVIDEO, and FACTTABLE
106 | 
107 | For details check etl_process.py
108 | ```
109 | ### 5. Create .env file and put environment variables if need
110 | ```env
111 | APPLICATION = ${your project name}
112 | STAGE = ${your stage}
113 | REGION = ${your region}
114 | TZ_LOCAL = ${your timezone}
115 | ```
116 | ### 6. Modify serverless.yml file
117 | ```Powershell
118 | For details check serverless.yml
119 | ```
120 | 
121 | ### 7. Deploy
122 | ```Powershell
123 | # Deploy to aws 
124 | % sls deploy
125 | ```
126 | 
127 | ## Step 4 — Add Event Notification for S3 Bucket
128 | 
129 | ```
130 | This notification informs Snowpipe via an SQS queue when files are ready to load.
131 | 
132 | Please note the SQS queue ARN from the notification_channel column once you execute 「show pipes」 command. 
133 | 
134 | Copy the ARN to a notepad.
135 | 
136 | Then paste on the Event Notification for destination S3 bucket.
137 | ```


--------------------------------------------------------------------------------
/serverless.yml:
--------------------------------------------------------------------------------
 1 | service: my-etl-pipeline
 2 | 
 3 | plugins:
 4 |   - serverless-prune-plugin
 5 |   - serverless-python-requirements
 6 |   - serverless-dotenv-plugin
 7 | 
 8 | custom:
 9 |   pythonRequirements:
10 |     dockerizePip: false
11 |   prune:
12 |     automatic: true
13 |     number: 3
14 |   dotenv:
15 |     include:
16 |       - TZ_LOCAL
17 | 
18 | provider:
19 |   name: aws
20 |   runtime: python3.8
21 |   memorySize: 256
22 |   timeout: 300
23 |   deploymentBucket:
24 |     name: serverless-deploy-bucket-mia
25 |     maxPreviousDeploymentArtifacts: 3
26 |   stage: dev
27 |   region: ap-southeast-2
28 |   versionFunctions: true
29 |   layers: arn:aws:lambda:ap-southeast-2:770693421928:layer:Klayers-python38-pandas:48
30 | 
31 | # Lambda IAM Role Configuration
32 |   iamRoleStatements:
33 |     - Effect: Allow
34 |       Action:
35 |         - s3:HeadObject
36 |         - s3:GetObject
37 |         - s3:putObject
38 |       Resource:
39 |         - arn:aws:s3:::src-bucket-datapipeline/*
40 |         - arn:aws:s3:::dst-bucket-snowpipeline/*
41 |     - Effect: Allow
42 |       Action:
43 |         - logs:*
44 |       Resource: "*"
45 | 
46 | # exclude packaging information 
47 | package:
48 |   exclude:
49 |     - __cache__/**
50 |     - __pycache__/**
51 |     - node_modules/**
52 |     - 'package.json'
53 |     - 'package-lock.json'
54 |     - 'snowflake.sql'
55 |     - 'readme.md'
56 | 
57 | # Lambda Function
58 | functions:
59 |   etl_process:
60 |     handler: etl_process.lambda_handler
61 |     name: my_etl_lambda
62 |     description: "AWS-Lambda-to-Snowflake-Data-Cloud-ETL-Integration"
63 |     events:
64 |       - s3:
65 |           bucket: src-bucket-datapipeline
66 |           event: s3:ObjectCreated:*
67 |           existing: true
68 | 
69 | #  Build Destination Bucket
70 | resources:
71 |   Resources:
72 |     SnowflakeStagingOutbucket:
73 |       Type: AWS::S3::Bucket
74 |       Properties:
75 |         AccessControl: BucketOwnerFullControl
76 |         BucketName: dst-bucket-snowpipeline


--------------------------------------------------------------------------------
/snowflake.sql:
--------------------------------------------------------------------------------
  1 | // create datawarehouse 
  2 | CREATE OR REPLACE WAREHOUSE mywarehouse WITH
  3 |   WAREHOUSE_SIZE='X-SMALL'
  4 |   AUTO_SUSPEND = 120
  5 |   AUTO_RESUME = TRUE
  6 |   INITIALLY_SUSPENDED = TRUE;
  7 | 
  8 | // create database
  9 | CREATE OR REPLACE DATABASE ETL_PIPELINE;
 10 | 
 11 | // create DIMDATE table
 12 | CREATE OR REPLACE TABLE DIMDATE (
 13 |     DATETIME_SKEY TIMESTAMP,
 14 |     PRIMARY KEY (DATETIME_SKEY)
 15 | );
 16 | 
 17 | // create DIMPLATFORM table
 18 | CREATE OR REPLACE TABLE DIMPLATFORM (
 19 |     PLATFORM_SKEY INTEGER NOT NULL,
 20 |     PLATFORM_TYPE VARCHAR(200) NOT NULL,
 21 |     PRIMARY KEY (PLATFORM_SKEY)
 22 | );
 23 | 
 24 | // create DIMSITE table
 25 | CREATE OR REPLACE TABLE DIMSITE (
 26 |     Site_SKEY INTEGER NOT NULL,
 27 |     Site VARCHAR(200) NOT NULL,
 28 |     PRIMARY KEY (Site_SKEY)
 29 | );
 30 | 
 31 | // create DIMVIDEO table
 32 | CREATE OR REPLACE TABLE DIMVIDEO (
 33 |     Video_SKEY INTEGER NOT NULL,
 34 |     Video_Title TEXT NOT NULL,
 35 |     PRIMARY KEY (Video_SKEY)
 36 | );
 37 | 
 38 | // create FACTTABLE
 39 | CREATE OR REPLACE TABLE FACTTABLE (
 40 |     DATETIME_SKEY TIMESTAMP REFERENCES DIMDATE(DATETIME_SKEY),
 41 |     Platform_SKEY INTEGER REFERENCES DIMPLATFORM(Platform_SKEY),
 42 |     Site_SKEY INTEGER REFERENCES DIMSITE(Site_SKEY),
 43 |     Video_SKEY INTEGER REFERENCES DIMVIDEO(Video_SKEY),
 44 |     events VARCHAR2(150 BYTE) NOT NULL
 45 | );
 46 | 
 47 | // Create a file format 
 48 | CREATE OR REPLACE FILE FORMAT DataPipeline_CSV_Format
 49 |     TYPE = 'CSV'
 50 |     TIMESTAMP_FORMAT = 'YYYY-MM-DD HH24:MI'
 51 |     skip_header = 1
 52 |     field_delimiter = ','
 53 |     record_delimiter = '\\n'
 54 |     FIELD_OPTIONALLY_ENCLOSED_BY = '"';
 55 | 
 56 | // create a external stage
 57 | CREATE OR REPLACE STAGE S3_to_Snowflake_Stage
 58 |     URL="S3://dst-bucket-snowpipeline"
 59 |     CREDENTIALS = (AWS_KEY_ID = '**************' AWS_SECRET_KEY = '**************')
 60 |     file_format = DataPipeline_CSV_Format;
 61 | 
 62 | // create pipes
 63 | CREATE OR REPLACE PIPE DimDate_Pipe
 64 |     AUTO_INGEST = TRUE 
 65 |     AS COPY INTO DIMDATE 
 66 |     FROM @S3_to_Snowflake_Stage/dimdate/
 67 |     FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format);
 68 |     
 69 | CREATE OR REPLACE PIPE Dimplatform_Pipe
 70 |     AUTO_INGEST = TRUE 
 71 |     AS COPY INTO DIMPLATFORM 
 72 |     FROM @S3_to_Snowflake_Stage/dimplatform/
 73 |     FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format);
 74 |     
 75 | CREATE OR REPLACE PIPE DimSite_Pipe
 76 |     AUTO_INGEST = TRUE 
 77 |     AS COPY INTO DIMSITE 
 78 |     FROM @S3_to_Snowflake_Stage/dimsite/
 79 |     FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format);
 80 |     
 81 | CREATE OR REPLACE PIPE DimVideo_Pipe
 82 |     AUTO_INGEST = TRUE 
 83 |     AS COPY INTO DIMVIDEO 
 84 |     FROM @S3_to_Snowflake_Stage/dimvideo/
 85 |     FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format);
 86 | 
 87 | CREATE OR REPLACE PIPE FactTable_Pipe
 88 |     AUTO_INGEST = TRUE 
 89 |     AS COPY INTO FACTTABLE 
 90 |     FROM @S3_to_Snowflake_Stage/facttable/
 91 |     FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format);
 92 | 
 93 | // PIPES COMMAND
 94 | SHOW PIPES; -- check pipes to get notification_channel url
 95 | SELECT SYSTEM$PIPE_STATUS('<PIPE NAME>'); -- Check Pipe Status if need
 96 | SELECT * FROM table(information_schema.copy_history(table_name=>'<TABLE NAME>', start_time=> dateadd(hours, -1, current_timestamp()))); -- Show PIPE COPY history in specific table 
 97 | ALTER PIPE <PIPE NAME> REFRESH; -- REFRESH PIPE 
 98 | 
 99 | // EXTERNAL STAGE COMMAND
100 | LIST @S3_to_Snowflake_Stage; -- Check if files exists in external stage 
101 | REMOVE '@S3_to_Snowflake_Stage/dimdate/date.csv'; -- remove single file from external stage 
102 | REMOVE @S3_to_Snowflake_Stage pattern='.*.csv'; -- remove all files from external stage 
103 | 
104 | // save notification_channel url for S3 Event Notification
105 | arn:aws:sqs:ap-southeast-2:123456789012:sf-snowpipe-**************-**************


--------------------------------------------------------------------------------
/table_design.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# **Table Design**\n",
  8 |     "## RAW TABLE\n"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 54,
 14 |    "metadata": {},
 15 |    "outputs": [
 16 |     {
 17 |      "data": {
 18 |       "text/html": [
 19 |        "<style type=\"text/css\">\n",
 20 |        "</style>\n",
 21 |        "<table id=\"T_7be09_\">\n",
 22 |        "  <thead>\n",
 23 |        "    <tr>\n",
 24 |        "      <th class=\"col_heading level0 col0\" >COLUMN_NAME</th>\n",
 25 |        "      <th class=\"col_heading level0 col1\" >DATA_TYPE</th>\n",
 26 |        "      <th class=\"col_heading level0 col2\" >PK</th>\n",
 27 |        "      <th class=\"col_heading level0 col3\" >NULLABLE</th>\n",
 28 |        "      <th class=\"col_heading level0 col4\" >DATA_DEFAULT</th>\n",
 29 |        "      <th class=\"col_heading level0 col5\" >COLUMN_ID</th>\n",
 30 |        "      <th class=\"col_heading level0 col6\" >COMMENTS</th>\n",
 31 |        "    </tr>\n",
 32 |        "  </thead>\n",
 33 |        "  <tbody>\n",
 34 |        "    <tr>\n",
 35 |        "      <td id=\"T_7be09_row0_col0\" class=\"data row0 col0\" >DATETIME</td>\n",
 36 |        "      <td id=\"T_7be09_row0_col1\" class=\"data row0 col1\" >VARCHAR2(30 BYTE)</td>\n",
 37 |        "      <td id=\"T_7be09_row0_col2\" class=\"data row0 col2\" >N</td>\n",
 38 |        "      <td id=\"T_7be09_row0_col3\" class=\"data row0 col3\" >Yes</td>\n",
 39 |        "      <td id=\"T_7be09_row0_col4\" class=\"data row0 col4\" >null</td>\n",
 40 |        "      <td id=\"T_7be09_row0_col5\" class=\"data row0 col5\" >1</td>\n",
 41 |        "      <td id=\"T_7be09_row0_col6\" class=\"data row0 col6\" >Data from raw file</td>\n",
 42 |        "    </tr>\n",
 43 |        "    <tr>\n",
 44 |        "      <td id=\"T_7be09_row1_col0\" class=\"data row1 col0\" >VIDEOTITLE</td>\n",
 45 |        "      <td id=\"T_7be09_row1_col1\" class=\"data row1 col1\" >VARCHAR2(200 BYTE)</td>\n",
 46 |        "      <td id=\"T_7be09_row1_col2\" class=\"data row1 col2\" >N</td>\n",
 47 |        "      <td id=\"T_7be09_row1_col3\" class=\"data row1 col3\" >Yes</td>\n",
 48 |        "      <td id=\"T_7be09_row1_col4\" class=\"data row1 col4\" >null</td>\n",
 49 |        "      <td id=\"T_7be09_row1_col5\" class=\"data row1 col5\" >2</td>\n",
 50 |        "      <td id=\"T_7be09_row1_col6\" class=\"data row1 col6\" >Data from raw file</td>\n",
 51 |        "    </tr>\n",
 52 |        "    <tr>\n",
 53 |        "      <td id=\"T_7be09_row2_col0\" class=\"data row2 col0\" >EVENTS</td>\n",
 54 |        "      <td id=\"T_7be09_row2_col1\" class=\"data row2 col1\" >VARCHAR2(150 BYTE)</td>\n",
 55 |        "      <td id=\"T_7be09_row2_col2\" class=\"data row2 col2\" >N</td>\n",
 56 |        "      <td id=\"T_7be09_row2_col3\" class=\"data row2 col3\" >Yes</td>\n",
 57 |        "      <td id=\"T_7be09_row2_col4\" class=\"data row2 col4\" >null</td>\n",
 58 |        "      <td id=\"T_7be09_row2_col5\" class=\"data row2 col5\" >3</td>\n",
 59 |        "      <td id=\"T_7be09_row2_col6\" class=\"data row2 col6\" >Data from raw file</td>\n",
 60 |        "    </tr>\n",
 61 |        "  </tbody>\n",
 62 |        "</table>\n"
 63 |       ],
 64 |       "text/plain": [
 65 |        "<pandas.io.formats.style.Styler at 0x12cceb850>"
 66 |       ]
 67 |      },
 68 |      "execution_count": 54,
 69 |      "metadata": {},
 70 |      "output_type": "execute_result"
 71 |     }
 72 |    ],
 73 |    "source": [
 74 |     "import pandas as pd\n",
 75 |     "raw_table = pd.DataFrame([['DATETIME', 'VARCHAR2(30 BYTE)', 'N', 'Yes', 'null', 1, 'Data from raw file'],\n",
 76 |     "                          ['VIDEOTITLE', 'VARCHAR2(200 BYTE)', 'N', 'Yes', 'null', 2, 'Data from raw file'],\n",
 77 |     "                          ['EVENTS', 'VARCHAR2(150 BYTE)', 'N', 'Yes', 'null', 3, 'Data from raw file']],\n",
 78 |     "                  columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n",
 79 |     "raw_table.style.hide_index()"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "## Dimension Table DIMDATE "
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": 53,
 92 |    "metadata": {},
 93 |    "outputs": [
 94 |     {
 95 |      "data": {
 96 |       "text/html": [
 97 |        "<style type=\"text/css\">\n",
 98 |        "</style>\n",
 99 |        "<table id=\"T_2cb2f_\">\n",
100 |        "  <thead>\n",
101 |        "    <tr>\n",
102 |        "      <th class=\"col_heading level0 col0\" >COLUMN_NAME</th>\n",
103 |        "      <th class=\"col_heading level0 col1\" >DATA_TYPE</th>\n",
104 |        "      <th class=\"col_heading level0 col2\" >PK</th>\n",
105 |        "      <th class=\"col_heading level0 col3\" >NULLABLE</th>\n",
106 |        "      <th class=\"col_heading level0 col4\" >DATA_DEFAULT</th>\n",
107 |        "      <th class=\"col_heading level0 col5\" >COLUMN_ID</th>\n",
108 |        "      <th class=\"col_heading level0 col6\" >COMMENTS</th>\n",
109 |        "    </tr>\n",
110 |        "  </thead>\n",
111 |        "  <tbody>\n",
112 |        "    <tr>\n",
113 |        "      <td id=\"T_2cb2f_row0_col0\" class=\"data row0 col0\" >DATETIME_SKEY</td>\n",
114 |        "      <td id=\"T_2cb2f_row0_col1\" class=\"data row0 col1\" >TIMESTAMP</td>\n",
115 |        "      <td id=\"T_2cb2f_row0_col2\" class=\"data row0 col2\" >Y</td>\n",
116 |        "      <td id=\"T_2cb2f_row0_col3\" class=\"data row0 col3\" >No</td>\n",
117 |        "      <td id=\"T_2cb2f_row0_col4\" class=\"data row0 col4\" >NOT NULL</td>\n",
118 |        "      <td id=\"T_2cb2f_row0_col5\" class=\"data row0 col5\" >1</td>\n",
119 |        "      <td id=\"T_2cb2f_row0_col6\" class=\"data row0 col6\" >Data derived from DIMDATE.DATETIME_SKEY</td>\n",
120 |        "    </tr>\n",
121 |        "  </tbody>\n",
122 |        "</table>\n"
123 |       ],
124 |       "text/plain": [
125 |        "<pandas.io.formats.style.Styler at 0x12cde7fd0>"
126 |       ]
127 |      },
128 |      "execution_count": 53,
129 |      "metadata": {},
130 |      "output_type": "execute_result"
131 |     }
132 |    ],
133 |    "source": [
134 |     "import pandas as pd\n",
135 |     "\n",
136 |     "dimdate = pd.DataFrame([['DATETIME_SKEY', 'TIMESTAMP', 'Y', 'No', 'NOT NULL', 1, 'Data derived from DIMDATE.DATETIME_SKEY']],\n",
137 |     "                  columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n",
138 |     "\n",
139 |     "dimdate.style.hide_index()"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {},
145 |    "source": [
146 |     "## Dimension Table DIMPLATFORM"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 52,
152 |    "metadata": {},
153 |    "outputs": [
154 |     {
155 |      "data": {
156 |       "text/html": [
157 |        "<style type=\"text/css\">\n",
158 |        "</style>\n",
159 |        "<table id=\"T_47b2d_\">\n",
160 |        "  <thead>\n",
161 |        "    <tr>\n",
162 |        "      <th class=\"col_heading level0 col0\" >COLUMN_NAME</th>\n",
163 |        "      <th class=\"col_heading level0 col1\" >DATA_TYPE</th>\n",
164 |        "      <th class=\"col_heading level0 col2\" >PK</th>\n",
165 |        "      <th class=\"col_heading level0 col3\" >NULLABLE</th>\n",
166 |        "      <th class=\"col_heading level0 col4\" >DATA_DEFAULT</th>\n",
167 |        "      <th class=\"col_heading level0 col5\" >COLUMN_ID</th>\n",
168 |        "      <th class=\"col_heading level0 col6\" >COMMENTS</th>\n",
169 |        "    </tr>\n",
170 |        "  </thead>\n",
171 |        "  <tbody>\n",
172 |        "    <tr>\n",
173 |        "      <td id=\"T_47b2d_row0_col0\" class=\"data row0 col0\" >PLATFORM_SKEY</td>\n",
174 |        "      <td id=\"T_47b2d_row0_col1\" class=\"data row0 col1\" >INTEGER</td>\n",
175 |        "      <td id=\"T_47b2d_row0_col2\" class=\"data row0 col2\" >Y</td>\n",
176 |        "      <td id=\"T_47b2d_row0_col3\" class=\"data row0 col3\" >No</td>\n",
177 |        "      <td id=\"T_47b2d_row0_col4\" class=\"data row0 col4\" >NOT NULL</td>\n",
178 |        "      <td id=\"T_47b2d_row0_col5\" class=\"data row0 col5\" >1</td>\n",
179 |        "      <td id=\"T_47b2d_row0_col6\" class=\"data row0 col6\" >Data derived DIMPLATFORM.PLATFORM_SKEY</td>\n",
180 |        "    </tr>\n",
181 |        "    <tr>\n",
182 |        "      <td id=\"T_47b2d_row1_col0\" class=\"data row1 col0\" >PLATFORM_TYPE</td>\n",
183 |        "      <td id=\"T_47b2d_row1_col1\" class=\"data row1 col1\" >VARCHAR2(200 BYTE)</td>\n",
184 |        "      <td id=\"T_47b2d_row1_col2\" class=\"data row1 col2\" >N</td>\n",
185 |        "      <td id=\"T_47b2d_row1_col3\" class=\"data row1 col3\" >No</td>\n",
186 |        "      <td id=\"T_47b2d_row1_col4\" class=\"data row1 col4\" >NOT NULL</td>\n",
187 |        "      <td id=\"T_47b2d_row1_col5\" class=\"data row1 col5\" >2</td>\n",
188 |        "      <td id=\"T_47b2d_row1_col6\" class=\"data row1 col6\" >Data derived DIMPLATFORM.PLATFORM</td>\n",
189 |        "    </tr>\n",
190 |        "  </tbody>\n",
191 |        "</table>\n"
192 |       ],
193 |       "text/plain": [
194 |        "<pandas.io.formats.style.Styler at 0x12ccbba90>"
195 |       ]
196 |      },
197 |      "execution_count": 52,
198 |      "metadata": {},
199 |      "output_type": "execute_result"
200 |     }
201 |    ],
202 |    "source": [
203 |     "import pandas as pd\n",
204 |     "\n",
205 |     "dimplatform = pd.DataFrame([['PLATFORM_SKEY', 'INTEGER', 'Y', 'No', 'NOT NULL', 1, 'Data derived DIMPLATFORM.PLATFORM_SKEY'],\n",
206 |     "                          ['PLATFORM_TYPE', 'VARCHAR2(200 BYTE)', 'N', 'No', 'NOT NULL', 2, 'Data derived DIMPLATFORM.PLATFORM']],\n",
207 |     "                  columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n",
208 |     "                  \n",
209 |     "dimplatform.style.hide_index()"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "markdown",
214 |    "metadata": {},
215 |    "source": [
216 |     "## Dimension Table DIMSITE\n"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": 55,
222 |    "metadata": {},
223 |    "outputs": [
224 |     {
225 |      "data": {
226 |       "text/html": [
227 |        "<style type=\"text/css\">\n",
228 |        "</style>\n",
229 |        "<table id=\"T_cc64d_\">\n",
230 |        "  <thead>\n",
231 |        "    <tr>\n",
232 |        "      <th class=\"col_heading level0 col0\" >COLUMN_NAME</th>\n",
233 |        "      <th class=\"col_heading level0 col1\" >DATA_TYPE</th>\n",
234 |        "      <th class=\"col_heading level0 col2\" >PK</th>\n",
235 |        "      <th class=\"col_heading level0 col3\" >NULLABLE</th>\n",
236 |        "      <th class=\"col_heading level0 col4\" >DATA_DEFAULT</th>\n",
237 |        "      <th class=\"col_heading level0 col5\" >COLUMN_ID</th>\n",
238 |        "      <th class=\"col_heading level0 col6\" >COMMENTS</th>\n",
239 |        "    </tr>\n",
240 |        "  </thead>\n",
241 |        "  <tbody>\n",
242 |        "    <tr>\n",
243 |        "      <td id=\"T_cc64d_row0_col0\" class=\"data row0 col0\" >SITE_SKEY</td>\n",
244 |        "      <td id=\"T_cc64d_row0_col1\" class=\"data row0 col1\" >INTEGER</td>\n",
245 |        "      <td id=\"T_cc64d_row0_col2\" class=\"data row0 col2\" >Y</td>\n",
246 |        "      <td id=\"T_cc64d_row0_col3\" class=\"data row0 col3\" >No</td>\n",
247 |        "      <td id=\"T_cc64d_row0_col4\" class=\"data row0 col4\" >NOT NULL</td>\n",
248 |        "      <td id=\"T_cc64d_row0_col5\" class=\"data row0 col5\" >1</td>\n",
249 |        "      <td id=\"T_cc64d_row0_col6\" class=\"data row0 col6\" >Data derived DIMSITE.SITE_SKEY</td>\n",
250 |        "    </tr>\n",
251 |        "    <tr>\n",
252 |        "      <td id=\"T_cc64d_row1_col0\" class=\"data row1 col0\" >SITE</td>\n",
253 |        "      <td id=\"T_cc64d_row1_col1\" class=\"data row1 col1\" >VARCHAR(200)</td>\n",
254 |        "      <td id=\"T_cc64d_row1_col2\" class=\"data row1 col2\" >N</td>\n",
255 |        "      <td id=\"T_cc64d_row1_col3\" class=\"data row1 col3\" >No</td>\n",
256 |        "      <td id=\"T_cc64d_row1_col4\" class=\"data row1 col4\" >NOT NULL</td>\n",
257 |        "      <td id=\"T_cc64d_row1_col5\" class=\"data row1 col5\" >2</td>\n",
258 |        "      <td id=\"T_cc64d_row1_col6\" class=\"data row1 col6\" >Data derived DIMSITE.SITE</td>\n",
259 |        "    </tr>\n",
260 |        "  </tbody>\n",
261 |        "</table>\n"
262 |       ],
263 |       "text/plain": [
264 |        "<pandas.io.formats.style.Styler at 0x12cd753d0>"
265 |       ]
266 |      },
267 |      "execution_count": 55,
268 |      "metadata": {},
269 |      "output_type": "execute_result"
270 |     }
271 |    ],
272 |    "source": [
273 |     "import pandas as pd\n",
274 |     "\n",
275 |     "dimsite = pd.DataFrame([['SITE_SKEY', 'INTEGER', 'Y', 'No', 'NOT NULL', 1, 'Data derived DIMSITE.SITE_SKEY'],\n",
276 |     "                          ['SITE', 'VARCHAR(200)', 'N', 'No', 'NOT NULL', 2, 'Data derived DIMSITE.SITE']],\n",
277 |     "                  columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n",
278 |     "                  \n",
279 |     "dimsite.style.hide_index()"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "markdown",
284 |    "metadata": {},
285 |    "source": [
286 |     "## Dimension Table DIMVIDEO\n"
287 |    ]
288 |   },
289 |   {
290 |    "cell_type": "code",
291 |    "execution_count": 56,
292 |    "metadata": {},
293 |    "outputs": [
294 |     {
295 |      "data": {
296 |       "text/html": [
297 |        "<style type=\"text/css\">\n",
298 |        "</style>\n",
299 |        "<table id=\"T_aafa2_\">\n",
300 |        "  <thead>\n",
301 |        "    <tr>\n",
302 |        "      <th class=\"col_heading level0 col0\" >COLUMN_NAME</th>\n",
303 |        "      <th class=\"col_heading level0 col1\" >DATA_TYPE</th>\n",
304 |        "      <th class=\"col_heading level0 col2\" >PK</th>\n",
305 |        "      <th class=\"col_heading level0 col3\" >NULLABLE</th>\n",
306 |        "      <th class=\"col_heading level0 col4\" >DATA_DEFAULT</th>\n",
307 |        "      <th class=\"col_heading level0 col5\" >COLUMN_ID</th>\n",
308 |        "      <th class=\"col_heading level0 col6\" >COMMENTS</th>\n",
309 |        "    </tr>\n",
310 |        "  </thead>\n",
311 |        "  <tbody>\n",
312 |        "    <tr>\n",
313 |        "      <td id=\"T_aafa2_row0_col0\" class=\"data row0 col0\" >VIDEO_SKEY</td>\n",
314 |        "      <td id=\"T_aafa2_row0_col1\" class=\"data row0 col1\" >INTEGER</td>\n",
315 |        "      <td id=\"T_aafa2_row0_col2\" class=\"data row0 col2\" >Y</td>\n",
316 |        "      <td id=\"T_aafa2_row0_col3\" class=\"data row0 col3\" >No</td>\n",
317 |        "      <td id=\"T_aafa2_row0_col4\" class=\"data row0 col4\" >NOT NULL</td>\n",
318 |        "      <td id=\"T_aafa2_row0_col5\" class=\"data row0 col5\" >1</td>\n",
319 |        "      <td id=\"T_aafa2_row0_col6\" class=\"data row0 col6\" >Data derived from DIMVIDEO.VIDEO_SKEY</td>\n",
320 |        "    </tr>\n",
321 |        "    <tr>\n",
322 |        "      <td id=\"T_aafa2_row1_col0\" class=\"data row1 col0\" >VIDEO_TITLE</td>\n",
323 |        "      <td id=\"T_aafa2_row1_col1\" class=\"data row1 col1\" >TEXT</td>\n",
324 |        "      <td id=\"T_aafa2_row1_col2\" class=\"data row1 col2\" >N</td>\n",
325 |        "      <td id=\"T_aafa2_row1_col3\" class=\"data row1 col3\" >No</td>\n",
326 |        "      <td id=\"T_aafa2_row1_col4\" class=\"data row1 col4\" >NOT NULL</td>\n",
327 |        "      <td id=\"T_aafa2_row1_col5\" class=\"data row1 col5\" >2</td>\n",
328 |        "      <td id=\"T_aafa2_row1_col6\" class=\"data row1 col6\" >Data derived from DIMVIDEO.VIDEO_TITLE</td>\n",
329 |        "    </tr>\n",
330 |        "  </tbody>\n",
331 |        "</table>\n"
332 |       ],
333 |       "text/plain": [
334 |        "<pandas.io.formats.style.Styler at 0x12cbf5340>"
335 |       ]
336 |      },
337 |      "execution_count": 56,
338 |      "metadata": {},
339 |      "output_type": "execute_result"
340 |     }
341 |    ],
342 |    "source": [
343 |     "import pandas as pd\n",
344 |     "\n",
345 |     "dimvideo = pd.DataFrame([['VIDEO_SKEY', 'INTEGER', 'Y', 'No', 'NOT NULL', 1, 'Data derived from DIMVIDEO.VIDEO_SKEY'],\n",
346 |     "                          ['VIDEO_TITLE', 'TEXT', 'N', 'No', 'NOT NULL', 2, 'Data derived from DIMVIDEO.VIDEO_TITLE']],\n",
347 |     "                  columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n",
348 |     "                  \n",
349 |     "dimvideo.style.hide_index()"
350 |    ]
351 |   },
352 |   {
353 |    "cell_type": "markdown",
354 |    "metadata": {},
355 |    "source": [
356 |     "## Fact Table FACTTABLE\n"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "code",
361 |    "execution_count": 57,
362 |    "metadata": {},
363 |    "outputs": [
364 |     {
365 |      "data": {
366 |       "text/html": [
367 |        "<style type=\"text/css\">\n",
368 |        "</style>\n",
369 |        "<table id=\"T_6f7a4_\">\n",
370 |        "  <thead>\n",
371 |        "    <tr>\n",
372 |        "      <th class=\"col_heading level0 col0\" >COLUMN_NAME</th>\n",
373 |        "      <th class=\"col_heading level0 col1\" >DATA_TYPE</th>\n",
374 |        "      <th class=\"col_heading level0 col2\" >PK</th>\n",
375 |        "      <th class=\"col_heading level0 col3\" >NULLABLE</th>\n",
376 |        "      <th class=\"col_heading level0 col4\" >DATA_DEFAULT</th>\n",
377 |        "      <th class=\"col_heading level0 col5\" >COLUMN_ID</th>\n",
378 |        "      <th class=\"col_heading level0 col6\" >COMMENTS</th>\n",
379 |        "    </tr>\n",
380 |        "  </thead>\n",
381 |        "  <tbody>\n",
382 |        "    <tr>\n",
383 |        "      <td id=\"T_6f7a4_row0_col0\" class=\"data row0 col0\" >DATETIME_SKEY</td>\n",
384 |        "      <td id=\"T_6f7a4_row0_col1\" class=\"data row0 col1\" >TIMESTAMP</td>\n",
385 |        "      <td id=\"T_6f7a4_row0_col2\" class=\"data row0 col2\" >N</td>\n",
386 |        "      <td id=\"T_6f7a4_row0_col3\" class=\"data row0 col3\" >No</td>\n",
387 |        "      <td id=\"T_6f7a4_row0_col4\" class=\"data row0 col4\" >NOT NULL</td>\n",
388 |        "      <td id=\"T_6f7a4_row0_col5\" class=\"data row0 col5\" >1</td>\n",
389 |        "      <td id=\"T_6f7a4_row0_col6\" class=\"data row0 col6\" >Data derived from FACTTABLE.DATETIME_SKEY</td>\n",
390 |        "    </tr>\n",
391 |        "    <tr>\n",
392 |        "      <td id=\"T_6f7a4_row1_col0\" class=\"data row1 col0\" >Platform_SKEY</td>\n",
393 |        "      <td id=\"T_6f7a4_row1_col1\" class=\"data row1 col1\" >INTEGER</td>\n",
394 |        "      <td id=\"T_6f7a4_row1_col2\" class=\"data row1 col2\" >N</td>\n",
395 |        "      <td id=\"T_6f7a4_row1_col3\" class=\"data row1 col3\" >No</td>\n",
396 |        "      <td id=\"T_6f7a4_row1_col4\" class=\"data row1 col4\" >NOT NULL</td>\n",
397 |        "      <td id=\"T_6f7a4_row1_col5\" class=\"data row1 col5\" >2</td>\n",
398 |        "      <td id=\"T_6f7a4_row1_col6\" class=\"data row1 col6\" >Data derived from FACTTABLE.PLATFORM_SKEY</td>\n",
399 |        "    </tr>\n",
400 |        "    <tr>\n",
401 |        "      <td id=\"T_6f7a4_row2_col0\" class=\"data row2 col0\" >Site_SKEY</td>\n",
402 |        "      <td id=\"T_6f7a4_row2_col1\" class=\"data row2 col1\" >INTEGER</td>\n",
403 |        "      <td id=\"T_6f7a4_row2_col2\" class=\"data row2 col2\" >N</td>\n",
404 |        "      <td id=\"T_6f7a4_row2_col3\" class=\"data row2 col3\" >No</td>\n",
405 |        "      <td id=\"T_6f7a4_row2_col4\" class=\"data row2 col4\" >NOT NULL</td>\n",
406 |        "      <td id=\"T_6f7a4_row2_col5\" class=\"data row2 col5\" >3</td>\n",
407 |        "      <td id=\"T_6f7a4_row2_col6\" class=\"data row2 col6\" >Data derived from FACTTABLE.SITE_SKEY</td>\n",
408 |        "    </tr>\n",
409 |        "    <tr>\n",
410 |        "      <td id=\"T_6f7a4_row3_col0\" class=\"data row3 col0\" >Video_SKEY</td>\n",
411 |        "      <td id=\"T_6f7a4_row3_col1\" class=\"data row3 col1\" >INTEGER</td>\n",
412 |        "      <td id=\"T_6f7a4_row3_col2\" class=\"data row3 col2\" >N</td>\n",
413 |        "      <td id=\"T_6f7a4_row3_col3\" class=\"data row3 col3\" >No</td>\n",
414 |        "      <td id=\"T_6f7a4_row3_col4\" class=\"data row3 col4\" >NOT NULL</td>\n",
415 |        "      <td id=\"T_6f7a4_row3_col5\" class=\"data row3 col5\" >4</td>\n",
416 |        "      <td id=\"T_6f7a4_row3_col6\" class=\"data row3 col6\" >Data derived from FACTTABLE.VIDEO_SKEY</td>\n",
417 |        "    </tr>\n",
418 |        "    <tr>\n",
419 |        "      <td id=\"T_6f7a4_row4_col0\" class=\"data row4 col0\" >Events</td>\n",
420 |        "      <td id=\"T_6f7a4_row4_col1\" class=\"data row4 col1\" >VARCHAR2(150 BYTE)</td>\n",
421 |        "      <td id=\"T_6f7a4_row4_col2\" class=\"data row4 col2\" >N</td>\n",
422 |        "      <td id=\"T_6f7a4_row4_col3\" class=\"data row4 col3\" >No</td>\n",
423 |        "      <td id=\"T_6f7a4_row4_col4\" class=\"data row4 col4\" >NOT NULL</td>\n",
424 |        "      <td id=\"T_6f7a4_row4_col5\" class=\"data row4 col5\" >5</td>\n",
425 |        "      <td id=\"T_6f7a4_row4_col6\" class=\"data row4 col6\" >Data derived from FACTTABLE.EVENTS</td>\n",
426 |        "    </tr>\n",
427 |        "  </tbody>\n",
428 |        "</table>\n"
429 |       ],
430 |       "text/plain": [
431 |        "<pandas.io.formats.style.Styler at 0x12cc7d790>"
432 |       ]
433 |      },
434 |      "execution_count": 57,
435 |      "metadata": {},
436 |      "output_type": "execute_result"
437 |     }
438 |    ],
439 |    "source": [
440 |     "import pandas as pd\n",
441 |     "facttable = pd.DataFrame([['DATETIME_SKEY', 'TIMESTAMP', 'N', 'No', 'NOT NULL', 1, 'Data derived from FACTTABLE.DATETIME_SKEY'],\n",
442 |     "                          ['Platform_SKEY', 'INTEGER', 'N', 'No', 'NOT NULL', 2, 'Data derived from FACTTABLE.PLATFORM_SKEY'],\n",
443 |     "                          ['Site_SKEY', 'INTEGER', 'N', 'No', 'NOT NULL', 3, 'Data derived from FACTTABLE.SITE_SKEY'],\n",
444 |     "                          ['Video_SKEY', 'INTEGER', 'N', 'No', 'NOT NULL', 4, 'Data derived from FACTTABLE.VIDEO_SKEY'],\n",
445 |     "                          ['Events', 'VARCHAR2(150 BYTE)', 'N', 'No', 'NOT NULL', 5, 'Data derived from FACTTABLE.EVENTS']],\n",
446 |     "                  columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n",
447 |     "                  \n",
448 |     "facttable.style.hide_index()"
449 |    ]
450 |   },
451 |   {
452 |    "cell_type": "markdown",
453 |    "metadata": {},
454 |    "source": [
455 |     "# **Pre-process the Raw Data**"
456 |    ]
457 |   },
458 |   {
459 |    "cell_type": "markdown",
460 |    "metadata": {},
461 |    "source": [
462 |     "## 1. Data auditing"
463 |    ]
464 |   },
465 |   {
466 |    "cell_type": "code",
467 |    "execution_count": 117,
468 |    "metadata": {},
469 |    "outputs": [
470 |     {
471 |      "name": "stdout",
472 |      "output_type": "stream",
473 |      "text": [
474 |       "DateTime       24\n",
475 |       "VideoTitle    157\n",
476 |       "events         95\n",
477 |       "dtype: int64\n"
478 |      ]
479 |     }
480 |    ],
481 |    "source": [
482 |     "import pandas as pd\n",
483 |     "\n",
484 |     "df = pd.read_csv('source.csv', usecols=[0,1,2])\n",
485 |     "df.columns = ['DateTime', 'VideoTitle', 'events']\n",
486 |     "\n",
487 |     "print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}))"
488 |    ]
489 |   },
490 |   {
491 |    "cell_type": "markdown",
492 |    "metadata": {},
493 |    "source": [
494 |     "## 2. Identify Objects for ETL Process"
495 |    ]
496 |   },
497 |   {
498 |    "cell_type": "code",
499 |    "execution_count": 121,
500 |    "metadata": {},
501 |    "outputs": [
502 |     {
503 |      "data": {
504 |       "text/plain": [
505 |        "array(['App Web', 'news', 'App iPhone', 'App Android', 'App iPad'],\n",
506 |        "      dtype=object)"
507 |       ]
508 |      },
509 |      "execution_count": 121,
510 |      "metadata": {},
511 |      "output_type": "execute_result"
512 |     }
513 |    ],
514 |    "source": [
515 |     "# Identify the type of PLATFORM and SITE\n",
516 |     "import pandas as pd\n",
517 |     "\n",
518 |     "df = pd.read_csv('source.csv', usecols=[0,1,2])\n",
519 |     "df.columns = ['DateTime', 'VideoTitle', 'events']\n",
520 |     "df[['0','1','2','3','4']] = df['VideoTitle'].str.split(\"|\",expand=True,)\n",
521 |     "\n",
522 |     "df['0'].unique()"
523 |    ]
524 |   },
525 |   {
526 |    "cell_type": "markdown",
527 |    "metadata": {},
528 |    "source": [
529 |     "## 3. The sql script to create the table\n",
530 |     "\n",
531 |     "check <font color='#33AFFF'>snowflake.sql</font>"
532 |    ]
533 |   },
534 |   {
535 |    "cell_type": "markdown",
536 |    "metadata": {},
537 |    "source": [
538 |     "# **NOTE**"
539 |    ]
540 |   },
541 |   {
542 |    "cell_type": "markdown",
543 |    "metadata": {},
544 |    "source": [
545 |     "1. SKEY stands for surrogate key.  \n",
546 |     "2. The current design is <font color='#33AFFF'>Dimension Type One</font>.  \n",
547 |     "3. If the source dimension data contains not only the PK but also some attributes, and we want to track the changes of attributes, we should use <font color='#33AFFF'>Dimension Type Two</font>. "
548 |    ]
549 |   },
550 |   {
551 |    "cell_type": "markdown",
552 |    "metadata": {},
553 |    "source": [
554 |     "### **One sample of Dimension Type Two**"
555 |    ]
556 |   },
557 |   {
558 |    "cell_type": "code",
559 |    "execution_count": 207,
560 |    "metadata": {},
561 |    "outputs": [
562 |     {
563 |      "data": {
564 |       "text/html": [
565 |        "<style type=\"text/css\">\n",
566 |        "</style>\n",
567 |        "<table id=\"T_cfba5_\">\n",
568 |        "  <caption>Data from current date 04/03/2022</caption>\n",
569 |        "  <thead>\n",
570 |        "    <tr>\n",
571 |        "      <th class=\"col_heading level0 col0\" >Product_ID</th>\n",
572 |        "      <th class=\"col_heading level0 col1\" >Product</th>\n",
573 |        "      <th class=\"col_heading level0 col2\" >Price</th>\n",
574 |        "      <th class=\"col_heading level0 col3\" >Location</th>\n",
575 |        "    </tr>\n",
576 |        "  </thead>\n",
577 |        "  <tbody>\n",
578 |        "    <tr>\n",
579 |        "      <td id=\"T_cfba5_row0_col0\" class=\"data row0 col0\" >P01</td>\n",
580 |        "      <td id=\"T_cfba5_row0_col1\" class=\"data row0 col1\" >iPhone 13</td>\n",
581 |        "      <td id=\"T_cfba5_row0_col2\" class=\"data row0 col2\" >1399</td>\n",
582 |        "      <td id=\"T_cfba5_row0_col3\" class=\"data row0 col3\" >Noel Leeming</td>\n",
583 |        "    </tr>\n",
584 |        "    <tr>\n",
585 |        "      <td id=\"T_cfba5_row1_col0\" class=\"data row1 col0\" >P03</td>\n",
586 |        "      <td id=\"T_cfba5_row1_col1\" class=\"data row1 col1\" >iPhone 13 Pro Max</td>\n",
587 |        "      <td id=\"T_cfba5_row1_col2\" class=\"data row1 col2\" >1999</td>\n",
588 |        "      <td id=\"T_cfba5_row1_col3\" class=\"data row1 col3\" >Noel Leeming</td>\n",
589 |        "    </tr>\n",
590 |        "  </tbody>\n",
591 |        "</table>\n"
592 |       ],
593 |       "text/plain": [
594 |        "<pandas.io.formats.style.Styler at 0x12e715310>"
595 |       ]
596 |      },
597 |      "execution_count": 207,
598 |      "metadata": {},
599 |      "output_type": "execute_result"
600 |     }
601 |    ],
602 |    "source": [
603 |     "import pandas as pd\n",
604 |     "sample_table = pd.DataFrame([['P01', 'iPhone 13', '1399', 'Noel Leeming'],['P03', 'iPhone 13 Pro Max', '1999', 'Noel Leeming']],columns=['Product_ID', 'Product', 'Price', 'Location'])\n",
605 |     "sample_table.style.hide_index().set_caption('Data from current date 04/03/2022')"
606 |    ]
607 |   },
608 |   {
609 |    "cell_type": "markdown",
610 |    "metadata": {},
611 |    "source": [
612 |     "### **Data in dimension table**"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "code",
617 |    "execution_count": 204,
618 |    "metadata": {},
619 |    "outputs": [
620 |     {
621 |      "data": {
622 |       "text/html": [
623 |        "<style type=\"text/css\">\n",
624 |        "</style>\n",
625 |        "<table id=\"T_f3f3f_\">\n",
626 |        "  <caption>Data in dimension table</caption>\n",
627 |        "  <thead>\n",
628 |        "    <tr>\n",
629 |        "      <th class=\"col_heading level0 col0\" >Product_SKEY</th>\n",
630 |        "      <th class=\"col_heading level0 col1\" >Product_ID</th>\n",
631 |        "      <th class=\"col_heading level0 col2\" >Product</th>\n",
632 |        "      <th class=\"col_heading level0 col3\" >Price</th>\n",
633 |        "      <th class=\"col_heading level0 col4\" >Location</th>\n",
634 |        "      <th class=\"col_heading level0 col5\" >Current_Flag</th>\n",
635 |        "      <th class=\"col_heading level0 col6\" >Start_Date</th>\n",
636 |        "      <th class=\"col_heading level0 col7\" >End_Date</th>\n",
637 |        "    </tr>\n",
638 |        "  </thead>\n",
639 |        "  <tbody>\n",
640 |        "    <tr>\n",
641 |        "      <td id=\"T_f3f3f_row0_col0\" class=\"data row0 col0\" >1</td>\n",
642 |        "      <td id=\"T_f3f3f_row0_col1\" class=\"data row0 col1\" >P01</td>\n",
643 |        "      <td id=\"T_f3f3f_row0_col2\" class=\"data row0 col2\" >iPhone 13</td>\n",
644 |        "      <td id=\"T_f3f3f_row0_col3\" class=\"data row0 col3\" >1499</td>\n",
645 |        "      <td id=\"T_f3f3f_row0_col4\" class=\"data row0 col4\" >Noel Leeming</td>\n",
646 |        "      <td id=\"T_f3f3f_row0_col5\" class=\"data row0 col5\" >Y</td>\n",
647 |        "      <td id=\"T_f3f3f_row0_col6\" class=\"data row0 col6\" >31/12/2021</td>\n",
648 |        "      <td id=\"T_f3f3f_row0_col7\" class=\"data row0 col7\" >31/12/9999</td>\n",
649 |        "    </tr>\n",
650 |        "    <tr>\n",
651 |        "      <td id=\"T_f3f3f_row1_col0\" class=\"data row1 col0\" >2</td>\n",
652 |        "      <td id=\"T_f3f3f_row1_col1\" class=\"data row1 col1\" >P02</td>\n",
653 |        "      <td id=\"T_f3f3f_row1_col2\" class=\"data row1 col2\" >iPhone 13 Pro</td>\n",
654 |        "      <td id=\"T_f3f3f_row1_col3\" class=\"data row1 col3\" >1799</td>\n",
655 |        "      <td id=\"T_f3f3f_row1_col4\" class=\"data row1 col4\" >Noel Leeming</td>\n",
656 |        "      <td id=\"T_f3f3f_row1_col5\" class=\"data row1 col5\" >Y</td>\n",
657 |        "      <td id=\"T_f3f3f_row1_col6\" class=\"data row1 col6\" >20/01/2022</td>\n",
658 |        "      <td id=\"T_f3f3f_row1_col7\" class=\"data row1 col7\" >31/12/9999</td>\n",
659 |        "    </tr>\n",
660 |        "  </tbody>\n",
661 |        "</table>\n"
662 |       ],
663 |       "text/plain": [
664 |        "<pandas.io.formats.style.Styler at 0x12e6f7c40>"
665 |       ]
666 |      },
667 |      "execution_count": 204,
668 |      "metadata": {},
669 |      "output_type": "execute_result"
670 |     }
671 |    ],
672 |    "source": [
673 |     "import pandas as pd\n",
674 |     "\n",
675 |     "dim_table = pd.DataFrame([['1', 'P01', 'iPhone 13', '1499', 'Noel Leeming', 'Y', '31/12/2021', '31/12/9999'], ['2', 'P02', 'iPhone 13 Pro', '1799', 'Noel Leeming', 'Y', '20/01/2022', '31/12/9999']], columns=['Product_SKEY', 'Product_ID', 'Product', 'Price', 'Location','Current_Flag', 'Start_Date', 'End_Date'])\n",
676 |     "dim_table.style.hide_index().set_caption('Data in dimension table')"
677 |    ]
678 |   },
679 |   {
680 |    "cell_type": "markdown",
681 |    "metadata": {},
682 |    "source": [
683 |     "### Update data in dimension table:\n",
684 |     "* new product **(P03)**  \n",
685 |     "* update product **(P01)**"
686 |    ]
687 |   },
688 |   {
689 |    "cell_type": "code",
690 |    "execution_count": 205,
691 |    "metadata": {},
692 |    "outputs": [
693 |     {
694 |      "data": {
695 |       "text/html": [
696 |        "<style type=\"text/css\">\n",
697 |        "</style>\n",
698 |        "<table id=\"T_4005c_\">\n",
699 |        "  <caption>Update dimension table</caption>\n",
700 |        "  <thead>\n",
701 |        "    <tr>\n",
702 |        "      <th class=\"col_heading level0 col0\" >Product_SKEY</th>\n",
703 |        "      <th class=\"col_heading level0 col1\" >Product_ID</th>\n",
704 |        "      <th class=\"col_heading level0 col2\" >Product</th>\n",
705 |        "      <th class=\"col_heading level0 col3\" >Price</th>\n",
706 |        "      <th class=\"col_heading level0 col4\" >Location</th>\n",
707 |        "      <th class=\"col_heading level0 col5\" >Current_Flag</th>\n",
708 |        "      <th class=\"col_heading level0 col6\" >Start_Date</th>\n",
709 |        "      <th class=\"col_heading level0 col7\" >End_Date</th>\n",
710 |        "    </tr>\n",
711 |        "  </thead>\n",
712 |        "  <tbody>\n",
713 |        "    <tr>\n",
714 |        "      <td id=\"T_4005c_row0_col0\" class=\"data row0 col0\" >1</td>\n",
715 |        "      <td id=\"T_4005c_row0_col1\" class=\"data row0 col1\" >P01</td>\n",
716 |        "      <td id=\"T_4005c_row0_col2\" class=\"data row0 col2\" >iPhone 13</td>\n",
717 |        "      <td id=\"T_4005c_row0_col3\" class=\"data row0 col3\" >1499</td>\n",
718 |        "      <td id=\"T_4005c_row0_col4\" class=\"data row0 col4\" >Noel Leeming</td>\n",
719 |        "      <td id=\"T_4005c_row0_col5\" class=\"data row0 col5\" >N</td>\n",
720 |        "      <td id=\"T_4005c_row0_col6\" class=\"data row0 col6\" >31/12/2021</td>\n",
721 |        "      <td id=\"T_4005c_row0_col7\" class=\"data row0 col7\" >03/03/2022</td>\n",
722 |        "    </tr>\n",
723 |        "    <tr>\n",
724 |        "      <td id=\"T_4005c_row1_col0\" class=\"data row1 col0\" >2</td>\n",
725 |        "      <td id=\"T_4005c_row1_col1\" class=\"data row1 col1\" >P02</td>\n",
726 |        "      <td id=\"T_4005c_row1_col2\" class=\"data row1 col2\" >iPhone 13 Pro</td>\n",
727 |        "      <td id=\"T_4005c_row1_col3\" class=\"data row1 col3\" >1799</td>\n",
728 |        "      <td id=\"T_4005c_row1_col4\" class=\"data row1 col4\" >Noel Leeming</td>\n",
729 |        "      <td id=\"T_4005c_row1_col5\" class=\"data row1 col5\" >Y</td>\n",
730 |        "      <td id=\"T_4005c_row1_col6\" class=\"data row1 col6\" >20/01/2022</td>\n",
731 |        "      <td id=\"T_4005c_row1_col7\" class=\"data row1 col7\" >31/12/9999</td>\n",
732 |        "    </tr>\n",
733 |        "    <tr>\n",
734 |        "      <td id=\"T_4005c_row2_col0\" class=\"data row2 col0\" >3</td>\n",
735 |        "      <td id=\"T_4005c_row2_col1\" class=\"data row2 col1\" >P03</td>\n",
736 |        "      <td id=\"T_4005c_row2_col2\" class=\"data row2 col2\" >iPhone 13 Pro Max</td>\n",
737 |        "      <td id=\"T_4005c_row2_col3\" class=\"data row2 col3\" >1999</td>\n",
738 |        "      <td id=\"T_4005c_row2_col4\" class=\"data row2 col4\" >Noel Leeming</td>\n",
739 |        "      <td id=\"T_4005c_row2_col5\" class=\"data row2 col5\" >Y</td>\n",
740 |        "      <td id=\"T_4005c_row2_col6\" class=\"data row2 col6\" >04/03/2022</td>\n",
741 |        "      <td id=\"T_4005c_row2_col7\" class=\"data row2 col7\" >31/12/9999</td>\n",
742 |        "    </tr>\n",
743 |        "    <tr>\n",
744 |        "      <td id=\"T_4005c_row3_col0\" class=\"data row3 col0\" >4</td>\n",
745 |        "      <td id=\"T_4005c_row3_col1\" class=\"data row3 col1\" >P01</td>\n",
746 |        "      <td id=\"T_4005c_row3_col2\" class=\"data row3 col2\" >iPhone 13</td>\n",
747 |        "      <td id=\"T_4005c_row3_col3\" class=\"data row3 col3\" >1399</td>\n",
748 |        "      <td id=\"T_4005c_row3_col4\" class=\"data row3 col4\" >Noel Leeming</td>\n",
749 |        "      <td id=\"T_4005c_row3_col5\" class=\"data row3 col5\" >Y</td>\n",
750 |        "      <td id=\"T_4005c_row3_col6\" class=\"data row3 col6\" >04/03/2022</td>\n",
751 |        "      <td id=\"T_4005c_row3_col7\" class=\"data row3 col7\" >31/12/9999</td>\n",
752 |        "    </tr>\n",
753 |        "  </tbody>\n",
754 |        "</table>\n"
755 |       ],
756 |       "text/plain": [
757 |        "<pandas.io.formats.style.Styler at 0x12e6c19a0>"
758 |       ]
759 |      },
760 |      "execution_count": 205,
761 |      "metadata": {},
762 |      "output_type": "execute_result"
763 |     }
764 |    ],
765 |    "source": [
766 |     "import pandas as pd\n",
767 |     "dim_table = pd.DataFrame([['1', 'P01', 'iPhone 13', '1499', 'Noel Leeming', 'N', '31/12/2021', '03/03/2022'],\n",
768 |     "                          ['2', 'P02', 'iPhone 13 Pro', '1799', 'Noel Leeming', 'Y', '20/01/2022', '31/12/9999'],\n",
769 |     "                          ['3', 'P03', 'iPhone 13 Pro Max', '1999', 'Noel Leeming', 'Y', '04/03/2022', '31/12/9999'],\n",
770 |     "                          ['4', 'P01', 'iPhone 13', '1399', 'Noel Leeming', 'Y', '04/03/2022', '31/12/9999']],\n",
771 |     "                  columns=['Product_SKEY', 'Product_ID', 'Product', 'Price', 'Location','Current_Flag', 'Start_Date', 'End_Date'])\n",
772 |     "    \n",
773 |     "dim_table.style.hide_index().set_caption('Update dimension table')"
774 |    ]
775 |   },
776 |   {
777 |    "cell_type": "markdown",
778 |    "metadata": {},
779 |    "source": [
780 |     "When there is a **new record** coming in, we generate a new record with new SKEY, \n",
781 |     "* Current_Flag = ‘Y’\n",
782 |     "* Start_Date = Current_Date\n",
783 |     "* End_Date = 31/12/9999\n",
784 |     "\n",
785 |     "When there is a **updated record** coming in, we also generate a new record with new SKEY,\n",
786 |     "* Current_Flag = ‘Y’\n",
787 |     "* Start_Date = Current_Date\n",
788 |     "* End_Date = 31/12/9999\n",
789 |     "\n",
790 |     "And at same time we need to update the **old record** in dimension table with \n",
791 |     "* Current_Flag = ‘N’\n",
792 |     "* End_Date = Current_Date – 1\n",
793 |     "\n",
794 |     "Therefore, when we populate new records into fact table, we need to put a filter such as **Current_Flag = ‘Y’** in order to get the correct SKEY; \n",
795 |     "\n",
796 |     "If we want to track the history data in dimension table for **certain days or certain period**, we need to put a time range filter such as **EVENT_DATE(or CONTACT_DATE)** between Start_Date and End_Date\n",
797 |     "\n",
798 |     "For example, if in fact table we see a transaction like customer purchased product(P01) on 01/02/22, by looking at product dimension table, we could find the price that customer paid at that moment was 1499 not 1399, although 1399 is the current price of P01."
799 |    ]
800 |   }
801 |  ],
802 |  "metadata": {
803 |   "interpreter": {
804 |    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
805 |   },
806 |   "kernelspec": {
807 |    "display_name": "Python 3.8.2 64-bit",
808 |    "language": "python",
809 |    "name": "python3"
810 |   },
811 |   "language_info": {
812 |    "codemirror_mode": {
813 |     "name": "ipython",
814 |     "version": 3
815 |    },
816 |    "file_extension": ".py",
817 |    "mimetype": "text/x-python",
818 |    "name": "python",
819 |    "nbconvert_exporter": "python",
820 |    "pygments_lexer": "ipython3",
821 |    "version": "3.8.2"
822 |   },
823 |   "orig_nbformat": 4
824 |  },
825 |  "nbformat": 4,
826 |  "nbformat_minor": 2
827 | }
828 | 


--------------------------------------------------------------------------------