├── .env ├── .gitignore ├── data_pipeline.jpg ├── etl_process.py ├── readme.md ├── serverless.yml ├── snowflake.sql └── table_design.ipynb /.env: -------------------------------------------------------------------------------- 1 | # This will be automatically added into lambda runtime, if you enter it in serverless.yml:custom.dotenv.include 2 | TZ_LOCAL=Pacific/Auckland -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .serverless/ 2 | .DS_Store 3 | FullScript.sql 4 | FullScript.py 5 | FullScript.ipynb 6 | data.ctl 7 | note.txt 8 | rawfile.csv -------------------------------------------------------------------------------- /data_pipeline.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AWS-Serverless-Projects/Snowflake-Data-Cloud-ETL-Integration-with-Serverless/92b890d0a93cd226b8059d0e4ec3130c649746fb/data_pipeline.jpg -------------------------------------------------------------------------------- /etl_process.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import boto3 3 | import io 4 | from io import StringIO 5 | 6 | def lambda_handler(event, context): 7 | s3_file_key = event['Records'][0]['s3']['object']['key'] 8 | bucket = 'src-bucket-datapipeline' 9 | s3 = boto3.client('s3') 10 | obj = s3.get_object(Bucket=bucket, Key=s3_file_key) 11 | s3_resource = boto3.resource('s3') 12 | df = pd.read_csv(io.BytesIO(obj['Body'].read())) 13 | 14 | bucket='dst-bucket-snowpipeline' 15 | 16 | # ----------------------ETL PROCESS---------------------- 17 | pd.options.mode.chained_assignment = None # Disable warining 18 | df1 = df[df['events'].str.contains('206', na=False)] # Filter value as required 19 | 20 | # Convert DateTime for DimDate Table and Fact Table 21 | df1['DATETIME_SKEY'] = pd.to_datetime(df1['DateTime']).dt.tz_convert(None).dt.strftime('%Y-%m-%d %H:%M') 22 | 23 | # split by '|' for table Dimvideo, DimPlatform, and DimSite 24 | df1[['0','1','2','3','4']] = df1['VideoTitle'].str.split("|",expand=True,n=4).reindex(range(5), axis=1) 25 | 26 | # note: for [0] ['news', 'App Web', 'App Android', 'App iPhone', 'App iPad'] 27 | # Build Dimvideo Table 28 | df1['Video_Title'] = df1.iloc[:, 1:].ffill(axis=1).iloc[:, -1] # Create Video_Title column 29 | df1['Video_SKEY'] = df1.groupby(['Video_Title']).ngroup() # create souragate key 30 | dfvideo= df1.drop_duplicates(subset = ["Video_Title","Video_SKEY"]) 31 | 32 | # Build DimPlatform Table 33 | df1.loc[df1['0'].str.contains('Android'), 'Platform_Type'] = 'Platform' 34 | df1.loc[df1['0'].str.contains('iPad'), 'Platform_Type'] = 'Platform' 35 | df1.loc[df1['0'].str.contains('iPhone'), 'Platform_Type'] = 'Platform' 36 | df1.loc[df1['0'].str.contains('Web'), 'Platform_Type'] = 'Desktop' 37 | df1.loc[df1['0'].str.contains('news'), 'Platform_Type'] = 'Desktop' 38 | df1['Platform_SKEY'] = df1.groupby(['Platform_Type']).ngroup() # create souragate key 39 | dfplatform= df1.drop_duplicates(subset = ["Platform_Type","Platform_SKEY"]) 40 | 41 | # Build DimSite Table 42 | df1.loc[df1['0'].str.contains('news'), 'Site'] = 'news' # Create Column 'Site' by news 43 | df1.loc[df1['0'].str.contains('Web'), 'Site'] = 'App Web' # Create Column 'Site' by web 44 | df1.loc[df1['Site'].isnull(), 'Site'] = 'Not Applicable' #fill NaN value 45 | df1['Site_SKEY'] = df1.groupby(['Site']).ngroup() # create souragate key 46 | dfsite= df1.drop_duplicates(subset = ["Site","Site_SKEY"]) 47 | 48 | # Export target tables to S3 Dst Bucket 49 | s3_resource.Object(bucket, 'dimdate/dimdate.csv').put(Body=df1[['DATETIME_SKEY']].to_csv(index=False)) 50 | s3_resource.Object(bucket, 'dimvideo/dimvideo.csv').put(Body=dfvideo[['Video_SKEY','Video_Title']].to_csv(index = False)) 51 | s3_resource.Object(bucket, 'dimplatform/dimplatform.csv').put(Body=dfplatform[['Platform_SKEY','Platform_Type']].to_csv(index = False)) 52 | s3_resource.Object(bucket, 'dimsite/dimsite.csv').put(Body=dfsite[['Site_SKEY','Site']].to_csv(index = False)) 53 | s3_resource.Object(bucket, 'facttable/facttable.csv').put(Body=df1[['DATETIME_SKEY','Platform_SKEY','Site_SKEY','Video_SKEY','events']].to_csv(index = False)) -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Overview of the solution 2 | The project includes a local environment to inspect the data and deploy the stack using the AWS S3, Lambda in Serverless Framework, AWS SQS, and Snowflake Snowpipe. 3 | 4 | The deployment includes an object-based event that triggers an AWS Lambda function. The function ingests and stores the raw data from an external source, transforms the content, and saves clean information. 5 | 6 | The raw and clean data is stored in an S3 destination bucket with SQS event notifications, which will deliver data to snowpipe. 7 | 8 | The following diagram illustrates my architecture: 9 | 10 | ![alt text](https://github.com/miaaaalu/AWS-Lambda-to-Snowflake-Data-Cloud-ETL-Integration-with-Serverless/blob/master/data_pipeline.jpg?raw=true) 11 | 12 | The solution includes the following high-level steps: 13 | 14 | 1. Design the star schema for the source file 15 | 2. Snowflake Snowpipe Configuration 16 | 3. Build and test the ETL process 17 | 2. Inspect the function and the AWS Cloudformation template and deploy in Serverless Framework 18 | 5. Create an Event Notification for the S3 destination buck 19 | 20 | # Test Environment 21 | ```powershell 22 | * System: macOS Big Sur Version 11.5.2 23 | 24 | * Programming: Python 3.8, Snowflake SQL 25 | 26 | * Tools: Snowflake, Cloud Service Provider: AWS (S3, Lambda, and SQS), Serverless Framework 27 | ``` 28 | 29 | # Process Workflow 30 | ## Preparation 31 | ### AWS 32 | ```powershell 33 | 1. Create source bucket in S3 34 | 2. Create deploy bucket for the serverless package in S3 35 | 3. Create an IAM User for snowflake with S3 full access policies 36 | ``` 37 | ### Python 38 | ```powershell 39 | 3.8 preferable 40 | ``` 41 | ## Step 1 — Table Design 42 | ```powershell 43 | For details check table_design.ipynb 44 | ``` 45 | 46 | ## Step 2 — Snowflake Snowpipe Configuration 47 | ```powershell 48 | 1. Database Creation 49 | 2. Tables Creation (5 tables in this project) 50 | 3. File Format Creation with AWS Credentials 51 | 4. External Stage Creation 52 | 4. Pipes Creation 53 | 54 | For details check snowflake.sql 55 | ``` 56 | 57 | ## Step 3 — Serverless Framework Deploy 58 | 59 | *Serverless Framework is open source software that builds, compiles, and packages code for serverless deployment, and then deploys the package to the cloud. With Python on AWS, for example, Serverless Framework creates the self-contained Python environment, including all dependencies.* 60 | 61 | ### 1. Serverless Project Initialization 62 | ```powershell 63 | % serverless create --template aws-python3 --path {project_folder} 64 | ``` 65 | 66 | ### 2. Open the project in VScode 67 | ```powershell 68 | % open -a “Visual Studio Code” ${project_folder} 69 | ``` 70 | 71 | ### 3. Serverless Plugin Installation 72 | ```powershell 73 | # Instal Serverless Prune Plugin 74 | % sudo sls plugin install -n serverless-prune-plugin 75 | 76 | # Install serverless python requirements (https://github.com/UnitedIncome/serverless-python-requirements) 77 | % sudo sls plugin install -n serverless-python-requirements 78 | 79 | # Install serverless dotenv plugin 80 | % sudo npm i -D serverless-dotenv-plugin 81 | ``` 82 | ### 4. Modify .python file for ETL Process 83 | ```powershell 84 | # Rename python file 85 | % mv handler.py ${project_handle}.py 86 | 87 | # Handle your Python packaging 88 | By default, pandas library is not available in AWS Lambda Python environments. 89 | For using pandas library in Lambda function, a requirements.txt needs to be attached. 90 | OR add Python pandas layer to AWS Lambda in serverless.yml. 91 | 92 | # option 1: attach a requirements.txt with needed library 93 | % touch requirements.txt 94 | % echo “pandas” >> requirements.txt 95 | % pip install -r requirements.txt 96 | 97 | # option 2: add the pandas layer from Klayers in serverless.yml (recommend) 98 | source from Klayers: https://github.com/keithrozario/Klayers/tree/master/deployments/python3.8/arns 99 | 100 | # ETL Process 101 | a. Load raw file 102 | b. Data Cleaning 103 | c. Data Washing 104 | b. Data transformation for table DIMDATE, DIMPLATFORM, DIMSITE, DIMVIDEO, and FACTTABLE 105 | d. Load Data into staging folders - DIMDATE, DIMPLATFORM, DIMSITE, DIMVIDEO, and FACTTABLE 106 | 107 | For details check etl_process.py 108 | ``` 109 | ### 5. Create .env file and put environment variables if need 110 | ```env 111 | APPLICATION = ${your project name} 112 | STAGE = ${your stage} 113 | REGION = ${your region} 114 | TZ_LOCAL = ${your timezone} 115 | ``` 116 | ### 6. Modify serverless.yml file 117 | ```Powershell 118 | For details check serverless.yml 119 | ``` 120 | 121 | ### 7. Deploy 122 | ```Powershell 123 | # Deploy to aws 124 | % sls deploy 125 | ``` 126 | 127 | ## Step 4 — Add Event Notification for S3 Bucket 128 | 129 | ``` 130 | This notification informs Snowpipe via an SQS queue when files are ready to load. 131 | 132 | Please note the SQS queue ARN from the notification_channel column once you execute 「show pipes」 command. 133 | 134 | Copy the ARN to a notepad. 135 | 136 | Then paste on the Event Notification for destination S3 bucket. 137 | ``` -------------------------------------------------------------------------------- /serverless.yml: -------------------------------------------------------------------------------- 1 | service: my-etl-pipeline 2 | 3 | plugins: 4 | - serverless-prune-plugin 5 | - serverless-python-requirements 6 | - serverless-dotenv-plugin 7 | 8 | custom: 9 | pythonRequirements: 10 | dockerizePip: false 11 | prune: 12 | automatic: true 13 | number: 3 14 | dotenv: 15 | include: 16 | - TZ_LOCAL 17 | 18 | provider: 19 | name: aws 20 | runtime: python3.8 21 | memorySize: 256 22 | timeout: 300 23 | deploymentBucket: 24 | name: serverless-deploy-bucket-mia 25 | maxPreviousDeploymentArtifacts: 3 26 | stage: dev 27 | region: ap-southeast-2 28 | versionFunctions: true 29 | layers: arn:aws:lambda:ap-southeast-2:770693421928:layer:Klayers-python38-pandas:48 30 | 31 | # Lambda IAM Role Configuration 32 | iamRoleStatements: 33 | - Effect: Allow 34 | Action: 35 | - s3:HeadObject 36 | - s3:GetObject 37 | - s3:putObject 38 | Resource: 39 | - arn:aws:s3:::src-bucket-datapipeline/* 40 | - arn:aws:s3:::dst-bucket-snowpipeline/* 41 | - Effect: Allow 42 | Action: 43 | - logs:* 44 | Resource: "*" 45 | 46 | # exclude packaging information 47 | package: 48 | exclude: 49 | - __cache__/** 50 | - __pycache__/** 51 | - node_modules/** 52 | - 'package.json' 53 | - 'package-lock.json' 54 | - 'snowflake.sql' 55 | - 'readme.md' 56 | 57 | # Lambda Function 58 | functions: 59 | etl_process: 60 | handler: etl_process.lambda_handler 61 | name: my_etl_lambda 62 | description: "AWS-Lambda-to-Snowflake-Data-Cloud-ETL-Integration" 63 | events: 64 | - s3: 65 | bucket: src-bucket-datapipeline 66 | event: s3:ObjectCreated:* 67 | existing: true 68 | 69 | # Build Destination Bucket 70 | resources: 71 | Resources: 72 | SnowflakeStagingOutbucket: 73 | Type: AWS::S3::Bucket 74 | Properties: 75 | AccessControl: BucketOwnerFullControl 76 | BucketName: dst-bucket-snowpipeline -------------------------------------------------------------------------------- /snowflake.sql: -------------------------------------------------------------------------------- 1 | // create datawarehouse 2 | CREATE OR REPLACE WAREHOUSE mywarehouse WITH 3 | WAREHOUSE_SIZE='X-SMALL' 4 | AUTO_SUSPEND = 120 5 | AUTO_RESUME = TRUE 6 | INITIALLY_SUSPENDED = TRUE; 7 | 8 | // create database 9 | CREATE OR REPLACE DATABASE ETL_PIPELINE; 10 | 11 | // create DIMDATE table 12 | CREATE OR REPLACE TABLE DIMDATE ( 13 | DATETIME_SKEY TIMESTAMP, 14 | PRIMARY KEY (DATETIME_SKEY) 15 | ); 16 | 17 | // create DIMPLATFORM table 18 | CREATE OR REPLACE TABLE DIMPLATFORM ( 19 | PLATFORM_SKEY INTEGER NOT NULL, 20 | PLATFORM_TYPE VARCHAR(200) NOT NULL, 21 | PRIMARY KEY (PLATFORM_SKEY) 22 | ); 23 | 24 | // create DIMSITE table 25 | CREATE OR REPLACE TABLE DIMSITE ( 26 | Site_SKEY INTEGER NOT NULL, 27 | Site VARCHAR(200) NOT NULL, 28 | PRIMARY KEY (Site_SKEY) 29 | ); 30 | 31 | // create DIMVIDEO table 32 | CREATE OR REPLACE TABLE DIMVIDEO ( 33 | Video_SKEY INTEGER NOT NULL, 34 | Video_Title TEXT NOT NULL, 35 | PRIMARY KEY (Video_SKEY) 36 | ); 37 | 38 | // create FACTTABLE 39 | CREATE OR REPLACE TABLE FACTTABLE ( 40 | DATETIME_SKEY TIMESTAMP REFERENCES DIMDATE(DATETIME_SKEY), 41 | Platform_SKEY INTEGER REFERENCES DIMPLATFORM(Platform_SKEY), 42 | Site_SKEY INTEGER REFERENCES DIMSITE(Site_SKEY), 43 | Video_SKEY INTEGER REFERENCES DIMVIDEO(Video_SKEY), 44 | events VARCHAR2(150 BYTE) NOT NULL 45 | ); 46 | 47 | // Create a file format 48 | CREATE OR REPLACE FILE FORMAT DataPipeline_CSV_Format 49 | TYPE = 'CSV' 50 | TIMESTAMP_FORMAT = 'YYYY-MM-DD HH24:MI' 51 | skip_header = 1 52 | field_delimiter = ',' 53 | record_delimiter = '\\n' 54 | FIELD_OPTIONALLY_ENCLOSED_BY = '"'; 55 | 56 | // create a external stage 57 | CREATE OR REPLACE STAGE S3_to_Snowflake_Stage 58 | URL="S3://dst-bucket-snowpipeline" 59 | CREDENTIALS = (AWS_KEY_ID = '**************' AWS_SECRET_KEY = '**************') 60 | file_format = DataPipeline_CSV_Format; 61 | 62 | // create pipes 63 | CREATE OR REPLACE PIPE DimDate_Pipe 64 | AUTO_INGEST = TRUE 65 | AS COPY INTO DIMDATE 66 | FROM @S3_to_Snowflake_Stage/dimdate/ 67 | FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format); 68 | 69 | CREATE OR REPLACE PIPE Dimplatform_Pipe 70 | AUTO_INGEST = TRUE 71 | AS COPY INTO DIMPLATFORM 72 | FROM @S3_to_Snowflake_Stage/dimplatform/ 73 | FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format); 74 | 75 | CREATE OR REPLACE PIPE DimSite_Pipe 76 | AUTO_INGEST = TRUE 77 | AS COPY INTO DIMSITE 78 | FROM @S3_to_Snowflake_Stage/dimsite/ 79 | FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format); 80 | 81 | CREATE OR REPLACE PIPE DimVideo_Pipe 82 | AUTO_INGEST = TRUE 83 | AS COPY INTO DIMVIDEO 84 | FROM @S3_to_Snowflake_Stage/dimvideo/ 85 | FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format); 86 | 87 | CREATE OR REPLACE PIPE FactTable_Pipe 88 | AUTO_INGEST = TRUE 89 | AS COPY INTO FACTTABLE 90 | FROM @S3_to_Snowflake_Stage/facttable/ 91 | FILE_FORMAT = (FORMAT_NAME = DataPipeline_CSV_Format); 92 | 93 | // PIPES COMMAND 94 | SHOW PIPES; -- check pipes to get notification_channel url 95 | SELECT SYSTEM$PIPE_STATUS(''); -- Check Pipe Status if need 96 | SELECT * FROM table(information_schema.copy_history(table_name=>'', start_time=> dateadd(hours, -1, current_timestamp()))); -- Show PIPE COPY history in specific table 97 | ALTER PIPE REFRESH; -- REFRESH PIPE 98 | 99 | // EXTERNAL STAGE COMMAND 100 | LIST @S3_to_Snowflake_Stage; -- Check if files exists in external stage 101 | REMOVE '@S3_to_Snowflake_Stage/dimdate/date.csv'; -- remove single file from external stage 102 | REMOVE @S3_to_Snowflake_Stage pattern='.*.csv'; -- remove all files from external stage 103 | 104 | // save notification_channel url for S3 Event Notification 105 | arn:aws:sqs:ap-southeast-2:123456789012:sf-snowpipe-**************-************** -------------------------------------------------------------------------------- /table_design.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# **Table Design**\n", 8 | "## RAW TABLE\n" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 54, 14 | "metadata": {}, 15 | "outputs": [ 16 | { 17 | "data": { 18 | "text/html": [ 19 | "\n", 21 | "
\n", 22 | " \n", 23 | " \n", 24 | " \n", 25 | " \n", 26 | " \n", 27 | " \n", 28 | " \n", 29 | " \n", 30 | " \n", 31 | " \n", 32 | " \n", 33 | " \n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | "
COLUMN_NAMEDATA_TYPEPKNULLABLEDATA_DEFAULTCOLUMN_IDCOMMENTS
DATETIMEVARCHAR2(30 BYTE)NYesnull1Data from raw file
VIDEOTITLEVARCHAR2(200 BYTE)NYesnull2Data from raw file
EVENTSVARCHAR2(150 BYTE)NYesnull3Data from raw file
\n" 63 | ], 64 | "text/plain": [ 65 | "" 66 | ] 67 | }, 68 | "execution_count": 54, 69 | "metadata": {}, 70 | "output_type": "execute_result" 71 | } 72 | ], 73 | "source": [ 74 | "import pandas as pd\n", 75 | "raw_table = pd.DataFrame([['DATETIME', 'VARCHAR2(30 BYTE)', 'N', 'Yes', 'null', 1, 'Data from raw file'],\n", 76 | " ['VIDEOTITLE', 'VARCHAR2(200 BYTE)', 'N', 'Yes', 'null', 2, 'Data from raw file'],\n", 77 | " ['EVENTS', 'VARCHAR2(150 BYTE)', 'N', 'Yes', 'null', 3, 'Data from raw file']],\n", 78 | " columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n", 79 | "raw_table.style.hide_index()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "## Dimension Table DIMDATE " 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 53, 92 | "metadata": {}, 93 | "outputs": [ 94 | { 95 | "data": { 96 | "text/html": [ 97 | "\n", 99 | "\n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | "
COLUMN_NAMEDATA_TYPEPKNULLABLEDATA_DEFAULTCOLUMN_IDCOMMENTS
DATETIME_SKEYTIMESTAMPYNoNOT NULL1Data derived from DIMDATE.DATETIME_SKEY
\n" 123 | ], 124 | "text/plain": [ 125 | "" 126 | ] 127 | }, 128 | "execution_count": 53, 129 | "metadata": {}, 130 | "output_type": "execute_result" 131 | } 132 | ], 133 | "source": [ 134 | "import pandas as pd\n", 135 | "\n", 136 | "dimdate = pd.DataFrame([['DATETIME_SKEY', 'TIMESTAMP', 'Y', 'No', 'NOT NULL', 1, 'Data derived from DIMDATE.DATETIME_SKEY']],\n", 137 | " columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n", 138 | "\n", 139 | "dimdate.style.hide_index()" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "## Dimension Table DIMPLATFORM" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 52, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/html": [ 157 | "\n", 159 | "\n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | "
COLUMN_NAMEDATA_TYPEPKNULLABLEDATA_DEFAULTCOLUMN_IDCOMMENTS
PLATFORM_SKEYINTEGERYNoNOT NULL1Data derived DIMPLATFORM.PLATFORM_SKEY
PLATFORM_TYPEVARCHAR2(200 BYTE)NNoNOT NULL2Data derived DIMPLATFORM.PLATFORM
\n" 192 | ], 193 | "text/plain": [ 194 | "" 195 | ] 196 | }, 197 | "execution_count": 52, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "import pandas as pd\n", 204 | "\n", 205 | "dimplatform = pd.DataFrame([['PLATFORM_SKEY', 'INTEGER', 'Y', 'No', 'NOT NULL', 1, 'Data derived DIMPLATFORM.PLATFORM_SKEY'],\n", 206 | " ['PLATFORM_TYPE', 'VARCHAR2(200 BYTE)', 'N', 'No', 'NOT NULL', 2, 'Data derived DIMPLATFORM.PLATFORM']],\n", 207 | " columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n", 208 | " \n", 209 | "dimplatform.style.hide_index()" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "## Dimension Table DIMSITE\n" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 55, 222 | "metadata": {}, 223 | "outputs": [ 224 | { 225 | "data": { 226 | "text/html": [ 227 | "\n", 229 | "\n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | "
COLUMN_NAMEDATA_TYPEPKNULLABLEDATA_DEFAULTCOLUMN_IDCOMMENTS
SITE_SKEYINTEGERYNoNOT NULL1Data derived DIMSITE.SITE_SKEY
SITEVARCHAR(200)NNoNOT NULL2Data derived DIMSITE.SITE
\n" 262 | ], 263 | "text/plain": [ 264 | "" 265 | ] 266 | }, 267 | "execution_count": 55, 268 | "metadata": {}, 269 | "output_type": "execute_result" 270 | } 271 | ], 272 | "source": [ 273 | "import pandas as pd\n", 274 | "\n", 275 | "dimsite = pd.DataFrame([['SITE_SKEY', 'INTEGER', 'Y', 'No', 'NOT NULL', 1, 'Data derived DIMSITE.SITE_SKEY'],\n", 276 | " ['SITE', 'VARCHAR(200)', 'N', 'No', 'NOT NULL', 2, 'Data derived DIMSITE.SITE']],\n", 277 | " columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n", 278 | " \n", 279 | "dimsite.style.hide_index()" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "## Dimension Table DIMVIDEO\n" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 56, 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "data": { 296 | "text/html": [ 297 | "\n", 299 | "\n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | "
COLUMN_NAMEDATA_TYPEPKNULLABLEDATA_DEFAULTCOLUMN_IDCOMMENTS
VIDEO_SKEYINTEGERYNoNOT NULL1Data derived from DIMVIDEO.VIDEO_SKEY
VIDEO_TITLETEXTNNoNOT NULL2Data derived from DIMVIDEO.VIDEO_TITLE
\n" 332 | ], 333 | "text/plain": [ 334 | "" 335 | ] 336 | }, 337 | "execution_count": 56, 338 | "metadata": {}, 339 | "output_type": "execute_result" 340 | } 341 | ], 342 | "source": [ 343 | "import pandas as pd\n", 344 | "\n", 345 | "dimvideo = pd.DataFrame([['VIDEO_SKEY', 'INTEGER', 'Y', 'No', 'NOT NULL', 1, 'Data derived from DIMVIDEO.VIDEO_SKEY'],\n", 346 | " ['VIDEO_TITLE', 'TEXT', 'N', 'No', 'NOT NULL', 2, 'Data derived from DIMVIDEO.VIDEO_TITLE']],\n", 347 | " columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n", 348 | " \n", 349 | "dimvideo.style.hide_index()" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "## Fact Table FACTTABLE\n" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 57, 362 | "metadata": {}, 363 | "outputs": [ 364 | { 365 | "data": { 366 | "text/html": [ 367 | "\n", 369 | "\n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | "
COLUMN_NAMEDATA_TYPEPKNULLABLEDATA_DEFAULTCOLUMN_IDCOMMENTS
DATETIME_SKEYTIMESTAMPNNoNOT NULL1Data derived from FACTTABLE.DATETIME_SKEY
Platform_SKEYINTEGERNNoNOT NULL2Data derived from FACTTABLE.PLATFORM_SKEY
Site_SKEYINTEGERNNoNOT NULL3Data derived from FACTTABLE.SITE_SKEY
Video_SKEYINTEGERNNoNOT NULL4Data derived from FACTTABLE.VIDEO_SKEY
EventsVARCHAR2(150 BYTE)NNoNOT NULL5Data derived from FACTTABLE.EVENTS
\n" 429 | ], 430 | "text/plain": [ 431 | "" 432 | ] 433 | }, 434 | "execution_count": 57, 435 | "metadata": {}, 436 | "output_type": "execute_result" 437 | } 438 | ], 439 | "source": [ 440 | "import pandas as pd\n", 441 | "facttable = pd.DataFrame([['DATETIME_SKEY', 'TIMESTAMP', 'N', 'No', 'NOT NULL', 1, 'Data derived from FACTTABLE.DATETIME_SKEY'],\n", 442 | " ['Platform_SKEY', 'INTEGER', 'N', 'No', 'NOT NULL', 2, 'Data derived from FACTTABLE.PLATFORM_SKEY'],\n", 443 | " ['Site_SKEY', 'INTEGER', 'N', 'No', 'NOT NULL', 3, 'Data derived from FACTTABLE.SITE_SKEY'],\n", 444 | " ['Video_SKEY', 'INTEGER', 'N', 'No', 'NOT NULL', 4, 'Data derived from FACTTABLE.VIDEO_SKEY'],\n", 445 | " ['Events', 'VARCHAR2(150 BYTE)', 'N', 'No', 'NOT NULL', 5, 'Data derived from FACTTABLE.EVENTS']],\n", 446 | " columns=['COLUMN_NAME', 'DATA_TYPE', 'PK', 'NULLABLE', 'DATA_DEFAULT', 'COLUMN_ID', 'COMMENTS'])\n", 447 | " \n", 448 | "facttable.style.hide_index()" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "# **Pre-process the Raw Data**" 456 | ] 457 | }, 458 | { 459 | "cell_type": "markdown", 460 | "metadata": {}, 461 | "source": [ 462 | "## 1. Data auditing" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 117, 468 | "metadata": {}, 469 | "outputs": [ 470 | { 471 | "name": "stdout", 472 | "output_type": "stream", 473 | "text": [ 474 | "DateTime 24\n", 475 | "VideoTitle 157\n", 476 | "events 95\n", 477 | "dtype: int64\n" 478 | ] 479 | } 480 | ], 481 | "source": [ 482 | "import pandas as pd\n", 483 | "\n", 484 | "df = pd.read_csv('source.csv', usecols=[0,1,2])\n", 485 | "df.columns = ['DateTime', 'VideoTitle', 'events']\n", 486 | "\n", 487 | "print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}))" 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "## 2. Identify Objects for ETL Process" 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "execution_count": 121, 500 | "metadata": {}, 501 | "outputs": [ 502 | { 503 | "data": { 504 | "text/plain": [ 505 | "array(['App Web', 'news', 'App iPhone', 'App Android', 'App iPad'],\n", 506 | " dtype=object)" 507 | ] 508 | }, 509 | "execution_count": 121, 510 | "metadata": {}, 511 | "output_type": "execute_result" 512 | } 513 | ], 514 | "source": [ 515 | "# Identify the type of PLATFORM and SITE\n", 516 | "import pandas as pd\n", 517 | "\n", 518 | "df = pd.read_csv('source.csv', usecols=[0,1,2])\n", 519 | "df.columns = ['DateTime', 'VideoTitle', 'events']\n", 520 | "df[['0','1','2','3','4']] = df['VideoTitle'].str.split(\"|\",expand=True,)\n", 521 | "\n", 522 | "df['0'].unique()" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "## 3. The sql script to create the table\n", 530 | "\n", 531 | "check snowflake.sql" 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "# **NOTE**" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "1. SKEY stands for surrogate key. \n", 546 | "2. The current design is Dimension Type One. \n", 547 | "3. If the source dimension data contains not only the PK but also some attributes, and we want to track the changes of attributes, we should use Dimension Type Two. " 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "### **One sample of Dimension Type Two**" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 207, 560 | "metadata": {}, 561 | "outputs": [ 562 | { 563 | "data": { 564 | "text/html": [ 565 | "\n", 567 | "\n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | "
Data from current date 04/03/2022
Product_IDProductPriceLocation
P01iPhone 131399Noel Leeming
P03iPhone 13 Pro Max1999Noel Leeming
\n" 592 | ], 593 | "text/plain": [ 594 | "" 595 | ] 596 | }, 597 | "execution_count": 207, 598 | "metadata": {}, 599 | "output_type": "execute_result" 600 | } 601 | ], 602 | "source": [ 603 | "import pandas as pd\n", 604 | "sample_table = pd.DataFrame([['P01', 'iPhone 13', '1399', 'Noel Leeming'],['P03', 'iPhone 13 Pro Max', '1999', 'Noel Leeming']],columns=['Product_ID', 'Product', 'Price', 'Location'])\n", 605 | "sample_table.style.hide_index().set_caption('Data from current date 04/03/2022')" 606 | ] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "metadata": {}, 611 | "source": [ 612 | "### **Data in dimension table**" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 204, 618 | "metadata": {}, 619 | "outputs": [ 620 | { 621 | "data": { 622 | "text/html": [ 623 | "\n", 625 | "\n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | "
Data in dimension table
Product_SKEYProduct_IDProductPriceLocationCurrent_FlagStart_DateEnd_Date
1P01iPhone 131499Noel LeemingY31/12/202131/12/9999
2P02iPhone 13 Pro1799Noel LeemingY20/01/202231/12/9999
\n" 662 | ], 663 | "text/plain": [ 664 | "" 665 | ] 666 | }, 667 | "execution_count": 204, 668 | "metadata": {}, 669 | "output_type": "execute_result" 670 | } 671 | ], 672 | "source": [ 673 | "import pandas as pd\n", 674 | "\n", 675 | "dim_table = pd.DataFrame([['1', 'P01', 'iPhone 13', '1499', 'Noel Leeming', 'Y', '31/12/2021', '31/12/9999'], ['2', 'P02', 'iPhone 13 Pro', '1799', 'Noel Leeming', 'Y', '20/01/2022', '31/12/9999']], columns=['Product_SKEY', 'Product_ID', 'Product', 'Price', 'Location','Current_Flag', 'Start_Date', 'End_Date'])\n", 676 | "dim_table.style.hide_index().set_caption('Data in dimension table')" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "### Update data in dimension table:\n", 684 | "* new product **(P03)** \n", 685 | "* update product **(P01)**" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": 205, 691 | "metadata": {}, 692 | "outputs": [ 693 | { 694 | "data": { 695 | "text/html": [ 696 | "\n", 698 | "\n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | "
Update dimension table
Product_SKEYProduct_IDProductPriceLocationCurrent_FlagStart_DateEnd_Date
1P01iPhone 131499Noel LeemingN31/12/202103/03/2022
2P02iPhone 13 Pro1799Noel LeemingY20/01/202231/12/9999
3P03iPhone 13 Pro Max1999Noel LeemingY04/03/202231/12/9999
4P01iPhone 131399Noel LeemingY04/03/202231/12/9999
\n" 755 | ], 756 | "text/plain": [ 757 | "" 758 | ] 759 | }, 760 | "execution_count": 205, 761 | "metadata": {}, 762 | "output_type": "execute_result" 763 | } 764 | ], 765 | "source": [ 766 | "import pandas as pd\n", 767 | "dim_table = pd.DataFrame([['1', 'P01', 'iPhone 13', '1499', 'Noel Leeming', 'N', '31/12/2021', '03/03/2022'],\n", 768 | " ['2', 'P02', 'iPhone 13 Pro', '1799', 'Noel Leeming', 'Y', '20/01/2022', '31/12/9999'],\n", 769 | " ['3', 'P03', 'iPhone 13 Pro Max', '1999', 'Noel Leeming', 'Y', '04/03/2022', '31/12/9999'],\n", 770 | " ['4', 'P01', 'iPhone 13', '1399', 'Noel Leeming', 'Y', '04/03/2022', '31/12/9999']],\n", 771 | " columns=['Product_SKEY', 'Product_ID', 'Product', 'Price', 'Location','Current_Flag', 'Start_Date', 'End_Date'])\n", 772 | " \n", 773 | "dim_table.style.hide_index().set_caption('Update dimension table')" 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "metadata": {}, 779 | "source": [ 780 | "When there is a **new record** coming in, we generate a new record with new SKEY, \n", 781 | "* Current_Flag = ‘Y’\n", 782 | "* Start_Date = Current_Date\n", 783 | "* End_Date = 31/12/9999\n", 784 | "\n", 785 | "When there is a **updated record** coming in, we also generate a new record with new SKEY,\n", 786 | "* Current_Flag = ‘Y’\n", 787 | "* Start_Date = Current_Date\n", 788 | "* End_Date = 31/12/9999\n", 789 | "\n", 790 | "And at same time we need to update the **old record** in dimension table with \n", 791 | "* Current_Flag = ‘N’\n", 792 | "* End_Date = Current_Date – 1\n", 793 | "\n", 794 | "Therefore, when we populate new records into fact table, we need to put a filter such as **Current_Flag = ‘Y’** in order to get the correct SKEY; \n", 795 | "\n", 796 | "If we want to track the history data in dimension table for **certain days or certain period**, we need to put a time range filter such as **EVENT_DATE(or CONTACT_DATE)** between Start_Date and End_Date\n", 797 | "\n", 798 | "For example, if in fact table we see a transaction like customer purchased product(P01) on 01/02/22, by looking at product dimension table, we could find the price that customer paid at that moment was 1499 not 1399, although 1399 is the current price of P01." 799 | ] 800 | } 801 | ], 802 | "metadata": { 803 | "interpreter": { 804 | "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" 805 | }, 806 | "kernelspec": { 807 | "display_name": "Python 3.8.2 64-bit", 808 | "language": "python", 809 | "name": "python3" 810 | }, 811 | "language_info": { 812 | "codemirror_mode": { 813 | "name": "ipython", 814 | "version": 3 815 | }, 816 | "file_extension": ".py", 817 | "mimetype": "text/x-python", 818 | "name": "python", 819 | "nbconvert_exporter": "python", 820 | "pygments_lexer": "ipython3", 821 | "version": "3.8.2" 822 | }, 823 | "orig_nbformat": 4 824 | }, 825 | "nbformat": 4, 826 | "nbformat_minor": 2 827 | } 828 | --------------------------------------------------------------------------------