├── README.md ├── airflow ├── create_tables.sql ├── dags │ ├── __pycache__ │ │ ├── dag.cpython-36.pyc │ │ ├── dags.cpython-36.pyc │ │ ├── udac_example_dag.cpython-36.pyc │ │ └── udac_example_dag_backup.cpython-36.pyc │ └── dag.py └── plugins │ ├── __init__.py │ ├── __pycache__ │ └── __init__.cpython-36.pyc │ ├── helpers │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-36.pyc │ │ └── sql_queries.cpython-36.pyc │ └── sql_queries.py │ └── operators │ ├── __init__.py │ ├── __pycache__ │ ├── __init__.cpython-36.pyc │ ├── data_quality.cpython-36.pyc │ ├── load_dimension.cpython-36.pyc │ ├── load_fact.cpython-36.pyc │ └── stage_redshift.cpython-36.pyc │ ├── data_quality.py │ ├── load_dimension.py │ ├── load_fact.py │ └── stage_redshift.py └── dag.png /README.md: -------------------------------------------------------------------------------- 1 | # Creating Data Pipelines with Apache Airflow to manage ETL from Amazon S3 into Amazon Redshift 2 | For this project, I worked with Apache Airflow to manage workflow of different data operators scheduled as per the dependency on each other represented by DAG (Directed Acyclic Graph) for extracting data stored in JSON and CSV file formats in S3, staging them into tables in Amazon Redshift, loading data into facts and dimensions tables of the data warehouse and checking the quality of data after each ETL cycle is completed. 3 | 4 | ## Analytics Project Scenario 5 | A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow. 6 | 7 | The task is to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. They have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets. 8 | 9 | The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of CSV logs that tell about user activity in the application and JSON metadata about the songs the users listen to. 10 | 11 | ## Solution 12 | I decided to create different custom operators that subclass Apache Airflow base operator and perform specific step in the ETL process delineated below. 13 | ### Steps of data pipeline 14 | #### Loading data from S3 to Staging tables in Redshift 15 | * Reading user activity log data stored in CSV format in S3 into staging table in Redshift. 16 | * Reading songs metadata stored in JSON format in S3 into staging table in Redshift. 17 | 18 | For these tasks I created **`StageToRedshiftOperator`** that takes as arguments the location and the type of file to be read from S3 i.e. JSON or CSV and the name of target table in which the raw data is to be staged on. It issues COPY command to Redshift that supports reading files in JSON files from directly from S3 into designated table in Redshift. This operator uses `AwsHook` to retrieve AWS credentials set as connection in Airflow admin interface. It uses `PostgresHook` which is compatible with Redshift to execute commands. 19 | 20 | #### Loading data from staging tables to Fact table 21 | In our case, the `songplays` table is considered the fact table as it stores the timestamps and other details of the user's activity of listening new songs on the music streaming app. I created **`LoadFactOperator`** that takes as arguments the SQL statement for inserting data and target table name. For loading `songplays` table it means that the SQL statement passed as argument would be joining data from staging tables of songs and user activity log data over matching song attributes like title, length and artist and selecting appropriate fields to be inserted into `songplays` fact table. The operator executes the SQL statement to load targeted fact table using `PostgresHook` compatible with Redshift. Considering the usual case where the fact tables are usually very large and are only appended with new data during scheduled ETL operations, this operator does not drop or empty (truncate) the facts table before inserting new data into it. 22 | #### Loading data from staging tables to Dimension tables 23 | In this scenario, the `songs`, `users`, `artists` and `time` are dimension tables. To load the data from staging tables into them, I created **`LoadDimensionOperator`**. It takes as arguments the sql statement that would do the selection and insertion of data and name of target table to be loaded. As it is a usual practise to empty the dimension table before inserting new data into it on every ETL cycle, this operator is also capable of first running TRUNCATE operation on the target table before inserting data into it. This truncation behaviour is controlled through a flag `should_truncate` passed in as argument to the operator. After the optional truncation step, it then executes SQL statement passed in as argument to perform the insertion of data into targeted dimension table using `PostgresHook` compatible with Redshift. 24 | #### Performing data quality checks 25 | As the ETL process is supposed to be running automatically on regular interval, it essential to test the quality parameters of data after all the ETL steps are completed in given cycle. Data quality check helps instilling the trust of data consumers on the data generated by the pipeline. It also helps the maintainers of data pipeline to get notified when the data generated as part of ETL processing is not meeting quality expectations set by data consumers such as analytics or management teams. I created **`DataQuailtyOperator`**, which takes as arguments a list of pairs of SQL statement that would perform some query and return a single row with single column named `result`; and expected value for the `result`. The operator runs each SQL statement as query on Redshift using compatible `PostgresHook` and compares the retrieved `result` value with expected value. If for any SQL statement, the `result` value does not match with expected value then an exception is raise to cause the `DataQuality` operator to fail, which the pipeline maintainer can take note of and perform necessary corrective actions. 26 | 27 | After creating the operators, it is essential to organise them appropriately considering their inter task dependencies. In Airflow, this is represented by DAG (Directed Acyclic Graph) form, where the tasks are nodes and the directed edges are dependency of one task on another one to be completed before starting execution. In our case, the inter-dependency of tasks can be represented as a DAG and can be visualized by Airflow's web interface after it successfully reads the DAG created by us in python using appropriate python classes for `airflow` package and setting appropriate dependencies between the operators in our code. Two dummy operators are included in the DAG for indicating the beginning of execution and end of execution of the ETL. 28 | 29 | ![DAG Visualization](dag.png) 30 | *Visualization of DAG consisting of operations performed in data pipeline* 31 | 32 | ### How to run 33 | Amazon Redshift cluster is to be created with desired configuration. The Airflow webserver can be started by executing following command on Linux system. 34 | 35 | /opt/airflow/start.sh 36 | The admin interface of Airflow can be accessed at port 8080 of host system from browser. In the interface the new connection to Redshift cluster with appropriate values for different attributes can be added from the Admin > Connections menu on the top navigation bar. 37 | 38 | Dag written in python becomes visible in the Airflow's Dags section. The DAG can be turned on from the switch located to its left and then it starts executing according to the start time, schedule interval and end time set during its instantiation in python code. 39 | 40 | The status of all the DAGs can be seen in the Airflow interface. The individual dag can be visualized in Graph view where the dependencies between various operators as set in the python code becomes visible. 41 | 42 | Status and the logs from the execution of each operator in the past can be accessed from the Tree view of dag. If the certain operator is showing yellow or red colored square box in the tree view of dag, it can be debugged by going through log file generated by that operator during specific execution to understand the cause of error and make necessary corrections. Upon successful execution, the square box representing the operator turns dark green and the dag can be considered as running successfully when all of the square boxes in a column indicating one execution cycle of workflow have turned dark green. -------------------------------------------------------------------------------- /airflow/create_tables.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE public.artists ( 2 | artist_id varchar(256) NOT NULL, 3 | name varchar(256), 4 | location varchar(256), 5 | latitude numeric(18,0), 6 | longitude numeric(18,0) 7 | ); 8 | 9 | CREATE TABLE public.users ( 10 | user_id int4 NOT NULL, 11 | first_name varchar(256), 12 | last_name varchar(256), 13 | gender varchar(256), 14 | "level" varchar(256), 15 | CONSTRAINT users_pkey PRIMARY KEY (user_id) 16 | ); 17 | 18 | CREATE TABLE public.time ( 19 | start_time timestamp NOT NULL, 20 | "hour" integer NOT NULL, 21 | "day" integer NOT NULL, 22 | "week" integer NOT NULL, 23 | "month" integer NOT NULL, 24 | "year" integer NOT NULL, 25 | "weekday" integer NOT NULL 26 | ); 27 | 28 | CREATE TABLE public.songplays ( 29 | songplay_id varchar(32) IDENTITY(0,1) NOT NULL, 30 | start_time timestamp NOT NULL, 31 | user_id int4 NOT NULL, 32 | "level" varchar(256), 33 | song_id varchar(256), 34 | artist_id varchar(256), 35 | session_id int4, 36 | location varchar(256), 37 | user_agent varchar(256), 38 | CONSTRAINT songplays_pkey PRIMARY KEY (songplay_id) 39 | ); 40 | 41 | CREATE TABLE public.songs ( 42 | song_id varchar(256) NOT NULL, 43 | title varchar(256), 44 | artist_id varchar(256), 45 | "year" int4, 46 | duration numeric(18,0), 47 | CONSTRAINT songs_pkey PRIMARY KEY (song_id) 48 | ); 49 | 50 | CREATE TABLE public.staging_events ( 51 | artist varchar(256), 52 | auth varchar(256), 53 | firstname varchar(256), 54 | gender varchar(256), 55 | iteminsession int4, 56 | lastname varchar(256), 57 | length numeric(18,0), 58 | "level" varchar(256), 59 | location varchar(256), 60 | "method" varchar(256), 61 | page varchar(256), 62 | registration numeric(18,0), 63 | sessionid int4, 64 | song varchar(256), 65 | status int4, 66 | ts int8, 67 | useragent varchar(256), 68 | userid int4 69 | ); 70 | 71 | CREATE TABLE public.staging_songs ( 72 | num_songs int4, 73 | artist_id varchar(256), 74 | artist_name varchar(256), 75 | artist_latitude numeric(18,0), 76 | artist_longitude numeric(18,0), 77 | artist_location varchar(256), 78 | song_id varchar(256), 79 | title varchar(256), 80 | duration numeric(18,0), 81 | "year" int4 82 | ); -------------------------------------------------------------------------------- /airflow/dags/__pycache__/dag.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/dags/__pycache__/dag.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/dags/__pycache__/dags.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/dags/__pycache__/dags.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/dags/__pycache__/udac_example_dag_backup.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/dags/__pycache__/udac_example_dag_backup.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/dags/dag.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime, timedelta 2 | import os 3 | from airflow import DAG 4 | from airflow.operators.dummy_operator import DummyOperator 5 | from airflow.operators import (StageToRedshiftOperator, LoadFactOperator, 6 | LoadDimensionOperator, DataQualityOperator) 7 | from helpers import SqlQueries 8 | 9 | default_args = { 10 | 'owner': 'sparkify', 11 | 'start_date': datetime(2019, 1, 12), 12 | 'depends_on_past': False, 13 | 'retries': 3, 14 | 'retry_delay': timedelta(minutes=5), 15 | 'email_on_retry': False, 16 | 'catchup': False 17 | } 18 | 19 | dag = DAG('process_song_plays_data', 20 | default_args=default_args, 21 | description='Load and transform data in Redshift with Airflow', 22 | schedule_interval='@hourly' 23 | ) 24 | 25 | start_operator = DummyOperator(task_id='Begin_execution', dag=dag) 26 | 27 | stage_events_to_redshift = StageToRedshiftOperator( 28 | task_id='Stage_events', 29 | dag=dag, 30 | redshift_conn_id="redshift", 31 | aws_conn_id="aws_credentials", 32 | source_location="s3://udacity-dend/log_data", 33 | target_table="staging_events", 34 | file_type="json", 35 | json_path="s3://udacity-dend/log_json_path.json" 36 | ) 37 | 38 | stage_songs_to_redshift = StageToRedshiftOperator( 39 | task_id='Stage_songs', 40 | dag=dag, 41 | redshift_conn_id="redshift", 42 | aws_conn_id="aws_credentials", 43 | source_location="s3://udacity-dend/song_data", 44 | target_table="staging_songs", 45 | file_type="json" 46 | ) 47 | 48 | load_songplays_table = LoadFactOperator( 49 | task_id='Load_songplays_fact_table', 50 | dag=dag, 51 | redshift_conn_id="redshift", 52 | sql_stat=SqlQueries.songplay_table_insert, 53 | target_table="songplays" 54 | ) 55 | 56 | load_user_dimension_table = LoadDimensionOperator( 57 | task_id='Load_user_dim_table', 58 | dag=dag, 59 | redshift_conn_id="redshift", 60 | sql_stat=SqlQueries.user_table_insert, 61 | target_table="users", 62 | should_truncate=True 63 | ) 64 | 65 | load_song_dimension_table = LoadDimensionOperator( 66 | task_id='Load_song_dim_table', 67 | dag=dag, 68 | redshift_conn_id="redshift", 69 | sql_stat=SqlQueries.song_table_insert, 70 | target_table="songs", 71 | should_truncate=True 72 | ) 73 | 74 | load_artist_dimension_table = LoadDimensionOperator( 75 | task_id='Load_artist_dim_table', 76 | dag=dag, 77 | redshift_conn_id="redshift", 78 | sql_stat=SqlQueries.artist_table_insert, 79 | target_table="artists", 80 | should_truncate=True 81 | ) 82 | 83 | load_time_dimension_table = LoadDimensionOperator( 84 | task_id='Load_time_dim_table', 85 | dag=dag, 86 | redshift_conn_id="redshift", 87 | sql_stat=SqlQueries.time_table_insert, 88 | target_table="time", 89 | should_truncate=True 90 | ) 91 | 92 | run_quality_checks = DataQualityOperator( 93 | task_id='Run_data_quality_checks', 94 | dag=dag, 95 | redshift_conn_id="redshift", 96 | sql_stats_tests = [ 97 | (SqlQueries.count_of_nulls_in_songs_table, 0), 98 | (SqlQueries.count_of_nulls_in_users_table, 0), 99 | (SqlQueries.count_of_nulls_in_artists_table, 0), 100 | (SqlQueries.count_of_nulls_in_time_table, 0), 101 | (SqlQueries.count_of_nulls_in_songplays_table, 0), 102 | ] 103 | ) 104 | 105 | end_operator = DummyOperator(task_id='Stop_execution', dag=dag) 106 | 107 | start_operator >> stage_songs_to_redshift 108 | start_operator >> stage_events_to_redshift 109 | stage_songs_to_redshift >> load_songplays_table 110 | stage_events_to_redshift >> load_songplays_table 111 | load_songplays_table >> load_song_dimension_table 112 | load_songplays_table >> load_artist_dimension_table 113 | load_songplays_table >> load_time_dimension_table 114 | load_songplays_table >> load_user_dimension_table 115 | load_song_dimension_table >> run_quality_checks 116 | load_user_dimension_table >> run_quality_checks 117 | load_time_dimension_table >> run_quality_checks 118 | load_artist_dimension_table >> run_quality_checks 119 | run_quality_checks >> end_operator -------------------------------------------------------------------------------- /airflow/plugins/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, absolute_import, print_function 2 | 3 | from airflow.plugins_manager import AirflowPlugin 4 | 5 | import operators 6 | import helpers 7 | 8 | # Defining the plugin class 9 | class UdacityPlugin(AirflowPlugin): 10 | name = "udacity_plugin" 11 | operators = [ 12 | operators.StageToRedshiftOperator, 13 | operators.LoadFactOperator, 14 | operators.LoadDimensionOperator, 15 | operators.DataQualityOperator 16 | ] 17 | helpers = [ 18 | helpers.SqlQueries 19 | ] 20 | -------------------------------------------------------------------------------- /airflow/plugins/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/plugins/helpers/__init__.py: -------------------------------------------------------------------------------- 1 | from helpers.sql_queries import SqlQueries 2 | 3 | __all__ = [ 4 | 'SqlQueries', 5 | ] -------------------------------------------------------------------------------- /airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/plugins/helpers/sql_queries.py: -------------------------------------------------------------------------------- 1 | class SqlQueries: 2 | 3 | truncate_table = (""" 4 | TRUNCATE {} 5 | """) 6 | 7 | copy_csv_to_redshift = (""" 8 | COPY {} 9 | FROM '{}' 10 | ACCESS_KEY_ID '{}' 11 | SECRET_ACCESS_KEY '{}' 12 | IGNOREHEADER 1 13 | DELIMITER ',' 14 | region 'us-west-2' 15 | """) 16 | copy_json_to_redshift = (""" 17 | COPY {} 18 | FROM '{}' 19 | ACCESS_KEY_ID '{}' 20 | SECRET_ACCESS_KEY '{}' 21 | format as json 'auto' 22 | region 'us-west-2' 23 | """) 24 | copy_json_with_json_path_to_redshift = (""" 25 | COPY {} 26 | FROM '{}' 27 | ACCESS_KEY_ID '{}' 28 | SECRET_ACCESS_KEY '{}' 29 | json '{}' 30 | region 'us-west-2' 31 | """) 32 | songplay_table_insert = (""" 33 | INSERT INTO songplays (start_time, user_id, level, song_id, artist_id, session_id, location, user_agent) 34 | SELECT 35 | events.start_time, 36 | events.userid as user_id, 37 | events.level, 38 | songs.song_id, 39 | songs.artist_id, 40 | events.sessionid as session_id, 41 | events.location, 42 | events.useragent as user_agent 43 | FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, * 44 | FROM staging_events 45 | WHERE page='NextSong' AND userid IS NOT NULL) events 46 | LEFT JOIN staging_songs songs 47 | ON events.song = songs.title 48 | AND events.artist = songs.artist_name 49 | AND events.length = songs.duration 50 | """) 51 | 52 | user_table_insert = (""" 53 | INSERT INTO users (user_id, first_name, last_name, gender, level) 54 | SELECT distinct userid as user_id, 55 | firstname as first_name, 56 | lastname as last_name, 57 | gender, 58 | level 59 | FROM staging_events 60 | WHERE page='NextSong' AND userid IS NOT NULL 61 | """) 62 | 63 | song_table_insert = (""" 64 | INSERT INTO songs (song_id, title, artist_id, year, duration) 65 | SELECT distinct song_id, 66 | title, 67 | artist_id, 68 | year, 69 | duration 70 | FROM staging_songs 71 | """) 72 | 73 | artist_table_insert = (""" 74 | INSERT INTO artists (artist_id, name, location, latitude, longitude) 75 | SELECT distinct artist_id, 76 | artist_name as name, 77 | artist_location as location, 78 | artist_latitude as latitude, 79 | artist_longitude as longitude 80 | FROM staging_songs 81 | """) 82 | 83 | time_table_insert = (""" 84 | INSERT INTO time (start_time, hour, day, week, month, year, weekday) 85 | SELECT start_time, 86 | extract(hour from start_time), 87 | extract(day from start_time), 88 | extract(week from start_time), 89 | extract(month from start_time), 90 | extract(year from start_time), 91 | extract(dayofweek from start_time) as weekday 92 | FROM songplays 93 | """) 94 | 95 | count_of_nulls_in_songs_table = (""" 96 | SELECT count(*) as result 97 | FROM songs 98 | WHERE NULL in (song_id) 99 | """) 100 | 101 | count_of_nulls_in_artists_table = (""" 102 | SELECT count(*) as result 103 | FROM artists 104 | WHERE NULL in (artist_id) 105 | """) 106 | 107 | count_of_nulls_in_users_table = (""" 108 | SELECT count(*) as result 109 | FROM users 110 | WHERE NULL in (user_id) 111 | """) 112 | 113 | count_of_nulls_in_time_table = (""" 114 | SELECT count(*) as result 115 | FROM time 116 | WHERE NULL in (start_time, "hour", "month", "year", "day", "weekday") 117 | """) 118 | 119 | count_of_nulls_in_songplays_table = (""" 120 | SELECT count(*) as result 121 | FROM songplays 122 | WHERE NULL in (songplay_id) 123 | """) -------------------------------------------------------------------------------- /airflow/plugins/operators/__init__.py: -------------------------------------------------------------------------------- 1 | from operators.stage_redshift import StageToRedshiftOperator 2 | from operators.load_fact import LoadFactOperator 3 | from operators.load_dimension import LoadDimensionOperator 4 | from operators.data_quality import DataQualityOperator 5 | 6 | __all__ = [ 7 | 'StageToRedshiftOperator', 8 | 'LoadFactOperator', 9 | 'LoadDimensionOperator', 10 | 'DataQualityOperator' 11 | ] 12 | -------------------------------------------------------------------------------- /airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc -------------------------------------------------------------------------------- /airflow/plugins/operators/data_quality.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | """ 5 | This operator is able to perform quality checks of the data in tables. 6 | It accepts list of pairs of sql statement and expected value as arguments. 7 | For each SQL statement, it executes it on Redshift and 8 | compares the retrieved result with expected value. 9 | In case of mismatch, it raises exception to indicate failure of test. 10 | """ 11 | class DataQualityOperator(BaseOperator): 12 | 13 | ui_color = '#89DA59' 14 | """ 15 | Inputs: 16 | * redshift_conn_id Redshift connection ID of Airflow connector for Redshift 17 | * sql_stats_tests list of pairs of SQL statement and expected value to be test for equality after query execution 18 | """ 19 | @apply_defaults 20 | def __init__(self, 21 | redshift_conn_id = "redshift", 22 | sql_stats_tests = [], 23 | *args, **kwargs): 24 | 25 | super(DataQualityOperator, self).__init__(*args, **kwargs) 26 | self.redshift_conn_id = redshift_conn_id 27 | self.sql_stats_tests = sql_stats_tests 28 | 29 | def execute(self, context): 30 | redshift_hook = PostgresHook(self.redshift_conn_id) 31 | for (sql_stat, expected_result) in self.sql_stats_tests: 32 | row = redshift_hook.get_first(sql_stat) 33 | if row is not None: 34 | if row[0] == expected_result: 35 | self.log.info("Passed Test: {}\nResult == {}\n===================================".format(sql_stat, expected_result)) 36 | else: 37 | raise ValueError("Failed Test: {}\nResult != {}\n===================================".format(sql_stat, expected_result)) -------------------------------------------------------------------------------- /airflow/plugins/operators/load_dimension.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | from helpers import SqlQueries 5 | """ 6 | This operator is able to load data into Dimension table 7 | by executing specified SQL statement for specified target table. 8 | It also allows to optionally truncate the dimension table before inserting new data into it. 9 | """ 10 | class LoadDimensionOperator(BaseOperator): 11 | 12 | ui_color = '#80BD9E' 13 | """ 14 | Inputs: 15 | * redshift_conn_id Redshift connection ID in Airflow connectors 16 | * sql_stat SQL statement for loading data into target Fact table 17 | * target_table name of target Fact table to load data into 18 | """ 19 | @apply_defaults 20 | def __init__(self, 21 | redshift_conn_id = "redshift", 22 | should_truncate = True, 23 | sql_stat = "", 24 | target_table = "", 25 | *args, **kwargs): 26 | 27 | super(LoadDimensionOperator, self).__init__(*args, **kwargs) 28 | self.redshift_conn_id = redshift_conn_id 29 | self.should_truncate = should_truncate 30 | self.sql_stat = sql_stat 31 | self.target_table = target_table 32 | def execute(self, context): 33 | self.log.info("Starting insert data into dimension table: {}".format(self.target_table)) 34 | redshift_hook = PostgresHook(self.redshift_conn_id) 35 | if self.should_truncate: 36 | redshift_hook.run( 37 | SqlQueries.truncate_table.format(self.target_table) 38 | ) 39 | redshift_hook.run(self.sql_stat) 40 | self.log.info("Done inserting data into dimension table: {}".format(self.target_table)) -------------------------------------------------------------------------------- /airflow/plugins/operators/load_fact.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | from helpers import SqlQueries 5 | """ 6 | This operator is able to load data into Fact table 7 | by executing specified SQL statement for specified target table. 8 | """ 9 | class LoadFactOperator(BaseOperator): 10 | 11 | ui_color = '#F98866' 12 | """ 13 | Inputs: 14 | * redshift_conn_id Redshift connection ID in Airflow connectors 15 | * sql_stat SQL statement for loading data into target Fact table 16 | * target_table name of target Fact table to load data into 17 | """ 18 | @apply_defaults 19 | def __init__(self, 20 | redshift_conn_id = "redshift", 21 | sql_stat = "", 22 | target_table = "", 23 | *args, **kwargs): 24 | 25 | super(LoadFactOperator, self).__init__(*args, **kwargs) 26 | self.redshift_conn_id = redshift_conn_id 27 | self.sql_stat = sql_stat 28 | self.target_table = target_table 29 | 30 | def execute(self, context): 31 | self.log.info("Starting load data into fact table: {}".format(self.target_table)) 32 | redshift_hook = PostgresHook(self.redshift_conn_id) 33 | redshift_hook.run(self.sql_stat) 34 | self.log.info("Done inserting data into fact table: {}".format(self.target_table)) 35 | -------------------------------------------------------------------------------- /airflow/plugins/operators/stage_redshift.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.contrib.hooks.aws_hook import AwsHook 3 | from airflow.models import BaseOperator 4 | from airflow.utils.decorators import apply_defaults 5 | from helpers import SqlQueries 6 | """ 7 | This operator is able to load given dataset in JSON or CSV format 8 | from specified location on S3 into target Redshift table 9 | """ 10 | class StageToRedshiftOperator(BaseOperator): 11 | ui_color = '#358140' 12 | 13 | 14 | """ 15 | Inputs: 16 | * redshift_conn_id Redshift connection ID in Airflow connectors 17 | * aws_conn_id AWS credentials connection ID in Airflow connectors 18 | * source_location location of dataset on S3 19 | * target_table name of the table to loaded from dataset in Redshift 20 | * file_type format of the dataset files to be read from S3. Supported values "json" or "csv" 21 | * json_path location of file representing the schema of dataset in JSON format 22 | """ 23 | @apply_defaults 24 | def __init__(self, 25 | redshift_conn_id="redshift", 26 | aws_conn_id="aws_credentials", 27 | source_location="", 28 | target_table="sample_table", 29 | file_type="json", 30 | json_path="", 31 | *args, **kwargs): 32 | 33 | super(StageToRedshiftOperator, self).__init__(*args, **kwargs) 34 | self.redshift_conn_id = redshift_conn_id 35 | self.aws_conn_id = aws_conn_id 36 | self.source_location = source_location 37 | self.target_table = target_table 38 | self.file_type = file_type 39 | self.json_path = json_path 40 | def execute(self, context): 41 | if self.file_type in ["json", "csv"]: 42 | redshift_hook = PostgresHook(self.redshift_conn_id) 43 | aws_hook = AwsHook(self.aws_conn_id) 44 | credentials = aws_hook.get_credentials() 45 | self.log.info(f'StageToRedshiftOperator will start loading files at: {self.source_location} to staging table: {self.target_table}') 46 | redshift_hook.run(SqlQueries.truncate_table.format(self.target_table)) 47 | if self.file_type == "json": 48 | if self.json_path != "": 49 | redshift_hook.run ( 50 | SqlQueries.copy_json_with_json_path_to_redshift.format ( 51 | self.target_table, 52 | self.source_location, 53 | credentials.access_key, 54 | credentials.secret_key, 55 | self.json_path 56 | ) 57 | ) 58 | else: 59 | redshift_hook.run ( 60 | SqlQueries.copy_json_to_redshift.format ( 61 | self.target_table, 62 | self.source_location, 63 | credentials.access_key, 64 | credentials.secret_key 65 | ) 66 | ) 67 | elif self.file_type == "csv": 68 | redshift_hook.run ( 69 | SqlQueries.copy_csv_to_redshift.format ( 70 | self.target_table, 71 | self.source_location, 72 | credentials.access_key, 73 | credentials.secret_key) 74 | ) 75 | 76 | self.log.info(f'StageToRedshiftOperator has completed loading files at: {self.source_location} to staging table: {self.target_table}') 77 | else: 78 | raise ValueError("file_type param must be either json or csv") 79 | 80 | 81 | -------------------------------------------------------------------------------- /dag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/dag.png --------------------------------------------------------------------------------