├── README.md
├── airflow
    ├── create_tables.sql
    ├── dags
    │   ├── __pycache__
    │   │   ├── dag.cpython-36.pyc
    │   │   ├── dags.cpython-36.pyc
    │   │   ├── udac_example_dag.cpython-36.pyc
    │   │   └── udac_example_dag_backup.cpython-36.pyc
    │   └── dag.py
    └── plugins
    │   ├── __init__.py
    │   ├── __pycache__
    │       └── __init__.cpython-36.pyc
    │   ├── helpers
    │       ├── __init__.py
    │       ├── __pycache__
    │       │   ├── __init__.cpython-36.pyc
    │       │   └── sql_queries.cpython-36.pyc
    │       └── sql_queries.py
    │   └── operators
    │       ├── __init__.py
    │       ├── __pycache__
    │           ├── __init__.cpython-36.pyc
    │           ├── data_quality.cpython-36.pyc
    │           ├── load_dimension.cpython-36.pyc
    │           ├── load_fact.cpython-36.pyc
    │           └── stage_redshift.cpython-36.pyc
    │       ├── data_quality.py
    │       ├── load_dimension.py
    │       ├── load_fact.py
    │       └── stage_redshift.py
└── dag.png


/README.md:
--------------------------------------------------------------------------------
 1 | # Creating Data Pipelines with Apache Airflow to manage ETL from Amazon S3 into Amazon Redshift
 2 | For this project, I worked with Apache Airflow to manage workflow of different data operators scheduled as per the dependency on each other represented by DAG (Directed Acyclic Graph) for extracting data stored in JSON and CSV file formats in S3, staging them into tables in Amazon Redshift, loading data into facts and dimensions tables of the data warehouse and checking the quality of data after each ETL cycle is completed.
 3 | 
 4 | ## Analytics Project Scenario
 5 | A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.
 6 | 
 7 | The task is to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. They have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.
 8 | 
 9 | The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of CSV logs that tell about user activity in the application and JSON metadata about the songs the users listen to.
10 | 
11 | ## Solution
12 | I decided to create different custom operators that subclass Apache Airflow base operator and perform specific step in the ETL process delineated below.
13 | ### Steps of data pipeline
14 | #### Loading data from S3 to Staging tables in Redshift
15 | * Reading user activity log data stored in CSV format in S3 into staging table in Redshift.
16 | * Reading songs metadata stored in JSON format in S3 into staging table in Redshift.
17 | 
18 | For these tasks I created **`StageToRedshiftOperator`** that takes as arguments the location and the type of file to be read from S3 i.e. JSON or CSV and the name of target table in which the raw data is to be staged on. It issues COPY command to Redshift that supports reading files in JSON files from directly from S3 into designated table in Redshift. This operator uses `AwsHook` to retrieve AWS credentials set as connection in Airflow admin interface. It uses `PostgresHook` which is compatible with Redshift to execute commands.
19 | 
20 | #### Loading data from staging tables to Fact table
21 | In our case, the `songplays` table is considered the fact table as it stores the timestamps and other details of the user's activity of listening new songs on the music streaming app. I created **`LoadFactOperator`** that takes as arguments the SQL statement for inserting data and target table name. For loading `songplays` table it means that the SQL statement passed as argument would be joining data from staging tables of songs and user activity log data over matching song attributes like title, length and artist and selecting appropriate fields to be inserted into `songplays` fact table. The operator executes the SQL statement to load targeted fact table using `PostgresHook` compatible with Redshift.  Considering the usual case where the fact tables are usually very large and are only appended with new data during scheduled ETL operations, this operator does not drop or empty (truncate) the facts table before inserting new data into it.
22 | #### Loading data from staging tables to Dimension tables
23 | In this scenario, the `songs`, `users`, `artists`  and `time` are dimension tables. To load the data from staging tables into them, I created **`LoadDimensionOperator`**. It takes as arguments the sql statement that would do the selection and insertion of data and name of target table to be loaded. As it is a usual practise to empty the dimension table before inserting new data into it on every ETL cycle, this operator is also capable of first running TRUNCATE operation on the target table before inserting data into it. This truncation behaviour is controlled through a flag `should_truncate` passed in as argument to the operator. After the optional truncation step, it then executes SQL statement passed in as argument to perform the insertion of data into targeted dimension table using `PostgresHook` compatible with Redshift.
24 | #### Performing data quality checks
25 | As the ETL process is supposed to be running automatically on regular interval, it essential to test the quality parameters of data after all the ETL steps are completed in given cycle. Data quality check helps instilling the trust of data consumers on the data generated by the pipeline. It also helps the maintainers of data pipeline to get notified when the data generated as part of ETL processing is not meeting quality expectations set by data consumers such as analytics or management teams. I created **`DataQuailtyOperator`**, which takes as arguments a list of pairs of SQL statement that would perform some query and return a single row with single column named `result`; and expected value for the `result`. The operator runs each SQL statement as query on Redshift using compatible `PostgresHook` and compares the retrieved `result` value with expected value. If for any SQL statement, the `result` value does not match with expected value then an exception is raise to cause the `DataQuality` operator to fail, which the pipeline maintainer can take note of and perform necessary corrective actions.
26 | 
27 | After creating the operators, it is essential to organise them appropriately considering their inter task dependencies. In Airflow, this is represented by DAG (Directed Acyclic Graph) form, where the tasks are nodes and the directed edges are dependency of one task on another one to be completed before starting execution. In our case, the inter-dependency of tasks can be represented as a DAG and can be visualized by Airflow's web interface after it successfully reads the DAG created by us in python using appropriate python classes for `airflow` package and setting appropriate dependencies between the operators in our code. Two dummy operators are included in the DAG for indicating the beginning of execution and end of execution of the ETL.
28 | 
29 | ![DAG Visualization](dag.png)
30 | *Visualization of DAG consisting of operations performed in data pipeline*
31 | 
32 | ### How to run
33 | Amazon Redshift cluster is to be created with desired configuration. The Airflow webserver can be started by executing following command on Linux system.
34 | 
35 |     /opt/airflow/start.sh
36 | The admin interface of Airflow can be accessed at port 8080 of host system from browser. In the interface the new connection to Redshift cluster with appropriate values for different attributes can be added from the Admin > Connections menu on the top navigation bar. 
37 | 
38 | Dag written in python becomes visible in the Airflow's Dags section. The DAG can be turned on from the switch located to its left and then it starts executing according to the start time, schedule interval and end time set during its instantiation in python code.
39 | 
40 | The status of all the DAGs can be seen in the Airflow interface. The individual dag can be visualized in Graph view where the dependencies between various operators as set in the python code becomes visible. 
41 | 
42 | Status and the logs from the execution of each operator in the past can be accessed from the Tree view of dag. If the certain operator is showing yellow or red colored square box in the tree view of dag, it can be debugged by going through log file generated by that operator during specific execution to understand the cause of error and make necessary corrections. Upon successful execution, the square box representing the operator turns dark green and the dag can be considered as running successfully when all of the square boxes in a column indicating one execution cycle of workflow have turned dark green.


--------------------------------------------------------------------------------
/airflow/create_tables.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE public.artists (
 2 | 	artist_id varchar(256) NOT NULL,
 3 | 	name varchar(256),
 4 | 	location varchar(256),
 5 | 	latitude numeric(18,0),
 6 | 	longitude numeric(18,0)
 7 | );
 8 | 
 9 | CREATE TABLE public.users (
10 | 	user_id int4 NOT NULL,
11 | 	first_name varchar(256),
12 | 	last_name varchar(256),
13 | 	gender varchar(256),
14 | 	"level" varchar(256),
15 | 	CONSTRAINT users_pkey PRIMARY KEY (user_id)
16 | );
17 | 
18 | CREATE TABLE public.time (
19 | 	start_time timestamp NOT NULL,
20 | 	"hour" integer NOT NULL,
21 | 	"day" integer  NOT NULL,
22 | 	"week" integer  NOT NULL,
23 | 	"month" integer  NOT NULL,
24 | 	"year" integer NOT NULL,
25 | 	"weekday" integer NOT NULL
26 | );
27 | 
28 | CREATE TABLE public.songplays (
29 | 	songplay_id varchar(32) IDENTITY(0,1) NOT NULL,
30 | 	start_time timestamp NOT NULL,
31 | 	user_id int4 NOT NULL,
32 | 	"level" varchar(256),
33 | 	song_id varchar(256),
34 | 	artist_id varchar(256),
35 | 	session_id int4,
36 | 	location varchar(256),
37 | 	user_agent varchar(256),
38 | 	CONSTRAINT songplays_pkey PRIMARY KEY (songplay_id)
39 | );
40 | 
41 | CREATE TABLE public.songs (
42 | 	song_id varchar(256) NOT NULL,
43 | 	title varchar(256),
44 | 	artist_id varchar(256),
45 | 	"year" int4,
46 | 	duration numeric(18,0),
47 | 	CONSTRAINT songs_pkey PRIMARY KEY (song_id)
48 | );
49 | 
50 | CREATE TABLE public.staging_events (
51 | 	artist varchar(256),
52 | 	auth varchar(256),
53 | 	firstname varchar(256),
54 | 	gender varchar(256),
55 | 	iteminsession int4,
56 | 	lastname varchar(256),
57 | 	length numeric(18,0),
58 | 	"level" varchar(256),
59 | 	location varchar(256),
60 | 	"method" varchar(256),
61 | 	page varchar(256),
62 | 	registration numeric(18,0),
63 | 	sessionid int4,
64 | 	song varchar(256),
65 | 	status int4,
66 | 	ts int8,
67 | 	useragent varchar(256),
68 | 	userid int4
69 | );
70 | 
71 | CREATE TABLE public.staging_songs (
72 | 	num_songs int4,
73 | 	artist_id varchar(256),
74 | 	artist_name varchar(256),
75 | 	artist_latitude numeric(18,0),
76 | 	artist_longitude numeric(18,0),
77 | 	artist_location varchar(256),
78 | 	song_id varchar(256),
79 | 	title varchar(256),
80 | 	duration numeric(18,0),
81 | 	"year" int4
82 | );


--------------------------------------------------------------------------------
/airflow/dags/__pycache__/dag.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/dags/__pycache__/dag.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/dags/__pycache__/dags.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/dags/__pycache__/dags.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/dags/__pycache__/udac_example_dag.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/dags/__pycache__/udac_example_dag_backup.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/dags/__pycache__/udac_example_dag_backup.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/dags/dag.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime, timedelta
  2 | import os
  3 | from airflow import DAG
  4 | from airflow.operators.dummy_operator import DummyOperator
  5 | from airflow.operators import (StageToRedshiftOperator, LoadFactOperator,
  6 |                                 LoadDimensionOperator, DataQualityOperator)
  7 | from helpers import SqlQueries
  8 | 
  9 | default_args = {
 10 |     'owner': 'sparkify',
 11 |     'start_date': datetime(2019, 1, 12),
 12 |     'depends_on_past': False,
 13 |     'retries': 3,
 14 |     'retry_delay': timedelta(minutes=5),
 15 |     'email_on_retry': False,
 16 |     'catchup': False
 17 | }
 18 | 
 19 | dag = DAG('process_song_plays_data',
 20 |           default_args=default_args,
 21 |           description='Load and transform data in Redshift with Airflow',
 22 |           schedule_interval='@hourly'
 23 |         )
 24 | 
 25 | start_operator = DummyOperator(task_id='Begin_execution',  dag=dag)
 26 | 
 27 | stage_events_to_redshift = StageToRedshiftOperator(
 28 |     task_id='Stage_events',
 29 |     dag=dag,
 30 |     redshift_conn_id="redshift",
 31 |     aws_conn_id="aws_credentials",
 32 |     source_location="s3://udacity-dend/log_data",
 33 |     target_table="staging_events",
 34 |     file_type="json",
 35 |     json_path="s3://udacity-dend/log_json_path.json"
 36 | )
 37 | 
 38 | stage_songs_to_redshift = StageToRedshiftOperator(
 39 |     task_id='Stage_songs',
 40 |     dag=dag,
 41 |     redshift_conn_id="redshift",
 42 |     aws_conn_id="aws_credentials",
 43 |     source_location="s3://udacity-dend/song_data",
 44 |     target_table="staging_songs",
 45 |     file_type="json"
 46 | )
 47 | 
 48 | load_songplays_table = LoadFactOperator(
 49 |     task_id='Load_songplays_fact_table',
 50 |     dag=dag,
 51 |     redshift_conn_id="redshift",
 52 |     sql_stat=SqlQueries.songplay_table_insert,
 53 |     target_table="songplays"
 54 | )
 55 | 
 56 | load_user_dimension_table = LoadDimensionOperator(
 57 |     task_id='Load_user_dim_table',
 58 |     dag=dag,
 59 |     redshift_conn_id="redshift",
 60 |     sql_stat=SqlQueries.user_table_insert,
 61 |     target_table="users",
 62 |     should_truncate=True
 63 | )
 64 | 
 65 | load_song_dimension_table = LoadDimensionOperator(
 66 |     task_id='Load_song_dim_table',
 67 |     dag=dag,
 68 |     redshift_conn_id="redshift",
 69 |     sql_stat=SqlQueries.song_table_insert,
 70 |     target_table="songs",
 71 |     should_truncate=True
 72 | )
 73 | 
 74 | load_artist_dimension_table = LoadDimensionOperator(
 75 |     task_id='Load_artist_dim_table',
 76 |     dag=dag,
 77 |     redshift_conn_id="redshift",
 78 |     sql_stat=SqlQueries.artist_table_insert,
 79 |     target_table="artists",
 80 |     should_truncate=True
 81 | )
 82 | 
 83 | load_time_dimension_table = LoadDimensionOperator(
 84 |     task_id='Load_time_dim_table',    
 85 |     dag=dag,
 86 |     redshift_conn_id="redshift",
 87 |     sql_stat=SqlQueries.time_table_insert,
 88 |     target_table="time",
 89 |     should_truncate=True
 90 | )
 91 | 
 92 | run_quality_checks = DataQualityOperator(
 93 |     task_id='Run_data_quality_checks',
 94 |     dag=dag,
 95 |     redshift_conn_id="redshift",
 96 |     sql_stats_tests = [
 97 |         (SqlQueries.count_of_nulls_in_songs_table, 0),
 98 |         (SqlQueries.count_of_nulls_in_users_table, 0),
 99 |         (SqlQueries.count_of_nulls_in_artists_table, 0),
100 |         (SqlQueries.count_of_nulls_in_time_table, 0),
101 |         (SqlQueries.count_of_nulls_in_songplays_table, 0),
102 |     ]
103 | )
104 | 
105 | end_operator = DummyOperator(task_id='Stop_execution',  dag=dag)
106 | 
107 | start_operator >> stage_songs_to_redshift
108 | start_operator >> stage_events_to_redshift
109 | stage_songs_to_redshift >> load_songplays_table
110 | stage_events_to_redshift >> load_songplays_table
111 | load_songplays_table >> load_song_dimension_table
112 | load_songplays_table >> load_artist_dimension_table
113 | load_songplays_table >> load_time_dimension_table
114 | load_songplays_table >> load_user_dimension_table
115 | load_song_dimension_table >> run_quality_checks
116 | load_user_dimension_table >> run_quality_checks
117 | load_time_dimension_table >> run_quality_checks
118 | load_artist_dimension_table >> run_quality_checks
119 | run_quality_checks >> end_operator


--------------------------------------------------------------------------------
/airflow/plugins/__init__.py:
--------------------------------------------------------------------------------
 1 | from __future__ import division, absolute_import, print_function
 2 | 
 3 | from airflow.plugins_manager import AirflowPlugin
 4 | 
 5 | import operators
 6 | import helpers
 7 | 
 8 | # Defining the plugin class
 9 | class UdacityPlugin(AirflowPlugin):
10 |     name = "udacity_plugin"
11 |     operators = [
12 |         operators.StageToRedshiftOperator,
13 |         operators.LoadFactOperator,
14 |         operators.LoadDimensionOperator,
15 |         operators.DataQualityOperator
16 |     ]
17 |     helpers = [
18 |         helpers.SqlQueries
19 |     ]
20 | 


--------------------------------------------------------------------------------
/airflow/plugins/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/plugins/helpers/__init__.py:
--------------------------------------------------------------------------------
1 | from helpers.sql_queries import SqlQueries
2 | 
3 | __all__ = [
4 |     'SqlQueries',
5 | ]


--------------------------------------------------------------------------------
/airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/helpers/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/helpers/__pycache__/sql_queries.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/plugins/helpers/sql_queries.py:
--------------------------------------------------------------------------------
  1 | class SqlQueries:
  2 |     
  3 |     truncate_table = ("""
  4 |         TRUNCATE {}
  5 |     """)
  6 |     
  7 |     copy_csv_to_redshift = ("""
  8 |         COPY {}
  9 |         FROM '{}'
 10 |         ACCESS_KEY_ID '{}'
 11 |         SECRET_ACCESS_KEY '{}'
 12 |         IGNOREHEADER 1
 13 |         DELIMITER ','
 14 |         region 'us-west-2'
 15 |     """)
 16 |     copy_json_to_redshift = ("""
 17 |         COPY {}
 18 |         FROM '{}'
 19 |         ACCESS_KEY_ID '{}'
 20 |         SECRET_ACCESS_KEY '{}'
 21 |         format as json 'auto'
 22 |         region 'us-west-2'
 23 |     """)
 24 |     copy_json_with_json_path_to_redshift = ("""
 25 |         COPY {}
 26 |         FROM '{}'
 27 |         ACCESS_KEY_ID '{}'
 28 |         SECRET_ACCESS_KEY '{}'
 29 |         json '{}'
 30 |         region 'us-west-2'
 31 |     """)
 32 |     songplay_table_insert = ("""
 33 |         INSERT INTO songplays (start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)
 34 |         SELECT
 35 |                 events.start_time, 
 36 |                 events.userid as user_id, 
 37 |                 events.level, 
 38 |                 songs.song_id, 
 39 |                 songs.artist_id, 
 40 |                 events.sessionid as session_id, 
 41 |                 events.location, 
 42 |                 events.useragent as user_agent
 43 |                 FROM (SELECT TIMESTAMP 'epoch' + ts/1000 * interval '1 second' AS start_time, *
 44 |             FROM staging_events
 45 |             WHERE page='NextSong' AND userid IS NOT NULL) events
 46 |             LEFT JOIN staging_songs songs
 47 |             ON events.song = songs.title
 48 |                 AND events.artist = songs.artist_name
 49 |                 AND events.length = songs.duration
 50 |     """)
 51 | 
 52 |     user_table_insert = ("""
 53 |         INSERT INTO users (user_id, first_name, last_name, gender, level)
 54 |         SELECT distinct userid as user_id,
 55 |         firstname as first_name,
 56 |         lastname as last_name,
 57 |         gender, 
 58 |         level
 59 |         FROM staging_events
 60 |         WHERE page='NextSong' AND userid IS NOT NULL
 61 |     """)
 62 | 
 63 |     song_table_insert = ("""
 64 |         INSERT INTO songs (song_id, title, artist_id, year, duration)
 65 |         SELECT distinct song_id,
 66 |             title,
 67 |             artist_id,
 68 |             year,
 69 |             duration
 70 |         FROM staging_songs
 71 |     """)
 72 | 
 73 |     artist_table_insert = ("""
 74 |         INSERT INTO artists (artist_id, name, location, latitude, longitude)
 75 |         SELECT distinct artist_id, 
 76 |             artist_name as name,
 77 |             artist_location as location,
 78 |             artist_latitude as latitude, 
 79 |             artist_longitude as longitude
 80 |         FROM staging_songs
 81 |     """)
 82 | 
 83 |     time_table_insert = ("""
 84 |         INSERT INTO time (start_time, hour, day, week, month, year, weekday)
 85 |         SELECT start_time,
 86 |             extract(hour from start_time),
 87 |             extract(day from start_time),
 88 |             extract(week from start_time),        
 89 |             extract(month from start_time),
 90 |             extract(year from start_time), 
 91 |             extract(dayofweek from start_time) as weekday
 92 |         FROM songplays
 93 |     """)
 94 |     
 95 |     count_of_nulls_in_songs_table = ("""
 96 |         SELECT count(*) as result
 97 |         FROM songs
 98 |         WHERE NULL in (song_id)
 99 |     """)
100 |     
101 |     count_of_nulls_in_artists_table = ("""
102 |         SELECT count(*) as result
103 |         FROM artists
104 |         WHERE NULL in (artist_id)
105 |     """)
106 |     
107 |     count_of_nulls_in_users_table = ("""
108 |         SELECT count(*) as result
109 |         FROM users
110 |         WHERE NULL in (user_id)
111 |     """)
112 |     
113 |     count_of_nulls_in_time_table = ("""
114 |         SELECT count(*) as result
115 |         FROM time
116 |         WHERE NULL in (start_time, "hour", "month", "year", "day", "weekday")
117 |     """)
118 |     
119 |     count_of_nulls_in_songplays_table = ("""
120 |         SELECT count(*) as result
121 |         FROM songplays
122 |         WHERE NULL in (songplay_id)
123 |     """)


--------------------------------------------------------------------------------
/airflow/plugins/operators/__init__.py:
--------------------------------------------------------------------------------
 1 | from operators.stage_redshift import StageToRedshiftOperator
 2 | from operators.load_fact import LoadFactOperator
 3 | from operators.load_dimension import LoadDimensionOperator
 4 | from operators.data_quality import DataQualityOperator
 5 | 
 6 | __all__ = [
 7 |     'StageToRedshiftOperator',
 8 |     'LoadFactOperator',
 9 |     'LoadDimensionOperator',
10 |     'DataQualityOperator'
11 | ]
12 | 


--------------------------------------------------------------------------------
/airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/data_quality.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/load_dimension.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/load_fact.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/airflow/plugins/operators/__pycache__/stage_redshift.cpython-36.pyc


--------------------------------------------------------------------------------
/airflow/plugins/operators/data_quality.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | """
 5 |     This operator is able to perform quality checks of the data in tables.
 6 |     It accepts list of pairs of sql statement and expected value as arguments.
 7 |     For each SQL statement, it executes it on Redshift and 
 8 |     compares the retrieved result with expected value.
 9 |     In case of mismatch, it raises exception to indicate failure of test.
10 | """
11 | class DataQualityOperator(BaseOperator):
12 | 
13 |     ui_color = '#89DA59'
14 |     """
15 |         Inputs:
16 |         * redshift_conn_id Redshift connection ID of Airflow connector for Redshift
17 |         * sql_stats_tests list of pairs of SQL statement and expected value to be test for equality after query execution
18 |     """
19 |     @apply_defaults
20 |     def __init__(self,
21 |                  redshift_conn_id = "redshift",
22 |                  sql_stats_tests = [],
23 |                  *args, **kwargs):
24 | 
25 |         super(DataQualityOperator, self).__init__(*args, **kwargs)
26 |         self.redshift_conn_id = redshift_conn_id
27 |         self.sql_stats_tests = sql_stats_tests
28 |         
29 |     def execute(self, context):
30 |         redshift_hook = PostgresHook(self.redshift_conn_id)
31 |         for (sql_stat, expected_result) in self.sql_stats_tests:
32 |             row = redshift_hook.get_first(sql_stat)
33 |             if row is not None:
34 |                 if row[0] == expected_result:
35 |                     self.log.info("Passed Test: {}\nResult == {}\n===================================".format(sql_stat, expected_result))
36 |                 else:
37 |                     raise ValueError("Failed Test: {}\nResult != {}\n===================================".format(sql_stat, expected_result))


--------------------------------------------------------------------------------
/airflow/plugins/operators/load_dimension.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | from helpers import SqlQueries
 5 | """
 6 |     This operator is able to load data into Dimension table 
 7 |     by executing specified SQL statement for specified target table.
 8 |     It also allows to optionally truncate the dimension table before inserting new data into it.
 9 | """
10 | class LoadDimensionOperator(BaseOperator):
11 | 
12 |     ui_color = '#80BD9E'
13 |     """
14 |         Inputs:
15 |         * redshift_conn_id Redshift connection ID in Airflow connectors
16 |         * sql_stat SQL statement for loading data into target Fact table
17 |         * target_table name of target Fact table to load data into
18 |     """
19 |     @apply_defaults
20 |     def __init__(self,
21 |                  redshift_conn_id = "redshift",
22 |                  should_truncate = True,
23 |                  sql_stat = "",
24 |                  target_table = "",
25 |                  *args, **kwargs):
26 | 
27 |         super(LoadDimensionOperator, self).__init__(*args, **kwargs)
28 |         self.redshift_conn_id = redshift_conn_id
29 |         self.should_truncate = should_truncate
30 |         self.sql_stat = sql_stat
31 |         self.target_table = target_table
32 |     def execute(self, context):
33 |         self.log.info("Starting insert data into dimension table: {}".format(self.target_table))
34 |         redshift_hook = PostgresHook(self.redshift_conn_id)
35 |         if self.should_truncate:
36 |             redshift_hook.run( 
37 |                 SqlQueries.truncate_table.format(self.target_table)
38 |             )
39 |         redshift_hook.run(self.sql_stat)
40 |         self.log.info("Done inserting data into dimension table: {}".format(self.target_table))


--------------------------------------------------------------------------------
/airflow/plugins/operators/load_fact.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.models import BaseOperator
 3 | from airflow.utils.decorators import apply_defaults
 4 | from helpers import SqlQueries
 5 | """
 6 |     This operator is able to load data into Fact table 
 7 |     by executing specified SQL statement for specified target table.
 8 | """
 9 | class LoadFactOperator(BaseOperator):
10 | 
11 |     ui_color = '#F98866'
12 |     """
13 |         Inputs:
14 |         * redshift_conn_id Redshift connection ID in Airflow connectors
15 |         * sql_stat SQL statement for loading data into target Fact table
16 |         * target_table name of target Fact table to load data into
17 |     """
18 |     @apply_defaults
19 |     def __init__(self,
20 |                  redshift_conn_id = "redshift",
21 |                  sql_stat = "",
22 |                  target_table = "",
23 |                  *args, **kwargs):
24 | 
25 |         super(LoadFactOperator, self).__init__(*args, **kwargs)
26 |         self.redshift_conn_id = redshift_conn_id
27 |         self.sql_stat = sql_stat
28 |         self.target_table = target_table
29 |         
30 |     def execute(self, context):
31 |         self.log.info("Starting load data into fact table: {}".format(self.target_table))
32 |         redshift_hook = PostgresHook(self.redshift_conn_id)        
33 |         redshift_hook.run(self.sql_stat)
34 |         self.log.info("Done inserting data into fact table: {}".format(self.target_table))
35 | 


--------------------------------------------------------------------------------
/airflow/plugins/operators/stage_redshift.py:
--------------------------------------------------------------------------------
 1 | from airflow.hooks.postgres_hook import PostgresHook
 2 | from airflow.contrib.hooks.aws_hook import AwsHook
 3 | from airflow.models import BaseOperator
 4 | from airflow.utils.decorators import apply_defaults
 5 | from helpers import SqlQueries
 6 | """
 7 |     This operator is able to load given dataset in JSON or CSV format 
 8 |     from specified location on S3 into target Redshift table    
 9 | """
10 | class StageToRedshiftOperator(BaseOperator):
11 |     ui_color = '#358140'
12 | 
13 |     
14 |     """
15 |         Inputs:
16 |         * redshift_conn_id Redshift connection ID in Airflow connectors
17 |         * aws_conn_id AWS credentials connection ID in Airflow connectors
18 |         * source_location location of dataset on S3
19 |         * target_table name of the table to loaded from dataset in Redshift
20 |         * file_type format of the dataset files to be read from S3. Supported values "json" or "csv"
21 |         * json_path location of file representing the schema of dataset in JSON format
22 |     """
23 |     @apply_defaults
24 |     def __init__(self,
25 |                  redshift_conn_id="redshift",
26 |                  aws_conn_id="aws_credentials",
27 |                  source_location="",
28 |                  target_table="sample_table",
29 |                  file_type="json",
30 |                  json_path="",
31 |                  *args, **kwargs):
32 | 
33 |         super(StageToRedshiftOperator, self).__init__(*args, **kwargs)        
34 |         self.redshift_conn_id = redshift_conn_id
35 |         self.aws_conn_id = aws_conn_id
36 |         self.source_location = source_location
37 |         self.target_table = target_table
38 |         self.file_type = file_type
39 |         self.json_path = json_path
40 |     def execute(self, context):        
41 |         if self.file_type in ["json", "csv"]:            
42 |             redshift_hook = PostgresHook(self.redshift_conn_id)
43 |             aws_hook = AwsHook(self.aws_conn_id)
44 |             credentials = aws_hook.get_credentials()
45 |             self.log.info(f'StageToRedshiftOperator will start loading files at: {self.source_location} to staging table: {self.target_table}')
46 |             redshift_hook.run(SqlQueries.truncate_table.format(self.target_table))
47 |             if self.file_type == "json":
48 |                 if self.json_path != "":
49 |                     redshift_hook.run (
50 |                         SqlQueries.copy_json_with_json_path_to_redshift.format (
51 |                             self.target_table,
52 |                             self.source_location,
53 |                             credentials.access_key, 
54 |                             credentials.secret_key,
55 |                             self.json_path
56 |                         )
57 |                     )
58 |                 else:
59 |                     redshift_hook.run (
60 |                         SqlQueries.copy_json_to_redshift.format (
61 |                             self.target_table,
62 |                             self.source_location,
63 |                             credentials.access_key, 
64 |                             credentials.secret_key
65 |                         )
66 |                     )               
67 |             elif self.file_type == "csv":
68 |                 redshift_hook.run (
69 |                     SqlQueries.copy_csv_to_redshift.format (
70 |                     self.target_table,
71 |                     self.source_location,
72 |                     credentials.access_key, 
73 |                     credentials.secret_key)
74 |                 )
75 | 
76 |             self.log.info(f'StageToRedshiftOperator has completed loading files at: {self.source_location} to staging table: {self.target_table}')
77 |         else:
78 |             raise ValueError("file_type param must be either json or csv")
79 | 
80 | 
81 | 


--------------------------------------------------------------------------------
/dag.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patelatharva/Data_Pipelines_with_Apache_Airflow/eef412a595e2c22f6188e48bad3f4bed64c3fe62/dag.png


--------------------------------------------------------------------------------