├── .gitignore ├── README.md ├── Report ├── Report_Shravan_Kuchkula.ipynb ├── Report_Shravan_Kuchkula.md ├── Report_Shravan_Kuchkula_files │ ├── Report_Shravan_Kuchkula_11_0.png │ ├── Report_Shravan_Kuchkula_13_0.png │ ├── Report_Shravan_Kuchkula_8_0.png │ ├── airflow_tree_view.png │ ├── clicks.png │ ├── dag.png │ ├── dataframe.png │ ├── operators.png │ ├── rawdata.png │ ├── redshift.png │ └── validsearches.png └── dwh-streeteasy.cfg ├── docker-compose.yml ├── images ├── Report_Shravan_Kuchkula_11_0.png ├── Report_Shravan_Kuchkula_13_0.png ├── Report_Shravan_Kuchkula_8_0.png ├── airflow_tree_view.png ├── clicks.png ├── connections.png ├── dag.png ├── dataframe.png ├── operators.png ├── rawdata.png ├── redshift.png ├── validsearches.png └── variables.png └── street-easy ├── dags ├── create_postgres_table.py └── street_easy.py ├── plugins ├── __init__.py ├── helpers │ ├── __init__.py │ └── transforms.py └── operators │ ├── __init__.py │ ├── extract_and_transform_streeteasy.py │ └── valid_search_stats.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ## ETL data pipeline to process StreetEasy data 3 | 4 | **Project Description**: 5 | 6 | An online real-estate company is interested in understanding `user enagagement` by analyzing user search patterns to send targeted emails to the users with valid searches. A valid search is termed as one where the search metadata contains `enabled:true` and number of clicks is atleast `3`. 7 | 8 | A daily snapshot of user search history and related data is saved to S3. Each file represents a single date, as noted by the filename: `inferred_users.20180330.csv.gz`. Each line in each file represents a *unique user*, as identified by `id` column. Information on each user's searches and engagement is stored in `searches` column. An example of this is shown below: 9 | 10 | ![rawdata](images/rawdata.png) 11 | 12 | 13 | **Data Description**: The source data resides in S3 `s3://` for each day from **2018-01-20** till **2018-03-30**, as shown: 14 | ```bash 15 | s3:/// 16 | inferred_users.20180120.csv.gz 17 | inferred_users.20180121.csv.gz 18 | inferred_users.20180122.csv.gz 19 | inferred_users.20180123.csv.gz 20 | inferred_users.20180124.csv.gz 21 | .. 22 | inferred_users.20180325.csv.gz 23 | inferred_users.20180326.csv.gz 24 | inferred_users.20180327.csv.gz 25 | inferred_users.20180328.csv.gz 26 | inferred_users.20180329.csv.gz 27 | inferred_users.20180330.csv.gz 28 | ``` 29 | 30 | All this data needs to be processed using a data pipeline to answer the following **business questions:** 31 | 1. Produce a list of **unique "valid searches"**. 32 | 2. Produce, for each date, the **total number of valid searches** that existed on that date. 33 | 3. Produce, for each date, the **total number of users** who had valid searches on that date. 34 | 4. Given this data, determine which is the **most engaging search.** 35 | 5. What would the email traffic look like if the definition of a valid search is changed from **3 clicks to 2 clicks**? 36 | 6. Report any interesting **trends over the timespan** of the data available. 37 | 38 | 39 | **Data Pipeline design**: 40 | The design of the pipeline can be summarized as: 41 | - Extract data from source S3 location. 42 | - Process and Transform it using python and custom **Airflow operators**. 43 | - Load a clean dataset and intermediate artifacts to **destination S3 location**. 44 | - Calculate summary statistics and load the summary stats into **Amazon Redshift**. 45 | 46 | > Figure showns the structure of the data pipeline as represented by a Airflow DAG 47 | ![dag](images/dag.png) 48 | 49 | Finally, I have made use of `Jupyter Notebook` to connect to the `Redshift` cluster and answer the questions of interest. 50 | 51 | **Design Goals**: 52 | As the data is stored in S3, we need a way to incrementally load each file, then process it and store that particular day's results back into S3. Doing so will allow us to perform further analysis later-on, on the cleaned dataset. Secondly, we need a way to aggregate the data and store it in a table to facilitate time-based analysis. Keeping these two goals in mind, the following tools were chosen: 53 | - Apache Airflow will incrementally extract the data from S3 and process it *in-memory* and store the results back into a destination S3 bucket. The reason we need to process this in-memory is because, we don't want to download the file from S3 to airflow worker's disk, as this might fill-up the worker's disk and crash the worker process. 54 | - Amazon Redshift is a simple cloud-managed data warehouse that can be integrated into pipelines without much effort. Airflow will then read the intermediate dataset created in the first step and aggregate the data per day and store it into a Redshift table. 55 | 56 | **Pipeline Implementation**: 57 | Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A **DAG** describes *how* you want to carry out your workflow, and **Operators** determine *what* actually gets done. 58 | 59 | By default, airflow comes with some simple built-in operators like `PythonOperator`, `BashOperator`, `DummyOperator` etc., however, airflow lets you extend the features of a `BaseOperator` and create custom operators. For this project, I developed two custom operators: 60 | 61 | ![operators](images/operators.png) 62 | 63 | - **StreetEasyOperator**: Extract data from **source S3 bucket**, processes the data in-memory by applying a series of transformations found inside `transforms.py`, then loads it to destination S3 bucket. Please see the code here: [StreetEasyOperator](https://github.com/nn/batch-etl/blob/master/street-easy/plugins/operators/extract_and_transform_streeteasy.py) 64 | - **ValidSearchStatsOperator**: Takes data from **destination S3 bucket**, aggregates the data on a per-day basis, and uploads it to Redshift table `search_stats`. Please see the code here: [ValidSearchStatsOperator](https://github.com/nn/batch-etl/blob/master/street-easy/plugins/operators/valid_search_stats.py) 65 | 66 | Here's the directory organization: 67 | 68 | ```bash 69 | ├── README.md 70 | ├── Report 71 | │   ├── Report.ipynb 72 | │   └── dwh-streeteasy.cfg 73 | ├── docker-compose.yml 74 | ├── images 75 | └── street-easy 76 | ├── dags 77 | │   ├── create_postgres_table.py 78 | │   └── street_easy.py 79 | ├── plugins 80 | │   ├── __init__.py 81 | │   ├── helpers 82 | │   │   ├── __init__.py 83 | │   │   └── transforms.py 84 | │   └── operators 85 | │   ├── __init__.py 86 | │   ├── extract_and_transform_streeteasy.py 87 | │   └── valid_search_stats.py 88 | └── requirements.txt 89 | ``` 90 | 91 | 92 | **Pipeline Schedule**: Our pipeline is required to adhere to the following guidelines: 93 | * The DAG should run *daily* from `2018-01-20` to `2018-03-30` 94 | * The DAG should not have any dependencies on past runs. 95 | * On failure, the task is retried for 3 times. 96 | * Retries happen every 5 minutes. 97 | * Do not email on retry. 98 | 99 | > Shown below is the data pipeline (street_easy DAG) execution starting on **2018-01-20** and ending on **2018-03-30**. 100 | ![airflow_tree_view](images/airflow_tree_view.png) 101 | > Note: The data for *2018-01-29 and 2018-01-30* is not available, thus we are skipping over that. 102 | 103 | **Destination S3 datasets and Redshift Table**: 104 | After each successful run of the DAG, two files are stored in the destination bucket: 105 | * `s3://skuchkula-etl/unique_valid_searches_.csv`: Contains a list of unique valid searches for each day. 106 | * `s3://skuchkula-etl/valid_searches_.csv`: Contains a dataset with the following fields: 107 | * user_id: Unique id of the user 108 | * num_valid_searches: Number of valid searches 109 | * avg_listings: Avg number of listings for that user 110 | * type_of_search: Did the user search for: 111 | * Only Rental 112 | * Only Sale 113 | * Both Rental and Sale 114 | * Neither 115 | * list_of_valid_searches: A list of valid searches for that user 116 | 117 | 118 | **unique_valid_searches_{date}.csv** contains unique valid searches per day: 119 | ```bash 120 | s3://skuchkula-etl/ 121 | unique_valid_searches_20180120.csv 122 | unique_valid_searches_20180121.csv 123 | unique_valid_searches_20180122.csv 124 | unique_valid_searches_20180123.csv 125 | unique_valid_searches_20180214.csv 126 | ... 127 | ``` 128 | 129 | **valid_searches_{date}.csv** contains the valid searches dataset per day: 130 | ```bash 131 | s3://skuchkula-etl/ 132 | valid_searches_20180120.csv 133 | valid_searches_20180121.csv 134 | valid_searches_20180122.csv 135 | valid_searches_20180123.csv 136 | valid_searches_20180214.csv 137 | ... 138 | ``` 139 | **Amazon Redshift table:** 140 | 141 | The `ValidSearchesStatsOperator` then takes each of datasets `valid_searches_{date}.csv` and calcuates summary stats and loads the results to **search_stats** table, as shown: 142 | 143 | ![redshift](images/redshift.png) 144 | 145 | 146 | ## Answering business questions using data 147 | 148 | ### Business question: Produce a list of all unique "valid searches" given the above requirements. 149 | The list of all unique searches is stored in the destination S3 bucket: `s3://skuchkula-etl/unique_valid_searches_{date}.csv`. An example of the output is shown here 150 | 151 | ```bash 152 | $ head -10 unique_valid_searches_20180330.csv 153 | searches 154 | 38436711 155 | 56011095 156 | 3161333 157 | 43841677 158 | 42719934 159 | 40166212 160 | 44847718 161 | 36981443 162 | 13923552 163 | ``` 164 | The code used to calculate the unique valid searches can be found here: [transforms.py](https://github.com/nn/batch-etl/blob/16986034763616f330d27febf22c92efa007d1db/street-easy/plugins/operators/extract_and_transform_streeteasy.py#L112) 165 | 166 | We will be making using of `pandas`, `psycopg2` and `matplotlib` to use the data we gathered to answer the next set of business questions. 167 | 168 | 169 | ```python 170 | import pandas as pd 171 | import pandas.io.sql as sqlio 172 | import configparser 173 | import psycopg2 174 | 175 | import matplotlib.pyplot as plt 176 | plt.style.use('fivethirtyeight') 177 | ``` 178 | 179 | ### Business question: Produce, for each date, the total number of valid searches that existed on that date. 180 | To answer this we need to connect to the Redshift cluster and query the `search_stats` table. First, we obtain a connection to Redshift cluster. The secrets are stored in the `dwh-streeteasy.cfg` file. Next, we execute the SQL query and store the result as a pandas dataframe. 181 | 182 | 183 | ```python 184 | config = configparser.ConfigParser() 185 | config.read('dwh-streeteasy.cfg') 186 | 187 | # connect to redshift cluster 188 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values())) 189 | cur = conn.cursor() 190 | 191 | sql_query = "SELECT * FROM search_stats" 192 | df = sqlio.read_sql_query(sql_query, conn) 193 | df['day'] = pd.to_datetime(df['day']) 194 | df = df.set_index('day') 195 | print(df.shape) 196 | df.head() 197 | ``` 198 | ``` 199 | (68, 6) 200 | ``` 201 | ![dataframe](images/dataframe.png) 202 | 203 | From this dataframe, for this question, we are interested in finding out the **total number of valid searches** on a given day. This is captured in the `num_searches` column. Shown below is a plot showing the num_searches per day for the entire time-period. 204 | 205 | 206 | ```python 207 | ax = df['num_searches'].plot(figsize=(12, 8), fontsize=12, linewidth=3, linestyle='--') 208 | ax.set_xlabel('Date', fontsize=16) 209 | ax.set_ylabel('Valid Searches', fontsize=16) 210 | ax.set_title('Total number of valid searches on each day') 211 | ax.axvspan('2018-03-21', '2018-03-24', color='red', alpha=0.3) 212 | plt.show() 213 | ``` 214 | 215 | 216 | ![png](images/Report__8_0.png) 217 | 218 | 219 | **Observation**: The red band indicates a sharp drop in the number of valid searches on `2018-03-24`. 220 | 221 | ### Business Question: Produce, for each date, the total number of valid searches that existed on that date. 222 | The **total number of users with valid searches per day** is captured in the `num_users` column of the dataframe. A similar trend can be observed for the num_users indicated by the red band. 223 | 224 | 225 | ```python 226 | ax = df['num_users'].plot(figsize=(12, 8), fontsize=12, linewidth=3, linestyle='--') 227 | ax.set_xlabel('Date', fontsize=16) 228 | ax.set_ylabel('Number of users', fontsize=16) 229 | ax.set_title('Total number of users on each day') 230 | ax.axvspan('2018-03-21', '2018-03-24', color='red', alpha=0.3) 231 | plt.show() 232 | ``` 233 | 234 | 235 | ![png](images/Report__11_0.png) 236 | 237 | 238 | ### Business question: Most engaging search 239 | From the data that is available, it appears that `Rental` searches are the most engaging ones. I am assuming that the number of valid searches is a good indicator to guage user engagement. It is evident from the below plot, that Rental Searches are consistently producing more valid searches than `Sale` type searches. 240 | 241 | 242 | ```python 243 | ax = df[['num_rental_searches', 244 | 'num_sales_searches', 245 | 'num_rental_and_sales_searches', 246 | 'num_none_type_searches']].plot(figsize=(12, 8), fontsize=12, linewidth=2, linestyle='--') 247 | ax.set_xlabel('Date', fontsize=16) 248 | ax.set_ylabel('Valid Searches', fontsize=16) 249 | ax.set_title('Types of searches every day') 250 | ax.legend(fontsize=10) 251 | plt.show() 252 | ``` 253 | 254 | 255 | ![png](images/Report__13_0.png) 256 | 257 | 258 | ### Business question: What would the email traffic look like if we changed the definition of a valid search from 3 clicks to 2? 259 | When the defintion of valid search is changed from `clicks >= 3` to `clicks >= 2` the number of searches and its corresponding stats increase in size. Shown below is a comparison for the first 3 days: 260 | 261 | ![clicks](images/clicks.png) 262 | 263 | This means that the **email traffic would increase**. 264 | 265 | ### Business question: Report any interesting trends over the timespan of the data available. 266 | Mainly there are two trends observed with this timeseries data: 267 | - One is that there is a steady increase in the number of searches made and also in the number of users. The stats corresponding to individual search type shows that Rental searches are growing faster than Sales searches. 268 | - Second interesting thing that I found was a sharp dip in the number of searches and users on 2018-03-23, which could be something interesting to investigate. 269 | 270 | ## Recommendations 271 | 272 | ### Recommendations in data storage: 273 | In terms of storing data, using CSV files comes with some problems down the line. Here are some difficulties with CSV files: 274 | - No defined schema: There are no data types included and column names beyond a header row. 275 | - Nested data requires special handling. 276 | In addition to these issues with using CSV file format, **Spark** has some **specific problems** when working with CSV data: 277 | - CSV files are quite **slow to import** and parse. 278 | - The files cannot be shared between workers during the import process. 279 | - If no schema is defined, then all data must be read before a schema can be inferred. 280 | - Spark has a feature known as **predicate pushdown** - which is an idea of ordering tasks to do the least amount of work. Example, *filtering* data prior to processing is one of the primary optimizations of predicate pushdown, this drastically reduces the amount of information that must be processed in large data sets. Unfortunately, we cannot filter the CSV data via predicate pushdown. 281 | - Finally, Spark processes are often multi-step and may utilize an intermediate file representation. These representations allow data to be used later without regenerating the data from source. 282 | 283 | Instead of using CSV, when possible use **parquet file format**. 284 | 285 | **Parquet Format**: Parquet is a compressed columnar data format and is structured with data accessible in chunks that allows efficient read/write operations without processing the entire file. This structured format supports Spark's predicate pushdown functionality, thus providing significant performance improvement. Finally, parquet files automatically include schema information and handle data encoding. This is perfect for intermediary or on-disk representation of processed data. Note that parquet files are binary file format and can only be used with proper tools. 286 | 287 | ### Recommendations for downstream processing: 288 | The search field coming through from the application appears to be `YAML` format. I found that writing regular expression to parse out the search field is prone to errors if the schema evolves. A better way to capture the search field is using JSON or AVRO, as this has some form of schema tied to it, so that downstream applications can know when the schema evolves. 289 | 290 | ## How to run this project? 291 | **pre-requisites**: 292 | - Docker and docker-compose must be running on your laptop. 293 | - You have credentials for source and destination S3 buckets. (Both are private buckets) 294 | - You need to have AWS Redshift cluster endpoint. [guide to create Redshift cluster using IaC](https://nn.github.io/create-aws-redshift-cluster/) 295 | 296 | **Step 1:** Once the requirements are met, launch Airflow on your laptop by running: `docker-compose up` from the location where `docker-compose.yml` is located. 297 | ```bash 298 | : batch-etl$ docker-compose up 299 | Creating network "batch-etl_default" with the default driver 300 | Creating batch-etl_postgres_1 ... done 301 | Creating batch-etl_webserver_1 ... done 302 | 303 | webserver_1 | ____________ _____________ 304 | webserver_1 | ____ |__( )_________ __/__ /________ __ 305 | webserver_1 | ____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / / 306 | webserver_1 | ___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ / 307 | webserver_1 | _/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/ 308 | ``` 309 | Inside the `docker-compose.yml` we have the **volumes** section, which maps our dags directory to airflow's dag-bag: `/usr/local/airflow/dags`. Next, we map the custom Airflow Plugin that we created to extend Airflow's functionality by adding two custom operators, this is mapped to the airflow's plugin directory. Lastly, inside both my operators, I have made use of `s3fs` python package, which is essentially a wrapper around `boto3` package, but provides more simpler interface. Add `s3fs` to `requirements.txt` and map that to `/requirements.txt`. The reason we need to map this to way is because the entrypoint docker script runs `pip install -r requirements.txt` from `/` within the docker container. 310 | 311 | ``` 312 | volumes: 313 | - ./street-easy/dags:/usr/local/airflow/dags 314 | # Uncomment to include custom plugins 315 | - ./street-easy/plugins:/usr/local/airflow/plugins 316 | # Additional python packages used inside airflow operators 317 | - ./street-easy/requirements.txt:/requirements.txt 318 | ``` 319 | 320 | **Step 2:**: Configure Airflow Variables 321 | Login to Airflow Console: http://localhost:8080/admin , and create two `Variables`. Our code uses these variables to reference the source and destination buckets. 322 | ![variables](images/variables.png) 323 | 324 | Next, create the following connections: 325 | - *aws_credentials*: (Type: Amazon Web Services, Login:, Password:) 326 | - *aws_dest_credentials*: (Type: Amazon Web Services, Login:, Password:) 327 | - *redshift*: Shown below is the configuration 328 | ![connections](images/connections.png) 329 | 330 | **Step 3**: There are two dags in our dag-bag: `create_postgres_table` and `street_easy`. The first is used to create a table in Redshift. Turn on the `create_postgres_table` DAG and trigger it manually. Once the dag finishes running, it will create the tables in Redshift. After that, turn on the `street_easy` dag. This will trigger the execution automatically since the start date is in the past. 331 | 332 | **Step 4**: Launch the jupyter notebook provided here: [notebook](https://github.com/nn/batch-etl/blob/16986034763616f330d27febf22c92efa007d1db/Report/Report.ipynb) . Navigate to "Answering Business questions using data" section. Run the code cells. 333 | -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## ETL data pipeline to process StreetEasy data\n", 8 | "### Author: Shravan Kuchkula (email: shravan.kuchkula@gmail.com)\n", 9 | "**Project Description**: \n", 10 | "\n", 11 | "An online real-estate company is interested in understanding `user enagagement` by analyzing user search patterns to send targeted emails to the users with valid searches. A valid search is termed as one where the search metadata contains `enabled:true` and number of clicks is atleast `3`.\n", 12 | "\n", 13 | " A daily snapshot of user search history and related data is saved to S3. Each file represents a single date, as noted by the filename: `inferred_users.20180330.csv.gz`. Each line in each file represents a *unique user*, as identified by `id` column. Information on each user's searches and engagement is stored in `searches` column. An example of this is shown below:\n", 14 | "\n", 15 | "![rawdata](streeteasy-images/rawdata.png)\n", 16 | " \n", 17 | "\n", 18 | "**Data Description**: The source data resides in S3 `s3://` for each day from **2018-01-20** till **2018-03-30**, as shown:\n", 19 | "```bash\n", 20 | "s3:///\n", 21 | "inferred_users.20180120.csv.gz\n", 22 | "inferred_users.20180121.csv.gz\n", 23 | "inferred_users.20180122.csv.gz\n", 24 | "inferred_users.20180123.csv.gz\n", 25 | "inferred_users.20180124.csv.gz\n", 26 | "..\n", 27 | "inferred_users.20180325.csv.gz\n", 28 | "inferred_users.20180326.csv.gz\n", 29 | "inferred_users.20180327.csv.gz\n", 30 | "inferred_users.20180328.csv.gz\n", 31 | "inferred_users.20180329.csv.gz\n", 32 | "inferred_users.20180330.csv.gz\n", 33 | "```\n", 34 | "\n", 35 | "All this data needs to be processed using a data pipeline to answer the following **business questions:**\n", 36 | "1. Produce a list of **unique \"valid searches\"**. \n", 37 | "2. Produce, for each date, the **total number of valid searches** that existed on that date.\n", 38 | "3. Produce, for each date, the **total number of users** who had valid searches on that date.\n", 39 | "4. Given this data, determine which is the **most engaging search.**\n", 40 | "5. What would the email traffic look like if the definition of a valid search is changed from **3 clicks to 2 clicks**?\n", 41 | "6. Report any interesting **trends over the timespan** of the data available.\n", 42 | "\n", 43 | "\n", 44 | "**Data Pipeline design**:\n", 45 | "The design of the pipeline can be summarized as:\n", 46 | "- Extract data from source S3 location.\n", 47 | "- Process and Transform it using python and custom **Airflow operators**.\n", 48 | "- Load a clean dataset and intermediate artifacts to **destination S3 location**.\n", 49 | "- Calculate summary statistics and load the summary stats into **Amazon Redshift**.\n", 50 | "\n", 51 | "> Figure showns the structure of the data pipeline as represented by a Airflow DAG\n", 52 | "![dag](streeteasy-images/dag.png)\n", 53 | "\n", 54 | "Finally, I have made use of `Jupyter Notebook` to connect to the `Redshift` cluster and answer the questions of interest.\n", 55 | "\n", 56 | "**Design Goals**:\n", 57 | "As the data is stored in S3, we need a way to incrementally load each file, then process it and store that particular day's results back into S3. Doing so will allow us to perform further analysis later-on, on the cleaned dataset. Secondly, we need a way to aggregate the data and store it in a table to facilitate time-based analysis. Keeping these two goals in mind, the following tools were chosen:\n", 58 | "- Apache Airflow will incrementally extract the data from S3 and process it *in-memory* and store the results back into a destination S3 bucket. The reason we need to process this in-memory is because, we don't want to download the file from S3 to airflow worker's disk, as this might fill-up the worker's disk and crash the worker process.\n", 59 | "- Amazon Redshift is a simple cloud-managed data warehouse that can be integrated into pipelines without much effort. Airflow will then read the intermediate dataset created in the first step and aggregate the data per day and store it into a Redshift table.\n", 60 | "\n", 61 | "**Pipeline Implementation**:\n", 62 | "Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A **DAG** describes *how* you want to carry out your workflow, and **Operators** determine *what* actually gets done.\n", 63 | "\n", 64 | "By default, airflow comes with some simple built-in operators like `PythonOperator`, `BashOperator`, `DummyOperator` etc., however, airflow lets you extend the features of a `BaseOperator` and create custom operators. For this project, I developed two custom operators:\n", 65 | "\n", 66 | "![operators](streeteasy-images/operators.png)\n", 67 | "\n", 68 | "- **StreetEasyOperator**: Extract data from **source S3 bucket**, processes the data in-memory by applying a series of transformations found inside `transforms.py`, then loads it to destination S3 bucket. Please see the code here: \n", 69 | "- **ValidSearchStatsOperator**: Takes data from **destination S3 bucket**, aggregates the data on a per-day basis, and uploads it to Reshift table `search_stats`. Please see the code here:\n", 70 | "\n", 71 | "Here's the directory organization:\n", 72 | "\n", 73 | "```bash\n", 74 | "├── README.md\n", 75 | "├── docker-compose.yml\n", 76 | "└── street-easy\n", 77 | " ├── dags\n", 78 | " │   ├── create_postgres_table.py\n", 79 | " │   └── street_easy.py\n", 80 | " ├── plugins\n", 81 | " │   ├── __init__.py\n", 82 | " │   ├── helpers\n", 83 | " │   │   ├── __init__.py\n", 84 | " │   │   └── transforms.py\n", 85 | " │   └── operators\n", 86 | " │   ├── __init__.py\n", 87 | " │   ├── extract_and_transform_streeteasy.py\n", 88 | " │   └── valid_search_stats.py\n", 89 | " └── requirements.txt\n", 90 | "```\n", 91 | "\n", 92 | "\n", 93 | "**Pipeline Schedule**: Our pipeline is required to adhere to the following guidelines:\n", 94 | "* The DAG should run *daily* from `2018-01-20` to `2018-03-30`\n", 95 | "* The DAG should not have any dependencies on past runs.\n", 96 | "* On failure, the task is retried for 3 times.\n", 97 | "* Retries happen every 5 minutes.\n", 98 | "* Do not email on retry.\n", 99 | "\n", 100 | "> Shown below is the data pipeline (street_easy DAG) execution starting on **2018-01-20** and ending on **2018-03-30**.\n", 101 | "![airflow_tree_view](streeteasy-images/airflow_tree_view.png)\n", 102 | "> Note: The data for *2018-01-29 and 2018-01-30* is not available, thus we are skipping over that.\n", 103 | "\n", 104 | "**Destination S3 datasets and Redshift Table**:\n", 105 | "After each successful run of the DAG, two files are stored in the destination bucket: \n", 106 | "* `s3://skuchkula-etl/unique_valid_searches_.csv`: Contains a list of unique valid searches for each day.\n", 107 | "* `s3://skuchkula-etl/valid_searches_.csv`: Contains a dataset with the following fields:\n", 108 | " * user_id: Unique id of the user\n", 109 | " * num_valid_searches: Number of valid searches\n", 110 | " * avg_listings: Avg number of listings for that user\n", 111 | " * type_of_search: Did the user search for:\n", 112 | " * Only Rental\n", 113 | " * Only Sale\n", 114 | " * Both Rental and Sale\n", 115 | " * Neither\n", 116 | " * list_of_valid_searches: A list of valid searches for that user\n", 117 | "\n", 118 | "\n", 119 | "**unique_valid_searches_{date}.csv** contains unique valid searches per day:\n", 120 | "```bash\n", 121 | "s3://skuchkula-etl/\n", 122 | "unique_valid_searches_20180120.csv\n", 123 | "unique_valid_searches_20180121.csv\n", 124 | "unique_valid_searches_20180122.csv\n", 125 | "unique_valid_searches_20180123.csv\n", 126 | "unique_valid_searches_20180214.csv\n", 127 | "...\n", 128 | "```\n", 129 | "\n", 130 | "**valid_searches_{date}.csv** contains the valid searches dataset per day:\n", 131 | "```bash\n", 132 | "s3://skuchkula-etl/\n", 133 | "valid_searches_20180120.csv\n", 134 | "valid_searches_20180121.csv\n", 135 | "valid_searches_20180122.csv\n", 136 | "valid_searches_20180123.csv\n", 137 | "valid_searches_20180214.csv\n", 138 | "...\n", 139 | "```\n", 140 | "**Amazon Redshift table:**\n", 141 | "\n", 142 | "The `ValidSearchesStatsOperator` then takes each of datasets `valid_searches_{date}.csv` and calcuates summary stats and loads the results to **search_stats** table, as shown:\n", 143 | "\n", 144 | "![redshift](streeteasy-images/redshift.png)\n" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "## Answering business questions using data" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "### Business question: Produce a list of all unique \"valid searches\" given the above requirements.\n", 159 | "The list of all unique searches is stored in the destination S3 bucket: `s3://skuchkula-etl/unique_valid_searches_{date}.csv`. An example of the output is shown here\n", 160 | "\n", 161 | "```bash\n", 162 | "$ head -10 unique_valid_searches_20180330.csv\n", 163 | "searches\n", 164 | "38436711\n", 165 | "56011095\n", 166 | "3161333\n", 167 | "43841677\n", 168 | "42719934\n", 169 | "40166212\n", 170 | "44847718\n", 171 | "36981443\n", 172 | "13923552\n", 173 | "```\n", 174 | "The code used to calculate the unique valid searches can be found here: TODO" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "We will be making using of `pandas`, `psycopg2` and `matplotlib` to use the data we gathered to answer the next set of business questions." 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 13, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "import pandas as pd\n", 191 | "import pandas.io.sql as sqlio\n", 192 | "import configparser\n", 193 | "import psycopg2\n", 194 | "\n", 195 | "import matplotlib.pyplot as plt\n", 196 | "plt.style.use('fivethirtyeight')" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "### Business question: Produce, for each date, the total number of valid searches that existed on that date.\n", 204 | "To answer this we need to connect to the Redshift cluster and query the `search_stats` table. First, we obtain a connection to Redshift cluster. The secrets are stored in the `dwh-streeteasy.cfg` file. Next, we execute the SQL query and store the result as a pandas dataframe." 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 2, 210 | "metadata": {}, 211 | "outputs": [ 212 | { 213 | "name": "stdout", 214 | "output_type": "stream", 215 | "text": [ 216 | "(68, 6)\n" 217 | ] 218 | }, 219 | { 220 | "data": { 221 | "text/html": [ 222 | "
\n", 223 | "\n", 236 | "\n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | "
num_searchesnum_usersnum_rental_searchesnum_sales_searchesnum_rental_and_sales_searchesnum_none_type_searches
day
2018-01-202244871285449183915592284018273
2018-01-212249451287999180515541285218601
2018-01-222255771291679174915487283619095
2018-01-232263061295049177515531284219356
2018-01-242269621298389179515560284819635
\n", 305 | "
" 306 | ], 307 | "text/plain": [ 308 | " num_searches num_users num_rental_searches num_sales_searches \\\n", 309 | "day \n", 310 | "2018-01-20 224487 128544 91839 15592 \n", 311 | "2018-01-21 224945 128799 91805 15541 \n", 312 | "2018-01-22 225577 129167 91749 15487 \n", 313 | "2018-01-23 226306 129504 91775 15531 \n", 314 | "2018-01-24 226962 129838 91795 15560 \n", 315 | "\n", 316 | " num_rental_and_sales_searches num_none_type_searches \n", 317 | "day \n", 318 | "2018-01-20 2840 18273 \n", 319 | "2018-01-21 2852 18601 \n", 320 | "2018-01-22 2836 19095 \n", 321 | "2018-01-23 2842 19356 \n", 322 | "2018-01-24 2848 19635 " 323 | ] 324 | }, 325 | "execution_count": 2, 326 | "metadata": {}, 327 | "output_type": "execute_result" 328 | } 329 | ], 330 | "source": [ 331 | "config = configparser.ConfigParser()\n", 332 | "config.read('dwh-streeteasy.cfg')\n", 333 | "\n", 334 | "# connect to redshift cluster\n", 335 | "conn = psycopg2.connect(\"host={} dbname={} user={} password={} port={}\".format(*config['CLUSTER'].values()))\n", 336 | "cur = conn.cursor()\n", 337 | "\n", 338 | "sql_query = \"SELECT * FROM search_stats\"\n", 339 | "df = sqlio.read_sql_query(sql_query, conn)\n", 340 | "df['day'] = pd.to_datetime(df['day'])\n", 341 | "df = df.set_index('day')\n", 342 | "print(df.shape)\n", 343 | "df.head()" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "From this dataframe, for this question, we are interested in finding out the **total number of valid searches** on a given day. This is captured in the `num_searches` column. Shown below is a plot showing the num_searches per day for the entire time-period." 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 4, 356 | "metadata": {}, 357 | "outputs": [ 358 | { 359 | "data": { 360 | "image/png": "\n", 361 | "text/plain": [ 362 | "
" 363 | ] 364 | }, 365 | "metadata": {}, 366 | "output_type": "display_data" 367 | } 368 | ], 369 | "source": [ 370 | "ax = df['num_searches'].plot(figsize=(12, 8), fontsize=12, linewidth=3, linestyle='--')\n", 371 | "ax.set_xlabel('Date', fontsize=16)\n", 372 | "ax.set_ylabel('Valid Searches', fontsize=16)\n", 373 | "ax.set_title('Total number of valid searches on each day')\n", 374 | "ax.axvspan('2018-03-21', '2018-03-24', color='red', alpha=0.3)\n", 375 | "plt.show()" 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "**Observation**: The red band indicates a sharp drop in the number of valid searches on `2018-03-24`." 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "### Business Question: Produce, for each date, the total number of valid searches that existed on that date.\n", 390 | "The **total number of users with valid searches per day** is captured in the `num_users` column of the dataframe. A similar trend can be observed for the num_users indicated by the red band." 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 5, 396 | "metadata": {}, 397 | "outputs": [ 398 | { 399 | "data": { 400 | "image/png": "\n", 401 | "text/plain": [ 402 | "
" 403 | ] 404 | }, 405 | "metadata": {}, 406 | "output_type": "display_data" 407 | } 408 | ], 409 | "source": [ 410 | "ax = df['num_users'].plot(figsize=(12, 8), fontsize=12, linewidth=3, linestyle='--')\n", 411 | "ax.set_xlabel('Date', fontsize=16)\n", 412 | "ax.set_ylabel('Number of users', fontsize=16)\n", 413 | "ax.set_title('Total number of users on each day')\n", 414 | "ax.axvspan('2018-03-21', '2018-03-24', color='red', alpha=0.3)\n", 415 | "plt.show()" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "### Business question: Most engaging search\n", 423 | "From the data that is available, it appears that `Rental` searches are the most engaging ones. I am assuming that the number of valid searches is a good indicator to guage user engagement. It is evident from the below plot, that Rental Searches are consistently producing more valid searches than `Sale` type searches." 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 10, 429 | "metadata": {}, 430 | "outputs": [ 431 | { 432 | "data": { 433 | "image/png": "\n", 434 | "text/plain": [ 435 | "
" 436 | ] 437 | }, 438 | "metadata": {}, 439 | "output_type": "display_data" 440 | } 441 | ], 442 | "source": [ 443 | "ax = df[['num_rental_searches', \n", 444 | " 'num_sales_searches', \n", 445 | " 'num_rental_and_sales_searches',\n", 446 | " 'num_none_type_searches']].plot(figsize=(12, 8), fontsize=12, linewidth=2, linestyle='--')\n", 447 | "ax.set_xlabel('Date', fontsize=16)\n", 448 | "ax.set_ylabel('Valid Searches', fontsize=16)\n", 449 | "ax.set_title('Types of searches every day')\n", 450 | "ax.legend(fontsize=10)\n", 451 | "plt.show()" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "### Business question: What would the email traffic look like if we changed the definition of a valid search from 3 clicks to 2?\n", 459 | "When the defintion of valid search is changed from `clicks >= 3` to `clicks >= 2` the number of searches and its corresponding stats increase in size. Shown below is a comparison for the first 3 days:\n", 460 | "\n", 461 | "![clicks](streeteasy-images/clicks.png)\n", 462 | "\n", 463 | "This means that the **email traffic would increase**." 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "### Business question: Report any interesting trends over the timespan of the data available.\n", 471 | "Mainly there are two trends observed with this timeseries data:\n", 472 | "- One is that there is a steady increase in the number of searches made and also in the number of users. The stats corresponding to individual search type shows that Rental searches are growing faster than Sales searches. \n", 473 | "- Second interesting thing that I found was a sharp dip in the number of searches and users on 2018-03-23, which could be something interesting to investigate." 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": {}, 479 | "source": [ 480 | "## Recommendations\n", 481 | "\n", 482 | "### Recommendations in data storage: \n", 483 | "In terms of storing data, using CSV files comes with some problems down the line. Here are some difficulties with CSV files:\n", 484 | "- No defined schema: There are no data types included and column names beyond a header row.\n", 485 | "- Nested data requires special handling:\n", 486 | "-\n", 487 | "In addition to these issues with using CSV file format, **Spark** has some **specific problems** when working with CSV data:\n", 488 | "- CSV files are quite **slow to import** and parse.\n", 489 | "- The files cannot be shared between workers during the import process.\n", 490 | "- If no schema is defined, then all data must be read before a schema can be inferred.\n", 491 | "- Spark has a feature known as **predicate pushdown** - which is an idea of ordering tasks to do the least amount of work. Example, *filtering* data prior to processing is one of the primary optimizations of predicate pushdown, this drastically reduces the amount of information that must be processed in large data sets. Unfortunately, we cannot filter the CSV data via predicate pushdown.\n", 492 | "- Finally, Spark processes are often multi-step and may utilize an intermediate file representation. These representations allow data to be used later without regenerating the data from source.\n", 493 | "\n", 494 | "Instead of using CSV, when possible use **parquet file format**.\n", 495 | "\n", 496 | "**Parquet Format**: Parquet is a compressed columnar data format and is structured with data accessible in chunks that allows efficient read/write operations without processing the entire file. This structured format supports Spark's predicate pushdown functionality, thus providing significant performance improvement. Finally, parquet files automatically include schema information and handle data encoding. This is perfect for intermediary or on-disk representation of processed data. Note that parquet files are binary file format and can only be used with proper tools.\n", 497 | "\n", 498 | "### Recommendations for downstream processing:\n", 499 | "The search field coming through from the application appear to by `YAML` format. I found that writing regular expression to parse out the search field is prone to errors if the schema evolves. A better way to capture the search field is using JSON or AVRO, as this has some form of schema tied to it, so that downstream applications can know when the schema evolves." 500 | ] 501 | } 502 | ], 503 | "metadata": { 504 | "kernelspec": { 505 | "display_name": "Python 3", 506 | "language": "python", 507 | "name": "python3" 508 | }, 509 | "language_info": { 510 | "codemirror_mode": { 511 | "name": "ipython", 512 | "version": 3 513 | }, 514 | "file_extension": ".py", 515 | "mimetype": "text/x-python", 516 | "name": "python", 517 | "nbconvert_exporter": "python", 518 | "pygments_lexer": "ipython3", 519 | "version": "3.6.6" 520 | } 521 | }, 522 | "nbformat": 4, 523 | "nbformat_minor": 2 524 | } 525 | -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula.md: -------------------------------------------------------------------------------- 1 | 2 | ## ETL data pipeline to process StreetEasy data 3 | ### Author: Shravan Kuchkula (email: shravan.kuchkula@gmail.com) 4 | **Project Description**: 5 | 6 | An online real-estate company is interested in understanding `user enagagement` by analyzing user search patterns to send targeted emails to the users with valid searches. A valid search is termed as one where the search metadata contains `enabled:true` and number of clicks is atleast `3`. 7 | 8 | A daily snapshot of user search history and related data is saved to S3. Each file represents a single date, as noted by the filename: `inferred_users.20180330.csv.gz`. Each line in each file represents a *unique user*, as identified by `id` column. Information on each user's searches and engagement is stored in `searches` column. An example of this is shown below: 9 | 10 | ![rawdata](Report_Shravan_Kuchkula_files/rawdata.png) 11 | 12 | 13 | **Data Description**: The source data resides in S3 `s3://` for each day from **2018-01-20** till **2018-03-30**, as shown: 14 | ```bash 15 | s3:/// 16 | inferred_users.20180120.csv.gz 17 | inferred_users.20180121.csv.gz 18 | inferred_users.20180122.csv.gz 19 | inferred_users.20180123.csv.gz 20 | inferred_users.20180124.csv.gz 21 | .. 22 | inferred_users.20180325.csv.gz 23 | inferred_users.20180326.csv.gz 24 | inferred_users.20180327.csv.gz 25 | inferred_users.20180328.csv.gz 26 | inferred_users.20180329.csv.gz 27 | inferred_users.20180330.csv.gz 28 | ``` 29 | 30 | All this data needs to be processed using a data pipeline to answer the following **business questions:** 31 | 1. Produce a list of **unique "valid searches"**. 32 | 2. Produce, for each date, the **total number of valid searches** that existed on that date. 33 | 3. Produce, for each date, the **total number of users** who had valid searches on that date. 34 | 4. Given this data, determine which is the **most engaging search.** 35 | 5. What would the email traffic look like if the definition of a valid search is changed from **3 clicks to 2 clicks**? 36 | 6. Report any interesting **trends over the timespan** of the data available. 37 | 38 | 39 | **Data Pipeline design**: 40 | The design of the pipeline can be summarized as: 41 | - Extract data from source S3 location. 42 | - Process and Transform it using python and custom **Airflow operators**. 43 | - Load a clean dataset and intermediate artifacts to **destination S3 location**. 44 | - Calculate summary statistics and load the summary stats into **Amazon Redshift**. 45 | 46 | > Figure showns the structure of the data pipeline as represented by a Airflow DAG 47 | ![dag](Report_Shravan_Kuchkula_files/dag.png) 48 | 49 | Finally, I have made use of `Jupyter Notebook` to connect to the `Redshift` cluster and answer the questions of interest. 50 | 51 | **Design Goals**: 52 | As the data is stored in S3, we need a way to incrementally load each file, then process it and store that particular day's results back into S3. Doing so will allow us to perform further analysis later-on, on the cleaned dataset. Secondly, we need a way to aggregate the data and store it in a table to facilitate time-based analysis. Keeping these two goals in mind, the following tools were chosen: 53 | - Apache Airflow will incrementally extract the data from S3 and process it *in-memory* and store the results back into a destination S3 bucket. The reason we need to process this in-memory is because, we don't want to download the file from S3 to airflow worker's disk, as this might fill-up the worker's disk and crash the worker process. 54 | - Amazon Redshift is a simple cloud-managed data warehouse that can be integrated into pipelines without much effort. Airflow will then read the intermediate dataset created in the first step and aggregate the data per day and store it into a Redshift table. 55 | 56 | **Pipeline Implementation**: 57 | Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A **DAG** describes *how* you want to carry out your workflow, and **Operators** determine *what* actually gets done. 58 | 59 | By default, airflow comes with some simple built-in operators like `PythonOperator`, `BashOperator`, `DummyOperator` etc., however, airflow lets you extend the features of a `BaseOperator` and create custom operators. For this project, I developed two custom operators: 60 | 61 | ![operators](Report_Shravan_Kuchkula_files/operators.png) 62 | 63 | - **StreetEasyOperator**: Extract data from **source S3 bucket**, processes the data in-memory by applying a series of transformations found inside `transforms.py`, then loads it to destination S3 bucket. Please see the code here: 64 | - **ValidSearchStatsOperator**: Takes data from **destination S3 bucket**, aggregates the data on a per-day basis, and uploads it to Reshift table `search_stats`. Please see the code here: 65 | 66 | Here's the directory organization: 67 | 68 | ```bash 69 | ├── README.md 70 | ├── docker-compose.yml 71 | └── street-easy 72 | ├── dags 73 | │   ├── create_postgres_table.py 74 | │   └── street_easy.py 75 | ├── plugins 76 | │   ├── __init__.py 77 | │   ├── helpers 78 | │   │   ├── __init__.py 79 | │   │   └── transforms.py 80 | │   └── operators 81 | │   ├── __init__.py 82 | │   ├── extract_and_transform_streeteasy.py 83 | │   └── valid_search_stats.py 84 | └── requirements.txt 85 | ``` 86 | 87 | 88 | **Pipeline Schedule**: Our pipeline is required to adhere to the following guidelines: 89 | * The DAG should run *daily* from `2018-01-20` to `2018-03-30` 90 | * The DAG should not have any dependencies on past runs. 91 | * On failure, the task is retried for 3 times. 92 | * Retries happen every 5 minutes. 93 | * Do not email on retry. 94 | 95 | > Shown below is the data pipeline (street_easy DAG) execution starting on **2018-01-20** and ending on **2018-03-30**. 96 | ![airflow_tree_view](Report_Shravan_Kuchkula_files/airflow_tree_view.png) 97 | > Note: The data for *2018-01-29 and 2018-01-30* is not available, thus we are skipping over that. 98 | 99 | **Destination S3 datasets and Redshift Table**: 100 | After each successful run of the DAG, two files are stored in the destination bucket: 101 | * `s3://skuchkula-etl/unique_valid_searches_.csv`: Contains a list of unique valid searches for each day. 102 | * `s3://skuchkula-etl/valid_searches_.csv`: Contains a dataset with the following fields: 103 | * user_id: Unique id of the user 104 | * num_valid_searches: Number of valid searches 105 | * avg_listings: Avg number of listings for that user 106 | * type_of_search: Did the user search for: 107 | * Only Rental 108 | * Only Sale 109 | * Both Rental and Sale 110 | * Neither 111 | * list_of_valid_searches: A list of valid searches for that user 112 | 113 | 114 | **unique_valid_searches_{date}.csv** contains unique valid searches per day: 115 | ```bash 116 | s3://skuchkula-etl/ 117 | unique_valid_searches_20180120.csv 118 | unique_valid_searches_20180121.csv 119 | unique_valid_searches_20180122.csv 120 | unique_valid_searches_20180123.csv 121 | unique_valid_searches_20180214.csv 122 | ... 123 | ``` 124 | 125 | **valid_searches_{date}.csv** contains the valid searches dataset per day: 126 | ```bash 127 | s3://skuchkula-etl/ 128 | valid_searches_20180120.csv 129 | valid_searches_20180121.csv 130 | valid_searches_20180122.csv 131 | valid_searches_20180123.csv 132 | valid_searches_20180214.csv 133 | ... 134 | ``` 135 | **Amazon Redshift table:** 136 | 137 | The `ValidSearchesStatsOperator` then takes each of datasets `valid_searches_{date}.csv` and calcuates summary stats and loads the results to **search_stats** table, as shown: 138 | 139 | ![redshift](Report_Shravan_Kuchkula_files/redshift.png) 140 | 141 | 142 | ## Answering business questions using data 143 | 144 | ### Business question: Produce a list of all unique "valid searches" given the above requirements. 145 | The list of all unique searches is stored in the destination S3 bucket: `s3://skuchkula-etl/unique_valid_searches_{date}.csv`. An example of the output is shown here 146 | 147 | ```bash 148 | $ head -10 unique_valid_searches_20180330.csv 149 | searches 150 | 38436711 151 | 56011095 152 | 3161333 153 | 43841677 154 | 42719934 155 | 40166212 156 | 44847718 157 | 36981443 158 | 13923552 159 | ``` 160 | The code used to calculate the unique valid searches can be found here: TODO 161 | 162 | We will be making using of `pandas`, `psycopg2` and `matplotlib` to use the data we gathered to answer the next set of business questions. 163 | 164 | 165 | ```python 166 | import pandas as pd 167 | import pandas.io.sql as sqlio 168 | import configparser 169 | import psycopg2 170 | 171 | import matplotlib.pyplot as plt 172 | plt.style.use('fivethirtyeight') 173 | ``` 174 | 175 | ### Business question: Produce, for each date, the total number of valid searches that existed on that date. 176 | To answer this we need to connect to the Redshift cluster and query the `search_stats` table. First, we obtain a connection to Redshift cluster. The secrets are stored in the `dwh-streeteasy.cfg` file. Next, we execute the SQL query and store the result as a pandas dataframe. 177 | 178 | 179 | ```python 180 | config = configparser.ConfigParser() 181 | config.read('dwh-streeteasy.cfg') 182 | 183 | # connect to redshift cluster 184 | conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values())) 185 | cur = conn.cursor() 186 | 187 | sql_query = "SELECT * FROM search_stats" 188 | df = sqlio.read_sql_query(sql_query, conn) 189 | df['day'] = pd.to_datetime(df['day']) 190 | df = df.set_index('day') 191 | print(df.shape) 192 | df.head() 193 | ``` 194 | (68, 6) 195 | ``` 196 | ![dataframe](Report_Shravan_Kuchkula_files/dataframe.png) 197 | 198 | From this dataframe, for this question, we are interested in finding out the **total number of valid searches** on a given day. This is captured in the `num_searches` column. Shown below is a plot showing the num_searches per day for the entire time-period. 199 | 200 | 201 | ```python 202 | ax = df['num_searches'].plot(figsize=(12, 8), fontsize=12, linewidth=3, linestyle='--') 203 | ax.set_xlabel('Date', fontsize=16) 204 | ax.set_ylabel('Valid Searches', fontsize=16) 205 | ax.set_title('Total number of valid searches on each day') 206 | ax.axvspan('2018-03-21', '2018-03-24', color='red', alpha=0.3) 207 | plt.show() 208 | ``` 209 | 210 | 211 | ![png](Report_Shravan_Kuchkula_files/Report_Shravan_Kuchkula_8_0.png) 212 | 213 | 214 | **Observation**: The red band indicates a sharp drop in the number of valid searches on `2018-03-24`. 215 | 216 | ### Business Question: Produce, for each date, the total number of valid searches that existed on that date. 217 | The **total number of users with valid searches per day** is captured in the `num_users` column of the dataframe. A similar trend can be observed for the num_users indicated by the red band. 218 | 219 | 220 | ```python 221 | ax = df['num_users'].plot(figsize=(12, 8), fontsize=12, linewidth=3, linestyle='--') 222 | ax.set_xlabel('Date', fontsize=16) 223 | ax.set_ylabel('Number of users', fontsize=16) 224 | ax.set_title('Total number of users on each day') 225 | ax.axvspan('2018-03-21', '2018-03-24', color='red', alpha=0.3) 226 | plt.show() 227 | ``` 228 | 229 | 230 | ![png](Report_Shravan_Kuchkula_files/Report_Shravan_Kuchkula_11_0.png) 231 | 232 | 233 | ### Business question: Most engaging search 234 | From the data that is available, it appears that `Rental` searches are the most engaging ones. I am assuming that the number of valid searches is a good indicator to guage user engagement. It is evident from the below plot, that Rental Searches are consistently producing more valid searches than `Sale` type searches. 235 | 236 | 237 | ```python 238 | ax = df[['num_rental_searches', 239 | 'num_sales_searches', 240 | 'num_rental_and_sales_searches', 241 | 'num_none_type_searches']].plot(figsize=(12, 8), fontsize=12, linewidth=2, linestyle='--') 242 | ax.set_xlabel('Date', fontsize=16) 243 | ax.set_ylabel('Valid Searches', fontsize=16) 244 | ax.set_title('Types of searches every day') 245 | ax.legend(fontsize=10) 246 | plt.show() 247 | ``` 248 | 249 | 250 | ![png](Report_Shravan_Kuchkula_files/Report_Shravan_Kuchkula_13_0.png) 251 | 252 | 253 | ### Business question: What would the email traffic look like if we changed the definition of a valid search from 3 clicks to 2? 254 | When the defintion of valid search is changed from `clicks >= 3` to `clicks >= 2` the number of searches and its corresponding stats increase in size. Shown below is a comparison for the first 3 days: 255 | 256 | ![clicks](Report_Shravan_Kuchkula_files/clicks.png) 257 | 258 | This means that the **email traffic would increase**. 259 | 260 | ### Business question: Report any interesting trends over the timespan of the data available. 261 | Mainly there are two trends observed with this timeseries data: 262 | - One is that there is a steady increase in the number of searches made and also in the number of users. The stats corresponding to individual search type shows that Rental searches are growing faster than Sales searches. 263 | - Second interesting thing that I found was a sharp dip in the number of searches and users on 2018-03-23, which could be something interesting to investigate. 264 | 265 | ## Recommendations 266 | 267 | ### Recommendations in data storage: 268 | In terms of storing data, using CSV files comes with some problems down the line. Here are some difficulties with CSV files: 269 | - No defined schema: There are no data types included and column names beyond a header row. 270 | - Nested data requires special handling: 271 | - 272 | In addition to these issues with using CSV file format, **Spark** has some **specific problems** when working with CSV data: 273 | - CSV files are quite **slow to import** and parse. 274 | - The files cannot be shared between workers during the import process. 275 | - If no schema is defined, then all data must be read before a schema can be inferred. 276 | - Spark has a feature known as **predicate pushdown** - which is an idea of ordering tasks to do the least amount of work. Example, *filtering* data prior to processing is one of the primary optimizations of predicate pushdown, this drastically reduces the amount of information that must be processed in large data sets. Unfortunately, we cannot filter the CSV data via predicate pushdown. 277 | - Finally, Spark processes are often multi-step and may utilize an intermediate file representation. These representations allow data to be used later without regenerating the data from source. 278 | 279 | Instead of using CSV, when possible use **parquet file format**. 280 | 281 | **Parquet Format**: Parquet is a compressed columnar data format and is structured with data accessible in chunks that allows efficient read/write operations without processing the entire file. This structured format supports Spark's predicate pushdown functionality, thus providing significant performance improvement. Finally, parquet files automatically include schema information and handle data encoding. This is perfect for intermediary or on-disk representation of processed data. Note that parquet files are binary file format and can only be used with proper tools. 282 | 283 | ### Recommendations for downstream processing: 284 | The search field coming through from the application appear to by `YAML` format. I found that writing regular expression to parse out the search field is prone to errors if the schema evolves. A better way to capture the search field is using JSON or AVRO, as this has some form of schema tied to it, so that downstream applications can know when the schema evolves. 285 | -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/Report_Shravan_Kuchkula_11_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/Report_Shravan_Kuchkula_11_0.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/Report_Shravan_Kuchkula_13_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/Report_Shravan_Kuchkula_13_0.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/Report_Shravan_Kuchkula_8_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/Report_Shravan_Kuchkula_8_0.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/airflow_tree_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/airflow_tree_view.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/clicks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/clicks.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/dag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/dag.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/dataframe.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/dataframe.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/operators.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/operators.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/rawdata.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/rawdata.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/redshift.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/redshift.png -------------------------------------------------------------------------------- /Report/Report_Shravan_Kuchkula_files/validsearches.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/Report/Report_Shravan_Kuchkula_files/validsearches.png -------------------------------------------------------------------------------- /Report/dwh-streeteasy.cfg: -------------------------------------------------------------------------------- 1 | [CLUSTER] 2 | HOST=dwhcluster.cpczrz48gy51.us-east-1.redshift.amazonaws.com 3 | DB_NAME=dwh 4 | DB_USER=dwhuser 5 | DB_PASSWORD= 6 | DB_PORT=5439 7 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3' 2 | services: 3 | postgres: 4 | image: postgres:9.6 5 | environment: 6 | - POSTGRES_USER=airflow 7 | - POSTGRES_PASSWORD=airflow 8 | - POSTGRES_DB=airflow 9 | ports: 10 | - "5432:5432" 11 | 12 | webserver: 13 | image: puckel/docker-airflow:1.10.4 14 | build: 15 | context: https://github.com/puckel/docker-airflow.git#1.10.4 16 | dockerfile: Dockerfile 17 | args: 18 | AIRFLOW_DEPS: gcp_api,s3 19 | restart: always 20 | depends_on: 21 | - postgres 22 | environment: 23 | - LOAD_EX=n 24 | - EXECUTOR=Local 25 | - FERNET_KEY=jsDPRErfv8Z_eVTnGfF8ywd19j4pyqE3NpdUBA_oRTo= 26 | volumes: 27 | - ./street-easy/dags:/usr/local/airflow/dags 28 | # Uncomment to include custom plugins 29 | - ./street-easy/plugins:/usr/local/airflow/plugins 30 | # Additional python packages used inside airflow operators 31 | - ./street-easy/requirements.txt:/requirements.txt 32 | ports: 33 | - "8080:8080" 34 | command: webserver 35 | healthcheck: 36 | test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"] 37 | interval: 30s 38 | timeout: 30s 39 | retries: 3 40 | -------------------------------------------------------------------------------- /images/Report_Shravan_Kuchkula_11_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/Report_Shravan_Kuchkula_11_0.png -------------------------------------------------------------------------------- /images/Report_Shravan_Kuchkula_13_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/Report_Shravan_Kuchkula_13_0.png -------------------------------------------------------------------------------- /images/Report_Shravan_Kuchkula_8_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/Report_Shravan_Kuchkula_8_0.png -------------------------------------------------------------------------------- /images/airflow_tree_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/airflow_tree_view.png -------------------------------------------------------------------------------- /images/clicks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/clicks.png -------------------------------------------------------------------------------- /images/connections.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/connections.png -------------------------------------------------------------------------------- /images/dag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/dag.png -------------------------------------------------------------------------------- /images/dataframe.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/dataframe.png -------------------------------------------------------------------------------- /images/operators.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/operators.png -------------------------------------------------------------------------------- /images/rawdata.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/rawdata.png -------------------------------------------------------------------------------- /images/redshift.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/redshift.png -------------------------------------------------------------------------------- /images/validsearches.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/validsearches.png -------------------------------------------------------------------------------- /images/variables.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/images/variables.png -------------------------------------------------------------------------------- /street-easy/dags/create_postgres_table.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime, timedelta 2 | import os 3 | from airflow import DAG 4 | from airflow.operators.dummy_operator import DummyOperator 5 | from airflow.operators import PostgresOperator 6 | 7 | default_args = { 8 | 'owner': 'shravan', 9 | 'start_date': datetime.utcnow() - timedelta(hours=5), 10 | 'depends_on_past': False, 11 | 'email_on_retry': False, 12 | 'retries': 2, 13 | 'retry_delay': timedelta(minutes=1), 14 | 'catchup_by_default': False, 15 | } 16 | 17 | dag = DAG('create_search_tables_dag', 18 | default_args=default_args, 19 | description='Create tables in Redshift using Airflow', 20 | schedule_interval=None, 21 | max_active_runs=1 22 | ) 23 | 24 | start_operator = DummyOperator(task_id='Begin_execution', dag=dag) 25 | 26 | create_search_stats_table = PostgresOperator( 27 | task_id="create_search_stats_table", 28 | dag=dag, 29 | postgres_conn_id="redshift", 30 | sql=''' 31 | CREATE TABLE IF NOT EXISTS public.search_stats ( 32 | day date, 33 | num_searches int, 34 | num_users int, 35 | num_rental_searches int, 36 | num_sales_searches int, 37 | num_rental_and_sales_searches int, 38 | num_none_type_searches int 39 | ); 40 | ''' 41 | ) 42 | 43 | end_operator = DummyOperator(task_id='Stop_execution', dag=dag) 44 | 45 | # task dependencies 46 | start_operator >> create_search_stats_table 47 | create_search_stats_table >> end_operator 48 | -------------------------------------------------------------------------------- /street-easy/dags/street_easy.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | from datetime import datetime, timedelta 4 | 5 | from airflow import DAG 6 | from airflow.models import Variable 7 | from airflow.operators.python_operator import PythonOperator 8 | from airflow.operators.dummy_operator import DummyOperator 9 | from airflow.hooks.S3_hook import S3Hook 10 | from airflow.contrib.hooks.aws_hook import AwsHook 11 | from airflow.operators import (StreetEasyOperator, ValidSearchStatsOperator) 12 | 13 | # Default arguments for DAG: 14 | # start_date : date from when we need to start processing. 15 | # end_data : date until when we need to process the data. 16 | # depends_on_past : this DAG is independent of previous runs. 17 | # email_on_retry : We don't want emails on retry. 18 | # retries : Number of times the task is retries upon failure. 19 | # retry_delay : How much time should the scheduler wait before re-attempt. 20 | # provide_context : When you provide_context=True to an operator, we pass 21 | # along the Airflow context variables to be used inside the operator. 22 | default_args = { 23 | 'owner': 'shravan', 24 | 'start_date': datetime(2018, 1, 20), 25 | 'end_date': datetime(2018, 3, 30), 26 | 'depends_on_past': False, 27 | 'email_on_retry': False, 28 | 'retries': 3, 29 | 'retry_delay': timedelta(minutes=5), 30 | 'provide_context': True, 31 | } 32 | 33 | # Dag will start automatically when turned on 34 | # This is because the start date is in the past 35 | dag = DAG( 36 | 'street_easy', 37 | default_args=default_args, 38 | description='Load and Transform street easy data', 39 | schedule_interval='@daily', 40 | max_active_runs=1 41 | ) 42 | 43 | def check_connectivity_to_s3(*args, **kwargs): 44 | hook = S3Hook(aws_conn_id='aws_credentials') 45 | bucket = Variable.get('s3_bucket') 46 | logging.info(f"Listing Keys from {bucket}") 47 | keys = hook.list_keys(bucket) 48 | for key in keys: 49 | logging.info(f"- s3://{bucket}/{key}") 50 | 51 | start_operator = DummyOperator(task_id='Begin_Execution', dag=dag) 52 | 53 | check_connectivity_to_s3 = PythonOperator( 54 | task_id="check_connectivity_to_s3", 55 | python_callable=check_connectivity_to_s3, 56 | dag=dag 57 | ) 58 | 59 | extract_and_transform_streeteasy_data = StreetEasyOperator( 60 | task_id = "extract_and_transform_streeteasy_data", 61 | dag=dag, 62 | aws_credentials_id = "aws_credentials", 63 | aws_credentials_dest_id = "aws_credentials_dest", 64 | s3_bucket = Variable.get('s3_bucket'), 65 | s3_dest_bucket = Variable.get('s3_dest_bucket'), 66 | s3_key = "inferred_users.{ds}.csv.gz", 67 | s3_dest_key = "unique_valid_searches_{ds}.csv", 68 | s3_dest_df_key = "valid_searches_{ds}.csv", 69 | ) 70 | 71 | calculate_valid_search_stats = ValidSearchStatsOperator( 72 | task_id = "calculate_valid_search_stats", 73 | dag=dag, 74 | aws_credentials_id = "aws_credentials_dest", 75 | redshift_conn_id = "redshift", 76 | table = "search_stats", 77 | columns = """ 78 | day, 79 | num_searches, 80 | num_users, 81 | num_rental_searches, 82 | num_sales_searches, 83 | num_rental_and_sales_searches, 84 | num_none_type_searches 85 | """, 86 | s3_bucket = Variable.get('s3_dest_bucket'), 87 | s3_key = "valid_searches_{ds}.csv", 88 | today = "{ds}", 89 | ) 90 | 91 | end_operator = DummyOperator(task_id='End_Execution', dag=dag) 92 | 93 | # DAG layout 94 | start_operator >> check_connectivity_to_s3 95 | check_connectivity_to_s3 >> extract_and_transform_streeteasy_data 96 | extract_and_transform_streeteasy_data >> calculate_valid_search_stats 97 | calculate_valid_search_stats >> end_operator 98 | -------------------------------------------------------------------------------- /street-easy/plugins/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, absolute_import, print_function 2 | 3 | from airflow.plugins_manager import AirflowPlugin 4 | 5 | import operators 6 | import helpers 7 | 8 | # Defining the plugin class 9 | class SEPlugin(AirflowPlugin): 10 | name = "se_plugin" 11 | operators = [ 12 | operators.StreetEasyOperator, 13 | operators.ValidSearchStatsOperator 14 | ] 15 | helpers = [ 16 | ] 17 | -------------------------------------------------------------------------------- /street-easy/plugins/helpers/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shravan-kuchkula/udacity-data-eng-proj2/a8bb5ae1298695b01581c38395bc9508cb1e3eba/street-easy/plugins/helpers/__init__.py -------------------------------------------------------------------------------- /street-easy/plugins/helpers/transforms.py: -------------------------------------------------------------------------------- 1 | import re 2 | import pandas as pd 3 | import numpy as np 4 | 5 | def valid_searches(searches): 6 | ''' Parses the search string and returns only valid searches. 7 | Additional details: intended to be applied to a pandas series. 8 | 9 | :param searches: raw unparsed search string 10 | :type str 11 | :return valid_searches: list of valid searches 12 | :type list 13 | ''' 14 | # each search is delimited by \\n- 15 | searches = searches.split('\\n-') 16 | 17 | # filter the list 18 | searches = [item for item in searches if not item.startswith('---')] 19 | 20 | # if no searches then return empty list otherwise keep parsing 21 | if len(searches) == 0: 22 | return [] 23 | else: 24 | # parse the searches and make a list of searches 25 | searches = [item for item in searches if not item.startswith('---')] 26 | searches = [re.sub(r'(\\.n|\\.n\s+:|\\)', ' ', item) for item in searches] 27 | searches = [re.sub(r'\s+:', ',', item) for item in searches] 28 | searches = [item.split(',') for item in searches] 29 | 30 | # Determine validity: 31 | # a valid search contains enabled==true and has clicks >= 3 32 | # store only valid searches in the list to return. 33 | valid_searches = [] 34 | for item in searches: 35 | search_dict = {} 36 | for key in item: 37 | if key.split(':')[0] in ('search_id', 'enabled', 'clicks', 38 | 'type', 'listings_sent', 'recommended'): 39 | d_key = key.split(':')[0] 40 | d_value = key.split(':')[1].strip() 41 | search_dict[d_key] = d_value 42 | if search_dict['enabled'] == 'true' and int(search_dict.get('clicks', 0)) >= 3: 43 | valid_searches.append(search_dict) 44 | 45 | return valid_searches 46 | 47 | def avg_listings_sent(valid_searches): 48 | num_listings = 0 49 | listings = 0 50 | for item in valid_searches: 51 | if item.get('listings_sent'): 52 | num_listings = num_listings + 1 53 | listings = listings + int(item.get('listings_sent')) 54 | 55 | if num_listings > 0: 56 | return np.round(np.sum(listings)/num_listings, 2) 57 | else: 58 | return 0 59 | 60 | def type_of_search(valid_searches): 61 | ''' 62 | Categorize the type of search given a list of searches. 63 | 64 | :params valid_searches: list of searches 65 | :return enum('rental_and_sale', 'sale', 'rental', 'none') 66 | ''' 67 | rental = 0 68 | sale = 0 69 | for item in valid_searches: 70 | if item.get('type') == 'Rental': 71 | rental = rental + 1 72 | elif item.get('type') == 'Sale': 73 | sale = sale + 1 74 | else: 75 | pass 76 | 77 | if rental > 0 and sale > 0: 78 | return "rental_and_sale" 79 | elif rental > 0: 80 | return "rental" 81 | elif sale > 0: 82 | return "sale" 83 | else: 84 | return "none" 85 | 86 | def list_of_valid_searches(valid_searches): 87 | ''' 88 | Convert a list of lists to a single list 89 | 90 | :param valid_searches: list of lists 91 | :return search_list: single list of searches. 92 | ''' 93 | search_list = [] 94 | for item in valid_searches: 95 | if item.get('search_id'): 96 | search_list.append(item.get('search_id')) 97 | return search_list 98 | -------------------------------------------------------------------------------- /street-easy/plugins/operators/__init__.py: -------------------------------------------------------------------------------- 1 | from operators.extract_and_transform_streeteasy import StreetEasyOperator 2 | from operators.valid_search_stats import ValidSearchStatsOperator 3 | 4 | __all__ = [ 5 | 'StreetEasyOperator', 6 | 'ValidSearchStatsOperator' 7 | ] 8 | -------------------------------------------------------------------------------- /street-easy/plugins/operators/extract_and_transform_streeteasy.py: -------------------------------------------------------------------------------- 1 | from airflow.contrib.hooks.aws_hook import AwsHook 2 | from airflow.models import BaseOperator 3 | from airflow.utils.decorators import apply_defaults 4 | from helpers.transforms import valid_searches 5 | from helpers.transforms import avg_listings_sent 6 | from helpers.transforms import type_of_search 7 | from helpers.transforms import list_of_valid_searches 8 | 9 | import pandas as pd 10 | import numpy as np 11 | import re 12 | from s3fs.core import S3FileSystem 13 | 14 | class StreetEasyOperator(BaseOperator): 15 | """ 16 | Extract data from source S3, process it in-memory, load it to dest S3. 17 | 18 | :param aws_credentials_id: reference to source aws hook containing iam details. 19 | :type aws_credentials_id: str 20 | :param aws_credentials_dest_id: reference to dest aws hook containing iam details. 21 | :type aws_credentials_id: str 22 | :param s3_bucket: source s3 bucket name 23 | :type s3_bucket: str 24 | :param s3_dest_bucket: destination s3 bucket name 25 | :type s3_dest_bucket: str 26 | :param s3_key: source s3 file (templated) 27 | :type s3_key: Can receive a str representing a prefix, 28 | the prefix can contain a path that is partitioned by some field. 29 | :param s3_dest_key: first destination s3 file (templated) 30 | :type s3_dest_key: Can receive a str representing a prefix, 31 | the prefix can contain a path that is partitioned by some field. 32 | :param s3_dest_df_key: second destination s3 file (templated) 33 | :type s3_dest_df_key: Can receive a str representing a prefix, 34 | the prefix can contain a path that is partitioned by some field. 35 | """ 36 | template_fields = ("s3_key", "s3_dest_key", "s3_dest_df_key",) 37 | 38 | @apply_defaults 39 | def __init__(self, 40 | aws_credentials_id="", 41 | aws_credentials_dest_id="", 42 | s3_bucket="", 43 | s3_dest_bucket="", 44 | s3_key="", 45 | s3_dest_key="", 46 | s3_dest_df_key="", 47 | *args, **kwargs): 48 | 49 | super(StreetEasyOperator, self).__init__(*args, **kwargs) 50 | self.aws_credentials_id = aws_credentials_id 51 | self.aws_credentials_dest_id = aws_credentials_dest_id 52 | self.s3_bucket = s3_bucket 53 | self.s3_dest_bucket = s3_dest_bucket 54 | self.s3_key = s3_key 55 | self.s3_dest_key = s3_dest_key 56 | self.s3_dest_df_key = s3_dest_df_key 57 | 58 | 59 | def execute(self, context): 60 | self.log.info("Executing StreetEasyOperator!!") 61 | 62 | # get the aws hooks 63 | aws_hook = AwsHook(self.aws_credentials_id) 64 | aws_dest_hook = AwsHook(self.aws_credentials_dest_id) 65 | 66 | # get the credentials for source and destination 67 | credentials = aws_hook.get_credentials() 68 | credentials_dest = aws_dest_hook.get_credentials() 69 | 70 | # build the s3 source path 71 | # as we are providing_context = True, we get them in kwargs form 72 | # use **context to upack the dictionary and format the s3_key 73 | rendered_key = self.s3_key.format(**context) 74 | rendered_key_no_dashes = re.sub(r'-', '', rendered_key) 75 | self.log.info("Rendered Key no dashes {}".format(rendered_key_no_dashes)) 76 | s3_path = "s3://{}/{}".format(self.s3_bucket, rendered_key_no_dashes) 77 | 78 | # get a S3 file handle 79 | s3 = S3FileSystem(anon=False, key=credentials.access_key, secret=credentials.secret_key) 80 | 81 | self.log.info("Extract data from {}".format(s3_path)) 82 | # stream data from s3 and transform it 83 | with s3.open(s3_path, mode='rb') as s3_file: 84 | # read in the data from s3 85 | data = pd.read_csv(s3_file, compression='gzip', names=['user_id', 'searches']) 86 | 87 | # create valid searches 88 | data['valid_searches'] = data['searches'].apply(valid_searches) 89 | 90 | # calculate num valid searches per user 91 | data['num_valid_searches'] = data['valid_searches'].apply(len) 92 | 93 | # keep only valid searches 94 | data = data[data.num_valid_searches > 0].reset_index(drop=True) 95 | 96 | # remove original searches 97 | data = data.drop(['searches'], axis=1) 98 | 99 | # calculate avg_listings_sent 100 | data['avg_listings'] = data['valid_searches'].apply(avg_listings_sent) 101 | 102 | # calculate type_of_search 103 | data['type_of_search'] = data['valid_searches'].apply(type_of_search) 104 | 105 | # prepare a list of valid search ids 106 | data['list_of_valid_searches'] = data['valid_searches'].apply(list_of_valid_searches) 107 | 108 | # drop valid searches as we don't need it anymore 109 | data = data.drop(['valid_searches'], axis=1) 110 | 111 | # get unique valid searches 112 | unique_valid_searches = set() 113 | for sublist in data['list_of_valid_searches']: 114 | for item in sublist: 115 | unique_valid_searches.add(re.sub(r'\'', '', item)) 116 | 117 | # construct a dataframe 118 | unique_valid_searches_df = pd.DataFrame({'searches': list(unique_valid_searches)}) 119 | 120 | self.log.info("Total valid searches today are: {}".format(np.sum(data['num_valid_searches']))) 121 | self.log.info("Total users today are: {}".format(np.sum(data['num_valid_searches'] > 0))) 122 | 123 | # build the s3 destination path 124 | rendered_dest_key = self.s3_dest_key.format(**context) 125 | rendered_dest_key_no_dashes = re.sub(r'-', '', rendered_dest_key) 126 | self.log.info("Rendered Key no dashes {}".format(rendered_dest_key_no_dashes)) 127 | s3_dest_path = "s3://{}/{}".format(self.s3_dest_bucket, rendered_dest_key_no_dashes) 128 | 129 | # get a S3 file handle for destination 130 | s3_dest = S3FileSystem(anon=False, key=credentials_dest.access_key, secret=credentials_dest.secret_key) 131 | 132 | # stream the transformed data into s3 133 | with s3_dest.open(s3_dest_path, mode='wb') as s3_dest_file: 134 | self.log.info("Started writing {}".format(unique_valid_searches_df.shape)) 135 | s3_dest_file.write(unique_valid_searches_df.to_csv(None, index=False).encode()) 136 | self.log.info("Completed writing {}".format(unique_valid_searches_df.shape)) 137 | 138 | rendered_dest_df_key = self.s3_dest_df_key.format(**context) 139 | rendered_dest_df_key_no_dashes = re.sub(r'-', '', rendered_dest_df_key) 140 | self.log.info("Rendered Key no dashes {}".format(rendered_dest_df_key_no_dashes)) 141 | s3_dest_df_path = "s3://{}/{}".format(self.s3_dest_bucket, rendered_dest_df_key_no_dashes) 142 | 143 | with s3_dest.open(s3_dest_df_path, mode='wb') as s3_dest_file: 144 | self.log.info("Started writing {}".format(data.shape)) 145 | s3_dest_file.write(data.to_csv(None, index=False).encode()) 146 | self.log.info("Completed writing {}".format(data.shape)) 147 | 148 | self.log.info("StreetEasyOperator completed") 149 | -------------------------------------------------------------------------------- /street-easy/plugins/operators/valid_search_stats.py: -------------------------------------------------------------------------------- 1 | from airflow.hooks.postgres_hook import PostgresHook 2 | from airflow.contrib.hooks.aws_hook import AwsHook 3 | from airflow.models import BaseOperator 4 | from airflow.utils.decorators import apply_defaults 5 | 6 | import pandas as pd 7 | import numpy as np 8 | import re 9 | from s3fs.core import S3FileSystem 10 | 11 | class ValidSearchStatsOperator(BaseOperator): 12 | """ 13 | Takes data from S3, calculates search stats, uploads it to Reshift table. 14 | 15 | :param aws_credentials_id: reference to source aws hook containing iam details. 16 | :type aws_credentials_id: str 17 | :param redshift_conn_id: reference to a specific redshift cluster hook 18 | :type redshift_conn_id: str 19 | :param table: destination table on redshift. 20 | :type table: str 21 | :param columns: columns of the destination table 22 | :type columns: str containing column names in csv format. 23 | :param s3_bucket: source s3 bucket name 24 | :type s3_bucket: str 25 | :param s3_key: source s3 file (templated) 26 | :type s3_key: Can receive a str representing a prefix, 27 | the prefix can contain a path that is partitioned by some field. 28 | :param today: date of execution (templated) 29 | :type today: Can receive a str representing the execution_date. 30 | """ 31 | ui_color = '#80BD9E' 32 | template_fields = ("s3_key", "today", ) 33 | load_search_stats_sql = """ 34 | INSERT INTO {} {} VALUES {} 35 | """ 36 | @apply_defaults 37 | def __init__(self, 38 | # Define your operators params (with defaults) here 39 | aws_credentials_id="", 40 | redshift_conn_id="", 41 | table="", 42 | columns="", 43 | s3_bucket="", 44 | s3_key="", 45 | today="", 46 | *args, **kwargs): 47 | 48 | super(ValidSearchStatsOperator, self).__init__(*args, **kwargs) 49 | # Map params here 50 | self.aws_credentials_id = aws_credentials_id 51 | self.redshift_conn_id = redshift_conn_id 52 | self.table = table 53 | self.columns = columns 54 | self.s3_bucket = s3_bucket 55 | self.s3_key = s3_key 56 | self.today = today 57 | 58 | def execute(self, context): 59 | self.log.info('ValidSearchStatsOperator has started') 60 | # get the hooks 61 | redshift_hook = PostgresHook(self.redshift_conn_id) 62 | aws_hook = AwsHook(self.aws_credentials_id) 63 | 64 | # get the credentials for s3 65 | credentials = aws_hook.get_credentials() 66 | 67 | # put the columns in the format INSERT table expect 68 | columns = "({})".format(self.columns) 69 | 70 | # build the s3 source path 71 | # as we are providing_context = True, we get them in kwargs form 72 | # use **context to upack the dictionary and format the s3_key 73 | rendered_key = self.s3_key.format(**context) 74 | rendered_key_no_dashes = re.sub(r'-', '', rendered_key) 75 | self.log.info("Rendered Key no dashes {}".format(rendered_key_no_dashes)) 76 | s3_path = "s3://{}/{}".format(self.s3_bucket, rendered_key_no_dashes) 77 | 78 | # get a S3 file handle and pass in the creds 79 | s3 = S3FileSystem(anon=False, key=credentials.access_key, secret=credentials.secret_key) 80 | 81 | # stream data from s3 82 | # we don't want to store a local copy of the file on airflow worker's disk 83 | # thus, we are processing this in-memory using with-open-file construct. 84 | with s3.open(s3_path, mode='rb') as s3_file: 85 | # read in the data from s3 86 | data = pd.read_csv(s3_file) 87 | self.log.info("Shape of the data is {}".format(data.shape)) 88 | 89 | # as we are providing_context = True, we get them in kwargs form 90 | # use **context to upack the dictionary and format the today ivar 91 | render_today = self.today.format(**context) 92 | self.log.info("Today is {}".format(render_today)) 93 | 94 | # calculate summary stats 95 | num_valid_searches = np.sum(data['num_valid_searches']) 96 | num_users_with_valid_searches = np.sum(data['num_valid_searches'] > 0) 97 | num_rental_searches = np.sum(data['type_of_search'] == 'rental') 98 | num_sales_searches = np.sum(data['type_of_search'] == 'sale') 99 | num_rental_and_sales_searches = np.sum(data['type_of_search'] == 'rental_and_sale') 100 | num_none_type_searches = np.sum(data['type_of_search'] == 'none') 101 | 102 | # prepare values to be sent to INSERT stmt 103 | values = (render_today, num_valid_searches, num_users_with_valid_searches, 104 | num_rental_searches, num_sales_searches, num_rental_and_sales_searches, 105 | num_none_type_searches) 106 | 107 | self.log.info("Loading stats into Redshift table") 108 | self.log.info("Total valid searches today are: {}".format(np.sum(data['num_valid_searches']))) 109 | self.log.info("Total users today are: {}".format(np.sum(data['num_valid_searches'] > 0))) 110 | self.log.info("Total rental searches today are: {}".format(np.sum(data['type_of_search'] == 'rental'))) 111 | self.log.info("Total sales searches today are: {}".format(np.sum(data['type_of_search'] == 'sale'))) 112 | self.log.info("Total rental and sales searches today are: {}".format(np.sum(data['type_of_search'] == 'rental_and_sale'))) 113 | self.log.info("Total none type searches today are: {}".format(np.sum(data['type_of_search'] == 'none'))) 114 | 115 | # build the insert statement 116 | load_sql = ValidSearchStatsOperator.load_search_stats_sql.format( 117 | self.table, 118 | columns, 119 | values 120 | ) 121 | 122 | # load data into redshift 123 | redshift_hook.run(load_sql) 124 | -------------------------------------------------------------------------------- /street-easy/requirements.txt: -------------------------------------------------------------------------------- 1 | s3fs==0.4.0 2 | --------------------------------------------------------------------------------