├── airflow ├── dags │ ├── __init__.py │ ├── subdags │ │ ├── __init__.py │ │ ├── data_quality_checks.py │ │ ├── spark_jobs.py │ │ └── copy_to_redshift.py │ ├── scripts │ │ ├── photos.py │ │ ├── users.py │ │ ├── tips.py │ │ ├── reviews.py │ │ ├── checkins.py │ │ ├── elite_years.py │ │ ├── friends.py │ │ ├── addresses.py │ │ ├── businesses.py │ │ ├── cities.py │ │ ├── business_hours.py │ │ ├── business_categories.py │ │ ├── business_attributes.py │ │ └── city_weather.py │ ├── configs │ │ ├── check_definitions.yml │ │ └── table_definitions.yml │ └── main.py └── plugins │ ├── __init__.py │ ├── redshift_plugin │ ├── macros │ │ ├── __init__.py │ │ └── redshift_auth.py │ ├── operators │ │ ├── __init__.py │ │ ├── redshift_check_operator.py │ │ └── s3_to_redshift_operator.py │ └── __init__.py │ └── spark_plugin │ ├── operators │ ├── __init__.py │ └── spark_operator.py │ └── __init__.py ├── .gitignore ├── images ├── main.png ├── data-model.png ├── spark_jobs.png ├── amazon-s3-logo.png ├── aws-connection.png ├── copy_to_redshift.png ├── redshift-connection.png ├── 1200px-Yelp_Logo.svg.png ├── livy_http_connection.png ├── 1*eeiD15Xwc_2Ul2DA5u_-Gw.png ├── aws-redshift-connector.png ├── 1200px-Apache_Spark_Logo.svg.png └── airflow-stack-220x234-613461a0bb1df0b065a5b69146fbe061.png ├── dwh.cfg ├── delete_redshift_cluster.ipynb ├── create_redshift_cluster.ipynb └── README.md /airflow/dags/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /airflow/plugins/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /airflow/dags/subdags/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /airflow/plugins/redshift_plugin/macros/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /airflow/plugins/redshift_plugin/operators/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /airflow/plugins/spark_plugin/operators/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode 2 | .ipynb_checkpoints 3 | __pycache__ -------------------------------------------------------------------------------- /images/main.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/main.png -------------------------------------------------------------------------------- /images/data-model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/data-model.png -------------------------------------------------------------------------------- /images/spark_jobs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/spark_jobs.png -------------------------------------------------------------------------------- /images/amazon-s3-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/amazon-s3-logo.png -------------------------------------------------------------------------------- /images/aws-connection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/aws-connection.png -------------------------------------------------------------------------------- /images/copy_to_redshift.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/copy_to_redshift.png -------------------------------------------------------------------------------- /images/redshift-connection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/redshift-connection.png -------------------------------------------------------------------------------- /images/1200px-Yelp_Logo.svg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/1200px-Yelp_Logo.svg.png -------------------------------------------------------------------------------- /images/livy_http_connection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/livy_http_connection.png -------------------------------------------------------------------------------- /images/1*eeiD15Xwc_2Ul2DA5u_-Gw.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/1*eeiD15Xwc_2Ul2DA5u_-Gw.png -------------------------------------------------------------------------------- /images/aws-redshift-connector.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/aws-redshift-connector.png -------------------------------------------------------------------------------- /images/1200px-Apache_Spark_Logo.svg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/1200px-Apache_Spark_Logo.svg.png -------------------------------------------------------------------------------- /images/airflow-stack-220x234-613461a0bb1df0b065a5b69146fbe061.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/airflow-stack-220x234-613461a0bb1df0b065a5b69146fbe061.png -------------------------------------------------------------------------------- /dwh.cfg: -------------------------------------------------------------------------------- 1 | [AWS] 2 | KEY= 3 | SECRET= 4 | 5 | [DWH] 6 | DWH_CLUSTER_TYPE=single-node 7 | DWH_NUM_NODES=1 8 | DWH_NODE_TYPE=dc2.large 9 | DWH_IAM_ROLE_NAME=dwhRole 10 | DWH_CLUSTER_IDENTIFIER=dwhCluster 11 | 12 | [DB] 13 | DB_NAME=dwh 14 | DB_USER=dwhuser 15 | DB_PASSWORD=Passw0rd 16 | DB_PORT=5439 17 | 18 | [DB_ACCESS] 19 | DB_HOST=dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com 20 | ROLE_ARN='arn:aws:iam::953225455667:role/dwhRole' 21 | -------------------------------------------------------------------------------- /airflow/plugins/spark_plugin/__init__.py: -------------------------------------------------------------------------------- 1 | from airflow.plugins_manager import AirflowPlugin 2 | from spark_plugin.operators.spark_operator import SparkSubmitOperator, LivySparkOperator 3 | 4 | 5 | class S3ToRedshiftPlugin(AirflowPlugin): 6 | name = "S3ToRedshiftPlugin" 7 | operators = [ 8 | SparkSubmitOperator, 9 | LivySparkOperator 10 | ] 11 | # Leave in for explicitness 12 | hooks = [] 13 | executors = [] 14 | macros = [] 15 | admin_views = [] 16 | flask_blueprints = [] 17 | menu_links = [] 18 | -------------------------------------------------------------------------------- /airflow/dags/scripts/photos.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_photo_path = "s3://polakowo-yelp2/yelp_dataset/photo.json" 7 | target_photos_path = "s3://polakowo-yelp2/staging_data/photos" 8 | 9 | photos_df = spark.read.json(source_photo_path) 10 | 11 | # Even if we do not store any photos, this table is useful for knowing how many and what kind of photos were taken. 12 | 13 | photos_df.write.parquet(target_photos_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/scripts/users.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_user_path = "s3://polakowo-yelp2/yelp_dataset/user.json" 7 | target_users_path = "s3://polakowo-yelp2/staging_data/users" 8 | 9 | user_df = spark.read.json(source_user_path) 10 | 11 | # Drop fields which will be outsourced and cast timestamp field 12 | users_df = user_df.drop("elite", "friends")\ 13 | .withColumn("yelping_since", F.to_timestamp("yelping_since")) 14 | 15 | users_df.write.parquet(target_users_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/scripts/tips.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_tip_path = "s3://polakowo-yelp2/yelp_dataset/tip.json" 7 | target_tips_path = "s3://polakowo-yelp2/staging_data/tips" 8 | 9 | tip_df = spark.read.json(source_tip_path) 10 | 11 | # Assign to each record a unique id for convenience. 12 | 13 | tips_df = tip_df.withColumnRenamed("date", "ts")\ 14 | .withColumn("ts", F.to_timestamp("ts"))\ 15 | .withColumn("tip_id", F.monotonically_increasing_id()) 16 | 17 | tips_df.write.parquet(target_tips_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/scripts/reviews.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_review_path = "s3://polakowo-yelp2/yelp_dataset/review.json" 7 | target_reviews_path = "s3://polakowo-yelp2/staging_data/reviews" 8 | 9 | review_df = spark.read.json(source_review_path) 10 | 11 | # The table can be used as-is, only minor transformations required. 12 | 13 | # date field looks more like a timestamp 14 | reviews_df = review_df.withColumnRenamed("date", "ts")\ 15 | .withColumn("ts", F.to_timestamp("ts")) 16 | 17 | reviews_df.write.parquet(target_reviews_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/plugins/redshift_plugin/macros/redshift_auth.py: -------------------------------------------------------------------------------- 1 | from airflow.utils.db import provide_session 2 | from airflow.models import Connection 3 | 4 | 5 | @provide_session 6 | def get_conn(conn_id, session=None): 7 | conn = ( 8 | session.query(Connection) 9 | .filter(Connection.conn_id == conn_id) 10 | .first()) 11 | return conn 12 | 13 | 14 | def redshift_auth(s3_conn_id): 15 | s3_conn = get_conn(s3_conn_id) 16 | aws_key = s3_conn.extra_dejson.get('aws_access_key_id') 17 | aws_secret = s3_conn.extra_dejson.get('aws_secret_access_key') 18 | return ("aws_access_key_id={0};aws_secret_access_key={1}" 19 | .format(aws_key, aws_secret)) 20 | -------------------------------------------------------------------------------- /airflow/dags/scripts/checkins.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_checkin_path = "s3://polakowo-yelp2/yelp_dataset/checkin.json" 7 | target_checkins_path = "s3://polakowo-yelp2/staging_data/checkins" 8 | 9 | checkin_df = spark.read.json(source_checkin_path) 10 | 11 | # Basically the same procedure as friends to get the table of pairs business_id:ts 12 | 13 | checkins_df = checkin_df.selectExpr("business_id", "date as ts")\ 14 | .withColumn("ts", F.explode(F.split(F.col("ts"), ", ")))\ 15 | .where("ts != '' and ts is not null")\ 16 | .withColumn("ts", F.to_timestamp("ts")) 17 | 18 | checkins_df.write.parquet(target_checkins_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/scripts/elite_years.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_user_path = "s3://polakowo-yelp2/yelp_dataset/user.json" 7 | elite_years_path = "s3://polakowo-yelp2/staging_data/elite_years" 8 | 9 | user_df = spark.read.json(source_user_path) 10 | 11 | # The field elite is a comma-separated list of strings masked as a string. 12 | # Make a separate table out of it. 13 | 14 | elite_years_df = user_df.select("user_id", "elite")\ 15 | .withColumn("year", F.explode(F.split(F.col("elite"), ",")))\ 16 | .where("year != '' and year is not null")\ 17 | .select(F.col("user_id"), F.col("year").cast("integer")) 18 | 19 | elite_years_df.write.parquet(elite_years_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/scripts/friends.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_user_path = "s3://polakowo-yelp2/yelp_dataset/user.json" 7 | friends_path = "s3://polakowo-yelp2/staging_data/friends" 8 | 9 | user_df = spark.read.json(source_user_path) 10 | 11 | # Basically the same procedure as elite to get the table of user relationships. 12 | # Can take some time. 13 | 14 | friends_df = user_df.select("user_id", "friends")\ 15 | .withColumn("friend_id", F.explode(F.split(F.col("friends"), ", ")))\ 16 | .where("friend_id != '' and friend_id is not null")\ 17 | .select(F.col("user_id"), F.col("friend_id"))\ 18 | .distinct() 19 | 20 | friends_df.write.parquet(friends_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/plugins/redshift_plugin/__init__.py: -------------------------------------------------------------------------------- 1 | from airflow.plugins_manager import AirflowPlugin 2 | from redshift_plugin.operators.s3_to_redshift_operator import S3ToRedshiftOperator 3 | from redshift_plugin.operators.redshift_check_operator import (RedshiftCheckOperator, 4 | RedshiftValueCheckOperator, RedshiftIntervalCheckOperator) 5 | from redshift_plugin.macros.redshift_auth import redshift_auth 6 | 7 | 8 | class S3ToRedshiftPlugin(AirflowPlugin): 9 | name = "S3ToRedshiftPlugin" 10 | operators = [ 11 | S3ToRedshiftOperator, 12 | RedshiftCheckOperator, 13 | RedshiftValueCheckOperator, 14 | RedshiftIntervalCheckOperator 15 | ] 16 | # Leave in for explicitness 17 | hooks = [] 18 | executors = [] 19 | macros = [redshift_auth] 20 | admin_views = [] 21 | flask_blueprints = [] 22 | menu_links = [] 23 | -------------------------------------------------------------------------------- /airflow/dags/scripts/addresses.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json" 7 | source_cities_path = "s3://polakowo-yelp2/staging_data/cities" 8 | target_addresses_path = "s3://polakowo-yelp2/staging_data/addresses" 9 | 10 | business_df = spark.read.json(source_business_path) 11 | cities_df = spark.read.parquet(source_cities_path) 12 | 13 | # Pull address information from business.json, but instead of city take newly created city_id. 14 | 15 | addresses_df = business_df.selectExpr("address", "latitude", "longitude", "postal_code", "city", "state as state_code")\ 16 | .join(cities_df.select("city", "state_code", "city_id"), ["city", "state_code"], how='left')\ 17 | .drop("city", "state_code")\ 18 | .distinct()\ 19 | .withColumn("address_id", F.monotonically_increasing_id()) 20 | 21 | addresses_df.write.parquet(target_addresses_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/scripts/businesses.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json" 7 | source_addresses_path = "s3://polakowo-yelp2/staging_data/addresses" 8 | target_businesses_path = "s3://polakowo-yelp2/staging_data/businesses" 9 | 10 | business_df = spark.read.json(source_business_path) 11 | addresses_df = spark.read.parquet(source_addresses_path) 12 | 13 | # Take any other information and write it into businesses table. 14 | 15 | businesses_df = business_df.join(addresses_df, (business_df["address"] == addresses_df["address"]) 16 | & (business_df["latitude"] == addresses_df["latitude"]) 17 | & (business_df["longitude"] == addresses_df["longitude"]) 18 | & (business_df["postal_code"] == addresses_df["postal_code"]), how="left")\ 19 | .selectExpr("business_id", "address_id", "cast(is_open as boolean)", "name", "review_count", "stars") 20 | 21 | businesses_df.write.parquet(target_businesses_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/subdags/data_quality_checks.py: -------------------------------------------------------------------------------- 1 | # from datetime import datetime, timedelta 2 | from airflow import DAG 3 | from airflow.operators.dummy_operator import DummyOperator 4 | from airflow.operators import RedshiftValueCheckOperator 5 | 6 | 7 | def data_quality_checks_subdag( 8 | parent_dag_id, 9 | dag_id, 10 | redshift_conn_id, 11 | check_definitions, 12 | *args, **kwargs): 13 | """Returns the SubDAG for performing data quality checks""" 14 | 15 | dag = DAG( 16 | f"{parent_dag_id}.{dag_id}", 17 | **kwargs 18 | ) 19 | 20 | start_operator = DummyOperator(dag=dag, task_id='start_operator') 21 | end_operator = DummyOperator(dag=dag, task_id='end_operator') 22 | 23 | for check in check_definitions: 24 | check_operator = RedshiftValueCheckOperator( 25 | dag=dag, 26 | task_id=check.get('task_id', None), 27 | redshift_conn_id="redshift", 28 | sql=check.get('sql', None), 29 | pass_value=check.get('pass_value', None), 30 | tolerance=check.get('tolerance', None) 31 | ) 32 | 33 | start_operator >> check_operator >> end_operator 34 | 35 | return dag -------------------------------------------------------------------------------- /airflow/dags/configs/check_definitions.yml: -------------------------------------------------------------------------------- 1 | - task_id: check_businesses_count 2 | sql: SELECT COUNT(*) FROM businesses 3 | pass_value: 196728 4 | - task_id: check_business_attributes_count 5 | sql: SELECT COUNT(*) FROM business_attributes 6 | pass_value: 192609 7 | - task_id: check_categories_count 8 | sql: SELECT COUNT(*) FROM categories 9 | pass_value: 1298 10 | - task_id: check_business_categories_count 11 | sql: SELECT COUNT(*) FROM business_categories 12 | pass_value: 788110 13 | - task_id: check_addresses_count 14 | sql: SELECT COUNT(*) FROM addresses 15 | pass_value: 178763 16 | - task_id: check_cities_count 17 | sql: SELECT COUNT(*) FROM cities 18 | pass_value: 1258 19 | - task_id: check_city_weather_count 20 | sql: SELECT COUNT(*) FROM city_weather 21 | pass_value: 15096 22 | - task_id: check_business_hours_count 23 | sql: SELECT COUNT(*) FROM business_hours 24 | pass_value: 192609 25 | - task_id: check_users_count 26 | sql: SELECT COUNT(*) FROM users 27 | pass_value: 1637138 28 | - task_id: check_elite_years_count 29 | sql: SELECT COUNT(*) FROM elite_years 30 | pass_value: 224499 31 | - task_id: check_friends_count 32 | sql: SELECT COUNT(*) FROM friends 33 | pass_value: 75531114 34 | - task_id: check_reviews_count 35 | sql: SELECT COUNT(*) FROM reviews 36 | pass_value: 6685900 37 | - task_id: check_checkins_count 38 | sql: SELECT COUNT(*) FROM checkins 39 | pass_value: 19089148 40 | - task_id: check_tips_count 41 | sql: SELECT COUNT(*) FROM tips 42 | pass_value: 1223094 43 | - task_id: check_photos_count 44 | sql: SELECT COUNT(*) FROM photos 45 | pass_value: 200000 -------------------------------------------------------------------------------- /airflow/dags/scripts/cities.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json" 7 | source_demo_path = "s3://polakowo-yelp2/demo_dataset/us-cities-demographics.json" 8 | target_cities_path = "s3://polakowo-yelp2/staging_data/cities" 9 | 10 | business_df = spark.read.json(source_business_path) 11 | 12 | # Take city and state_code from the business.json and enrich them with demographics data. 13 | 14 | ################ 15 | # Demographics # 16 | ################ 17 | 18 | demo_df = spark.read.json(source_demo_path) 19 | 20 | # Each JSON object here seems to describe (1) the demographics of the city 21 | # and (2) the number of people belonging to some race (race and count fields). 22 | # Since each record is unique by city, state_code and race fields, while other 23 | # demographic fields are unique by only city and state_code (which means lots 24 | # of redundancy), we need to transform race column into columns corresponding 25 | # to each of its values via pivot function. 26 | 27 | def prepare_race(x): 28 | # We want to make each race a stand-alone column, thus each race value needs a proper naming 29 | return x.replace(" ", "_").replace("-", "_").lower() 30 | 31 | prepare_race_udf = F.udf(prepare_race, T.StringType()) 32 | 33 | # Group by all columns except race and count and convert race rows into columns 34 | demo_df = demo_df.select("fields.*")\ 35 | .withColumn("race", prepare_race_udf("race")) 36 | demo_df = demo_df.groupby(*set(demo_df.schema.names).difference(set(["race", "count"])))\ 37 | .pivot('race')\ 38 | .max('count') 39 | # Columns have a different order every now and then? 40 | demo_df = demo_df.select(*sorted(demo_df.columns)) 41 | 42 | ########## 43 | # Cities # 44 | ########## 45 | 46 | # Merge city data with demographics data 47 | cities_df = business_df.selectExpr("city", "state as state_code")\ 48 | .distinct()\ 49 | .join(demo_df, ["city", "state_code"], how="left")\ 50 | .withColumn("city_id", F.monotonically_increasing_id()) 51 | 52 | cities_df.write.parquet(target_cities_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/scripts/business_hours.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json" 7 | target_business_hours_path = "s3://polakowo-yelp2/staging_data/business_hours" 8 | 9 | business_df = spark.read.json(source_business_path) 10 | business_hours_df = business_df.select("business_id", "hours.*") 11 | 12 | # To enable efficient querying based on business hours, for each day of week, 13 | # split the time range string into "from" and "to" integers. 14 | # From 15 | # Row( 16 | # business_id=u'QXAEGFB4oINsVuTFxEYKFQ', 17 | # Monday=u'9:0-0:0', 18 | # Tuesday=u'9:0-0:0', 19 | # Wednesday=u'9:0-0:0', 20 | # Thursday=u'9:0-0:0', 21 | # Friday=u'9:0-1:0', 22 | # Saturday=u'9:0-1:0', 23 | # Sunday=u'9:0-0:0' 24 | # ) 25 | # To 26 | # Row( 27 | # business_id=u'QXAEGFB4oINsVuTFxEYKFQ', 28 | # Monday_from=900, 29 | # Monday_to=0, 30 | # Tuesday_from=900, 31 | # Tuesday_to=0, 32 | # Wednesday_from=900, 33 | # Wednesday_to=0, 34 | # Thursday_from=900, 35 | # Thursday_to=0, 36 | # Friday_from=900, 37 | # Friday_to=100, 38 | # Saturday_from=900, 39 | # Saturday_to=100, 40 | # Sunday_from=900, 41 | # Sunday_to=0 42 | # ) 43 | 44 | def parse_hours(x): 45 | # Take "9:0-0:0" (9am-00am) and transform it into {from: 900, to: 0} 46 | if x is None: 47 | return None 48 | convert_to_int = lambda x: int(x.split(':')[0]) * 100 + int(x.split(':')[1]) 49 | return { 50 | "from": convert_to_int(x.split('-')[0]), 51 | "to": convert_to_int(x.split('-')[1]) 52 | } 53 | 54 | parse_hours_udf = F.udf(parse_hours, T.StructType([ 55 | T.StructField('from', T.IntegerType(), nullable=True), 56 | T.StructField('to', T.IntegerType(), nullable=True) 57 | ])) 58 | 59 | hour_attrs = [ 60 | "Monday", 61 | "Tuesday", 62 | "Wednesday", 63 | "Thursday", 64 | "Friday", 65 | "Saturday", 66 | "Sunday", 67 | ] 68 | 69 | for attr in hour_attrs: 70 | business_hours_df = business_hours_df.withColumn(attr, parse_hours_udf(attr))\ 71 | .selectExpr("*", attr+".from as "+attr+"_from", attr+".to as "+attr+"_to")\ 72 | .drop(attr) 73 | 74 | business_hours_df.write.parquet(target_business_hours_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/subdags/spark_jobs.py: -------------------------------------------------------------------------------- 1 | from airflow import DAG 2 | from airflow.operators.dummy_operator import DummyOperator 3 | from airflow.operators import LivySparkOperator 4 | 5 | 6 | def spark_jobs_subdag( 7 | parent_dag_id, 8 | dag_id, 9 | http_conn_id, 10 | session_kind, 11 | *args, **kwargs): 12 | """Returns the SubDAG for processing data with Spark""" 13 | 14 | dag = DAG( 15 | f"{parent_dag_id}.{dag_id}", 16 | **kwargs 17 | ) 18 | 19 | start_operator = DummyOperator(dag=dag, task_id='start_operator') 20 | end_operator = DummyOperator(dag=dag, task_id='end_operator') 21 | 22 | def create_task(script_name): 23 | """Returns an operator that executes the Spark script under the passed name""" 24 | 25 | with open(f'/Users/olegpolakow/airflow/dags/scripts/{script_name}.py', 'r') as f: 26 | spark_script = f.read() 27 | 28 | return LivySparkOperator( 29 | dag=dag, 30 | task_id=f"{script_name}_script", 31 | spark_script=spark_script, 32 | http_conn_id=http_conn_id, 33 | session_kind=session_kind) 34 | 35 | business_hours = create_task("business_hours") 36 | business_attributes = create_task("business_attributes") 37 | cities = create_task("cities") 38 | addresses = create_task("addresses") 39 | business_categories = create_task("business_categories") 40 | businesses = create_task("businesses") 41 | reviews = create_task("reviews") 42 | users = create_task("users") 43 | elite_years = create_task("elite_years") 44 | friends = create_task("friends") 45 | checkins = create_task("checkins") 46 | tips = create_task("tips") 47 | photos = create_task("photos") 48 | city_weather = create_task("city_weather") 49 | 50 | # Specify relationships between operators 51 | start_operator >> cities >> addresses >> businesses >> end_operator 52 | start_operator >> cities >> city_weather >> end_operator 53 | start_operator >> business_hours >> end_operator 54 | start_operator >> business_attributes >> end_operator 55 | start_operator >> business_categories >> end_operator 56 | start_operator >> reviews >> end_operator 57 | start_operator >> users >> end_operator 58 | start_operator >> elite_years >> end_operator 59 | start_operator >> friends >> end_operator 60 | start_operator >> checkins >> end_operator 61 | start_operator >> tips >> end_operator 62 | start_operator >> photos >> end_operator 63 | 64 | return dag 65 | -------------------------------------------------------------------------------- /airflow/dags/scripts/business_categories.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json" 7 | target_categories_path = "s3://polakowo-yelp2/staging_data/categories" 8 | target_business_categories_path = "s3://polakowo-yelp2/staging_data/business_categories" 9 | 10 | business_df = spark.read.json(source_business_path) 11 | 12 | ############## 13 | # Categories # 14 | ############## 15 | 16 | # First, create a list of unique categories and assign each of them an id. 17 | 18 | import re 19 | def parse_categories(categories): 20 | # Convert comma separated list of strings masked as a string into a native list type 21 | if categories is None: 22 | return [] 23 | parsed = [] 24 | # Some strings contain commas, so they have to be extracted beforehand 25 | require_attention = set(["Wills, Trusts, & Probates"]) 26 | for s in require_attention: 27 | if categories.find(s) > -1: 28 | parsed.append(s) 29 | categories = categories.replace(s, "") 30 | return list(filter(None, parsed + re.split(r",\s*", categories))) 31 | 32 | parse_categories_udf = F.udf(parse_categories, T.ArrayType(T.StringType())) 33 | business_categories_df = business_df.select("business_id", "categories")\ 34 | .withColumn("categories", parse_categories_udf("categories")) 35 | 36 | # Convert the list of categories in each row into a set of rows 37 | categories_df = business_categories_df.select(F.explode("categories").alias("category"))\ 38 | .dropDuplicates()\ 39 | .sort("category")\ 40 | .withColumn("category_id", F.monotonically_increasing_id()) 41 | 42 | categories_df.write.parquet(target_categories_path, mode="overwrite") 43 | 44 | ####################### 45 | # Business categories # 46 | ####################### 47 | 48 | # For each record in business.json, convert list of categories in categories field into rows of pairs business_id-category_id. 49 | 50 | import re 51 | def zip_categories(business_id, categories): 52 | # For each value in categories, zip it with business_id to form a pair 53 | return list(zip([business_id] * len(categories), categories)) 54 | 55 | zip_categories_udf = F.udf(zip_categories, T.ArrayType(T.ArrayType(T.StringType()))) 56 | 57 | # Zip business_id's and categories and extract them into a new table called business_catagories 58 | business_categories_df = business_categories_df.select(F.explode(zip_categories_udf("business_id", "categories")).alias("cols"))\ 59 | .selectExpr("cols[0] as business_id", "cols[1] as category")\ 60 | .dropDuplicates() 61 | business_categories_df = business_categories_df.join(categories_df, business_categories_df["category"] == categories_df["category"], how="left")\ 62 | .drop("category") 63 | 64 | business_categories_df.write.parquet(target_business_categories_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/main.py: -------------------------------------------------------------------------------- 1 | from airflow import DAG 2 | from airflow.operators.dummy_operator import DummyOperator 3 | from airflow.operators.subdag_operator import SubDagOperator 4 | from airflow.operators.postgres_operator import PostgresOperator 5 | 6 | from subdags.copy_to_redshift import copy_to_redshift_subdag 7 | from subdags.data_quality_checks import data_quality_checks_subdag 8 | from subdags.spark_jobs import spark_jobs_subdag 9 | 10 | from datetime import datetime, timedelta 11 | import os 12 | import yaml 13 | 14 | start_date = datetime.now() - timedelta(days=2) 15 | 16 | default_args = { 17 | 'owner': "polakowo", 18 | 'start_date': start_date, 19 | 'catchup': False, 20 | 'depends_on_past': False, 21 | 'retries': 0 22 | } 23 | 24 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 25 | 26 | dag = DAG(DAG_ID, 27 | default_args=default_args, 28 | description="Extracts Yelp data from S3, transforms it into tables with Spark, and loads into Redshift", 29 | schedule_interval=None, 30 | max_active_runs=1) 31 | 32 | start_operator = DummyOperator(dag=dag, task_id='start_operator') 33 | 34 | # Create the SubDAG for transforming data with Spark 35 | subdag_id = "spark_jobs" 36 | spark_jobs = SubDagOperator( 37 | subdag=spark_jobs_subdag( 38 | parent_dag_id=DAG_ID, 39 | dag_id=subdag_id, 40 | http_conn_id="livy_http_conn", 41 | session_kind="pyspark", 42 | start_date=start_date), 43 | task_id=subdag_id, 44 | dag=dag) 45 | 46 | # Read table definitions from YAML file 47 | with open('/Users/olegpolakow/airflow/dags/configs/table_definitions.yml', 'r') as f: 48 | table_definitions = yaml.safe_load(f) 49 | 50 | # Create the SubDAG for copying S3 tables into Redshift 51 | subdag_id = "copy_to_redshift" 52 | s3_to_redshift = SubDagOperator( 53 | subdag=copy_to_redshift_subdag( 54 | parent_dag_id=DAG_ID, 55 | dag_id=subdag_id, 56 | table_definitions=table_definitions, 57 | redshift_conn_id='redshift', 58 | redshift_schema='public', 59 | s3_conn_id='aws_credentials', 60 | s3_bucket='polakowo-yelp2/staging_data', 61 | load_type='rebuild', 62 | schema_location='Local', 63 | start_date=start_date), 64 | task_id=subdag_id, 65 | dag=dag) 66 | 67 | # Read check definitions from YAML file 68 | with open('/Users/olegpolakow/airflow/dags/configs/check_definitions.yml', 'r') as f: 69 | check_definitions = yaml.safe_load(f) 70 | 71 | # Create the SubDAG for performing data quality checks 72 | subdag_id = "data_quality_checks" 73 | data_quality_checks = SubDagOperator( 74 | subdag=data_quality_checks_subdag( 75 | parent_dag_id=DAG_ID, 76 | dag_id=subdag_id, 77 | redshift_conn_id='redshift', 78 | check_definitions=check_definitions, 79 | start_date=start_date), 80 | task_id=subdag_id, 81 | dag=dag) 82 | 83 | end_operator = DummyOperator(dag=dag, task_id='end_operator') 84 | 85 | # Specify relationships between operators 86 | start_operator >> spark_jobs >> s3_to_redshift >> data_quality_checks >> end_operator 87 | -------------------------------------------------------------------------------- /airflow/dags/subdags/copy_to_redshift.py: -------------------------------------------------------------------------------- 1 | from airflow import DAG 2 | from airflow.operators.dummy_operator import DummyOperator 3 | from airflow.operators import S3ToRedshiftOperator 4 | 5 | def copy_to_redshift_subdag( 6 | parent_dag_id, 7 | dag_id, 8 | table_definitions, 9 | redshift_conn_id, 10 | redshift_schema, 11 | s3_conn_id, 12 | s3_bucket, 13 | load_type, 14 | schema_location, 15 | *args, **kwargs): 16 | """Returns the SubDAG for copying S3 tables into Redshift""" 17 | 18 | dag = DAG( 19 | f"{parent_dag_id}.{dag_id}", 20 | **kwargs 21 | ) 22 | 23 | start_operator = DummyOperator(dag=dag, task_id='start_operator') 24 | end_operator = DummyOperator(dag=dag, task_id='end_operator') 25 | 26 | def get_table(table_name): 27 | """Returns the table under the passed name""" 28 | 29 | for table in table_definitions: 30 | if table.get('table_name', None) == table_name: 31 | return table 32 | 33 | def create_task(table): 34 | """Returns an operator for copying the table into Redshift""" 35 | 36 | return S3ToRedshiftOperator( 37 | dag=dag, 38 | task_id=f"copy_{table.get('table_name', None)}_to_redshift", 39 | redshift_conn_id=redshift_conn_id, 40 | redshift_schema=redshift_schema, 41 | table=table.get('table_name', None), 42 | s3_conn_id=s3_conn_id, 43 | s3_bucket=s3_bucket, 44 | s3_key=table.get('s3_key', None), 45 | load_type=load_type, 46 | copy_params=table.get('copy_params', None), 47 | origin_schema=table.get('origin_schema', None), 48 | primary_key=table.get('primary_key', None), 49 | foreign_key=table.get('foreign_key', {}), 50 | schema_location=schema_location) 51 | 52 | businesses = create_task(get_table("businesses")) 53 | business_attributes = create_task(get_table("business_attributes")) 54 | categories = create_task(get_table("categories")) 55 | business_categories = create_task(get_table("business_categories")) 56 | addresses = create_task(get_table("addresses")) 57 | cities = create_task(get_table("cities")) 58 | city_weather = create_task(get_table("city_weather")) 59 | business_hours = create_task(get_table("business_hours")) 60 | reviews = create_task(get_table("reviews")) 61 | users = create_task(get_table("users")) 62 | elite_years = create_task(get_table("elite_years")) 63 | friends = create_task(get_table("friends")) 64 | checkins = create_task(get_table("checkins")) 65 | tips = create_task(get_table("tips")) 66 | photos = create_task(get_table("photos")) 67 | 68 | # We could execute the entire YAML file in parallel 69 | # But let's respect the referential integrity 70 | # Look at the UML diagram to build the acyclic graph of references 71 | 72 | start_operator >> cities 73 | start_operator >> categories 74 | start_operator >> users 75 | 76 | cities >> addresses 77 | cities >> city_weather >> end_operator 78 | 79 | addresses >> businesses 80 | 81 | businesses >> business_attributes >> end_operator 82 | businesses >> business_categories >> end_operator 83 | businesses >> business_hours >> end_operator 84 | businesses >> checkins >> end_operator 85 | businesses >> photos >> end_operator 86 | businesses >> tips >> end_operator 87 | businesses >> reviews >> end_operator 88 | 89 | categories >> business_categories >> end_operator 90 | 91 | users >> reviews >> end_operator 92 | users >> tips >> end_operator 93 | users >> friends >> end_operator 94 | users >> elite_years >> end_operator 95 | 96 | return dag 97 | -------------------------------------------------------------------------------- /airflow/plugins/redshift_plugin/operators/redshift_check_operator.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # Licensed to the Apache Software Foundation (ASF) under one 4 | # or more contributor license agreements. See the NOTICE file 5 | # distributed with this work for additional information 6 | # regarding copyright ownership. The ASF licenses this file 7 | # to you under the Apache License, Version 2.0 (the 8 | # "License"); you may not use this file except in compliance 9 | # with the License. You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, 14 | # software distributed under the License is distributed on an 15 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | # KIND, either express or implied. See the License for the 17 | # specific language governing permissions and limitations 18 | # under the License. 19 | from typing import Any, Dict 20 | 21 | from airflow.hooks.postgres_hook import PostgresHook 22 | from airflow.operators.check_operator import CheckOperator, \ 23 | ValueCheckOperator, IntervalCheckOperator 24 | from airflow.utils.decorators import apply_defaults 25 | 26 | 27 | class RedshiftCheckOperator(CheckOperator): 28 | """ 29 | Performs checks against Redshift. The ``RedshiftCheckOperator`` expects 30 | a sql query that will return a single row. Each value on that 31 | first row is evaluated using python ``bool`` casting. If any of the 32 | values return ``False`` the check is failed and errors out. 33 | Note that Python bool casting evals the following as ``False``: 34 | * ``False`` 35 | * ``0`` 36 | * Empty string (``""``) 37 | * Empty list (``[]``) 38 | * Empty dictionary or set (``{}``) 39 | Given a query like ``SELECT COUNT(*) FROM foo``, it will fail only if 40 | the count ``== 0``. You can craft much more complex query that could, 41 | for instance, check that the table has the same number of rows as 42 | the source table upstream, or that the count of today's partition is 43 | greater than yesterday's partition, or that a set of metrics are less 44 | than 3 standard deviation for the 7 day average. 45 | This operator can be used as a data quality check in your pipeline, and 46 | depending on where you put it in your DAG, you have the choice to 47 | stop the critical path, preventing from 48 | publishing dubious data, or on the side and receive email alerts 49 | without stopping the progress of the DAG. 50 | :param sql: the sql to be executed 51 | :type sql: str 52 | :param redshift_conn_id: reference to the Redshift database 53 | :type redshift_conn_id: str 54 | """ 55 | 56 | @apply_defaults 57 | def __init__( 58 | self, 59 | sql: str, 60 | redshift_conn_id: str = 'redshift_default', 61 | *args, **kwargs) -> None: 62 | super().__init__(sql=sql, *args, **kwargs) 63 | 64 | self.redshift_conn_id = redshift_conn_id 65 | self.sql = sql 66 | 67 | def get_db_hook(self): 68 | return PostgresHook(postgres_conn_id=self.redshift_conn_id) 69 | 70 | 71 | class RedshiftValueCheckOperator(ValueCheckOperator): 72 | """ 73 | Performs a simple value check using sql code. 74 | :param sql: the sql to be executed 75 | :type sql: str 76 | :param redshift_conn_id: reference to the Redshift database 77 | :type redshift_conn_id: str 78 | """ 79 | 80 | @apply_defaults 81 | def __init__( 82 | self, 83 | sql: str, 84 | pass_value: Any, 85 | tolerance: Any = None, 86 | redshift_conn_id: str = 'redshift_default', 87 | *args, **kwargs): 88 | super().__init__( 89 | sql=sql, pass_value=pass_value, tolerance=tolerance, 90 | *args, **kwargs) 91 | self.redshift_conn_id = redshift_conn_id 92 | 93 | def get_db_hook(self): 94 | return PostgresHook(postgres_conn_id=self.redshift_conn_id) 95 | 96 | 97 | class RedshiftIntervalCheckOperator(IntervalCheckOperator): 98 | """ 99 | Checks that the values of metrics given as SQL expressions are within 100 | a certain tolerance of the ones from days_back before. 101 | :param table: the table name 102 | :type table: str 103 | :param days_back: number of days between ds and the ds we want to check 104 | against. Defaults to 7 days 105 | :type days_back: int 106 | :param metrics_threshold: a dictionary of ratios indexed by metrics 107 | :type metrics_threshold: dict 108 | :param redshift_conn_id: reference to the Redshift database 109 | :type redshift_conn_id: str 110 | """ 111 | 112 | @apply_defaults 113 | def __init__( 114 | self, 115 | table: str, 116 | metrics_thresholds: Dict, 117 | date_filter_column: str = 'ds', 118 | days_back: int = -7, 119 | redshift_conn_id: str = 'redshift_default', 120 | *args, **kwargs): 121 | super().__init__( 122 | table=table, metrics_thresholds=metrics_thresholds, 123 | date_filter_column=date_filter_column, days_back=days_back, 124 | *args, **kwargs) 125 | self.redshift_conn_id = redshift_conn_id 126 | 127 | def get_db_hook(self): 128 | return PostgresHook(postgres_conn_id=self.redshift_conn_id) -------------------------------------------------------------------------------- /airflow/dags/scripts/business_attributes.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import functions as F 2 | from pyspark.sql import types as T 3 | from pyspark.sql import Window, Row 4 | 5 | # File paths 6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json" 7 | target_business_attributes_path = "s3://polakowo-yelp2/staging_data/business_attributes" 8 | 9 | business_df = spark.read.json(source_business_path) 10 | business_attributes_df = business_df.select("business_id", "attributes.*") 11 | 12 | # Unfold deep nested field attributes into a new table. 13 | 14 | ################## 15 | # Parse booleans # 16 | ################## 17 | 18 | # From 19 | # Row(AcceptsInsurance=None), 20 | # Row(AcceptsInsurance=u'None'), 21 | # Row(AcceptsInsurance=u'False'), 22 | # Row(AcceptsInsurance=u'True') 23 | # To 24 | # Row(AcceptsInsurance=None), 25 | # Row(AcceptsInsurance=None), 26 | # Row(AcceptsInsurance=False) 27 | # Row(AcceptsInsurance=True) 28 | 29 | def parse_boolean(x): 30 | # Convert boolean strings to native boolean format 31 | if x is None or x == 'None': 32 | return None 33 | if x == 'True': 34 | return True 35 | if x == 'False': 36 | return False 37 | 38 | parse_boolean_udf = F.udf(parse_boolean, T.BooleanType()) 39 | 40 | bool_attrs = [ 41 | "AcceptsInsurance", 42 | "BYOB", 43 | "BikeParking", 44 | "BusinessAcceptsBitcoin", 45 | "BusinessAcceptsCreditCards", 46 | "ByAppointmentOnly", 47 | "Caters", 48 | "CoatCheck", 49 | "Corkage", 50 | "DogsAllowed", 51 | "DriveThru", 52 | "GoodForDancing", 53 | "GoodForKids", 54 | "HappyHour", 55 | "HasTV", 56 | "Open24Hours", 57 | "OutdoorSeating", 58 | "RestaurantsCounterService", 59 | "RestaurantsDelivery", 60 | "RestaurantsGoodForGroups", 61 | "RestaurantsReservations", 62 | "RestaurantsTableService", 63 | "RestaurantsTakeOut", 64 | "WheelchairAccessible" 65 | ] 66 | 67 | for attr in bool_attrs: 68 | business_attributes_df = business_attributes_df.withColumn(attr, parse_boolean_udf(attr)) 69 | 70 | ################# 71 | # Parse strings # 72 | ################# 73 | 74 | # From 75 | # Row(AgesAllowed=None), 76 | # Row(AgesAllowed=u'None'), 77 | # Row(AgesAllowed=u"u'18plus'"), 78 | # Row(AgesAllowed=u"u'19plus'") 79 | # Row(AgesAllowed=u"u'21plus'"), 80 | # Row(AgesAllowed=u"u'allages'"), 81 | # To 82 | # Row(AgesAllowed=None), 83 | # Row(AgesAllowed=u'none'), 84 | # Row(AgesAllowed=u'18plus') 85 | # Row(AgesAllowed=u'19plus'), 86 | # Row(AgesAllowed=u'21plus'), 87 | # Row(AgesAllowed=u'allages'), 88 | 89 | 90 | 91 | def parse_string(x): 92 | # Clean and standardize strings 93 | # Do not cast "None" into None since it has a special meaning 94 | if x is None or x == '': 95 | return None 96 | # Some strings are of format u"u'string'" 97 | return x.replace("u'", "").replace("'", "").lower() 98 | 99 | parse_string_udf = F.udf(parse_string, T.StringType()) 100 | 101 | str_attrs = [ 102 | "AgesAllowed", 103 | "Alcohol", 104 | "BYOBCorkage", 105 | "NoiseLevel", 106 | "RestaurantsAttire", 107 | "Smoking", 108 | "WiFi", 109 | ] 110 | 111 | for attr in str_attrs: 112 | business_attributes_df = business_attributes_df.withColumn(attr, parse_string_udf(attr)) 113 | 114 | ################## 115 | # Parse integers # 116 | ################## 117 | 118 | # From 119 | # Row(RestaurantsPriceRange2=u'None'), 120 | # Row(RestaurantsPriceRange2=None), 121 | # Row(RestaurantsPriceRange2=u'1'), 122 | # Row(RestaurantsPriceRange2=u'2')] 123 | # Row(RestaurantsPriceRange2=u'3'), 124 | # Row(RestaurantsPriceRange2=u'4'), 125 | # To 126 | # Row(RestaurantsPriceRange2=None), 127 | # Row(RestaurantsPriceRange2=None), 128 | # Row(RestaurantsPriceRange2=1), 129 | # Row(RestaurantsPriceRange2=2) 130 | # Row(RestaurantsPriceRange2=3), 131 | # Row(RestaurantsPriceRange2=4), 132 | 133 | def parse_integer(x): 134 | # Convert integers masked as strings to native integer format 135 | if x is None or x == 'None': 136 | return None 137 | return int(x) 138 | 139 | parse_integer_udf = F.udf(parse_integer, T.IntegerType()) 140 | 141 | int_attrs = [ 142 | "RestaurantsPriceRange2", 143 | ] 144 | 145 | for attr in int_attrs: 146 | business_attributes_df = business_attributes_df.withColumn(attr, parse_integer_udf(attr)) 147 | 148 | ####################### 149 | # Parse boolean dicts # 150 | ####################### 151 | 152 | # From 153 | # Row( 154 | # business_id=u'QXAEGFB4oINsVuTFxEYKFQ', 155 | # Ambience=u"{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}" 156 | # ) 157 | # To 158 | # Row( 159 | # business_id=u'QXAEGFB4oINsVuTFxEYKFQ', 160 | # Ambience_romantic=False, 161 | # Ambience_intimate=False, 162 | # Ambience_classy=False, 163 | # Ambience_hipster=False, 164 | # Ambience_divey=False, 165 | # Ambience_touristy=False, 166 | # Ambience_trendy=False, 167 | # Ambience_upscale=False, 168 | # Ambience_casual=True 169 | # ) 170 | 171 | import ast 172 | 173 | def parse_boolean_dict(x): 174 | # Convert dicts masked as strings to string:boolean format 175 | if x is None or x == 'None' or x == '': 176 | return None 177 | return ast.literal_eval(x) 178 | 179 | parse_boolean_dict_udf = F.udf(parse_boolean_dict, T.MapType(T.StringType(), T.BooleanType())) 180 | 181 | bool_dict_attrs = [ 182 | "Ambience", 183 | "BestNights", 184 | "BusinessParking", 185 | "DietaryRestrictions", 186 | "GoodForMeal", 187 | "HairSpecializesIn", 188 | "Music" 189 | ] 190 | 191 | for attr in bool_dict_attrs: 192 | business_attributes_df = business_attributes_df.withColumn(attr, parse_boolean_dict_udf(attr)) 193 | # Get all keys of the MapType 194 | # [Row(key=u'romantic'), Row(key=u'casual'), ... 195 | key_rows = business_attributes_df.select(F.explode(attr)).select("key").distinct().collect() 196 | # Convert each key into column (with proper name) 197 | exprs = ["{}['{}'] as {}".format(attr, row.key, attr+"_"+row.key.replace('-', '_')) for row in key_rows] 198 | business_attributes_df = business_attributes_df.selectExpr("*", *exprs).drop(attr) 199 | 200 | business_attributes_df.write.parquet(target_business_attributes_path, mode="overwrite") -------------------------------------------------------------------------------- /airflow/dags/scripts/city_weather.py: -------------------------------------------------------------------------------- 1 | """Take city and state_code from the business.json and enrich them with demographics data.""" 2 | 3 | from pyspark.sql import functions as F 4 | from pyspark.sql import types as T 5 | from pyspark.sql import Window, Row 6 | 7 | # File paths 8 | source_cities_path = "s3://polakowo-yelp2/staging_data/cities" 9 | source_city_attr_path = "s3://polakowo-yelp2/weather_dataset/city_attributes.csv" 10 | source_weather_temp_path = "s3://polakowo-yelp2/weather_dataset/temperature.csv" 11 | source_weather_desc_path = "s3://polakowo-yelp2/weather_dataset/weather_description.csv" 12 | target_city_weather_path = "s3://polakowo-yelp2/staging_data/city_weather" 13 | 14 | cities_df = spark.read.parquet(source_cities_path) 15 | 16 | ################### 17 | # City attributes # 18 | ################### 19 | 20 | # Get the names of US cities supported by this dataset and assign to each a city_id. 21 | # Requires reading the table cities. 22 | 23 | city_attr_df = spark.read\ 24 | .format('csv')\ 25 | .option("header", "true")\ 26 | .option("delimiter", ",")\ 27 | .load(source_city_attr_path) 28 | 29 | # We only want the list of US cities 30 | cities = city_attr_df.where("Country = 'United States'")\ 31 | .select("City")\ 32 | .distinct()\ 33 | .rdd.flatMap(lambda x: x)\ 34 | .collect() 35 | 36 | # Weather dataset doesn't provide us with the respective state codes though. 37 | # How do we know whether "Phoenix" is in AZ or TX? 38 | # The most appropriate solution is finding the biggest city. 39 | # Let's find out which of those cities are referenced in Yelp dataset and relevant to us. 40 | # Use Google or any other API. 41 | 42 | weather_cities_df = [ 43 | Row(city='Phoenix', state_code='AZ'), 44 | Row(city='Dallas', state_code='TX'), 45 | Row(city='Los Angeles', state_code='CA'), 46 | Row(city='San Diego', state_code='CA'), 47 | Row(city='Pittsburgh', state_code='PA'), 48 | Row(city='Las Vegas', state_code='NV'), 49 | Row(city='Seattle', state_code='WA'), 50 | Row(city='New York', state_code='NY'), 51 | Row(city='Charlotte', state_code='NC'), 52 | Row(city='Denver', state_code='CO'), 53 | Row(city='Boston', state_code='MA') 54 | ] 55 | weather_cities_schema = T.StructType([ 56 | T.StructField("city", T.StringType()), 57 | T.StructField("state_code", T.StringType()) 58 | ]) 59 | weather_cities_df = spark.createDataFrame(weather_cities_df, schema=weather_cities_schema) 60 | 61 | # Join with the cities dataset to find matches 62 | weather_cities_df = cities_df.join(weather_cities_df, ["city", "state_code"])\ 63 | .select("city", "city_id")\ 64 | .distinct() 65 | 66 | ################ 67 | # Temperatures # 68 | ################ 69 | 70 | # Read temperaturs recorded hourly, transform them into daily averages, and filter by our cities. 71 | # Also, cities are columns, so transform them into rows. 72 | 73 | weather_temp_df = spark.read\ 74 | .format('csv')\ 75 | .option("header", "true")\ 76 | .option("delimiter", ",")\ 77 | .load(source_weather_temp_path) 78 | 79 | # Extract date string from time string to be able to group by day 80 | weather_temp_df = weather_temp_df.select("datetime", *cities)\ 81 | .withColumn("date", F.substring("datetime", 0, 10))\ 82 | .drop("datetime") 83 | 84 | # For data quality check 85 | import numpy as np 86 | phoenix_rows = weather_temp_df.where("Phoenix is not null and date = '2012-10-01'").select("Phoenix").collect() 87 | phoenix_mean_temp = np.mean([float(row.Phoenix) for row in phoenix_rows]) 88 | 89 | # To transform city columns into rows, transform each city individually and union all dataframes 90 | temp_df = None 91 | for city in cities: 92 | # Get average temperature in Fahrenheit for each day and city 93 | df = weather_temp_df.select("date", city)\ 94 | .withColumnRenamed(city, "temperature")\ 95 | .withColumn("temperature", F.col("temperature").cast("double"))\ 96 | .withColumn("city", F.lit(city))\ 97 | .groupBy("date", "city")\ 98 | .agg(F.mean("temperature").alias("avg_temperature")) 99 | if temp_df is None: 100 | temp_df = df 101 | else: 102 | temp_df = temp_df.union(df) 103 | weather_temp_df = temp_df 104 | 105 | # Speed up further joins 106 | weather_temp_df = weather_temp_df.repartition(1).cache() 107 | weather_temp_df.count() 108 | 109 | phoenix_mean_temp2 = weather_temp_df.where("city = 'Phoenix' and date = '2012-10-01'").collect()[0].avg_temperature 110 | assert(phoenix_mean_temp == phoenix_mean_temp2) 111 | # If we pass, the calculations are done correctly 112 | 113 | ######################## 114 | # Weather descriptions # 115 | ######################## 116 | 117 | # Read weather descriptions recorded hourly, pick the most frequent one on each day, and filter by our cities. 118 | # The same as for temperatures, transform columns into rows. 119 | 120 | weather_desc_df = spark.read\ 121 | .format('csv')\ 122 | .option("header", "true")\ 123 | .option("delimiter", ",")\ 124 | .load(source_weather_desc_path) 125 | 126 | # Extract date string from time string to be able to group by day 127 | weather_desc_df = weather_desc_df.select("datetime", *cities)\ 128 | .withColumn("date", F.substring("datetime", 0, 10))\ 129 | .drop("datetime") 130 | 131 | # For data quality check 132 | from collections import Counter 133 | phoenix_rows = weather_desc_df.where("Phoenix is not null and date = '2012-12-10'").select("Phoenix").collect() 134 | phoenix_most_common_weather = Counter([row.Phoenix for row in phoenix_rows]).most_common()[0][0] 135 | 136 | # To transform city columns into rows, transform each city individually and union all dataframes 137 | temp_df = None 138 | for city in cities: 139 | # Get the most frequent description for each day and city 140 | window = Window.partitionBy("date", "city").orderBy(F.desc("count")) 141 | df = weather_desc_df.select("date", city)\ 142 | .withColumnRenamed(city, "weather_description")\ 143 | .withColumn("city", F.lit(city))\ 144 | .groupBy("date", "city", "weather_description")\ 145 | .count()\ 146 | .withColumn("order", F.row_number().over(window))\ 147 | .where(F.col("order") == 1)\ 148 | .drop("count", "order") 149 | if temp_df is None: 150 | temp_df = df 151 | else: 152 | temp_df = temp_df.union(df) 153 | weather_desc_df = temp_df 154 | 155 | # Speed up further joins 156 | weather_desc_df = weather_desc_df.repartition(1).cache() 157 | weather_desc_df.count() 158 | 159 | phoenix_most_common_weather2 = weather_desc_df.where("city = 'Phoenix' and date = '2012-12-10'").collect()[0].weather_description 160 | assert(phoenix_most_common_weather == phoenix_most_common_weather2) 161 | # If we pass, the calculations are done correctly 162 | 163 | ################ 164 | # City weather # 165 | ################ 166 | 167 | # What was the weather in the city when the particular review was posted? 168 | # Join weather description with temperature, and keep only city ids which are present in Yelp. 169 | city_weather_df = weather_temp_df.join(weather_desc_df, ["city", "date"])\ 170 | .join(weather_cities_df, "city")\ 171 | .drop("city")\ 172 | .distinct()\ 173 | .withColumn("date", F.to_date("date")) 174 | 175 | city_weather_df.write.parquet(target_city_weather_path, mode="overwrite") -------------------------------------------------------------------------------- /delete_redshift_cluster.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import boto3" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import configparser\n", 19 | "config = configparser.ConfigParser()\n", 20 | "config.read_file(open('dwh.cfg'))\n", 21 | "\n", 22 | "# Load params from configuration file\n", 23 | "KEY = config.get('AWS', 'KEY')\n", 24 | "SECRET = config.get('AWS', 'SECRET')\n", 25 | "DWH_CLUSTER_IDENTIFIER = config.get(\"DWH\", \"DWH_CLUSTER_IDENTIFIER\")\n", 26 | "DWH_IAM_ROLE_NAME = config.get(\"DWH\", \"DWH_IAM_ROLE_NAME\")" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 3, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "# Create clients\n", 36 | "iam = boto3.client(\n", 37 | " 'iam',aws_access_key_id=KEY,\n", 38 | " aws_secret_access_key=SECRET,\n", 39 | " region_name='us-west-2'\n", 40 | ")\n", 41 | "redshift = boto3.client(\n", 42 | " 'redshift',\n", 43 | " region_name=\"us-west-2\",\n", 44 | " aws_access_key_id=KEY,\n", 45 | " aws_secret_access_key=SECRET\n", 46 | ")" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 4, 52 | "metadata": {}, 53 | "outputs": [ 54 | { 55 | "data": { 56 | "text/plain": [ 57 | "{'Cluster': {'ClusterIdentifier': 'dwhcluster',\n", 58 | " 'NodeType': 'dc2.large',\n", 59 | " 'ClusterStatus': 'deleting',\n", 60 | " 'ClusterAvailabilityStatus': 'Modifying',\n", 61 | " 'MasterUsername': 'dwhuser',\n", 62 | " 'DBName': 'dwh',\n", 63 | " 'Endpoint': {'Address': 'dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com',\n", 64 | " 'Port': 5439},\n", 65 | " 'ClusterCreateTime': datetime.datetime(2019, 8, 16, 17, 57, 27, 660000, tzinfo=tzutc()),\n", 66 | " 'AutomatedSnapshotRetentionPeriod': 1,\n", 67 | " 'ManualSnapshotRetentionPeriod': -1,\n", 68 | " 'ClusterSecurityGroups': [],\n", 69 | " 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-b575adfb',\n", 70 | " 'Status': 'active'}],\n", 71 | " 'ClusterParameterGroups': [{'ParameterGroupName': 'default.redshift-1.0',\n", 72 | " 'ParameterApplyStatus': 'in-sync'}],\n", 73 | " 'ClusterSubnetGroupName': 'default',\n", 74 | " 'VpcId': 'vpc-cdb609b5',\n", 75 | " 'AvailabilityZone': 'us-west-2a',\n", 76 | " 'PreferredMaintenanceWindow': 'fri:08:00-fri:08:30',\n", 77 | " 'PendingModifiedValues': {},\n", 78 | " 'ClusterVersion': '1.0',\n", 79 | " 'AllowVersionUpgrade': True,\n", 80 | " 'NumberOfNodes': 1,\n", 81 | " 'PubliclyAccessible': True,\n", 82 | " 'Encrypted': False,\n", 83 | " 'Tags': [],\n", 84 | " 'EnhancedVpcRouting': False,\n", 85 | " 'IamRoles': [{'IamRoleArn': 'arn:aws:iam::953225455667:role/dwhRole',\n", 86 | " 'ApplyStatus': 'in-sync'}],\n", 87 | " 'MaintenanceTrackName': 'current',\n", 88 | " 'DeferredMaintenanceWindows': []},\n", 89 | " 'ResponseMetadata': {'RequestId': 'd6e25b1a-c086-11e9-96f7-cfa2f664abf8',\n", 90 | " 'HTTPStatusCode': 200,\n", 91 | " 'HTTPHeaders': {'x-amzn-requestid': 'd6e25b1a-c086-11e9-96f7-cfa2f664abf8',\n", 92 | " 'content-type': 'text/xml',\n", 93 | " 'content-length': '2290',\n", 94 | " 'vary': 'Accept-Encoding',\n", 95 | " 'date': 'Sat, 17 Aug 2019 00:34:57 GMT'},\n", 96 | " 'RetryAttempts': 0}}" 97 | ] 98 | }, 99 | "execution_count": 4, 100 | "metadata": {}, 101 | "output_type": "execute_result" 102 | } 103 | ], 104 | "source": [ 105 | "redshift.delete_cluster(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER, SkipFinalClusterSnapshot=True)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 5, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "{'ClusterIdentifier': 'dwhcluster',\n", 117 | " 'NodeType': 'dc2.large',\n", 118 | " 'ClusterStatus': 'deleting',\n", 119 | " 'ClusterAvailabilityStatus': 'Modifying',\n", 120 | " 'MasterUsername': 'dwhuser',\n", 121 | " 'DBName': 'dwh',\n", 122 | " 'Endpoint': {'Address': 'dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com',\n", 123 | " 'Port': 5439},\n", 124 | " 'ClusterCreateTime': datetime.datetime(2019, 8, 16, 17, 57, 27, 660000, tzinfo=tzutc()),\n", 125 | " 'AutomatedSnapshotRetentionPeriod': 1,\n", 126 | " 'ManualSnapshotRetentionPeriod': -1,\n", 127 | " 'ClusterSecurityGroups': [],\n", 128 | " 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-b575adfb',\n", 129 | " 'Status': 'active'}],\n", 130 | " 'ClusterParameterGroups': [{'ParameterGroupName': 'default.redshift-1.0',\n", 131 | " 'ParameterApplyStatus': 'in-sync'}],\n", 132 | " 'ClusterSubnetGroupName': 'default',\n", 133 | " 'VpcId': 'vpc-cdb609b5',\n", 134 | " 'AvailabilityZone': 'us-west-2a',\n", 135 | " 'PreferredMaintenanceWindow': 'fri:08:00-fri:08:30',\n", 136 | " 'PendingModifiedValues': {},\n", 137 | " 'ClusterVersion': '1.0',\n", 138 | " 'AllowVersionUpgrade': True,\n", 139 | " 'NumberOfNodes': 1,\n", 140 | " 'PubliclyAccessible': True,\n", 141 | " 'Encrypted': False,\n", 142 | " 'ClusterPublicKey': 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbKhq1KiPJZqupL2t1GFp0catgp9xjoCUJaSdBZmEdZmW7Z6xBdwvXfM7w9TIRvvz5cZxXh9Oq9qkDp1U+q/X5tW6vDyVD7UjHzjL+QyPop8AogOE2hZgHi05DtRADvKgGLgdryxlWFOClWAoxAwEa7XtqcfZlE+KK02lB62YBfUeqI6BYTeQgOFcCd6WrggusDtz7QaJ2eFJ2fvT+FFlXySdH0YYuEIpCmFuFI0JdliX0N4euowwWu1nQH6QSA8ILofFHSjk9QGVtJVMO1GsxxamEoBfJrTpquoVa6u2xma2XdW4JbNt0zxnCcVIW6kwhAND+iBwZ6fMY0uJQBL/j Amazon-Redshift\\n',\n", 143 | " 'ClusterNodes': [{'NodeRole': 'SHARED',\n", 144 | " 'PrivateIPAddress': '172.31.17.98',\n", 145 | " 'PublicIPAddress': '52.24.252.63'}],\n", 146 | " 'ClusterRevisionNumber': '9041',\n", 147 | " 'Tags': [],\n", 148 | " 'EnhancedVpcRouting': False,\n", 149 | " 'IamRoles': [{'IamRoleArn': 'arn:aws:iam::953225455667:role/dwhRole',\n", 150 | " 'ApplyStatus': 'in-sync'}],\n", 151 | " 'MaintenanceTrackName': 'current',\n", 152 | " 'DeferredMaintenanceWindows': []}" 153 | ] 154 | }, 155 | "execution_count": 5, 156 | "metadata": {}, 157 | "output_type": "execute_result" 158 | } 159 | ], 160 | "source": [ 161 | "redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 6, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "data": { 171 | "text/plain": [ 172 | "{'ResponseMetadata': {'RequestId': 'd9be23f1-bd54-11e9-b584-d392791cf985',\n", 173 | " 'HTTPStatusCode': 200,\n", 174 | " 'HTTPHeaders': {'x-amzn-requestid': 'd9be23f1-bd54-11e9-b584-d392791cf985',\n", 175 | " 'content-type': 'text/xml',\n", 176 | " 'content-length': '200',\n", 177 | " 'date': 'Mon, 12 Aug 2019 22:59:33 GMT'},\n", 178 | " 'RetryAttempts': 0}}" 179 | ] 180 | }, 181 | "execution_count": 6, 182 | "metadata": {}, 183 | "output_type": "execute_result" 184 | } 185 | ], 186 | "source": [ 187 | "iam.detach_role_policy(RoleName=DWH_IAM_ROLE_NAME, PolicyArn=\"arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess\")\n", 188 | "iam.delete_role(RoleName=DWH_IAM_ROLE_NAME)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [] 197 | } 198 | ], 199 | "metadata": { 200 | "kernelspec": { 201 | "display_name": "Python 3", 202 | "language": "python", 203 | "name": "python3" 204 | }, 205 | "language_info": { 206 | "codemirror_mode": { 207 | "name": "ipython", 208 | "version": 3 209 | }, 210 | "file_extension": ".py", 211 | "mimetype": "text/x-python", 212 | "name": "python", 213 | "nbconvert_exporter": "python", 214 | "pygments_lexer": "ipython3", 215 | "version": "3.7.3" 216 | } 217 | }, 218 | "nbformat": 4, 219 | "nbformat_minor": 4 220 | } 221 | -------------------------------------------------------------------------------- /create_redshift_cluster.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import boto3\n", 10 | "import json" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "import configparser\n", 20 | "config = configparser.ConfigParser()\n", 21 | "config.read_file(open('dwh.cfg'))\n", 22 | "\n", 23 | "# Load params from configuration file\n", 24 | "KEY = config.get('AWS', 'KEY')\n", 25 | "SECRET = config.get('AWS', 'SECRET')\n", 26 | "DWH_CLUSTER_TYPE = config.get(\"DWH\", \"DWH_CLUSTER_TYPE\")\n", 27 | "DWH_NUM_NODES = config.get(\"DWH\", \"DWH_NUM_NODES\")\n", 28 | "DWH_NODE_TYPE = config.get(\"DWH\", \"DWH_NODE_TYPE\")\n", 29 | "DWH_CLUSTER_IDENTIFIER = config.get(\"DWH\", \"DWH_CLUSTER_IDENTIFIER\")\n", 30 | "DWH_IAM_ROLE_NAME = config.get(\"DWH\", \"DWH_IAM_ROLE_NAME\")\n", 31 | "DB_NAME = config.get('DB', \"DB_NAME\")\n", 32 | "DB_USER = config.get('DB', \"DB_USER\")\n", 33 | "DB_PASSWORD = config.get('DB', \"DB_PASSWORD\")\n", 34 | "DB_PORT = config.get('DB', \"DB_PORT\")" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 3, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# Create clients\n", 44 | "ec2 = boto3.resource(\n", 45 | " 'ec2',\n", 46 | " region_name=\"us-west-2\",\n", 47 | " aws_access_key_id=KEY,\n", 48 | " aws_secret_access_key=SECRET\n", 49 | ")\n", 50 | "iam = boto3.client(\n", 51 | " 'iam',\n", 52 | " aws_access_key_id=KEY,\n", 53 | " aws_secret_access_key=SECRET,\n", 54 | " region_name='us-west-2'\n", 55 | ")\n", 56 | "redshift = boto3.client(\n", 57 | " 'redshift',\n", 58 | " region_name=\"us-west-2\",\n", 59 | " aws_access_key_id=KEY,\n", 60 | " aws_secret_access_key=SECRET\n", 61 | ")" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 4, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "1.1 Creating a new IAM Role\n", 74 | "An error occurred (EntityAlreadyExists) when calling the CreateRole operation: Role with name dwhRole already exists.\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "from botocore.exceptions import ClientError\n", 80 | "\n", 81 | "# Create an IAM Role that makes Redshift able to access S3 bucket (ReadOnly)\n", 82 | "try:\n", 83 | " print(\"1.1 Creating a new IAM Role\") \n", 84 | " dwhRole = iam.create_role(\n", 85 | " Path='/',\n", 86 | " RoleName=DWH_IAM_ROLE_NAME,\n", 87 | " Description=\"Allows Redshift clusters to call AWS services on your behalf.\",\n", 88 | " AssumeRolePolicyDocument=json.dumps({\n", 89 | " 'Statement': [{\n", 90 | " 'Action': 'sts:AssumeRole',\n", 91 | " 'Effect': 'Allow',\n", 92 | " 'Principal': {\n", 93 | " 'Service': 'redshift.amazonaws.com'\n", 94 | " }\n", 95 | " }]\n", 96 | " })\n", 97 | " ) \n", 98 | "except Exception as e:\n", 99 | " print(e)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 5, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "1.2 Attaching Policy\n" 112 | ] 113 | }, 114 | { 115 | "data": { 116 | "text/plain": [ 117 | "200" 118 | ] 119 | }, 120 | "execution_count": 5, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "# Attach Policy\n", 127 | "print(\"1.2 Attaching Policy\")\n", 128 | "iam.attach_role_policy(\n", 129 | " RoleName=DWH_IAM_ROLE_NAME,\n", 130 | " PolicyArn=\"arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess\"\n", 131 | ")['ResponseMetadata']['HTTPStatusCode']" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 6, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | "1.3 Get the IAM role ARN\n", 144 | "arn:aws:iam::953225455667:role/dwhRole\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "# Get and print the IAM role ARN\n", 150 | "print(\"1.3 Get the IAM role ARN\")\n", 151 | "roleArn = iam.get_role(RoleName=DWH_IAM_ROLE_NAME)['Role']['Arn']\n", 152 | "\n", 153 | "print(roleArn)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 7, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "# Create Redshift cluster\n", 163 | "try:\n", 164 | " response = redshift.create_cluster( \n", 165 | " #Cluster\n", 166 | " ClusterType=DWH_CLUSTER_TYPE,\n", 167 | " NodeType=DWH_NODE_TYPE,\n", 168 | "\n", 169 | " #Identifiers & Credentials\n", 170 | " DBName=DB_NAME,\n", 171 | " ClusterIdentifier=DWH_CLUSTER_IDENTIFIER,\n", 172 | " MasterUsername=DB_USER,\n", 173 | " MasterUserPassword=DB_PASSWORD,\n", 174 | " \n", 175 | " #Roles (for s3 access)\n", 176 | " IamRoles=[roleArn] \n", 177 | " )\n", 178 | "except Exception as e:\n", 179 | " print(e)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 13, 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "data": { 189 | "text/plain": [ 190 | "{'ClusterIdentifier': 'dwhcluster',\n", 191 | " 'NodeType': 'dc2.large',\n", 192 | " 'ClusterStatus': 'available',\n", 193 | " 'ClusterAvailabilityStatus': 'Unavailable',\n", 194 | " 'MasterUsername': 'dwhuser',\n", 195 | " 'DBName': 'dwh',\n", 196 | " 'Endpoint': {'Address': 'dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com',\n", 197 | " 'Port': 5439},\n", 198 | " 'ClusterCreateTime': datetime.datetime(2019, 8, 16, 17, 57, 27, 660000, tzinfo=tzutc()),\n", 199 | " 'AutomatedSnapshotRetentionPeriod': 1,\n", 200 | " 'ManualSnapshotRetentionPeriod': -1,\n", 201 | " 'ClusterSecurityGroups': [],\n", 202 | " 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-b575adfb',\n", 203 | " 'Status': 'active'}],\n", 204 | " 'ClusterParameterGroups': [{'ParameterGroupName': 'default.redshift-1.0',\n", 205 | " 'ParameterApplyStatus': 'in-sync'}],\n", 206 | " 'ClusterSubnetGroupName': 'default',\n", 207 | " 'VpcId': 'vpc-cdb609b5',\n", 208 | " 'AvailabilityZone': 'us-west-2a',\n", 209 | " 'PreferredMaintenanceWindow': 'fri:08:00-fri:08:30',\n", 210 | " 'PendingModifiedValues': {},\n", 211 | " 'ClusterVersion': '1.0',\n", 212 | " 'AllowVersionUpgrade': True,\n", 213 | " 'NumberOfNodes': 1,\n", 214 | " 'PubliclyAccessible': True,\n", 215 | " 'Encrypted': False,\n", 216 | " 'ClusterPublicKey': 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbKhq1KiPJZqupL2t1GFp0catgp9xjoCUJaSdBZmEdZmW7Z6xBdwvXfM7w9TIRvvz5cZxXh9Oq9qkDp1U+q/X5tW6vDyVD7UjHzjL+QyPop8AogOE2hZgHi05DtRADvKgGLgdryxlWFOClWAoxAwEa7XtqcfZlE+KK02lB62YBfUeqI6BYTeQgOFcCd6WrggusDtz7QaJ2eFJ2fvT+FFlXySdH0YYuEIpCmFuFI0JdliX0N4euowwWu1nQH6QSA8ILofFHSjk9QGVtJVMO1GsxxamEoBfJrTpquoVa6u2xma2XdW4JbNt0zxnCcVIW6kwhAND+iBwZ6fMY0uJQBL/j Amazon-Redshift\\n',\n", 217 | " 'ClusterNodes': [{'NodeRole': 'SHARED',\n", 218 | " 'PrivateIPAddress': '172.31.17.98',\n", 219 | " 'PublicIPAddress': '52.24.252.63'}],\n", 220 | " 'ClusterRevisionNumber': '9041',\n", 221 | " 'Tags': [],\n", 222 | " 'EnhancedVpcRouting': False,\n", 223 | " 'IamRoles': [{'IamRoleArn': 'arn:aws:iam::953225455667:role/dwhRole',\n", 224 | " 'ApplyStatus': 'in-sync'}],\n", 225 | " 'MaintenanceTrackName': 'current',\n", 226 | " 'DeferredMaintenanceWindows': []}" 227 | ] 228 | }, 229 | "execution_count": 13, 230 | "metadata": {}, 231 | "output_type": "execute_result" 232 | } 233 | ], 234 | "source": [ 235 | "# Run this block several times until the cluster status becomes available\n", 236 | "cluster_props = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]\n", 237 | "cluster_props" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 14, 243 | "metadata": {}, 244 | "outputs": [ 245 | { 246 | "name": "stdout", 247 | "output_type": "stream", 248 | "text": [ 249 | "DB_HOST :: dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com\n", 250 | "ROLE_ARN :: arn:aws:iam::953225455667:role/dwhRole\n" 251 | ] 252 | } 253 | ], 254 | "source": [ 255 | "DB_HOST = cluster_props['Endpoint']['Address']\n", 256 | "ROLE_ARN = cluster_props['IamRoles'][0]['IamRoleArn']\n", 257 | "\n", 258 | "# Save back to config\n", 259 | "config.set('DB_ACCESS', 'DB_HOST', DB_HOST)\n", 260 | "config.set('DB_ACCESS', 'ROLE_ARN', ROLE_ARN)\n", 261 | "\n", 262 | "print(\"DB_HOST ::\", DB_HOST)\n", 263 | "print(\"ROLE_ARN ::\", ROLE_ARN)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 15, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "name": "stdout", 273 | "output_type": "stream", 274 | "text": [ 275 | "ec2.SecurityGroup(id='sg-057a93d5984de3064')\n", 276 | "An error occurred (InvalidPermission.Duplicate) when calling the AuthorizeSecurityGroupIngress operation: the specified rule \"peer: 0.0.0.0/0, TCP, from port: 5439, to port: 5439, ALLOW\" already exists\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "# Open an incoming TCP port to access the cluster endpoint\n", 282 | "try:\n", 283 | " vpc = ec2.Vpc(id=cluster_props['VpcId'])\n", 284 | " defaultSg = list(vpc.security_groups.all())[0]\n", 285 | " print(defaultSg)\n", 286 | " defaultSg.authorize_ingress(\n", 287 | " GroupName=defaultSg.group_name,\n", 288 | " CidrIp='0.0.0.0/0',\n", 289 | " IpProtocol='TCP',\n", 290 | " FromPort=int(DB_PORT),\n", 291 | " ToPort=int(DB_PORT)\n", 292 | " )\n", 293 | "except Exception as e:\n", 294 | " print(e)" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 16, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "%load_ext sql" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 17, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "name": "stdout", 313 | "output_type": "stream", 314 | "text": [ 315 | "postgresql://dwhuser:Passw0rd@dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com:5439/dwh\n" 316 | ] 317 | }, 318 | { 319 | "data": { 320 | "text/plain": [ 321 | "'Connected: dwhuser@dwh'" 322 | ] 323 | }, 324 | "execution_count": 17, 325 | "metadata": {}, 326 | "output_type": "execute_result" 327 | } 328 | ], 329 | "source": [ 330 | "# Make sure you can connect to the cluster\n", 331 | "conn_string=\"postgresql://{}:{}@{}:{}/{}\".format(DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, DB_NAME)\n", 332 | "print(conn_string)\n", 333 | "%sql $conn_string" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [] 342 | } 343 | ], 344 | "metadata": { 345 | "kernelspec": { 346 | "display_name": "Python 3", 347 | "language": "python", 348 | "name": "python3" 349 | }, 350 | "language_info": { 351 | "codemirror_mode": { 352 | "name": "ipython", 353 | "version": 3 354 | }, 355 | "file_extension": ".py", 356 | "mimetype": "text/x-python", 357 | "name": "python", 358 | "nbconvert_exporter": "python", 359 | "pygments_lexer": "ipython3", 360 | "version": "3.7.3" 361 | } 362 | }, 363 | "nbformat": 4, 364 | "nbformat_minor": 4 365 | } 366 | -------------------------------------------------------------------------------- /airflow/dags/configs/table_definitions.yml: -------------------------------------------------------------------------------- 1 | # businesses 2 | - table_name: businesses 3 | s3_key: businesses 4 | copy_params: 5 | - FORMAT AS PARQUET 6 | origin_schema: 7 | - name: business_id 8 | type: varchar(22) 9 | - name: address_id 10 | type: bigint 11 | - name: is_open 12 | type: boolean 13 | - name: name 14 | type: varchar(256) 15 | - name: review_count 16 | type: bigint 17 | - name: stars 18 | type: float 19 | primary_key: business_id 20 | foreign_key: 21 | - column_name: address_id 22 | reftable: addresses 23 | ref_column: address_id 24 | 25 | # business_attributes 26 | - table_name: business_attributes 27 | s3_key: business_attributes 28 | copy_params: 29 | - FORMAT AS PARQUET 30 | origin_schema: 31 | - name: business_id 32 | type: varchar(22) 33 | - name: AcceptsInsurance 34 | type: boolean 35 | - name: AgesAllowed 36 | type: varchar(7) 37 | - name: Alcohol 38 | type: varchar(13) 39 | - name: BYOB 40 | type: boolean 41 | - name: BYOBCorkage 42 | type: varchar(11) 43 | - name: BikeParking 44 | type: boolean 45 | - name: BusinessAcceptsBitcoin 46 | type: boolean 47 | - name: BusinessAcceptsCreditCards 48 | type: boolean 49 | - name: ByAppointmentOnly 50 | type: boolean 51 | - name: Caters 52 | type: boolean 53 | - name: CoatCheck 54 | type: boolean 55 | - name: Corkage 56 | type: boolean 57 | - name: DogsAllowed 58 | type: boolean 59 | - name: DriveThru 60 | type: boolean 61 | - name: GoodForDancing 62 | type: boolean 63 | - name: GoodForKids 64 | type: boolean 65 | - name: HappyHour 66 | type: boolean 67 | - name: HasTV 68 | type: boolean 69 | - name: NoiseLevel 70 | type: varchar(9) 71 | - name: Open24Hours 72 | type: boolean 73 | - name: OutdoorSeating 74 | type: boolean 75 | - name: RestaurantsAttire 76 | type: varchar(6) 77 | - name: RestaurantsCounterService 78 | type: boolean 79 | - name: RestaurantsDelivery 80 | type: boolean 81 | - name: RestaurantsGoodForGroups 82 | type: boolean 83 | - name: RestaurantsPriceRange2 84 | type: integer 85 | - name: RestaurantsReservations 86 | type: boolean 87 | - name: RestaurantsTableService 88 | type: boolean 89 | - name: RestaurantsTakeOut 90 | type: boolean 91 | - name: Smoking 92 | type: varchar(7) 93 | - name: WheelchairAccessible 94 | type: boolean 95 | - name: WiFi 96 | type: varchar(4) 97 | - name: Ambience_romantic 98 | type: boolean 99 | - name: Ambience_casual 100 | type: boolean 101 | - name: Ambience_trendy 102 | type: boolean 103 | - name: Ambience_intimate 104 | type: boolean 105 | - name: Ambience_hipster 106 | type: boolean 107 | - name: Ambience_upscale 108 | type: boolean 109 | - name: Ambience_divey 110 | type: boolean 111 | - name: Ambience_touristy 112 | type: boolean 113 | - name: Ambience_classy 114 | type: boolean 115 | - name: BestNights_sunday 116 | type: boolean 117 | - name: BestNights_thursday 118 | type: boolean 119 | - name: BestNights_monday 120 | type: boolean 121 | - name: BestNights_wednesday 122 | type: boolean 123 | - name: BestNights_saturday 124 | type: boolean 125 | - name: BestNights_friday 126 | type: boolean 127 | - name: BestNights_tuesday 128 | type: boolean 129 | - name: BusinessParking_valet 130 | type: boolean 131 | - name: BusinessParking_lot 132 | type: boolean 133 | - name: BusinessParking_validated 134 | type: boolean 135 | - name: BusinessParking_garage 136 | type: boolean 137 | - name: BusinessParking_street 138 | type: boolean 139 | - name: DietaryRestrictions_kosher 140 | type: boolean 141 | - name: DietaryRestrictions_dairy_free 142 | type: boolean 143 | - name: DietaryRestrictions_vegan 144 | type: boolean 145 | - name: DietaryRestrictions_vegetarian 146 | type: boolean 147 | - name: DietaryRestrictions_gluten_free 148 | type: boolean 149 | - name: DietaryRestrictions_soy_free 150 | type: boolean 151 | - name: DietaryRestrictions_halal 152 | type: boolean 153 | - name: GoodForMeal_lunch 154 | type: boolean 155 | - name: GoodForMeal_brunch 156 | type: boolean 157 | - name: GoodForMeal_dinner 158 | type: boolean 159 | - name: GoodForMeal_latenight 160 | type: boolean 161 | - name: GoodForMeal_dessert 162 | type: boolean 163 | - name: GoodForMeal_breakfast 164 | type: boolean 165 | - name: HairSpecializesIn_curly 166 | type: boolean 167 | - name: HairSpecializesIn_asian 168 | type: boolean 169 | - name: HairSpecializesIn_perms 170 | type: boolean 171 | - name: HairSpecializesIn_africanamerican 172 | type: boolean 173 | - name: HairSpecializesIn_straightperms 174 | type: boolean 175 | - name: HairSpecializesIn_kids 176 | type: boolean 177 | - name: HairSpecializesIn_coloring 178 | type: boolean 179 | - name: HairSpecializesIn_extensions 180 | type: boolean 181 | - name: Music_no_music 182 | type: boolean 183 | - name: Music_dj 184 | type: boolean 185 | - name: Music_live 186 | type: boolean 187 | - name: Music_karaoke 188 | type: boolean 189 | - name: Music_video 190 | type: boolean 191 | - name: Music_background_music 192 | type: boolean 193 | - name: Music_jukebox 194 | type: boolean 195 | primary_key: business_id 196 | foreign_key: 197 | - column_name: business_id 198 | reftable: businesses 199 | ref_column: business_id 200 | 201 | # categories 202 | - table_name: categories 203 | s3_key: categories 204 | copy_params: 205 | - FORMAT AS PARQUET 206 | origin_schema: 207 | - name: category 208 | type: varchar(35) 209 | - name: category_id 210 | type: bigint 211 | primary_key: category_id 212 | 213 | # business_categories 214 | - table_name: business_categories 215 | s3_key: business_categories 216 | copy_params: 217 | - FORMAT AS PARQUET 218 | origin_schema: 219 | - name: business_id 220 | type: varchar(22) 221 | - name: category_id 222 | type: bigint 223 | primary_key: 224 | - business_id 225 | - category_id 226 | foreign_key: 227 | - column_name: business_id 228 | reftable: businesses 229 | ref_column: business_id 230 | - column_name: category_id 231 | reftable: categories 232 | ref_column: category_id 233 | 234 | # addresses 235 | - table_name: addresses 236 | s3_key: addresses 237 | copy_params: 238 | - FORMAT AS PARQUET 239 | origin_schema: 240 | - name: address 241 | type: varchar(256) 242 | - name: latitude 243 | type: float 244 | - name: longitude 245 | type: float 246 | - name: postal_code 247 | type: varchar(8) 248 | - name: city_id 249 | type: bigint 250 | - name: address_id 251 | type: bigint 252 | primary_key: address_id 253 | foreign_key: 254 | - column_name: city_id 255 | reftable: cities 256 | ref_column: city_id 257 | 258 | # cities 259 | - table_name: cities 260 | s3_key: cities 261 | copy_params: 262 | - FORMAT AS PARQUET 263 | origin_schema: 264 | - name: city 265 | type: varchar(50) 266 | - name: state_code 267 | type: varchar(3) 268 | - name: american_indian_and_alaska_native 269 | type: bigint 270 | - name: asian 271 | type: bigint 272 | - name: average_household_size 273 | type: float 274 | - name: black_or_african_american 275 | type: bigint 276 | - name: female_population 277 | type: bigint 278 | - name: foreign_born 279 | type: bigint 280 | - name: hispanic_or_latino 281 | type: bigint 282 | - name: male_population 283 | type: bigint 284 | - name: median_age 285 | type: float 286 | - name: number_of_veterans 287 | type: bigint 288 | - name: state 289 | type: varchar(14) 290 | - name: total_population 291 | type: bigint 292 | - name: white 293 | type: bigint 294 | - name: city_id 295 | type: bigint 296 | primary_key: city_id 297 | 298 | # city_weather 299 | - table_name: city_weather 300 | s3_key: city_weather 301 | copy_params: 302 | - FORMAT AS PARQUET 303 | origin_schema: 304 | - name: date 305 | type: date 306 | - name: avg_temperature 307 | type: float 308 | - name: weather_description 309 | type: varchar(23) 310 | - name: city_id 311 | type: bigint 312 | primary_key: 313 | - city_id 314 | - date 315 | foreign_key: 316 | - column_name: city_id 317 | reftable: cities 318 | ref_column: city_id 319 | 320 | # business_hours 321 | - table_name: business_hours 322 | s3_key: business_hours 323 | copy_params: 324 | - FORMAT AS PARQUET 325 | origin_schema: 326 | - name: business_id 327 | type: varchar(22) 328 | - name: Monday_from 329 | type: int 330 | - name: Monday_to 331 | type: int 332 | - name: Tuesday_from 333 | type: int 334 | - name: Tuesday_to 335 | type: int 336 | - name: Wednesday_from 337 | type: int 338 | - name: Wednesday_to 339 | type: int 340 | - name: Thursday_from 341 | type: int 342 | - name: Thursday_to 343 | type: int 344 | - name: Friday_from 345 | type: int 346 | - name: Friday_to 347 | type: int 348 | - name: Saturday_from 349 | type: int 350 | - name: Saturday_to 351 | type: int 352 | - name: Sunday_from 353 | type: int 354 | - name: Sunday_to 355 | type: int 356 | primary_key: business_id 357 | foreign_key: 358 | - column_name: business_id 359 | reftable: businesses 360 | ref_column: business_id 361 | 362 | # users 363 | - table_name: users 364 | s3_key: users 365 | copy_params: 366 | - FORMAT AS PARQUET 367 | origin_schema: 368 | - name: average_stars 369 | type: float 370 | - name: compliment_cool 371 | type: bigint 372 | - name: compliment_cute 373 | type: bigint 374 | - name: compliment_funny 375 | type: bigint 376 | - name: compliment_hot 377 | type: bigint 378 | - name: compliment_list 379 | type: bigint 380 | - name: compliment_more 381 | type: bigint 382 | - name: compliment_note 383 | type: bigint 384 | - name: compliment_photos 385 | type: bigint 386 | - name: compliment_plain 387 | type: bigint 388 | - name: compliment_profile 389 | type: bigint 390 | - name: compliment_writer 391 | type: bigint 392 | - name: cool 393 | type: bigint 394 | - name: fans 395 | type: bigint 396 | - name: funny 397 | type: bigint 398 | - name: name 399 | type: varchar(256) 400 | - name: review_count 401 | type: bigint 402 | - name: useful 403 | type: bigint 404 | - name: user_id 405 | type: varchar(22) 406 | - name: yelping_since 407 | type: timestamp 408 | primary_key: user_id 409 | 410 | # elite_years 411 | - table_name: elite_years 412 | s3_key: elite_years 413 | copy_params: 414 | - FORMAT AS PARQUET 415 | origin_schema: 416 | - name: user_id 417 | type: varchar(22) 418 | - name: year 419 | type: int 420 | primary_key: 421 | - user_id 422 | - year 423 | foreign_key: 424 | - column_name: user_id 425 | reftable: users 426 | ref_column: user_id 427 | 428 | # friends 429 | - table_name: friends 430 | s3_key: friends 431 | copy_params: 432 | - FORMAT AS PARQUET 433 | origin_schema: 434 | - name: user_id 435 | type: varchar(22) 436 | - name: friend_id 437 | type: varchar(22) 438 | primary_key: 439 | - user_id 440 | - friend_id 441 | foreign_key: 442 | - column_name: user_id 443 | reftable: users 444 | ref_column: user_id 445 | - column_name: friend_id 446 | reftable: users 447 | ref_column: user_id 448 | 449 | # reviews 450 | - table_name: reviews 451 | s3_key: reviews 452 | copy_params: 453 | - FORMAT AS PARQUET 454 | origin_schema: 455 | - name: business_id 456 | type: varchar(22) 457 | - name: cool 458 | type: bigint 459 | - name: ts 460 | type: timestamp 461 | - name: funny 462 | type: bigint 463 | - name: review_id 464 | type: varchar(22) 465 | - name: stars 466 | type: float 467 | - name: text 468 | type: varchar(20000) 469 | - name: useful 470 | type: bigint 471 | - name: user_id 472 | type: varchar(22) 473 | primary_key: review_id 474 | foreign_key: 475 | - column_name: business_id 476 | reftable: businesses 477 | ref_column: business_id 478 | - column_name: user_id 479 | reftable: users 480 | ref_column: user_id 481 | 482 | # checkins 483 | - table_name: checkins 484 | s3_key: checkins 485 | copy_params: 486 | - FORMAT AS PARQUET 487 | origin_schema: 488 | - name: business_id 489 | type: varchar(22) 490 | - name: ts 491 | type: timestamp 492 | primary_key: 493 | - business_id 494 | - ts 495 | foreign_key: 496 | - column_name: business_id 497 | reftable: businesses 498 | ref_column: business_id 499 | 500 | # tips 501 | - table_name: tips 502 | s3_key: tips 503 | copy_params: 504 | - FORMAT AS PARQUET 505 | origin_schema: 506 | - name: business_id 507 | type: varchar(22) 508 | - name: compliment_count 509 | type: bigint 510 | - name: ts 511 | type: timestamp 512 | - name: text 513 | type: varchar(2000) 514 | - name: user_id 515 | type: varchar(22) 516 | - name: tip_id 517 | type: bigint 518 | primary_key: tip_id 519 | foreign_key: 520 | - column_name: business_id 521 | reftable: businesses 522 | ref_column: business_id 523 | - column_name: user_id 524 | reftable: users 525 | ref_column: user_id 526 | 527 | # photos 528 | - table_name: photos 529 | s3_key: photos 530 | copy_params: 531 | - FORMAT AS PARQUET 532 | origin_schema: 533 | - name: business_id 534 | type: varchar(22) 535 | - name: caption 536 | type: varchar(560) 537 | - name: label 538 | type: varchar(7) 539 | - name: photo_id 540 | type: varchar(22) 541 | primary_key: photo_id 542 | foreign_key: 543 | - column_name: business_id 544 | reftable: businesses 545 | ref_column: business_id -------------------------------------------------------------------------------- /airflow/plugins/spark_plugin/operators/spark_operator.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | from airflow.plugins_manager import AirflowPlugin 16 | from airflow.hooks import HttpHook 17 | from airflow.models import BaseOperator 18 | from airflow.operators import BashOperator 19 | from airflow.utils import apply_defaults 20 | import logging 21 | import textwrap 22 | import time 23 | import json 24 | 25 | 26 | class SparkSubmitOperator(BashOperator): 27 | """ 28 | An operator which executes the spark-submit command through Airflow. This operator accepts all the desired 29 | arguments and assembles the spark-submit command which is then executed by the BashOperator. 30 | :param application_file: Path to a bundled jar including your application 31 | and all dependencies. The URL must be globally visible inside of 32 | your cluster, for instance, an hdfs:// path or a file:// path 33 | that is present on all nodes. 34 | :type application_file: string 35 | :param main_class: The entry point for your application 36 | (e.g. org.apache.spark.examples.SparkPi) 37 | :type main_class: string 38 | :param master: The master value for the cluster. 39 | (e.g. spark://23.195.26.187:7077 or yarn-client) 40 | :type master: string 41 | :param conf: Dictionary consisting of arbitrary Spark configuration properties. 42 | (e.g. {"spark.eventLog.enabled": "false", 43 | "spark.executor.extraJavaOptions": "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps"} 44 | :type conf: dict 45 | :param deploy_mode: Whether to deploy your driver on the worker nodes 46 | (cluster) or locally as an external client (default: client) 47 | :type deploy_mode: string 48 | :param other_spark_options: Other options you would like to pass to 49 | the spark submit command that isn't covered by the current 50 | options. (e.g. --files /path/to/file.xml) 51 | :type other_spark_options: string 52 | :param application_args: Arguments passed to the main method of your 53 | main class, if any. 54 | :type application_args: string 55 | :param xcom_push: If xcom_push is True, the last line written to stdout 56 | will also be pushed to an XCom when the bash command completes. 57 | :type xcom_push: bool 58 | :param env: If env is not None, it must be a mapping that defines the 59 | environment variables for the new process; these are used instead 60 | of inheriting the current process environment, which is the default 61 | behavior. (templated) 62 | :type env: dict 63 | :type output_encoding: output encoding of bash command 64 | """ 65 | 66 | template_fields = ('conf', 'other_spark_options', 'application_args', 'env') 67 | template_ext = [] 68 | ui_color = '#e47128' # Apache Spark's Main Color: Orange 69 | 70 | @apply_defaults 71 | def __init__( 72 | self, 73 | application_file, 74 | main_class=None, 75 | master=None, 76 | conf={}, 77 | deploy_mode=None, 78 | other_spark_options=None, 79 | application_args=None, 80 | xcom_push=False, 81 | env=None, 82 | output_encoding='utf-8', 83 | *args, **kwargs): 84 | self.bash_command = "" 85 | self.env = env 86 | self.output_encoding = output_encoding 87 | self.xcom_push_flag = xcom_push 88 | super(SparkSubmitOperator, self).__init__(bash_command=self.bash_command, xcom_push=xcom_push, env=env, output_encoding=output_encoding, *args, **kwargs) 89 | self.application_file = application_file 90 | self.main_class = main_class 91 | self.master = master 92 | self.conf = conf 93 | self.deploy_mode = deploy_mode 94 | self.other_spark_options = other_spark_options 95 | self.application_args = application_args 96 | 97 | def execute(self, context): 98 | logging.info("Executing SparkSubmitOperator.execute(context)") 99 | 100 | self.bash_command = "spark-submit " 101 | if self.is_not_null_and_is_not_empty_str(self.main_class): 102 | self.bash_command += "--class " + self.main_class + " " 103 | if self.is_not_null_and_is_not_empty_str(self.master): 104 | self.bash_command += "--master " + self.master + " " 105 | if self.is_not_null_and_is_not_empty_str(self.deploy_mode): 106 | self.bash_command += "--deploy-mode " + self.deploy_mode + " " 107 | for conf_key, conf_value in self.conf.items(): 108 | if self.is_not_null_and_is_not_empty_str(conf_key) and self.is_not_null_and_is_not_empty_str(conf_value): 109 | self.bash_command += "--conf " + "'" + conf_key + "=" + conf_value + "'" + " " 110 | if self.is_not_null_and_is_not_empty_str(self.other_spark_options): 111 | self.bash_command += self.other_spark_options + " " 112 | 113 | self.bash_command += self.application_file + " " 114 | 115 | if self.is_not_null_and_is_not_empty_str(self.application_args): 116 | self.bash_command += self.application_args + " " 117 | 118 | logging.info("Finished assembling bash_command in SparkSubmitOperator: " + str(self.bash_command)) 119 | 120 | logging.info("Executing bash execute statement") 121 | super(SparkSubmitOperator, self).execute(context) 122 | 123 | logging.info("Finished executing SparkSubmitOperator.execute(context)") 124 | 125 | @staticmethod 126 | def is_not_null_and_is_not_empty_str(value): 127 | return value is not None and value != "" 128 | 129 | 130 | class LivySparkOperator(BaseOperator): 131 | """ 132 | Operator to facilitate interacting with the Livy Server which executes Apache Spark code via a REST API. 133 | :param spark_script: Scala, Python or R code to submit to the Livy Server (templated) 134 | :type spark_script: string 135 | :param session_kind: Type of session to setup with Livy. This will determine which type of code will be accepted. Possible values include "spark" (executes Scala code), "pyspark" (executes Python code) or "sparkr" (executes R code). 136 | :type session_kind: string 137 | :param http_conn_id: The http connection to run the operator against 138 | :type http_conn_id: string 139 | :param poll_interval: The polling interval to use when checking if the code in spark_script has finished executing. In seconds. (default: 30 seconds) 140 | :type poll_interval: integer 141 | """ 142 | 143 | template_fields = ['spark_script'] # todo : make sure this works 144 | template_ext = ['.py', '.R', '.r'] 145 | ui_color = '#34a8dd' # Clouderas Main Color: Blue 146 | 147 | acceptable_response_codes = [200, 201] 148 | statement_non_terminated_status_list = ['waiting', 'running'] 149 | 150 | @apply_defaults 151 | def __init__( 152 | self, 153 | spark_script, 154 | session_kind="spark", # spark, pyspark, or sparkr 155 | http_conn_id='http_default', 156 | poll_interval=30, 157 | *args, **kwargs): 158 | super(LivySparkOperator, self).__init__(*args, **kwargs) 159 | 160 | self.spark_script = spark_script 161 | self.session_kind = session_kind 162 | self.http_conn_id = http_conn_id 163 | self.poll_interval = poll_interval 164 | 165 | self.http = HttpHook("GET", http_conn_id=self.http_conn_id) 166 | 167 | def execute(self, context): 168 | logging.info("Executing LivySparkOperator.execute(context)") 169 | 170 | logging.info("Validating arguments...") 171 | self._validate_arguments() 172 | logging.info("Finished validating arguments") 173 | 174 | logging.info("Creating a Livy Session...") 175 | session_id = self._create_session() 176 | logging.info("Finished creating a Livy Session. (session_id: " + str(session_id) + ")") 177 | 178 | logging.info("Submitting spark script...") 179 | statement_id, overall_statements_state = self._submit_spark_script(session_id=session_id) 180 | logging.info("Finished submitting spark script. (statement_id: " + str(statement_id) + ", overall_statements_state: " + str(overall_statements_state) + ")") 181 | 182 | poll_for_completion = (overall_statements_state in self.statement_non_terminated_status_list) 183 | 184 | if poll_for_completion: 185 | logging.info("Spark job did not complete immediately. Starting to Poll for completion...") 186 | 187 | while overall_statements_state in self.statement_non_terminated_status_list: # todo: test execution_timeout 188 | logging.info("Sleeping for " + str(self.poll_interval) + " seconds...") 189 | time.sleep(self.poll_interval) 190 | logging.info("Finished sleeping. Checking if Spark job has completed...") 191 | statements = self._get_session_statements(session_id=session_id) 192 | 193 | is_all_complete = True 194 | for statement in statements: 195 | if statement["state"] in self.statement_non_terminated_status_list: 196 | is_all_complete = False 197 | 198 | # In case one of the statements finished with errors throw exception 199 | elif statement["state"] != 'available' or statement["output"]["status"] == 'error': 200 | logging.error("Statement failed. (state: " + str(statement["state"]) + ". Output:\n" + 201 | str(statement["output"])) 202 | response = self._close_session(session_id=session_id) 203 | logging.error("Closed session. (response: " + str(response) + ")") 204 | raise Exception("Statement failed. (state: " + str(statement["state"]) + ". Output:\n" + 205 | str(statement["output"])) 206 | 207 | if is_all_complete: 208 | overall_statements_state = "available" 209 | 210 | logging.info("Finished checking if Spark job has completed. (overall_statements_state: " + str(overall_statements_state) + ")") 211 | 212 | if poll_for_completion: 213 | logging.info("Finished Polling for completion.") 214 | 215 | logging.info("Session Logs:\n" + str(self._get_session_logs(session_id=session_id))) 216 | 217 | for statement in self._get_session_statements(session_id): 218 | logging.info("Statement '" + str(statement["id"]) + "' Output:\n" + str(statement["output"])) 219 | 220 | logging.info("Closing session...") 221 | response = self._close_session(session_id=session_id) 222 | logging.info("Finished closing session. (response: " + str(response) + ")") 223 | 224 | logging.info("Finished executing LivySparkOperator.execute(context)") 225 | 226 | def _validate_arguments(self): 227 | if self.session_kind is None or self.session_kind == "": 228 | raise Exception( 229 | "session_kind argument is invalid. It is empty or None. (value: '" + str(self.session_kind) + "')") 230 | elif self.session_kind not in ["spark", "pyspark", "sparkr"]: 231 | raise Exception( 232 | "session_kind argument is invalid. It should be set to 'spark', 'pyspark', or 'sparkr'. (value: '" + str( 233 | self.session_kind) + "')") 234 | 235 | def _get_sessions(self): 236 | method = "GET" 237 | endpoint = "sessions" 238 | response = self._http_rest_call(method=method, endpoint=endpoint) 239 | 240 | if response.status_code in self.acceptable_response_codes: 241 | return response.json()["sessions"] 242 | else: 243 | raise Exception("Call to get sessions didn't return " + str(self.acceptable_response_codes) + ". Returned '" + str(response.status_code) + "'.") 244 | 245 | def _get_session(self, session_id): 246 | sessions = self._get_sessions() 247 | for session in sessions: 248 | if session["id"] == session_id: 249 | return session 250 | 251 | def _get_session_logs(self, session_id): 252 | method = "GET" 253 | endpoint = "sessions/" + str(session_id) + "/log" 254 | response = self._http_rest_call(method=method, endpoint=endpoint) 255 | return response.json() 256 | 257 | def _create_session(self): 258 | method = "POST" 259 | endpoint = "sessions" 260 | 261 | data = { 262 | "kind": self.session_kind 263 | } 264 | 265 | response = self._http_rest_call(method=method, endpoint=endpoint, data=data) 266 | 267 | if response.status_code in self.acceptable_response_codes: 268 | response_json = response.json() 269 | session_id = response_json["id"] 270 | session_state = response_json["state"] 271 | 272 | if session_state == "starting": 273 | logging.info("Session is starting. Polling to see if it is ready...") 274 | 275 | session_state_polling_interval = 10 276 | while session_state == "starting": 277 | logging.info("Sleeping for " + str(session_state_polling_interval) + " seconds") 278 | time.sleep(session_state_polling_interval) 279 | session_state_check_response = self._get_session(session_id=session_id) 280 | session_state = session_state_check_response["state"] 281 | logging.info("Got latest session state as '" + session_state + "'") 282 | 283 | return session_id 284 | else: 285 | raise Exception("Call to create a new session didn't return " + str(self.acceptable_response_codes) + ". Returned '" + str(response.status_code) + "'.") 286 | 287 | def _submit_spark_script(self, session_id): 288 | method = "POST" 289 | endpoint = "sessions/" + str(session_id) + "/statements" 290 | 291 | logging.info("Executing Spark Script: \n" + str(self.spark_script)) 292 | 293 | data = { 294 | 'code': textwrap.dedent(self.spark_script) 295 | } 296 | 297 | response = self._http_rest_call(method=method, endpoint=endpoint, data=data) 298 | 299 | if response.status_code in self.acceptable_response_codes: 300 | response_json = response.json() 301 | return response_json["id"], response_json["state"] 302 | else: 303 | raise Exception("Call to create a new statement didn't return " + str(self.acceptable_response_codes) + ". Returned '" + str(response.status_code) + "'.") 304 | 305 | def _get_session_statements(self, session_id): 306 | method = "GET" 307 | endpoint = "sessions/" + str(session_id) + "/statements" 308 | response = self._http_rest_call(method=method, endpoint=endpoint) 309 | 310 | if response.status_code in self.acceptable_response_codes: 311 | response_json = response.json() 312 | statements = response_json["statements"] 313 | return statements 314 | else: 315 | raise Exception("Call to get the session statement response didn't return " + str(self.acceptable_response_codes) + ". Returned '" + str(response.status_code) + "'.") 316 | 317 | def _close_session(self, session_id): 318 | method = "DELETE" 319 | endpoint = "sessions/" + str(session_id) 320 | return self._http_rest_call(method=method, endpoint=endpoint) 321 | 322 | def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None): 323 | if not extra_options: 324 | extra_options = {} 325 | logging.debug("Performing HTTP REST call... (method: " + str(method) + ", endpoint: " + str(endpoint) + ", data: " + str(data) + ", headers: " + str(headers) + ")") 326 | self.http.method = method 327 | response = self.http.run(endpoint, json.dumps(data), headers, extra_options=extra_options) 328 | 329 | logging.debug("status_code: " + str(response.status_code)) 330 | logging.debug("response_as_json: " + str(response.json())) 331 | 332 | return response 333 | 334 | 335 | # Defining the plugin class 336 | class SparkOperatorPlugin(AirflowPlugin): 337 | name = "spark_operator_plugin" 338 | operators = [SparkSubmitOperator, LivySparkOperator] 339 | flask_blueprints = [] 340 | hooks = [] 341 | executors = [] 342 | admin_views = [] 343 | menu_links = [] -------------------------------------------------------------------------------- /airflow/plugins/redshift_plugin/operators/s3_to_redshift_operator.py: -------------------------------------------------------------------------------- 1 | import json 2 | import random 3 | import string 4 | import logging 5 | 6 | from airflow.utils.db import provide_session 7 | from airflow.models import Connection 8 | from airflow.utils.decorators import apply_defaults 9 | 10 | from airflow.models import BaseOperator 11 | from airflow.hooks.S3_hook import S3Hook 12 | from airflow.hooks.postgres_hook import PostgresHook 13 | 14 | # https://github.com/airflow-plugins/redshift_plugin 15 | # We edited it slightly to accept composite primary keys 16 | 17 | 18 | class S3ToRedshiftOperator(BaseOperator): 19 | """ 20 | S3 To Redshift Operator 21 | :param redshift_conn_id: The destination redshift connection id. 22 | :type redshift_conn_id: string 23 | :param redshift_schema: The destination redshift schema. 24 | :type redshift_schema: string 25 | :param table: The destination redshift table. 26 | :type table: string 27 | :param s3_conn_id: The source s3 connection id. 28 | :type s3_conn_id: string 29 | :param s3_bucket: The source s3 bucket. 30 | :type s3_bucket: string 31 | :param s3_key: The source s3 key. 32 | :type s3_key: string 33 | :param copy_params: The parameters to be included when issuing 34 | the copy statement in Redshift. 35 | :type copy_params: list 36 | :param origin_schema: The s3 key for the incoming data schema. 37 | Expects a JSON file with an array of 38 | dictionaries specifying name and type. 39 | (e.g. {"name": "_id", "type": "int4"}) 40 | :type origin_schema: array of dictionaries 41 | :param schema_location: The location of the origin schema. This 42 | can be set to 'S3' or 'Local'. 43 | If 'S3', it will expect a valid S3 Key. If 44 | 'Local', it will expect a dictionary that 45 | is defined in the operator itself. By 46 | default the location is set to 's3'. 47 | :type schema_location: string 48 | :param load_type: The method of loading into Redshift that 49 | should occur. Options: 50 | - "append" 51 | - "rebuild" 52 | - "truncate" 53 | - "upsert" 54 | Defaults to "append." 55 | :type load_type: string 56 | :param primary_key: *(optional)* The primary key for the 57 | destination table. Not enforced by redshift 58 | and only required if using a load_type of 59 | "upsert". It will expect a string or an 60 | array of strings. 61 | :type primary_key: string 62 | :param incremental_key: *(optional)* The incremental key to compare 63 | new data against the destination table 64 | with. Only required if using a load_type of 65 | "upsert". 66 | :type incremental_key: string 67 | :param foreign_key: *(optional)* This specifies any foreign_keys 68 | in the table and which corresponding table 69 | and key they reference. This may be either 70 | a dictionary or list of dictionaries (for 71 | multiple foreign keys). The fields that are 72 | required in each dictionary are: 73 | - column_name 74 | - reftable 75 | - ref_column 76 | :type foreign_key: dictionary 77 | :param distkey: *(optional)* The distribution key for the 78 | table. Only one key may be specified. 79 | :type distkey: string 80 | :param sortkey: *(optional)* The sort keys for the table. 81 | If more than one key is specified, set this 82 | as a list. 83 | :type sortkey: string 84 | :param sort_type: *(optional)* The style of distribution 85 | to sort the table. Possible values include: 86 | - compound 87 | - interleaved 88 | Defaults to "compound". 89 | :type sort_type: string 90 | """ 91 | 92 | template_fields = ('s3_key', 93 | 'origin_schema') 94 | 95 | @apply_defaults 96 | def __init__(self, 97 | s3_conn_id, 98 | s3_bucket, 99 | s3_key, 100 | redshift_conn_id, 101 | redshift_schema, 102 | table, 103 | copy_params=[], 104 | origin_schema=None, 105 | schema_location='s3', 106 | load_type='append', 107 | primary_key=None, 108 | incremental_key=None, 109 | foreign_key={}, 110 | distkey=None, 111 | sortkey='', 112 | sort_type='COMPOUND', 113 | *args, 114 | **kwargs): 115 | super().__init__(*args, **kwargs) 116 | self.s3_conn_id = s3_conn_id 117 | self.s3_bucket = s3_bucket 118 | self.s3_key = s3_key 119 | self.redshift_conn_id = redshift_conn_id 120 | self.redshift_schema = redshift_schema.lower() 121 | self.table = table.lower() 122 | self.copy_params = copy_params 123 | self.origin_schema = origin_schema 124 | self.schema_location = schema_location 125 | self.load_type = load_type 126 | self.primary_key = primary_key 127 | self.incremental_key = incremental_key 128 | self.foreign_key = foreign_key 129 | self.distkey = distkey 130 | self.sortkey = sortkey 131 | self.sort_type = sort_type 132 | 133 | if self.load_type.lower() not in ("append", "rebuild", "truncate", "upsert"): 134 | raise Exception('Please choose "append", "rebuild", or "upsert".') 135 | 136 | if self.schema_location.lower() not in ('s3', 'local'): 137 | raise Exception('Valid Schema Locations are "s3" or "local".') 138 | 139 | if not (isinstance(self.sortkey, str) or isinstance(self.sortkey, list)): 140 | raise Exception('Sort Keys must be specified as either a string or list.') 141 | 142 | if not (isinstance(self.foreign_key, dict) or isinstance(self.foreign_key, list)): 143 | raise Exception('Foreign Keys must be specified as either a dictionary or a list of dictionaries.') 144 | 145 | if self.distkey and ((',' in self.distkey) or not isinstance(self.distkey, str)): 146 | raise Exception('Only one distribution key may be specified.') 147 | 148 | if self.sort_type.lower() not in ('compound', 'interleaved'): 149 | raise Exception('Please choose "compound" or "interleaved" for sort type.') 150 | 151 | def execute(self, context): 152 | # Append a random string to the end of the staging table to ensure 153 | # no conflicts if multiple processes running concurrently. 154 | letters = string.ascii_lowercase 155 | random_string = ''.join(random.choice(letters) for _ in range(7)) 156 | self.temp_suffix = '_tmp_{0}'.format(random_string) 157 | 158 | if self.origin_schema: 159 | schema = self.read_and_format() 160 | 161 | pg_hook = PostgresHook(self.redshift_conn_id) 162 | 163 | #self.reconcile_schemas(schema, pg_hook) 164 | self.copy_data(pg_hook, schema) 165 | 166 | def read_and_format(self): 167 | if self.schema_location.lower() == 's3': 168 | hook = S3Hook(self.s3_conn_id) 169 | # NOTE: In retrieving the schema, it is assumed 170 | # that boto3 is being used. If using boto, 171 | # `.get()['Body'].read().decode('utf-8'))` 172 | # should be changed to 173 | # `.get_contents_as_string(encoding='utf-8'))` 174 | schema = (hook.get_key(self.origin_schema, 175 | bucket_name= 176 | '{0}'.format(self.s3_bucket)) 177 | .get()['Body'].read().decode('utf-8')) 178 | schema = json.loads(schema.replace("'", '"')) 179 | else: 180 | schema = self.origin_schema 181 | 182 | return schema 183 | 184 | def reconcile_schemas(self, schema, pg_hook): 185 | pg_query = \ 186 | """ 187 | SELECT column_name, udt_name 188 | FROM information_schema.columns 189 | WHERE table_schema = '{0}' AND table_name = '{1}'; 190 | """.format(self.redshift_schema, self.table) 191 | 192 | pg_schema = dict(pg_hook.get_records(pg_query)) 193 | incoming_keys = [column['name'] for column in schema] 194 | diff = list(set(incoming_keys) - set(pg_schema.keys())) 195 | print(diff) 196 | # Check length of column differential to see if any new columns exist 197 | if len(diff): 198 | for i in diff: 199 | for e in schema: 200 | if i == e['name']: 201 | alter_query = \ 202 | """ 203 | ALTER TABLE "{0}"."{1}" 204 | ADD COLUMN "{2}" {3} 205 | """.format(self.redshift_schema, 206 | self.table, 207 | e['name'], 208 | e['type']) 209 | pg_hook.run(alter_query) 210 | logging.info('The new columns were:' + str(diff)) 211 | else: 212 | logging.info('There were no new columns.') 213 | 214 | def copy_data(self, pg_hook, schema=None): 215 | @provide_session 216 | def get_conn(conn_id, session=None): 217 | conn = ( 218 | session.query(Connection) 219 | .filter(Connection.conn_id == conn_id) 220 | .first()) 221 | return conn 222 | 223 | def getS3Conn(): 224 | creds = "" 225 | s3_conn = get_conn(self.s3_conn_id) 226 | aws_key = s3_conn.extra_dejson.get('aws_access_key_id', None) 227 | aws_secret = s3_conn.extra_dejson.get('aws_secret_access_key', None) 228 | 229 | # support for cross account resource access 230 | aws_role_arn = s3_conn.extra_dejson.get('role_arn', None) 231 | 232 | if aws_key and aws_secret: 233 | creds = ("aws_access_key_id={0};aws_secret_access_key={1}" 234 | .format(aws_key, aws_secret)) 235 | elif aws_role_arn: 236 | creds = ("aws_iam_role={0}" 237 | .format(aws_role_arn)) 238 | 239 | if not creds: 240 | logging.error("AWS Credentials not found") 241 | 242 | 243 | return creds 244 | 245 | # Delete records from the destination table where the incremental_key 246 | # is greater than or equal to the incremental_key of the source table 247 | # and the primary key is the same. 248 | # (e.g. Source: {"id": 1, "updated_at": "2017-01-02 00:00:00"}; 249 | # Destination: {"id": 1, "updated_at": "2017-01-01 00:00:00"}) 250 | 251 | delete_sql = \ 252 | ''' 253 | DELETE FROM "{rs_schema}"."{rs_table}" 254 | USING "{rs_schema}"."{rs_table}{rs_suffix}" 255 | WHERE "{rs_schema}"."{rs_table}"."{rs_pk}" = 256 | "{rs_schema}"."{rs_table}{rs_suffix}"."{rs_pk}" 257 | AND "{rs_schema}"."{rs_table}{rs_suffix}"."{rs_ik}" >= 258 | "{rs_schema}"."{rs_table}"."{rs_ik}" 259 | '''.format(rs_schema=self.redshift_schema, 260 | rs_table=self.table, 261 | rs_pk=self.primary_key, 262 | rs_suffix=self.temp_suffix, 263 | rs_ik=self.incremental_key) 264 | 265 | # Delete records from the source table where the incremental_key 266 | # is greater than or equal to the incremental_key of the destination 267 | # table and the primary key is the same. This is done in the edge case 268 | # where data is pulled BEFORE it is altered in the source table but 269 | # AFTER a workflow containing an updated version of the record runs. 270 | # In this case, not running this will cause the older record to be 271 | # added as a duplicate to the newer record. 272 | # (e.g. Source: {"id": 1, "updated_at": "2017-01-01 00:00:00"}; 273 | # Destination: {"id": 1, "updated_at": "2017-01-02 00:00:00"}) 274 | 275 | delete_confirm_sql = \ 276 | ''' 277 | DELETE FROM "{rs_schema}"."{rs_table}{rs_suffix}" 278 | USING "{rs_schema}"."{rs_table}" 279 | WHERE "{rs_schema}"."{rs_table}{rs_suffix}"."{rs_pk}" = 280 | "{rs_schema}"."{rs_table}"."{rs_pk}" 281 | AND "{rs_schema}"."{rs_table}"."{rs_ik}" >= 282 | "{rs_schema}"."{rs_table}{rs_suffix}"."{rs_ik}" 283 | '''.format(rs_schema=self.redshift_schema, 284 | rs_table=self.table, 285 | rs_pk=self.primary_key, 286 | rs_suffix=self.temp_suffix, 287 | rs_ik=self.incremental_key) 288 | 289 | append_sql = \ 290 | ''' 291 | ALTER TABLE "{0}"."{1}" 292 | APPEND FROM "{0}"."{1}{2}" 293 | FILLTARGET 294 | '''.format(self.redshift_schema, self.table, self.temp_suffix) 295 | 296 | drop_sql = \ 297 | ''' 298 | DROP TABLE IF EXISTS "{0}"."{1}" CASCADE 299 | '''.format(self.redshift_schema, self.table) 300 | 301 | drop_temp_sql = \ 302 | ''' 303 | DROP TABLE IF EXISTS "{0}"."{1}{2}" CASCADE 304 | '''.format(self.redshift_schema, self.table, self.temp_suffix) 305 | 306 | truncate_sql = \ 307 | ''' 308 | TRUNCATE TABLE "{0}"."{1}" 309 | '''.format(self.redshift_schema, self.table) 310 | 311 | params = '\n'.join(self.copy_params) 312 | 313 | # Example params for loading json from US-East-1 S3 region 314 | # params = ["COMPUPDATE OFF", 315 | # "STATUPDATE OFF", 316 | # "JSON 'auto'", 317 | # "TIMEFORMAT 'auto'", 318 | # "TRUNCATECOLUMNS", 319 | # "region as 'us-east-1'"] 320 | 321 | base_sql = \ 322 | """ 323 | FROM 's3://{0}/{1}' 324 | CREDENTIALS '{2}' 325 | {3}; 326 | """.format(self.s3_bucket, 327 | self.s3_key, 328 | getS3Conn(), 329 | params) 330 | 331 | load_sql = '''COPY "{0}"."{1}" {2}'''.format(self.redshift_schema, 332 | self.table, 333 | base_sql) 334 | 335 | if self.load_type == 'append': 336 | self.create_if_not_exists(schema, pg_hook) 337 | pg_hook.run(load_sql) 338 | elif self.load_type == 'rebuild': 339 | pg_hook.run(drop_sql) 340 | self.create_if_not_exists(schema, pg_hook) 341 | pg_hook.run(load_sql) 342 | elif self.load_type == 'truncate': 343 | self.create_if_not_exists(schema, pg_hook) 344 | pg_hook.run(truncate_sql) 345 | pg_hook.run(load_sql) 346 | elif self.load_type == 'upsert': 347 | self.create_if_not_exists(schema, pg_hook, temp=True) 348 | load_temp_sql = \ 349 | '''COPY "{0}"."{1}{2}" {3}'''.format(self.redshift_schema, 350 | self.table, 351 | self.temp_suffix, 352 | base_sql) 353 | pg_hook.run(load_temp_sql) 354 | pg_hook.run(delete_sql) 355 | pg_hook.run(delete_confirm_sql) 356 | pg_hook.run(append_sql, autocommit=True) 357 | pg_hook.run(drop_temp_sql) 358 | 359 | def create_if_not_exists(self, schema, pg_hook, temp=False): 360 | output = '' 361 | for item in schema: 362 | k = "{quote}{key}{quote}".format(quote='"', key=item['name']) 363 | field = ' '.join([k, item['type']]) 364 | if isinstance(self.sortkey, str) and self.sortkey == item['name']: 365 | field += ' sortkey' 366 | output += field 367 | output += ', ' 368 | # Remove last comma and space after schema items loop ends 369 | output = output[:-2] 370 | if temp: 371 | copy_table = '{0}{1}'.format(self.table, self.temp_suffix) 372 | else: 373 | copy_table = self.table 374 | create_schema_query = \ 375 | ''' 376 | CREATE SCHEMA IF NOT EXISTS "{0}"; 377 | '''.format(self.redshift_schema) 378 | 379 | pk = '' 380 | fk = '' 381 | dk = '' 382 | sk = '' 383 | 384 | if self.primary_key: 385 | pk = ', ' 386 | if isinstance(self.primary_key, list): 387 | pk += 'primary key({0})'.format(', '.join(self.primary_key)) 388 | else: 389 | pk += 'primary key("{0}")'.format(self.primary_key) 390 | 391 | if self.foreign_key: 392 | if isinstance(self.foreign_key, list): 393 | fk = ', ' 394 | for i, e in enumerate(self.foreign_key): 395 | fk += 'foreign key("{0}") references {1}("{2}")'.format(e['column_name'], 396 | e['reftable'], 397 | e['ref_column']) 398 | if i != (len(self.foreign_key) - 1): 399 | fk += ', ' 400 | elif isinstance(self.foreign_key, dict): 401 | fk += ', ' 402 | fk += 'foreign key("{0}") references {1}("{2}")'.format(self.foreign_key['column_name'], 403 | self.foreign_key['reftable'], 404 | self.foreign_key['ref_column']) 405 | if self.distkey: 406 | dk = 'distkey({})'.format(self.distkey) 407 | 408 | if self.sortkey: 409 | if isinstance(self.sortkey, list): 410 | sk += '{0} sortkey({1})'.format(self.sort_type, ', '.join(["{}".format(e) for e in self.sortkey])) 411 | 412 | create_table_query = \ 413 | ''' 414 | CREATE TABLE IF NOT EXISTS "{schema}"."{table}" 415 | ({fields}{primary_key}{foreign_key}) {distkey} {sortkey} 416 | '''.format(schema=self.redshift_schema, 417 | table=copy_table, 418 | fields=output, 419 | primary_key=pk, 420 | foreign_key=fk, 421 | distkey=dk, 422 | sortkey=sk) 423 | 424 | #pg_hook.run([create_schema_query, create_table_query]) 425 | pg_hook.run(create_table_query) 426 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # yelp-3nf 2 | 3 | The developed data pipeline translates the non-relational Yelp dataset distributed over JSON files in Amazon S3 bucket, into a 3NF-normalized dataset stored on Amazon Redshift. The resulting schema ensures data consistency and referential integrity across tables, and is meant to be the source of truth for analytical queries and BI tools. Additionally, the data was enriched with demographics and weather data coming from third-party data sources. 4 | 5 | The entire process was done using Apache Spark, Amazon Redshift and Apache Airflow. 6 | 7 | ## Datasets 8 | 9 | 10 | 11 | The [Yelp Open Dataset](https://www.yelp.com/dataset) is a perfect candidate for this project, since: 12 | 13 | - (1) it is a NoSQL data source; 14 | - (2) it comprises of 6 files that count together more than 10 million rows; 15 | - (3) this dataset provides lots of diverse information and allows for many analysis approaches, from traditional analytical queries (such as "*Give me the average star rating for each city*") to Graph Mining, Photo Classification, Natural Language Processing, and Sentiment Analysis; 16 | - (4) Moreover, it was produced in a real production setting (as opposed to synthetic data generation). 17 | 18 | To make the contribution unique, the Yelp dataset was enriched by demographics and weather data. This allows the end user to make queries such as "*Does the number of ratings depend upon the city's population density?*" or "*Which restaurants are particularly popular during hot weather?*". 19 | 20 | ### Yelp Open Dataset 21 | 22 | The [Yelp Open Dataset](https://www.yelp.com/dataset) dataset is a subset of Yelp's businesses, reviews, and user data, available for academic use. The dataset (as of 13.08.2019) takes 9GB disk space (unzipped) and counts 6,685,900 reviews, 192,609 businesses over 10 metropolitan areas, over 1.2 million business attributes like hours, parking, availability, and ambience, 1,223,094 tips by 1,637,138 users, and aggregated check-ins over time. Each file is composed of a single object type, one JSON-object per-line. For more details on dataset structure, proceed to [Yelp Dataset JSON Documentation](https://www.yelp.com/dataset/documentation/main). 23 | 24 | ### U.S. City Demographic Data 25 | 26 | The [U.S. City Demographic Data](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/) dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau's 2015 American Community Survey. Each JSON object describes the demographics of a particular city and race, and so it can be uniquely identified by the city, state and race fields. More information can be found [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/information/) under the section "Dataset schema". 27 | 28 | ### Historical Hourly Weather Data 2012-2017 29 | 30 | The [Historical Hourly Weather Data](https://www.kaggle.com/selfishgene/historical-hourly-weather-data) dataset is a dataset collected by a Kaggle competitor. The dataset contains 5 years of hourly measurements data of various weather attributes, such as temperature, humidity, and air pressure. This data is available for 27 bigger US cities, 3 cities in Canada, and 6 cities in Israel. Each attribute has it's own file and is organized such that the rows are the time axis (timestamps), and the columns are the different cities. Additionally, there is a separate file to identify which city belongs to which country. 31 | 32 | ## Data model and dictionary 33 | 34 | Our target data model is a 3NF-normalized relational model, which was designed to be neutral to different kinds of analytical queries. The data should depend on the key [1NF], the whole key [2NF] and nothing but the key [3NF] (so help me Codd). Forms beyond 4NF are mainly of academic interest. The following image depicts the logical model of the database: 35 | 36 | ![Data Model](images/data-model.png) 37 | 38 | Note: fields such as *compliment_** are just placeholders for multiple fields with the same prefix (*compliment*). This is done to visually reduce the length of the tables. 39 | 40 | The model consists of 15 tables as a result of normalizing and joining 6 tables provided by Yelp, 1 table with demographic information and 2 tables with weather information. The schema is closer to a Snowflake schema as there are two fact tables - *reviews* and *tips* - and many dimensional tables with multiple levels of hierarchy and many-to-many relationships. Some tables keep their native keys, while for others monotonically increasing ids were generated. Rule of thumb: Use generated keys for entities and composite keys for relationships. Moreover, timestamps and dates were converted into Spark's native data types to be able to import them into Amazon Redshift in a correct format. 41 | 42 | For more details on tables and fields, visit [Yelp Dataset JSON](https://www.yelp.com/dataset/documentation/main) or look at [Redshift table definitions](https://github.com/polakowo/yelp-3nf/blob/master/airflow/dags/configs/table_definitions.yml). 43 | 44 | To dive deeper into the data processing pipeline with Spark, here is [the provided Jupyter Notebook](https://nbviewer.jupyter.org/github/polakowo/yelp-3nf/blob/master/spark-jobs-playground.ipynb). 45 | 46 | #### *businesses* 47 | 48 | The most referenced table in the model. Contains the name of the business, the current star rating, the number of reviews, and whether the business is currently open. The address (one-to-one relationship), hours (one-to-one relationship), businesses attributes (one-to-one relationship) and categories (one-to-many relationship) were outsourced to separate tables as part of the normalization process. 49 | 50 | #### *business_attributes* 51 | 52 | This was the most challenging part of the remodeling, since Yelp kept business attributes as nested dict. All fields in the source table were strings, so they had to be parsed into respective data types. Some values were dirty, for example, boolean fields can be `"True"`, `"False"`, `"None"` and `None`, while some string fields were of double unicode format `u"u'string'"`. Moreover, some fields were dicts formatted as strings. The resulting nested JSON structure of three levels had to be flattened. 53 | 54 | #### *business_categories* and *categories* 55 | 56 | In the original table, business categories were stored as an array. The best solution was to outsource it into a separate table. One way of doing this is to assign a column to each category, but what if we are going to add a new category later on? Then we must update this whole table to reflect the change - this is a clear violation of the 3NF, where columns should not have any transitional function relations. Thus, let's create two tables: *categories*, which contains categories keyed by their ids, and *business_categories*, which contains tuples of business ids and category ids. 57 | 58 | #### *business_hours* 59 | 60 | Business hours were stored as a dict where each key is day of week and value is string of format `"hour:minute-hour:minute"`. The best way to make the data representation neutral to queries, is to split the "from hour" and "to hour" parts into separate columns and combine "hour" and "minute" into a single field of type integer, for example `"10:00-21:00"` into `1000` and `2100` respectively. This way we could easily formulate the following query: 61 | 62 | ```sql 63 | -- Find businesses opened on Sunday at 8pm 64 | SELECT business_id FROM business_hours WHERE Sunday_from <= 2000 AND Sunday_to > 2000; 65 | ``` 66 | 67 | #### *addresses* 68 | 69 | A common practice is to separate business data from address data and connect them though a synthetic key. The resulting link is a one-to-one relationship. Furthermore, addresses were separated from cities, since states and demographic data are dependent on cities only (otherwise 3NF violation). 70 | 71 | #### *cities* 72 | 73 | This table contains city name and state code coming from the Yelp dataset and fields on demographics. For most of the cities there is no demographic information since they are too small (< 65k). Each record in the table can be uniquely identified by the city and postal code, but a single primary key is more convenient to connect both addresses and cities. 74 | 75 | #### *city_weather* 76 | 77 | The table *city_weather* was composed out of CSV files `temperature.csv` and `weather_description.csv`. Both files contain information on various (global) cities. To filter the cities by country (= US), one first has to read the `city_attributes.csv`. The issue with the dataset is that it doesn't provide us with the respective state codes, so how do we know whether Phoenix is in AZ or TX? The most appropriate solution is finding the biggest city. This can be done manually with Google: find the respective state codes and then match these with the cities available in the Yelp dataset. As a result, 8 cities could be enriched. Also, both the temperatures and weather description data were recorded hourly, which is too fine-grained. Thus, they were grouped by day by applying an aggregation statistic: temperatures (of data type `float`) are averaged, while for weather description (of data type `string`) the most frequent one is chosen. 78 | 79 | #### *checkins* 80 | 81 | This table contains checkins on a business and required no further transformations. 82 | 83 | #### *reviews* 84 | 85 | The *reviews* table contains full review text data including the user id that wrote the review and the business id the review is written for. It is the most central table in our data schema and structured similarly to a fact table. But in order to convert it into a fact table, the date column has to be outsourced into a separate dimension table and the text has to be omitted. 86 | 87 | #### *users*, *elite_years* and *friends* 88 | 89 | Originally, the user data includes the user's friend mapping and all the metadata associated with the user. But since the fields `friends` and `elite` are arrays, they have become separate relations in our model, both structured similarly to the *business_categories* table and having composite primary keys. The format of the table *friends* is a very convenient one, as it can be directly fed into Apache Spark's GraphX API to build a social graph of Yelp users. 90 | 91 | #### *tips* 92 | 93 | Tips were written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions. The table required no transformations apart from assigning it a generated key. 94 | 95 | #### *photos* 96 | 97 | Contains photo data including the caption and classification. 98 | 99 | ## Data pipeline 100 | 101 | The designed pipeline dynamically loads the JSON files from S3, processes them, and stores their normalized and enriched versions back into S3 in Parquet format. After this, Redshift takes over and copies the tables into a DWH. 102 | 103 | #### Load from S3 104 | 105 | 106 | 107 | All three datasets reside in a Amazon S3 bucket, which is the easiest and safest option to store and retrieve any amount of data at any time from any other AWS service. 108 | 109 | #### Process with Spark 110 | 111 | 112 | 113 | Since the data is in JSON format and contains arrays and nested fields, it needs first to be transformed into a relational form. By design, Amazon Redshift does not support loading nested data (only Redshift Spectrum enables you to query complex data types such as struct, array, or map, without having to transform or load your data). To do this in a quick and scalable fashion, Apache Spark is utilized. In particular, you can execute the entire data processing pipeline in [the provided ETL notebook](https://nbviewer.jupyter.org/github/polakowo/yelp-3nf/blob/master/spark-jobs-playground.ipynb) an Amazon EMR (Elastic MapReduce) cluster, which uses Apache Spark and Hadoop to quickly & cost-effectively process and analyze vast amounts of data. The another advantage of Spark is the ability to control data quality, thus most of our data quality checks are done at this stage. 114 | 115 | #### Unload to S3 116 | 117 | 118 | 119 | Parquet stores nested data structures in a flat columnar format. Compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance. Parquet files are well supported in the AWS ecosystem. Moreover, compared to JSON and CSV formats, we can store timestamp objects, datetime objects and long texts without any post-processing, and load them into Amazon Redshift as-is. From here, we can use an AWS Glue crawler to discover and register the schema for our datasets to be used in Amazon Athena. But our goal is materializing the data rather than querying directly from files on Amazon S3 - to be able to retrieve the data without prolonged load times. 120 | 121 | #### Load into Redshift 122 | 123 | 124 | 125 | To load the data from Parquet files into our Redshift DWH, we can rely on multiple options. The easiest one is by using [spark-redshift](https://github.com/databricks/spark-redshift): Spark reads the parquet files from S3 into the Spark cluster, converts the data to Avro format, writes it to S3, and finally issues a COPY SQL query to Redshift to load the data. Or we can have [an AWS Glue job that loads data into an Amazon Redshift](https://www.dbbest.com/blog/aws-glue-etl-service/). But instead, we should define the tables manually: that way we can control the quality and consistency of data, but also sortkeys, distkeys and compression. This solution issues SQL statements to Redshift to first CREATE the tables and then to COPY the data. To make the table definition process easier and more transparent, we can utilize the AWS Glue's data catalog to derive the correct data types (for example, should we use int or bigint?). 126 | 127 | #### Check data quality 128 | 129 | Most data checks are done when transforming data with Spark. Furthermore, consistency and referential integrity checks are done automatically by importing the data into Redshift (since data must adhere to table definition). To ensure that the output tables are of the right size, we also do some checks the end of the data pipeline. 130 | 131 | ## Airflow DAGs 132 | 133 | 134 | 135 | The following data processing pipeline is executed by using Apache Airflow, which is a tool for orchestrating complex computational workflows and data processing pipelines. The advantage of Airflow over Python ETL scripts is that it provides many add-on modules for operators that already exist from the community, such that one can build useful stuff quickly and in a modular fashion. Also, Airflow scheduler is designed to run as a persistent service in an Airflow production environment and is easier to manage than cron jobs. 136 | 137 | The whole data pipeline is divided into three subDAGs: the one that processes data with Spark (`spark_jobs`), the one that loads the data into Redshift (`copy_to_redshift`), and the one that checks the data for errors (`data_quality_checks`). 138 | 139 | 140 | 141 | ### spark_jobs 142 | 143 | This subDAG comprises of a set of tasks, each sending Spark script to an Amazon EMR cluster. For this, the [LivySparkOperator](https://github.com/rssanders3/airflow-spark-operator-plugin) is used. This operator facilitates interacting with the Livy Server on the EMR master node, which lets us send simple Scala or Python code over REST API calls instead of having to manage and deploy large JAR files. This helps because it scales data pipelines easily with multiple spark jobs running in parallel, rather than running them serially using EMR Step API. Each Spark script takes care of loading one or more source JSON files, transforming it into one or more (3NF-normalized) tables, and unloading them back into S3 in parquet format. The subDAG was partitioned logically by target tables, such that each script takes care of a small amount of work to simplify debugging. Note: in order to increase the performance, one might divide the tasks by the source tables and cache them. 144 | 145 | 146 | 147 | ### copy_to_redshift 148 | 149 | Airflow takes control of loading Parquet files into Redshift in right order and with consistency checks in place. The loading operation is done with the [S3ToRedshiftOperator](https://github.com/airflow-plugins/redshift_plugin), provided by the Airflow community. This operator takes the table definition as a dictionary, creates the Redshift table from it and performs the COPY operation. All table definitions are stored in a YAML configuration file. The order and relationships between operators were derived based on the references between tables; for example, because *reviews* table references *businesses*, *businesses* have to be loaded first, otherwise, the referential integrity is violated (and you may get errors). Thus, data integrity and referential constraints are automatically enforced while populating the Redshift database. 150 | 151 | 152 | 153 | ### data_quality_checks 154 | 155 | The data quality checks are executed with a custom [RedshiftCheckOperator](https://github.com/polakowo/yelp-3nf/blob/master/airflow/plugins/redshift_plugin/operators/redshift_check_operator.py), which extends the Airflow's default [CheckOperator](https://github.com/apache/airflow/blob/master/airflow/operators/check_operator.py). It takes a SQL statement, the expected pass value, and optionally the tolerance of the result, and performs a simple value check. 156 | 157 | ## Date updates 158 | 159 | The whole ETL process for 7 million reviews and related data lasts about 20 minutes. As our target data model is meant to be the source for other dimensional tables, the ETL process can take longer time. Since the Yelp Open Dataset is only a subset of the real dataset and we don't know how many rows Yelp generates each day, we cannot derive the optimal frequency of the updates. But taking only newly appended rows (for example, those collected for one day) can significantly increase the frequency. 160 | 161 | ## Scenarios 162 | 163 | The following scenarios needs to be addressed: 164 | - **The data was increased by 100x:** That wouldn't be a technical issue as both Amazon EMR and Redshift clusters can handle huge amounts of data. Eventually, they would have to be scaled out. 165 | - **The data populates a dashboard that must be updated on a daily basis by 7am every day:** That's perfectly plausible and could be done by running the ETL script some time prior to 7am. 166 | - **The database needed to be accessed by 100+ people:** That wouldn't be a problem as Redshift is highly scalable and available. 167 | 168 | ## Installation 169 | 170 | ### Data preparation (Amazon S3) 171 | 172 | - Create an S3 bucket. 173 | - Ensure that the bucket is in the same region as your Amazon EMR and Redshift clusters. 174 | - Be careful with read permissions - you may end up having to pay lots of fees in data transfers. 175 | - Option 1: 176 | - Download [Yelp Open Dataset](https://www.yelp.com/dataset) and directly upload to your S3 bucket (`yelp_dataset` folder). 177 | - Option 2 (for slow internet connections): 178 | - Launch an EC2 instance with at minimum 20GB SSD storage. 179 | - Connect to this instance via SSH (click "Connect" and proceed according to AWS instructions) 180 | - Proceed to the dataset homepage, fill in your information, copy the download link, and paste into the command below. Note: the link is valid for 30 seconds. 181 | ```bash 182 | wget -O yelp_dataset.tar.gz "[your_download_link]" 183 | tar -xvzf yelp_dataset.tar.gz 184 | ``` 185 | - Finally, transfer the files as described by [this blog](http://codeomitted.com/transfer-files-from-ec2-to-s3/) 186 | - Remember to provide IAM role and credentials of user who has AmazonS3FullAccess. 187 | - In case your instance has no AWS CLI installed, follow [this documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) 188 | - In case you come into errors such as "Unable to locate package python3-pip", follow [this answer](https://askubuntu.com/questions/1061486/unable-to-locate-package-python-pip-when-trying-to-install-from-fresh-18-04-in#answer-1061488) 189 | - Download the JSON file from [U.S. City Demographic Data](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/) 190 | - Upload it to a separate folder (`demo_dataset`) on your S3 bucket. 191 | - Download the whole dataset from [Historical Hourly Weather Data 2012-2017](https://www.kaggle.com/selfishgene/historical-hourly-weather-data/downloads/historical-hourly-weather-data.zip) 192 | - Unzip and upload `city_attributes.csv`, `temperature.csv`, and `weather_description.csv` files to a separate folder (`weather_dataset`) on your S3 bucket. 193 | 194 | ### Amazon EMR 195 | 196 | - Configure and create your EMR cluster. 197 | - Go to advanced options, and enable Apache Spark, Livy and AWS Glue Data Catalog for Spark. 198 | - Enter the following configuration JSON to make Python 3 default: 199 | ```json 200 | [ 201 | { 202 | "Classification": "spark-env", 203 | "Configurations": [ 204 | { 205 | "Classification": "export", 206 | "Properties": { 207 | "PYSPARK_PYTHON": "/usr/bin/python3" 208 | } 209 | } 210 | ] 211 | } 212 | ] 213 | ``` 214 | - Go to EC2 Security Groups, select your master node and enable inbound connections to 8998. 215 | 216 | ### Amazon Redshift 217 | 218 | - Store your credentials and cluster creation parameters in `dwh.cfg` 219 | - Run `create_cluster.ipynb` to create a Redshift cluster. 220 | - Note: Delete your Redshift cluster with `delete_cluster.ipynb` when you're finished working. 221 | 222 | ### Apache Airflow 223 | 224 | - Use [Quick Start](https://airflow.apache.org/start.html) to make a local Airflow instance up and running. 225 | - Copy `dags` and `plugins` folders to your Airflow work environment (under `AIRFLOW_HOME` path variable) 226 | - Create a new HTTP connection `livy_http_conn` by providing host and port of the Livy server. 227 | 228 | 229 | 230 | - Create a new AWS connection `aws_credentials` by providing user credentials and ARN role (from `dwh.cfg`) 231 | 232 | 233 | 234 | - Create a new Redshift connection `redshift` by providing database connection parameters (from `dwh.cfg`) 235 | 236 | 237 | 238 | - In the Airflow UI, turn on and manually run the DAG "main". 239 | 240 | ## Further resources 241 | 242 | - [Yelp's Academic Dataset Examples](https://github.com/Yelp/dataset-examples) 243 | - [Spark Tips & Tricks](https://gist.github.com/dusenberrymw/30cebf98263fae206ea0ffd2cb155813) 244 | - [Use Pyspark with a Jupyter Notebook in an AWS EMR cluster](https://towardsdatascience.com/use-pyspark-with-a-jupyter-notebook-in-an-aws-emr-cluster-e5abc4cc9bdd) 245 | - [Real-world Python workloads on Spark: EMR clusters](https://becominghuman.ai/real-world-python-workloads-on-spark-emr-clusters-3c6bda1a1350) 246 | --------------------------------------------------------------------------------