├── airflow
├── dags
│ ├── __init__.py
│ ├── subdags
│ │ ├── __init__.py
│ │ ├── data_quality_checks.py
│ │ ├── spark_jobs.py
│ │ └── copy_to_redshift.py
│ ├── scripts
│ │ ├── photos.py
│ │ ├── users.py
│ │ ├── tips.py
│ │ ├── reviews.py
│ │ ├── checkins.py
│ │ ├── elite_years.py
│ │ ├── friends.py
│ │ ├── addresses.py
│ │ ├── businesses.py
│ │ ├── cities.py
│ │ ├── business_hours.py
│ │ ├── business_categories.py
│ │ ├── business_attributes.py
│ │ └── city_weather.py
│ ├── configs
│ │ ├── check_definitions.yml
│ │ └── table_definitions.yml
│ └── main.py
└── plugins
│ ├── __init__.py
│ ├── redshift_plugin
│ ├── macros
│ │ ├── __init__.py
│ │ └── redshift_auth.py
│ ├── operators
│ │ ├── __init__.py
│ │ ├── redshift_check_operator.py
│ │ └── s3_to_redshift_operator.py
│ └── __init__.py
│ └── spark_plugin
│ ├── operators
│ ├── __init__.py
│ └── spark_operator.py
│ └── __init__.py
├── .gitignore
├── images
├── main.png
├── data-model.png
├── spark_jobs.png
├── amazon-s3-logo.png
├── aws-connection.png
├── copy_to_redshift.png
├── redshift-connection.png
├── 1200px-Yelp_Logo.svg.png
├── livy_http_connection.png
├── 1*eeiD15Xwc_2Ul2DA5u_-Gw.png
├── aws-redshift-connector.png
├── 1200px-Apache_Spark_Logo.svg.png
└── airflow-stack-220x234-613461a0bb1df0b065a5b69146fbe061.png
├── dwh.cfg
├── delete_redshift_cluster.ipynb
├── create_redshift_cluster.ipynb
└── README.md
/airflow/dags/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/airflow/plugins/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/airflow/dags/subdags/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/airflow/plugins/redshift_plugin/macros/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/airflow/plugins/redshift_plugin/operators/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/airflow/plugins/spark_plugin/operators/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .vscode
2 | .ipynb_checkpoints
3 | __pycache__
--------------------------------------------------------------------------------
/images/main.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/main.png
--------------------------------------------------------------------------------
/images/data-model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/data-model.png
--------------------------------------------------------------------------------
/images/spark_jobs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/spark_jobs.png
--------------------------------------------------------------------------------
/images/amazon-s3-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/amazon-s3-logo.png
--------------------------------------------------------------------------------
/images/aws-connection.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/aws-connection.png
--------------------------------------------------------------------------------
/images/copy_to_redshift.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/copy_to_redshift.png
--------------------------------------------------------------------------------
/images/redshift-connection.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/redshift-connection.png
--------------------------------------------------------------------------------
/images/1200px-Yelp_Logo.svg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/1200px-Yelp_Logo.svg.png
--------------------------------------------------------------------------------
/images/livy_http_connection.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/livy_http_connection.png
--------------------------------------------------------------------------------
/images/1*eeiD15Xwc_2Ul2DA5u_-Gw.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/1*eeiD15Xwc_2Ul2DA5u_-Gw.png
--------------------------------------------------------------------------------
/images/aws-redshift-connector.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/aws-redshift-connector.png
--------------------------------------------------------------------------------
/images/1200px-Apache_Spark_Logo.svg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/1200px-Apache_Spark_Logo.svg.png
--------------------------------------------------------------------------------
/images/airflow-stack-220x234-613461a0bb1df0b065a5b69146fbe061.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/polakowo/yelp-3nf/HEAD/images/airflow-stack-220x234-613461a0bb1df0b065a5b69146fbe061.png
--------------------------------------------------------------------------------
/dwh.cfg:
--------------------------------------------------------------------------------
1 | [AWS]
2 | KEY=
3 | SECRET=
4 |
5 | [DWH]
6 | DWH_CLUSTER_TYPE=single-node
7 | DWH_NUM_NODES=1
8 | DWH_NODE_TYPE=dc2.large
9 | DWH_IAM_ROLE_NAME=dwhRole
10 | DWH_CLUSTER_IDENTIFIER=dwhCluster
11 |
12 | [DB]
13 | DB_NAME=dwh
14 | DB_USER=dwhuser
15 | DB_PASSWORD=Passw0rd
16 | DB_PORT=5439
17 |
18 | [DB_ACCESS]
19 | DB_HOST=dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com
20 | ROLE_ARN='arn:aws:iam::953225455667:role/dwhRole'
21 |
--------------------------------------------------------------------------------
/airflow/plugins/spark_plugin/__init__.py:
--------------------------------------------------------------------------------
1 | from airflow.plugins_manager import AirflowPlugin
2 | from spark_plugin.operators.spark_operator import SparkSubmitOperator, LivySparkOperator
3 |
4 |
5 | class S3ToRedshiftPlugin(AirflowPlugin):
6 | name = "S3ToRedshiftPlugin"
7 | operators = [
8 | SparkSubmitOperator,
9 | LivySparkOperator
10 | ]
11 | # Leave in for explicitness
12 | hooks = []
13 | executors = []
14 | macros = []
15 | admin_views = []
16 | flask_blueprints = []
17 | menu_links = []
18 |
--------------------------------------------------------------------------------
/airflow/dags/scripts/photos.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_photo_path = "s3://polakowo-yelp2/yelp_dataset/photo.json"
7 | target_photos_path = "s3://polakowo-yelp2/staging_data/photos"
8 |
9 | photos_df = spark.read.json(source_photo_path)
10 |
11 | # Even if we do not store any photos, this table is useful for knowing how many and what kind of photos were taken.
12 |
13 | photos_df.write.parquet(target_photos_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/scripts/users.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_user_path = "s3://polakowo-yelp2/yelp_dataset/user.json"
7 | target_users_path = "s3://polakowo-yelp2/staging_data/users"
8 |
9 | user_df = spark.read.json(source_user_path)
10 |
11 | # Drop fields which will be outsourced and cast timestamp field
12 | users_df = user_df.drop("elite", "friends")\
13 | .withColumn("yelping_since", F.to_timestamp("yelping_since"))
14 |
15 | users_df.write.parquet(target_users_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/scripts/tips.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_tip_path = "s3://polakowo-yelp2/yelp_dataset/tip.json"
7 | target_tips_path = "s3://polakowo-yelp2/staging_data/tips"
8 |
9 | tip_df = spark.read.json(source_tip_path)
10 |
11 | # Assign to each record a unique id for convenience.
12 |
13 | tips_df = tip_df.withColumnRenamed("date", "ts")\
14 | .withColumn("ts", F.to_timestamp("ts"))\
15 | .withColumn("tip_id", F.monotonically_increasing_id())
16 |
17 | tips_df.write.parquet(target_tips_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/scripts/reviews.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_review_path = "s3://polakowo-yelp2/yelp_dataset/review.json"
7 | target_reviews_path = "s3://polakowo-yelp2/staging_data/reviews"
8 |
9 | review_df = spark.read.json(source_review_path)
10 |
11 | # The table can be used as-is, only minor transformations required.
12 |
13 | # date field looks more like a timestamp
14 | reviews_df = review_df.withColumnRenamed("date", "ts")\
15 | .withColumn("ts", F.to_timestamp("ts"))
16 |
17 | reviews_df.write.parquet(target_reviews_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/plugins/redshift_plugin/macros/redshift_auth.py:
--------------------------------------------------------------------------------
1 | from airflow.utils.db import provide_session
2 | from airflow.models import Connection
3 |
4 |
5 | @provide_session
6 | def get_conn(conn_id, session=None):
7 | conn = (
8 | session.query(Connection)
9 | .filter(Connection.conn_id == conn_id)
10 | .first())
11 | return conn
12 |
13 |
14 | def redshift_auth(s3_conn_id):
15 | s3_conn = get_conn(s3_conn_id)
16 | aws_key = s3_conn.extra_dejson.get('aws_access_key_id')
17 | aws_secret = s3_conn.extra_dejson.get('aws_secret_access_key')
18 | return ("aws_access_key_id={0};aws_secret_access_key={1}"
19 | .format(aws_key, aws_secret))
20 |
--------------------------------------------------------------------------------
/airflow/dags/scripts/checkins.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_checkin_path = "s3://polakowo-yelp2/yelp_dataset/checkin.json"
7 | target_checkins_path = "s3://polakowo-yelp2/staging_data/checkins"
8 |
9 | checkin_df = spark.read.json(source_checkin_path)
10 |
11 | # Basically the same procedure as friends to get the table of pairs business_id:ts
12 |
13 | checkins_df = checkin_df.selectExpr("business_id", "date as ts")\
14 | .withColumn("ts", F.explode(F.split(F.col("ts"), ", ")))\
15 | .where("ts != '' and ts is not null")\
16 | .withColumn("ts", F.to_timestamp("ts"))
17 |
18 | checkins_df.write.parquet(target_checkins_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/scripts/elite_years.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_user_path = "s3://polakowo-yelp2/yelp_dataset/user.json"
7 | elite_years_path = "s3://polakowo-yelp2/staging_data/elite_years"
8 |
9 | user_df = spark.read.json(source_user_path)
10 |
11 | # The field elite is a comma-separated list of strings masked as a string.
12 | # Make a separate table out of it.
13 |
14 | elite_years_df = user_df.select("user_id", "elite")\
15 | .withColumn("year", F.explode(F.split(F.col("elite"), ",")))\
16 | .where("year != '' and year is not null")\
17 | .select(F.col("user_id"), F.col("year").cast("integer"))
18 |
19 | elite_years_df.write.parquet(elite_years_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/scripts/friends.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_user_path = "s3://polakowo-yelp2/yelp_dataset/user.json"
7 | friends_path = "s3://polakowo-yelp2/staging_data/friends"
8 |
9 | user_df = spark.read.json(source_user_path)
10 |
11 | # Basically the same procedure as elite to get the table of user relationships.
12 | # Can take some time.
13 |
14 | friends_df = user_df.select("user_id", "friends")\
15 | .withColumn("friend_id", F.explode(F.split(F.col("friends"), ", ")))\
16 | .where("friend_id != '' and friend_id is not null")\
17 | .select(F.col("user_id"), F.col("friend_id"))\
18 | .distinct()
19 |
20 | friends_df.write.parquet(friends_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/plugins/redshift_plugin/__init__.py:
--------------------------------------------------------------------------------
1 | from airflow.plugins_manager import AirflowPlugin
2 | from redshift_plugin.operators.s3_to_redshift_operator import S3ToRedshiftOperator
3 | from redshift_plugin.operators.redshift_check_operator import (RedshiftCheckOperator,
4 | RedshiftValueCheckOperator, RedshiftIntervalCheckOperator)
5 | from redshift_plugin.macros.redshift_auth import redshift_auth
6 |
7 |
8 | class S3ToRedshiftPlugin(AirflowPlugin):
9 | name = "S3ToRedshiftPlugin"
10 | operators = [
11 | S3ToRedshiftOperator,
12 | RedshiftCheckOperator,
13 | RedshiftValueCheckOperator,
14 | RedshiftIntervalCheckOperator
15 | ]
16 | # Leave in for explicitness
17 | hooks = []
18 | executors = []
19 | macros = [redshift_auth]
20 | admin_views = []
21 | flask_blueprints = []
22 | menu_links = []
23 |
--------------------------------------------------------------------------------
/airflow/dags/scripts/addresses.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json"
7 | source_cities_path = "s3://polakowo-yelp2/staging_data/cities"
8 | target_addresses_path = "s3://polakowo-yelp2/staging_data/addresses"
9 |
10 | business_df = spark.read.json(source_business_path)
11 | cities_df = spark.read.parquet(source_cities_path)
12 |
13 | # Pull address information from business.json, but instead of city take newly created city_id.
14 |
15 | addresses_df = business_df.selectExpr("address", "latitude", "longitude", "postal_code", "city", "state as state_code")\
16 | .join(cities_df.select("city", "state_code", "city_id"), ["city", "state_code"], how='left')\
17 | .drop("city", "state_code")\
18 | .distinct()\
19 | .withColumn("address_id", F.monotonically_increasing_id())
20 |
21 | addresses_df.write.parquet(target_addresses_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/scripts/businesses.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json"
7 | source_addresses_path = "s3://polakowo-yelp2/staging_data/addresses"
8 | target_businesses_path = "s3://polakowo-yelp2/staging_data/businesses"
9 |
10 | business_df = spark.read.json(source_business_path)
11 | addresses_df = spark.read.parquet(source_addresses_path)
12 |
13 | # Take any other information and write it into businesses table.
14 |
15 | businesses_df = business_df.join(addresses_df, (business_df["address"] == addresses_df["address"])
16 | & (business_df["latitude"] == addresses_df["latitude"])
17 | & (business_df["longitude"] == addresses_df["longitude"])
18 | & (business_df["postal_code"] == addresses_df["postal_code"]), how="left")\
19 | .selectExpr("business_id", "address_id", "cast(is_open as boolean)", "name", "review_count", "stars")
20 |
21 | businesses_df.write.parquet(target_businesses_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/subdags/data_quality_checks.py:
--------------------------------------------------------------------------------
1 | # from datetime import datetime, timedelta
2 | from airflow import DAG
3 | from airflow.operators.dummy_operator import DummyOperator
4 | from airflow.operators import RedshiftValueCheckOperator
5 |
6 |
7 | def data_quality_checks_subdag(
8 | parent_dag_id,
9 | dag_id,
10 | redshift_conn_id,
11 | check_definitions,
12 | *args, **kwargs):
13 | """Returns the SubDAG for performing data quality checks"""
14 |
15 | dag = DAG(
16 | f"{parent_dag_id}.{dag_id}",
17 | **kwargs
18 | )
19 |
20 | start_operator = DummyOperator(dag=dag, task_id='start_operator')
21 | end_operator = DummyOperator(dag=dag, task_id='end_operator')
22 |
23 | for check in check_definitions:
24 | check_operator = RedshiftValueCheckOperator(
25 | dag=dag,
26 | task_id=check.get('task_id', None),
27 | redshift_conn_id="redshift",
28 | sql=check.get('sql', None),
29 | pass_value=check.get('pass_value', None),
30 | tolerance=check.get('tolerance', None)
31 | )
32 |
33 | start_operator >> check_operator >> end_operator
34 |
35 | return dag
--------------------------------------------------------------------------------
/airflow/dags/configs/check_definitions.yml:
--------------------------------------------------------------------------------
1 | - task_id: check_businesses_count
2 | sql: SELECT COUNT(*) FROM businesses
3 | pass_value: 196728
4 | - task_id: check_business_attributes_count
5 | sql: SELECT COUNT(*) FROM business_attributes
6 | pass_value: 192609
7 | - task_id: check_categories_count
8 | sql: SELECT COUNT(*) FROM categories
9 | pass_value: 1298
10 | - task_id: check_business_categories_count
11 | sql: SELECT COUNT(*) FROM business_categories
12 | pass_value: 788110
13 | - task_id: check_addresses_count
14 | sql: SELECT COUNT(*) FROM addresses
15 | pass_value: 178763
16 | - task_id: check_cities_count
17 | sql: SELECT COUNT(*) FROM cities
18 | pass_value: 1258
19 | - task_id: check_city_weather_count
20 | sql: SELECT COUNT(*) FROM city_weather
21 | pass_value: 15096
22 | - task_id: check_business_hours_count
23 | sql: SELECT COUNT(*) FROM business_hours
24 | pass_value: 192609
25 | - task_id: check_users_count
26 | sql: SELECT COUNT(*) FROM users
27 | pass_value: 1637138
28 | - task_id: check_elite_years_count
29 | sql: SELECT COUNT(*) FROM elite_years
30 | pass_value: 224499
31 | - task_id: check_friends_count
32 | sql: SELECT COUNT(*) FROM friends
33 | pass_value: 75531114
34 | - task_id: check_reviews_count
35 | sql: SELECT COUNT(*) FROM reviews
36 | pass_value: 6685900
37 | - task_id: check_checkins_count
38 | sql: SELECT COUNT(*) FROM checkins
39 | pass_value: 19089148
40 | - task_id: check_tips_count
41 | sql: SELECT COUNT(*) FROM tips
42 | pass_value: 1223094
43 | - task_id: check_photos_count
44 | sql: SELECT COUNT(*) FROM photos
45 | pass_value: 200000
--------------------------------------------------------------------------------
/airflow/dags/scripts/cities.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json"
7 | source_demo_path = "s3://polakowo-yelp2/demo_dataset/us-cities-demographics.json"
8 | target_cities_path = "s3://polakowo-yelp2/staging_data/cities"
9 |
10 | business_df = spark.read.json(source_business_path)
11 |
12 | # Take city and state_code from the business.json and enrich them with demographics data.
13 |
14 | ################
15 | # Demographics #
16 | ################
17 |
18 | demo_df = spark.read.json(source_demo_path)
19 |
20 | # Each JSON object here seems to describe (1) the demographics of the city
21 | # and (2) the number of people belonging to some race (race and count fields).
22 | # Since each record is unique by city, state_code and race fields, while other
23 | # demographic fields are unique by only city and state_code (which means lots
24 | # of redundancy), we need to transform race column into columns corresponding
25 | # to each of its values via pivot function.
26 |
27 | def prepare_race(x):
28 | # We want to make each race a stand-alone column, thus each race value needs a proper naming
29 | return x.replace(" ", "_").replace("-", "_").lower()
30 |
31 | prepare_race_udf = F.udf(prepare_race, T.StringType())
32 |
33 | # Group by all columns except race and count and convert race rows into columns
34 | demo_df = demo_df.select("fields.*")\
35 | .withColumn("race", prepare_race_udf("race"))
36 | demo_df = demo_df.groupby(*set(demo_df.schema.names).difference(set(["race", "count"])))\
37 | .pivot('race')\
38 | .max('count')
39 | # Columns have a different order every now and then?
40 | demo_df = demo_df.select(*sorted(demo_df.columns))
41 |
42 | ##########
43 | # Cities #
44 | ##########
45 |
46 | # Merge city data with demographics data
47 | cities_df = business_df.selectExpr("city", "state as state_code")\
48 | .distinct()\
49 | .join(demo_df, ["city", "state_code"], how="left")\
50 | .withColumn("city_id", F.monotonically_increasing_id())
51 |
52 | cities_df.write.parquet(target_cities_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/scripts/business_hours.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json"
7 | target_business_hours_path = "s3://polakowo-yelp2/staging_data/business_hours"
8 |
9 | business_df = spark.read.json(source_business_path)
10 | business_hours_df = business_df.select("business_id", "hours.*")
11 |
12 | # To enable efficient querying based on business hours, for each day of week,
13 | # split the time range string into "from" and "to" integers.
14 | # From
15 | # Row(
16 | # business_id=u'QXAEGFB4oINsVuTFxEYKFQ',
17 | # Monday=u'9:0-0:0',
18 | # Tuesday=u'9:0-0:0',
19 | # Wednesday=u'9:0-0:0',
20 | # Thursday=u'9:0-0:0',
21 | # Friday=u'9:0-1:0',
22 | # Saturday=u'9:0-1:0',
23 | # Sunday=u'9:0-0:0'
24 | # )
25 | # To
26 | # Row(
27 | # business_id=u'QXAEGFB4oINsVuTFxEYKFQ',
28 | # Monday_from=900,
29 | # Monday_to=0,
30 | # Tuesday_from=900,
31 | # Tuesday_to=0,
32 | # Wednesday_from=900,
33 | # Wednesday_to=0,
34 | # Thursday_from=900,
35 | # Thursday_to=0,
36 | # Friday_from=900,
37 | # Friday_to=100,
38 | # Saturday_from=900,
39 | # Saturday_to=100,
40 | # Sunday_from=900,
41 | # Sunday_to=0
42 | # )
43 |
44 | def parse_hours(x):
45 | # Take "9:0-0:0" (9am-00am) and transform it into {from: 900, to: 0}
46 | if x is None:
47 | return None
48 | convert_to_int = lambda x: int(x.split(':')[0]) * 100 + int(x.split(':')[1])
49 | return {
50 | "from": convert_to_int(x.split('-')[0]),
51 | "to": convert_to_int(x.split('-')[1])
52 | }
53 |
54 | parse_hours_udf = F.udf(parse_hours, T.StructType([
55 | T.StructField('from', T.IntegerType(), nullable=True),
56 | T.StructField('to', T.IntegerType(), nullable=True)
57 | ]))
58 |
59 | hour_attrs = [
60 | "Monday",
61 | "Tuesday",
62 | "Wednesday",
63 | "Thursday",
64 | "Friday",
65 | "Saturday",
66 | "Sunday",
67 | ]
68 |
69 | for attr in hour_attrs:
70 | business_hours_df = business_hours_df.withColumn(attr, parse_hours_udf(attr))\
71 | .selectExpr("*", attr+".from as "+attr+"_from", attr+".to as "+attr+"_to")\
72 | .drop(attr)
73 |
74 | business_hours_df.write.parquet(target_business_hours_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/subdags/spark_jobs.py:
--------------------------------------------------------------------------------
1 | from airflow import DAG
2 | from airflow.operators.dummy_operator import DummyOperator
3 | from airflow.operators import LivySparkOperator
4 |
5 |
6 | def spark_jobs_subdag(
7 | parent_dag_id,
8 | dag_id,
9 | http_conn_id,
10 | session_kind,
11 | *args, **kwargs):
12 | """Returns the SubDAG for processing data with Spark"""
13 |
14 | dag = DAG(
15 | f"{parent_dag_id}.{dag_id}",
16 | **kwargs
17 | )
18 |
19 | start_operator = DummyOperator(dag=dag, task_id='start_operator')
20 | end_operator = DummyOperator(dag=dag, task_id='end_operator')
21 |
22 | def create_task(script_name):
23 | """Returns an operator that executes the Spark script under the passed name"""
24 |
25 | with open(f'/Users/olegpolakow/airflow/dags/scripts/{script_name}.py', 'r') as f:
26 | spark_script = f.read()
27 |
28 | return LivySparkOperator(
29 | dag=dag,
30 | task_id=f"{script_name}_script",
31 | spark_script=spark_script,
32 | http_conn_id=http_conn_id,
33 | session_kind=session_kind)
34 |
35 | business_hours = create_task("business_hours")
36 | business_attributes = create_task("business_attributes")
37 | cities = create_task("cities")
38 | addresses = create_task("addresses")
39 | business_categories = create_task("business_categories")
40 | businesses = create_task("businesses")
41 | reviews = create_task("reviews")
42 | users = create_task("users")
43 | elite_years = create_task("elite_years")
44 | friends = create_task("friends")
45 | checkins = create_task("checkins")
46 | tips = create_task("tips")
47 | photos = create_task("photos")
48 | city_weather = create_task("city_weather")
49 |
50 | # Specify relationships between operators
51 | start_operator >> cities >> addresses >> businesses >> end_operator
52 | start_operator >> cities >> city_weather >> end_operator
53 | start_operator >> business_hours >> end_operator
54 | start_operator >> business_attributes >> end_operator
55 | start_operator >> business_categories >> end_operator
56 | start_operator >> reviews >> end_operator
57 | start_operator >> users >> end_operator
58 | start_operator >> elite_years >> end_operator
59 | start_operator >> friends >> end_operator
60 | start_operator >> checkins >> end_operator
61 | start_operator >> tips >> end_operator
62 | start_operator >> photos >> end_operator
63 |
64 | return dag
65 |
--------------------------------------------------------------------------------
/airflow/dags/scripts/business_categories.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json"
7 | target_categories_path = "s3://polakowo-yelp2/staging_data/categories"
8 | target_business_categories_path = "s3://polakowo-yelp2/staging_data/business_categories"
9 |
10 | business_df = spark.read.json(source_business_path)
11 |
12 | ##############
13 | # Categories #
14 | ##############
15 |
16 | # First, create a list of unique categories and assign each of them an id.
17 |
18 | import re
19 | def parse_categories(categories):
20 | # Convert comma separated list of strings masked as a string into a native list type
21 | if categories is None:
22 | return []
23 | parsed = []
24 | # Some strings contain commas, so they have to be extracted beforehand
25 | require_attention = set(["Wills, Trusts, & Probates"])
26 | for s in require_attention:
27 | if categories.find(s) > -1:
28 | parsed.append(s)
29 | categories = categories.replace(s, "")
30 | return list(filter(None, parsed + re.split(r",\s*", categories)))
31 |
32 | parse_categories_udf = F.udf(parse_categories, T.ArrayType(T.StringType()))
33 | business_categories_df = business_df.select("business_id", "categories")\
34 | .withColumn("categories", parse_categories_udf("categories"))
35 |
36 | # Convert the list of categories in each row into a set of rows
37 | categories_df = business_categories_df.select(F.explode("categories").alias("category"))\
38 | .dropDuplicates()\
39 | .sort("category")\
40 | .withColumn("category_id", F.monotonically_increasing_id())
41 |
42 | categories_df.write.parquet(target_categories_path, mode="overwrite")
43 |
44 | #######################
45 | # Business categories #
46 | #######################
47 |
48 | # For each record in business.json, convert list of categories in categories field into rows of pairs business_id-category_id.
49 |
50 | import re
51 | def zip_categories(business_id, categories):
52 | # For each value in categories, zip it with business_id to form a pair
53 | return list(zip([business_id] * len(categories), categories))
54 |
55 | zip_categories_udf = F.udf(zip_categories, T.ArrayType(T.ArrayType(T.StringType())))
56 |
57 | # Zip business_id's and categories and extract them into a new table called business_catagories
58 | business_categories_df = business_categories_df.select(F.explode(zip_categories_udf("business_id", "categories")).alias("cols"))\
59 | .selectExpr("cols[0] as business_id", "cols[1] as category")\
60 | .dropDuplicates()
61 | business_categories_df = business_categories_df.join(categories_df, business_categories_df["category"] == categories_df["category"], how="left")\
62 | .drop("category")
63 |
64 | business_categories_df.write.parquet(target_business_categories_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/main.py:
--------------------------------------------------------------------------------
1 | from airflow import DAG
2 | from airflow.operators.dummy_operator import DummyOperator
3 | from airflow.operators.subdag_operator import SubDagOperator
4 | from airflow.operators.postgres_operator import PostgresOperator
5 |
6 | from subdags.copy_to_redshift import copy_to_redshift_subdag
7 | from subdags.data_quality_checks import data_quality_checks_subdag
8 | from subdags.spark_jobs import spark_jobs_subdag
9 |
10 | from datetime import datetime, timedelta
11 | import os
12 | import yaml
13 |
14 | start_date = datetime.now() - timedelta(days=2)
15 |
16 | default_args = {
17 | 'owner': "polakowo",
18 | 'start_date': start_date,
19 | 'catchup': False,
20 | 'depends_on_past': False,
21 | 'retries': 0
22 | }
23 |
24 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
25 |
26 | dag = DAG(DAG_ID,
27 | default_args=default_args,
28 | description="Extracts Yelp data from S3, transforms it into tables with Spark, and loads into Redshift",
29 | schedule_interval=None,
30 | max_active_runs=1)
31 |
32 | start_operator = DummyOperator(dag=dag, task_id='start_operator')
33 |
34 | # Create the SubDAG for transforming data with Spark
35 | subdag_id = "spark_jobs"
36 | spark_jobs = SubDagOperator(
37 | subdag=spark_jobs_subdag(
38 | parent_dag_id=DAG_ID,
39 | dag_id=subdag_id,
40 | http_conn_id="livy_http_conn",
41 | session_kind="pyspark",
42 | start_date=start_date),
43 | task_id=subdag_id,
44 | dag=dag)
45 |
46 | # Read table definitions from YAML file
47 | with open('/Users/olegpolakow/airflow/dags/configs/table_definitions.yml', 'r') as f:
48 | table_definitions = yaml.safe_load(f)
49 |
50 | # Create the SubDAG for copying S3 tables into Redshift
51 | subdag_id = "copy_to_redshift"
52 | s3_to_redshift = SubDagOperator(
53 | subdag=copy_to_redshift_subdag(
54 | parent_dag_id=DAG_ID,
55 | dag_id=subdag_id,
56 | table_definitions=table_definitions,
57 | redshift_conn_id='redshift',
58 | redshift_schema='public',
59 | s3_conn_id='aws_credentials',
60 | s3_bucket='polakowo-yelp2/staging_data',
61 | load_type='rebuild',
62 | schema_location='Local',
63 | start_date=start_date),
64 | task_id=subdag_id,
65 | dag=dag)
66 |
67 | # Read check definitions from YAML file
68 | with open('/Users/olegpolakow/airflow/dags/configs/check_definitions.yml', 'r') as f:
69 | check_definitions = yaml.safe_load(f)
70 |
71 | # Create the SubDAG for performing data quality checks
72 | subdag_id = "data_quality_checks"
73 | data_quality_checks = SubDagOperator(
74 | subdag=data_quality_checks_subdag(
75 | parent_dag_id=DAG_ID,
76 | dag_id=subdag_id,
77 | redshift_conn_id='redshift',
78 | check_definitions=check_definitions,
79 | start_date=start_date),
80 | task_id=subdag_id,
81 | dag=dag)
82 |
83 | end_operator = DummyOperator(dag=dag, task_id='end_operator')
84 |
85 | # Specify relationships between operators
86 | start_operator >> spark_jobs >> s3_to_redshift >> data_quality_checks >> end_operator
87 |
--------------------------------------------------------------------------------
/airflow/dags/subdags/copy_to_redshift.py:
--------------------------------------------------------------------------------
1 | from airflow import DAG
2 | from airflow.operators.dummy_operator import DummyOperator
3 | from airflow.operators import S3ToRedshiftOperator
4 |
5 | def copy_to_redshift_subdag(
6 | parent_dag_id,
7 | dag_id,
8 | table_definitions,
9 | redshift_conn_id,
10 | redshift_schema,
11 | s3_conn_id,
12 | s3_bucket,
13 | load_type,
14 | schema_location,
15 | *args, **kwargs):
16 | """Returns the SubDAG for copying S3 tables into Redshift"""
17 |
18 | dag = DAG(
19 | f"{parent_dag_id}.{dag_id}",
20 | **kwargs
21 | )
22 |
23 | start_operator = DummyOperator(dag=dag, task_id='start_operator')
24 | end_operator = DummyOperator(dag=dag, task_id='end_operator')
25 |
26 | def get_table(table_name):
27 | """Returns the table under the passed name"""
28 |
29 | for table in table_definitions:
30 | if table.get('table_name', None) == table_name:
31 | return table
32 |
33 | def create_task(table):
34 | """Returns an operator for copying the table into Redshift"""
35 |
36 | return S3ToRedshiftOperator(
37 | dag=dag,
38 | task_id=f"copy_{table.get('table_name', None)}_to_redshift",
39 | redshift_conn_id=redshift_conn_id,
40 | redshift_schema=redshift_schema,
41 | table=table.get('table_name', None),
42 | s3_conn_id=s3_conn_id,
43 | s3_bucket=s3_bucket,
44 | s3_key=table.get('s3_key', None),
45 | load_type=load_type,
46 | copy_params=table.get('copy_params', None),
47 | origin_schema=table.get('origin_schema', None),
48 | primary_key=table.get('primary_key', None),
49 | foreign_key=table.get('foreign_key', {}),
50 | schema_location=schema_location)
51 |
52 | businesses = create_task(get_table("businesses"))
53 | business_attributes = create_task(get_table("business_attributes"))
54 | categories = create_task(get_table("categories"))
55 | business_categories = create_task(get_table("business_categories"))
56 | addresses = create_task(get_table("addresses"))
57 | cities = create_task(get_table("cities"))
58 | city_weather = create_task(get_table("city_weather"))
59 | business_hours = create_task(get_table("business_hours"))
60 | reviews = create_task(get_table("reviews"))
61 | users = create_task(get_table("users"))
62 | elite_years = create_task(get_table("elite_years"))
63 | friends = create_task(get_table("friends"))
64 | checkins = create_task(get_table("checkins"))
65 | tips = create_task(get_table("tips"))
66 | photos = create_task(get_table("photos"))
67 |
68 | # We could execute the entire YAML file in parallel
69 | # But let's respect the referential integrity
70 | # Look at the UML diagram to build the acyclic graph of references
71 |
72 | start_operator >> cities
73 | start_operator >> categories
74 | start_operator >> users
75 |
76 | cities >> addresses
77 | cities >> city_weather >> end_operator
78 |
79 | addresses >> businesses
80 |
81 | businesses >> business_attributes >> end_operator
82 | businesses >> business_categories >> end_operator
83 | businesses >> business_hours >> end_operator
84 | businesses >> checkins >> end_operator
85 | businesses >> photos >> end_operator
86 | businesses >> tips >> end_operator
87 | businesses >> reviews >> end_operator
88 |
89 | categories >> business_categories >> end_operator
90 |
91 | users >> reviews >> end_operator
92 | users >> tips >> end_operator
93 | users >> friends >> end_operator
94 | users >> elite_years >> end_operator
95 |
96 | return dag
97 |
--------------------------------------------------------------------------------
/airflow/plugins/redshift_plugin/operators/redshift_check_operator.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #
3 | # Licensed to the Apache Software Foundation (ASF) under one
4 | # or more contributor license agreements. See the NOTICE file
5 | # distributed with this work for additional information
6 | # regarding copyright ownership. The ASF licenses this file
7 | # to you under the Apache License, Version 2.0 (the
8 | # "License"); you may not use this file except in compliance
9 | # with the License. You may obtain a copy of the License at
10 | #
11 | # http://www.apache.org/licenses/LICENSE-2.0
12 | #
13 | # Unless required by applicable law or agreed to in writing,
14 | # software distributed under the License is distributed on an
15 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16 | # KIND, either express or implied. See the License for the
17 | # specific language governing permissions and limitations
18 | # under the License.
19 | from typing import Any, Dict
20 |
21 | from airflow.hooks.postgres_hook import PostgresHook
22 | from airflow.operators.check_operator import CheckOperator, \
23 | ValueCheckOperator, IntervalCheckOperator
24 | from airflow.utils.decorators import apply_defaults
25 |
26 |
27 | class RedshiftCheckOperator(CheckOperator):
28 | """
29 | Performs checks against Redshift. The ``RedshiftCheckOperator`` expects
30 | a sql query that will return a single row. Each value on that
31 | first row is evaluated using python ``bool`` casting. If any of the
32 | values return ``False`` the check is failed and errors out.
33 | Note that Python bool casting evals the following as ``False``:
34 | * ``False``
35 | * ``0``
36 | * Empty string (``""``)
37 | * Empty list (``[]``)
38 | * Empty dictionary or set (``{}``)
39 | Given a query like ``SELECT COUNT(*) FROM foo``, it will fail only if
40 | the count ``== 0``. You can craft much more complex query that could,
41 | for instance, check that the table has the same number of rows as
42 | the source table upstream, or that the count of today's partition is
43 | greater than yesterday's partition, or that a set of metrics are less
44 | than 3 standard deviation for the 7 day average.
45 | This operator can be used as a data quality check in your pipeline, and
46 | depending on where you put it in your DAG, you have the choice to
47 | stop the critical path, preventing from
48 | publishing dubious data, or on the side and receive email alerts
49 | without stopping the progress of the DAG.
50 | :param sql: the sql to be executed
51 | :type sql: str
52 | :param redshift_conn_id: reference to the Redshift database
53 | :type redshift_conn_id: str
54 | """
55 |
56 | @apply_defaults
57 | def __init__(
58 | self,
59 | sql: str,
60 | redshift_conn_id: str = 'redshift_default',
61 | *args, **kwargs) -> None:
62 | super().__init__(sql=sql, *args, **kwargs)
63 |
64 | self.redshift_conn_id = redshift_conn_id
65 | self.sql = sql
66 |
67 | def get_db_hook(self):
68 | return PostgresHook(postgres_conn_id=self.redshift_conn_id)
69 |
70 |
71 | class RedshiftValueCheckOperator(ValueCheckOperator):
72 | """
73 | Performs a simple value check using sql code.
74 | :param sql: the sql to be executed
75 | :type sql: str
76 | :param redshift_conn_id: reference to the Redshift database
77 | :type redshift_conn_id: str
78 | """
79 |
80 | @apply_defaults
81 | def __init__(
82 | self,
83 | sql: str,
84 | pass_value: Any,
85 | tolerance: Any = None,
86 | redshift_conn_id: str = 'redshift_default',
87 | *args, **kwargs):
88 | super().__init__(
89 | sql=sql, pass_value=pass_value, tolerance=tolerance,
90 | *args, **kwargs)
91 | self.redshift_conn_id = redshift_conn_id
92 |
93 | def get_db_hook(self):
94 | return PostgresHook(postgres_conn_id=self.redshift_conn_id)
95 |
96 |
97 | class RedshiftIntervalCheckOperator(IntervalCheckOperator):
98 | """
99 | Checks that the values of metrics given as SQL expressions are within
100 | a certain tolerance of the ones from days_back before.
101 | :param table: the table name
102 | :type table: str
103 | :param days_back: number of days between ds and the ds we want to check
104 | against. Defaults to 7 days
105 | :type days_back: int
106 | :param metrics_threshold: a dictionary of ratios indexed by metrics
107 | :type metrics_threshold: dict
108 | :param redshift_conn_id: reference to the Redshift database
109 | :type redshift_conn_id: str
110 | """
111 |
112 | @apply_defaults
113 | def __init__(
114 | self,
115 | table: str,
116 | metrics_thresholds: Dict,
117 | date_filter_column: str = 'ds',
118 | days_back: int = -7,
119 | redshift_conn_id: str = 'redshift_default',
120 | *args, **kwargs):
121 | super().__init__(
122 | table=table, metrics_thresholds=metrics_thresholds,
123 | date_filter_column=date_filter_column, days_back=days_back,
124 | *args, **kwargs)
125 | self.redshift_conn_id = redshift_conn_id
126 |
127 | def get_db_hook(self):
128 | return PostgresHook(postgres_conn_id=self.redshift_conn_id)
--------------------------------------------------------------------------------
/airflow/dags/scripts/business_attributes.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import functions as F
2 | from pyspark.sql import types as T
3 | from pyspark.sql import Window, Row
4 |
5 | # File paths
6 | source_business_path = "s3://polakowo-yelp2/yelp_dataset/business.json"
7 | target_business_attributes_path = "s3://polakowo-yelp2/staging_data/business_attributes"
8 |
9 | business_df = spark.read.json(source_business_path)
10 | business_attributes_df = business_df.select("business_id", "attributes.*")
11 |
12 | # Unfold deep nested field attributes into a new table.
13 |
14 | ##################
15 | # Parse booleans #
16 | ##################
17 |
18 | # From
19 | # Row(AcceptsInsurance=None),
20 | # Row(AcceptsInsurance=u'None'),
21 | # Row(AcceptsInsurance=u'False'),
22 | # Row(AcceptsInsurance=u'True')
23 | # To
24 | # Row(AcceptsInsurance=None),
25 | # Row(AcceptsInsurance=None),
26 | # Row(AcceptsInsurance=False)
27 | # Row(AcceptsInsurance=True)
28 |
29 | def parse_boolean(x):
30 | # Convert boolean strings to native boolean format
31 | if x is None or x == 'None':
32 | return None
33 | if x == 'True':
34 | return True
35 | if x == 'False':
36 | return False
37 |
38 | parse_boolean_udf = F.udf(parse_boolean, T.BooleanType())
39 |
40 | bool_attrs = [
41 | "AcceptsInsurance",
42 | "BYOB",
43 | "BikeParking",
44 | "BusinessAcceptsBitcoin",
45 | "BusinessAcceptsCreditCards",
46 | "ByAppointmentOnly",
47 | "Caters",
48 | "CoatCheck",
49 | "Corkage",
50 | "DogsAllowed",
51 | "DriveThru",
52 | "GoodForDancing",
53 | "GoodForKids",
54 | "HappyHour",
55 | "HasTV",
56 | "Open24Hours",
57 | "OutdoorSeating",
58 | "RestaurantsCounterService",
59 | "RestaurantsDelivery",
60 | "RestaurantsGoodForGroups",
61 | "RestaurantsReservations",
62 | "RestaurantsTableService",
63 | "RestaurantsTakeOut",
64 | "WheelchairAccessible"
65 | ]
66 |
67 | for attr in bool_attrs:
68 | business_attributes_df = business_attributes_df.withColumn(attr, parse_boolean_udf(attr))
69 |
70 | #################
71 | # Parse strings #
72 | #################
73 |
74 | # From
75 | # Row(AgesAllowed=None),
76 | # Row(AgesAllowed=u'None'),
77 | # Row(AgesAllowed=u"u'18plus'"),
78 | # Row(AgesAllowed=u"u'19plus'")
79 | # Row(AgesAllowed=u"u'21plus'"),
80 | # Row(AgesAllowed=u"u'allages'"),
81 | # To
82 | # Row(AgesAllowed=None),
83 | # Row(AgesAllowed=u'none'),
84 | # Row(AgesAllowed=u'18plus')
85 | # Row(AgesAllowed=u'19plus'),
86 | # Row(AgesAllowed=u'21plus'),
87 | # Row(AgesAllowed=u'allages'),
88 |
89 |
90 |
91 | def parse_string(x):
92 | # Clean and standardize strings
93 | # Do not cast "None" into None since it has a special meaning
94 | if x is None or x == '':
95 | return None
96 | # Some strings are of format u"u'string'"
97 | return x.replace("u'", "").replace("'", "").lower()
98 |
99 | parse_string_udf = F.udf(parse_string, T.StringType())
100 |
101 | str_attrs = [
102 | "AgesAllowed",
103 | "Alcohol",
104 | "BYOBCorkage",
105 | "NoiseLevel",
106 | "RestaurantsAttire",
107 | "Smoking",
108 | "WiFi",
109 | ]
110 |
111 | for attr in str_attrs:
112 | business_attributes_df = business_attributes_df.withColumn(attr, parse_string_udf(attr))
113 |
114 | ##################
115 | # Parse integers #
116 | ##################
117 |
118 | # From
119 | # Row(RestaurantsPriceRange2=u'None'),
120 | # Row(RestaurantsPriceRange2=None),
121 | # Row(RestaurantsPriceRange2=u'1'),
122 | # Row(RestaurantsPriceRange2=u'2')]
123 | # Row(RestaurantsPriceRange2=u'3'),
124 | # Row(RestaurantsPriceRange2=u'4'),
125 | # To
126 | # Row(RestaurantsPriceRange2=None),
127 | # Row(RestaurantsPriceRange2=None),
128 | # Row(RestaurantsPriceRange2=1),
129 | # Row(RestaurantsPriceRange2=2)
130 | # Row(RestaurantsPriceRange2=3),
131 | # Row(RestaurantsPriceRange2=4),
132 |
133 | def parse_integer(x):
134 | # Convert integers masked as strings to native integer format
135 | if x is None or x == 'None':
136 | return None
137 | return int(x)
138 |
139 | parse_integer_udf = F.udf(parse_integer, T.IntegerType())
140 |
141 | int_attrs = [
142 | "RestaurantsPriceRange2",
143 | ]
144 |
145 | for attr in int_attrs:
146 | business_attributes_df = business_attributes_df.withColumn(attr, parse_integer_udf(attr))
147 |
148 | #######################
149 | # Parse boolean dicts #
150 | #######################
151 |
152 | # From
153 | # Row(
154 | # business_id=u'QXAEGFB4oINsVuTFxEYKFQ',
155 | # Ambience=u"{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}"
156 | # )
157 | # To
158 | # Row(
159 | # business_id=u'QXAEGFB4oINsVuTFxEYKFQ',
160 | # Ambience_romantic=False,
161 | # Ambience_intimate=False,
162 | # Ambience_classy=False,
163 | # Ambience_hipster=False,
164 | # Ambience_divey=False,
165 | # Ambience_touristy=False,
166 | # Ambience_trendy=False,
167 | # Ambience_upscale=False,
168 | # Ambience_casual=True
169 | # )
170 |
171 | import ast
172 |
173 | def parse_boolean_dict(x):
174 | # Convert dicts masked as strings to string:boolean format
175 | if x is None or x == 'None' or x == '':
176 | return None
177 | return ast.literal_eval(x)
178 |
179 | parse_boolean_dict_udf = F.udf(parse_boolean_dict, T.MapType(T.StringType(), T.BooleanType()))
180 |
181 | bool_dict_attrs = [
182 | "Ambience",
183 | "BestNights",
184 | "BusinessParking",
185 | "DietaryRestrictions",
186 | "GoodForMeal",
187 | "HairSpecializesIn",
188 | "Music"
189 | ]
190 |
191 | for attr in bool_dict_attrs:
192 | business_attributes_df = business_attributes_df.withColumn(attr, parse_boolean_dict_udf(attr))
193 | # Get all keys of the MapType
194 | # [Row(key=u'romantic'), Row(key=u'casual'), ...
195 | key_rows = business_attributes_df.select(F.explode(attr)).select("key").distinct().collect()
196 | # Convert each key into column (with proper name)
197 | exprs = ["{}['{}'] as {}".format(attr, row.key, attr+"_"+row.key.replace('-', '_')) for row in key_rows]
198 | business_attributes_df = business_attributes_df.selectExpr("*", *exprs).drop(attr)
199 |
200 | business_attributes_df.write.parquet(target_business_attributes_path, mode="overwrite")
--------------------------------------------------------------------------------
/airflow/dags/scripts/city_weather.py:
--------------------------------------------------------------------------------
1 | """Take city and state_code from the business.json and enrich them with demographics data."""
2 |
3 | from pyspark.sql import functions as F
4 | from pyspark.sql import types as T
5 | from pyspark.sql import Window, Row
6 |
7 | # File paths
8 | source_cities_path = "s3://polakowo-yelp2/staging_data/cities"
9 | source_city_attr_path = "s3://polakowo-yelp2/weather_dataset/city_attributes.csv"
10 | source_weather_temp_path = "s3://polakowo-yelp2/weather_dataset/temperature.csv"
11 | source_weather_desc_path = "s3://polakowo-yelp2/weather_dataset/weather_description.csv"
12 | target_city_weather_path = "s3://polakowo-yelp2/staging_data/city_weather"
13 |
14 | cities_df = spark.read.parquet(source_cities_path)
15 |
16 | ###################
17 | # City attributes #
18 | ###################
19 |
20 | # Get the names of US cities supported by this dataset and assign to each a city_id.
21 | # Requires reading the table cities.
22 |
23 | city_attr_df = spark.read\
24 | .format('csv')\
25 | .option("header", "true")\
26 | .option("delimiter", ",")\
27 | .load(source_city_attr_path)
28 |
29 | # We only want the list of US cities
30 | cities = city_attr_df.where("Country = 'United States'")\
31 | .select("City")\
32 | .distinct()\
33 | .rdd.flatMap(lambda x: x)\
34 | .collect()
35 |
36 | # Weather dataset doesn't provide us with the respective state codes though.
37 | # How do we know whether "Phoenix" is in AZ or TX?
38 | # The most appropriate solution is finding the biggest city.
39 | # Let's find out which of those cities are referenced in Yelp dataset and relevant to us.
40 | # Use Google or any other API.
41 |
42 | weather_cities_df = [
43 | Row(city='Phoenix', state_code='AZ'),
44 | Row(city='Dallas', state_code='TX'),
45 | Row(city='Los Angeles', state_code='CA'),
46 | Row(city='San Diego', state_code='CA'),
47 | Row(city='Pittsburgh', state_code='PA'),
48 | Row(city='Las Vegas', state_code='NV'),
49 | Row(city='Seattle', state_code='WA'),
50 | Row(city='New York', state_code='NY'),
51 | Row(city='Charlotte', state_code='NC'),
52 | Row(city='Denver', state_code='CO'),
53 | Row(city='Boston', state_code='MA')
54 | ]
55 | weather_cities_schema = T.StructType([
56 | T.StructField("city", T.StringType()),
57 | T.StructField("state_code", T.StringType())
58 | ])
59 | weather_cities_df = spark.createDataFrame(weather_cities_df, schema=weather_cities_schema)
60 |
61 | # Join with the cities dataset to find matches
62 | weather_cities_df = cities_df.join(weather_cities_df, ["city", "state_code"])\
63 | .select("city", "city_id")\
64 | .distinct()
65 |
66 | ################
67 | # Temperatures #
68 | ################
69 |
70 | # Read temperaturs recorded hourly, transform them into daily averages, and filter by our cities.
71 | # Also, cities are columns, so transform them into rows.
72 |
73 | weather_temp_df = spark.read\
74 | .format('csv')\
75 | .option("header", "true")\
76 | .option("delimiter", ",")\
77 | .load(source_weather_temp_path)
78 |
79 | # Extract date string from time string to be able to group by day
80 | weather_temp_df = weather_temp_df.select("datetime", *cities)\
81 | .withColumn("date", F.substring("datetime", 0, 10))\
82 | .drop("datetime")
83 |
84 | # For data quality check
85 | import numpy as np
86 | phoenix_rows = weather_temp_df.where("Phoenix is not null and date = '2012-10-01'").select("Phoenix").collect()
87 | phoenix_mean_temp = np.mean([float(row.Phoenix) for row in phoenix_rows])
88 |
89 | # To transform city columns into rows, transform each city individually and union all dataframes
90 | temp_df = None
91 | for city in cities:
92 | # Get average temperature in Fahrenheit for each day and city
93 | df = weather_temp_df.select("date", city)\
94 | .withColumnRenamed(city, "temperature")\
95 | .withColumn("temperature", F.col("temperature").cast("double"))\
96 | .withColumn("city", F.lit(city))\
97 | .groupBy("date", "city")\
98 | .agg(F.mean("temperature").alias("avg_temperature"))
99 | if temp_df is None:
100 | temp_df = df
101 | else:
102 | temp_df = temp_df.union(df)
103 | weather_temp_df = temp_df
104 |
105 | # Speed up further joins
106 | weather_temp_df = weather_temp_df.repartition(1).cache()
107 | weather_temp_df.count()
108 |
109 | phoenix_mean_temp2 = weather_temp_df.where("city = 'Phoenix' and date = '2012-10-01'").collect()[0].avg_temperature
110 | assert(phoenix_mean_temp == phoenix_mean_temp2)
111 | # If we pass, the calculations are done correctly
112 |
113 | ########################
114 | # Weather descriptions #
115 | ########################
116 |
117 | # Read weather descriptions recorded hourly, pick the most frequent one on each day, and filter by our cities.
118 | # The same as for temperatures, transform columns into rows.
119 |
120 | weather_desc_df = spark.read\
121 | .format('csv')\
122 | .option("header", "true")\
123 | .option("delimiter", ",")\
124 | .load(source_weather_desc_path)
125 |
126 | # Extract date string from time string to be able to group by day
127 | weather_desc_df = weather_desc_df.select("datetime", *cities)\
128 | .withColumn("date", F.substring("datetime", 0, 10))\
129 | .drop("datetime")
130 |
131 | # For data quality check
132 | from collections import Counter
133 | phoenix_rows = weather_desc_df.where("Phoenix is not null and date = '2012-12-10'").select("Phoenix").collect()
134 | phoenix_most_common_weather = Counter([row.Phoenix for row in phoenix_rows]).most_common()[0][0]
135 |
136 | # To transform city columns into rows, transform each city individually and union all dataframes
137 | temp_df = None
138 | for city in cities:
139 | # Get the most frequent description for each day and city
140 | window = Window.partitionBy("date", "city").orderBy(F.desc("count"))
141 | df = weather_desc_df.select("date", city)\
142 | .withColumnRenamed(city, "weather_description")\
143 | .withColumn("city", F.lit(city))\
144 | .groupBy("date", "city", "weather_description")\
145 | .count()\
146 | .withColumn("order", F.row_number().over(window))\
147 | .where(F.col("order") == 1)\
148 | .drop("count", "order")
149 | if temp_df is None:
150 | temp_df = df
151 | else:
152 | temp_df = temp_df.union(df)
153 | weather_desc_df = temp_df
154 |
155 | # Speed up further joins
156 | weather_desc_df = weather_desc_df.repartition(1).cache()
157 | weather_desc_df.count()
158 |
159 | phoenix_most_common_weather2 = weather_desc_df.where("city = 'Phoenix' and date = '2012-12-10'").collect()[0].weather_description
160 | assert(phoenix_most_common_weather == phoenix_most_common_weather2)
161 | # If we pass, the calculations are done correctly
162 |
163 | ################
164 | # City weather #
165 | ################
166 |
167 | # What was the weather in the city when the particular review was posted?
168 | # Join weather description with temperature, and keep only city ids which are present in Yelp.
169 | city_weather_df = weather_temp_df.join(weather_desc_df, ["city", "date"])\
170 | .join(weather_cities_df, "city")\
171 | .drop("city")\
172 | .distinct()\
173 | .withColumn("date", F.to_date("date"))
174 |
175 | city_weather_df.write.parquet(target_city_weather_path, mode="overwrite")
--------------------------------------------------------------------------------
/delete_redshift_cluster.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import boto3"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 2,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import configparser\n",
19 | "config = configparser.ConfigParser()\n",
20 | "config.read_file(open('dwh.cfg'))\n",
21 | "\n",
22 | "# Load params from configuration file\n",
23 | "KEY = config.get('AWS', 'KEY')\n",
24 | "SECRET = config.get('AWS', 'SECRET')\n",
25 | "DWH_CLUSTER_IDENTIFIER = config.get(\"DWH\", \"DWH_CLUSTER_IDENTIFIER\")\n",
26 | "DWH_IAM_ROLE_NAME = config.get(\"DWH\", \"DWH_IAM_ROLE_NAME\")"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": 3,
32 | "metadata": {},
33 | "outputs": [],
34 | "source": [
35 | "# Create clients\n",
36 | "iam = boto3.client(\n",
37 | " 'iam',aws_access_key_id=KEY,\n",
38 | " aws_secret_access_key=SECRET,\n",
39 | " region_name='us-west-2'\n",
40 | ")\n",
41 | "redshift = boto3.client(\n",
42 | " 'redshift',\n",
43 | " region_name=\"us-west-2\",\n",
44 | " aws_access_key_id=KEY,\n",
45 | " aws_secret_access_key=SECRET\n",
46 | ")"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 4,
52 | "metadata": {},
53 | "outputs": [
54 | {
55 | "data": {
56 | "text/plain": [
57 | "{'Cluster': {'ClusterIdentifier': 'dwhcluster',\n",
58 | " 'NodeType': 'dc2.large',\n",
59 | " 'ClusterStatus': 'deleting',\n",
60 | " 'ClusterAvailabilityStatus': 'Modifying',\n",
61 | " 'MasterUsername': 'dwhuser',\n",
62 | " 'DBName': 'dwh',\n",
63 | " 'Endpoint': {'Address': 'dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com',\n",
64 | " 'Port': 5439},\n",
65 | " 'ClusterCreateTime': datetime.datetime(2019, 8, 16, 17, 57, 27, 660000, tzinfo=tzutc()),\n",
66 | " 'AutomatedSnapshotRetentionPeriod': 1,\n",
67 | " 'ManualSnapshotRetentionPeriod': -1,\n",
68 | " 'ClusterSecurityGroups': [],\n",
69 | " 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-b575adfb',\n",
70 | " 'Status': 'active'}],\n",
71 | " 'ClusterParameterGroups': [{'ParameterGroupName': 'default.redshift-1.0',\n",
72 | " 'ParameterApplyStatus': 'in-sync'}],\n",
73 | " 'ClusterSubnetGroupName': 'default',\n",
74 | " 'VpcId': 'vpc-cdb609b5',\n",
75 | " 'AvailabilityZone': 'us-west-2a',\n",
76 | " 'PreferredMaintenanceWindow': 'fri:08:00-fri:08:30',\n",
77 | " 'PendingModifiedValues': {},\n",
78 | " 'ClusterVersion': '1.0',\n",
79 | " 'AllowVersionUpgrade': True,\n",
80 | " 'NumberOfNodes': 1,\n",
81 | " 'PubliclyAccessible': True,\n",
82 | " 'Encrypted': False,\n",
83 | " 'Tags': [],\n",
84 | " 'EnhancedVpcRouting': False,\n",
85 | " 'IamRoles': [{'IamRoleArn': 'arn:aws:iam::953225455667:role/dwhRole',\n",
86 | " 'ApplyStatus': 'in-sync'}],\n",
87 | " 'MaintenanceTrackName': 'current',\n",
88 | " 'DeferredMaintenanceWindows': []},\n",
89 | " 'ResponseMetadata': {'RequestId': 'd6e25b1a-c086-11e9-96f7-cfa2f664abf8',\n",
90 | " 'HTTPStatusCode': 200,\n",
91 | " 'HTTPHeaders': {'x-amzn-requestid': 'd6e25b1a-c086-11e9-96f7-cfa2f664abf8',\n",
92 | " 'content-type': 'text/xml',\n",
93 | " 'content-length': '2290',\n",
94 | " 'vary': 'Accept-Encoding',\n",
95 | " 'date': 'Sat, 17 Aug 2019 00:34:57 GMT'},\n",
96 | " 'RetryAttempts': 0}}"
97 | ]
98 | },
99 | "execution_count": 4,
100 | "metadata": {},
101 | "output_type": "execute_result"
102 | }
103 | ],
104 | "source": [
105 | "redshift.delete_cluster(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER, SkipFinalClusterSnapshot=True)"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 5,
111 | "metadata": {},
112 | "outputs": [
113 | {
114 | "data": {
115 | "text/plain": [
116 | "{'ClusterIdentifier': 'dwhcluster',\n",
117 | " 'NodeType': 'dc2.large',\n",
118 | " 'ClusterStatus': 'deleting',\n",
119 | " 'ClusterAvailabilityStatus': 'Modifying',\n",
120 | " 'MasterUsername': 'dwhuser',\n",
121 | " 'DBName': 'dwh',\n",
122 | " 'Endpoint': {'Address': 'dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com',\n",
123 | " 'Port': 5439},\n",
124 | " 'ClusterCreateTime': datetime.datetime(2019, 8, 16, 17, 57, 27, 660000, tzinfo=tzutc()),\n",
125 | " 'AutomatedSnapshotRetentionPeriod': 1,\n",
126 | " 'ManualSnapshotRetentionPeriod': -1,\n",
127 | " 'ClusterSecurityGroups': [],\n",
128 | " 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-b575adfb',\n",
129 | " 'Status': 'active'}],\n",
130 | " 'ClusterParameterGroups': [{'ParameterGroupName': 'default.redshift-1.0',\n",
131 | " 'ParameterApplyStatus': 'in-sync'}],\n",
132 | " 'ClusterSubnetGroupName': 'default',\n",
133 | " 'VpcId': 'vpc-cdb609b5',\n",
134 | " 'AvailabilityZone': 'us-west-2a',\n",
135 | " 'PreferredMaintenanceWindow': 'fri:08:00-fri:08:30',\n",
136 | " 'PendingModifiedValues': {},\n",
137 | " 'ClusterVersion': '1.0',\n",
138 | " 'AllowVersionUpgrade': True,\n",
139 | " 'NumberOfNodes': 1,\n",
140 | " 'PubliclyAccessible': True,\n",
141 | " 'Encrypted': False,\n",
142 | " 'ClusterPublicKey': 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbKhq1KiPJZqupL2t1GFp0catgp9xjoCUJaSdBZmEdZmW7Z6xBdwvXfM7w9TIRvvz5cZxXh9Oq9qkDp1U+q/X5tW6vDyVD7UjHzjL+QyPop8AogOE2hZgHi05DtRADvKgGLgdryxlWFOClWAoxAwEa7XtqcfZlE+KK02lB62YBfUeqI6BYTeQgOFcCd6WrggusDtz7QaJ2eFJ2fvT+FFlXySdH0YYuEIpCmFuFI0JdliX0N4euowwWu1nQH6QSA8ILofFHSjk9QGVtJVMO1GsxxamEoBfJrTpquoVa6u2xma2XdW4JbNt0zxnCcVIW6kwhAND+iBwZ6fMY0uJQBL/j Amazon-Redshift\\n',\n",
143 | " 'ClusterNodes': [{'NodeRole': 'SHARED',\n",
144 | " 'PrivateIPAddress': '172.31.17.98',\n",
145 | " 'PublicIPAddress': '52.24.252.63'}],\n",
146 | " 'ClusterRevisionNumber': '9041',\n",
147 | " 'Tags': [],\n",
148 | " 'EnhancedVpcRouting': False,\n",
149 | " 'IamRoles': [{'IamRoleArn': 'arn:aws:iam::953225455667:role/dwhRole',\n",
150 | " 'ApplyStatus': 'in-sync'}],\n",
151 | " 'MaintenanceTrackName': 'current',\n",
152 | " 'DeferredMaintenanceWindows': []}"
153 | ]
154 | },
155 | "execution_count": 5,
156 | "metadata": {},
157 | "output_type": "execute_result"
158 | }
159 | ],
160 | "source": [
161 | "redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 6,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "data": {
171 | "text/plain": [
172 | "{'ResponseMetadata': {'RequestId': 'd9be23f1-bd54-11e9-b584-d392791cf985',\n",
173 | " 'HTTPStatusCode': 200,\n",
174 | " 'HTTPHeaders': {'x-amzn-requestid': 'd9be23f1-bd54-11e9-b584-d392791cf985',\n",
175 | " 'content-type': 'text/xml',\n",
176 | " 'content-length': '200',\n",
177 | " 'date': 'Mon, 12 Aug 2019 22:59:33 GMT'},\n",
178 | " 'RetryAttempts': 0}}"
179 | ]
180 | },
181 | "execution_count": 6,
182 | "metadata": {},
183 | "output_type": "execute_result"
184 | }
185 | ],
186 | "source": [
187 | "iam.detach_role_policy(RoleName=DWH_IAM_ROLE_NAME, PolicyArn=\"arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess\")\n",
188 | "iam.delete_role(RoleName=DWH_IAM_ROLE_NAME)"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": null,
194 | "metadata": {},
195 | "outputs": [],
196 | "source": []
197 | }
198 | ],
199 | "metadata": {
200 | "kernelspec": {
201 | "display_name": "Python 3",
202 | "language": "python",
203 | "name": "python3"
204 | },
205 | "language_info": {
206 | "codemirror_mode": {
207 | "name": "ipython",
208 | "version": 3
209 | },
210 | "file_extension": ".py",
211 | "mimetype": "text/x-python",
212 | "name": "python",
213 | "nbconvert_exporter": "python",
214 | "pygments_lexer": "ipython3",
215 | "version": "3.7.3"
216 | }
217 | },
218 | "nbformat": 4,
219 | "nbformat_minor": 4
220 | }
221 |
--------------------------------------------------------------------------------
/create_redshift_cluster.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import boto3\n",
10 | "import json"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 2,
16 | "metadata": {},
17 | "outputs": [],
18 | "source": [
19 | "import configparser\n",
20 | "config = configparser.ConfigParser()\n",
21 | "config.read_file(open('dwh.cfg'))\n",
22 | "\n",
23 | "# Load params from configuration file\n",
24 | "KEY = config.get('AWS', 'KEY')\n",
25 | "SECRET = config.get('AWS', 'SECRET')\n",
26 | "DWH_CLUSTER_TYPE = config.get(\"DWH\", \"DWH_CLUSTER_TYPE\")\n",
27 | "DWH_NUM_NODES = config.get(\"DWH\", \"DWH_NUM_NODES\")\n",
28 | "DWH_NODE_TYPE = config.get(\"DWH\", \"DWH_NODE_TYPE\")\n",
29 | "DWH_CLUSTER_IDENTIFIER = config.get(\"DWH\", \"DWH_CLUSTER_IDENTIFIER\")\n",
30 | "DWH_IAM_ROLE_NAME = config.get(\"DWH\", \"DWH_IAM_ROLE_NAME\")\n",
31 | "DB_NAME = config.get('DB', \"DB_NAME\")\n",
32 | "DB_USER = config.get('DB', \"DB_USER\")\n",
33 | "DB_PASSWORD = config.get('DB', \"DB_PASSWORD\")\n",
34 | "DB_PORT = config.get('DB', \"DB_PORT\")"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 3,
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "# Create clients\n",
44 | "ec2 = boto3.resource(\n",
45 | " 'ec2',\n",
46 | " region_name=\"us-west-2\",\n",
47 | " aws_access_key_id=KEY,\n",
48 | " aws_secret_access_key=SECRET\n",
49 | ")\n",
50 | "iam = boto3.client(\n",
51 | " 'iam',\n",
52 | " aws_access_key_id=KEY,\n",
53 | " aws_secret_access_key=SECRET,\n",
54 | " region_name='us-west-2'\n",
55 | ")\n",
56 | "redshift = boto3.client(\n",
57 | " 'redshift',\n",
58 | " region_name=\"us-west-2\",\n",
59 | " aws_access_key_id=KEY,\n",
60 | " aws_secret_access_key=SECRET\n",
61 | ")"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 4,
67 | "metadata": {},
68 | "outputs": [
69 | {
70 | "name": "stdout",
71 | "output_type": "stream",
72 | "text": [
73 | "1.1 Creating a new IAM Role\n",
74 | "An error occurred (EntityAlreadyExists) when calling the CreateRole operation: Role with name dwhRole already exists.\n"
75 | ]
76 | }
77 | ],
78 | "source": [
79 | "from botocore.exceptions import ClientError\n",
80 | "\n",
81 | "# Create an IAM Role that makes Redshift able to access S3 bucket (ReadOnly)\n",
82 | "try:\n",
83 | " print(\"1.1 Creating a new IAM Role\") \n",
84 | " dwhRole = iam.create_role(\n",
85 | " Path='/',\n",
86 | " RoleName=DWH_IAM_ROLE_NAME,\n",
87 | " Description=\"Allows Redshift clusters to call AWS services on your behalf.\",\n",
88 | " AssumeRolePolicyDocument=json.dumps({\n",
89 | " 'Statement': [{\n",
90 | " 'Action': 'sts:AssumeRole',\n",
91 | " 'Effect': 'Allow',\n",
92 | " 'Principal': {\n",
93 | " 'Service': 'redshift.amazonaws.com'\n",
94 | " }\n",
95 | " }]\n",
96 | " })\n",
97 | " ) \n",
98 | "except Exception as e:\n",
99 | " print(e)"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 5,
105 | "metadata": {},
106 | "outputs": [
107 | {
108 | "name": "stdout",
109 | "output_type": "stream",
110 | "text": [
111 | "1.2 Attaching Policy\n"
112 | ]
113 | },
114 | {
115 | "data": {
116 | "text/plain": [
117 | "200"
118 | ]
119 | },
120 | "execution_count": 5,
121 | "metadata": {},
122 | "output_type": "execute_result"
123 | }
124 | ],
125 | "source": [
126 | "# Attach Policy\n",
127 | "print(\"1.2 Attaching Policy\")\n",
128 | "iam.attach_role_policy(\n",
129 | " RoleName=DWH_IAM_ROLE_NAME,\n",
130 | " PolicyArn=\"arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess\"\n",
131 | ")['ResponseMetadata']['HTTPStatusCode']"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": 6,
137 | "metadata": {},
138 | "outputs": [
139 | {
140 | "name": "stdout",
141 | "output_type": "stream",
142 | "text": [
143 | "1.3 Get the IAM role ARN\n",
144 | "arn:aws:iam::953225455667:role/dwhRole\n"
145 | ]
146 | }
147 | ],
148 | "source": [
149 | "# Get and print the IAM role ARN\n",
150 | "print(\"1.3 Get the IAM role ARN\")\n",
151 | "roleArn = iam.get_role(RoleName=DWH_IAM_ROLE_NAME)['Role']['Arn']\n",
152 | "\n",
153 | "print(roleArn)"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": 7,
159 | "metadata": {},
160 | "outputs": [],
161 | "source": [
162 | "# Create Redshift cluster\n",
163 | "try:\n",
164 | " response = redshift.create_cluster( \n",
165 | " #Cluster\n",
166 | " ClusterType=DWH_CLUSTER_TYPE,\n",
167 | " NodeType=DWH_NODE_TYPE,\n",
168 | "\n",
169 | " #Identifiers & Credentials\n",
170 | " DBName=DB_NAME,\n",
171 | " ClusterIdentifier=DWH_CLUSTER_IDENTIFIER,\n",
172 | " MasterUsername=DB_USER,\n",
173 | " MasterUserPassword=DB_PASSWORD,\n",
174 | " \n",
175 | " #Roles (for s3 access)\n",
176 | " IamRoles=[roleArn] \n",
177 | " )\n",
178 | "except Exception as e:\n",
179 | " print(e)"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": 13,
185 | "metadata": {},
186 | "outputs": [
187 | {
188 | "data": {
189 | "text/plain": [
190 | "{'ClusterIdentifier': 'dwhcluster',\n",
191 | " 'NodeType': 'dc2.large',\n",
192 | " 'ClusterStatus': 'available',\n",
193 | " 'ClusterAvailabilityStatus': 'Unavailable',\n",
194 | " 'MasterUsername': 'dwhuser',\n",
195 | " 'DBName': 'dwh',\n",
196 | " 'Endpoint': {'Address': 'dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com',\n",
197 | " 'Port': 5439},\n",
198 | " 'ClusterCreateTime': datetime.datetime(2019, 8, 16, 17, 57, 27, 660000, tzinfo=tzutc()),\n",
199 | " 'AutomatedSnapshotRetentionPeriod': 1,\n",
200 | " 'ManualSnapshotRetentionPeriod': -1,\n",
201 | " 'ClusterSecurityGroups': [],\n",
202 | " 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-b575adfb',\n",
203 | " 'Status': 'active'}],\n",
204 | " 'ClusterParameterGroups': [{'ParameterGroupName': 'default.redshift-1.0',\n",
205 | " 'ParameterApplyStatus': 'in-sync'}],\n",
206 | " 'ClusterSubnetGroupName': 'default',\n",
207 | " 'VpcId': 'vpc-cdb609b5',\n",
208 | " 'AvailabilityZone': 'us-west-2a',\n",
209 | " 'PreferredMaintenanceWindow': 'fri:08:00-fri:08:30',\n",
210 | " 'PendingModifiedValues': {},\n",
211 | " 'ClusterVersion': '1.0',\n",
212 | " 'AllowVersionUpgrade': True,\n",
213 | " 'NumberOfNodes': 1,\n",
214 | " 'PubliclyAccessible': True,\n",
215 | " 'Encrypted': False,\n",
216 | " 'ClusterPublicKey': 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbKhq1KiPJZqupL2t1GFp0catgp9xjoCUJaSdBZmEdZmW7Z6xBdwvXfM7w9TIRvvz5cZxXh9Oq9qkDp1U+q/X5tW6vDyVD7UjHzjL+QyPop8AogOE2hZgHi05DtRADvKgGLgdryxlWFOClWAoxAwEa7XtqcfZlE+KK02lB62YBfUeqI6BYTeQgOFcCd6WrggusDtz7QaJ2eFJ2fvT+FFlXySdH0YYuEIpCmFuFI0JdliX0N4euowwWu1nQH6QSA8ILofFHSjk9QGVtJVMO1GsxxamEoBfJrTpquoVa6u2xma2XdW4JbNt0zxnCcVIW6kwhAND+iBwZ6fMY0uJQBL/j Amazon-Redshift\\n',\n",
217 | " 'ClusterNodes': [{'NodeRole': 'SHARED',\n",
218 | " 'PrivateIPAddress': '172.31.17.98',\n",
219 | " 'PublicIPAddress': '52.24.252.63'}],\n",
220 | " 'ClusterRevisionNumber': '9041',\n",
221 | " 'Tags': [],\n",
222 | " 'EnhancedVpcRouting': False,\n",
223 | " 'IamRoles': [{'IamRoleArn': 'arn:aws:iam::953225455667:role/dwhRole',\n",
224 | " 'ApplyStatus': 'in-sync'}],\n",
225 | " 'MaintenanceTrackName': 'current',\n",
226 | " 'DeferredMaintenanceWindows': []}"
227 | ]
228 | },
229 | "execution_count": 13,
230 | "metadata": {},
231 | "output_type": "execute_result"
232 | }
233 | ],
234 | "source": [
235 | "# Run this block several times until the cluster status becomes available\n",
236 | "cluster_props = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]\n",
237 | "cluster_props"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": 14,
243 | "metadata": {},
244 | "outputs": [
245 | {
246 | "name": "stdout",
247 | "output_type": "stream",
248 | "text": [
249 | "DB_HOST :: dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com\n",
250 | "ROLE_ARN :: arn:aws:iam::953225455667:role/dwhRole\n"
251 | ]
252 | }
253 | ],
254 | "source": [
255 | "DB_HOST = cluster_props['Endpoint']['Address']\n",
256 | "ROLE_ARN = cluster_props['IamRoles'][0]['IamRoleArn']\n",
257 | "\n",
258 | "# Save back to config\n",
259 | "config.set('DB_ACCESS', 'DB_HOST', DB_HOST)\n",
260 | "config.set('DB_ACCESS', 'ROLE_ARN', ROLE_ARN)\n",
261 | "\n",
262 | "print(\"DB_HOST ::\", DB_HOST)\n",
263 | "print(\"ROLE_ARN ::\", ROLE_ARN)"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": 15,
269 | "metadata": {},
270 | "outputs": [
271 | {
272 | "name": "stdout",
273 | "output_type": "stream",
274 | "text": [
275 | "ec2.SecurityGroup(id='sg-057a93d5984de3064')\n",
276 | "An error occurred (InvalidPermission.Duplicate) when calling the AuthorizeSecurityGroupIngress operation: the specified rule \"peer: 0.0.0.0/0, TCP, from port: 5439, to port: 5439, ALLOW\" already exists\n"
277 | ]
278 | }
279 | ],
280 | "source": [
281 | "# Open an incoming TCP port to access the cluster endpoint\n",
282 | "try:\n",
283 | " vpc = ec2.Vpc(id=cluster_props['VpcId'])\n",
284 | " defaultSg = list(vpc.security_groups.all())[0]\n",
285 | " print(defaultSg)\n",
286 | " defaultSg.authorize_ingress(\n",
287 | " GroupName=defaultSg.group_name,\n",
288 | " CidrIp='0.0.0.0/0',\n",
289 | " IpProtocol='TCP',\n",
290 | " FromPort=int(DB_PORT),\n",
291 | " ToPort=int(DB_PORT)\n",
292 | " )\n",
293 | "except Exception as e:\n",
294 | " print(e)"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": 16,
300 | "metadata": {},
301 | "outputs": [],
302 | "source": [
303 | "%load_ext sql"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": 17,
309 | "metadata": {},
310 | "outputs": [
311 | {
312 | "name": "stdout",
313 | "output_type": "stream",
314 | "text": [
315 | "postgresql://dwhuser:Passw0rd@dwhcluster.ccg25xgqwmck.us-west-2.redshift.amazonaws.com:5439/dwh\n"
316 | ]
317 | },
318 | {
319 | "data": {
320 | "text/plain": [
321 | "'Connected: dwhuser@dwh'"
322 | ]
323 | },
324 | "execution_count": 17,
325 | "metadata": {},
326 | "output_type": "execute_result"
327 | }
328 | ],
329 | "source": [
330 | "# Make sure you can connect to the cluster\n",
331 | "conn_string=\"postgresql://{}:{}@{}:{}/{}\".format(DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, DB_NAME)\n",
332 | "print(conn_string)\n",
333 | "%sql $conn_string"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": null,
339 | "metadata": {},
340 | "outputs": [],
341 | "source": []
342 | }
343 | ],
344 | "metadata": {
345 | "kernelspec": {
346 | "display_name": "Python 3",
347 | "language": "python",
348 | "name": "python3"
349 | },
350 | "language_info": {
351 | "codemirror_mode": {
352 | "name": "ipython",
353 | "version": 3
354 | },
355 | "file_extension": ".py",
356 | "mimetype": "text/x-python",
357 | "name": "python",
358 | "nbconvert_exporter": "python",
359 | "pygments_lexer": "ipython3",
360 | "version": "3.7.3"
361 | }
362 | },
363 | "nbformat": 4,
364 | "nbformat_minor": 4
365 | }
366 |
--------------------------------------------------------------------------------
/airflow/dags/configs/table_definitions.yml:
--------------------------------------------------------------------------------
1 | # businesses
2 | - table_name: businesses
3 | s3_key: businesses
4 | copy_params:
5 | - FORMAT AS PARQUET
6 | origin_schema:
7 | - name: business_id
8 | type: varchar(22)
9 | - name: address_id
10 | type: bigint
11 | - name: is_open
12 | type: boolean
13 | - name: name
14 | type: varchar(256)
15 | - name: review_count
16 | type: bigint
17 | - name: stars
18 | type: float
19 | primary_key: business_id
20 | foreign_key:
21 | - column_name: address_id
22 | reftable: addresses
23 | ref_column: address_id
24 |
25 | # business_attributes
26 | - table_name: business_attributes
27 | s3_key: business_attributes
28 | copy_params:
29 | - FORMAT AS PARQUET
30 | origin_schema:
31 | - name: business_id
32 | type: varchar(22)
33 | - name: AcceptsInsurance
34 | type: boolean
35 | - name: AgesAllowed
36 | type: varchar(7)
37 | - name: Alcohol
38 | type: varchar(13)
39 | - name: BYOB
40 | type: boolean
41 | - name: BYOBCorkage
42 | type: varchar(11)
43 | - name: BikeParking
44 | type: boolean
45 | - name: BusinessAcceptsBitcoin
46 | type: boolean
47 | - name: BusinessAcceptsCreditCards
48 | type: boolean
49 | - name: ByAppointmentOnly
50 | type: boolean
51 | - name: Caters
52 | type: boolean
53 | - name: CoatCheck
54 | type: boolean
55 | - name: Corkage
56 | type: boolean
57 | - name: DogsAllowed
58 | type: boolean
59 | - name: DriveThru
60 | type: boolean
61 | - name: GoodForDancing
62 | type: boolean
63 | - name: GoodForKids
64 | type: boolean
65 | - name: HappyHour
66 | type: boolean
67 | - name: HasTV
68 | type: boolean
69 | - name: NoiseLevel
70 | type: varchar(9)
71 | - name: Open24Hours
72 | type: boolean
73 | - name: OutdoorSeating
74 | type: boolean
75 | - name: RestaurantsAttire
76 | type: varchar(6)
77 | - name: RestaurantsCounterService
78 | type: boolean
79 | - name: RestaurantsDelivery
80 | type: boolean
81 | - name: RestaurantsGoodForGroups
82 | type: boolean
83 | - name: RestaurantsPriceRange2
84 | type: integer
85 | - name: RestaurantsReservations
86 | type: boolean
87 | - name: RestaurantsTableService
88 | type: boolean
89 | - name: RestaurantsTakeOut
90 | type: boolean
91 | - name: Smoking
92 | type: varchar(7)
93 | - name: WheelchairAccessible
94 | type: boolean
95 | - name: WiFi
96 | type: varchar(4)
97 | - name: Ambience_romantic
98 | type: boolean
99 | - name: Ambience_casual
100 | type: boolean
101 | - name: Ambience_trendy
102 | type: boolean
103 | - name: Ambience_intimate
104 | type: boolean
105 | - name: Ambience_hipster
106 | type: boolean
107 | - name: Ambience_upscale
108 | type: boolean
109 | - name: Ambience_divey
110 | type: boolean
111 | - name: Ambience_touristy
112 | type: boolean
113 | - name: Ambience_classy
114 | type: boolean
115 | - name: BestNights_sunday
116 | type: boolean
117 | - name: BestNights_thursday
118 | type: boolean
119 | - name: BestNights_monday
120 | type: boolean
121 | - name: BestNights_wednesday
122 | type: boolean
123 | - name: BestNights_saturday
124 | type: boolean
125 | - name: BestNights_friday
126 | type: boolean
127 | - name: BestNights_tuesday
128 | type: boolean
129 | - name: BusinessParking_valet
130 | type: boolean
131 | - name: BusinessParking_lot
132 | type: boolean
133 | - name: BusinessParking_validated
134 | type: boolean
135 | - name: BusinessParking_garage
136 | type: boolean
137 | - name: BusinessParking_street
138 | type: boolean
139 | - name: DietaryRestrictions_kosher
140 | type: boolean
141 | - name: DietaryRestrictions_dairy_free
142 | type: boolean
143 | - name: DietaryRestrictions_vegan
144 | type: boolean
145 | - name: DietaryRestrictions_vegetarian
146 | type: boolean
147 | - name: DietaryRestrictions_gluten_free
148 | type: boolean
149 | - name: DietaryRestrictions_soy_free
150 | type: boolean
151 | - name: DietaryRestrictions_halal
152 | type: boolean
153 | - name: GoodForMeal_lunch
154 | type: boolean
155 | - name: GoodForMeal_brunch
156 | type: boolean
157 | - name: GoodForMeal_dinner
158 | type: boolean
159 | - name: GoodForMeal_latenight
160 | type: boolean
161 | - name: GoodForMeal_dessert
162 | type: boolean
163 | - name: GoodForMeal_breakfast
164 | type: boolean
165 | - name: HairSpecializesIn_curly
166 | type: boolean
167 | - name: HairSpecializesIn_asian
168 | type: boolean
169 | - name: HairSpecializesIn_perms
170 | type: boolean
171 | - name: HairSpecializesIn_africanamerican
172 | type: boolean
173 | - name: HairSpecializesIn_straightperms
174 | type: boolean
175 | - name: HairSpecializesIn_kids
176 | type: boolean
177 | - name: HairSpecializesIn_coloring
178 | type: boolean
179 | - name: HairSpecializesIn_extensions
180 | type: boolean
181 | - name: Music_no_music
182 | type: boolean
183 | - name: Music_dj
184 | type: boolean
185 | - name: Music_live
186 | type: boolean
187 | - name: Music_karaoke
188 | type: boolean
189 | - name: Music_video
190 | type: boolean
191 | - name: Music_background_music
192 | type: boolean
193 | - name: Music_jukebox
194 | type: boolean
195 | primary_key: business_id
196 | foreign_key:
197 | - column_name: business_id
198 | reftable: businesses
199 | ref_column: business_id
200 |
201 | # categories
202 | - table_name: categories
203 | s3_key: categories
204 | copy_params:
205 | - FORMAT AS PARQUET
206 | origin_schema:
207 | - name: category
208 | type: varchar(35)
209 | - name: category_id
210 | type: bigint
211 | primary_key: category_id
212 |
213 | # business_categories
214 | - table_name: business_categories
215 | s3_key: business_categories
216 | copy_params:
217 | - FORMAT AS PARQUET
218 | origin_schema:
219 | - name: business_id
220 | type: varchar(22)
221 | - name: category_id
222 | type: bigint
223 | primary_key:
224 | - business_id
225 | - category_id
226 | foreign_key:
227 | - column_name: business_id
228 | reftable: businesses
229 | ref_column: business_id
230 | - column_name: category_id
231 | reftable: categories
232 | ref_column: category_id
233 |
234 | # addresses
235 | - table_name: addresses
236 | s3_key: addresses
237 | copy_params:
238 | - FORMAT AS PARQUET
239 | origin_schema:
240 | - name: address
241 | type: varchar(256)
242 | - name: latitude
243 | type: float
244 | - name: longitude
245 | type: float
246 | - name: postal_code
247 | type: varchar(8)
248 | - name: city_id
249 | type: bigint
250 | - name: address_id
251 | type: bigint
252 | primary_key: address_id
253 | foreign_key:
254 | - column_name: city_id
255 | reftable: cities
256 | ref_column: city_id
257 |
258 | # cities
259 | - table_name: cities
260 | s3_key: cities
261 | copy_params:
262 | - FORMAT AS PARQUET
263 | origin_schema:
264 | - name: city
265 | type: varchar(50)
266 | - name: state_code
267 | type: varchar(3)
268 | - name: american_indian_and_alaska_native
269 | type: bigint
270 | - name: asian
271 | type: bigint
272 | - name: average_household_size
273 | type: float
274 | - name: black_or_african_american
275 | type: bigint
276 | - name: female_population
277 | type: bigint
278 | - name: foreign_born
279 | type: bigint
280 | - name: hispanic_or_latino
281 | type: bigint
282 | - name: male_population
283 | type: bigint
284 | - name: median_age
285 | type: float
286 | - name: number_of_veterans
287 | type: bigint
288 | - name: state
289 | type: varchar(14)
290 | - name: total_population
291 | type: bigint
292 | - name: white
293 | type: bigint
294 | - name: city_id
295 | type: bigint
296 | primary_key: city_id
297 |
298 | # city_weather
299 | - table_name: city_weather
300 | s3_key: city_weather
301 | copy_params:
302 | - FORMAT AS PARQUET
303 | origin_schema:
304 | - name: date
305 | type: date
306 | - name: avg_temperature
307 | type: float
308 | - name: weather_description
309 | type: varchar(23)
310 | - name: city_id
311 | type: bigint
312 | primary_key:
313 | - city_id
314 | - date
315 | foreign_key:
316 | - column_name: city_id
317 | reftable: cities
318 | ref_column: city_id
319 |
320 | # business_hours
321 | - table_name: business_hours
322 | s3_key: business_hours
323 | copy_params:
324 | - FORMAT AS PARQUET
325 | origin_schema:
326 | - name: business_id
327 | type: varchar(22)
328 | - name: Monday_from
329 | type: int
330 | - name: Monday_to
331 | type: int
332 | - name: Tuesday_from
333 | type: int
334 | - name: Tuesday_to
335 | type: int
336 | - name: Wednesday_from
337 | type: int
338 | - name: Wednesday_to
339 | type: int
340 | - name: Thursday_from
341 | type: int
342 | - name: Thursday_to
343 | type: int
344 | - name: Friday_from
345 | type: int
346 | - name: Friday_to
347 | type: int
348 | - name: Saturday_from
349 | type: int
350 | - name: Saturday_to
351 | type: int
352 | - name: Sunday_from
353 | type: int
354 | - name: Sunday_to
355 | type: int
356 | primary_key: business_id
357 | foreign_key:
358 | - column_name: business_id
359 | reftable: businesses
360 | ref_column: business_id
361 |
362 | # users
363 | - table_name: users
364 | s3_key: users
365 | copy_params:
366 | - FORMAT AS PARQUET
367 | origin_schema:
368 | - name: average_stars
369 | type: float
370 | - name: compliment_cool
371 | type: bigint
372 | - name: compliment_cute
373 | type: bigint
374 | - name: compliment_funny
375 | type: bigint
376 | - name: compliment_hot
377 | type: bigint
378 | - name: compliment_list
379 | type: bigint
380 | - name: compliment_more
381 | type: bigint
382 | - name: compliment_note
383 | type: bigint
384 | - name: compliment_photos
385 | type: bigint
386 | - name: compliment_plain
387 | type: bigint
388 | - name: compliment_profile
389 | type: bigint
390 | - name: compliment_writer
391 | type: bigint
392 | - name: cool
393 | type: bigint
394 | - name: fans
395 | type: bigint
396 | - name: funny
397 | type: bigint
398 | - name: name
399 | type: varchar(256)
400 | - name: review_count
401 | type: bigint
402 | - name: useful
403 | type: bigint
404 | - name: user_id
405 | type: varchar(22)
406 | - name: yelping_since
407 | type: timestamp
408 | primary_key: user_id
409 |
410 | # elite_years
411 | - table_name: elite_years
412 | s3_key: elite_years
413 | copy_params:
414 | - FORMAT AS PARQUET
415 | origin_schema:
416 | - name: user_id
417 | type: varchar(22)
418 | - name: year
419 | type: int
420 | primary_key:
421 | - user_id
422 | - year
423 | foreign_key:
424 | - column_name: user_id
425 | reftable: users
426 | ref_column: user_id
427 |
428 | # friends
429 | - table_name: friends
430 | s3_key: friends
431 | copy_params:
432 | - FORMAT AS PARQUET
433 | origin_schema:
434 | - name: user_id
435 | type: varchar(22)
436 | - name: friend_id
437 | type: varchar(22)
438 | primary_key:
439 | - user_id
440 | - friend_id
441 | foreign_key:
442 | - column_name: user_id
443 | reftable: users
444 | ref_column: user_id
445 | - column_name: friend_id
446 | reftable: users
447 | ref_column: user_id
448 |
449 | # reviews
450 | - table_name: reviews
451 | s3_key: reviews
452 | copy_params:
453 | - FORMAT AS PARQUET
454 | origin_schema:
455 | - name: business_id
456 | type: varchar(22)
457 | - name: cool
458 | type: bigint
459 | - name: ts
460 | type: timestamp
461 | - name: funny
462 | type: bigint
463 | - name: review_id
464 | type: varchar(22)
465 | - name: stars
466 | type: float
467 | - name: text
468 | type: varchar(20000)
469 | - name: useful
470 | type: bigint
471 | - name: user_id
472 | type: varchar(22)
473 | primary_key: review_id
474 | foreign_key:
475 | - column_name: business_id
476 | reftable: businesses
477 | ref_column: business_id
478 | - column_name: user_id
479 | reftable: users
480 | ref_column: user_id
481 |
482 | # checkins
483 | - table_name: checkins
484 | s3_key: checkins
485 | copy_params:
486 | - FORMAT AS PARQUET
487 | origin_schema:
488 | - name: business_id
489 | type: varchar(22)
490 | - name: ts
491 | type: timestamp
492 | primary_key:
493 | - business_id
494 | - ts
495 | foreign_key:
496 | - column_name: business_id
497 | reftable: businesses
498 | ref_column: business_id
499 |
500 | # tips
501 | - table_name: tips
502 | s3_key: tips
503 | copy_params:
504 | - FORMAT AS PARQUET
505 | origin_schema:
506 | - name: business_id
507 | type: varchar(22)
508 | - name: compliment_count
509 | type: bigint
510 | - name: ts
511 | type: timestamp
512 | - name: text
513 | type: varchar(2000)
514 | - name: user_id
515 | type: varchar(22)
516 | - name: tip_id
517 | type: bigint
518 | primary_key: tip_id
519 | foreign_key:
520 | - column_name: business_id
521 | reftable: businesses
522 | ref_column: business_id
523 | - column_name: user_id
524 | reftable: users
525 | ref_column: user_id
526 |
527 | # photos
528 | - table_name: photos
529 | s3_key: photos
530 | copy_params:
531 | - FORMAT AS PARQUET
532 | origin_schema:
533 | - name: business_id
534 | type: varchar(22)
535 | - name: caption
536 | type: varchar(560)
537 | - name: label
538 | type: varchar(7)
539 | - name: photo_id
540 | type: varchar(22)
541 | primary_key: photo_id
542 | foreign_key:
543 | - column_name: business_id
544 | reftable: businesses
545 | ref_column: business_id
--------------------------------------------------------------------------------
/airflow/plugins/spark_plugin/operators/spark_operator.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | from airflow.plugins_manager import AirflowPlugin
16 | from airflow.hooks import HttpHook
17 | from airflow.models import BaseOperator
18 | from airflow.operators import BashOperator
19 | from airflow.utils import apply_defaults
20 | import logging
21 | import textwrap
22 | import time
23 | import json
24 |
25 |
26 | class SparkSubmitOperator(BashOperator):
27 | """
28 | An operator which executes the spark-submit command through Airflow. This operator accepts all the desired
29 | arguments and assembles the spark-submit command which is then executed by the BashOperator.
30 | :param application_file: Path to a bundled jar including your application
31 | and all dependencies. The URL must be globally visible inside of
32 | your cluster, for instance, an hdfs:// path or a file:// path
33 | that is present on all nodes.
34 | :type application_file: string
35 | :param main_class: The entry point for your application
36 | (e.g. org.apache.spark.examples.SparkPi)
37 | :type main_class: string
38 | :param master: The master value for the cluster.
39 | (e.g. spark://23.195.26.187:7077 or yarn-client)
40 | :type master: string
41 | :param conf: Dictionary consisting of arbitrary Spark configuration properties.
42 | (e.g. {"spark.eventLog.enabled": "false",
43 | "spark.executor.extraJavaOptions": "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps"}
44 | :type conf: dict
45 | :param deploy_mode: Whether to deploy your driver on the worker nodes
46 | (cluster) or locally as an external client (default: client)
47 | :type deploy_mode: string
48 | :param other_spark_options: Other options you would like to pass to
49 | the spark submit command that isn't covered by the current
50 | options. (e.g. --files /path/to/file.xml)
51 | :type other_spark_options: string
52 | :param application_args: Arguments passed to the main method of your
53 | main class, if any.
54 | :type application_args: string
55 | :param xcom_push: If xcom_push is True, the last line written to stdout
56 | will also be pushed to an XCom when the bash command completes.
57 | :type xcom_push: bool
58 | :param env: If env is not None, it must be a mapping that defines the
59 | environment variables for the new process; these are used instead
60 | of inheriting the current process environment, which is the default
61 | behavior. (templated)
62 | :type env: dict
63 | :type output_encoding: output encoding of bash command
64 | """
65 |
66 | template_fields = ('conf', 'other_spark_options', 'application_args', 'env')
67 | template_ext = []
68 | ui_color = '#e47128' # Apache Spark's Main Color: Orange
69 |
70 | @apply_defaults
71 | def __init__(
72 | self,
73 | application_file,
74 | main_class=None,
75 | master=None,
76 | conf={},
77 | deploy_mode=None,
78 | other_spark_options=None,
79 | application_args=None,
80 | xcom_push=False,
81 | env=None,
82 | output_encoding='utf-8',
83 | *args, **kwargs):
84 | self.bash_command = ""
85 | self.env = env
86 | self.output_encoding = output_encoding
87 | self.xcom_push_flag = xcom_push
88 | super(SparkSubmitOperator, self).__init__(bash_command=self.bash_command, xcom_push=xcom_push, env=env, output_encoding=output_encoding, *args, **kwargs)
89 | self.application_file = application_file
90 | self.main_class = main_class
91 | self.master = master
92 | self.conf = conf
93 | self.deploy_mode = deploy_mode
94 | self.other_spark_options = other_spark_options
95 | self.application_args = application_args
96 |
97 | def execute(self, context):
98 | logging.info("Executing SparkSubmitOperator.execute(context)")
99 |
100 | self.bash_command = "spark-submit "
101 | if self.is_not_null_and_is_not_empty_str(self.main_class):
102 | self.bash_command += "--class " + self.main_class + " "
103 | if self.is_not_null_and_is_not_empty_str(self.master):
104 | self.bash_command += "--master " + self.master + " "
105 | if self.is_not_null_and_is_not_empty_str(self.deploy_mode):
106 | self.bash_command += "--deploy-mode " + self.deploy_mode + " "
107 | for conf_key, conf_value in self.conf.items():
108 | if self.is_not_null_and_is_not_empty_str(conf_key) and self.is_not_null_and_is_not_empty_str(conf_value):
109 | self.bash_command += "--conf " + "'" + conf_key + "=" + conf_value + "'" + " "
110 | if self.is_not_null_and_is_not_empty_str(self.other_spark_options):
111 | self.bash_command += self.other_spark_options + " "
112 |
113 | self.bash_command += self.application_file + " "
114 |
115 | if self.is_not_null_and_is_not_empty_str(self.application_args):
116 | self.bash_command += self.application_args + " "
117 |
118 | logging.info("Finished assembling bash_command in SparkSubmitOperator: " + str(self.bash_command))
119 |
120 | logging.info("Executing bash execute statement")
121 | super(SparkSubmitOperator, self).execute(context)
122 |
123 | logging.info("Finished executing SparkSubmitOperator.execute(context)")
124 |
125 | @staticmethod
126 | def is_not_null_and_is_not_empty_str(value):
127 | return value is not None and value != ""
128 |
129 |
130 | class LivySparkOperator(BaseOperator):
131 | """
132 | Operator to facilitate interacting with the Livy Server which executes Apache Spark code via a REST API.
133 | :param spark_script: Scala, Python or R code to submit to the Livy Server (templated)
134 | :type spark_script: string
135 | :param session_kind: Type of session to setup with Livy. This will determine which type of code will be accepted. Possible values include "spark" (executes Scala code), "pyspark" (executes Python code) or "sparkr" (executes R code).
136 | :type session_kind: string
137 | :param http_conn_id: The http connection to run the operator against
138 | :type http_conn_id: string
139 | :param poll_interval: The polling interval to use when checking if the code in spark_script has finished executing. In seconds. (default: 30 seconds)
140 | :type poll_interval: integer
141 | """
142 |
143 | template_fields = ['spark_script'] # todo : make sure this works
144 | template_ext = ['.py', '.R', '.r']
145 | ui_color = '#34a8dd' # Clouderas Main Color: Blue
146 |
147 | acceptable_response_codes = [200, 201]
148 | statement_non_terminated_status_list = ['waiting', 'running']
149 |
150 | @apply_defaults
151 | def __init__(
152 | self,
153 | spark_script,
154 | session_kind="spark", # spark, pyspark, or sparkr
155 | http_conn_id='http_default',
156 | poll_interval=30,
157 | *args, **kwargs):
158 | super(LivySparkOperator, self).__init__(*args, **kwargs)
159 |
160 | self.spark_script = spark_script
161 | self.session_kind = session_kind
162 | self.http_conn_id = http_conn_id
163 | self.poll_interval = poll_interval
164 |
165 | self.http = HttpHook("GET", http_conn_id=self.http_conn_id)
166 |
167 | def execute(self, context):
168 | logging.info("Executing LivySparkOperator.execute(context)")
169 |
170 | logging.info("Validating arguments...")
171 | self._validate_arguments()
172 | logging.info("Finished validating arguments")
173 |
174 | logging.info("Creating a Livy Session...")
175 | session_id = self._create_session()
176 | logging.info("Finished creating a Livy Session. (session_id: " + str(session_id) + ")")
177 |
178 | logging.info("Submitting spark script...")
179 | statement_id, overall_statements_state = self._submit_spark_script(session_id=session_id)
180 | logging.info("Finished submitting spark script. (statement_id: " + str(statement_id) + ", overall_statements_state: " + str(overall_statements_state) + ")")
181 |
182 | poll_for_completion = (overall_statements_state in self.statement_non_terminated_status_list)
183 |
184 | if poll_for_completion:
185 | logging.info("Spark job did not complete immediately. Starting to Poll for completion...")
186 |
187 | while overall_statements_state in self.statement_non_terminated_status_list: # todo: test execution_timeout
188 | logging.info("Sleeping for " + str(self.poll_interval) + " seconds...")
189 | time.sleep(self.poll_interval)
190 | logging.info("Finished sleeping. Checking if Spark job has completed...")
191 | statements = self._get_session_statements(session_id=session_id)
192 |
193 | is_all_complete = True
194 | for statement in statements:
195 | if statement["state"] in self.statement_non_terminated_status_list:
196 | is_all_complete = False
197 |
198 | # In case one of the statements finished with errors throw exception
199 | elif statement["state"] != 'available' or statement["output"]["status"] == 'error':
200 | logging.error("Statement failed. (state: " + str(statement["state"]) + ". Output:\n" +
201 | str(statement["output"]))
202 | response = self._close_session(session_id=session_id)
203 | logging.error("Closed session. (response: " + str(response) + ")")
204 | raise Exception("Statement failed. (state: " + str(statement["state"]) + ". Output:\n" +
205 | str(statement["output"]))
206 |
207 | if is_all_complete:
208 | overall_statements_state = "available"
209 |
210 | logging.info("Finished checking if Spark job has completed. (overall_statements_state: " + str(overall_statements_state) + ")")
211 |
212 | if poll_for_completion:
213 | logging.info("Finished Polling for completion.")
214 |
215 | logging.info("Session Logs:\n" + str(self._get_session_logs(session_id=session_id)))
216 |
217 | for statement in self._get_session_statements(session_id):
218 | logging.info("Statement '" + str(statement["id"]) + "' Output:\n" + str(statement["output"]))
219 |
220 | logging.info("Closing session...")
221 | response = self._close_session(session_id=session_id)
222 | logging.info("Finished closing session. (response: " + str(response) + ")")
223 |
224 | logging.info("Finished executing LivySparkOperator.execute(context)")
225 |
226 | def _validate_arguments(self):
227 | if self.session_kind is None or self.session_kind == "":
228 | raise Exception(
229 | "session_kind argument is invalid. It is empty or None. (value: '" + str(self.session_kind) + "')")
230 | elif self.session_kind not in ["spark", "pyspark", "sparkr"]:
231 | raise Exception(
232 | "session_kind argument is invalid. It should be set to 'spark', 'pyspark', or 'sparkr'. (value: '" + str(
233 | self.session_kind) + "')")
234 |
235 | def _get_sessions(self):
236 | method = "GET"
237 | endpoint = "sessions"
238 | response = self._http_rest_call(method=method, endpoint=endpoint)
239 |
240 | if response.status_code in self.acceptable_response_codes:
241 | return response.json()["sessions"]
242 | else:
243 | raise Exception("Call to get sessions didn't return " + str(self.acceptable_response_codes) + ". Returned '" + str(response.status_code) + "'.")
244 |
245 | def _get_session(self, session_id):
246 | sessions = self._get_sessions()
247 | for session in sessions:
248 | if session["id"] == session_id:
249 | return session
250 |
251 | def _get_session_logs(self, session_id):
252 | method = "GET"
253 | endpoint = "sessions/" + str(session_id) + "/log"
254 | response = self._http_rest_call(method=method, endpoint=endpoint)
255 | return response.json()
256 |
257 | def _create_session(self):
258 | method = "POST"
259 | endpoint = "sessions"
260 |
261 | data = {
262 | "kind": self.session_kind
263 | }
264 |
265 | response = self._http_rest_call(method=method, endpoint=endpoint, data=data)
266 |
267 | if response.status_code in self.acceptable_response_codes:
268 | response_json = response.json()
269 | session_id = response_json["id"]
270 | session_state = response_json["state"]
271 |
272 | if session_state == "starting":
273 | logging.info("Session is starting. Polling to see if it is ready...")
274 |
275 | session_state_polling_interval = 10
276 | while session_state == "starting":
277 | logging.info("Sleeping for " + str(session_state_polling_interval) + " seconds")
278 | time.sleep(session_state_polling_interval)
279 | session_state_check_response = self._get_session(session_id=session_id)
280 | session_state = session_state_check_response["state"]
281 | logging.info("Got latest session state as '" + session_state + "'")
282 |
283 | return session_id
284 | else:
285 | raise Exception("Call to create a new session didn't return " + str(self.acceptable_response_codes) + ". Returned '" + str(response.status_code) + "'.")
286 |
287 | def _submit_spark_script(self, session_id):
288 | method = "POST"
289 | endpoint = "sessions/" + str(session_id) + "/statements"
290 |
291 | logging.info("Executing Spark Script: \n" + str(self.spark_script))
292 |
293 | data = {
294 | 'code': textwrap.dedent(self.spark_script)
295 | }
296 |
297 | response = self._http_rest_call(method=method, endpoint=endpoint, data=data)
298 |
299 | if response.status_code in self.acceptable_response_codes:
300 | response_json = response.json()
301 | return response_json["id"], response_json["state"]
302 | else:
303 | raise Exception("Call to create a new statement didn't return " + str(self.acceptable_response_codes) + ". Returned '" + str(response.status_code) + "'.")
304 |
305 | def _get_session_statements(self, session_id):
306 | method = "GET"
307 | endpoint = "sessions/" + str(session_id) + "/statements"
308 | response = self._http_rest_call(method=method, endpoint=endpoint)
309 |
310 | if response.status_code in self.acceptable_response_codes:
311 | response_json = response.json()
312 | statements = response_json["statements"]
313 | return statements
314 | else:
315 | raise Exception("Call to get the session statement response didn't return " + str(self.acceptable_response_codes) + ". Returned '" + str(response.status_code) + "'.")
316 |
317 | def _close_session(self, session_id):
318 | method = "DELETE"
319 | endpoint = "sessions/" + str(session_id)
320 | return self._http_rest_call(method=method, endpoint=endpoint)
321 |
322 | def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
323 | if not extra_options:
324 | extra_options = {}
325 | logging.debug("Performing HTTP REST call... (method: " + str(method) + ", endpoint: " + str(endpoint) + ", data: " + str(data) + ", headers: " + str(headers) + ")")
326 | self.http.method = method
327 | response = self.http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
328 |
329 | logging.debug("status_code: " + str(response.status_code))
330 | logging.debug("response_as_json: " + str(response.json()))
331 |
332 | return response
333 |
334 |
335 | # Defining the plugin class
336 | class SparkOperatorPlugin(AirflowPlugin):
337 | name = "spark_operator_plugin"
338 | operators = [SparkSubmitOperator, LivySparkOperator]
339 | flask_blueprints = []
340 | hooks = []
341 | executors = []
342 | admin_views = []
343 | menu_links = []
--------------------------------------------------------------------------------
/airflow/plugins/redshift_plugin/operators/s3_to_redshift_operator.py:
--------------------------------------------------------------------------------
1 | import json
2 | import random
3 | import string
4 | import logging
5 |
6 | from airflow.utils.db import provide_session
7 | from airflow.models import Connection
8 | from airflow.utils.decorators import apply_defaults
9 |
10 | from airflow.models import BaseOperator
11 | from airflow.hooks.S3_hook import S3Hook
12 | from airflow.hooks.postgres_hook import PostgresHook
13 |
14 | # https://github.com/airflow-plugins/redshift_plugin
15 | # We edited it slightly to accept composite primary keys
16 |
17 |
18 | class S3ToRedshiftOperator(BaseOperator):
19 | """
20 | S3 To Redshift Operator
21 | :param redshift_conn_id: The destination redshift connection id.
22 | :type redshift_conn_id: string
23 | :param redshift_schema: The destination redshift schema.
24 | :type redshift_schema: string
25 | :param table: The destination redshift table.
26 | :type table: string
27 | :param s3_conn_id: The source s3 connection id.
28 | :type s3_conn_id: string
29 | :param s3_bucket: The source s3 bucket.
30 | :type s3_bucket: string
31 | :param s3_key: The source s3 key.
32 | :type s3_key: string
33 | :param copy_params: The parameters to be included when issuing
34 | the copy statement in Redshift.
35 | :type copy_params: list
36 | :param origin_schema: The s3 key for the incoming data schema.
37 | Expects a JSON file with an array of
38 | dictionaries specifying name and type.
39 | (e.g. {"name": "_id", "type": "int4"})
40 | :type origin_schema: array of dictionaries
41 | :param schema_location: The location of the origin schema. This
42 | can be set to 'S3' or 'Local'.
43 | If 'S3', it will expect a valid S3 Key. If
44 | 'Local', it will expect a dictionary that
45 | is defined in the operator itself. By
46 | default the location is set to 's3'.
47 | :type schema_location: string
48 | :param load_type: The method of loading into Redshift that
49 | should occur. Options:
50 | - "append"
51 | - "rebuild"
52 | - "truncate"
53 | - "upsert"
54 | Defaults to "append."
55 | :type load_type: string
56 | :param primary_key: *(optional)* The primary key for the
57 | destination table. Not enforced by redshift
58 | and only required if using a load_type of
59 | "upsert". It will expect a string or an
60 | array of strings.
61 | :type primary_key: string
62 | :param incremental_key: *(optional)* The incremental key to compare
63 | new data against the destination table
64 | with. Only required if using a load_type of
65 | "upsert".
66 | :type incremental_key: string
67 | :param foreign_key: *(optional)* This specifies any foreign_keys
68 | in the table and which corresponding table
69 | and key they reference. This may be either
70 | a dictionary or list of dictionaries (for
71 | multiple foreign keys). The fields that are
72 | required in each dictionary are:
73 | - column_name
74 | - reftable
75 | - ref_column
76 | :type foreign_key: dictionary
77 | :param distkey: *(optional)* The distribution key for the
78 | table. Only one key may be specified.
79 | :type distkey: string
80 | :param sortkey: *(optional)* The sort keys for the table.
81 | If more than one key is specified, set this
82 | as a list.
83 | :type sortkey: string
84 | :param sort_type: *(optional)* The style of distribution
85 | to sort the table. Possible values include:
86 | - compound
87 | - interleaved
88 | Defaults to "compound".
89 | :type sort_type: string
90 | """
91 |
92 | template_fields = ('s3_key',
93 | 'origin_schema')
94 |
95 | @apply_defaults
96 | def __init__(self,
97 | s3_conn_id,
98 | s3_bucket,
99 | s3_key,
100 | redshift_conn_id,
101 | redshift_schema,
102 | table,
103 | copy_params=[],
104 | origin_schema=None,
105 | schema_location='s3',
106 | load_type='append',
107 | primary_key=None,
108 | incremental_key=None,
109 | foreign_key={},
110 | distkey=None,
111 | sortkey='',
112 | sort_type='COMPOUND',
113 | *args,
114 | **kwargs):
115 | super().__init__(*args, **kwargs)
116 | self.s3_conn_id = s3_conn_id
117 | self.s3_bucket = s3_bucket
118 | self.s3_key = s3_key
119 | self.redshift_conn_id = redshift_conn_id
120 | self.redshift_schema = redshift_schema.lower()
121 | self.table = table.lower()
122 | self.copy_params = copy_params
123 | self.origin_schema = origin_schema
124 | self.schema_location = schema_location
125 | self.load_type = load_type
126 | self.primary_key = primary_key
127 | self.incremental_key = incremental_key
128 | self.foreign_key = foreign_key
129 | self.distkey = distkey
130 | self.sortkey = sortkey
131 | self.sort_type = sort_type
132 |
133 | if self.load_type.lower() not in ("append", "rebuild", "truncate", "upsert"):
134 | raise Exception('Please choose "append", "rebuild", or "upsert".')
135 |
136 | if self.schema_location.lower() not in ('s3', 'local'):
137 | raise Exception('Valid Schema Locations are "s3" or "local".')
138 |
139 | if not (isinstance(self.sortkey, str) or isinstance(self.sortkey, list)):
140 | raise Exception('Sort Keys must be specified as either a string or list.')
141 |
142 | if not (isinstance(self.foreign_key, dict) or isinstance(self.foreign_key, list)):
143 | raise Exception('Foreign Keys must be specified as either a dictionary or a list of dictionaries.')
144 |
145 | if self.distkey and ((',' in self.distkey) or not isinstance(self.distkey, str)):
146 | raise Exception('Only one distribution key may be specified.')
147 |
148 | if self.sort_type.lower() not in ('compound', 'interleaved'):
149 | raise Exception('Please choose "compound" or "interleaved" for sort type.')
150 |
151 | def execute(self, context):
152 | # Append a random string to the end of the staging table to ensure
153 | # no conflicts if multiple processes running concurrently.
154 | letters = string.ascii_lowercase
155 | random_string = ''.join(random.choice(letters) for _ in range(7))
156 | self.temp_suffix = '_tmp_{0}'.format(random_string)
157 |
158 | if self.origin_schema:
159 | schema = self.read_and_format()
160 |
161 | pg_hook = PostgresHook(self.redshift_conn_id)
162 |
163 | #self.reconcile_schemas(schema, pg_hook)
164 | self.copy_data(pg_hook, schema)
165 |
166 | def read_and_format(self):
167 | if self.schema_location.lower() == 's3':
168 | hook = S3Hook(self.s3_conn_id)
169 | # NOTE: In retrieving the schema, it is assumed
170 | # that boto3 is being used. If using boto,
171 | # `.get()['Body'].read().decode('utf-8'))`
172 | # should be changed to
173 | # `.get_contents_as_string(encoding='utf-8'))`
174 | schema = (hook.get_key(self.origin_schema,
175 | bucket_name=
176 | '{0}'.format(self.s3_bucket))
177 | .get()['Body'].read().decode('utf-8'))
178 | schema = json.loads(schema.replace("'", '"'))
179 | else:
180 | schema = self.origin_schema
181 |
182 | return schema
183 |
184 | def reconcile_schemas(self, schema, pg_hook):
185 | pg_query = \
186 | """
187 | SELECT column_name, udt_name
188 | FROM information_schema.columns
189 | WHERE table_schema = '{0}' AND table_name = '{1}';
190 | """.format(self.redshift_schema, self.table)
191 |
192 | pg_schema = dict(pg_hook.get_records(pg_query))
193 | incoming_keys = [column['name'] for column in schema]
194 | diff = list(set(incoming_keys) - set(pg_schema.keys()))
195 | print(diff)
196 | # Check length of column differential to see if any new columns exist
197 | if len(diff):
198 | for i in diff:
199 | for e in schema:
200 | if i == e['name']:
201 | alter_query = \
202 | """
203 | ALTER TABLE "{0}"."{1}"
204 | ADD COLUMN "{2}" {3}
205 | """.format(self.redshift_schema,
206 | self.table,
207 | e['name'],
208 | e['type'])
209 | pg_hook.run(alter_query)
210 | logging.info('The new columns were:' + str(diff))
211 | else:
212 | logging.info('There were no new columns.')
213 |
214 | def copy_data(self, pg_hook, schema=None):
215 | @provide_session
216 | def get_conn(conn_id, session=None):
217 | conn = (
218 | session.query(Connection)
219 | .filter(Connection.conn_id == conn_id)
220 | .first())
221 | return conn
222 |
223 | def getS3Conn():
224 | creds = ""
225 | s3_conn = get_conn(self.s3_conn_id)
226 | aws_key = s3_conn.extra_dejson.get('aws_access_key_id', None)
227 | aws_secret = s3_conn.extra_dejson.get('aws_secret_access_key', None)
228 |
229 | # support for cross account resource access
230 | aws_role_arn = s3_conn.extra_dejson.get('role_arn', None)
231 |
232 | if aws_key and aws_secret:
233 | creds = ("aws_access_key_id={0};aws_secret_access_key={1}"
234 | .format(aws_key, aws_secret))
235 | elif aws_role_arn:
236 | creds = ("aws_iam_role={0}"
237 | .format(aws_role_arn))
238 |
239 | if not creds:
240 | logging.error("AWS Credentials not found")
241 |
242 |
243 | return creds
244 |
245 | # Delete records from the destination table where the incremental_key
246 | # is greater than or equal to the incremental_key of the source table
247 | # and the primary key is the same.
248 | # (e.g. Source: {"id": 1, "updated_at": "2017-01-02 00:00:00"};
249 | # Destination: {"id": 1, "updated_at": "2017-01-01 00:00:00"})
250 |
251 | delete_sql = \
252 | '''
253 | DELETE FROM "{rs_schema}"."{rs_table}"
254 | USING "{rs_schema}"."{rs_table}{rs_suffix}"
255 | WHERE "{rs_schema}"."{rs_table}"."{rs_pk}" =
256 | "{rs_schema}"."{rs_table}{rs_suffix}"."{rs_pk}"
257 | AND "{rs_schema}"."{rs_table}{rs_suffix}"."{rs_ik}" >=
258 | "{rs_schema}"."{rs_table}"."{rs_ik}"
259 | '''.format(rs_schema=self.redshift_schema,
260 | rs_table=self.table,
261 | rs_pk=self.primary_key,
262 | rs_suffix=self.temp_suffix,
263 | rs_ik=self.incremental_key)
264 |
265 | # Delete records from the source table where the incremental_key
266 | # is greater than or equal to the incremental_key of the destination
267 | # table and the primary key is the same. This is done in the edge case
268 | # where data is pulled BEFORE it is altered in the source table but
269 | # AFTER a workflow containing an updated version of the record runs.
270 | # In this case, not running this will cause the older record to be
271 | # added as a duplicate to the newer record.
272 | # (e.g. Source: {"id": 1, "updated_at": "2017-01-01 00:00:00"};
273 | # Destination: {"id": 1, "updated_at": "2017-01-02 00:00:00"})
274 |
275 | delete_confirm_sql = \
276 | '''
277 | DELETE FROM "{rs_schema}"."{rs_table}{rs_suffix}"
278 | USING "{rs_schema}"."{rs_table}"
279 | WHERE "{rs_schema}"."{rs_table}{rs_suffix}"."{rs_pk}" =
280 | "{rs_schema}"."{rs_table}"."{rs_pk}"
281 | AND "{rs_schema}"."{rs_table}"."{rs_ik}" >=
282 | "{rs_schema}"."{rs_table}{rs_suffix}"."{rs_ik}"
283 | '''.format(rs_schema=self.redshift_schema,
284 | rs_table=self.table,
285 | rs_pk=self.primary_key,
286 | rs_suffix=self.temp_suffix,
287 | rs_ik=self.incremental_key)
288 |
289 | append_sql = \
290 | '''
291 | ALTER TABLE "{0}"."{1}"
292 | APPEND FROM "{0}"."{1}{2}"
293 | FILLTARGET
294 | '''.format(self.redshift_schema, self.table, self.temp_suffix)
295 |
296 | drop_sql = \
297 | '''
298 | DROP TABLE IF EXISTS "{0}"."{1}" CASCADE
299 | '''.format(self.redshift_schema, self.table)
300 |
301 | drop_temp_sql = \
302 | '''
303 | DROP TABLE IF EXISTS "{0}"."{1}{2}" CASCADE
304 | '''.format(self.redshift_schema, self.table, self.temp_suffix)
305 |
306 | truncate_sql = \
307 | '''
308 | TRUNCATE TABLE "{0}"."{1}"
309 | '''.format(self.redshift_schema, self.table)
310 |
311 | params = '\n'.join(self.copy_params)
312 |
313 | # Example params for loading json from US-East-1 S3 region
314 | # params = ["COMPUPDATE OFF",
315 | # "STATUPDATE OFF",
316 | # "JSON 'auto'",
317 | # "TIMEFORMAT 'auto'",
318 | # "TRUNCATECOLUMNS",
319 | # "region as 'us-east-1'"]
320 |
321 | base_sql = \
322 | """
323 | FROM 's3://{0}/{1}'
324 | CREDENTIALS '{2}'
325 | {3};
326 | """.format(self.s3_bucket,
327 | self.s3_key,
328 | getS3Conn(),
329 | params)
330 |
331 | load_sql = '''COPY "{0}"."{1}" {2}'''.format(self.redshift_schema,
332 | self.table,
333 | base_sql)
334 |
335 | if self.load_type == 'append':
336 | self.create_if_not_exists(schema, pg_hook)
337 | pg_hook.run(load_sql)
338 | elif self.load_type == 'rebuild':
339 | pg_hook.run(drop_sql)
340 | self.create_if_not_exists(schema, pg_hook)
341 | pg_hook.run(load_sql)
342 | elif self.load_type == 'truncate':
343 | self.create_if_not_exists(schema, pg_hook)
344 | pg_hook.run(truncate_sql)
345 | pg_hook.run(load_sql)
346 | elif self.load_type == 'upsert':
347 | self.create_if_not_exists(schema, pg_hook, temp=True)
348 | load_temp_sql = \
349 | '''COPY "{0}"."{1}{2}" {3}'''.format(self.redshift_schema,
350 | self.table,
351 | self.temp_suffix,
352 | base_sql)
353 | pg_hook.run(load_temp_sql)
354 | pg_hook.run(delete_sql)
355 | pg_hook.run(delete_confirm_sql)
356 | pg_hook.run(append_sql, autocommit=True)
357 | pg_hook.run(drop_temp_sql)
358 |
359 | def create_if_not_exists(self, schema, pg_hook, temp=False):
360 | output = ''
361 | for item in schema:
362 | k = "{quote}{key}{quote}".format(quote='"', key=item['name'])
363 | field = ' '.join([k, item['type']])
364 | if isinstance(self.sortkey, str) and self.sortkey == item['name']:
365 | field += ' sortkey'
366 | output += field
367 | output += ', '
368 | # Remove last comma and space after schema items loop ends
369 | output = output[:-2]
370 | if temp:
371 | copy_table = '{0}{1}'.format(self.table, self.temp_suffix)
372 | else:
373 | copy_table = self.table
374 | create_schema_query = \
375 | '''
376 | CREATE SCHEMA IF NOT EXISTS "{0}";
377 | '''.format(self.redshift_schema)
378 |
379 | pk = ''
380 | fk = ''
381 | dk = ''
382 | sk = ''
383 |
384 | if self.primary_key:
385 | pk = ', '
386 | if isinstance(self.primary_key, list):
387 | pk += 'primary key({0})'.format(', '.join(self.primary_key))
388 | else:
389 | pk += 'primary key("{0}")'.format(self.primary_key)
390 |
391 | if self.foreign_key:
392 | if isinstance(self.foreign_key, list):
393 | fk = ', '
394 | for i, e in enumerate(self.foreign_key):
395 | fk += 'foreign key("{0}") references {1}("{2}")'.format(e['column_name'],
396 | e['reftable'],
397 | e['ref_column'])
398 | if i != (len(self.foreign_key) - 1):
399 | fk += ', '
400 | elif isinstance(self.foreign_key, dict):
401 | fk += ', '
402 | fk += 'foreign key("{0}") references {1}("{2}")'.format(self.foreign_key['column_name'],
403 | self.foreign_key['reftable'],
404 | self.foreign_key['ref_column'])
405 | if self.distkey:
406 | dk = 'distkey({})'.format(self.distkey)
407 |
408 | if self.sortkey:
409 | if isinstance(self.sortkey, list):
410 | sk += '{0} sortkey({1})'.format(self.sort_type, ', '.join(["{}".format(e) for e in self.sortkey]))
411 |
412 | create_table_query = \
413 | '''
414 | CREATE TABLE IF NOT EXISTS "{schema}"."{table}"
415 | ({fields}{primary_key}{foreign_key}) {distkey} {sortkey}
416 | '''.format(schema=self.redshift_schema,
417 | table=copy_table,
418 | fields=output,
419 | primary_key=pk,
420 | foreign_key=fk,
421 | distkey=dk,
422 | sortkey=sk)
423 |
424 | #pg_hook.run([create_schema_query, create_table_query])
425 | pg_hook.run(create_table_query)
426 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # yelp-3nf
2 |
3 | The developed data pipeline translates the non-relational Yelp dataset distributed over JSON files in Amazon S3 bucket, into a 3NF-normalized dataset stored on Amazon Redshift. The resulting schema ensures data consistency and referential integrity across tables, and is meant to be the source of truth for analytical queries and BI tools. Additionally, the data was enriched with demographics and weather data coming from third-party data sources.
4 |
5 | The entire process was done using Apache Spark, Amazon Redshift and Apache Airflow.
6 |
7 | ## Datasets
8 |
9 |
10 |
11 | The [Yelp Open Dataset](https://www.yelp.com/dataset) is a perfect candidate for this project, since:
12 |
13 | - (1) it is a NoSQL data source;
14 | - (2) it comprises of 6 files that count together more than 10 million rows;
15 | - (3) this dataset provides lots of diverse information and allows for many analysis approaches, from traditional analytical queries (such as "*Give me the average star rating for each city*") to Graph Mining, Photo Classification, Natural Language Processing, and Sentiment Analysis;
16 | - (4) Moreover, it was produced in a real production setting (as opposed to synthetic data generation).
17 |
18 | To make the contribution unique, the Yelp dataset was enriched by demographics and weather data. This allows the end user to make queries such as "*Does the number of ratings depend upon the city's population density?*" or "*Which restaurants are particularly popular during hot weather?*".
19 |
20 | ### Yelp Open Dataset
21 |
22 | The [Yelp Open Dataset](https://www.yelp.com/dataset) dataset is a subset of Yelp's businesses, reviews, and user data, available for academic use. The dataset (as of 13.08.2019) takes 9GB disk space (unzipped) and counts 6,685,900 reviews, 192,609 businesses over 10 metropolitan areas, over 1.2 million business attributes like hours, parking, availability, and ambience, 1,223,094 tips by 1,637,138 users, and aggregated check-ins over time. Each file is composed of a single object type, one JSON-object per-line. For more details on dataset structure, proceed to [Yelp Dataset JSON Documentation](https://www.yelp.com/dataset/documentation/main).
23 |
24 | ### U.S. City Demographic Data
25 |
26 | The [U.S. City Demographic Data](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/) dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau's 2015 American Community Survey. Each JSON object describes the demographics of a particular city and race, and so it can be uniquely identified by the city, state and race fields. More information can be found [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/information/) under the section "Dataset schema".
27 |
28 | ### Historical Hourly Weather Data 2012-2017
29 |
30 | The [Historical Hourly Weather Data](https://www.kaggle.com/selfishgene/historical-hourly-weather-data) dataset is a dataset collected by a Kaggle competitor. The dataset contains 5 years of hourly measurements data of various weather attributes, such as temperature, humidity, and air pressure. This data is available for 27 bigger US cities, 3 cities in Canada, and 6 cities in Israel. Each attribute has it's own file and is organized such that the rows are the time axis (timestamps), and the columns are the different cities. Additionally, there is a separate file to identify which city belongs to which country.
31 |
32 | ## Data model and dictionary
33 |
34 | Our target data model is a 3NF-normalized relational model, which was designed to be neutral to different kinds of analytical queries. The data should depend on the key [1NF], the whole key [2NF] and nothing but the key [3NF] (so help me Codd). Forms beyond 4NF are mainly of academic interest. The following image depicts the logical model of the database:
35 |
36 | 
37 |
38 | Note: fields such as *compliment_** are just placeholders for multiple fields with the same prefix (*compliment*). This is done to visually reduce the length of the tables.
39 |
40 | The model consists of 15 tables as a result of normalizing and joining 6 tables provided by Yelp, 1 table with demographic information and 2 tables with weather information. The schema is closer to a Snowflake schema as there are two fact tables - *reviews* and *tips* - and many dimensional tables with multiple levels of hierarchy and many-to-many relationships. Some tables keep their native keys, while for others monotonically increasing ids were generated. Rule of thumb: Use generated keys for entities and composite keys for relationships. Moreover, timestamps and dates were converted into Spark's native data types to be able to import them into Amazon Redshift in a correct format.
41 |
42 | For more details on tables and fields, visit [Yelp Dataset JSON](https://www.yelp.com/dataset/documentation/main) or look at [Redshift table definitions](https://github.com/polakowo/yelp-3nf/blob/master/airflow/dags/configs/table_definitions.yml).
43 |
44 | To dive deeper into the data processing pipeline with Spark, here is [the provided Jupyter Notebook](https://nbviewer.jupyter.org/github/polakowo/yelp-3nf/blob/master/spark-jobs-playground.ipynb).
45 |
46 | #### *businesses*
47 |
48 | The most referenced table in the model. Contains the name of the business, the current star rating, the number of reviews, and whether the business is currently open. The address (one-to-one relationship), hours (one-to-one relationship), businesses attributes (one-to-one relationship) and categories (one-to-many relationship) were outsourced to separate tables as part of the normalization process.
49 |
50 | #### *business_attributes*
51 |
52 | This was the most challenging part of the remodeling, since Yelp kept business attributes as nested dict. All fields in the source table were strings, so they had to be parsed into respective data types. Some values were dirty, for example, boolean fields can be `"True"`, `"False"`, `"None"` and `None`, while some string fields were of double unicode format `u"u'string'"`. Moreover, some fields were dicts formatted as strings. The resulting nested JSON structure of three levels had to be flattened.
53 |
54 | #### *business_categories* and *categories*
55 |
56 | In the original table, business categories were stored as an array. The best solution was to outsource it into a separate table. One way of doing this is to assign a column to each category, but what if we are going to add a new category later on? Then we must update this whole table to reflect the change - this is a clear violation of the 3NF, where columns should not have any transitional function relations. Thus, let's create two tables: *categories*, which contains categories keyed by their ids, and *business_categories*, which contains tuples of business ids and category ids.
57 |
58 | #### *business_hours*
59 |
60 | Business hours were stored as a dict where each key is day of week and value is string of format `"hour:minute-hour:minute"`. The best way to make the data representation neutral to queries, is to split the "from hour" and "to hour" parts into separate columns and combine "hour" and "minute" into a single field of type integer, for example `"10:00-21:00"` into `1000` and `2100` respectively. This way we could easily formulate the following query:
61 |
62 | ```sql
63 | -- Find businesses opened on Sunday at 8pm
64 | SELECT business_id FROM business_hours WHERE Sunday_from <= 2000 AND Sunday_to > 2000;
65 | ```
66 |
67 | #### *addresses*
68 |
69 | A common practice is to separate business data from address data and connect them though a synthetic key. The resulting link is a one-to-one relationship. Furthermore, addresses were separated from cities, since states and demographic data are dependent on cities only (otherwise 3NF violation).
70 |
71 | #### *cities*
72 |
73 | This table contains city name and state code coming from the Yelp dataset and fields on demographics. For most of the cities there is no demographic information since they are too small (< 65k). Each record in the table can be uniquely identified by the city and postal code, but a single primary key is more convenient to connect both addresses and cities.
74 |
75 | #### *city_weather*
76 |
77 | The table *city_weather* was composed out of CSV files `temperature.csv` and `weather_description.csv`. Both files contain information on various (global) cities. To filter the cities by country (= US), one first has to read the `city_attributes.csv`. The issue with the dataset is that it doesn't provide us with the respective state codes, so how do we know whether Phoenix is in AZ or TX? The most appropriate solution is finding the biggest city. This can be done manually with Google: find the respective state codes and then match these with the cities available in the Yelp dataset. As a result, 8 cities could be enriched. Also, both the temperatures and weather description data were recorded hourly, which is too fine-grained. Thus, they were grouped by day by applying an aggregation statistic: temperatures (of data type `float`) are averaged, while for weather description (of data type `string`) the most frequent one is chosen.
78 |
79 | #### *checkins*
80 |
81 | This table contains checkins on a business and required no further transformations.
82 |
83 | #### *reviews*
84 |
85 | The *reviews* table contains full review text data including the user id that wrote the review and the business id the review is written for. It is the most central table in our data schema and structured similarly to a fact table. But in order to convert it into a fact table, the date column has to be outsourced into a separate dimension table and the text has to be omitted.
86 |
87 | #### *users*, *elite_years* and *friends*
88 |
89 | Originally, the user data includes the user's friend mapping and all the metadata associated with the user. But since the fields `friends` and `elite` are arrays, they have become separate relations in our model, both structured similarly to the *business_categories* table and having composite primary keys. The format of the table *friends* is a very convenient one, as it can be directly fed into Apache Spark's GraphX API to build a social graph of Yelp users.
90 |
91 | #### *tips*
92 |
93 | Tips were written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions. The table required no transformations apart from assigning it a generated key.
94 |
95 | #### *photos*
96 |
97 | Contains photo data including the caption and classification.
98 |
99 | ## Data pipeline
100 |
101 | The designed pipeline dynamically loads the JSON files from S3, processes them, and stores their normalized and enriched versions back into S3 in Parquet format. After this, Redshift takes over and copies the tables into a DWH.
102 |
103 | #### Load from S3
104 |
105 |
106 |
107 | All three datasets reside in a Amazon S3 bucket, which is the easiest and safest option to store and retrieve any amount of data at any time from any other AWS service.
108 |
109 | #### Process with Spark
110 |
111 |
112 |
113 | Since the data is in JSON format and contains arrays and nested fields, it needs first to be transformed into a relational form. By design, Amazon Redshift does not support loading nested data (only Redshift Spectrum enables you to query complex data types such as struct, array, or map, without having to transform or load your data). To do this in a quick and scalable fashion, Apache Spark is utilized. In particular, you can execute the entire data processing pipeline in [the provided ETL notebook](https://nbviewer.jupyter.org/github/polakowo/yelp-3nf/blob/master/spark-jobs-playground.ipynb) an Amazon EMR (Elastic MapReduce) cluster, which uses Apache Spark and Hadoop to quickly & cost-effectively process and analyze vast amounts of data. The another advantage of Spark is the ability to control data quality, thus most of our data quality checks are done at this stage.
114 |
115 | #### Unload to S3
116 |
117 |
118 |
119 | Parquet stores nested data structures in a flat columnar format. Compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance. Parquet files are well supported in the AWS ecosystem. Moreover, compared to JSON and CSV formats, we can store timestamp objects, datetime objects and long texts without any post-processing, and load them into Amazon Redshift as-is. From here, we can use an AWS Glue crawler to discover and register the schema for our datasets to be used in Amazon Athena. But our goal is materializing the data rather than querying directly from files on Amazon S3 - to be able to retrieve the data without prolonged load times.
120 |
121 | #### Load into Redshift
122 |
123 |
124 |
125 | To load the data from Parquet files into our Redshift DWH, we can rely on multiple options. The easiest one is by using [spark-redshift](https://github.com/databricks/spark-redshift): Spark reads the parquet files from S3 into the Spark cluster, converts the data to Avro format, writes it to S3, and finally issues a COPY SQL query to Redshift to load the data. Or we can have [an AWS Glue job that loads data into an Amazon Redshift](https://www.dbbest.com/blog/aws-glue-etl-service/). But instead, we should define the tables manually: that way we can control the quality and consistency of data, but also sortkeys, distkeys and compression. This solution issues SQL statements to Redshift to first CREATE the tables and then to COPY the data. To make the table definition process easier and more transparent, we can utilize the AWS Glue's data catalog to derive the correct data types (for example, should we use int or bigint?).
126 |
127 | #### Check data quality
128 |
129 | Most data checks are done when transforming data with Spark. Furthermore, consistency and referential integrity checks are done automatically by importing the data into Redshift (since data must adhere to table definition). To ensure that the output tables are of the right size, we also do some checks the end of the data pipeline.
130 |
131 | ## Airflow DAGs
132 |
133 |
134 |
135 | The following data processing pipeline is executed by using Apache Airflow, which is a tool for orchestrating complex computational workflows and data processing pipelines. The advantage of Airflow over Python ETL scripts is that it provides many add-on modules for operators that already exist from the community, such that one can build useful stuff quickly and in a modular fashion. Also, Airflow scheduler is designed to run as a persistent service in an Airflow production environment and is easier to manage than cron jobs.
136 |
137 | The whole data pipeline is divided into three subDAGs: the one that processes data with Spark (`spark_jobs`), the one that loads the data into Redshift (`copy_to_redshift`), and the one that checks the data for errors (`data_quality_checks`).
138 |
139 |
140 |
141 | ### spark_jobs
142 |
143 | This subDAG comprises of a set of tasks, each sending Spark script to an Amazon EMR cluster. For this, the [LivySparkOperator](https://github.com/rssanders3/airflow-spark-operator-plugin) is used. This operator facilitates interacting with the Livy Server on the EMR master node, which lets us send simple Scala or Python code over REST API calls instead of having to manage and deploy large JAR files. This helps because it scales data pipelines easily with multiple spark jobs running in parallel, rather than running them serially using EMR Step API. Each Spark script takes care of loading one or more source JSON files, transforming it into one or more (3NF-normalized) tables, and unloading them back into S3 in parquet format. The subDAG was partitioned logically by target tables, such that each script takes care of a small amount of work to simplify debugging. Note: in order to increase the performance, one might divide the tasks by the source tables and cache them.
144 |
145 |
146 |
147 | ### copy_to_redshift
148 |
149 | Airflow takes control of loading Parquet files into Redshift in right order and with consistency checks in place. The loading operation is done with the [S3ToRedshiftOperator](https://github.com/airflow-plugins/redshift_plugin), provided by the Airflow community. This operator takes the table definition as a dictionary, creates the Redshift table from it and performs the COPY operation. All table definitions are stored in a YAML configuration file. The order and relationships between operators were derived based on the references between tables; for example, because *reviews* table references *businesses*, *businesses* have to be loaded first, otherwise, the referential integrity is violated (and you may get errors). Thus, data integrity and referential constraints are automatically enforced while populating the Redshift database.
150 |
151 |
152 |
153 | ### data_quality_checks
154 |
155 | The data quality checks are executed with a custom [RedshiftCheckOperator](https://github.com/polakowo/yelp-3nf/blob/master/airflow/plugins/redshift_plugin/operators/redshift_check_operator.py), which extends the Airflow's default [CheckOperator](https://github.com/apache/airflow/blob/master/airflow/operators/check_operator.py). It takes a SQL statement, the expected pass value, and optionally the tolerance of the result, and performs a simple value check.
156 |
157 | ## Date updates
158 |
159 | The whole ETL process for 7 million reviews and related data lasts about 20 minutes. As our target data model is meant to be the source for other dimensional tables, the ETL process can take longer time. Since the Yelp Open Dataset is only a subset of the real dataset and we don't know how many rows Yelp generates each day, we cannot derive the optimal frequency of the updates. But taking only newly appended rows (for example, those collected for one day) can significantly increase the frequency.
160 |
161 | ## Scenarios
162 |
163 | The following scenarios needs to be addressed:
164 | - **The data was increased by 100x:** That wouldn't be a technical issue as both Amazon EMR and Redshift clusters can handle huge amounts of data. Eventually, they would have to be scaled out.
165 | - **The data populates a dashboard that must be updated on a daily basis by 7am every day:** That's perfectly plausible and could be done by running the ETL script some time prior to 7am.
166 | - **The database needed to be accessed by 100+ people:** That wouldn't be a problem as Redshift is highly scalable and available.
167 |
168 | ## Installation
169 |
170 | ### Data preparation (Amazon S3)
171 |
172 | - Create an S3 bucket.
173 | - Ensure that the bucket is in the same region as your Amazon EMR and Redshift clusters.
174 | - Be careful with read permissions - you may end up having to pay lots of fees in data transfers.
175 | - Option 1:
176 | - Download [Yelp Open Dataset](https://www.yelp.com/dataset) and directly upload to your S3 bucket (`yelp_dataset` folder).
177 | - Option 2 (for slow internet connections):
178 | - Launch an EC2 instance with at minimum 20GB SSD storage.
179 | - Connect to this instance via SSH (click "Connect" and proceed according to AWS instructions)
180 | - Proceed to the dataset homepage, fill in your information, copy the download link, and paste into the command below. Note: the link is valid for 30 seconds.
181 | ```bash
182 | wget -O yelp_dataset.tar.gz "[your_download_link]"
183 | tar -xvzf yelp_dataset.tar.gz
184 | ```
185 | - Finally, transfer the files as described by [this blog](http://codeomitted.com/transfer-files-from-ec2-to-s3/)
186 | - Remember to provide IAM role and credentials of user who has AmazonS3FullAccess.
187 | - In case your instance has no AWS CLI installed, follow [this documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html)
188 | - In case you come into errors such as "Unable to locate package python3-pip", follow [this answer](https://askubuntu.com/questions/1061486/unable-to-locate-package-python-pip-when-trying-to-install-from-fresh-18-04-in#answer-1061488)
189 | - Download the JSON file from [U.S. City Demographic Data](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/)
190 | - Upload it to a separate folder (`demo_dataset`) on your S3 bucket.
191 | - Download the whole dataset from [Historical Hourly Weather Data 2012-2017](https://www.kaggle.com/selfishgene/historical-hourly-weather-data/downloads/historical-hourly-weather-data.zip)
192 | - Unzip and upload `city_attributes.csv`, `temperature.csv`, and `weather_description.csv` files to a separate folder (`weather_dataset`) on your S3 bucket.
193 |
194 | ### Amazon EMR
195 |
196 | - Configure and create your EMR cluster.
197 | - Go to advanced options, and enable Apache Spark, Livy and AWS Glue Data Catalog for Spark.
198 | - Enter the following configuration JSON to make Python 3 default:
199 | ```json
200 | [
201 | {
202 | "Classification": "spark-env",
203 | "Configurations": [
204 | {
205 | "Classification": "export",
206 | "Properties": {
207 | "PYSPARK_PYTHON": "/usr/bin/python3"
208 | }
209 | }
210 | ]
211 | }
212 | ]
213 | ```
214 | - Go to EC2 Security Groups, select your master node and enable inbound connections to 8998.
215 |
216 | ### Amazon Redshift
217 |
218 | - Store your credentials and cluster creation parameters in `dwh.cfg`
219 | - Run `create_cluster.ipynb` to create a Redshift cluster.
220 | - Note: Delete your Redshift cluster with `delete_cluster.ipynb` when you're finished working.
221 |
222 | ### Apache Airflow
223 |
224 | - Use [Quick Start](https://airflow.apache.org/start.html) to make a local Airflow instance up and running.
225 | - Copy `dags` and `plugins` folders to your Airflow work environment (under `AIRFLOW_HOME` path variable)
226 | - Create a new HTTP connection `livy_http_conn` by providing host and port of the Livy server.
227 |
228 |
229 |
230 | - Create a new AWS connection `aws_credentials` by providing user credentials and ARN role (from `dwh.cfg`)
231 |
232 |
233 |
234 | - Create a new Redshift connection `redshift` by providing database connection parameters (from `dwh.cfg`)
235 |
236 |
237 |
238 | - In the Airflow UI, turn on and manually run the DAG "main".
239 |
240 | ## Further resources
241 |
242 | - [Yelp's Academic Dataset Examples](https://github.com/Yelp/dataset-examples)
243 | - [Spark Tips & Tricks](https://gist.github.com/dusenberrymw/30cebf98263fae206ea0ffd2cb155813)
244 | - [Use Pyspark with a Jupyter Notebook in an AWS EMR cluster](https://towardsdatascience.com/use-pyspark-with-a-jupyter-notebook-in-an-aws-emr-cluster-e5abc4cc9bdd)
245 | - [Real-world Python workloads on Spark: EMR clusters](https://becominghuman.ai/real-world-python-workloads-on-spark-emr-clusters-3c6bda1a1350)
246 |
--------------------------------------------------------------------------------