├── README.md
├── config.yml
└── process_sql_statements.py


/README.md:
--------------------------------------------------------------------------------
 1 | **spark-sql-etl-framework**
 2 | ==============
 3 | Multi Stage SQL based ETL Processing Framework Written in PySpark:  
 4 | 
 5 | process_sql_statements.py is a PySpark application which reads config from a YAML document (see config.yml in this project).  The configuration specifies a set of input sources - 
 6 | which are table objects avaiable from the catalog of the current SparkSession (for instance an AWS Glue Catalog) - in the `sources` section.  The `transforms` section is a list of   
 7 | transformations written as SQL statements using temporary views in Spark SQL, this is akin to using CTE (common table expressions) or Volatile Tables when performing typical multi-stage 
 8 | complex ETL routines on traditional relation database systems.  The `targets` section defines the location to write out the final output object to.  
 9 | 
10 | The sample configuration uses the framework (`process_sql_statements.py`) to process a multi stage SQL ETL routine using data from the 
11 | [AWS Sample Tickit Database](https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html) which have been stored as S3 objects and catalogued using Hive/AWS Glue. 
12 | Modify the `config.yml` file to specify your targets, projections, filters and transformations and run as follows:   
13 | 
14 |     spark-submit process_sql_statements.py config.yml
15 |     
16 | Dependencies
17 | --------------
18 | - Spark 2.x


--------------------------------------------------------------------------------
/config.yml:
--------------------------------------------------------------------------------
 1 | ---
 2 | # the name is used as the Application Name in Spark/YARN
 3 | name: multi_stage_sql_test
 4 | jobs:
 5 | - name: get_data
 6 |   sources:
 7 |   # a reference to each source element is loaded in sequence the a temporary view
 8 |   - table: tickit.users
 9 |     view: vw_users
10 |     # you can limit the columns to be projected in the source view here
11 |     columns:
12 |     - userid
13 |     - username
14 |     - firstname
15 |     - lastname
16 |     - city
17 |     # you can filter rows to be included in the source view here using WHERE conditions
18 |     filters:
19 |     - "city = 'San Diego'"
20 |   - table: tickit.date
21 |     view: vw_dates
22 |     columns:
23 |     - dateid
24 |     - year
25 |     filters:
26 |     - "year = 2008"
27 |   # alternatively you can project all columns and include all rows
28 |   - table: tickit.sales
29 |     view: vw_sales
30 |   transforms:
31 |   # transforms contain a list of sql statements to be processed in sequence
32 |   - sql: >-
33 |         CREATE TEMPORARY VIEW vw_sales_dates AS
34 |         SELECT s.sellerid, s.qtysold FROM vw_sales s
35 |         INNER JOIN vw_dates d
36 |         ON s.dateid = d.dateid
37 |   # each SQL transform creates a temporary view by referencing views created by previous statements
38 |   - sql: >-
39 |         CREATE TEMPORARY VIEW vw_sales_users AS
40 |         SELECT s.sellerid, u.username,
41 |         (u.firstname ||' '|| u.lastname) AS name,
42 |         u.city, s.qtysold
43 |         FROM vw_sales_dates s
44 |         INNER JOIN vw_users u
45 |         ON s.sellerid = u.userid
46 |   # the final transform element should create the final object which you intend to publish
47 |   - sql: >-
48 |         CREATE TEMPORARY VIEW vw_final AS
49 |         SELECT sellerid, username, name, city,
50 |         SUM(qtysold) as total_sales
51 |         FROM vw_sales_users
52 |         GROUP BY sellerid, username, name, city
53 |   targets:
54 |   # the final view is used to create a DataFrame which is then written out to the target_location
55 |     final_object: vw_final
56 |     target_location: s3://<yourbucket>/sales_by_users
57 | # add a second processing step, for example to de-dupe the results from the first query
58 | - name: remove_dups
59 |   sources:
60 |   - object: "s3://<yourbucket>/sales_by_users"
61 |     view: vw_sales
62 |   transforms:
63 |   - sql: >-
64 |         CREATE OR REPLACE TEMPORARY VIEW vw_final AS
65 |         SELECT DISTINCT * FROM vw_sales
66 |   targets:
67 |    final_object: vw_final
68 |    target_location: "s3://<yourbucket>/sales_by_users_deduped"
69 | 


--------------------------------------------------------------------------------
/process_sql_statements.py:
--------------------------------------------------------------------------------
 1 | #
 2 | #
 3 | # process_sql_statements.py
 4 | #
 5 | # Process config driven, mutli-stage SQL based ETL using Spark SQL
 6 | #
 7 | # Example usage:
 8 | # spark-submit process_sql_statements.py config.yml
 9 | 
10 | import yaml, sys, datetime
11 | from pyspark.context import SparkContext
12 | from pyspark.sql import SparkSession
13 | 
14 | config_file  = sys.argv[1]
15 | 
16 | with open(config_file, 'r') as stream:
17 | 	config = yaml.load(stream)
18 | 
19 | print("Initializing SparkSession (%s)..." % (config["name"]))
20 | 
21 | spark = SparkSession \
22 | 	.builder \
23 | 	.appName(config["name"]) \
24 | 	.getOrCreate()
25 | spark.sparkContext.setLogLevel("ERROR")
26 | 
27 | overall_start = datetime.datetime.now()
28 | 
29 | for job in config["jobs"]:
30 | 	# Load Sources
31 | 	print("Creating source views...")
32 | 	for source in job["sources"]:
33 | 		if source.get("table") is not None:
34 | 			print("Creating view %s from table %s..." % (source["view"], source["table"]))
35 | 			df = spark.table(source["table"])
36 | 		else:
37 | 			print("Creating view %s from object %s..." % (source["view"], source["object"]))
38 | 			df = spark.read.parquet(source["object"])
39 | 		if source.get("columns") is not None:
40 | 			# columns listed, select given columns
41 | 			df = df.select(source["columns"])
42 | 		if source.get("filters") is not None:
43 | 			for filter in source["filters"]:
44 | 				df = df.filter(filter)
45 | 		df.createOrReplaceTempView(source["view"])
46 | 
47 | 	# Perform Transforms
48 | 	print("Performing SQL Transformations...")
49 | 	for transform in job["transforms"]:
50 | 		spark.sql(transform["sql"])
51 | 
52 | 	# Write out final object
53 | 	print("Writing out final object to %s..." % (job["targets"]["target_location"]))
54 | 	start = datetime.datetime.now()
55 | 	final_df = spark.table(job["targets"]["final_object"])
56 | 	final_df.write.parquet(job["targets"]["target_location"], mode="overwrite")
57 | 	finish = datetime.datetime.now()
58 | 	print("Finished writing out target object...")
59 | 
60 | 	### Remove this before productionizing
61 | 	#print("Sample output:")
62 | 	#spark.read.parquet(config["targets"]["target_location"]).show()
63 | 	###
64 | 
65 | 	print("Total number of output rows: %s (%s)" % (str(spark.read.parquet(job["targets"]["target_location"]).count()), (str(finish-start))))
66 | 
67 | overall_finish = datetime.datetime.now()
68 | 
69 | print("total time taken : %s" % (str(overall_finish-overall_start)))
70 | spark.sparkContext.stop()
71 | 


--------------------------------------------------------------------------------