├── README.md ├── config.yml └── process_sql_statements.py /README.md: -------------------------------------------------------------------------------- 1 | **spark-sql-etl-framework** 2 | ============== 3 | Multi Stage SQL based ETL Processing Framework Written in PySpark: 4 | 5 | process_sql_statements.py is a PySpark application which reads config from a YAML document (see config.yml in this project). The configuration specifies a set of input sources - 6 | which are table objects avaiable from the catalog of the current SparkSession (for instance an AWS Glue Catalog) - in the `sources` section. The `transforms` section is a list of 7 | transformations written as SQL statements using temporary views in Spark SQL, this is akin to using CTE (common table expressions) or Volatile Tables when performing typical multi-stage 8 | complex ETL routines on traditional relation database systems. The `targets` section defines the location to write out the final output object to. 9 | 10 | The sample configuration uses the framework (`process_sql_statements.py`) to process a multi stage SQL ETL routine using data from the 11 | [AWS Sample Tickit Database](https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html) which have been stored as S3 objects and catalogued using Hive/AWS Glue. 12 | Modify the `config.yml` file to specify your targets, projections, filters and transformations and run as follows: 13 | 14 | spark-submit process_sql_statements.py config.yml 15 | 16 | Dependencies 17 | -------------- 18 | - Spark 2.x -------------------------------------------------------------------------------- /config.yml: -------------------------------------------------------------------------------- 1 | --- 2 | # the name is used as the Application Name in Spark/YARN 3 | name: multi_stage_sql_test 4 | jobs: 5 | - name: get_data 6 | sources: 7 | # a reference to each source element is loaded in sequence the a temporary view 8 | - table: tickit.users 9 | view: vw_users 10 | # you can limit the columns to be projected in the source view here 11 | columns: 12 | - userid 13 | - username 14 | - firstname 15 | - lastname 16 | - city 17 | # you can filter rows to be included in the source view here using WHERE conditions 18 | filters: 19 | - "city = 'San Diego'" 20 | - table: tickit.date 21 | view: vw_dates 22 | columns: 23 | - dateid 24 | - year 25 | filters: 26 | - "year = 2008" 27 | # alternatively you can project all columns and include all rows 28 | - table: tickit.sales 29 | view: vw_sales 30 | transforms: 31 | # transforms contain a list of sql statements to be processed in sequence 32 | - sql: >- 33 | CREATE TEMPORARY VIEW vw_sales_dates AS 34 | SELECT s.sellerid, s.qtysold FROM vw_sales s 35 | INNER JOIN vw_dates d 36 | ON s.dateid = d.dateid 37 | # each SQL transform creates a temporary view by referencing views created by previous statements 38 | - sql: >- 39 | CREATE TEMPORARY VIEW vw_sales_users AS 40 | SELECT s.sellerid, u.username, 41 | (u.firstname ||' '|| u.lastname) AS name, 42 | u.city, s.qtysold 43 | FROM vw_sales_dates s 44 | INNER JOIN vw_users u 45 | ON s.sellerid = u.userid 46 | # the final transform element should create the final object which you intend to publish 47 | - sql: >- 48 | CREATE TEMPORARY VIEW vw_final AS 49 | SELECT sellerid, username, name, city, 50 | SUM(qtysold) as total_sales 51 | FROM vw_sales_users 52 | GROUP BY sellerid, username, name, city 53 | targets: 54 | # the final view is used to create a DataFrame which is then written out to the target_location 55 | final_object: vw_final 56 | target_location: s3:///sales_by_users 57 | # add a second processing step, for example to de-dupe the results from the first query 58 | - name: remove_dups 59 | sources: 60 | - object: "s3:///sales_by_users" 61 | view: vw_sales 62 | transforms: 63 | - sql: >- 64 | CREATE OR REPLACE TEMPORARY VIEW vw_final AS 65 | SELECT DISTINCT * FROM vw_sales 66 | targets: 67 | final_object: vw_final 68 | target_location: "s3:///sales_by_users_deduped" 69 | -------------------------------------------------------------------------------- /process_sql_statements.py: -------------------------------------------------------------------------------- 1 | # 2 | # 3 | # process_sql_statements.py 4 | # 5 | # Process config driven, mutli-stage SQL based ETL using Spark SQL 6 | # 7 | # Example usage: 8 | # spark-submit process_sql_statements.py config.yml 9 | 10 | import yaml, sys, datetime 11 | from pyspark.context import SparkContext 12 | from pyspark.sql import SparkSession 13 | 14 | config_file = sys.argv[1] 15 | 16 | with open(config_file, 'r') as stream: 17 | config = yaml.load(stream) 18 | 19 | print("Initializing SparkSession (%s)..." % (config["name"])) 20 | 21 | spark = SparkSession \ 22 | .builder \ 23 | .appName(config["name"]) \ 24 | .getOrCreate() 25 | spark.sparkContext.setLogLevel("ERROR") 26 | 27 | overall_start = datetime.datetime.now() 28 | 29 | for job in config["jobs"]: 30 | # Load Sources 31 | print("Creating source views...") 32 | for source in job["sources"]: 33 | if source.get("table") is not None: 34 | print("Creating view %s from table %s..." % (source["view"], source["table"])) 35 | df = spark.table(source["table"]) 36 | else: 37 | print("Creating view %s from object %s..." % (source["view"], source["object"])) 38 | df = spark.read.parquet(source["object"]) 39 | if source.get("columns") is not None: 40 | # columns listed, select given columns 41 | df = df.select(source["columns"]) 42 | if source.get("filters") is not None: 43 | for filter in source["filters"]: 44 | df = df.filter(filter) 45 | df.createOrReplaceTempView(source["view"]) 46 | 47 | # Perform Transforms 48 | print("Performing SQL Transformations...") 49 | for transform in job["transforms"]: 50 | spark.sql(transform["sql"]) 51 | 52 | # Write out final object 53 | print("Writing out final object to %s..." % (job["targets"]["target_location"])) 54 | start = datetime.datetime.now() 55 | final_df = spark.table(job["targets"]["final_object"]) 56 | final_df.write.parquet(job["targets"]["target_location"], mode="overwrite") 57 | finish = datetime.datetime.now() 58 | print("Finished writing out target object...") 59 | 60 | ### Remove this before productionizing 61 | #print("Sample output:") 62 | #spark.read.parquet(config["targets"]["target_location"]).show() 63 | ### 64 | 65 | print("Total number of output rows: %s (%s)" % (str(spark.read.parquet(job["targets"]["target_location"]).count()), (str(finish-start)))) 66 | 67 | overall_finish = datetime.datetime.now() 68 | 69 | print("total time taken : %s" % (str(overall_finish-overall_start))) 70 | spark.sparkContext.stop() 71 | --------------------------------------------------------------------------------