├── .gitignore ├── README.md ├── cloudbuild.yaml └── src ├── lib ├── __init__.py ├── bq_api_data_functions.py ├── data_ingestion.py ├── helper_functions.py ├── infrastructure_setup.py └── schemas.py ├── main.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | #exclude files and directories in top directory 2 | * 3 | 4 | #include the src directory 5 | !/src 6 | 7 | #include the src/lib directory 8 | !/src/lib 9 | 10 | #include gitignore 11 | !.gitignore 12 | 13 | #include README 14 | !README.md 15 | 16 | #include cloudbuild config 17 | !cloudbuild.yaml -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Serverless Data Pipeline GCP 2 | 3 | **Deploy an end to end data pipeline for Chicago traffic api data and measure function performance using: Cloud Functions, Pub/Sub, Cloud Storage, Cloud Scheduler, BigQuery, Stackdriver Trace** 4 | 5 | **Usage:** 6 | 7 | - Use this as a template for your data pipelines 8 | - Use this to extract and land Chicago traffic data 9 | - Pick and choose which modules/code snippets are useful 10 | - Understand how to measure function performance throughout execution 11 | - Show me how to do it better :) 12 | 13 | **Data Pipeline Operations:** 14 | 15 | 1. Creates raw data bucket 16 | 2. Creates BigQuery dataset and raw, staging, and final tables with defined schemas 17 | 3. Downloads data from Chicago traffic API 18 | 4. Ingests data as pandas dataframe 19 | 5. Uploads pandas dataframe to raw data bucket as a parquet file 20 | 6. Converts dataframe schema to match BigQuery defined schema 21 | 7. Uploads pandas dataframe to raw BigQuery table 22 | 8. Run SQL queries to capture and accumulate unique records based on current date 23 | 9. Sends function performance metrics to Stackdriver Trace 24 | 25 | **Technologies:** Cloud Shell, Cloud Functions, Pub/Sub, Cloud Storage, Cloud Scheduler, BigQuery, Stackdriver Trace 26 | 27 | **Languages:** Python 3.7, SQL(Standard) 28 | 29 | **Technical Concepts:** 30 | 31 | - This was designed to be a python native solution and starter template for simple data pipelines 32 | - This follows the procedural paradigm which groups like-functions in separate files and imports them as modules into the main entrypoint file. Each operation goes through ordered actions on objects vs. OOP in which objects perform the actions 33 | - Pub/Sub is used as middleware as opposed to invoking an HTTP-triggered cloud function to minimize overhead code securing the URL in addition to Pub/Sub serving as a shock absorber to a sudden burst in invocations 34 | - This is not intended for robust production deployment as it doesn't account for edge cases and dynamic error handling 35 | - Lessons Learned: Leverage classes for interdependent functions and creating extensibility in table objects. I also forced stacktracing functionality to measure function performance on a pub/sub trigger, so you can't create a report based on http requests to analyze performance trends. Stackdriver Trace will start auto-creating performance reports once there's enough data. 36 | 37 | **Further Reading:** For those looking for production-level deployments 38 | 39 | - Streaming Tutorial: 40 | - BQPipeline Utility Functions: 41 | - I discovered the above after I created this pipeline...HA! 42 | 43 | ## Deployment Instructions 44 | 45 | **Prerequisites:** 46 | 47 | - An open Google Cloud account: 48 | - Proficient in Python and SQL 49 | - A heart and mind eager to create data pipelines 50 | 51 | [![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.png)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https://github.com/sungchun12/serverless_data_pipeline_gcp.git) 52 | 53 | _OR_ 54 | 55 | 1. Activate Cloud Shell: 56 | 2. Clone repository 57 | 58 | ```bash 59 | git clone https://github.com/sungchun12/serverless_data_pipeline_gcp.git 60 | ``` 61 | 62 | - Enable Google Cloud APIs: Stackdriver Trace API, Cloud Functions API, Cloud Pub/Sub API, Cloud Scheduler API, Cloud Storage, BigQuery API, Cloud Build API(gcloud CLI equivalent below when submitted through cloud shell) 63 | 64 | ```bash 65 | #set project id 66 | gcloud config set project [your-project-id] 67 | ``` 68 | 69 | ```bash 70 | #Enable Google service APIs 71 | gcloud services enable \ 72 | cloudfunctions.googleapis.com \ 73 | cloudtrace.googleapis.com \ 74 | pubsub.googleapis.com \ 75 | cloudscheduler.googleapis.com \ 76 | storage-component.googleapis.com \ 77 | bigquery-json.googleapis.com \ 78 | cloudbuild.googleapis.com 79 | ``` 80 | 81 | --- 82 | 83 | **Note**: If you want to automate the build and deployment of this pipeline, submit the commands in order in cloud shell after completing the above steps. It skips step 5 below as it is redundant for auto-deployment. 84 | 85 | Find your Cloudbuild service account in the IAM console. Ex: [unique-id]@cloudbuild.gserviceaccount.com 86 | 87 | ```bash 88 | #add role permissions to CloudBuild Service Account 89 | gcloud projects add-iam-policy-binding [your-project-id] \ 90 | --member serviceAccount:[unique-id]@cloudbuild.gserviceaccount.com \ 91 | --role roles/cloudfunctions.developer \ 92 | --role roles/cloudscheduler.admin \ 93 | --role roles/logging.viewer 94 | ``` 95 | 96 | ```bash 97 | #Deploy steps in cloudbuild configuration file 98 | gcloud builds submit --config cloudbuild.yaml . 99 | ``` 100 | 101 | --- 102 | 103 | 3. Change directory to relevant code 104 | 105 | ```bash 106 | cd serverless_data_pipeline_gcp/src 107 | ``` 108 | 109 | 4. Deploy cloud function with pub/sub trigger. Note: this will automatically create the trigger if it does not exist 110 | 111 | ```bash 112 | gcloud functions deploy [function-name] --entry-point handler --runtime python37 --trigger-topic [topic-name] 113 | ``` 114 | 115 | Ex: 116 | 117 | ```bash 118 | gcloud functions deploy demo_function --entry-point handler --runtime python37 --trigger-topic demo_topic 119 | ``` 120 | 121 | 5. Test cloud function by publishing a message to pub/sub topic 122 | 123 | ```bash 124 | gcloud pubsub topics publish [topic-name] --message "" 125 | ``` 126 | 127 | Ex: 128 | 129 | ```bash 130 | gcloud pubsub topics publish demo_topic --message "Can you see this?" 131 | ``` 132 | 133 | 6. Check logs to see how function performed. You may have to re-execute this command line multiple times if logs don't show up initially 134 | 135 | ```bash 136 | gcloud functions logs read --limit 50 137 | ``` 138 | 139 | 7. Deploy cloud scheduler job which publishes a message to Pub/Sub every 5 minutes 140 | 141 | ```bash 142 | gcloud beta scheduler jobs create pubsub [job-name] \ 143 | --schedule "*/5 * * * *" \ 144 | --topic [topic-name] \ 145 | --message-body '{""}' \ 146 | --time-zone 'America/Chicago' 147 | ``` 148 | 149 | Ex: 150 | 151 | ```bash 152 | gcloud beta scheduler jobs create pubsub schedule_function \ 153 | --schedule "*/5 * * * *" \ 154 | --topic demo_topic \ 155 | --message-body '{"Can you see this? With love, cloud scheduler"}' \ 156 | --time-zone 'America/Chicago' 157 | ``` 158 | 159 | 8. Test end to end pipeline by manually running cloud scheduler job. Next, repeat step 6 above. 160 | 161 | ```bash 162 | gcloud beta scheduler jobs run [job-name] 163 | ``` 164 | 165 | Ex: 166 | 167 | ```bash 168 | gcloud beta scheduler jobs run schedule_function 169 | ``` 170 | 171 | 9. Understand pipeline performance by opening stacktrace and click "get_kpis": 172 | 173 | **YOUR PIPELINE IS DEPLOYED AND MEASURABLE!** 174 | 175 | **Note:** You'll notice extraneous blocks of comments and commented out code throughout the python scripts. 176 | -------------------------------------------------------------------------------- /cloudbuild.yaml: -------------------------------------------------------------------------------- 1 | steps: 2 | # clone the git repository 3 | - name: "gcr.io/cloud-builders/git" 4 | args: 5 | ["clone", "https://github.com/sungchun12/serverless_data_pipeline_gcp"] 6 | 7 | # Deploy cloud function with pub/sub trigger from clone directory 8 | - name: "gcr.io/cloud-builders/gcloud" 9 | args: 10 | [ 11 | "functions", 12 | "deploy", 13 | "demo_function", 14 | "--entry-point", 15 | "handler", 16 | "--runtime", 17 | "python37", 18 | "--trigger-topic", 19 | "demo_topic", 20 | ] 21 | dir: "serverless_data_pipeline_gcp/src" 22 | 23 | # Deploy cloud scheduler job which publishes a message to Pub/Sub every 5 minutes 24 | # exits if it already exists 25 | - name: "gcr.io/cloud-builders/gcloud" 26 | entrypoint: "bash" 27 | args: 28 | - "-c" 29 | - | 30 | gcloud beta scheduler jobs create pubsub schedule_function \ 31 | --schedule "*/5 * * * *" \ 32 | --topic demo_topic \ 33 | --message-body '{"Can you see this? With love, cloud scheduler"}' \ 34 | --time-zone 'America/Chicago' || exit 0 35 | 36 | # Manually run the cloud scheduler job 37 | - name: "gcr.io/cloud-builders/gcloud" 38 | args: ["beta", "scheduler", "jobs", "run", "schedule_function"] 39 | 40 | # Check logs to see how function performed. 41 | - name: "gcr.io/cloud-builders/gcloud" 42 | args: ["functions", "logs", "read", "--limit", "50"] 43 | 44 | # set logs bucket location for builds 45 | # set this to whatever bucket you want or remove entirely and cloudbuild will create a default bucket 46 | logsBucket: "gs://cloud_function_build_demo" 47 | -------------------------------------------------------------------------------- /src/lib/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sungchun12/serverless-data-pipeline-gcp/14716017356e2ed64f204acd573117d0940b38d2/src/lib/__init__.py -------------------------------------------------------------------------------- /src/lib/bq_api_data_functions.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Module which captures number of rows in tables and runs SQL queries. 3 | 4 | This module is responsible for: 5 | -Capturing total number of rows in a BigQuery table 6 | -Querying max timestamp in a date field based on rows within current date 7 | -Querying unique records based on the above timestamp 8 | -Appending unique records to a final table 9 | 10 | """ 11 | # gcp modules 12 | from google.cloud import bigquery 13 | 14 | # import logging 15 | from lib.helper_functions import set_logger 16 | 17 | logger = set_logger(__name__) 18 | 19 | 20 | def bq_table_num_rows(dataset_name, table_name): 21 | """Log total number of rows in destination bigquery table. 22 | 23 | Args: 24 | dataset_name: destination dataset name 25 | table_name: destination table name 26 | 27 | Returns: 28 | Integer object: ``num_rows`` 29 | 30 | """ 31 | bigquery_client = ( 32 | bigquery.Client() 33 | ) # instantiate bigquery client to interact with api 34 | dataset_ref = bigquery_client.dataset(dataset_name) # create dataset obj 35 | table_ref = dataset_ref.table(table_name) # create table obj 36 | table = bigquery_client.get_table(table_ref) # API Request 37 | num_rows = table.num_rows 38 | logger.info(f"Total number of rows: {table.num_rows} in {table_ref.path}") 39 | return num_rows 40 | 41 | 42 | def query_max_timestamp(project_id, dataset_name, table_name): 43 | """Return the max timestamp in BigQuery table after current date in CST. 44 | 45 | Args: 46 | project_id: destination project id 47 | dataset_name: destination dataset name 48 | table_name: destination table name 49 | 50 | Returns: 51 | string object: ``max_timestamp`` 52 | 53 | """ 54 | # in central Chicago time 55 | sql = f"SELECT max(_last_updt) as max_timestamp \ 56 | FROM `{project_id}.{dataset_name}.{table_name}` \ 57 | WHERE _last_updt >= TIMESTAMP(CURRENT_DATE('-06:00'))" 58 | bigquery_client = bigquery.Client() # setup the client 59 | query_job = bigquery_client.query(sql) # run the query 60 | results = query_job.result() # waits for job to complete 61 | for row in results: # returns the result 62 | max_timestamp = row.max_timestamp.strftime("%Y-%m-%d %H:%M:%S") 63 | return max_timestamp 64 | 65 | 66 | # https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#methods 67 | # https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig 68 | # The following values are supported: 69 | # configuration.copy.writeDisposition 70 | # WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table 71 | # WRITE_APPEND: If the table already exists, BigQuery appends the data 72 | # WRITE_EMPTY: If the table already exists and contains data, 73 | # a 'duplicate' error is returned in the job result. 74 | def query_unique_records(project_id, dataset_name, table_name, table_name_2): 75 | """Queries unique records from original table, saves in staging table. 76 | 77 | Unique records are filtered by filtering all records >= the max timestamp. 78 | 79 | Args: 80 | project_id: destination project id 81 | dataset_name: destination dataset name 82 | table_name: starting table name 83 | table_name_2: destination table name 84 | 85 | """ 86 | bigquery_client = bigquery.Client() 87 | job_config = bigquery.QueryJobConfig() 88 | table_ref = bigquery_client.dataset(dataset_name).table( 89 | table_name_2 90 | ) # set destination table 91 | job_config.destination = table_ref 92 | job_config.write_disposition = "WRITE_TRUNCATE" 93 | max_timestamp = query_max_timestamp(project_id, dataset_name, table_name) 94 | sql = f"SELECT DISTINCT * FROM `{project_id}.{dataset_name}.{table_name}` \ 95 | WHERE _last_updt >= TIMESTAMP(DATETIME '{max_timestamp}');" 96 | query_job = bigquery_client.query( 97 | sql, 98 | # Location must match that of the dataset(s) referenced in the query 99 | # and of the destination table. 100 | location="US", 101 | job_config=job_config, 102 | ) # API request - starts the query 103 | query_job.result() 104 | logger.info(f"Query results loaded to table {table_ref.path}") 105 | 106 | 107 | def append_unique_records(project_id, dataset_name, table_name, table_name_2): 108 | """Queries unique staging table and appends new results onto final table. 109 | 110 | Args: 111 | project_id: destination project id 112 | dataset_name: destination dataset name 113 | table_name: starting table name 114 | table_name_2: destination table name 115 | 116 | """ 117 | bigquery_client = bigquery.Client() 118 | job_config = bigquery.QueryJobConfig() 119 | table_ref = bigquery_client.dataset(dataset_name).table( 120 | table_name_2 121 | ) # set destination table 122 | job_config.destination = table_ref 123 | job_config.write_disposition = "WRITE_APPEND" 124 | # left outer join to avoid appending duplicate data 125 | sql = f"SELECT a.* FROM `{project_id}.{dataset_name}.{table_name}` a \ 126 | LEFT JOIN `{project_id}.{dataset_name}.{table_name_2}` b \ 127 | on a.segmentid = b.segmentid AND a._last_updt = b._last_updt \ 128 | WHERE b.segmentid IS NULL;" 129 | query_job = bigquery_client.query( 130 | sql, 131 | # Location must match that of the dataset(s) referenced in the query 132 | # and of the destination table. 133 | location="US", 134 | job_config=job_config, 135 | ) # API request - starts the query 136 | query_job.result() 137 | logger.info(f"Query results loaded to table {table_ref.path}") 138 | -------------------------------------------------------------------------------- /src/lib/data_ingestion.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Module which contains functions to ingest and check data for outliers. 3 | 4 | This module is responsible for: 5 | -Creating a pandas dataframe using an api call to Chicago traffic data 6 | -Uploading a pandas dataframe to a google cloud storage bucket 7 | -Converting pandas dataframe schema 8 | -Checking for null outliers in scope 9 | -Uploading a pandas dataframe to BigQuery 10 | 11 | """ 12 | 13 | # built in python modules 14 | import os 15 | 16 | # gcp modules 17 | from google.cloud import storage 18 | import pandas_gbq as gbq 19 | from google.cloud import bigquery 20 | 21 | # api module 22 | from sodapy import Socrata 23 | 24 | # pandas dataframe module 25 | import pandas as pd 26 | 27 | from lib.helper_functions import _getToday, set_logger 28 | 29 | logger = set_logger(__name__) 30 | 31 | 32 | def create_results_df(): 33 | """Create a dataframe based on JSON from the Chicago traffic API 34 | 35 | Args: 36 | None 37 | 38 | Returns: 39 | Dataframe object: ``results_df`` 40 | 41 | """ 42 | try: 43 | # First 2000 results, returned as JSON from API / converted to Python list of 44 | # dictionaries by sodapy. 45 | # Unauthenticated client only works with public data sets. Note 'None' 46 | # in place of application token, and no username or password: 47 | data_client = Socrata("data.cityofchicago.org", None) 48 | results = data_client.get( 49 | "8v9j-bter", limit=2000 50 | ) # unique id for chicago traffic data 51 | 52 | # Convert to pandas DataFrame 53 | results_df = pd.DataFrame.from_records(results) 54 | logger.info("Successfully created a pandas dataframe!") 55 | 56 | return results_df 57 | except Exception as e: 58 | logger.error("Failure to create a pandas dataframe :(") 59 | raise e 60 | 61 | 62 | # GZIP compression uses more CPU resources than Snappy or LZO, 63 | # but provides a higher compression ratio. 64 | # GZip is often a good choice for cold data, which is accessed infrequently. 65 | # Snappy or LZO are a better choice for hot data, which is accessed frequently. 66 | # The only writeable part of the filesystem is the /tmp directory, which you 67 | # can use to store temporary files in a function instance. 68 | # This is a local disk mount point known as a "tmpfs" volume in which data 69 | # written to the volume is stored in memory. 70 | # Note that it will consume memory resources provisioned for the function. 71 | def upload_raw_data_gcs(results_df, bucket_name): 72 | """Upload dataframe into google cloud storage bucket. 73 | 74 | Deletes file in temp directory after upload. 75 | 76 | Args: 77 | results_df: pandas dataframe 78 | bucket_name: name of bucket to upload data towards 79 | 80 | """ 81 | # Write the DataFrame to GCS (Google Cloud Storage) 82 | storage_client = storage.Client() 83 | # .from_service_account_json('service_account.json') #authenticate service account 84 | bucket = storage_client.bucket(bucket_name) # capture bucket details 85 | timestamp = _getToday() 86 | source_file_name = "traffic_" + timestamp + ".gzip" # create the file name 87 | temp_path = "/tmp" 88 | os.chdir(temp_path) # change to tmp path 89 | # blob.upload_from_string(results_df.to_parquet(source_file_name, engine = 'pyarrow', compression = 'gzip'),content_type='gzip') 90 | results_df.to_parquet(source_file_name, engine="pyarrow", compression="gzip") 91 | # blob = bucket.blob(os.path.basename(source_file_name)) #define the path to the file 92 | blob = bucket.blob(source_file_name) # define the binary large object(blob) 93 | blob.upload_from_filename(source_file_name) # upload to bucket 94 | logger.info(f"Successfully uploaded parquet gzip file into: {bucket}") 95 | delete_temp_dir() 96 | 97 | 98 | def delete_temp_dir(): 99 | """Deletes every file in the /tmp directory. 100 | 101 | This minimizes memory load during function execution. 102 | 103 | Args: 104 | None 105 | 106 | """ 107 | # check what's in the temporary directory 108 | temp_path = "/tmp" 109 | logger.info(os.listdir(temp_path)) 110 | # delete all files in the temporary path if they exist 111 | logger.info(f"Deleting all files in {temp_path} directory") 112 | for file in os.listdir(temp_path): 113 | file_path = os.path.join(temp_path, file) 114 | if os.path.isfile(file_path): 115 | os.unlink(file_path) 116 | # check that the temporary directory is empty 117 | if len(os.listdir(temp_path)) == 0: 118 | logger.info(f"{temp_path} directory is empty") 119 | else: 120 | logger.warning(f"{temp_path} directory is NOT empty") 121 | 122 | 123 | # https://stackoverflow.com/questions/21886742/convert-pandas-dtypes-to-bigquery-type-representation 124 | # https://stackoverflow.com/questions/44953463/pandas-google-bigquery-schema-mismatch-makes-the-upload-fail 125 | # http://pbpython.com/pandas_dtypes.html 126 | def convert_schema(results_df, schema_df): 127 | """Converts data types in dataframe to match BigQuery destination table. 128 | 129 | Args: 130 | results_df: pandas dataframe 131 | schema_df: schema to convert towards 132 | 133 | Returns: 134 | Dataframe Object: ``results_df_transformed`` 135 | 136 | """ 137 | for ( 138 | k, 139 | v, 140 | ) in ( 141 | schema_df.items() 142 | ): # for each column name in the dictionary, convert the data type in the dataframe 143 | results_df[k] = results_df[k].astype(v) 144 | results_df_transformed = results_df 145 | logger.info("Updated schema to match BigQuery destination table") 146 | return results_df_transformed 147 | 148 | 149 | def check_nulls(results_df_transformed): 150 | """Creates list of column names if any nulls in the columns. 151 | 152 | Args: 153 | results_df_transformed: pandas dataframe with converted schema 154 | 155 | Returns: 156 | list object: ``null_columns`` 157 | """ 158 | null_columns = [] 159 | check_bool = ( 160 | results_df_transformed.isnull().any() 161 | ) # returns a boolean True/False for every column in dataframe if it contains nulls 162 | for ( 163 | k, 164 | v, 165 | ) in ( 166 | check_bool.items() 167 | ): # for each column in check_bool having nulls, print the name of the column 168 | if check_bool[v] is False: 169 | null_columns.append(k) 170 | logger.info(f"These are the null columns: {null_columns}") 171 | return null_columns 172 | 173 | 174 | def check_null_outliers(null_columns, nulls_expected): 175 | """Creates list of any outlier null columns. 176 | 177 | Args: 178 | null_columns: current list of null columns in pandas dataframe 179 | nulls_expected: list of nulls in scope 180 | 181 | Returns: 182 | list object: ``null_outliers`` 183 | 184 | """ 185 | null_outliers = ( 186 | [] 187 | ) # empty list to collect list of columns that are not expected to be null 188 | # if any columns in the nulls expected mismatch the nulls collected, append to null_outliers 189 | if any(x not in nulls_expected for x in null_columns): 190 | null_outliers.append(x) 191 | logger.info(f"These are the outlier null columns: {null_outliers}") 192 | return null_outliers 193 | 194 | 195 | # figure out the nullable vs. required mode schema mismatch 196 | # https://cloud.google.com/bigquery/docs/pandas-gbq-migration#loading_a_pandas_dataframe_to_a_table 197 | # https://github.com/pydata/pandas-gbq/issues/133#issuecomment-411119426 198 | def upload_to_gbq(results_df_transformed, project_id, dataset_name, table_name): 199 | """Uploads data into bigquery and appends if data already exists. 200 | 201 | Args: 202 | results_df_transformed: pandas dataframe with converted schema 203 | project_id: name of project where you want to upload data 204 | dataset_name: name of target dataset 205 | table_name: name of target table 206 | 207 | """ 208 | bigquery_client = bigquery.Client() 209 | dataset_ref = bigquery_client.dataset(dataset_name) 210 | table_ref = dataset_ref.table(table_name) 211 | # job_config = bigquery.job.LoadJobConfig() #configure how the data loads into bigquery 212 | # job_config.write_disposition = 'WRITE_TRUNCATE' #if table exists, append to it 213 | # job_config.ignoreUnknownValues = 'T' #ignore columns that don't match destination schema 214 | # job_config.schema_update_options ='ALLOW_FIELD_ADDITION' 215 | # TODO: bad request due to schema mismatch with an index field 216 | # https://github.com/googleapis/google-cloud-python/issues/5572 217 | # bigquery_client.load_table_from_dataframe(results_df_transformed, 218 | # table_ref, num_retries = 5, job_config = job_config).result() 219 | gbq.to_gbq( 220 | results_df_transformed, 221 | dataset_name + "." + table_name, 222 | project_id, 223 | if_exists="append", 224 | location="US", 225 | progress_bar=True, 226 | ) 227 | logger.info(f"Data uploaded into: {table_ref.path}") 228 | -------------------------------------------------------------------------------- /src/lib/helper_functions.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Module with miscellaneous utility functions. 3 | 4 | This module contains a function to capture the current datetime stamp, 5 | and configures logging format. 6 | 7 | This module can be used to add more helper functions as needed. 8 | 9 | """ 10 | 11 | # built in python modules 12 | 13 | from datetime import datetime 14 | import logging 15 | import sys 16 | 17 | 18 | def _getToday(): 19 | """Returns timestamp string""" 20 | return datetime.now().strftime("%Y%m%d%H%M%S") 21 | 22 | 23 | def set_logger(__name__): 24 | """Configures logger for all modules and returns logger object""" 25 | logger = logging.getLogger(__name__) 26 | logger.setLevel(logging.INFO) 27 | 28 | formatter = logging.Formatter("%(levelname)s|%(asctime)s|%(name)s|%(message)s") 29 | 30 | stream_handler = logging.StreamHandler(sys.stdout) 31 | stream_handler.setFormatter(formatter) 32 | 33 | logger.addHandler(stream_handler) 34 | return logger 35 | -------------------------------------------------------------------------------- /src/lib/infrastructure_setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Module which creates data pipeline storage infrastructure 3 | 4 | This module has functions that create a raw data bucket 5 | in google cloud storage, and creates dataset-table pairs. 6 | 7 | """ 8 | # gcp modules 9 | from google.cloud import storage 10 | from google.cloud import bigquery 11 | 12 | # import logging 13 | from lib.helper_functions import set_logger 14 | 15 | logger = set_logger(__name__) 16 | 17 | 18 | def create_bucket(bucket_name): 19 | """Creates a bucket if not detected 20 | 21 | Args: 22 | bucket_name: name of GCS bucket to be created 23 | 24 | Returns: 25 | ``Created a new bucket: `` 26 | OR ``Bucket already exists: `` 27 | 28 | """ 29 | client = storage.Client() 30 | # authenticate service account 31 | # .from_service_account_json('service_account.json') 32 | bucket = client.bucket(bucket_name) # capture bucket details 33 | bucket.location = "US-CENTRAL1" # define regional location 34 | if not bucket.exists(): # checks if bucket doesn't exist 35 | bucket.create() 36 | logger.info(f"Created a new bucket: {bucket.path}") 37 | else: 38 | logger.info(f"Bucket already exists: {bucket.path}") 39 | 40 | 41 | def dataset_exists(client, dataset_reference): 42 | """Return if a table exists. 43 | 44 | Args: 45 | client (google.cloud.bigquery.client.Client): 46 | A client to connect to the BigQuery API. 47 | table_reference (google.cloud.bigquery.table.TableReference): 48 | A reference to the table to look for. 49 | 50 | Returns: 51 | bool: ``True`` if the table exists, ``False`` otherwise. 52 | 53 | """ 54 | from google.cloud.exceptions import NotFound 55 | 56 | try: 57 | client.get_dataset(dataset_reference) 58 | return True 59 | except NotFound: 60 | return False 61 | 62 | 63 | def table_exists(client, table_reference): 64 | """Return if a table exists. 65 | 66 | Args: 67 | client (google.cloud.bigquery.client.Client): 68 | A client to connect to the BigQuery API. 69 | table_reference (google.cloud.bigquery.table.TableReference): 70 | A reference to the table to look for. 71 | 72 | Returns: 73 | bool: ``True`` if the table exists, ``False`` otherwise. 74 | 75 | """ 76 | from google.cloud.exceptions import NotFound 77 | 78 | try: 79 | client.get_table(table_reference) 80 | return True 81 | except NotFound: 82 | return False 83 | 84 | 85 | # https://cloud.google.com/bigquery/docs/python-client-migration#update_a_table 86 | def create_dataset_table(dataset_name, table_name, table_desc, schema, partition_by): 87 | """Creates a new dataset and/or table if not detected. 88 | 89 | Args: 90 | dataset_name: Name of dataset to be created 91 | table_name: Name of table to be created within dataset 92 | table_desc: table descriptions 93 | schema: table schema with data types 94 | partition_by: Which datetime field to partition by 95 | 96 | Returns: 97 | ``Created new dataset: `` 98 | OR ``Dataset already exists: `` 99 | & 100 | ``Created empty table partitioned on column: partition_by`` 101 | OR ``Table already exists: `` 102 | 103 | """ 104 | # setup the client 105 | bigquery_client = bigquery.Client() 106 | 107 | # Create a DatasetReference using a chosen dataset ID. 108 | dataset_ref = bigquery_client.dataset( 109 | dataset_name 110 | ) # The project defaults to the Client's project if not specified. 111 | 112 | # Construct a full Dataset object to send to the API. 113 | dataset = bigquery.Dataset(dataset_ref) 114 | 115 | # Specify the geographic location where the dataset should reside. 116 | dataset.location = "US" 117 | 118 | # Send the dataset to the API for creation. 119 | # Raises google.api_core.exceptions. 120 | # Conflict if the Dataset already exists within the project. 121 | if ( 122 | dataset_exists(bigquery_client, dataset_ref) is False 123 | ): # checks if dataset not found 124 | dataset = bigquery_client.create_dataset(dataset) # API request 125 | logger.info(f"Created new dataset: {dataset_ref.path}") 126 | else: 127 | logger.info(f"Dataset already exists: {dataset_ref.path}") 128 | 129 | # Create an empty table 130 | table_ref = dataset_ref.table( 131 | table_name 132 | ) # construct a full table object to send to the api 133 | 134 | if table_exists(bigquery_client, table_ref) is False: 135 | table = bigquery.Table(table_ref, schema=schema) 136 | table.time_partitioning = bigquery.TimePartitioning( 137 | type_=bigquery.TimePartitioningType.DAY, 138 | field=partition_by, # day is the only supported type for now 139 | ) # name of column to use for partitioning 140 | table = bigquery_client.create_table(table) 141 | assert table.table_id == table_name # checks if table_id matches 142 | 143 | # update the table description 144 | table.description = table_desc 145 | table = bigquery_client.update_table(table, ["description"]) 146 | assert ( 147 | table.description == table_desc 148 | ) # checks if table description matches the update 149 | logger.info( 150 | f"Created empty table partitioned \ 151 | on column: {table.time_partitioning.field}" 152 | ) 153 | else: 154 | logger.info(f"Table already exists: {table_ref.path}") 155 | -------------------------------------------------------------------------------- /src/lib/schemas.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """This contains the bigquery and dataframe schemas for data warehouse setup. 3 | 4 | Change these values for your table schemas in scope. 5 | Data types defined in BigQuery are mapped to pandas dataframe data types. 6 | If data types do not match, data will not be able to be uploaded to bigquery. 7 | 8 | """ 9 | # gcp modules 10 | from google.cloud import bigquery 11 | 12 | # apply a schema during BigQuery table creation 13 | schema_bq = [ 14 | bigquery.SchemaField( 15 | "_comments", 16 | "STRING", 17 | mode="NULLABLE", 18 | description="Provides extra context to traffic segment", 19 | ), 20 | bigquery.SchemaField( 21 | "_direction", 22 | "STRING", 23 | mode="NULLABLE", 24 | description="Traffic flow direction for the segment.", 25 | ), 26 | bigquery.SchemaField( 27 | "_fromst", 28 | "STRING", 29 | mode="NULLABLE", 30 | description="Start street for the segment in the direction of traffic flow.", 31 | ), 32 | bigquery.SchemaField( 33 | "_last_updt", 34 | "TIMESTAMP", 35 | mode="NULLABLE", 36 | description=" If the \ 37 | LAST_UPDATED time is several days old, it can be assumed that no \ 38 | transit service over the segment currently. \ 39 | These segments are included in the Chicago Traffic Tracker dataset \ 40 | because they are key routes and CDOT intends to monitor \ 41 | traffic conditions through other means in the near future. \ 42 | Will display UTC, but truly represents CST.", 43 | ), # in CST -06:00 44 | bigquery.SchemaField( 45 | "_length", 46 | "FLOAT", 47 | mode="NULLABLE", 48 | description="Length of the segment in miles.", 49 | ), 50 | bigquery.SchemaField( 51 | "_lif_lat", 52 | "FLOAT", 53 | mode="NULLABLE", 54 | description="The starting point latitude. \ 55 | See start_lon for a fuller description.", 56 | ), 57 | bigquery.SchemaField( 58 | "_lit_lat", 59 | "FLOAT", 60 | mode="NULLABLE", 61 | description="The ending point latitude. \ 62 | See start_lon for a fuller description.", 63 | ), 64 | bigquery.SchemaField( 65 | "_lit_lon", 66 | "FLOAT", 67 | mode="NULLABLE", 68 | description="The ending point longitude. \ 69 | See start_lon for a fuller description.", 70 | ), 71 | bigquery.SchemaField( 72 | "_strheading", 73 | "STRING", 74 | mode="NULLABLE", 75 | description="The position of the segment in the address grid. \ 76 | North, South, East, or West of State and Madison.", 77 | ), 78 | bigquery.SchemaField( 79 | "_tost", 80 | "STRING", 81 | mode="NULLABLE", 82 | description="End street for the segment in the direction of traffic flow.", 83 | ), 84 | bigquery.SchemaField( 85 | "_traffic", 86 | "INTEGER", 87 | mode="NULLABLE", 88 | description="Real-time estimated speed in miles per hour. \ 89 | For congestion advisory and traffic maps, this value is compared to \ 90 | a 0-9, 10-20, and 21 & over scale to display heavy, medium, \ 91 | and free flow conditions for the traffic segment. \ 92 | Except for a very few segments speed on city arterials is limited \ 93 | to 30 mph by ordinance.", 94 | ), 95 | bigquery.SchemaField( 96 | "segmentid", 97 | "INTEGER", 98 | mode="NULLABLE", 99 | description="Unique arbitrary number to represent each segment.", 100 | ), 101 | bigquery.SchemaField( 102 | "start_lon", 103 | "FLOAT", 104 | mode="NULLABLE", 105 | description="The longitude associated with the starting point of the \ 106 | segment in the direction of traffic flow. \ 107 | For two-way streets it is roughly at the middle of the half \ 108 | that the segment is representing. \ 109 | For one-way streets this is the street center line. ", 110 | ), 111 | bigquery.SchemaField( 112 | "street", 113 | "STRING", 114 | mode="NULLABLE", 115 | description="Street name of the traffic segment", 116 | ), 117 | ] 118 | 119 | # apply a schema to pandas dataframe to match BigQuery for equivalent types 120 | schema_df = { 121 | "_comments": "object", 122 | "_direction": "object", 123 | "_fromst": "object", 124 | "_last_updt": "datetime64", 125 | "_length": "float64", 126 | "_lif_lat": "float64", 127 | "_lit_lat": "float64", 128 | "_lit_lon": "float64", 129 | "_strheading": "object", 130 | "_tost": "object", 131 | "_traffic": "int64", 132 | "segmentid": "int64", 133 | "start_lon": "float64", 134 | "street": "object", 135 | } 136 | -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Cloud Function which creates an end to end data pipeline. 3 | 4 | API Data Source: https://dev.socrata.com/foundry/data.cityofchicago.org/n4j6-wkkf 5 | 6 | This Cloud Function is responsible for: 7 | -Tracing performance of subsets of function calls via spans 8 | -Defining and creating infrastructure such as dataset, tables, bucket 9 | -Ingesting raw data from an api call into google cloud storage 10 | -Converting a pandas dataframe raw data schema to match BigQuery 11 | -Ingesting data into BigQuery 12 | -Run SQL queries capturing recent and unique records based on current date of invocation 13 | -Appending unique records to final table 14 | 15 | """ 16 | # decoding module for pubsub 17 | import base64 18 | 19 | # opencensus modules to trace function performance 20 | # https://opencensus.io/exporters/supported-exporters/python/stackdriver/ 21 | import opencensus 22 | from opencensus.trace import tracer as tracer_module 23 | from opencensus.trace.exporters import stackdriver_exporter 24 | from opencensus.trace.exporters.transports.background_thread import ( 25 | BackgroundThreadTransport, 26 | ) 27 | 28 | # lib modules 29 | from lib.bq_api_data_functions import ( 30 | append_unique_records, 31 | bq_table_num_rows, 32 | query_unique_records, 33 | ) 34 | from lib.data_ingestion import ( 35 | check_null_outliers, 36 | check_nulls, 37 | convert_schema, 38 | create_results_df, 39 | upload_raw_data_gcs, 40 | upload_to_gbq, 41 | ) 42 | from lib.helper_functions import set_logger 43 | from lib.infrastructure_setup import create_bucket, create_dataset_table 44 | 45 | logger = set_logger(__name__) 46 | 47 | 48 | # explains why to use pubsub as middleware 49 | # https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule 50 | def handler(event, context): 51 | """Entry point function that orchestrates the data pipeline from start to finish. 52 | 53 | Triggered from a message on a Cloud Pub/Sub topic. 54 | 55 | Args: 56 | event (dict): Event payload. 57 | context (google.cloud.functions.Context): Metadata for the event. 58 | 59 | """ 60 | # instantiate trace exporter 61 | project_id = ( 62 | "iconic-range-220603" 63 | ) # capture the project id to where this data will land 64 | exporter = stackdriver_exporter.StackdriverExporter( 65 | project_id=project_id, transport=BackgroundThreadTransport 66 | ) 67 | # instantiate tracer 68 | tracer = tracer_module.Tracer(exporter=exporter) 69 | 70 | with tracer.span(name="get_kpis") as span_get_kpis: 71 | # prints a message from the pubsub trigger 72 | pubsub_message = base64.b64decode(event["data"]).decode("utf-8") 73 | print(pubsub_message) # can be used to configure dynamic pipeline 74 | 75 | with span_get_kpis.span(name="infrastructure_var_setup"): 76 | # define infrastructure variables 77 | bucket_name = ( 78 | "chicago_traffic_raw" 79 | ) # capture bucket name where raw data will be stored 80 | dataset_name = "chicago_traffic_demo" # initial dataset 81 | table_raw = "traffic_raw" # name of table to capture data 82 | table_desc = "Raw, public Chicago traffic data is appended \ 83 | to this table every 5 minutes" # table description 84 | table_staging = "traffic_staging" 85 | table_staging_desc = f"Unique records greater than or equal to \ 86 | current date from table: {table_raw}" 87 | table_final = "traffic_final" 88 | table_final_desc = f"Unique, historical records \ 89 | accumulated from table: {table_raw}" 90 | nulls_expected = ( 91 | "_comments" 92 | ) # tuple of nulls expected for checking data outliers 93 | partition_by = ( 94 | "_last_updt" 95 | ) # partition by the last updated field for faster querying 96 | # and incremental loads 97 | from lib.schemas import schema_bq, schema_df # import schemas 98 | 99 | with span_get_kpis.span(name="infrastructure_creation") as span_infra_create: 100 | # create infrastructure 101 | with span_infra_create.span(name="bucket_creation"): 102 | create_bucket(bucket_name) 103 | with span_infra_create.span(name="dataset_table_creation"): 104 | create_dataset_table( 105 | dataset_name, table_raw, table_desc, schema_bq, partition_by 106 | ) # create raw table 107 | create_dataset_table( 108 | dataset_name, 109 | table_staging, 110 | table_staging_desc, 111 | schema_bq, 112 | partition_by, 113 | ) # create a table for unique records staging 114 | create_dataset_table( 115 | dataset_name, table_final, table_final_desc, schema_bq, partition_by 116 | ) # create a table for unique records final 117 | 118 | with span_get_kpis.span(name="ingest_raw_data") as span_ingest_raw: 119 | with span_ingest_raw.span(name="create_dataframe"): 120 | # access data from API, create dataframe, and upload raw csv 121 | results_df = create_results_df() 122 | with span_ingest_raw.span(name="upload_raw_data_gcs"): 123 | upload_raw_data_gcs(results_df, bucket_name) 124 | 125 | with span_get_kpis.span(name="convert_schema"): 126 | # perform schema conversion on dataframe to match bigquery schema 127 | results_df_transformed = convert_schema(results_df, schema_df) 128 | print(results_df_transformed.dtypes) 129 | 130 | with span_get_kpis.span(name="audit_null_columns"): 131 | # check if there are any nulls in the columns and print exceptions 132 | null_columns = check_nulls(results_df_transformed) 133 | null_outliers = check_null_outliers(null_columns, nulls_expected) 134 | 135 | with span_get_kpis.span(name="upload_to_gbq"): 136 | # upload data to bigquery 137 | upload_to_gbq(results_df_transformed, project_id, dataset_name, table_raw) 138 | bq_table_num_rows(dataset_name, table_raw) 139 | 140 | with span_get_kpis.span(name="preprocess_data") as span_prep_data: 141 | # Preprocess data for unique records accumulation 142 | with span_prep_data.span(name="query_unique_records"): 143 | query_unique_records(project_id, dataset_name, table_raw, table_staging) 144 | bq_table_num_rows(dataset_name, table_staging) 145 | with span_prep_data.span(name="append_unique_records"): 146 | append_unique_records( 147 | project_id, dataset_name, table_staging, table_final 148 | ) 149 | bq_table_num_rows(dataset_name, table_final) 150 | logger.info("Data Pipeline Fully Realized!") 151 | -------------------------------------------------------------------------------- /src/requirements.txt: -------------------------------------------------------------------------------- 1 | sodapy 2 | google-cloud-storage 3 | google-cloud-bigquery 4 | google-cloud-trace==0.19.0 5 | opencensus==0.1.8 6 | pyarrow 7 | pandas 8 | pandas-gbq 9 | datetime --------------------------------------------------------------------------------