├── .gitignore
├── README.md
├── cloudbuild.yaml
└── src
    ├── lib
        ├── __init__.py
        ├── bq_api_data_functions.py
        ├── data_ingestion.py
        ├── helper_functions.py
        ├── infrastructure_setup.py
        └── schemas.py
    ├── main.py
    └── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
 1 | #exclude files and directories in top directory
 2 | *
 3 | 
 4 | #include the src directory
 5 | !/src
 6 | 
 7 | #include the src/lib directory 
 8 | !/src/lib
 9 | 
10 | #include gitignore
11 | !.gitignore
12 | 
13 | #include README
14 | !README.md
15 | 
16 | #include cloudbuild config
17 | !cloudbuild.yaml


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Serverless Data Pipeline GCP
  2 | 
  3 | **Deploy an end to end data pipeline for Chicago traffic api data and measure function performance using: Cloud Functions, Pub/Sub, Cloud Storage, Cloud Scheduler, BigQuery, Stackdriver Trace**
  4 | 
  5 | **Usage:**
  6 | 
  7 | - Use this as a template for your data pipelines
  8 | - Use this to extract and land Chicago traffic data
  9 | - Pick and choose which modules/code snippets are useful
 10 | - Understand how to measure function performance throughout execution
 11 | - Show me how to do it better :)
 12 | 
 13 | **Data Pipeline Operations:**
 14 | 
 15 | 1. Creates raw data bucket
 16 | 2. Creates BigQuery dataset and raw, staging, and final tables with defined schemas
 17 | 3. Downloads data from Chicago traffic API
 18 | 4. Ingests data as pandas dataframe
 19 | 5. Uploads pandas dataframe to raw data bucket as a parquet file
 20 | 6. Converts dataframe schema to match BigQuery defined schema
 21 | 7. Uploads pandas dataframe to raw BigQuery table
 22 | 8. Run SQL queries to capture and accumulate unique records based on current date
 23 | 9. Sends function performance metrics to Stackdriver Trace
 24 | 
 25 | **Technologies:** Cloud Shell, Cloud Functions, Pub/Sub, Cloud Storage, Cloud Scheduler, BigQuery, Stackdriver Trace
 26 | 
 27 | **Languages:** Python 3.7, SQL(Standard)
 28 | 
 29 | **Technical Concepts:**
 30 | 
 31 | - This was designed to be a python native solution and starter template for simple data pipelines
 32 | - This follows the procedural paradigm which groups like-functions in separate files and imports them as modules into the main entrypoint file. Each operation goes through ordered actions on objects vs. OOP in which objects perform the actions
 33 | - Pub/Sub is used as middleware as opposed to invoking an HTTP-triggered cloud function to minimize overhead code securing the URL in addition to Pub/Sub serving as a shock absorber to a sudden burst in invocations
 34 | - This is not intended for robust production deployment as it doesn't account for edge cases and dynamic error handling
 35 | - Lessons Learned: Leverage classes for interdependent functions and creating extensibility in table objects. I also forced stacktracing functionality to measure function performance on a pub/sub trigger, so you can't create a report based on http requests to analyze performance trends. Stackdriver Trace will start auto-creating performance reports once there's enough data.
 36 | 
 37 | **Further Reading:** For those looking for production-level deployments
 38 | 
 39 | - Streaming Tutorial: <https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions#create_the_cloud_storage_bucket>
 40 | - BQPipeline Utility Functions: <https://github.com/GoogleCloudPlatform/professional-services/tree/master/tools/bqpipeline>
 41 | - I discovered the above after I created this pipeline...HA!
 42 | 
 43 | ## Deployment Instructions
 44 | 
 45 | **Prerequisites:**
 46 | 
 47 | - An open Google Cloud account: <https://cloud.google.com/free/>
 48 | - Proficient in Python and SQL
 49 | - A heart and mind eager to create data pipelines
 50 | 
 51 | [![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.png)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https://github.com/sungchun12/serverless_data_pipeline_gcp.git)
 52 | 
 53 | _OR_
 54 | 
 55 | 1.  Activate Cloud Shell: <https://cloud.google.com/shell/docs/quickstart#start_cloud_shell>
 56 | 2.  Clone repository
 57 | 
 58 | ```bash
 59 | git clone https://github.com/sungchun12/serverless_data_pipeline_gcp.git
 60 | ```
 61 | 
 62 | - Enable Google Cloud APIs: Stackdriver Trace API, Cloud Functions API, Cloud Pub/Sub API, Cloud Scheduler API, Cloud Storage, BigQuery API, Cloud Build API(gcloud CLI equivalent below when submitted through cloud shell)
 63 | 
 64 | ```bash
 65 | #set project id
 66 | gcloud config set project [your-project-id]
 67 | ```
 68 | 
 69 | ```bash
 70 | #Enable Google service APIs
 71 | gcloud services enable \
 72 |     cloudfunctions.googleapis.com \
 73 |     cloudtrace.googleapis.com \
 74 |     pubsub.googleapis.com \
 75 |     cloudscheduler.googleapis.com \
 76 |     storage-component.googleapis.com \
 77 |     bigquery-json.googleapis.com \
 78 |     cloudbuild.googleapis.com
 79 | ```
 80 | 
 81 | ---
 82 | 
 83 | **Note**: If you want to automate the build and deployment of this pipeline, submit the commands in order in cloud shell after completing the above steps. It skips step 5 below as it is redundant for auto-deployment.
 84 | 
 85 | Find your Cloudbuild service account in the IAM console. Ex: [unique-id]@cloudbuild.gserviceaccount.com
 86 | 
 87 | ```bash
 88 | #add role permissions to CloudBuild Service Account
 89 | gcloud projects add-iam-policy-binding [your-project-id] \
 90 |     --member serviceAccount:[unique-id]@cloudbuild.gserviceaccount.com \
 91 |     --role roles/cloudfunctions.developer \
 92 |     --role roles/cloudscheduler.admin \
 93 |     --role roles/logging.viewer
 94 | ```
 95 | 
 96 | ```bash
 97 | #Deploy steps in cloudbuild configuration file
 98 | gcloud builds submit --config cloudbuild.yaml .
 99 | ```
100 | 
101 | ---
102 | 
103 | 3.  Change directory to relevant code
104 | 
105 | ```bash
106 | cd serverless_data_pipeline_gcp/src
107 | ```
108 | 
109 | 4.  Deploy cloud function with pub/sub trigger. Note: this will automatically create the trigger if it does not exist
110 | 
111 | ```bash
112 | gcloud functions deploy [function-name] --entry-point handler --runtime python37 --trigger-topic [topic-name]
113 | ```
114 | 
115 | Ex:
116 | 
117 | ```bash
118 | gcloud functions deploy demo_function --entry-point handler --runtime python37 --trigger-topic demo_topic
119 | ```
120 | 
121 | 5.  Test cloud function by publishing a message to pub/sub topic
122 | 
123 | ```bash
124 | gcloud pubsub topics publish [topic-name] --message "<your-message>"
125 | ```
126 | 
127 | Ex:
128 | 
129 | ```bash
130 | gcloud pubsub topics publish demo_topic --message "Can you see this?"
131 | ```
132 | 
133 | 6.  Check logs to see how function performed. You may have to re-execute this command line multiple times if logs don't show up initially
134 | 
135 | ```bash
136 | gcloud functions logs read --limit 50
137 | ```
138 | 
139 | 7.  Deploy cloud scheduler job which publishes a message to Pub/Sub every 5 minutes
140 | 
141 | ```bash
142 | gcloud beta scheduler jobs create pubsub [job-name] \
143 |     --schedule "*/5 * * * *" \
144 |     --topic [topic-name] \
145 |     --message-body '{"<Message-to-publish-to-pubsub>"}' \
146 |     --time-zone 'America/Chicago'
147 | ```
148 | 
149 | Ex:
150 | 
151 | ```bash
152 | gcloud beta scheduler jobs create pubsub schedule_function \
153 |     --schedule "*/5 * * * *" \
154 |     --topic demo_topic \
155 |     --message-body '{"Can you see this? With love, cloud scheduler"}' \
156 |     --time-zone 'America/Chicago'
157 | ```
158 | 
159 | 8.  Test end to end pipeline by manually running cloud scheduler job. Next, repeat step 6 above.
160 | 
161 | ```bash
162 | gcloud beta scheduler jobs run [job-name]
163 | ```
164 | 
165 | Ex:
166 | 
167 | ```bash
168 | gcloud beta scheduler jobs run schedule_function
169 | ```
170 | 
171 | 9.  Understand pipeline performance by opening stacktrace and click "get_kpis": <https://cloud.google.com/trace/docs/quickstart#view_the_trace_overview>
172 | 
173 | **YOUR PIPELINE IS DEPLOYED AND MEASURABLE!**
174 | 
175 | **Note:** You'll notice extraneous blocks of comments and commented out code throughout the python scripts.
176 | 


--------------------------------------------------------------------------------
/cloudbuild.yaml:
--------------------------------------------------------------------------------
 1 | steps:
 2 |   # clone the git repository
 3 |   - name: "gcr.io/cloud-builders/git"
 4 |     args:
 5 |       ["clone", "https://github.com/sungchun12/serverless_data_pipeline_gcp"]
 6 | 
 7 |     # Deploy cloud function with pub/sub trigger from clone directory
 8 |   - name: "gcr.io/cloud-builders/gcloud"
 9 |     args:
10 |       [
11 |         "functions",
12 |         "deploy",
13 |         "demo_function",
14 |         "--entry-point",
15 |         "handler",
16 |         "--runtime",
17 |         "python37",
18 |         "--trigger-topic",
19 |         "demo_topic",
20 |       ]
21 |     dir: "serverless_data_pipeline_gcp/src"
22 | 
23 |     # Deploy cloud scheduler job which publishes a message to Pub/Sub every 5 minutes
24 |     # exits if it already exists
25 |   - name: "gcr.io/cloud-builders/gcloud"
26 |     entrypoint: "bash"
27 |     args:
28 |       - "-c"
29 |       - |
30 |         gcloud beta scheduler jobs create pubsub schedule_function \
31 |             --schedule "*/5 * * * *" \
32 |             --topic demo_topic \
33 |             --message-body '{"Can you see this? With love, cloud scheduler"}' \
34 |             --time-zone 'America/Chicago' || exit 0
35 | 
36 |     # Manually run the cloud scheduler job
37 |   - name: "gcr.io/cloud-builders/gcloud"
38 |     args: ["beta", "scheduler", "jobs", "run", "schedule_function"]
39 | 
40 |     # Check logs to see how function performed.
41 |   - name: "gcr.io/cloud-builders/gcloud"
42 |     args: ["functions", "logs", "read", "--limit", "50"]
43 | 
44 | # set logs bucket location for builds
45 | # set this to whatever bucket you want or remove entirely and cloudbuild will create a default bucket
46 | logsBucket: "gs://cloud_function_build_demo"
47 | 


--------------------------------------------------------------------------------
/src/lib/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sungchun12/serverless-data-pipeline-gcp/14716017356e2ed64f204acd573117d0940b38d2/src/lib/__init__.py


--------------------------------------------------------------------------------
/src/lib/bq_api_data_functions.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """Module which captures number of rows in tables and runs SQL queries.
  3 | 
  4 | This module is responsible for:
  5 | -Capturing total number of rows in a BigQuery table
  6 | -Querying max timestamp in a date field based on rows within current date
  7 | -Querying unique records based on the above timestamp
  8 | -Appending unique records to a final table
  9 | 
 10 | """
 11 | # gcp modules
 12 | from google.cloud import bigquery
 13 | 
 14 | # import logging
 15 | from lib.helper_functions import set_logger
 16 | 
 17 | logger = set_logger(__name__)
 18 | 
 19 | 
 20 | def bq_table_num_rows(dataset_name, table_name):
 21 |     """Log total number of rows in destination bigquery table.
 22 | 
 23 |     Args:
 24 |         dataset_name: destination dataset name
 25 |         table_name: destination table name
 26 | 
 27 |     Returns:
 28 |         Integer object: ``num_rows``
 29 | 
 30 |     """
 31 |     bigquery_client = (
 32 |         bigquery.Client()
 33 |     )  # instantiate bigquery client to interact with api
 34 |     dataset_ref = bigquery_client.dataset(dataset_name)  # create dataset obj
 35 |     table_ref = dataset_ref.table(table_name)  # create table obj
 36 |     table = bigquery_client.get_table(table_ref)  # API Request
 37 |     num_rows = table.num_rows
 38 |     logger.info(f"Total number of rows: {table.num_rows} in {table_ref.path}")
 39 |     return num_rows
 40 | 
 41 | 
 42 | def query_max_timestamp(project_id, dataset_name, table_name):
 43 |     """Return the max timestamp in BigQuery table after current date in CST.
 44 | 
 45 |     Args:
 46 |         project_id: destination project id
 47 |         dataset_name: destination dataset name
 48 |         table_name: destination table name
 49 | 
 50 |     Returns:
 51 |         string object: ``max_timestamp``
 52 | 
 53 |     """
 54 |     # in central Chicago time
 55 |     sql = f"SELECT max(_last_updt) as max_timestamp \
 56 |             FROM `{project_id}.{dataset_name}.{table_name}` \
 57 |             WHERE _last_updt >= TIMESTAMP(CURRENT_DATE('-06:00'))"
 58 |     bigquery_client = bigquery.Client()  # setup the client
 59 |     query_job = bigquery_client.query(sql)  # run the query
 60 |     results = query_job.result()  # waits for job to complete
 61 |     for row in results:  # returns the result
 62 |         max_timestamp = row.max_timestamp.strftime("%Y-%m-%d %H:%M:%S")
 63 |         return max_timestamp
 64 | 
 65 | 
 66 | # https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#methods
 67 | # https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig
 68 | # The following values are supported:
 69 | # configuration.copy.writeDisposition
 70 | # WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table
 71 | # WRITE_APPEND: If the table already exists, BigQuery appends the data
 72 | # WRITE_EMPTY: If the table already exists and contains data,
 73 | # a 'duplicate' error is returned in the job result.
 74 | def query_unique_records(project_id, dataset_name, table_name, table_name_2):
 75 |     """Queries unique records from original table, saves in staging table.
 76 | 
 77 |     Unique records are filtered by filtering all records >= the max timestamp.
 78 | 
 79 |     Args:
 80 |         project_id: destination project id
 81 |         dataset_name: destination dataset name
 82 |         table_name: starting table name
 83 |         table_name_2: destination table name
 84 | 
 85 |     """
 86 |     bigquery_client = bigquery.Client()
 87 |     job_config = bigquery.QueryJobConfig()
 88 |     table_ref = bigquery_client.dataset(dataset_name).table(
 89 |         table_name_2
 90 |     )  # set destination table
 91 |     job_config.destination = table_ref
 92 |     job_config.write_disposition = "WRITE_TRUNCATE"
 93 |     max_timestamp = query_max_timestamp(project_id, dataset_name, table_name)
 94 |     sql = f"SELECT DISTINCT * FROM `{project_id}.{dataset_name}.{table_name}` \
 95 |             WHERE _last_updt >= TIMESTAMP(DATETIME '{max_timestamp}');"
 96 |     query_job = bigquery_client.query(
 97 |         sql,
 98 |         # Location must match that of the dataset(s) referenced in the query
 99 |         # and of the destination table.
100 |         location="US",
101 |         job_config=job_config,
102 |     )  # API request - starts the query
103 |     query_job.result()
104 |     logger.info(f"Query results loaded to table {table_ref.path}")
105 | 
106 | 
107 | def append_unique_records(project_id, dataset_name, table_name, table_name_2):
108 |     """Queries unique staging table and appends new results onto final table.
109 | 
110 |     Args:
111 |         project_id: destination project id
112 |         dataset_name: destination dataset name
113 |         table_name: starting table name
114 |         table_name_2: destination table name
115 | 
116 |     """
117 |     bigquery_client = bigquery.Client()
118 |     job_config = bigquery.QueryJobConfig()
119 |     table_ref = bigquery_client.dataset(dataset_name).table(
120 |         table_name_2
121 |     )  # set destination table
122 |     job_config.destination = table_ref
123 |     job_config.write_disposition = "WRITE_APPEND"
124 |     # left outer join to avoid appending duplicate data
125 |     sql = f"SELECT a.* FROM `{project_id}.{dataset_name}.{table_name}` a \
126 |             LEFT JOIN `{project_id}.{dataset_name}.{table_name_2}` b \
127 |             on a.segmentid = b.segmentid AND a._last_updt = b._last_updt \
128 |             WHERE b.segmentid IS NULL;"
129 |     query_job = bigquery_client.query(
130 |         sql,
131 |         # Location must match that of the dataset(s) referenced in the query
132 |         # and of the destination table.
133 |         location="US",
134 |         job_config=job_config,
135 |     )  # API request - starts the query
136 |     query_job.result()
137 |     logger.info(f"Query results loaded to table {table_ref.path}")
138 | 


--------------------------------------------------------------------------------
/src/lib/data_ingestion.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """Module which contains functions to ingest and check data for outliers.
  3 | 
  4 | This module is responsible for:
  5 | -Creating a pandas dataframe using an api call to Chicago traffic data
  6 | -Uploading a pandas dataframe to a google cloud storage bucket
  7 | -Converting pandas dataframe schema
  8 | -Checking for null outliers in scope
  9 | -Uploading a pandas dataframe to BigQuery
 10 | 
 11 | """
 12 | 
 13 | # built in python modules
 14 | import os
 15 | 
 16 | # gcp modules
 17 | from google.cloud import storage
 18 | import pandas_gbq as gbq
 19 | from google.cloud import bigquery
 20 | 
 21 | # api module
 22 | from sodapy import Socrata
 23 | 
 24 | # pandas dataframe module
 25 | import pandas as pd
 26 | 
 27 | from lib.helper_functions import _getToday, set_logger
 28 | 
 29 | logger = set_logger(__name__)
 30 | 
 31 | 
 32 | def create_results_df():
 33 |     """Create a dataframe based on JSON from the Chicago traffic API
 34 | 
 35 |     Args:
 36 |         None
 37 | 
 38 |     Returns:
 39 |         Dataframe object: ``results_df``
 40 | 
 41 |     """
 42 |     try:
 43 |         # First 2000 results, returned as JSON from API / converted to Python list of
 44 |         # dictionaries by sodapy.
 45 |         # Unauthenticated client only works with public data sets. Note 'None'
 46 |         # in place of application token, and no username or password:
 47 |         data_client = Socrata("data.cityofchicago.org", None)
 48 |         results = data_client.get(
 49 |             "8v9j-bter", limit=2000
 50 |         )  # unique id for chicago traffic data
 51 | 
 52 |         # Convert to pandas DataFrame
 53 |         results_df = pd.DataFrame.from_records(results)
 54 |         logger.info("Successfully created a pandas dataframe!")
 55 | 
 56 |         return results_df
 57 |     except Exception as e:
 58 |         logger.error("Failure to create a pandas dataframe :(")
 59 |         raise e
 60 | 
 61 | 
 62 | # GZIP compression uses more CPU resources than Snappy or LZO,
 63 | # but provides a higher compression ratio.
 64 | # GZip is often a good choice for cold data, which is accessed infrequently.
 65 | # Snappy or LZO are a better choice for hot data, which is accessed frequently.
 66 | # The only writeable part of the filesystem is the /tmp directory, which you
 67 | # can use to store temporary files in a function instance.
 68 | # This is a local disk mount point known as a "tmpfs" volume in which data
 69 | # written to the volume is stored in memory.
 70 | # Note that it will consume memory resources provisioned for the function.
 71 | def upload_raw_data_gcs(results_df, bucket_name):
 72 |     """Upload dataframe into google cloud storage bucket.
 73 | 
 74 |     Deletes file in temp directory after upload.
 75 | 
 76 |     Args:
 77 |         results_df: pandas dataframe
 78 |         bucket_name: name of bucket to upload data towards
 79 | 
 80 |     """
 81 |     # Write the DataFrame to GCS (Google Cloud Storage)
 82 |     storage_client = storage.Client()
 83 |     # .from_service_account_json('service_account.json') #authenticate service account
 84 |     bucket = storage_client.bucket(bucket_name)  # capture bucket details
 85 |     timestamp = _getToday()
 86 |     source_file_name = "traffic_" + timestamp + ".gzip"  # create the file name
 87 |     temp_path = "/tmp"
 88 |     os.chdir(temp_path)  # change to tmp path
 89 |     # blob.upload_from_string(results_df.to_parquet(source_file_name, engine = 'pyarrow', compression = 'gzip'),content_type='gzip')
 90 |     results_df.to_parquet(source_file_name, engine="pyarrow", compression="gzip")
 91 |     # blob = bucket.blob(os.path.basename(source_file_name)) #define the path to the file
 92 |     blob = bucket.blob(source_file_name)  # define the binary large object(blob)
 93 |     blob.upload_from_filename(source_file_name)  # upload to bucket
 94 |     logger.info(f"Successfully uploaded parquet gzip file into: {bucket}")
 95 |     delete_temp_dir()
 96 | 
 97 | 
 98 | def delete_temp_dir():
 99 |     """Deletes every file in the /tmp directory.
100 | 
101 |     This minimizes memory load during function execution.
102 | 
103 |     Args:
104 |         None
105 | 
106 |     """
107 |     # check what's in the temporary directory
108 |     temp_path = "/tmp"
109 |     logger.info(os.listdir(temp_path))
110 |     # delete all files in the temporary path if they exist
111 |     logger.info(f"Deleting all files in {temp_path} directory")
112 |     for file in os.listdir(temp_path):
113 |         file_path = os.path.join(temp_path, file)
114 |         if os.path.isfile(file_path):
115 |             os.unlink(file_path)
116 |     # check that the temporary directory is empty
117 |     if len(os.listdir(temp_path)) == 0:
118 |         logger.info(f"{temp_path} directory is empty")
119 |     else:
120 |         logger.warning(f"{temp_path} directory is NOT empty")
121 | 
122 | 
123 | # https://stackoverflow.com/questions/21886742/convert-pandas-dtypes-to-bigquery-type-representation
124 | # https://stackoverflow.com/questions/44953463/pandas-google-bigquery-schema-mismatch-makes-the-upload-fail
125 | # http://pbpython.com/pandas_dtypes.html
126 | def convert_schema(results_df, schema_df):
127 |     """Converts data types in dataframe to match BigQuery destination table.
128 | 
129 |     Args:
130 |         results_df: pandas dataframe
131 |         schema_df: schema to convert towards
132 | 
133 |     Returns:
134 |         Dataframe Object: ``results_df_transformed``
135 | 
136 |     """
137 |     for (
138 |         k,
139 |         v,
140 |     ) in (
141 |         schema_df.items()
142 |     ):  # for each column name in the dictionary, convert the data type in the dataframe
143 |         results_df[k] = results_df[k].astype(v)
144 |     results_df_transformed = results_df
145 |     logger.info("Updated schema to match BigQuery destination table")
146 |     return results_df_transformed
147 | 
148 | 
149 | def check_nulls(results_df_transformed):
150 |     """Creates list of column names if any nulls in the columns.
151 | 
152 |     Args:
153 |         results_df_transformed: pandas dataframe with converted schema
154 | 
155 |     Returns:
156 |         list object: ``null_columns``
157 |     """
158 |     null_columns = []
159 |     check_bool = (
160 |         results_df_transformed.isnull().any()
161 |     )  # returns a boolean True/False for every column in dataframe if it contains nulls
162 |     for (
163 |         k,
164 |         v,
165 |     ) in (
166 |         check_bool.items()
167 |     ):  # for each column in check_bool having nulls, print the name of the column
168 |         if check_bool[v] is False:
169 |             null_columns.append(k)
170 |     logger.info(f"These are the null columns: {null_columns}")
171 |     return null_columns
172 | 
173 | 
174 | def check_null_outliers(null_columns, nulls_expected):
175 |     """Creates list of any outlier null columns.
176 | 
177 |     Args:
178 |         null_columns: current list of null columns in pandas dataframe
179 |         nulls_expected: list of nulls in scope
180 | 
181 |     Returns:
182 |         list object: ``null_outliers``
183 | 
184 |     """
185 |     null_outliers = (
186 |         []
187 |     )  # empty list to collect list of columns that are not expected to be null
188 |     # if any columns in the nulls expected mismatch the nulls collected, append to null_outliers
189 |     if any(x not in nulls_expected for x in null_columns):
190 |         null_outliers.append(x)
191 |     logger.info(f"These are the outlier null columns: {null_outliers}")
192 |     return null_outliers
193 | 
194 | 
195 | # figure out the nullable vs. required mode schema mismatch
196 | # https://cloud.google.com/bigquery/docs/pandas-gbq-migration#loading_a_pandas_dataframe_to_a_table
197 | # https://github.com/pydata/pandas-gbq/issues/133#issuecomment-411119426
198 | def upload_to_gbq(results_df_transformed, project_id, dataset_name, table_name):
199 |     """Uploads data into bigquery and appends if data already exists.
200 | 
201 |     Args:
202 |         results_df_transformed: pandas dataframe with converted schema
203 |         project_id: name of project where you want to upload data
204 |         dataset_name: name of target dataset
205 |         table_name: name of target table
206 | 
207 |     """
208 |     bigquery_client = bigquery.Client()
209 |     dataset_ref = bigquery_client.dataset(dataset_name)
210 |     table_ref = dataset_ref.table(table_name)
211 |     # job_config = bigquery.job.LoadJobConfig() #configure how the data loads into bigquery
212 |     # job_config.write_disposition = 'WRITE_TRUNCATE' #if table exists, append to it
213 |     # job_config.ignoreUnknownValues = 'T' #ignore columns that don't match destination schema
214 |     # job_config.schema_update_options ='ALLOW_FIELD_ADDITION'
215 |     # TODO: bad request due to schema mismatch with an index field
216 |     # https://github.com/googleapis/google-cloud-python/issues/5572
217 |     # bigquery_client.load_table_from_dataframe(results_df_transformed,
218 |     # table_ref, num_retries = 5, job_config = job_config).result()
219 |     gbq.to_gbq(
220 |         results_df_transformed,
221 |         dataset_name + "." + table_name,
222 |         project_id,
223 |         if_exists="append",
224 |         location="US",
225 |         progress_bar=True,
226 |     )
227 |     logger.info(f"Data uploaded into: {table_ref.path}")
228 | 


--------------------------------------------------------------------------------
/src/lib/helper_functions.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | """Module with miscellaneous utility functions.
 3 | 
 4 | This module contains a function to capture the current datetime stamp,
 5 | and configures logging format.
 6 | 
 7 | This module can be used to add more helper functions as needed.
 8 | 
 9 | """
10 | 
11 | # built in python modules
12 | 
13 | from datetime import datetime
14 | import logging
15 | import sys
16 | 
17 | 
18 | def _getToday():
19 |     """Returns timestamp string"""
20 |     return datetime.now().strftime("%Y%m%d%H%M%S")
21 | 
22 | 
23 | def set_logger(__name__):
24 |     """Configures logger for all modules and returns logger object"""
25 |     logger = logging.getLogger(__name__)
26 |     logger.setLevel(logging.INFO)
27 | 
28 |     formatter = logging.Formatter("%(levelname)s|%(asctime)s|%(name)s|%(message)s")
29 | 
30 |     stream_handler = logging.StreamHandler(sys.stdout)
31 |     stream_handler.setFormatter(formatter)
32 | 
33 |     logger.addHandler(stream_handler)
34 |     return logger
35 | 


--------------------------------------------------------------------------------
/src/lib/infrastructure_setup.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """Module which creates data pipeline storage infrastructure
  3 | 
  4 | This module has functions that create a raw data bucket
  5 | in google cloud storage, and creates dataset-table pairs.
  6 | 
  7 | """
  8 | # gcp modules
  9 | from google.cloud import storage
 10 | from google.cloud import bigquery
 11 | 
 12 | # import logging
 13 | from lib.helper_functions import set_logger
 14 | 
 15 | logger = set_logger(__name__)
 16 | 
 17 | 
 18 | def create_bucket(bucket_name):
 19 |     """Creates a bucket if not detected
 20 | 
 21 |     Args:
 22 |         bucket_name: name of GCS bucket to be created
 23 | 
 24 |     Returns:
 25 |         ``Created a new bucket: <bucket path>``
 26 |           OR ``Bucket already exists: <bucket path>``
 27 | 
 28 |     """
 29 |     client = storage.Client()
 30 |     # authenticate service account
 31 |     # .from_service_account_json('service_account.json')
 32 |     bucket = client.bucket(bucket_name)  # capture bucket details
 33 |     bucket.location = "US-CENTRAL1"  # define regional location
 34 |     if not bucket.exists():  # checks if bucket doesn't exist
 35 |         bucket.create()
 36 |         logger.info(f"Created a new bucket: {bucket.path}")
 37 |     else:
 38 |         logger.info(f"Bucket already exists: {bucket.path}")
 39 | 
 40 | 
 41 | def dataset_exists(client, dataset_reference):
 42 |     """Return if a table exists.
 43 | 
 44 |     Args:
 45 |         client (google.cloud.bigquery.client.Client):
 46 |             A client to connect to the BigQuery API.
 47 |         table_reference (google.cloud.bigquery.table.TableReference):
 48 |             A reference to the table to look for.
 49 | 
 50 |     Returns:
 51 |         bool: ``True`` if the table exists, ``False`` otherwise.
 52 | 
 53 |     """
 54 |     from google.cloud.exceptions import NotFound
 55 | 
 56 |     try:
 57 |         client.get_dataset(dataset_reference)
 58 |         return True
 59 |     except NotFound:
 60 |         return False
 61 | 
 62 | 
 63 | def table_exists(client, table_reference):
 64 |     """Return if a table exists.
 65 | 
 66 |     Args:
 67 |         client (google.cloud.bigquery.client.Client):
 68 |             A client to connect to the BigQuery API.
 69 |         table_reference (google.cloud.bigquery.table.TableReference):
 70 |             A reference to the table to look for.
 71 | 
 72 |     Returns:
 73 |         bool: ``True`` if the table exists, ``False`` otherwise.
 74 | 
 75 |     """
 76 |     from google.cloud.exceptions import NotFound
 77 | 
 78 |     try:
 79 |         client.get_table(table_reference)
 80 |         return True
 81 |     except NotFound:
 82 |         return False
 83 | 
 84 | 
 85 | # https://cloud.google.com/bigquery/docs/python-client-migration#update_a_table
 86 | def create_dataset_table(dataset_name, table_name, table_desc, schema, partition_by):
 87 |     """Creates a new dataset and/or table if not detected.
 88 | 
 89 |     Args:
 90 |         dataset_name: Name of dataset to be created
 91 |         table_name: Name of table to be created within dataset
 92 |         table_desc: table descriptions
 93 |         schema: table schema with data types
 94 |         partition_by: Which datetime field to partition by
 95 | 
 96 |     Returns:
 97 |         ``Created new dataset: <dataset path>``
 98 |           OR ``Dataset already exists: <dataset path>``
 99 |         &
100 |         ``Created empty table partitioned on column: partition_by``
101 |           OR ``Table already exists: <table path>``
102 | 
103 |     """
104 |     # setup the client
105 |     bigquery_client = bigquery.Client()
106 | 
107 |     # Create a DatasetReference using a chosen dataset ID.
108 |     dataset_ref = bigquery_client.dataset(
109 |         dataset_name
110 |     )  # The project defaults to the Client's project if not specified.
111 | 
112 |     # Construct a full Dataset object to send to the API.
113 |     dataset = bigquery.Dataset(dataset_ref)
114 | 
115 |     # Specify the geographic location where the dataset should reside.
116 |     dataset.location = "US"
117 | 
118 |     # Send the dataset to the API for creation.
119 |     # Raises google.api_core.exceptions.
120 |     # Conflict if the Dataset already exists within the project.
121 |     if (
122 |         dataset_exists(bigquery_client, dataset_ref) is False
123 |     ):  # checks if dataset not found
124 |         dataset = bigquery_client.create_dataset(dataset)  # API request
125 |         logger.info(f"Created new dataset: {dataset_ref.path}")
126 |     else:
127 |         logger.info(f"Dataset already exists: {dataset_ref.path}")
128 | 
129 |     # Create an empty table
130 |     table_ref = dataset_ref.table(
131 |         table_name
132 |     )  # construct a full table object to send to the api
133 | 
134 |     if table_exists(bigquery_client, table_ref) is False:
135 |         table = bigquery.Table(table_ref, schema=schema)
136 |         table.time_partitioning = bigquery.TimePartitioning(
137 |             type_=bigquery.TimePartitioningType.DAY,
138 |             field=partition_by,  # day is the only supported type for now
139 |         )  # name of column to use for partitioning
140 |         table = bigquery_client.create_table(table)
141 |         assert table.table_id == table_name  # checks if table_id matches
142 | 
143 |         # update the table description
144 |         table.description = table_desc
145 |         table = bigquery_client.update_table(table, ["description"])
146 |         assert (
147 |             table.description == table_desc
148 |         )  # checks if table description matches the update
149 |         logger.info(
150 |             f"Created empty table partitioned \
151 |               on column: {table.time_partitioning.field}"
152 |         )
153 |     else:
154 |         logger.info(f"Table already exists: {table_ref.path}")
155 | 


--------------------------------------------------------------------------------
/src/lib/schemas.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """This contains the bigquery and dataframe schemas for data warehouse setup.
  3 | 
  4 | Change these values for your table schemas in scope.
  5 | Data types defined in BigQuery are mapped to pandas dataframe data types.
  6 | If data types do not match, data will not be able to be uploaded to bigquery.
  7 | 
  8 | """
  9 | # gcp modules
 10 | from google.cloud import bigquery
 11 | 
 12 | # apply a schema during BigQuery table creation
 13 | schema_bq = [
 14 |     bigquery.SchemaField(
 15 |         "_comments",
 16 |         "STRING",
 17 |         mode="NULLABLE",
 18 |         description="Provides extra context to traffic segment",
 19 |     ),
 20 |     bigquery.SchemaField(
 21 |         "_direction",
 22 |         "STRING",
 23 |         mode="NULLABLE",
 24 |         description="Traffic flow direction for the segment.",
 25 |     ),
 26 |     bigquery.SchemaField(
 27 |         "_fromst",
 28 |         "STRING",
 29 |         mode="NULLABLE",
 30 |         description="Start street for the segment in the direction of traffic flow.",
 31 |     ),
 32 |     bigquery.SchemaField(
 33 |         "_last_updt",
 34 |         "TIMESTAMP",
 35 |         mode="NULLABLE",
 36 |         description=" If the \
 37 |           LAST_UPDATED time is several days old, it can be assumed that no \
 38 |           transit service over the segment currently. \
 39 |           These segments are included in the Chicago Traffic Tracker dataset \
 40 |           because they are key routes and CDOT intends to monitor \
 41 |           traffic conditions through other means in the near future. \
 42 |           Will display UTC, but truly represents CST.",
 43 |     ),  # in CST -06:00
 44 |     bigquery.SchemaField(
 45 |         "_length",
 46 |         "FLOAT",
 47 |         mode="NULLABLE",
 48 |         description="Length of the segment in miles.",
 49 |     ),
 50 |     bigquery.SchemaField(
 51 |         "_lif_lat",
 52 |         "FLOAT",
 53 |         mode="NULLABLE",
 54 |         description="The starting point latitude. \
 55 |           See start_lon for a fuller description.",
 56 |     ),
 57 |     bigquery.SchemaField(
 58 |         "_lit_lat",
 59 |         "FLOAT",
 60 |         mode="NULLABLE",
 61 |         description="The ending point latitude. \
 62 |           See start_lon for a fuller description.",
 63 |     ),
 64 |     bigquery.SchemaField(
 65 |         "_lit_lon",
 66 |         "FLOAT",
 67 |         mode="NULLABLE",
 68 |         description="The ending point longitude. \
 69 |           See start_lon for a fuller description.",
 70 |     ),
 71 |     bigquery.SchemaField(
 72 |         "_strheading",
 73 |         "STRING",
 74 |         mode="NULLABLE",
 75 |         description="The position of the segment in the address grid. \
 76 |           North, South, East, or West of State and Madison.",
 77 |     ),
 78 |     bigquery.SchemaField(
 79 |         "_tost",
 80 |         "STRING",
 81 |         mode="NULLABLE",
 82 |         description="End street for the segment in the direction of traffic flow.",
 83 |     ),
 84 |     bigquery.SchemaField(
 85 |         "_traffic",
 86 |         "INTEGER",
 87 |         mode="NULLABLE",
 88 |         description="Real-time estimated speed in miles per hour. \
 89 |           For congestion advisory and traffic maps, this value is compared to \
 90 |           a 0-9, 10-20, and 21 & over scale to display heavy, medium, \
 91 |           and free flow conditions for the traffic segment. \
 92 |           Except for a very few segments speed on city arterials is limited \
 93 |           to 30 mph by ordinance.",
 94 |     ),
 95 |     bigquery.SchemaField(
 96 |         "segmentid",
 97 |         "INTEGER",
 98 |         mode="NULLABLE",
 99 |         description="Unique arbitrary number to represent each segment.",
100 |     ),
101 |     bigquery.SchemaField(
102 |         "start_lon",
103 |         "FLOAT",
104 |         mode="NULLABLE",
105 |         description="The longitude associated with the starting point of the \
106 |           segment in the direction of traffic flow. \
107 |           For two-way streets it is roughly at the middle of the half \
108 |             that the segment is representing. \
109 |           For one-way streets this is the street center line. ",
110 |     ),
111 |     bigquery.SchemaField(
112 |         "street",
113 |         "STRING",
114 |         mode="NULLABLE",
115 |         description="Street name of the traffic segment",
116 |     ),
117 | ]
118 | 
119 | # apply a schema to pandas dataframe to match BigQuery for equivalent types
120 | schema_df = {
121 |     "_comments": "object",
122 |     "_direction": "object",
123 |     "_fromst": "object",
124 |     "_last_updt": "datetime64",
125 |     "_length": "float64",
126 |     "_lif_lat": "float64",
127 |     "_lit_lat": "float64",
128 |     "_lit_lon": "float64",
129 |     "_strheading": "object",
130 |     "_tost": "object",
131 |     "_traffic": "int64",
132 |     "segmentid": "int64",
133 |     "start_lon": "float64",
134 |     "street": "object",
135 | }
136 | 


--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """Cloud Function which creates an end to end data pipeline.
  3 | 
  4 | API Data Source: https://dev.socrata.com/foundry/data.cityofchicago.org/n4j6-wkkf
  5 | 
  6 | This Cloud Function is responsible for:
  7 | -Tracing performance of subsets of function calls via spans
  8 | -Defining and creating infrastructure such as dataset, tables, bucket
  9 | -Ingesting raw data from an api call into google cloud storage
 10 | -Converting a pandas dataframe raw data schema to match BigQuery
 11 | -Ingesting data into BigQuery
 12 | -Run SQL queries capturing recent and unique records based on current date of invocation
 13 | -Appending unique records to final table
 14 | 
 15 | """
 16 | # decoding module for pubsub
 17 | import base64
 18 | 
 19 | # opencensus modules to trace function performance
 20 | # https://opencensus.io/exporters/supported-exporters/python/stackdriver/
 21 | import opencensus
 22 | from opencensus.trace import tracer as tracer_module
 23 | from opencensus.trace.exporters import stackdriver_exporter
 24 | from opencensus.trace.exporters.transports.background_thread import (
 25 |     BackgroundThreadTransport,
 26 | )
 27 | 
 28 | # lib modules
 29 | from lib.bq_api_data_functions import (
 30 |     append_unique_records,
 31 |     bq_table_num_rows,
 32 |     query_unique_records,
 33 | )
 34 | from lib.data_ingestion import (
 35 |     check_null_outliers,
 36 |     check_nulls,
 37 |     convert_schema,
 38 |     create_results_df,
 39 |     upload_raw_data_gcs,
 40 |     upload_to_gbq,
 41 | )
 42 | from lib.helper_functions import set_logger
 43 | from lib.infrastructure_setup import create_bucket, create_dataset_table
 44 | 
 45 | logger = set_logger(__name__)
 46 | 
 47 | 
 48 | # explains why to use pubsub as middleware
 49 | # https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule
 50 | def handler(event, context):
 51 |     """Entry point function that orchestrates the data pipeline from start to finish.
 52 | 
 53 |     Triggered from a message on a Cloud Pub/Sub topic.
 54 | 
 55 |     Args:
 56 |         event (dict): Event payload.
 57 |         context (google.cloud.functions.Context): Metadata for the event.
 58 | 
 59 |     """
 60 |     # instantiate trace exporter
 61 |     project_id = (
 62 |         "iconic-range-220603"
 63 |     )  # capture the project id to where this data will land
 64 |     exporter = stackdriver_exporter.StackdriverExporter(
 65 |         project_id=project_id, transport=BackgroundThreadTransport
 66 |     )
 67 |     # instantiate tracer
 68 |     tracer = tracer_module.Tracer(exporter=exporter)
 69 | 
 70 |     with tracer.span(name="get_kpis") as span_get_kpis:
 71 |         # prints a message from the pubsub trigger
 72 |         pubsub_message = base64.b64decode(event["data"]).decode("utf-8")
 73 |         print(pubsub_message)  # can be used to configure dynamic pipeline
 74 | 
 75 |         with span_get_kpis.span(name="infrastructure_var_setup"):
 76 |             # define infrastructure variables
 77 |             bucket_name = (
 78 |                 "chicago_traffic_raw"
 79 |             )  # capture bucket name where raw data will be stored
 80 |             dataset_name = "chicago_traffic_demo"  # initial dataset
 81 |             table_raw = "traffic_raw"  # name of table to capture data
 82 |             table_desc = "Raw, public Chicago traffic data is appended \
 83 |                 to this table every 5 minutes"  # table description
 84 |             table_staging = "traffic_staging"
 85 |             table_staging_desc = f"Unique records greater than or equal to \
 86 |                 current date from table: {table_raw}"
 87 |             table_final = "traffic_final"
 88 |             table_final_desc = f"Unique, historical records \
 89 |                 accumulated from table: {table_raw}"
 90 |             nulls_expected = (
 91 |                 "_comments"
 92 |             )  # tuple of nulls expected for checking data outliers
 93 |             partition_by = (
 94 |                 "_last_updt"
 95 |             )  # partition by the last updated field for faster querying
 96 |             # and incremental loads
 97 |             from lib.schemas import schema_bq, schema_df  # import schemas
 98 | 
 99 |         with span_get_kpis.span(name="infrastructure_creation") as span_infra_create:
100 |             # create infrastructure
101 |             with span_infra_create.span(name="bucket_creation"):
102 |                 create_bucket(bucket_name)
103 |             with span_infra_create.span(name="dataset_table_creation"):
104 |                 create_dataset_table(
105 |                     dataset_name, table_raw, table_desc, schema_bq, partition_by
106 |                 )  # create raw table
107 |                 create_dataset_table(
108 |                     dataset_name,
109 |                     table_staging,
110 |                     table_staging_desc,
111 |                     schema_bq,
112 |                     partition_by,
113 |                 )  # create a table for unique records staging
114 |                 create_dataset_table(
115 |                     dataset_name, table_final, table_final_desc, schema_bq, partition_by
116 |                 )  # create a table for unique records final
117 | 
118 |         with span_get_kpis.span(name="ingest_raw_data") as span_ingest_raw:
119 |             with span_ingest_raw.span(name="create_dataframe"):
120 |                 # access data from API, create dataframe, and upload raw csv
121 |                 results_df = create_results_df()
122 |             with span_ingest_raw.span(name="upload_raw_data_gcs"):
123 |                 upload_raw_data_gcs(results_df, bucket_name)
124 | 
125 |         with span_get_kpis.span(name="convert_schema"):
126 |             # perform schema conversion on dataframe to match bigquery schema
127 |             results_df_transformed = convert_schema(results_df, schema_df)
128 |             print(results_df_transformed.dtypes)
129 | 
130 |         with span_get_kpis.span(name="audit_null_columns"):
131 |             # check if there are any nulls in the columns and print exceptions
132 |             null_columns = check_nulls(results_df_transformed)
133 |             null_outliers = check_null_outliers(null_columns, nulls_expected)
134 | 
135 |         with span_get_kpis.span(name="upload_to_gbq"):
136 |             # upload data to bigquery
137 |             upload_to_gbq(results_df_transformed, project_id, dataset_name, table_raw)
138 |             bq_table_num_rows(dataset_name, table_raw)
139 | 
140 |         with span_get_kpis.span(name="preprocess_data") as span_prep_data:
141 |             # Preprocess data for unique records accumulation
142 |             with span_prep_data.span(name="query_unique_records"):
143 |                 query_unique_records(project_id, dataset_name, table_raw, table_staging)
144 |                 bq_table_num_rows(dataset_name, table_staging)
145 |             with span_prep_data.span(name="append_unique_records"):
146 |                 append_unique_records(
147 |                     project_id, dataset_name, table_staging, table_final
148 |                 )
149 |                 bq_table_num_rows(dataset_name, table_final)
150 |         logger.info("Data Pipeline Fully Realized!")
151 | 


--------------------------------------------------------------------------------
/src/requirements.txt:
--------------------------------------------------------------------------------
1 | sodapy
2 | google-cloud-storage
3 | google-cloud-bigquery
4 | google-cloud-trace==0.19.0
5 | opencensus==0.1.8
6 | pyarrow
7 | pandas
8 | pandas-gbq
9 | datetime


--------------------------------------------------------------------------------