├── .gitignore ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── cloud-composer ├── mssql_gcs_dataflow_bigquery_dag_1.py └── mssql_gcs_dataflow_bigquery_dag_2.py ├── cloud-dataflow └── process_json.py ├── cloud-functions ├── index.js └── package.json ├── get-client-id ├── get_client_id.py └── requirements.txt └── images ├── dags-folder.png ├── diagrams.png └── pypi-packages.png /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | We'd love to accept your patches and contributions to this project. There are 4 | just a few small guidelines you need to follow. 5 | 6 | ## Contributor License Agreement 7 | 8 | Contributions to this project must be accompanied by a Contributor License 9 | Agreement (CLA). You (or your employer) retain the copyright to your 10 | contribution; this simply gives us permission to use and redistribute your 11 | contributions as part of the project. Head over to 12 | to see your current agreements on file or 13 | to sign a new one. 14 | 15 | You generally only need to submit a CLA once, so if you've already submitted one 16 | (even if it was for a different project), you probably don't need to do it 17 | again. 18 | 19 | ## Code reviews 20 | 21 | All submissions, including submissions by project members, require review. We 22 | use GitHub pull requests for this purpose. Consult 23 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more 24 | information on using pull requests. 25 | 26 | ## Community Guidelines 27 | 28 | This project follows 29 | [Google's Open Source Community Guidelines](https://opensource.google/conduct/). -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Cloud Composer Orchestrating Moving Data from Microsoft SQL Server to BigQuery 2 | ## Google Cloud Composer Example 3 | 4 | This repository contains an example of how to leverage Cloud Composer and Cloud Dataflow to move data from a Microsoft SQL Server to BigQuery. The diagrams below demonstrate the workflow pipeline. 5 | 6 | 7 | ![Diagram Part One](images/diagrams.png) 8 | 9 | 10 | The Pipeline Steps are as follows: 11 | 12 | 1. A Cloud Composer DAG is either scheduled or manually triggered which connects a Microsoft SQL Server defined and exports the defined data to Google Cloud Storage as a JSON file. 13 | 14 | 2. A second Cloud Composer DAG is triggered by a Cloud Function once the JSON file has been written to the storage bucket. 15 | 16 | 3. The second Cloud Composer DAG triggers a Dataflow batch job which can if needed perform transformations then it writes the data to BigQuery. 17 | 18 | 5. Included in both Cloud Composer DAGs is the ability to send email notifications. 19 | 20 | You can: 21 | * Schedule the Cloud Composer DAG to export data as needed with date filters. 22 | * Perform transformation in Dataflow. 23 | * Get a notification on a successful or failed jobs. 24 | 25 | Requirements: 26 | * You need a Microsoft SQL Server installed either in Google Cloud or elsewhere. 27 | 28 | ## How to install 29 | 30 | 1. [Install the Google Cloud SDK](https://cloud.google.com/sdk/install) 31 | 32 | 2. Create a export storage bucket for **Microsoft SQL Server Exports** 33 | 34 | ``` shell 35 | gsutil mb gs://[BUCKET_NAME]/ 36 | ``` 37 | 38 | 3. Create a Dataflow staging storage bucket 39 | 40 | ``` shell 41 | gsutil mb gs://[BUCKET_NAME]/ 42 | ``` 43 | 44 | 4. Through the [Google Cloud Console](https://console.cloud.google.com) create a folder named **tmp** in the newly created bucket for the DataFlow staging files 45 | 46 | 47 | 5. [Create a Cloud Composer Environment](https://cloud.google.com/composer/docs/how-to/managing/creating) 48 | * You need to use an image equal to or greater to: composer-1.10.6-airflow-1.10.6 49 | 50 | 6. Create a BigQuery Dataset 51 | ``` shell 52 | bq mk [YOUR_BIG_QUERY_DATABASE_NAME] 53 | ``` 54 | 55 | 7. Enable the Cloud Dataflow API 56 | ``` shell 57 | gcloud services enable dataflow 58 | ``` 59 | 60 | 8. Enable the Cloud Composer API 61 | ``` shell 62 | gcloud services enable composer.googleapis.com 63 | ``` 64 | 65 | 9. Enable the Cloud Functions API 66 | ``` shell 67 | gcloud services enable cloudfunctions.googleapis.com 68 | ``` 69 | 70 | 10. Granting blob signing permissions to the Cloud Functions Service Account 71 | ```shell 72 | gcloud iam service-accounts add-iam-policy-binding \ 73 | [YOUR_PROJECT_ID]@appspot.gserviceaccount.com \ 74 | --member=serviceAccount:[YOUR_PROJECT_ID]@appspot.gserviceaccount.com \ 75 | --role=roles/iam.serviceAccountTokenCreator 76 | ``` 77 | 78 | 11. Edit the index.js file 79 | * In the cloned repo, go to the “cloud-functions” directory and edit the index.js file and change the variables listed below. 80 | 81 | * To get your your-iap-client-id execute the following: 82 | 83 | ``` shell 84 | python get-client-id/get_client_id.py [PROJECT_ID] [GCP_REGION] [COMPOSER_ENVIRONMENT] 85 | ``` 86 | 87 | ``` js 88 | // The project that holds your function 89 | const PROJECT_ID = 'your-project-id'; 90 | // Run python get-client-id/get_client_id.py [PROJECT_ID] [GCP_REGION] [COMPOSER_ENVIRONMENT] to get your client id 91 | const CLIENT_ID = 'your-iap-client-id'; 92 | // This should be part of your webserver's URL: 93 | // {tenant-project-id}.appspot.com 94 | const WEBSERVER_ID = 'your-tenant-project-id'; 95 | // The name of the DAG you wish to trigger 96 | const DAG_NAME = 'mssql_gcs_dataflow_bigquery_dag_2'; 97 | ``` 98 | 99 | 12. Deploy the Cloud Function 100 | * In the cloned repo, go to the “cloud-functions” directory and deploy the following Cloud Function. 101 | ``` shell 102 | gcloud functions deploy triggerDag --region=us-central1 --runtime=nodejs8 --trigger-event=google.storage.object.finalize --trigger-resource=[YOUR_UPLOADED_EXPORT_STORAGE_BUCKET_NAME] 103 | ``` 104 | 105 | 13. Deploy the Cloud Dataflow Pipeline 106 | * Update the fields object to match your table schema 107 | * In the Cloud Console go to the Composer Environments 108 | * Click on the DAGs Folder Icon 109 | * This will open a new window for the Bucket Details 110 | * Create a Folder called dataflow 111 | * Upload the cloud-dataflow/process_json.py file to the dataflow folder 112 | 113 | 14. Create the following variables in the Airflow Web Server 114 | 115 | | Key | Val | 116 | | --- | ----------- | 117 | | bq_output_table | [DATASET.TABLE] | 118 | | email | [YOUR_EMAIL_ADDRESS] | 119 | | gcp_project | [YOUR_PROJECT_ID] | 120 | | gcp_temp_location | gs://[YOUR_DATAFLOW_STAGE_BUCKET]/tmp | 121 | | mssql_export_bucket | [YOUR_UPLOADED_EXPORT_STORAGE_BUCKET_NAME] | 122 | 123 | 124 | * For the [DATASET.TABLE] use the dataset name you created in step 6 and choose a name for the table. Cloud Dataflow will create the table for you on it's first run. 125 | 126 | 15. Create a Airflow connection 127 | * From the Airflow interface to go to Admin > Connections 128 | * Edit the mssql_default connection 129 | * Change the details to match your Microsoft SQL Server 130 | 131 | 16. In the Cloud Console go to the Composer Environments 132 | * In the PYPI Packages add pymssql, it should look like: 133 | 134 | ![PYPI Packages](images/pypi-packages.png) 135 | 136 | 17. Follow these instructions for [Configuring SendGrid email services](https://cloud.google.com/composer/docs/how-to/managing/creating#notification) 137 | 138 | 18. Deploy the two Cloud Composer DAGs 139 | * Before upload the mssql_gcs_dataflow_bigquery_dag_1.py edit line 51 for your respective SQL Statement 140 | * Upload the two file below to the DAGs folder in Google Cloud Storage 141 | 142 | ![Dags Folder](images/dags-folder.png) 143 | 144 | * cloud-composer/mssql_gcs_dataflow_bigquery_dag_1.py 145 | * cloud-composer/mssql_gcs_dataflow_bigquery_dag_2.py 146 | 147 | 148 | **This is not an officially supported Google product** -------------------------------------------------------------------------------- /cloud-composer/mssql_gcs_dataflow_bigquery_dag_1.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import datetime 16 | import logging 17 | import os 18 | 19 | from airflow import configuration 20 | from airflow import models 21 | from airflow.contrib.hooks import gcs_hook 22 | from airflow.contrib.operators import mssql_to_gcs 23 | from airflow.operators import python_operator 24 | from airflow.utils.trigger_rule import TriggerRule 25 | from airflow.operators import email_operator 26 | 27 | # We set the start_date of the DAG to the previous date. This will 28 | # make the DAG immediately available for scheduling. 29 | YESTERDAY = datetime.datetime.combine( 30 | datetime.datetime.today() - datetime.timedelta(1), 31 | datetime.datetime.min.time()) 32 | 33 | # We define some variables that we will use in the DAG tasks. 34 | SUCCESS_TAG = 'success' 35 | FAILURE_TAG = 'failure' 36 | 37 | DATE = '{{ ds }}' 38 | 39 | DEFAULT_DAG_ARGS = { 40 | 'start_date': YESTERDAY, 41 | 'retries': 0, 42 | 'project_id': models.Variable.get('gcp_project') 43 | } 44 | 45 | with models.DAG(dag_id='mssql_gcs_dataflow_bigquery_dag_1', 46 | description='A DAG triggered by an external Cloud Function', 47 | schedule_interval=None, default_args=DEFAULT_DAG_ARGS) as dag: 48 | # Export task that will process SQL statement and save files to Cloud Storage. 49 | export_sales_orders = mssql_to_gcs.MsSqlToGoogleCloudStorageOperator( 50 | task_id='export_sales_orders', 51 | sql='SELECT * FROM WideWorldImporters.Sales.Orders;', 52 | bucket=models.Variable.get('mssql_export_bucket'), 53 | filename=DATE + '-export.json', 54 | mssql_conn_id='mssql_default', 55 | dag=dag 56 | ) 57 | 58 | # Here we create two conditional tasks, one of which will be executed 59 | # based on whether the export_sales_orders was a success or a failure. 60 | success_move_task = email_operator.EmailOperator(task_id='success', 61 | trigger_rule=TriggerRule.ALL_SUCCESS, 62 | to=models.Variable.get('email'), 63 | subject='mssql_gcs_dataflow_bigquery_dag_1 Job Succeeded: start_date {{ ds }}', 64 | html_content="HTML CONTENT" 65 | ) 66 | 67 | failure_move_task = email_operator.EmailOperator(task_id='failure', 68 | trigger_rule=TriggerRule.ALL_FAILED, 69 | to=models.Variable.get('email'), 70 | subject='mssql_gcs_dataflow_bigquery_dag_1 Job Failed: start_date {{ ds }}', 71 | html_content="HTML CONTENT" 72 | ) 73 | 74 | # The success_move_task and failure_move_task are both downstream from the 75 | # dataflow_task. 76 | export_sales_orders >> success_move_task 77 | export_sales_orders >> failure_move_task -------------------------------------------------------------------------------- /cloud-composer/mssql_gcs_dataflow_bigquery_dag_2.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import datetime 16 | import logging 17 | import os 18 | 19 | from airflow import configuration 20 | from airflow import models 21 | from airflow.contrib.hooks import gcs_hook 22 | from airflow.contrib.operators import dataflow_operator 23 | from airflow.operators import python_operator 24 | from airflow.utils.trigger_rule import TriggerRule 25 | from airflow.operators import email_operator 26 | 27 | # We set the start_date of the DAG to the previous date. This will 28 | # make the DAG immediately available for scheduling. 29 | YESTERDAY = datetime.datetime.combine( 30 | datetime.datetime.today() - datetime.timedelta(1), 31 | datetime.datetime.min.time()) 32 | 33 | # We define some variables that we will use in the DAG tasks. 34 | SUCCESS_TAG = 'success' 35 | FAILURE_TAG = 'failure' 36 | 37 | DS_TAG = '{{ ds }}' 38 | DATAFLOW_FILE = os.path.join( 39 | configuration.get('core', 'dags_folder'), 'dataflow', 'process_json.py') 40 | 41 | DEFAULT_DAG_ARGS = { 42 | 'start_date': YESTERDAY, 43 | 'email': models.Variable.get('email'), 44 | 'email_on_failure': True, 45 | 'email_on_retry': False, 46 | 'retries': 0, 47 | 'project_id': models.Variable.get('gcp_project'), 48 | 'dataflow_default_options': { 49 | 'project': models.Variable.get('gcp_project'), 50 | 'temp_location': models.Variable.get('gcp_temp_location'), 51 | 'runner': 'DataflowRunner' 52 | } 53 | } 54 | 55 | # Setting schedule_interval to None as this DAG is externally trigger by a Cloud Function. 56 | with models.DAG(dag_id='mssql_gcs_dataflow_bigquery_dag_2', 57 | description='A DAG triggered by an external Cloud Function', 58 | schedule_interval=None, default_args=DEFAULT_DAG_ARGS) as dag: 59 | # Args required for the Dataflow job. 60 | job_args = { 61 | 'input': 'gs://{{ dag_run.conf["bucket"] }}/{{ dag_run.conf["name"] }}', 62 | 'output': models.Variable.get('bq_output_table'), 63 | 'load_dt': DS_TAG 64 | } 65 | 66 | # Main Dataflow task that will process and load the input delimited file. 67 | dataflow_task = dataflow_operator.DataFlowPythonOperator( 68 | task_id="process-json-to-dataflow", 69 | py_file=DATAFLOW_FILE, 70 | options=job_args) 71 | 72 | # Here we create two conditional tasks, one of which will be executed 73 | # based on whether the export_sales_orders was a success or a failure. 74 | success_move_task = email_operator.EmailOperator(task_id='success', 75 | trigger_rule=TriggerRule.ALL_SUCCESS, 76 | to=models.Variable.get('email'), 77 | subject='mssql_gcs_dataflow_bigquery_dag_2 Job Succeeded: start_date {{ ds }}', 78 | html_content="HTML CONTENT" 79 | ) 80 | 81 | failure_move_task = email_operator.EmailOperator(task_id='failure', 82 | trigger_rule=TriggerRule.ALL_FAILED, 83 | to=models.Variable.get('email'), 84 | subject='mssql_gcs_dataflow_bigquery_dag_2 Job Failed: start_date {{ ds }}', 85 | html_content="HTML CONTENT" 86 | ) 87 | 88 | # The success_move_task and failure_move_task are both downstream from the 89 | # dataflow_task. 90 | dataflow_task >> success_move_task 91 | dataflow_task >> failure_move_task -------------------------------------------------------------------------------- /cloud-dataflow/process_json.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import argparse 16 | import logging 17 | import ntpath 18 | import re 19 | import json 20 | 21 | import apache_beam as beam 22 | from apache_beam.options import pipeline_options 23 | 24 | class RowTransformer(object): 25 | 26 | def __init__(self, filename, load_dt): 27 | self.filename = filename 28 | self.load_dt = load_dt 29 | 30 | def parse(self, row): 31 | data = json.loads(row) 32 | data['filename'] = self.filename 33 | data['load_dt'] = self.load_dt 34 | return data 35 | 36 | def run(argv=None): 37 | parser = argparse.ArgumentParser() 38 | parser.add_argument( 39 | '--input', dest='input', required=True, 40 | help='Input file to read. This can be a local file or ' 41 | 'a file in a Google Storage Bucket.') 42 | parser.add_argument('--output', dest='output', required=True, 43 | help='Output BQ table to write results to.') 44 | parser.add_argument('--load_dt', dest='load_dt', required=True, 45 | help='Load date in YYYY-MM-DD format.') 46 | known_args, pipeline_args = parser.parse_known_args(argv) 47 | row_transformer = RowTransformer(filename=ntpath.basename(known_args.input), 48 | load_dt=known_args.load_dt) 49 | p_opts = pipeline_options.PipelineOptions(pipeline_args) 50 | 51 | with beam.Pipeline(options=p_opts) as pipeline: 52 | 53 | rows = pipeline | "Read from text file" >> beam.io.ReadFromText(known_args.input) 54 | 55 | dict_records = rows | "Convert to BigQuery row" >> beam.Map( 56 | lambda r: row_transformer.parse(r)) 57 | 58 | bigquery_table_schema = { 59 | "fields": [ 60 | { 61 | "mode": "NULLABLE", 62 | "name": "BackorderOrderID", 63 | "type": "INTEGER" 64 | }, 65 | { 66 | "mode": "NULLABLE", 67 | "name": "Comments", 68 | "type": "STRING" 69 | }, 70 | { 71 | "mode": "NULLABLE", 72 | "name": "ContactPersonID", 73 | "type": "INTEGER" 74 | }, 75 | { 76 | "mode": "NULLABLE", 77 | "name": "CustomerID", 78 | "type": "INTEGER" 79 | }, 80 | { 81 | "mode": "NULLABLE", 82 | "name": "CustomerPurchaseOrderNumber", 83 | "type": "INTEGER" 84 | }, 85 | { 86 | "mode": "NULLABLE", 87 | "name": "DeliveryInstructions", 88 | "type": "STRING" 89 | }, 90 | { 91 | "mode": "NULLABLE", 92 | "name": "ExpectedDeliveryDate", 93 | "type": "DATE" 94 | }, 95 | { 96 | "mode": "NULLABLE", 97 | "name": "InternalComments", 98 | "type": "STRING" 99 | }, 100 | { 101 | "mode": "NULLABLE", 102 | "name": "IsUndersupplyBackordered", 103 | "type": "BOOLEAN" 104 | }, 105 | { 106 | "mode": "NULLABLE", 107 | "name": "LastEditedBy", 108 | "type": "INTEGER" 109 | }, 110 | { 111 | "mode": "NULLABLE", 112 | "name": "LastEditedWhen", 113 | "type": "TIMESTAMP" 114 | }, 115 | { 116 | "mode": "NULLABLE", 117 | "name": "OrderDate", 118 | "type": "DATE" 119 | }, 120 | { 121 | "mode": "NULLABLE", 122 | "name": "OrderID", 123 | "type": "INTEGER" 124 | }, 125 | { 126 | "mode": "NULLABLE", 127 | "name": "PickedByPersonID", 128 | "type": "INTEGER" 129 | }, 130 | { 131 | "mode": "NULLABLE", 132 | "name": "PickingCompletedWhen", 133 | "type": "TIMESTAMP" 134 | }, 135 | { 136 | "mode": "NULLABLE", 137 | "name": "SalespersonPersonID", 138 | "type": "INTEGER" 139 | }, 140 | { 141 | "mode": "NULLABLE", 142 | "name": "filename", 143 | "type": "STRING" 144 | }, 145 | { 146 | "mode": "NULLABLE", 147 | "name": "load_dt", 148 | "type": "DATE" 149 | } 150 | ] 151 | } 152 | 153 | dict_records | "Write to BigQuery" >> beam.io.WriteToBigQuery( 154 | known_args.output, 155 | schema=bigquery_table_schema, 156 | create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, 157 | write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND) 158 | 159 | if __name__ == '__main__': 160 | logging.getLogger().setLevel(logging.INFO) 161 | run() -------------------------------------------------------------------------------- /cloud-functions/index.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright 2020, Google, Inc. 3 | * Licensed under the Apache License, Version 2.0 (the "License"); 4 | * you may not use this file except in compliance with the License. 5 | * You may obtain a copy of the License at 6 | * 7 | * http://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, software 10 | * distributed under the License is distributed on an "AS IS" BASIS, 11 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | * See the License for the specific language governing permissions and 13 | * limitations under the License. 14 | */ 15 | 16 | 'use strict'; 17 | 18 | const fetch = require('node-fetch'); 19 | const FormData = require('form-data'); 20 | 21 | /** 22 | * Triggered from a message on a Cloud Storage bucket. 23 | * 24 | * @param {!Object} data The Cloud Functions event data. 25 | * @returns {Promise} 26 | */ 27 | exports.triggerDag = async (data) => { 28 | 29 | // Fill in your Composer environment information here. 30 | 31 | // The project that holds your function 32 | const PROJECT_ID = ''; 33 | // Run python get-client-id/get_client_id.py [PROJECT_ID] [GCP_REGION] [COMPOSER_ENVIRONMENT] to get your client id 34 | const CLIENT_ID = ''; 35 | // This should be part of your webserver's URL: 36 | // {tenant-project-id}.appspot.com 37 | const WEBSERVER_ID = ''; 38 | // The name of the DAG you wish to trigger 39 | const DAG_NAME = 'mssql_gcs_dataflow_bigquery_dag_2'; 40 | 41 | // Other constants 42 | const WEBSERVER_URL = `https://${WEBSERVER_ID}.appspot.com/api/experimental/dags/${DAG_NAME}/dag_runs`; 43 | const USER_AGENT = 'gcf-event-trigger'; 44 | const BODY = {conf: JSON.stringify(data)}; 45 | 46 | // Make the request 47 | try { 48 | const iap = await authorizeIap(CLIENT_ID, PROJECT_ID, USER_AGENT); 49 | 50 | return makeIapPostRequest(WEBSERVER_URL, BODY, iap.idToken, USER_AGENT); 51 | } catch (err) { 52 | console.error('Error authorizing IAP:', err.message); 53 | throw new Error(err); 54 | } 55 | }; 56 | 57 | /** 58 | * @param {string} clientId The client id associated with the Composer webserver application. 59 | * @param {string} projectId The id for the project containing the Cloud Function. 60 | * @param {string} userAgent The user agent string which will be provided with the webserver request. 61 | */ 62 | const authorizeIap = async (clientId, projectId, userAgent) => { 63 | const SERVICE_ACCOUNT = `${projectId}@appspot.gserviceaccount.com`; 64 | const JWT_HEADER = Buffer.from( 65 | JSON.stringify({alg: 'RS256', typ: 'JWT'}) 66 | ).toString('base64'); 67 | 68 | let jwt = ''; 69 | let jwtClaimset = ''; 70 | 71 | // Obtain an Oauth2 access token for the appspot service account 72 | const res = await fetch( 73 | `http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/${SERVICE_ACCOUNT}/token`, 74 | { 75 | headers: {'User-Agent': userAgent, 'Metadata-Flavor': 'Google'}, 76 | } 77 | ); 78 | const tokenResponse = await res.json(); 79 | if (tokenResponse.error) { 80 | console.error('Error in token reponse:', tokenResponse.error.message); 81 | return Promise.reject(tokenResponse.error); 82 | } 83 | 84 | const accessToken = tokenResponse.access_token; 85 | const iat = Math.floor(new Date().getTime() / 1000); 86 | const claims = { 87 | iss: SERVICE_ACCOUNT, 88 | aud: 'https://www.googleapis.com/oauth2/v4/token', 89 | iat: iat, 90 | exp: iat + 60, 91 | target_audience: clientId, 92 | }; 93 | jwtClaimset = Buffer.from(JSON.stringify(claims)).toString('base64'); 94 | const toSign = [JWT_HEADER, jwtClaimset].join('.'); 95 | 96 | const blob = await fetch( 97 | `https://iam.googleapis.com/v1/projects/${projectId}/serviceAccounts/${SERVICE_ACCOUNT}:signBlob`, 98 | { 99 | method: 'POST', 100 | body: JSON.stringify({ 101 | bytesToSign: Buffer.from(toSign).toString('base64'), 102 | }), 103 | headers: { 104 | 'User-Agent': userAgent, 105 | Authorization: `Bearer ${accessToken}`, 106 | }, 107 | } 108 | ); 109 | const blobJson = await blob.json(); 110 | if (blobJson.error) { 111 | console.error('Error in blob signing:', blobJson.error.message); 112 | return Promise.reject(blobJson.error); 113 | } 114 | 115 | // Request service account signature on header and claimset 116 | const jwtSignature = blobJson.signature; 117 | jwt = [JWT_HEADER, jwtClaimset, jwtSignature].join('.'); 118 | const form = new FormData(); 119 | form.append('grant_type', 'urn:ietf:params:oauth:grant-type:jwt-bearer'); 120 | form.append('assertion', jwt); 121 | 122 | const token = await fetch('https://www.googleapis.com/oauth2/v4/token', { 123 | method: 'POST', 124 | body: form, 125 | }); 126 | const tokenJson = await token.json(); 127 | if (tokenJson.error) { 128 | console.error('Error fetching token:', tokenJson.error.message); 129 | return Promise.reject(tokenJson.error); 130 | } 131 | 132 | return { 133 | idToken: tokenJson.id_token, 134 | }; 135 | }; 136 | 137 | /** 138 | * @param {string} url The url that the post request targets. 139 | * @param {string} body The body of the post request. 140 | * @param {string} idToken Bearer token used to authorize the iap request. 141 | * @param {string} userAgent The user agent to identify the requester. 142 | */ 143 | const makeIapPostRequest = async (url, body, idToken, userAgent) => { 144 | const res = await fetch(url, { 145 | method: 'POST', 146 | headers: { 147 | 'User-Agent': userAgent, 148 | Authorization: `Bearer ${idToken}`, 149 | }, 150 | body: JSON.stringify(body), 151 | }); 152 | 153 | if (!res.ok) { 154 | const err = await res.text(); 155 | console.error('Error making IAP post request:', err.message); 156 | throw new Error(err); 157 | } 158 | }; -------------------------------------------------------------------------------- /cloud-functions/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "nodejs-docs-samples-functions-composer-storage-trigger", 3 | "version": "0.0.1", 4 | "dependencies": { 5 | "form-data": "^3.0.0", 6 | "node-fetch": "^2.2.0" 7 | }, 8 | "engines": { 9 | "node": ">=8.0.0" 10 | }, 11 | "private": true, 12 | "license": "Apache-2.0", 13 | "author": "Google Inc.", 14 | "repository": { 15 | "type": "git", 16 | "url": "https://github.com/GoogleCloudPlatform/nodejs-docs-samples.git" 17 | }, 18 | "devDependencies": { 19 | "mocha": "^8.0.0", 20 | "proxyquire": "^2.1.0", 21 | "sinon": "^9.0.0" 22 | }, 23 | "scripts": { 24 | "test": "mocha test/*.test.js --timeout=20000" 25 | } 26 | } -------------------------------------------------------------------------------- /get-client-id/get_client_id.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Get the client ID associated with a Cloud Composer environment.""" 16 | 17 | import argparse 18 | 19 | 20 | def get_client_id(project_id, location, composer_environment): 21 | # [START composer_get_environment_client_id] 22 | import google.auth 23 | import google.auth.transport.requests 24 | import requests 25 | import six.moves.urllib.parse 26 | 27 | # Authenticate with Google Cloud. 28 | # See: https://cloud.google.com/docs/authentication/getting-started 29 | credentials, _ = google.auth.default( 30 | scopes=['https://www.googleapis.com/auth/cloud-platform']) 31 | authed_session = google.auth.transport.requests.AuthorizedSession( 32 | credentials) 33 | 34 | # project_id = 'YOUR_PROJECT_ID' 35 | # location = 'us-central1' 36 | # composer_environment = 'YOUR_COMPOSER_ENVIRONMENT_NAME' 37 | # https://composer.googleapis.com/v1beta1/projects/composer-training-282614/locations/us-central1/environments/airflow-env' 38 | 39 | environment_url = ( 40 | 'https://composer.googleapis.com/v1beta1/projects/{}/locations/{}' 41 | '/environments/{}').format(project_id, location, composer_environment) 42 | composer_response = authed_session.request('GET', environment_url) 43 | environment_data = composer_response.json() 44 | airflow_uri = environment_data['config']['airflowUri'] 45 | 46 | # The Composer environment response does not include the IAP client ID. 47 | # Make a second, unauthenticated HTTP request to the web server to get the 48 | # redirect URI. 49 | redirect_response = requests.get(airflow_uri, allow_redirects=False) 50 | redirect_location = redirect_response.headers['location'] 51 | 52 | # Extract the client_id query parameter from the redirect. 53 | parsed = six.moves.urllib.parse.urlparse(redirect_location) 54 | query_string = six.moves.urllib.parse.parse_qs(parsed.query) 55 | print(query_string['client_id'][0]) 56 | # [END composer_get_environment_client_id] 57 | 58 | 59 | if __name__ == '__main__': 60 | parser = argparse.ArgumentParser( 61 | description=__doc__, 62 | formatter_class=argparse.RawDescriptionHelpFormatter) 63 | parser.add_argument('project_id', help='Your Project ID.') 64 | parser.add_argument( 65 | 'location', help='Region of the Cloud Composer environent.') 66 | parser.add_argument( 67 | 'composer_environment', help='Name of the Cloud Composer environent.') 68 | 69 | args = parser.parse_args() 70 | get_client_id( 71 | args.project_id, args.location, args.composer_environment) 72 | 73 | -------------------------------------------------------------------------------- /get-client-id/requirements.txt: -------------------------------------------------------------------------------- 1 | google-auth==1.18.0 2 | requests==2.24.0 3 | six==1.15.0 -------------------------------------------------------------------------------- /images/dags-folder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GoogleCloudPlatform/cloud-composer-mssql-dataflow-bigquery/c93ed21608685376623686f2e0700f95af9fdfa8/images/dags-folder.png -------------------------------------------------------------------------------- /images/diagrams.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GoogleCloudPlatform/cloud-composer-mssql-dataflow-bigquery/c93ed21608685376623686f2e0700f95af9fdfa8/images/diagrams.png -------------------------------------------------------------------------------- /images/pypi-packages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GoogleCloudPlatform/cloud-composer-mssql-dataflow-bigquery/c93ed21608685376623686f2e0700f95af9fdfa8/images/pypi-packages.png --------------------------------------------------------------------------------