├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── composer ├── gcs_to_sftp_operator.py ├── gcs_to_sftp_operator_test.py ├── mozart_dag.py ├── sa360_create_report_operator.py ├── sa360_create_report_operator_test.py ├── sa360_download_report_file_operator.py ├── sa360_download_report_file_operator_test.py ├── sa360_report_file_available_sensor.py ├── sa360_report_file_available_sensor_test.py ├── sa360_report_request_builder.py ├── sa360_report_request_builder_test.py ├── sa360_reporting_hook.py ├── sa360_reporting_hook_test.py └── test_utils.py ├── dataflow ├── .gitignore ├── pom.xml └── src │ └── main │ └── java │ └── com │ └── google │ └── cse │ └── mozart │ ├── Mozart.java │ └── examples │ ├── FirebaseInput.java │ └── GCSInput.java └── doc └── images ├── environment-set0.png ├── environment-set1.png ├── environment-set2.png ├── environment-set3.png ├── environment-set4.png ├── environment-set5.png └── environment-set6.png /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | We'd love to accept your patches and contributions to this project. There are 4 | just a few small guidelines you need to follow. 5 | 6 | ## Contributor License Agreement 7 | 8 | Contributions to this project must be accompanied by a Contributor License 9 | Agreement. You (or your employer) retain the copyright to your contribution; 10 | this simply gives us permission to use and redistribute your contributions as 11 | part of the project. Head over to to see 12 | your current agreements on file or to sign a new one. 13 | 14 | You generally only need to submit a CLA once, so if you've already submitted one 15 | (even if it was for a different project), you probably don't need to do it 16 | again. 17 | 18 | ## Code reviews 19 | 20 | All submissions, including submissions by project members, require review. We 21 | use GitHub pull requests for this purpose. Consult 22 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more 23 | information on using pull requests. 24 | 25 | ## Community Guidelines 26 | 27 | This project follows 28 | [Google's Open Source Community 29 | Guidelines](https://opensource.google.com/conduct/). 30 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Mozart - Business logic for Search Ads 360 2 | 3 | Table of Contents 4 | ================= 5 | 6 | * [How it works](#how-it-works) 7 | * [Architecture](#architecture) 8 | * [Set-up](#set-up) 9 | * [Pre-requisites](#pre-requisites) 10 | * [Composer set-up](#composer-set-up) 11 | * [DataFlow set-up](#dataflow-set-up) 12 | * [Cloud Storage set-up](#cloud-storage-set-up) 13 | 14 | Mozart is a framework for automating tasks on Search Ads 360 (SA360). Mozart 15 | lets advertisers and agencies apply their own business logic to SA360 campaigns 16 | by leveraging well-known technologies such as Apache Beam. 17 | 18 | Mozart is designed to be deployed in an Airflow+Beam platform. The rest of this 19 | documentation assumes Google Cloud Platform (GCP) is used for deployment. 20 | *Composer* is the name of GCP's managed version of *Airflow*, whereas *DataFlow* 21 | is the name of the managed version of *Beam*. 22 | 23 | ## How it works 24 | 25 | Mozart leverages SA360 Reporting API and SA360 Bulksheet uploads to perform the 26 | automation tasks. 27 | 28 | The sequence of high-level operations is: 29 | 30 | 1. Reports are downloaded from SA360 API. These reports must include all 31 | entities to be processed (e.g.: Keywords, Ads). 32 | 2. Downloaded reports are analyzed, applying the custom business logic. The 33 | output of this logic is a CSV file containing updated information about the 34 | entities (Keywords, Ads). For example: a new Max CPC value for certain 35 | keywords 36 | 3. CSV files with updated values are uploaded into SA360 using sFTP Bulksheet 37 | upload 38 | 39 | ## Architecture 40 | 41 | Mozart consists of two main modules: 42 | 43 | 1. An Airflow DAG 44 | 2. A Beam Pipeline 45 | 46 | ## Set-up 47 | 48 | This guide describes how to set-up Mozart on Google Cloud Platform. 49 | 50 | For the rest of the guide, it is assumed that you have created a Google Cloud 51 | project, enabled billing and that you have Admin access to the project via 52 | [console.cloud.google.com](https://console.cloud.google.com/). 53 | 54 | For instructions on how to create a Google Cloud project and enable billing 55 | please refer to 56 | [Google Cloud Platform documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects). 57 | 58 | You must also enable certain Google Cloud Platform APIs to be able to use 59 | Mozart. In order to enable all of them in a single step, click on 60 | [enable APIs](https://console.cloud.google.com/flows/enableapi?apiid=composer.googleapis.com,dataproc.googleapis.com,storage-component.googleapis.com,dataflow,compute_component,logging,storage_component,storage_api,bigquery,pubsub,datastore.googleapis.com,cloudresourcemanager.googleapis.com,doubleclicksearch) 61 | and follow the instructions. 62 | 63 | Note: The link to enable APIs might take a while to load. Please be patient. 64 | 65 | ### Pre-requisites 66 | 67 | You must have the following software installed on your computer: 68 | 69 | * [Google Cloud SDK](https://cloud.google.com/sdk/install) 70 | * [Apache Maven](https://maven.apache.org/) 71 | 72 | ### Composer set-up 73 | 74 | First step is to set-up Google Cloud Composer. In order to do so, follow the 75 | steps: 76 | 77 | 1. Create a new Composer environment 78 | 79 | 1. Go to 80 | [console.cloud.google.com/composer](https://console.cloud.google.com/composer) 81 | 1. Check 'Enable beta features' 82 | 83 | ![Enabling beta features](doc/images/environment-set1.png) 84 | 85 | 1. Click on 'Create' 86 | 87 | ![Create Composer environment](doc/images/environment-set2.png) 88 | 89 | 1. Type in the following options: 90 | 91 | * Name: Enter a descriptive name 92 | * Location: select a location close to you (e.g.: 'europe-west1) 93 | * Scopes: 94 | * https://www.googleapis.com/auth/bigquery 95 | * https://www.googleapis.com/auth/devstorage.read_write 96 | * https://www.googleapis.com/auth/doubleclicksearch 97 | * https://www.googleapis.com/auth/compute 98 | * https://www.googleapis.com/auth/cloud-platform 99 | * Disk size: 20 100 | * Airflow version: 1.10.0 101 | 102 | 1. Click on 'Create' at the bottom of the page 103 | 104 | Tip: Creating the environment takes a while. We suggest you to continue with 105 | the other sections and come back later to finish environment configuration. 106 | 107 | 1. Once the environment is created, go to the 108 | [Composer page](https://console.cloud.google.com/composer) and open the 109 | 'Airflow webserver' for the newly created environment. 110 | 111 | ![Airflow webserver](doc/images/environment-set3.png) 112 | 113 | 1. Click on 'Admin' > 'Variables' 114 | 115 | ![Variables](doc/images/environment-set4.png) 116 | 117 | 1. Create the following variables: 118 | 119 | Variable | Description | Example value 120 | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- 121 | mozart/sa360_agency_id | SA360 agency ID | 123456789 122 | mozart/start_date | Enter today's date | 2018-10-30 123 | mozart/lookback_days | Number of days back to pull reports for. E.g.: if you enter '7', you will work with data (clicks, impressions) from the last 7 days | 7 124 | mozart/gcp_project | Your Google Cloud project ID | mozart-123456 125 | mozart/gcp_zone | The zone of your Composer instance | europe-west1-b 126 | mozart/gcs_bucket | Name of the GCS bucket you created (without 'gs://' prefix) | mozart-data 127 | mozart/dataflow_staging | GCS URI for DataFlow staging folder | gs://mozart/staging 128 | mozart/dataflow_template | GCS URI for DataFlow template | gs://mozart/templates/MozartProcessElements 129 | mozart/advertisers | JSON describing the advertisers to work with. Each advertiser contains an entry with the *advertiserId* and information about the sFTP endpoint for that advertiser. sFTP enpoint connection must specify either a *sftpConnId* or the sFTP connection parameters: *sftpHost*, *sftpPort*, *sftpUsername*, *sftpPassword*. Any of these individual fields overrides the configuration provided in the connection ID | \[{"advertiserId": "123", "sftpConnId": "sa360_sftp", "sftpUsername": "username1", "sftpPassword": "password1"},{"advertiserId": "456", "sftpConnId": "sa360_sftp", "sftpUsername": "username2", "sftpPassword": "password2"} \] 130 | 131 | ## DataFlow set-up 132 | 133 | 1. Create a Service account 134 | 1. In the GCP Console, go to the 135 | [Create service account key page](https://console.cloud.google.com/apis/credentials/serviceaccountkey?_ga=2.117905545.-183250881.1540540997). 136 | 1. From the Service account drop-down list, select 'New service account'. 137 | 1. In the Service account name field, enter a name . 138 | 1. From the Role drop-down list, select 'Project > Owner' 139 | * Use 'Project > Owner' for testing purposes. For a production system, 140 | you should select a more restrictive role. 141 | 1. Click 'Create'. A JSON file that contains your key downloads to your 142 | computer. 143 | 1. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the absolute 144 | path of the JSON file downloaded in the previous step 145 | 146 | ## Cloud Storage set-up 147 | 148 | Mozart's DataFlow pipeline works with files on Google Cloud Storage. You need to 149 | create a bucket where these files will be stored: 150 | 151 | 1. Go to 152 | [console.cloud.google.com/storage](https://console.cloud.google.com/storage) 153 | 1. Create a bucket 154 | 155 | 1. Click on 'Create bucket' 156 | 1. Choose a name for the bucket 157 | 1. Choose a location that matches the location you used for the Composer 158 | configuration 159 | 160 | Note: A configuration based on 'Regional' storage class and the same 161 | location as the one used for Composer is suggested. However, you may 162 | want to use other options if you plan on using the bucket for storing 163 | custom data. Check the Cloud Storage docs for more info on all the 164 | options. 165 | 166 | 1. Create a lifecycle rule for the bucket 167 | 168 | 1. In the bucket list view, click on the Lifecycle column value 169 | 170 | ![Lifecycle](doc/images/environment-set6.png) 171 | 172 | 1. Click on 'Add rule' 173 | 174 | 1. Select 'Age' condition, and set it to 30 days 175 | 176 | 1. As an action, select 'Delete' 177 | 178 | 1. Save the rule 179 | 180 | Note: Lifecycle rules help you decrease Cloud Storage costs by deleting old 181 | elements. We suggest setting this 30-day policy, but you should adjust this 182 | if you wish to keep items for longer, or if you plan on storing other data 183 | in the same bucket. 184 | 185 | 1. Create the following folders: 186 | 187 | 1. staging 188 | 1. templates 189 | 1. sa360_reports 190 | 1. sa360_upload 191 | -------------------------------------------------------------------------------- /composer/gcs_to_sftp_operator.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Operator to upload file from Google Cloud Storage into an sFTP. 15 | """ 16 | 17 | import contextlib 18 | import StringIO 19 | import logging 20 | import os 21 | 22 | from airflow import models 23 | 24 | 25 | logger = logging.getLogger(__name__) 26 | 27 | 28 | class GCSToSFTPOperator(models.BaseOperator): 29 | """Operator to upload a GCS file to an sFTP location. 30 | """ 31 | 32 | template_fields = ['_gcs_filename'] 33 | 34 | def __init__(self, gcs_hook, ssh_hook, gcs_bucket, gcs_filename, 35 | sftp_destination_path, header=None, **kwargs): 36 | """Constructor. 37 | 38 | Args: 39 | gcs_hook: GCS Hook for connecting to Google Cloud Storage. 40 | ssh_hook: sFTP Hook for connecting to sFTP server. 41 | gcs_bucket: GCS bucket where source file is stored. 42 | gcs_filename: File name in GCS of source file. 43 | sftp_destination_path: Destination path in the sFTP server. 44 | header: Optional header to be added to the uploaded file. Defaults to None 45 | **kwargs: Other arguments for use with airflow.models.BaseOperator. 46 | """ 47 | super(GCSToSFTPOperator, self).__init__(**kwargs) 48 | self._gcs_hook = gcs_hook 49 | self._gcs_bucket = gcs_bucket 50 | self._gcs_filename = gcs_filename 51 | self._ssh_hook = ssh_hook 52 | self._sftp_destination_path = sftp_destination_path 53 | self._header = header 54 | 55 | def execute(self, context): 56 | """Execute operator. 57 | 58 | This method is invoked by Airflow to execute the task. 59 | 60 | Args: 61 | context: Airflow context. 62 | 63 | Raises: 64 | ValueError: If provided GCS URI cannot be parsed for bucket and filename. 65 | """ 66 | file_data = self._gcs_hook.download(self._gcs_bucket, 67 | self._gcs_filename) 68 | with contextlib.closing(StringIO.StringIO()) as file_fd: 69 | if self._header: 70 | file_fd.write('%s\n' % self._header) 71 | file_fd.write(file_data) 72 | file_fd.seek(0) 73 | with self._ssh_hook.get_conn() as ssh_client: 74 | sftp_client = ssh_client.open_sftp() 75 | sftp_client.putfo(file_fd, self._sftp_destination_path) 76 | logger.debug('File [gs://%s/%s] uploaded to sFTP destination: %s', 77 | self._gcs_bucket, self._gcs_filename, 78 | self._sftp_destination_path) 79 | 80 | 81 | -------------------------------------------------------------------------------- /composer/gcs_to_sftp_operator_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Tests for common.gcs_to_sftp_operator.""" 15 | 16 | from __future__ import absolute_import 17 | from __future__ import division 18 | from __future__ import print_function 19 | 20 | import contextlib 21 | import mock 22 | 23 | from google3.corp.gtech.ads.doubleclick.composer.common import test_utils 24 | from google3.testing.pybase import googletest 25 | 26 | GCSToSFTPOperator = test_utils.import_with_mock_dependencies( # Mock dependency injection pylint: disable=invalid-name 27 | 'google3.corp.gtech.ads.doubleclick.composer.' 28 | 'common.gcs_to_sftp_operator.GCSToSFTPOperator', [{ 29 | 'name': 'airflow.models.BaseOperator', 30 | 'mock': mock.MagicMock 31 | }]) 32 | 33 | GCS_BUCKET = 'test_gcs_bucket' 34 | GCS_FILENAME = 'test_gcs_filename' 35 | SFTP_DESTINATION = 'test_sftp_destination' 36 | MOCK_FILE_NAME = 'mock_filename' 37 | CONTEXT_TASK_INSTANCE = 'task_instance' 38 | FILE_CONTENTS = u'Test first line áéíóú\nTest second line\n' 39 | FILE_HEADER = u'Headeráéíóú' 40 | 41 | 42 | class TestException(Exception): 43 | pass 44 | 45 | 46 | class GcsToSftpOperatorTest(googletest.TestCase): 47 | 48 | def putfo_mock(self, file_fd, destination): 49 | self.sftp_filesystem[destination] = file_fd.read() 50 | 51 | def setUp(self): 52 | self.gcs_hook = mock.MagicMock() 53 | self.gcs_hook.download.return_value = FILE_CONTENTS 54 | 55 | self.ssh_hook = mock.MagicMock() 56 | self.ssh_hook.get_conn().__enter__( 57 | ).open_sftp().putfo.side_effect = self.putfo_mock 58 | self.operator = GCSToSFTPOperator(self.gcs_hook, self.ssh_hook, GCS_BUCKET, 59 | GCS_FILENAME, SFTP_DESTINATION) 60 | self.context = {CONTEXT_TASK_INSTANCE: mock.MagicMock()} 61 | 62 | self.sftp_filesystem = {} 63 | 64 | def test_gcs_download(self): 65 | self.operator.execute(self.context) 66 | self.gcs_hook.download.assert_called_once_with(GCS_BUCKET, GCS_FILENAME) 67 | 68 | def test_contents(self): 69 | self.operator.execute(self.context) 70 | self.assertEqual(self.sftp_filesystem[SFTP_DESTINATION], FILE_CONTENTS) 71 | 72 | def test_contents_header(self): 73 | self.operator = GCSToSFTPOperator(self.gcs_hook, self.ssh_hook, GCS_BUCKET, 74 | GCS_FILENAME, SFTP_DESTINATION, 75 | FILE_HEADER) 76 | self.operator.execute(self.context) 77 | self.assertEqual(self.sftp_filesystem[SFTP_DESTINATION], '%s\n%s' % ( 78 | FILE_HEADER, FILE_CONTENTS)) 79 | 80 | 81 | if __name__ == '__main__': 82 | googletest.main() 83 | -------------------------------------------------------------------------------- /composer/mozart_dag.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Mozart main DAG. 15 | 16 | This DAG automates the process of downloading SA360 reports into Cloud Storage 17 | and then processing it using DataFlow. 18 | 19 | DAG requirements: 20 | Airflow variables: 21 | sa360_agency_id: SA360 agency ID (e.g.: 123456789) 22 | sa360_advertiser_id: SA360 advertiser ID (e.g.: 987654321) 23 | gcs_bucket: GCS bucket where SA360 reports will be stored for processing. 24 | Just the bucket name, no *gs://* prefix 25 | mozart_start_date: Date on which the workflow will begin processing. Format: 26 | YYYY-MM-DD 27 | """ 28 | import datetime 29 | import json 30 | 31 | import airflow 32 | from airflow import models 33 | from airflow.contrib.hooks import gcs_hook 34 | from airflow.contrib.hooks import ssh_hook 35 | from airflow.contrib.operators import dataflow_operator 36 | 37 | import gcs_to_sftp_operator 38 | import sa360_create_report_operator as create_operator 39 | import sa360_download_report_file_operator as download_operator 40 | import sa360_report_file_available_sensor as available_sensor 41 | import sa360_report_request_builder as request_builder 42 | import sa360_reporting_hook 43 | 44 | GCS_PATH_FORMAT = 'gs://%s/%s' 45 | REPORT_FILENAME = 'report-{{ run_id }}.csv' 46 | 47 | # Read configuration from Airflow variables 48 | sa360_conn_id = 'google_cloud_default' 49 | ssh_conn_id = 'sa360_sftp' 50 | agency_id = models.Variable.get('mozart/sa360_agency_id') 51 | advertisers = json.loads(models.Variable.get('mozart/sa360_advertisers')) 52 | gcs_bucket = models.Variable.get('mozart/gcs_bucket') 53 | start_date = datetime.datetime.strptime( 54 | models.Variable.get('mozart/start_date'), '%Y-%m-%d') 55 | lookback_days = int(models.Variable.get('mozart/lookback_days')) 56 | gcp_project = models.Variable.get('mozart/gcp_project') 57 | gcp_zone = models.Variable.get('mozart/gcp_zone') 58 | dataflow_staging = models.Variable.get('mozart/dataflow_staging') 59 | dataflow_template = models.Variable.get('mozart/dataflow_template') 60 | input_custom_data_file = models.Variable.get('input_custom_data_file') 61 | custom_data_column_names = models.Variable.get('custom_data_column_names') 62 | 63 | # Default args that will be applied to all tasks in the DAG 64 | default_args = { 65 | 'owner': 'airflow', 66 | 'depends_on_past': False, 67 | 'start_date': start_date, 68 | 'email_on_failure': False, 69 | 'email_on_retry': False, 70 | 'retries': 1, 71 | 'retry_delay': datetime.timedelta(seconds=10), 72 | } 73 | 74 | # Hooks for connecting to GCS, SA360 API and SA360's sFTP endpoint. 75 | sa360_reporting_hook = sa360_reporting_hook.SA360ReportingHook( 76 | sa360_report_conn_id=sa360_conn_id) 77 | gcs_hook = gcs_hook.GoogleCloudStorageHook() 78 | 79 | # SA360 request builder 80 | request_builder = request_builder.SA360ReportRequestBuilder( 81 | agency_id, list(elem['advertiserId'] for elem in advertisers)) 82 | 83 | output_file_header = ','.join(request_builder.get_headers()) 84 | 85 | # DAG definition 86 | dag = airflow.DAG( 87 | 'mozart_dag', 88 | default_args=default_args, 89 | schedule_interval=datetime.timedelta(1), 90 | concurrency=3) 91 | 92 | create_report = create_operator.SA360CreateReportOperator( 93 | task_id='create_kw_report', 94 | sa360_reporting_hook=sa360_reporting_hook, 95 | request_builder=request_builder, 96 | lookback_days=lookback_days, 97 | dag=dag) 98 | 99 | wait_for_report = available_sensor.SA360ReportFileAvailableSensor( 100 | task_id='wait_for_kw_report', 101 | sa360_reporting_hook=sa360_reporting_hook, 102 | poke_interval=30, 103 | retries=500, 104 | dag=dag) 105 | create_report.set_downstream(wait_for_report) 106 | 107 | download_file = download_operator.SA360DownloadReportFileOperator( 108 | task_id='download_kw_report', 109 | sa360_reporting_hook=sa360_reporting_hook, 110 | gcs_hook=gcs_hook, 111 | gcs_bucket=gcs_bucket, 112 | filename=REPORT_FILENAME, 113 | write_header=False, 114 | dag=dag) 115 | wait_for_report.set_downstream(download_file) 116 | 117 | for advertiser in advertisers: 118 | advertiser_id = advertiser['advertiserId'] 119 | sftp_conn_id = advertiser.get('sftpConnId', None) 120 | sftp_host = advertiser.get('sftpHost', None) 121 | sftp_port = advertiser.get('sftpPort', None) 122 | sftp_username = advertiser.get('sftpUsername', None) 123 | sftp_password = advertiser.get('sftpPassword', None) 124 | connection_hook = ssh_hook.SSHHook( 125 | ssh_conn_id=sftp_conn_id, 126 | remote_host=sftp_host, 127 | username=sftp_username, 128 | password=sftp_password, 129 | port=sftp_port) 130 | 131 | output_filename = 'output/report-%s-{{ run_id }}.csv' % advertiser_id 132 | 133 | process_elements = dataflow_operator.DataflowTemplateOperator( 134 | task_id='process_elements-%s' % advertiser_id, 135 | dataflow_default_options={ 136 | 'project': gcp_project, 137 | 'zone': gcp_zone, 138 | 'tempLocation': dataflow_staging, 139 | }, 140 | parameters={ 141 | 'inputKeywordsFile': 142 | GCS_PATH_FORMAT % (gcs_bucket, REPORT_FILENAME), 143 | 'outputKeywordsFile': 144 | GCS_PATH_FORMAT % (gcs_bucket, output_filename), 145 | 'keywordColumnNames': 146 | output_file_header, 147 | 'inputCustomDataFile': 148 | input_custom_data_file, 149 | 'customDataColumnNames': 150 | custom_data_column_names, 151 | 'advertiserId': 152 | advertiser_id 153 | }, 154 | template=dataflow_template, 155 | gcp_conn_id=sa360_conn_id, 156 | dag=dag) 157 | download_file.set_downstream(process_elements) 158 | 159 | upload_to_sftp = gcs_to_sftp_operator.GCSToSFTPOperator( 160 | task_id='upload_to_sftp-%s' % advertiser_id, 161 | gcs_hook=gcs_hook, 162 | ssh_hook=connection_hook, 163 | gcs_bucket=gcs_bucket, 164 | gcs_filename=output_filename, 165 | sftp_destination_path='/input.csv', 166 | gcp_conn_id=sa360_conn_id, 167 | header='Row Type,Action,Status,' + output_file_header, 168 | dag=dag) 169 | process_elements.set_downstream(upload_to_sftp) 170 | -------------------------------------------------------------------------------- /composer/sa360_create_report_operator.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Operator to create (request) a new SA360 Report. 15 | """ 16 | 17 | import datetime 18 | import logging 19 | 20 | from airflow import models 21 | 22 | 23 | logger = logging.getLogger(__name__) 24 | 25 | 26 | class SA360CreateReportOperator(models.BaseOperator): 27 | """Operator to request a new SA360 asynchronous report. 28 | 29 | This Operator requests a new report to the SA360 API. The newly requested 30 | report ID is then stored in a 'report_id' xcom so that it can be retrieved by 31 | a subsequent task. 32 | """ 33 | 34 | def __init__(self, sa360_reporting_hook, request_builder, lookback_days, 35 | *args, **kwargs): 36 | """Constructor. 37 | 38 | Args: 39 | sa360_reporting_hook: Airflow hook to create the SA360 Reporting API 40 | service. 41 | request_builder: Request builder. Must be an object with a 42 | *build(start_date, end_date)* method. 43 | lookback_days: Number of days into the past for the report. Report will 44 | start at yesterday - lookback_days. 45 | *args: Other arguments for use with airflow.models.BaseOperator. 46 | **kwargs: Other arguments for use with airflow.models.BaseOperator. 47 | """ 48 | super(SA360CreateReportOperator, self).__init__(*args, **kwargs) 49 | self._service = None 50 | self._request_builder = request_builder 51 | self._sa360_reporting_hook = sa360_reporting_hook 52 | self._lookback_days = lookback_days 53 | 54 | def execute(self, context): 55 | """Execute operator. 56 | 57 | This method is invoked by Airflow to execute the task. 58 | 59 | Args: 60 | context: Airflow context. 61 | """ 62 | if self._service is None: 63 | self._service = self._sa360_reporting_hook.get_service() 64 | 65 | # start at yesterday - lookback_days and end at yesterday 66 | start_date = datetime.date.today() - datetime.timedelta( 67 | self._lookback_days + 1) 68 | end_date = datetime.date.today() - datetime.timedelta(1) 69 | 70 | request = self._service.reports().request( 71 | body=self._request_builder.build(start_date, end_date)) 72 | response = request.execute() 73 | logger.info('Successfully created report') 74 | logger.debug('Create report response: %s', response) 75 | 76 | # Store report ID in xcom so that it can be retrieve by another task. 77 | context['task_instance'].xcom_push('report_id', response['id']) 78 | -------------------------------------------------------------------------------- /composer/sa360_create_report_operator_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Tests for sa360_create_report_operator.""" 15 | 16 | from __future__ import absolute_import 17 | from __future__ import division 18 | from __future__ import print_function 19 | import datetime 20 | 21 | import mock 22 | 23 | from google3.corp.gtech.ads.doubleclick.composer.common import test_utils 24 | from google3.testing.pybase import googletest 25 | 26 | SA360CreateReportOperator = test_utils.import_with_mock_dependencies( # Mock dependency injection pylint: disable=invalid-name 27 | 'google3.corp.gtech.ads.doubleclick.composer.' 28 | 'sa360.sa360_create_report_operator.SA360CreateReportOperator', 29 | [{'name': 'airflow.models.BaseOperator', 30 | 'mock': mock.MagicMock} 31 | ]) 32 | 33 | 34 | class TestException(Exception): 35 | pass 36 | 37 | 38 | class TestRequestBuilder(object): 39 | 40 | def build(self, start_date, end_date): 41 | result = {} 42 | result['startDate'] = start_date.isoformat() 43 | result['endDate'] = end_date.isoformat() 44 | return result 45 | 46 | 47 | class Sa360CreateReportOperatorTest(googletest.TestCase): 48 | 49 | REQUEST_BODY = 'Test request body' 50 | XCOM_VARIABLE = 'report_id' 51 | RESPONSE_DICT = {'id': 'response_report_id'} 52 | CONTEXT_TASK_INSTANCE = 'task_instance' 53 | ID = 'id' 54 | LOOKBACK_DAYS = 10 55 | 56 | def create_and_execute_default(self): 57 | self.operator = SA360CreateReportOperator( 58 | self.hook, self.request_builder, self.LOOKBACK_DAYS) 59 | self.operator.execute(self.context) 60 | 61 | def setUp(self): 62 | self.hook = mock.MagicMock() 63 | self.context = {self.CONTEXT_TASK_INSTANCE: mock.MagicMock()} 64 | self.hook.get_service().reports().request().execute( 65 | ).__getitem__.side_effect = self.RESPONSE_DICT.__getitem__ 66 | self.hook.reset_mock() 67 | self.request_builder = TestRequestBuilder() 68 | self.expected_start_date = datetime.date.today() - datetime.timedelta( 69 | self.LOOKBACK_DAYS + 1) 70 | self.expected_end_date = datetime.date.today() - datetime.timedelta(1) 71 | 72 | def tearDown(self): 73 | pass 74 | 75 | def test_request_body(self): 76 | self.create_and_execute_default() 77 | expected_body = { 78 | 'startDate': self.expected_start_date.isoformat(), 79 | 'endDate': self.expected_end_date.isoformat(), 80 | } 81 | self.hook.get_service().reports( 82 | ).request.assert_called_once_with(body=expected_body) 83 | 84 | def test_request_executed(self): 85 | self.create_and_execute_default() 86 | self.hook.get_service().reports( 87 | ).request().execute.assert_called_once_with() 88 | 89 | def test_id_pushed_to_xcom(self): 90 | self.create_and_execute_default() 91 | self.context[self.CONTEXT_TASK_INSTANCE].xcom_push.assert_called_with( 92 | self.XCOM_VARIABLE, self.RESPONSE_DICT[self.ID]) 93 | 94 | def test_exception_on_get_service(self): 95 | self.hook.get_service = mock.MagicMock(side_effect=TestException) 96 | self.operator = SA360CreateReportOperator( 97 | self.hook, self.request_builder, self.LOOKBACK_DAYS) 98 | with self.assertRaises(TestException): 99 | self.operator.execute(self.context) 100 | 101 | def test_exception_on_request_execution(self): 102 | self.hook.get_service().reports().request().execute = mock.MagicMock( 103 | side_effect=TestException) 104 | self.operator = SA360CreateReportOperator( 105 | self.hook, self.request_builder, self.LOOKBACK_DAYS) 106 | with self.assertRaises(TestException): 107 | self.operator.execute(self.context) 108 | 109 | def test_exception_on_request_builder(self): 110 | self.request_builder.build = mock.MagicMock( 111 | side_effect=TestException) 112 | self.operator = SA360CreateReportOperator( 113 | self.hook, self.request_builder, self.LOOKBACK_DAYS) 114 | with self.assertRaises(TestException): 115 | self.operator.execute(self.context) 116 | 117 | if __name__ == '__main__': 118 | googletest.main() 119 | -------------------------------------------------------------------------------- /composer/sa360_download_report_file_operator.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Operator for downloading SA360 reports. 15 | 16 | """ 17 | 18 | import logging 19 | import os 20 | import tempfile 21 | from airflow import models 22 | 23 | logger = logging.getLogger(__name__) 24 | 25 | 26 | class ReportNotReadyError(RuntimeError): 27 | """Error thrown when a Report is not ready on SA360. 28 | """ 29 | pass 30 | 31 | 32 | class SA360DownloadReportFileOperator(models.BaseOperator): 33 | """Operator to download a SA360 report. 34 | 35 | This Operator downloads a SA360 API into BigQuery. The operator takes the 36 | report ID from the *report_id* XCom variable. 37 | 38 | Report should be ready when execution gets to this operator. You may make use 39 | of SA360ReportFileAvailableSensor to make sure report is ready. A 40 | ReportNotReadyException will be raised on execution if report is found not to 41 | be ready. 42 | """ 43 | 44 | template_fields = ['_filename'] 45 | 46 | def __init__(self, sa360_reporting_hook, gcs_hook, gcs_bucket, 47 | filename, write_header, *args, **kwargs): 48 | """Constructor. 49 | 50 | Args: 51 | sa360_reporting_hook: Airflow hook for creating SA360 Reporting API 52 | service. 53 | gcs_hook: Airflow Hook for connecting to Google Cloud Storage (GCS) 54 | (e.g.: airflow.contrib.hooks.gcs_hook.GoogleCloudStorageHook). 55 | gcs_bucket: Name of the GCS bucket where the files downloaded from SA360 56 | API will be stored. 57 | filename: Name of the output file in GCS (templated). 58 | write_header: Whether output file should contain headers. 59 | *args: Other arguments for use with airflow.models.BaseOperator. 60 | **kwargs: Other arguments for use with airflow.models.BaseOperator. 61 | """ 62 | super(SA360DownloadReportFileOperator, self).__init__(*args, **kwargs) 63 | self._service = None 64 | self._sa360_reporting_hook = sa360_reporting_hook 65 | self._gcs_hook = gcs_hook 66 | self._gcs_bucket = gcs_bucket 67 | self._filename = filename 68 | self._write_header = write_header 69 | 70 | def execute(self, context): 71 | """Execute operator. 72 | 73 | This method is invoked by Airflow to execute the task. 74 | 75 | Args: 76 | context: Airflow context. 77 | 78 | Raises: 79 | ReportNotReadyError: If report is not ready yet. 80 | ValueError: If no report_id is provided via XCom. 81 | """ 82 | report_id = context['task_instance'].xcom_pull( 83 | task_ids=None, key='report_id') 84 | if not report_id: 85 | raise ValueError('No report_id found in XCom') 86 | if self._service is None: 87 | self._service = self._sa360_reporting_hook.get_service() 88 | 89 | # Get report files 90 | request = self._service.reports().get(reportId=report_id) 91 | response = request.execute() 92 | 93 | if not response['isReportReady']: 94 | raise ReportNotReadyError('Report %s is not ready' % (report_id)) 95 | 96 | no_of_files = len(response['files']) 97 | 98 | temp_filename = None 99 | try: 100 | # Creating temp file. Creting it with 'delete=False' because we need to 101 | # close it and keep the contents stored for use by gcs_hook.upload(). 102 | with tempfile.NamedTemporaryFile(delete=False) as temp_file: 103 | temp_filename = temp_file.name 104 | for i in xrange(no_of_files): 105 | request = self._service.reports().getFile( 106 | reportId=report_id, reportFragment=i) 107 | response = request.execute() 108 | # Skip header if this is not the first fragment or if headers are 109 | # disabled 110 | if (not self._write_header) or i > 0: 111 | splitted_response = response.split('\n', 1) 112 | if len(splitted_response) > 1: 113 | response = splitted_response[1] 114 | else: 115 | response = '' 116 | temp_file.writelines(response) 117 | logger.info('Report fragment %d written to file: %s', i, 118 | temp_filename) 119 | self._gcs_hook.upload(self._gcs_bucket, self._filename, temp_filename) 120 | finally: 121 | if temp_filename: 122 | os.unlink(temp_filename) 123 | -------------------------------------------------------------------------------- /composer/sa360_download_report_file_operator_test.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 Google LLC 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # https://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Tests for sa360_download_report_file_operator.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | import contextlib 21 | import StringIO 22 | 23 | import mock 24 | 25 | from google3.corp.gtech.ads.doubleclick.composer.common import test_utils 26 | from google3.testing.pybase import googletest 27 | 28 | FILENAME = 'test_output_filename' 29 | TEMP_FILENAME = 'test_temp_filename' 30 | GCS_BUCKET = 'test_gcs_bucket' 31 | CONTEXT_TASK_INSTANCE = 'task_instance' 32 | REPORT_ID = 'test_report_id' 33 | REPORT_FILE_IDS = ['file_id1'] 34 | REPORT_HEADER = u'This is a test report. áéíóú\n' 35 | REPORT_CONTENTS_FRAGMENT_1 = u'First fragment contents\n' 36 | REPORT_CONTENTS_FRAGMENT_2 = u'Second fragment contents\n' 37 | REPORT_CONTENTS = ['%s%s' % (REPORT_HEADER, REPORT_CONTENTS_FRAGMENT_1), 38 | '%s%s' % (REPORT_HEADER, REPORT_CONTENTS_FRAGMENT_2)] 39 | GCS_URI = 'gs://%s/%s' 40 | 41 | dependencies = [{'name': 'airflow.models.BaseOperator', 'mock': mock.MagicMock}] 42 | 43 | SA360DownloadReportFileOperator = test_utils.import_with_mock_dependencies( # Mock dependency injection pylint: disable=invalid-name 44 | 'google3.corp.gtech.ads.doubleclick.composer.' 45 | 'sa360.sa360_download_report_file_operator.SA360DownloadReportFileOperator', 46 | dependencies) 47 | 48 | ReportNotReadyError = test_utils.import_with_mock_dependencies( # Mock dependency injection pylint: disable=invalid-name 49 | 'google3.corp.gtech.ads.doubleclick.composer.' 50 | 'sa360.sa360_download_report_file_operator.ReportNotReadyError', 51 | dependencies) 52 | 53 | 54 | class TestException(Exception): 55 | pass 56 | 57 | 58 | class Sa360DownloadReportFileOperatorTest(googletest.TestCase): 59 | 60 | def create_stringio_file(self, *unused_args, **unused_kwargs): 61 | file_mock = StringIO.StringIO() 62 | file_mock.name = TEMP_FILENAME 63 | # We don't want the StringIO file to be closed since we can't open it again 64 | # for reading like a normal file 65 | file_mock.close = mock.MagicMock() 66 | self.files[file_mock.name] = file_mock 67 | return contextlib.closing(file_mock) 68 | 69 | def create_read_only_stringio_file(self, *unused_args, **unused_kwargs): 70 | file_mock = mock.MagicMock() 71 | file_mock.name = TEMP_FILENAME 72 | file_mock.writelines.side_effect = OSError 73 | return contextlib.closing(file_mock) 74 | 75 | def setUp(self): 76 | self.report_ready = True 77 | self.files = {} 78 | self.gcs_file_contents = {} 79 | self.sa360_hook = mock.MagicMock() 80 | 81 | def get_report(reportId): # Must use same name as SA360 API method pylint: disable=invalid-name 82 | if reportId != REPORT_ID: 83 | raise TestException('Report not found') 84 | request = mock.MagicMock() 85 | request.execute.return_value = { 86 | 'files': REPORT_FILE_IDS, 87 | 'isReportReady': self.report_ready 88 | } 89 | return request 90 | 91 | def get_report_file(reportId, reportFragment): # Must use same name as SA360 API method pylint: disable=invalid-name 92 | if reportId != REPORT_ID: 93 | raise TestException('Report not found') 94 | request = mock.MagicMock() 95 | request.execute.return_value = REPORT_CONTENTS[reportFragment] 96 | return request 97 | 98 | self.sa360_hook.get_service().reports().get = mock.MagicMock( 99 | side_effect=get_report) 100 | self.sa360_hook.get_service().reports().getFile = mock.MagicMock( 101 | side_effect=get_report_file) 102 | 103 | self.gcs_hook = mock.MagicMock() 104 | 105 | def gcs_upload(gcs_bucket, target_filename, source_filename): 106 | self.gcs_file_contents[ 107 | GCS_URI % (gcs_bucket, 108 | target_filename)] = self.files[source_filename].getvalue() 109 | 110 | self.gcs_hook.upload = mock.MagicMock(side_effect=gcs_upload) 111 | 112 | self.context = {CONTEXT_TASK_INSTANCE: mock.MagicMock()} 113 | self.context[CONTEXT_TASK_INSTANCE].xcom_pull.return_value = REPORT_ID 114 | 115 | self.operator = SA360DownloadReportFileOperator( 116 | self.sa360_hook, self.gcs_hook, GCS_BUCKET, FILENAME, True) 117 | 118 | def test_report_contents(self): 119 | with mock.patch( 120 | 'tempfile.NamedTemporaryFile', 121 | mock.MagicMock( 122 | side_effect=self.create_stringio_file)), mock.patch('os.unlink'): 123 | self.operator.execute(self.context) 124 | self.assertEqual(self.gcs_file_contents[GCS_URI % (GCS_BUCKET, FILENAME)], 125 | REPORT_CONTENTS[0]) 126 | 127 | def test_report_contents_multifragment(self): 128 | 129 | def get_report(reportId): # Must use same name as SA360 API method pylint: disable=invalid-name 130 | if reportId != REPORT_ID: 131 | raise TestException('Report not found') 132 | request = mock.MagicMock() 133 | request.execute.return_value = { 134 | 'files': ['file_id1', 'file_id2'], 135 | 'isReportReady': self.report_ready 136 | } 137 | return request 138 | 139 | self.sa360_hook.get_service().reports().get = mock.MagicMock( 140 | side_effect=get_report) 141 | with mock.patch( 142 | 'tempfile.NamedTemporaryFile', 143 | mock.MagicMock( 144 | side_effect=self.create_stringio_file)), mock.patch('os.unlink'): 145 | self.operator.execute(self.context) 146 | self.assertEqual( 147 | self.gcs_file_contents[GCS_URI % (GCS_BUCKET, FILENAME)], 148 | u'%s%s%s' % (REPORT_HEADER, REPORT_CONTENTS_FRAGMENT_1, 149 | REPORT_CONTENTS_FRAGMENT_2)) 150 | 151 | def test_no_report_id(self): 152 | with mock.patch( 153 | 'tempfile.NamedTemporaryFile', 154 | mock.MagicMock( 155 | side_effect=self.create_stringio_file)), mock.patch('os.unlink'): 156 | self.context[CONTEXT_TASK_INSTANCE].xcom_pull.return_value = None 157 | with self.assertRaises(ValueError): 158 | self.operator.execute(self.context) 159 | 160 | def test_report_not_found(self): 161 | self.context[CONTEXT_TASK_INSTANCE].xcom_pull.return_value = 'invalid_id' 162 | with mock.patch( 163 | 'tempfile.NamedTemporaryFile', 164 | mock.MagicMock( 165 | side_effect=self.create_stringio_file)), mock.patch('os.unlink'): 166 | with self.assertRaises(TestException): 167 | self.operator.execute(self.context) 168 | 169 | def test_report_not_finished(self): 170 | self.report_ready = False 171 | with mock.patch( 172 | 'tempfile.NamedTemporaryFile', 173 | mock.MagicMock( 174 | side_effect=self.create_stringio_file)), mock.patch('os.unlink'): 175 | with self.assertRaises(ReportNotReadyError): 176 | self.operator.execute(self.context) 177 | 178 | def test_cannot_open_temp_file(self): 179 | with mock.patch( 180 | 'tempfile.NamedTemporaryFile', 181 | mock.MagicMock(side_effect=OSError)), mock.patch('os.unlink'): 182 | with self.assertRaises(OSError): 183 | self.operator.execute(self.context) 184 | 185 | def test_cannot_write_to_temp_file(self): 186 | with mock.patch( 187 | 'tempfile.NamedTemporaryFile', 188 | mock.MagicMock(side_effect=self.create_read_only_stringio_file) 189 | ), mock.patch('os.unlink'): 190 | with self.assertRaises(OSError): 191 | self.operator.execute(self.context) 192 | 193 | def test_cannot_upload_to_gcs(self): 194 | self.gcs_hook.upload = mock.MagicMock(side_effect=TestException) 195 | with mock.patch( 196 | 'tempfile.NamedTemporaryFile', 197 | mock.MagicMock( 198 | side_effect=self.create_stringio_file)), mock.patch('os.unlink'): 199 | with self.assertRaises(TestException): 200 | self.operator.execute(self.context) 201 | 202 | def test_one_temp_file(self): 203 | with mock.patch( 204 | 'tempfile.NamedTemporaryFile', 205 | mock.MagicMock( 206 | side_effect=self.create_stringio_file)), mock.patch('os.unlink'): 207 | self.operator.execute(self.context) 208 | self.assertEqual(len(self.files), 1) 209 | 210 | def test_temp_file_deleted(self): 211 | with mock.patch( 212 | 'tempfile.NamedTemporaryFile', 213 | mock.MagicMock(side_effect=self.create_stringio_file)), mock.patch( 214 | 'os.unlink') as unlink_mock: 215 | self.operator.execute(self.context) 216 | unlink_mock.assert_called_with(TEMP_FILENAME) 217 | 218 | def test_temp_file_deleted_on_gcs_exception(self): 219 | self.gcs_hook.upload = mock.MagicMock(side_effect=TestException) 220 | with mock.patch( 221 | 'tempfile.NamedTemporaryFile', 222 | mock.MagicMock(side_effect=self.create_stringio_file)), mock.patch( 223 | 'os.unlink') as unlink_mock: 224 | try: 225 | self.operator.execute(self.context) 226 | except TestException: 227 | pass 228 | unlink_mock.assert_called_with(TEMP_FILENAME) 229 | 230 | def test_report_contents_no_header(self): 231 | self.operator = SA360DownloadReportFileOperator( 232 | self.sa360_hook, self.gcs_hook, GCS_BUCKET, FILENAME, False) 233 | with mock.patch( 234 | 'tempfile.NamedTemporaryFile', 235 | mock.MagicMock( 236 | side_effect=self.create_stringio_file)), mock.patch('os.unlink'): 237 | self.operator.execute(self.context) 238 | self.assertEqual(self.gcs_file_contents[GCS_URI % (GCS_BUCKET, FILENAME)], 239 | REPORT_CONTENTS_FRAGMENT_1) 240 | 241 | def test_report_contents_multifragment_no_header(self): 242 | self.operator = SA360DownloadReportFileOperator( 243 | self.sa360_hook, self.gcs_hook, GCS_BUCKET, FILENAME, False) 244 | 245 | def get_report(reportId): # Must use same name as SA360 API method pylint: disable=invalid-name 246 | if reportId != REPORT_ID: 247 | raise TestException('Report not found') 248 | request = mock.MagicMock() 249 | request.execute.return_value = { 250 | 'files': ['file_id1', 'file_id2'], 251 | 'isReportReady': self.report_ready 252 | } 253 | return request 254 | 255 | self.sa360_hook.get_service().reports().get = mock.MagicMock( 256 | side_effect=get_report) 257 | with mock.patch( 258 | 'tempfile.NamedTemporaryFile', 259 | mock.MagicMock( 260 | side_effect=self.create_stringio_file)), mock.patch('os.unlink'): 261 | self.operator.execute(self.context) 262 | self.assertEqual( 263 | self.gcs_file_contents[GCS_URI % (GCS_BUCKET, FILENAME)], 264 | u'%s%s' % (REPORT_CONTENTS_FRAGMENT_1, REPORT_CONTENTS_FRAGMENT_2)) 265 | 266 | 267 | if __name__ == '__main__': 268 | googletest.main() 269 | -------------------------------------------------------------------------------- /composer/sa360_report_file_available_sensor.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Sensor to check if SA360 report generation has finished. 15 | """ 16 | 17 | import logging 18 | from airflow.operators import sensors 19 | 20 | logger = logging.getLogger(__name__) 21 | 22 | 23 | class SA360ReportFileAvailableSensor(sensors.BaseSensorOperator): 24 | """Sensor to check if SA360 report generation has finished. 25 | 26 | This sensor calls SA360 API to check if a SA360 report is ready to download. 27 | It makes use of SA360ReportingHook to access the SA360 API. It takes the 28 | SA360 report ID from the *report_id* XCom variable. 29 | """ 30 | 31 | def __init__(self, sa360_reporting_hook, *args, **kwargs): 32 | """Constructor. 33 | 34 | Args: 35 | sa360_reporting_hook: Airflow hook for creating SA360 Reporting API 36 | service. 37 | *args: Other arguments for use with airflow.models.BaseOperator. 38 | **kwargs: Other arguments for use with airflow.models.BaseOperator. 39 | """ 40 | super(SA360ReportFileAvailableSensor, self).__init__(*args, **kwargs) 41 | self._sa360_reporting_hook = sa360_reporting_hook 42 | self._service = None 43 | 44 | def poke(self, context): 45 | """Check for report ready. 46 | 47 | This method is invoked by Airflow to trigger the sensor. It checks whether 48 | the report is ready. 49 | 50 | Args: 51 | context: Airflow context. 52 | 53 | Returns: 54 | True if report is ready. False otherwise. 55 | 56 | Raises: 57 | KeyError: If *report_id* is not found in XCom 58 | """ 59 | if self._service is None: 60 | self._service = self._sa360_reporting_hook.get_service() 61 | 62 | # Pull report_id from xcom. 63 | report_id = context['task_instance'].xcom_pull( 64 | task_ids=None, key='report_id') 65 | report_ready = False 66 | if report_id is None: 67 | raise KeyError("Report ID key not found in XCom: 'report_id'") 68 | request = self._service.reports().get(reportId=report_id) 69 | response = request.execute() 70 | logger.debug('Report poll response: %s', response) 71 | if response['isReportReady']: 72 | report_ready = True 73 | return report_ready 74 | -------------------------------------------------------------------------------- /composer/sa360_report_file_available_sensor_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Tests for sa360_report_file_available_sensor.""" 15 | 16 | from __future__ import absolute_import 17 | from __future__ import division 18 | from __future__ import print_function 19 | import mock 20 | from google3.corp.gtech.ads.doubleclick.composer.common import test_utils 21 | from google3.testing.pybase import googletest 22 | 23 | sensors_mock = mock.MagicMock 24 | sensors_mock.BaseSensorOperator = mock.MagicMock 25 | 26 | SA360ReportFileAvailableSensor = test_utils.import_with_mock_dependencies( # Mock dependency injection pylint: disable=invalid-name 27 | 'google3.corp.gtech.ads.doubleclick.composer.' 28 | 'sa360.sa360_report_file_available_sensor.SA360ReportFileAvailableSensor', 29 | [{'name': 'airflow.operators.sensors', 30 | 'mock': sensors_mock} 31 | ]) 32 | 33 | 34 | REPORT_ID = 'test_report_id' 35 | XCOM_REPORT_ID_KEY = 'report_id' 36 | CONTEXT_TASK_INSTANCE = 'task_instance' 37 | RESPONSE_IS_READY_KEY = 'isReportReady' 38 | RESPONSE_READY = {RESPONSE_IS_READY_KEY: True} 39 | RESPONSE_NOT_READY = {RESPONSE_IS_READY_KEY: False} 40 | 41 | 42 | class TestException(Exception): 43 | pass 44 | 45 | 46 | class Sa360ReportFileAvailableSensorTest(googletest.TestCase): 47 | 48 | def setUp(self): 49 | self.hook = mock.MagicMock() 50 | self.context = {CONTEXT_TASK_INSTANCE: mock.MagicMock()} 51 | self.context[CONTEXT_TASK_INSTANCE].xcom_pull.return_value = REPORT_ID 52 | self.sensor = SA360ReportFileAvailableSensor(self.hook) 53 | 54 | def test_report_ready(self): 55 | self.hook.get_service().reports().get().execute( 56 | ).__getitem__.side_effect = RESPONSE_READY.__getitem__ 57 | self.assertTrue(self.sensor.poke(self.context)) 58 | 59 | def test_report_not_ready(self): 60 | self.hook.get_service().reports().get().execute( 61 | ).__getitem__.side_effect = RESPONSE_NOT_READY.__getitem__ 62 | self.assertFalse(self.sensor.poke(self.context)) 63 | 64 | def test_correct_xcom_pull(self): 65 | self.sensor.poke(self.context) 66 | self.context[CONTEXT_TASK_INSTANCE].xcom_pull.assert_called_once_with( 67 | task_ids=None, key=XCOM_REPORT_ID_KEY) 68 | 69 | def test_no_report_id(self): 70 | self.context[CONTEXT_TASK_INSTANCE].xcom_pull.return_value = None 71 | with self.assertRaises(KeyError): 72 | self.sensor.poke(self.context) 73 | 74 | def test_exception_on_execute(self): 75 | self.hook.get_service().reports().get().execute.side_effect = TestException 76 | with self.assertRaises(TestException): 77 | self.sensor.poke(self.context) 78 | 79 | def test_exception_on_get_service(self): 80 | self.hook.get_service.side_effect = TestException 81 | with self.assertRaises(TestException): 82 | self.sensor.poke(self.context) 83 | 84 | 85 | if __name__ == '__main__': 86 | googletest.main() 87 | -------------------------------------------------------------------------------- /composer/sa360_report_request_builder.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """SA360 report request builder. 15 | 16 | Helper for building SA360 report request bodies for SA360 API. 17 | """ 18 | 19 | ATTR_SCOPE = 'reportScope' 20 | DEFAULT_COLUMNS = [{ 21 | 'columnName': 'advertiser' 22 | }, { 23 | 'columnName': 'account' 24 | }, { 25 | 'columnName': 'campaign' 26 | }, { 27 | 'columnName': 'keywordText' 28 | }, { 29 | 'columnName': 'keywordMatchType' 30 | }, { 31 | 'columnName': 'keywordMaxCpc' 32 | }, { 33 | 'columnName': 'clicks' 34 | }, { 35 | 'columnName': 'keywordId' 36 | }, { 37 | 'columnName': 'advertiserId' 38 | }] 39 | 40 | # Equivalence between column names in the API and in the UI/Bulksheet 41 | COLUMNS_DICT = { 42 | 'advertiser': 'Advertiser', 43 | 'account': 'Account', 44 | 'campaign': 'Campaign', 45 | 'keywordText': 'Keyword', 46 | 'keywordMatchType': 'Match type', 47 | 'keywordMaxCpc': 'Keyword max CPC', 48 | 'clicks': 'Clicks', 49 | 'keywordId': 'Keyword ID', 50 | 'advertiserId': 'Advertiser ID', 51 | } 52 | 53 | FILTERS = 'filters' 54 | 55 | 56 | class SA360ReportRequestBuilder(object): 57 | """Class for building SA360 report requests. 58 | 59 | """ 60 | 61 | def __init__(self, 62 | agency_id, 63 | advertiser_ids=None, 64 | columns=None, 65 | filter_display_stats=True): 66 | """Constructor. 67 | 68 | Args: 69 | agency_id: ID of the agency for report scope. 70 | advertiser_ids: Defaults to None. Advertiser IDs to filter the report to. 71 | columns: JSON descriptor for the report columns. Defaults to basic set of 72 | columns. 73 | filter_display_stats: Whether Display Stats should be filtered out from 74 | the report (optional. Default: True). 75 | """ 76 | self._agency_id = agency_id 77 | self._advertiser_ids = advertiser_ids 78 | self._columns = columns or DEFAULT_COLUMNS 79 | self._filter_display_stats = filter_display_stats 80 | 81 | def _add_filter(self, request_body, new_filter): 82 | """Add filter to request body. 83 | 84 | Args: 85 | request_body: Request body to which filter should be added. 86 | new_filter: Filter to be added. 87 | """ 88 | if not request_body.get(FILTERS, None): 89 | request_body[FILTERS] = [] 90 | request_body[FILTERS].append(new_filter) 91 | 92 | def build(self, start_date, end_date): 93 | """Generate request body for keyword report. 94 | 95 | Args: 96 | start_date: datetime.date for report start date. 97 | end_date: datetime.date for report end date. 98 | 99 | Returns: 100 | Report request in JSON dict format. 101 | """ 102 | request_body = { 103 | 'downloadFormat': 'csv', 104 | 'maxRowsPerFile': 100000000, 105 | 'reportType': 'keyword', 106 | 'statisticsCurrency': 'USD', 107 | ATTR_SCOPE: { 108 | 'agencyId': self._agency_id 109 | }, 110 | 'columns': self._columns, 111 | 'timeRange': { 112 | 'startDate': start_date.isoformat(), 113 | 'endDate': end_date.isoformat() 114 | } 115 | } 116 | self._add_filter(request_body, { 117 | 'column': { 118 | 'columnName': 'keywordId' 119 | }, 120 | 'operator': 'notEquals', 121 | 'values': ['0'] 122 | }) 123 | if self._advertiser_ids: 124 | if len(self._advertiser_ids) == 1: 125 | request_body[ATTR_SCOPE]['advertiserId'] = self._advertiser_ids[0] 126 | else: 127 | self._add_filter( 128 | request_body, { 129 | 'column': { 130 | 'columnName': 'advertiserId' 131 | }, 132 | 'operator': 'in', 133 | 'values': self._advertiser_ids 134 | }) 135 | if self._filter_display_stats: 136 | self._add_filter( 137 | request_body, { 138 | 'column': { 139 | 'columnName': 'keywordText' 140 | }, 141 | 'operator': 'notEquals', 142 | 'values': ['Display Network Stats'] 143 | }) 144 | return request_body 145 | 146 | def get_headers(self, ui_names=True): 147 | """Get report headers. 148 | 149 | This method returns a list of strings with the headers for the generated 150 | report. 151 | 152 | Args: 153 | ui_names: Whether headers should follow UI naming (True) or API naming 154 | (False). Defaults to UI (True). 155 | 156 | Returns: 157 | List of strings with this report's headers. 158 | 159 | Raise: 160 | ValueError: 161 | If any column descriptor in self._columns does not contain exactly one 162 | element. 163 | """ 164 | headers = [] 165 | for column in self._columns: 166 | if len(column) != 1: 167 | raise ValueError('Unexpected number of values in column descriptor: %d' 168 | % len(column)) 169 | api_name = column.values()[0] 170 | headers.append(COLUMNS_DICT[api_name] if ui_names else api_name) 171 | return headers 172 | -------------------------------------------------------------------------------- /composer/sa360_report_request_builder_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Tests for sa360_report_request_builder.""" 15 | 16 | from __future__ import absolute_import 17 | from __future__ import division 18 | from __future__ import print_function 19 | import datetime 20 | 21 | from google3.testing.pybase import googletest 22 | from google3.corp.gtech.ads.doubleclick.mozart.composer import sa360_report_request_builder as request_builder 23 | 24 | AGENCY_ID = 'test_agency_id' 25 | ADVERTISER_ID = 'test_advertiser_id' 26 | YEAR = 2018 27 | MONTH = 3 28 | DAY = 27 29 | DATE_FORMAT = '%04d-%02d-%02d' 30 | 31 | 32 | class Sa360ReportRequestBuilderTest(googletest.TestCase): 33 | 34 | def setUp(self): 35 | self.expected_body = { 36 | 'downloadFormat': 37 | 'csv', 38 | 'maxRowsPerFile': 39 | 100000000, 40 | 'reportType': 41 | 'keyword', 42 | 'statisticsCurrency': 43 | 'USD', 44 | 'reportScope': { 45 | 'agencyId': AGENCY_ID 46 | }, 47 | 'columns': [{ 48 | 'columnName': 'advertiser' 49 | }, { 50 | 'columnName': 'account' 51 | }, { 52 | 'columnName': 'campaign' 53 | }, { 54 | 'columnName': 'keywordText' 55 | }, { 56 | 'columnName': 'keywordMatchType' 57 | }, { 58 | 'columnName': 'keywordMaxCpc' 59 | }, { 60 | 'columnName': 'clicks' 61 | }, { 62 | 'columnName': 'keywordId' 63 | }, { 64 | 'columnName': 'advertiserId' 65 | }], 66 | 'filters': [{ 67 | 'column': { 68 | 'columnName': 'keywordId' 69 | }, 70 | 'operator': 'notEquals', 71 | 'values': ['0'] 72 | }, 73 | { 74 | 'column': { 75 | 'columnName': 'keywordText' 76 | }, 77 | 'operator': 'notEquals', 78 | 'values': ['Display Network Stats'] 79 | }], 80 | 'timeRange': {} 81 | } 82 | 83 | def test_no_advertiser(self): 84 | builder = request_builder.SA360ReportRequestBuilder(AGENCY_ID) 85 | start_date = datetime.date(YEAR, MONTH, DAY) 86 | end_date = datetime.date(YEAR, MONTH + 1, DAY) 87 | body = builder.build(start_date, end_date) 88 | self.expected_body['timeRange']['startDate'] = DATE_FORMAT % (YEAR, MONTH, 89 | DAY) 90 | self.expected_body['timeRange']['endDate'] = DATE_FORMAT % (YEAR, MONTH + 1, 91 | DAY) 92 | self.assertEqual(body, self.expected_body) 93 | 94 | def test_advertiser(self): 95 | builder = request_builder.SA360ReportRequestBuilder(AGENCY_ID, 96 | [ADVERTISER_ID]) 97 | start_date = datetime.date(YEAR, MONTH, DAY) 98 | end_date = datetime.date(YEAR, MONTH + 1, DAY) 99 | body = builder.build(start_date, end_date) 100 | self.expected_body['timeRange']['startDate'] = DATE_FORMAT % (YEAR, MONTH, 101 | DAY) 102 | self.expected_body['timeRange']['endDate'] = DATE_FORMAT % (YEAR, MONTH + 1, 103 | DAY) 104 | self.expected_body['reportScope']['advertiserId'] = ADVERTISER_ID 105 | self.assertEqual(body, self.expected_body) 106 | 107 | def test_advertiser_list(self): 108 | advertiser_id_list = ['adv1', 'adv2', 'adv3'] 109 | builder = request_builder.SA360ReportRequestBuilder(AGENCY_ID, 110 | advertiser_id_list) 111 | start_date = datetime.date(YEAR, MONTH, DAY) 112 | end_date = datetime.date(YEAR, MONTH + 1, DAY) 113 | body = builder.build(start_date, end_date) 114 | self.expected_body['timeRange']['startDate'] = DATE_FORMAT % (YEAR, MONTH, 115 | DAY) 116 | self.expected_body['timeRange']['endDate'] = DATE_FORMAT % (YEAR, MONTH + 1, 117 | DAY) 118 | self.expected_body['filters'].insert( 119 | 1, { 120 | 'column': { 121 | 'columnName': 'advertiserId' 122 | }, 123 | 'operator': 'in', 124 | 'values': advertiser_id_list 125 | }) 126 | self.assertEqual(body, self.expected_body) 127 | 128 | def test_not_valid_start_date(self): 129 | builder = request_builder.SA360ReportRequestBuilder(AGENCY_ID) 130 | start_date = 'not_valid' 131 | end_date = datetime.date(YEAR, MONTH + 1, DAY) 132 | with self.assertRaises(AttributeError): 133 | builder.build(start_date, end_date) 134 | 135 | def test_not_valid_end_date(self): 136 | builder = request_builder.SA360ReportRequestBuilder(AGENCY_ID) 137 | start_date = datetime.date(YEAR, MONTH, DAY) 138 | end_date = 'not_valid' 139 | with self.assertRaises(AttributeError): 140 | builder.build(start_date, end_date) 141 | 142 | def test_custom_columns(self): 143 | custom_columns = [{'columnName': 'columnA'}, {'columnName': 'columnB'}] 144 | builder = request_builder.SA360ReportRequestBuilder( 145 | AGENCY_ID, columns=custom_columns) 146 | start_date = datetime.date(YEAR, MONTH, DAY) 147 | end_date = datetime.date(YEAR, MONTH + 1, DAY) 148 | body = builder.build(start_date, end_date) 149 | self.expected_body['timeRange']['startDate'] = DATE_FORMAT % (YEAR, MONTH, 150 | DAY) 151 | self.expected_body['timeRange']['endDate'] = DATE_FORMAT % (YEAR, MONTH + 1, 152 | DAY) 153 | self.expected_body['columns'] = custom_columns 154 | self.assertEqual(body, self.expected_body) 155 | 156 | def test_get_headers_api_names_custom_columns(self): 157 | custom_columns = [{'columnName': 'columnA'}, {'columnName': 'columnB'}] 158 | builder = request_builder.SA360ReportRequestBuilder( 159 | AGENCY_ID, columns=custom_columns) 160 | headers = builder.get_headers(ui_names=False) 161 | self.assertEqual(headers, ['columnA', 'columnB']) 162 | 163 | def test_get_headers(self): 164 | builder = request_builder.SA360ReportRequestBuilder(AGENCY_ID) 165 | self.assertEqual(builder.get_headers(), [ 166 | 'Advertiser', 167 | 'Account', 168 | 'Campaign', 169 | 'Keyword', 170 | 'Match type', 171 | 'Keyword max CPC', 172 | 'Clicks', 173 | 'Keyword ID', 174 | 'Advertiser ID', 175 | ]) 176 | 177 | def test_get_headers_api_names(self): 178 | builder = request_builder.SA360ReportRequestBuilder(AGENCY_ID) 179 | self.assertEqual( 180 | builder.get_headers(ui_names=False), [ 181 | 'advertiser', 'account', 'campaign', 'keywordText', 182 | 'keywordMatchType', 'keywordMaxCpc', 'clicks', 'keywordId', 183 | 'advertiserId' 184 | ]) 185 | 186 | 187 | if __name__ == '__main__': 188 | googletest.main() 189 | -------------------------------------------------------------------------------- /composer/sa360_reporting_hook.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Airflow hook for accesing SA360 Reporting API. 15 | 16 | This hook provides access to SA Reporting API. It makes use of 17 | GoogleCloudBaseHook to provide an authenticated api client. 18 | """ 19 | 20 | from airflow.contrib.hooks import gcp_api_base_hook 21 | from apiclient import discovery 22 | 23 | 24 | class SA360ReportingHook(gcp_api_base_hook.GoogleCloudBaseHook): 25 | """Airflow Hook to connect to SA360 Reporting API. 26 | 27 | This hook utilizes GoogleCloudBaseHook connection to generate the service 28 | object to SA360 reporting API. 29 | 30 | Once the Hook is instantiated you can invoke get_service() to obtain a 31 | service object that you can use to invoke the API. 32 | """ 33 | 34 | def __init__(self, 35 | sa360_report_conn_id='sa360_report_default', 36 | api_name='doubleclicksearch', 37 | api_version='v2'): 38 | """Constructor. 39 | 40 | Args: 41 | sa360_report_conn_id: Airflow connection ID to be used for accesing 42 | SA360 Reporting API and authenticating requests. Default is 43 | sa360_report_default 44 | api_name: SA360 API name to be used. Default is doubleclicksearch. 45 | api_version: SA360 API Version. Default is v2. 46 | """ 47 | super(SA360ReportingHook, self).__init__( 48 | gcp_conn_id=sa360_report_conn_id) 49 | 50 | self.api_name = api_name 51 | self.api_version = api_version 52 | 53 | def get_service(self): 54 | """Get API service object. 55 | 56 | This is called by Airflow whenever an API client service object is needed. 57 | Returned service object is already authenticated using the Airflow 58 | connection data provided in the sa360_report_conn_id constructor parameter. 59 | 60 | Returns: 61 | Google API client service object. 62 | """ 63 | http_authorized = self._authorize() 64 | return discovery.build(self.api_name, 65 | self.api_version, http=http_authorized) 66 | -------------------------------------------------------------------------------- /composer/sa360_reporting_hook_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Tests for sa360_reporting_hook.""" 15 | 16 | from __future__ import absolute_import 17 | from __future__ import division 18 | from __future__ import print_function 19 | import mock 20 | from google3.corp.gtech.ads.doubleclick.composer.common import test_utils 21 | from google3.testing.pybase import googletest 22 | 23 | 24 | HTTP_AUTHORIZED = 'authorized_http' 25 | 26 | cloud_base_hook_mock = mock.MagicMock 27 | cloud_base_hook_mock._authorize = mock.MagicMock(return_value=HTTP_AUTHORIZED) 28 | 29 | gcp_api_base_hook = mock.MagicMock 30 | gcp_api_base_hook.GoogleCloudBaseHook = cloud_base_hook_mock 31 | 32 | 33 | def mock_build_method(api_name, api_version, http): 34 | return (api_name, api_version, http) 35 | 36 | discovery_mock = mock.MagicMock 37 | discovery_mock.build = mock.MagicMock(side_effect=mock_build_method) 38 | 39 | SA360ReportingHook = test_utils.import_with_mock_dependencies( # Mock dependency injection pylint: disable=invalid-name 40 | 'google3.corp.gtech.ads.doubleclick.composer' 41 | '.sa360.sa360_reporting_hook.SA360ReportingHook', 42 | [{'name': 'airflow.contrib.hooks.gcp_api_base_hook', 43 | 'mock': gcp_api_base_hook}, 44 | {'name': 'apiclient.discovery', 45 | 'mock': discovery_mock} 46 | ]) 47 | 48 | 49 | class TestException(Exception): 50 | pass 51 | 52 | 53 | class Sa360ReportingHookTest(googletest.TestCase): 54 | 55 | API_VERSION = 'v2' 56 | API_NAME = 'doubleclicksearch' 57 | CONNECTION_ID = 'sa360_report_default' 58 | 59 | def setUp(self): 60 | self.hook = SA360ReportingHook() 61 | 62 | def test_get_service(self): 63 | service = self.hook.get_service() 64 | self.assertEqual((self.API_NAME, 65 | self.API_VERSION, HTTP_AUTHORIZED), service) 66 | 67 | def test_custom_api_name(self): 68 | self.hook = SA360ReportingHook(api_name='custom_api_name') 69 | service = self.hook.get_service() 70 | self.assertEqual(('custom_api_name', 71 | self.API_VERSION, HTTP_AUTHORIZED), service) 72 | 73 | def test_custom_api_version(self): 74 | self.hook = SA360ReportingHook(api_version='custom_api_version') 75 | service = self.hook.get_service() 76 | self.assertEqual((self.API_NAME, 77 | 'custom_api_version', HTTP_AUTHORIZED), service) 78 | 79 | def test_exception_on_authorize(self): 80 | # Check that exceptions are properly propagated should they be raised. 81 | self.hook._authorize = mock.MagicMock(side_effect=TestException) 82 | with self.assertRaises(TestException): 83 | self.hook.get_service() 84 | 85 | def test_exception_on_build(self): 86 | # Check that exceptions are properly propagated should they be raised. 87 | with mock.patch('google3.corp.gtech.ads.doubleclick.composer.sa360' 88 | '.sa360_reporting_hook.discovery.build') as build_mock: 89 | build_mock.side_effect = TestException 90 | with self.assertRaises(TestException): 91 | self.hook.get_service() 92 | 93 | 94 | if __name__ == '__main__': 95 | googletest.main() 96 | -------------------------------------------------------------------------------- /composer/test_utils.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Helper methods and classes for unit tests. 15 | """ 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | import importlib 21 | import sys 22 | import types 23 | import mock 24 | 25 | 26 | def _import_dummy_module(module_name): 27 | """Import dummy modules. 28 | 29 | This method imports a dummy module and all its parents based on module name. 30 | Modules created by this method are of ModuleType. This is useful when 31 | packages are not available at runtime and you need to place mock 32 | implementations of classes in those modules for unit testing purposes (e.g.: 33 | Airflow runtime modules). 34 | 35 | Args: 36 | module_name: Full module name to import as a dummy module. 37 | 38 | Returns: 39 | Newly imported module object. 40 | """ 41 | packages = module_name.split('.') 42 | full_name = '' 43 | package_object = None 44 | for package_name in packages: 45 | if full_name: 46 | full_name += '.' 47 | full_name += package_name 48 | if not sys.modules.get(full_name, None): 49 | new_module = types.ModuleType(full_name) 50 | sys.modules[full_name] = new_module 51 | package_object = sys.modules[full_name] 52 | return package_object 53 | 54 | 55 | def _import(element, dependencies, 56 | patch_path=None): 57 | """Import element with provided dependencies. 58 | 59 | This method recursively injects the provided dependencies, assuming all parent 60 | modules previous exist. 61 | 62 | Args: 63 | element: Fully-qualified name of the object to be imported. 64 | dependencies: List of dependencies. dict with *name*, *mock* and *package* 65 | entries. 66 | patch_path: current module path to be patched. Paths are recursively 67 | patched. If patch is None, it is bootstrapped to the first submodule of 68 | the first dependency in the list. 69 | 70 | Returns: 71 | Imported object with dependencies injected. 72 | """ 73 | # Work with the first element in the list. 74 | dependency = dependencies[0] 75 | dependency_full_name = dependency['name'] 76 | dependency_mock = dependency['mock'] 77 | dependency_module = dependency['module'] 78 | 79 | # Bootstrap patch_path with the first submodule. 80 | if patch_path is None: 81 | patch_path = '.'.join(dependency_full_name.split('.')[:2]) 82 | 83 | if (patch_path == dependency_module.__name__ or 84 | len(dependency_full_name.split('.')) == 2): 85 | # If we are already at the immediate parent of the dependency to be injected 86 | # Patch the parent module with the dependency module. 87 | with mock.patch(patch_path, new=dependency_module, create=True): 88 | # Patch the dependency with the dependency mock. 89 | with mock.patch(dependency_full_name, new=dependency_mock, create=True): 90 | if len(dependencies) > 1: 91 | # If there are still dependencies to patch, continue recursively. 92 | new_dependencies = dependencies[1:] 93 | return _import(element, new_dependencies) 94 | else: 95 | # Else, we have finished patching dependencies, import element and 96 | # return. 97 | element_module_name = '.'.join(element.split('.')[:-1]) 98 | element_name = element.split('.')[-1] 99 | element_module = importlib.import_module(element_module_name) 100 | result = getattr(element_module, element_name) 101 | return result 102 | else: 103 | # Else, continue recursively importing the next parent of the current 104 | # dependency. 105 | result = None 106 | print('patch_path: %s' % (patch_path)) 107 | print('dependency_full_name: %s' % (dependency_full_name)) 108 | with mock.patch(patch_path, create=True): 109 | next_parent_module = dependency_full_name.split( 110 | '.')[len(patch_path.split('.'))] 111 | print('next_parent_module: %s' % (next_parent_module)) 112 | new_patch_path = '.'.join([patch_path, next_parent_module]) 113 | result = _import(element, dependencies, new_patch_path) 114 | return result 115 | 116 | 117 | def import_with_mock_dependencies(element_full_name, dependencies): 118 | """Import class with mock dependencies. 119 | 120 | This method imports the provided element (module, class, method, etc.). 121 | In contrast with the standard import mechanism, this method injects the 122 | provided dependencies as mocks before exeuting the import. That way, 123 | you can import objects whose dependencies are not available, or substitute 124 | those dependencies with mocks. 125 | 126 | Args: 127 | element_full_name: fully-qualified name of the element to be imported. 128 | dependencies: list of dependencies. Each element of the list must be a dict 129 | containing a *name* and a *mock*, where *name* contains the 130 | fully-qualified name of the dependency and *mock* contains a mock.Mock 131 | object to be injected. 132 | 133 | Returns: 134 | Imported object with dependencies injected. 135 | """ 136 | # Dependencies' modules might not exist, so we import them as mock modules 137 | # (if not imported already). 138 | for dependency in dependencies: 139 | # Module path of the dependency to be imported. 140 | module_path = '.'.join(dependency['name'].split('.')[:-1]) 141 | print('module_path: %s' % (module_path)) 142 | dependency['module'] = _import_dummy_module(module_path) 143 | return _import(element_full_name, dependencies) 144 | -------------------------------------------------------------------------------- /dataflow/.gitignore: -------------------------------------------------------------------------------- 1 | target 2 | -------------------------------------------------------------------------------- /dataflow/pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 17 | 20 | 4.0.0 21 | 22 | com.google.cse.mozart 23 | mozart-composer 24 | 0.0.1-SNAPSHOT 25 | 26 | 27 | UTF-8 28 | 3.7.0 29 | 1.6.0 30 | 1.7.25 31 | 32 | 33 | 34 | 35 | ossrh.snapshots 36 | Sonatype OSS Repository Hosting 37 | https://oss.sonatype.org/content/repositories/snapshots/ 38 | 39 | false 40 | 41 | 42 | true 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | org.apache.maven.plugins 51 | maven-compiler-plugin 52 | ${maven-compiler-plugin.version} 53 | 54 | 1.8 55 | 1.8 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | org.codehaus.mojo 64 | exec-maven-plugin 65 | ${exec-maven-plugin.version} 66 | 67 | false 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | com.google.cloud.dataflow 78 | google-cloud-dataflow-java-sdk-all 79 | 2.5.0 80 | 81 | 82 | 83 | 84 | org.slf4j 85 | slf4j-api 86 | ${slf4j.version} 87 | 88 | 89 | org.slf4j 90 | slf4j-jdk14 91 | ${slf4j.version} 92 | 93 | 94 | com.mashape.unirest 95 | unirest-java 96 | 1.4.9 97 | 98 | 99 | 100 | -------------------------------------------------------------------------------- /dataflow/src/main/java/com/google/cse/mozart/Mozart.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2018 Google LLC 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except 5 | * in compliance with the License. You may obtain a copy of the License at 6 | * 7 | * https://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, software distributed under the License 10 | * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express 11 | * or implied. See the License for the specific language governing permissions and limitations under 12 | * the License. 13 | */ 14 | package com.google.cse.mozart; 15 | 16 | import java.io.Serializable; 17 | import java.util.ArrayList; 18 | import java.util.Arrays; 19 | import java.util.HashMap; 20 | import java.util.List; 21 | import java.util.Map; 22 | import org.apache.beam.sdk.Pipeline; 23 | import org.apache.beam.sdk.io.TextIO; 24 | import org.apache.beam.sdk.options.Description; 25 | import org.apache.beam.sdk.options.PipelineOptions; 26 | import org.apache.beam.sdk.options.ValueProvider; 27 | import org.apache.beam.sdk.transforms.DoFn; 28 | import org.apache.beam.sdk.transforms.ParDo; 29 | import org.apache.beam.sdk.values.PCollection; 30 | import org.slf4j.Logger; 31 | import org.slf4j.LoggerFactory; 32 | 33 | public class Mozart implements Serializable { 34 | 35 | private static final String H_ROW_TYPE = "Row Type"; 36 | private static final String H_ACTION = "Action"; 37 | private static final String H_STATUS = "Status"; 38 | private static final String V_KEYWORD = "keyword"; 39 | private static final String V_EDIT = "edit"; 40 | private static final String V_ACTIVE = "Active"; 41 | private static final String H_ADVERTISER_ID = "Advertiser ID"; 42 | private static final Logger LOG = LoggerFactory.getLogger(Mozart.class); 43 | 44 | public interface MozartOptions extends PipelineOptions { 45 | @Description("Path to the input keywords file") 46 | ValueProvider getInputKeywordsFile(); 47 | 48 | void setInputKeywordsFile(ValueProvider value); 49 | 50 | @Description("Keyword column names") 51 | ValueProvider getKeywordColumnNames(); 52 | 53 | void setKeywordColumnNames(ValueProvider value); 54 | 55 | @Description("Output keywords file path") 56 | ValueProvider getOutputKeywordsFile(); 57 | 58 | void setOutputKeywordsFile(ValueProvider value); 59 | 60 | @Description("Advertiser ID. This is used to filter the output so that it can be uploaded to " 61 | + "sFTP") 62 | ValueProvider getAdvertiserId(); 63 | 64 | void setAdvertiserId(ValueProvider value); 65 | 66 | } 67 | 68 | /** 69 | * Get keywords PCollection. 70 | * 71 | * This method returns a PCollection with all the keywords. The PCollection contains elements of 72 | * type {@code Map}. Each element represents one keyword. The key is the column 73 | * name, and the value is the value for the column. Column names are the same as those in the 74 | * SA360 UI. 75 | * 76 | * For example, for a given element (a keyword), the keyword text is element.get("Keyword"), and 77 | * the Max CPC is element.get("Keyword max CPC"). 78 | * 79 | * @param options Configuration options. 80 | * @param pipeline Beam pipeline that you want to use for this processing. 81 | * @return PCollection with keywords 82 | */ 83 | public static PCollection> getKeywords(MozartOptions options, 84 | Pipeline pipeline) { 85 | 86 | options.getKeywordColumnNames().isAccessible(); 87 | options.getAdvertiserId().isAccessible(); 88 | 89 | return pipeline.apply("MozartReadKeywords", TextIO.read().from(options.getInputKeywordsFile())) 90 | // Create dictionary 91 | .apply("MozartCreateKWDict", ParDo.of(new DoFn>() { 92 | @ProcessElement 93 | public void processElement(ProcessContext c) { 94 | String[] element = c.element().split(","); 95 | String[] headers = c.getPipelineOptions().as(MozartOptions.class) 96 | .getKeywordColumnNames().get().split(","); 97 | Map newOutput = new HashMap<>(); 98 | if (element.length == headers.length) { 99 | for (int i = 0; i < headers.length; i++) { 100 | newOutput.put(headers[i], element[i]); 101 | } 102 | newOutput.put(H_ROW_TYPE, V_KEYWORD); 103 | newOutput.put(H_ACTION, V_EDIT); 104 | newOutput.put(H_STATUS, V_ACTIVE); 105 | c.output(newOutput); 106 | } else { 107 | LOG.warn("Different length for headers and element. header: {}. element: {}", headers, 108 | element); 109 | } 110 | } 111 | })); 112 | } 113 | 114 | /** 115 | * Write keywords to output. 116 | * 117 | * This method writes keywords to GCS, so that the other Mozart elements (Composer) can do the 118 | * sFTP upload to SA360. 119 | * 120 | * You should invoke this method when you have processed the keyword's PCollection according to 121 | * your business logic and are ready to push the new values to SA360. 122 | * 123 | * @param options Configuration options. 124 | * @param keywordsAfterLogic PCollection with the keywords. 125 | */ 126 | public static void writeKeywordsOutput(MozartOptions options, 127 | PCollection> keywordsAfterLogic) { 128 | keywordsAfterLogic 129 | .apply("MozartFlattenLines", ParDo.of(new DoFn, String>() { 130 | @ProcessElement 131 | public void processElement(ProcessContext c) { 132 | MozartOptions options = c.getPipelineOptions().as(MozartOptions.class); 133 | String[] headers = options.getKeywordColumnNames().get().split(","); 134 | final List values = new ArrayList(headers.length); 135 | final Map element = c.element(); 136 | if(element.get(H_ADVERTISER_ID).equals(options.getAdvertiserId().get())) { 137 | List outputHeaders = new ArrayList<>(); 138 | outputHeaders.add(H_ROW_TYPE); 139 | outputHeaders.add(H_ACTION); 140 | outputHeaders.add(H_STATUS); 141 | outputHeaders.addAll(Arrays.asList(headers)); 142 | outputHeaders.forEach(header -> values.add(element.get(header))); 143 | String valuesString = String.join(",", values); 144 | c.output(valuesString); 145 | } 146 | } 147 | })).apply("MozartWriteKeywords", 148 | TextIO.write().to(options.getOutputKeywordsFile()).withoutSharding()); 149 | } 150 | 151 | } 152 | -------------------------------------------------------------------------------- /dataflow/src/main/java/com/google/cse/mozart/examples/FirebaseInput.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2018 Google LLC 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except 5 | * in compliance with the License. You may obtain a copy of the License at 6 | * 7 | * https://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, software distributed under the License 10 | * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express 11 | * or implied. See the License for the specific language governing permissions and limitations under 12 | * the License. 13 | */ 14 | package com.google.cse.mozart.examples; 15 | 16 | import com.google.api.client.googleapis.auth.oauth2.GoogleCredential; 17 | import com.google.cse.mozart.Mozart; 18 | import com.google.cse.mozart.Mozart.MozartOptions; 19 | import com.mashape.unirest.http.HttpResponse; 20 | import com.mashape.unirest.http.Unirest; 21 | import com.mashape.unirest.http.exceptions.UnirestException; 22 | import java.io.IOException; 23 | import java.io.InputStream; 24 | import java.util.Arrays; 25 | import java.util.Map; 26 | import org.apache.beam.sdk.Pipeline; 27 | import org.apache.beam.sdk.options.Description; 28 | import org.apache.beam.sdk.options.PipelineOptionsFactory; 29 | import org.apache.beam.sdk.options.ValueProvider; 30 | import org.apache.beam.sdk.transforms.DoFn; 31 | import org.apache.beam.sdk.transforms.PTransform; 32 | import org.apache.beam.sdk.transforms.ParDo; 33 | import org.apache.beam.sdk.values.PCollection; 34 | import org.slf4j.Logger; 35 | import org.slf4j.LoggerFactory; 36 | 37 | /** 38 | * Mozart example pipeline. 39 | * 40 | *

This is an example of a Beam pipeline that uses Mozart. This example pulls data from Firebase 41 | * realtime database. 42 | */ 43 | public class FirebaseInput { 44 | 45 | // Extend MozartOptions to add your custom options to the pipeline 46 | interface MozartExampleOptions extends MozartOptions { 47 | @Description("Path to the custom input file (customer data table)") 48 | ValueProvider getInputCustomDataFile(); 49 | 50 | void setInputCustomDataFile(ValueProvider value); 51 | 52 | @Description("Custom data column names") 53 | ValueProvider getCustomDataColumnNames(); 54 | 55 | void setCustomDataColumnNames(ValueProvider value); 56 | } 57 | 58 | private static final Logger LOG = LoggerFactory.getLogger(FirebaseInput.class); 59 | 60 | public static String getFirebaseData(String path) throws UnirestException, IOException { 61 | // Use the Google credential to generate an access token 62 | InputStream credentialsStream = FirebaseInput.class.getResourceAsStream("service_account.json"); 63 | GoogleCredential credentials = GoogleCredential.fromStream(credentialsStream); 64 | credentialsStream.close(); 65 | final GoogleCredential scopedCredentials = 66 | credentials.createScoped( 67 | Arrays.asList( 68 | "https://www.googleapis.com/auth/firebase.database", 69 | "https://www.googleapis.com/auth/userinfo.email")); 70 | scopedCredentials.refreshToken(); 71 | String token = scopedCredentials.getAccessToken(); 72 | HttpResponse jsonResponse = 73 | Unirest.get("https://fir-test-a5923.firebaseio.com" + path) 74 | .queryString("access_token", token) 75 | .asString(); 76 | return jsonResponse.getBody(); 77 | } 78 | 79 | public static void main(String[] args) throws IOException { 80 | 81 | // Build your Pipeline as usual 82 | PipelineOptionsFactory.register(MozartOptions.class); 83 | MozartExampleOptions options = 84 | PipelineOptionsFactory.fromArgs(args).withValidation().as(MozartExampleOptions.class); 85 | options.getCustomDataColumnNames().isAccessible(); 86 | options.getInputCustomDataFile().isAccessible(); 87 | Pipeline p = Pipeline.create(options); 88 | 89 | // Use the Google credential to generate an access token 90 | 91 | // Define your PTransform. Here is where your business logic is implemented 92 | PTransform>, PCollection>> 93 | businessTransform = 94 | ParDo.of( 95 | new DoFn, Map>() { 96 | @ProcessElement 97 | public void processElement(ProcessContext c) { 98 | final Map element = c.element(); 99 | // Skip if element is empty 100 | if (element.size() > 0) { 101 | try { 102 | String newMaxCPC = getFirebaseData("/maxCPC.json"); 103 | LOG.info("newMaxCPC from Firebase: {}", newMaxCPC); 104 | element.put("Keyword max CPC", newMaxCPC); 105 | } catch (Exception e) { 106 | LOG.error("Error while processing element", e); 107 | } 108 | c.output(element); 109 | } 110 | } 111 | }); 112 | 113 | // Use Mozart to complete the pipeline 114 | // First, get the PCollection with all the keywords 115 | PCollection> keywordsBeforeLogic = Mozart.getKeywords(options, p); 116 | 117 | // Then, apply your business logic over the keywords 118 | PCollection> keywordsAfterLogic = 119 | keywordsBeforeLogic.apply(businessTransform); 120 | 121 | // Lastly, use Mozart to write the keywords output 122 | Mozart.writeKeywordsOutput(options, keywordsAfterLogic); 123 | 124 | // Run your pipeline as usual 125 | p.run(); 126 | } 127 | } 128 | -------------------------------------------------------------------------------- /dataflow/src/main/java/com/google/cse/mozart/examples/GCSInput.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2018 Google LLC 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except 5 | * in compliance with the License. You may obtain a copy of the License at 6 | * 7 | * https://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, software distributed under the License 10 | * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express 11 | * or implied. See the License for the specific language governing permissions and limitations under 12 | * the License. 13 | */ 14 | package com.google.cse.mozart.examples; 15 | 16 | import com.google.cse.mozart.Mozart; 17 | import com.google.cse.mozart.Mozart.MozartOptions; 18 | import java.util.HashMap; 19 | import java.util.List; 20 | import java.util.Map; 21 | import org.apache.beam.sdk.Pipeline; 22 | import org.apache.beam.sdk.io.TextIO; 23 | import org.apache.beam.sdk.options.Description; 24 | import org.apache.beam.sdk.options.PipelineOptionsFactory; 25 | import org.apache.beam.sdk.options.ValueProvider; 26 | import org.apache.beam.sdk.transforms.DoFn; 27 | import org.apache.beam.sdk.transforms.PTransform; 28 | import org.apache.beam.sdk.transforms.ParDo; 29 | import org.apache.beam.sdk.transforms.View; 30 | import org.apache.beam.sdk.values.PCollection; 31 | import org.apache.beam.sdk.values.PCollectionView; 32 | import org.slf4j.Logger; 33 | import org.slf4j.LoggerFactory; 34 | 35 | /** 36 | * Mozart example pipeline. 37 | * 38 | *

This is an example of a Beam pipeline that uses Mozart 39 | */ 40 | public class GCSInput { 41 | 42 | // Extend MozartOptions to add your custom options to the pipeline 43 | interface MozartExampleOptions extends MozartOptions { 44 | @Description("Path to the custom input file (customer data table)") 45 | ValueProvider getInputCustomDataFile(); 46 | 47 | void setInputCustomDataFile(ValueProvider value); 48 | 49 | @Description("Custom data column names") 50 | ValueProvider getCustomDataColumnNames(); 51 | 52 | void setCustomDataColumnNames(ValueProvider value); 53 | } 54 | 55 | private static final Logger LOG = LoggerFactory.getLogger(GCSInput.class); 56 | 57 | public static void main(String[] args) { 58 | 59 | // Build your Pipeline as usual 60 | PipelineOptionsFactory.register(MozartOptions.class); 61 | MozartExampleOptions options = 62 | PipelineOptionsFactory.fromArgs(args).withValidation().as(MozartExampleOptions.class); 63 | options.getCustomDataColumnNames().isAccessible(); 64 | Pipeline p = Pipeline.create(options); 65 | 66 | // Load additional custom (non-SA360) data. For example: a file containing information about 67 | // whether a certain brand is in promotion state 68 | PCollection> customData = 69 | p.apply("ReadCustomData", TextIO.read().from(options.getInputCustomDataFile())) 70 | // Create dictionary 71 | .apply( 72 | "CreateCustomDataDict", 73 | ParDo.of( 74 | new DoFn>() { 75 | @ProcessElement 76 | public void processElement(ProcessContext c) { 77 | String[] element = c.element().split(","); 78 | String[] headers = 79 | c.getPipelineOptions() 80 | .as(MozartExampleOptions.class) 81 | .getCustomDataColumnNames() 82 | .get() 83 | .split(","); 84 | Map newOutput = new HashMap<>(); 85 | if (element.length == headers.length) { 86 | for (int i = 0; i < headers.length; i++) { 87 | newOutput.put(headers[i], element[i]); 88 | } 89 | c.output(newOutput); 90 | } else { 91 | LOG.warn( 92 | "Different length for headers and element. header: {}. element: {}", 93 | headers, 94 | element); 95 | } 96 | } 97 | })); 98 | 99 | // Create a view of your custom data (this is necessary to pass it as side input to your 100 | // main PTransform 101 | PCollectionView>> customDataView = 102 | customData.apply("CreateCustomDataView", View.asList()); 103 | 104 | final String maxCPCPromo = "5"; 105 | final String maxCPCNormal = "1"; 106 | 107 | // Define your PTransform. Here is where your business logic is implemented 108 | PTransform>, PCollection>> 109 | businessTransform = 110 | ParDo.of( 111 | new DoFn, Map>() { 112 | @ProcessElement 113 | public void processElement(ProcessContext c) { 114 | List> customData = c.sideInput(customDataView); 115 | final Map element = c.element(); 116 | // Skip if element is empty 117 | if (element.size() > 0) { 118 | String newMaxCPC = maxCPCNormal; 119 | for (Map customEntry : customData) { 120 | if (element.get("Keyword").contains(customEntry.get("brand"))) { 121 | if (customEntry.get("status").equals("promotion")) { 122 | newMaxCPC = maxCPCPromo; 123 | } 124 | } 125 | } 126 | element.put("Keyword max CPC", newMaxCPC); 127 | c.output(element); 128 | } 129 | } 130 | }) 131 | .withSideInputs(customDataView); 132 | 133 | // Use Mozart to complete the pipeline 134 | // First, get the PCollection with all the keywords 135 | PCollection> keywordsBeforeLogic = Mozart.getKeywords(options, p); 136 | 137 | // Then, apply your business logic over the keywords 138 | PCollection> keywordsAfterLogic = 139 | keywordsBeforeLogic.apply(businessTransform); 140 | 141 | // Lastly, use Mozart to write the keywords output 142 | Mozart.writeKeywordsOutput(options, keywordsAfterLogic); 143 | 144 | // Run your pipeline as usual 145 | p.run(); 146 | } 147 | } 148 | -------------------------------------------------------------------------------- /doc/images/environment-set0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/mozart/21cdfef70a8acb8814d17575f458c53285f7f51a/doc/images/environment-set0.png -------------------------------------------------------------------------------- /doc/images/environment-set1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/mozart/21cdfef70a8acb8814d17575f458c53285f7f51a/doc/images/environment-set1.png -------------------------------------------------------------------------------- /doc/images/environment-set2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/mozart/21cdfef70a8acb8814d17575f458c53285f7f51a/doc/images/environment-set2.png -------------------------------------------------------------------------------- /doc/images/environment-set3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/mozart/21cdfef70a8acb8814d17575f458c53285f7f51a/doc/images/environment-set3.png -------------------------------------------------------------------------------- /doc/images/environment-set4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/mozart/21cdfef70a8acb8814d17575f458c53285f7f51a/doc/images/environment-set4.png -------------------------------------------------------------------------------- /doc/images/environment-set5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/mozart/21cdfef70a8acb8814d17575f458c53285f7f51a/doc/images/environment-set5.png -------------------------------------------------------------------------------- /doc/images/environment-set6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/mozart/21cdfef70a8acb8814d17575f458c53285f7f51a/doc/images/environment-set6.png --------------------------------------------------------------------------------