├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── cloud_function ├── main.py ├── main_test.py └── requirements.txt ├── cocoa ├── __init__.py ├── cocoa_template.ipynb ├── nearest_consented_customers.py ├── nearest_consented_customers_test.py ├── preprocess.py ├── preprocess_test.py └── testing_constants.py ├── generate_template.sh ├── pipeline.py ├── pipeline_test.py ├── requirements.txt └── setup.py /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, gender identity and expression, level of 9 | experience, education, socio-economic status, nationality, personal appearance, 10 | race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or reject 41 | comments, commits, code, wiki edits, issues, and other contributions that are 42 | not aligned to this Code of Conduct, or to ban temporarily or permanently any 43 | contributor for other behaviors that they deem inappropriate, threatening, 44 | offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies both within project spaces and in public spaces 49 | when an individual is representing the project or its community. Examples of 50 | representing a project or community include using an official project e-mail 51 | address, posting via an official social media account, or acting as an appointed 52 | representative at an online or offline event. Representation of a project may be 53 | further defined and clarified by project maintainers. 54 | 55 | This Code of Conduct also applies outside the project spaces when the Project 56 | Steward has a reasonable belief that an individual's behavior may have a 57 | negative impact on the project or its community. 58 | 59 | ## Conflict Resolution 60 | 61 | We do not believe that all conflict is bad; healthy debate and disagreement 62 | often yield positive results. However, it is never okay to be disrespectful or 63 | to engage in behavior that violates the project’s code of conduct. 64 | 65 | If you see someone violating the code of conduct, you are encouraged to address 66 | the behavior directly with those involved. Many issues can be resolved quickly 67 | and easily, and this gives people more control over the outcome of their 68 | dispute. If you are unable to resolve the matter for any reason, or if the 69 | behavior is threatening or harassing, report it. We are dedicated to providing 70 | an environment where participants feel welcome and safe. 71 | 72 | Reports should be directed to *[PROJECT STEWARD NAME(s) AND EMAIL(s)]*, the 73 | Project Steward(s) for *[PROJECT NAME]*. It is the Project Steward’s duty to 74 | receive and address reported violations of the code of conduct. They will then 75 | work with a committee consisting of representatives from the Open Source 76 | Programs Office and the Google Open Source Strategy team. If for any reason you 77 | are uncomfortable reaching out to the Project Steward, please email 78 | opensource@google.com. 79 | 80 | We will investigate every complaint, but you may not receive a direct response. 81 | We will use our discretion in determining when and how to follow up on reported 82 | incidents, which may range from not taking action to permanent expulsion from 83 | the project and project-sponsored spaces. We will notify the accused of the 84 | report and provide them an opportunity to discuss it before any action is taken. 85 | The identity of the reporter will be omitted from the details of the report 86 | supplied to the accused. In potentially harmful situations, such as ongoing 87 | harassment or threats to anyone's safety, we may take action without notice. 88 | 89 | ## Attribution 90 | 91 | This Code of Conduct is adapted from the Contributor Covenant, version 1.4, 92 | available at 93 | https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 94 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | We'd love to accept your patches and contributions to this project. There are 4 | just a few small guidelines you need to follow. 5 | 6 | ## Contributor License Agreement 7 | 8 | Contributions to this project must be accompanied by a Contributor License 9 | Agreement (CLA). You (or your employer) retain the copyright to your 10 | contribution; this simply gives us permission to use and redistribute your 11 | contributions as part of the project. Head over to 12 | to see your current agreements on file or 13 | to sign a new one. 14 | 15 | You generally only need to submit a CLA once, so if you've already submitted one 16 | (even if it was for a different project), you probably don't need to do it 17 | again. 18 | 19 | ## Code Reviews 20 | 21 | All submissions, including submissions by project members, require review. We 22 | use GitHub pull requests for this purpose. Consult 23 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more 24 | information on using pull requests. 25 | 26 | ## Community Guidelines 27 | 28 | This project follows 29 | [Google's Open Source Community Guidelines](https://opensource.google/conduct/). 30 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Consent-based Conversion Adjustments 2 | 3 | ## Problem statement 4 | 5 | Given regulatory requirements, customers have the choice to accept or decline 6 | third-party cookies. For those who opt-out of third-party cookie tracking 7 | (hereafter, non-consenting customers), data on their conversions on an 8 | advertiser's website cannot be shared with Smart Bidding. This potential data 9 | loss can lead to worse bidding performance, or drifts in the bidding behaviour 10 | away from the advertiser's initial goals. 11 | 12 | We have developed a solution that allows advertisers to capitalise on their 13 | first-party data in order to statistically up-weight conversion values of 14 | customers that gave consent. By doing this, advertisers have the possibility to 15 | feed back up to 100% of the factual conversion values back into Smart Bidding. 16 | 17 | ## Solution description 18 | 19 | We take the following approach: For all consenting and non-consenting customers 20 | that converted on a given day, the advertiser has access to first-party data 21 | that describes the customers. Examples could be the adgroup-title that a 22 | conversion is attributed to, the device type used, or demographic information. 23 | Based on this information, a feature space can be created that describes each 24 | consenting and non-consenting customer. Importantly, this feature space has to 25 | be the same for all customers. 26 | 27 | Given this feature space, we can create a distance-graph for all *consenting* 28 | customers in our dataset, and find the nearest consenting customers for each 29 | *non-consenting* customer. This is done using a NearestNeighbor model. The 30 | non-consenting customer's conversion value can then be split across all 31 | identified nearest consenting customers, in proportion to the similarity between 32 | the non-consenting and the non-consenting customers. 33 | 34 | ## Model Parameters 35 | 36 | * Distance metric: We need to define the distance metric to use when 37 | determining the nearest consenting customers. By default, this is set to 38 | `manhattan distance`. 39 | * Radius, number of nearest neighbors, or percentile: In coordination with the 40 | advertiser and depending on the dataset as well as business requirements, 41 | the user can choose between: 42 | * setting a fixed radius within which all nearest neighbors should be 43 | selected, 44 | * setting a fixed number of nearest neighbors that should be selected for 45 | each non-consenting customer, independent of their distance to them 46 | * finding the required radius to ensure that at least `x%` of 47 | non-consenting customers would have at least one sufficiently close 48 | neighbor. 49 | 50 | ## Data requirements 51 | 52 | As mentioned above, consenting and non-consenting customers must lie in the same 53 | feature space. This is currently achieved by considering the adgroup a given 54 | customer has clicked on and splitting it according to the advertiser's logic. 55 | This way, customers that came through similar adgroups are considered being more 56 | similar to each other. All customers to be considered must have a valid 57 | conversion value larger zero and must not have missing data. 58 | 59 | ## How to use the solution 60 | 61 | This solution uses an Apache Beam pipeline to find the nearest consenting 62 | customers for each non-consenting customer. The following instructions show how 63 | to run the pipeline on Google Cloud Dataflow, however any other suitable Apache 64 | Beam runner may be used as well. 65 | 66 | ### Installation 67 | 68 | > Note: This solution requires 3.6 <= Python < 3.9 as Beam does not currently 69 | > support Python 3.9. 70 | 71 | #### Set up Dataflow Template 72 | 73 | * Navigate to your Google Cloud Project and activate the Cloud Shell 74 | * Set the current project by running `gcloud config set project 75 | [YOUR_PROJECT_ID]` 76 | * Clone this repository and `cd` into the project directory 77 | * Download pyenv as described 78 | [here](https://cwiki.apache.org/confluence/display/BEAM/Python+Tips#PythonTips-VirtualEnvironmentswithpyenv). 79 | * Create and activate a virtual environment as follows: 80 | 81 | ``` 82 | pyenv install 3.8.12 83 | pyenv virtualenv 3.8.12 env 84 | pyenv activate env 85 | ``` 86 | 87 | * Install python3 dependencies `pip3 install -r requirements.txt` 88 | 89 | * Create a GCS bucket (if one does not already exist) where the Dataflow 90 | template as well as all inputs and outputs will be stored 91 | 92 | * Set an environment variable with the name of the bucket `export 93 | PIPELINE_BUCKET=[YOUR_CLOUD_STORAGE_BUCKET_NAME]` 94 | 95 | * To read data from BigQuery, we need to know the project containing your 96 | BigQuery tables. Set an environment variable `export 97 | BQ_PROJECT_ID=[YOUR_BIGQUERY_PROJECT_ID]` 98 | 99 | * Additionally, set the location of your BigQuery tables `export 100 | BIGQUERY_LOCATION=[YOUR_BIGQUERY_LOCATION]` e.g. 'EU' for Europe 101 | 102 | * Set an environment variable with the name of your BigQuery table with 103 | consenting user data `export TABLE_CONSENT=[CONSENTING_USER_TABLE_NAME]` 104 | 105 | * Set an environment variable with the name of your BigQuery table with 106 | non-consenting user data `export 107 | TABLE_NOCONSENT=[NON_CONSENTING_USER_TABLE_NAME]` 108 | 109 | * Set an environment variable with the name of the data column in your tables 110 | `export DATE_COLUMN=[DATA_COLUMN_NAME]` 111 | 112 | * To up-weight the conversion values in our dataset, we need to know which 113 | column represents the conversion values in the input data. Set an 114 | environment variable with the name of the conversion column `export 115 | CONVERSION_COLUMN=[YOUR_CONVERSION_COLUMN_NAME]` 116 | 117 | * The final output of the pipeline is a CSV file that may be used for Offline 118 | Conversion Import (OCI) into Google Ads or Google Marketing Platform (GMP). 119 | Each row of this OCI CSV must be unique. Set an environment variable with 120 | the list of columns in the input data that together form a unique ID `export 121 | ID_COLUMNS=[COMMA_SEPARATED_ID_COLUMNS_LIST]` e.g. export 122 | ID_COLUMNS=GCLID,TIMESTAMP,ADGROUP_NAME (**no spaces** between the commas!) 123 | 124 | * You may want to exclude some columns in your data from being used for 125 | matching. Set an environment variable with the list of columns in the input 126 | data that should be dropped e.g. `export DROP_COLUMNS=feature2,feature5` 127 | 128 | * Provide all categorical columns in your data that should not be 129 | [dummy-coded](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) 130 | e.g. `export non_dummy_columns=GCLID,TIMESTAMP` 131 | 132 | * Set an environment variable with the project id `export 133 | PROJECT_ID=[YOUR_PROJECT_ID]` 134 | 135 | * Set an environment variable with the 136 | [regional endpoint](https://cloud.google.com/dataflow/docs/concepts/regional-endpoints) 137 | to deploy your Dataflow job `export 138 | PIPELINE_REGION=[YOUR_REGIONAL_ENDPOINT]` 139 | 140 | * Generate the template by running `./generate_template.sh` 141 | 142 | * Deactivate the virtual env by typing `pyenv deactivate` and close cloud 143 | shell 144 | 145 | #### Set up Cloud Function 146 | 147 | The Apache Beam pipeline that we set up above will be triggered by a Cloud 148 | Function. Following instructions show how to set up the Cloud Function: 149 | 150 | * Open Cloud Functions from the navigation menu in your Google Cloud Project 151 | * If not done already, enable the Cloud Functions and Cloud Build APIs 152 | * Select `Create function` and fill in the required fields such as `Function 153 | name` and `Region`. Choose Cloud Pub/Sub as a trigger and create a new 154 | topic. We will later write to this topic whenever the BigQuery tables have 155 | new data, thereby triggering the cloud function 156 | * Under runtime setting, set timeout to at least 60 seconds to give ample time 157 | for the Cloud Function to run. Click next 158 | * Upload the contents of the `cloud_function` directory found in the repo to 159 | Cloud Functions 160 | * Select Python 3.8 as Runtime and set Entry point to "run". 161 | * Update the required values in `main.py` as marked by `TODO(): ...` 162 | * Deploy the Cloud Function 163 | 164 | #### Set up Cloud Logging to Pub/Sub sink 165 | 166 | > Note: For this section, we assume that you wish to trigger the Dataflow 167 | > pipeline whenever new data is inserted in the non-consented or consented 168 | > tables. If you have a different requirement, proceed accordingly with setting 169 | > up a trigger for the Cloud Function. See also: 170 | > [Using Cloud Scheduler and Pub/Sub to trigger a Cloud Function](https://cloud.google.com/scheduler/docs/tut-pub-sub). 171 | 172 | * In Cloud Logging on your Google Cloud Project, filter to the relevant 173 | BigQuery event. For example, to filter by table inserts, use: 174 | 175 | ``` 176 | protoPayload.serviceName="bigquery.googleapis.com" 177 | protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" 178 | protoPayload.resourceName="projects/[YOUR_PROJECT_ID]/datasets/[YOUR_DATASET]/tables/[YOUR_TABLE_NAME]" 179 | resource.labels.project_id="[YOUR_PROJECT_ID]" 180 | protoPayload.metadata.tableDataChange.reason="QUERY" 181 | ``` 182 | 183 | Once the relevant event is available, create a sink that routes your logs to 184 | the Pub/Sub topic defined above. For more information on creating sinks, see 185 | the 186 | [documentation](https://cloud.google.com/logging/docs/export/configure_export_v2). 187 | 188 | * With this in place, the Dataflow pipeline should get triggered whenever new 189 | data is inserted in your Bigquery tables. 190 | 191 | ## Contributing 192 | 193 | See [`CONTRIBUTING.md`](CONTRIBUTING.md) for details. 194 | 195 | ## License 196 | 197 | Apache 2.0; see [`LICENSE`](LICENSE) for details. 198 | 199 | ## Disclaimer 200 | 201 | This is not an official Google product. 202 | -------------------------------------------------------------------------------- /cloud_function/main.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """A Cloud Function that triggers the CoCoA Dataflow pipeline.""" 16 | import datetime 17 | import os 18 | from typing import Any, Dict, Sequence 19 | 20 | from google.cloud import bigquery 21 | from google.cloud import storage 22 | from googleapiclient.discovery import build 23 | 24 | # TODO(): Update the project and dataflow parameters below. 25 | # Project and bigquery related parameters. 26 | _PROJECT = 'cocoa' 27 | _GCS_BUCKET = 'cocoa-df' 28 | _BIGQUERY_LOCATION = 'EU' 29 | _DATAFLOW_REGION = 'europe-west3' 30 | _DATAFLOW_SUBNET = 'default' 31 | _DATE_COLUMN = 'conversion_date' 32 | _TABLE_NAME_NOCONSENT = 'pipeline_test.noconsent' 33 | _INPUT_FILE_PATH = 'input' 34 | _LOOKBACK_WINDOW = 1 35 | 36 | # Dataflow parameters. 37 | JOB = 'cocoa-cloud-function-test' 38 | TEMPLATE = f'gs://{_GCS_BUCKET}/templates/cocoa-template' 39 | PARAMETERS = { 40 | 'metric': 'manhattan', 41 | 'number_nearest_neighbors': '1', 42 | } 43 | ENVIRONMENT = { 44 | 'subnetwork': 45 | f'https://www.googleapis.com/compute/v1/projects/{_PROJECT}/regions/{_DATAFLOW_REGION}/subnetworks/{_DATAFLOW_SUBNET}', 46 | } 47 | 48 | 49 | def run( 50 | event: Dict[str, Any], context: Any # Cloud Function so pylint: disable=unused-argument 51 | ) -> None: 52 | """Background Cloud Function to be triggered by Cloud Logging. 53 | 54 | Prepares input data for the Dataflow pipeline and then triggers the pipeline. 55 | 56 | Args: 57 | event : The dictionary with data specific to this type of event. For 58 | details, see: 59 | https://cloud.google.com/functions/docs/samples/functions-log-stackdriver#code-sample 60 | context: Metadata of triggering event. 61 | 62 | Raises: 63 | RuntimeError: A dependency was not found, requiring this CF to exit. 64 | The RuntimeError is not raised explicitly in this function but is default 65 | behavior for any Cloud Function. 66 | 67 | Returns: 68 | None. 69 | """ 70 | _prepare_pipeline_input() 71 | 72 | dataflow = build('dataflow', 'v1b3') 73 | request = dataflow.projects().locations().templates().launch( 74 | projectId=_PROJECT, 75 | gcsPath=TEMPLATE, 76 | location=_DATAFLOW_REGION, 77 | body={ 78 | 'jobName': JOB, 79 | 'parameters': PARAMETERS, 80 | 'environment': ENVIRONMENT, 81 | }) 82 | 83 | request.execute() 84 | 85 | 86 | def _prepare_pipeline_input() -> None: 87 | """Prepares dates to be processed for the CoCoA Dataflow pipeline. 88 | 89 | The CoCoA Dataflow pipeline requires an input file containing dates to 90 | process. This function prepares this file and uploads it to a Cloud Storage 91 | bucket. 92 | """ 93 | latest_date = _get_latest_date_from_bigquery(_TABLE_NAME_NOCONSENT, 94 | _BIGQUERY_LOCATION, _PROJECT, 95 | _DATE_COLUMN) 96 | dates_to_process = _get_dates_to_process(latest_date, _LOOKBACK_WINDOW) 97 | 98 | # Prepare string of newline separated dates and write to GCS 99 | write_to_gcs(_GCS_BUCKET, _INPUT_FILE_PATH, 'dates.txt', 100 | '\n'.join(map(datetime.date.isoformat, dates_to_process))) 101 | 102 | 103 | def _get_dates_to_process( 104 | start_date: str, 105 | lookback_window: int) -> Sequence[datetime.date]: 106 | """Generates a sequence of dates. 107 | 108 | Creates a sequence of dates based on the given start date and lookback window. 109 | 110 | Args: 111 | start_date: Starting date for the sequence. 112 | lookback_window: Number of days in the past. 113 | 114 | Returns: 115 | A sequence of dates ranging from start_date - lookback_window to the 116 | start_date. 117 | """ 118 | return [ 119 | datetime.date.fromisoformat(start_date) - 120 | datetime.timedelta(days=delta) for delta in range(lookback_window) 121 | ] 122 | 123 | 124 | def _get_latest_date_from_bigquery(table_name: str, location: str, project: str, 125 | date_column: str) -> str: 126 | """Gets the latest date from a date column in a BigQuery table.""" 127 | bq_client = bigquery.Client(location=location, project=project) 128 | query = f""" 129 | SELECT FORMAT_DATETIME("%F", MAX({date_column})) AS input_date 130 | FROM `{table_name}` 131 | """ 132 | results_iter = bq_client.query(query).result() 133 | # There is only one row as we query by max(), which we now return. 134 | return next(results_iter).input_date 135 | 136 | 137 | def write_to_gcs(bucket_name: str, path: str, filename: str, data: str) -> None: 138 | """Writes the given data to a Google Cloud Storage Bucket.""" 139 | gcs_client = storage.Client() 140 | gcs_bucket = gcs_client.get_bucket(bucket_name) 141 | gcs_bucket.blob(os.path.join(path, 142 | filename)).upload_from_string(data, 'text/csv') 143 | -------------------------------------------------------------------------------- /cloud_function/main_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Tests for main.""" 16 | import datetime 17 | 18 | from absl.testing import parameterized 19 | from consent_based_conversion_adjustments.cloud_function import main 20 | 21 | 22 | class PipelineTest(parameterized.TestCase): 23 | 24 | @parameterized.named_parameters( 25 | dict( 26 | testcase_name='_start_date_as_single_entry_list', 27 | start_date='2021-12-15', 28 | lookback_window=1, 29 | expected_output=[datetime.date.fromisoformat('2021-12-15')]), 30 | dict( 31 | testcase_name='_list_with_start_date_and_previous_day', 32 | start_date='2021-12-15', 33 | lookback_window=2, 34 | expected_output=[ 35 | datetime.date.fromisoformat('2021-12-15'), 36 | datetime.date.fromisoformat('2021-12-14') 37 | ])) 38 | def test_get_dates_to_process_returns(self, start_date, lookback_window, 39 | expected_output): 40 | result = main.get_dates_to_process(start_date, lookback_window) 41 | self.assertListEqual(result, expected_output) 42 | -------------------------------------------------------------------------------- /cloud_function/requirements.txt: -------------------------------------------------------------------------------- 1 | # Copyright 2024 Google LLC. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | google-api-python-client 16 | google-cloud-bigquery 17 | google-cloud-storage 18 | -------------------------------------------------------------------------------- /cocoa/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright 2024 Google LLC. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | -------------------------------------------------------------------------------- /cocoa/cocoa_template.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "x1LyGCNZhWBH" 7 | }, 8 | "source": [ 9 | "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n", 10 | " \u003ctd\u003e\n", 11 | " \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/google/consent-based-conversion-adjustments/blob/main/cocoa/cocoa_template.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n", 12 | " \u003c/td\u003e\n", 13 | " \u003ctd\u003e\n", 14 | " \u003ca target=\"_blank\" href=\"https://github.com/google/consent-based-conversion-adjustments/blob/main/cocoa/cocoa_template.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n", 15 | " \u003c/td\u003e\n", 16 | "\u003c/table\u003e" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "wLQsEboILf9_" 23 | }, 24 | "source": [ 25 | "###### License" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "id": "mMCxIddRJkps" 32 | }, 33 | "source": [ 34 | " Copyright 2020 Google LLC\n", 35 | "\n", 36 | " Licensed under the Apache License, Version 2.0 (the \"License\");\n", 37 | " you may not use this file except in compliance with the License.\n", 38 | " You may obtain a copy of the License at\n", 39 | "\n", 40 | " http://www.apache.org/licenses/LICENSE-2.0\n", 41 | "\n", 42 | " Unless required by applicable law or agreed to in writing, software\n", 43 | " distributed under the License is distributed on an \"AS IS\" BASIS,\n", 44 | " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", 45 | " See the License for the specific language governing permissions and\n", 46 | " limitations under the License." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": { 52 | "id": "4RwBf7lfJxPs" 53 | }, 54 | "source": [ 55 | "# Consent-based Conversion Adjustments\n" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": { 61 | "id": "-rGQ-Vm1Lnh_" 62 | }, 63 | "source": [ 64 | "In this notebook, we are illustrating how we can use a non-parametric model (based on k nearest-neighbors) to redistribute conversion values of customers opting out of advertising cookies over customers who opt in. \n", 65 | "The resulting conversion-value adjustments can be used within value-based bidding to prevent biases in the bidding-algorithm due to systematic differences between customers who opt in vs customers who don't." 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": { 71 | "id": "-IENCIRJNyqF" 72 | }, 73 | "source": [ 74 | "# Imports" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": { 81 | "id": "yLBUAG1O5S2S" 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "!pip install git+https://github.com/google/consent-based-conversion-adjustments.git\n", 86 | "from IPython.display import clear_output\n", 87 | "\n", 88 | "clear_output()\n" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "id": "ZRAT4tTpZY51" 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "from itertools import combinations\n", 100 | "import typing\n", 101 | "\n", 102 | "import matplotlib.pyplot as plt\n", 103 | "import numpy as np\n", 104 | "import pandas as pd\n", 105 | "\n", 106 | "\n", 107 | "np.random.seed(123)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": { 114 | "id": "Bl_p0SoAw7Gz" 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": { 124 | "id": "lIfB_qb851sE" 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "from cocoa import nearest_consented_customers" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": { 134 | "id": "QK9vYsW8N7F3" 135 | }, 136 | "source": [ 137 | "# Data Simulation" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": { 144 | "cellView": "form", 145 | "id": "N21wIR0hbCTb" 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "#@title Create fake dataset of adgroups and conversion values\n", 150 | "#@markdown We are generating random data: each row is an individual conversion\n", 151 | "#@markdown with a given conversion value. \\\n", 152 | "#@markdown For each conversion, we know the\n", 153 | "#@markdown adgroup, which is our only feature here and just consists of 3 letters.\n", 154 | "\n", 155 | "n_consenting_customers = 8000 #@param\n", 156 | "n_nonconsenting_customers = 2000 #@param\n", 157 | "\n", 158 | "\n", 159 | "def simulate_conversion_data_consenting_non_consenting(\n", 160 | " n_consenting_customers: int,\n", 161 | " n_nonconsenting_customers: int) -\u003e typing.Tuple[pd.DataFrame, pd.DataFrame]:\n", 162 | " \"\"\"Simulates dataframes for consenting and non-consenting customers.\n", 163 | "\n", 164 | " Args:\n", 165 | " n_consenting_customers: Desired number of consenting customers. Should be\n", 166 | " larger than n_nonconsenting_customers.\n", 167 | " n_nonconsenting_customers: Desired number non non-consenting customers.\n", 168 | "\n", 169 | " Returns:\n", 170 | " Two dataframes of simulated consenting and non-consenting customers.\n", 171 | " \"\"\"\n", 172 | " fake_adgroups = np.array(\n", 173 | " ['_'.join(fake_ad) for fake_ad in (combinations('ABCDEFG', 3))])\n", 174 | "\n", 175 | " data_consenting = pd.DataFrame.from_dict({\n", 176 | " 'adgroup':\n", 177 | " fake_adgroups[np.random.randint(\n", 178 | " low=0, high=len(fake_adgroups), size=n_consenting_customers)],\n", 179 | " 'conversion_value':\n", 180 | " np.random.lognormal(1, size=n_consenting_customers)\n", 181 | " })\n", 182 | "\n", 183 | " data_nonconsenting = pd.DataFrame.from_dict({\n", 184 | " 'adgroup':\n", 185 | " fake_adgroups[np.random.randint(\n", 186 | " low=0, high=len(fake_adgroups), size=n_nonconsenting_customers)],\n", 187 | " 'conversion_value':\n", 188 | " np.random.lognormal(1, size=n_nonconsenting_customers)\n", 189 | " })\n", 190 | " return data_consenting, data_nonconsenting\n", 191 | "\n", 192 | "\n", 193 | "data_consenting, data_nonconsenting = simulate_conversion_data_consenting_non_consenting(\n", 194 | " n_consenting_customers, n_nonconsenting_customers)\n", 195 | "data_consenting.head()" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": { 201 | "id": "V_9JEHeZOF8X" 202 | }, 203 | "source": [ 204 | "# Preprocessing" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": { 211 | "cellView": "form", 212 | "id": "K-WQvtKYdn6N" 213 | }, 214 | "outputs": [], 215 | "source": [ 216 | "#@title Split adgroups in separate levels\n", 217 | "#@markdown We preprocess our data. Consenting and non-consenting data are\n", 218 | "#@markdown concatenated to ensure that they have the same feature-columns. \\\n", 219 | "#@markdown We then split our adgroup-string into its components and dummy code each.\n", 220 | "#@markdown The level of each letter in the adgroup-string is added as prefix here.\n", 221 | "\n", 222 | "def preprocess_data(data_consenting, data_nonconsenting):\n", 223 | " data_consenting['consent'] = 1\n", 224 | " data_nonconsenting['consent'] = 0\n", 225 | " data_all = pd.concat([data_consenting, data_nonconsenting])\n", 226 | " data_all.reset_index(inplace=True)\n", 227 | "\n", 228 | " # split the adgroups in their levels and dummy-code those.\n", 229 | " data_all = data_all.join(\n", 230 | " pd.get_dummies(data_all['adgroup'].str.split('_').apply(pd.Series)))\n", 231 | " data_all.drop(['adgroup'], axis=1, inplace=True)\n", 232 | " return data_all[data_all['consent'] == 1], data_all[data_all['consent'] == 0]\n", 233 | "\n", 234 | "data_consenting, data_nonconsenting = preprocess_data(data_consenting,\n", 235 | " data_nonconsenting)\n", 236 | "data_consenting.head()" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": { 242 | "id": "3v16SF-umvAz" 243 | }, 244 | "source": [ 245 | "# Create NearestCustomerMatcher object and run conversion-adjustments.\n", 246 | "\n", 247 | "We now have our fake data in the right format – similarity here depends alone on\n", 248 | "the adgroup of a given customer. In reality, we would have a gCLID and a\n", 249 | "timestamp for each customer that we could pass as `id_columns` to the matcher.\\\n", 250 | "Other example features that could be used instead/in addition to the adgroup are\n", 251 | "\n", 252 | "\n", 253 | "* device type\n", 254 | "* geo\n", 255 | "* time of day\n", 256 | "* ad-type\n", 257 | "* GA-derived features\n", 258 | "* etc. \n", 259 | "\n", 260 | "When using the `NearestCustomerMatcher`, we can choose between three matching\n", 261 | "strategies:\n", 262 | "* if we define `number_nearest_neighbors`, a fixed number of nearest (consenting)\n", 263 | "customers is used, irrespective of how dissimilar those customers are to the \n", 264 | "seed-non-consenting customer.\n", 265 | "* if we define `radius`, all consenting customers that fall within the specified radius of a non-consenting customer are used. This means that the number of nearest-neighbors likely differs between non-consenting customers, and a given non-consenting customer might have no consenting customers in their radius.\n", 266 | "* if we define `percentage`, the `NearestCustomerMatcher` first which minimal radius needs to be set in order to find at least one closest consenting customer for at least `percentage` non-consenting customers (not implemented in beam yet)\n", 267 | "\n", 268 | "In practice, the simplest approach is to set `number_nearest_neighbors` and\n", 269 | "choose a sufficiently high number here to ensure that individual consenting\n", 270 | "customers do not receive too high a share of non-consenting conversion values.\n", 271 | "\n" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": { 278 | "id": "ezsCLpQ4jExj" 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "matcher = nearest_consented_customers.NearestCustomerMatcher(\n", 283 | " data_consenting, conversion_column='conversion_value', id_columns=['index'])\n", 284 | "data_adjusted = matcher.calculate_adjusted_conversions(\n", 285 | " data_nonconsenting, number_nearest_neighbors=100)" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": { 292 | "cellView": "form", 293 | "id": "GCEKmfcdqyRK" 294 | }, 295 | "outputs": [], 296 | "source": [ 297 | "#@title We generated a new dataframe containing the conversion-value adjustments\n", 298 | "data_adjusted.sample(5)" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": { 305 | "cellView": "form", 306 | "id": "twOhpxSvmIkR" 307 | }, 308 | "outputs": [], 309 | "source": [ 310 | "#@title Visualise distribution of adjusted conversions\n", 311 | "#@markdown We can plot the original and adjusted (original + adjustment-values)\n", 312 | "#@markdown conversion values and see that in general, the distributions are\n", 313 | "#@markdown very similar, but as expected, the adjusted values are shifted towards\n", 314 | "#@markdown larger values.\n", 315 | "ax = data_adjusted['conversion_value'].plot(kind='hist', alpha=.5, )\n", 316 | "(data_adjusted['adjusted_conversion']+data_adjusted['conversion_value']).plot(kind='hist', ax=ax, alpha=.5)\n", 317 | "ax.legend(['original conversion value', 'adjusted conversion value'])\n", 318 | "plt.show()" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": { 324 | "id": "KjDO7pncOTON" 325 | }, 326 | "source": [ 327 | "# Next steps\n", 328 | "The above would run automatically on a daily basis within a Google Cloud Project. A new table ready to use with Offline Conversion Import is created.\n", 329 | "If no custom pipeline has been set up yet, we recommend using [Tentacles](https://github.com/GoogleCloudPlatform/cloud-for-marketing/blob/master/marketing-analytics/activation/gmp-googleads-connector/README.md)." 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": { 336 | "id": "l5kV_6qNPSNF" 337 | }, 338 | "outputs": [], 339 | "source": [ 340 | "" 341 | ] 342 | } 343 | ], 344 | "metadata": { 345 | "colab": { 346 | "collapsed_sections": [ 347 | "QK9vYsW8N7F3", 348 | "V_9JEHeZOF8X", 349 | "3v16SF-umvAz" 350 | ], 351 | "last_runtime": { 352 | "build_target": "//corp/gtech/ads/infrastructure/colab_utils/ds_runtime:ds_colab", 353 | "kind": "private" 354 | }, 355 | "name": "cocoa_template.ipynb", 356 | "private_outputs": true, 357 | "provenance": [ 358 | { 359 | "file_id": "1YafoEaxHj4Gfs51FwMEzEjx1oRu92XSB", 360 | "timestamp": 1619712681367 361 | } 362 | ] 363 | }, 364 | "kernelspec": { 365 | "display_name": "Python 3", 366 | "name": "python3" 367 | }, 368 | "language_info": { 369 | "name": "python" 370 | } 371 | }, 372 | "nbformat": 4, 373 | "nbformat_minor": 0 374 | } 375 | -------------------------------------------------------------------------------- /cocoa/nearest_consented_customers.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Module to re-distribute conversion-values of no-consent customers.""" 16 | 17 | import logging 18 | from typing import Any, Callable, List, Optional, Sequence, Tuple, Union 19 | 20 | import numpy as np 21 | import pandas as pd 22 | from scipy import sparse 23 | from scipy import special 24 | from sklearn import neighbors 25 | 26 | 27 | class NearestCustomerMatcher: 28 | """Class to find nearest neighbors and distribute conversion value. 29 | 30 | When we have a dataset of customers that gave consent to cookie-tracking, and 31 | customers that did not give consent, we want to ensure that the total 32 | conversion values (e.g. value of a purchase) across all customers are 33 | accessible to SmartBidding. 34 | The NearestCustomerMatcher finds the most similar customers among the 35 | consenting customers to each of the no-consent customers, and distributes 36 | the conversion values of any no-consent customer across the matches in the 37 | set of consenting customers, in proportion to their distance. 38 | Similarity is defined as the distance between customers in their feature- 39 | space, for instance based on adgroup-levels. Which distance-metric to 40 | choose is up to the user. 41 | The more similar a consenting customer is to a given no-consent 42 | customer, the larger the share of the no-consent customer's conversion- 43 | value that will be added to the consenting customer's conversion value. 44 | """ 45 | 46 | def __init__(self, 47 | data_consent: pd.DataFrame, 48 | conversion_column: str, 49 | id_columns: List[Union[str, int]], 50 | metric: str = "manhattan", 51 | neighbor: Callable[..., Any] = neighbors.NearestNeighbors): 52 | """Initialises class. 53 | 54 | Args: 55 | data_consent: Dataframe of consented customers (preprocessed). 56 | conversion_column: Name of column in dataframe of conversion-value. 57 | id_columns: Names of columns that identify customers. Usually GCLID and 58 | timestamp. 59 | metric: Distance metric to use when finding nearest neighbors. 60 | neighbor: sklearn NearestNeighbor object. 61 | 62 | Raises: 63 | ValueError: if the conversion values contain NaNs or Nones, or if 64 | conversion values < 0. 65 | """ 66 | # TODO() Test behaviour under different distance metrics. 67 | self._neighbor = neighbor(metric=metric, algorithm="auto") 68 | self._columns_consent = data_consent.drop(id_columns, axis=1).columns 69 | self._data_consent = data_consent[id_columns + [conversion_column]] 70 | features_consent = data_consent.drop( 71 | id_columns + [conversion_column], axis=1 72 | ).values.astype(np.float64) 73 | self._features_consent = sparse.csr_matrix(features_consent).astype( 74 | np.float32 75 | ) 76 | self._conversion_column = conversion_column 77 | self._consent_id = data_consent[id_columns] 78 | self._id_columns = id_columns 79 | if any(self._data_consent[self._conversion_column].isna()): 80 | raise ValueError("The conversion column must not contain NaNs/Nones.") 81 | if any(self._data_consent[self._conversion_column] <= 0): 82 | raise ValueError("The conversion values must be larger than zero.") 83 | self._neighbor = self._neighbor.fit(self._features_consent) 84 | 85 | # These attributes will be populated with data later. 86 | self._data_noconsent = None 87 | self._data_noconsent_match = None 88 | self._data_noconsent_nomatch = None 89 | 90 | @property 91 | def total_non_matched_conversion_value(self) -> float: 92 | return self._data_noconsent_nomatch[self._conversion_column].sum() 93 | 94 | @property 95 | def total_matched_conversion_value(self) -> float: 96 | return self._data_noconsent_match[self._conversion_column].sum() 97 | 98 | @property 99 | def percentage_matched_conversion_value(self) -> float: 100 | return (self.total_matched_conversion_value / 101 | (self.total_non_matched_conversion_value + 102 | self.total_matched_conversion_value)) * 100 103 | 104 | @property 105 | def number_non_matched_conversions(self) -> int: 106 | return len(self._data_noconsent_nomatch) 107 | 108 | @property 109 | def number_matched_conversions(self) -> int: 110 | return len(self._data_noconsent_match) 111 | 112 | @property 113 | def percentage_matched_conversions(self) -> float: 114 | return self.number_matched_conversions / len(self._data_noconsent) * 100 115 | 116 | @property 117 | def distance_statistics(self): 118 | return self._data_adjusted["average_distance"].describe() 119 | 120 | @property 121 | def nearest_distances_statistics_nonconsenting(self): 122 | return self._data_noconsent_match["distance_to_nearest_neighbor"].describe( 123 | percentiles=[.25, .5, .75, .9, .95, .99]) 124 | 125 | @property 126 | def summary_statistics_matched_conversions(self): 127 | return pd.DataFrame( 128 | { 129 | "percentage_matched_conversion_value": 130 | self.percentage_matched_conversion_value, 131 | "percentage_matched_conversions": 132 | self.percentage_matched_conversions, 133 | "number_matched_conversions": 134 | self.number_matched_conversions, 135 | "total_matched_conversion_value": 136 | self.total_matched_conversion_value 137 | }, 138 | index=["summary_statistics_matched_conversions"]) 139 | 140 | def min_radius_by_percentile(self, percentile: float = .95) -> float: 141 | radius = self._data_noconsent_match[ 142 | "distance_to_nearest_neighbor"].quantile(percentile) 143 | return radius 144 | 145 | def _get_proportional_number_nearest_neighbors( 146 | self, number_nearest_neighbors: float) -> int: 147 | return int(number_nearest_neighbors * len(self._data_consent)) 148 | 149 | def _fit_neighbor(self): 150 | self._neighbor.fit(self._features_consent) 151 | self._fitted = True 152 | 153 | def _get_neighbors_within_radius( 154 | self, data_noconsent: pd.DataFrame, radius: float 155 | ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray], Sequence[bool]]: 156 | """Gets neighbors within specified radius. 157 | 158 | Args: 159 | data_noconsent: Data of no-consent customers. 160 | radius: Radius within which nearest neighbors are found. 161 | 162 | Returns: 163 | neighbors_index: Array of indices-arrays of neighboring points. 164 | neighbors_distances: Array of distance-arrays to neighboring points. 165 | has_neighbors_array: Array of booleans indicating whether a given non- 166 | consenting customer had at least one neighbor or not. Takes advantage 167 | of numpy's functionality, e.g.: 168 | (np.array([0,1,2]) > 0) 169 | >>> array([False, True, True]) 170 | """ 171 | neighbors_distance, neighbors_index = self._neighbor.radius_neighbors( 172 | data_noconsent.drop([self._conversion_column], axis=1), 173 | radius=radius, 174 | return_distance=True, 175 | ) 176 | has_neighbors_array = np.array( 177 | [len(neighbors) for neighbors in neighbors_index]) > 0 178 | if not any(has_neighbors_array): 179 | logging.warning("No matching customers within radius %d.", radius) 180 | neighbors_index = neighbors_index[has_neighbors_array] 181 | neighbors_distance = neighbors_distance[has_neighbors_array] 182 | return neighbors_index, neighbors_distance, has_neighbors_array 183 | 184 | def _get_n_nearest_neighbors( 185 | self, data_noconsent: pd.DataFrame, number_nearest_neighbors: float 186 | ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray], Sequence[bool]]: 187 | """Gets n nearest neighbors. 188 | 189 | Args: 190 | data_noconsent: Data of no-consent customers. 191 | number_nearest_neighbors: Number of neighbors to return. If <1, 192 | number_nearest_neighbors is calculated as the proportion in the set of 193 | consenting customers. 194 | 195 | Returns: 196 | neighbors_index: Array of indices-arrays of neighboring points. 197 | neighbors_distances: Array of distance-arrays to neighboring points. 198 | has_neighbors_array: Array of booleans indicating whether a given non- 199 | consenting customer had at least one neighbor or not. Takes advantage 200 | of numpy's functionality, e.g.: 201 | (np.array([0,1,2]) > 0) 202 | >>> array([False, True, True]) 203 | 204 | Raises: 205 | ValueError: if the actual number of nearest neighbors is not 206 | `number_nearest_neighbors`. 207 | """ 208 | if number_nearest_neighbors < 1: 209 | number_nearest_neighbors = ( 210 | self._get_proportional_number_nearest_neighbors( 211 | number_nearest_neighbors)) 212 | neighbors_distance, neighbors_index = self._neighbor.kneighbors( 213 | data_noconsent.drop([self._conversion_column], axis=1), 214 | n_neighbors=number_nearest_neighbors, 215 | return_distance=True) 216 | has_neighbors_array = np.array( 217 | [len(neighbors) for neighbors in neighbors_index]) > 0 218 | if np.shape(neighbors_distance)[1] != number_nearest_neighbors: 219 | raise ValueError( 220 | f"Returned number of neighbors is not {number_nearest_neighbors}.") 221 | return neighbors_index, neighbors_distance, has_neighbors_array 222 | 223 | def _get_nearest_neighbors( 224 | self, 225 | data_noconsent: pd.DataFrame, 226 | radius: Optional[float] = None, 227 | number_nearest_neighbors: Optional[float] = None 228 | ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray], Sequence[bool]]: 229 | """Get indices and distances to nearest neighbors. 230 | 231 | Finds nearest neighbors based on radius or number_nearest_neighbors for each 232 | entry in data_noconsent. If nearest neighbors are defined 233 | via radius, entries in data_noconsent without sufficiently close 234 | neighbor are removed. 235 | 236 | Args: 237 | data_noconsent: Data of no-consent customers. 238 | radius: Radius within which neighbors have to lie. 239 | number_nearest_neighbors: Defines the number (or proportion) of nearest 240 | neighbors. If smaller 1, number_nearest_neighbors is calculated as the 241 | proportion of the number of consenting customers. 242 | 243 | Returns: 244 | A 3-tuple with: 245 | Array of indices-arrays of nearest neighbors in data_consent. 246 | Array of distances-arrays of nearest neigbors in data_consent. 247 | Array of booleans indicating whether a given no-consent customer 248 | had at least one neighbor or not. 249 | 250 | Raises: 251 | ValueError: if not exactly one of radius or number_nearest_neighbors are 252 | provided. 253 | """ 254 | has_radius = radius is not None 255 | has_number_nearest_neighbors = number_nearest_neighbors is not None 256 | 257 | if has_radius == has_number_nearest_neighbors: 258 | raise ValueError("Exactly one of radius or number_nearest_neighbors has ", 259 | "to be provided.") 260 | if has_radius: 261 | return self._get_neighbors_within_radius(data_noconsent, radius) 262 | elif has_number_nearest_neighbors: 263 | return self._get_n_nearest_neighbors(data_noconsent, 264 | number_nearest_neighbors) 265 | 266 | def _assert_all_columns_match_and_conversions_are_valid(self, data_noconsent): 267 | """Checks that all consenting and no-consent data match and are valid. 268 | 269 | Args: 270 | data_noconsent: Data of no-consent customers. 271 | 272 | Raises: 273 | ValueError: if columns of consenting and no-consent data don't match, 274 | the conversion values contain NaNs/Nones or if conversion values <0. 275 | """ 276 | if not all(self._columns_consent == data_noconsent.columns) or (len( 277 | self._columns_consent) != len(data_noconsent.columns)): 278 | raise ValueError( 279 | "Consented and non-consented data must have same columns.") 280 | for data in (data_noconsent, self._data_consent): 281 | if any(data[self._conversion_column].isna()): 282 | raise ValueError("The conversion column should not contain NaNs.") 283 | if any(data[self._conversion_column] <= 0): 284 | ValueError("The conversion values should be larger than zero.") 285 | 286 | def get_indices_and_values_to_nearest_neighbors( 287 | self, 288 | data_noconsent: pd.DataFrame, 289 | radius: Optional[float] = None, 290 | number_nearest_neighbors: Optional[float] = None 291 | ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray], Sequence[np.ndarray], 292 | Sequence[np.ndarray], Sequence[bool]]: 293 | """Gets indices of nearest neighbours as well as the needed conversions. 294 | 295 | Args: 296 | data_noconsent: Data of no-consent customers. 297 | radius: Radius within which neighbors have to lie. 298 | number_nearest_neighbors: Defines the number (or proportion) of nearest 299 | neighbors. 300 | 301 | Returns: 302 | neighbors_data_index: Arrays of indices to the nearest neighbors in the 303 | consenting-customer data. 304 | neighbors_distance: Arrays of distances to the nearest neighbors. 305 | weighted_conversion_values: Conversion values of no-consent customers 306 | weighted by their distance to each nearest neighbor. 307 | weighted_distance: Weighted distances between no-consent and 308 | consenting customers. 309 | has_neighbor: Whether or not a given no-consent customer had a 310 | nearest neighbor. 311 | """ 312 | data_noconsent = data_noconsent.drop(self._id_columns, axis=1) 313 | self._assert_all_columns_match_and_conversions_are_valid(data_noconsent) 314 | neighbors_index, neighbors_distance, has_neighbor = ( 315 | self._get_nearest_neighbors(data_noconsent, radius, 316 | number_nearest_neighbors)) 317 | neighbors_data_index = [ 318 | self._data_consent.index[index] for index in neighbors_index 319 | ] 320 | non_consent_conversion_values = data_noconsent[has_neighbor][ 321 | self._conversion_column].values 322 | weighted_conversion_values, weighted_distance = ( 323 | _calculate_weighted_conversion_values( 324 | non_consent_conversion_values, 325 | neighbors_distance, 326 | )) 327 | return (neighbors_data_index, neighbors_distance, 328 | weighted_conversion_values, weighted_distance, has_neighbor) 329 | 330 | def calculate_adjusted_conversions( 331 | self, 332 | data_noconsent: pd.DataFrame, 333 | radius: Optional[float] = None, 334 | number_nearest_neighbors: Optional[float] = None) -> pd.DataFrame: 335 | """Calculates adjusted conversions for identified nearest neighbors. 336 | 337 | Finds nearest neighbors based on radius or number_nearest_neighbors for each 338 | entry in data_noconsent. If nearest neighbors are defined via radius, 339 | entries in data_noconsent without sufficiently close neighbor are ignored. 340 | Conversion values of consenting customers that are identified as nearest 341 | neighbor to a no-consent customer are adjusted by adding the weighted 342 | proportional conversion value of the respective no-consent customer. 343 | The weighted conversion value is calculated as the product of the conversion 344 | value with the softmax over all neighbor-similarities. 345 | 346 | Args: 347 | data_noconsent: Data for no-consent customer(s). Needs to be pre- 348 | processed and have the same columns as data_consent. 349 | radius: Radius within which neighbors have to lie. 350 | number_nearest_neighbors: Defines the number (or proportion) of nearest 351 | neighbors. 352 | 353 | Returns: 354 | data_adjusted: Copy of data_consent including the modelled conversion 355 | values. 356 | """ 357 | (neighbors_data_index, neighbors_distance, weighted_conversion_values, 358 | weighted_distance, 359 | has_neighbor) = self.get_indices_and_values_to_nearest_neighbors( 360 | data_noconsent, radius, number_nearest_neighbors) 361 | self._data_noconsent = data_noconsent.drop(self._id_columns, axis=1) 362 | self._data_noconsent_nomatch = data_noconsent[np.invert( 363 | has_neighbor)].copy() 364 | self._data_noconsent_match = data_noconsent[has_neighbor].copy() 365 | self._data_noconsent_match["distance_to_nearest_neighbor"] = [ 366 | min(distances) for distances in neighbors_distance 367 | ] 368 | self._data_adjusted = _distribute_conversion_values( 369 | self._data_consent, self._conversion_column, 370 | self._data_noconsent_match[self._conversion_column].values, 371 | weighted_conversion_values, neighbors_data_index, neighbors_distance, 372 | weighted_distance) 373 | return self._data_adjusted 374 | 375 | 376 | def _calculate_weighted_conversion_values( 377 | conversion_values: Sequence[np.ndarray], 378 | neighbors_distance: Sequence[np.ndarray], 379 | ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray]]: 380 | """Calculate weighted conversion values as function of distance. 381 | 382 | The weighted conversion value is calculated as the product of the conversion 383 | value with the softmax over all neighbor-similarities. 384 | 385 | 386 | Args: 387 | conversion_values: Array of conversion_values for non-consented customers. 388 | neighbors_distance: Array of arrays of neighbor-distances. 389 | 390 | Returns: 391 | weighted_conversion_values: Array of weighted conversion_values per non- 392 | consented customer. 393 | softmax_similarity: Array of softmax similarities per non-consented 394 | customer. 395 | """ 396 | if len(conversion_values) != len(neighbors_distance): 397 | raise ValueError("All of conversion_values and neighbors_distance", 398 | "must have the same length.") 399 | 400 | if any((dist < 0).any() for dist in neighbors_distance): 401 | raise ValueError("Distances should not contain negative values." 402 | "Please review which distance metric you used.") 403 | 404 | softmax_similarity = [ 405 | special.softmax(-distance) for distance in neighbors_distance 406 | ] 407 | weighted_conversion_values = [ 408 | conversion_value * weight 409 | for conversion_value, weight in zip(conversion_values, softmax_similarity) 410 | ] 411 | return weighted_conversion_values, softmax_similarity 412 | 413 | 414 | def _distribute_conversion_values( 415 | data_consent: pd.DataFrame, 416 | conversion_column: str, 417 | non_consent_conversion_values: Sequence[float], 418 | weighted_conversion_values: Sequence[np.ndarray], 419 | neighbors_index: Sequence[np.ndarray], 420 | neighbors_distance: Sequence[np.ndarray], 421 | weighted_distance: Sequence[np.ndarray], 422 | ) -> pd.DataFrame: 423 | """Distribute conversion-values of no-consent over consenting customers. 424 | 425 | Conversion values of consenting customers that are identified as nearest 426 | neighbor to a no-consent customer are adjusted by adding the weighted 427 | proportional conversion value of the respective no-consent customer. 428 | Additionally, metrics like average distance to no-consent customers 429 | and total number of added conversions are calculated. 430 | 431 | Args: 432 | data_consent: DataFrame of consented customers. 433 | conversion_column: String indicating the conversion KPI in data_consent. 434 | non_consent_conversion_values: Array of original conversion values. 435 | weighted_conversion_values: Array of arrays of weighted conversion_values, 436 | based on distance between consenting and no-consent customers. 437 | neighbors_index: Array of arrays of neighbor-indices. 438 | neighbors_distance: Array of arrays of neighbor-distances. 439 | weighted_distance: Array of arrays of weighted neighbor-distances. 440 | 441 | Returns: 442 | data_adjusted: Copy of data_consent including the modelled conversion 443 | values. 444 | """ 445 | 446 | data_adjusted = data_consent.copy() 447 | data_adjusted["adjusted_conversion"] = 0 448 | data_adjusted["average_distance"] = 0 449 | data_adjusted["n_added_conversions"] = 0 450 | data_adjusted["sum_distribution_weights"] = 0 451 | for index, values, distance, weight in zip(neighbors_index, 452 | weighted_conversion_values, 453 | neighbors_distance, 454 | weighted_distance): 455 | data_adjusted.loc[index, "adjusted_conversion"] += values 456 | data_adjusted.loc[index, "average_distance"] += distance 457 | data_adjusted.loc[index, "sum_distribution_weights"] += weight 458 | data_adjusted.loc[index, "n_added_conversions"] += 1 459 | 460 | data_adjusted["average_distance"] = (data_adjusted["average_distance"] / 461 | data_adjusted["n_added_conversions"]) 462 | 463 | naive_conversion_adjustments = np.sum(non_consent_conversion_values) / len( 464 | data_consent) 465 | data_adjusted["naive_adjusted_conversion"] = data_adjusted[ 466 | conversion_column] + naive_conversion_adjustments 467 | return data_adjusted 468 | 469 | 470 | def get_adjustments_and_summary_calculations( 471 | matcher: NearestCustomerMatcher, 472 | data_noconsent: pd.DataFrame, 473 | number_nearest_neighbors: Optional[float] = None, 474 | radius: Optional[float] = None, 475 | percentile: Optional[float] = None, 476 | ) -> Tuple[pd.DataFrame, pd.DataFrame]: 477 | """Calculates adjusted conversions for consenting customers. 478 | 479 | Args: 480 | matcher: Matcher object which has been fit to all of data_consent. It 481 | provides the functionality to get the nearest neighbors for a given 482 | no-consent customer. 483 | data_noconsent: Dataframe of no-consent customers. Needs to have the 484 | same columns as data_consent to calculate similarity between data points. 485 | number_nearest_neighbors: Number of consenting customers to chose to match 486 | to. If float, is taken as proportion of all customers. 487 | radius: Radius to find matching consenting customers. 488 | percentile: Percentile of matched no-consent customers based on which 489 | radius is set. 490 | 491 | Returns: 492 | A two-tuple with: 493 | - adjusted conversion values for new and old customers. 494 | - summary statistics on matched conversions (% of counts,% of conversion 495 | value). 496 | 497 | Raises: 498 | ValueError: if not exactly one of number_nearest_neighbors, radius, 499 | or percentile is provided. 500 | ValueError: if the provided percentile is not within the range of 0-1. 501 | """ 502 | has_number_nearest_neighbors = number_nearest_neighbors is not None 503 | has_radius = radius is not None 504 | has_percentile = percentile is not None 505 | 506 | if (has_number_nearest_neighbors + has_radius + has_percentile) != 1: 507 | raise ValueError("Exactly one of number_nearest_neighbors, radius,", 508 | " or percentile has to be specified.") 509 | 510 | if has_percentile and not 0 < percentile <= 1: 511 | raise ValueError("The percentile has to be a value between 0 and 1.") 512 | 513 | if number_nearest_neighbors or radius: 514 | data_adjusted = matcher.calculate_adjusted_conversions( 515 | data_noconsent=data_noconsent, 516 | number_nearest_neighbors=number_nearest_neighbors, 517 | radius=radius) 518 | elif percentile: 519 | matcher.calculate_adjusted_conversions( 520 | data_noconsent=data_noconsent, number_nearest_neighbors=1) 521 | radius = matcher.min_radius_by_percentile(percentile=percentile) 522 | data_adjusted = matcher.calculate_adjusted_conversions( 523 | data_noconsent=data_noconsent, radius=radius) 524 | return data_adjusted, matcher.summary_statistics_matched_conversions 525 | -------------------------------------------------------------------------------- /cocoa/nearest_consented_customers_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Tests for nearest_consented_customers package.""" 16 | 17 | from absl.testing import absltest 18 | from absl.testing import parameterized 19 | import numpy as np 20 | 21 | from consent_based_conversion_adjustments.cocoa import nearest_consented_customers 22 | from consent_based_conversion_adjustments.cocoa import testing_constants 23 | 24 | CONVERSION_COLUMN = testing_constants.CONVERSION_COLUMN 25 | ID_COLUMNS = testing_constants.ID_COLUMNS 26 | 27 | DATA_CONSENT = testing_constants.DATA_CONSENT 28 | DATA_NOCONSENT = testing_constants.DATA_NOCONSENT 29 | 30 | METRIC = testing_constants.METRIC 31 | 32 | 33 | class NearestCustomerTest(parameterized.TestCase): 34 | 35 | def setUp(self): 36 | super().setUp() 37 | self.matcher = nearest_consented_customers.NearestCustomerMatcher( 38 | DATA_CONSENT, CONVERSION_COLUMN, ID_COLUMNS, METRIC) 39 | self.matcher._data_noconsent = DATA_NOCONSENT.drop(ID_COLUMNS, axis=1) 40 | 41 | def test_columns_differ_raises_value_error(self): 42 | with self.assertRaises(ValueError): 43 | self.matcher.calculate_adjusted_conversions( 44 | data_noconsent=DATA_NOCONSENT.drop('a', axis=1), 45 | number_nearest_neighbors=1) 46 | 47 | def test_number_nearest_neighbors_and_radius_raises_value_error(self): 48 | with self.assertRaises(ValueError): 49 | self.matcher.calculate_adjusted_conversions(data_noconsent=DATA_NOCONSENT, 50 | radius=.5, 51 | number_nearest_neighbors=5) 52 | 53 | @parameterized.parameters(1, 2, 3) 54 | def test_assert_k_matches_length_of_indices(self, number_nearest_neighbors): 55 | indices, distances, _ = self.matcher._get_nearest_neighbors( 56 | data_noconsent=DATA_CONSENT.drop(ID_COLUMNS, axis=1), 57 | number_nearest_neighbors=number_nearest_neighbors) 58 | 59 | self.assertEqual(np.shape(indices)[1], number_nearest_neighbors) 60 | self.assertEqual(np.shape(distances)[1], number_nearest_neighbors) 61 | 62 | def test_number_nearest_neighbors_larger_than_customers_raises_value_error( 63 | self): 64 | number_nearest_neighbors = len(DATA_CONSENT) * 3 65 | 66 | with self.assertRaises(ValueError): 67 | _ = self.matcher.calculate_adjusted_conversions( 68 | data_noconsent=DATA_NOCONSENT, 69 | number_nearest_neighbors=number_nearest_neighbors) 70 | 71 | def test_no_match_in_radius_logs_warning(self): 72 | with self.assertLogs(level='WARNING') as log: 73 | _ = self.matcher.calculate_adjusted_conversions( 74 | data_noconsent=DATA_NOCONSENT, radius=0) 75 | 76 | self.assertEqual(log.records[0].getMessage(), 77 | 'No matching customers within radius 0.') 78 | 79 | def test_adjusted_conversion_value_different_from_original_value(self): 80 | data_adjusted = self.matcher.calculate_adjusted_conversions( 81 | data_noconsent=DATA_NOCONSENT, number_nearest_neighbors=3) 82 | 83 | sum_adjusted_conversions = (data_adjusted[CONVERSION_COLUMN].sum() 84 | + data_adjusted['adjusted_conversion'].sum()) 85 | self.assertLess(data_adjusted[CONVERSION_COLUMN].sum(), 86 | sum_adjusted_conversions) 87 | 88 | def test_negative_distances_raises_value_error(self): 89 | neighbors_distance = [np.array([0.1, 0.1, 0.8]), np.array([-5, 0, 0])] 90 | 91 | with self.assertRaises(ValueError): 92 | nearest_consented_customers._calculate_weighted_conversion_values( 93 | DATA_NOCONSENT[CONVERSION_COLUMN].values, neighbors_distance) 94 | 95 | def test_neighbor_indices_not_in_data_consent_raises_key_error(self): 96 | neighbors_index = [np.array([111, 222, 333])] 97 | neighbors_distance = [np.array([0.1, 0.1, 0.8])] 98 | weighted_conversion_values = [np.array([10, 20, 30])] 99 | weighted_distances = neighbors_distance 100 | 101 | with self.assertRaises(KeyError): 102 | nearest_consented_customers._distribute_conversion_values( 103 | DATA_CONSENT, CONVERSION_COLUMN, 104 | DATA_CONSENT[CONVERSION_COLUMN].values, weighted_conversion_values, 105 | neighbors_index, neighbors_distance, weighted_distances) 106 | 107 | def test_adjusted_conversion_smaller_than_upper_limit(self): 108 | upper_limit = DATA_NOCONSENT[CONVERSION_COLUMN].sum() 109 | 110 | data_adjusted = self.matcher.calculate_adjusted_conversions( 111 | data_noconsent=DATA_NOCONSENT, radius=1) 112 | sum_adjusted_conversions = data_adjusted['adjusted_conversion'].sum() 113 | 114 | self.assertLessEqual(sum_adjusted_conversions, upper_limit) 115 | 116 | def test_sum_of_weighted_conversions_matches_original_conversions(self): 117 | original_conversions = DATA_NOCONSENT[CONVERSION_COLUMN].astype( 118 | float).values[:2] 119 | neighbors_distance = [np.array([0.1, 0.1, 0.8]), np.array([1, 0, 0])] 120 | 121 | weighted_conversions, _ = ( 122 | nearest_consented_customers._calculate_weighted_conversion_values( 123 | original_conversions, neighbors_distance)) 124 | 125 | np.testing.assert_almost_equal(np.sum(weighted_conversions, axis=1), 126 | original_conversions) 127 | 128 | def test_raises_value_error_if_number_nearest_neighbors_and_radius(self): 129 | with self.assertRaises(ValueError): 130 | nearest_consented_customers.get_adjustments_and_summary_calculations( 131 | matcher=self.matcher, 132 | data_noconsent=DATA_NOCONSENT.drop(columns=[CONVERSION_COLUMN]), 133 | number_nearest_neighbors=1, 134 | radius=1) 135 | 136 | def test_adjusted_conversions_larger_zero(self): 137 | adjusted_conversions, _ = ( 138 | nearest_consented_customers.get_adjustments_and_summary_calculations( 139 | matcher=self.matcher, 140 | data_noconsent=DATA_NOCONSENT, 141 | number_nearest_neighbors=3, 142 | )) 143 | 144 | self.assertGreater(adjusted_conversions['adjusted_conversion'].sum(), 0) 145 | 146 | @parameterized.named_parameters( 147 | { 148 | 'testcase_name': 'percentile.9', 149 | 'percentile': .9, 150 | }, { 151 | 'testcase_name': 'percentile.5', 152 | 'percentile': .5, 153 | }, { 154 | 'testcase_name': 'percentile.1', 155 | 'percentile': .1, 156 | }) 157 | def test_percentage_matched_conversions_matches_target_percentage( 158 | self, percentile): 159 | _, summary_statistics_matched_conversions = ( 160 | nearest_consented_customers.get_adjustments_and_summary_calculations( 161 | matcher=self.matcher, 162 | data_noconsent=DATA_NOCONSENT, 163 | percentile=percentile, 164 | )) 165 | 166 | self.assertGreaterEqual( 167 | summary_statistics_matched_conversions['percentage_matched_conversions'] 168 | .values, percentile) 169 | 170 | @parameterized.named_parameters( 171 | { 172 | 'testcase_name': 'percentile>1', 173 | 'percentile': 1.1, 174 | }, { 175 | 'testcase_name': 'percentile<0', 176 | 'percentile': -1, 177 | }) 178 | def test_raises_value_error_for_invalid_percentile(self, percentile): 179 | with self.assertRaises(ValueError): 180 | nearest_consented_customers.get_adjustments_and_summary_calculations( 181 | matcher=self.matcher, 182 | data_noconsent=DATA_NOCONSENT, 183 | percentile=percentile, 184 | ) 185 | 186 | def test_length_adjusted_conversions_equals_length_data_consent(self): 187 | adjusted_conversions, _ = ( 188 | nearest_consented_customers.get_adjustments_and_summary_calculations( 189 | matcher=self.matcher, 190 | data_noconsent=DATA_NOCONSENT, 191 | number_nearest_neighbors=3)) 192 | 193 | self.assertEqual(len(adjusted_conversions), len(DATA_CONSENT)) 194 | 195 | # TODO() Add test to assert expected outcome is produced. 196 | 197 | 198 | if __name__ == '__main__': 199 | absltest.main() 200 | -------------------------------------------------------------------------------- /cocoa/preprocess.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Prepare data to distribute conversion-values of no-consent customers.""" 16 | 17 | from typing import Any, List, Tuple 18 | 19 | from absl import logging 20 | import numpy as np 21 | import pandas as pd 22 | 23 | logging.set_verbosity(logging.WARNING) 24 | 25 | NON_DUMMY_COLUMNS = () # typically at least "GCLID", "TIMESTAMP" 26 | DROP_COLUMNS = () 27 | CONVERSION_COLUMN = "conversion_column" 28 | 29 | 30 | def _clean_data(data: pd.DataFrame, conversion_column: str) -> pd.DataFrame: 31 | """Cleans data from NaNs and invalid conversion values. 32 | 33 | In its most basic form, this function drops entries that don't have a 34 | conversion value or for which the conversion value is not larger than zero. 35 | This function should be extended based on custom requirements. 36 | 37 | Args: 38 | data: Dataframe of customer data. 39 | conversion_column: Name of the conversion-column in data. 40 | 41 | Returns: 42 | cleaned data. 43 | """ 44 | # Optional: Fill NaNs based on additional information/other columns. 45 | data.dropna(subset=[conversion_column], inplace=True) 46 | has_valid_conversion_value = data[conversion_column].values > 0 47 | data = data[has_valid_conversion_value] 48 | # Optional: Deduplicate consented users based on timestamp and gclid. 49 | return data 50 | 51 | 52 | def _additional_feature_engineering(data: pd.DataFrame) -> pd.DataFrame: 53 | """Creates additional features that influence similarity between customers. 54 | 55 | Depending on your use-case and the features you have about your customers, 56 | you may want to do additional feature engineering. For instance, customers 57 | may purchase products that can be naturally organised into a hierarchy. 58 | You could code a purchase from "furniture/living/sofa" or from 59 | "furniture/kitchen/chair" into the following format: 60 | 61 | customer | level_0 | level_1 | level_2 62 | ----------------------------------------- 63 | 0 | furniture | living | sofa 64 | 1 | furniture | kitchen | chair 65 | 66 | In a later stage, one-hot encoding of these product-levels will result in a 67 | representation that reflects the similarity between products. If different 68 | products have different levels of depths, it might be appropriate to decide 69 | a threshold and drop low levels that occure only rarely. 70 | 71 | Args: 72 | data: Dataframe to be processed, containing consented and unconsented 73 | customers. 74 | 75 | Returns: 76 | Dataframe including new features. 77 | """ 78 | return data 79 | 80 | 81 | def preprocess_data(data: pd.DataFrame, drop_columns: List[Any], 82 | non_dummy_columns: List[Any], 83 | conversion_column: str) -> pd.DataFrame: 84 | """Preprocesses the passed dataframe. 85 | 86 | Cleans data and applies dummie-coding to relevant columns. 87 | 88 | Args: 89 | data: Dataframe to be processed (either for consent or no-consent users). 90 | drop_columns: List of columns to drop from dataframe. 91 | non_dummy_columns: List of columns to not include in dummy-coding, but keep. 92 | conversion_column: Name of column indicating conversion value. 93 | 94 | Returns: 95 | Processed and dummie-coded dataframe 96 | """ 97 | data = _clean_data(data, conversion_column=conversion_column) 98 | data = _additional_feature_engineering(data) 99 | data_dummies = pd.get_dummies( 100 | data.drop(drop_columns + non_dummy_columns, axis=1, errors="ignore"), 101 | sparse=True) 102 | data_dummies.astype(pd.SparseDtype("bool", 0)) 103 | data_dummies = data_dummies.join(data[non_dummy_columns]) 104 | logging.info("Shape of dummy-coded data is:%d", np.shape(data_dummies)) 105 | return data_dummies 106 | 107 | 108 | def concatenate_and_process_data( 109 | data_consent: pd.DataFrame, 110 | data_noconsent: pd.DataFrame, 111 | conversion_column: str = CONVERSION_COLUMN, 112 | drop_columns: Tuple[Any, ...] = DROP_COLUMNS, 113 | non_dummy_columns: Tuple[Any, ...] = NON_DUMMY_COLUMNS 114 | ) -> Tuple[pd.DataFrame, pd.DataFrame]: 115 | """Concatenates consent and no-consent data and preprocesses them. 116 | 117 | Args: 118 | data_consent: Dataframe of consent customers. 119 | data_noconsent: Dataframe of no-consent customers. 120 | conversion_column: Name of the conversion column in the data. 121 | drop_columns: Names of columns that should be dropped from the data. 122 | non_dummy_columns: Names of (categorical) columns that should be kept, but 123 | not dummy-coded. 124 | 125 | Raises: 126 | ValueError: if concatenating consent and no-consent data doesn't 127 | match the expected length. 128 | 129 | Returns: 130 | Processed dataframes for consent and no-consent customers. 131 | """ 132 | data_noconsent["consent"] = 0 133 | data_consent["consent"] = 1 134 | data_concat = pd.concat([data_noconsent, data_consent]) 135 | data_concat.reset_index(inplace=True, drop=True) 136 | if len(data_concat) != (len(data_noconsent) + len(data_consent)): 137 | raise ValueError( 138 | "Length of concatenated data does not match sum of individual dataframes." 139 | ) 140 | data_preprocessed = preprocess_data( 141 | data=data_concat, 142 | drop_columns=list(drop_columns), 143 | non_dummy_columns=list(non_dummy_columns), 144 | conversion_column=conversion_column) 145 | data_noconsent_processed = data_preprocessed[data_preprocessed["consent"] == 146 | 0] 147 | data_consent_processed = data_preprocessed[data_preprocessed["consent"] == 1] 148 | return data_consent_processed, data_noconsent_processed 149 | -------------------------------------------------------------------------------- /cocoa/preprocess_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Tests for preprocess.""" 16 | 17 | from absl.testing import absltest 18 | from absl.testing import parameterized 19 | import pandas as pd 20 | 21 | 22 | from consent_based_conversion_adjustments.cocoa import preprocess 23 | from consent_based_conversion_adjustments.cocoa import testing_constants 24 | 25 | CONVERSION_COLUMN = testing_constants.CONVERSION_COLUMN 26 | ID_COLUMNS = testing_constants.ID_COLUMNS 27 | 28 | DATA_CONSENT = testing_constants.DATA_CONSENT 29 | DATA_NOCONSENT = testing_constants.DATA_NOCONSENT 30 | 31 | METRIC = testing_constants.METRIC 32 | 33 | 34 | class PreprocessTest(parameterized.TestCase): 35 | 36 | def setUp(self): 37 | super().setUp() 38 | self.fake_data_consent = DATA_CONSENT 39 | self.fake_data_consent['new_customer'] = 0 40 | self.fake_data_consent.iloc[::3]['new_customer'] = 1 41 | self.fake_data_noconsent = DATA_NOCONSENT 42 | self.fake_data_noconsent['new_customer'] = 0 43 | self.fake_data_noconsent.iloc[::3]['new_customer'] = 1 44 | 45 | def test_processed_shape_matches_expected_shape(self): 46 | joined_data = pd.concat([self.fake_data_consent, self.fake_data_noconsent]) 47 | categorical_columns = joined_data.columns[joined_data.dtypes == 'object'] 48 | n_dummy_variables = 0 49 | for categorical_column in categorical_columns: 50 | n_dummy_variables += joined_data[categorical_column].nunique() -1 51 | target_shape = self.fake_data_consent.shape[1] + n_dummy_variables + 1 52 | 53 | fake_preprocessed_data_consent, fake_preprocessed_data_noconsent = ( 54 | preprocess.concatenate_and_process_data( 55 | self.fake_data_consent.copy(), self.fake_data_noconsent.copy())) 56 | 57 | self.assertEqual(fake_preprocessed_data_consent.shape[1], target_shape) 58 | self.assertEqual(fake_preprocessed_data_noconsent.shape[1], target_shape) 59 | 60 | def test_preprocessed_conversion_values_larger_zero(self): 61 | fake_preprocessed_data_consent, fake_preprocessed_data_noconsent = ( 62 | preprocess.concatenate_and_process_data( 63 | self.fake_data_consent.copy(), self.fake_data_noconsent.copy())) 64 | 65 | values_not_larger_zero = ( 66 | fake_preprocessed_data_consent[CONVERSION_COLUMN] <= 0).sum() + ( 67 | fake_preprocessed_data_noconsent[CONVERSION_COLUMN] <= 0).sum() 68 | 69 | self.assertEqual(values_not_larger_zero, 0) 70 | 71 | 72 | if __name__ == '__main__': 73 | absltest.main() 74 | -------------------------------------------------------------------------------- /cocoa/testing_constants.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Create mock data and define global variables used across multiple tests.""" 16 | 17 | import numpy as np 18 | import pandas as pd 19 | 20 | CONVERSION_COLUMN = 'conversion_column' 21 | ID_COLUMNS = ['id_column'] 22 | 23 | product_levels = ['1_1', '2_2', '1_1'] 24 | DATA_CONSENT = pd.concat([ 25 | pd.DataFrame( 26 | np.array([[1, 2, 3, 0], [0, 5, 6, 0], [1, 8, 9, 0]]), 27 | columns=['a', 'b', CONVERSION_COLUMN] + ID_COLUMNS) 28 | ] * 10) 29 | DATA_CONSENT.reset_index(inplace=True, drop=True) 30 | DATA_CONSENT['product_level'] = product_levels * 10 31 | DATA_NOCONSENT = pd.concat([ 32 | pd.DataFrame( 33 | np.array([[4, 5, 6, 0], [7, 8, 9, 0], [10, 11, 12, 0]]), 34 | columns=['a', 'b', CONVERSION_COLUMN] + ID_COLUMNS) 35 | ] * 5) 36 | DATA_NOCONSENT.reset_index(inplace=True, drop=True) 37 | DATA_NOCONSENT['product_level'] = product_levels * 5 38 | 39 | METRIC = 'manhattan' 40 | -------------------------------------------------------------------------------- /generate_template.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # Copyright 2021 Google LLC 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # https://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | echo "Requesting template to be generated..." 17 | 18 | python -m pipeline \ 19 | --input_path "gs://${PIPELINE_BUCKET}/input/dates.txt" \ 20 | --output_csv_bucket "${PIPELINE_BUCKET}" \ 21 | --output_csv_path "output" \ 22 | --bq_project "${BQ_PROJECT_ID}" \ 23 | --location "${BIGQUERY_LOCATION}" \ 24 | --table_consent "${TABLE_CONSENT}" \ 25 | --table_noconsent "${TABLE_NOCONSENT}" \ 26 | --date_column "${DATE_COLUMN}" \ 27 | --conversion_column "${CONVERSION_COLUMN}" \ 28 | --id_columns "${ID_COLUMNS}" \ 29 | --drop_columns "${DROP_COLUMNS}" \ 30 | --non_dummy_columns "${NON_DUMMY_COLUMNS}" \ 31 | --runner DataflowRunner \ 32 | --project "${PROJECT_ID}" \ 33 | --staging_location "gs://${PIPELINE_BUCKET}/staging/" \ 34 | --temp_location "gs://${PIPELINE_BUCKET}/temp/" \ 35 | --template_location "gs://${PIPELINE_BUCKET}/templates/cocoa-template" \ 36 | --region "${PIPELINE_REGION}" \ 37 | --machine_type "n1-highmem-32" \ 38 | --setup_file ./setup.py 39 | 40 | echo "Done." 41 | -------------------------------------------------------------------------------- /pipeline.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Pipeline to run conversion adjustments. 16 | 17 | Consenting and non-consenting data for customers is read by the pipeline and 18 | adjustments are applied to conversion values of consenting 19 | customers based on the distance from non-consenting customers and the runtime 20 | arguments like number_nearest_neighbors, radius and percentile. The adjusted 21 | data is output as a csv where the adjusted conversions appear in a new column. 22 | """ 23 | import argparse 24 | import datetime 25 | import logging 26 | import os 27 | import sys 28 | from typing import Any, Callable, List, Optional, Sequence, Tuple, Union 29 | 30 | import apache_beam as beam 31 | from apache_beam.options.pipeline_options import PipelineOptions 32 | from apache_beam.options.pipeline_options import SetupOptions 33 | from apache_beam.options.value_provider import RuntimeValueProvider 34 | from google.cloud import bigquery 35 | from google.cloud import storage 36 | import pandas as pd 37 | 38 | from consent_based_conversion_adjustments.cocoa import nearest_consented_customers 39 | from consent_based_conversion_adjustments.cocoa import preprocess 40 | 41 | logging.basicConfig(level=logging.INFO) 42 | 43 | 44 | def _parse_known_args( 45 | cmd_line_args: Sequence[str]) -> Tuple[argparse.Namespace, Sequence[str]]: 46 | """Parses known arguments from the command line using the argparse library. 47 | 48 | Args: 49 | cmd_line_args: Sequence of commandline arguments. 50 | 51 | Returns: 52 | A tuple containing argparse.Namespace with known arguments and a list of 53 | remaining (unknown) command line arguments. 54 | """ 55 | parser = argparse.ArgumentParser() 56 | parser.add_argument( 57 | '--input_path', 58 | dest='input_path', 59 | required=True, 60 | help='Path to txt file containing dates for which to run the pipeline.') 61 | parser.add_argument( 62 | '--output_csv_bucket', 63 | dest='output_csv_bucket', 64 | required=True, 65 | help='Google Cloud Storage bucket for storing CSV output.') 66 | parser.add_argument( 67 | '--output_csv_path', 68 | dest='output_csv_path', 69 | required=True, 70 | help='CSV output file location.') 71 | parser.add_argument( 72 | '--bq_project', 73 | dest='bq_project', 74 | required=True, 75 | help='Google Cloud project containing the BigQuery tables.') 76 | parser.add_argument( 77 | '--location', 78 | dest='location', 79 | required=True, 80 | help='Location of the BigQuery tables e.g EU') 81 | parser.add_argument( 82 | '--table_consent', 83 | dest='table_consent', 84 | required=True, 85 | help='BigQuery table containing consented user data.') 86 | parser.add_argument( 87 | '--table_noconsent', 88 | dest='table_noconsent', 89 | required=True, 90 | help='BigQuery table containing non-consented user data.') 91 | parser.add_argument( 92 | '--date_column', 93 | dest='date_column', 94 | required=True, 95 | help='BigQuery table column containing date value.') 96 | parser.add_argument( 97 | '--conversion_column', 98 | dest='conversion_column', 99 | required=True, 100 | help='BigQuery table column containing conversion value.') 101 | parser.add_argument( 102 | '--id_columns', 103 | dest='id_columns', 104 | required=True, 105 | help='BigQuery table columns that form a unique row e.g. GCLID,TIMESTAMP.' 106 | ) 107 | parser.add_argument( 108 | '--drop_columns', 109 | dest='drop_columns', 110 | required=False, 111 | help='BigQuery table columns that should be dropped from the data.') 112 | parser.add_argument( 113 | '--non_dummy_columns', 114 | dest='non_dummy_columns', 115 | required=False, 116 | help='BigQuery table (categorical) columns that should be kept, but not dummy-coded.' 117 | ) 118 | return parser.parse_known_args(cmd_line_args) 119 | 120 | 121 | class RuntimeOptions(PipelineOptions): 122 | """Specifies runtime options for the pipeline. 123 | 124 | Class defining the arguments that can be passed to the pipeline to 125 | customize the runtime execution. 126 | """ 127 | 128 | @classmethod # classmethod is required here for Beam's PipelineOptions. 129 | def _add_argparse_args(cls, parser): 130 | parser.add_value_provider_argument( 131 | '--number_nearest_neighbors', 132 | help='number of nearest consenting customers to select.') 133 | parser.add_value_provider_argument( 134 | '--radius', 135 | help='radius within which nearest customers should be considered.') 136 | parser.add_value_provider_argument( 137 | '--percentile', 138 | help='percentage of non-consenting customers that should be matched.') 139 | parser.add_value_provider_argument( 140 | '--metric', help='distance metric.', type=str) 141 | 142 | 143 | def _load_data_from_bq(table_name: str, location: str, project: str, 144 | start_date: str, end_date: str, 145 | date_column: str) -> pd.DataFrame: 146 | """Reads data from BigQuery filtered to the given start and end date.""" 147 | bq_client = bigquery.Client(location=location, project=project) 148 | query = f""" 149 | SELECT * FROM `{table_name}` 150 | WHERE {date_column} >= '{start_date}' and {date_column} < '{end_date}' 151 | ORDER BY {date_column} 152 | """ 153 | return bq_client.query(query).result().to_dataframe() 154 | 155 | 156 | class ConversionAdjustments(beam.DoFn): 157 | """Apache Beam ParDo transform for applying conversion adjustments.""" 158 | 159 | def __init__(self, number_nearest_neighbors: RuntimeValueProvider, 160 | radius: RuntimeValueProvider, percentile: RuntimeValueProvider, 161 | metric: RuntimeValueProvider, project: str, location: str, 162 | table_consent: str, table_noconsent: str, date_column: str, 163 | conversion_column: str, id_columns: List[str], 164 | drop_columns: Tuple[Any, 165 | ...], non_dummy_columns: Tuple[Any, 166 | ...]) -> None: 167 | """Initialises class. 168 | 169 | Args: 170 | number_nearest_neighbors: Number of nearest consenting customers to 171 | select. 172 | radius: Radius within which nearest customers should be considered. 173 | percentile: Percentage of non-consenting customers that should be matched. 174 | metric: Distance metric e.g. manhattan. 175 | project: Name of Google Cloud project containing the BigQuery tables. 176 | location: Location of the BigQuery tables e.g. EU. 177 | table_consent: BigQuery table containing consented user data. 178 | table_noconsent: BigQuery table containing non-consented user data. 179 | date_column: BigQuery table column containing date value. 180 | conversion_column: BigQuery table column containing conversion value. 181 | id_columns: BigQuery table columns that form a unique row. 182 | drop_columns: BigQuery table columns that should be dropped from the data. 183 | non_dummy_columns: BigQuery table (categorical) columns that should be 184 | kept, but not dummy-coded. 185 | """ 186 | self._number_nearest_neighbors = number_nearest_neighbors 187 | self._radius = radius 188 | self._percentile = percentile 189 | self._metric = metric 190 | self._project = project 191 | self._location = location 192 | self._table_consent = table_consent 193 | self._table_noconsent = table_noconsent 194 | self._date_column = date_column 195 | self._conversion_column = conversion_column 196 | self._id_columns = id_columns 197 | self._drop_columns = drop_columns 198 | self._non_dummy_columns = non_dummy_columns 199 | 200 | def process( 201 | self, process_date: datetime.date 202 | ) -> Optional[Sequence[Tuple[str, pd.DataFrame, pd.DataFrame]]]: 203 | """Calculates conversion adjustments for the given date. 204 | 205 | Args: 206 | process_date: Date to be processed. 207 | 208 | Returns: 209 | Tuple containing processed date, adjusted data and summary statistics. 210 | """ 211 | logging.info('Processing date %r', process_date) 212 | # TODO(): Consider if time delta can be decided by user. 213 | end_date = str((process_date + datetime.timedelta(days=1))) 214 | start_date = str(process_date) 215 | logging.info('Pulling non-consented data for date %r', process_date) 216 | data_noconsent = _load_data_from_bq(self._table_noconsent, self._location, 217 | self._project, start_date, end_date, 218 | self._date_column) 219 | logging.info('Pulling consented data for date %r', process_date) 220 | data_consent = _load_data_from_bq(self._table_consent, self._location, 221 | self._project, start_date, end_date, 222 | self._date_column) 223 | logging.info( 224 | 'Preprocessing consented and non-consented datasets for date %r', 225 | process_date) 226 | data_consent, data_noconsent = preprocess.concatenate_and_process_data( 227 | data_consent, data_noconsent, self._conversion_column, 228 | self._drop_columns, self._non_dummy_columns) 229 | matcher = nearest_consented_customers.NearestCustomerMatcher( 230 | data_consent, self._conversion_column, self._id_columns, 231 | _get_runtime_val_or_none(self._metric)) 232 | logging.info('Calculating conversion adjustments for date %r', process_date) 233 | 234 | data_adjusted, summary_statistics_matched_conversions = nearest_consented_customers.get_adjustments_and_summary_calculations( 235 | matcher, data_noconsent, 236 | _get_runtime_val_or_none(self._number_nearest_neighbors, int), 237 | _get_runtime_val_or_none(self._radius, float), 238 | _get_runtime_val_or_none(self._percentile, float)) 239 | return [(start_date, data_adjusted, summary_statistics_matched_conversions)] 240 | 241 | 242 | def _get_runtime_val_or_none( 243 | runtime_var: RuntimeValueProvider, 244 | apply_type: Callable[[Union[int, float, str]], Union[int, float, str]] = str 245 | ) -> Optional[Union[int, float, str]]: 246 | """Gets the runtime value in the correct type. 247 | 248 | Checks if a runtime value is available. If the value is not None, convert the 249 | value to the requested type. 250 | 251 | Args: 252 | runtime_var: The runtime value provider. 253 | apply_type: A type that may be applied to non-none runtime values. 254 | 255 | Returns: 256 | Typed value if available, otherwise None. 257 | """ 258 | if runtime_var.is_accessible(): 259 | runtime_val = runtime_var.get() 260 | if runtime_val is not None: 261 | return apply_type(runtime_val) 262 | return None 263 | 264 | 265 | def write_adjustments_to_gcs(adjustments: Tuple[str, pd.DataFrame, 266 | pd.DataFrame], bucket_name: str, 267 | path: str) -> None: 268 | """Prepares the conversion adjustments data to be written to Cloud Storage. 269 | 270 | Args: 271 | adjustments: A tuple containing processed date, adjusted data and summary 272 | statistics. 273 | bucket_name: Name of the Cloud Storage bucket where adjustments are written. 274 | path: Path on the Cloud Storage bucket where adjustments are written. 275 | 276 | Returns: 277 | None. 278 | """ 279 | adjustments_date = adjustments[0] 280 | adjustments_data = adjustments[1].to_csv(index=False) 281 | adjustments_summary = adjustments[2].to_csv(index=False) 282 | gcs_client = storage.Client() 283 | gcs_bucket = gcs_client.get_bucket(bucket_name) 284 | logging.info('Uploading conversion adjustments for date %r', adjustments_date) 285 | write_to_gcs(gcs_bucket, os.path.join(path, adjustments_date), 286 | 'adjustments_data.csv', 'text/csv', adjustments_data) 287 | logging.info('Uploading adjustments summary for date %r', adjustments_date) 288 | write_to_gcs(gcs_bucket, os.path.join(path, adjustments_date), 289 | 'adjustments_summary.csv', 'text/csv', adjustments_summary) 290 | 291 | 292 | def write_to_gcs(bucket: storage.Bucket, path: str, filename: str, 293 | data_type: str, data: str) -> None: 294 | """Writes data to the given Cloud Storage bucket.""" 295 | bucket.blob(os.path.join(path, filename)).upload_from_string(data, data_type) 296 | 297 | 298 | def get_columns_from_str(columns: Optional[str], 299 | separator: str = ',') -> Tuple[Any, ...]: 300 | """Converts columns input as separated string to tuples for further processing. 301 | 302 | A helper function to convert strings containing column names to tuples of 303 | column names. 304 | 305 | Args: 306 | columns: List of columns as a string with separators. 307 | separator: Character that separates the column names in the string. 308 | 309 | Returns: 310 | A tuple containing the columns names or empty if the column string doesn't 311 | exist or is empty. 312 | """ 313 | if not columns: 314 | return () 315 | return tuple(columns.split(separator)) 316 | 317 | 318 | def main(argv: Sequence[str], save_main_session: bool = True) -> None: 319 | """Main entry point; defines and runs the beam pipeline.""" 320 | 321 | known_args, pipeline_args = _parse_known_args(argv) 322 | 323 | pipeline_options = PipelineOptions(pipeline_args) 324 | pipeline_options.view_as(SetupOptions).save_main_session = save_main_session 325 | runtime_options = pipeline_options.view_as(RuntimeOptions) 326 | 327 | with beam.Pipeline(options=pipeline_options) as p: 328 | 329 | dates_to_process = ( 330 | p 331 | | 'Read ISO format date string from input file' >> beam.io.ReadFromText( 332 | known_args.input_path) 333 | | 'Convert to date type' >> beam.Map(datetime.date.fromisoformat)) 334 | 335 | adjustments = ( 336 | dates_to_process 337 | | 'Apply conversion adjustments' >> beam.ParDo( 338 | ConversionAdjustments( 339 | number_nearest_neighbors=runtime_options 340 | .number_nearest_neighbors, 341 | radius=runtime_options.radius, 342 | percentile=runtime_options.percentile, 343 | metric=runtime_options.metric, 344 | project=known_args.bq_project, 345 | location=known_args.location, 346 | table_consent=known_args.table_consent, 347 | table_noconsent=known_args.table_noconsent, 348 | conversion_column=known_args.conversion_column, 349 | id_columns=list(known_args.id_columns.split(',')), 350 | date_column=known_args.date_column, 351 | drop_columns=get_columns_from_str(known_args.drop_columns), 352 | non_dummy_columns=get_columns_from_str( 353 | known_args.non_dummy_columns)))) 354 | 355 | _ = ( 356 | adjustments 357 | | 'Write adjusted data as CSV files to cloud storage' >> beam.Map( 358 | write_adjustments_to_gcs, 359 | bucket_name=known_args.output_csv_bucket, 360 | path=known_args.output_csv_path)) 361 | 362 | 363 | if __name__ == '__main__': 364 | main(sys.argv) 365 | -------------------------------------------------------------------------------- /pipeline_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Tests for pipeline.""" 16 | import dataclasses 17 | import datetime 18 | 19 | from absl.testing import absltest 20 | from absl.testing import parameterized 21 | import apache_beam as beam 22 | from apache_beam.testing.util import assert_that 23 | from apache_beam.testing.util import equal_to 24 | import pandas as pd 25 | 26 | from consent_based_conversion_adjustments import pipeline 27 | 28 | _DATA_NOCONSENT = [{ 29 | 'gclid': '21', 30 | 'conversion_timestamp': '2021-11-20 12:34:56 UTC', 31 | 'conversion_value': 20.0, 32 | 'conversion_date': '2021-11-20', 33 | 'conversion_item': 'dress' 34 | }] 35 | _DATA_CONSENT = [{ 36 | 'gclid': '1', 37 | 'conversion_timestamp': '2021-11-20 12:34:56 UTC', 38 | 'conversion_value': 10.0, 39 | 'conversion_date': '2021-11-20', 40 | 'conversion_item': 'dress' 41 | }] 42 | _DATA_CONSENT_MULTI = [ 43 | { 44 | 'gclid': '1', 45 | 'conversion_timestamp': '2021-11-20 12:34:56 UTC', 46 | 'conversion_value': 10.0, 47 | 'conversion_date': '2021-11-20', 48 | 'conversion_item': 'dress' 49 | }, 50 | { 51 | 'gclid': '2', 52 | 'conversion_timestamp': '2021-11-20 12:34:56 UTC', 53 | 'conversion_value': 10.0, 54 | 'conversion_date': '2021-11-20', 55 | 'conversion_item': 'dress' 56 | }, 57 | ] 58 | _PIPELINE_RUN_DATE = '2021-11-20' 59 | _PROJECT = 'cocoa_test' 60 | _LOCATION = 'EU' 61 | _TABLE_CONSENT = 'table_consent' 62 | _TABLE_NOCONSENT = 'table_noconsent' 63 | _CONVERSION_COLUMN = 'conversion_value' 64 | _ID_COLUMNS = ['gclid', 'conversion_timestamp'] 65 | _DATE_COLUMN = 'conversion_date' 66 | _DROP_COLUMNS = [] 67 | _NON_DUMMY_COLUMNS = _ID_COLUMNS 68 | 69 | 70 | def _fake_load_data_from_bq(table_name: str, *args, **kwargs) -> pd.DataFrame: # Patching method so pylint: disable=unused-argument 71 | if table_name == 'table_consent': 72 | return pd.DataFrame.from_records(_DATA_CONSENT) 73 | elif table_name == 'table_consent_multi': 74 | return pd.DataFrame.from_records(_DATA_CONSENT_MULTI) 75 | elif table_name == 'table_noconsent': 76 | return pd.DataFrame.from_records(_DATA_NOCONSENT) 77 | else: 78 | return pd.DataFrame([]) 79 | 80 | 81 | @dataclasses.dataclass(frozen=True) 82 | class RuntimeParam: 83 | value: any 84 | accessible: bool = True 85 | 86 | def get(self): 87 | return self.value 88 | 89 | def is_accessible(self): 90 | return self.accessible 91 | 92 | 93 | class PipelineTest(parameterized.TestCase): 94 | 95 | @classmethod 96 | def setUpClass(cls): 97 | super().setUpClass() 98 | # Replace network calls to BigQuery with a local fake reply 99 | pipeline._load_data_from_bq = _fake_load_data_from_bq 100 | 101 | @parameterized.named_parameters( 102 | dict( 103 | testcase_name='_completely_when_single_nearest_neighbor', 104 | number_nearest_neighbors=1, 105 | table_consent='table_consent', 106 | expected_output=20.0), 107 | dict( 108 | testcase_name='_partially_when_multiple_nearest_neighbor', 109 | number_nearest_neighbors=2, 110 | table_consent='table_consent_multi', 111 | expected_output=10.0)) 112 | def test_conversion_adjustments_value_assigned(self, 113 | number_nearest_neighbors: int, 114 | table_consent: str, 115 | expected_output: float): 116 | 117 | with beam.Pipeline(beam.runners.direct.DirectRunner()) as p: 118 | date_to_process = ( 119 | p | 'Process date' >> beam.Create( 120 | [datetime.date.fromisoformat(_PIPELINE_RUN_DATE)])) 121 | adjustments = ( 122 | date_to_process 123 | | beam.ParDo( 124 | pipeline.ConversionAdjustments( 125 | number_nearest_neighbors=RuntimeParam( 126 | number_nearest_neighbors), 127 | radius=RuntimeParam(None), 128 | percentile=RuntimeParam(None), 129 | metric=RuntimeParam('manhattan'), 130 | project=_PROJECT, 131 | location=_LOCATION, 132 | table_consent=table_consent, 133 | table_noconsent=_TABLE_NOCONSENT, 134 | conversion_column=_CONVERSION_COLUMN, 135 | id_columns=_ID_COLUMNS, 136 | date_column=_DATE_COLUMN, 137 | drop_columns=_DROP_COLUMNS, 138 | non_dummy_columns=_NON_DUMMY_COLUMNS))) 139 | 140 | adjusted_conversion_value = ( 141 | adjustments 142 | | 143 | 'Select single row' >> beam.Map(lambda x: x[1][x[1]['gclid'] == '1']) 144 | | 'Calculate conversion value' >> 145 | beam.Map(lambda x: x['adjusted_conversion'].sum())) 146 | assert_that(adjusted_conversion_value, equal_to([expected_output])) 147 | 148 | 149 | if __name__ == '__main__': 150 | absltest.main() 151 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # Copyright 2024 Google LLC. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | . -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | # Copyright 2024 Google LLC. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Setup script for package CoCoA.""" 16 | import setuptools 17 | 18 | with open('README.md', 'r', encoding='utf-8') as fh: 19 | long_description = fh.read() 20 | 21 | setuptools.setup( 22 | name='consent_based_conversion_adjustments', 23 | version='0.0.1', 24 | author='gTech Professional Services', 25 | author_email='', 26 | description='Adjust conversion values for Smart Bidding OCI', 27 | long_description=long_description, 28 | long_description_content_type='text/markdown', 29 | packages=setuptools.find_packages(), 30 | install_requires=[ 31 | 'absl-py==1.0.0', 32 | 'apache-beam==2.28.0', 33 | 'avro-python3==1.9.2.1', 34 | 'cachetools==4.2.4', 35 | 'certifi==2021.10.8', 36 | 'charset-normalizer==2.0.12', 37 | 'crcmod==1.7', 38 | 'dill==0.3.1.1', 39 | 'docopt==0.6.2', 40 | 'fastavro==1.4.9', 41 | 'fasteners==0.17.3', 42 | 'future==0.18.2', 43 | 'google-api-core==1.31.5', 44 | 'google-apitools==0.5.31', 45 | 'google-auth==1.35.0', 46 | 'google-cloud-bigquery==2.34.0', 47 | 'google-cloud-bigquery-storage==2.11.0', 48 | 'google-cloud-bigtable==1.7.0', 49 | 'google-cloud-core==1.7.2', 50 | 'google-cloud-datastore==1.15.3', 51 | 'google-cloud-dlp==3.6.0', 52 | 'google-cloud-language==1.3.0', 53 | 'google-cloud-pubsub==1.7.0', 54 | 'google-cloud-recommendations-ai==0.2.0', 55 | 'google-cloud-spanner==1.19.1', 56 | 'google-cloud-storage==2.1.0', 57 | 'google-cloud-videointelligence==1.16.1', 58 | 'google-cloud-vision==1.0.0', 59 | 'google-crc32c==1.3.0', 60 | 'google-resumable-media==2.2.1', 61 | 'googleapis-common-protos==1.54.0', 62 | 'grpc-google-iam-v1==0.12.3', 63 | 'grpcio==1.44.0', 64 | 'grpcio-gcp==0.2.2', 65 | 'hdfs==2.6.0', 66 | 'httplib2==0.17.4', 67 | 'idna==3.3', 68 | 'joblib==1.1.0', 69 | 'libcst==0.4.1', 70 | 'mock==2.0.0', 71 | 'mypy-extensions==0.4.3', 72 | 'numpy==1.19.5', 73 | 'oauth2client==4.1.3', 74 | 'orjson==3.6.7', 75 | 'packaging==21.3', 76 | 'pandas>=1.3', 77 | 'pbr==5.8.1', 78 | 'proto-plus==1.20.3', 79 | 'protobuf==3.19.4', 80 | 'pyarrow==2.0.0', 81 | 'pyasn1==0.4.8', 82 | 'pyasn1-modules==0.2.8', 83 | 'pydot==1.4.2', 84 | 'pymongo==3.12.3', 85 | 'pyparsing==2.4.7', 86 | 'python-dateutil==2.8.2', 87 | 'pytz==2021.3', 88 | 'PyYAML==6.0', 89 | 'requests==2.27.1', 90 | 'rsa==4.8', 91 | 'scikit-learn==1.0.2', 92 | 'scipy>=1.7', 93 | 'six==1.16.0', 94 | 'sklearn==0.0', 95 | 'threadpoolctl==3.1.0', 96 | 'typing-extensions==3.7.4.3', 97 | 'typing-inspect==0.7.1', 98 | 'urllib3==1.26.8', 99 | ], 100 | classifiers=[ 101 | 'Programming Language :: Python :: 3', 102 | 'License :: OSI Approved :: MIT License', 103 | 'Operating System :: OS Independent', 104 | ], 105 | python_requires='>=3.7', 106 | ) 107 | --------------------------------------------------------------------------------