├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── cloud_function
    ├── main.py
    ├── main_test.py
    └── requirements.txt
├── cocoa
    ├── __init__.py
    ├── cocoa_template.ipynb
    ├── nearest_consented_customers.py
    ├── nearest_consented_customers_test.py
    ├── preprocess.py
    ├── preprocess_test.py
    └── testing_constants.py
├── generate_template.sh
├── pipeline.py
├── pipeline_test.py
├── requirements.txt
└── setup.py


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Code of Conduct
 2 | 
 3 | ## Our Pledge
 4 | 
 5 | In the interest of fostering an open and welcoming environment, we as
 6 | contributors and maintainers pledge to making participation in our project and
 7 | our community a harassment-free experience for everyone, regardless of age, body
 8 | size, disability, ethnicity, gender identity and expression, level of
 9 | experience, education, socio-economic status, nationality, personal appearance,
10 | race, religion, or sexual identity and orientation.
11 | 
12 | ## Our Standards
13 | 
14 | Examples of behavior that contributes to creating a positive environment
15 | include:
16 | 
17 | *   Using welcoming and inclusive language
18 | *   Being respectful of differing viewpoints and experiences
19 | *   Gracefully accepting constructive criticism
20 | *   Focusing on what is best for the community
21 | *   Showing empathy towards other community members
22 | 
23 | Examples of unacceptable behavior by participants include:
24 | 
25 | *   The use of sexualized language or imagery and unwelcome sexual attention or
26 |     advances
27 | *   Trolling, insulting/derogatory comments, and personal or political attacks
28 | *   Public or private harassment
29 | *   Publishing others' private information, such as a physical or electronic
30 |     address, without explicit permission
31 | *   Other conduct which could reasonably be considered inappropriate in a
32 |     professional setting
33 | 
34 | ## Our Responsibilities
35 | 
36 | Project maintainers are responsible for clarifying the standards of acceptable
37 | behavior and are expected to take appropriate and fair corrective action in
38 | response to any instances of unacceptable behavior.
39 | 
40 | Project maintainers have the right and responsibility to remove, edit, or reject
41 | comments, commits, code, wiki edits, issues, and other contributions that are
42 | not aligned to this Code of Conduct, or to ban temporarily or permanently any
43 | contributor for other behaviors that they deem inappropriate, threatening,
44 | offensive, or harmful.
45 | 
46 | ## Scope
47 | 
48 | This Code of Conduct applies both within project spaces and in public spaces
49 | when an individual is representing the project or its community. Examples of
50 | representing a project or community include using an official project e-mail
51 | address, posting via an official social media account, or acting as an appointed
52 | representative at an online or offline event. Representation of a project may be
53 | further defined and clarified by project maintainers.
54 | 
55 | This Code of Conduct also applies outside the project spaces when the Project
56 | Steward has a reasonable belief that an individual's behavior may have a
57 | negative impact on the project or its community.
58 | 
59 | ## Conflict Resolution
60 | 
61 | We do not believe that all conflict is bad; healthy debate and disagreement
62 | often yield positive results. However, it is never okay to be disrespectful or
63 | to engage in behavior that violates the project’s code of conduct.
64 | 
65 | If you see someone violating the code of conduct, you are encouraged to address
66 | the behavior directly with those involved. Many issues can be resolved quickly
67 | and easily, and this gives people more control over the outcome of their
68 | dispute. If you are unable to resolve the matter for any reason, or if the
69 | behavior is threatening or harassing, report it. We are dedicated to providing
70 | an environment where participants feel welcome and safe.
71 | 
72 | Reports should be directed to *[PROJECT STEWARD NAME(s) AND EMAIL(s)]*, the
73 | Project Steward(s) for *[PROJECT NAME]*. It is the Project Steward’s duty to
74 | receive and address reported violations of the code of conduct. They will then
75 | work with a committee consisting of representatives from the Open Source
76 | Programs Office and the Google Open Source Strategy team. If for any reason you
77 | are uncomfortable reaching out to the Project Steward, please email
78 | opensource@google.com.
79 | 
80 | We will investigate every complaint, but you may not receive a direct response.
81 | We will use our discretion in determining when and how to follow up on reported
82 | incidents, which may range from not taking action to permanent expulsion from
83 | the project and project-sponsored spaces. We will notify the accused of the
84 | report and provide them an opportunity to discuss it before any action is taken.
85 | The identity of the reporter will be omitted from the details of the report
86 | supplied to the accused. In potentially harmful situations, such as ongoing
87 | harassment or threats to anyone's safety, we may take action without notice.
88 | 
89 | ## Attribution
90 | 
91 | This Code of Conduct is adapted from the Contributor Covenant, version 1.4,
92 | available at
93 | https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
94 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # How to Contribute
 2 | 
 3 | We'd love to accept your patches and contributions to this project. There are
 4 | just a few small guidelines you need to follow.
 5 | 
 6 | ## Contributor License Agreement
 7 | 
 8 | Contributions to this project must be accompanied by a Contributor License
 9 | Agreement (CLA). You (or your employer) retain the copyright to your
10 | contribution; this simply gives us permission to use and redistribute your
11 | contributions as part of the project. Head over to
12 | <https://cla.developers.google.com/> to see your current agreements on file or
13 | to sign a new one.
14 | 
15 | You generally only need to submit a CLA once, so if you've already submitted one
16 | (even if it was for a different project), you probably don't need to do it
17 | again.
18 | 
19 | ## Code Reviews
20 | 
21 | All submissions, including submissions by project members, require review. We
22 | use GitHub pull requests for this purpose. Consult
23 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
24 | information on using pull requests.
25 | 
26 | ## Community Guidelines
27 | 
28 | This project follows
29 | [Google's Open Source Community Guidelines](https://opensource.google/conduct/).
30 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | 
  2 |                                  Apache License
  3 |                            Version 2.0, January 2004
  4 |                         http://www.apache.org/licenses/
  5 | 
  6 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  7 | 
  8 |    1. Definitions.
  9 | 
 10 |       "License" shall mean the terms and conditions for use, reproduction,
 11 |       and distribution as defined by Sections 1 through 9 of this document.
 12 | 
 13 |       "Licensor" shall mean the copyright owner or entity authorized by
 14 |       the copyright owner that is granting the License.
 15 | 
 16 |       "Legal Entity" shall mean the union of the acting entity and all
 17 |       other entities that control, are controlled by, or are under common
 18 |       control with that entity. For the purposes of this definition,
 19 |       "control" means (i) the power, direct or indirect, to cause the
 20 |       direction or management of such entity, whether by contract or
 21 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 22 |       outstanding shares, or (iii) beneficial ownership of such entity.
 23 | 
 24 |       "You" (or "Your") shall mean an individual or Legal Entity
 25 |       exercising permissions granted by this License.
 26 | 
 27 |       "Source" form shall mean the preferred form for making modifications,
 28 |       including but not limited to software source code, documentation
 29 |       source, and configuration files.
 30 | 
 31 |       "Object" form shall mean any form resulting from mechanical
 32 |       transformation or translation of a Source form, including but
 33 |       not limited to compiled object code, generated documentation,
 34 |       and conversions to other media types.
 35 | 
 36 |       "Work" shall mean the work of authorship, whether in Source or
 37 |       Object form, made available under the License, as indicated by a
 38 |       copyright notice that is included in or attached to the work
 39 |       (an example is provided in the Appendix below).
 40 | 
 41 |       "Derivative Works" shall mean any work, whether in Source or Object
 42 |       form, that is based on (or derived from) the Work and for which the
 43 |       editorial revisions, annotations, elaborations, or other modifications
 44 |       represent, as a whole, an original work of authorship. For the purposes
 45 |       of this License, Derivative Works shall not include works that remain
 46 |       separable from, or merely link (or bind by name) to the interfaces of,
 47 |       the Work and Derivative Works thereof.
 48 | 
 49 |       "Contribution" shall mean any work of authorship, including
 50 |       the original version of the Work and any modifications or additions
 51 |       to that Work or Derivative Works thereof, that is intentionally
 52 |       submitted to Licensor for inclusion in the Work by the copyright owner
 53 |       or by an individual or Legal Entity authorized to submit on behalf of
 54 |       the copyright owner. For the purposes of this definition, "submitted"
 55 |       means any form of electronic, verbal, or written communication sent
 56 |       to the Licensor or its representatives, including but not limited to
 57 |       communication on electronic mailing lists, source code control systems,
 58 |       and issue tracking systems that are managed by, or on behalf of, the
 59 |       Licensor for the purpose of discussing and improving the Work, but
 60 |       excluding communication that is conspicuously marked or otherwise
 61 |       designated in writing by the copyright owner as "Not a Contribution."
 62 | 
 63 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 64 |       on behalf of whom a Contribution has been received by Licensor and
 65 |       subsequently incorporated within the Work.
 66 | 
 67 |    2. Grant of Copyright License. Subject to the terms and conditions of
 68 |       this License, each Contributor hereby grants to You a perpetual,
 69 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 70 |       copyright license to reproduce, prepare Derivative Works of,
 71 |       publicly display, publicly perform, sublicense, and distribute the
 72 |       Work and such Derivative Works in Source or Object form.
 73 | 
 74 |    3. Grant of Patent License. Subject to the terms and conditions of
 75 |       this License, each Contributor hereby grants to You a perpetual,
 76 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 77 |       (except as stated in this section) patent license to make, have made,
 78 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 79 |       where such license applies only to those patent claims licensable
 80 |       by such Contributor that are necessarily infringed by their
 81 |       Contribution(s) alone or by combination of their Contribution(s)
 82 |       with the Work to which such Contribution(s) was submitted. If You
 83 |       institute patent litigation against any entity (including a
 84 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 85 |       or a Contribution incorporated within the Work constitutes direct
 86 |       or contributory patent infringement, then any patent licenses
 87 |       granted to You under this License for that Work shall terminate
 88 |       as of the date such litigation is filed.
 89 | 
 90 |    4. Redistribution. You may reproduce and distribute copies of the
 91 |       Work or Derivative Works thereof in any medium, with or without
 92 |       modifications, and in Source or Object form, provided that You
 93 |       meet the following conditions:
 94 | 
 95 |       (a) You must give any other recipients of the Work or
 96 |           Derivative Works a copy of this License; and
 97 | 
 98 |       (b) You must cause any modified files to carry prominent notices
 99 |           stating that You changed the files; and
100 | 
101 |       (c) You must retain, in the Source form of any Derivative Works
102 |           that You distribute, all copyright, patent, trademark, and
103 |           attribution notices from the Source form of the Work,
104 |           excluding those notices that do not pertain to any part of
105 |           the Derivative Works; and
106 | 
107 |       (d) If the Work includes a "NOTICE" text file as part of its
108 |           distribution, then any Derivative Works that You distribute must
109 |           include a readable copy of the attribution notices contained
110 |           within such NOTICE file, excluding those notices that do not
111 |           pertain to any part of the Derivative Works, in at least one
112 |           of the following places: within a NOTICE text file distributed
113 |           as part of the Derivative Works; within the Source form or
114 |           documentation, if provided along with the Derivative Works; or,
115 |           within a display generated by the Derivative Works, if and
116 |           wherever such third-party notices normally appear. The contents
117 |           of the NOTICE file are for informational purposes only and
118 |           do not modify the License. You may add Your own attribution
119 |           notices within Derivative Works that You distribute, alongside
120 |           or as an addendum to the NOTICE text from the Work, provided
121 |           that such additional attribution notices cannot be construed
122 |           as modifying the License.
123 | 
124 |       You may add Your own copyright statement to Your modifications and
125 |       may provide additional or different license terms and conditions
126 |       for use, reproduction, or distribution of Your modifications, or
127 |       for any such Derivative Works as a whole, provided Your use,
128 |       reproduction, and distribution of the Work otherwise complies with
129 |       the conditions stated in this License.
130 | 
131 |    5. Submission of Contributions. Unless You explicitly state otherwise,
132 |       any Contribution intentionally submitted for inclusion in the Work
133 |       by You to the Licensor shall be under the terms and conditions of
134 |       this License, without any additional terms or conditions.
135 |       Notwithstanding the above, nothing herein shall supersede or modify
136 |       the terms of any separate license agreement you may have executed
137 |       with Licensor regarding such Contributions.
138 | 
139 |    6. Trademarks. This License does not grant permission to use the trade
140 |       names, trademarks, service marks, or product names of the Licensor,
141 |       except as required for reasonable and customary use in describing the
142 |       origin of the Work and reproducing the content of the NOTICE file.
143 | 
144 |    7. Disclaimer of Warranty. Unless required by applicable law or
145 |       agreed to in writing, Licensor provides the Work (and each
146 |       Contributor provides its Contributions) on an "AS IS" BASIS,
147 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 |       implied, including, without limitation, any warranties or conditions
149 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 |       PARTICULAR PURPOSE. You are solely responsible for determining the
151 |       appropriateness of using or redistributing the Work and assume any
152 |       risks associated with Your exercise of permissions under this License.
153 | 
154 |    8. Limitation of Liability. In no event and under no legal theory,
155 |       whether in tort (including negligence), contract, or otherwise,
156 |       unless required by applicable law (such as deliberate and grossly
157 |       negligent acts) or agreed to in writing, shall any Contributor be
158 |       liable to You for damages, including any direct, indirect, special,
159 |       incidental, or consequential damages of any character arising as a
160 |       result of this License or out of the use or inability to use the
161 |       Work (including but not limited to damages for loss of goodwill,
162 |       work stoppage, computer failure or malfunction, or any and all
163 |       other commercial damages or losses), even if such Contributor
164 |       has been advised of the possibility of such damages.
165 | 
166 |    9. Accepting Warranty or Additional Liability. While redistributing
167 |       the Work or Derivative Works thereof, You may choose to offer,
168 |       and charge a fee for, acceptance of support, warranty, indemnity,
169 |       or other liability obligations and/or rights consistent with this
170 |       License. However, in accepting such obligations, You may act only
171 |       on Your own behalf and on Your sole responsibility, not on behalf
172 |       of any other Contributor, and only if You agree to indemnify,
173 |       defend, and hold each Contributor harmless for any liability
174 |       incurred by, or claims asserted against, such Contributor by reason
175 |       of your accepting any such warranty or additional liability.
176 | 
177 |    END OF TERMS AND CONDITIONS
178 | 
179 |    APPENDIX: How to apply the Apache License to your work.
180 | 
181 |       To apply the Apache License to your work, attach the following
182 |       boilerplate notice, with the fields enclosed by brackets "[]"
183 |       replaced with your own identifying information. (Don't include
184 |       the brackets!)  The text should be enclosed in the appropriate
185 |       comment syntax for the file format. We also recommend that a
186 |       file or class name and description of purpose be included on the
187 |       same "printed page" as the copyright notice for easier
188 |       identification within third-party archives.
189 | 
190 |    Copyright [yyyy] [name of copyright owner]
191 | 
192 |    Licensed under the Apache License, Version 2.0 (the "License");
193 |    you may not use this file except in compliance with the License.
194 |    You may obtain a copy of the License at
195 | 
196 |        http://www.apache.org/licenses/LICENSE-2.0
197 | 
198 |    Unless required by applicable law or agreed to in writing, software
199 |    distributed under the License is distributed on an "AS IS" BASIS,
200 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201 |    See the License for the specific language governing permissions and
202 |    limitations under the License.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Consent-based Conversion Adjustments
  2 | 
  3 | ## Problem statement
  4 | 
  5 | Given regulatory requirements, customers have the choice to accept or decline
  6 | third-party cookies. For those who opt-out of third-party cookie tracking
  7 | (hereafter, non-consenting customers), data on their conversions on an
  8 | advertiser's website cannot be shared with Smart Bidding. This potential data
  9 | loss can lead to worse bidding performance, or drifts in the bidding behaviour
 10 | away from the advertiser's initial goals.
 11 | 
 12 | We have developed a solution that allows advertisers to capitalise on their
 13 | first-party data in order to statistically up-weight conversion values of
 14 | customers that gave consent. By doing this, advertisers have the possibility to
 15 | feed back up to 100% of the factual conversion values back into Smart Bidding.
 16 | 
 17 | ## Solution description
 18 | 
 19 | We take the following approach: For all consenting and non-consenting customers
 20 | that converted on a given day, the advertiser has access to first-party data
 21 | that describes the customers. Examples could be the adgroup-title that a
 22 | conversion is attributed to, the device type used, or demographic information.
 23 | Based on this information, a feature space can be created that describes each
 24 | consenting and non-consenting customer. Importantly, this feature space has to
 25 | be the same for all customers.
 26 | 
 27 | Given this feature space, we can create a distance-graph for all *consenting*
 28 | customers in our dataset, and find the nearest consenting customers for each
 29 | *non-consenting* customer. This is done using a NearestNeighbor model. The
 30 | non-consenting customer's conversion value can then be split across all
 31 | identified nearest consenting customers, in proportion to the similarity between
 32 | the non-consenting and the non-consenting customers.
 33 | 
 34 | ## Model Parameters
 35 | 
 36 | *   Distance metric: We need to define the distance metric to use when
 37 |     determining the nearest consenting customers. By default, this is set to
 38 |     `manhattan distance`.
 39 | *   Radius, number of nearest neighbors, or percentile: In coordination with the
 40 |     advertiser and depending on the dataset as well as business requirements,
 41 |     the user can choose between:
 42 |     *   setting a fixed radius within which all nearest neighbors should be
 43 |         selected,
 44 |     *   setting a fixed number of nearest neighbors that should be selected for
 45 |         each non-consenting customer, independent of their distance to them
 46 |     *   finding the required radius to ensure that at least `x%` of
 47 |         non-consenting customers would have at least one sufficiently close
 48 |         neighbor.
 49 | 
 50 | ## Data requirements
 51 | 
 52 | As mentioned above, consenting and non-consenting customers must lie in the same
 53 | feature space. This is currently achieved by considering the adgroup a given
 54 | customer has clicked on and splitting it according to the advertiser's logic.
 55 | This way, customers that came through similar adgroups are considered being more
 56 | similar to each other. All customers to be considered must have a valid
 57 | conversion value larger zero and must not have missing data.
 58 | 
 59 | ## How to use the solution
 60 | 
 61 | This solution uses an Apache Beam pipeline to find the nearest consenting
 62 | customers for each non-consenting customer. The following instructions show how
 63 | to run the pipeline on Google Cloud Dataflow, however any other suitable Apache
 64 | Beam runner may be used as well.
 65 | 
 66 | ### Installation
 67 | 
 68 | > Note: This solution requires 3.6 <= Python < 3.9 as Beam does not currently
 69 | > support Python 3.9.
 70 | 
 71 | #### Set up Dataflow Template
 72 | 
 73 | *   Navigate to your Google Cloud Project and activate the Cloud Shell
 74 | *   Set the current project by running `gcloud config set project
 75 |     [YOUR_PROJECT_ID]`
 76 | *   Clone this repository and `cd` into the project directory
 77 | *   Download pyenv as described
 78 |     [here](https://cwiki.apache.org/confluence/display/BEAM/Python+Tips#PythonTips-VirtualEnvironmentswithpyenv).
 79 | *   Create and activate a virtual environment as follows:
 80 | 
 81 |     ```
 82 |       pyenv install 3.8.12
 83 |       pyenv virtualenv 3.8.12 env
 84 |       pyenv activate env
 85 |     ```
 86 | 
 87 | *   Install python3 dependencies `pip3 install -r requirements.txt`
 88 | 
 89 | *   Create a GCS bucket (if one does not already exist) where the Dataflow
 90 |     template as well as all inputs and outputs will be stored
 91 | 
 92 | *   Set an environment variable with the name of the bucket `export
 93 |     PIPELINE_BUCKET=[YOUR_CLOUD_STORAGE_BUCKET_NAME]`
 94 | 
 95 | *   To read data from BigQuery, we need to know the project containing your
 96 |     BigQuery tables. Set an environment variable `export
 97 |     BQ_PROJECT_ID=[YOUR_BIGQUERY_PROJECT_ID]`
 98 | 
 99 | *   Additionally, set the location of your BigQuery tables `export
100 |     BIGQUERY_LOCATION=[YOUR_BIGQUERY_LOCATION]` e.g. 'EU' for Europe
101 | 
102 | *   Set an environment variable with the name of your BigQuery table with
103 |     consenting user data `export TABLE_CONSENT=[CONSENTING_USER_TABLE_NAME]`
104 | 
105 | *   Set an environment variable with the name of your BigQuery table with
106 |     non-consenting user data `export
107 |     TABLE_NOCONSENT=[NON_CONSENTING_USER_TABLE_NAME]`
108 | 
109 | *   Set an environment variable with the name of the data column in your tables
110 |     `export DATE_COLUMN=[DATA_COLUMN_NAME]`
111 | 
112 | *   To up-weight the conversion values in our dataset, we need to know which
113 |     column represents the conversion values in the input data. Set an
114 |     environment variable with the name of the conversion column `export
115 |     CONVERSION_COLUMN=[YOUR_CONVERSION_COLUMN_NAME]`
116 | 
117 | *   The final output of the pipeline is a CSV file that may be used for Offline
118 |     Conversion Import (OCI) into Google Ads or Google Marketing Platform (GMP).
119 |     Each row of this OCI CSV must be unique. Set an environment variable with
120 |     the list of columns in the input data that together form a unique ID `export
121 |     ID_COLUMNS=[COMMA_SEPARATED_ID_COLUMNS_LIST]` e.g. export
122 |     ID_COLUMNS=GCLID,TIMESTAMP,ADGROUP_NAME (**no spaces** between the commas!)
123 | 
124 | *   You may want to exclude some columns in your data from being used for
125 |     matching. Set an environment variable with the list of columns in the input
126 |     data that should be dropped e.g. `export DROP_COLUMNS=feature2,feature5`
127 | 
128 | *   Provide all categorical columns in your data that should not be
129 |     [dummy-coded](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)
130 |     e.g. `export non_dummy_columns=GCLID,TIMESTAMP`
131 | 
132 | *   Set an environment variable with the project id `export
133 |     PROJECT_ID=[YOUR_PROJECT_ID]`
134 | 
135 | *   Set an environment variable with the
136 |     [regional endpoint](https://cloud.google.com/dataflow/docs/concepts/regional-endpoints)
137 |     to deploy your Dataflow job `export
138 |     PIPELINE_REGION=[YOUR_REGIONAL_ENDPOINT]`
139 | 
140 | *   Generate the template by running `./generate_template.sh`
141 | 
142 | *   Deactivate the virtual env by typing `pyenv deactivate` and close cloud
143 |     shell
144 | 
145 | #### Set up Cloud Function
146 | 
147 | The Apache Beam pipeline that we set up above will be triggered by a Cloud
148 | Function. Following instructions show how to set up the Cloud Function:
149 | 
150 | *   Open Cloud Functions from the navigation menu in your Google Cloud Project
151 | *   If not done already, enable the Cloud Functions and Cloud Build APIs
152 | *   Select `Create function` and fill in the required fields such as `Function
153 |     name` and `Region`. Choose Cloud Pub/Sub as a trigger and create a new
154 |     topic. We will later write to this topic whenever the BigQuery tables have
155 |     new data, thereby triggering the cloud function
156 | *   Under runtime setting, set timeout to at least 60 seconds to give ample time
157 |     for the Cloud Function to run. Click next
158 | *   Upload the contents of the `cloud_function` directory found in the repo to
159 |     Cloud Functions
160 | *   Select Python 3.8 as Runtime and set Entry point to "run".
161 | *   Update the required values in `main.py` as marked by `TODO(): ...`
162 | *   Deploy the Cloud Function
163 | 
164 | #### Set up Cloud Logging to Pub/Sub sink
165 | 
166 | > Note: For this section, we assume that you wish to trigger the Dataflow
167 | > pipeline whenever new data is inserted in the non-consented or consented
168 | > tables. If you have a different requirement, proceed accordingly with setting
169 | > up a trigger for the Cloud Function. See also:
170 | > [Using Cloud Scheduler and Pub/Sub to trigger a Cloud Function](https://cloud.google.com/scheduler/docs/tut-pub-sub).
171 | 
172 | *   In Cloud Logging on your Google Cloud Project, filter to the relevant
173 |     BigQuery event. For example, to filter by table inserts, use:
174 | 
175 |     ```
176 |     protoPayload.serviceName="bigquery.googleapis.com"
177 |     protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob"
178 |     protoPayload.resourceName="projects/[YOUR_PROJECT_ID]/datasets/[YOUR_DATASET]/tables/[YOUR_TABLE_NAME]"
179 |     resource.labels.project_id="[YOUR_PROJECT_ID]"
180 |     protoPayload.metadata.tableDataChange.reason="QUERY"
181 |     ```
182 | 
183 |     Once the relevant event is available, create a sink that routes your logs to
184 |     the Pub/Sub topic defined above. For more information on creating sinks, see
185 |     the
186 |     [documentation](https://cloud.google.com/logging/docs/export/configure_export_v2).
187 | 
188 | *   With this in place, the Dataflow pipeline should get triggered whenever new
189 |     data is inserted in your Bigquery tables.
190 | 
191 | ## Contributing
192 | 
193 | See [`CONTRIBUTING.md`](CONTRIBUTING.md) for details.
194 | 
195 | ## License
196 | 
197 | Apache 2.0; see [`LICENSE`](LICENSE) for details.
198 | 
199 | ## Disclaimer
200 | 
201 | This is not an official Google product.
202 | 


--------------------------------------------------------------------------------
/cloud_function/main.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Google LLC
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     https://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """A Cloud Function that triggers the CoCoA Dataflow pipeline."""
 16 | import datetime
 17 | import os
 18 | from typing import Any, Dict, Sequence
 19 | 
 20 | from google.cloud import bigquery
 21 | from google.cloud import storage
 22 | from googleapiclient.discovery import build
 23 | 
 24 | # TODO(): Update the project and dataflow parameters below.
 25 | # Project and bigquery related parameters.
 26 | _PROJECT = 'cocoa'
 27 | _GCS_BUCKET = 'cocoa-df'
 28 | _BIGQUERY_LOCATION = 'EU'
 29 | _DATAFLOW_REGION = 'europe-west3'
 30 | _DATAFLOW_SUBNET = 'default'
 31 | _DATE_COLUMN = 'conversion_date'
 32 | _TABLE_NAME_NOCONSENT = 'pipeline_test.noconsent'
 33 | _INPUT_FILE_PATH = 'input'
 34 | _LOOKBACK_WINDOW = 1
 35 | 
 36 | # Dataflow parameters.
 37 | JOB = 'cocoa-cloud-function-test'
 38 | TEMPLATE = f'gs://{_GCS_BUCKET}/templates/cocoa-template'
 39 | PARAMETERS = {
 40 |     'metric': 'manhattan',
 41 |     'number_nearest_neighbors': '1',
 42 | }
 43 | ENVIRONMENT = {
 44 |     'subnetwork':
 45 |         f'https://www.googleapis.com/compute/v1/projects/{_PROJECT}/regions/{_DATAFLOW_REGION}/subnetworks/{_DATAFLOW_SUBNET}',
 46 | }
 47 | 
 48 | 
 49 | def run(
 50 |     event: Dict[str, Any], context: Any  # Cloud Function so pylint: disable=unused-argument
 51 | ) -> None:
 52 |   """Background Cloud Function to be triggered by Cloud Logging.
 53 | 
 54 |   Prepares input data for the Dataflow pipeline and then triggers the pipeline.
 55 | 
 56 |   Args:
 57 |     event : The dictionary with data specific to this type of event. For
 58 |       details, see:
 59 |       https://cloud.google.com/functions/docs/samples/functions-log-stackdriver#code-sample
 60 |     context: Metadata of triggering event.
 61 | 
 62 |   Raises:
 63 |     RuntimeError: A dependency was not found, requiring this CF to exit.
 64 |       The RuntimeError is not raised explicitly in this function but is default
 65 |       behavior for any Cloud Function.
 66 | 
 67 |   Returns:
 68 |       None.
 69 |   """
 70 |   _prepare_pipeline_input()
 71 | 
 72 |   dataflow = build('dataflow', 'v1b3')
 73 |   request = dataflow.projects().locations().templates().launch(
 74 |       projectId=_PROJECT,
 75 |       gcsPath=TEMPLATE,
 76 |       location=_DATAFLOW_REGION,
 77 |       body={
 78 |           'jobName': JOB,
 79 |           'parameters': PARAMETERS,
 80 |           'environment': ENVIRONMENT,
 81 |       })
 82 | 
 83 |   request.execute()
 84 | 
 85 | 
 86 | def _prepare_pipeline_input() -> None:
 87 |   """Prepares dates to be processed for the CoCoA Dataflow pipeline.
 88 | 
 89 |   The CoCoA Dataflow pipeline requires an input file containing dates to
 90 |   process. This function prepares this file and uploads it to a Cloud Storage
 91 |   bucket.
 92 |   """
 93 |   latest_date = _get_latest_date_from_bigquery(_TABLE_NAME_NOCONSENT,
 94 |                                                _BIGQUERY_LOCATION, _PROJECT,
 95 |                                                _DATE_COLUMN)
 96 |   dates_to_process = _get_dates_to_process(latest_date, _LOOKBACK_WINDOW)
 97 | 
 98 |   # Prepare string of newline separated dates and write to GCS
 99 |   write_to_gcs(_GCS_BUCKET, _INPUT_FILE_PATH, 'dates.txt',
100 |                '\n'.join(map(datetime.date.isoformat, dates_to_process)))
101 | 
102 | 
103 | def _get_dates_to_process(
104 |     start_date: str,
105 |     lookback_window: int) -> Sequence[datetime.date]:
106 |   """Generates a sequence of dates.
107 | 
108 |   Creates a sequence of dates based on the given start date and lookback window.
109 | 
110 |   Args:
111 |     start_date: Starting date for the sequence.
112 |     lookback_window: Number of days in the past.
113 | 
114 |   Returns:
115 |     A sequence of dates ranging from start_date - lookback_window to the
116 |     start_date.
117 |   """
118 |   return [
119 |       datetime.date.fromisoformat(start_date) -
120 |       datetime.timedelta(days=delta) for delta in range(lookback_window)
121 |   ]
122 | 
123 | 
124 | def _get_latest_date_from_bigquery(table_name: str, location: str, project: str,
125 |                                    date_column: str) -> str:
126 |   """Gets the latest date from a date column in a BigQuery table."""
127 |   bq_client = bigquery.Client(location=location, project=project)
128 |   query = f"""
129 |           SELECT FORMAT_DATETIME("%F", MAX({date_column})) AS input_date
130 |           FROM `{table_name}`
131 |           """
132 |   results_iter = bq_client.query(query).result()
133 |   # There is only one row as we query by max(), which we now return.
134 |   return next(results_iter).input_date
135 | 
136 | 
137 | def write_to_gcs(bucket_name: str, path: str, filename: str, data: str) -> None:
138 |   """Writes the given data to a Google Cloud Storage Bucket."""
139 |   gcs_client = storage.Client()
140 |   gcs_bucket = gcs_client.get_bucket(bucket_name)
141 |   gcs_bucket.blob(os.path.join(path,
142 |                                filename)).upload_from_string(data, 'text/csv')
143 | 


--------------------------------------------------------------------------------
/cloud_function/main_test.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2021 Google LLC
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     https://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | """Tests for main."""
16 | import datetime
17 | 
18 | from absl.testing import parameterized
19 | from consent_based_conversion_adjustments.cloud_function import main
20 | 
21 | 
22 | class PipelineTest(parameterized.TestCase):
23 | 
24 |   @parameterized.named_parameters(
25 |       dict(
26 |           testcase_name='_start_date_as_single_entry_list',
27 |           start_date='2021-12-15',
28 |           lookback_window=1,
29 |           expected_output=[datetime.date.fromisoformat('2021-12-15')]),
30 |       dict(
31 |           testcase_name='_list_with_start_date_and_previous_day',
32 |           start_date='2021-12-15',
33 |           lookback_window=2,
34 |           expected_output=[
35 |               datetime.date.fromisoformat('2021-12-15'),
36 |               datetime.date.fromisoformat('2021-12-14')
37 |           ]))
38 |   def test_get_dates_to_process_returns(self, start_date, lookback_window,
39 |                                         expected_output):
40 |     result = main.get_dates_to_process(start_date, lookback_window)
41 |     self.assertListEqual(result, expected_output)
42 | 


--------------------------------------------------------------------------------
/cloud_function/requirements.txt:
--------------------------------------------------------------------------------
 1 | # Copyright 2024 Google LLC.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     https://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | google-api-python-client
16 | google-cloud-bigquery
17 | google-cloud-storage
18 | 


--------------------------------------------------------------------------------
/cocoa/__init__.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2024 Google LLC.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     https://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | 


--------------------------------------------------------------------------------
/cocoa/cocoa_template.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "cells": [
  3 |     {
  4 |       "cell_type": "markdown",
  5 |       "metadata": {
  6 |         "id": "x1LyGCNZhWBH"
  7 |       },
  8 |       "source": [
  9 |         "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
 10 |         "  \u003ctd\u003e\n",
 11 |         "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/google/consent-based-conversion-adjustments/blob/main/cocoa/cocoa_template.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
 12 |         "  \u003c/td\u003e\n",
 13 |         "  \u003ctd\u003e\n",
 14 |         "    \u003ca target=\"_blank\" href=\"https://github.com/google/consent-based-conversion-adjustments/blob/main/cocoa/cocoa_template.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
 15 |         "  \u003c/td\u003e\n",
 16 |         "\u003c/table\u003e"
 17 |       ]
 18 |     },
 19 |     {
 20 |       "cell_type": "markdown",
 21 |       "metadata": {
 22 |         "id": "wLQsEboILf9_"
 23 |       },
 24 |       "source": [
 25 |         "###### License"
 26 |       ]
 27 |     },
 28 |     {
 29 |       "cell_type": "markdown",
 30 |       "metadata": {
 31 |         "id": "mMCxIddRJkps"
 32 |       },
 33 |       "source": [
 34 |         " Copyright 2020 Google LLC\n",
 35 |         "\n",
 36 |         " Licensed under the Apache License, Version 2.0 (the \"License\");\n",
 37 |         " you may not use this file except in compliance with the License.\n",
 38 |         " You may obtain a copy of the License at\n",
 39 |         "\n",
 40 |         "      http://www.apache.org/licenses/LICENSE-2.0\n",
 41 |         "\n",
 42 |         " Unless required by applicable law or agreed to in writing, software\n",
 43 |         " distributed under the License is distributed on an \"AS IS\" BASIS,\n",
 44 |         " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
 45 |         " See the License for the specific language governing permissions and\n",
 46 |         " limitations under the License."
 47 |       ]
 48 |     },
 49 |     {
 50 |       "cell_type": "markdown",
 51 |       "metadata": {
 52 |         "id": "4RwBf7lfJxPs"
 53 |       },
 54 |       "source": [
 55 |         "# Consent-based Conversion Adjustments\n"
 56 |       ]
 57 |     },
 58 |     {
 59 |       "cell_type": "markdown",
 60 |       "metadata": {
 61 |         "id": "-rGQ-Vm1Lnh_"
 62 |       },
 63 |       "source": [
 64 |         "In this notebook, we are illustrating how we can use a non-parametric model (based on k nearest-neighbors) to redistribute conversion values of customers opting out of advertising cookies over customers who opt in. \n",
 65 |         "The resulting conversion-value adjustments can be used within value-based bidding to prevent biases in the bidding-algorithm due to systematic differences between customers who opt in vs customers who don't."
 66 |       ]
 67 |     },
 68 |     {
 69 |       "cell_type": "markdown",
 70 |       "metadata": {
 71 |         "id": "-IENCIRJNyqF"
 72 |       },
 73 |       "source": [
 74 |         "# Imports"
 75 |       ]
 76 |     },
 77 |     {
 78 |       "cell_type": "code",
 79 |       "execution_count": null,
 80 |       "metadata": {
 81 |         "id": "yLBUAG1O5S2S"
 82 |       },
 83 |       "outputs": [],
 84 |       "source": [
 85 |         "!pip install git+https://github.com/google/consent-based-conversion-adjustments.git\n",
 86 |         "from IPython.display import clear_output\n",
 87 |         "\n",
 88 |         "clear_output()\n"
 89 |       ]
 90 |     },
 91 |     {
 92 |       "cell_type": "code",
 93 |       "execution_count": null,
 94 |       "metadata": {
 95 |         "id": "ZRAT4tTpZY51"
 96 |       },
 97 |       "outputs": [],
 98 |       "source": [
 99 |         "from itertools import combinations\n",
100 |         "import typing\n",
101 |         "\n",
102 |         "import matplotlib.pyplot as plt\n",
103 |         "import numpy as np\n",
104 |         "import pandas as pd\n",
105 |         "\n",
106 |         "\n",
107 |         "np.random.seed(123)"
108 |       ]
109 |     },
110 |     {
111 |       "cell_type": "code",
112 |       "execution_count": null,
113 |       "metadata": {
114 |         "id": "Bl_p0SoAw7Gz"
115 |       },
116 |       "outputs": [],
117 |       "source": [
118 |       ]
119 |     },
120 |     {
121 |       "cell_type": "code",
122 |       "execution_count": null,
123 |       "metadata": {
124 |         "id": "lIfB_qb851sE"
125 |       },
126 |       "outputs": [],
127 |       "source": [
128 |         "from cocoa import nearest_consented_customers"
129 |       ]
130 |     },
131 |     {
132 |       "cell_type": "markdown",
133 |       "metadata": {
134 |         "id": "QK9vYsW8N7F3"
135 |       },
136 |       "source": [
137 |         "# Data Simulation"
138 |       ]
139 |     },
140 |     {
141 |       "cell_type": "code",
142 |       "execution_count": null,
143 |       "metadata": {
144 |         "cellView": "form",
145 |         "id": "N21wIR0hbCTb"
146 |       },
147 |       "outputs": [],
148 |       "source": [
149 |         "#@title Create fake dataset of adgroups and conversion values\n",
150 |         "#@markdown We are generating random data: each row is an individual conversion\n",
151 |         "#@markdown with a given conversion value. \\\n",
152 |         "#@markdown For each conversion, we know the\n",
153 |         "#@markdown adgroup, which is our only feature here and just consists of 3 letters.\n",
154 |         "\n",
155 |         "n_consenting_customers = 8000  #@param\n",
156 |         "n_nonconsenting_customers = 2000  #@param\n",
157 |         "\n",
158 |         "\n",
159 |         "def simulate_conversion_data_consenting_non_consenting(\n",
160 |         "    n_consenting_customers: int,\n",
161 |         "    n_nonconsenting_customers: int) -\u003e typing.Tuple[pd.DataFrame, pd.DataFrame]:\n",
162 |         "  \"\"\"Simulates dataframes for consenting and non-consenting customers.\n",
163 |         "\n",
164 |         "    Args:\n",
165 |         "      n_consenting_customers: Desired number of consenting customers. Should be\n",
166 |         "        larger than n_nonconsenting_customers.\n",
167 |         "      n_nonconsenting_customers: Desired number non non-consenting customers.\n",
168 |         "\n",
169 |         "    Returns:\n",
170 |         "      Two dataframes of simulated consenting and non-consenting customers.\n",
171 |         "  \"\"\"\n",
172 |         "  fake_adgroups = np.array(\n",
173 |         "      ['_'.join(fake_ad) for fake_ad in (combinations('ABCDEFG', 3))])\n",
174 |         "\n",
175 |         "  data_consenting = pd.DataFrame.from_dict({\n",
176 |         "      'adgroup':\n",
177 |         "          fake_adgroups[np.random.randint(\n",
178 |         "              low=0, high=len(fake_adgroups), size=n_consenting_customers)],\n",
179 |         "      'conversion_value':\n",
180 |         "          np.random.lognormal(1, size=n_consenting_customers)\n",
181 |         "  })\n",
182 |         "\n",
183 |         "  data_nonconsenting = pd.DataFrame.from_dict({\n",
184 |         "      'adgroup':\n",
185 |         "          fake_adgroups[np.random.randint(\n",
186 |         "              low=0, high=len(fake_adgroups), size=n_nonconsenting_customers)],\n",
187 |         "      'conversion_value':\n",
188 |         "          np.random.lognormal(1, size=n_nonconsenting_customers)\n",
189 |         "  })\n",
190 |         "  return data_consenting, data_nonconsenting\n",
191 |         "\n",
192 |         "\n",
193 |         "data_consenting, data_nonconsenting = simulate_conversion_data_consenting_non_consenting(\n",
194 |         "    n_consenting_customers, n_nonconsenting_customers)\n",
195 |         "data_consenting.head()"
196 |       ]
197 |     },
198 |     {
199 |       "cell_type": "markdown",
200 |       "metadata": {
201 |         "id": "V_9JEHeZOF8X"
202 |       },
203 |       "source": [
204 |         "# Preprocessing"
205 |       ]
206 |     },
207 |     {
208 |       "cell_type": "code",
209 |       "execution_count": null,
210 |       "metadata": {
211 |         "cellView": "form",
212 |         "id": "K-WQvtKYdn6N"
213 |       },
214 |       "outputs": [],
215 |       "source": [
216 |         "#@title Split adgroups in separate levels\n",
217 |         "#@markdown We preprocess our data. Consenting and non-consenting data are\n",
218 |         "#@markdown concatenated to ensure that they have the same feature-columns. \\\n",
219 |         "#@markdown We then split our adgroup-string into its components and dummy code each.\n",
220 |         "#@markdown The level of each letter in the adgroup-string is added as prefix here.\n",
221 |         "\n",
222 |         "def preprocess_data(data_consenting, data_nonconsenting):\n",
223 |         "  data_consenting['consent'] = 1\n",
224 |         "  data_nonconsenting['consent'] = 0\n",
225 |         "  data_all = pd.concat([data_consenting, data_nonconsenting])\n",
226 |         "  data_all.reset_index(inplace=True)\n",
227 |         "\n",
228 |         "  # split the adgroups in their levels and dummy-code those.\n",
229 |         "  data_all = data_all.join(\n",
230 |         "      pd.get_dummies(data_all['adgroup'].str.split('_').apply(pd.Series)))\n",
231 |         "  data_all.drop(['adgroup'], axis=1, inplace=True)\n",
232 |         "  return data_all[data_all['consent'] == 1], data_all[data_all['consent'] == 0]\n",
233 |         "\n",
234 |         "data_consenting, data_nonconsenting = preprocess_data(data_consenting,\n",
235 |         "                                                      data_nonconsenting)\n",
236 |         "data_consenting.head()"
237 |       ]
238 |     },
239 |     {
240 |       "cell_type": "markdown",
241 |       "metadata": {
242 |         "id": "3v16SF-umvAz"
243 |       },
244 |       "source": [
245 |         "# Create NearestCustomerMatcher object and run conversion-adjustments.\n",
246 |         "\n",
247 |         "We now have our fake data in the right format – similarity here depends alone on\n",
248 |         "the adgroup of a given customer. In reality, we would have a gCLID and a\n",
249 |         "timestamp for each customer that we could pass as `id_columns` to the matcher.\\\n",
250 |         "Other example features that could be used instead/in addition to the adgroup are\n",
251 |         "\n",
252 |         "\n",
253 |         "* device type\n",
254 |         "* geo\n",
255 |         "* time of day\n",
256 |         "* ad-type\n",
257 |         "* GA-derived features\n",
258 |         "* etc. \n",
259 |         "\n",
260 |         "When using the `NearestCustomerMatcher`, we can choose between three matching\n",
261 |         "strategies:\n",
262 |         "* if we define `number_nearest_neighbors`, a fixed number of nearest (consenting)\n",
263 |         "customers is used, irrespective of how dissimilar those customers are to the \n",
264 |         "seed-non-consenting customer.\n",
265 |         "* if we define `radius`, all consenting customers that fall within the specified radius of a non-consenting customer are used. This means that the number of nearest-neighbors likely differs between non-consenting customers, and a given non-consenting customer might have no consenting customers in their radius.\n",
266 |         "* if we define `percentage`, the `NearestCustomerMatcher` first which minimal radius needs to be set in order to find at least one closest consenting customer for at least `percentage` non-consenting customers (not implemented in beam yet)\n",
267 |         "\n",
268 |         "In practice, the simplest approach is to set `number_nearest_neighbors` and\n",
269 |         "choose a sufficiently high number here to ensure that individual consenting\n",
270 |         "customers do not receive too high a share of non-consenting conversion values.\n",
271 |         "\n"
272 |       ]
273 |     },
274 |     {
275 |       "cell_type": "code",
276 |       "execution_count": null,
277 |       "metadata": {
278 |         "id": "ezsCLpQ4jExj"
279 |       },
280 |       "outputs": [],
281 |       "source": [
282 |         "matcher = nearest_consented_customers.NearestCustomerMatcher(\n",
283 |         "    data_consenting, conversion_column='conversion_value', id_columns=['index'])\n",
284 |         "data_adjusted = matcher.calculate_adjusted_conversions(\n",
285 |         "    data_nonconsenting, number_nearest_neighbors=100)"
286 |       ]
287 |     },
288 |     {
289 |       "cell_type": "code",
290 |       "execution_count": null,
291 |       "metadata": {
292 |         "cellView": "form",
293 |         "id": "GCEKmfcdqyRK"
294 |       },
295 |       "outputs": [],
296 |       "source": [
297 |         "#@title We generated a new dataframe containing the conversion-value adjustments\n",
298 |         "data_adjusted.sample(5)"
299 |       ]
300 |     },
301 |     {
302 |       "cell_type": "code",
303 |       "execution_count": null,
304 |       "metadata": {
305 |         "cellView": "form",
306 |         "id": "twOhpxSvmIkR"
307 |       },
308 |       "outputs": [],
309 |       "source": [
310 |         "#@title Visualise distribution of adjusted conversions\n",
311 |         "#@markdown We can plot the original and adjusted (original + adjustment-values)\n",
312 |         "#@markdown conversion values and see that in general, the distributions are\n",
313 |         "#@markdown very similar, but as expected, the adjusted values are shifted towards\n",
314 |         "#@markdown larger values.\n",
315 |         "ax = data_adjusted['conversion_value'].plot(kind='hist', alpha=.5, )\n",
316 |         "(data_adjusted['adjusted_conversion']+data_adjusted['conversion_value']).plot(kind='hist', ax=ax, alpha=.5)\n",
317 |         "ax.legend(['original conversion value', 'adjusted conversion value'])\n",
318 |         "plt.show()"
319 |       ]
320 |     },
321 |     {
322 |       "cell_type": "markdown",
323 |       "metadata": {
324 |         "id": "KjDO7pncOTON"
325 |       },
326 |       "source": [
327 |         "# Next steps\n",
328 |         "The above would run automatically on a daily basis within a Google Cloud Project. A new table ready to use with Offline Conversion Import is created.\n",
329 |         "If no custom pipeline has been set up yet, we recommend using [Tentacles](https://github.com/GoogleCloudPlatform/cloud-for-marketing/blob/master/marketing-analytics/activation/gmp-googleads-connector/README.md)."
330 |       ]
331 |     },
332 |     {
333 |       "cell_type": "code",
334 |       "execution_count": null,
335 |       "metadata": {
336 |         "id": "l5kV_6qNPSNF"
337 |       },
338 |       "outputs": [],
339 |       "source": [
340 |         ""
341 |       ]
342 |     }
343 |   ],
344 |   "metadata": {
345 |     "colab": {
346 |       "collapsed_sections": [
347 |         "QK9vYsW8N7F3",
348 |         "V_9JEHeZOF8X",
349 |         "3v16SF-umvAz"
350 |       ],
351 |       "last_runtime": {
352 |         "build_target": "//corp/gtech/ads/infrastructure/colab_utils/ds_runtime:ds_colab",
353 |         "kind": "private"
354 |       },
355 |       "name": "cocoa_template.ipynb",
356 |       "private_outputs": true,
357 |       "provenance": [
358 |         {
359 |           "file_id": "1YafoEaxHj4Gfs51FwMEzEjx1oRu92XSB",
360 |           "timestamp": 1619712681367
361 |         }
362 |       ]
363 |     },
364 |     "kernelspec": {
365 |       "display_name": "Python 3",
366 |       "name": "python3"
367 |     },
368 |     "language_info": {
369 |       "name": "python"
370 |     }
371 |   },
372 |   "nbformat": 4,
373 |   "nbformat_minor": 0
374 | }
375 | 


--------------------------------------------------------------------------------
/cocoa/nearest_consented_customers.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Google LLC
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     https://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Module to re-distribute conversion-values of no-consent customers."""
 16 | 
 17 | import logging
 18 | from typing import Any, Callable, List, Optional, Sequence, Tuple, Union
 19 | 
 20 | import numpy as np
 21 | import pandas as pd
 22 | from scipy import sparse
 23 | from scipy import special
 24 | from sklearn import neighbors
 25 | 
 26 | 
 27 | class NearestCustomerMatcher:
 28 |   """Class to find nearest neighbors and distribute conversion value.
 29 | 
 30 |   When we have a dataset of customers that gave consent to cookie-tracking, and
 31 |   customers that did not give consent, we want to ensure that the total
 32 |   conversion values (e.g. value of a purchase) across all customers are
 33 |   accessible to SmartBidding.
 34 |   The NearestCustomerMatcher finds the most similar customers among the
 35 |   consenting customers to each of the no-consent customers, and distributes
 36 |   the conversion values of any no-consent customer across the matches in the
 37 |   set of consenting customers, in proportion to their distance.
 38 |   Similarity is defined as the distance between customers in their feature-
 39 |   space, for instance based on adgroup-levels. Which distance-metric to
 40 |   choose is up to the user.
 41 |   The more similar a consenting customer is to a given no-consent
 42 |   customer, the larger the share of the no-consent customer's conversion-
 43 |   value that will be added to the consenting customer's conversion value.
 44 |   """
 45 | 
 46 |   def __init__(self,
 47 |                data_consent: pd.DataFrame,
 48 |                conversion_column: str,
 49 |                id_columns: List[Union[str, int]],
 50 |                metric: str = "manhattan",
 51 |                neighbor: Callable[..., Any] = neighbors.NearestNeighbors):
 52 |     """Initialises class.
 53 | 
 54 |     Args:
 55 |       data_consent: Dataframe of consented customers (preprocessed).
 56 |       conversion_column: Name of column in dataframe of conversion-value.
 57 |       id_columns: Names of columns that identify customers. Usually GCLID and
 58 |         timestamp.
 59 |       metric: Distance metric to use when finding nearest neighbors.
 60 |       neighbor: sklearn NearestNeighbor object.
 61 | 
 62 |     Raises:
 63 |       ValueError: if the conversion values contain NaNs or Nones, or if
 64 |         conversion values < 0.
 65 |     """
 66 |     # TODO() Test behaviour under different distance metrics.
 67 |     self._neighbor = neighbor(metric=metric, algorithm="auto")
 68 |     self._columns_consent = data_consent.drop(id_columns, axis=1).columns
 69 |     self._data_consent = data_consent[id_columns + [conversion_column]]
 70 |     features_consent = data_consent.drop(
 71 |         id_columns + [conversion_column], axis=1
 72 |     ).values.astype(np.float64)
 73 |     self._features_consent = sparse.csr_matrix(features_consent).astype(
 74 |         np.float32
 75 |     )
 76 |     self._conversion_column = conversion_column
 77 |     self._consent_id = data_consent[id_columns]
 78 |     self._id_columns = id_columns
 79 |     if any(self._data_consent[self._conversion_column].isna()):
 80 |       raise ValueError("The conversion column must not contain NaNs/Nones.")
 81 |     if any(self._data_consent[self._conversion_column] <= 0):
 82 |       raise ValueError("The conversion values must be larger than zero.")
 83 |     self._neighbor = self._neighbor.fit(self._features_consent)
 84 | 
 85 |     # These attributes will be populated with data later.
 86 |     self._data_noconsent = None
 87 |     self._data_noconsent_match = None
 88 |     self._data_noconsent_nomatch = None
 89 | 
 90 |   @property
 91 |   def total_non_matched_conversion_value(self) -> float:
 92 |     return self._data_noconsent_nomatch[self._conversion_column].sum()
 93 | 
 94 |   @property
 95 |   def total_matched_conversion_value(self) -> float:
 96 |     return self._data_noconsent_match[self._conversion_column].sum()
 97 | 
 98 |   @property
 99 |   def percentage_matched_conversion_value(self) -> float:
100 |     return (self.total_matched_conversion_value /
101 |             (self.total_non_matched_conversion_value +
102 |              self.total_matched_conversion_value)) * 100
103 | 
104 |   @property
105 |   def number_non_matched_conversions(self) -> int:
106 |     return len(self._data_noconsent_nomatch)
107 | 
108 |   @property
109 |   def number_matched_conversions(self) -> int:
110 |     return len(self._data_noconsent_match)
111 | 
112 |   @property
113 |   def percentage_matched_conversions(self) -> float:
114 |     return self.number_matched_conversions / len(self._data_noconsent) * 100
115 | 
116 |   @property
117 |   def distance_statistics(self):
118 |     return self._data_adjusted["average_distance"].describe()
119 | 
120 |   @property
121 |   def nearest_distances_statistics_nonconsenting(self):
122 |     return self._data_noconsent_match["distance_to_nearest_neighbor"].describe(
123 |         percentiles=[.25, .5, .75, .9, .95, .99])
124 | 
125 |   @property
126 |   def summary_statistics_matched_conversions(self):
127 |     return pd.DataFrame(
128 |         {
129 |             "percentage_matched_conversion_value":
130 |                 self.percentage_matched_conversion_value,
131 |             "percentage_matched_conversions":
132 |                 self.percentage_matched_conversions,
133 |             "number_matched_conversions":
134 |                 self.number_matched_conversions,
135 |             "total_matched_conversion_value":
136 |                 self.total_matched_conversion_value
137 |         },
138 |         index=["summary_statistics_matched_conversions"])
139 | 
140 |   def min_radius_by_percentile(self, percentile: float = .95) -> float:
141 |     radius = self._data_noconsent_match[
142 |         "distance_to_nearest_neighbor"].quantile(percentile)
143 |     return radius
144 | 
145 |   def _get_proportional_number_nearest_neighbors(
146 |       self, number_nearest_neighbors: float) -> int:
147 |     return int(number_nearest_neighbors * len(self._data_consent))
148 | 
149 |   def _fit_neighbor(self):
150 |     self._neighbor.fit(self._features_consent)
151 |     self._fitted = True
152 | 
153 |   def _get_neighbors_within_radius(
154 |       self, data_noconsent: pd.DataFrame, radius: float
155 |   ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray], Sequence[bool]]:
156 |     """Gets neighbors within specified radius.
157 | 
158 |     Args:
159 |       data_noconsent: Data of no-consent customers.
160 |       radius: Radius within which nearest neighbors are found.
161 | 
162 |     Returns:
163 |       neighbors_index: Array of indices-arrays of neighboring points.
164 |       neighbors_distances: Array of distance-arrays to neighboring points.
165 |       has_neighbors_array: Array of booleans indicating whether a given non-
166 |         consenting customer had at least one neighbor or not. Takes advantage
167 |         of numpy's functionality, e.g.:
168 |         (np.array([0,1,2]) > 0)
169 |         >>> array([False,  True,  True])
170 |     """
171 |     neighbors_distance, neighbors_index = self._neighbor.radius_neighbors(
172 |         data_noconsent.drop([self._conversion_column], axis=1),
173 |         radius=radius,
174 |         return_distance=True,
175 |     )
176 |     has_neighbors_array = np.array(
177 |         [len(neighbors) for neighbors in neighbors_index]) > 0
178 |     if not any(has_neighbors_array):
179 |       logging.warning("No matching customers within radius %d.", radius)
180 |     neighbors_index = neighbors_index[has_neighbors_array]
181 |     neighbors_distance = neighbors_distance[has_neighbors_array]
182 |     return neighbors_index, neighbors_distance, has_neighbors_array
183 | 
184 |   def _get_n_nearest_neighbors(
185 |       self, data_noconsent: pd.DataFrame, number_nearest_neighbors: float
186 |   ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray], Sequence[bool]]:
187 |     """Gets n nearest neighbors.
188 | 
189 |     Args:
190 |       data_noconsent: Data of no-consent customers.
191 |       number_nearest_neighbors: Number of neighbors to return. If <1,
192 |         number_nearest_neighbors is calculated as the proportion in the set of
193 |         consenting customers.
194 | 
195 |     Returns:
196 |       neighbors_index: Array of indices-arrays of neighboring points.
197 |       neighbors_distances: Array of distance-arrays to neighboring points.
198 |       has_neighbors_array: Array of booleans indicating whether a given non-
199 |         consenting customer had at least one neighbor or not. Takes advantage
200 |         of numpy's functionality, e.g.:
201 |         (np.array([0,1,2]) > 0)
202 |         >>> array([False,  True,  True])
203 | 
204 |     Raises:
205 |       ValueError: if the actual number of nearest neighbors is not
206 |         `number_nearest_neighbors`.
207 |     """
208 |     if number_nearest_neighbors < 1:
209 |       number_nearest_neighbors = (
210 |           self._get_proportional_number_nearest_neighbors(
211 |               number_nearest_neighbors))
212 |     neighbors_distance, neighbors_index = self._neighbor.kneighbors(
213 |         data_noconsent.drop([self._conversion_column], axis=1),
214 |         n_neighbors=number_nearest_neighbors,
215 |         return_distance=True)
216 |     has_neighbors_array = np.array(
217 |         [len(neighbors) for neighbors in neighbors_index]) > 0
218 |     if np.shape(neighbors_distance)[1] != number_nearest_neighbors:
219 |       raise ValueError(
220 |           f"Returned number of neighbors is not {number_nearest_neighbors}.")
221 |     return neighbors_index, neighbors_distance, has_neighbors_array
222 | 
223 |   def _get_nearest_neighbors(
224 |       self,
225 |       data_noconsent: pd.DataFrame,
226 |       radius: Optional[float] = None,
227 |       number_nearest_neighbors: Optional[float] = None
228 |   ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray], Sequence[bool]]:
229 |     """Get indices and distances to nearest neighbors.
230 | 
231 |     Finds nearest neighbors based on radius or number_nearest_neighbors for each
232 |     entry in data_noconsent. If nearest neighbors are defined
233 |     via radius, entries in data_noconsent without sufficiently close
234 |     neighbor are removed.
235 | 
236 |     Args:
237 |       data_noconsent: Data of no-consent customers.
238 |       radius: Radius within which neighbors have to lie.
239 |       number_nearest_neighbors: Defines the number (or proportion) of nearest
240 |         neighbors. If smaller 1, number_nearest_neighbors is calculated as the
241 |         proportion of the number of consenting customers.
242 | 
243 |     Returns:
244 |       A 3-tuple with:
245 |         Array of indices-arrays of nearest neighbors in data_consent.
246 |         Array of distances-arrays of nearest neigbors in data_consent.
247 |         Array of booleans indicating whether a given no-consent customer
248 |         had at least one neighbor or not.
249 | 
250 |     Raises:
251 |       ValueError: if not exactly one of radius or number_nearest_neighbors are
252 |         provided.
253 |     """
254 |     has_radius = radius is not None
255 |     has_number_nearest_neighbors = number_nearest_neighbors is not None
256 | 
257 |     if has_radius == has_number_nearest_neighbors:
258 |       raise ValueError("Exactly one of radius or number_nearest_neighbors has ",
259 |                        "to be provided.")
260 |     if has_radius:
261 |       return self._get_neighbors_within_radius(data_noconsent, radius)
262 |     elif has_number_nearest_neighbors:
263 |       return self._get_n_nearest_neighbors(data_noconsent,
264 |                                            number_nearest_neighbors)
265 | 
266 |   def _assert_all_columns_match_and_conversions_are_valid(self, data_noconsent):
267 |     """Checks that all consenting and no-consent data match and are valid.
268 | 
269 |     Args:
270 |       data_noconsent: Data of no-consent customers.
271 | 
272 |     Raises:
273 |       ValueError: if columns of consenting and no-consent data don't match,
274 |         the conversion values contain NaNs/Nones or if conversion values <0.
275 |     """
276 |     if not all(self._columns_consent == data_noconsent.columns) or (len(
277 |         self._columns_consent) != len(data_noconsent.columns)):
278 |       raise ValueError(
279 |           "Consented and non-consented data must have same columns.")
280 |     for data in (data_noconsent, self._data_consent):
281 |       if any(data[self._conversion_column].isna()):
282 |         raise ValueError("The conversion column should not contain NaNs.")
283 |       if any(data[self._conversion_column] <= 0):
284 |         ValueError("The conversion values should be larger than zero.")
285 | 
286 |   def get_indices_and_values_to_nearest_neighbors(
287 |       self,
288 |       data_noconsent: pd.DataFrame,
289 |       radius: Optional[float] = None,
290 |       number_nearest_neighbors: Optional[float] = None
291 |   ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray], Sequence[np.ndarray],
292 |              Sequence[np.ndarray], Sequence[bool]]:
293 |     """Gets indices of nearest neighbours as well as the needed conversions.
294 | 
295 |     Args:
296 |       data_noconsent: Data of no-consent customers.
297 |       radius: Radius within which neighbors have to lie.
298 |       number_nearest_neighbors: Defines the number (or proportion) of nearest
299 |         neighbors.
300 | 
301 |     Returns:
302 |       neighbors_data_index: Arrays of indices to the nearest neighbors in the
303 |         consenting-customer data.
304 |       neighbors_distance: Arrays of distances to the nearest neighbors.
305 |       weighted_conversion_values: Conversion values of no-consent customers
306 |         weighted by their distance to each nearest neighbor.
307 |       weighted_distance: Weighted distances between no-consent and
308 |         consenting customers.
309 |       has_neighbor: Whether or not a given no-consent customer had a
310 |         nearest neighbor.
311 |     """
312 |     data_noconsent = data_noconsent.drop(self._id_columns, axis=1)
313 |     self._assert_all_columns_match_and_conversions_are_valid(data_noconsent)
314 |     neighbors_index, neighbors_distance, has_neighbor = (
315 |         self._get_nearest_neighbors(data_noconsent, radius,
316 |                                     number_nearest_neighbors))
317 |     neighbors_data_index = [
318 |         self._data_consent.index[index] for index in neighbors_index
319 |     ]
320 |     non_consent_conversion_values = data_noconsent[has_neighbor][
321 |         self._conversion_column].values
322 |     weighted_conversion_values, weighted_distance = (
323 |         _calculate_weighted_conversion_values(
324 |             non_consent_conversion_values,
325 |             neighbors_distance,
326 |         ))
327 |     return (neighbors_data_index, neighbors_distance,
328 |             weighted_conversion_values, weighted_distance, has_neighbor)
329 | 
330 |   def calculate_adjusted_conversions(
331 |       self,
332 |       data_noconsent: pd.DataFrame,
333 |       radius: Optional[float] = None,
334 |       number_nearest_neighbors: Optional[float] = None) -> pd.DataFrame:
335 |     """Calculates adjusted conversions for identified nearest neighbors.
336 | 
337 |     Finds nearest neighbors based on radius or number_nearest_neighbors for each
338 |     entry in data_noconsent. If nearest neighbors are defined via radius,
339 |     entries in data_noconsent without sufficiently close neighbor are ignored.
340 |     Conversion values of consenting customers that are identified as nearest
341 |     neighbor to a no-consent customer are adjusted by adding the weighted
342 |     proportional conversion value of the respective no-consent customer.
343 |     The weighted conversion value is calculated as the product of the conversion
344 |     value with the softmax over all neighbor-similarities.
345 | 
346 |     Args:
347 |       data_noconsent: Data for no-consent customer(s). Needs to be pre-
348 |         processed and have the same columns as data_consent.
349 |       radius: Radius within which neighbors have to lie.
350 |       number_nearest_neighbors: Defines the number (or proportion) of nearest
351 |         neighbors.
352 | 
353 |     Returns:
354 |       data_adjusted: Copy of data_consent including the modelled conversion
355 |         values.
356 |     """
357 |     (neighbors_data_index, neighbors_distance, weighted_conversion_values,
358 |      weighted_distance,
359 |      has_neighbor) = self.get_indices_and_values_to_nearest_neighbors(
360 |          data_noconsent, radius, number_nearest_neighbors)
361 |     self._data_noconsent = data_noconsent.drop(self._id_columns, axis=1)
362 |     self._data_noconsent_nomatch = data_noconsent[np.invert(
363 |         has_neighbor)].copy()
364 |     self._data_noconsent_match = data_noconsent[has_neighbor].copy()
365 |     self._data_noconsent_match["distance_to_nearest_neighbor"] = [
366 |         min(distances) for distances in neighbors_distance
367 |     ]
368 |     self._data_adjusted = _distribute_conversion_values(
369 |         self._data_consent, self._conversion_column,
370 |         self._data_noconsent_match[self._conversion_column].values,
371 |         weighted_conversion_values, neighbors_data_index, neighbors_distance,
372 |         weighted_distance)
373 |     return self._data_adjusted
374 | 
375 | 
376 | def _calculate_weighted_conversion_values(
377 |     conversion_values: Sequence[np.ndarray],
378 |     neighbors_distance: Sequence[np.ndarray],
379 | ) -> Tuple[Sequence[np.ndarray], Sequence[np.ndarray]]:
380 |   """Calculate weighted conversion values as function of distance.
381 | 
382 |   The weighted conversion value is calculated as the product of the conversion
383 |   value with the softmax over all neighbor-similarities.
384 | 
385 | 
386 |   Args:
387 |     conversion_values: Array of conversion_values for non-consented customers.
388 |     neighbors_distance: Array of arrays of neighbor-distances.
389 | 
390 |   Returns:
391 |     weighted_conversion_values: Array of weighted conversion_values per non-
392 |      consented customer.
393 |     softmax_similarity: Array of softmax similarities per non-consented
394 |      customer.
395 |   """
396 |   if len(conversion_values) != len(neighbors_distance):
397 |     raise ValueError("All of conversion_values and neighbors_distance",
398 |                      "must have the same length.")
399 | 
400 |   if any((dist < 0).any() for dist in neighbors_distance):
401 |     raise ValueError("Distances should not contain negative values."
402 |                      "Please review which distance metric you used.")
403 | 
404 |   softmax_similarity = [
405 |       special.softmax(-distance) for distance in neighbors_distance
406 |   ]
407 |   weighted_conversion_values = [
408 |       conversion_value * weight
409 |       for conversion_value, weight in zip(conversion_values, softmax_similarity)
410 |   ]
411 |   return weighted_conversion_values, softmax_similarity
412 | 
413 | 
414 | def _distribute_conversion_values(
415 |     data_consent: pd.DataFrame,
416 |     conversion_column: str,
417 |     non_consent_conversion_values: Sequence[float],
418 |     weighted_conversion_values: Sequence[np.ndarray],
419 |     neighbors_index: Sequence[np.ndarray],
420 |     neighbors_distance: Sequence[np.ndarray],
421 |     weighted_distance: Sequence[np.ndarray],
422 | ) -> pd.DataFrame:
423 |   """Distribute conversion-values of no-consent over consenting customers.
424 | 
425 |   Conversion values of consenting customers that are identified as nearest
426 |   neighbor to a no-consent customer are adjusted by adding the weighted
427 |   proportional conversion value of the respective no-consent customer.
428 |   Additionally, metrics like average distance to no-consent customers
429 |   and total number of added conversions are calculated.
430 | 
431 |   Args:
432 |     data_consent: DataFrame of consented customers.
433 |     conversion_column: String indicating the conversion KPI in data_consent.
434 |     non_consent_conversion_values: Array of original conversion values.
435 |     weighted_conversion_values: Array of arrays of weighted conversion_values,
436 |       based on distance between consenting and no-consent customers.
437 |     neighbors_index: Array of arrays of neighbor-indices.
438 |     neighbors_distance: Array of arrays of neighbor-distances.
439 |     weighted_distance: Array of arrays of weighted neighbor-distances.
440 | 
441 |   Returns:
442 |     data_adjusted: Copy of data_consent including the modelled conversion
443 |       values.
444 |   """
445 | 
446 |   data_adjusted = data_consent.copy()
447 |   data_adjusted["adjusted_conversion"] = 0
448 |   data_adjusted["average_distance"] = 0
449 |   data_adjusted["n_added_conversions"] = 0
450 |   data_adjusted["sum_distribution_weights"] = 0
451 |   for index, values, distance, weight in zip(neighbors_index,
452 |                                              weighted_conversion_values,
453 |                                              neighbors_distance,
454 |                                              weighted_distance):
455 |     data_adjusted.loc[index, "adjusted_conversion"] += values
456 |     data_adjusted.loc[index, "average_distance"] += distance
457 |     data_adjusted.loc[index, "sum_distribution_weights"] += weight
458 |     data_adjusted.loc[index, "n_added_conversions"] += 1
459 | 
460 |   data_adjusted["average_distance"] = (data_adjusted["average_distance"] /
461 |                                        data_adjusted["n_added_conversions"])
462 | 
463 |   naive_conversion_adjustments = np.sum(non_consent_conversion_values) / len(
464 |       data_consent)
465 |   data_adjusted["naive_adjusted_conversion"] = data_adjusted[
466 |       conversion_column] + naive_conversion_adjustments
467 |   return data_adjusted
468 | 
469 | 
470 | def get_adjustments_and_summary_calculations(
471 |     matcher: NearestCustomerMatcher,
472 |     data_noconsent: pd.DataFrame,
473 |     number_nearest_neighbors: Optional[float] = None,
474 |     radius: Optional[float] = None,
475 |     percentile: Optional[float] = None,
476 | ) -> Tuple[pd.DataFrame, pd.DataFrame]:
477 |   """Calculates adjusted conversions for consenting customers.
478 | 
479 |   Args:
480 |     matcher: Matcher object which has been fit to all of data_consent. It
481 |       provides the functionality to get the nearest neighbors for a given
482 |       no-consent customer.
483 |     data_noconsent: Dataframe of no-consent customers. Needs to have the
484 |       same columns as data_consent to calculate similarity between data points.
485 |     number_nearest_neighbors: Number of consenting customers to chose to match
486 |       to. If float, is taken as proportion of all customers.
487 |     radius: Radius to find matching consenting customers.
488 |     percentile: Percentile of matched no-consent customers based on which
489 |       radius is set.
490 | 
491 |   Returns:
492 |   A two-tuple with:
493 |       - adjusted conversion values for new and old customers.
494 |       - summary statistics on matched conversions (% of counts,% of conversion
495 |       value).
496 | 
497 |   Raises:
498 |     ValueError: if not exactly one of number_nearest_neighbors, radius,
499 |       or percentile is provided.
500 |     ValueError: if the provided percentile is not within the range of 0-1.
501 |   """
502 |   has_number_nearest_neighbors = number_nearest_neighbors is not None
503 |   has_radius = radius is not None
504 |   has_percentile = percentile is not None
505 | 
506 |   if (has_number_nearest_neighbors + has_radius + has_percentile) != 1:
507 |     raise ValueError("Exactly one of number_nearest_neighbors, radius,",
508 |                      " or percentile has to be specified.")
509 | 
510 |   if has_percentile and not 0 < percentile <= 1:
511 |     raise ValueError("The percentile has to be a value between 0 and 1.")
512 | 
513 |   if number_nearest_neighbors or radius:
514 |     data_adjusted = matcher.calculate_adjusted_conversions(
515 |         data_noconsent=data_noconsent,
516 |         number_nearest_neighbors=number_nearest_neighbors,
517 |         radius=radius)
518 |   elif percentile:
519 |     matcher.calculate_adjusted_conversions(
520 |         data_noconsent=data_noconsent, number_nearest_neighbors=1)
521 |     radius = matcher.min_radius_by_percentile(percentile=percentile)
522 |     data_adjusted = matcher.calculate_adjusted_conversions(
523 |         data_noconsent=data_noconsent, radius=radius)
524 |   return data_adjusted, matcher.summary_statistics_matched_conversions
525 | 


--------------------------------------------------------------------------------
/cocoa/nearest_consented_customers_test.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Google LLC
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     https://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Tests for nearest_consented_customers package."""
 16 | 
 17 | from absl.testing import absltest
 18 | from absl.testing import parameterized
 19 | import numpy as np
 20 | 
 21 | from consent_based_conversion_adjustments.cocoa import nearest_consented_customers
 22 | from consent_based_conversion_adjustments.cocoa import testing_constants
 23 | 
 24 | CONVERSION_COLUMN = testing_constants.CONVERSION_COLUMN
 25 | ID_COLUMNS = testing_constants.ID_COLUMNS
 26 | 
 27 | DATA_CONSENT = testing_constants.DATA_CONSENT
 28 | DATA_NOCONSENT = testing_constants.DATA_NOCONSENT
 29 | 
 30 | METRIC = testing_constants.METRIC
 31 | 
 32 | 
 33 | class NearestCustomerTest(parameterized.TestCase):
 34 | 
 35 |   def setUp(self):
 36 |     super().setUp()
 37 |     self.matcher = nearest_consented_customers.NearestCustomerMatcher(
 38 |         DATA_CONSENT, CONVERSION_COLUMN, ID_COLUMNS, METRIC)
 39 |     self.matcher._data_noconsent = DATA_NOCONSENT.drop(ID_COLUMNS, axis=1)
 40 | 
 41 |   def test_columns_differ_raises_value_error(self):
 42 |     with self.assertRaises(ValueError):
 43 |       self.matcher.calculate_adjusted_conversions(
 44 |           data_noconsent=DATA_NOCONSENT.drop('a', axis=1),
 45 |           number_nearest_neighbors=1)
 46 | 
 47 |   def test_number_nearest_neighbors_and_radius_raises_value_error(self):
 48 |     with self.assertRaises(ValueError):
 49 |       self.matcher.calculate_adjusted_conversions(data_noconsent=DATA_NOCONSENT,
 50 |                                                   radius=.5,
 51 |                                                   number_nearest_neighbors=5)
 52 | 
 53 |   @parameterized.parameters(1, 2, 3)
 54 |   def test_assert_k_matches_length_of_indices(self, number_nearest_neighbors):
 55 |     indices, distances, _ = self.matcher._get_nearest_neighbors(
 56 |         data_noconsent=DATA_CONSENT.drop(ID_COLUMNS, axis=1),
 57 |         number_nearest_neighbors=number_nearest_neighbors)
 58 | 
 59 |     self.assertEqual(np.shape(indices)[1], number_nearest_neighbors)
 60 |     self.assertEqual(np.shape(distances)[1], number_nearest_neighbors)
 61 | 
 62 |   def test_number_nearest_neighbors_larger_than_customers_raises_value_error(
 63 |       self):
 64 |     number_nearest_neighbors = len(DATA_CONSENT) * 3
 65 | 
 66 |     with self.assertRaises(ValueError):
 67 |       _ = self.matcher.calculate_adjusted_conversions(
 68 |           data_noconsent=DATA_NOCONSENT,
 69 |           number_nearest_neighbors=number_nearest_neighbors)
 70 | 
 71 |   def test_no_match_in_radius_logs_warning(self):
 72 |     with self.assertLogs(level='WARNING') as log:
 73 |       _ = self.matcher.calculate_adjusted_conversions(
 74 |           data_noconsent=DATA_NOCONSENT, radius=0)
 75 | 
 76 |       self.assertEqual(log.records[0].getMessage(),
 77 |                        'No matching customers within radius 0.')
 78 | 
 79 |   def test_adjusted_conversion_value_different_from_original_value(self):
 80 |     data_adjusted = self.matcher.calculate_adjusted_conversions(
 81 |         data_noconsent=DATA_NOCONSENT, number_nearest_neighbors=3)
 82 | 
 83 |     sum_adjusted_conversions = (data_adjusted[CONVERSION_COLUMN].sum()
 84 |                                 + data_adjusted['adjusted_conversion'].sum())
 85 |     self.assertLess(data_adjusted[CONVERSION_COLUMN].sum(),
 86 |                     sum_adjusted_conversions)
 87 | 
 88 |   def test_negative_distances_raises_value_error(self):
 89 |     neighbors_distance = [np.array([0.1, 0.1, 0.8]), np.array([-5, 0, 0])]
 90 | 
 91 |     with self.assertRaises(ValueError):
 92 |       nearest_consented_customers._calculate_weighted_conversion_values(
 93 |           DATA_NOCONSENT[CONVERSION_COLUMN].values, neighbors_distance)
 94 | 
 95 |   def test_neighbor_indices_not_in_data_consent_raises_key_error(self):
 96 |     neighbors_index = [np.array([111, 222, 333])]
 97 |     neighbors_distance = [np.array([0.1, 0.1, 0.8])]
 98 |     weighted_conversion_values = [np.array([10, 20, 30])]
 99 |     weighted_distances = neighbors_distance
100 | 
101 |     with self.assertRaises(KeyError):
102 |       nearest_consented_customers._distribute_conversion_values(
103 |           DATA_CONSENT, CONVERSION_COLUMN,
104 |           DATA_CONSENT[CONVERSION_COLUMN].values, weighted_conversion_values,
105 |           neighbors_index, neighbors_distance, weighted_distances)
106 | 
107 |   def test_adjusted_conversion_smaller_than_upper_limit(self):
108 |     upper_limit = DATA_NOCONSENT[CONVERSION_COLUMN].sum()
109 | 
110 |     data_adjusted = self.matcher.calculate_adjusted_conversions(
111 |         data_noconsent=DATA_NOCONSENT, radius=1)
112 |     sum_adjusted_conversions = data_adjusted['adjusted_conversion'].sum()
113 | 
114 |     self.assertLessEqual(sum_adjusted_conversions, upper_limit)
115 | 
116 |   def test_sum_of_weighted_conversions_matches_original_conversions(self):
117 |     original_conversions = DATA_NOCONSENT[CONVERSION_COLUMN].astype(
118 |         float).values[:2]
119 |     neighbors_distance = [np.array([0.1, 0.1, 0.8]), np.array([1, 0, 0])]
120 | 
121 |     weighted_conversions, _ = (
122 |         nearest_consented_customers._calculate_weighted_conversion_values(
123 |             original_conversions, neighbors_distance))
124 | 
125 |     np.testing.assert_almost_equal(np.sum(weighted_conversions, axis=1),
126 |                                    original_conversions)
127 | 
128 |   def test_raises_value_error_if_number_nearest_neighbors_and_radius(self):
129 |     with self.assertRaises(ValueError):
130 |       nearest_consented_customers.get_adjustments_and_summary_calculations(
131 |           matcher=self.matcher,
132 |           data_noconsent=DATA_NOCONSENT.drop(columns=[CONVERSION_COLUMN]),
133 |           number_nearest_neighbors=1,
134 |           radius=1)
135 | 
136 |   def test_adjusted_conversions_larger_zero(self):
137 |     adjusted_conversions, _ = (
138 |         nearest_consented_customers.get_adjustments_and_summary_calculations(
139 |             matcher=self.matcher,
140 |             data_noconsent=DATA_NOCONSENT,
141 |             number_nearest_neighbors=3,
142 |         ))
143 | 
144 |     self.assertGreater(adjusted_conversions['adjusted_conversion'].sum(), 0)
145 | 
146 |   @parameterized.named_parameters(
147 |       {
148 |           'testcase_name': 'percentile.9',
149 |           'percentile': .9,
150 |       }, {
151 |           'testcase_name': 'percentile.5',
152 |           'percentile': .5,
153 |       }, {
154 |           'testcase_name': 'percentile.1',
155 |           'percentile': .1,
156 |       })
157 |   def test_percentage_matched_conversions_matches_target_percentage(
158 |       self, percentile):
159 |     _, summary_statistics_matched_conversions = (
160 |         nearest_consented_customers.get_adjustments_and_summary_calculations(
161 |             matcher=self.matcher,
162 |             data_noconsent=DATA_NOCONSENT,
163 |             percentile=percentile,
164 |         ))
165 | 
166 |     self.assertGreaterEqual(
167 |         summary_statistics_matched_conversions['percentage_matched_conversions']
168 |         .values, percentile)
169 | 
170 |   @parameterized.named_parameters(
171 |       {
172 |           'testcase_name': 'percentile>1',
173 |           'percentile': 1.1,
174 |       }, {
175 |           'testcase_name': 'percentile<0',
176 |           'percentile': -1,
177 |       })
178 |   def test_raises_value_error_for_invalid_percentile(self, percentile):
179 |     with self.assertRaises(ValueError):
180 |       nearest_consented_customers.get_adjustments_and_summary_calculations(
181 |           matcher=self.matcher,
182 |           data_noconsent=DATA_NOCONSENT,
183 |           percentile=percentile,
184 |       )
185 | 
186 |   def test_length_adjusted_conversions_equals_length_data_consent(self):
187 |     adjusted_conversions, _ = (
188 |         nearest_consented_customers.get_adjustments_and_summary_calculations(
189 |             matcher=self.matcher,
190 |             data_noconsent=DATA_NOCONSENT,
191 |             number_nearest_neighbors=3))
192 | 
193 |     self.assertEqual(len(adjusted_conversions), len(DATA_CONSENT))
194 | 
195 |   # TODO() Add test to assert expected outcome is produced.
196 | 
197 | 
198 | if __name__ == '__main__':
199 |   absltest.main()
200 | 


--------------------------------------------------------------------------------
/cocoa/preprocess.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Google LLC
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     https://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Prepare data to distribute conversion-values of no-consent customers."""
 16 | 
 17 | from typing import Any, List, Tuple
 18 | 
 19 | from absl import logging
 20 | import numpy as np
 21 | import pandas as pd
 22 | 
 23 | logging.set_verbosity(logging.WARNING)
 24 | 
 25 | NON_DUMMY_COLUMNS = ()  # typically at least "GCLID", "TIMESTAMP"
 26 | DROP_COLUMNS = ()
 27 | CONVERSION_COLUMN = "conversion_column"
 28 | 
 29 | 
 30 | def _clean_data(data: pd.DataFrame, conversion_column: str) -> pd.DataFrame:
 31 |   """Cleans data from NaNs and invalid conversion values.
 32 | 
 33 |   In its most basic form, this function drops entries that don't have a
 34 |   conversion value or for which the conversion value is not larger than zero.
 35 |   This function should be extended based on custom requirements.
 36 | 
 37 |   Args:
 38 |     data: Dataframe of customer data.
 39 |     conversion_column: Name of the conversion-column in data.
 40 | 
 41 |   Returns:
 42 |     cleaned data.
 43 |   """
 44 |   # Optional: Fill NaNs based on additional information/other columns.
 45 |   data.dropna(subset=[conversion_column], inplace=True)
 46 |   has_valid_conversion_value = data[conversion_column].values > 0
 47 |   data = data[has_valid_conversion_value]
 48 |   # Optional: Deduplicate consented users based on timestamp and gclid.
 49 |   return data
 50 | 
 51 | 
 52 | def _additional_feature_engineering(data: pd.DataFrame) -> pd.DataFrame:
 53 |   """Creates additional features that influence similarity between customers.
 54 | 
 55 |      Depending on your use-case and the features you have about your customers,
 56 |      you may want to do additional feature engineering. For instance, customers
 57 |      may purchase products that can be naturally organised into a hierarchy.
 58 |      You could code a purchase from "furniture/living/sofa" or from
 59 |      "furniture/kitchen/chair" into the following format:
 60 | 
 61 |      customer | level_0   | level_1 | level_2
 62 |      -----------------------------------------
 63 |        0      | furniture | living  | sofa
 64 |        1      | furniture | kitchen | chair
 65 | 
 66 |     In a later stage, one-hot encoding of these product-levels will result in a
 67 |     representation that reflects the similarity between products. If different
 68 |     products have different levels of depths, it might be appropriate to decide
 69 |     a threshold and drop low levels that occure only rarely.
 70 | 
 71 |   Args:
 72 |     data: Dataframe to be processed, containing consented and unconsented
 73 |       customers.
 74 | 
 75 |   Returns:
 76 |     Dataframe including new features.
 77 |   """
 78 |   return data
 79 | 
 80 | 
 81 | def preprocess_data(data: pd.DataFrame, drop_columns: List[Any],
 82 |                     non_dummy_columns: List[Any],
 83 |                     conversion_column: str) -> pd.DataFrame:
 84 |   """Preprocesses the passed dataframe.
 85 | 
 86 |     Cleans data and applies dummie-coding to relevant columns.
 87 | 
 88 |   Args:
 89 |     data: Dataframe to be processed (either for consent or no-consent users).
 90 |     drop_columns: List of columns to drop from dataframe.
 91 |     non_dummy_columns: List of columns to not include in dummy-coding, but keep.
 92 |     conversion_column: Name of column indicating conversion value.
 93 | 
 94 |   Returns:
 95 |     Processed and dummie-coded dataframe
 96 |   """
 97 |   data = _clean_data(data, conversion_column=conversion_column)
 98 |   data = _additional_feature_engineering(data)
 99 |   data_dummies = pd.get_dummies(
100 |       data.drop(drop_columns + non_dummy_columns, axis=1, errors="ignore"),
101 |       sparse=True)
102 |   data_dummies.astype(pd.SparseDtype("bool", 0))
103 |   data_dummies = data_dummies.join(data[non_dummy_columns])
104 |   logging.info("Shape of dummy-coded data is:%d", np.shape(data_dummies))
105 |   return data_dummies
106 | 
107 | 
108 | def concatenate_and_process_data(
109 |     data_consent: pd.DataFrame,
110 |     data_noconsent: pd.DataFrame,
111 |     conversion_column: str = CONVERSION_COLUMN,
112 |     drop_columns: Tuple[Any, ...] = DROP_COLUMNS,
113 |     non_dummy_columns: Tuple[Any, ...] = NON_DUMMY_COLUMNS
114 | ) -> Tuple[pd.DataFrame, pd.DataFrame]:
115 |   """Concatenates consent and no-consent data and preprocesses them.
116 | 
117 |   Args:
118 |     data_consent: Dataframe of consent customers.
119 |     data_noconsent: Dataframe of no-consent customers.
120 |     conversion_column: Name of the conversion column in the data.
121 |     drop_columns: Names of columns that should be dropped from the data.
122 |     non_dummy_columns: Names of (categorical) columns that should be kept, but
123 |       not dummy-coded.
124 | 
125 |   Raises:
126 |     ValueError: if concatenating consent and no-consent data doesn't
127 |       match the expected length.
128 | 
129 |   Returns:
130 |     Processed dataframes for consent and no-consent customers.
131 |   """
132 |   data_noconsent["consent"] = 0
133 |   data_consent["consent"] = 1
134 |   data_concat = pd.concat([data_noconsent, data_consent])
135 |   data_concat.reset_index(inplace=True, drop=True)
136 |   if len(data_concat) != (len(data_noconsent) + len(data_consent)):
137 |     raise ValueError(
138 |         "Length of concatenated data does not match sum of individual dataframes."
139 |     )
140 |   data_preprocessed = preprocess_data(
141 |       data=data_concat,
142 |       drop_columns=list(drop_columns),
143 |       non_dummy_columns=list(non_dummy_columns),
144 |       conversion_column=conversion_column)
145 |   data_noconsent_processed = data_preprocessed[data_preprocessed["consent"] ==
146 |                                                0]
147 |   data_consent_processed = data_preprocessed[data_preprocessed["consent"] == 1]
148 |   return data_consent_processed, data_noconsent_processed
149 | 


--------------------------------------------------------------------------------
/cocoa/preprocess_test.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2021 Google LLC
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     https://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | """Tests for preprocess."""
16 | 
17 | from absl.testing import absltest
18 | from absl.testing import parameterized
19 | import pandas as pd
20 | 
21 | 
22 | from consent_based_conversion_adjustments.cocoa import preprocess
23 | from consent_based_conversion_adjustments.cocoa import testing_constants
24 | 
25 | CONVERSION_COLUMN = testing_constants.CONVERSION_COLUMN
26 | ID_COLUMNS = testing_constants.ID_COLUMNS
27 | 
28 | DATA_CONSENT = testing_constants.DATA_CONSENT
29 | DATA_NOCONSENT = testing_constants.DATA_NOCONSENT
30 | 
31 | METRIC = testing_constants.METRIC
32 | 
33 | 
34 | class PreprocessTest(parameterized.TestCase):
35 | 
36 |   def setUp(self):
37 |     super().setUp()
38 |     self.fake_data_consent = DATA_CONSENT
39 |     self.fake_data_consent['new_customer'] = 0
40 |     self.fake_data_consent.iloc[::3]['new_customer'] = 1
41 |     self.fake_data_noconsent = DATA_NOCONSENT
42 |     self.fake_data_noconsent['new_customer'] = 0
43 |     self.fake_data_noconsent.iloc[::3]['new_customer'] = 1
44 | 
45 |   def test_processed_shape_matches_expected_shape(self):
46 |     joined_data = pd.concat([self.fake_data_consent, self.fake_data_noconsent])
47 |     categorical_columns = joined_data.columns[joined_data.dtypes == 'object']
48 |     n_dummy_variables = 0
49 |     for categorical_column in categorical_columns:
50 |       n_dummy_variables += joined_data[categorical_column].nunique() -1
51 |     target_shape = self.fake_data_consent.shape[1] + n_dummy_variables + 1
52 | 
53 |     fake_preprocessed_data_consent, fake_preprocessed_data_noconsent = (
54 |         preprocess.concatenate_and_process_data(
55 |             self.fake_data_consent.copy(), self.fake_data_noconsent.copy()))
56 | 
57 |     self.assertEqual(fake_preprocessed_data_consent.shape[1], target_shape)
58 |     self.assertEqual(fake_preprocessed_data_noconsent.shape[1], target_shape)
59 | 
60 |   def test_preprocessed_conversion_values_larger_zero(self):
61 |     fake_preprocessed_data_consent, fake_preprocessed_data_noconsent = (
62 |         preprocess.concatenate_and_process_data(
63 |             self.fake_data_consent.copy(), self.fake_data_noconsent.copy()))
64 | 
65 |     values_not_larger_zero = (
66 |         fake_preprocessed_data_consent[CONVERSION_COLUMN] <= 0).sum() + (
67 |             fake_preprocessed_data_noconsent[CONVERSION_COLUMN] <= 0).sum()
68 | 
69 |     self.assertEqual(values_not_larger_zero, 0)
70 | 
71 | 
72 | if __name__ == '__main__':
73 |   absltest.main()
74 | 


--------------------------------------------------------------------------------
/cocoa/testing_constants.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2021 Google LLC
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     https://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | """Create mock data and define global variables used across multiple tests."""
16 | 
17 | import numpy as np
18 | import pandas as pd
19 | 
20 | CONVERSION_COLUMN = 'conversion_column'
21 | ID_COLUMNS = ['id_column']
22 | 
23 | product_levels = ['1_1', '2_2', '1_1']
24 | DATA_CONSENT = pd.concat([
25 |     pd.DataFrame(
26 |         np.array([[1, 2, 3, 0], [0, 5, 6, 0], [1, 8, 9, 0]]),
27 |         columns=['a', 'b', CONVERSION_COLUMN] + ID_COLUMNS)
28 | ] * 10)
29 | DATA_CONSENT.reset_index(inplace=True, drop=True)
30 | DATA_CONSENT['product_level'] = product_levels * 10
31 | DATA_NOCONSENT = pd.concat([
32 |     pd.DataFrame(
33 |         np.array([[4, 5, 6, 0], [7, 8, 9, 0], [10, 11, 12, 0]]),
34 |         columns=['a', 'b', CONVERSION_COLUMN] + ID_COLUMNS)
35 | ] * 5)
36 | DATA_NOCONSENT.reset_index(inplace=True, drop=True)
37 | DATA_NOCONSENT['product_level'] = product_levels * 5
38 | 
39 | METRIC = 'manhattan'
40 | 


--------------------------------------------------------------------------------
/generate_template.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | # Copyright 2021 Google LLC
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     https://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | 
16 | echo "Requesting template to be generated..."
17 | 
18 | python -m pipeline \
19 |     --input_path "gs://${PIPELINE_BUCKET}/input/dates.txt" \
20 |     --output_csv_bucket "${PIPELINE_BUCKET}" \
21 |     --output_csv_path "output" \
22 |     --bq_project "${BQ_PROJECT_ID}" \
23 |     --location "${BIGQUERY_LOCATION}" \
24 |     --table_consent "${TABLE_CONSENT}" \
25 |     --table_noconsent "${TABLE_NOCONSENT}" \
26 |     --date_column "${DATE_COLUMN}" \
27 |     --conversion_column "${CONVERSION_COLUMN}" \
28 |     --id_columns "${ID_COLUMNS}" \
29 |     --drop_columns "${DROP_COLUMNS}" \
30 |     --non_dummy_columns "${NON_DUMMY_COLUMNS}" \
31 |     --runner DataflowRunner \
32 |     --project "${PROJECT_ID}" \
33 |     --staging_location "gs://${PIPELINE_BUCKET}/staging/" \
34 |     --temp_location "gs://${PIPELINE_BUCKET}/temp/" \
35 |     --template_location "gs://${PIPELINE_BUCKET}/templates/cocoa-template" \
36 |     --region "${PIPELINE_REGION}" \
37 |     --machine_type "n1-highmem-32" \
38 |     --setup_file ./setup.py
39 | 
40 | echo "Done."
41 | 


--------------------------------------------------------------------------------
/pipeline.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Google LLC
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     https://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Pipeline to run conversion adjustments.
 16 | 
 17 | Consenting and non-consenting data for customers is read by the pipeline and
 18 | adjustments are applied to conversion values of consenting
 19 | customers based on the distance from non-consenting customers and the runtime
 20 | arguments like number_nearest_neighbors, radius and percentile. The adjusted
 21 | data is output as a csv where the adjusted conversions appear in a new column.
 22 | """
 23 | import argparse
 24 | import datetime
 25 | import logging
 26 | import os
 27 | import sys
 28 | from typing import Any, Callable, List, Optional, Sequence, Tuple, Union
 29 | 
 30 | import apache_beam as beam
 31 | from apache_beam.options.pipeline_options import PipelineOptions
 32 | from apache_beam.options.pipeline_options import SetupOptions
 33 | from apache_beam.options.value_provider import RuntimeValueProvider
 34 | from google.cloud import bigquery
 35 | from google.cloud import storage
 36 | import pandas as pd
 37 | 
 38 | from consent_based_conversion_adjustments.cocoa import nearest_consented_customers
 39 | from consent_based_conversion_adjustments.cocoa import preprocess
 40 | 
 41 | logging.basicConfig(level=logging.INFO)
 42 | 
 43 | 
 44 | def _parse_known_args(
 45 |     cmd_line_args: Sequence[str]) -> Tuple[argparse.Namespace, Sequence[str]]:
 46 |   """Parses known arguments from the command line using the argparse library.
 47 | 
 48 |   Args:
 49 |     cmd_line_args: Sequence of commandline arguments.
 50 | 
 51 |   Returns:
 52 |     A tuple containing argparse.Namespace with known arguments and a list of
 53 |     remaining (unknown) command line arguments.
 54 |   """
 55 |   parser = argparse.ArgumentParser()
 56 |   parser.add_argument(
 57 |       '--input_path',
 58 |       dest='input_path',
 59 |       required=True,
 60 |       help='Path to txt file containing dates for which to run the pipeline.')
 61 |   parser.add_argument(
 62 |       '--output_csv_bucket',
 63 |       dest='output_csv_bucket',
 64 |       required=True,
 65 |       help='Google Cloud Storage bucket for storing CSV output.')
 66 |   parser.add_argument(
 67 |       '--output_csv_path',
 68 |       dest='output_csv_path',
 69 |       required=True,
 70 |       help='CSV output file location.')
 71 |   parser.add_argument(
 72 |       '--bq_project',
 73 |       dest='bq_project',
 74 |       required=True,
 75 |       help='Google Cloud project containing the BigQuery tables.')
 76 |   parser.add_argument(
 77 |       '--location',
 78 |       dest='location',
 79 |       required=True,
 80 |       help='Location of the BigQuery tables e.g EU')
 81 |   parser.add_argument(
 82 |       '--table_consent',
 83 |       dest='table_consent',
 84 |       required=True,
 85 |       help='BigQuery table containing consented user data.')
 86 |   parser.add_argument(
 87 |       '--table_noconsent',
 88 |       dest='table_noconsent',
 89 |       required=True,
 90 |       help='BigQuery table containing non-consented user data.')
 91 |   parser.add_argument(
 92 |       '--date_column',
 93 |       dest='date_column',
 94 |       required=True,
 95 |       help='BigQuery table column containing date value.')
 96 |   parser.add_argument(
 97 |       '--conversion_column',
 98 |       dest='conversion_column',
 99 |       required=True,
100 |       help='BigQuery table column containing conversion value.')
101 |   parser.add_argument(
102 |       '--id_columns',
103 |       dest='id_columns',
104 |       required=True,
105 |       help='BigQuery table columns that form a unique row e.g. GCLID,TIMESTAMP.'
106 |   )
107 |   parser.add_argument(
108 |       '--drop_columns',
109 |       dest='drop_columns',
110 |       required=False,
111 |       help='BigQuery table columns that should be dropped from the data.')
112 |   parser.add_argument(
113 |       '--non_dummy_columns',
114 |       dest='non_dummy_columns',
115 |       required=False,
116 |       help='BigQuery table (categorical) columns that should be kept, but not dummy-coded.'
117 |   )
118 |   return parser.parse_known_args(cmd_line_args)
119 | 
120 | 
121 | class RuntimeOptions(PipelineOptions):
122 |   """Specifies runtime options for the pipeline.
123 | 
124 |   Class defining the arguments that can be passed to the pipeline to
125 |   customize the runtime execution.
126 |   """
127 | 
128 |   @classmethod  # classmethod is required here for Beam's PipelineOptions.
129 |   def _add_argparse_args(cls, parser):
130 |     parser.add_value_provider_argument(
131 |         '--number_nearest_neighbors',
132 |         help='number of nearest consenting customers to select.')
133 |     parser.add_value_provider_argument(
134 |         '--radius',
135 |         help='radius within which nearest customers should be considered.')
136 |     parser.add_value_provider_argument(
137 |         '--percentile',
138 |         help='percentage of non-consenting customers that should be matched.')
139 |     parser.add_value_provider_argument(
140 |         '--metric', help='distance metric.', type=str)
141 | 
142 | 
143 | def _load_data_from_bq(table_name: str, location: str, project: str,
144 |                        start_date: str, end_date: str,
145 |                        date_column: str) -> pd.DataFrame:
146 |   """Reads data from BigQuery filtered to the given start and end date."""
147 |   bq_client = bigquery.Client(location=location, project=project)
148 |   query = f"""
149 |            SELECT * FROM `{table_name}`
150 |            WHERE {date_column} >= '{start_date}' and {date_column} < '{end_date}'
151 |            ORDER BY {date_column}
152 |            """
153 |   return bq_client.query(query).result().to_dataframe()
154 | 
155 | 
156 | class ConversionAdjustments(beam.DoFn):
157 |   """Apache Beam ParDo transform for applying conversion adjustments."""
158 | 
159 |   def __init__(self, number_nearest_neighbors: RuntimeValueProvider,
160 |                radius: RuntimeValueProvider, percentile: RuntimeValueProvider,
161 |                metric: RuntimeValueProvider, project: str, location: str,
162 |                table_consent: str, table_noconsent: str, date_column: str,
163 |                conversion_column: str, id_columns: List[str],
164 |                drop_columns: Tuple[Any,
165 |                                    ...], non_dummy_columns: Tuple[Any,
166 |                                                                   ...]) -> None:
167 |     """Initialises class.
168 | 
169 |     Args:
170 |       number_nearest_neighbors: Number of nearest consenting customers to
171 |         select.
172 |       radius: Radius within which nearest customers should be considered.
173 |       percentile: Percentage of non-consenting customers that should be matched.
174 |       metric: Distance metric e.g. manhattan.
175 |       project: Name of Google Cloud project containing the BigQuery tables.
176 |       location: Location of the BigQuery tables e.g. EU.
177 |       table_consent: BigQuery table containing consented user data.
178 |       table_noconsent: BigQuery table containing non-consented user data.
179 |       date_column: BigQuery table column containing date value.
180 |       conversion_column: BigQuery table column containing conversion value.
181 |       id_columns: BigQuery table columns that form a unique row.
182 |       drop_columns: BigQuery table columns that should be dropped from the data.
183 |       non_dummy_columns: BigQuery table (categorical) columns that should be
184 |         kept, but not dummy-coded.
185 |     """
186 |     self._number_nearest_neighbors = number_nearest_neighbors
187 |     self._radius = radius
188 |     self._percentile = percentile
189 |     self._metric = metric
190 |     self._project = project
191 |     self._location = location
192 |     self._table_consent = table_consent
193 |     self._table_noconsent = table_noconsent
194 |     self._date_column = date_column
195 |     self._conversion_column = conversion_column
196 |     self._id_columns = id_columns
197 |     self._drop_columns = drop_columns
198 |     self._non_dummy_columns = non_dummy_columns
199 | 
200 |   def process(
201 |       self, process_date: datetime.date
202 |   ) -> Optional[Sequence[Tuple[str, pd.DataFrame, pd.DataFrame]]]:
203 |     """Calculates conversion adjustments for the given date.
204 | 
205 |     Args:
206 |       process_date: Date to be processed.
207 | 
208 |     Returns:
209 |       Tuple containing processed date, adjusted data and summary statistics.
210 |     """
211 |     logging.info('Processing date %r', process_date)
212 |     # TODO(): Consider if time delta can be decided by user.
213 |     end_date = str((process_date + datetime.timedelta(days=1)))
214 |     start_date = str(process_date)
215 |     logging.info('Pulling non-consented data for date %r', process_date)
216 |     data_noconsent = _load_data_from_bq(self._table_noconsent, self._location,
217 |                                         self._project, start_date, end_date,
218 |                                         self._date_column)
219 |     logging.info('Pulling consented data for date %r', process_date)
220 |     data_consent = _load_data_from_bq(self._table_consent, self._location,
221 |                                       self._project, start_date, end_date,
222 |                                       self._date_column)
223 |     logging.info(
224 |         'Preprocessing consented and non-consented datasets for date %r',
225 |         process_date)
226 |     data_consent, data_noconsent = preprocess.concatenate_and_process_data(
227 |         data_consent, data_noconsent, self._conversion_column,
228 |         self._drop_columns, self._non_dummy_columns)
229 |     matcher = nearest_consented_customers.NearestCustomerMatcher(
230 |         data_consent, self._conversion_column, self._id_columns,
231 |         _get_runtime_val_or_none(self._metric))
232 |     logging.info('Calculating conversion adjustments for date %r', process_date)
233 | 
234 |     data_adjusted, summary_statistics_matched_conversions = nearest_consented_customers.get_adjustments_and_summary_calculations(
235 |         matcher, data_noconsent,
236 |         _get_runtime_val_or_none(self._number_nearest_neighbors, int),
237 |         _get_runtime_val_or_none(self._radius, float),
238 |         _get_runtime_val_or_none(self._percentile, float))
239 |     return [(start_date, data_adjusted, summary_statistics_matched_conversions)]
240 | 
241 | 
242 | def _get_runtime_val_or_none(
243 |     runtime_var: RuntimeValueProvider,
244 |     apply_type: Callable[[Union[int, float, str]], Union[int, float, str]] = str
245 | ) -> Optional[Union[int, float, str]]:
246 |   """Gets the runtime value in the correct type.
247 | 
248 |   Checks if a runtime value is available. If the value is not None, convert the
249 |   value to the requested type.
250 | 
251 |   Args:
252 |     runtime_var: The runtime value provider.
253 |     apply_type: A type that may be applied to non-none runtime values.
254 | 
255 |   Returns:
256 |     Typed value if available, otherwise None.
257 |   """
258 |   if runtime_var.is_accessible():
259 |     runtime_val = runtime_var.get()
260 |     if runtime_val is not None:
261 |       return apply_type(runtime_val)
262 |   return None
263 | 
264 | 
265 | def write_adjustments_to_gcs(adjustments: Tuple[str, pd.DataFrame,
266 |                                                 pd.DataFrame], bucket_name: str,
267 |                              path: str) -> None:
268 |   """Prepares the conversion adjustments data to be written to Cloud Storage.
269 | 
270 |   Args:
271 |     adjustments: A tuple containing processed date, adjusted data and summary
272 |       statistics.
273 |     bucket_name: Name of the Cloud Storage bucket where adjustments are written.
274 |     path: Path on the Cloud Storage bucket where adjustments are written.
275 | 
276 |   Returns:
277 |     None.
278 |   """
279 |   adjustments_date = adjustments[0]
280 |   adjustments_data = adjustments[1].to_csv(index=False)
281 |   adjustments_summary = adjustments[2].to_csv(index=False)
282 |   gcs_client = storage.Client()
283 |   gcs_bucket = gcs_client.get_bucket(bucket_name)
284 |   logging.info('Uploading conversion adjustments for date %r', adjustments_date)
285 |   write_to_gcs(gcs_bucket, os.path.join(path, adjustments_date),
286 |                'adjustments_data.csv', 'text/csv', adjustments_data)
287 |   logging.info('Uploading adjustments summary for date %r', adjustments_date)
288 |   write_to_gcs(gcs_bucket, os.path.join(path, adjustments_date),
289 |                'adjustments_summary.csv', 'text/csv', adjustments_summary)
290 | 
291 | 
292 | def write_to_gcs(bucket: storage.Bucket, path: str, filename: str,
293 |                  data_type: str, data: str) -> None:
294 |   """Writes data to the given Cloud Storage bucket."""
295 |   bucket.blob(os.path.join(path, filename)).upload_from_string(data, data_type)
296 | 
297 | 
298 | def get_columns_from_str(columns: Optional[str],
299 |                          separator: str = ',') -> Tuple[Any, ...]:
300 |   """Converts columns input as separated string to tuples for further processing.
301 | 
302 |   A helper function to convert strings containing column names to tuples of
303 |   column names.
304 | 
305 |   Args:
306 |     columns: List of columns as a string with separators.
307 |     separator: Character that separates the column names in the string.
308 | 
309 |   Returns:
310 |     A tuple containing the columns names or empty if the column string doesn't
311 |       exist or is empty.
312 |   """
313 |   if not columns:
314 |     return ()
315 |   return tuple(columns.split(separator))
316 | 
317 | 
318 | def main(argv: Sequence[str], save_main_session: bool = True) -> None:
319 |   """Main entry point; defines and runs the beam pipeline."""
320 | 
321 |   known_args, pipeline_args = _parse_known_args(argv)
322 | 
323 |   pipeline_options = PipelineOptions(pipeline_args)
324 |   pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
325 |   runtime_options = pipeline_options.view_as(RuntimeOptions)
326 | 
327 |   with beam.Pipeline(options=pipeline_options) as p:
328 | 
329 |     dates_to_process = (
330 |         p
331 |         | 'Read ISO format date string from input file' >> beam.io.ReadFromText(
332 |             known_args.input_path)
333 |         | 'Convert to date type' >> beam.Map(datetime.date.fromisoformat))
334 | 
335 |     adjustments = (
336 |         dates_to_process
337 |         | 'Apply conversion adjustments' >> beam.ParDo(
338 |             ConversionAdjustments(
339 |                 number_nearest_neighbors=runtime_options
340 |                 .number_nearest_neighbors,
341 |                 radius=runtime_options.radius,
342 |                 percentile=runtime_options.percentile,
343 |                 metric=runtime_options.metric,
344 |                 project=known_args.bq_project,
345 |                 location=known_args.location,
346 |                 table_consent=known_args.table_consent,
347 |                 table_noconsent=known_args.table_noconsent,
348 |                 conversion_column=known_args.conversion_column,
349 |                 id_columns=list(known_args.id_columns.split(',')),
350 |                 date_column=known_args.date_column,
351 |                 drop_columns=get_columns_from_str(known_args.drop_columns),
352 |                 non_dummy_columns=get_columns_from_str(
353 |                     known_args.non_dummy_columns))))
354 | 
355 |     _ = (
356 |         adjustments
357 |         | 'Write adjusted data as CSV files to cloud storage' >> beam.Map(
358 |             write_adjustments_to_gcs,
359 |             bucket_name=known_args.output_csv_bucket,
360 |             path=known_args.output_csv_path))
361 | 
362 | 
363 | if __name__ == '__main__':
364 |   main(sys.argv)
365 | 


--------------------------------------------------------------------------------
/pipeline_test.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Google LLC
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     https://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Tests for pipeline."""
 16 | import dataclasses
 17 | import datetime
 18 | 
 19 | from absl.testing import absltest
 20 | from absl.testing import parameterized
 21 | import apache_beam as beam
 22 | from apache_beam.testing.util import assert_that
 23 | from apache_beam.testing.util import equal_to
 24 | import pandas as pd
 25 | 
 26 | from consent_based_conversion_adjustments import pipeline
 27 | 
 28 | _DATA_NOCONSENT = [{
 29 |     'gclid': '21',
 30 |     'conversion_timestamp': '2021-11-20 12:34:56 UTC',
 31 |     'conversion_value': 20.0,
 32 |     'conversion_date': '2021-11-20',
 33 |     'conversion_item': 'dress'
 34 | }]
 35 | _DATA_CONSENT = [{
 36 |     'gclid': '1',
 37 |     'conversion_timestamp': '2021-11-20 12:34:56 UTC',
 38 |     'conversion_value': 10.0,
 39 |     'conversion_date': '2021-11-20',
 40 |     'conversion_item': 'dress'
 41 | }]
 42 | _DATA_CONSENT_MULTI = [
 43 |     {
 44 |         'gclid': '1',
 45 |         'conversion_timestamp': '2021-11-20 12:34:56 UTC',
 46 |         'conversion_value': 10.0,
 47 |         'conversion_date': '2021-11-20',
 48 |         'conversion_item': 'dress'
 49 |     },
 50 |     {
 51 |         'gclid': '2',
 52 |         'conversion_timestamp': '2021-11-20 12:34:56 UTC',
 53 |         'conversion_value': 10.0,
 54 |         'conversion_date': '2021-11-20',
 55 |         'conversion_item': 'dress'
 56 |     },
 57 | ]
 58 | _PIPELINE_RUN_DATE = '2021-11-20'
 59 | _PROJECT = 'cocoa_test'
 60 | _LOCATION = 'EU'
 61 | _TABLE_CONSENT = 'table_consent'
 62 | _TABLE_NOCONSENT = 'table_noconsent'
 63 | _CONVERSION_COLUMN = 'conversion_value'
 64 | _ID_COLUMNS = ['gclid', 'conversion_timestamp']
 65 | _DATE_COLUMN = 'conversion_date'
 66 | _DROP_COLUMNS = []
 67 | _NON_DUMMY_COLUMNS = _ID_COLUMNS
 68 | 
 69 | 
 70 | def _fake_load_data_from_bq(table_name: str, *args, **kwargs) -> pd.DataFrame:  # Patching method so pylint: disable=unused-argument
 71 |   if table_name == 'table_consent':
 72 |     return pd.DataFrame.from_records(_DATA_CONSENT)
 73 |   elif table_name == 'table_consent_multi':
 74 |     return pd.DataFrame.from_records(_DATA_CONSENT_MULTI)
 75 |   elif table_name == 'table_noconsent':
 76 |     return pd.DataFrame.from_records(_DATA_NOCONSENT)
 77 |   else:
 78 |     return pd.DataFrame([])
 79 | 
 80 | 
 81 | @dataclasses.dataclass(frozen=True)
 82 | class RuntimeParam:
 83 |   value: any
 84 |   accessible: bool = True
 85 | 
 86 |   def get(self):
 87 |     return self.value
 88 | 
 89 |   def is_accessible(self):
 90 |     return self.accessible
 91 | 
 92 | 
 93 | class PipelineTest(parameterized.TestCase):
 94 | 
 95 |   @classmethod
 96 |   def setUpClass(cls):
 97 |     super().setUpClass()
 98 |     # Replace network calls to BigQuery with a local fake reply
 99 |     pipeline._load_data_from_bq = _fake_load_data_from_bq
100 | 
101 |   @parameterized.named_parameters(
102 |       dict(
103 |           testcase_name='_completely_when_single_nearest_neighbor',
104 |           number_nearest_neighbors=1,
105 |           table_consent='table_consent',
106 |           expected_output=20.0),
107 |       dict(
108 |           testcase_name='_partially_when_multiple_nearest_neighbor',
109 |           number_nearest_neighbors=2,
110 |           table_consent='table_consent_multi',
111 |           expected_output=10.0))
112 |   def test_conversion_adjustments_value_assigned(self,
113 |                                                  number_nearest_neighbors: int,
114 |                                                  table_consent: str,
115 |                                                  expected_output: float):
116 | 
117 |     with beam.Pipeline(beam.runners.direct.DirectRunner()) as p:
118 |       date_to_process = (
119 |           p | 'Process date' >> beam.Create(
120 |               [datetime.date.fromisoformat(_PIPELINE_RUN_DATE)]))
121 |       adjustments = (
122 |           date_to_process
123 |           | beam.ParDo(
124 |               pipeline.ConversionAdjustments(
125 |                   number_nearest_neighbors=RuntimeParam(
126 |                       number_nearest_neighbors),
127 |                   radius=RuntimeParam(None),
128 |                   percentile=RuntimeParam(None),
129 |                   metric=RuntimeParam('manhattan'),
130 |                   project=_PROJECT,
131 |                   location=_LOCATION,
132 |                   table_consent=table_consent,
133 |                   table_noconsent=_TABLE_NOCONSENT,
134 |                   conversion_column=_CONVERSION_COLUMN,
135 |                   id_columns=_ID_COLUMNS,
136 |                   date_column=_DATE_COLUMN,
137 |                   drop_columns=_DROP_COLUMNS,
138 |                   non_dummy_columns=_NON_DUMMY_COLUMNS)))
139 | 
140 |       adjusted_conversion_value = (
141 |           adjustments
142 |           |
143 |           'Select single row' >> beam.Map(lambda x: x[1][x[1]['gclid'] == '1'])
144 |           | 'Calculate conversion value' >>
145 |           beam.Map(lambda x: x['adjusted_conversion'].sum()))
146 |       assert_that(adjusted_conversion_value, equal_to([expected_output]))
147 | 
148 | 
149 | if __name__ == '__main__':
150 |   absltest.main()
151 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | # Copyright 2024 Google LLC.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     https://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | .


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2024 Google LLC.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     https://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Setup script for package CoCoA."""
 16 | import setuptools
 17 | 
 18 | with open('README.md', 'r', encoding='utf-8') as fh:
 19 |   long_description = fh.read()
 20 | 
 21 | setuptools.setup(
 22 |     name='consent_based_conversion_adjustments',
 23 |     version='0.0.1',
 24 |     author='gTech Professional Services',
 25 |     author_email='',
 26 |     description='Adjust conversion values for Smart Bidding OCI',
 27 |     long_description=long_description,
 28 |     long_description_content_type='text/markdown',
 29 |     packages=setuptools.find_packages(),
 30 |     install_requires=[
 31 |         'absl-py==1.0.0',
 32 |         'apache-beam==2.28.0',
 33 |         'avro-python3==1.9.2.1',
 34 |         'cachetools==4.2.4',
 35 |         'certifi==2021.10.8',
 36 |         'charset-normalizer==2.0.12',
 37 |         'crcmod==1.7',
 38 |         'dill==0.3.1.1',
 39 |         'docopt==0.6.2',
 40 |         'fastavro==1.4.9',
 41 |         'fasteners==0.17.3',
 42 |         'future==0.18.2',
 43 |         'google-api-core==1.31.5',
 44 |         'google-apitools==0.5.31',
 45 |         'google-auth==1.35.0',
 46 |         'google-cloud-bigquery==2.34.0',
 47 |         'google-cloud-bigquery-storage==2.11.0',
 48 |         'google-cloud-bigtable==1.7.0',
 49 |         'google-cloud-core==1.7.2',
 50 |         'google-cloud-datastore==1.15.3',
 51 |         'google-cloud-dlp==3.6.0',
 52 |         'google-cloud-language==1.3.0',
 53 |         'google-cloud-pubsub==1.7.0',
 54 |         'google-cloud-recommendations-ai==0.2.0',
 55 |         'google-cloud-spanner==1.19.1',
 56 |         'google-cloud-storage==2.1.0',
 57 |         'google-cloud-videointelligence==1.16.1',
 58 |         'google-cloud-vision==1.0.0',
 59 |         'google-crc32c==1.3.0',
 60 |         'google-resumable-media==2.2.1',
 61 |         'googleapis-common-protos==1.54.0',
 62 |         'grpc-google-iam-v1==0.12.3',
 63 |         'grpcio==1.44.0',
 64 |         'grpcio-gcp==0.2.2',
 65 |         'hdfs==2.6.0',
 66 |         'httplib2==0.17.4',
 67 |         'idna==3.3',
 68 |         'joblib==1.1.0',
 69 |         'libcst==0.4.1',
 70 |         'mock==2.0.0',
 71 |         'mypy-extensions==0.4.3',
 72 |         'numpy==1.19.5',
 73 |         'oauth2client==4.1.3',
 74 |         'orjson==3.6.7',
 75 |         'packaging==21.3',
 76 |         'pandas>=1.3',
 77 |         'pbr==5.8.1',
 78 |         'proto-plus==1.20.3',
 79 |         'protobuf==3.19.4',
 80 |         'pyarrow==2.0.0',
 81 |         'pyasn1==0.4.8',
 82 |         'pyasn1-modules==0.2.8',
 83 |         'pydot==1.4.2',
 84 |         'pymongo==3.12.3',
 85 |         'pyparsing==2.4.7',
 86 |         'python-dateutil==2.8.2',
 87 |         'pytz==2021.3',
 88 |         'PyYAML==6.0',
 89 |         'requests==2.27.1',
 90 |         'rsa==4.8',
 91 |         'scikit-learn==1.0.2',
 92 |         'scipy>=1.7',
 93 |         'six==1.16.0',
 94 |         'sklearn==0.0',
 95 |         'threadpoolctl==3.1.0',
 96 |         'typing-extensions==3.7.4.3',
 97 |         'typing-inspect==0.7.1',
 98 |         'urllib3==1.26.8',
 99 |     ],
100 |     classifiers=[
101 |         'Programming Language :: Python :: 3',
102 |         'License :: OSI Approved :: MIT License',
103 |         'Operating System :: OS Independent',
104 |     ],
105 |     python_requires='>=3.7',
106 | )
107 | 


--------------------------------------------------------------------------------