├── .github └── workflows │ ├── build.yml │ └── validate.yml ├── .gitignore ├── .pr-preview.json ├── Makefile ├── README.md ├── error_reporting.md ├── flexible_filtering.md ├── meetings └── 2023-02-15 │ ├── minutes.md │ └── shared-storage-overview.pdf ├── named_budgets.md ├── reach_whitepaper.md ├── reach_whitepaper_figs ├── cap_discount.png ├── cumulative_error.png ├── diagram.png ├── direct_error.png ├── errors_comparison.png ├── observation_error.png ├── sketches_error.png ├── total_cumulative_error.png ├── total_direct_error.png ├── total_sketches_error.png └── window_discount.png ├── report_verification.md ├── security_and_privacy_questionnaire.md └── spec.bs /.github/workflows/build.yml: -------------------------------------------------------------------------------- 1 | name: Build 2 | on: 3 | push: 4 | branches: [main] 5 | paths: ["**.bs"] 6 | jobs: 7 | build: 8 | name: Build 9 | runs-on: ubuntu-latest 10 | steps: 11 | - uses: actions/checkout@v3 12 | - uses: w3c/spec-prod@v2 13 | with: 14 | TOOLCHAIN: bikeshed 15 | DESTINATION: index.html 16 | SOURCE: spec.bs 17 | GH_PAGES_BRANCH: gh-pages 18 | BUILD_FAIL_ON: warning 19 | -------------------------------------------------------------------------------- /.github/workflows/validate.yml: -------------------------------------------------------------------------------- 1 | name: Validate 2 | on: 3 | pull_request: 4 | paths: ["**.bs"] 5 | jobs: 6 | main: 7 | name: Validate Spec 8 | runs-on: ubuntu-latest 9 | steps: 10 | - uses: actions/checkout@v3 11 | - uses: w3c/spec-prod@v2 12 | with: 13 | TOOLCHAIN: bikeshed 14 | SOURCE: spec.bs 15 | BUILD_FAIL_ON: warning 16 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | spec.html 2 | -------------------------------------------------------------------------------- /.pr-preview.json: -------------------------------------------------------------------------------- 1 | { 2 | "src_file": "spec.bs", 3 | "type": "bikeshed" 4 | } 5 | 6 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # HTML files that are generated from Markdown sources. 2 | HTML_FROM_MD_TARGETS = README.html README-with-toc.html 3 | COMMIT = $(shell git rev-parse HEAD) 4 | EXPORTS_REPORT_NAME = "exports-report.${COMMIT}.txt" 5 | 6 | .PHONY: all 7 | all: spec.html $(HTML_FROM_MD_TARGETS) 8 | 9 | .PHONY: clean 10 | clean: 11 | -rm spec.html 12 | -rm $(HTML_FROM_MD_TARGETS) 13 | -rm exports-report.*.txt 14 | 15 | # Updates Bikeshed's datafiles. Bikeshed automatically updates them if they're 16 | # more than a few days old, but this will ensure you have the latest version. 17 | .PHONY: update 18 | update: 19 | bikeshed update 20 | 21 | spec.html: spec.bs 22 | bikeshed --die-on=everything spec $< $@ 23 | 24 | # Autogenerates a table of contents for the README. This can in turn be rendered 25 | # as HTML. This is useful for catching mistakes in the README's handwritten TOC. 26 | README-with-toc.md: README.md 27 | pandoc -f gfm --toc --toc-depth 6 -s $< -o $@ 28 | 29 | # Rule for generating HTML from Markdown for a limited set of targets. This uses 30 | # GNU Make "static pattern" syntax. 31 | $(HTML_FROM_MD_TARGETS): %.html : %.md 32 | pandoc -f gfm -s $< -o $@ 33 | 34 | .PHONY: write-exports-report 35 | write-exports-report: ${EXPORTS_REPORT_NAME} 36 | 37 | ${EXPORTS_REPORT_NAME}: spec.bs 38 | bikeshed debug --print-exports > $@ 39 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | **This document is an individual draft proposal. It has not been adopted by the Private Advertising Technology Community Group.** 2 | 3 | ------- 4 | 5 | # Private Aggregation API explainer 6 | 7 | Author: Alex Turner (alexmt@chromium.org) 8 | 9 | ### Table of Contents 10 | 11 | - [Introduction](#introduction) 12 | - [Examples](#examples) 13 | - [Protected Audience reporting](#protected-audience-reporting) 14 | - [Measuring user demographics with cross-site information](#measuring-user-demographics-with-cross-site-information) 15 | - [Cross-site reach measurement](#cross-site-reach-measurement) 16 | - [K+ frequency measurement](#k-frequency-measurement) 17 | - [Goals](#goals) 18 | - [Non-goals](#non-goals) 19 | - [Operations](#operations) 20 | - [Reports](#reports) 21 | - [Temporary debugging mechanism](#temporary-debugging-mechanism) 22 | - [Enabling](#enabling) 23 | - [Debug keys](#debug-keys) 24 | - [Duplicate debug report](#duplicate-debug-report) 25 | - [Reducing volume by batching](#reducing-volume-by-batching) 26 | - [Batching scope](#batching-scope) 27 | - [Limiting the number of contributions per report](#limiting-the-number-of-contributions-per-report) 28 | - [Padding](#padding) 29 | - [Aggregation coordinator choice](#aggregation-coordinator-choice) 30 | - [Privacy and security](#privacy-and-security) 31 | - [Metadata readable by the reporting origin](#metadata-readable-by-the-reporting-origin) 32 | - [Open question: what metadata to allow](#open-question-what-metadata-to-allow) 33 | - [Contribution bounding and budgeting](#contribution-bounding-and-budgeting) 34 | - [Scaling values](#scaling-values) 35 | - [Examples](#examples-1) 36 | - [Partition choice](#partition-choice) 37 | - [Implementation plan](#implementation-plan) 38 | - [Enrollment and attestation](#enrollment-and-attestation) 39 | - [Future Iterations](#future-iterations) 40 | - [Supporting different aggregation modes](#supporting-different-aggregation-modes) 41 | - [Shared contribution budget](#shared-contribution-budget) 42 | - [Authentication and data integrity](#authentication-and-data-integrity) 43 | - [Aggregate error reporting](#aggregate-error-reporting) 44 | 45 | ## Introduction 46 | 47 | This proposal introduces a generic mechanism for measuring aggregate, cross-site 48 | data in a privacy preserving manner. 49 | 50 | Browsers are now working to prevent cross-site user tracking, including by 51 | [partitioning storage and removing third-party 52 | cookies](https://blog.chromium.org/2020/01/building-more-private-web-path-towards.html). 53 | There are a range of API proposals to continue supporting legitimate use cases 54 | in a way that respects user privacy. Many of these proposals, including [Shared 55 | Storage](https://github.com/WICG/shared-storage) and [Protected 56 | Audience](https://github.com/WICG/turtledove), plan to isolate potentially 57 | identifying cross-site data in special contexts, which ensures that the data 58 | cannot escape the user agent. 59 | 60 | Relative to cross-site data from an individual user, aggregate data about groups 61 | of users can be less sensitive and yet would be sufficient for a wide range of 62 | use cases. An [aggregation 63 | service](https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATION_SERVICE_TEE.md) 64 | has been proposed to allow reporting noisy, aggregated cross-site data. This 65 | service was originally proposed for use by the [Attribution Reporting 66 | API](https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md), 67 | but allowing more general aggregation would support additional use cases. In 68 | particular, the [Protected Audience](https://github.com/WICG/turtledove) 69 | and [Shared Storage](https://github.com/WICG/shared-storage) 70 | proposals expect this functionality to become available. 71 | 72 | So, to complement the Attribution Reporting API, we propose a general-purpose 73 | Private Aggregation API that can be called from a wide array of contexts, 74 | including isolated contexts that have access to cross-site data (such as a 75 | shared storage worklet). Within these contexts, potentially identifying data 76 | could be encapsulated into "aggregatable reports". To prevent leakage, the 77 | cross-site data in these reports would be encrypted to ensure it can only be 78 | processed by the aggregation service. During processing, this service will add 79 | noise and impose limits on how many queries can be performed. 80 | 81 | This API introduces a `contributeToHistogram()` function; see 82 | [examples](#examples) below. This call registers a histogram contribution for 83 | reporting. Later, the browser constructs an aggregatable report, which contains 84 | an encrypted payload with the specified contribution(s) for later computation 85 | via the aggregation service. 86 | The API queues the constructed report to be sent to the reporting endpoint of 87 | the script's origin (in other words, the reporting origin) after a delay. The 88 | report format and endpoint paths are detailed [below](#reports). 89 | After the endpoint receives the reports, it batches the reports and sends them 90 | to the aggregation service for processing. The output of that process is a 91 | summary report containing the (approximate) result, which is dispatched back to 92 | the reporting origin. 93 | 94 | See [the Private Aggregation API specification](https://patcg-individual-drafts.github.io/private-aggregation-api/). 95 | 96 | ## Examples 97 | 98 | ### Protected Audience reporting 99 | 100 | The [Protected 101 | Audience](https://github.com/WICG/turtledove/blob/main/FLEDGE.md#5-event-level-reporting-for-now) 102 | API plans to run on-device ad 103 | auctions using cross-site data as an input. The Private Aggregation API will 104 | allow measurement of the auction results from within the isolated execution 105 | environments. 106 | 107 | For example, a key measurement use case is to report the price of the auctions' 108 | winning bids. This tells the seller how much they should be paid and who should 109 | pay them. To support this, each seller's JavaScript could define a 110 | `reportResult()` function. For example: 111 | 112 | ```javascript 113 | function reportResult(auctionConfig, browserSignals) { 114 | // Helper functions that map each buyer to its predetermined bucket and scales 115 | // each bid appropriately for measurement, see scaling values below. 116 | function convertBuyerToBucketId(buyer_origin) { … } 117 | function convertBidToReportingValue(winning_bid_price) { … } 118 | 119 | // The user agent sends the report to the reporting endpoint of the script's 120 | // origin (that is, the caller of `runAdAuction()`) after a delay. 121 | privateAggregation.contributeToHistogram({ 122 | // Note: the bucket must be a BigInt and the value an integer Number. 123 | bucket: convertBuyerToBucketId(browserSignals.interestGroupOwner), 124 | value: convertBidToReportingValue(browserSignals.bid) 125 | }); 126 | } 127 | ``` 128 | 129 | The buyer can make their own measurements, which could be used to verify the 130 | seller's information. To support this, each buyer's JavaScript would define a 131 | `reportWin()` function (and possibly also a `reportLoss()` function). For 132 | example: 133 | 134 | ```javascript 135 | function reportWin(auctionSignals, perBuyerSignals, sellerSignals, browserSignals) { 136 | // The buyer defines their own similar functions. 137 | function convertSellerToBucketId(seller_origin) { … } 138 | function convertBidToReportingValue(winning_bid_price) { … } 139 | 140 | privateAggregation.contributeToHistogram({ 141 | bucket: convertSellerToBucketId(browserSignals.seller), 142 | value: convertBidToReportingValue(browserSignals.bid), 143 | }); 144 | } 145 | ``` 146 | 147 | ### Measuring user demographics with cross-site information 148 | 149 | `publisher.example` wants to measure the demographics of its user base, for 150 | example, a histogram of number of users split by age ranges. `demo.example` is a 151 | popular site that knows the demographics of its users. `publisher.example` 152 | embeds `demo.example` as a third-party, allowing it to measure the demographics 153 | of the overlapping users. 154 | 155 | First, `demo.example` saves these demographics to its shared storage when it is 156 | the top level site: 157 | 158 | ```javascript 159 | sharedStorage.set("demo", '{"age": "40-49", ...}'); 160 | ``` 161 | 162 | Then, in a `demo.example` iframe on `publisher.example`, the appropriate shared 163 | storage operation is triggered once for each user: 164 | 165 | ```javascript 166 | await sharedStorage.worklet.addModule("measure-demo.js"); 167 | await sharedStorage.run("send-demo-report"); 168 | ``` 169 | 170 | Shared storage worklet script (i.e. `measure-demo.js`): 171 | 172 | ```javascript 173 | class SendDemoReportOperation { 174 | async function run() { 175 | let demo_string = await sharedStorage.get("demo"); 176 | let demo = {}; 177 | if (demo_string) { 178 | demo = JSON.parse(demo_string); 179 | } 180 | 181 | // Helper function that maps each age range to its predetermined bucket, or 182 | // a special unknown bucket e.g. if the user has not visited `demo.example`. 183 | function convertAgeToBucketId(country_string_or_undefined) { … } 184 | 185 | // The report will be sent to `demo.example`'s reporting endpoint after a 186 | // delay. 187 | privateAggregation.contributeToHistogram({ 188 | bucket: convertAgeToBucketId(demo["age"]); 189 | value: 128, // A predetermined fixed value, see scaling values below. 190 | }); 191 | 192 | // Could add more contributeToHistogram() calls to measure other demographics. 193 | } 194 | } 195 | register("send-demo-report", SendDemoReportOperation); 196 | ``` 197 | 198 | ### Cross-site reach measurement 199 | 200 | Measuring the number of users that have seen an ad. 201 | 202 | In the ad’s iframe: 203 | 204 | ```js 205 | await window.sharedStorage.worklet.addModule('reach.js'); 206 | await window.sharedStorage.run('send-reach-report', { 207 | // optional one-time context 208 | data: { campaignId: '1234' }, 209 | }); 210 | ``` 211 | 212 | Worklet script (i.e. `reach.js`): 213 | 214 | ```js 215 | class SendReachReportOperation { 216 | async run(data) { 217 | const reportSentForCampaign = `report-sent-${data.campaignId}`; 218 | 219 | // Compute reach only for users who haven't previously had a report sent for 220 | // this campaign. Users who had a report for this campaign triggered by a 221 | // site other than the current one will be skipped. 222 | if (await sharedStorage.get(reportSentForCampaign) === 'yes') { 223 | return; // Don't send a report. 224 | } 225 | 226 | // The user agent will send the report to a default endpoint after a delay. 227 | privateAggregation.contributeToHistogram({ 228 | bucket: data.campaignId, 229 | value: 128, // A predetermined fixed value; see Scaling values. 230 | }); 231 | 232 | await sharedStorage.set(reportSentForCampaign, 'yes'); 233 | } 234 | } 235 | register('send-reach-report', SendReachReportOperation); 236 | ``` 237 | 238 | ### _K_+ frequency measurement 239 | 240 | By instead maintaining a counter in shared storage, the approach for cross-site 241 | reach measurement could be extended to _K_+ frequency measurement, i.e. 242 | measuring the number of users who have seen _K_ or more ads on a given browser, 243 | for a pre-chosen value of _K_. A unary counter can be maintained by calling 244 | `window.sharedStorage.append("freq", "1")` on each ad view. Then, the 245 | `send-reach-report` operation would only send a report if there are more than 246 | _K_ characters stored at the key `"freq"`. This counter could also be used to 247 | filter out ads that have been shown too frequently. 248 | 249 | ## Goals 250 | 251 | This API aims to support a wide range of aggregation use cases, including 252 | measurement of demographics and reach, while remaining generic and flexible. We 253 | intend for this API to be callable in as many contexts and situations as 254 | possible, including the isolated contexts used by other privacy-preserving API 255 | proposals for processing cross-site data. This will help to foster continued 256 | growth, experimentation, and rapid iteration in the web ecosystem; to support a 257 | thriving, open web; and to prevent ossification and unnecessary rigidity. 258 | 259 | This API also intends to avoid the privacy risks presented by unpartitioned 260 | storage and third-party cookies. In particular, it seeks to prevent off-browser 261 | cross-site recognition of users. Developer adoption of this API will help to 262 | replace the usage of third-party cookies, making the web more private by 263 | default. 264 | 265 | ### Non-goals 266 | 267 | This API does not intend to regulate what data is allowed as an input to 268 | aggregation. Instead, the aggregation service will protect this input by adding 269 | noise to limit the impact of any individual's input data on the output. Learn 270 | more about [contribution bounding and 271 | budgeting](#contribution-bounding-and-budgeting) below. 272 | 273 | Further, this API does not seek to prevent (probabilistic) cross-site inference 274 | about sufficiently large groups of people. That is, learning high confidence 275 | properties of large groups is ok as long as we can bound how much an individual 276 | affects the aggregate measurement. See also [discussion of this non-goal in 277 | other 278 | settings](https://differentialprivacy.org/inference-is-not-a-privacy-violation/). 279 | 280 | ## Operations 281 | 282 | This current design supports one operation: constructing a histogram. This 283 | operation matches the description in the [Attribution Reporting API with 284 | Aggregatable Reports 285 | explainer](http://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md#two-party-flow), 286 | with a fixed domain of 'buckets' that the reports contribute bounded integer 287 | 'values' to. Note that sums can be computed using the histogram operation by 288 | contributing values to a fixed, predetermined bucket and ignoring the returned 289 | values for all other buckets after querying. 290 | 291 | Over time, we should be able to support additional operations by extending the 292 | aggregation service infrastructure. For example, we could add a 'count 293 | distinct' operation that, like the histogram operation, uses a fixed domain of 294 | buckets, but without any values. Instead, the computed result would be 295 | (approximately) how many _unique_ buckets the reports contributed to. Other 296 | possible additions include supporting federated learning via privately 297 | aggregating machine learning update vectors or extending the histogram operation 298 | to support values that are vectors of integers rather than only scalars. 299 | 300 | The operation would be indicated by using the appropriate JavaScript call, e.g. 301 | `contributeToHistogram()` and `contributeToDistinctCount()` for histograms and 302 | count distinct, respectively. 303 | 304 | ## Reports 305 | 306 | The report, including the payload, will mirror the [structure proposed for the Attribution Reporting 307 | API](https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md#aggregatable-reports). 308 | However, a few details will change. For example, fields with no equivalent 309 | on this API (e.g. `attribution_destination` and `source_registration_time`) 310 | won't be present. Additionally, the `api` field will contain either 311 | `"protected-audience"` 312 | or `"shared-storage"` to reflect which API's context requested the report. 313 | 314 | The following is an example report showing the JSON format 315 | ```jsonc 316 | { 317 | "shared_info": "{\"api\":\"protected-audience\",\"report_id\":\"[UUID]\",\"reporting_origin\":\"https://reporter.example\",\"scheduled_report_time\":\"[timestamp in seconds]\",\"version\":\"[api version]\"}", 318 | 319 | "aggregation_service_payloads": [ 320 | { 321 | "payload": "[base64-encoded data]", 322 | "key_id": "[string]", 323 | 324 | // Optional debugging information if debugging is enabled 325 | "debug_cleartext_payload": "[base64-encoded unencrypted payload]", 326 | } 327 | ], 328 | 329 | // Optional debugging information if debugging is enabled and debug key specified 330 | "debug_key": "[64 bit unsigned integer]" 331 | } 332 | ``` 333 | 334 | As described earlier, these reports will be sent to the reporting origin after a 335 | delay. The URL path used for sending the reports will be 336 | `/.well-known/private-aggregation/report-protected-audience` and 337 | `/.well-known/private-aggregation/report-shared-storage` for reports triggered 338 | within a Protected Audience or Shared Storage context, respectively. 339 | 340 | ### Temporary debugging mechanism 341 | 342 | _While third-party cookies are still available_, we plan to have a temporary 343 | mechanism available that allows for easier debugging. This mechanism involves 344 | temporarily relaxing some privacy constraints. It will help ensure that the API 345 | can be fully understood during roll-out and help flush out any bugs (either in 346 | browser or caller code), and more easily compare the performance to cookie-based 347 | alternatives. 348 | 349 | This mechanism is similar to Attribution Reporting API's [debug aggregatable 350 | reports](https://github.com/WICG/attribution-reporting-api/blob/main/EVENT.md#optional-extended-debugging-reports). 351 | When the debug mode is enabled for a report, a cleartext version of the payload 352 | will be included in the report. Additionally, the shared_info will also include 353 | the flag `"debug_mode": "enabled"` to allow the aggregation service to support 354 | debugging functionality on these reports. 355 | 356 | This data will only be available in a transitional phase while third-party 357 | cookies are available and are already capable of user tracking. The debug mode 358 | will only be enabled if the context is able to access third-party cookies and 359 | the browser has third-party cookies generally enabled. That is, it will be 360 | disabled if third-party cookies are disabled/deprecated generally or for a 361 | particular site/context; note that if third-party cookies are generally 362 | disabled, but enabled for a particular site, debug mode will not be enabled for 363 | that site to protect data saved from other sites. 364 | 365 | Though the debug mode is tied to third-party cookie availability, browsers may 366 | temporarily allow debug mode without third-party cookies in order to support 367 | testing, such as the browsers in the [Mode 368 | B](https://developers.google.com/privacy-sandbox/setup/web/chrome-facilitated-testing#mode_b_disable_1_of_third-party_cookies) 369 | group of Chrome-facilitated testing. 370 | 371 | #### Enabling 372 | 373 | The following javascript call will then enable debug mode for all future reports 374 | requested in that context (e.g. Shared Storage operation or Protected Audience 375 | function call): 376 | ```js 377 | privateAggregation.enableDebugMode(); 378 | ``` 379 | The browser can optionally apply debug mode to reports requested earlier in that 380 | context. 381 | 382 | This javascript function can only be called once per context. Any subsequent 383 | calls will throw an exception. 384 | 385 | #### Debug keys 386 | To allow sites to associate reports with the contexts that triggered them, we 387 | also allow setting 64-bit unsigned integer debug keys. These keys are passed as 388 | an optional field to the javascript call, for example: 389 | ```js 390 | privateAggregation.enableDebugMode({debugKey: 1234n}); 391 | ``` 392 | 393 | #### Duplicate debug report 394 | 395 | When debug mode is enabled, an additional, duplicate debug report will be sent 396 | immediately (i.e. without the random delay) to a separate debug endpoint. This 397 | endpoint will use a path like 398 | `/.well-known/private-aggregation/debug/report-protected-audience` (and the 399 | equivalent for Shared Storage). 400 | 401 | The debug reports should be almost identical to the normal reports, including 402 | the additional debug fields. However, the payload ciphertext will differ due to 403 | repeating the encryption operation and the `key_id` may differ if the previous 404 | key had since expired or the browser randomly chose a different valid public 405 | key. 406 | 407 | ### Reducing volume by batching 408 | 409 | In the case of multiple calls to `contributeToHistogram()`, we can reduce report 410 | volume by sending a single report with multiple contributions instead of 411 | multiple reports. For this to be possible, the different calls must involve the 412 | same reporting origin and the same API (i.e. Protected Audience or Shared 413 | Storage). 414 | Additionally, the calls must be made at a similar time as the reporting time 415 | will necessarily be shared. 416 | 417 | #### Batching scope 418 | 419 | For calls within a Shared Storage worklet, calls within the same Shared Storage 420 | operation should be batched together. 421 | 422 | For calls within a Protected Audience worklet, calls using the same reporting 423 | origin within the same auction should be batched together. This should happen 424 | even between different interest groups or Protected Audience function calls. 425 | However, reports triggered via `window.fence.reportEvent()` (see 426 | [here](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md#reporting-bidding-data-associated-with-an-event-in-a-frame) 427 | for more detail), should only be batched per-event. This avoids excessive delay 428 | if this event is triggered substantially later. Reports for Protected Audience 429 | bidders may not share the same [aggregation coordinator 430 | choice](#aggregation-coordinator-choice). In this case, calls should be batched 431 | separately for the different coordinator choices. 432 | 433 | One consideration in the short term is that these calls may have different 434 | associated [debug modes or keys](#temporary-debugging-mechanism). In this case, 435 | only calls sharing those details should be batched together. 436 | 437 | #### Limiting the number of contributions per report 438 | 439 | We will also need a limit on the number of contributions within a single report. 440 | In the case that too many contributions are specified with a ‘batching scope’, 441 | we should truncate them to the limit. To reduce the impact of this limit, we 442 | will merge any contributions that have the same bucket and [filtering 443 | ID](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/flexible_filtering.md#proposal-filtering-id-in-the-encrypted-payload) 444 | before truncation. 445 | 446 | Although larger reports have higher utility, they are also more expensive for 447 | the aggregation service to process. To accommodate use cases with diverse 448 | utility requirements and cost tolerances, we will attempt to select reasonable 449 | defaults with optional overrides: 450 | 451 | - *Default limits:* The default limit may depend on the identity of the calling 452 | API. In particular, Protected Audience reports may benefit from a higher limit 453 | more than Shared Storage reports. Our implementation plan is to set the 454 | default limit at 20 contributions per report for Shared Storage and 100 455 | contributions per report for Protected Audience. 456 | 457 | - *Per-context limits:* Callers may request a different limit on each isolated 458 | context they create. Since this affects the payload size, the requested limit 459 | must be specified from outside an isolated context. Consequently, Protected 460 | Audience buyers cannot set per-context limits. The browser must clamp 461 | excessively large values to some maximum value. Our implementation plan is to 462 | clamp the requested limit to a maximum of 1000 contributions per report. 463 | 464 | - *Global config:* A more complex design that enables sites to configure a 465 | global limit may also be possible, but it requires further analysis. (See 466 | [issue #81].) 467 | 468 | [issue #81]: https://github.com/patcg-individual-drafts/private-aggregation-api/issues/81 469 | 470 | #### Padding 471 | 472 | The size of the encrypted payload may reveal information about the number of 473 | contributions embedded in the aggregatable report. This can be mitigated by 474 | padding the plaintext payload (e.g. to a fixed size). In the shorter term, we 475 | plan to pad the payload by adding 'null' contributions (i.e. with value 0) to 476 | a fixed length. In the future, we plan to instead append bytes to a fixed 477 | length, but this will require updating the payload version. 478 | 479 | ### Aggregation coordinator choice 480 | 481 | This API should support multiple deployment options for the aggregation service, 482 | e.g. deployments on [different cloud 483 | providers](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#initial-experiment-plans). 484 | To avoid a leak, this choice should not be possible from within an isolated 485 | context when a [context 486 | ID](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#shared-storage) 487 | is set. 488 | 489 | We propose a new optional string field `aggregationCoordinatorOrigin` to allow 490 | developers to specify the deployment option, e.g. the origin for the aggregation 491 | service deployed on AWS, GCP, and other platforms in the future. The specified 492 | origin would need to be on an allowlist maintained by the browser. If none is 493 | specified, a default will be used. 494 | 495 | This allowlist matches the Attribution Reporting API's, available 496 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/aggregation_coordinator_origin_allowlist.md). 497 | 498 | Shared Storage callers would specify this field when calling the `run()` or 499 | `selectURL()` APIs, e.g. 500 | 501 | ```js 502 | sharedStorage.run('someOperation', {'privateAggregationConfig': 503 | {'aggregationCoordinatorOrigin': 'https://coordinator.example'}}); 504 | ``` 505 | 506 | Note that we are reusing the `privateAggregationConfig` that currently allows 507 | for specifying a context ID (see 508 | [here](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md)). 509 | 510 | Protected Audience sellers would specify this field in the `auctionConfig`, e.g. 511 | 512 | ```js 513 | const myAuctionConfig = { 514 | ... 515 | 'privateAggregationConfig': { 516 | 'aggregationCoordinatorOrigin': 'https://coordinator.example' 517 | } 518 | }; 519 | const auctionResultPromise = navigator.runAdAuction(myAuctionConfig); 520 | ``` 521 | 522 | For Protected Audience bidders, we plan to allow this choice to be set for each 523 | interest group via `navigator.joinAdInterestGroup()`, e.g. 524 | 525 | ```js 526 | const interestGroup = { 527 | ... 528 | 'privateAggregationConfig': { 529 | 'aggregationCoordinatorOrigin': 'https://coordinator.example' 530 | } 531 | }; 532 | const auctionResultPromise = navigator.joinAdInterestGroup(interestGroup); 533 | ``` 534 | This setting would be able to be overridden via the typical interest group 535 | mechanisms. For example, the update mechanism could support a new 536 | `privateAggregationConfig` key matching the call to `joinAdInterestGroup()`. 537 | 538 | ## Privacy and security 539 | 540 | ### Metadata readable by the reporting origin 541 | 542 | Reports will, by default, come with a variety of (unencrypted) metadata that the 543 | reporting origin will be able to directly read. While this metadata can be 544 | useful, we must be careful to balance the impact on privacy. Here are some 545 | examples of metadata that could be included, along with some potential risks: 546 | 547 | - The originally scheduled reporting time (noised within an ~hour granularity) 548 | - Could be used to identify users on the reporting site within a time window 549 | - Note that combining this with the actual timestamp the report was received 550 | could reveal if the user's device was offline, etc. 551 | - The reporting origin 552 | - Determined by the execution context's origin, but a site could use different 553 | subdomains, e.g. to separate use cases. 554 | - Which API triggered the report 555 | - The `api` field and the endpoint path indicates which API triggered the 556 | report (e.g. Protected Audience or Shared Storage). 557 | - The API version 558 | - A version string used to allow future incompatible changes to the API. This 559 | should usually correspond to the browser version and should not change 560 | often. 561 | - Encrypted payload sizes 562 | - If we do not carefully add padding or enforce that all reports are of the 563 | same natural size, this may expose some information about the contents of 564 | the report. 565 | - Debugging information 566 | - If [debugging](#temporary-debug-mechanism) is enabled, some additional 567 | metadata will be provided. While this information may be potentially 568 | identifying, it will only be available temporarily: while third-party 569 | cookies are enabled. 570 | - Developer-selected metadata 571 | - See [open question: what metadata to 572 | allow](#open-question-what-metadata-to-allow). 573 | 574 | #### Open question: what metadata to allow 575 | 576 | It remains an open question what metadata should be included or allowed in the 577 | report and how that metadata could be selected or configured. Note that any 578 | variation in the reporting endpoint (such as the URL path) would, for this 579 | analysis, be equivalent to including the selected endpoint as additional 580 | metadata. 581 | 582 | While allowing a developer to specify arbitrary metadata from an isolated 583 | context would negate the privacy goals of the API, specifying a report's 584 | metadata from a non-isolated context (e.g. a main document) may be less 585 | worrisome. This could improve the API's utility and flexibility. For example, 586 | allowing this might simplify usage for a single origin using the API for 587 | different use cases. This non-isolated metadata selection could also allow for 588 | first-party trust signals to be associated with a report. 589 | 590 | Alternatively, there may be ways to "noise" the metadata to achieve 591 | differential privacy. Further study and consideration is needed here. 592 | 593 | ### Contribution bounding and budgeting 594 | 595 | As described above, the aggregation service protects user privacy by adding 596 | noise, aiming to have a framework that could support differential privacy. 597 | However, simply protecting each _query_ to the aggregation service or each 598 | _report_ sent from a user agent would be vulnerable to an adversary that repeats 599 | queries or issues multiple reports, and combines the results. Instead, we 600 | propose the following structure. See [below](#implementation-plan) for the 601 | specific choices we have made in our current implementation. 602 | 603 | First, each user agent will limit the contribution that it could make to the 604 | outputs of aggregation service queries. (Note that this limitation is a rate 605 | over time, not an absolute number, see [Partition choice](#partition-choice) 606 | below.) In the case of a histogram operation, the user agent could bound the 607 | L1 norm of the _values_, i.e. the sum of all the contributions across 608 | all buckets. The user agent could consider other bounds, e.g. the L2 609 | norm. 610 | 611 | The user agent would also need to determine what the appropriate 612 | 'partition' is for this budgeting, see [partition choice](#partition-choice) 613 | below. For example, there could be a separate L1 'budget' for each 614 | origin, resetting periodically. Exceeding these limits will cause future 615 | contributions to silently drop. 616 | 617 | Second, the server-side processing will limit the number of queries that can be 618 | performed on reports containing the same 'shared ID' to a small number (e.g. a 619 | single query). See 620 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches) 621 | for more detail. 622 | This also limits the number of queries that can contain the same report. The 623 | shared ID is a hash representing the partition. It is computed by the aggregation 624 | service using data available in each aggregatable report. Note that the hash's 625 | input data will need to differ from the Attribution Reporting API (e.g. to 626 | exclude fields like the destination site that don't exist in this API). 627 | 628 | With the above restrictions, the processing servers only need to sample the 629 | noise for each query from a fixed distribution. In principle, this fixed noise 630 | could be used to achieve differential privacy, e.g. by using Laplace noise with 631 | the following parameter: (max number of queries per report) \* (max 632 | L1 per user per partition) / epsilon. 633 | 634 | #### Scaling values 635 | 636 | Developers will need to choose an appropriate scale for their measurements. In 637 | other words, they will likely want to multiply their values by a fixed, 638 | predetermined constant. 639 | 640 | Scaling the values up, i.e. choosing a larger constant, will reduce the 641 | _relative_ noise added by the server (as the noise has constant magnitude). 642 | However, this will also cause the limit on the L1 norm of the values 643 | contributed to reports, i.e. the sum of all contributions across all buckets, to 644 | be reached faster. Recall that no more reports can be sent after depleting the 645 | budget. 646 | 647 | Scaling the values down, i.e. choosing a smaller constant, will increase the 648 | relative noise, but would also reduce the risk of reaching the budget limit. 649 | Developers will have to balance these considerations to choose the appropriate 650 | scale. The examples below explore this in more detail. 651 | 652 | #### Examples 653 | 654 | These examples use an L1 bound of 216 = 65 536. 655 | 656 | Let's consider a basic measurement case: a binary histogram of counts. For 657 | example, using bucket 0 to indicate a user is a member of some group and bucket 658 | 1 to indicate they are not. Suppose that we don't want to measure anything 659 | else and we've set up our measurement so that each user is only measured once 660 | (per partition per time period). Then, each user could contribute their full limit 661 | (i.e. 216) to the appropriate bucket. After all the reports for all 662 | users are collected, a single query would be performed and the server would add 663 | noise (from a fixed distribution) to each bucket. We would then divide the 664 | values by 216 to obtain a fairly precise result (with standard 665 | deviation of 1/216 of the server's noise distribution). 666 | 667 | If each user had instead just contributed a value of 1, we wouldn't have to 668 | divide the query result by 216. However, each user would end the week 669 | with the vast majority of their budget remaining -- and the processing servers 670 | would still add the same noise. So, our result would be much less precise (with 671 | standard deviation equal to the server's noise distribution). 672 | 673 | On the other hand, suppose we wanted to allow each user to report multiple times 674 | per time period to this same binary histogram. In this case, we would have to reduce 675 | each contribution from 216 to a lower predetermined value, say, 676 | 212. Then, each user would be allowed to contribute up to 16 times to 677 | the histogram. Note that you have to reduce each contribution by the _worst 678 | case_ number of contributions per user. Otherwise, users contributing too much 679 | will have reports dropped. 680 | 681 | #### Partition choice 682 | 683 | A narrow partition (e.g. giving each top-level _URL_ a separate budget) may not 684 | sufficiently protect privacy. Unfortunately, very broad partitions (e.g. a 685 | single budget for the browser) may allow malicious (or simply greedy) actors to 686 | exhaust the budget, denying service to all others. 687 | 688 | The ergonomics of the partition should also be considered. Some choices might 689 | require coordination between different entities (e.g. different third parties on 690 | one site) or complex delegation mechanism. Other choices would require complex 691 | accounting; for example, requiring [Shared 692 | storage](https://github.com/pythagoraskitty/shared-storage) to record the source 693 | of each piece of data that could have contributed (even indirectly) to a report. 694 | 695 | Note also that it is important to include a time component to the partition, 696 | e.g. resetting limits periodically. This does risk long-term information leakage 697 | from dedicated adversaries, but is essential for utility. Other options for 698 | recovering from an exhausted budget may be possible but need further 699 | exploration, e.g. allowing a site to clear its data to reset its budget. 700 | 701 | #### Implementation plan 702 | 703 | We plan to enforce a per-[site](https://web.dev/same-site-same-origin/#site) 704 | budget that resets every 10 minutes; that is, we will bound the contributions 705 | that any site can make to a histogram over any 10 minute rolling window. We plan 706 | to use an L1 bound of 216 = 65 536 for this bound; this 707 | aligns with the [Attribution Reporting API with Aggregatable Reports 708 | explainer](https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md#privacy-budgeting). 709 | 710 | As a backstop to limit worst-case leakage, we plan a separate, looser per-site 711 | bound that is enforced on a 24 hour rolling window, limiting the daily 712 | L1 norm to 220 = 1 048 576. 713 | 714 | This site will match the site of the execution environment, i.e. the site of the 715 | reporting origin, no matter which top-level sites are involved. For the earlier 716 | [example](#examples), this would correspond to the `runAdAuction()` caller 717 | within `reportResult()` and the interest group owner within 718 | `reportWin()`/`reportLoss()`. 719 | 720 | We initially plan to have two separate budgets: one for calls within Shared 721 | Storage worklets and one for Protected Audience worklets. However, see [shared 722 | contribution budget](#shared-contribution-budget) below. 723 | 724 | ### Enrollment and attestation 725 | Use of this API requires 726 | [enrollment](https://github.com/privacysandbox/attestation/blob/main/how-to-enroll.md) 727 | and 728 | [attestation](https://github.com/privacysandbox/attestation/blob/main/README.md#core-privacy-attestations) 729 | via the [Privacy Sandbox enrollment attestation 730 | model](https://github.com/privacysandbox/attestation/blob/main/README.md). 731 | 732 | When an aggregatable report is triggered, a check will be performed to determine 733 | whether the calling 734 | [site](https://html.spec.whatwg.org/multipage/browsers.html#site) is enrolled and 735 | attested. If this check fails, the report will be dropped (i.e. not sent). 736 | 737 | ## Future Iterations 738 | 739 | ### Supporting different aggregation modes 740 | 741 | This API will support an optional parameter `alternativeAggregationMode` that 742 | accepts a string value. This parameter will allow developers to choose among 743 | different options for aggregation infrastructure supported by the user agent. 744 | This will allow experimentation with new technologies, and allows us to test new 745 | approaches without removing core functionality provided by the default option. 746 | The `"experimental-poplar"` option will implement a protocol similar to 747 | [poplar](https://github.com/cjpatton/vdaf/blob/main/draft-patton-cfrg-vdaf.md#poplar1-poplar1) 748 | VDAF in the [PPM 749 | Framework](https://datatracker.ietf.org/doc/draft-gpew-priv-ppm/). 750 | 751 | ### Shared contribution budget 752 | 753 | Separating contribution budgets for Shared Storage worklets and Protected 754 | Audience worklets 755 | provides additional flexibility; for example, some partition choices may not be 756 | compatible (e.g. a per-interest group budget). However, we could consider 757 | merging the two budgets in the future. 758 | 759 | ### Authentication and data integrity 760 | 761 | To ensure the integrity of the aggregated data, it may be desirable to support a 762 | mechanism for authentication. This would help limit the impact of reports sent 763 | from malicious or misbehaving clients on the results of each query. 764 | 765 | To ensure privacy, the reporting endpoint should be able to determine whether a 766 | report came from a trusted client without determining _which_ client sent it. We 767 | may be able to use [trust tokens](https://github.com/WICG/trust-token-api) for 768 | this, but further design work is required. 769 | 770 | ### Aggregate error reporting 771 | 772 | Unfortunately, errors that occur within isolated execution contexts cannot be 773 | easily reported (e.g. to a non-isolated document or over the network). If 774 | allowed, such errors could be used as an information channel. While these errors 775 | could still appear in the console, it would also aid developers if we add a 776 | mechanism for aggregate error reporting. This reporting could be automatic or 777 | could be required to be configured according to the developers' preferences. 778 | 779 | ------ 780 | 781 | **This document is an individual draft proposal. It has not been adopted by the Private Advertising Technology Community Group.** 782 | -------------------------------------------------------------------------------- /error_reporting.md: -------------------------------------------------------------------------------- 1 | # Aggregate error reporting 2 | 3 | #### Table of contents 4 | 5 | - [Introduction](#introduction) 6 | - [Motivating use cases](#motivating-use-cases) 7 | - [Measuring reports dropped due to insufficient budget](#measuring-reports-dropped-due-to-insufficient-budget) 8 | - [Detecting unhandled crashes](#detecting-unhandled-crashes) 9 | - [Defining histogram contributions to send on error events](#defining-histogram-contributions-to-send-on-error-events) 10 | - [New JavaScript API surface](#new-javascript-api-surface) 11 | - [Error events](#error-events) 12 | - [Associating error events with a single interest group](#associating-error-events-with-a-single-interest-group) 13 | - [Measuring insufficient contribution budget](#measuring-insufficient-contribution-budget) 14 | - [Privacy and security](#privacy-and-security) 15 | - [Future iterations](#future-iterations) 16 | - [Global config](#global-config) 17 | - [Per-interest group error-event contributions](#per-interest-group-error-event-contributions) 18 | 19 | ## Introduction 20 | 21 | There are a range of error conditions that can be hit when using the Private 22 | Aggregation API. For example, the privacy budget could run out, preventing any 23 | further histogram contributions. As the errors themselves may be cross-site 24 | information, we cannot simply expose them to the page for users who have 25 | disabled third-party cookies. Instead, we propose a new aggregate, noised error 26 | reporting mechanism that leverages the existing reporting pipelines through the 27 | Aggregation Service. 28 | 29 | We aim to allow developers to measure the frequency of certain 'error events' 30 | and to split these measurements on relevant developer-specified dimensions (e.g. 31 | version of deployed code). We aim to also support measuring certain error events 32 | in the Shared Storage and Protected Audience APIs that cannot be directly 33 | exposed due to similar cross-site information risks. We aim to support these use 34 | cases with minimal or no privacy regressions. 35 | 36 | The proposed mechanism limits privacy impacts by embedding these debugging 37 | measurements in the same aggregatable reports already used by the API. Ad techs 38 | will be able to avoid interfering with their existing measurements by using 39 | filtering IDs or separate bucket spaces. Note that we have also 40 | [proposed](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/named_budgets.md) 41 | a mechanism to reserve privacy budget for different types of contributions. This 42 | is necessary to allow for budget exhaustion measurement, but will support 43 | additional uses too. 44 | 45 | Note that the Attribution Reporting API (ARA) has introduced a similar feature 46 | called [Aggregate Debug 47 | Reporting](https://github.com/WICG/attribution-reporting-api/blob/main/aggregate_debug_reporting.md). 48 | This allows for measuring events like reaching rate limits. Our proposal has a 49 | few differences in design due to differences in our setting. 50 | 51 | ## Motivating use cases 52 | 53 | ### Measuring reports dropped due to insufficient budget 54 | 55 | Developers using the Private Aggregation API need to [choose an appropriate 56 | scale](https://github.com/patcg-individual-drafts/private-aggregation-api#scaling-values) 57 | for their measurements. This choice is a trade off between the relative noise 58 | added by the aggregation service and the risk of dropped reports due to budget 59 | limits. Measuring the fraction of reports that are dropped due to insufficient 60 | budget would allow developers to better evaluate the impact of their scale 61 | choice. 62 | 63 | ### Detecting unhandled crashes 64 | 65 | Developers using Shared Storage or Protected Audience may accidentally ship code 66 | that crashes, i.e. by triggering a JavaScript exception that isn't caught. 67 | Measuring these situations directly would improve detection. Being able to split 68 | these measurements by relevant dimensions would also ease debugging. 69 | 70 | ## Defining histogram contributions to send on error events 71 | 72 | ### New JavaScript API surface 73 | 74 | To support measuring these error conditions, we propose extending the 75 | [existing](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md#reporting-api-informal-specification) 76 | `contributeToHistogramOnEvent()` API, currently only exposed in Protected 77 | Audience script runners. For example: 78 | 79 | ```js 80 | privateAggregation.contributeToHistogramOnEvent( 81 | "reserved.uncaught-exception", { bucket: 123n, value: 45, filteringId: 6n }); 82 | ``` 83 | 84 | We would expand the existing list of `reserved.` events supported. We would also 85 | expose this API to Shared Storage, but without the ['filling 86 | in'](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md#reporting-api-informal-specification) 87 | logic (i.e. without support for signalBuckets and signalValues). Note also that 88 | certain events would only be valid in one type of context; for example, the 89 | existing Protected Audience-specific 90 | [events](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md#triggering-reports) 91 | would not be exposed to Shared Storage callers. 92 | 93 | These 'conditional' histogram contributions would be scoped to that specific 94 | JavaScript context. 95 | 96 | This approach is flexible. For example, it allows callers to ignore error events 97 | that are not interested in, or to have two different error events trigger the 98 | same contribution. However, it is also somewhat verbose, requiring these calls 99 | to be repeated for each JavaScript context. There are also certain error cases 100 | that cannot be measured with this approach. However, see [Future 101 | iterations](#future-iterations) below for some extensions that may 102 | address these issues. 103 | 104 | ### Error events 105 | 106 | We propose the following events, but this list could be expanded in the future. 107 | 108 | The following events would be available in both Shared Storage and Protected 109 | Audience contexts: 110 | 111 | - `reserved.report-success`: a report was scheduled and no contributions were 112 | dropped 113 | - `reserved.too-many-contributions`: a report was scheduled, but some 114 | contributions were dropped due to the per-report limit 115 | - `reserved.empty-report-dropped`: a report was not scheduled as it had no 116 | contributions 117 | - `reserved.too-many-pending-reports`: a report was not scheduled as there were 118 | already too many pending reports 119 | - `reserved.insufficient-budget-for-report`: a report was not scheduled as there 120 | was not enough budget 121 | - `reserved.uncaught-exception`: a JavaScript exception was thrown and not 122 | caught in this context 123 | 124 | The following event would only be available in Shared Storage contexts: 125 | 126 | - `reserved.contribution-timeout-reached`: the JavaScript context was still 127 | running when the contribution timeout occurred 128 | 129 | The following events would only be available in Protected Audience contexts: 130 | 131 | - The existing events (`reserved.win`, `reserved.loss`, `reserved.always`) 132 | - Possibly additional events for various network failures, see [Future 133 | iterations](#future-iterations) below. 134 | 135 | Note that errors in the reporting pipeline used by public key fetching failures 136 | or running out of retries are not exposed. These errors would be difficult to 137 | report on given that they indicate the reporting pipeline is not functioning 138 | properly; additionally, it would be difficult to perform budgeting for these 139 | cases given these occur well after the original report was scheduled. 140 | 141 | #### Associating error events with a single interest group 142 | 143 | Contributions associated with different interest groups but with the same 144 | reporting origin are [batched 145 | together](https://github.com/patcg-individual-drafts/private-aggregation-api#batching-scope) 146 | into a single report. However, the number of these other interest groups is not 147 | revealed to Protected Audience contexts. If this single report encounters an 148 | error, we do not want to trigger duplicate contributions due to multiple 149 | contexts registering the same contributions. 150 | 151 | So, we propose each reporting-related error should only trigger an error event 152 | for one of the contexts that contributed to the report. The browser should pick 153 | one at random. Note that this aligns with the 154 | [proposal](https://github.com/WICG/turtledove/issues/1170) to define a new 155 | `reserved.once` event. 156 | 157 | ### Measuring insufficient contribution budget 158 | 159 | To measure the error events representing insufficient [contribution 160 | budget](https://github.com/patcg-individual-drafts/private-aggregation-api#contribution-bounding-and-budgeting), 161 | some of the budget for histogram contributions that measure this error event 162 | must be reserved. Otherwise, those contributions would likely also be dropped 163 | due to the budget limit. This “named budget” functionality also supports other 164 | use cases and has been proposed in [a separate 165 | explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/named_budgets.md). 166 | 167 | ## Privacy and security 168 | 169 | This feature does not introduce any new privacy or security considerations as it 170 | only allows you to send histogram contributions that would've been permitted 171 | without the new feature. (That is, the only change is allowing these 172 | contributions to be conditional on an error event.) These contributions are 173 | embedded into the existing reports sent by the API and use contribution budget 174 | just like existing contributions. 175 | 176 | ## Future iterations 177 | 178 | ### Global config 179 | 180 | As a future extension, we could consider allowing these error-event 181 | contributions to be set up globally for a reporting origin (or site). A global 182 | config has already been proposed for other use cases (see issues 183 | [#81](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/81#issuecomment-2091524214) 184 | and 185 | [#143](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/143)) 186 | so we could potentially use a single config for these use cases. 187 | 188 | This would reduce verbosity as calls that don't need to change frequently could 189 | be replaced. It could also allow for the addition of new error events for 190 | failures that would prevent scripts from being run, e.g. if a bidding script 191 | fails to load. Note that this wouldn't allow for contributions to vary based on 192 | specific details like the interest group or the user, and wouldn't allow 193 | sampling. We may need to consider a mechanism that allows for canceling or 194 | overriding these globally configured details. 195 | 196 | ### Per-interest group error-event contributions 197 | 198 | We could also consider allowing error-event contributions to be set per interest 199 | group. As with the global config, this allow for the addition of new error 200 | events for failures that would prevent scripts from being run, e.g. if a bidding 201 | script fails to load. It would also allow these contributions to vary per-user 202 | or per-interest group. However, this configuration would use up some of the 203 | interest group storage budget, which may not be desired. 204 | -------------------------------------------------------------------------------- /flexible_filtering.md: -------------------------------------------------------------------------------- 1 | # More flexible contribution filtering for Aggregation Service queries 2 | 3 | _Note: This document proposes a new backwards compatible change in the Private 4 | Aggregation API, Attribution Reporting API and Aggregation Service. While this 5 | new functionality is being developed, we still highly encourage testing the 6 | existing API functionalities to support core utility and compatibility needs._ 7 | 8 | #### Table of Contents 9 | 10 | - [Introduction](#introduction) 11 | - [Motivating use cases](#motivating-use-cases) 12 | - [Processing contributions at different cadences](#processing-contributions-at-different-cadences) 13 | - [Processing contributions by campaign ID](#processing-contributions-by-campaign-id) 14 | - [Non-goals](#non-goals) 15 | - [Proposal: filtering ID in the encrypted payload](#proposal-filtering-id-in-the-encrypted-payload) 16 | - [Use case examples](#use-case-examples) 17 | - [Processing contributions at different cadences](#processing-contributions-at-different-cadences-1) 18 | - [Processing contributions by campaign ID](#processing-contributions-by-campaign-id-1) 19 | - [Details](#details) 20 | - [Small ID space by default, but configurable](#small-id-space-by-default-but-configurable) 21 | - [Backwards compatibility](#backwards-compatibility) 22 | - [One ID per contribution](#one-id-per-contribution) 23 | - [Possible future extension: batching ID in the shared_info](#possible-future-extension-batching-id-in-the-shared_info) 24 | - [Use case examples](#use-case-examples-1) 25 | - [Processing contributions at different cadences](#processing-contributions-at-different-cadences-2) 26 | - [Processing contributions by campaign ID](#processing-contributions-by-campaign-id-2) 27 | - [Details](#details-1) 28 | - [Requires deterministic reports and specifying batching ID from a single-site context](#requires-deterministic-reports-and-specifying-batching-id-from-a-single-site-context) 29 | - [Backwards compatibility](#backwards-compatibility-1) 30 | - [One ID per report](#one-id-per-report) 31 | - [Use with filtering ID](#use-with-filtering-id) 32 | - [Limits on number of IDs used](#limits-on-number-of-ids-used) 33 | - [Application to Attribution Reporting API](#application-to-attribution-reporting-api) 34 | - [Privacy considerations](#privacy-considerations) 35 | 36 | ## Introduction 37 | 38 | Currently, the Aggregation Service only allows each '[shared 39 | ID](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)' 40 | to be present in one query. A set of reports with the same shared ID cannot be 41 | split for separate queries, even if the resulting batches are disjoint. However, 42 | there have been requests to introduce additional flexibility to this query model 43 | (see GitHub issues for [Private 44 | Aggregation](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/92) 45 | and [Attribution Reporting](https://github.com/WICG/attribution-reporting-api/issues/732)). 46 | 47 | Here, we propose introducing a new _filtering ID_ set when a contribution is 48 | made and embedded in the encrypted payload. This allows for these queries to be 49 | split further, with the aggregation service filtering contributions based on the 50 | provided IDs. 51 | 52 | We also propose a possible future extension where a _batching ID_ is set from a 53 | first-party context and embedded in the `shared_info`. This would allow for the 54 | ad tech to filter the reports directly, improving the ergonomics for some use 55 | cases. 56 | 57 | ## Motivating use cases 58 | 59 | #### Processing contributions at different cadences 60 | 61 | For some measurements, it may be desirable to query the Aggregation Service less 62 | frequently; this would allow for more contributions to be aggregated before 63 | noise is added, improving the signal-to-noise ratio. However, for other 64 | measurements, it may be more valuable to receive a result faster. (Support for 65 | this use case has been requested for Attribution Reporting 66 | [here](https://github.com/WICG/attribution-reporting-api/issues/732).) Filtering 67 | IDs could be used to separate these measurements into different queries. 68 | 69 | #### Processing contributions by campaign ID 70 | 71 | An ad tech might want to process measurements — for example, reach measurements 72 | — separately for each advertising campaign. To allow for this, it might want to 73 | use a different filtering ID or batching ID for each campaign. Note that, 74 | without this new functionality, the advertiser is not part of the shared ID and 75 | so it's not currently possible to process these separately. 76 | 77 | ## Non-goals 78 | 79 | While we aim to increase the flexibility of report batching strategies, we don't 80 | intend to allow every report or every contribution to be queried separately. 81 | Further, we don't intend to allow for arbitrary groupings decided after 82 | reporting is complete. This is to ensure that the scale of aggregatable 83 | reporting accounting remains feasible, see [discussion 84 | below](#limits-on-number-of-ids-used). 85 | 86 | ## Proposal: filtering ID in the encrypted payload 87 | 88 | We plan to introduce additional IDs in the payload called _filtering IDs_. By 89 | embedding these IDs within the encrypted payload, their values could be set 90 | within a worklet/script runner — e.g. for a Protected Audience bidder — and 91 | could even be chosen based on cross-site data. For example: 92 | 93 | ```js 94 | privateAggregation.contributeToHistogram( 95 | {bucket: 1234n, value: 56, filteringId: 3n}); 96 | ``` 97 | 98 | If no filtering ID is provided, a default ID of 0 will be used. (See also 99 | [Backwards compatibility](#backwards-compatibility) below.) 100 | 101 | As the reporting endpoint cannot determine the IDs within a given report, the 102 | aggregation service will provide new functionality for filtering contributions 103 | based on their IDs. In particular, each aggregation service query's parameters 104 | should provide a list of allowed filtering IDs and all contributions with other 105 | IDs will be filtered out. For example: 106 | 107 | ```jsonc 108 | // ... 109 | "job_parameters": { 110 | "output_domain_blob_prefix": "domain/domain.avro", 111 | "output_domain_bucket_name": "", 112 | "filtering_ids": [1, 3] // IDs to keep in the query 113 | }, 114 | ``` 115 | 116 | Note that this API is not final, e.g. it might make more sense to specify the 117 | IDs via an avro file. 118 | 119 | The aggregation service would include a filtering ID in the computation of each 120 | '[shared ID](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)' 121 | hash. For aggregatable report accounting, the aggregation service would assume 122 | that each filtering ID listed in the job parameters is present in every report. 123 | This avoids leaking any information about which IDs were actually present in 124 | each report. 125 | 126 | ### Use case examples 127 | 128 | #### Processing contributions at different cadences 129 | 130 | As discussed [above](#processing-contributions-at-different-cadences), a 131 | reporting site may want to query the Aggregation Service at different cadences 132 | for different kinds of measurements. 133 | 134 | Filtering IDs could be used to separate these measurements into different 135 | queries. For example, you could specify a filtering ID of 1 for measurements 136 | that should be queried daily and an ID of 2 for measurements that should be 137 | queried monthly. Each day, the reporting site would then send a day's reports to 138 | the aggregation service and specify that only contributions with a filtering ID 139 | 1 should be processed. Each month, the reporting site would send an entire 140 | month's payloads (which have been sent earlier in the daily queries), but 141 | specify that only contributions with a filtering ID 2 should be processed. 142 | 143 | Note that, in this flow, every report needs to be sent to the aggregation 144 | service multiple times. However, as the filtering IDs are different, no 145 | _contribution_ is being included in an aggregation twice and so there are no 146 | issues with aggregatable report accounting. 147 | 148 | #### Processing contributions by campaign ID 149 | 150 | For certain use cases, the filtering ID may be a deterministic function of the 151 | context. For example, if an ad tech wants to process measurements separately for 152 | each campaign, it could use a different filtering ID for each campaign. As the 153 | campaign would be known outside the Shared Storage worklet, the ad tech could 154 | externally maintain a mapping from the [context 155 | ID](https://patcg-individual-drafts.github.io/private-aggregation-api/#aggregatable-report-context-id) 156 | to the filtering ID. 157 | 158 | When batching reports for the aggregation service, the ad tech could use this 159 | mapping to separate the reports by filtering ID, even though it cannot decrypt 160 | the payload. By avoiding reprocessing every report for each campaign ID, the 161 | number of IDs used can be much larger while keeping processing costs reasonable. 162 | 163 | ### Details 164 | 165 | #### Small ID space by default, but configurable 166 | 167 | The filtering ID would be an unsigned integer limited to a small number of bytes 168 | (1 byte = 8 bits) by default. If no filtering ID is provided, a value of 0 will 169 | be used. We limit the size of the ID space to prevent unnecessarily increasing 170 | the payload size and thus storage and processing costs. As filtering IDs are not 171 | readable by the reporting endpoint, processing multiple filtering IDs separately 172 | would typically require reprocessing the same reports for each query (see [the 173 | first example use](#processing-contributions-at-different-cadences-1) above). 174 | Given this performance constraint, it is unlikely that a larger ID space will be 175 | practical with this flow. 176 | 177 | However, other flows could make use of a larger ID space (see [the second 178 | example use case](#processing-contributions-by-campaign-id-1) above), so we plan 179 | to allow for the ID space to be configurable per-report, up to 8 bytes (i.e. 64 180 | bits). To avoid amplifying a counting attack due to the different payload size, 181 | we plan to make the number of reports emitted with this custom label size 182 | deterministic. This will result in a null report being sent if no contributions 183 | are made. Note that this means the filtering ID _space_ for Private Aggregation 184 | reports must also be specified outside Shared Storage worklets/Protected 185 | Audience script runners. 186 | 187 | For Shared Storage and Protected Audience sellers, we propose reusing the 188 | `privateAggregationConfig` implemented/proposed for report verification, adding 189 | a new field, e.g. 190 | 191 | ```js 192 | sharedStorage.run('example-operation', { 193 | privateAggregationConfig: { 194 | contextId: 'example-id', 195 | filteringIdMaxBytes: 8 // i.e. allow up to a 64-bit integer 196 | } 197 | }); 198 | ``` 199 | 200 | We do not currently plan to allow the filtering ID bit size to be configured for 201 | Protected Audience bidders as these flows require context IDs to make the scale 202 | practical; we do not currently plan to expose context IDs to bidders (see the 203 | [explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#specifying-a-contextual-id-and-each-possible-ig-owner) 204 | for more discussion). We also do not plan on allowing these fields to be set 205 | from within fenced frames, as they may have access to cross-site information. 206 | 207 | #### Backwards compatibility 208 | 209 | For backwards compatibility, if no list of `filtering_ids` is provided in an 210 | aggregation query, the list containing only the default ID will be used (i.e. 211 | `[0]`). This means that any contributions that don't specify a filtering ID 212 | would be included in that aggregation, along with any contributions that 213 | explicitly specify the default ID. Additionally, the aggregation service will 214 | process reports using older format versions (i.e. before labels were supported) 215 | as if every contribution uses the default filtering ID. 216 | 217 | This should ensure that no changes need to be made to existing pipelines if 218 | filtering IDs are not needed. 219 | 220 | #### One ID per contribution 221 | 222 | We plan to allow for a filtering ID to be set individually for each contribution 223 | in a report's payload. To reduce the impact on payload size, we could consider 224 | instead limiting the number of distinct filtering IDs per report to a smaller 225 | number. However, this may pose ergonomic difficulties. 226 | 227 | ## Possible future extension: batching ID in the shared\_info 228 | 229 | Later, to improve ergonomics (see [example 230 | below](#processing-contributions-by-campaign-id-2)), we could consider 231 | introducing a new, optional field to an aggregatable 232 | [report](https://github.com/patcg-individual-drafts/private-aggregation-api#reports)'s 233 | shared\_info called a _batching ID_. For example: 234 | 235 | ```jsonc 236 | "shared_info": "{\"api\":\"shared-storage\",\"batching_id\":1234,\"report_id\":\"[UUID]\",\"reporting_origin\":\"https://reporter.example\",\"scheduled_report_time\":\"[timestamp in seconds]\",\"version\":\"[api version]\"}", 237 | ``` 238 | 239 | This ID would be an unsigned 32-bit integer. The aggregation service would 240 | include the batching ID in computation of the '[shared 241 | ID](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)' 242 | hash, allowing reports with differing batching IDs to be batched and queried 243 | separately. 244 | 245 | ### Use case examples 246 | 247 | #### Processing contributions by campaign ID 248 | 249 | As discussed [above](#processing-contributions-by-campaign-id-1), an ad tech may 250 | want to process measurements separately for each campaign. In that example, the 251 | filtering ID used is a deterministic function of the context. Instead of setting 252 | a filtering ID, a batching ID could be specified. 253 | 254 | As the batching ID would be readable by the ad tech, it would then be able to 255 | use this batching ID to identify what campaign the report is for and to batch 256 | and query the reports for each campaign separately. It would no longer have to 257 | rely on maintaining a context ID to filtering ID mapping, which would provide 258 | improved ergonomics and might reduce the risk of bugs from the context ID to 259 | filtering ID mapping. 260 | 261 | #### Processing contributions at different cadences 262 | 263 | While a reporting site could potentially use a batching ID for processing 264 | contributions at different cadences, it has a few downsides relative to a 265 | filtering ID. As only one batching ID can be set per report, multiple reports 266 | would need to be triggered, e.g. through multiple Shared Storage operations. 267 | Further, as the batching ID [requires deterministic 268 | reports](#requires-deterministic-reports-and-specifying-batching-id-from-a-single-site-context), 269 | this would result in a report being sent for each ID, even if there are no 270 | contributions for that cadence. These additional reports would negate the 271 | benefit of being able to split reports into separate batches at the reporting 272 | endpoint. 273 | 274 | ### Details 275 | 276 | #### Requires deterministic reports and specifying batching ID from a single-site context 277 | 278 | As this option embeds highly specific information about the context that 279 | triggered a particular report (in plaintext), we need to make the number of 280 | reports emitted with the batching ID deterministic. (See the [report 281 | verification explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#deterministic-number-of-reports) 282 | for a similar discussion with respect to context IDs.) This will result in a 283 | null report being sent if no contributions are made. Note that this means the 284 | batching ID for Private Aggregation reports must also be specified outside 285 | Shared Storage worklets/Protected Audience script runners. 286 | 287 | For Shared Storage and Protected Audience sellers, we propose reusing the 288 | `privateAggregationConfig` implemented/proposed for report verification, adding 289 | a new field, e.g. 290 | 291 | ```js 292 | sharedStorage.run('example-operation', { 293 | privateAggregationConfig: { 294 | contextId: 'example-id', 295 | batchingId: 1234 296 | } 297 | }); 298 | ``` 299 | 300 | We do not currently plan to use a context ID for Protected Audience bidders due 301 | to the potential for a large number of null reports, see 302 | [explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#specifying-a-contextual-id-and-each-possible-ig-owner) 303 | for more discussion. Identical considerations would apply to this batching ID in 304 | the `shared_info`; so, we would not allow a batching ID to be set for bidders. 305 | 306 | #### Backwards compatibility 307 | 308 | If no batching ID is specified, the field will not be present in the 309 | `shared_info`. This should ensure the change is backwards compatible. 310 | 311 | #### One ID per report 312 | 313 | Each report can have at most one batching ID (unlike filtering IDs which are 314 | per-contribution). This aligns with the behavior for context IDs, given they are 315 | both readable by the reporting endpoint. 316 | 317 | #### Use with filtering ID 318 | 319 | Both a batching ID and a filtering ID could be used at the same time. 320 | 321 | ## Limits on number of IDs used 322 | 323 | This proposal increases the number of '[shared 324 | IDs](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)' 325 | that the Aggregatable Report Accounting service will need to keep track of. So, 326 | we will need to ensure there are limits to this increase to prevent scale 327 | issues. (Note that it is not practical for each report to have its own entry 328 | recorded in the accounting service.) 329 | 330 | We plan to impose a limit on the number of shared IDs for any particular 331 | aggregation. That is, if too many are used by a query, an error would occur. The 332 | effect of this limit on the number of filtering IDs or batching IDs (or both) 333 | that can be provided will depend on other details of the batching strategy. 334 | 335 | Straw proposal: a limit of 1000 shared IDs per aggregation. 336 | 337 | ## Application to Attribution Reporting API 338 | 339 | The filtering ID approach should be extendable to the Attribution Reporting API 340 | and, in principle, we could allow the label to be set based on either source or 341 | trigger-side information. 342 | 343 | The batching ID approach may not be viable for all Attribution Reporting API 344 | callers as a null report would need to be sent for every unattributed trigger. 345 | This could increase report volume substantially (e.g. 4 to 20 times); however, 346 | some callers may be able to tolerate this increase (see the discussion in [ARA 347 | issue #974](https://github.com/WICG/attribution-reporting-api/issues/974) about 348 | introducing a trigger ID). If making reports deterministic is acceptable for 349 | some callers, we could support setting a batching ID for a trigger with a 350 | similar mechanism to the already proposed trigger ID. 351 | 352 | The details of these approaches will be explored in a separate GitHub issue. 353 | 354 | ## Privacy considerations 355 | 356 | While this change does allow for reprocessing the same report in different 357 | aggregations, each query will only aggregate distinct contributions from that 358 | report. In other words, each contribution is still guaranteed to only be 359 | aggregated once, maintaining our current [privacy protection 360 | model](https://github.com/patcg-individual-drafts/private-aggregation-api#contribution-bounding-and-budgeting). 361 | 362 | One other potential concern is that introducing new (plaintext) metadata to the 363 | report might amplify counting attacks (see related discussion for context IDs 364 | [here](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#privacy-considerations)). 365 | However, we ensure that any new metadata (including a batching ID and any 366 | non-default payload size) is paired with making the sending of that report 367 | deterministic. This avoids any risk of the report count leaking information. 368 | -------------------------------------------------------------------------------- /meetings/2023-02-15/minutes.md: -------------------------------------------------------------------------------- 1 | # Private Aggregation W3C Call: Agenda & Notes 2 | 3 | Wednesday 15 February 2023 at 4 pm Eastern Time (= 9 pm UTC = 10 pm Paris = 1 pm California). 4 | 5 | This notes doc will be editable during the meeting — if you can only comment, hit reload. 6 | 7 | Notes will be made available on [GitHub](https://github.com/patcg-individual-drafts/private-aggregation-api) after the meeting. 8 | 9 | Additional meetings will be made on an ad hoc basis. 10 | 11 | ## Agenda 12 | 13 | #### Process reminder: Join PATCG and sign in 14 | 15 | If you want to participate in the call, please make sure you join the PATCG: and add yourself to the attendees list below. 16 | 17 | ### [Suggest agenda items here] 18 | 19 | * Intro to Shared Storage and Private Aggregation 20 | * What’s possible with Reach in the current design 21 | * Importance of cross-device Reach 22 | * Supporting statistical methods/adjustments 23 | * Scope of dynamic flexibility 24 | * Issue [#17](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/17): Extending shared storage API to support advanced reach reporting 25 | * Fledge + PAA: 26 | 27 | Can we expect that Private Aggregation API will be available for Fledge on GA (H3’23)? 28 | 29 | ### Note takers 30 | 31 | Zach Mastromatto 32 | 33 | ### Notes 34 | 35 | * Intro to Shared Storage and Private Aggregation - What’s possible with Reach in the current design 36 | * Shared Storage provides unrestricted write access across sites 37 | * Companies can use relevant Shared Storage data to aggregate data and return noisy summary reports 38 | * You can use Shared Storage data to create an aggregation key, send that to your server, then batch them together and send to Aggregation Service to decrypt, add noise, and return an aggregated summary reports 39 | * Q: How does data delineation look? 40 | * A: Each adtech will have its own Shared Storage context to read and write from 41 | * Reach is one of the use cases that Shared Storage is built to support 42 | * When an unique ad is served to a new user, you can write to Shared Storage that this new user was reached by an ad that was served (frequency =1) 43 | * Each new user view can then be aggregated and sent in the Private Aggregation flow to aggregate and return a noised summary report from the Aggregation Service 44 | * How does the batching of these reports work? 45 | * When a report is sent for batching, it can’t be reprocessed again 46 | * It is important to write the date/date range then when you write to Shared Storage 47 | * Q: Is Shared Storage available in incognito mode? 48 | * A: Unclear at the moment, will need to follow up. You would consider it as a separate bucket though if it is enabled. You can still write to it if it is Incognito, but all of the data would be cleared when the browser is closed. 49 | * Q: If I setup a weekly reach report, than, I can't send a daily report? 50 | * No, you would only be able to process data in one of the reports 51 | * Q: This is browser counting, and not deduplicated right? 52 | * A: Yes, and we are interested if there are other reach models that are dependent on an extension of that deduplication 53 | * Q: What use case does this solution map to? What value do advertisers get from doing this specifically within the browser? (from Meta) 54 | * A: From Google Ads’ perspective, broadly speaking, Reach is pretty established and is an important solution that Google Ads provides in-market. By extending the API just a little bit further can significantly improve the utility from the API. Details are in the GitHub issue that Google Ads wrote. 55 | * A: From Chrome perspective, is it possible to get per-browser estimates and are there any adjustments that take place after the fact to make Reach closer to a person-based calculation vs browser-based calculation. Our question is when that needs to happen? 56 | * A: It is unclear if you move the matching to off-device vs on-device (Meta) 57 | * Q: Would you be able to dedupe users across devices? (Amazon) 58 | * A: Cross-device is an interesting question in this context. This seems to be in the context of deduping the browser to an actual person. The challenge for Chrome is figuring out the right way to do this with user privacy in mind. We want to be confident from a statistical perspective on where the inflection point is for knowing that a reached impression is a new person or not. This is probably worth a deeper dive in a later discussion. 59 | * We need as much data flexibility as we can in a dynamic environment (weekly, monthly, mid-campaign flight, etc). (Amazon) 60 | * Q: Why are there limitations on processing data multiple times in batches? 61 | * A: To limit the amount of information that is available on users. Also to restrict averaging out the noise if you process the data multiple times. Theoretically you can add more noise in if you reprocess it each time to alleviate this, but it adds more complexity for engineering and for users. 62 | * Raimundo: The important thing on the first question is what is the unique thing that you are counting, more so than where you are doing the counting? 63 | * Wendell: Proposing that Chrome simplifies this solution. Investing in testing costs money both from the advertiser and adtech. Suggests that simplification here would make it easier to test and build a business around. Noise being added is tough to manage and it has to behave the same way across long periods of time. 64 | * Thomas: Counting devices is really the most important thing, as that is correlated most to a person. Counting users is supposed to be the secret sauce of an adtech to do this. Does not think it is Chrome’s responsibility to provide a device graph for users. It is key to have an integration between Android and Chrome. 65 | * Q: Is there a strong sense that apps and browsers should talk to each other? Or multiple browsers on the same device? Can you elaborate what you mean by a unique device? It is not a question of multiple browsers, but across app and web? 66 | * A: Device as defined by a physical device. Across both app and web, yes. 67 | * Q: iFrames and Fenced Frames are fully supported, so Shared Storage should be available everywhere? 68 | * A: Yes, but Chrome wants to double check the Fenced Frames piece and get back to you. 69 | * Evgeny: We (Google Ads) are asking for a level of privacy protection from Privacy Sandbox, then enabling engineering on top of that to create a reach metric. Chrome should not provide an identity layer. The blocking of processing batches multiple times that needs to be removed is an example of an area that can improve Reach calculations without sacrificing privacy. 70 | * Raimundo: Two challenges we need to solve for Reach. 1) How do you count for people not reach (flexibility) and 2) across all data sources and devices. We already count unique people and we need to be able to deduplicate across all devices, not even just browser. How do you bring in unique counts from devices that do not have Privacy Sandbox enabled? 71 | * Vishal: Reach measurement is very foundational and core to Brand advertisers around the world. We are more than welcome to bring those advertisers in for feedback if Chrome wishes. This is a core metric used to measure across all of TV and digital. 72 | * Jukka: It might sound like we would need a unique identifier across all platforms and media for a single person. However, when we talk about counting we’re talking really about estimating. We can deal with a world where browser and app usage on the same device is separate. We use panels for modeling purposes. Not accurate at a per-person level, but can be done in aggregate. Reach is what advertisers want the most and want it automatically without much hassle. They also care about privacy though too. We need to make this solution workable for both. 73 | * Asha: It is probably not possible to make a universal identity, so we need to support adtechs in facilitating the estimation counts after the fact. 74 | * Sanjay: The amount of utility for an advertiser decreases when the fidelity of the signal decreases. This isn’t on Chrome to uniquely solve though for all adtechs across all devices. The industry needs to come to a consensus on the signals that are important for Reach and the privacy guardrails. This is true for all browsers, not just Chrome. 75 | * Evgeny: Allowing count(distinct) would unlock the ability to do modeling and increase flexibility on one or two orders of magnitude over what is possible now. It unlocks a key toolbox of modelling. 76 | * Alex: Just want to flag that we have a lot to talk about and need more stakeholders in the room, per Vishal. Chrome would like more attendance in these meetings and encourage follow up meetings and sharing. 77 | * Asha: Are adtechs used to having standard dimensions for reporting or does it need to be very flexible across any dimension? 78 | * Evgeny: Flexibility across the board is important, date is important. It is important for Reach to be flexible, which is what we hear from our customers. Trying to list the important metrics might not lead to success here, but rather enabling the flexibility for adtechs to decide for themselves. 79 | * Michal: We plan to use PAA together with FLEDGE. Some of these reports need to be processed daily vs other types of reporting. And the batching frequency needs to be variable. From our perspective, it would be good to have the ability to reprocess the data or to batch data in a way that is not just based on a specific date range. 80 | * Bo: Strongly agree that dimensions should be fully flexible, instead of picking the specific dimensions to be flexible on. Timezone issues are a concern too. In order for the ecosystem to test the data sent back, we need to gain some confidence in the underlying data. Without that we can’t be confident in using it. 81 | * Jukka: Exact timestamp is not as important to Comscore. Exact time is less important than other dimensions. 82 | * Importance of cross-device Reach 83 | * Supporting statistical methods/adjustments 84 | * Scope of dynamic flexibility 85 | * Issue [#17](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/17): Extending shared storage API to support advanced reach reporting 86 | * Fledge + PAA: 87 | 88 | Can we expect that Private Aggregation API will be available for Fledge on GA (H3’23)? 89 | 90 | ### To join the speaker queue 91 | 92 | Please use the "Raise My Hand" feature in Google Meet 93 | 94 | ### Attendees: please sign yourself in 95 | 96 | 1. Wendell Baker (Yahoo) 97 | 2. David Dabbs (Epsilon) 98 | 3. Alex Turner (Google Chrome) 99 | 4. Anish Ahmed (Google, Privacy Sandbox) 100 | 5. Asha Menon (Google Chrome) 101 | 6. Vishal Radhakrishnan (Google Ads - Cross-Media Measurement ) 102 | 7. Sid Sahoo (Google Chrome) 103 | 8. Robert Kubis (Google Chrome) 104 | 9. Renan Feldman (Google Chrome) 105 | 10. Maybelline Boon (Google Chrome) 106 | 11. Christina Ilvento (Google Chrome) 107 | 12. Bo Xiong (Amazon Ads) 108 | 13. Hannah Chang (Google) 109 | 14. Thomas Prieur (Criteo) 110 | 15. Craig Wright (Google Ads) 111 | 16. (RTB House) 112 | 17. Jonathan Frederic (Google Ads) 113 | 18. Chris McAndrew (Google Ads - Measurement) 114 | 19. Evgeny Skvortsov(Google Ads) 115 | 20. Michael Perry (Amazon) 116 | 21. Michael Kleber (Google Chrome) 117 | 22. Nan Li (Amazon) 118 | 23. Zach Mastromatto (Google Chrome) 119 | 24. Jonas Schulte (Amazon Ads) 120 | 25. Gaurav Bajaj (Amazon Ads) 121 | 26. Anatolii Bed (Amazon Ads) 122 | 27. Sanjay Saravanan (Meta) 123 | 28. Jukka Ranta (Comscore) 124 | 29. Ruchi Lohani (Google, Privacy Sandbox) 125 | 30. Martin Pal (Google Privacy Sandbox) 126 | 31. Kuang Yi Chen (Amazon) 127 | 32. Konrad Krakowiak (Google Ads) 128 | 129 | **Cursor Parking** 130 | 131 | **🚗🚗🚗🚗🚗🚗** 132 | -------------------------------------------------------------------------------- /meetings/2023-02-15/shared-storage-overview.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/meetings/2023-02-15/shared-storage-overview.pdf -------------------------------------------------------------------------------- /named_budgets.md: -------------------------------------------------------------------------------- 1 | # Named budgets 2 | 3 | Sharing budget between different use cases or different products can pose a 4 | challenge. If one of the use cases/products uses the [contribution 5 | budget](https://github.com/patcg-individual-drafts/private-aggregation-api#contribution-bounding-and-budgeting) 6 | more quickly, it may exhaust the budget, preventing the others from being able 7 | to report. (For example, see issue 8 | [#145](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/145).) 9 | 10 | We propose a generic mechanism to better support this. Each reporting site would 11 | be able to define a series of _named budgets_ and would allocate a fraction of 12 | their total budget to each, e.g.: 13 | 14 | ```js 15 | privateAggregation.reserveBudget("example-budget", 0.5); 16 | privateAggregation.reserveBudget("debug", 0.125); 17 | ``` 18 | 19 | Then, when making histogram contributions (including via 20 | `contributeToHistogramOnEvent()`), a named budget can optionally be specified, 21 | e.g.: 22 | 23 | ```js 24 | privateAggregation.contributeToHistogram({ 25 | bucket: 123n, 26 | value: 45, 27 | filteringId: 6n, 28 | namedBudget: "example-budget" 29 | }); 30 | ``` 31 | 32 | When approving or rejecting contributions, the browser would then check both 33 | whether the chosen named budget has enough allocated budget left and whether 34 | there is enough overall budget left. This ensures the overall budget is always 35 | respected, even if the named budget allocations change. 36 | 37 | ### Fractional budget 38 | 39 | We propose using a fraction to represent the budget allocation, i.e. not a 40 | direct limit on the contributions’ values sum. Given the two separate time 41 | windows used in Private Aggregation budgeting (per-10 min and per-day), using a 42 | fraction more clearly allocates a portion of both limits simultaneously. 43 | 44 | ### Reservation persistence 45 | 46 | To avoid the complexity of persisting configurations over time, we propose that 47 | the `reserveBudget()` calls are scoped to just the single worklet/script runner 48 | context. That is, the budget allocations would need to be set up at the start of 49 | each Shared Storage operation or Protected Audience function call. Note that the 50 | browser would still keep track of the budget usage for each group over the last 51 | 10 min and last 24 hours, but the per-named budget _limits_ themselves would not 52 | be persisted. 53 | 54 | See [Future iteration: global config](#future-iteration-global-config) below for 55 | some discussion of a more persistent choice. 56 | 57 | ### Default (null) budget 58 | 59 | If no named budget is explicitly specified for a contribution, it would default 60 | to using a special “null” budget. This “null” budget is implicitly allocated the 61 | remaining budget after all the explicit reservations are accounted for. In the 62 | example above, the “null” budget would be allocated the remaining 37.5% of the 63 | budget. Note that this behavior means that this feature is fully backwards 64 | compatible. 65 | 66 | ### Allocating more than the entire budget 67 | 68 | If more than 100% of the budget is allocated to different named budgets, a 69 | JavaScript error will be thrown. 70 | 71 | ### Privacy considerations 72 | 73 | This feature also does not introduce any concerns as the overall existing budget 74 | is still always enforced. Note that contribution budgets are enforced per 75 | reporting site and so one reporting origin on a site could exhaust the budget 76 | for its entire reporting site. However, this is intended and is unchanged by the 77 | new feature. 78 | 79 | ### Future iteration: global config 80 | 81 | As a future extension, we could consider allowing these budget reservations to 82 | be set up globally for a reporting origin (or site). A global config has already 83 | been proposed for other use cases (see issues 84 | [#81](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/81#issuecomment-2091524214) 85 | and 86 | [#143](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/143)) 87 | so we could potentially use a single config for these use cases. This would 88 | reduce verbosity as reservations that don’t need to change frequently could be 89 | replaced. We may need to consider a mechanism that allows for canceling or 90 | overriding these globally configured details. 91 | -------------------------------------------------------------------------------- /reach_whitepaper.md: -------------------------------------------------------------------------------- 1 | # Reach Implementation Best Practices in the Privacy Sandbox Shared Storage + Private Aggregation APIs 2 | Authors: Hidayet Aksu (aksu@google.com), Alexander Knop (alexanderknop@google.com), Pasin Manurangsi (pasin@google.com) 3 | 4 | This whitepaper aims to provide ad tech companies with actionable guidance on 5 | implementing reach measurement within the Privacy Sandbox, via the 6 | [Shared Storage](https://developers.google.com/privacy-sandbox/relevance/shared-storage) 7 | and 8 | [Private Aggregation](https://developers.google.com/privacy-sandbox/relevance/private-aggregation) APIs. 9 | We dive into various reach scenarios, covering hierarchical queries, cumulative 10 | reach, and present corresponding implementation strategies, catering to diverse 11 | ad tech requirements. 12 | 13 | In various use cases, we recognize the potential utilization of exploratory 14 | reach queries. 15 | Taking into account the various needs of advertising technology companies, 16 | we consider both scenarios: pre-determined reach queries (queries known in 17 | advance) and ad hoc or unknown reach queries. 18 | Hence, this work presents an exploration of alternative approaches, 19 | including direct measurement and 20 | [sketch-based mechanisms](#sketch-based-methods). 21 | Through empirical evaluation on synthetically generated data, modeled after 22 | real-world ad datasets, we assess the impact of factors like ad event capping, 23 | query granularity, and time windows on measurement accuracy. We aim to equip 24 | ad tech companies with the insights needed to select the best reach measurement 25 | method aligned with their specific query characteristics and privacy 26 | considerations. 27 | 28 | Our experiments show that for predetermined queries, the best solution is to use 29 | the point contribution mechanism. However for more exploratory queries or those 30 | with longer windows, sketch-based mechanisms are beneficial although techniques 31 | such as capping may need to be used to reduce error. 32 | 33 | By shedding light on reach estimation using the Privacy Sandbox, this paper 34 | seeks to empower ad tech companies to navigate the evolving landscape of 35 | privacy-centric advertising technologies. 36 | 37 | 38 | ## Reach Functionality 39 | 40 | The goal of reach is to compute the number of unique users that see (or interact 41 | with) content (e.g., a specific ad). 42 | For the purpose of this white paper, we will treat each browser profile as a 43 | unique user, so that the result is actually the number of unique profiles that 44 | see the content. 45 | We use this definition for simplicity in our empirical evaluations. 46 | Therefore, we will use “user” and “browser profile” interchangeably. 47 | 48 | In general, an ad tech would like to perform multiple reach measurements 49 | depending on how they wish to “slice” the set of users. 50 | In the context of digital advertising, a slice is defined as a distinct segment 51 | of a campaign's target audience, characterized by specific attributes or 52 | behaviors. This segmentation allows granular analysis and targeted optimization 53 | of advertising efforts. Slices can be constructed based on a variety of 54 | dimensions like campaign, location, demographics etc. For example, one may 55 | want to perform a reach measurement for a single campaign during a single day; 56 | or to perform a reach measurement for a single campaign during a single day for 57 | a specific geographic region. 58 | 59 | Within a specified time window, reach between slices is not necessarily 60 | additive. 61 | In other words, the count of all users for `campaign=123` is equal to the sum of 62 | user counts of `campaign=123` for each geolocation, only if no user appears in 63 | two different geo areas, which is not always the case. 64 | For features such as a creative ID, a user may view advertisements with multiple 65 | creative IDs, thus additivity does not apply in this scenario either. 66 | Consequently, when queries are pre-determined, it is important to explicitly 67 | define each inquiry and collect relevant contributions. 68 | Conversely, when queries are unknown, contributions must be collected to support 69 | the finest granularity of queries. 70 | 71 | Throughout this paper, we consider hierarchical reach 72 | measurement queries, based on different slices. 73 | For illustration and experimentation purposes, we assume the following four 74 | slicing levels: 75 | 76 | 77 | * Level 1 (coarsest): advertiser 78 | * Level 2: advertiser x campaign 79 | * Level 3: advertiser x campaign x geo 80 | * Level 4 (finest): advertiser x campaign x geo x ad strategy x creative 81 | 82 | To analyze the temporal aspects of reach, let's consider the following 83 | definitions for a running campaign: 84 | 85 | 86 | 87 | * Cumulative Reach: This quantifies the number of unique users exposed to 88 | impressions within a campaign since its inception. In this scenario, we 89 | incrementally measure reach over time. 90 | * Example queries: Reach until January 24th, Reach until January 25th, 91 | Reach until January 26th, etc. 92 | 93 | _Note that we are not considering rolling window queries here for brevity; 94 | however, the approaches we consider for cumulative reach could be 95 | generalized to this use case._ 96 | 97 | * Fixed Window Reach: This calculates the number of unique users exposed 98 | to impressions within a specific time range. The time range, or window, is 99 | determined at the time of data collection and remains constant during 100 | subsequent analysis. 101 | * Example queries: Reach between January 22nd and January 24th, Reach 102 | between January 23rd and January 25th, etc. 103 | 104 | The following query examples were selected for pre-determined queries due to 105 | their relevance to real-world scenarios and frequent usage within ad tech 106 | analytics. 107 | 108 | 109 | 110 | 111 | 114 | 116 | 118 | 120 | 121 | 122 | 124 | 126 | 128 | 130 | 131 | 132 | 134 | 136 | 138 | 140 | 141 | 142 | 144 | 146 | 148 | 150 | 151 | 152 | 154 | 156 | 158 | 160 | 161 |
112 | ID 113 | Query 115 | Query Frequency 117 | Type 119 |
Q1 123 | What is the reach for the last day? 125 | Queried every day 127 | Fixed Window 129 |
Q2 133 | What is the reach for the last calendar week? 135 | Queried every week 137 | Fixed Window 139 |
Q3 143 | What is the reach for the last month? 145 | Queried every month 147 | Fixed Window 149 |
QC 153 | What is the cumulative reach till now? 155 | Queried every day 157 | Cumulative 159 |
162 | 163 | 164 | 165 | ## Reach Implementations 166 | 167 | First, we show that reach can be measured directly using the existing Shared 168 | Storage and Private Aggregation APIs – we call this approach a direct method. 169 | Next, we explain how sketch-based methods can be used with the same APIs. 170 | 171 | 172 | ### Single Reach Measurement in Shared Storage + Private Aggregation APIs 173 | 174 | We start by recalling how to compute a single reach measurement via Shared 175 | Storage, as described in the 176 | [developer documentation](https://developers.google.com/privacy-sandbox/relevance/shared-storage/unique-reach). 177 | The idea is to use Shared Storage to record whether this browser has already 178 | contributed to the reach measurement for the slice. 179 | If it has, then we do nothing. 180 | Otherwise, we contribute to the histogram (with a single bucket) and record that 181 | a contribution has been made in Shared Storage. 182 | In doing so, the computed reach is simply equal to the aggregated value. 183 | For convenience, the following code snippet is adapted from the above linked 184 | documentation. 185 | 186 | 187 | ```javascript 188 | const L1_BUDGET = 65536; 189 | 190 | // contentId here is some id of the interaction that is measured; 191 | // i.e., it is a campaign id. 192 | function convertContentIdToBucket(contentId) { 193 | return BigInt(contentId); 194 | } 195 | 196 | class ReachMeasurementOperation { 197 | async run(data) { 198 | const { contentId } = data; 199 | 200 | // Read from Shared Storage 201 | const key = 'has-reported-content'; 202 | const hasReportedContent = (await sharedStorage.get(key)) === 'true'; 203 | 204 | // Do not report if a report has been sent already 205 | if (hasReportedContent) { 206 | // Note that this would happen if less than 30 days 207 | // passed since the first visit. 208 | return; 209 | } 210 | 211 | // Generate the aggregation key and the aggregatable value 212 | const bucket = convertContentIdToBucket(contentId); 213 | const value = 1 * L1_BUDGET; 214 | 215 | // Send an aggregatable report via the Private Aggregation API 216 | privateAggregation.contributeToHistogram({ bucket, value }); 217 | 218 | // Set the report submission status flag 219 | await sharedStorage.set(key, true); 220 | } 221 | } 222 | ``` 223 | 224 | 225 | Note that the reach result can be computed by taking the aggregated histogram 226 | value of each key and dividing by the contribution budget (`L1_BUDGET)`. 227 | Note that the developers website uses a scale factor (`SCALE_FACTOR`), but in 228 | this example the scaling factor is the same as the contribution budget so we 229 | use `L1_BUDGET`. 230 | See the 231 | [“Noise and scaling” section](https://developers.google.com/privacy-sandbox/relevance/private-aggregation/fundamentals#noise_and_scaling) 232 | of the Private Aggregation fundamentals article to learn more. 233 | 234 | 235 | ### Single Reach Measurement with Capping 236 | 237 | An implicit assumption in the previous section is that each 238 | [privacy unit](https://github.com/patcg-individual-drafts/private-aggregation-api#implementation-plan) 239 | – i.e., <browser x reporting origin ad tech x 10 minutes> – 240 | contributes to only a single query. 241 | The reason is that (i) the Shared Storage check does not incorporate the bucket 242 | so even if a later content tries to contribute to the same report, it will not 243 | pass the check and (ii) the value contribution is 65,536 which is the same as 244 | the overall contribution budget (see 245 | [contribution budget for summary reports](https://developers.google.com/privacy-sandbox/relevance/private-aggregation/fundamentals#contribution_budget)). 246 | In other words, we “cap” the number of contributions each privacy unit can make 247 | to just one. 248 | In many cases, multiple contributions from a single privacy unit may be dropped 249 | due to a capping limit of 1. 250 | This can lead to poor utility, as each privacy unit's contributions are not 251 | fully reflected in the results. 252 | 253 | To improve utility, we can increase the contribution cap to an arbitrary number 254 | we wish (denoted by `CONTRIBUTION_CAP` below). 255 | Allowing a larger cap can capture more contributions, which is good for utility. 256 | However, it also results in a larger amount of noise (after scaled down by 257 | `L1_BUDGET / CONTRIBUTION_CAP`) which is worse for utility. 258 | As such, the capping parameter has to be tuned carefully. 259 | 260 | We give an example code below where the capping parameter is set to five. 261 | The two changes are (i) making the Shared Storage checks for the specific bucket 262 | and (ii) dividing each contribution value by the capping parameter. 263 | These changes are highlighted in light yellow. 264 | 265 | 266 | ```javascript 267 | const CONTRIBUTION_CAP = 5; 268 | const L1_BUDGET = 65536; 269 | 270 | function convertContentIdToBucket(contentId) { 271 | return BigInt(contentId); 272 | } 273 | 274 | class ReachMeasurementOperation { 275 | async run(data) { 276 | const { contentId } = data; 277 | 278 | // Generate the aggregation key and the aggregatable value 279 | const bucket = convertContentIdToBucket(contentId); 280 | const value = Math.floor(L1_BUDGET / CONTRIBUTION_CAP); 281 | 282 | // Read from Shared Storage 283 | const key = `has-reported-content-${bucket}`; 284 | const hasReportedContent = (await sharedStorage.get(key)) === 'true'; 285 | 286 | // Do not report if a report has been sent already for this bucket 287 | if (hasReportedContent) { 288 | return; 289 | } 290 | 291 | // Send an aggregatable report via the Private Aggregation API 292 | privateAggregation.contributeToHistogram({ bucket, value }); 293 | 294 | // Set the report submission status flag 295 | await sharedStorage.set(key, true); 296 | } 297 | } 298 | ``` 299 | 300 | 301 | 302 | ### Multiple Reach Measurements with Capping 303 | 304 | While the above method focuses on a scenario where each content can contribute 305 | to only a single reach measurement (i.e., a single “query”), it can be extended 306 | to the case where each content can contribute to multiple reach measurements by 307 | having a key for each query. 308 | 309 | Below, we give an example of this for the hierarchical queries described above. 310 | Note that each reached browser (i.e., first time the advertisement is shown to 311 | the user) contributes to exactly one slice in each 312 | [level](#level). 313 | Thus, the number of histogram keys that each reached content contributes to is 314 | equal to the number of levels. 315 | Due to the 316 | [contributions budget](https://developers.google.com/privacy-sandbox/relevance/private-aggregation/fundamentals#contribution_budget) 317 | constraint, the scale factor is now divided by the number of levels in addition 318 | to dividing it by the contribution cap (`CONTRIBUTION_CAP`). 319 | While we adhere to equal contribution per level, it is possible to further 320 | optimize the contribution budget across different levels, as explained in the 321 | [relevant paper](https://arxiv.org/pdf/2308.13510). 322 | 323 | The code for this is below. 324 | 325 | 326 | ```javascript 327 | const CONTRIBUTION_CAP = 5; 328 | const NUM_LEVELS = 4; 329 | const L1_BUDGET = 65536; 330 | 331 | function convertContentIdAndLevelToBucket(contentId, level, ...otherDimensions) { 332 | // Implement a function that computes the bucket key for a given level using 333 | // level-related fields from otherDimensions. 334 | // E.g., in our example, the second level could set the key to be the concatenation 335 | // between advertiser and campaign id of this content. 336 | } 337 | class ReachMeasurementOperation { 338 | async run(data) { 339 | const { contentId, ...otherDimensions } = data; 340 | 341 | for (let level = 1; level <= NUM_LEVELS; level++) { 342 | // Generate the aggregation key and the aggregatable value 343 | const bucket = convertContentIdAndLevelToBucket(contentId, level, ...otherDimensions); 344 | const value = Math.floor(L1_BUDGET / (CONTRIBUTION_CAP * NUM_LEVELS)); 345 | 346 | // Read from Shared Storage 347 | const key = `has-reported-content-${bucket}`; 348 | const hasReportedContent = (await sharedStorage.get(key)) === 'true'; 349 | 350 | // Do not report if a report has been sent already 351 | if (hasReportedContent) { 352 | return; 353 | } 354 | 355 | // Send an aggregatable report via the Private Aggregation API 356 | privateAggregation.contributeToHistogram({ bucket, value }); 357 | 358 | // Set the report submission status flag 359 | await sharedStorage.set(key, true); 360 | } 361 | } 362 | } 363 | 364 | // Register the operation 365 | register('reach-measurement', ReachMeasurementOperation); 366 | ``` 367 | 368 | 369 | A sample call would look like: 370 | 371 | 372 | ```javascript 373 | await window.sharedStorage.run('reach-measurement', { data: { 374 | contentId: '123', 375 | geo: 'san jose', 376 | creativeId: '55' 377 | }}); 378 | ``` 379 | 380 | 381 | 382 | ### Cumulative Reach Measurements 383 | 384 | So far, we have not mentioned the time aspect of the queries. 385 | In particular, we assume that all queries correspond to the same time periods. 386 | However, we may want to measure time-related queries. 387 | A popular family of such queries is cumulative queries, which are queries of 388 | the form “what is the total reach up until timestep `t`?” where 389 | `t` varies from `1, …, T`. 390 | For example, one might want to measure the reach up until today everyday for the 391 | entire month. In this case, we would have `T = 30` and “timestep” being a day. 392 | 393 | _Note: Both methods described in this section are limited by the 394 | [retention policy](https://github.com/WICG/shared-storage?#data-retention-policy) of Shared Storage which is 30 days from last write._ 395 | 396 | 397 | #### Direct Method 398 | 399 | For cumulative reach queries, we can naively apply the direct method above by 400 | adding the timestamp to the key of the histogram. 401 | If a user is reached at time step $`i`$, then contributions would be made to all 402 | of $`i, \dots, T`$ buckets. 403 | Since there are $`T`$ queries, the noise will scale by a factor of 404 | $`O(T / \epsilon)`$. 405 | Note that with the current API, one would need to wait until the last window to 406 | obtain the results since reprocessing of the reports is not currently supported. 407 | 408 | 409 | #### Point Contribution 410 | 411 | We can improve upon the above direct method by a simple modification. 412 | For simplicity, we only discuss the mechanism (and give the code) for 413 | the single-query cap-one case. 414 | It is easy to extend this to the multi-query case by appending the query id to 415 | the keys. 416 | Before we describe the mechanism, let us first make the following observation. 417 | Consider the number $`n^{(\text{new})}_i`$ of users that were reached for the 418 | first time at timestep $`i`$. 419 | The answer of cumulative reach up to timestep $`t`$ is exactly equal to the 420 | cumulative sum $`\sum\limits_{i=1}^t n^{(\text{new})}_i`$. 421 | Using this fact, we can simply create $`T`$ buckets, where bucket $`i`$ corresponds 422 | to $`n^{(\text{new})}_i`$. 423 | When we wish to compute the cumulative reach up to time step $`i`$, we take the 424 | sum of the first $`i`$ buckets. Since each user contributes to only a single 425 | bucket, each bucket’s noise standard deviation is only $`O(1 / \epsilon)`$. 426 | Since we are summing up to $`T`$ such (independent) noises, the total noise 427 | standard deviation is $`O(\sqrt{T} / \epsilon)`$, which is an improvement of 428 | $`O(\sqrt{T})`$ over the direct method. 429 | 430 | An implementation in Shared Storage is given below. 431 | 432 | 433 | ```javascript 434 | // Time horizon T 435 | const TIME_HORIZON = 30; 436 | const L1_BUDGET = 65536; 437 | 438 | function convertContentIdToBucket(contentId, currentTime) { 439 | // This key corresponds to n_i as described above with i = currentTime. 440 | // Note that the remainder of "bucket" when divided by TIME_HORIZON 441 | // is currentTime and the quotient is contentId; hence, this formula gives a 442 | // unique id to each pair of contentId and currentTime. 443 | return TIME_HORIZON * BigInt(contentId) + currentTime; 444 | } 445 | 446 | class ReachMeasurementOperation { 447 | async run(data) { 448 | const { contentId } = data; 449 | 450 | // Read from Shared Storage 451 | const key = 'has-reported-content'; 452 | const hasReportedContent = (await this.sharedStorage.get(key)) === 'true'; 453 | 454 | // Do not report if a report has been sent already 455 | if (hasReportedContent) { 456 | return; 457 | } 458 | 459 | // Should be appropriately set by the ad tech (e.g., based on hour or day) 460 | const currentTime; 461 | 462 | // Increment the current bucket 463 | const bucket = convertContentIdToBucket(contentId, currentTime); 464 | const value = L1_BUDGET; 465 | 466 | // Send an aggregatable report via the Private Aggregation API 467 | privateAggregation.contributeToHistogram({ bucket, value }); 468 | 469 | // Set the report submission status flag 470 | await this.sharedStorage.set(key, true); 471 | } 472 | } 473 | ``` 474 | 475 | 476 | The code for computing cumulative reach from the aggregated histogram is given 477 | below. 478 | 479 | 480 | ```python 481 | # Time horizon T; needs to be fixed beforehand 482 | TIME_HORIZON = 30; 483 | L1_BUDGET = 65536.0; 484 | 485 | 486 | def convert_content_id_to_bucket(content_id: int, current_time: int) -> int: 487 | # Same as the client-side function above. 488 | return TIME_HORIZON * content_id + current_time; 489 | 490 | def compute_cumulative_reach( 491 | aggregated_histogram: dict[int, int], 492 | content_id: int) -> dict[int, int]: 493 | """Compute cumulative reach for the given content_id. 494 | 495 | Args: 496 | aggregated_histogram: a mapping from key to aggregated value n the summary 497 | report output by the Aggregation Service. 498 | content_id: the content id for which the cumulative reach is computed. 499 | 500 | Returns: 501 | a mapping from time step to the cumulative reach up until that time step. 502 | """ 503 | res = {} 504 | for t in range(TIME_HORIZON): 505 | res[t] = 0 506 | for i in range(1, t + 1): 507 | key = convert_content_id_to_bucket(content_id, i) 508 | res[t] += aggregated_histogram[key] / L1_BUDGET 509 | return res 510 | ``` 511 | 512 | 513 | Finally, we remark that, for larger values of T (which may be the case if 514 | measuring cumulative reach on granularities less than one day), there are more 515 | advanced mechanisms that can reduce the error further 516 | [[1](#private-decayed-sums),[2](#private-counting)], 517 | but this is beyond the scope of this paper. 518 | 519 | 520 | ### Sketch-Based Methods 521 | 522 | Sketching is a technique for estimating the reach (and other quantities of 523 | interest) using compressed aggregated data structure admitting merge operation. 524 | Here we only consider one of the most basic families of sketches, called 525 | [counting bloom filters (CBFs)](https://en.wikipedia.org/wiki/Counting_Bloom_filter). 526 | In this case, we have a hash function $`h`$ that maps any user ID to one of $`m`$ 527 | buckets. 528 | The CBF sketch is simply the histogram on these $`m`$ buckets among all the 529 | reached users. 530 | We can estimate the reach based on the number of non-empty buckets in the 531 | sketches. 532 | The exact formula for this estimation is based on the hash function 533 | distribution. 534 | 535 | The main advantage of this sketch is that it can be merged: if we have two CBFs, 536 | summing them up gives us the CBF corresponding to the union of the two 537 | (multi)sets. 538 | This allows us to be more flexible when answering queries. 539 | Recall that, in the direct method, we need to know all the queries beforehand 540 | since we need to allocate each bucket and each query. 541 | This is not needed when using sketches: We can simply create a sketch for each 542 | fine-grained unit (e.g. the reach for each day) and, when we make a query for a 543 | day range, we take the union of all the sketches in that range. 544 | This means that the analyst can supply the queries in an on-demand fashion, 545 | which is more convenient. Moreover, sketching methodology offers another 546 | significant advantage: the ability to deduplicate user IDs across multiple 547 | devices, providing a more accurate picture of individual engagement. 548 | This would typically require using the same user ID on multiple devices. 549 | Shared Storage is partitioned per browser, so the ad tech would need to identify 550 | the user across devices independently. Alternatively, there are modeling-based 551 | approaches (e.g. 552 | [virtual people](https://research.google/pubs/virtual-people-actionable-reach-modeling/)) 553 | that would avoid the need for these 554 | IDs to map directly to real-world users. 555 | 556 | Nevertheless, there are also several disadvantages of this sketch. 557 | Firstly, each sketch is a histogram with $`m`$ buckets. 558 | In contrast, the direct method only outputs a single number per query. 559 | Therefore, the sketching method can result in a memory usage overhead during 560 | processing of the aggregatable reports and during post-processing of the summary 561 | reports compared to the direct or point contribution methods. 562 | Secondly, since the noise is added to every entry of the sketch, this can lead 563 | to inaccurate estimates if the noise is large. 564 | Specifically, as we will see below, we need to set the capping parameter to be 565 | small to ensure the noise is small but doing so results in a large error due to 566 | capping. 567 | This tension means that we have to be very careful in terms of parameter 568 | settings to ensure (reasonably) accurate estimates. 569 | 570 | The overall scheme of the reach estimation involving sketches is described on 571 | the following diagram. 572 | 573 | !["A flow diagram illustrating a sketch based ad reach estimation. On the client side, aggregatable reports are generated. These reports are then sent to a trusted execution environment on the server side, where they are processed by the Aggregation Service to produce sketches for each day. These sketches are then merged and fed into an estimator to calculate the final ad reach."](reach_whitepaper_figs/diagram.png "Computation diagram for sketch-based reach estimation" ) 574 | 575 | 576 | We remark that there are many other types of sketches beyond counting bloom 577 | filters (CBFs), such as 578 | [(boolean) bloom filters](https://en.wikipedia.org/wiki/Bloom_filter); 579 | and even CBF’s could use different hashes with different distributions of 580 | values. 581 | In this paper, we only focus on uniform CBFs since they are histograms and thus 582 | most compatible with the API and change of the distribution didn’t change the 583 | experimental errors in our dataset. 584 | However, it is plausible that other types of 585 | sketches can allow for better utility. 586 | 587 | As the sketch is simply a histogram, it is simple to implement via the Shared 588 | Storage and Private Aggregation APIs. 589 | Again, for simplicity, we only include the code snippet for the cap-one case 590 | below. (Note that if there are many different content ids, one might consider 591 | splitting them into groups using 592 | [filtering ID’s](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/flexible_filtering.md).) 593 | 594 | 595 | ```javascript 596 | // Sketch size m 597 | const SKETCH_SIZE = 10000 598 | const L1_BUDGET = 65536; 599 | 600 | function sketchHash(userId) { 601 | // Should output a number in 0, ..., SKETCH_SIZE - 1 based on which bucket of 602 | // the sketch the user id is hashed to. 603 | } 604 | 605 | function convertContentIdToBucket(contentId, userId) { 606 | // Note that the remainder of convertContentIdToBucket when divided by 607 | // SKETCH_SIZE is sketchHash(userId) and the quotient is contentId; hence, 608 | // this formula gives a unique id to each pair of contentId and 609 | // sketchHash(userId). 610 | return SKETCH_SIZE * BigInt(contentId) + sketchHash(userId); 611 | } 612 | 613 | class ReachMeasurementOperation { 614 | async run(data) { 615 | const { contentId } = data; 616 | 617 | // Read from Shared Storage 618 | const key = 'has-reported-content'; 619 | const hasReportedContent = (await sharedStorage.get(key)) === 'true'; 620 | 621 | // Do not report if a report has been sent already 622 | if (hasReportedContent) { 623 | return; 624 | } 625 | 626 | // Generate the aggregation key and the aggregatable value 627 | const userId = getUserId(); 628 | const bucket = convertContentIdToBucket(contentId, userId); 629 | const value = 1 * L1_BUDGET; 630 | 631 | // Send an aggregatable report via the Private Aggregation API 632 | privateAggregation.contributeToHistogram({ bucket, value }); 633 | 634 | // Set the report submission status flag 635 | await sharedStorage.set(key, true); 636 | } 637 | } 638 | ``` 639 | 640 | 641 | After aggregation, each content id will have a sketch associated with it. 642 | To estimate the reach given sketches the following algorithm can be used. 643 | 644 | 645 | ```python 646 | import numpy as np 647 | 648 | def _invert_monotonic( 649 | f: Callable[[float], float], lower: float = 0, tolerance: float = 0.001 650 | ) -> Callable[[float], float]: 651 | """Inverts monotonic function f.""" 652 | f0 = f(lower) 653 | 654 | def inversion(y: float) -> float: 655 | """Inverted f.""" 656 | assert f0 <= y, ( 657 | "Positive domain inversion error." 658 | f"f({lower}) = {f0}, but {y} was requested." 659 | ) 660 | left = lower 661 | probe = 1 662 | while f(probe) < y: 663 | left = probe 664 | probe *= 2 665 | right = probe 666 | mid = (right + left) / 2 667 | while right - left > tolerance: 668 | f_mid = f(mid) 669 | if f_mid > y: 670 | right = mid 671 | else: 672 | left = mid 673 | mid = (right + left) / 2 674 | return mid 675 | 676 | return inversion 677 | 678 | def _get_expected_non_zero_buckets( 679 | bucket_probabilities: list[float], 680 | ) -> Callable[[int], float]: 681 | """ 682 | Create a function that computes the expected number of non-zero buckets. 683 | 684 | Args: 685 | bucket_probabilities: A list of probabilities of each bucket 686 | being selected when a user is added to a sketch. 687 | Returns: 688 | A function that given the number of users, returns the expected number of 689 | non-empty buckets. 690 | """ 691 | def internal(count: int): 692 | return sum( 693 | 1 - pow(1 - p, count) 694 | for p in bucket_probabilities 695 | ) 696 | return internal 697 | 698 | 699 | def estimate_reach( 700 | sketches: list[list[int]], 701 | bucket_probabilities: list[float], 702 | ) -> float: 703 | """ 704 | Estimate reach given the sketches. 705 | 706 | Args: 707 | sketches: the list of sketches, where each element of a sketch is an 708 | element of the histogram. 709 | bucket_probabilities: probability of a given bucket being chosen for a 710 | user; for uniform CBF's all the values are the same, for exponential 711 | CBF ith element has value `alpha ** i` for some fixed value alpha. 712 | """ 713 | sketches = np.array(sketches) 714 | 715 | thresholded_sketches = (sketches >= 0.5).astype(int) 716 | merged_sketch = thresholded_sketches.sum(axis=0) 717 | thresholded_merged__sketch = (merged_sketch >= 0.5).astype(int) 718 | non_zero_buckets_count = thresholded_merged__sketch.sum() 719 | 720 | estimator = _invert_monotonic( 721 | _get_expected_non_zero_buckets(bucket_probabilities), 722 | lower=0, 723 | tolerance=1e-7, 724 | ) 725 | 726 | return estimator(non_zero_buckets_count) 727 | ``` 728 | 729 | 730 | 731 | ## Empirical Evaluation 732 | 733 | 734 | ### Data 735 | 736 | Due to privacy concerns, in this paper, we choose to evaluate errors on 737 | synthetic data. 738 | Working on real ad datasets, we find out that the reach on datasets can be 739 | modeled in the following way. Reach for one day follows 740 | [power-law distribution](https://arxiv.org/pdf/2311.13586) with shape parameter 741 | $b$. 742 | When analyzing different datasets based on the different query or various slice 743 | definitions, such as hierarchical queries, we noticed a consistent pattern. 744 | Specifically, these datasets exhibit a power-law distribution, albeit with 745 | differing shape parameters. 746 | In our observations, it was noted that capping on the client side does not 747 | affect the shape of the power-law distribution, indicating that campaign 748 | impressions are dropped by capping with equal probability. 749 | In contrast, capping does have a significant impact on the volume reports from 750 | clients. 751 | The rate of reports dropped by capping, i.e., the capping discount rate, follows 752 | a power-law distribution with respect to the capping value. 753 | Figure 2 displays a representative discount rate versus capping plot and 754 | corresponding curve fitting. 755 | 756 | Furthermore, we examined the impact of the time window on the reach. Reach does 757 | not demonstrate a linear relationship with time window size. 758 | Instead, it exhibits a diminishing return compared to linear growth as 759 | illustrated in Figure 3. 760 | This discount follows a power-law function relative to the size of the time 761 | window. 762 | 763 | ![Reach estimates receive a discount as a function of cap value.](reach_whitepaper_figs/cap_discount.png) 764 | 765 | ![Reach does not increase linearly with time windows size.](reach_whitepaper_figs/window_discount.png) 766 | 767 | 768 | ```python 769 | def sample_slices( 770 | number_of_days, 771 | cap, 772 | dataset_power_law_shape, 773 | n_samples 774 | ) -> list[SliceSize]: 775 | """Generates a synthetic dataset in slices (tuples of capped and uncapped numbers). 776 | 777 | 778 | Args: 779 | number_of_days: what time window the query is using. 780 | cap: the maximal number of slices a user might contribute. 781 | dataset_power_law_shape: dataset specific power-law shape param. 782 | n_samples: number of slices to generate. 783 | """ 784 | 785 | 786 | # sample reach distribution from power-law distribution. 787 | reach_1_day = sample_discrete_power_law( 788 | b=dataset_power_law_shape, 789 | n_samples=n_samples, 790 | ...) 791 | # approximate impact of capping. 792 | # compute probability of being reported (success, not impacted by capped) 793 | # after capping. 794 | p_reported = (1 - cap_discount_rate(cap=cap)) 795 | # for a slice of true size n, each contribution has p_reported probability to 796 | # be reported. Thus after capping, observed contributions become 797 | # np.random.binomial(n, p) (there may be faster approximations) 798 | reach_1_day_capped = np.random.binomial(reach_1_day, p_reported) 799 | 800 | 801 | # approximate impact of time window 802 | reach_n_days = reach_1_day * number_of_days * window_discount_rate(number_of_days) 803 | reach_n_days_capped = reach_1_day_capped * number_of_days * window_discount_rate(number_of_days) 804 | 805 | 806 | # round to nearest int 807 | reach_n_days = np.round(reach_n_days).astype(int) 808 | reach_n_days_capped = np.round(reach_n_days_capped).astype(int) 809 | 810 | 811 | return [ 812 | (uncapped, capped) 813 | for uncapped, capped in zip(reach_n_days, reach_n_days_capped) 814 | ] 815 | ``` 816 | 817 | 818 | Listing 1: pseudocode to generate synthetic data. 819 | 820 | 821 | 822 | 823 | 826 | 829 | 833 | 834 | 835 | 837 | 839 | 841 | 842 | 843 | 845 | 847 | 849 | 850 | 851 | 853 | 855 | 857 | 858 | 859 | 861 | 863 | 865 | 866 |
824 | Level
(granularity) 825 |
827 | Query dimensions 828 | 830 | Power law shape (b) param
831 | # reach per bucket 832 |
1 (coarsest) 836 | advertiser 838 | 1.01 840 |
2 844 | advertiser x campaign 846 | 1.10 848 |
3 852 | advertiser x campaign x geo 854 | 1.50 856 |
4 (finest) 860 | advertiser x campaign x geo x ad strategy x creative 862 | 1.60 864 |
867 | 868 | 869 | Table 2: shape params used in experiments for queries with different granularity. 870 | 871 | In accordance with the pseudocode presented in Listing 1, synthetic datasets of 872 | variable sizes, capping values, and time windows were generated for experimental 873 | evaluation. 874 | 875 | 876 | ### Error Metrics 877 | 878 | In our experiments we are going to use $`\text{RMSRE}_\tau`$ metric defined as 879 | [follows](https://developers.google.com/privacy-sandbox/relevance/attribution-reporting/design-decisions#expandable-9): 880 | ```math 881 | \mathrm{RMSRE}_\tau\left(\{t_i\}_{i = 1}^n, \{e_i\}_{i = 1}^n\right) = 882 | \sqrt{\frac{1}{n} \sum\limits_{i=1}^n \left(\frac{t_i - e_i}{\max(\tau, t_i)}\right)^2}, 883 | ``` 884 | where 885 | $`\{t_i\}_1^n`$ are true values that we’d like to measure, and 886 | $`\{e_i\}_1^n`$ are the estimates. 887 | 888 | 889 | ### Experimental Evaluation 890 | 891 | As it was mentioned the data used in the experiments is synthetically generated. W 892 | e perform four experiments: 893 | 894 | 895 | 896 | 1. First, we measure the error introduced by capping. 897 | 2. Next, we measure the error (for each granularity level and three different 898 | windows: one day, one week, and one month). 899 | 3. In addition, we measure the error of the cumulative queries with and without 900 | the point contribution mechanism. 901 | 4. Finally, we examine the error of sketching-based methods. 902 | 903 | 904 | ### Results 905 | 906 | 907 | #### Error Caused by Capping 908 | 909 | To measure the error caused by capping, we sampled 1000 samples of slices for 910 | each size of the window (1 day, 7 days, 30 days) and each granularity of the 911 | query, and computed $`\text{RMSRE}_\tau`$ error where the estimated value is true 912 | observed value (before adding noise or using sketches). 913 | 914 | ![observation error](reach_whitepaper_figs/observation_error.png) 915 | 916 | Next we plot the fraction of the samples with the error below t vs t. 917 | The plot shows that to achieve an error of at most 0.1 with probability at least 918 | 80%, we need to choose the capping to be at least 4. 919 | 920 | 921 | #### Multiple Reach Measurements 922 | 923 | To measure the error introduced by noise in case of multiple measurements, we 924 | sampled 1000 slices for each size of the window (1 day, 7 days, 30 days) and 925 | each granularity of the query. 926 | Next we computed $`\text{RMSRE}_\tau`$ error where the true value is true observed 927 | value and estimated value is the noised observed value. 928 | 929 | ![multiple reach measurement error](reach_whitepaper_figs/direct_error.png) 930 | 931 | Next, we plot the fraction of the samples with the total error below t vs t, 932 | i.e. the error including both capping and noise. 933 | The plot shows that if we want to measure at least the week window, error less 934 | than 0.1 can be achieved for all granularities, and for the coarsest one error 935 | less than 0.05 can be achieved for cap equal to 4 with probability 80%, for 936 | window size of a month even for cap equal to 10 the error 0.1 can be achieved 937 | for all granularities with probability at least 80%, finally, for window size of 938 | a day, the error 0.1 with probability 80% is only achievable for cap equal to 1. 939 | 940 | ![total multiple reach measurement error](reach_whitepaper_figs/total_direct_error.png) 941 | 942 | 943 | #### Cumulative Reach Method Error 944 | 945 | To measure cumulative reach, one can simply use the approach described above 946 | (i.e., using direct method); however, as it was mentioned before, point 947 | contribution mechanism adds noise with smaller standard deviation. 948 | 949 | ![cumulative reach method error](reach_whitepaper_figs/cumulative_error.png) 950 | 951 | To show that this indeed provides better utility in the applications we plot the 952 | fraction of the samples with the total error on the last day (i.e., on the 30th 953 | day) below t vs t. 954 | As one may see from the plots, values of cap where observation error is not too 955 | large, the point contribution method gives sufficient improvement over the 956 | direct method and should be used for cumulative queries. 957 | 958 | ![total cumulative reach method error](reach_whitepaper_figs/total_cumulative_error.png) 959 | 960 | 961 | #### Sketch Method Error 962 | 963 | For the last experiment, for window sizes of 1, 10, 20, …, 360 days we sampled 964 | 100 slices and computed the $`\text{RMSRE}_\tau`$ error where the true value is 965 | true observed value and estimated value is obtained from sketches of size 10000 966 | (sketches need to be big to accommodate a year worth of data). 967 | Note that for each window size, we assume that reach is distributed uniformly 968 | across the window. 969 | 970 | ![sketch method error](reach_whitepaper_figs/sketches_error.png) 971 | 972 | We plot the window size vs the total error and the shade denotes the 80% 973 | probability band. 974 | As it can be seen the error is somewhat stable and for a cap larger than 3 the 975 | error is too high; however, for a cap equal to 3 the median error is around 0.1 976 | and stays within 0.2 with probability of 80%. 977 | 978 | ![total sketch method error](reach_whitepaper_figs/total_sketches_error.png) 979 | 980 | 981 | ## Conclusion 982 | 983 | Our experiments show that if the query is pre-determined and the window is less 984 | than 30 days, the best solution would be to use the 985 | [point contribution mechanism](#point-contribution). 986 | However, if the queries are not determined at the collection time and/or the 987 | necessary window is over 30 days then the sketch method can be used; but it is 988 | important to limit the cap used to smaller values since otherwise the error is 989 | too high. 990 | The relatively large (compared to direct method) decrease in utility of the 991 | sketching method is due to the addition of noise to each intermediate sketch 992 | before the final result is computed. 993 | This could be addressed with evolution of the API design to allow ad techs to 994 | perform queries on sketches and only add noise to the final result. 995 | 996 | 997 | ![error comparison](reach_whitepaper_figs/errors_comparison.png) 998 | 999 | We are hopeful that this paper will enable advertising technologists to 1000 | comprehend the trade-offs associated with various methodologies and develop a 1001 | structured path to implementing reach using the APIs. 1002 | 1003 | For the implementation of experimental evaluations, please refer to the [evaluation code](https://github.com/google-research/google-research/tree/master/privacy_sandbox/reach_whitepaper). 1004 | 1005 | ## Acknowledgements 1006 | 1007 | We would like to thank Badih Ghazi, Charlie Harrison, Ravi Kumar, Asha Menon, and Alex Turner for their input and help with this work. 1008 | 1009 | 1010 | ## References 1011 | 1012 | 1013 | 1014 | 1. ​​Jean Bolot, Nadia Fawaz, S. Muthukrishnan, Aleksandar Nikolov, Nina Taft:_ Private decayed predicate sums on streams_. ICDT 2013: 284-295 1015 | 2. Badih Ghazi, Ravi Kumar, Jelani Nelson, Pasin Manurangsi: _Private Counting of Distinct and k-Occurring Items in Time Windows_. ITCS 2023: 55:1-55:24 1016 | 3. Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N. Rothblum: _Differential privacy under continual observation_. STOC 2010: 715-724 1017 | 4. T.-H. Hubert Chan, Elaine Shi, Dawn Song: _Private and Continual Release of Statistics_. ICALP (2) 2010: 405-417 1018 | 5. Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum: _Pure Differential Privacy for Rectangle Queries via Private Partitions_. ASIACRYPT (2) 2015: 735-751 1019 | 6. Joel Daniel Andersson, Rasmus Pagh: _A Smooth Binary Mechanism for Efficient Private Continual Observation_. NeurIPS 2023 -------------------------------------------------------------------------------- /reach_whitepaper_figs/cap_discount.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/cap_discount.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/cumulative_error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/cumulative_error.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/diagram.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/direct_error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/direct_error.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/errors_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/errors_comparison.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/observation_error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/observation_error.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/sketches_error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/sketches_error.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/total_cumulative_error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/total_cumulative_error.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/total_direct_error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/total_direct_error.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/total_sketches_error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/total_sketches_error.png -------------------------------------------------------------------------------- /reach_whitepaper_figs/window_discount.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/window_discount.png -------------------------------------------------------------------------------- /report_verification.md: -------------------------------------------------------------------------------- 1 | # Preventing invalid Private Aggregation API reports with report verification 2 | 3 | ### Table of Contents 4 | 5 | - [Background](#background) 6 | - [Security goals](#security-goals) 7 | - [Existing threat practicality](#existing-threat-practicality) 8 | - [Shared Storage](#shared-storage) 9 | - [Details](#details) 10 | - [Deterministic number of reports](#deterministic-number-of-reports) 11 | - [Allows retrospective filtering](#allows-retrospective-filtering) 12 | - [Security considerations](#security-considerations) 13 | - [Privacy considerations](#privacy-considerations) 14 | - [Reduced delay](#reduced-delay) 15 | - [FLEDGE sellers](#fledge-sellers) 16 | - [Details](#details-1) 17 | - [Privacy considerations](#privacy-considerations-1) 18 | - [FLEDGE bidders](#fledge-bidders) 19 | - [Details](#details-2) 20 | - [New network requests](#new-network-requests) 21 | - [Need to list all possible token issuers](#need-to-list-all-possible-token-issuers) 22 | - [Need to limit the list of token issuers](#need-to-limit-the-list-of-token-issuers) 23 | - [Allocating returned tokens](#allocating-returned-tokens) 24 | - [Delegating token issuance](#delegating-token-issuance) 25 | - [Ensuring delegation consistency](#ensuring-delegation-consistency) 26 | - [Issuance mechanism](#issuance-mechanism) 27 | - [Redemption mechanism](#redemption-mechanism) 28 | - [Security considerations](#security-considerations-1) 29 | - [Privacy considerations](#privacy-considerations-2) 30 | - [Partitioning can amplify counting attacks](#partitioning-can-amplify-counting-attacks) 31 | - [Initial design](#initial-design) 32 | - [Potential mitigations](#potential-mitigations) 33 | - [Alternatives considered](#alternatives-considered) 34 | - [Using signals from interest group joining time](#using-signals-from-interest-group-joining-time) 35 | - [New network request](#new-network-request) 36 | - [Different security model](#different-security-model) 37 | - [Difficult to decide on number of tokens to issue](#difficult-to-decide-on-number-of-tokens-to-issue) 38 | - [Requires new persistent token storage](#requires-new-persistent-token-storage) 39 | - [Using signals from both auction running time and interest group joining time](#using-signals-from-both-auction-running-time-and-interest-group-joining-time) 40 | - [Separate token headers/fields](#separate-token-headersfields) 41 | - [Each origin picks one option](#each-origin-picks-one-option) 42 | - [One mechanism preferred, the other a fallback](#one-mechanism-preferred-the-other-a-fallback) 43 | - [Specifying a contextual ID and each possible IG owner](#specifying-a-contextual-id-and-each-possible-ig-owner) 44 | - [Trusted server report verification](#trusted-server-report-verification) 45 | - [Shared Storage in Fenced Frames](#shared-storage-in-fenced-frames) 46 | - [Details](#details-3) 47 | - [Doesn’t support nesting](#doesnt-support-nesting) 48 | - [Privacy considerations](#privacy-considerations-3) 49 | - [Extending to selectURL()](#extending-to-selecturl) 50 | 51 | ## Background 52 | 53 | This document proposes a set of API changes to enhance the security of the 54 | aggregatable reports by making it more difficult for bad actors to interfere 55 | with the accuracy of cross-site measurement. Note that a mechanism based on the 56 | [Private State Tokens API](https://github.com/WICG/trust-token-api) has been 57 | [proposed](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md) 58 | for the Attribution Reporting API. 59 | 60 | The proposal is separated by the different contexts the Private Aggregation API 61 | can be invoked in as the constraints and designs differ substantially. 62 | 63 | ## Security goals 64 | 65 | Our security goals match the Attribution Reporting proposal’s, see 66 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#security-goals) 67 | for details. Briefly, our primary security goals are: 68 | 69 | 1. No reports out of thin air 70 | 2. No replaying reports 71 | 72 | We also share the secondary goals: 73 | 74 | 3. Privacy of the invalid traffic (IVT) detector 75 | 4. Limit the attack scope for bad actors that can bypass IVT detectors 76 | 5. No report mutation (lower priority) 77 | 78 | ### Existing threat practicality 79 | 80 | As with the Attribution Reporting API, we don’t currently prevent reports being 81 | created out of thin air, but practical attacks are challenging. More details 82 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#existing-mitigations-and-practical-threats). 83 | 84 | ## Shared Storage 85 | 86 | When triggering a Shared Storage operation that could send an aggregatable 87 | report, we propose allowing the site to specify a high-entropy ID from 88 | outside the isolated context. This ID would then be embedded unencrypted in the 89 | report issued by that worklet operation, e.g. adding the following key to the 90 | report: 91 | 92 | ```jsonc 93 | "context_id" : "example_string", 94 | ``` 95 | 96 | This would be achieved by adding a new optional parameter to the Shared Storage 97 | `run()` and `selectURL()` APIs, e.g.: 98 | 99 | ```js 100 | sharedStorage.run('someOperation', {'privateAggregationConfig': {'contextId': 'example_string'}}); 101 | ``` 102 | 103 | Note that this design does not support report verification for Shared Storage 104 | operations run from within a fenced frame. See 105 | [below](#shared-storage-in-fenced-frames) for a discussion of that case. 106 | 107 | An approach based on Private State Tokens was not proposed as it would add 108 | complexity and offer strictly less power than ID-based filtering for invalid 109 | traffic filtering. 110 | 111 | ### Details 112 | 113 | #### Deterministic number of reports 114 | 115 | One key concern with this approach is that the number of reports (with that ID) 116 | could be used to exfiltrate cross-site information. So, when an ID is specified 117 | for a Shared Storage operation, we ensure that a single report is sent no matter 118 | how many calls to `sendHistogramReport()` occur (including zero). Instead, this 119 | report would have a variable number of contributions embedded (see [batching 120 | proposal](https://github.com/patcg-individual-drafts/private-aggregation-api#reducing-volume-by-batching)). 121 | To avoid leaking the number of contributions, we will need to 122 | [pad](https://github.com/patcg-individual-drafts/private-aggregation-api#padding) 123 | the encrypted payload. Additionally, if a context has run out of budget, a 124 | report should still be sent (containing no contributions). 125 | 126 | #### Allows retrospective filtering 127 | 128 | This approach allows for a server to retroactively alter its decisions on report 129 | validity. For example, if a new signal for invalid traffic is determined, 130 | previous reports with that signal could be marked as invalid too (if they have 131 | not yet been processed). 132 | 133 | #### Security considerations 134 | 135 | This option easily achieves all of the higher priority security goals 136 | 137 | - **No reports out of thin air**: Any report without an ID, or with an 138 | unexpected ID, can be discarded as invalid. These IDs are high-entropy and so 139 | can be made infeasible to guess. 140 | - **No replaying reports**: Each ID can be unique, allowing discarding of 141 | reports with a repeated ID. 142 | - **Privacy of the invalid traffic (IVT) detector**: The valid/invalid decision 143 | associated with an ID can be made server-side and need not be revealed to the 144 | client. In fact, the decision need not happen immediately, see [Allows 145 | retrospective filtering](#allows-retrospective-filtering). 146 | - **Limit the attack scope for bad actors that can bypass IVT detectors**: By 147 | using a unique ID, server-side checks could be added to ensure that the 148 | metadata fields in the report match the expected values. 149 | - **No report mutation**: Only partially addressed. The server can verify that 150 | plaintext fields in the report “reasonably match” their expectations. This 151 | does not prevent mutation of the payload or fields that could have multiple 152 | reasonable values (e.g. small changes to scheduled\_report\_time). 153 | 154 | #### Privacy considerations 155 | 156 | Adding a high-entropy ID may allow for timing attacks. E.g. if a report is not 157 | issued until after a Shared Storage operation completes, the reporting origin 158 | could in principle use the scheduled reporting time to learn something about how 159 | long the operation took to run. This is currently mitigated by a randomized 160 | delay, but we also plan to add a timeout in Shared Storage, see [Reduced 161 | delay](#reduced-delay) below. 162 | 163 | Adding a high-entropy ID also allows for the reports to be arbitrarily 164 | partitioned. However, by making the count of reports with the given ID 165 | [deterministic](#deterministic-number-of-reports), we avoid the [major concern 166 | this introduces](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#could-we-just-tag-reports-with-a-trigger_id-instead-of-using-anonymous-tokens) 167 | (non-noisy leaks through counts). We do not consider the ability to process only 168 | a chosen subset of reports to be a privacy concern, given other protections 169 | (e.g. adding noise to the summary report). 170 | 171 | #### Reduced delay 172 | 173 | Currently, reports are delayed by up to one hour to avoid revealing an 174 | association between the issued reports and the original context. As this 175 | approach explicitly reveals this association (with other mitigations), we can 176 | shorten these delays. We plan to impose a 5 second timeout on Shared Storage 177 | operations making contributions. We then plan to wait until the timeout to 178 | send a report, even if execution finishes early. This avoids leaking 179 | information through how long the operation took to run. We also considered 180 | instead keeping a shorter randomized delay (e.g. up to 1 minute), but that 181 | did not seem necessary. 182 | 183 | ## FLEDGE sellers 184 | 185 | We propose a very similar mechanism for FLEDGE seller reporting as for Shared 186 | Storage worklets. That is, we’ll allow the site to specify a high-entropy 187 | ID from outside the isolated context and this ID would then be embedded 188 | unencrypted in the report issued by that seller within that auction, e.g.: 189 | 190 | ```jsonc 191 | "context_id" : "example_string", 192 | ``` 193 | 194 | The seller would specify this ID through an optional parameter into the 195 | `auctionConfig`, e.g.: 196 | 197 | ```js 198 | const myAuctionConfig = { 199 | ... 200 | 'privateAggregationConfig': { 201 | 'contextId': 'example_string', 202 | } 203 | }; 204 | const auctionResultPromise = navigator.runAdAuction(myAuctionConfig); 205 | ``` 206 | 207 | ### Details 208 | 209 | See the [Shared Storage section](#shared-storage) for more details. 210 | 211 | #### Privacy considerations 212 | 213 | Like for shared storage, adding a high-entropy ID could allow for timing attacks 214 | as the reporting origin could use the scheduled reporting time to learn 215 | something about when the report was triggered. This is partially mitigated by 216 | the existing randomized reporting delay (10-60 min) imposed as FLEDGE auctions 217 | impose small timeouts (e.g. 0.5 s). As discussed 218 | [above](#privacy-considerations) for Shared Storage, we avoid concerns about 219 | partitioning by making the number of reports deterministic (and other 220 | protections). 221 | 222 | ## FLEDGE bidders 223 | 224 | **Note: Unlike the above sections which offer relatively straightforward approaches, 225 | this section is highly complex and nuanced. Feedback is appreciated!** 226 | 227 | We can’t easily use a contextual ID for the FLEDGE bidder case as the existence 228 | of a bidder in a particular auction is inherently cross-site data, see 229 | [below](#specifying-a-contextual-id-and-each-possible-ig-owner). So, our options 230 | are more limited and we focus on mechanisms using Private State Tokens. 231 | 232 | However, note also that there are no existing network requests that we can easily reuse 233 | for token issuance. While there is a trusted signals fetch, that is 234 | intentionally uncredentialed. Much like using an ID, we can’t just add a network 235 | request for each bidder as that would reveal cross-site data. 236 | 237 | So, we handle token issuance by adding a new optional parameter to 238 | `runAdAuction()`, e.g.: 239 | 240 | ```js 241 | const myAuctionConfig = { 242 | ... 243 | 'privateAggregationConfig': { 244 | ... 245 | 'tokenIssuanceURLs': [ 246 | 'https://origin1.example/path?signal1=abc,signal2=def', 247 | 'https://origin2.example/some-other-path', 248 | 'https://origin3.example/etc', 249 | ], 250 | // How many tokens to request from each listed issuer. Optional, defaults to 251 | // each issuer's batch size. 252 | 'numTokensPerIssuer': 10, 253 | } 254 | } 255 | const auctionResultPromise = navigator.runAdAuction(myAuctionConfig); 256 | ``` 257 | 258 | This would trigger a token issuance request for each listed origin (see 259 | [below](#issuance-mechanism)). Each token would be redeemed along with any later 260 | reports’ network requests (see [below](#redemption-mechanism)). 261 | 262 | If this token successfully verifies, then the reporting origin has a guarantee 263 | that the report was associated with a `runAdAuction()` request that was signed. 264 | 265 | ### Details 266 | 267 | #### New network requests 268 | 269 | This requires the addition of a new network request for each listed token issuer, 270 | emitted when `runAdAuction` is invoked. 271 | 272 | #### Need to list all possible token issuers 273 | 274 | As the presence of bidders in an auction is inherently cross-site, we require 275 | listing all possible token issuers from the publisher site. The user agent will 276 | then unconditionally perform a token issuance request for each listed token 277 | issuer to avoid cross-site leakage, i.e. even if the issuer is not used by any 278 | bidder in the auction. 279 | 280 | #### Need to limit the list of token issuers 281 | 282 | The user agent will also need to impose a limit on the number of token issuers 283 | listed in each auction to avoid too many network requests being added. 284 | Practically, this means interest group owners will likely need to use the 285 | [delegation mechanism](#delegating-token-issuance). 286 | 287 | #### Allocating returned tokens 288 | 289 | A single bidder origin may own multiple interest groups that a user is enrolled 290 | in. Additionally, multiple interest group owner origins may use the same token 291 | issuer (due to [delegation](#delegating-token-issuance)). In these cases, the 292 | interest groups will have to share the tokens issued. 293 | 294 | In the case of multiple owner origins using the same token issuer, tokens can’t 295 | be reused as we don’t want to reveal that both interest group owners were 296 | present in the same auction (for the same user). However, multiple tokens can be 297 | requested from a single token issuer to mitigate this. If not enough tokens were 298 | issued, some reports will be sent unattested. 299 | 300 | In the case of multiple interest groups with the same owner, the histogram 301 | contributions should be 302 | [batched](https://github.com/patcg-individual-drafts/private-aggregation-api#reducing-volume-by-batching) 303 | together into a single report, avoiding the need to use multiple tokens. 304 | However, the [extended reporting 305 | plans](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md) 306 | for Private Aggregation allow for fenced frames to trigger reports indirectly 307 | with `window.fence.reportPrivateAggregationEvent()`. This could occur 308 | arbitrarily later, so we may need to ignore events triggered too long after the 309 | auction (e.g. after 1 hour). We could consider replacing the randomized delay 310 | with simply waiting until the timeout, even if execution finishes early. 311 | 312 | #### Delegating token issuance 313 | 314 | Interest group owners will be able to delegate their token issuance by hosting a 315 | `.well-known` file which specifies the origin to delegate to. This will be 316 | optional (i.e. each origin can choose itself as its token issuer), but note that 317 | all origins choosing themselves would likely exceed limits, see [Need to limit 318 | the list of token issuers](#need-to-limit-the-list-of-token-issuers) above. 319 | 320 | ##### Ensuring delegation consistency 321 | 322 | To ensure that the same file is served across different browser instances, the 323 | user agent vendor may re-distribute these files through a separate mechanism. 324 | Further, to ensure that the origin does not change frequently, the user agent 325 | could impose some limits on the rotation frequency. 326 | 327 | #### Issuance mechanism 328 | 329 | Token issuance network requests will be sent to the specified token issuer URLs. 330 | The URL path and query string allows for metadata to be embedded by the seller, 331 | but note that only the token issuance _origin_ is used for 332 | [delegation](#delegating-token-issuance). Each request will have a 333 | `Sec-Private-Aggregation-Private-State-Token` header with one or more blinded 334 | messages (each of which embeds a report\_id) according to the number of tokens 335 | requested. If the number of tokens is not requested, the token issuer’s [batch 336 | size](https://source.chromium.org/chromium/chromium/src/+/main:services/network/public/mojom/trust_tokens.mojom;drc=96d76471a47949536f88e90cbf03596cda41f6e1;l=232) 337 | will be used. The token issuer will inspect the request and decide whether it is 338 | valid, i.e. whether the issuer suspects it is coming from a real, honest client 339 | and should therefore be allowed to generate aggregatable reports. 340 | 341 | If the request is considered **invalid** and hence shouldn’t be taken into 342 | account to calculate aggregate measurement results, the origin should respond 343 | without adding a `Sec-Private-Aggregation-Private-State-Token` response header. 344 | If this header is omitted or is not valid, the browser will proceed normally, 345 | but any report generated will not contain the report verification header. Note: 346 | more advanced deployments can consider issuing an "invalid" token using private 347 | metadata to avoid the client learning the detection result. See privacy of the 348 | IVT detector in [Security considerations](#security-considerations-1) for more 349 | details. 350 | 351 | If the request is considered **valid**, the origin should add a 352 | `Sec-Private-Aggregation-Private-State-Token` header with a blind token (the 353 | blind signature over the blinded message) for each blinded message included in 354 | the original request. The origin could also return a token for only a subset of 355 | the blinded messages if it wishes to limit the number of tokens issued to limit 356 | exfiltration risk. 357 | 358 | Internally, the browser will store the token associated with any generated 359 | report until it is sent. 360 | 361 | #### Redemption mechanism 362 | 363 | If a token is [allocated](#allocating-returned-tokens) to an aggregatable 364 | report, it will be sent along with the report’s request in the form of a new 365 | request header `Sec-Private-Aggregation-Private-State-Token`. If this token is 366 | successfully verified, then the reporting origin has a guarantee that the report 367 | was associated with a previous request that was signed. 368 | 369 | Note: unlike the basic Private State Token API (which enables conveying tokens 370 | from one site to another), there are no redemption limits for Private 371 | Aggregation API integration. See [Privacy 372 | considerations](#privacy-considerations-2) for discussion of other mitigations. 373 | 374 | #### Security considerations 375 | 376 | This option easily achieves the primary security goals plus some secondary 377 | security goals. The considerations largely match the Attribution Reporting 378 | proposal’s given the similar token-based approach, see 379 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#security-considerations) 380 | for details. 381 | 382 | #### Privacy considerations 383 | 384 | Much like Attribution Reporting’s 385 | [proposal](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#privacy-considerations), 386 | this integration is intended to be as privacy-neutral as possible. In 387 | particular, we want to avoid cross-site information leakage. While each token’s 388 | issuance occurs using a request from a single site, this token – including its 389 | metadata, or no token if none was issued – will later be sent with a report from 390 | a bidder. The identity of which bidders participated in an auction is cross-site 391 | data. 392 | 393 | ##### Partitioning can amplify counting attacks 394 | 395 | If the count of reports is sensitive, this partitioning could amplify counting 396 | attacks. However, note that reports can already be partitioned by the 397 | `scheduled_report_time` and `api` fields. There are designs for [protecting the 398 | count of encrypted reports](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATE.md#hide-the-true-number-of-attribution-reports) 399 | to mitigate or eliminate the risk of counting attacks. These designs target the 400 | Attribution Reporting API, but could be adapted for Private Aggregation. Still, 401 | with less extreme mitigations, there are privacy benefits to reducing the 402 | partitioning available. 403 | 404 | ##### Initial design 405 | 406 | For the initial design, we do not plan to implement any changes to the Private 407 | State Token protocol’s public/private metadata bits. So, each token will have 408 | [six buckets](https://github.com/WICG/trust-token-api/blob/main/ISSUER_PROTOCOL.md#issuance-metadata) 409 | of metadata embedded. Further, each report could either have a token or no 410 | token, allowing up to 7 total possibilities (~2.8 bits). This would therefore 411 | allow the reporting origin to partition its reports into 7 buckets. 412 | 413 | ##### Potential mitigations 414 | 415 | We could consider mitigations in the future. For example: 416 | 417 | - restricting the public/private metadata to one bucket – or just a single 418 | private bit to avoid an invalid traffic oracle. 419 | - refusing to send reports to reporting origins using report verification if no 420 | token was available/issued. 421 | - sending null reports with some frequency for buyers that delegate to an issuer 422 | who issued tokens. 423 | 424 | ### Alternatives considered 425 | 426 | #### Using signals from interest group joining time 427 | 428 | Alternatively, we could associate any trust signals available at the 429 | `joinAdInterestGroup()` call with reports later sent from a bidder under that 430 | interest group. 431 | 432 | Token issuance could be handled by adding a new optional parameter to 433 | `joinAdInterestGroup()`, e.g.: 434 | 435 | ```js 436 | const myGroup = { 437 | ... 438 | 'privateAggregationTokens': 10, // number of tokens to request 439 | } 440 | const joinPromise = navigator.joinAdInterestGroup(myGroup, 30 * kSecsPerDay); 441 | ``` 442 | 443 | This would trigger a token issuance request (see [above](#issuance-mechanism)) 444 | with the requested number of blinded messages. Each resulting token would be 445 | redeemed along with the later report’s network request (see 446 | [above](#redemption-mechanism)). 447 | 448 | If the token successfully verifies, then the reporting origin has a guarantee 449 | that the report was associated with a previous `joinAdInterestGroup()` request 450 | that was signed. 451 | 452 | ##### New network request 453 | 454 | This requires the addition of one new network request at `joinAdInterestGroup()` 455 | time. 456 | 457 | ##### Different security model 458 | 459 | This approach uses a different security model to Attribution Reporting’s, with a 460 | potentially large time delay between token issuance and use. 461 | 462 | ##### Difficult to decide on number of tokens to issue 463 | 464 | Due to this large time delay between token issuance and last possible use, it 465 | will be difficult to decide on the number of tokens to issue. If too few are 466 | issued, later auctions may not be able to be attested. Issuing too many may 467 | degrade performance, e.g. unnecessarily using storage, and may exacerbate token 468 | exfiltration attacks. 469 | 470 | ##### Requires new persistent token storage 471 | 472 | This approach requires Private State Tokens to be persisted for later use. This 473 | store will need to be separate from the existing Private State Token store. Note 474 | also that key rotations will cause issues here, as any tokens issued before the 475 | rotation would not be able to be used after the rotation. 476 | 477 | #### Using signals from both auction running time and interest group joining time 478 | 479 | We could combine the functionality of both the proposal and the above 480 | alternative. There are a few different ways we could do this. 481 | 482 | ##### Separate token headers/fields 483 | 484 | We could allow for both mechanisms to be independently implemented. Separate 485 | headers could be used to distinguish between the two. This would allow for 486 | maximum flexibility, but comes at a possible complexity and privacy cost. 487 | 488 | **Privacy risk:** By supporting two separate token fields, the number of 489 | possible token states is ‘squared’. That is, without additional mitigations, 490 | adding a second Private State Token field would increase the number of states 491 | from 7 to 49 (~5.6 bits). This partitioning would allow for amplified counting 492 | attacks unless other mitigates are put in place, see 493 | [above](#privacy-considerations-2). 494 | 495 | ##### Each origin picks one option 496 | 497 | We could allow each origin to pick one of the two mechanisms, using a similar 498 | mechanism to picking a token issuer. Any attempt to use the other mechanism 499 | would be ignored or cause an error. 500 | 501 | ##### One mechanism preferred, the other a fallback 502 | 503 | We could allow for both mechanisms, but only allow one token to be bound to each 504 | report. If a token is available via each mechanism, the browser will prefer one 505 | (e.g. the runAdAuction() associated token). 506 | 507 | #### Specifying a contextual ID and each possible IG owner 508 | 509 | Instead of using Private State Tokens, we could also use a contextual ID here. 510 | But, to avoid a cross-site leak, this would require that a report be sent to 511 | each origin listed in `interestGroupBuyers`, even if that bidder did not 512 | actually participate in the auction. This could lead to a large number of (null) 513 | reports, which would pose a performance concern. 514 | 515 | #### Trusted server report verification 516 | 517 | Ideally for performance, the user agent would be able to only request a token 518 | for reports that are actually going to be sent. But, that would inherently leak 519 | cross-site data, which we can't allow. But it might be possible to design a 520 | trusted server architecture that can perform the required invalid traffic 521 | determination and token issuance while ensuring that any cross-site data is not 522 | persisted. This is not feasible in the short term, however, requiring 523 | significant design and exploration. 524 | 525 | ## Shared Storage in Fenced Frames 526 | 527 | When a shared storage operation is run from a fenced frame instead of a 528 | document, we can no longer set a contextual ID. Any cross-site information the 529 | fenced frame has could be embedded in the context ID, so the ability to set it 530 | is disabled. 531 | 532 | Instead, we propose allowing a Private State Token to be bound to the 533 | FencedFrameConfig output of a FLEDGE auction. We would reuse the FLEDGE bidder 534 | mechanism chosen [above](#fledge-bidders) and take an additional token from the 535 | same source for this purpose. When the shared storage worklet triggers a report 536 | to be sent, any context ID specified would be ignored and the token would be 537 | used instead. 538 | 539 | ### Details 540 | 541 | As it uses the same token source, most details match the FLEDGE bidder 542 | discussion (see [above](#details-2)). Additional considerations are listed 543 | below. 544 | 545 | #### Doesn’t support nesting 546 | 547 | This proposal does not currently support cross-origin subframes or nested fenced frames 548 | within the top-level fenced frame. 549 | 550 | #### Privacy considerations 551 | 552 | As discussed [above](#privacy-considerations-2), adding a token allows reports 553 | to be partitioned, which exacerbates the risk of a counting attack. 554 | 555 | This design also implicitly reveals whether a Shared Storage worklet’s 556 | aggregatable report came from an operation run by a document or a fenced frame. 557 | This may allow for further partitioning, but is unlikely to be a significant 558 | issue. 559 | 560 | ### Extending to selectURL() 561 | 562 | Further design work is needed to extend this mechanism to fenced frames 563 | rendering the output of a `selectURL()` operation. 564 | -------------------------------------------------------------------------------- /security_and_privacy_questionnaire.md: -------------------------------------------------------------------------------- 1 | # Security and Privacy Questionnaire 2 | 3 | Responses to the W3C TAG’s [Self-Review Questionnaire: Security and 4 | Privacy](https://w3ctag.github.io/security-questionnaire/) for the Private 5 | Aggregation API. 6 | 7 | ### 2.1. What information does this feature expose, and for what purposes? 8 | 9 | This API lets isolated contexts with access to cross-site data (i.e. [Shared 10 | Storage](https://github.com/WICG/shared-storage) worklets/[Protected 11 | Audience](https://github.com/WICG/turtledove) script runners) send aggregatable 12 | reports over the network. Aggregatable reports contain encrypted high entropy 13 | cross-site information, in the form of key-value pairs (i.e. contributions to a 14 | histogram), but this information is not exposed directly. Instead, these reports 15 | can only be processed by a trusted aggregation service. This trusted aggregation 16 | service sums the values across the reports for each key and adds noise to each 17 | of these values to produce ‘summary reports’. It also limits the number of times 18 | that reports may be queried. 19 | 20 | The aggregatable reports also contain some unencrypted metadata that is not 21 | based on cross-site information. 22 | 23 | The purpose of this API is to allow generic aggregate cross-site measurement for 24 | a range of use cases, even if third-party cookies are no longer available. Use 25 | cases include reach measurement, and Protected Audience auction reporting. 26 | 27 | ### 2.2. Do features in your specification expose the minimum amount of information necessary to implement the intended functionality? 28 | 29 | We strictly limit access to the cross-site information embedded in the 30 | aggregatable reports. The cross-site information embedded in these reports is 31 | encrypted and only processable by a trusted aggregation service. The output of 32 | that processing will be an aggregate, noised histogram. The service ensures that 33 | any report can not be processed multiple times. Further, information exposure is 34 | limited by contribution budgets on the client. In principle, this framework can 35 | support specifying a noise parameter which satisfies differential privacy. 36 | 37 | The plaintext portion of an aggregatable report includes information necessary 38 | to organize (batch) reports for aggregation. The encrypted portion is assumed to 39 | be not readable by an attacker (except for ciphertext size). 40 | 41 | The amount of information exposed by this API is a product of the privacy 42 | parameters used (e.g. contribution limits and the noise distribution). While we 43 | aim to minimize the amount of information exposed, we also aim to support a wide 44 | range of use cases. The privacy parameters can be customized to reflect the 45 | appropriate tradeoff between information exposure and utility. The exact choice 46 | of parameters is currently left unfixed to allow for exploration of this 47 | tradeoff and will eventually be fixed based on community feedback. 48 | 49 | These reports also expose a limited amount of metadata, which is not based on 50 | cross-site data. However, the number of reports with the given metadata could 51 | expose some cross-site information. To protect against this, we make the number 52 | of reports deterministic in certain situations (sending reports containing no 53 | contributions in the payloads if necessary). We are considering mitigations for 54 | other situations, e.g. adding noise to the report count. 55 | 56 | The recipient of the report may also be able to observe side-channel information 57 | such as the time when the report was sent, or IP address of the sender. 58 | 59 | ### 2.3. Do the features in your specification expose personal information, personally-identifiable information (PII), or information derived from either? 60 | 61 | This API does not directly expose PII or personal information. However, it is a 62 | generic mechanism that does not place any limits on the kinds of data that sites 63 | may encapsulate into the reports. See above for how all cross-site information 64 | is protected. 65 | 66 | ### 2.4. How do the features in your specification deal with sensitive information? 67 | 68 | See 2.3. 69 | 70 | ### 2.5. Do the features in your specification introduce state that persists across browsing sessions? 71 | 72 | Yes, we introduce new storage for reports that are not yet sent (i.e. scheduled 73 | to be sent in the future), for enforcing limits on the total sum of contribution 74 | values (per-reporting site, per-context type, per-10 min / per-day) and for 75 | caching the public keys of the trusted aggregation service. These all persist 76 | across browsing sessions, but will be cleared along with other site data when 77 | requested by a user and are not exposed to JavaScript. 78 | 79 | ### 2.6. Do the features in your specification expose information about the underlying platform to origins? 80 | 81 | No 82 | 83 | ### 2.7. Does this specification allow an origin to send data to the underlying platform? 84 | 85 | No 86 | 87 | ### 2.8. Do features in this specification enable access to device sensors? 88 | 89 | No 90 | 91 | ### 2.9. Do features in this specification enable new script execution/loading mechanisms? 92 | 93 | No, but this API is proposed only for the new isolated script execution contexts 94 | specified by other proposed features (i.e. Shared Storage worklets/Protected 95 | Audience script runners). 96 | 97 | ### 2.10. Do features in this specification allow an origin to access other devices? 98 | 99 | No 100 | 101 | ### 2.11. Do features in this specification allow an origin some measure of control over a user agent’s native UI? 102 | 103 | No 104 | 105 | ### 2.12. What temporary identifiers do the features in this specification create or expose to the web? 106 | 107 | None 108 | 109 | ### 2.13. How does this specification distinguish between behavior in first-party and third-party contexts? 110 | 111 | This API is only exposed in isolated contexts that may have access to cross-site 112 | data. There are mechanisms proposed for controlling access to those isolated 113 | contexts, e.g. see Protected Audience’s response 114 | [here](https://github.com/w3ctag/design-reviews/issues/723). 115 | 116 | ### 2.14. How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode? 117 | 118 | The contexts this API is exposed in (Shared Storage worklets and Protected 119 | Audience script runners) are not available in Private Browsing/Incognito mode, 120 | so it is not possible to use this API in that mode. 121 | 122 | ### 2.15. Does this specification have both "Security Considerations" and "Privacy Considerations" sections? 123 | 124 | We are still working on the spec, but it will include both sections. 125 | 126 | ### 2.16. Do features in your specification enable origins to downgrade default security protections? 127 | 128 | No 129 | 130 | ### 2.17. What happens when a document that uses your feature is kept alive in BFCache (instead of getting destroyed) after navigation, and potentially gets reused on future navigations back to the document? 131 | 132 | This API is not available in document contexts, so there is no need to handle 133 | this case. 134 | 135 | ### 2.18. What happens when a document that uses your feature gets disconnected? 136 | 137 | This API is not available in document contexts, so there is no need to handle 138 | this case. 139 | 140 | ### 2.19. What should this questionnaire have asked? 141 | 142 | N/A 143 | --------------------------------------------------------------------------------