├── .github
    └── workflows
    │   ├── build.yml
    │   └── validate.yml
├── .gitignore
├── .pr-preview.json
├── Makefile
├── README.md
├── error_reporting.md
├── flexible_filtering.md
├── meetings
    └── 2023-02-15
    │   ├── minutes.md
    │   └── shared-storage-overview.pdf
├── named_budgets.md
├── reach_whitepaper.md
├── reach_whitepaper_figs
    ├── cap_discount.png
    ├── cumulative_error.png
    ├── diagram.png
    ├── direct_error.png
    ├── errors_comparison.png
    ├── observation_error.png
    ├── sketches_error.png
    ├── total_cumulative_error.png
    ├── total_direct_error.png
    ├── total_sketches_error.png
    └── window_discount.png
├── report_verification.md
├── security_and_privacy_questionnaire.md
└── spec.bs


/.github/workflows/build.yml:
--------------------------------------------------------------------------------
 1 | name: Build
 2 | on:
 3 |   push:
 4 |     branches: [main]
 5 |     paths: ["**.bs"]
 6 | jobs:
 7 |   build:
 8 |     name: Build
 9 |     runs-on: ubuntu-latest
10 |     steps:
11 |       - uses: actions/checkout@v3
12 |       - uses: w3c/spec-prod@v2
13 |         with:
14 |           TOOLCHAIN: bikeshed
15 |           DESTINATION: index.html
16 |           SOURCE: spec.bs
17 |           GH_PAGES_BRANCH: gh-pages
18 |           BUILD_FAIL_ON: warning
19 | 


--------------------------------------------------------------------------------
/.github/workflows/validate.yml:
--------------------------------------------------------------------------------
 1 | name: Validate
 2 | on:
 3 |   pull_request:
 4 |     paths: ["**.bs"]
 5 | jobs:
 6 |   main:
 7 |     name: Validate Spec
 8 |     runs-on: ubuntu-latest
 9 |     steps:
10 |       - uses: actions/checkout@v3
11 |       - uses: w3c/spec-prod@v2
12 |         with:
13 |           TOOLCHAIN: bikeshed
14 |           SOURCE: spec.bs
15 |           BUILD_FAIL_ON: warning
16 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | spec.html
2 | 


--------------------------------------------------------------------------------
/.pr-preview.json:
--------------------------------------------------------------------------------
1 | {
2 |     "src_file": "spec.bs",
3 |     "type": "bikeshed"
4 | }
5 | 
6 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | # HTML files that are generated from Markdown sources.
 2 | HTML_FROM_MD_TARGETS = README.html README-with-toc.html
 3 | COMMIT = $(shell git rev-parse HEAD)
 4 | EXPORTS_REPORT_NAME = "exports-report.${COMMIT}.txt"
 5 | 
 6 | .PHONY: all
 7 | all: spec.html $(HTML_FROM_MD_TARGETS)
 8 | 
 9 | .PHONY: clean
10 | clean:
11 | 	-rm spec.html
12 | 	-rm $(HTML_FROM_MD_TARGETS)
13 | 	-rm exports-report.*.txt
14 | 
15 | # Updates Bikeshed's datafiles. Bikeshed automatically updates them if they're
16 | # more than a few days old, but this will ensure you have the latest version.
17 | .PHONY: update
18 | update:
19 | 	bikeshed update
20 | 
21 | spec.html: spec.bs
22 | 	bikeshed --die-on=everything spec $< $@
23 | 
24 | # Autogenerates a table of contents for the README. This can in turn be rendered
25 | # as HTML. This is useful for catching mistakes in the README's handwritten TOC.
26 | README-with-toc.md: README.md
27 | 	pandoc -f gfm --toc --toc-depth 6 -s $< -o $@
28 | 
29 | # Rule for generating HTML from Markdown for a limited set of targets. This uses
30 | # GNU Make "static pattern" syntax.
31 | $(HTML_FROM_MD_TARGETS): %.html : %.md
32 | 	pandoc -f gfm -s $< -o $@
33 | 
34 | .PHONY: write-exports-report
35 | write-exports-report: ${EXPORTS_REPORT_NAME}
36 | 
37 | ${EXPORTS_REPORT_NAME}: spec.bs
38 | 	bikeshed debug --print-exports > $@
39 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | **This document is an individual draft proposal. It has not been adopted by the Private Advertising Technology Community Group.**
  2 | 
  3 | -------
  4 | 
  5 | # Private Aggregation API explainer
  6 | 
  7 | Author: Alex Turner (alexmt@chromium.org)
  8 | 
  9 | ### Table of Contents
 10 | 
 11 | - [Introduction](#introduction)
 12 | - [Examples](#examples)
 13 |   - [Protected Audience reporting](#protected-audience-reporting)
 14 |   - [Measuring user demographics with cross-site information](#measuring-user-demographics-with-cross-site-information)
 15 |   - [Cross-site reach measurement](#cross-site-reach-measurement)
 16 |   - [K+ frequency measurement](#k-frequency-measurement)
 17 | - [Goals](#goals)
 18 |   - [Non-goals](#non-goals)
 19 | - [Operations](#operations)
 20 | - [Reports](#reports)
 21 |   - [Temporary debugging mechanism](#temporary-debugging-mechanism)
 22 |     - [Enabling](#enabling)
 23 |     - [Debug keys](#debug-keys)
 24 |     - [Duplicate debug report](#duplicate-debug-report)
 25 |   - [Reducing volume by batching](#reducing-volume-by-batching)
 26 |     - [Batching scope](#batching-scope)
 27 |     - [Limiting the number of contributions per report](#limiting-the-number-of-contributions-per-report)
 28 |     - [Padding](#padding)
 29 |   - [Aggregation coordinator choice](#aggregation-coordinator-choice)
 30 | - [Privacy and security](#privacy-and-security)
 31 |   - [Metadata readable by the reporting origin](#metadata-readable-by-the-reporting-origin)
 32 |     - [Open question: what metadata to allow](#open-question-what-metadata-to-allow)
 33 |   - [Contribution bounding and budgeting](#contribution-bounding-and-budgeting)
 34 |     - [Scaling values](#scaling-values)
 35 |     - [Examples](#examples-1)
 36 |     - [Partition choice](#partition-choice)
 37 |     - [Implementation plan](#implementation-plan)
 38 |   - [Enrollment and attestation](#enrollment-and-attestation)
 39 | - [Future Iterations](#future-iterations)
 40 |   - [Supporting different aggregation modes](#supporting-different-aggregation-modes)
 41 |   - [Shared contribution budget](#shared-contribution-budget)
 42 |   - [Authentication and data integrity](#authentication-and-data-integrity)
 43 |   - [Aggregate error reporting](#aggregate-error-reporting)
 44 | 
 45 | ## Introduction
 46 | 
 47 | This proposal introduces a generic mechanism for measuring aggregate, cross-site
 48 | data in a privacy preserving manner.
 49 | 
 50 | Browsers are now working to prevent cross-site user tracking, including by
 51 | [partitioning storage and removing third-party
 52 | cookies](https://blog.chromium.org/2020/01/building-more-private-web-path-towards.html).
 53 | There are a range of API proposals to continue supporting legitimate use cases
 54 | in a way that respects user privacy. Many of these proposals, including [Shared
 55 | Storage](https://github.com/WICG/shared-storage) and [Protected
 56 | Audience](https://github.com/WICG/turtledove), plan to isolate potentially
 57 | identifying cross-site data in special contexts, which ensures that the data
 58 | cannot escape the user agent.
 59 | 
 60 | Relative to cross-site data from an individual user, aggregate data about groups
 61 | of users can be less sensitive and yet would be sufficient for a wide range of
 62 | use cases. An [aggregation
 63 | service](https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATION_SERVICE_TEE.md)
 64 | has been proposed to allow reporting noisy, aggregated cross-site data. This
 65 | service was originally proposed for use by the [Attribution Reporting
 66 | API](https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md),
 67 | but allowing more general aggregation would support additional use cases. In
 68 | particular, the [Protected Audience](https://github.com/WICG/turtledove)
 69 | and [Shared Storage](https://github.com/WICG/shared-storage)
 70 | proposals expect this functionality to become available.
 71 | 
 72 | So, to complement the Attribution Reporting API, we propose a general-purpose
 73 | Private Aggregation API that can be called from a wide array of contexts,
 74 | including isolated contexts that have access to cross-site data (such as a
 75 | shared storage worklet). Within these contexts, potentially identifying data
 76 | could be encapsulated into "aggregatable reports". To prevent leakage, the
 77 | cross-site data in these reports would be encrypted to ensure it can only be
 78 | processed by the aggregation service. During processing, this service will add
 79 | noise and impose limits on how many queries can be performed.
 80 | 
 81 | This API introduces a `contributeToHistogram()` function; see
 82 | [examples](#examples) below. This call registers a histogram contribution for
 83 | reporting. Later, the browser constructs an aggregatable report, which contains
 84 | an encrypted payload with the specified contribution(s) for later computation
 85 | via the aggregation service.
 86 | The API queues the constructed report to be sent to the reporting endpoint of
 87 | the script's origin (in other words, the reporting origin) after a delay. The
 88 | report format and endpoint paths are detailed [below](#reports).
 89 | After the endpoint receives the reports, it batches the reports and sends them
 90 | to the aggregation service for processing. The output of that process is a
 91 | summary report containing the (approximate) result, which is dispatched back to
 92 | the reporting origin.
 93 | 
 94 | See [the Private Aggregation API specification](https://patcg-individual-drafts.github.io/private-aggregation-api/).
 95 | 
 96 | ## Examples
 97 | 
 98 | ### Protected Audience reporting
 99 | 
100 | The [Protected
101 | Audience](https://github.com/WICG/turtledove/blob/main/FLEDGE.md#5-event-level-reporting-for-now)
102 | API plans to run on-device ad
103 | auctions using cross-site data as an input. The Private Aggregation API will
104 | allow measurement of the auction results from within the isolated execution
105 | environments.
106 | 
107 | For example, a key measurement use case is to report the price of the auctions'
108 | winning bids. This tells the seller how much they should be paid and who should
109 | pay them. To support this, each seller's JavaScript could define a
110 | `reportResult()` function. For example:
111 | 
112 | ```javascript
113 | function reportResult(auctionConfig, browserSignals) {
114 |   // Helper functions that map each buyer to its predetermined bucket and scales
115 |   // each bid appropriately for measurement, see scaling values below.
116 |   function convertBuyerToBucketId(buyer_origin) { … }
117 |   function convertBidToReportingValue(winning_bid_price) { … }
118 | 
119 |   // The user agent sends the report to the reporting endpoint of the script's
120 |   // origin (that is, the caller of `runAdAuction()`) after a delay.
121 |   privateAggregation.contributeToHistogram({
122 |     // Note: the bucket must be a BigInt and the value an integer Number.
123 |     bucket: convertBuyerToBucketId(browserSignals.interestGroupOwner),
124 |     value: convertBidToReportingValue(browserSignals.bid)
125 |   });
126 | }
127 | ```
128 | 
129 | The buyer can make their own measurements, which could be used to verify the
130 | seller's information. To support this, each buyer's JavaScript would define a
131 | `reportWin()` function (and possibly also a `reportLoss()` function). For
132 | example:
133 | 
134 | ```javascript
135 | function reportWin(auctionSignals, perBuyerSignals, sellerSignals, browserSignals) {
136 |   // The buyer defines their own similar functions.
137 |   function convertSellerToBucketId(seller_origin) { … }
138 |   function convertBidToReportingValue(winning_bid_price) { … }
139 | 
140 |   privateAggregation.contributeToHistogram({
141 |     bucket: convertSellerToBucketId(browserSignals.seller),
142 |     value: convertBidToReportingValue(browserSignals.bid),
143 |   });
144 | }
145 | ```
146 | 
147 | ### Measuring user demographics with cross-site information
148 | 
149 | `publisher.example` wants to measure the demographics of its user base, for
150 | example, a histogram of number of users split by age ranges. `demo.example` is a
151 | popular site that knows the demographics of its users. `publisher.example`
152 | embeds `demo.example` as a third-party, allowing it to measure the demographics
153 | of the overlapping users.
154 | 
155 | First, `demo.example` saves these demographics to its shared storage when it is
156 | the top level site:
157 | 
158 | ```javascript
159 | sharedStorage.set("demo", '{"age": "40-49", ...}');
160 | ```
161 | 
162 | Then, in a `demo.example` iframe on `publisher.example`, the appropriate shared
163 | storage operation is triggered once for each user:
164 | 
165 | ```javascript
166 | await sharedStorage.worklet.addModule("measure-demo.js");
167 | await sharedStorage.run("send-demo-report");
168 | ```
169 | 
170 | Shared storage worklet script (i.e. `measure-demo.js`):
171 | 
172 | ```javascript
173 | class SendDemoReportOperation {
174 |   async function run() {
175 |     let demo_string = await sharedStorage.get("demo");
176 |     let demo = {};
177 |     if (demo_string) {
178 |       demo = JSON.parse(demo_string);
179 |     }
180 | 
181 |     // Helper function that maps each age range to its predetermined bucket, or
182 |     // a special unknown bucket e.g. if the user has not visited `demo.example`.
183 |     function convertAgeToBucketId(country_string_or_undefined) { … }
184 | 
185 |     // The report will be sent to `demo.example`'s reporting endpoint after a
186 |     // delay.
187 |     privateAggregation.contributeToHistogram({
188 |       bucket: convertAgeToBucketId(demo["age"]);
189 |       value: 128,  // A predetermined fixed value, see scaling values below.
190 |     });
191 | 
192 |     // Could add more contributeToHistogram() calls to measure other demographics.
193 |   }
194 | }
195 | register("send-demo-report", SendDemoReportOperation);
196 | ```
197 | 
198 | ### Cross-site reach measurement
199 | 
200 | Measuring the number of users that have seen an ad.
201 | 
202 | In the ad’s iframe:
203 | 
204 | ```js
205 | await window.sharedStorage.worklet.addModule('reach.js');
206 | await window.sharedStorage.run('send-reach-report', {
207 |   // optional one-time context
208 |   data: { campaignId: '1234' },
209 | });
210 | ```
211 | 
212 | Worklet script (i.e. `reach.js`):
213 | 
214 | ```js
215 | class SendReachReportOperation {
216 |   async run(data) {
217 |     const reportSentForCampaign = `report-sent-${data.campaignId}`;
218 | 
219 |     // Compute reach only for users who haven't previously had a report sent for
220 |     // this campaign. Users who had a report for this campaign triggered by a
221 |     // site other than the current one will be skipped.
222 |     if (await sharedStorage.get(reportSentForCampaign) === 'yes') {
223 |       return;  // Don't send a report.
224 |     }
225 | 
226 |     // The user agent will send the report to a default endpoint after a delay.
227 |     privateAggregation.contributeToHistogram({
228 |       bucket: data.campaignId,
229 |       value: 128,  // A predetermined fixed value; see Scaling values.
230 |     });
231 | 
232 |     await sharedStorage.set(reportSentForCampaign, 'yes');
233 |   }
234 | }
235 | register('send-reach-report', SendReachReportOperation);
236 | ```
237 | 
238 | ### _K_+ frequency measurement
239 | 
240 | By instead maintaining a counter in shared storage, the approach for cross-site
241 | reach measurement could be extended to _K_+ frequency measurement, i.e.
242 | measuring the number of users who have seen _K_ or more ads on a given browser,
243 | for a pre-chosen value of _K_. A unary counter can be maintained by calling
244 | `window.sharedStorage.append("freq", "1")` on each ad view. Then, the
245 | `send-reach-report` operation would only send a report if there are more than
246 | _K_ characters stored at the key `"freq"`. This counter could also be used to
247 | filter out ads that have been shown too frequently.
248 | 
249 | ## Goals
250 | 
251 | This API aims to support a wide range of aggregation use cases, including
252 | measurement of demographics and reach, while remaining generic and flexible. We
253 | intend for this API to be callable in as many contexts and situations as
254 | possible, including the isolated contexts used by other privacy-preserving API
255 | proposals for processing cross-site data. This will help to foster continued
256 | growth, experimentation, and rapid iteration in the web ecosystem; to support a
257 | thriving, open web; and to prevent ossification and unnecessary rigidity.
258 | 
259 | This API also intends to avoid the privacy risks presented by unpartitioned
260 | storage and third-party cookies. In particular, it seeks to prevent off-browser
261 | cross-site recognition of users. Developer adoption of this API will help to
262 | replace the usage of third-party cookies, making the web more private by
263 | default.
264 | 
265 | ### Non-goals
266 | 
267 | This API does not intend to regulate what data is allowed as an input to
268 | aggregation. Instead, the aggregation service will protect this input by adding
269 | noise to limit the impact of any individual's input data on the output. Learn
270 | more about [contribution bounding and
271 | budgeting](#contribution-bounding-and-budgeting) below.
272 | 
273 | Further, this API does not seek to prevent (probabilistic) cross-site inference
274 | about sufficiently large groups of people. That is, learning high confidence
275 | properties of large groups is ok as long as we can bound how much an individual
276 | affects the aggregate measurement. See also [discussion of this non-goal in
277 | other
278 | settings](https://differentialprivacy.org/inference-is-not-a-privacy-violation/).
279 | 
280 | ## Operations
281 | 
282 | This current design supports one operation: constructing a histogram. This
283 | operation matches the description in the [Attribution Reporting API with
284 | Aggregatable Reports
285 | explainer](http://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md#two-party-flow),
286 | with a fixed domain of 'buckets' that the reports contribute bounded integer
287 | 'values' to. Note that sums can be computed using the histogram operation by
288 | contributing values to a fixed, predetermined bucket and ignoring the returned
289 | values for all other buckets after querying.
290 | 
291 | Over time, we should be able to support additional operations by extending the
292 | aggregation service infrastructure. For example, we could add a 'count
293 | distinct' operation that, like the histogram operation, uses a fixed domain of
294 | buckets, but without any values. Instead, the computed result would be
295 | (approximately) how many _unique_ buckets the reports contributed to. Other
296 | possible additions include supporting federated learning via privately
297 | aggregating machine learning update vectors or extending the histogram operation
298 | to support values that are vectors of integers rather than only scalars.
299 | 
300 | The operation would be indicated by using the appropriate JavaScript call, e.g.
301 | `contributeToHistogram()` and `contributeToDistinctCount()` for histograms and
302 | count distinct, respectively.
303 | 
304 | ## Reports
305 | 
306 | The report, including the payload, will mirror the [structure proposed for the Attribution Reporting
307 | API](https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md#aggregatable-reports).
308 | However, a few details will change. For example, fields with no equivalent
309 | on this API (e.g. `attribution_destination` and `source_registration_time`)
310 | won't be present. Additionally, the `api` field will contain either
311 | `"protected-audience"`
312 | or `"shared-storage"` to reflect which API's context requested the report.
313 | 
314 | The following is an example report showing the JSON format
315 | ```jsonc
316 | {
317 |   "shared_info": "{\"api\":\"protected-audience\",\"report_id\":\"[UUID]\",\"reporting_origin\":\"https://reporter.example\",\"scheduled_report_time\":\"[timestamp in seconds]\",\"version\":\"[api version]\"}",
318 | 
319 |   "aggregation_service_payloads": [
320 |     {
321 |       "payload": "[base64-encoded data]",
322 |       "key_id": "[string]",
323 | 
324 |       // Optional debugging information if debugging is enabled
325 |       "debug_cleartext_payload": "[base64-encoded unencrypted payload]",
326 |     }
327 |   ],
328 | 
329 |   // Optional debugging information if debugging is enabled and debug key specified
330 |   "debug_key": "[64 bit unsigned integer]"
331 | }
332 | ```
333 | 
334 | As described earlier, these reports will be sent to the reporting origin after a
335 | delay. The URL path used for sending the reports will be
336 | `/.well-known/private-aggregation/report-protected-audience` and
337 | `/.well-known/private-aggregation/report-shared-storage` for reports triggered
338 | within a Protected Audience or Shared Storage context, respectively.
339 | 
340 | ### Temporary debugging mechanism
341 | 
342 | _While third-party cookies are still available_, we plan to have a temporary
343 | mechanism available that allows for easier debugging. This mechanism involves
344 | temporarily relaxing some privacy constraints. It will help ensure that the API
345 | can be fully understood during roll-out and help flush out any bugs (either in
346 | browser or caller code), and more easily compare the performance to cookie-based
347 | alternatives.
348 | 
349 | This mechanism is similar to Attribution Reporting API's [debug aggregatable
350 | reports](https://github.com/WICG/attribution-reporting-api/blob/main/EVENT.md#optional-extended-debugging-reports).
351 | When the debug mode is enabled for a report, a cleartext version of the payload
352 | will be included in the report. Additionally, the shared_info will also include
353 | the flag `"debug_mode": "enabled"` to allow the aggregation service to support
354 | debugging functionality on these reports.
355 | 
356 | This data will only be available in a transitional phase while third-party
357 | cookies are available and are already capable of user tracking. The debug mode
358 | will only be enabled if the context is able to access third-party cookies and
359 | the browser has third-party cookies generally enabled. That is, it will be
360 | disabled if third-party cookies are disabled/deprecated generally or for a
361 | particular site/context; note that if third-party cookies are generally
362 | disabled, but enabled for a particular site, debug mode will not be enabled for
363 | that site to protect data saved from other sites.
364 | 
365 | Though the debug mode is tied to third-party cookie availability, browsers may
366 | temporarily allow debug mode without third-party cookies in order to support
367 | testing, such as the browsers in the [Mode
368 | B](https://developers.google.com/privacy-sandbox/setup/web/chrome-facilitated-testing#mode_b_disable_1_of_third-party_cookies)
369 | group of Chrome-facilitated testing.
370 | 
371 | #### Enabling
372 | 
373 | The following javascript call will then enable debug mode for all future reports
374 | requested in that context (e.g. Shared Storage operation or Protected Audience
375 | function call):
376 | ```js
377 | privateAggregation.enableDebugMode();
378 | ```
379 | The browser can optionally apply debug mode to reports requested earlier in that
380 | context.
381 | 
382 | This javascript function can only be called once per context. Any subsequent
383 | calls will throw an exception.
384 | 
385 | #### Debug keys
386 | To allow sites to associate reports with the contexts that triggered them, we
387 | also allow setting 64-bit unsigned integer debug keys. These keys are passed as
388 | an optional field to the javascript call, for example:
389 | ```js
390 | privateAggregation.enableDebugMode({debugKey: 1234n});
391 | ```
392 | 
393 | #### Duplicate debug report
394 | 
395 | When debug mode is enabled, an additional, duplicate debug report will be sent
396 | immediately (i.e. without the random delay) to a separate debug endpoint. This
397 | endpoint will use a path like
398 | `/.well-known/private-aggregation/debug/report-protected-audience` (and the
399 | equivalent for Shared Storage).
400 | 
401 | The debug reports should be almost identical to the normal reports, including
402 | the additional debug fields. However, the payload ciphertext will differ due to
403 | repeating the encryption operation and the `key_id` may differ if the previous
404 | key had since expired or the browser randomly chose a different valid public
405 | key.
406 | 
407 | ### Reducing volume by batching
408 | 
409 | In the case of multiple calls to `contributeToHistogram()`, we can reduce report
410 | volume by sending a single report with multiple contributions instead of
411 | multiple reports. For this to be possible, the different calls must involve the
412 | same reporting origin and the same API (i.e. Protected Audience or Shared
413 | Storage).
414 | Additionally, the calls must be made at a similar time as the reporting time
415 | will necessarily be shared.
416 | 
417 | #### Batching scope
418 | 
419 | For calls within a Shared Storage worklet, calls within the same Shared Storage
420 | operation should be batched together.
421 | 
422 | For calls within a Protected Audience worklet, calls using the same reporting
423 | origin within the same auction should be batched together. This should happen
424 | even between different interest groups or Protected Audience function calls.
425 | However, reports triggered via `window.fence.reportEvent()` (see
426 | [here](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md#reporting-bidding-data-associated-with-an-event-in-a-frame)
427 | for more detail), should only be batched per-event. This avoids excessive delay
428 | if this event is triggered substantially later. Reports for Protected Audience
429 | bidders may not share the same [aggregation coordinator
430 | choice](#aggregation-coordinator-choice). In this case, calls should be batched
431 | separately for the different coordinator choices.
432 | 
433 | One consideration in the short term is that these calls may have different
434 | associated [debug modes or keys](#temporary-debugging-mechanism). In this case,
435 | only calls sharing those details should be batched together.
436 | 
437 | #### Limiting the number of contributions per report
438 | 
439 | We will also need a limit on the number of contributions within a single report.
440 | In the case that too many contributions are specified with a ‘batching scope’,
441 | we should truncate them to the limit. To reduce the impact of this limit, we
442 | will merge any contributions that have the same bucket and [filtering
443 | ID](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/flexible_filtering.md#proposal-filtering-id-in-the-encrypted-payload)
444 | before truncation.
445 | 
446 | Although larger reports have higher utility, they are also more expensive for
447 | the aggregation service to process. To accommodate use cases with diverse
448 | utility requirements and cost tolerances, we will attempt to select reasonable
449 | defaults with optional overrides:
450 | 
451 | - *Default limits:* The default limit may depend on the identity of the calling
452 |   API. In particular, Protected Audience reports may benefit from a higher limit
453 |   more than Shared Storage reports. Our implementation plan is to set the
454 |   default limit at 20 contributions per report for Shared Storage and 100
455 |   contributions per report for Protected Audience.
456 | 
457 | - *Per-context limits:* Callers may request a different limit on each isolated
458 |   context they create. Since this affects the payload size, the requested limit
459 |   must be specified from outside an isolated context. Consequently, Protected
460 |   Audience buyers cannot set per-context limits. The browser must clamp
461 |   excessively large values to some maximum value. Our implementation plan is to
462 |   clamp the requested limit to a maximum of 1000 contributions per report.
463 | 
464 | - *Global config:* A more complex design that enables sites to configure a
465 |   global limit may also be possible, but it requires further analysis. (See
466 |   [issue #81].)
467 | 
468 |   [issue #81]: https://github.com/patcg-individual-drafts/private-aggregation-api/issues/81
469 | 
470 | #### Padding
471 | 
472 | The size of the encrypted payload may reveal information about the number of
473 | contributions embedded in the aggregatable report. This can be mitigated by
474 | padding the plaintext payload (e.g. to a fixed size). In the shorter term, we
475 | plan to pad the payload by adding 'null' contributions (i.e. with value 0) to
476 | a fixed length. In the future, we plan to instead append bytes to a fixed
477 | length, but this will require updating the payload version.
478 | 
479 | ### Aggregation coordinator choice
480 | 
481 | This API should support multiple deployment options for the aggregation service,
482 | e.g. deployments on [different cloud
483 | providers](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#initial-experiment-plans).
484 | To avoid a leak, this choice should not be possible from within an isolated
485 | context when a [context
486 | ID](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#shared-storage)
487 | is set.
488 | 
489 | We propose a new optional string field `aggregationCoordinatorOrigin` to allow
490 | developers to specify the deployment option, e.g. the origin for the aggregation
491 | service deployed on AWS, GCP, and other platforms in the future. The specified
492 | origin would need to be on an allowlist maintained by the browser. If none is
493 | specified, a default will be used.
494 | 
495 | This allowlist matches the Attribution Reporting API's, available
496 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/aggregation_coordinator_origin_allowlist.md).
497 | 
498 | Shared Storage callers would specify this field when calling the `run()` or
499 | `selectURL()` APIs, e.g.
500 | 
501 | ```js
502 | sharedStorage.run('someOperation', {'privateAggregationConfig':
503 |     {'aggregationCoordinatorOrigin': 'https://coordinator.example'}});
504 | ```
505 | 
506 | Note that we are reusing the `privateAggregationConfig` that currently allows
507 | for specifying a context ID (see
508 | [here](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md)).
509 | 
510 | Protected Audience sellers would specify this field in the `auctionConfig`, e.g.
511 | 
512 | ```js
513 | const myAuctionConfig = {
514 |   ...
515 |   'privateAggregationConfig': {
516 |     'aggregationCoordinatorOrigin': 'https://coordinator.example'
517 |   }
518 | };
519 | const auctionResultPromise = navigator.runAdAuction(myAuctionConfig);
520 | ```
521 | 
522 | For Protected Audience bidders, we plan to allow this choice to be set for each
523 | interest group via `navigator.joinAdInterestGroup()`, e.g.
524 | 
525 | ```js
526 | const interestGroup = {
527 |   ...
528 |   'privateAggregationConfig': {
529 |     'aggregationCoordinatorOrigin': 'https://coordinator.example'
530 |   }
531 | };
532 | const auctionResultPromise = navigator.joinAdInterestGroup(interestGroup);
533 | ```
534 | This setting would be able to be overridden via the typical interest group
535 | mechanisms. For example, the update mechanism could support a new
536 | `privateAggregationConfig` key matching the call to `joinAdInterestGroup()`.
537 | 
538 | ## Privacy and security
539 | 
540 | ### Metadata readable by the reporting origin
541 | 
542 | Reports will, by default, come with a variety of (unencrypted) metadata that the
543 | reporting origin will be able to directly read. While this metadata can be
544 | useful, we must be careful to balance the impact on privacy. Here are some
545 | examples of metadata that could be included, along with some potential risks:
546 | 
547 | - The originally scheduled reporting time (noised within an ~hour granularity)
548 |   - Could be used to identify users on the reporting site within a time window
549 |   - Note that combining this with the actual timestamp the report was received
550 |     could reveal if the user's device was offline, etc.
551 | - The reporting origin
552 |   - Determined by the execution context's origin, but a site could use different
553 |     subdomains, e.g. to separate use cases.
554 | - Which API triggered the report
555 |   - The `api` field and the endpoint path indicates which API triggered the
556 |     report (e.g. Protected Audience or Shared Storage).
557 | - The API version
558 |   - A version string used to allow future incompatible changes to the API. This
559 |     should usually correspond to the browser version and should not change
560 |     often.
561 | - Encrypted payload sizes
562 |   - If we do not carefully add padding or enforce that all reports are of the
563 |     same natural size, this may expose some information about the contents of
564 |     the report.
565 | - Debugging information
566 |   - If [debugging](#temporary-debug-mechanism) is enabled, some additional
567 |     metadata will be provided. While this information may be potentially
568 |     identifying, it will only be available temporarily: while third-party
569 |     cookies are enabled.
570 | - Developer-selected metadata
571 |   - See [open question: what metadata to
572 |     allow](#open-question-what-metadata-to-allow).
573 | 
574 | #### Open question: what metadata to allow
575 | 
576 | It remains an open question what metadata should be included or allowed in the
577 | report and how that metadata could be selected or configured. Note that any
578 | variation in the reporting endpoint (such as the URL path) would, for this
579 | analysis, be equivalent to including the selected endpoint as additional
580 | metadata.
581 | 
582 | While allowing a developer to specify arbitrary metadata from an isolated
583 | context would negate the privacy goals of the API, specifying a report's
584 | metadata from a non-isolated context (e.g. a main document) may be less
585 | worrisome. This could improve the API's utility and flexibility. For example,
586 | allowing this might simplify usage for a single origin using the API for
587 | different use cases. This non-isolated metadata selection could also allow for
588 | first-party trust signals to be associated with a report.
589 | 
590 | Alternatively, there may be ways to "noise" the metadata to achieve
591 | differential privacy. Further study and consideration is needed here.
592 | 
593 | ### Contribution bounding and budgeting
594 | 
595 | As described above, the aggregation service protects user privacy by adding
596 | noise, aiming to have a framework that could support differential privacy.
597 | However, simply protecting each _query_ to the aggregation service or each
598 | _report_ sent from a user agent would be vulnerable to an adversary that repeats
599 | queries or issues multiple reports, and combines the results. Instead, we
600 | propose the following structure. See [below](#implementation-plan) for the
601 | specific choices we have made in our current implementation.
602 | 
603 | First, each user agent will limit the contribution that it could make to the
604 | outputs of aggregation service queries. (Note that this limitation is a rate
605 | over time, not an absolute number, see [Partition choice](#partition-choice)
606 | below.) In the case of a histogram operation, the user agent could bound the
607 | L<sub>1</sub> norm of the _values_, i.e. the sum of all the contributions across
608 | all buckets. The user agent could consider other bounds, e.g. the L<sub>2</sub>
609 | norm.
610 | 
611 | The user agent would also need to determine what the appropriate
612 | 'partition' is for this budgeting, see [partition choice](#partition-choice)
613 | below. For example, there could be a separate L<sub>1</sub> 'budget' for each
614 | origin, resetting periodically. Exceeding these limits will cause future
615 | contributions to silently drop.
616 | 
617 | Second, the server-side processing will limit the number of queries that can be
618 | performed on reports containing the same 'shared ID' to a small number (e.g. a
619 | single query). See
620 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)
621 | for more detail.
622 | This also limits the number of queries that can contain the same report. The
623 | shared ID is a hash representing the partition. It is computed by the aggregation
624 | service using data available in each aggregatable report. Note that the hash's
625 | input data will need to differ from the Attribution Reporting API (e.g. to
626 | exclude fields like the destination site that don't exist in this API).
627 | 
628 | With the above restrictions, the processing servers only need to sample the
629 | noise for each query from a fixed distribution. In principle, this fixed noise
630 | could be used to achieve differential privacy, e.g. by using Laplace noise with
631 | the following parameter: (max number of queries per report) \* (max
632 | L<sub>1</sub> per user per partition) / epsilon.
633 | 
634 | #### Scaling values
635 | 
636 | Developers will need to choose an appropriate scale for their measurements. In
637 | other words, they will likely want to multiply their values by a fixed,
638 | predetermined constant.
639 | 
640 | Scaling the values up, i.e. choosing a larger constant, will reduce the
641 | _relative_ noise added by the server (as the noise has constant magnitude).
642 | However, this will also cause the limit on the L<sub>1</sub> norm of the values
643 | contributed to reports, i.e. the sum of all contributions across all buckets, to
644 | be reached faster. Recall that no more reports can be sent after depleting the
645 | budget.
646 | 
647 | Scaling the values down, i.e. choosing a smaller constant, will increase the
648 | relative noise, but would also reduce the risk of reaching the budget limit.
649 | Developers will have to balance these considerations to choose the appropriate
650 | scale. The examples below explore this in more detail.
651 | 
652 | #### Examples
653 | 
654 | These examples use an L<sub>1</sub> bound of 2<sup>16</sup> = 65 536.
655 | 
656 | Let's consider a basic measurement case: a binary histogram of counts. For
657 | example, using bucket 0 to indicate a user is a member of some group and bucket
658 | 1 to indicate they are not. Suppose that we don't want to measure anything
659 | else and we've set up our measurement so that each user is only measured once
660 | (per partition per time period). Then, each user could contribute their full limit
661 | (i.e. 2<sup>16</sup>) to the appropriate bucket. After all the reports for all
662 | users are collected, a single query would be performed and the server would add
663 | noise (from a fixed distribution) to each bucket. We would then divide the
664 | values by 2<sup>16</sup> to obtain a fairly precise result (with standard
665 | deviation of 1/2<sup>16</sup> of the server's noise distribution).
666 | 
667 | If each user had instead just contributed a value of 1, we wouldn't have to
668 | divide the query result by 2<sup>16</sup>. However, each user would end the week
669 | with the vast majority of their budget remaining -- and the processing servers
670 | would still add the same noise. So, our result would be much less precise (with
671 | standard deviation equal to the server's noise distribution).
672 | 
673 | On the other hand, suppose we wanted to allow each user to report multiple times
674 | per time period to this same binary histogram. In this case, we would have to reduce
675 | each contribution from 2<sup>16</sup> to a lower predetermined value, say,
676 | 2<sup>12</sup>. Then, each user would be allowed to contribute up to 16 times to
677 | the histogram. Note that you have to reduce each contribution by the _worst
678 | case_ number of contributions per user. Otherwise, users contributing too much
679 | will have reports dropped.
680 | 
681 | #### Partition choice
682 | 
683 | A narrow partition (e.g. giving each top-level _URL_ a separate budget) may not
684 | sufficiently protect privacy. Unfortunately, very broad partitions (e.g. a
685 | single budget for the browser) may allow malicious (or simply greedy) actors to
686 | exhaust the budget, denying service to all others.
687 | 
688 | The ergonomics of the partition should also be considered. Some choices might
689 | require coordination between different entities (e.g. different third parties on
690 | one site) or complex delegation mechanism. Other choices would require complex
691 | accounting; for example, requiring [Shared
692 | storage](https://github.com/pythagoraskitty/shared-storage) to record the source
693 | of each piece of data that could have contributed (even indirectly) to a report.
694 | 
695 | Note also that it is important to include a time component to the partition,
696 | e.g. resetting limits periodically. This does risk long-term information leakage
697 | from dedicated adversaries, but is essential for utility. Other options for
698 | recovering from an exhausted budget may be possible but need further
699 | exploration, e.g. allowing a site to clear its data to reset its budget.
700 | 
701 | #### Implementation plan
702 | 
703 | We plan to enforce a per-[site](https://web.dev/same-site-same-origin/#site)
704 | budget that resets every 10 minutes; that is, we will bound the contributions
705 | that any site can make to a histogram over any 10 minute rolling window. We plan
706 | to use an L<sub>1</sub> bound of 2<sup>16</sup> = 65 536 for this bound; this
707 | aligns with the [Attribution Reporting API with Aggregatable Reports
708 | explainer](https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md#privacy-budgeting).
709 | 
710 | As a backstop to limit worst-case leakage, we plan a separate, looser per-site
711 | bound that is enforced on a 24 hour rolling window, limiting the daily
712 | L<sub>1</sub> norm to 2<sup>20</sup> = 1 048 576.
713 | 
714 | This site will match the site of the execution environment, i.e. the site of the
715 | reporting origin, no matter which top-level sites are involved. For the earlier
716 | [example](#examples), this would correspond to the `runAdAuction()` caller
717 | within `reportResult()` and the interest group owner within
718 | `reportWin()`/`reportLoss()`.
719 | 
720 | We initially plan to have two separate budgets: one for calls within Shared
721 | Storage worklets and one for Protected Audience worklets. However, see [shared
722 | contribution budget](#shared-contribution-budget) below.
723 | 
724 | ### Enrollment and attestation
725 | Use of this API requires
726 | [enrollment](https://github.com/privacysandbox/attestation/blob/main/how-to-enroll.md)
727 | and
728 | [attestation](https://github.com/privacysandbox/attestation/blob/main/README.md#core-privacy-attestations)
729 | via the [Privacy Sandbox enrollment attestation
730 | model](https://github.com/privacysandbox/attestation/blob/main/README.md).
731 | 
732 | When an aggregatable report is triggered, a check will be performed to determine
733 | whether the calling
734 | [site](https://html.spec.whatwg.org/multipage/browsers.html#site) is enrolled and
735 | attested. If this check fails, the report will be dropped (i.e. not sent).
736 | 
737 | ## Future Iterations
738 | 
739 | ### Supporting different aggregation modes
740 | 
741 | This API will support an optional parameter `alternativeAggregationMode` that
742 | accepts a string value. This parameter will allow developers to choose among
743 | different options for aggregation infrastructure supported by the user agent.
744 | This will allow experimentation with new technologies, and allows us to test new
745 | approaches without removing core functionality provided by the default option.
746 | The `"experimental-poplar"` option will implement a protocol similar to
747 | [poplar](https://github.com/cjpatton/vdaf/blob/main/draft-patton-cfrg-vdaf.md#poplar1-poplar1)
748 | VDAF in the [PPM
749 | Framework](https://datatracker.ietf.org/doc/draft-gpew-priv-ppm/).
750 | 
751 | ### Shared contribution budget
752 | 
753 | Separating contribution budgets for Shared Storage worklets and Protected
754 | Audience worklets
755 | provides additional flexibility; for example, some partition choices may not be
756 | compatible (e.g. a per-interest group budget). However, we could consider
757 | merging the two budgets in the future.
758 | 
759 | ### Authentication and data integrity
760 | 
761 | To ensure the integrity of the aggregated data, it may be desirable to support a
762 | mechanism for authentication. This would help limit the impact of reports sent
763 | from malicious or misbehaving clients on the results of each query.
764 | 
765 | To ensure privacy, the reporting endpoint should be able to determine whether a
766 | report came from a trusted client without determining _which_ client sent it. We
767 | may be able to use [trust tokens](https://github.com/WICG/trust-token-api) for
768 | this, but further design work is required.
769 | 
770 | ### Aggregate error reporting
771 | 
772 | Unfortunately, errors that occur within isolated execution contexts cannot be
773 | easily reported (e.g. to a non-isolated document or over the network). If
774 | allowed, such errors could be used as an information channel. While these errors
775 | could still appear in the console, it would also aid developers if we add a
776 | mechanism for aggregate error reporting. This reporting could be automatic or
777 | could be required to be configured according to the developers' preferences.
778 | 
779 | ------
780 | 
781 | **This document is an individual draft proposal. It has not been adopted by the Private Advertising Technology Community Group.**
782 | 


--------------------------------------------------------------------------------
/error_reporting.md:
--------------------------------------------------------------------------------
  1 | # Aggregate error reporting
  2 | 
  3 | #### Table of contents
  4 | 
  5 | - [Introduction](#introduction)
  6 | - [Motivating use cases](#motivating-use-cases)
  7 |   - [Measuring reports dropped due to insufficient budget](#measuring-reports-dropped-due-to-insufficient-budget)
  8 |   - [Detecting unhandled crashes](#detecting-unhandled-crashes)
  9 | - [Defining histogram contributions to send on error events](#defining-histogram-contributions-to-send-on-error-events)
 10 |   - [New JavaScript API surface](#new-javascript-api-surface)
 11 |   - [Error events](#error-events)
 12 |     - [Associating error events with a single interest group](#associating-error-events-with-a-single-interest-group)
 13 |   - [Measuring insufficient contribution budget](#measuring-insufficient-contribution-budget)
 14 | - [Privacy and security](#privacy-and-security)
 15 | - [Future iterations](#future-iterations)
 16 |   - [Global config](#global-config)
 17 |   - [Per-interest group error-event contributions](#per-interest-group-error-event-contributions)
 18 | 
 19 | ## Introduction
 20 | 
 21 | There are a range of error conditions that can be hit when using the Private
 22 | Aggregation API. For example, the privacy budget could run out, preventing any
 23 | further histogram contributions. As the errors themselves may be cross-site
 24 | information, we cannot simply expose them to the page for users who have
 25 | disabled third-party cookies. Instead, we propose a new aggregate, noised error
 26 | reporting mechanism that leverages the existing reporting pipelines through the
 27 | Aggregation Service.
 28 | 
 29 | We aim to allow developers to measure the frequency of certain 'error events'
 30 | and to split these measurements on relevant developer-specified dimensions (e.g.
 31 | version of deployed code). We aim to also support measuring certain error events
 32 | in the Shared Storage and Protected Audience APIs that cannot be directly
 33 | exposed due to similar cross-site information risks. We aim to support these use
 34 | cases with minimal or no privacy regressions.
 35 | 
 36 | The proposed mechanism limits privacy impacts by embedding these debugging
 37 | measurements in the same aggregatable reports already used by the API. Ad techs
 38 | will be able to avoid interfering with their existing measurements by using
 39 | filtering IDs or separate bucket spaces. Note that we have also
 40 | [proposed](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/named_budgets.md)
 41 | a mechanism to reserve privacy budget for different types of contributions. This
 42 | is necessary to allow for budget exhaustion measurement, but will support
 43 | additional uses too.
 44 | 
 45 | Note that the Attribution Reporting API (ARA) has introduced a similar feature
 46 | called [Aggregate Debug
 47 | Reporting](https://github.com/WICG/attribution-reporting-api/blob/main/aggregate_debug_reporting.md).
 48 | This allows for measuring events like reaching rate limits. Our proposal has a
 49 | few differences in design due to differences in our setting.
 50 | 
 51 | ## Motivating use cases
 52 | 
 53 | ### Measuring reports dropped due to insufficient budget
 54 | 
 55 | Developers using the Private Aggregation API need to [choose an appropriate
 56 | scale](https://github.com/patcg-individual-drafts/private-aggregation-api#scaling-values)
 57 | for their measurements. This choice is a trade off between the relative noise
 58 | added by the aggregation service and the risk of dropped reports due to budget
 59 | limits. Measuring the fraction of reports that are dropped due to insufficient
 60 | budget would allow developers to better evaluate the impact of their scale
 61 | choice.
 62 | 
 63 | ### Detecting unhandled crashes
 64 | 
 65 | Developers using Shared Storage or Protected Audience may accidentally ship code
 66 | that crashes, i.e. by triggering a JavaScript exception that isn't caught.
 67 | Measuring these situations directly would improve detection. Being able to split
 68 | these measurements by relevant dimensions would also ease debugging.
 69 | 
 70 | ## Defining histogram contributions to send on error events
 71 | 
 72 | ### New JavaScript API surface
 73 | 
 74 | To support measuring these error conditions, we propose extending the
 75 | [existing](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md#reporting-api-informal-specification)
 76 | `contributeToHistogramOnEvent()` API, currently only exposed in Protected
 77 | Audience script runners. For example:
 78 | 
 79 | ```js
 80 | privateAggregation.contributeToHistogramOnEvent(
 81 |   "reserved.uncaught-exception", { bucket: 123n, value: 45, filteringId: 6n });
 82 | ```
 83 | 
 84 | We would expand the existing list of `reserved.` events supported. We would also
 85 | expose this API to Shared Storage, but without the ['filling
 86 | in'](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md#reporting-api-informal-specification)
 87 | logic (i.e. without support for signalBuckets and signalValues). Note also that
 88 | certain events would only be valid in one type of context; for example, the
 89 | existing Protected Audience-specific
 90 | [events](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md#triggering-reports)
 91 | would not be exposed to Shared Storage callers.
 92 | 
 93 | These 'conditional' histogram contributions would be scoped to that specific
 94 | JavaScript context.
 95 | 
 96 | This approach is flexible. For example, it allows callers to ignore error events
 97 | that are not interested in, or to have two different error events trigger the
 98 | same contribution. However, it is also somewhat verbose, requiring these calls
 99 | to be repeated for each JavaScript context. There are also certain error cases
100 | that cannot be measured with this approach. However, see [Future
101 | iterations](#future-iterations) below for some extensions that may
102 | address these issues.
103 | 
104 | ### Error events
105 | 
106 | We propose the following events, but this list could be expanded in the future.
107 | 
108 | The following events would be available in both Shared Storage and Protected
109 | Audience contexts:
110 | 
111 | - `reserved.report-success`: a report was scheduled and no contributions were
112 |   dropped
113 | - `reserved.too-many-contributions`: a report was scheduled, but some
114 |   contributions were dropped due to the per-report limit
115 | - `reserved.empty-report-dropped`: a report was not scheduled as it had no
116 |   contributions
117 | - `reserved.too-many-pending-reports`: a report was not scheduled as there were
118 |   already too many pending reports
119 | - `reserved.insufficient-budget-for-report`: a report was not scheduled as there
120 |   was not enough budget
121 | - `reserved.uncaught-exception`: a JavaScript exception was thrown and not
122 |   caught in this context
123 | 
124 | The following event would only be available in Shared Storage contexts:
125 | 
126 | - `reserved.contribution-timeout-reached`: the JavaScript context was still
127 |   running when the contribution timeout occurred
128 | 
129 | The following events would only be available in Protected Audience contexts:
130 | 
131 | - The existing events (`reserved.win`, `reserved.loss`, `reserved.always`)
132 | - Possibly additional events for various network failures, see [Future
133 |   iterations](#future-iterations) below.
134 | 
135 | Note that errors in the reporting pipeline used by public key fetching failures
136 | or running out of retries are not exposed. These errors would be difficult to
137 | report on given that they indicate the reporting pipeline is not functioning
138 | properly; additionally, it would be difficult to perform budgeting for these
139 | cases given these occur well after the original report was scheduled.
140 | 
141 | #### Associating error events with a single interest group
142 | 
143 | Contributions associated with different interest groups but with the same
144 | reporting origin are [batched
145 | together](https://github.com/patcg-individual-drafts/private-aggregation-api#batching-scope)
146 | into a single report. However, the number of these other interest groups is not
147 | revealed to Protected Audience contexts. If this single report encounters an
148 | error, we do not want to trigger duplicate contributions due to multiple
149 | contexts registering the same contributions.
150 | 
151 | So, we propose each reporting-related error should only trigger an error event
152 | for one of the contexts that contributed to the report. The browser should pick
153 | one at random. Note that this aligns with the
154 | [proposal](https://github.com/WICG/turtledove/issues/1170) to define a new
155 | `reserved.once` event.
156 | 
157 | ### Measuring insufficient contribution budget
158 | 
159 | To measure the error events representing insufficient [contribution
160 | budget](https://github.com/patcg-individual-drafts/private-aggregation-api#contribution-bounding-and-budgeting),
161 | some of the budget for histogram contributions that measure this error event
162 | must be reserved. Otherwise, those contributions would likely also be dropped
163 | due to the budget limit. This “named budget” functionality also supports other
164 | use cases and has been proposed in [a separate
165 | explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/named_budgets.md).
166 | 
167 | ## Privacy and security
168 | 
169 | This feature does not introduce any new privacy or security considerations as it
170 | only allows you to send histogram contributions that would've been permitted
171 | without the new feature. (That is, the only change is allowing these
172 | contributions to be conditional on an error event.) These contributions are
173 | embedded into the existing reports sent by the API and use contribution budget
174 | just like existing contributions.
175 | 
176 | ## Future iterations
177 | 
178 | ### Global config
179 | 
180 | As a future extension, we could consider allowing these error-event
181 | contributions to be set up globally for a reporting origin (or site). A global
182 | config has already been proposed for other use cases (see issues
183 | [#81](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/81#issuecomment-2091524214)
184 | and
185 | [#143](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/143))
186 | so we could potentially use a single config for these use cases.
187 | 
188 | This would reduce verbosity as calls that don't need to change frequently could
189 | be replaced. It could also allow for the addition of new error events for
190 | failures that would prevent scripts from being run, e.g. if a bidding script
191 | fails to load. Note that this wouldn't allow for contributions to vary based on
192 | specific details like the interest group or the user, and wouldn't allow
193 | sampling. We may need to consider a mechanism that allows for canceling or
194 | overriding these globally configured details.
195 | 
196 | ### Per-interest group error-event contributions
197 | 
198 | We could also consider allowing error-event contributions to be set per interest
199 | group. As with the global config, this allow for the addition of new error
200 | events for failures that would prevent scripts from being run, e.g. if a bidding
201 | script fails to load. It would also allow these contributions to vary per-user
202 | or per-interest group. However, this configuration would use up some of the
203 | interest group storage budget, which may not be desired.
204 | 


--------------------------------------------------------------------------------
/flexible_filtering.md:
--------------------------------------------------------------------------------
  1 | # More flexible contribution filtering for Aggregation Service queries
  2 | 
  3 | _Note: This document proposes a new backwards compatible change in the Private
  4 | Aggregation API, Attribution Reporting API and Aggregation Service. While this
  5 | new functionality is being developed, we still highly encourage testing the
  6 | existing API functionalities to support core utility and compatibility needs._
  7 | 
  8 | #### Table of Contents
  9 | 
 10 | - [Introduction](#introduction)
 11 | - [Motivating use cases](#motivating-use-cases)
 12 |   - [Processing contributions at different cadences](#processing-contributions-at-different-cadences)
 13 |   - [Processing contributions by campaign ID](#processing-contributions-by-campaign-id)
 14 | - [Non-goals](#non-goals)
 15 | - [Proposal: filtering ID in the encrypted payload](#proposal-filtering-id-in-the-encrypted-payload)
 16 |   - [Use case examples](#use-case-examples)
 17 |     - [Processing contributions at different cadences](#processing-contributions-at-different-cadences-1)
 18 |     - [Processing contributions by campaign ID](#processing-contributions-by-campaign-id-1)
 19 |   - [Details](#details)
 20 |     - [Small ID space by default, but configurable](#small-id-space-by-default-but-configurable)
 21 |     - [Backwards compatibility](#backwards-compatibility)
 22 |     - [One ID per contribution](#one-id-per-contribution)
 23 | - [Possible future extension: batching ID in the shared_info](#possible-future-extension-batching-id-in-the-shared_info)
 24 |   - [Use case examples](#use-case-examples-1)
 25 |     - [Processing contributions at different cadences](#processing-contributions-at-different-cadences-2)
 26 |     - [Processing contributions by campaign ID](#processing-contributions-by-campaign-id-2)
 27 |   - [Details](#details-1)
 28 |     - [Requires deterministic reports and specifying batching ID from a single-site context](#requires-deterministic-reports-and-specifying-batching-id-from-a-single-site-context)
 29 |     - [Backwards compatibility](#backwards-compatibility-1)
 30 |     - [One ID per report](#one-id-per-report)
 31 |     - [Use with filtering ID](#use-with-filtering-id)
 32 | - [Limits on number of IDs used](#limits-on-number-of-ids-used)
 33 | - [Application to Attribution Reporting API](#application-to-attribution-reporting-api)
 34 | - [Privacy considerations](#privacy-considerations)
 35 | 
 36 | ## Introduction
 37 | 
 38 | Currently, the Aggregation Service only allows each '[shared
 39 | ID](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)'
 40 | to be present in one query. A set of reports with the same shared ID cannot be
 41 | split for separate queries, even if the resulting batches are disjoint. However,
 42 | there have been requests to introduce additional flexibility to this query model
 43 | (see GitHub issues for [Private
 44 | Aggregation](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/92)
 45 | and [Attribution Reporting](https://github.com/WICG/attribution-reporting-api/issues/732)).
 46 | 
 47 | Here, we propose introducing a new _filtering ID_ set when a contribution is
 48 | made and embedded in the encrypted payload. This allows for these queries to be
 49 | split further, with the aggregation service filtering contributions based on the
 50 | provided IDs.
 51 | 
 52 | We also propose a possible future extension where a _batching ID_ is set from a
 53 | first-party context and embedded in the `shared_info`. This would allow for the
 54 | ad tech to filter the reports directly, improving the ergonomics for some use
 55 | cases.
 56 | 
 57 | ## Motivating use cases
 58 | 
 59 | #### Processing contributions at different cadences
 60 | 
 61 | For some measurements, it may be desirable to query the Aggregation Service less
 62 | frequently; this would allow for more contributions to be aggregated before
 63 | noise is added, improving the signal-to-noise ratio. However, for other
 64 | measurements, it may be more valuable to receive a result faster. (Support for
 65 | this use case has been requested for Attribution Reporting
 66 | [here](https://github.com/WICG/attribution-reporting-api/issues/732).) Filtering
 67 | IDs could be used to separate these measurements into different queries.
 68 | 
 69 | #### Processing contributions by campaign ID
 70 | 
 71 | An ad tech might want to process measurements — for example, reach measurements
 72 | — separately for each advertising campaign. To allow for this, it might want to
 73 | use a different filtering ID or batching ID for each campaign. Note that,
 74 | without this new functionality, the advertiser is not part of the shared ID and
 75 | so it's not currently possible to process these separately.
 76 | 
 77 | ## Non-goals
 78 | 
 79 | While we aim to increase the flexibility of report batching strategies, we don't
 80 | intend to allow every report or every contribution to be queried separately.
 81 | Further, we don't intend to allow for arbitrary groupings decided after
 82 | reporting is complete. This is to ensure that the scale of aggregatable
 83 | reporting accounting remains feasible, see  [discussion
 84 | below](#limits-on-number-of-ids-used).
 85 | 
 86 | ## Proposal: filtering ID in the encrypted payload
 87 | 
 88 | We plan to introduce additional IDs in the payload called _filtering IDs_. By
 89 | embedding these IDs within the encrypted payload, their values could be set
 90 | within a worklet/script runner — e.g. for a Protected Audience bidder — and
 91 | could even be chosen based on cross-site data. For example:
 92 | 
 93 | ```js
 94 | privateAggregation.contributeToHistogram(
 95 |     {bucket: 1234n, value: 56, filteringId: 3n});
 96 | ```
 97 | 
 98 | If no filtering ID is provided, a default ID of 0 will be used. (See also
 99 | [Backwards compatibility](#backwards-compatibility) below.)
100 | 
101 | As the reporting endpoint cannot determine the IDs within a given report, the
102 | aggregation service will provide new functionality for filtering contributions
103 | based on their IDs. In particular, each aggregation service query's parameters
104 | should provide a list of allowed filtering IDs and all contributions with other
105 | IDs will be filtered out. For example:
106 | 
107 | ```jsonc
108 | // ...
109 | "job_parameters": {
110 |   "output_domain_blob_prefix": "domain/domain.avro",
111 |   "output_domain_bucket_name": "<data_bucket>",
112 |   "filtering_ids": [1, 3]  // IDs to keep in the query
113 | },
114 | ```
115 | 
116 | Note that this API is not final, e.g. it might make more sense to specify the
117 | IDs via an avro file.
118 | 
119 | The aggregation service would include a filtering ID in the computation of each
120 | '[shared ID](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)'
121 | hash. For aggregatable report accounting, the aggregation service would assume
122 | that each filtering ID listed in the job parameters is present in every report.
123 | This avoids leaking any information about which IDs were actually present in
124 | each report.
125 | 
126 | ### Use case examples
127 | 
128 | #### Processing contributions at different cadences
129 | 
130 | As discussed [above](#processing-contributions-at-different-cadences), a
131 | reporting site may want to query the Aggregation Service at different cadences
132 | for different kinds of measurements.
133 | 
134 | Filtering IDs could be used to separate these measurements into different
135 | queries. For example, you could specify a filtering ID of 1 for measurements
136 | that should be queried daily and an ID of 2 for measurements that should be
137 | queried monthly. Each day, the reporting site would then send a day's reports to
138 | the aggregation service and specify that only contributions with a filtering ID
139 | 1 should be processed. Each month, the reporting site would send an entire
140 | month's payloads (which have been sent earlier in the daily queries), but
141 | specify that only contributions with a filtering ID 2 should be processed.
142 | 
143 | Note that, in this flow, every report needs to be sent to the aggregation
144 | service multiple times. However,  as the filtering IDs are different, no
145 | _contribution_ is being included in an aggregation twice and so there are no
146 | issues with aggregatable report accounting.
147 | 
148 | #### Processing contributions by campaign ID
149 | 
150 | For certain use cases, the filtering ID may be a deterministic function of the
151 | context. For example, if an ad tech wants to process measurements separately for
152 | each campaign, it could use a different filtering ID for each campaign. As the
153 | campaign would be known outside the Shared Storage worklet, the ad tech could
154 | externally maintain a mapping from the [context
155 | ID](https://patcg-individual-drafts.github.io/private-aggregation-api/#aggregatable-report-context-id)
156 | to the filtering ID.
157 | 
158 | When batching reports for the aggregation service, the ad tech could use this
159 | mapping to separate the reports by filtering ID, even though it cannot decrypt
160 | the payload. By avoiding reprocessing every report for each campaign ID, the
161 | number of IDs used can be much larger while keeping processing costs reasonable.
162 | 
163 | ### Details
164 | 
165 | #### Small ID space by default, but configurable
166 | 
167 | The filtering ID would be an unsigned integer limited to a small number of bytes
168 | (1 byte = 8 bits) by default. If no filtering ID is provided, a value of 0 will
169 | be used. We limit the size of the ID space to prevent unnecessarily increasing
170 | the payload size and thus storage and processing costs. As filtering IDs are not
171 | readable by the reporting endpoint, processing multiple filtering IDs separately
172 | would typically require reprocessing the same reports for each query (see [the
173 | first example use](#processing-contributions-at-different-cadences-1) above).
174 | Given this performance constraint, it is unlikely that a larger ID space will be
175 | practical with this flow.
176 | 
177 | However, other flows could make use of a larger ID space (see [the second
178 | example use case](#processing-contributions-by-campaign-id-1) above), so we plan
179 | to allow for the ID space to be configurable per-report, up to 8 bytes (i.e. 64
180 | bits). To avoid amplifying a counting attack due to the different payload size,
181 | we plan to make the number of reports emitted with this custom label size
182 | deterministic. This will result in a null report being sent if no contributions
183 | are made. Note that this means the filtering ID _space_ for Private Aggregation
184 | reports must also be specified outside Shared Storage worklets/Protected
185 | Audience script runners.
186 | 
187 | For Shared Storage and Protected Audience sellers, we propose reusing the
188 | `privateAggregationConfig` implemented/proposed for report verification, adding
189 | a new field, e.g.
190 | 
191 | ```js
192 | sharedStorage.run('example-operation', {
193 |   privateAggregationConfig: {
194 |     contextId: 'example-id',
195 |     filteringIdMaxBytes: 8  // i.e. allow up to a 64-bit integer
196 |   }
197 | });
198 | ```
199 | 
200 | We do not currently plan to allow the filtering ID bit size to be configured for
201 | Protected Audience bidders as these flows require context IDs to make the scale
202 | practical; we do not currently plan to expose context IDs to bidders (see the
203 | [explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#specifying-a-contextual-id-and-each-possible-ig-owner)
204 | for more discussion). We also do not plan on allowing these fields to be set
205 | from within fenced frames, as they may have access to cross-site information.
206 | 
207 | #### Backwards compatibility
208 | 
209 | For backwards compatibility, if no list of `filtering_ids` is provided in an
210 | aggregation query, the list containing only the default ID will be used (i.e.
211 | `[0]`). This means that any contributions that don't specify a filtering ID
212 | would be included in that aggregation, along with any contributions that
213 | explicitly specify the default ID. Additionally, the aggregation service will
214 | process reports using older format versions (i.e. before labels were supported)
215 | as if every contribution uses the default filtering ID.
216 | 
217 | This should ensure that no changes need to be made to existing pipelines if
218 | filtering IDs are not needed.
219 | 
220 | #### One ID per contribution
221 | 
222 | We plan to allow for a filtering ID to be set individually for each contribution
223 | in a report's payload. To reduce the impact on payload size, we could consider
224 | instead limiting the number of distinct filtering IDs per report to a smaller
225 | number. However, this may pose ergonomic difficulties.
226 | 
227 | ## Possible future extension: batching ID in the shared\_info
228 | 
229 | Later, to improve ergonomics (see [example
230 | below](#processing-contributions-by-campaign-id-2)), we could consider
231 | introducing a new, optional field to an aggregatable
232 | [report](https://github.com/patcg-individual-drafts/private-aggregation-api#reports)'s
233 | shared\_info called a _batching ID_. For example:
234 | 
235 | ```jsonc
236 | "shared_info": "{\"api\":\"shared-storage\",\"batching_id\":1234,\"report_id\":\"[UUID]\",\"reporting_origin\":\"https://reporter.example\",\"scheduled_report_time\":\"[timestamp in seconds]\",\"version\":\"[api version]\"}",
237 | ```
238 | 
239 | This ID would be an unsigned 32-bit integer. The aggregation service would
240 | include the batching ID in computation of the '[shared
241 | ID](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)'
242 | hash, allowing reports with differing batching IDs to be batched and queried
243 | separately.
244 | 
245 | ### Use case examples
246 | 
247 | #### Processing contributions by campaign ID
248 | 
249 | As discussed [above](#processing-contributions-by-campaign-id-1), an ad tech may
250 | want to process measurements separately for each campaign. In that example, the
251 | filtering ID used is a deterministic function of the context. Instead of setting
252 | a filtering ID, a batching ID could be specified.
253 | 
254 | As the batching ID would be readable by the ad tech, it would then be able to
255 | use this batching ID to identify what campaign the report is for and to batch
256 | and query the reports for each campaign separately. It would no longer have to
257 | rely on maintaining a context ID to filtering ID mapping, which would provide
258 | improved ergonomics and might reduce the risk of bugs from the context ID to
259 | filtering ID mapping.
260 | 
261 | #### Processing contributions at different cadences
262 | 
263 | While a reporting site could potentially use a batching ID for processing
264 | contributions at different cadences, it has a few downsides relative to a
265 | filtering ID. As only one batching ID can be set per report, multiple reports
266 | would need to be triggered, e.g. through multiple Shared Storage operations.
267 | Further, as the batching ID [requires deterministic
268 | reports](#requires-deterministic-reports-and-specifying-batching-id-from-a-single-site-context),
269 | this would result in a report being sent for each ID, even if there are no
270 | contributions for that cadence. These additional reports would negate the
271 | benefit of being able to split reports into separate batches at the reporting
272 | endpoint.
273 | 
274 | ### Details
275 | 
276 | #### Requires deterministic reports and specifying batching ID from a single-site context
277 | 
278 | As this option embeds highly specific information about the context that
279 | triggered a particular report (in plaintext), we need to make the number of
280 | reports emitted with the batching ID deterministic. (See the [report
281 | verification explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#deterministic-number-of-reports)
282 | for a similar discussion with respect to context IDs.) This will result in a
283 | null report being sent if no contributions are made. Note that this means the
284 | batching ID for Private Aggregation reports must also be specified outside
285 | Shared Storage worklets/Protected Audience script runners.
286 | 
287 | For Shared Storage and Protected Audience sellers, we propose reusing the
288 | `privateAggregationConfig` implemented/proposed for report verification, adding
289 | a new field, e.g.
290 | 
291 | ```js
292 | sharedStorage.run('example-operation', {
293 |   privateAggregationConfig: {
294 |     contextId: 'example-id',
295 |     batchingId: 1234
296 |   }
297 | });
298 | ```
299 | 
300 | We do not currently plan to use a context ID for Protected Audience bidders due
301 | to the potential for a large number of null reports, see
302 | [explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#specifying-a-contextual-id-and-each-possible-ig-owner)
303 | for more discussion. Identical considerations would apply to this batching ID in
304 | the `shared_info`; so, we would not allow a batching ID to be set for bidders.
305 | 
306 | #### Backwards compatibility
307 | 
308 | If no batching ID is specified, the field will not be present in the
309 | `shared_info`. This should ensure the change is backwards compatible.
310 | 
311 | #### One ID per report
312 | 
313 | Each report can have at most one batching ID (unlike filtering IDs which are
314 | per-contribution). This aligns with the behavior for context IDs, given they are
315 | both readable by the reporting endpoint.
316 | 
317 | #### Use with filtering ID
318 | 
319 | Both a batching ID and a filtering ID could be used at the same time.
320 | 
321 | ## Limits on number of IDs used
322 | 
323 | This proposal increases the number of '[shared
324 | IDs](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)'
325 | that the Aggregatable Report Accounting service will need to keep track of. So,
326 | we will need to ensure there are limits to this increase to prevent scale
327 | issues. (Note that it is not practical for each report to have its own entry
328 | recorded in the accounting service.)
329 | 
330 | We plan to impose a limit on the number of shared IDs for any particular
331 | aggregation. That is, if too many are used by a query, an error would occur. The
332 | effect of this limit on the number of filtering IDs or batching IDs (or both)
333 | that can be provided will depend on other details of the batching strategy.
334 | 
335 | Straw proposal: a limit of 1000 shared IDs per aggregation.
336 | 
337 | ## Application to Attribution Reporting API
338 | 
339 | The filtering ID approach should be extendable to the Attribution Reporting API
340 | and, in principle, we could allow the label to be set based on either source or
341 | trigger-side information.
342 | 
343 | The batching ID approach may not be viable for all Attribution Reporting API
344 | callers as a null report would need to be sent for every unattributed trigger.
345 | This could increase report volume substantially (e.g. 4 to 20 times); however,
346 | some callers may be able to tolerate this increase (see the discussion in [ARA
347 | issue #974](https://github.com/WICG/attribution-reporting-api/issues/974) about
348 | introducing a trigger ID). If making reports deterministic is acceptable for
349 | some callers, we could support setting a batching ID for a trigger with a
350 | similar mechanism to the already proposed trigger ID.
351 | 
352 | The details of these approaches will be explored in a separate GitHub issue.
353 | 
354 | ## Privacy considerations
355 | 
356 | While this change does allow for reprocessing the same report in different
357 | aggregations, each query will only aggregate distinct contributions from that
358 | report. In other words, each contribution is still guaranteed to only be
359 | aggregated once, maintaining our current [privacy protection
360 | model](https://github.com/patcg-individual-drafts/private-aggregation-api#contribution-bounding-and-budgeting).
361 | 
362 | One other potential concern is that introducing new (plaintext) metadata to the
363 | report might amplify counting attacks (see related discussion for context IDs
364 | [here](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#privacy-considerations)).
365 | However, we ensure that any new metadata (including a batching ID and any
366 | non-default payload size) is paired with making the sending of that report
367 | deterministic. This avoids any risk of the report count leaking information.
368 | 


--------------------------------------------------------------------------------
/meetings/2023-02-15/minutes.md:
--------------------------------------------------------------------------------
  1 | # Private Aggregation W3C Call: Agenda & Notes
  2 | 
  3 | Wednesday 15 February 2023 at 4 pm Eastern Time (= 9 pm UTC = 10 pm Paris = 1 pm California).
  4 | 
  5 | This notes doc will be editable during the meeting — if you can only comment, hit reload.
  6 | 
  7 | Notes will be made available on [GitHub](https://github.com/patcg-individual-drafts/private-aggregation-api) after the meeting.
  8 | 
  9 | Additional meetings will be made on an ad hoc basis.
 10 | 
 11 | ## Agenda
 12 | 
 13 | #### Process reminder: Join PATCG and sign in
 14 | 
 15 | If you want to participate in the call, please make sure you join the PATCG: <https://www.w3.org/community/patcg/> and add yourself to the attendees list below.
 16 | 
 17 | ### [Suggest agenda items here]
 18 | 
 19 | * Intro to Shared Storage and Private Aggregation
 20 |   * What’s possible with Reach in the current design
 21 | * Importance of cross-device Reach
 22 |   * Supporting statistical methods/adjustments
 23 | * Scope of dynamic flexibility
 24 | * Issue [#17](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/17): Extending shared storage API to support advanced reach reporting
 25 | * Fledge + PAA:
 26 | 
 27 |     Can we expect that Private Aggregation API will be available for Fledge on GA (H3’23)?
 28 | 
 29 | ### Note takers
 30 | 
 31 | Zach Mastromatto
 32 | 
 33 | ### Notes
 34 | 
 35 | * Intro to Shared Storage and Private Aggregation - What’s possible with Reach in the current design
 36 |   * Shared Storage provides unrestricted write access across sites
 37 |   * Companies can use relevant Shared Storage data to aggregate data and return noisy summary reports
 38 |   * You can use Shared Storage data to create an aggregation key, send that to your server, then batch them together and send to Aggregation Service to decrypt, add noise, and return an aggregated summary reports
 39 |   * Q: How does data delineation look?
 40 |     * A: Each adtech will have its own Shared Storage context to read and write from
 41 |   * Reach is one of the use cases that Shared Storage is built to support
 42 |     * When an unique ad is served to a new user, you can write to Shared Storage that this new user was reached by an ad that was served (frequency =1)
 43 |     * Each new user view can then be aggregated and sent in the Private Aggregation flow to aggregate and return a noised summary report from the Aggregation Service
 44 |     * How does the batching of these reports work?
 45 |       * When a report is sent for batching, it can’t be reprocessed again
 46 |       * It is important to write the date/date range then when you write to Shared Storage
 47 |     * Q: Is Shared Storage available in incognito mode?
 48 |       * A: Unclear at the moment, will need to follow up. You would consider it as a separate bucket though if it is enabled. You can still write to it if it is Incognito, but all of the data would be cleared when the browser is closed.
 49 |     * Q: If I setup a weekly reach report, than, I can't send a daily report?
 50 |       * No, you would only be able to process data in one of the reports
 51 |     * Q: This is browser counting, and not deduplicated right?
 52 |       * A: Yes, and we are interested if there are other reach models that are dependent on an extension of that deduplication
 53 |     * Q: What use case does this solution map to? What value do advertisers get from doing this specifically within the browser? (from Meta)
 54 |       * A: From Google Ads’ perspective, broadly speaking, Reach is pretty established and is an important solution that Google Ads provides in-market. By extending the API just a little bit further can significantly improve the utility from the API. Details are in the GitHub issue that Google Ads wrote.
 55 |       * A: From Chrome perspective, is it possible to get per-browser estimates and are there any adjustments that take place after the fact to make Reach closer to a person-based calculation vs browser-based calculation. Our question is when that needs to happen?
 56 |       * A: It is unclear if you move the matching to off-device vs on-device (Meta)
 57 |     * Q: Would you be able to dedupe users across devices? (Amazon)
 58 |       * A: Cross-device is an interesting question in this context. This seems to be in the context of deduping the browser to an actual person. The challenge for Chrome is figuring out the right way to do this with user privacy in mind. We want to be confident from a statistical perspective on where the inflection point is for knowing that a reached impression is a new person or not. This is probably worth a deeper dive in a later discussion.
 59 |     * We need as much data flexibility as we can in a dynamic environment (weekly, monthly, mid-campaign flight, etc). (Amazon)
 60 |     * Q: Why are there limitations on processing data multiple times in batches?
 61 |       * A: To limit the amount of information that is available on users. Also to restrict averaging out the noise if you process the data multiple times. Theoretically you can add more noise in if you reprocess it each time to alleviate this, but it adds more complexity for engineering and for users.
 62 |     * Raimundo: The important thing on the first question is what is the unique thing that you are counting, more so than where you are doing the counting?
 63 |     * Wendell: Proposing that Chrome simplifies this solution. Investing in testing costs money both from the advertiser and adtech. Suggests that simplification here would make it easier to test and build a business around. Noise being added is tough to manage and it has to behave the same way across long periods of time.
 64 |     * Thomas: Counting devices is really the most important thing, as that is correlated most to a person. Counting users is supposed to be the secret sauce of an adtech to do this. Does not think it is Chrome’s responsibility to provide a device graph for users. It is key to have an integration between Android and Chrome.
 65 |       * Q: Is there a strong sense that apps and browsers should talk to each other? Or multiple browsers on the same device? Can you elaborate what you mean by a unique device? It is not a question of multiple browsers, but across app and web?
 66 |       * A: Device as defined by a physical device. Across both app and web, yes.
 67 |     * Q: iFrames and Fenced Frames are fully supported, so Shared Storage should be available everywhere?
 68 |       * A: Yes, but Chrome wants to double check the Fenced Frames piece and get back to you.
 69 |     * Evgeny: We (Google Ads) are asking for a level of privacy protection from Privacy Sandbox, then enabling engineering on top of that to create a reach metric. Chrome should not provide an identity layer. The blocking of processing batches multiple times that needs to be removed is an example of an area that can improve Reach calculations without sacrificing privacy.
 70 |     * Raimundo: Two challenges we need to solve for Reach. 1) How do you count for people not reach (flexibility) and 2) across all data sources and devices. We already count unique people and we need to be able to deduplicate across all devices, not even just browser. How do you bring in unique counts from devices that do not have Privacy Sandbox enabled?
 71 |     * Vishal: Reach measurement is very foundational and core to Brand advertisers around the world. We are more than welcome to bring those advertisers in for feedback if Chrome wishes. This is a core metric used to measure across all of TV and digital.
 72 |     * Jukka: It might sound like we would need a unique identifier across all platforms and media for a single person. However, when we talk about counting we’re talking really about estimating. We can deal with a world where browser and app usage on the same device is separate. We use panels for modeling purposes. Not accurate at a per-person level, but can be done in aggregate. Reach is what advertisers want the most and want it automatically without much hassle. They also care about privacy though too. We need to make this solution workable for both.
 73 |       * Asha: It is probably not possible to make a universal identity, so we need to support adtechs in facilitating the estimation counts after the fact.
 74 |     * Sanjay: The amount of utility for an advertiser decreases when the fidelity of the signal decreases. This isn’t on Chrome to uniquely solve though for all adtechs across all devices. The industry needs to come to a consensus on the signals that are important for Reach and the privacy guardrails. This is true for all browsers, not just Chrome.
 75 |     * Evgeny: Allowing count(distinct) would unlock the ability to do modeling and increase flexibility on one or two orders of magnitude over what is possible now. It unlocks a key toolbox of modelling.
 76 |     * Alex: Just want to flag that we have a lot to talk about and need more stakeholders in the room, per Vishal. Chrome would like more attendance in these meetings and encourage follow up meetings and sharing.
 77 |     * Asha: Are adtechs used to having standard dimensions for reporting or does it need to be very flexible across any dimension?
 78 |       * Evgeny: Flexibility across the board is important, date is important. It is important for Reach to be flexible, which is what we hear from our customers. Trying to list the important metrics might not lead to success here, but rather enabling the flexibility for adtechs to decide for themselves.
 79 |     * Michal: We plan to use PAA together with FLEDGE. Some of these reports need to be processed daily vs other types of reporting. And the batching frequency needs to be variable. From our perspective, it would be good to have the ability to reprocess the data or to batch data in a way that is not just based on a specific date range.
 80 |     * Bo: Strongly agree that dimensions should be fully flexible, instead of picking the specific dimensions to be flexible on. Timezone issues are a concern too. In order for the ecosystem to test the data sent back, we need to gain some confidence in the underlying data. Without that we can’t be confident in using it.
 81 |     * Jukka: Exact timestamp is not as important to Comscore. Exact time is less important than other dimensions.
 82 | * Importance of cross-device Reach
 83 |   * Supporting statistical methods/adjustments
 84 | * Scope of dynamic flexibility
 85 | * Issue [#17](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/17): Extending shared storage API to support advanced reach reporting
 86 | * Fledge + PAA:
 87 | 
 88 |     Can we expect that Private Aggregation API will be available for Fledge on GA (H3’23)?
 89 | 
 90 | ### To join the speaker queue
 91 | 
 92 | Please use the "Raise My Hand" feature in Google Meet
 93 | 
 94 | ### Attendees: please sign yourself in
 95 | 
 96 | 1. Wendell Baker (Yahoo)
 97 | 2. David Dabbs (Epsilon)
 98 | 3. Alex Turner (Google Chrome)
 99 | 4. Anish Ahmed (Google, Privacy Sandbox)
100 | 5. Asha Menon (Google Chrome)
101 | 6. Vishal Radhakrishnan (Google Ads - Cross-Media Measurement )
102 | 7. Sid Sahoo (Google Chrome)
103 | 8. Robert Kubis (Google Chrome)
104 | 9. Renan Feldman (Google Chrome)
105 | 10. Maybelline Boon (Google Chrome)
106 | 11. Christina Ilvento (Google Chrome)
107 | 12. Bo Xiong (Amazon Ads)
108 | 13. Hannah Chang (Google)
109 | 14. Thomas Prieur (Criteo)
110 | 15. Craig Wright (Google Ads)
111 | 16. (RTB House)
112 | 17. Jonathan Frederic (Google Ads)
113 | 18. Chris McAndrew (Google Ads - Measurement)
114 | 19. Evgeny Skvortsov(Google Ads)
115 | 20. Michael Perry (Amazon)
116 | 21. Michael Kleber (Google Chrome)
117 | 22. Nan Li (Amazon)
118 | 23. Zach Mastromatto (Google Chrome)
119 | 24. Jonas Schulte (Amazon Ads)
120 | 25. Gaurav Bajaj (Amazon Ads)
121 | 26. Anatolii Bed (Amazon Ads)
122 | 27. Sanjay Saravanan (Meta)
123 | 28. Jukka Ranta (Comscore)
124 | 29. Ruchi Lohani (Google, Privacy Sandbox)
125 | 30. Martin Pal (Google Privacy Sandbox)
126 | 31. Kuang Yi Chen (Amazon)
127 | 32. Konrad Krakowiak (Google Ads)
128 | 
129 | **Cursor Parking**
130 | 
131 | **🚗🚗🚗🚗🚗🚗**
132 | 


--------------------------------------------------------------------------------
/meetings/2023-02-15/shared-storage-overview.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/meetings/2023-02-15/shared-storage-overview.pdf


--------------------------------------------------------------------------------
/named_budgets.md:
--------------------------------------------------------------------------------
 1 | # Named budgets
 2 | 
 3 | Sharing budget between different use cases or different products can pose a
 4 | challenge. If one of the use cases/products uses the [contribution
 5 | budget](https://github.com/patcg-individual-drafts/private-aggregation-api#contribution-bounding-and-budgeting)
 6 | more quickly, it may exhaust the budget, preventing the others from being able
 7 | to report. (For example, see issue
 8 | [#145](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/145).)
 9 | 
10 | We propose a generic mechanism to better support this. Each reporting site would
11 | be able to define a series of _named budgets_ and would allocate a fraction of
12 | their total budget to each, e.g.:
13 | 
14 | ```js
15 | privateAggregation.reserveBudget("example-budget", 0.5);
16 | privateAggregation.reserveBudget("debug", 0.125);
17 | ```
18 | 
19 | Then, when making histogram contributions (including via
20 | `contributeToHistogramOnEvent()`), a named budget can optionally be specified,
21 | e.g.:
22 | 
23 | ```js
24 | privateAggregation.contributeToHistogram({
25 |   bucket: 123n,
26 |   value: 45,
27 |   filteringId: 6n,
28 |   namedBudget: "example-budget"
29 | });
30 | ```
31 | 
32 | When approving or rejecting contributions, the browser would then check both
33 | whether the chosen named budget has enough allocated budget left and whether
34 | there is enough overall budget left. This ensures the overall budget is always
35 | respected, even if the named budget allocations change.
36 | 
37 | ### Fractional budget
38 | 
39 | We propose using a fraction to represent the budget allocation, i.e. not a
40 | direct limit on the contributions’ values sum. Given the two separate time
41 | windows used in Private Aggregation budgeting (per-10 min and per-day), using a
42 | fraction more clearly allocates a portion of both limits simultaneously.
43 | 
44 | ### Reservation persistence
45 | 
46 | To avoid the complexity of persisting configurations over time, we propose that
47 | the `reserveBudget()` calls are scoped to just the single worklet/script runner
48 | context. That is, the budget allocations would need to be set up at the start of
49 | each Shared Storage operation or Protected Audience function call. Note that the
50 | browser would still keep track of the budget usage for each group over the last
51 | 10 min and last 24 hours, but the per-named budget _limits_ themselves would not
52 | be persisted.
53 | 
54 | See [Future iteration: global config](#future-iteration-global-config) below for
55 | some discussion of a more persistent choice.
56 | 
57 | ### Default (null) budget
58 | 
59 | If no named budget is explicitly specified for a contribution, it would default
60 | to using a special “null” budget. This “null” budget is implicitly allocated the
61 | remaining budget after all the explicit reservations are accounted for. In the
62 | example above, the “null” budget would be allocated the remaining 37.5% of the
63 | budget. Note that this behavior means that this feature is fully backwards
64 | compatible.
65 | 
66 | ### Allocating more than the entire budget
67 | 
68 | If more than 100% of the budget is allocated to different named budgets, a
69 | JavaScript error will be thrown.
70 | 
71 | ### Privacy considerations
72 | 
73 | This feature also does not introduce any concerns as the overall existing budget
74 | is still always enforced. Note that contribution budgets are enforced per
75 | reporting site and so one reporting origin on a site could exhaust the budget
76 | for its entire reporting site. However, this is intended and is unchanged by the
77 | new feature.
78 | 
79 | ### Future iteration: global config
80 | 
81 | As a future extension, we could consider allowing these budget reservations to
82 | be set up globally for a reporting origin (or site). A global config has already
83 | been proposed for other use cases (see issues
84 | [#81](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/81#issuecomment-2091524214)
85 | and
86 | [#143](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/143))
87 | so we could potentially use a single config for these use cases. This would
88 | reduce verbosity as reservations that don’t need to change frequently could be
89 | replaced. We may need to consider a mechanism that allows for canceling or
90 | overriding these globally configured details.
91 | 


--------------------------------------------------------------------------------
/reach_whitepaper.md:
--------------------------------------------------------------------------------
   1 | # Reach Implementation Best Practices in the Privacy Sandbox Shared Storage + Private Aggregation APIs
   2 | Authors: Hidayet Aksu (aksu@google.com), Alexander Knop (alexanderknop@google.com), Pasin Manurangsi (pasin@google.com)
   3 | 
   4 | This whitepaper aims to provide ad tech companies with actionable guidance on
   5 | implementing reach measurement within the Privacy Sandbox, via the
   6 | [Shared Storage](https://developers.google.com/privacy-sandbox/relevance/shared-storage) 
   7 | and
   8 | [Private Aggregation](https://developers.google.com/privacy-sandbox/relevance/private-aggregation) APIs.
   9 | We dive into various reach scenarios, covering hierarchical queries, cumulative
  10 | reach, and present corresponding implementation strategies, catering to diverse
  11 | ad tech requirements.
  12 | 
  13 | In various use cases, we recognize the potential utilization of exploratory
  14 | reach queries.
  15 | Taking into account the various needs of advertising technology companies,
  16 | we consider both scenarios: pre-determined reach queries (queries known in
  17 | advance) and ad hoc or unknown reach queries.
  18 | Hence, this work presents an exploration of alternative approaches, 
  19 | including direct measurement and 
  20 | [sketch-based mechanisms](#sketch-based-methods). 
  21 | Through empirical evaluation on synthetically generated data, modeled after 
  22 | real-world ad datasets, we assess the impact of factors like ad event capping, 
  23 | query granularity, and time windows on measurement accuracy. We aim to equip 
  24 | ad tech companies with the insights needed to select the best reach measurement 
  25 | method aligned with their specific query characteristics and privacy 
  26 | considerations.
  27 | 
  28 | Our experiments show that for predetermined queries, the best solution is to use
  29 | the point contribution mechanism. However for more exploratory queries or those
  30 | with longer windows, sketch-based mechanisms are beneficial although techniques
  31 | such as capping may need to be used to reduce error.
  32 | 
  33 | By shedding light on reach estimation using the Privacy Sandbox, this paper 
  34 | seeks to empower ad tech companies to navigate the evolving landscape of 
  35 | privacy-centric advertising technologies.
  36 | 
  37 | 
  38 | ## Reach Functionality
  39 | 
  40 | The goal of reach is to compute the number of unique users that see (or interact
  41 |  with) content (e.g., a specific ad).
  42 |  For the purpose of this white paper, we will treat each browser profile as a
  43 |  unique user, so that the result is actually the number of unique profiles that
  44 | see the content.
  45 | We use this definition for simplicity in our empirical evaluations.
  46 | Therefore, we will use “user” and “browser profile” interchangeably.
  47 | 
  48 | In general, an ad tech would like to perform multiple reach measurements
  49 | depending on how they wish to “slice” the set of users.
  50 | In the context of digital advertising, a slice is defined as a distinct segment
  51 | of a campaign's target audience, characterized by specific attributes or
  52 | behaviors. This segmentation allows granular analysis and targeted optimization
  53 | of advertising efforts. Slices can be constructed based on a variety of
  54 | dimensions like campaign, location, demographics etc.  For example, one may
  55 | want to perform a reach measurement for a single campaign during a single day;
  56 | or to perform a reach measurement for a single campaign during a single day for
  57 | a specific geographic region.
  58 | 
  59 | Within a specified time window, reach between slices is not necessarily
  60 | additive. 
  61 | In other words, the count of all users for `campaign=123` is equal to the sum of
  62 | user counts of `campaign=123` for each geolocation, only if no user appears in
  63 | two different geo areas, which is not always the case.
  64 | For features such as a creative ID, a user may view advertisements with multiple
  65 | creative IDs, thus additivity does not apply in this scenario either.
  66 | Consequently, when queries are pre-determined, it is important to explicitly
  67 | define each inquiry and collect relevant contributions.
  68 | Conversely, when queries are unknown, contributions must be collected to support
  69 | the finest granularity of queries.
  70 | 
  71 | <span id="level">Throughout this paper</span>, we consider hierarchical reach 
  72 | measurement queries, based on different slices. 
  73 | For illustration and experimentation purposes, we assume the following four 
  74 | slicing levels:
  75 | 
  76 | 
  77 | *   Level 1 (coarsest): advertiser
  78 | *   Level 2: advertiser x campaign
  79 | *   Level 3: advertiser x campaign x geo
  80 | *   Level 4 (finest): advertiser x campaign x geo x ad strategy x creative
  81 | 
  82 | To analyze the temporal aspects of reach, let's consider the following 
  83 | definitions for a running campaign:
  84 | 
  85 | 
  86 | 
  87 | *   Cumulative Reach: This quantifies the number of unique users exposed to
  88 |   impressions within a campaign since its inception. In this scenario, we
  89 |   incrementally measure reach over time.
  90 |     *   Example queries: Reach until January 24th, Reach until January 25th,
  91 |       Reach until January 26th, etc.
  92 | 
  93 |     _Note that we are not considering rolling window queries here for brevity;
  94 |     however, the approaches we consider for cumulative reach could be 
  95 |     generalized to this use case._
  96 | 
  97 | *   Fixed Window Reach: This calculates the number of unique users exposed
  98 |   to impressions within a specific time range. The time range, or window, is 
  99 |   determined at the time of data collection and remains constant during 
 100 |   subsequent analysis.
 101 |     *   Example queries: Reach between January 22nd and January 24th, Reach
 102 |       between January 23rd and January 25th, etc.
 103 | 
 104 |     The following query examples were selected for pre-determined queries due to
 105 |     their relevance to real-world scenarios and frequent usage within ad tech
 106 |     analytics.
 107 | 
 108 | 
 109 | <table>
 110 |   <tr>
 111 |    <td style="background-color: #cccccc">
 112 | ID
 113 |    </td>
 114 |    <td style="background-color: #cccccc">Query
 115 |    </td>
 116 |    <td style="background-color: #cccccc">Query Frequency
 117 |    </td>
 118 |    <td style="background-color: #cccccc">Type
 119 |    </td>
 120 |   </tr>
 121 |   <tr>
 122 |    <td>Q1
 123 |    </td>
 124 |    <td>What is the reach for the last day?
 125 |    </td>
 126 |    <td>Queried every day
 127 |    </td>
 128 |    <td>Fixed Window
 129 |    </td>
 130 |   </tr>
 131 |   <tr>
 132 |    <td>Q2
 133 |    </td>
 134 |    <td>What is the reach for the last calendar week?
 135 |    </td>
 136 |    <td>Queried every week
 137 |    </td>
 138 |    <td>Fixed Window
 139 |    </td>
 140 |   </tr>
 141 |   <tr>
 142 |    <td>Q3
 143 |    </td>
 144 |    <td>What is the reach for the last month?
 145 |    </td>
 146 |    <td>Queried every month
 147 |    </td>
 148 |    <td>Fixed Window
 149 |    </td>
 150 |   </tr>
 151 |   <tr>
 152 |    <td>QC
 153 |    </td>
 154 |    <td>What is the cumulative reach till now?
 155 |    </td>
 156 |    <td>Queried every day
 157 |    </td>
 158 |    <td>Cumulative
 159 |    </td>
 160 |   </tr>
 161 | </table>
 162 | 
 163 | 
 164 | 
 165 | ## Reach Implementations
 166 | 
 167 | First, we show that reach can be measured directly using the existing Shared 
 168 | Storage and Private Aggregation APIs – we call this approach a direct method.
 169 | Next, we explain how sketch-based methods can be used with the same APIs.
 170 | 
 171 | 
 172 | ### Single Reach Measurement in Shared Storage + Private Aggregation APIs
 173 | 
 174 | We start by recalling how to compute a single reach measurement via Shared
 175 | Storage, as described in the
 176 | [developer documentation](https://developers.google.com/privacy-sandbox/relevance/shared-storage/unique-reach).
 177 | The idea is to use Shared Storage to record whether this browser has already
 178 | contributed to the reach measurement for the slice.
 179 | If it has, then we do nothing.
 180 | Otherwise, we contribute to the histogram (with a single bucket) and record that
 181 | a contribution has been made in Shared Storage.
 182 | In doing so, the computed reach is simply equal to the aggregated value.
 183 | For convenience, the following code snippet is adapted from the above linked
 184 | documentation.
 185 | 
 186 | 
 187 | ```javascript
 188 | const L1_BUDGET = 65536;
 189 | 
 190 | // contentId here is some id of the interaction that is measured;
 191 | // i.e., it is a campaign id.
 192 | function convertContentIdToBucket(contentId) {
 193 |   return BigInt(contentId);
 194 | }
 195 | 
 196 | class ReachMeasurementOperation {
 197 |   async run(data) {
 198 |     const { contentId } = data;
 199 | 
 200 |     // Read from Shared Storage
 201 |     const key = 'has-reported-content';
 202 |     const hasReportedContent = (await sharedStorage.get(key)) === 'true';
 203 | 
 204 |     // Do not report if a report has been sent already
 205 |     if (hasReportedContent) {
 206 |       // Note that this would happen if less than 30 days 
 207 |       // passed since the first visit.
 208 |       return;
 209 |     }
 210 | 
 211 |     // Generate the aggregation key and the aggregatable value
 212 |     const bucket = convertContentIdToBucket(contentId);
 213 |     const value = 1 * L1_BUDGET;
 214 | 
 215 |     // Send an aggregatable report via the Private Aggregation API
 216 |     privateAggregation.contributeToHistogram({ bucket, value });
 217 | 
 218 |     // Set the report submission status flag
 219 |     await sharedStorage.set(key, true);
 220 |   }
 221 | }
 222 | ```
 223 | 
 224 | 
 225 | Note that the reach result can be computed by taking the aggregated histogram
 226 | value of each key and dividing by the contribution budget (`L1_BUDGET)`.
 227 | Note that the developers website uses a scale factor (`SCALE_FACTOR`), but in
 228 | this example the scaling factor is the same as the contribution budget so we
 229 | use `L1_BUDGET`.
 230 | See the
 231 | [“Noise and scaling” section](https://developers.google.com/privacy-sandbox/relevance/private-aggregation/fundamentals#noise_and_scaling) 
 232 | of the Private Aggregation fundamentals article to learn more.
 233 | 
 234 | 
 235 | ### Single Reach Measurement with Capping
 236 | 
 237 | An implicit assumption in the previous section is that each 
 238 | [privacy unit](https://github.com/patcg-individual-drafts/private-aggregation-api#implementation-plan) 
 239 | – i.e., &lt;browser x reporting origin ad tech x 10 minutes> – 
 240 | contributes to only a single query. 
 241 | The reason is that (i) the Shared Storage check does not incorporate the bucket
 242 | so even if a later content tries to contribute to the same report, it will not
 243 | pass the check and (ii) the value contribution is 65,536 which is the same as
 244 | the overall contribution budget (see 
 245 | [contribution budget for summary reports](https://developers.google.com/privacy-sandbox/relevance/private-aggregation/fundamentals#contribution_budget)). 
 246 | In other words, we “cap” the number of contributions each privacy unit can make
 247 | to just one. 
 248 | In many cases, multiple contributions from a single privacy unit may be dropped
 249 | due to a capping limit of 1. 
 250 | This can lead to poor utility, as each privacy unit's contributions are not
 251 | fully reflected in the results. 
 252 | 
 253 | To improve utility, we can increase the contribution cap to an arbitrary number
 254 | we wish (denoted by `CONTRIBUTION_CAP` below).
 255 | Allowing a larger cap can capture more contributions, which is good for utility.
 256 | However, it also results in a larger amount of noise (after scaled down by
 257 | `L1_BUDGET / CONTRIBUTION_CAP`) which is worse for utility. 
 258 | As such, the capping parameter has to be tuned carefully. 
 259 | 
 260 | We give an example code below where the capping parameter is set to five. 
 261 | The two changes are (i) making the Shared Storage checks for the specific bucket
 262 | and (ii) dividing each contribution value by the capping parameter.
 263 | These changes are highlighted in light yellow.
 264 | 
 265 | 
 266 | ```javascript
 267 | const CONTRIBUTION_CAP = 5;
 268 | const L1_BUDGET = 65536;
 269 | 
 270 | function convertContentIdToBucket(contentId) {
 271 |   return BigInt(contentId);
 272 | }
 273 | 
 274 | class ReachMeasurementOperation {
 275 |   async run(data) {
 276 |     const { contentId } = data;
 277 | 
 278 |     // Generate the aggregation key and the aggregatable value
 279 |     const bucket = convertContentIdToBucket(contentId);
 280 |     const value = Math.floor(L1_BUDGET / CONTRIBUTION_CAP);
 281 | 
 282 |     // Read from Shared Storage
 283 |     const key = `has-reported-content-${bucket}`;
 284 |     const hasReportedContent = (await sharedStorage.get(key)) === 'true';
 285 | 
 286 |     // Do not report if a report has been sent already for this bucket
 287 |     if (hasReportedContent) {
 288 |       return;
 289 |     }
 290 | 
 291 |     // Send an aggregatable report via the Private Aggregation API
 292 |     privateAggregation.contributeToHistogram({ bucket, value });
 293 | 
 294 |     // Set the report submission status flag
 295 |     await sharedStorage.set(key, true);
 296 |   }
 297 | }
 298 | ```
 299 | 
 300 | 
 301 | 
 302 | ### Multiple Reach Measurements with Capping
 303 | 
 304 | While the above method focuses on a scenario where each content can contribute
 305 | to only a single reach measurement (i.e., a single “query”), it can be extended
 306 | to the case where each content can contribute to multiple reach measurements by
 307 | having a key for each query. 
 308 | 
 309 | Below, we give an example of this for the hierarchical queries described above.
 310 | Note that each reached browser (i.e., first time the advertisement is shown to
 311 | the user) contributes to exactly one slice in each
 312 | [level](#level).
 313 | Thus, the number of histogram keys that each reached content contributes to is
 314 | equal to the number of levels.
 315 | Due to the
 316 | [contributions budget](https://developers.google.com/privacy-sandbox/relevance/private-aggregation/fundamentals#contribution_budget)
 317 | constraint, the scale factor is now divided by the number of levels in addition
 318 | to dividing it by the contribution cap (`CONTRIBUTION_CAP`). 
 319 | While we adhere to equal contribution per level, it is possible to further
 320 | optimize the contribution budget across different levels, as explained in the
 321 | [relevant paper](https://arxiv.org/pdf/2308.13510).  
 322 | 
 323 | The code for this is below.
 324 | 
 325 | 
 326 | ```javascript
 327 | const CONTRIBUTION_CAP = 5;
 328 | const NUM_LEVELS = 4;
 329 | const L1_BUDGET = 65536;
 330 | 
 331 | function convertContentIdAndLevelToBucket(contentId, level, ...otherDimensions) {
 332 |   // Implement a function that computes the bucket key for a given level using 
 333 |   // level-related fields from otherDimensions.
 334 |   // E.g., in our example, the second level could set the key to be the concatenation
 335 |   // between advertiser and campaign id of this content.
 336 | }
 337 | class ReachMeasurementOperation {
 338 |   async run(data) {
 339 |     const { contentId, ...otherDimensions } = data;
 340 | 
 341 |     for (let level = 1; level <= NUM_LEVELS; level++) {
 342 |       // Generate the aggregation key and the aggregatable value
 343 |       const bucket = convertContentIdAndLevelToBucket(contentId, level, ...otherDimensions);
 344 |       const value = Math.floor(L1_BUDGET / (CONTRIBUTION_CAP * NUM_LEVELS));
 345 | 
 346 |       // Read from Shared Storage
 347 |       const key = `has-reported-content-${bucket}`;
 348 |       const hasReportedContent = (await sharedStorage.get(key)) === 'true';
 349 | 
 350 |       // Do not report if a report has been sent already
 351 |       if (hasReportedContent) {
 352 |         return;
 353 |       }
 354 | 
 355 |       // Send an aggregatable report via the Private Aggregation API
 356 |       privateAggregation.contributeToHistogram({ bucket, value });
 357 | 
 358 |       // Set the report submission status flag
 359 |       await sharedStorage.set(key, true);
 360 |     }
 361 |   }
 362 | }
 363 | 
 364 | // Register the operation
 365 | register('reach-measurement', ReachMeasurementOperation);
 366 | ```
 367 | 
 368 | 
 369 | A sample call would look like:
 370 | 
 371 | 
 372 | ```javascript
 373 | await window.sharedStorage.run('reach-measurement', { data: {
 374 |     contentId: '123', 
 375 |     geo: 'san jose', 
 376 |     creativeId: '55' 
 377 | }});
 378 | ```
 379 | 
 380 | 
 381 | 
 382 | ### Cumulative Reach Measurements
 383 | 
 384 | So far, we have not mentioned the time aspect of the queries. 
 385 | In particular, we assume that all queries correspond to the same time periods. 
 386 | However, we may want to measure time-related queries. 
 387 | A popular family of such queries is cumulative queries, which are queries of 
 388 | the form “what is the total reach up until timestep `t`?” where 
 389 | `t` varies from `1, …, T`.
 390 | For example, one might want to measure the reach up until today everyday for the
 391 |  entire month. In this case, we would have `T = 30` and “timestep” being a day.
 392 | 
 393 | _Note: Both methods described in this section are limited by the 
 394 | [retention policy](https://github.com/WICG/shared-storage?#data-retention-policy) of Shared Storage which is 30 days from last write._
 395 | 
 396 | 
 397 | #### Direct Method
 398 | 
 399 | For cumulative reach queries, we can naively apply the direct method above by 
 400 | adding the timestamp to the key of the histogram. 
 401 | If a user is reached at time step $`i`$, then contributions would be made to all 
 402 | of $`i, \dots, T`$ buckets. 
 403 | Since there are $`T`$ queries, the noise will scale by a factor of 
 404 | $`O(T / \epsilon)`$.
 405 | Note that with the current API, one would need to wait until the last window to
 406 | obtain the results since reprocessing of the reports is not currently supported.
 407 | 
 408 | 
 409 | #### Point Contribution
 410 | 
 411 | We can improve upon the above direct method by a simple modification. 
 412 | For simplicity, we only discuss the mechanism (and give the code) for
 413 | the single-query cap-one case. 
 414 | It is easy to extend this to the multi-query case by appending the query id to 
 415 | the keys. 
 416 | Before we describe the mechanism, let us first make the following observation. 
 417 | Consider the number $`n^{(\text{new})}_i`$ of users that were reached for the 
 418 | first time at timestep $`i`$. 
 419 | The answer of cumulative reach up to timestep $`t`$ is exactly equal to the 
 420 | cumulative sum $`\sum\limits_{i=1}^t n^{(\text{new})}_i`$. 
 421 | Using this fact, we can simply create $`T`$ buckets, where bucket $`i`$ corresponds 
 422 | to $`n^{(\text{new})}_i`$. 
 423 | When we wish to compute the cumulative reach up to time step $`i`$, we take the 
 424 | sum of the first $`i`$ buckets. Since each user contributes to only a single
 425 | bucket, each bucket’s noise standard deviation is only $`O(1 / \epsilon)`$.
 426 | Since we are summing up to $`T`$ such (independent) noises, the total noise 
 427 | standard deviation is $`O(\sqrt{T} / \epsilon)`$, which is an improvement of
 428 | $`O(\sqrt{T})`$ over the direct method.
 429 | 
 430 | An implementation in Shared Storage is given below. 
 431 | 
 432 | 
 433 | ```javascript
 434 | // Time horizon T
 435 | const TIME_HORIZON = 30;
 436 | const L1_BUDGET = 65536;
 437 | 
 438 | function convertContentIdToBucket(contentId, currentTime) {
 439 |   // This key corresponds to n_i as described above with i = currentTime.
 440 |   // Note that the remainder of "bucket" when divided by TIME_HORIZON
 441 |   // is currentTime and the quotient is contentId; hence, this formula gives a 
 442 |   // unique id to each pair of contentId and currentTime.
 443 |   return TIME_HORIZON * BigInt(contentId) + currentTime;
 444 | }
 445 | 
 446 | class ReachMeasurementOperation {
 447 |   async run(data) {
 448 |     const { contentId } = data;
 449 | 
 450 |     // Read from Shared Storage
 451 |     const key = 'has-reported-content';
 452 |     const hasReportedContent = (await this.sharedStorage.get(key)) === 'true';
 453 | 
 454 |     // Do not report if a report has been sent already
 455 |     if (hasReportedContent) {
 456 |       return;
 457 |     }
 458 | 
 459 |     // Should be appropriately set by the ad tech (e.g., based on hour or day)
 460 |     const currentTime;
 461 |   
 462 |     // Increment the current bucket
 463 |     const bucket = convertContentIdToBucket(contentId, currentTime);
 464 |     const value = L1_BUDGET;
 465 | 
 466 |     // Send an aggregatable report via the Private Aggregation API
 467 |     privateAggregation.contributeToHistogram({ bucket, value });
 468 | 
 469 |     // Set the report submission status flag
 470 |     await this.sharedStorage.set(key, true);
 471 |   }
 472 | }
 473 | ```
 474 | 
 475 | 
 476 | The code for computing cumulative reach from the aggregated histogram is given
 477 | below.
 478 | 
 479 | 
 480 | ```python
 481 | # Time horizon T; needs to be fixed beforehand
 482 | TIME_HORIZON = 30;
 483 | L1_BUDGET = 65536.0;
 484 | 
 485 | 
 486 | def convert_content_id_to_bucket(content_id: int, current_time: int) -> int:
 487 |     # Same as the client-side function above.
 488 |     return TIME_HORIZON * content_id + current_time;
 489 | 
 490 | def compute_cumulative_reach(
 491 |     aggregated_histogram: dict[int, int],
 492 |     content_id: int) -> dict[int, int]:
 493 |   """Compute cumulative reach for the given content_id.
 494 | 
 495 |   Args:
 496 |     aggregated_histogram: a mapping from key to aggregated value n the summary 
 497 |       report output by the Aggregation Service.
 498 |     content_id: the content id for which the cumulative reach is computed.
 499 | 
 500 |   Returns:
 501 |     a mapping from time step to the cumulative reach up until that time step.
 502 |   """
 503 |   res = {}
 504 |   for t in range(TIME_HORIZON):
 505 |     res[t] = 0
 506 |     for i in range(1, t + 1):
 507 |       key = convert_content_id_to_bucket(content_id, i)
 508 |       res[t] += aggregated_histogram[key] / L1_BUDGET
 509 |   return res
 510 | ```
 511 | 
 512 | 
 513 | Finally, we remark that, for larger values of T (which may be the case if
 514 | measuring cumulative reach on granularities less than one day), there are more
 515 | advanced mechanisms that can reduce the error further 
 516 | [[1](#private-decayed-sums),[2](#private-counting)],
 517 | but this is beyond the scope of this paper.
 518 | 
 519 | 
 520 | ### Sketch-Based Methods
 521 | 
 522 | Sketching is a technique for estimating the reach (and other quantities of 
 523 | interest) using compressed aggregated data structure admitting merge operation.
 524 | Here we only consider one of the most basic families of sketches, called 
 525 | [counting bloom filters (CBFs)](https://en.wikipedia.org/wiki/Counting_Bloom_filter). 
 526 | In this case, we have a hash function $`h`$ that maps any user ID to one of $`m`$ 
 527 | buckets. 
 528 | The CBF sketch is simply the histogram on these $`m`$ buckets among all the 
 529 | reached users. 
 530 | We can estimate the reach based on the number of non-empty buckets in the 
 531 | sketches. 
 532 | The exact formula for this estimation is based on the hash function 
 533 | distribution.
 534 | 
 535 | The main advantage of this sketch is that it can be merged: if we have two CBFs,
 536 | summing them up gives us the CBF corresponding to the union of the two
 537 | (multi)sets.
 538 | This allows us to be more flexible when answering queries.
 539 | Recall that, in the direct method, we need to know all the queries beforehand
 540 | since we need to allocate each bucket and each query. 
 541 | This is not needed when using sketches: We can simply create a sketch for each
 542 | fine-grained unit (e.g. the reach for each day) and, when we make a query for a
 543 | day range, we take the union of all the sketches in that range.
 544 | This means that the analyst can supply the queries in an on-demand fashion,
 545 | which is more convenient. Moreover, sketching methodology offers another 
 546 | significant advantage: the ability to deduplicate user IDs across multiple 
 547 | devices, providing a more accurate picture of individual engagement. 
 548 | This would typically require using the same user ID on multiple devices. 
 549 | Shared Storage is partitioned per browser, so the ad tech would need to identify
 550 | the user across devices independently. Alternatively, there are modeling-based
 551 | approaches (e.g. 
 552 | [virtual people](https://research.google/pubs/virtual-people-actionable-reach-modeling/))
 553 | that would avoid the need for these
 554 | IDs to map directly to real-world users.
 555 | 
 556 | Nevertheless, there are also several disadvantages of this sketch.
 557 | Firstly, each sketch is a histogram with $`m`$ buckets.
 558 | In contrast, the direct method only outputs a single number per query.
 559 | Therefore, the sketching method can result in a memory usage overhead during
 560 | processing of the aggregatable reports and during post-processing of the summary
 561 | reports compared to the direct or point contribution methods.
 562 | Secondly, since the noise is added to every entry of the sketch, this can lead
 563 | to inaccurate estimates if the noise is large.
 564 | Specifically, as we will see below, we need to set the capping parameter to be
 565 | small to ensure the noise is small but doing so results in a large error due to
 566 | capping. 
 567 | This tension means that we have to be very careful in terms of parameter 
 568 | settings to ensure (reasonably) accurate estimates.
 569 | 
 570 | The overall scheme of the reach estimation involving sketches is described on 
 571 | the following diagram.
 572 | 
 573 | !["A flow diagram illustrating a sketch based ad reach estimation. On the client side, aggregatable reports are generated. These reports are then sent to a trusted execution environment on the server side, where they are processed by the Aggregation Service to produce sketches for each day. These sketches are then merged and fed into an estimator to calculate the final ad reach."](reach_whitepaper_figs/diagram.png "Computation diagram for sketch-based reach estimation" )
 574 | 
 575 | 
 576 | We remark that there are many other types of sketches beyond counting bloom 
 577 | filters (CBFs), such as 
 578 | [(boolean) bloom filters](https://en.wikipedia.org/wiki/Bloom_filter); 
 579 | and even CBF’s could use different hashes with different distributions of 
 580 | values. 
 581 | In this paper, we only focus on uniform CBFs since they are histograms and thus
 582 | most compatible with the API and change of the distribution didn’t change the
 583 | experimental errors in our dataset. 
 584 | However, it is plausible that other types of
 585 | sketches can allow for better utility.
 586 | 
 587 | As the sketch is simply a histogram, it is simple to implement via the Shared
 588 | Storage and Private Aggregation APIs. 
 589 | Again, for simplicity, we only include the code snippet for the cap-one case
 590 | below. (Note that if there are many different content ids, one might consider 
 591 | splitting them into groups using 
 592 | [filtering ID’s](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/flexible_filtering.md).)
 593 | 
 594 | 
 595 | ```javascript
 596 | // Sketch size m
 597 | const SKETCH_SIZE = 10000
 598 | const L1_BUDGET = 65536;
 599 | 
 600 | function sketchHash(userId) {
 601 |   // Should output a number in 0, ..., SKETCH_SIZE - 1 based on which bucket of 
 602 |   // the sketch the user id is hashed to.
 603 | }
 604 | 
 605 | function convertContentIdToBucket(contentId, userId) {
 606 |   // Note that the remainder of convertContentIdToBucket when divided by 
 607 |   // SKETCH_SIZE is sketchHash(userId) and the quotient is contentId; hence, 
 608 |   // this formula gives a unique id to each pair of contentId and 
 609 |   // sketchHash(userId).
 610 |   return SKETCH_SIZE * BigInt(contentId) + sketchHash(userId);
 611 | }
 612 | 
 613 | class ReachMeasurementOperation {
 614 |   async run(data) {
 615 |     const { contentId } = data;
 616 | 
 617 |     // Read from Shared Storage
 618 |     const key = 'has-reported-content';
 619 |     const hasReportedContent = (await sharedStorage.get(key)) === 'true';
 620 | 
 621 |     // Do not report if a report has been sent already
 622 |     if (hasReportedContent) {
 623 |       return;
 624 |     }
 625 | 
 626 |     // Generate the aggregation key and the aggregatable value
 627 |     const userId = getUserId();
 628 |     const bucket = convertContentIdToBucket(contentId, userId);
 629 |     const value = 1 * L1_BUDGET;
 630 | 
 631 |     // Send an aggregatable report via the Private Aggregation API
 632 |     privateAggregation.contributeToHistogram({ bucket, value });
 633 | 
 634 |     // Set the report submission status flag
 635 |     await sharedStorage.set(key, true);
 636 |   }
 637 | }
 638 | ```
 639 | 
 640 | 
 641 | After aggregation, each content id will have a sketch associated with it.
 642 | To estimate the reach given sketches the following algorithm can be used.
 643 | 
 644 | 
 645 | ```python
 646 | import numpy as np
 647 | 
 648 | def _invert_monotonic(
 649 |     f: Callable[[float], float], lower: float = 0, tolerance: float = 0.001
 650 | ) -> Callable[[float], float]:
 651 |     """Inverts monotonic function f."""
 652 |     f0 = f(lower)
 653 | 
 654 |     def inversion(y: float) -> float:
 655 |         """Inverted f."""
 656 |         assert f0 <= y, (
 657 |             "Positive domain inversion error."
 658 |             f"f({lower}) = {f0}, but {y} was requested."
 659 |         )
 660 |         left = lower
 661 |         probe = 1
 662 |         while f(probe) < y:
 663 |             left = probe
 664 |             probe *= 2
 665 |         right = probe
 666 |         mid = (right + left) / 2
 667 |         while right - left > tolerance:
 668 |             f_mid = f(mid)
 669 |             if f_mid > y:
 670 |                 right = mid
 671 |             else:
 672 |                 left = mid
 673 |             mid = (right + left) / 2
 674 |         return mid
 675 | 
 676 |     return inversion
 677 | 
 678 | def _get_expected_non_zero_buckets(
 679 |    bucket_probabilities: list[float],
 680 | ) -> Callable[[int], float]:
 681 |    """
 682 |    Create a function that computes the expected number of non-zero buckets.
 683 | 
 684 |    Args:
 685 |        bucket_probabilities: A list of probabilities of each bucket
 686 |            being selected when a user is added to a sketch.
 687 |    Returns:
 688 |        A function that given the number of users, returns the expected number of
 689 |        non-empty buckets.
 690 |    """
 691 |    def internal(count: int):
 692 |        return sum(
 693 |                 1 - pow(1 - p, count)
 694 |                 for p in bucket_probabilities
 695 |             )
 696 |    return internal
 697 | 
 698 | 
 699 | def estimate_reach(
 700 |     sketches: list[list[int]], 
 701 |     bucket_probabilities: list[float],
 702 | ) -> float:
 703 |     """
 704 |     Estimate reach given the sketches.
 705 | 
 706 |     Args:
 707 |         sketches: the list of sketches, where each element of a sketch is an 
 708 |             element of the histogram.
 709 |         bucket_probabilities: probability of a given bucket being chosen for a
 710 |             user; for uniform CBF's all the values are the same, for exponential
 711 |             CBF ith element has value `alpha ** i` for some fixed value alpha.
 712 |     """
 713 |     sketches = np.array(sketches)
 714 |     
 715 |     thresholded_sketches = (sketches >= 0.5).astype(int)
 716 |     merged_sketch = thresholded_sketches.sum(axis=0)
 717 |     thresholded_merged__sketch = (merged_sketch >= 0.5).astype(int)
 718 |     non_zero_buckets_count = thresholded_merged__sketch.sum()
 719 | 
 720 |     estimator = _invert_monotonic(
 721 |         _get_expected_non_zero_buckets(bucket_probabilities), 
 722 |         lower=0, 
 723 |         tolerance=1e-7,
 724 |     )
 725 | 
 726 |     return estimator(non_zero_buckets_count)
 727 | ```
 728 | 
 729 | 
 730 | 
 731 | ## Empirical Evaluation
 732 | 
 733 | 
 734 | ### Data
 735 | 
 736 | Due to privacy concerns, in this paper, we choose to evaluate errors on
 737 | synthetic data.
 738 | Working on real ad datasets, we find out that the reach on datasets can be
 739 | modeled in the following way. Reach for one day follows
 740 | [power-law distribution](https://arxiv.org/pdf/2311.13586) with shape parameter
 741 | $b$. 
 742 | When analyzing different datasets based on the different query or various slice
 743 | definitions, such as hierarchical queries, we noticed a consistent pattern.
 744 | Specifically, these datasets exhibit a power-law distribution, albeit with
 745 | differing shape parameters. 
 746 | In our observations, it was noted that capping on the client side does not
 747 | affect the shape of the power-law distribution, indicating that campaign
 748 | impressions are dropped by capping with equal probability.
 749 | In contrast, capping does have a significant impact on the volume reports from
 750 | clients.
 751 | The rate of reports dropped by capping, i.e., the capping discount rate, follows
 752 | a power-law distribution with respect to the capping value.
 753 | Figure 2 displays a representative discount rate versus capping plot and
 754 | corresponding curve fitting.
 755 | 
 756 | Furthermore, we examined the impact of the time window on the reach. Reach does
 757 | not demonstrate a linear relationship with time window size.
 758 | Instead, it exhibits a diminishing return compared to linear growth as
 759 | illustrated in Figure 3. 
 760 | This discount follows a power-law function relative to the size of the time
 761 | window.
 762 | 
 763 | ![Reach estimates receive a discount as a function of cap value.](reach_whitepaper_figs/cap_discount.png)
 764 | 
 765 | ![Reach does not increase linearly with time windows size.](reach_whitepaper_figs/window_discount.png)
 766 | 
 767 | 
 768 | ```python
 769 | def sample_slices(
 770 |    number_of_days,
 771 |    cap,
 772 |    dataset_power_law_shape,
 773 |    n_samples
 774 | ) -> list[SliceSize]:
 775 | """Generates a synthetic dataset in slices (tuples of capped and uncapped numbers).
 776 | 
 777 | 
 778 |  Args:
 779 |    number_of_days: what time window the query is using.
 780 |    cap: the maximal number of slices a user might contribute.
 781 |    dataset_power_law_shape: dataset specific power-law shape param.
 782 |    n_samples: number of slices to generate. 
 783 |  """
 784 | 
 785 | 
 786 |   # sample reach distribution from power-law distribution. 
 787 |   reach_1_day = sample_discrete_power_law(
 788 |     b=dataset_power_law_shape, 
 789 |     n_samples=n_samples,
 790 |     ...)
 791 |   # approximate impact of capping.
 792 |   # compute probability of being reported (success, not impacted by capped) 
 793 |   # after capping.
 794 |   p_reported = (1 - cap_discount_rate(cap=cap))
 795 |   # for a slice of true size n, each contribution has p_reported probability to
 796 |   # be reported. Thus after capping, observed contributions become 
 797 |   # np.random.binomial(n, p) (there may be faster approximations) 
 798 |   reach_1_day_capped =  np.random.binomial(reach_1_day, p_reported)
 799 | 
 800 | 
 801 |   # approximate impact of time window
 802 |   reach_n_days = reach_1_day * number_of_days * window_discount_rate(number_of_days)
 803 |   reach_n_days_capped = reach_1_day_capped * number_of_days * window_discount_rate(number_of_days)
 804 | 
 805 | 
 806 |   # round to nearest int 
 807 |   reach_n_days = np.round(reach_n_days).astype(int)
 808 |   reach_n_days_capped = np.round(reach_n_days_capped).astype(int)
 809 | 
 810 | 
 811 |   return [
 812 |     (uncapped, capped) 
 813 |     for uncapped, capped in zip(reach_n_days, reach_n_days_capped)
 814 |   ]
 815 | ```
 816 | 
 817 | 
 818 | Listing 1: pseudocode to generate synthetic data. 
 819 | 
 820 | 
 821 | <table>
 822 |   <tr>
 823 |    <td style="background-color: #cccccc">
 824 |     Level <br> (granularity)
 825 |    </td>
 826 |    <td style="background-color: #cccccc">
 827 |     Query dimensions
 828 |    </td>
 829 |    <td style="background-color: #cccccc">
 830 |     Power law shape (b) param <br>
 831 | # reach per bucket
 832 |    </td>
 833 |   </tr>
 834 |   <tr>
 835 |    <td>1 (coarsest)
 836 |    </td>
 837 |    <td>advertiser
 838 |    </td>
 839 |    <td>1.01
 840 |    </td>
 841 |   </tr>
 842 |   <tr>
 843 |    <td>2
 844 |    </td>
 845 |    <td>advertiser x campaign
 846 |    </td>
 847 |    <td>1.10
 848 |    </td>
 849 |   </tr>
 850 |   <tr>
 851 |    <td>3
 852 |    </td>
 853 |    <td>advertiser x campaign x geo
 854 |    </td>
 855 |    <td>1.50
 856 |    </td>
 857 |   </tr>
 858 |   <tr>
 859 |    <td>4 (finest)
 860 |    </td>
 861 |    <td>advertiser x campaign x geo x ad strategy x creative
 862 |    </td>
 863 |    <td>1.60
 864 |    </td>
 865 |   </tr>
 866 | </table>
 867 | 
 868 | 
 869 | Table 2:  shape params used in experiments for queries with different granularity.
 870 | 
 871 | In accordance with the pseudocode presented in Listing 1, synthetic datasets of 
 872 | variable sizes, capping values, and time windows were generated for experimental
 873 | evaluation.
 874 | 
 875 | 
 876 | ### Error Metrics
 877 | 
 878 | In our experiments we are going to use $`\text{RMSRE}_\tau`$ metric defined as 
 879 | [follows](https://developers.google.com/privacy-sandbox/relevance/attribution-reporting/design-decisions#expandable-9):
 880 | ```math
 881 |     \mathrm{RMSRE}_\tau\left(\{t_i\}_{i = 1}^n, \{e_i\}_{i = 1}^n\right) = 
 882 |     \sqrt{\frac{1}{n} \sum\limits_{i=1}^n \left(\frac{t_i - e_i}{\max(\tau, t_i)}\right)^2},
 883 | ```
 884 | where 
 885 | $`\{t_i\}_1^n`$ are true values that we’d like to measure, and 
 886 | $`\{e_i\}_1^n`$ are the estimates.
 887 | 
 888 | 
 889 | ### Experimental Evaluation 
 890 | 
 891 | As it was mentioned the data used in the experiments is synthetically generated. W
 892 | e perform four experiments: 
 893 | 
 894 | 
 895 | 
 896 | 1. First, we measure the error introduced by capping.
 897 | 2. Next, we measure the error (for each granularity level and three different
 898 |   windows: one day, one week, and one month).
 899 | 3. In addition, we measure the error of the cumulative queries with and without
 900 |   the point contribution mechanism.
 901 | 4. Finally, we examine the error of sketching-based methods.
 902 | 
 903 | 
 904 | ### Results
 905 | 
 906 | 
 907 | #### Error Caused by Capping
 908 | 
 909 | To measure the error caused by capping, we sampled 1000 samples of slices for
 910 | each size of the window (1 day, 7 days, 30 days) and each granularity of the
 911 | query, and computed $`\text{RMSRE}_\tau`$ error where the estimated value is true
 912 | observed value (before adding noise or using sketches).
 913 | 
 914 | ![observation error](reach_whitepaper_figs/observation_error.png)
 915 | 
 916 | Next we plot the fraction of the samples with the error below t vs t. 
 917 | The plot shows that to achieve an error of at most 0.1 with probability at least
 918 | 80%, we need to choose the capping to be at least 4.
 919 | 
 920 | 
 921 | #### Multiple Reach Measurements
 922 | 
 923 | To measure the error introduced by noise in case of multiple measurements, we
 924 | sampled 1000 slices for each size of the window (1 day, 7 days, 30 days) and
 925 | each granularity of the query. 
 926 | Next we computed $`\text{RMSRE}_\tau`$ error where the true value is true observed
 927 | value and estimated value is the noised observed value.
 928 | 
 929 | ![multiple reach measurement error](reach_whitepaper_figs/direct_error.png)
 930 | 
 931 | Next, we plot the fraction of the samples with the total error below t vs t,
 932 | i.e. the error including both capping and noise. 
 933 | The plot shows that if we want to measure at least the week window, error less 
 934 | than 0.1 can be achieved for all granularities, and for the coarsest one error
 935 | less than 0.05 can be achieved for cap equal to 4 with probability 80%, for
 936 | window size of a month even for cap equal to 10 the error 0.1 can be achieved
 937 | for all granularities with probability at least 80%, finally, for window size of
 938 | a day, the error 0.1 with probability 80% is only achievable for cap equal to 1.
 939 | 
 940 | ![total multiple reach measurement error](reach_whitepaper_figs/total_direct_error.png)
 941 | 
 942 | 
 943 | #### Cumulative Reach Method Error
 944 | 
 945 | To measure cumulative reach, one can simply use the approach described above 
 946 | (i.e., using direct method); however, as it was mentioned before, point 
 947 | contribution mechanism adds noise with smaller standard deviation.
 948 | 
 949 | ![cumulative reach method error](reach_whitepaper_figs/cumulative_error.png)
 950 | 
 951 | To show that this indeed provides better utility in the applications we plot the
 952 | fraction of the samples with the total error on the last day (i.e., on the 30th
 953 | day) below t vs t. 
 954 | As one may see from the plots, values of cap where observation error is not too
 955 | large, the point contribution method gives sufficient improvement over the
 956 | direct method and should be used for cumulative queries.
 957 | 
 958 | ![total cumulative reach method error](reach_whitepaper_figs/total_cumulative_error.png)
 959 | 
 960 | 
 961 | #### Sketch Method Error
 962 | 
 963 | For the last experiment, for window sizes of 1, 10, 20, …, 360 days we sampled
 964 | 100 slices and computed the $`\text{RMSRE}_\tau`$ error where the true value is 
 965 | true observed value and estimated value is obtained from sketches of size 10000
 966 | (sketches need to be big to accommodate a year worth of data). 
 967 | Note that for each window size, we assume that reach is distributed uniformly 
 968 | across the window.
 969 | 
 970 | ![sketch method error](reach_whitepaper_figs/sketches_error.png)
 971 | 
 972 | We plot the window size vs the total error and the shade denotes the 80% 
 973 | probability band. 
 974 | As it can be seen the error is somewhat stable and for a cap larger than 3 the
 975 | error is too high; however, for a cap equal to 3 the median error is around 0.1
 976 | and stays within 0.2 with probability of 80%.
 977 | 
 978 | ![total sketch method error](reach_whitepaper_figs/total_sketches_error.png)
 979 | 
 980 | 
 981 | ## Conclusion
 982 | 
 983 | Our experiments show that if the query is pre-determined and the window is less
 984 | than 30 days, the best solution would be to use the
 985 | [point contribution mechanism](#point-contribution). 
 986 | However, if the queries are not determined at the collection time and/or the
 987 | necessary window is over 30 days then the sketch method can be used; but it is
 988 | important to limit the cap used to smaller values since otherwise the error is
 989 | too high.
 990 | The relatively large (compared to direct method) decrease in utility of the
 991 | sketching method is due to the addition of noise to each intermediate sketch
 992 | before the final result is computed. 
 993 | This could be addressed with evolution of the API design to allow ad techs to
 994 | perform queries on sketches and only add noise to the final result. 
 995 | 
 996 | 
 997 | ![error comparison](reach_whitepaper_figs/errors_comparison.png)
 998 | 
 999 | We are hopeful that this paper will enable advertising technologists to
1000 | comprehend the trade-offs associated with various methodologies and develop a
1001 | structured path to implementing reach using the APIs.
1002 | 
1003 | For the implementation of experimental evaluations, please refer to the [evaluation code](https://github.com/google-research/google-research/tree/master/privacy_sandbox/reach_whitepaper).
1004 | 
1005 | ## Acknowledgements
1006 | 
1007 | We would like to thank Badih Ghazi, Charlie Harrison, Ravi Kumar, Asha Menon, and Alex Turner for their input and help with this work.
1008 | 
1009 | 
1010 | ## References
1011 | 
1012 | 
1013 | 
1014 | 1. ​<span id="private-decayed-sums">​Jean Bolot, Nadia Fawaz, S. Muthukrishnan, Aleksandar Nikolov, Nina Taft:_ Private decayed predicate sums on streams_. ICDT 2013: 284-295</span>
1015 | 2. <span id="private-counting">Badih Ghazi, Ravi Kumar, Jelani Nelson, Pasin Manurangsi: _Private Counting of Distinct and k-Occurring Items in Time Windows_. ITCS 2023: 55:1-55:24</span>
1016 | 3. Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N. Rothblum: _Differential privacy under continual observation_. STOC 2010: 715-724
1017 | 4. T.-H. Hubert Chan, Elaine Shi, Dawn Song: _Private and Continual Release of Statistics_. ICALP (2) 2010: 405-417
1018 | 5. Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum: _Pure Differential Privacy for Rectangle Queries via Private Partitions_. ASIACRYPT (2) 2015: 735-751
1019 | 6. Joel Daniel Andersson, Rasmus Pagh: _A Smooth Binary Mechanism for Efficient Private Continual Observation_. NeurIPS 2023


--------------------------------------------------------------------------------
/reach_whitepaper_figs/cap_discount.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/cap_discount.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/cumulative_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/cumulative_error.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/diagram.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/direct_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/direct_error.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/errors_comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/errors_comparison.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/observation_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/observation_error.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/sketches_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/sketches_error.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/total_cumulative_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/total_cumulative_error.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/total_direct_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/total_direct_error.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/total_sketches_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/total_sketches_error.png


--------------------------------------------------------------------------------
/reach_whitepaper_figs/window_discount.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patcg-individual-drafts/private-aggregation-api/fc26ab5366899a12809b7816657ccc1da2e1c60d/reach_whitepaper_figs/window_discount.png


--------------------------------------------------------------------------------
/report_verification.md:
--------------------------------------------------------------------------------
  1 | # Preventing invalid Private Aggregation API reports with report verification
  2 | 
  3 | ### Table of Contents
  4 | 
  5 | - [Background](#background)
  6 | - [Security goals](#security-goals)
  7 |   - [Existing threat practicality](#existing-threat-practicality)
  8 | - [Shared Storage](#shared-storage)
  9 |   - [Details](#details)
 10 |     - [Deterministic number of reports](#deterministic-number-of-reports)
 11 |     - [Allows retrospective filtering](#allows-retrospective-filtering)
 12 |     - [Security considerations](#security-considerations)
 13 |     - [Privacy considerations](#privacy-considerations)
 14 |     - [Reduced delay](#reduced-delay)
 15 | - [FLEDGE sellers](#fledge-sellers)
 16 |   - [Details](#details-1)
 17 |     - [Privacy considerations](#privacy-considerations-1)
 18 | - [FLEDGE bidders](#fledge-bidders)
 19 |   - [Details](#details-2)
 20 |     - [New network requests](#new-network-requests)
 21 |     - [Need to list all possible token issuers](#need-to-list-all-possible-token-issuers)
 22 |     - [Need to limit the list of token issuers](#need-to-limit-the-list-of-token-issuers)
 23 |     - [Allocating returned tokens](#allocating-returned-tokens)
 24 |     - [Delegating token issuance](#delegating-token-issuance)
 25 |       - [Ensuring delegation consistency](#ensuring-delegation-consistency)
 26 |     - [Issuance mechanism](#issuance-mechanism)
 27 |     - [Redemption mechanism](#redemption-mechanism)
 28 |     - [Security considerations](#security-considerations-1)
 29 |     - [Privacy considerations](#privacy-considerations-2)
 30 |       - [Partitioning can amplify counting attacks](#partitioning-can-amplify-counting-attacks)
 31 |       - [Initial design](#initial-design)
 32 |       - [Potential mitigations](#potential-mitigations)
 33 |   - [Alternatives considered](#alternatives-considered)
 34 |     - [Using signals from interest group joining time](#using-signals-from-interest-group-joining-time)
 35 |       - [New network request](#new-network-request)
 36 |       - [Different security model](#different-security-model)
 37 |       - [Difficult to decide on number of tokens to issue](#difficult-to-decide-on-number-of-tokens-to-issue)
 38 |       - [Requires new persistent token storage](#requires-new-persistent-token-storage)
 39 |     - [Using signals from both auction running time and interest group joining time](#using-signals-from-both-auction-running-time-and-interest-group-joining-time)
 40 |       - [Separate token headers/fields](#separate-token-headersfields)
 41 |       - [Each origin picks one option](#each-origin-picks-one-option)
 42 |       - [One mechanism preferred, the other a fallback](#one-mechanism-preferred-the-other-a-fallback)
 43 |     - [Specifying a contextual ID and each possible IG owner](#specifying-a-contextual-id-and-each-possible-ig-owner)
 44 |     - [Trusted server report verification](#trusted-server-report-verification)
 45 | - [Shared Storage in Fenced Frames](#shared-storage-in-fenced-frames)
 46 |   - [Details](#details-3)
 47 |     - [Doesn’t support nesting](#doesnt-support-nesting)
 48 |     - [Privacy considerations](#privacy-considerations-3)
 49 |   - [Extending to selectURL()](#extending-to-selecturl)
 50 | 
 51 | ## Background
 52 | 
 53 | This document proposes a set of API changes to enhance the security of the
 54 | aggregatable reports by making it more difficult for bad actors to interfere
 55 | with the accuracy of cross-site measurement. Note that a mechanism based on the
 56 | [Private State Tokens API](https://github.com/WICG/trust-token-api) has been
 57 | [proposed](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md)
 58 | for the Attribution Reporting API.
 59 | 
 60 | The proposal is separated by the different contexts the Private Aggregation API
 61 | can be invoked in as the constraints and designs differ substantially.
 62 | 
 63 | ## Security goals
 64 | 
 65 | Our security goals match the Attribution Reporting proposal’s, see
 66 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#security-goals)
 67 | for details. Briefly, our primary security goals are:
 68 | 
 69 | 1. No reports out of thin air
 70 | 2. No replaying reports
 71 | 
 72 | We also share the secondary goals:
 73 | 
 74 | 3. Privacy of the invalid traffic (IVT) detector
 75 | 4. Limit the attack scope for bad actors that can bypass IVT detectors
 76 | 5. No report mutation (lower priority)
 77 | 
 78 | ### Existing threat practicality
 79 | 
 80 | As with the Attribution Reporting API, we don’t currently prevent reports being
 81 | created out of thin air, but practical attacks are challenging. More details
 82 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#existing-mitigations-and-practical-threats).
 83 | 
 84 | ## Shared Storage
 85 | 
 86 | When triggering a Shared Storage operation that could send an aggregatable
 87 | report, we propose allowing the site to specify a high-entropy ID from
 88 | outside the isolated context. This ID would then be embedded unencrypted in the
 89 | report issued by that worklet operation, e.g. adding the following key to the
 90 | report:
 91 | 
 92 | ```jsonc
 93 | "context_id" : "example_string",
 94 | ```
 95 | 
 96 | This would be achieved by adding a new optional parameter to the Shared Storage
 97 | `run()` and `selectURL()` APIs, e.g.:
 98 | 
 99 | ```js
100 | sharedStorage.run('someOperation', {'privateAggregationConfig': {'contextId': 'example_string'}});
101 | ```
102 | 
103 | Note that this design does not support report verification for Shared Storage
104 | operations run from within a fenced frame. See
105 | [below](#shared-storage-in-fenced-frames) for a discussion of that case.
106 | 
107 | An approach based on Private State Tokens was not proposed as it would add
108 | complexity and offer strictly less power than ID-based filtering for invalid
109 | traffic filtering.
110 | 
111 | ### Details
112 | 
113 | #### Deterministic number of reports
114 | 
115 | One key concern with this approach is that the number of reports (with that ID)
116 | could be used to exfiltrate cross-site information. So, when an ID is specified
117 | for a Shared Storage operation, we ensure that a single report is sent no matter
118 | how many calls to `sendHistogramReport()` occur (including zero). Instead, this
119 | report would have a variable number of contributions embedded (see [batching
120 | proposal](https://github.com/patcg-individual-drafts/private-aggregation-api#reducing-volume-by-batching)).
121 | To avoid leaking the number of contributions, we will need to
122 | [pad](https://github.com/patcg-individual-drafts/private-aggregation-api#padding)
123 | the encrypted payload. Additionally, if a context has run out of budget, a
124 | report should still be sent (containing no contributions).
125 | 
126 | #### Allows retrospective filtering
127 | 
128 | This approach allows for a server to retroactively alter its decisions on report
129 | validity. For example, if a new signal for invalid traffic is determined,
130 | previous reports with that signal could be marked as invalid too (if they have
131 | not yet been processed).
132 | 
133 | #### Security considerations
134 | 
135 | This option easily achieves all of the higher priority security goals
136 | 
137 | - **No reports out of thin air**: Any report without an ID, or with an
138 |   unexpected ID, can be discarded as invalid. These IDs are high-entropy and so
139 |   can be made infeasible to guess.
140 | - **No replaying reports**: Each ID can be unique, allowing discarding of
141 |   reports with a repeated ID.
142 | - **Privacy of the invalid traffic (IVT) detector**: The valid/invalid decision
143 |   associated with an ID can be made server-side and need not be revealed to the
144 |   client. In fact, the decision need not happen immediately, see [Allows
145 |   retrospective filtering](#allows-retrospective-filtering).
146 | - **Limit the attack scope for bad actors that can bypass IVT detectors**: By
147 |   using a unique ID, server-side checks could be added to ensure that the
148 |   metadata fields in the report match the expected values.
149 | - **No report mutation**: Only partially addressed. The server can verify that
150 |   plaintext fields in the report “reasonably match” their expectations. This
151 |   does not prevent mutation of the payload or fields that could have multiple
152 |   reasonable values (e.g. small changes to scheduled\_report\_time).
153 | 
154 | #### Privacy considerations
155 | 
156 | Adding a high-entropy ID may allow for timing attacks. E.g. if a report is not
157 | issued until after a Shared Storage operation completes, the reporting origin
158 | could in principle use the scheduled reporting time to learn something about how
159 | long the operation took to run. This is currently mitigated by a randomized
160 | delay, but we also plan to add a timeout in Shared Storage, see [Reduced
161 | delay](#reduced-delay) below.
162 | 
163 | Adding a high-entropy ID also allows for the reports to be arbitrarily
164 | partitioned. However, by making the count of reports with the given ID
165 | [deterministic](#deterministic-number-of-reports), we avoid the [major concern
166 | this introduces](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#could-we-just-tag-reports-with-a-trigger_id-instead-of-using-anonymous-tokens)
167 | (non-noisy leaks through counts). We do not consider the ability to process only
168 | a chosen subset of reports to be a privacy concern, given other protections
169 | (e.g. adding noise to the summary report).
170 | 
171 | #### Reduced delay
172 | 
173 | Currently, reports are delayed by up to one hour to avoid revealing an
174 | association between the issued reports and the original context. As this
175 | approach explicitly reveals this association (with other mitigations), we can
176 | shorten these delays. We plan to impose a 5 second timeout on Shared Storage
177 | operations making contributions. We then plan to wait until the timeout to
178 | send a report, even if execution finishes early. This avoids leaking
179 | information through how long the operation took to run. We also considered
180 | instead keeping a shorter randomized delay (e.g. up to 1 minute), but that
181 | did not seem necessary.
182 | 
183 | ## FLEDGE sellers
184 | 
185 | We propose a very similar mechanism for FLEDGE seller reporting as for Shared
186 | Storage worklets. That is, we’ll allow the site to specify a high-entropy
187 | ID from outside the isolated context and this ID would then be embedded
188 | unencrypted in the report issued by that seller within that auction, e.g.:
189 | 
190 | ```jsonc
191 | "context_id" : "example_string",
192 | ```
193 | 
194 | The seller would specify this ID through an optional parameter into the
195 | `auctionConfig`, e.g.:
196 | 
197 | ```js
198 | const myAuctionConfig = {
199 |   ...
200 |   'privateAggregationConfig': {
201 |     'contextId': 'example_string',
202 |   }
203 | };
204 | const auctionResultPromise = navigator.runAdAuction(myAuctionConfig);
205 | ```
206 | 
207 | ### Details
208 | 
209 | See the [Shared Storage section](#shared-storage) for more details.
210 | 
211 | #### Privacy considerations
212 | 
213 | Like for shared storage, adding a high-entropy ID could allow for timing attacks
214 | as the reporting origin could use the scheduled reporting time to learn
215 | something about when the report was triggered. This is partially mitigated by
216 | the existing randomized reporting delay (10-60 min) imposed as FLEDGE auctions
217 | impose small timeouts (e.g. 0.5 s). As discussed
218 | [above](#privacy-considerations) for Shared Storage, we avoid concerns about
219 | partitioning by making the number of reports deterministic (and other
220 | protections).
221 | 
222 | ## FLEDGE bidders
223 | 
224 | **Note: Unlike the above sections which offer relatively straightforward approaches,
225 | this section is highly complex and nuanced. Feedback is appreciated!**
226 | 
227 | We can’t easily use a contextual ID for the FLEDGE bidder case as the existence
228 | of a bidder in a particular auction is inherently cross-site data, see
229 | [below](#specifying-a-contextual-id-and-each-possible-ig-owner). So, our options
230 | are more limited and we focus on mechanisms using Private State Tokens.
231 | 
232 | However, note also that there are no existing network requests that we can easily reuse
233 | for token issuance. While there is a trusted signals fetch, that is
234 | intentionally uncredentialed. Much like using an ID, we can’t just add a network
235 | request for each bidder as that would reveal cross-site data.
236 | 
237 | So, we handle token issuance by adding a new optional parameter to
238 | `runAdAuction()`, e.g.:
239 | 
240 | ```js
241 | const myAuctionConfig = {
242 |   ...
243 |   'privateAggregationConfig': {
244 |     ...
245 |     'tokenIssuanceURLs': [
246 |       'https://origin1.example/path?signal1=abc,signal2=def',
247 |       'https://origin2.example/some-other-path',
248 |       'https://origin3.example/etc',
249 |     ],
250 |     // How many tokens to request from each listed issuer. Optional, defaults to
251 |     // each issuer's batch size.
252 |     'numTokensPerIssuer': 10,
253 |   }
254 | }
255 | const auctionResultPromise = navigator.runAdAuction(myAuctionConfig);
256 | ```
257 | 
258 | This would trigger a token issuance request for each listed origin (see
259 | [below](#issuance-mechanism)). Each token would be redeemed along with any later
260 | reports’ network requests (see [below](#redemption-mechanism)).
261 | 
262 | If this token successfully verifies, then the reporting origin has a guarantee
263 | that the report was associated with a `runAdAuction()` request that was signed.
264 | 
265 | ### Details
266 | 
267 | #### New network requests
268 | 
269 | This requires the addition of a new network request for each listed token issuer,
270 | emitted when `runAdAuction` is invoked.
271 | 
272 | #### Need to list all possible token issuers
273 | 
274 | As the presence of bidders in an auction is inherently cross-site, we require
275 | listing all possible token issuers from the publisher site. The user agent will
276 | then unconditionally perform a token issuance request for each listed token
277 | issuer to avoid cross-site leakage, i.e. even if the issuer is not used by any
278 | bidder in the auction.
279 | 
280 | #### Need to limit the list of token issuers
281 | 
282 | The user agent will also need to impose a limit on the number of token issuers
283 | listed in each auction to avoid too many network requests being added.
284 | Practically, this means interest group owners will likely need to use the
285 | [delegation mechanism](#delegating-token-issuance).
286 | 
287 | #### Allocating returned tokens
288 | 
289 | A single bidder origin may own multiple interest groups that a user is enrolled
290 | in. Additionally, multiple interest group owner origins may use the same token
291 | issuer (due to [delegation](#delegating-token-issuance)). In these cases, the
292 | interest groups will have to share the tokens issued.
293 | 
294 | In the case of multiple owner origins using the same token issuer, tokens can’t
295 | be reused as we don’t want to reveal that both interest group owners were
296 | present in the same auction (for the same user). However, multiple tokens can be
297 | requested from a single token issuer to mitigate this. If not enough tokens were
298 | issued, some reports will be sent unattested.
299 | 
300 | In the case of multiple interest groups with the same owner, the histogram
301 | contributions should be
302 | [batched](https://github.com/patcg-individual-drafts/private-aggregation-api#reducing-volume-by-batching)
303 | together into a single report, avoiding the need to use multiple tokens.
304 | However, the [extended reporting
305 | plans](https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md)
306 | for Private Aggregation allow for fenced frames to trigger reports indirectly
307 | with `window.fence.reportPrivateAggregationEvent()`. This could occur
308 | arbitrarily later, so we may need to ignore events triggered too long after the
309 | auction (e.g. after 1 hour). We could consider replacing the randomized delay
310 | with simply waiting until the timeout, even if execution finishes early.
311 | 
312 | #### Delegating token issuance
313 | 
314 | Interest group owners will be able to delegate their token issuance by hosting a
315 | `.well-known` file which specifies the origin to delegate to. This will be
316 | optional (i.e. each origin can choose itself as its token issuer), but note that
317 | all origins choosing themselves would likely exceed limits, see [Need to limit
318 | the list of token issuers](#need-to-limit-the-list-of-token-issuers) above.
319 | 
320 | ##### Ensuring delegation consistency
321 | 
322 | To ensure that the same file is served across different browser instances, the
323 | user agent vendor may re-distribute these files through a separate mechanism.
324 | Further, to ensure that the origin does not change frequently, the user agent
325 | could impose some limits on the rotation frequency.
326 | 
327 | #### Issuance mechanism
328 | 
329 | Token issuance network requests will be sent to the specified token issuer URLs.
330 | The URL path and query string allows for metadata to be embedded by the seller,
331 | but note that only the token issuance _origin_ is used for
332 | [delegation](#delegating-token-issuance). Each request will have a
333 | `Sec-Private-Aggregation-Private-State-Token` header with one or more blinded
334 | messages (each of which embeds a report\_id) according to the number of tokens
335 | requested. If the number of tokens is not requested, the token issuer’s [batch
336 | size](https://source.chromium.org/chromium/chromium/src/+/main:services/network/public/mojom/trust_tokens.mojom;drc=96d76471a47949536f88e90cbf03596cda41f6e1;l=232)
337 | will be used. The token issuer will inspect the request and decide whether it is
338 | valid, i.e. whether the issuer suspects it is coming from a real, honest client
339 | and should therefore be allowed to generate aggregatable reports.
340 | 
341 | If the request is considered **invalid** and hence shouldn’t be taken into
342 | account to calculate aggregate measurement results, the origin should respond
343 | without adding a `Sec-Private-Aggregation-Private-State-Token` response header.
344 | If this header is omitted or is not valid, the browser will proceed normally,
345 | but any report generated will not contain the report verification header. Note:
346 | more advanced deployments can consider issuing an "invalid" token using private
347 | metadata to avoid the client learning the detection result. See privacy of the
348 | IVT detector in [Security considerations](#security-considerations-1) for more
349 | details.
350 | 
351 | If the request is considered **valid**, the origin should add a
352 | `Sec-Private-Aggregation-Private-State-Token` header with a blind token (the
353 | blind signature over the blinded message) for each blinded message included in
354 | the original request. The origin could also return a token for only a subset of
355 | the blinded messages if it wishes to limit the number of tokens issued to limit
356 | exfiltration risk.
357 | 
358 | Internally, the browser will store the token associated with any generated
359 | report until it is sent.
360 | 
361 | #### Redemption mechanism
362 | 
363 | If a token is [allocated](#allocating-returned-tokens) to an aggregatable
364 | report, it will be sent along with the report’s request in the form of a new
365 | request header `Sec-Private-Aggregation-Private-State-Token`. If this token is
366 | successfully verified, then the reporting origin has a guarantee that the report
367 | was associated with a previous request that was signed.
368 | 
369 | Note: unlike the basic Private State Token API (which enables conveying tokens
370 | from one site to another), there are no redemption limits for Private
371 | Aggregation API integration. See [Privacy
372 | considerations](#privacy-considerations-2) for discussion of other mitigations.
373 | 
374 | #### Security considerations
375 | 
376 | This option easily achieves the primary security goals plus some secondary
377 | security goals. The considerations largely match the Attribution Reporting
378 | proposal’s given the similar token-based approach, see
379 | [here](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#security-considerations)
380 | for details.
381 | 
382 | #### Privacy considerations
383 | 
384 | Much like Attribution Reporting’s
385 | [proposal](https://github.com/WICG/attribution-reporting-api/blob/main/report_verification.md#privacy-considerations),
386 | this integration is intended to be as privacy-neutral as possible. In
387 | particular, we want to avoid cross-site information leakage. While each token’s
388 | issuance occurs using a request from a single site, this token – including its
389 | metadata, or no token if none was issued – will later be sent with a report from
390 | a bidder. The identity of which bidders participated in an auction is cross-site
391 | data.
392 | 
393 | ##### Partitioning can amplify counting attacks
394 | 
395 | If the count of reports is sensitive, this partitioning could amplify counting
396 | attacks. However, note that reports can already be partitioned by the
397 | `scheduled_report_time` and `api` fields. There are designs for [protecting the
398 | count of encrypted reports](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATE.md#hide-the-true-number-of-attribution-reports)
399 | to mitigate or eliminate the risk of counting attacks. These designs target the
400 | Attribution Reporting API, but could be adapted for Private Aggregation. Still,
401 | with less extreme mitigations, there are privacy benefits to reducing the
402 | partitioning available.
403 | 
404 | ##### Initial design
405 | 
406 | For the initial design, we do not plan to implement any changes to the Private
407 | State Token protocol’s public/private metadata bits. So, each token will have
408 | [six buckets](https://github.com/WICG/trust-token-api/blob/main/ISSUER_PROTOCOL.md#issuance-metadata)
409 | of metadata embedded. Further, each report could either have a token or no
410 | token, allowing up to 7 total possibilities (~2.8 bits). This would therefore
411 | allow the reporting origin to partition its reports into 7 buckets.
412 | 
413 | ##### Potential mitigations
414 | 
415 | We could consider mitigations in the future. For example:
416 | 
417 | - restricting the public/private metadata to one bucket – or just a single
418 |   private bit to avoid an invalid traffic oracle.
419 | - refusing to send reports to reporting origins using report verification if no
420 |   token was available/issued.
421 | - sending null reports with some frequency for buyers that delegate to an issuer
422 |   who issued tokens.
423 | 
424 | ### Alternatives considered
425 | 
426 | #### Using signals from interest group joining time
427 | 
428 | Alternatively, we could associate any trust signals available at the
429 | `joinAdInterestGroup()` call with reports later sent from a bidder under that
430 | interest group.
431 | 
432 | Token issuance could be handled by adding a new optional parameter to
433 | `joinAdInterestGroup()`, e.g.:
434 | 
435 | ```js
436 | const myGroup = {
437 |   ...
438 |   'privateAggregationTokens': 10,  // number of tokens to request
439 | }
440 | const joinPromise = navigator.joinAdInterestGroup(myGroup, 30 * kSecsPerDay);
441 | ```
442 | 
443 | This would trigger a token issuance request (see [above](#issuance-mechanism))
444 | with the requested number of blinded messages. Each resulting token would be
445 | redeemed along with the later report’s network request (see
446 | [above](#redemption-mechanism)).
447 | 
448 | If the token successfully verifies, then the reporting origin has a guarantee
449 | that the report was associated with a previous `joinAdInterestGroup()` request
450 | that was signed.
451 | 
452 | ##### New network request
453 | 
454 | This requires the addition of one new network request at `joinAdInterestGroup()`
455 | time.
456 | 
457 | ##### Different security model
458 | 
459 | This approach uses a different security model to Attribution Reporting’s, with a
460 | potentially large time delay between token issuance and use.
461 | 
462 | ##### Difficult to decide on number of tokens to issue
463 | 
464 | Due to this large time delay between token issuance and last possible use, it
465 | will be difficult to decide on the number of tokens to issue. If too few are
466 | issued, later auctions may not be able to be attested. Issuing too many may
467 | degrade performance, e.g. unnecessarily using storage, and may exacerbate token
468 | exfiltration attacks.
469 | 
470 | ##### Requires new persistent token storage
471 | 
472 | This approach requires Private State Tokens to be persisted for later use. This
473 | store will need to be separate from the existing Private State Token store. Note
474 | also that key rotations will cause issues here, as any tokens issued before the
475 | rotation would not be able to be used after the rotation.
476 | 
477 | #### Using signals from both auction running time and interest group joining time
478 | 
479 | We could combine the functionality of both the proposal and the above
480 | alternative. There are a few different ways we could do this.
481 | 
482 | ##### Separate token headers/fields
483 | 
484 | We could allow for both mechanisms to be independently implemented. Separate
485 | headers could be used to distinguish between the two. This would allow for
486 | maximum flexibility, but comes at a possible complexity and privacy cost.
487 | 
488 | **Privacy risk:** By supporting two separate token fields, the number of
489 | possible token states is ‘squared’. That is, without additional mitigations,
490 | adding a second Private State Token field would increase the number of states
491 | from 7 to 49 (~5.6 bits). This partitioning would allow for amplified counting
492 | attacks unless other mitigates are put in place, see
493 | [above](#privacy-considerations-2).
494 | 
495 | ##### Each origin picks one option
496 | 
497 | We could allow each origin to pick one of the two mechanisms, using a similar
498 | mechanism to picking a token issuer. Any attempt to use the other mechanism
499 | would be ignored or cause an error.
500 | 
501 | ##### One mechanism preferred, the other a fallback
502 | 
503 | We could allow for both mechanisms, but only allow one token to be bound to each
504 | report. If a token is available via each mechanism, the browser will prefer one
505 | (e.g. the runAdAuction() associated token).
506 | 
507 | #### Specifying a contextual ID and each possible IG owner
508 | 
509 | Instead of using Private State Tokens, we could also use a contextual ID here.
510 | But, to avoid a cross-site leak, this would require that a report be sent to
511 | each origin listed in `interestGroupBuyers`, even if that bidder did not
512 | actually participate in the auction. This could lead to a large number of (null)
513 | reports, which would pose a performance concern.
514 | 
515 | #### Trusted server report verification
516 | 
517 | Ideally for performance, the user agent would be able to only request a token
518 | for reports that are actually going to be sent. But, that would inherently leak
519 | cross-site data, which we can't allow. But it might be possible to design a
520 | trusted server architecture that can perform the required invalid traffic
521 | determination and token issuance while ensuring that any cross-site data is not
522 | persisted. This is not feasible in the short term, however, requiring
523 | significant design and exploration.
524 | 
525 | ## Shared Storage in Fenced Frames
526 | 
527 | When a shared storage operation is run from a fenced frame instead of a
528 | document, we can no longer set a contextual ID. Any cross-site information the
529 | fenced frame has could be embedded in the context ID, so the ability to set it
530 | is disabled.
531 | 
532 | Instead, we propose allowing a Private State Token to be bound to the
533 | FencedFrameConfig output of a FLEDGE auction. We would reuse the FLEDGE bidder
534 | mechanism chosen [above](#fledge-bidders) and take an additional token from the
535 | same source for this purpose. When the shared storage worklet triggers a report
536 | to be sent, any context ID specified would be ignored and the token would be
537 | used instead.
538 | 
539 | ### Details
540 | 
541 | As it uses the same token source, most details match the FLEDGE bidder
542 | discussion (see [above](#details-2)). Additional considerations are listed
543 | below.
544 | 
545 | #### Doesn’t support nesting
546 | 
547 | This proposal does not currently support cross-origin subframes or nested fenced frames
548 | within the top-level fenced frame.
549 | 
550 | #### Privacy considerations
551 | 
552 | As discussed [above](#privacy-considerations-2), adding a token allows reports
553 | to be partitioned, which exacerbates the risk of a counting attack.
554 | 
555 | This design also implicitly reveals whether a Shared Storage worklet’s
556 | aggregatable report came from an operation run by a document or a fenced frame.
557 | This may allow for further partitioning, but is unlikely to be a significant
558 | issue.
559 | 
560 | ### Extending to selectURL()
561 | 
562 | Further design work is needed to extend this mechanism to fenced frames
563 | rendering the output of a `selectURL()` operation.
564 | 


--------------------------------------------------------------------------------
/security_and_privacy_questionnaire.md:
--------------------------------------------------------------------------------
  1 | # Security and Privacy Questionnaire
  2 | 
  3 | Responses to the W3C TAG’s [Self-Review Questionnaire: Security and
  4 | Privacy](https://w3ctag.github.io/security-questionnaire/) for the Private
  5 | Aggregation API.
  6 | 
  7 | ### 2.1. What information does this feature expose, and for what purposes?
  8 | 
  9 | This API lets isolated contexts with access to cross-site data (i.e. [Shared
 10 | Storage](https://github.com/WICG/shared-storage) worklets/[Protected
 11 | Audience](https://github.com/WICG/turtledove) script runners) send aggregatable
 12 | reports over the network. Aggregatable reports contain encrypted high entropy
 13 | cross-site information, in the form of key-value pairs (i.e. contributions to a
 14 | histogram), but this information is not exposed directly. Instead, these reports
 15 | can only be processed by a trusted aggregation service. This trusted aggregation
 16 | service sums the values across the reports for each key and adds noise to each
 17 | of these values to produce ‘summary reports’. It also limits the number of times
 18 | that reports may be queried.
 19 | 
 20 | The aggregatable reports also contain some unencrypted metadata that is not
 21 | based on cross-site information.
 22 | 
 23 | The purpose of this API is to allow generic aggregate cross-site measurement for
 24 | a range of use cases, even if third-party cookies are no longer available. Use
 25 | cases include reach measurement, and Protected Audience auction reporting.
 26 | 
 27 | ### 2.2. Do features in your specification expose the minimum amount of information necessary to implement the intended functionality?
 28 | 
 29 | We strictly limit access to the cross-site information embedded in the
 30 | aggregatable reports. The cross-site information embedded in these reports is
 31 | encrypted and only processable by a trusted aggregation service. The output of
 32 | that processing will be an aggregate, noised histogram. The service ensures that
 33 | any report can not be processed multiple times. Further, information exposure is
 34 | limited by contribution budgets on the client. In principle, this framework can
 35 | support specifying a noise parameter which satisfies differential privacy.
 36 | 
 37 | The plaintext portion of an aggregatable report includes information necessary
 38 | to organize (batch) reports for aggregation. The encrypted portion is assumed to
 39 | be not readable by an attacker (except for ciphertext size).
 40 | 
 41 | The amount of information exposed by this API is a product of the privacy
 42 | parameters used (e.g. contribution limits and the noise distribution). While we
 43 | aim to minimize the amount of information exposed, we also aim to support a wide
 44 | range of use cases. The privacy parameters can be customized to reflect the
 45 | appropriate tradeoff between information exposure and utility. The exact choice
 46 | of parameters is currently left unfixed to allow for exploration of this
 47 | tradeoff and will eventually be fixed based on community feedback.
 48 | 
 49 | These reports also expose a limited amount of metadata, which is not based on
 50 | cross-site data. However, the number of reports with the given metadata could
 51 | expose some cross-site information. To protect against this, we make the number
 52 | of reports deterministic in certain situations (sending reports containing no
 53 | contributions in the payloads if necessary). We are considering mitigations for
 54 | other situations, e.g. adding noise to the report count.
 55 | 
 56 | The recipient of the report may also be able to observe side-channel information
 57 | such as the time when the report was sent, or IP address of the sender.
 58 | 
 59 | ### 2.3. Do the features in your specification expose personal information, personally-identifiable information (PII), or information derived from either?
 60 | 
 61 | This API does not directly expose PII or personal information. However, it is a
 62 | generic mechanism that does not place any limits on the kinds of data that sites
 63 | may encapsulate into the reports. See above for how all cross-site information
 64 | is protected.
 65 | 
 66 | ### 2.4. How do the features in your specification deal with sensitive information?
 67 | 
 68 | See 2.3.
 69 | 
 70 | ### 2.5. Do the features in your specification introduce state that persists across browsing sessions?
 71 | 
 72 | Yes, we introduce new storage for reports that are not yet sent (i.e. scheduled
 73 | to be sent in the future), for enforcing limits on the total sum of contribution
 74 | values (per-reporting site, per-context type, per-10 min / per-day) and for
 75 | caching the public keys of the trusted aggregation service. These all persist
 76 | across browsing sessions, but will be cleared along with other site data when
 77 | requested by a user and are not exposed to JavaScript.
 78 | 
 79 | ### 2.6. Do the features in your specification expose information about the underlying platform to origins?
 80 | 
 81 | No
 82 | 
 83 | ### 2.7. Does this specification allow an origin to send data to the underlying platform?
 84 | 
 85 | No
 86 | 
 87 | ### 2.8. Do features in this specification enable access to device sensors?
 88 | 
 89 | No
 90 | 
 91 | ### 2.9. Do features in this specification enable new script execution/loading mechanisms?
 92 | 
 93 | No, but this API is proposed only for the new isolated script execution contexts
 94 | specified by other proposed features (i.e. Shared Storage worklets/Protected
 95 | Audience script runners).
 96 | 
 97 | ### 2.10. Do features in this specification allow an origin to access other devices?
 98 | 
 99 | No
100 | 
101 | ### 2.11. Do features in this specification allow an origin some measure of control over a user agent’s native UI?
102 | 
103 | No
104 | 
105 | ### 2.12. What temporary identifiers do the features in this specification create or expose to the web?
106 | 
107 | None
108 | 
109 | ### 2.13. How does this specification distinguish between behavior in first-party and third-party contexts?
110 | 
111 | This API is only exposed in isolated contexts that may have access to cross-site
112 | data. There are mechanisms proposed for controlling access to those isolated
113 | contexts, e.g. see Protected Audience’s response
114 | [here](https://github.com/w3ctag/design-reviews/issues/723).
115 | 
116 | ### 2.14. How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode?
117 | 
118 | The contexts this API is exposed in (Shared Storage worklets and Protected
119 | Audience script runners) are not available in Private Browsing/Incognito mode,
120 | so it is not possible to use this API in that mode.
121 | 
122 | ### 2.15. Does this specification have both "Security Considerations" and "Privacy Considerations" sections?
123 | 
124 | We are still working on the spec, but it will include both sections.
125 | 
126 | ### 2.16. Do features in your specification enable origins to downgrade default security protections?
127 | 
128 | No
129 | 
130 | ### 2.17. What happens when a document that uses your feature is kept alive in BFCache (instead of getting destroyed) after navigation, and potentially gets reused on future navigations back to the document?
131 | 
132 | This API is not available in document contexts, so there is no need to handle
133 | this case.
134 | 
135 | ### 2.18. What happens when a document that uses your feature gets disconnected?
136 | 
137 | This API is not available in document contexts, so there is no need to handle
138 | this case.
139 | 
140 | ### 2.19. What should this questionnaire have asked?
141 | 
142 | N/A
143 | 


--------------------------------------------------------------------------------