├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── README_depricated.md
├── analyses
    ├── 2018_09_biskit1_mordax__canvas_fingerprinting.ipynb
    ├── 2018_12_LABBsoft_tracking_review
    │   ├── Ad Blocker Report.md
    │   ├── Evercookies Report.md
    │   ├── RelevantSymbolCounter.py
    │   ├── Tracking Method Sources.md
    │   ├── Tracking Methods.md
    │   ├── Tracking Report Template.md
    │   └── window.name Report.md
    ├── 2018_12_ddobre_static_analysis
    │   ├── 1-get_script_urls
    │   │   ├── README.md
    │   │   ├── config.ini
    │   │   ├── explore_url_lists.ipynb
    │   │   ├── generate_url_list_spark.py
    │   │   ├── requirements.txt
    │   │   ├── test_generate_url_list_spark.py
    │   │   └── test_urls.csv
    │   ├── 2-scrape_js
    │   │   ├── README.md
    │   │   ├── async_js_get.py
    │   │   ├── config.ini
    │   │   ├── downloads_analysis
    │   │   │   ├── README.md
    │   │   │   ├── compare_condensed_with_full.py
    │   │   │   ├── explore_downloads.ipynb
    │   │   │   ├── extract_hashes_from_full_dataset.py
    │   │   │   └── js_status.csv
    │   │   ├── requirements.txt
    │   │   └── single_js_get.py
    │   ├── 3-generate_symbols_of_interest
    │   │   ├── README.md
    │   │   ├── config.ini
    │   │   ├── master.txt
    │   │   ├── process_APIs.py
    │   │   └── symbol_dict.json
    │   ├── 4-ast_analysis
    │   │   ├── README.md
    │   │   ├── async_tree_explorer.py
    │   │   ├── config.ini
    │   │   ├── master_sym_list.json
    │   │   ├── new_async_tree_explorer.py
    │   │   ├── output_data
    │   │   │   ├── extended_symbol_counts.json
    │   │   │   └── symbol_counts.json
    │   │   ├── requirements.txt
    │   │   └── single_tree_explorer.py
    │   └── README.md
    ├── 2018_12_willoughr__fingerprinting_prevalence.txt
    ├── 2019_03_willougr_fingerprinting_implementation_sixth_sense
    │   ├── Audio Fingerprinting Heuristics.ipynb
    │   ├── Canvas Fingerprinting Heuristics.ipynb
    │   ├── Font Fingerprinting Heuristics.ipynb
    │   ├── README.md
    │   └── WebRTC Fingerprinting Heuristics.ipynb
    ├── README.md
    ├── environment.yaml
    ├── hello_mozfest.ipynb
    ├── hello_world.ipynb
    ├── hello_world.md
    ├── issue_34_setup_and_dask_tips.ipynb
    └── issue_36.ipynb
├── data_prep
    ├── Process All Data.ipynb
    ├── Process All Data.md
    ├── Sample Review.ipynb
    ├── raw_data_schema.template
    └── symbol_counts.csv
└── schema.md


/.gitignore:
--------------------------------------------------------------------------------
1 | # jupyter notebook checkpoints
2 | .ipynb_checkpoints
3 | 
4 | # vim swap files
5 | *.sw?
6 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Community Participation Guidelines
 2 | 
 3 | This repository is governed by Mozilla's code of conduct and etiquette guidelines. 
 4 | For more details, please read the
 5 | [Mozilla Community Participation Guidelines](https://www.mozilla.org/about/governance/policies/participation/). 
 6 | 
 7 | ## How to Report
 8 | For more information on how to report violations of the Community Participation Guidelines, please read our '[How to Report](https://www.mozilla.org/about/governance/policies/participation/reporting/)' page.
 9 | 
10 | <!--
11 | ## Project Specific Etiquette
12 | 
13 | In some cases, there will be additional project etiquette i.e.: (https://bugzilla.mozilla.org/page.cgi?id=etiquette.html).
14 | Please update for your project.
15 | -->
16 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Mozilla Public License Version 2.0
  2 | ==================================
  3 | 
  4 | 1. Definitions
  5 | --------------
  6 | 
  7 | 1.1. "Contributor"
  8 |     means each individual or legal entity that creates, contributes to
  9 |     the creation of, or owns Covered Software.
 10 | 
 11 | 1.2. "Contributor Version"
 12 |     means the combination of the Contributions of others (if any) used
 13 |     by a Contributor and that particular Contributor's Contribution.
 14 | 
 15 | 1.3. "Contribution"
 16 |     means Covered Software of a particular Contributor.
 17 | 
 18 | 1.4. "Covered Software"
 19 |     means Source Code Form to which the initial Contributor has attached
 20 |     the notice in Exhibit A, the Executable Form of such Source Code
 21 |     Form, and Modifications of such Source Code Form, in each case
 22 |     including portions thereof.
 23 | 
 24 | 1.5. "Incompatible With Secondary Licenses"
 25 |     means
 26 | 
 27 |     (a) that the initial Contributor has attached the notice described
 28 |         in Exhibit B to the Covered Software; or
 29 | 
 30 |     (b) that the Covered Software was made available under the terms of
 31 |         version 1.1 or earlier of the License, but not also under the
 32 |         terms of a Secondary License.
 33 | 
 34 | 1.6. "Executable Form"
 35 |     means any form of the work other than Source Code Form.
 36 | 
 37 | 1.7. "Larger Work"
 38 |     means a work that combines Covered Software with other material, in
 39 |     a separate file or files, that is not Covered Software.
 40 | 
 41 | 1.8. "License"
 42 |     means this document.
 43 | 
 44 | 1.9. "Licensable"
 45 |     means having the right to grant, to the maximum extent possible,
 46 |     whether at the time of the initial grant or subsequently, any and
 47 |     all of the rights conveyed by this License.
 48 | 
 49 | 1.10. "Modifications"
 50 |     means any of the following:
 51 | 
 52 |     (a) any file in Source Code Form that results from an addition to,
 53 |         deletion from, or modification of the contents of Covered
 54 |         Software; or
 55 | 
 56 |     (b) any new file in Source Code Form that contains any Covered
 57 |         Software.
 58 | 
 59 | 1.11. "Patent Claims" of a Contributor
 60 |     means any patent claim(s), including without limitation, method,
 61 |     process, and apparatus claims, in any patent Licensable by such
 62 |     Contributor that would be infringed, but for the grant of the
 63 |     License, by the making, using, selling, offering for sale, having
 64 |     made, import, or transfer of either its Contributions or its
 65 |     Contributor Version.
 66 | 
 67 | 1.12. "Secondary License"
 68 |     means either the GNU General Public License, Version 2.0, the GNU
 69 |     Lesser General Public License, Version 2.1, the GNU Affero General
 70 |     Public License, Version 3.0, or any later versions of those
 71 |     licenses.
 72 | 
 73 | 1.13. "Source Code Form"
 74 |     means the form of the work preferred for making modifications.
 75 | 
 76 | 1.14. "You" (or "Your")
 77 |     means an individual or a legal entity exercising rights under this
 78 |     License. For legal entities, "You" includes any entity that
 79 |     controls, is controlled by, or is under common control with You. For
 80 |     purposes of this definition, "control" means (a) the power, direct
 81 |     or indirect, to cause the direction or management of such entity,
 82 |     whether by contract or otherwise, or (b) ownership of more than
 83 |     fifty percent (50%) of the outstanding shares or beneficial
 84 |     ownership of such entity.
 85 | 
 86 | 2. License Grants and Conditions
 87 | --------------------------------
 88 | 
 89 | 2.1. Grants
 90 | 
 91 | Each Contributor hereby grants You a world-wide, royalty-free,
 92 | non-exclusive license:
 93 | 
 94 | (a) under intellectual property rights (other than patent or trademark)
 95 |     Licensable by such Contributor to use, reproduce, make available,
 96 |     modify, display, perform, distribute, and otherwise exploit its
 97 |     Contributions, either on an unmodified basis, with Modifications, or
 98 |     as part of a Larger Work; and
 99 | 
100 | (b) under Patent Claims of such Contributor to make, use, sell, offer
101 |     for sale, have made, import, and otherwise transfer either its
102 |     Contributions or its Contributor Version.
103 | 
104 | 2.2. Effective Date
105 | 
106 | The licenses granted in Section 2.1 with respect to any Contribution
107 | become effective for each Contribution on the date the Contributor first
108 | distributes such Contribution.
109 | 
110 | 2.3. Limitations on Grant Scope
111 | 
112 | The licenses granted in this Section 2 are the only rights granted under
113 | this License. No additional rights or licenses will be implied from the
114 | distribution or licensing of Covered Software under this License.
115 | Notwithstanding Section 2.1(b) above, no patent license is granted by a
116 | Contributor:
117 | 
118 | (a) for any code that a Contributor has removed from Covered Software;
119 |     or
120 | 
121 | (b) for infringements caused by: (i) Your and any other third party's
122 |     modifications of Covered Software, or (ii) the combination of its
123 |     Contributions with other software (except as part of its Contributor
124 |     Version); or
125 | 
126 | (c) under Patent Claims infringed by Covered Software in the absence of
127 |     its Contributions.
128 | 
129 | This License does not grant any rights in the trademarks, service marks,
130 | or logos of any Contributor (except as may be necessary to comply with
131 | the notice requirements in Section 3.4).
132 | 
133 | 2.4. Subsequent Licenses
134 | 
135 | No Contributor makes additional grants as a result of Your choice to
136 | distribute the Covered Software under a subsequent version of this
137 | License (see Section 10.2) or under the terms of a Secondary License (if
138 | permitted under the terms of Section 3.3).
139 | 
140 | 2.5. Representation
141 | 
142 | Each Contributor represents that the Contributor believes its
143 | Contributions are its original creation(s) or it has sufficient rights
144 | to grant the rights to its Contributions conveyed by this License.
145 | 
146 | 2.6. Fair Use
147 | 
148 | This License is not intended to limit any rights You have under
149 | applicable copyright doctrines of fair use, fair dealing, or other
150 | equivalents.
151 | 
152 | 2.7. Conditions
153 | 
154 | Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted
155 | in Section 2.1.
156 | 
157 | 3. Responsibilities
158 | -------------------
159 | 
160 | 3.1. Distribution of Source Form
161 | 
162 | All distribution of Covered Software in Source Code Form, including any
163 | Modifications that You create or to which You contribute, must be under
164 | the terms of this License. You must inform recipients that the Source
165 | Code Form of the Covered Software is governed by the terms of this
166 | License, and how they can obtain a copy of this License. You may not
167 | attempt to alter or restrict the recipients' rights in the Source Code
168 | Form.
169 | 
170 | 3.2. Distribution of Executable Form
171 | 
172 | If You distribute Covered Software in Executable Form then:
173 | 
174 | (a) such Covered Software must also be made available in Source Code
175 |     Form, as described in Section 3.1, and You must inform recipients of
176 |     the Executable Form how they can obtain a copy of such Source Code
177 |     Form by reasonable means in a timely manner, at a charge no more
178 |     than the cost of distribution to the recipient; and
179 | 
180 | (b) You may distribute such Executable Form under the terms of this
181 |     License, or sublicense it under different terms, provided that the
182 |     license for the Executable Form does not attempt to limit or alter
183 |     the recipients' rights in the Source Code Form under this License.
184 | 
185 | 3.3. Distribution of a Larger Work
186 | 
187 | You may create and distribute a Larger Work under terms of Your choice,
188 | provided that You also comply with the requirements of this License for
189 | the Covered Software. If the Larger Work is a combination of Covered
190 | Software with a work governed by one or more Secondary Licenses, and the
191 | Covered Software is not Incompatible With Secondary Licenses, this
192 | License permits You to additionally distribute such Covered Software
193 | under the terms of such Secondary License(s), so that the recipient of
194 | the Larger Work may, at their option, further distribute the Covered
195 | Software under the terms of either this License or such Secondary
196 | License(s).
197 | 
198 | 3.4. Notices
199 | 
200 | You may not remove or alter the substance of any license notices
201 | (including copyright notices, patent notices, disclaimers of warranty,
202 | or limitations of liability) contained within the Source Code Form of
203 | the Covered Software, except that You may alter any license notices to
204 | the extent required to remedy known factual inaccuracies.
205 | 
206 | 3.5. Application of Additional Terms
207 | 
208 | You may choose to offer, and to charge a fee for, warranty, support,
209 | indemnity or liability obligations to one or more recipients of Covered
210 | Software. However, You may do so only on Your own behalf, and not on
211 | behalf of any Contributor. You must make it absolutely clear that any
212 | such warranty, support, indemnity, or liability obligation is offered by
213 | You alone, and You hereby agree to indemnify every Contributor for any
214 | liability incurred by such Contributor as a result of warranty, support,
215 | indemnity or liability terms You offer. You may include additional
216 | disclaimers of warranty and limitations of liability specific to any
217 | jurisdiction.
218 | 
219 | 4. Inability to Comply Due to Statute or Regulation
220 | ---------------------------------------------------
221 | 
222 | If it is impossible for You to comply with any of the terms of this
223 | License with respect to some or all of the Covered Software due to
224 | statute, judicial order, or regulation then You must: (a) comply with
225 | the terms of this License to the maximum extent possible; and (b)
226 | describe the limitations and the code they affect. Such description must
227 | be placed in a text file included with all distributions of the Covered
228 | Software under this License. Except to the extent prohibited by statute
229 | or regulation, such description must be sufficiently detailed for a
230 | recipient of ordinary skill to be able to understand it.
231 | 
232 | 5. Termination
233 | --------------
234 | 
235 | 5.1. The rights granted under this License will terminate automatically
236 | if You fail to comply with any of its terms. However, if You become
237 | compliant, then the rights granted under this License from a particular
238 | Contributor are reinstated (a) provisionally, unless and until such
239 | Contributor explicitly and finally terminates Your grants, and (b) on an
240 | ongoing basis, if such Contributor fails to notify You of the
241 | non-compliance by some reasonable means prior to 60 days after You have
242 | come back into compliance. Moreover, Your grants from a particular
243 | Contributor are reinstated on an ongoing basis if such Contributor
244 | notifies You of the non-compliance by some reasonable means, this is the
245 | first time You have received notice of non-compliance with this License
246 | from such Contributor, and You become compliant prior to 30 days after
247 | Your receipt of the notice.
248 | 
249 | 5.2. If You initiate litigation against any entity by asserting a patent
250 | infringement claim (excluding declaratory judgment actions,
251 | counter-claims, and cross-claims) alleging that a Contributor Version
252 | directly or indirectly infringes any patent, then the rights granted to
253 | You by any and all Contributors for the Covered Software under Section
254 | 2.1 of this License shall terminate.
255 | 
256 | 5.3. In the event of termination under Sections 5.1 or 5.2 above, all
257 | end user license agreements (excluding distributors and resellers) which
258 | have been validly granted by You or Your distributors under this License
259 | prior to termination shall survive termination.
260 | 
261 | ************************************************************************
262 | *                                                                      *
263 | *  6. Disclaimer of Warranty                                           *
264 | *  -------------------------                                           *
265 | *                                                                      *
266 | *  Covered Software is provided under this License on an "as is"       *
267 | *  basis, without warranty of any kind, either expressed, implied, or  *
268 | *  statutory, including, without limitation, warranties that the       *
269 | *  Covered Software is free of defects, merchantable, fit for a        *
270 | *  particular purpose or non-infringing. The entire risk as to the     *
271 | *  quality and performance of the Covered Software is with You.        *
272 | *  Should any Covered Software prove defective in any respect, You     *
273 | *  (not any Contributor) assume the cost of any necessary servicing,   *
274 | *  repair, or correction. This disclaimer of warranty constitutes an   *
275 | *  essential part of this License. No use of any Covered Software is   *
276 | *  authorized under this License except under this disclaimer.         *
277 | *                                                                      *
278 | ************************************************************************
279 | 
280 | ************************************************************************
281 | *                                                                      *
282 | *  7. Limitation of Liability                                          *
283 | *  --------------------------                                          *
284 | *                                                                      *
285 | *  Under no circumstances and under no legal theory, whether tort      *
286 | *  (including negligence), contract, or otherwise, shall any           *
287 | *  Contributor, or anyone who distributes Covered Software as          *
288 | *  permitted above, be liable to You for any direct, indirect,         *
289 | *  special, incidental, or consequential damages of any character      *
290 | *  including, without limitation, damages for lost profits, loss of    *
291 | *  goodwill, work stoppage, computer failure or malfunction, or any    *
292 | *  and all other commercial damages or losses, even if such party      *
293 | *  shall have been informed of the possibility of such damages. This   *
294 | *  limitation of liability shall not apply to liability for death or   *
295 | *  personal injury resulting from such party's negligence to the       *
296 | *  extent applicable law prohibits such limitation. Some               *
297 | *  jurisdictions do not allow the exclusion or limitation of           *
298 | *  incidental or consequential damages, so this exclusion and          *
299 | *  limitation may not apply to You.                                    *
300 | *                                                                      *
301 | ************************************************************************
302 | 
303 | 8. Litigation
304 | -------------
305 | 
306 | Any litigation relating to this License may be brought only in the
307 | courts of a jurisdiction where the defendant maintains its principal
308 | place of business and such litigation shall be governed by laws of that
309 | jurisdiction, without reference to its conflict-of-law provisions.
310 | Nothing in this Section shall prevent a party's ability to bring
311 | cross-claims or counter-claims.
312 | 
313 | 9. Miscellaneous
314 | ----------------
315 | 
316 | This License represents the complete agreement concerning the subject
317 | matter hereof. If any provision of this License is held to be
318 | unenforceable, such provision shall be reformed only to the extent
319 | necessary to make it enforceable. Any law or regulation which provides
320 | that the language of a contract shall be construed against the drafter
321 | shall not be used to construe this License against a Contributor.
322 | 
323 | 10. Versions of the License
324 | ---------------------------
325 | 
326 | 10.1. New Versions
327 | 
328 | Mozilla Foundation is the license steward. Except as provided in Section
329 | 10.3, no one other than the license steward has the right to modify or
330 | publish new versions of this License. Each version will be given a
331 | distinguishing version number.
332 | 
333 | 10.2. Effect of New Versions
334 | 
335 | You may distribute the Covered Software under the terms of the version
336 | of the License under which You originally received the Covered Software,
337 | or under the terms of any subsequent version published by the license
338 | steward.
339 | 
340 | 10.3. Modified Versions
341 | 
342 | If you create software not governed by this License, and you want to
343 | create a new license for such software, you may create and use a
344 | modified version of this License if you rename the license and remove
345 | any references to the name of the license steward (except to note that
346 | such modified license differs from this License).
347 | 
348 | 10.4. Distributing Source Code Form that is Incompatible With Secondary
349 | Licenses
350 | 
351 | If You choose to distribute Source Code Form that is Incompatible With
352 | Secondary Licenses under the terms of this version of the License, the
353 | notice described in Exhibit B of this License must be attached.
354 | 
355 | Exhibit A - Source Code Form License Notice
356 | -------------------------------------------
357 | 
358 |   This Source Code Form is subject to the terms of the Mozilla Public
359 |   License, v. 2.0. If a copy of the MPL was not distributed with this
360 |   file, You can obtain one at http://mozilla.org/MPL/2.0/.
361 | 
362 | If it is not possible or desirable to put the notice in a particular
363 | file, then You may include the notice in a location (such as a LICENSE
364 | file in a relevant directory) where a recipient would be likely to look
365 | for such a notice.
366 | 
367 | You may add additional accurate notices of copyright ownership.
368 | 
369 | Exhibit B - "Incompatible With Secondary Licenses" Notice
370 | ---------------------------------------------------------
371 | 
372 |   This Source Code Form is "Incompatible With Secondary Licenses", as
373 |   defined by the Mozilla Public License, v. 2.0.
374 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Overscripted Web: Data Analysis in the Open
 2 | 
 3 | The Systems Research Group (SRG) at Mozilla have created and open sourced a data set of publicly available information that was collected by a November 2017 Web crawl. We want to empower the community to explore the unseen or otherwise not obvious series of JavaScript execution events that are triggered once a user visits a webpage, and all the first- and third-party events that are set in motion when people retrieve content.
 4 | Some preliminary insights already uncovered from this data are illustrated in this [blog post](https://medium.com/firefox-context-graph/overscripted-digging-into-javascript-execution-at-scale-2ed508f21862). 
 5 | Ongoing analyses can be tracked [here](https://github.com/mozilla/overscripted/tree/master/analyses)
 6 | 
 7 | The crawl data hosted here was collected using [OpenWPM](https://github.com/mozilla/OpenWPM), which is developed and maintained by the Mozilla Security Engineering team.
 8 | 
 9 | ### Submitting an analysis:
10 | - Analyses should be performed in Python using the [jupyter scientific notebook](https://jupyter-notebook.readthedocs.io/en/stable/) format and executing in this [environment](https://github.com/mozilla/overscripted/blob/master/analyses/environment.yaml). 
11 | - Analysis can be submitted by filing a [Pull Request](https://help.github.com/articles/using-pull-requests) against this repository with the analysis formatted as an *.ipynb file or folder in the /analyses/ folder. 
12 | - Set-up instructions are provided here: https://github.com/mozilla/overscripted/blob/master/analyses/README.md
13 | - Notebooks must be well documented and run on the [environment](https://github.com/mozilla/overscripted/blob/master/analyses/environment.yaml) described. If additional installations are needed these should be documented.
14 | - Files and folders should have the format `yyyy_mm_username__short-title` - the analyses directory contains examples already if this is not clear.
15 | - PRs altering or updating an existing analysis will not be accepted unless they are tweaking formatting / small errors to facilitate that notebook running. If you wish to continue / build-on someone else's existing analysis start your own analysis folder / file, cite their work, and then proceed with your extension.
16 | 
17 | ### Accessing the Data
18 | Each of the links below links to a bz2 zipped portion of the total dataset. 
19 | 
20 | A small sample of the data is available in `safe_dataset.sample.tar.bz2` to get a feel for the content without committing to the full download.
21 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.sample.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.sample.tar.bz2)
22 | 
23 | Three samples that are large enough to meaningful analysis of the dataset are
24 | also available as the full dataset is very large. More details about the
25 | samples are available in [data_prep/Sample Review.ipynb](https://github.com/mozilla/overscripted/blob/master/data_prep/Sample%20Review.ipynb)
26 | - https://public-data.telemetry.mozilla.org/bigcrawl/sample_10percent_value_1000_only.parquet.tar.bz2 - 900MB download / 1.3GB on disk
27 | - https://public-data.telemetry.mozilla.org/bigcrawl/sample_10percent.parquet.tar.bz2 - 3.7GB download / 7.4GB on disk
28 | - https://public-data.telemetry.mozilla.org/bigcrawl/value_1000_only.parquet.tar.bz2 - 9.1GB download / 15GB on disk
29 | 
30 | The full dataset. Unzipped the full parquet data will be approximately 70GB. Each (compressed) chunk dataset is around 9GB. `SHA256SUMS` contains the checksums for all datasets including the sample.
31 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.0.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.0.tar.bz2)
32 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.1.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.1.tar.bz2)
33 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.2.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.2.tar.bz2)
34 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.3.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.3.tar.bz2)
35 | - [https://public-data.telemetry.mozilla.org/bigcrawl/SHA256SUMS](https://public-data.telemetry.mozilla.org/bigcrawl/SHA256SUMS)
36 | 
37 | Refer [hello_world.ipynb](https://github.com/mozilla/overscripted/blob/master/analyses/hello_world.ipynb) to load and have a quick look at the data with pandas, dask and spark.
38 | 
39 | ### New Contributor Tips
40 | 
41 | - Make contributions with respect to any of your learnings, be it by reading related research papers or through your interaction with the community on gitter and submitting a Pull Request (PR) to the repository. You can submit the PR to the README on the main page or the analysis folder README.
42 | 
43 | - This is not a one issue per person repo. All the questions are very open ended and different people may find very different and complementary things when looking at a question.
44 | 
45 | - Use a reaction emoji to acknowledge a comment rather than writing a comment like "sure" - helps to keep things clean - but the contributor can still let folks know that they saw a comment.
46 | 
47 | - You can ask for help and discuss your ideas on gitter. Click [here](https://gitter.im/overscripted-discuss/community) to join !
48 | 
49 | - When you open an issue and work on a Pull Request relating to the same, add "WIP" in the title of the PR. "WIP" is work in progress. When your PR is ready for review remove the WIP tag. You can also request feedback on specific things while it's still a WIP.
50 | 
51 | - Please reference your issues on a PR so that they link and autoclose. Refer to [this](https://help.github.com/en/articles/closing-issues-using-keywords)
52 | 
53 | - If your OS is Ubuntu and you have trouble installing spark with conda. Refer to this [link](https://datawookie.netlify.com/blog/2017/07/installing-spark-on-ubuntu/).
54 | 
55 | - The dataset is very large. Even the subsets of the dataset are unlikely to fit into memory. Working with this dataset will typically require using Dask (http://dask.pydata.org/), Spark (http://spark.apache.org/) or similar tools to enable parallelized / out-of-core / distributed processing.
56 | 
57 | ### Glossary
58 | - [Fingerprinting](https://en.wikipedia.org/wiki/Device_fingerprint) is the process of creating a unique identifier based off of some characteristics of your hardware, operating  system and browser. 
59 | - TLD means Top-level Domain. You can read up more about it [here.](https://en.wikipedia.org/wiki/Top-level_domain) 
60 | - [User Agent](https://en.wikipedia.org/wiki/User_agent) (UA), is a string that helps identify which browser is being used, what version, and on which operating system. 
61 | - [Web Crawler](https://en.wikipedia.org/wiki/Web_crawler)- It is a program or automated script which browses the World Wide   Web in a methodical, automated manner.
62 | 
63 | 
64 | 
65 | ### Resources
66 | 
67 | - Please refer the [reading list](https://github.com/mozilla/overscripted/wiki/Reading-List-(WIP)) for additional references and information.
68 | 
69 | - [This](https://github.com/brandon-rhodes/pycon-pandas-tutorial) is a great tutorial to learn Pandas.
70 | 
71 | - [Tutorial](https://www.youtube.com/watch?v=HW29067qVWk) on Jupyter Notebook.
72 | 
73 | - We have used dask in some of our Jupyter notebooks. Dask gives you a pandas-like API but lets you work on data that is too big to fit in memory. Dask can be used on a single machine or a cluster. Most analyses done for this project were done on a single machine. Please start by reviewing the [docs](https://dask.org/) to learn more about it.
74 | 
75 | - [This](https://github.com/aSquare14/Git-Cheat-Sheet) will help you get started with GIT. For visual thinkers this [tutorial](https://www.youtube.com/playlist?list=PL6gx4Cwl9DGAKWClAD_iKpNC0bGHxGhcx) can be a good start.
76 | 
77 | - Other Dask resources: [overview](https://www.youtube.com/watch?v=ods97a5Pzw0) video and [cheatsheet](https://dask.pydata.org/en/latest/_downloads/daskcheatsheet.pdf).
78 | 
79 | - [Apache Spark](https://spark.apache.org/docs/latest/api/python/pyspark.html) is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads. We use [findspark](https://github.com/minrk/findspark) to set up spark. You can learn more about it [here](https://github.com/apache/spark )
80 | 
81 | 


--------------------------------------------------------------------------------
/analyses/2018_12_LABBsoft_tracking_review/Ad Blocker Report.md:
--------------------------------------------------------------------------------
 1 | # Ad Blocker Usage Detection Report
 2 | ### Summary
 3 | Ad blocker usage can be detected by checking the state of one of your ad scripts.  It is frequently used to restrict functionality for ad blocking users or as a bit of information in user fingerprinting.
 4 | 
 5 | 
 6 | ### Detection
 7 | #### In Literature
 8 | In _Measuring and Disrupting Anti-Adblockers Using
 9 | Differential Execution Analysis_ anti adblockers were detected on 30.5% of the Alexa top-10K websites. This is trending upwards significantly as in May 2016 only 6.7% were detected using anti-adblock software.
10 | 
11 | #### In The Overscripted Dataset
12 | Anti-adblock behaviour could not be detected in the dataset.
13 | 
14 | #### What else would we need to detect it?
15 | Adblock detection can be done by running two crawls, one with adblock one without, and comparing the behaviour differences. This would allow the overscripted set to detect behaviour differences as well as frequency.
16 | 


--------------------------------------------------------------------------------
/analyses/2018_12_LABBsoft_tracking_review/Evercookies Report.md:
--------------------------------------------------------------------------------
 1 | # Evercookies
 2 | ### Summary
 3 | The practice of restoring cleared cookies by storing the cookie information in an alternate location and recreating the cookie if it has been removed. This preserves the lifespan of the cookies and allows sites to identify users that are clearing their cookies.
 4 | 
 5 | Should other cookie behaviour such as duplicating cookies to all browsers on a system or passing cookies between domains count as evercookies?
 6 | 
 7 | ### Detection
 8 | #### In Literature
 9 | The TOR Browser design draft identifies the following locations for identifier storage:
10 | - **Cookies**
11 | - **Cache**
12 | - **HTTP Authentication**
13 | - **DOM Storage**
14 | - IndexedDB Storage
15 | - **Flash Cookies**
16 | - SSL+TLS session resumption
17 | - Javascript SharedWorkers - threads with a shared scope between all threads from the same JS origin, could have access to objects from the same third party loaded at another origin
18 | - URL.createObjectURL
19 | - SPDY, HTTP/2
20 | - Cross-origin redirects
21 | - **window.name**
22 | - Auto form-fill
23 | - HSTS - Stores an effective bit of information per domain name
24 | - HPKP 
25 | - BroadcastChannel API
26 | - OCSP Requests
27 | - Favicons
28 | - mediasource URIs and MediaStreams
29 | - Speculative and prefetched connections
30 | - Permissions API
31 | 
32 | 
33 | 
34 | #### In The Overscripted Dataset
35 | We found 52,669,910 instances of local storage access, cookie data access or window.name access.
36 | 
37 | #### What else would we need to detect it?
38 | The best way to detect evercookies would be multiple runs of the same crawl while clearing cache between attempts. This would allow us to compare cookies between attempts as well as log the state of all possible storage mechanisms after each run.
39 | 
40 | 
41 | #### Do we see it?
42 | Using our symbol_counts.csv, we can check for all instances of the "window.navigator" interface and counting all instances.
43 | ```
44 | count = 0
45 | target = ["window.name", "setItem", "getItem", "document.cookie"]
46 | for line in f.split('\n'):
47 |     for t in target:
48 |         if t in line:
49 |                 count += int(line.split(",")[1])
50 |                 print(line)
51 | print("Total:", count)
52 | ```
53 | 
54 | Output:
55 | ```
56 | window.document.cookie,35455680
57 | window.Storage.getItem,10553944
58 | window.Storage.setItem,4175556
59 | window.name,2484730
60 | Total: 52669910
61 | ```


--------------------------------------------------------------------------------
/analyses/2018_12_LABBsoft_tracking_review/RelevantSymbolCounter.py:
--------------------------------------------------------------------------------
  1 | f = """
  2 | window.document.cookie,35455680
  3 | window.navigator.userAgent,15534371
  4 | window.Storage.getItem,10553944
  5 | window.localStorage,8767285
  6 | window.Storage.setItem,4175556
  7 | window.sessionStorage,4033894
  8 | window.Storage.removeItem,2932713
  9 | window.name,2484730
 10 | CanvasRenderingContext2D.fillStyle,1957519
 11 | window.navigator.plugins[Shockwave Flash].description,1863285
 12 | window.screen.colorDepth,1449905
 13 | window.navigator.appName,1286084
 14 | window.navigator.language,1172256
 15 | window.navigator.platform,1140738
 16 | CanvasRenderingContext2D.save,1000762
 17 | CanvasRenderingContext2D.restore,997755
 18 | CanvasRenderingContext2D.fill,954340
 19 | CanvasRenderingContext2D.fillRect,936267
 20 | window.navigator.plugins[Shockwave Flash].name,895289
 21 | CanvasRenderingContext2D.font,814310
 22 | CanvasRenderingContext2D.lineWidth,718195
 23 | window.navigator.appVersion,707298
 24 | window.navigator.cookieEnabled,692524
 25 | HTMLCanvasElement.width,681003
 26 | CanvasRenderingContext2D.strokeStyle,650211
 27 | HTMLCanvasElement.height,644476
 28 | HTMLCanvasElement.getContext,596749
 29 | window.Storage.key,551691
 30 | CanvasRenderingContext2D.fillText,542896
 31 | window.Storage.length,541077
 32 | CanvasRenderingContext2D.stroke,537024
 33 | CanvasRenderingContext2D.measureText,522209
 34 | window.navigator.vendor,487833
 35 | window.navigator.doNotTrack,468365
 36 | CanvasRenderingContext2D.arc,413449
 37 | HTMLCanvasElement.style,294223
 38 | CanvasRenderingContext2D.textBaseline,293489
 39 | window.navigator.product,279653
 40 | CanvasRenderingContext2D.textAlign,246380
 41 | window.navigator.plugins[Shockwave Flash].filename,225751
 42 | window.navigator.mimeTypes[application/x-shockwave-flash].type,213769
 43 | window.navigator.languages,199435
 44 | window.navigator.plugins[Shockwave Flash].length,184995
 45 | CanvasRenderingContext2D.bezierCurveTo,176757
 46 | CanvasRenderingContext2D.shadowBlur,172808
 47 | CanvasRenderingContext2D.shadowOffsetY,161446
 48 | CanvasRenderingContext2D.shadowOffsetX,159263
 49 | CanvasRenderingContext2D.shadowColor,158579
 50 | window.screen.pixelDepth,156326
 51 | CanvasRenderingContext2D.rect,154007
 52 | HTMLCanvasElement.nodeType,153630
 53 | CanvasRenderingContext2D.lineJoin,151407
 54 | window.navigator.mimeTypes[application/futuresplash].type,150364
 55 | CanvasRenderingContext2D.lineCap,149511
 56 | window.navigator.plugins[Shockwave Flash].version,144656
 57 | CanvasRenderingContext2D.strokeRect,142838
 58 | HTMLCanvasElement.toDataURL,135041
 59 | CanvasRenderingContext2D.createRadialGradient,132444
 60 | CanvasRenderingContext2D.globalCompositeOperation,122162
 61 | window.navigator.onLine,116037
 62 | CanvasRenderingContext2D.scale,115227
 63 | window.Storage.hasOwnProperty,108138
 64 | CanvasRenderingContext2D.clip,106238
 65 | CanvasRenderingContext2D.miterLimit,102589
 66 | window.navigator.mimeTypes[application/x-shockwave-flash].suffixes,94030
 67 | window.navigator.mimeTypes[application/futuresplash].suffixes,94025
 68 | RTCPeerConnection.localDescription,88683
 69 | window.navigator.productSub,71139
 70 | window.navigator.mimeTypes[application/x-shockwave-flash].description,70284
 71 | window.navigator.mimeTypes[application/futuresplash].description,70278
 72 | HTMLCanvasElement.nodeName,67621
 73 | CanvasRenderingContext2D.rotate,63824
 74 | HTMLCanvasElement.parentNode,57192
 75 | window.navigator.oscpu,54799
 76 | window.navigator.appCodeName,51161
 77 | CanvasRenderingContext2D.createLinearGradient,46710
 78 | CanvasRenderingContext2D.putImageData,45469
 79 | window.navigator.geolocation,43022
 80 | CanvasRenderingContext2D.getImageData,41412
 81 | HTMLCanvasElement.ownerDocument,37831
 82 | HTMLCanvasElement.className,36778
 83 | RTCPeerConnection.onicecandidate,32522
 84 | HTMLCanvasElement.getAttribute,31800
 85 | window.navigator.vendorSub,26840
 86 | HTMLCanvasElement.addEventListener,23485
 87 | window.navigator.buildID,23419
 88 | HTMLCanvasElement.classList,22963
 89 | HTMLCanvasElement.setAttribute,20689
 90 | HTMLCanvasElement.clientHeight,20665
 91 | HTMLCanvasElement.clientWidth,20341
 92 | HTMLCanvasElement.getElementsByTagName,16224
 93 | HTMLCanvasElement.tagName,14475
 94 | RTCPeerConnection.iceGatheringState,13984
 95 | RTCPeerConnection.createDataChannel,13776
 96 | RTCPeerConnection.signalingState,13160
 97 | RTCPeerConnection.remoteDescription,13113
 98 | RTCPeerConnection.createOffer,13015
 99 | CanvasRenderingContext2D.setLineDash,12590
100 | HTMLCanvasElement.onselectstart,12054
101 | RTCPeerConnection.setLocalDescription,11844
102 | CanvasRenderingContext2D.arcTo,11428
103 | CanvasRenderingContext2D.isPointInPath,11342
104 | CanvasRenderingContext2D.createImageData,11163
105 | HTMLCanvasElement.id,10941
106 | CanvasRenderingContext2D.imageSmoothingEnabled,9722
107 | HTMLCanvasElement.draggable,9558
108 | HTMLCanvasElement.constructor,9246
109 | CanvasRenderingContext2D.createPattern,8713
110 | CanvasRenderingContext2D.lineDashOffset,7726
111 | HTMLCanvasElement.offsetWidth,7346
112 | CanvasRenderingContext2D.mozImageSmoothingEnabled,6561
113 | RTCPeerConnection.idpLoginUrl,6556
114 | RTCPeerConnection.peerIdentity,6556
115 | RTCPeerConnection.onremovestream,6556
116 | HTMLCanvasElement.offsetHeight,6175
117 | CanvasRenderingContext2D.strokeText,5147
118 | HTMLCanvasElement.firstChild,4897
119 | HTMLCanvasElement.hasAttribute,4604
120 | HTMLCanvasElement.localName,4577
121 | HTMLCanvasElement.attributes,4507
122 | HTMLCanvasElement.nextSibling,3857
123 | AudioContext.destination,3758
124 | HTMLCanvasElement.firstElementChild,3586
125 | HTMLCanvasElement.nextElementSibling,3560
126 | window.Storage.clear,3348
127 | HTMLCanvasElement.dir,3171
128 | CanvasRenderingContext2D.mozCurrentTransform,3102
129 | OscillatorNode.frequency,3056
130 | AudioContext.createOscillator,2898
131 | OscillatorNode.start,2687
132 | CanvasRenderingContext2D.__lookupGetter__,2543
133 | HTMLCanvasElement.childNodes,2541
134 | CanvasRenderingContext2D.hasOwnProperty,2422
135 | HTMLCanvasElement.getBoundingClientRect,2276
136 | HTMLCanvasElement.offsetLeft,2096
137 | OscillatorNode.type,2011
138 | OscillatorNode.connect,2011
139 | CanvasRenderingContext2D.mozCurrentTransformInverse,1890
140 | HTMLCanvasElement.removeAttribute,1814
141 | HTMLCanvasElement.offsetTop,1812
142 | HTMLCanvasElement.children,1795
143 | HTMLCanvasElement.dispatchEvent,1698
144 | HTMLCanvasElement.mozOpaque,1687
145 | HTMLCanvasElement.onmousemove,1538
146 | AudioContext.createDynamicsCompressor,1535
147 | HTMLCanvasElement.offsetParent,1499
148 | OfflineAudioContext.startRendering,1381
149 | OfflineAudioContext.createDynamicsCompressor,1380
150 | OfflineAudioContext.oncomplete,1380
151 | OfflineAudioContext.createOscillator,1380
152 | OfflineAudioContext.destination,1380
153 | HTMLCanvasElement.remove,1257
154 | HTMLCanvasElement.compareDocumentPosition,1253
155 | AudioContext.state,1249
156 | AudioContext.listener,1230
157 | GainNode.connect,1204
158 | AudioContext.createGain,1197
159 | GainNode.gain,1112
160 | HTMLCanvasElement.__proto__,1028
161 | window.Storage.toString,1027
162 | AudioContext.createAnalyser,905
163 | HTMLCanvasElement.cloneNode,899
164 | AudioContext.sampleRate,882
165 | AudioContext.decodeAudioData,876
166 | AudioContext.createMediaElementSource,860
167 | HTMLCanvasElement.toBlob,837
168 | HTMLCanvasElement.removeEventListener,779
169 | AnalyserNode.fftSize,774
170 | AnalyserNode.maxDecibels,771
171 | AnalyserNode.smoothingTimeConstant,770
172 | AnalyserNode.frequencyBinCount,770
173 | AnalyserNode.minDecibels,769
174 | RTCPeerConnection.addIceCandidate,769
175 | AudioContext.onstatechange,745
176 | HTMLCanvasElement.textContent,628
177 | HTMLCanvasElement.onclick,466
178 | HTMLCanvasElement.innerHTML,437
179 | window.Storage.valueOf,423
180 | RTCPeerConnection.setRemoteDescription,379
181 | RTCPeerConnection.getStats,361
182 | AudioContext.currentTime,354
183 | OscillatorNode.stop,351
184 | RTCPeerConnection.removeEventListener,346
185 | RTCPeerConnection.addEventListener,346
186 | HTMLCanvasElement.__lookupGetter__,344
187 | AudioContext.createScriptProcessor,337
188 | HTMLCanvasElement.hasOwnProperty,312
189 | HTMLCanvasElement.onmousedown,310
190 | HTMLCanvasElement.toString,291
191 | ScriptProcessorNode.connect,288
192 | ScriptProcessorNode.onaudioprocess,287
193 | AnalyserNode.connect,285
194 | HTMLCanvasElement.blur,280
195 | HTMLCanvasElement.getAttributeNode,237
196 | HTMLCanvasElement.onmouseout,232
197 | HTMLCanvasElement.onmouseover,229
198 | HTMLCanvasElement.append,227
199 | HTMLCanvasElement.onmouseup,227
200 | CanvasRenderingContext2D.ellipse,166
201 | HTMLCanvasElement.setAttributeNode,152
202 | HTMLCanvasElement.oncontextmenu,152
203 | CanvasRenderingContext2D.getLineDash,146
204 | HTMLCanvasElement.previousSibling,139
205 | HTMLCanvasElement.parentElement,136
206 | HTMLCanvasElement.innerText,134
207 | HTMLCanvasElement.onkeydown,132
208 | HTMLCanvasElement.onkeyup,129
209 | HTMLCanvasElement.onkeypress,128
210 | HTMLCanvasElement.onblur,128
211 | HTMLCanvasElement.onfocus,128
212 | HTMLCanvasElement.onmouseleave,127
213 | HTMLCanvasElement.ondblclick,126
214 | HTMLCanvasElement.ondragenter,125
215 | HTMLCanvasElement.onresize,125
216 | HTMLCanvasElement.onpaste,125
217 | HTMLCanvasElement.onchange,125
218 | HTMLCanvasElement.oncut,125
219 | HTMLCanvasElement.ondragover,125
220 | HTMLCanvasElement.ondragleave,125
221 | HTMLCanvasElement.ondrop,125
222 | HTMLCanvasElement.onmouseenter,125
223 | HTMLCanvasElement.onload,125
224 | HTMLCanvasElement.contains,102
225 | HTMLCanvasElement.querySelectorAll,98
226 | GainNode.disconnect,77
227 | AudioContext.createBufferSource,70
228 | HTMLCanvasElement.hasChildNodes,67
229 | AudioContext.createBuffer,63
230 | AudioContext.createPanner,60
231 | HTMLCanvasElement.scrollLeft,60
232 | HTMLCanvasElement.scrollTop,60
233 | CanvasRenderingContext2D.__lookupSetter__,58
234 | CanvasRenderingContext2D.__defineSetter__,58
235 | HTMLCanvasElement.ondragstart,50
236 | HTMLCanvasElement.getClientRects,49
237 | HTMLCanvasElement.title,44
238 | HTMLCanvasElement.tabIndex,43
239 | RTCPeerConnection.close,43
240 | RTCPeerConnection.iceConnectionState,33
241 | AudioContext.close,32
242 | HTMLCanvasElement.hasAttributes,25
243 | HTMLCanvasElement.previousElementSibling,23
244 | OscillatorNode.disconnect,22
245 | HTMLCanvasElement.focus,22
246 | RTCPeerConnection.onsignalingstatechange,16
247 | RTCPeerConnection.oniceconnectionstatechange,16
248 | HTMLCanvasElement.valueOf,16
249 | HTMLCanvasElement.dataset,15
250 | HTMLCanvasElement.requestPointerLock,15
251 | HTMLCanvasElement.namespaceURI,13
252 | HTMLCanvasElement.webkitMatchesSelector,12
253 | HTMLCanvasElement.childElementCount,11
254 | HTMLCanvasElement.removeChild,8
255 | HTMLCanvasElement.insertBefore,8
256 | GainNode.numberOfOutputs,7
257 | HTMLCanvasElement.matches,6
258 | HTMLCanvasElement.outerHTML,6
259 | HTMLCanvasElement.appendChild,6
260 | AudioContext.resume,5
261 | AnalyserNode.getByteFrequencyData,5
262 | HTMLCanvasElement.clientTop,4
263 | HTMLCanvasElement.clientLeft,4
264 | HTMLCanvasElement.onwheel,4
265 | HTMLCanvasElement.DOCUMENT_NODE,4
266 | RTCPeerConnection.onaddstream,3
267 | AnalyserNode.channelInterpretation,3
268 | AnalyserNode.numberOfInputs,3
269 | AnalyserNode.channelCountMode,3
270 | AnalyserNode.numberOfOutputs,3
271 | AnalyserNode.channelCount,3
272 | HTMLCanvasElement.scrollWidth,3
273 | HTMLCanvasElement.scrollHeight,3
274 | CanvasRenderingContext2D.__proto__,3
275 | HTMLCanvasElement.getElementsByClassName,3
276 | CanvasRenderingContext2D.__defineGetter__,3
277 | HTMLCanvasElement.querySelector,2
278 | OfflineAudioContext.decodeAudioData,2
279 | RTCPeerConnection.createAnswer,2
280 | CanvasRenderingContext2D.filter,2
281 | AudioContext.createConvolver,1
282 | HTMLCanvasElement.lastChild,1
283 | CanvasRenderingContext2D.toString,1
284 | """
285 | count = 0
286 | for line in f.split('\n'):
287 |     if "window.navigator" in line:
288 |         count += int(line.split(",")[1])
289 |         print(line[17:])
290 | print("Total:", count)
291 |         
292 | 
293 | 


--------------------------------------------------------------------------------
/analyses/2018_12_LABBsoft_tracking_review/Tracking Method Sources.md:
--------------------------------------------------------------------------------
  1 | # Tracking Method Sources
  2 | 
  3 | ## Web Tracking: Mechanisms, Implications, and Defenses
  4 | Note: This source covers almost all forms of fingerprinting detected or theorised
  5 | #### Session Only
  6 | * Session identifiers passed through links/requests
  7 | * Explicit web-form authentication
  8 | * window.name DOM property
  9 | 
 10 | #### Storage-based
 11 | * HTTP cookies
 12 | * Flash cookies and Java JNLP PersistenceService
 13 | * Flash LocalConnection object
 14 | * Silverlight Isolated Storage
 15 | * HTML5 Global, Local, and Session Storage
 16 | * Web SQL Database and HTML5 IndexedDB
 17 | * Internet Explorer userData storage
 18 | 
 19 | #### Cache-based
 20 | * Web cache
 21 | * Embedding identifiers in cached documents
 22 | * Loading performance tests
 23 | * ETags and Last-Modified headers
 24 | * DNS lookups
 25 | * Operational caches
 26 | * HTTP301 redirect cache
 27 | * HTTP authentication cache
 28 | * HTTP Strict Transport Security cache
 29 | * TLS Session Resumption cache and TLS Session IDs
 30 | 
 31 | #### Fingerprinting
 32 | * Network and location fingerprinting
 33 | * Device fingerprinting
 34 | * OS instance fingerprinting
 35 | * Browser version fingerprinting
 36 | * Browser instance fingerprinting using canvas
 37 | * Browser instance fingerprinting using web browsing history
 38 | 
 39 | #### Other Tracking mechanisms
 40 | * Headers attached to outgoing HTTP requests
 41 | * Using telephone metadata
 42 | * Timing attacks
 43 | * Using unconscious collaboration of the user
 44 | * Clickjacking
 45 | * Evercookies
 46 | 
 47 | 
 48 | ## Hiding in the Crowd: an Analysis of the Effectiveness of Browser Fingerprinting at Large Scale  
 49 | Note: Color depth, encoding, do not track, and plugins are valuable additions to the list
 50 | * User-Agent
 51 | * Header-accept
 52 | * Content encoding
 53 | * Content language
 54 | * List of plugins
 55 | * Cookies enabled
 56 | * Use of local/session storage
 57 | * Timezone
 58 | * Screen resolution & color depth
 59 | * Available fonts
 60 | * List of HTTP headers
 61 | * Platform
 62 | * Do Not Track
 63 | * Canvas
 64 | * WebGL Vendor
 65 | * WebGL Renderer
 66 | * Use of an ad blocker
 67 | 
 68 | ## The Web Never Forgets
 69 | * Canvas Fingerprinting
 70 | * Evercookies/Respawning
 71 | * Cookie Syncing
 72 | * Flash Cookies and using flash for respawning
 73 | 
 74 | ## Fingerprinting Information in JavaScript Implementations
 75 | * Performance Fingerprinting
 76 | * Whitelist Fingerprinting
 77 | 
 78 | ## Web Tracking – A Literature Review on the State of Research
 79 | TODO: Parse this list of papers for additional fingerprinting methods  
 80 | Stateful tracking:
 81 | * The Web Never Forgets: Persistant Tracking Mechanisms in the Wild - Acar et al. (2013, 2014)
 82 | * Hybrid Information Flow Monitoring Against Web Tracking - Besson et al. (2014)
 83 | * Web Tracking: Mechanisms, Implications, and Defenses - Bujlow et al. (2015, 2017)
 84 | * Online Tracking: A 1-million-site Measurement and Analysis - Englehardt & Narayanan (2016)
 85 | * Towards Seamless Tracking-Free Web: Improved Detection of Trackers via One-class Learning - Ikram et al. (2016)
 86 | * Third-Party Web Tracking: Policy and Technology - Mayer & Mitchell (2012)
 87 | * Web Tracking: Overview and applicability in digital investigations - Pugliese (2015)
 88 | * The Web is Watching You: A Comprehensive Review of Web-Tracking Techniques and Countermeasures - Sanchez-Rola et al. (2016) 
 89 | 
 90 | ## Web tracking: Overview and applicability in digital investigation
 91 | Note: In “The Web Never Forgets” there is reference to this paper that mentions behavioural fingerprinting. I could not access the paper.  
 92 | “Pugliese (2015) also mentions behavioral biometric features, namely those dynamics that occur when typing, moving and clicking the mouse, or touching a touch screen. Such behavioral biometric features can be used to improve stateless tracking”
 93 | 
 94 | ## Evaluating the effectiveness of defences to web tracking
 95 | https://www.royalholloway.ac.uk/media/5620/rhul-isg-2018-7-techreport-darrellnewman.pdf
 96 | Note: Lots to dig into here, not enough time this week to get through all of it
 97 | * Javascript Engines
 98 | * DOM Objects
 99 | * Installed add-ons/extensions
100 | * Fonts
101 | * Canvas API
102 | * AudioContext API
103 | * Battery Status API
104 | * Emojis
105 | * SSL/TLS Handshake Fingerprinting
106 | * WebRTC fingerprinting
107 | * Behavioural Biometrics
108 | 
109 | ## Remote physical device fingerprinting
110 | Note: This paper is from 2005
111 | * Clock Skews
112 | 
113 | ## Papers I couldn't access
114 | SHPF: Enhancing HTTP(S) Session Security with Browser Fingerprinting  


--------------------------------------------------------------------------------
/analyses/2018_12_LABBsoft_tracking_review/Tracking Methods.md:
--------------------------------------------------------------------------------
 1 | Note: I'm not sure how I'd like to sort these and what level of granularity to apply. For example, evercookies is a much broader category than the window.name DOM property.
 2 | 
 3 | # Tracking Methods
 4 | * Session identifiers passed through links/requests
 5 | * Explicit web-form authentication
 6 | * window.name DOM property
 7 | * HTTP cookies
 8 | * Flash cookies and Java JNLP PersistenceService
 9 | * Flash LocalConnection object
10 | * Silverlight Isolated Storage
11 | * HTML5 Global, Local, and Session Storage
12 | * Web SQL Database and HTML5 IndexedDB
13 | * Internet Explorer userData storage
14 | * Web cache
15 | * Embedding identifiers in cached documents
16 | * Loading performance tests
17 | * ETags and Last-Modified headers
18 | * DNS lookups
19 | * Operational caches
20 | * HTTP301 redirect cache
21 | * HTTP authentication cache
22 | * HTTP Strict Transport Security cache
23 | * TLS Session Resumption cache and TLS Session IDs
24 | * Network and location fingerprinting
25 | * Device fingerprinting
26 | * OS instance fingerprinting
27 | * Browser version fingerprinting
28 | * Browser instance fingerprinting using canvas
29 | * Browser instance fingerprinting using web browsing history
30 | * Headers attached to outgoing HTTP requests
31 | * Using telephone metadata
32 | * Timing attacks
33 | * Using unconscious collaboration of the user
34 | * Clickjacking
35 | * Evercookies
36 | * User-Agent
37 | * Header-accept
38 | * Content encoding
39 | * Content language
40 | * List of plugins
41 | * Cookies enabled
42 | * Use of local/session storage
43 | * Timezone
44 | * Screen resolution & color depth
45 | * Available fonts
46 | * List of HTTP headers
47 | * Platform
48 | * Do Not Track
49 | * Canvas
50 | * WebGL Vendor
51 | * WebGL Renderer
52 | * Use of an ad blocker
53 | * Canvas Fingerprinting
54 | * Evercookies/Respawning
55 | * Cookie Syncing
56 | * Flash Cookies and using flash for respawning
57 | * Performance Fingerprinting
58 | * Whitelist Fingerprinting
59 | * Javascript Engines
60 | * DOM Objects
61 | * Installed add-ons/extensions
62 | * Fonts
63 | * Canvas API
64 | * AudioContext API
65 | * Battery Status API
66 | * Emojis
67 | * SSL/TLS Handshake Fingerprinting
68 | * WebRTC fingerprinting
69 | * Behavioural Biometrics
70 | * Clock Skews
71 | 
72 | ## Papers Still to Read
73 | SHPF: Enhancing HTTP(S) Session Security with Browser Fingerprinting
74 | Web Tracking: Overview and applicability in digital investigations - Pugliese (2015)
75 | The Web is Watching You: A Comprehensive Review of Web-Tracking Techniques and Countermeasures - Sanchez-Rola et al. (2016) 


--------------------------------------------------------------------------------
/analyses/2018_12_LABBsoft_tracking_review/Tracking Report Template.md:
--------------------------------------------------------------------------------
 1 | # Browser Version Identification
 2 | ### Summary
 3 | Identifying web browsing software and version is a core component of a stateless fingerprint. In most cases the user will provide a genuine User-Agent request header that will contain the browser type, version, developer and the operating system. Some browsers are obfuscating this header to protect the user's privacy. Unfortunately, because of small differences between browsers there exist ways to detect a forged User-Agent string.
 4 | 
 5 | ### Detection
 6 | #### In Literature
 7 | The User-Agent string is available through the Navigator interface, so any usage of window.navigator will be accessing User-Agent data. 
 8 | However, a site owner could manually parse the request headers to access that data without using the Navigator interface.
 9 | Identifying forged User-Agents is browser version specific. In _How Unique Is Your Web Browser_ they found forged strings when they noticed there were iOS devices that had flash enabled.
10 | 
11 | #### In The Overscripted Dataset
12 | The Navigator interface was detected 26,361,700 times in the overscripted dataset.
13 | I don't know how to identify if a Navigator interface was called for tracking purposes or for functionality (to display an error message on IE6 for example)
14 | 
15 | 
16 | #### What else would we need to detect it?
17 | The dataset allows us to identify when a website accesses the User-Agent through the Navigator interface. 
18 | It may be possible to identify sites that test for forged User-Agents by crawling with a specific User-Agent string and checking for calls to unsupported features.
19 | 
20 | #### Do we see it?
21 | Using our symbol_counts.csv, we can check for all instances of the "window.navigator" interface and counting all instances.
22 | ```
23 | count = 0
24 | for line in f.split('\n'):
25 |     if "window.navigator" in line:
26 |         count += int(line.split(",")[1])
27 |         print(line[17:])
28 | print("Total:", count)
29 | ```
30 | 
31 | Output:
32 | ```
33 | userAgent,15534371
34 | plugins[Shockwave Flash].description,1863285
35 | appName,1286084
36 | language,1172256
37 | platform,1140738
38 | plugins[Shockwave Flash].name,895289
39 | appVersion,707298
40 | cookieEnabled,692524
41 | vendor,487833
42 | doNotTrack,468365
43 | product,279653
44 | plugins[Shockwave Flash].filename,225751
45 | mimeTypes[application/x-shockwave-flash].type,213769
46 | languages,199435
47 | plugins[Shockwave Flash].length,184995
48 | mimeTypes[application/futuresplash].type,150364
49 | plugins[Shockwave Flash].version,144656
50 | onLine,116037
51 | mimeTypes[application/x-shockwave-flash].suffixes,94030
52 | mimeTypes[application/futuresplash].suffixes,94025
53 | productSub,71139
54 | mimeTypes[application/x-shockwave-flash].description,70284
55 | mimeTypes[application/futuresplash].description,70278
56 | oscpu,54799
57 | appCodeName,51161
58 | geolocation,43022
59 | vendorSub,26840
60 | buildID,23419
61 | Total: 26361700
62 | ```


--------------------------------------------------------------------------------
/analyses/2018_12_LABBsoft_tracking_review/window.name Report.md:
--------------------------------------------------------------------------------
 1 | # window.name DOM property
 2 | ### Summary
 3 | The window.name property stores up to 2MB of information that persists for the lifespan of a tab. This allows sites that are closed and revisited to reaccess whatever information they may have stored inside.
 4 | 
 5 | ### Detection
 6 | #### In Literature
 7 | There were no examples of the detection of window.name trarcking found in the web tracking papers.
 8 | 
 9 | #### In The Overscripted Dataset
10 | We found 2,484,730 instances of window.name access.
11 | 
12 | #### What else would we need to detect it?
13 | We're equipped to track the calls to the window.name property. Detection could be improved if we logged the line of the window.name call.
14 | 
15 | #### Do we see it?
16 | Using our symbol_counts.csv, we can check for all instances of the "window.navigator" interface and counting all instances.
17 | ```
18 | count = 0
19 | target = ["window.name"]
20 | for line in f.split('\n'):
21 |     for t in target:
22 |         if t in line:
23 |                 count += int(line.split(",")[1])
24 |                 print(line)
25 | ```
26 | 
27 | Output:
28 | ```
29 | window.name,2484730
30 | ```


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/1-get_script_urls/README.md:
--------------------------------------------------------------------------------
 1 | # Extract script URLs from dataset
 2 | _Note, this section uses pySpark_
 3 | 
 4 | ## 1) Setup Spark:
 5 | 
 6 | 1) You must have Open JDK 8. Install via `$ sudo apt-get install openjdk-8-jdk` if you don't have this installed (check if `/usr/lib/jvm/java-8-openjdk-amd64/` exists).
 7 | 
 8 | 2) Download the latest version of [Spark](https://spark.apache.org/downloads.html) (prebuilt for Apache Hadoop 2.7 and later), and unpack the tar to a directory of your choosing.
 9 | 
10 | 3) Set some environment variables:
11 | ```
12 | $ export PYSPARK_PYTHON=python3
13 | $ export PATH=${PATH}:/path/to/spark-<version>-bin-hadoop2.7/bin
14 | $ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ 
15 | ```
16 | 
17 | To run any of the pyspark scripts on their own, you can run
18 | ```
19 | $ spark-submit sparkscript.py
20 | ```
21 | 
22 | **Note: you may need to increase spark driver memory.**
23 | In `<spark_dir>/conf/spark-defaults.conf`, add the line `spark.driver.memory 15g`, or whatever is acceptable for your system.
24 | 
25 | 
26 | ## 2) Running the scripts:
27 | 
28 | Replace the appropriate directories in `config.ini` for your system. You will need the [Full Mozilla Overscripted Dataset](https://github.com/mozilla/Overscripted-Data-Analysis-Challenge). 
29 | 
30 | Ensuring spark is set up as in pt. 1, run:
31 | ```
32 | $ spark-submit generate_url_list_spark.py
33 | ```
34 | 
35 | To test individual user specified urls, you can place urls in `test_urls.csv` and run `$ spark-submit test_generate_url_list_spark.py`. This will output `parsed_test_urls.csv` using the exact same process as on the full dataset for easy debugging and sanity checks.
36 | 
37 | A jupyter notebook with barebones loading and displaying of the data to make exploring the results easier.
38 | 
39 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/1-get_script_urls/config.ini:
--------------------------------------------------------------------------------
 1 | # Note: Make sure to not add comments on any lines being parsed in!
 2 | [DEFAULT]
 3 | 
 4 | # Location of the top of where data is located
 5 | #datatop         = </path/to/datatop>
 6 | datatop         = /mnt/Data/UCOSP_DATA
 7 | 
 8 | # Location of the parent directory of all parquet subfolders (iterates over)
 9 | #parquet_dataset   = <parent_of_parquet_subfolders/*>
10 | parquet_dataset = full_data/*
11 | 
12 | # Specify output directory (must create yourself!)
13 | # Make sure this directory exists!
14 | #output_dir      </output/subdir>
15 | output_dir      = resources/full_url_list_parsed_v2
16 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/1-get_script_urls/generate_url_list_spark.py:
--------------------------------------------------------------------------------
 1 | import configparser
 2 | from os import path
 3 | import sys
 4 | from slugify import slugify
 5 | from pyspark.sql import SparkSession, functions, types
 6 | 
 7 | # Safety for spark stuff
 8 | spark = SparkSession.builder.appName('URL extractor').getOrCreate()
 9 | assert sys.version_info >= (3, 4) # make sure we have Python 3.4+
10 | assert spark.version >= '2.1' # make sure we have Spark 2.1+
11 | 
12 | # UDF to generate a text file from the script URL
13 | def shorten_name(url_name):
14 |     # Strip out 'http', 'https', '/', and '.js'
15 |     shortened_url = url_name.replace(
16 |                                 'https://', ''
17 |                             ).replace(
18 |                                 'http://', ''
19 |                             ).replace(
20 |                                 '/', '_'
21 |                             ).replace(
22 |                                 '.js', ''
23 |                             )
24 | 
25 |     # Shorten url to 250 characters (max file system can support)
26 |     shortened_url = slugify(shortened_url)[:250]
27 | 
28 |     # Specify the suffix for each downloaded file
29 |     suffix = '.txt'
30 | 
31 |     # Final output
32 |     file_name = shortened_url + suffix
33 |     return file_name
34 | 
35 | def main():
36 | 
37 |     # Specify target directory
38 |     config = configparser.ConfigParser()
39 |     config.read('config.ini')
40 | 
41 |     datatop = config['DEFAULT']['datatop']
42 |     parquet_dataset = path.join(datatop,config['DEFAULT']['parquet_dataset'])
43 |     output_dir = path.join(datatop,config['DEFAULT']['output_dir'])
44 | 
45 |     # Read in dataset, selecting the 'script_url' column and filter duplicates
46 |     data = spark.read.parquet(parquet_dataset).select('script_url').distinct()
47 | 
48 |     # Split the string on reserved url characters to get canonical url
49 |     data = data.withColumn(
50 |                     "parsed_url",
51 |                     functions.split("script_url", "[\?\#\,\;]")[0]
52 |                 ).distinct()
53 | 
54 |     # Only keep urls that are actually .js files
55 |     data = data.filter(
56 |                 data["parsed_url"].rlike("\.js$")
57 |             ).dropDuplicates(["parsed_url"])
58 | 
59 |     # User Defined Function to convert script URL to a filename usable by ext4
60 |     shorten_udf = functions.udf(shorten_name, returnType=types.StringType())
61 | 
62 |     # Apply the UDF over the whole list to generate a new column 'filename'
63 |     data = data.withColumn(
64 |                 'filename',
65 |                 shorten_udf(data.parsed_url)
66 |             ).sort('filename')
67 | 
68 |     # Save the data to parquet files
69 |     data.write.parquet(output_dir)
70 |     config = configparser.ConfigParser()
71 |     config.read('config.ini')
72 | 
73 | 
74 | if __name__ == '__main__':
75 |     main();
76 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/1-get_script_urls/requirements.txt:
--------------------------------------------------------------------------------
1 | appdirs==1.4.3
2 | attrs==18.2.0
3 | Click==7.0
4 | pkg-resources==0.0.0
5 | python-slugify==1.2.6
6 | toml==0.10.0
7 | Unidecode==1.0.23
8 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/1-get_script_urls/test_generate_url_list_spark.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | from slugify import slugify
 3 | from pyspark.sql import SparkSession, functions, types
 4 | 
 5 | # Safety for spark stuff
 6 | spark = SparkSession.builder.appName('URL extractor').getOrCreate()
 7 | assert sys.version_info >= (3, 4) # make sure we have Python 3.4+
 8 | assert spark.version >= '2.1' # make sure we have Spark 2.1+
 9 | 
10 | # UDF to generate a text file from the script URL
11 | def shorten_name(url_name):
12 |     # Strip out 'http', 'https', '/', and '.js'
13 |     shortened_url = url_name.replace(
14 |                                 'https://', ''
15 |                             ).replace(
16 |                                 'http://', ''
17 |                             ).replace(
18 |                                 '/', '_'
19 |                             ).replace(
20 |                                 '.js', ''
21 |                             )
22 | 
23 |     # Shorten url to 250 characters (max file system can support)
24 |     shortened_url = slugify(shortened_url)[:250]
25 | 
26 |     # Specify the suffix for each downloaded file
27 |     suffix = '.txt'
28 | 
29 |     # Final output
30 |     file_name = shortened_url + suffix
31 |     return file_name
32 | 
33 | def main():
34 | 
35 |     # Specify test file, a csv of urls to parse
36 |     TEST_FILE = "test_urls.csv"
37 |     OUTPUT_FILE = "parsed_test_urls.csv"
38 | 
39 |     # Read in dataset, selecting the 'script_url' column and filter duplicates
40 |     data = spark.read.csv(TEST_FILE,header='true').distinct()
41 | 
42 |     # Split the string on reserved url characters to get canonical url
43 |     data = data.withColumn(
44 |                     "parsed_url",
45 |                     functions.split("script_url", "[\?\#\,\;]")[0]
46 |                 ).distinct()
47 | 
48 |     # Only keep urls that are actually .js files
49 |     data = data.filter(
50 |                 data["parsed_url"].rlike("\.js$")
51 |             ).dropDuplicates(["parsed_url"])
52 | 
53 |     # User Defined Function to convert script URL to a filename usable by ext4
54 |     shorten_udf = functions.udf(shorten_name, returnType=types.StringType())
55 | 
56 |     # Apply the UDF over the whole list to generate a new column 'filename'
57 |     data = data.withColumn(
58 |                 'filename',
59 |                 shorten_udf(data.parsed_url)
60 |             )#.sort('filename')
61 | 
62 |     # Save the data to parquet files
63 |     data.toPandas().to_csv(OUTPUT_FILE)
64 | 
65 | if __name__ == '__main__':
66 |     main();
67 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/1-get_script_urls/test_urls.csv:
--------------------------------------------------------------------------------
 1 | script_url
 2 | https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com
 3 | https://ajax.googleapis.com/ajax/libs/webfont/1.6.26/webfont.js
 4 | http://cpro.baidustatic.com/cpro/ui/noexpire/js/4.0.1/adClosefeedbackUpgrade.min.js
 5 | https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=fe1ad16a94c816&origin=http%3A%2F%2Farabi21.com
 6 | https://static.dynamicyield.com/scripts/12290/dy-coll-min.js
 7 | https://www.syracuse.edu/about/
 8 | https://www.googletagmanager.com/gtm.js?id=GTM-5FC97GL
 9 | https://www.syracuse.edu/wp-includes/js/wp-emoji-release.min.js?ver=4.9.1
10 | https://www.google-analytics.com/analytics.js
11 | https://code.jquery.com/jquery-migrate-1.4.1.min.js
12 | https://www.syracuse.edu/wp-content/themes/g6-carbon/js/carbon-all.js?ver=6.3.6
13 | https://www.syracuse.edu/wp-includes/js/wp-embed.min.js?ver=4.9.1
14 | https://syr-piwik-prod.syr.edu/piwik.js
15 | http://ads.pubmatic.com/AdServer/js/showad.js#PIX&kdntuid=1&SPug=true&p=37855&predirect=http%3A%2F%2Fdelivery.swid.switchadhub.com%2Fadserver%2Fuser_sync.php%3FSWID%3Dc3a5b350611a2a8227e5f3e7fb4785fc%26sKey%3DPM3%26sVal%3D%26do%5Bsingle%5D%3D1&it=0&np=0
16 | https://z.moatads.com/keplerpaypaldcm168224283233/moatad.js#moatClientLevel1=11459972&moatClientLevel2=3346219&moatClientLevel3=210658248&moatClientLevel4=95946136&zMoatG=ct=US&st=&city=0&dma=0&zp=&bw=4&zMoatUSER=AMsySZYB24A5c8yM4Xpv9GSzr-9f
17 | https://securepubads.g.doubleclick.net/gpt/pubads_impl_170.js
18 | http://cdn.optimizely.com/js/549871026.js
19 | http://w.sharethis.com/button/sharethis.js#&offsetLeft=-283&offsetTop=-7&publisher=8a80909b-bbba-4773-ad9c-57ff5f2349d5&type=website&post_services=email%2Cfacebook%2Ctwitter%2Cgbuzz%2Cmyspace%2Cdigg%2Csms%2Cwindows_live%2Cdelicious%2Cstumbleupon%2Creddit%2Cgoogle_bmarks%2Clinkedin%2Cbebo%2Cybuzz%2Cblogger%2Cyahoo_bmarks%2Cmixx%2Ctechnorati%2Cfriendfeed%2Cpropeller%2Cwordpress%2Cnewsvine&button=false
20 | http://ajax.googleapis.com/ajax/libs/swfobject/2.2/swfobject.js
21 | http://www.google-analytics.com/analytics.js
22 | http://cdn.mplxtms.com/s/MasterTMS.min.js#2076
23 | http://b.scorecardresearch.com/beacon.js
24 | https://www.googletagmanager.com/gtm.js?id=GTM-WZQCWK
25 | http://dthq3mor50viz.cloudfront.net/zbajck9faU.js
26 | http://pagead2.googlesyndication.com/pagead/osd.js
27 | http://media.ufc.tv/ufc_system_assets/ufc_201707101050/js/jwplayer7/jwplayer.js
28 | http://media.ufc.tv/ufc_system_assets/ufc_201707101050/js/cufon-yui.js
29 | https://apis.google.com/js/plusone.js
30 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/2-scrape_js/README.md:
--------------------------------------------------------------------------------
 1 | # Asynchronous Batch Downloading of JavaScript files
 2 | 
 3 | `async_js_get.py` iterates over the parquet files generated in part 1 to asynchronously download all of the files. The script first scans your specified output directory to skip over the already downloaded files. No arguments are needed, just ensure that you satisfied all of the requirements specified in `requirements.txt`. Run with:
 4 | ```
 5 | $ ./async_js_get.py > js_status.csv
 6 | ```
 7 | This saves the status responses for some analysis if so desired later. Note that the error codes may have some characters invalidating the schema (namely extra commas).
 8 | 
 9 | You may run into some system configuration issues; you may need to increase the ulimit (3000 worked for me):
10 | 
11 | ```
12 | $ ulimit -n 3000
13 | ```
14 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/2-scrape_js/async_js_get.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | #
  3 | # Original code: Cristian Garcia, Sep 21 '18
  4 | #       https://medium.com/@cgarciae/making-an-infinite-number-of-requests-with-
  5 | #               python-aiohttp-pypeln-3a552b97dc95
  6 | #
  7 | # Script adapted by David Dobre Nov 20 '18:
  8 | #       Added parquet loading, iteration over dataframes, and content saving
  9 | #
 10 | # NOTE: You may need to increase the ulimit (3000 worked for me):
 11 | #
 12 | #           $ ulimit -n 3000
 13 | #
 14 | ################################################################################
 15 | 
 16 | from aiohttp import ClientError, ClientSession, TCPConnector
 17 | import asyncio
 18 | import configparser
 19 | import glob
 20 | import os, os.path
 21 | import pandas as pd
 22 | import ssl
 23 | import sys
 24 | 
 25 | from pathlib import Path
 26 | from pypeln import asyncio_task as aio
 27 | 
 28 | ##### Specify directories and parameters
 29 | config = configparser.ConfigParser()
 30 | config.read('config.ini')
 31 | 
 32 | # Top directory
 33 | datatop = config['DEFAULT']['datatop']
 34 | 
 35 | # Input directory
 36 | url_list = os.path.join(datatop, config['DEFAULT']['url_list'])
 37 | 
 38 | # Output directory
 39 | output_dir = os.path.join(datatop, config['DEFAULT']['output_dir'])
 40 | 
 41 | # Max number of workers
 42 | limit = config['DEFAULT'].getint('limit')
 43 | 
 44 | print(datatop)
 45 | print(url_list)
 46 | print(output_dir)
 47 | print(limit)
 48 | 
 49 | # Not sure what this monkey business does
 50 | ssl.match_hostname = lambda cert, hostname: True
 51 | 
 52 | ##### Load in dataset ##########################################################
 53 | parquet_dir = Path(url_list)
 54 | input_data = pd.concat(
 55 |     pd.read_parquet(parquet_file)
 56 |     for parquet_file in parquet_dir.glob('*.parquet')
 57 | )
 58 | 
 59 | # Check for existing files in output directory
 60 | existing_files = [os.path.basename(x) for x in glob.glob(output_dir + "*.txt")]
 61 | 
 62 | # Remove those from the "to request" list as they're already downloaded
 63 | input_data = input_data[~input_data['filename'].isin(existing_files)]
 64 | 
 65 | # Append filename to the output folder and get the raw string value
 66 | input_data['filename'] = output_dir + input_data['filename']
 67 | input_data = input_data.values
 68 | 
 69 | print("url,status,")
 70 | 
 71 | ##### Async fetch ##############################################################
 72 | async def fetch(data, session):
 73 | 
 74 |     url = data[1]
 75 |     filename = data[2]
 76 | 
 77 |     try:
 78 |         async with session.get(url, timeout=2) as response:
 79 |             output = await response.read()
 80 |             print("{},{},".format(url, response.status))
 81 | 
 82 |             if (response.status == 200 and output):
 83 |                 with open(filename, "wb") as source_file:
 84 |                     source_file.write(output)
 85 |                 return output
 86 | 
 87 |             return response.status
 88 | 
 89 |     # Catch exceptions
 90 |     except ClientError as e:
 91 |         print("{},{},".format(url, e))
 92 |         return e
 93 | 
 94 |     except asyncio.TimeoutError as e:
 95 |         print("{},{},".format(url, e))
 96 |         return erb
 97 | 
 98 |     except ssl.CertificateError as e:
 99 |         print("{},{},".format(url, e))
100 |         return e
101 | 
102 |     except ssl.SSLError as e:
103 |         print("{},{},".format(url, e))
104 |         return e
105 | 
106 |     except ValueError as e:
107 |         print("{},{},".format(url, e))
108 |         return e
109 | 
110 |     except TimeoutError as e:
111 |         print("{},{},".format(url, e))
112 |         return e
113 | 
114 |     except concurrent.futures._base.TimeoutError as e:
115 |         print("{},{},".format(url, e))
116 |         return e
117 | 
118 | 
119 | ##### Iterate over each list entry #############################################
120 | aio.each(
121 |     fetch,              # worker function
122 |     input_data,         # input arguments
123 |     workers = limit,    # max number of workers
124 |     on_start = lambda: ClientSession(connector=TCPConnector(limit=None)),
125 |     on_done = lambda _status, session: session.close(),
126 |     run = True,
127 | )
128 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/2-scrape_js/config.ini:
--------------------------------------------------------------------------------
 1 | # Note: Make sure to not add comments on any lines being parsed in!
 2 | [DEFAULT]
 3 | 
 4 | # Location of the top of where data is located
 5 | #datatop         = </path/to/datatop>
 6 | datatop         = /mnt/Data/UCOSP_DATA
 7 | 
 8 | # Specify the directory containing the output of Pt. 1
 9 | url_list        = resources/full_url_list_parsed
10 | 
11 | # Specify directory to save all text files (make sure it exists!)
12 | output_dir      = js_source_files
13 | 
14 | # Worker limit (larger values fail for me, experiment)
15 | limit           = 20
16 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/2-scrape_js/downloads_analysis/README.md:
--------------------------------------------------------------------------------
 1 | # JavaScript Redundancy Analysis
 2 | 
 3 | The purpose of these scripts are to analyze the difference between the uncleansed `script_url`s from the full dataset and their parsed counterparts. The parsed 'condensed dataset' essentially omits any queries that were included in the `script_url`, condensing the number of entries from over a million to less than 200000.
 4 | 
 5 | `extract_hashes_from_full_dataset.py` generates a dictionary of hashes for every file (with `utf-8` encoding) in the specified directory. It is far too expensive to try to do this whole process in one step - filesystems are not fond of having over 800000 files in one directory. The pickled output is used in the next step.
 6 |  
 7 | `compare_condensed_with_full.py` uses the data from the above script and iterates over the specified condensed dataset, generating a final pickled output of a pandas dataframe with a schema of `['parent_url', 'url', 'hash']`, allowing the comparison between hashes sharing the same base URL.
 8 | 
 9 | `explore_downloads.ipynb` is a jupyter-notebook containing a brief analysis on the redundancy of the JS files when making requests with queries.
10 | 
11 | Note: for a small section on the HTTP status response, you will require `js_status.csv`, which is the output of Pt.2's download script dumped into a text file. This stores the URLs and their HTTP response status. This is not necessary for the rest of the analysis so you can remove the associated code to this.
12 | 
13 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/2-scrape_js/downloads_analysis/compare_condensed_with_full.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python3
 2 | import sys
 3 | import hashlib
 4 | import numpy as np
 5 | import pandas as pd
 6 | import glob
 7 | from pathlib import Path
 8 | import pickle
 9 | from pypeln import asyncio_task as aio
10 | from os import path
11 | 
12 | STORAGE_DIR = "/mnt/Data/UCOSP_DATA"
13 | 
14 | FULL_URL_LIST = "full_data.pickle"
15 | 
16 | CLEANED_FILES = path.join(STORAGE_DIR, ("js_source_files" + "/*"))
17 | 
18 | OUTPUT_FILE = "final_processed.pickle"
19 | 
20 | ##### Load in dataset
21 | print("Retrieving list from: '{}'".format(CLEANED_FILES))
22 | input_data_cleaned = list(glob.glob(CLEANED_FILES))
23 | 
24 | with open(FULL_URL_LIST, "rb") as handle:
25 |     input_data_full = pickle.load(handle)
26 | 
27 | # Sanity check
28 | print(
29 |     "\nThere are {} urls found in the cleaned dataset:\n\t'{}'".format(
30 |         len(input_data_cleaned), CLEANED_FILES
31 |     )
32 | )
33 | print(
34 |     "\nThere are {} hashes found in the complete dataset:\n\t'{}'".format(
35 |         len(input_data_full), FULL_URL_LIST
36 |     )
37 | )
38 | 
39 | # Generating new dataframe
40 | def get_hash_from_file(filename):
41 |     sha1 = hashlib.sha1()
42 |     with open(filename, "r") as f:
43 |         data = f.read()
44 |         sha1.update(data.encode("utf-8"))
45 |     return sha1.hexdigest()
46 | 
47 | output_success = []
48 | output_fails = []
49 | counter_success = 0
50 | counter_failed = 0
51 | 
52 | # Iterate over all of the cleansed dataset
53 | for filename in input_data_cleaned:
54 |     if counter_success % 5000 == 0:
55 |         print("{}/{}".format(counter_success, len(input_data_cleaned)))
56 | 
57 |     # Get source url
58 |     raw_filename = filename.split("/")[-1]
59 | 
60 |     # Try to get the file hash, only works with utf-8
61 |     try:
62 |         file_hash = get_hash_from_file(filename)
63 | 
64 |     except UnicodeDecodeError as e:
65 |         print("Bad type:\n{}".format(e))
66 |         counter_failed += 1
67 |         output_fails.append(raw_filename)
68 |         continue
69 | 
70 |     # Create an entry for the parent and append it
71 |     parent_dict = {
72 |         "parent_filename": raw_filename,
73 |         "filename": raw_filename,
74 |         "hash": get_hash_from_file(filename),
75 |     }
76 | 
77 |     output_success.append(parent_dict)
78 | 
79 |     # Now search for all entries in the complete crawl with the same base url
80 |     search = raw_filename.split(".txt")[0]
81 |     for key in input_data_full:
82 | 
83 |         if key.startswith(search) and key != raw_filename:
84 |             child_dict = {
85 |                 "parent_filename": raw_filename,
86 |                 "filename": key,
87 |                 "hash": input_data_full[key],
88 |             }
89 |             output_success.append(child_dict)
90 |     counter_success += 1
91 | 
92 | # Pickle output
93 | df = pd.DataFrame(output_success)
94 | df.to_pickle(OUTPUT_FILE)
95 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/2-scrape_js/downloads_analysis/extract_hashes_from_full_dataset.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python3
 2 | import sys
 3 | import hashlib
 4 | import glob
 5 | from pathlib import Path
 6 | from os import path
 7 | import pickle
 8 | 
 9 | STORAGE_DIR = "/mnt/Data/UCOSP_DATA"
10 | 
11 | ALL_FILES = path.join(STORAGE_DIR, ("1st_batch_js_source_files" + "/*"))
12 | 
13 | OUTPUT_FILE = "full_data.pickle"
14 | 
15 | OUTPUT_FILE_FAILS = "fails.pickle"
16 | 
17 | # Generating new dataframe
18 | def get_hash_from_file(filename):
19 |     sha1 = hashlib.sha1()
20 |     with open(filename, "r") as f:
21 |         data = f.read()
22 |         sha1.update(data.encode("utf-8"))
23 |     return sha1.hexdigest()
24 | 
25 | # Get file list
26 | print("Retrieving list from: '{}'".format(ALL_FILES))
27 | total_file_list = list(glob.glob(ALL_FILES))
28 | total_num_files = len(total_file_list)
29 | print("Retrieved list of {} files.\n".format(total_num_files))
30 | 
31 | output_dict = {}
32 | output_fails = []
33 | counter = 0
34 | fails = 0
35 | 
36 | # Iterate over entire file list
37 | for filename in total_file_list:
38 |     if counter % 20000 == 0:
39 |         print(80 * "#" + "\n")
40 |         print("{}/{}".format(counter, total_num_files))
41 |         print(80 * "#" + "\n")
42 | 
43 |     raw_filename = filename.split("/")[-1]
44 |     try:
45 |         file_hash = get_hash_from_file(filename)
46 |     except UnicodeDecodeError as e:
47 |         print("Bad type:\n{}".format(e))
48 |         fails += 1
49 |         output_fails.append(raw_filename)
50 |         continue
51 | 
52 |     output_dict.update({raw_filename: file_hash})
53 |     counter += 1
54 | 
55 | print(
56 |     "{}/{}\n\nDONE... Now pickling data to: '{}'".format(
57 |         counter, total_num_files, OUTPUT_FILE
58 |     )
59 | )
60 | 
61 | with open(OUTPUT_FILE, "wb") as f:
62 |     # Pickle the 'data' dictionary using the highest protocol available.
63 |     pickle.dump(output_dict, f, pickle.HIGHEST_PROTOCOL)
64 | 
65 | # Store fails for a later time
66 | with open(OUTPUT_FILE_FAILS, "wb") as f:
67 |     pickle.dump(output_fails, f)
68 | 
69 | print("DONE\nFinal stats:\nSuccess:\t{}\nFails:\t{}".format(counter, fails))
70 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/2-scrape_js/requirements.txt:
--------------------------------------------------------------------------------
 1 | aiohttp==3.4.4
 2 | async-timeout==3.0.1
 3 | attrs==18.2.0
 4 | chardet==3.0.4
 5 | idna==2.8
 6 | idna-ssl==1.1.0
 7 | multidict==4.5.2
 8 | numpy==1.15.4
 9 | pandas==0.23.4
10 | pkg-resources==0.0.0
11 | pypeln==0.1.6
12 | python-dateutil==2.7.5
13 | pytz==2018.7
14 | six==1.12.0
15 | yarl==1.3.0
16 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/2-scrape_js/single_js_get.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | import requests
 3 | import sys
 4 | 
 5 | url = sys.argv[1]
 6 | 
 7 | TIMEOUT = 2
 8 | OUT_FILENAME = "downloaded.js"
 9 | 
10 | try:
11 |     response = requests.get(url, timeout=TIMEOUT)
12 | 
13 |     if response.status_code == 200:
14 |         print("File found - writing contents to {}".format(OUT_FILENAME))
15 | 
16 |         with open(OUT_FILENAME, "w") as source_file:
17 |             source_file.write(response.text)
18 | 
19 |     else:
20 |         print("Error! Status code: {}".format(response.status_code))
21 | 
22 | except request.exceptions.RequestException as e:
23 |     print("Exception")
24 |     print(e)
25 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/3-generate_symbols_of_interest/README.md:
--------------------------------------------------------------------------------
 1 | # API Symbol extraction from JSON database:
 2 | 
 3 | Requires Mozilla's API list found on the [browser compatability data github](https://github.com/mdn/browser-compat-data/tree/master/api). For this reference, I've downloaded all of the json files into a directory called `api/`.
 4 | 
 5 | To run, configure the provided `config.ini` file. Make sure to specify both the directory of the api json files (found above) as well a list of those to generate symbol lists for. Run with:
 6 | ```
 7 | $ ./process_APIs.py
 8 | ```
 9 | 
10 | Program will run through the specified .json files, extracting methods/symbols, and then checking if these methods/symbols also exist as .json files in the json directory. If they do (and they haven't already been parsed), then program will also check through those files.
11 | 
12 | Spits out a json file with a user specified name, with keys corresponding to the original interface, and values containing the lists of the corresponding symbols and methods.
13 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/3-generate_symbols_of_interest/config.ini:
--------------------------------------------------------------------------------
 1 | # Note: Make sure to not add comments on any lines being parsed in!
 2 | [DEFAULT]
 3 | 
 4 | # Location to the file containing a list of all files you want to generate symbol lists for
 5 | seed            = master.txt
 6 | 
 7 | # Location to the folder containing all API json files
 8 | api_data        = api/
 9 | 
10 | # Specify output file containing the symbol list
11 | output          = symbol_dict.json
12 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/3-generate_symbols_of_interest/process_APIs.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | import configparser
  3 | from csv import DictWriter
  4 | import json
  5 | from os import listdir
  6 | import sys
  7 | 
  8 | # want to return a dict (lowercase : FileName.json) to recursively get nested
  9 | #   methods and properties.
 10 | def getAllFileDict(api_data):
 11 |     master_API_dict = {}
 12 | 
 13 |     for entry in listdir(api_data):
 14 |         key = str(entry.split(".")[0])
 15 |         master_API_dict[key] = entry
 16 | 
 17 |     return master_API_dict
 18 | 
 19 | 
 20 | # import all data from specified filename
 21 | def importData(filename):
 22 | 
 23 |     # read in JSON data
 24 |     with open(filename, encoding='utf-8') as data_file:
 25 |         data = json.loads(data_file.read())
 26 | 
 27 |     return data['api']
 28 | 
 29 | 
 30 | # once imported data from file, extract the desired properties
 31 | def extractProperties(json_data):
 32 | 
 33 |     # this only works if guaranteed that json file has one key under 'api'
 34 |     interface_name = list(json_data.keys())[0]
 35 |     property_list = list(json_data[interface_name].keys())
 36 |     if "__compat" in property_list:
 37 |         property_list.remove("__compat")
 38 | 
 39 |     return str(interface_name), property_list
 40 | 
 41 | 
 42 | def recursivelyGetProperties(current_interface, res_dict, master_API_dict):
 43 | 
 44 |     # import data from file
 45 |     data = importData(current_interface)
 46 | 
 47 |     # extract interface name and its associated property list
 48 |     interface_name, property_list = extractProperties(data)
 49 | 
 50 |     # add entries into the output
 51 |     res_dict[interface_name] = property_list
 52 | 
 53 |     # iterate over all properties, search if they exist in the master API list
 54 |     for entry in property_list:
 55 | 
 56 |         # (all keys within res_dict and master_API_dict are lowercase)
 57 |         entry = str.lower(entry)
 58 |         if (entry in master_API_dict):
 59 | 
 60 |             # if they exist, make sure they aren't already in the results dict
 61 |             if (entry not in res_dict.keys()):
 62 | 
 63 |                 # take result, create new seed and recursively fill res_dict
 64 |                 new_seed = sys.argv[2] + master_API_dict[entry]
 65 |                 res_dict = recursivelyGetProperties(new_seed, res_dict, \
 66 |                         master_API_dict)
 67 |     return res_dict
 68 | 
 69 | def main():
 70 |     config = configparser.ConfigParser()
 71 |     config.read('config.ini')
 72 | 
 73 |     seed_apis = config['DEFAULT']['seed']
 74 |     api_data = config['DEFAULT']['api_data']
 75 |     output = config['DEFAULT']['output']
 76 | 
 77 |     # init empty dict to store all results
 78 |     res_dict = {}
 79 | 
 80 |     # read in all available files to recurse over
 81 |     master_API_dict = getAllFileDict(api_data)
 82 | 
 83 |     # open seed file
 84 |     file_list = open(seed_apis).read().splitlines()
 85 | 
 86 |     # iterate over the list of specified files
 87 |     for entry in file_list:
 88 | 
 89 |         # generate a nice filename
 90 |         file_location = api_data + entry
 91 |         res_dict = recursivelyGetProperties(file_location, res_dict, \
 92 |                 master_API_dict)
 93 | 
 94 |     # dump all results to a file
 95 |     with open(output, 'w') as fp:
 96 |         json.dump(res_dict, fp, indent=4)
 97 | 
 98 | if __name__ == '__main__':
 99 |     main();
100 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/4-ast_analysis/README.md:
--------------------------------------------------------------------------------
 1 | # Analyze Syntax Tree Counts
 2 | 
 3 | This portion performs the actual counting of API symbol calls by generating an abstract syntax tree (AST) and walking through it (breadth-first) to extract symbol information. The AST generation is performed by [Esprima](https://github.com/Kronuz/esprima-python).
 4 | 
 5 | There are two scripts. `async_tree_explorer.py` uses the parameters specified in `config.ini` to asynchronously generate symbol counts for all files within a directory. Depending on your machine, this may take a long time (10000 files took about 12 hours in one sample using 16 workers). There isn't a significant memory overhead, so depending on the size of your dataset, 16-32GB of RAM should be enough. The is a pandas dataframe with the schema: `['filename', 'symbol_1' , ... , 'symbol_n']`. This results in NaN values for scrips that do not have a symbol call that others do. The benefit of this is that one can analyze specific symbols.
 6 | 
 7 | `single_tree_explorer.py` performs the same analysis as `async_tree_explorer.py` on a single file specified by the user at runtime. Runs with:
 8 | ```
 9 | $ ./single_tree_explorer.py <path/to/api_list> <path/to/js_file>
10 | ```
11 | This generates two outputs in `output_data`. `symbol_counts.json` is the output the count of bottom level calls, for example `window.location.href` would be counted as `href`. `extended_symbol_counts.json` would attempt to find the whole symbol `window.location.href`. There are limitations to this however. If the calls were broken up over several different functions, the AST has no way of recovering that information, so you might be left with `location.href`. Depending on the symbol, it might be possible to infer the parent APIs, but there are many cases of conflict which makes this not guaranteed. Furthermore, if the script is heavily obfuscated, then this approach will almost certainly fail. 
12 | 
13 | The cases this script looks for are `MemberExpressions`, `CallExpressions`, and `object`s within either of the expressions.
14 | 
15 | Looking at some examples, the JavaScript line
16 | ```js
17 | if ((new RegExp("WebKit")).test(navigator.userAgent))
18 | ```
19 | parses out to:
20 | ```json
21 | "type": "IfStatement",
22 | "test": {
23 |         "type": "CallExpression",
24 |         "callee": {
25 |                 "type": "MemberExpression",
26 |                 "computed": false,
27 |                 "object": {
28 |                         "type": "NewExpression",
29 |                         "callee": {
30 |                                 "type": "Identifier",
31 |                                 "name": "RegExp"
32 |                         },
33 |                         "arguments": [
34 |                         {
35 |                                 "type": "Literal",
36 |                                 "value": "WebKit",
37 |                                 "raw": "\"WebKit\""
38 |                         }
39 |                         ]
40 |                 },
41 |                 "property": {
42 |                         "type": "Identifier",
43 |                         "name": "test"
44 |                 }
45 |         },
46 |         "arguments": [
47 |         {
48 |                 "type": "MemberExpression",
49 |                 "computed": false,
50 |                 "object": {
51 |                         "type": "Identifier",
52 |                         "name": "navigator"
53 |                 },
54 |                 "property": {
55 |                         "type": "Identifier",
56 |                         "name": "userAgent"
57 |                 }
58 |         }
59 |         ]
60 | },
61 | ...
62 | ```
63 | The syntax tree walker will focus on nodes with the type `MemberExpression` or `CallExpression`. Focusing on the node which has the appropriate type:
64 | ```json
65 | {
66 |         "type": "MemberExpression",
67 |         "computed": false,
68 |         "object": {
69 |                 "type": "Identifier",
70 |                 "name": "navigator"
71 |         },
72 |         "property": {
73 |                 "type": "Identifier",
74 |                 "name": "userAgent"
75 |         }
76 | }
77 | ```
78 | The script first grabs the `property.name` parameter of this node, which is `userAgent`. The value is then compared to the list of symbols fed in, and if it does not exist in that list, this node is ignored. This would be the "outermost" symbol call, so this value is set to be the current output string (`output = userAgent`). The algorithm now recursively checks for objects in the node and repeats the first part of checking for types and properties. In this case, the object type is `Identifier`, so this this is the bottom of this branch and this value is prepended before the final value, i.e.  `tmp = navigator.userAgent`. Note that the object's type might have been `MemberExpression` or `CallExpression`, in which case `property.name` is prepended to the output, and then another `object` check is done (until the object is of type `Identifier`).
79 | 
80 | In the case of `CallExpression`, the same logic applies but instead of looking for `node.property.name`, we must look for `node.callee.name`. For example:
81 | ```json
82 | "type": "CallExpression",
83 | "callee": {
84 |         "type": "Identifier",
85 |         "name": "setInterval"
86 | },
87 | ```
88 | where `setInterval` is the symbol extracted.
89 | 
90 | Note that this approach is vulnerable to prepending nodes which share names with valid symbols in the symbol list. This is largely the case for single character object names, such as `d.userAgent`, where `d` is a valid symbol in `DOMMatrixReadOnly`. This is mitigated by assuming that parent objects are unlikely to be of length 1 or 2, so any parents which have only one or two characters are discarded. The only vulnerable two character values would be `id`, `go`, `as`, `ch`, and `db` (using the provided master list).
91 | 
92 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/4-ast_analysis/async_tree_explorer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import esprima
  4 | import configparser
  5 | import glob
  6 | import json
  7 | import multiprocessing
  8 | import pandas as pd
  9 | import sys
 10 | 
 11 | from os import path
 12 | 
 13 | config = configparser.ConfigParser()
 14 | config.read("config.ini")
 15 | 
 16 | # Top directory for all data and resource files
 17 | DATATOP = config["DEFAULT"]["datatop"]
 18 | 
 19 | # Stage 1 output:
 20 | #   directory for url:filename dictionary files (PARQUET!)
 21 | URL_FILENAME_DICT = path.join(DATATOP, 'resources/full_url_list_parsed/')
 22 | 
 23 | # Stage 2 output:
 24 | #   path to downloaded javascript files
 25 | JS_SOURCE_FILES = path.join(DATATOP, config["DEFAULT"]["js_source_files"])
 26 | 
 27 | # Stage 3 output:
 28 | #   symbol list
 29 | SYM_LIST = config["DEFAULT"]["sym_list"]
 30 | 
 31 | # Output directory
 32 | OUTPUT_DIR = path.join(DATATOP, "resources/symbol_counts/")
 33 | 
 34 | OUTPUT_FILE = "full_run_trial2"
 35 | OUTPUT_FAIL = "fails"
 36 | 
 37 | 
 38 | # Number of workers:
 39 | WORKERS = multiprocessing.cpu_count()
 40 | 
 41 | # Number of files per queue batch
 42 | BATCH_SIZE = 1000
 43 | 
 44 | 
 45 | class SymbolNode:
 46 |     def __init__(self, depth, width, parent_depth, parent_width):
 47 |         self._depth = depth
 48 |         self._width = width
 49 |         self._parent_depth = parent_depth
 50 |         self._parent_width = parent_width
 51 | 
 52 |     def setDepthWidth(depth, width):
 53 |         self._depth = depth
 54 |         self._width = width
 55 | 
 56 | 
 57 | class CustomEncoder(json.JSONEncoder):
 58 |     def default(self, obj):
 59 | 
 60 |         # Symbol node class
 61 |         if isinstance(obj, SymbolNode):
 62 |             return {
 63 |                 "depth": obj._depth,
 64 |                 "width": obj._width,
 65 |                 "parent_depth": obj._parent_depth,
 66 |                 "parent_width": obj._parent_width,
 67 |             }
 68 | 
 69 |         return json.JSONEncoder.default(self, obj)
 70 | 
 71 | 
 72 | class Element:
 73 |     ## Define keys you want to skip over
 74 |     BLACKLISTEDKEYS = ["parent"]
 75 | 
 76 |     ## Constructor
 77 |     def __init__(self, esprima_ast):
 78 |         self._ast = esprima_ast  # Assign member var AST
 79 |         self._visitors = []  # Init empty visitor array
 80 | 
 81 |     ## Add a new visitor to execute (will be executed at each node)
 82 |     def accept(self, visitor):
 83 |         self._visitors.append(visitor)
 84 | 
 85 |     ## (private) Step through the node's queue of potential nodes to visit
 86 |     def _step(self, node, queue, depth, width):
 87 |         before = len(queue)
 88 | 
 89 |         for key in node.keys():  # Enumerate keys for possible children
 90 |             if key in self.BLACKLISTEDKEYS:
 91 |                 continue  # Ignore node if it is blacklisted
 92 | 
 93 |             child = getattr(node, key)  # Assign child = node.key
 94 | 
 95 |             # if the child exists && the child has an attribute 'type'
 96 |             if child and hasattr(child, "type") == True:
 97 |                 child.parent = node  # Assign this node as child's parent
 98 |                 child.parent_depth = depth
 99 |                 child.parent_width = width
100 |                 queue.append(child)  # Append the child in this node's queue
101 | 
102 |             # if there is a list of children
103 |             if isinstance(child, list):
104 |                 for item in child:  # Iterate through them and do the same
105 |                     #   as above
106 |                     if hasattr(item, "type") == True:
107 |                         item.parent = node
108 |                         item.parent_depth = depth
109 |                         item.parent_width = width
110 |                         queue.append(item)
111 | 
112 |         return len(queue) - before  # Return whether any children were pushed
113 | 
114 |     ## Walk through this AST
115 |     def walk(self, api_symbols, filename):
116 |         queue = [self._ast]  # Add the imported AST to the queue
117 | 
118 |         # Initialize these entries
119 |         for node in queue:
120 |             node.parent_depth = 0
121 |             node.parent_width = 0
122 | 
123 |         # Depth and width counting
124 |         depth = 0  # what level of the tree we are in
125 |         width = 0  # how far from first node on this level we are
126 |         this_depth_num_nodes = 1  # how many nodes in this level are left
127 |         next_depth_num_nodes = 0  # how many nodes in the next level
128 |         node_counter = 0  # how many total nodes have been visited
129 |         this_depth_count = 0  # how many nodes are on this level (tot)
130 | 
131 |         # storage for the data
132 |         extended_symbol_counter = {}
133 |         symbol_counter = {key: 0 for key in api_symbols}
134 |         node_dict = {key: [] for key in api_symbols}
135 | 
136 |         extended_symbol_counter["script_url_filename"] = filename
137 | 
138 |         while len(queue) > 0:  # While stuff in the queue
139 |             node = queue.pop(0)  # Pop stuff off of the FRONT (0)
140 |             this_depth_num_nodes -= 1
141 |             node_counter += 1
142 |             width = node_counter - this_depth_count - 1
143 | 
144 |             for v in self._visitors:  # Run visitor instances here
145 |                 result = v.visit(node, api_symbols)
146 |                 if result:
147 |                     if result not in extended_symbol_counter.keys():
148 |                         extended_symbol_counter[result] = 1
149 |                     else:
150 |                         extended_symbol_counter[result] += 1
151 | 
152 |                     # MemberExpression
153 |                     if "MemberExpression" == node.type:
154 |                         tmp = node.property.name
155 | 
156 |                     # CallExpression
157 |                     if "CallExpression" == node.type:
158 |                         tmp = node.callee.name
159 | 
160 |                     symbol_counter[tmp] += 1  # increment counter
161 |                     this_node = SymbolNode(
162 |                         depth, width, node.parent_depth, node.parent_width
163 |                     )
164 |                     node_dict[tmp].append(this_node)
165 |                     break
166 | 
167 |             # If node is an instance of "esprima node", step through the node
168 |             #   Returns how many children have been added to the queue
169 |             if isinstance(node, esprima.nodes.Node):
170 | 
171 |                 # Feed the nodes that will be labeled as children the current
172 |                 #   depth and width
173 |                 next_depth_num_nodes += self._step(node, queue, depth, width)
174 | 
175 |             # Once this tree depth has been walked, update with the existing
176 |             #   "next" set and reset the next set to 0. Increment depth by 1,
177 |             #   and keep a tally on how many nodes have been counted up until
178 |             #   this depth.
179 |             if this_depth_num_nodes == 0:
180 |                 this_depth_num_nodes = next_depth_num_nodes  # update current list
181 |                 next_depth_num_nodes = 0  # reset this list
182 |                 this_depth_count = node_counter  #
183 |                 depth += 1
184 | 
185 |         print(
186 |             "Done '{}'. Total stats: Tree depth: {}\tTotal nodes:{}".format(
187 |                 filename, depth, node_counter
188 |             )
189 |         )
190 |         return symbol_counter, extended_symbol_counter, node_dict
191 | 
192 | 
193 | """
194 | Executes specified code given that an input node matches the property name of
195 |     this node.
196 | 
197 | Attributes:
198 |     _property_name: the name of the property required to execute the handler
199 |     _node_handler:  code to execute if _property_name matches
200 |     visit(node):    checks if input node's property matches this nodes; if yes,
201 |                         executes the code passed into _node_handler, passing the
202 |                         input node as an argument
203 | """
204 | class MatchPropertyVisitor:
205 |     ## Constructor
206 |     def __init__(self, property_name):
207 |         self._property_name = property_name  # userAgent, getContext, etc
208 | 
209 |     ##################################################
210 |     def _recursive_check_objects(self, node, api_symbols):
211 |         if node.object:
212 |             return self._recurrance_visit(node.object, api_symbols)
213 |         return False
214 | 
215 |     ## Visit the nodes, check if matches, and execute handler if it does
216 |     def _recurrance_visit(self, node, api_symbols):
217 | 
218 |         # No more objects to look through
219 |         if "Identifier" == node.type:
220 |             if node.name in api_symbols:
221 |                 return node.name
222 | 
223 |         # MemberExpression; maybe more objects
224 |         elif "MemberExpression" == node.type:
225 |             if node.property.name in api_symbols:
226 | 
227 |                 return_val = node.property.name
228 |                 tmp = self._recursive_check_objects(node, api_symbols)
229 | 
230 |                 if tmp:
231 |                     return_val = tmp + "." + return_val
232 | 
233 |                 return return_val
234 | 
235 |         # CallExpression; maybe more objects
236 |         elif "CallExpression" == node.type:
237 |             if node.callee.name in api_symbols:
238 | 
239 |                 return_val = node.callee.name
240 |                 tmp = self._recursive_check_objects(node.callee, api_symbols)
241 | 
242 |                 if tmp:
243 |                     return_val = tmp + "." + return_val
244 | 
245 |                 return return_val
246 |         return False
247 | 
248 |     ##################################################
249 | 
250 |     def _filter_parent_API(self, arg):
251 |         if len(arg.split(".")[0]) == 1:
252 |             arg = arg.split(".")
253 |             arg.pop(0)
254 |             arg = ".".join(arg)
255 |         return arg
256 | 
257 |     # FIRST VISIT: visit node, check if matches, and check for objects
258 |     def visit(self, node, api_symbols):
259 | 
260 |         # MemberExpression
261 |         if "MemberExpression" == node.type:
262 |             if node.property.name == self._property_name:
263 | 
264 |                 return_val = node.property.name
265 |                 tmp = self._recursive_check_objects(node, api_symbols)
266 | 
267 |                 if tmp:
268 |                     return_val = tmp + "." + return_val
269 |                     return_val = self._filter_parent_API(return_val)
270 | 
271 |                 return return_val
272 | 
273 |         # CallExpression
274 |         if "CallExpression" == node.type:
275 |             if node.callee.name == self._property_name:
276 | 
277 |                 return_val = node.callee.name
278 |                 tmp = self._recursive_check_objects(node.callee, api_symbols)
279 | 
280 |                 if tmp:
281 |                     return_val = tmp + "." + return_val
282 |                     return_val = self._filter_parent_API(return_val)
283 | 
284 |                 return return_val
285 |         return False
286 | 
287 | 
288 | # Extract JSON and JavaScript AST data from precompiled list
289 | def importData(js_file):
290 | 
291 |     return ast, api_list
292 | 
293 | 
294 | def uniquifyList(seq, idfun=None):
295 |     # order preserving
296 |     if idfun is None:
297 | 
298 |         def idfun(x):
299 |             return x
300 | 
301 |     seen = {}
302 |     result = []
303 |     for item in seq:
304 |         marker = idfun(item)
305 |         if marker in seen:
306 |             continue
307 |         seen[marker] = 1
308 |         result.append(item)
309 |     return result
310 | 
311 | 
312 | ################################################################################
313 | def worker_process(input_file):
314 | 
315 |     filename = input_file.split("/")[-1]
316 | 
317 |     # Try getting the AST using esprima, bail if non-JS syntax
318 |     try:
319 |         with open(input_file) as f:
320 |             ast = esprima.parseScript(f.read())
321 | 
322 |     except esprima.error_handler.Error as e:
323 |         print("Failure: non-javascript syntax detected. Terminating...")
324 |         return False, filename
325 | 
326 |     # Create an element using that AST
327 |     el = Element(ast)
328 |     for entry in api_symbols:
329 |         visitor = MatchPropertyVisitor(entry)
330 |         el.accept(visitor)
331 | 
332 |     # Walk down the AST (breadth-first)
333 |     symbol_counter, extended_symbol_counter, node_dict = el.walk(api_symbols, filename)
334 | 
335 |     return True, extended_symbol_counter
336 | 
337 | 
338 | ################################################################################
339 | if __name__ == "__main__":
340 | 
341 |     print("Initialized to use {} workers.".format(workers))
342 | 
343 |     # Extract all symbols from the generated "Symbols of Interest" list,
344 |     #   and flatten all api symbols into a single list (for efficiency)
345 |     with open(sym_list, encoding="utf-8") as f:
346 |         api_list = json.loads(f.read())
347 | 
348 |     print("Looking in '{}' for the API list...".format(sym_list))
349 |     api_symbols = [val for sublist in api_list.values() for val in sublist]
350 |     api_symbols = uniquifyList(api_symbols)
351 |     print("Success.")
352 | 
353 |     # Get file list from data directory
354 |     print("Looking in '{}' for all .txt files...".format(js_source_files))
355 |     file_list = glob.glob(js_source_files + "/*")
356 |     print("Success. Found {} files.".format(len(file_list)))
357 |     print("Begin iterating over the files to get symbol info.")
358 |     print("-" * 80)
359 | 
360 |     # Storage arrays
361 |     symbol_counts = []
362 |     fails_list = []
363 | 
364 |     # Callback
365 |     def log_result(result):
366 |         if result[0]:
367 |             symbol_counts.append(result[1])
368 |         else:
369 |             fails_list.append(result[1])
370 | 
371 |     # Setup thread pool
372 |     pool = multiprocessing.Pool(workers)
373 |     for filename in file_list:
374 |         re = pool.apply_async(worker_process, args=(filename,), callback=log_result)
375 | 
376 |     pool.close()
377 |     pool.join()
378 | 
379 |     # Done counting symbols
380 |     print("All files done. Saving...")
381 |     df = pd.DataFrame(symbol_counts)
382 | 
383 |     # Saving
384 |     df.to_parquet(output_dir + "/" + output_file + ".parquet.gzip", compression="gzip")
385 | 
386 |     with open(output_dir + "/" + output_fail + ".txt", "w") as f:
387 |         for item in fails_list:
388 |             f.write("%s\n" % item)
389 | 
390 |     print("Success.\n\nDONE SCRIPT!")
391 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/4-ast_analysis/config.ini:
--------------------------------------------------------------------------------
 1 | # Note: Make sure to not add comments on any lines being parsed in!
 2 | [DEFAULT]
 3 | 
 4 | # Location of the top of where data is located
 5 | #datatop         = </path/to/datatop>
 6 | datatop         = /mnt/Data/UCOSP_DATA
 7 | 
 8 | # Stage 1 output:
 9 | #   directory for url:filename dictionary files (PARQUET!)
10 | url_filename_dict = resources/full_url_list_parsed
11 | 
12 | # Stage 2 output:
13 | #      Path to directory containing all of the downloaded scripts
14 | js_source_files = js_source_files
15 | 
16 | # Output directory to save parquet output (make sure it exists!)
17 | output_dir      = resources/symbol_counts_output
18 | 
19 | # Output file names
20 | output_file = output_counts_part
21 | output_fail = fails
22 | 
23 | # Specify the list of symbols you want to search for in the AST scan
24 | sym_list = master_sym_list.json
25 | 
26 | # Number of files per queue batch
27 | batch_size = 1000
28 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/4-ast_analysis/new_async_tree_explorer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import configparser
  4 | import esprima
  5 | import glob
  6 | import json
  7 | import multiprocessing
  8 | import pandas as pd
  9 | import sys
 10 | 
 11 | from os import path
 12 | from tqdm import tqdm
 13 | 
 14 | ################################################################################
 15 | 
 16 | config = configparser.ConfigParser()
 17 | config.read("config.ini")
 18 | 
 19 | # Top directory for all data and resource files
 20 | DATATOP = config["DEFAULT"]["datatop"]
 21 | 
 22 | # Stage 1 output:
 23 | #   directory for url:filename dictionary files (PARQUET!)
 24 | URL_FILENAME_DICT = path.join(DATATOP, config["DEFAULT"]["url_filename_dict"])
 25 | 
 26 | # Stage 2 output:
 27 | #   path to downloaded javascript files
 28 | JS_SOURCE_FILES = path.join(DATATOP, config["DEFAULT"]["js_source_files"])
 29 | 
 30 | # Stage 3 output:
 31 | #   specify the list of symbols you want to search for in the AST scan
 32 | SYM_LIST = config["DEFAULT"]["sym_list"]
 33 | 
 34 | # Output directory
 35 | OUTPUT_DIR = path.join(DATATOP, config["DEFAULT"]["output_dir"])
 36 | 
 37 | OUTPUT_FILE = config["DEFAULT"]["output_file"]
 38 | OUTPUT_FAIL = config["DEFAULT"]["output_fail"]
 39 | 
 40 | # Number of workers:
 41 | WORKERS = multiprocessing.cpu_count()
 42 | 
 43 | # Number of files per queue batch
 44 | BATCH_SIZE = config["DEFAULT"].getint("batch_size")
 45 | 
 46 | ################################################################################
 47 | class SymbolNode:
 48 | 
 49 |     def __init__(self, depth, width, parent_depth, parent_width):
 50 |         self._depth = depth
 51 |         self._width = width
 52 |         self._parent_depth = parent_depth
 53 |         self._parent_width = parent_width
 54 | 
 55 |     def setDepthWidth(depth, width):
 56 |         self._depth = depth
 57 |         self._width = width
 58 | 
 59 | class CustomEncoder(json.JSONEncoder):
 60 |     def default(self, obj):
 61 | 
 62 |         # Symbol node class
 63 |         if isinstance(obj, SymbolNode):
 64 |             return{ "depth" : obj._depth,
 65 |                     "width" : obj._width,
 66 |                     "parent_depth": obj._parent_depth,
 67 |                     "parent_width": obj._parent_width}
 68 | 
 69 |         return json.JSONEncoder.default(self, obj)
 70 | 
 71 | ################################################################################
 72 | class Element:
 73 | 
 74 |     ## Define keys you want to skip over
 75 |     BLACKLISTEDKEYS = ['parent']
 76 | 
 77 |     ## Constructor
 78 |     def __init__(self, esprima_ast):
 79 |         self._ast = esprima_ast         # Assign member var AST
 80 |         self._visitors = []             # Init empty visitor array
 81 | 
 82 | 
 83 |     ## Add a new visitor to execute (will be executed at each node)
 84 |     def accept(self, visitor):
 85 |         self._visitors.append(visitor)
 86 | 
 87 | 
 88 |     ## (private) Step through the node's queue of potential nodes to visit
 89 |     def _step(self, node, queue, depth, width):
 90 |         before = len(queue)
 91 | 
 92 |         for key in node.keys():         # Enumerate keys for possible children
 93 |             if key in self.BLACKLISTEDKEYS:
 94 |                 continue                # Ignore node if it is blacklisted
 95 | 
 96 |             child = getattr(node, key)  # Assign child = node.key
 97 | 
 98 |             # if the child exists && the child has an attribute 'type'
 99 |             if child and hasattr(child, 'type') == True:
100 |                 child.parent = node     # Assign this node as child's parent
101 |                 child.parent_depth = depth
102 |                 child.parent_width = width
103 |                 queue.append(child)     # Append the child in this node's queue
104 | 
105 |             # if there is a list of children
106 |             if isinstance(child, list):
107 |                 for item in child:      # Iterate through them and do the same
108 |                                         #   as above
109 |                     if hasattr(item, 'type') == True:
110 |                         item.parent = node
111 |                         item.parent_depth = depth
112 |                         item.parent_width = width
113 |                         queue.append(item)
114 | 
115 |         return len(queue) - before     # Return whether any children were pushed
116 | 
117 |     ## Walk through this AST
118 |     def walk(self, api_symbols, filename):
119 |         queue = [self._ast]             # Add the imported AST to the queue
120 | 
121 |         # Initialize these entries
122 |         for node in queue:
123 |             node.parent_depth = 0
124 |             node.parent_width = 0
125 | 
126 |         # Depth and width counting
127 |         depth                   = 0     # what level of the tree we are in
128 |         width                   = 0     # how far from first node on this level we are
129 |         this_depth_num_nodes    = 1     # how many nodes in this level are left
130 |         next_depth_num_nodes    = 0     # how many nodes in the next level
131 |         node_counter            = 0     # how many total nodes have been visited
132 |         this_depth_count        = 0     # how many nodes are on this level (tot)
133 | 
134 |         # storage for the data
135 |         extended_symbol_counter = {}
136 |         symbol_counter          = {key: 0 for key in api_symbols}
137 |         node_dict               = {key: [] for key in api_symbols}
138 | 
139 |         extended_symbol_counter['script_url_filename'] = filename
140 | 
141 |         while len(queue) > 0:           # While stuff in the queue
142 |             node = queue.pop(0)          # Pop stuff off of the FRONT (0)
143 |             this_depth_num_nodes -= 1
144 |             node_counter += 1
145 |             width = node_counter - this_depth_count - 1
146 | 
147 |             for v in self._visitors:    # Run visitor instances here
148 |                 result = v.visit(node, api_symbols)
149 |                 if result:
150 |                     if result not in extended_symbol_counter.keys():
151 |                         extended_symbol_counter[result] = 1;
152 |                     else:
153 |                         extended_symbol_counter[result] += 1;
154 | 
155 |                     #MemberExpression
156 |                     if 'MemberExpression' == node.type:
157 |                         tmp = node.property.name
158 | 
159 |                     # CallExpression
160 |                     if 'CallExpression' == node.type:
161 |                         tmp = node.callee.name
162 | 
163 |                     symbol_counter[tmp] += 1 # increment counter
164 |                     this_node = SymbolNode(depth, width, node.parent_depth, node.parent_width)
165 |                     node_dict[tmp].append(this_node)
166 |                     break
167 | 
168 | 
169 |             # If node is an instance of "esprima node", step through the node
170 |             #   Returns how many children have been added to the queue
171 |             if isinstance(node, esprima.nodes.Node):
172 | 
173 |                 # Feed the nodes that will be labeled as children the current
174 |                 #   depth and width
175 |                 next_depth_num_nodes += self._step(node, queue, depth, width)
176 | 
177 | 
178 |             # Once this tree depth has been walked, update with the existing
179 |             #   "next" set and reset the next set to 0. Increment depth by 1,
180 |             #   and keep a tally on how many nodes have been counted up until
181 |             #   this depth.
182 |             if this_depth_num_nodes == 0:
183 |                 this_depth_num_nodes = next_depth_num_nodes # update current list
184 |                 next_depth_num_nodes = 0                    # reset this list
185 |                 this_depth_count = node_counter             #
186 |                 depth += 1
187 | 
188 |         return symbol_counter, extended_symbol_counter, node_dict
189 | 
190 | 
191 | ################################################################################
192 | """
193 | Executes specified code given that an input node matches the property name of
194 |     this node.
195 | 
196 | Attributes:
197 |     _property_name: the name of the property required to execute the handler
198 |     _node_handler:  code to execute if _property_name matches
199 |     visit(node):    checks if input node's property matches this nodes; if yes,
200 |                         executes the code passed into _node_handler, passing the
201 |                         input node as an argument
202 | """
203 | class MatchPropertyVisitor:
204 | 
205 |     ## Constructor
206 |     def __init__(self, property_name):
207 |         self._property_name = property_name # userAgent, getContext, etc
208 | 
209 |     ##################################################
210 |     def _recursive_check_objects(self, node, api_symbols):
211 |         if node.object:
212 |             return self._recurrance_visit(node.object, api_symbols)
213 |         return False
214 | 
215 |     ## Visit the nodes, check if matches, and execute handler if it does
216 |     def _recurrance_visit(self, node, api_symbols):
217 | 
218 |         # No more objects to look through
219 |         if 'Identifier' == node.type:
220 |             if node.name in api_symbols:
221 |                 return node.name
222 | 
223 |         # MemberExpression; maybe more objects
224 |         elif 'MemberExpression' == node.type:
225 |             if node.property.name in api_symbols:
226 |                 #self._memb_expr_handler(node)
227 | 
228 |                 return_val = node.property.name
229 |                 tmp = self._recursive_check_objects(node, api_symbols)
230 | 
231 |                 if tmp:
232 |                     return_val = tmp + '.' + return_val
233 | 
234 |                 return return_val
235 | 
236 |         # CallExpression; maybe more objects
237 |         elif 'CallExpression' == node.type:
238 |             if node.callee.name in api_symbols:
239 |                 #self._call_expr_handler(node)
240 | 
241 |                 return_val = node.callee.name
242 |                 tmp = self._recursive_check_objects(node.callee, api_symbols)
243 | 
244 |                 if tmp:
245 |                     return_val = tmp + '.' + return_val
246 | 
247 |                 return return_val
248 |         return False
249 |     ##################################################
250 | 
251 |     def _filter_parent_API(self, arg):
252 |         if len(arg.split('.')[0]) == 1:
253 |             arg = arg.split('.')
254 |             arg.pop(0)
255 |             arg = '.'.join(arg)
256 |         return arg
257 | 
258 |     # FIRST VISIT: visit node, check if matches, and check for objects
259 |     def visit(self, node, api_symbols):
260 | 
261 |         #MemberExpression
262 |         if 'MemberExpression' == node.type:
263 |             if node.property.name == self._property_name:
264 | 
265 |                 return_val = node.property.name
266 |                 tmp = self._recursive_check_objects(node, api_symbols)
267 | 
268 |                 if tmp:
269 |                     return_val = tmp + '.' + return_val
270 |                     return_val = self._filter_parent_API(return_val)
271 | 
272 |                 return return_val
273 | 
274 |         # CallExpression
275 |         if 'CallExpression' == node.type:
276 |             if node.callee.name == self._property_name:
277 | 
278 |                 return_val = node.callee.name
279 |                 tmp = self._recursive_check_objects(node.callee, api_symbols)
280 | 
281 |                 if tmp:
282 |                     return_val = tmp + '.' + return_val
283 |                     return_val = self._filter_parent_API(return_val)
284 | 
285 |                 return return_val
286 |         return False
287 | 
288 | 
289 | ################################################################################
290 | def uniquifyList(seq, idfun=None):
291 |    # order preserving
292 |    if idfun is None:
293 |        def idfun(x): return x
294 |    seen = {}
295 |    result = []
296 |    for item in seq:
297 |        marker = idfun(item)
298 |        if marker in seen: continue
299 |        seen[marker] = 1
300 |        result.append(item)
301 |    return result
302 | 
303 | 
304 | ################################################################################
305 | def worker_process(input_file):
306 | 
307 |     filename = input_file.split('/')[-1]
308 | 
309 |     # Try getting the AST using esprima, bail if non-JS syntax
310 |     try:
311 |         with open(input_file) as f:
312 |             ast = esprima.parseScript(f.read())
313 | 
314 |     except esprima.error_handler.Error as e:
315 |         return False, filename
316 | 
317 |     # Create an element using that AST
318 |     el = Element(ast)
319 |     for entry in api_symbols:
320 |         visitor = MatchPropertyVisitor(entry)
321 |         el.accept(visitor)
322 | 
323 |     # Walk down the AST (breadth-first)
324 |     symbol_counter, extended_symbol_counter, node_dict = el.walk(api_symbols, filename)
325 | 
326 |     return True, extended_symbol_counter
327 | 
328 | ################################################################################
329 | if __name__ == '__main__':
330 | 
331 |     print("Initialized to use {} workers.".format(WORKERS))
332 | 
333 |     # Extract all symbols from the generated "Symbols of Interest" list,
334 |     #   and flatten all api symbols into a single list (for efficiency)
335 |     with open(SYM_LIST, encoding='utf-8') as f:
336 |         api_list = json.loads(f.read())
337 | 
338 |     print("Looking in \'{}\' for the API list...".format(SYM_LIST))
339 |     api_symbols = [val for sublist in api_list.values() for val in sublist];
340 |     api_symbols = uniquifyList(api_symbols)
341 |     print("Success.")
342 | 
343 |     # Get file list from data directory
344 |     print("Looking in \'{}\' for all .txt files...".format(JS_SOURCE_FILES))
345 | 
346 |     file_list = glob.glob(JS_SOURCE_FILES + "/*")
347 |     file_list_size = len(file_list)
348 | 
349 |     print("Success. Found {} files.".format(file_list_size))
350 |     print("Begin iterating over the files to get symbol info.")
351 |     print("-" * 80)
352 | 
353 |     pbar = tqdm(total=file_list_size)
354 | 
355 |     # Storage queue
356 |     symbol_counts = multiprocessing.Queue()
357 | 
358 |     # Fail list (don't need a queue here)
359 |     fails_list = multiprocessing.Queue()
360 | 
361 |     # Callback (add to queue)
362 |     def log_result(result):
363 |         pbar.update(1)
364 |         if result[0]:
365 |             symbol_counts.put(result[1])
366 |         else:
367 |             fails_list.put(result[1])
368 | 
369 |     # Process queue thread
370 |     def test():
371 |         counter = 0
372 |         buffer_list = []
373 | 
374 |         def dump_files(buffer_list):
375 |             from math import ceil
376 |             df = pd.DataFrame(buffer_list)
377 |             filename = OUTPUT_DIR + OUTPUT_FILE + "_" + str(ceil(counter/BATCH_SIZE)) + '.parquet'
378 |             df.to_parquet(filename)
379 |             return []
380 | 
381 |         while counter + fails_list.qsize() < file_list_size:
382 |             try:
383 |                 # Attempt to get data from the queue. Note that
384 |                 # symbol_counts.get() will block this thread's execution
385 |                 # until data is available
386 |                 data = symbol_counts.get()
387 |             except queue.Empty:
388 |                 pass
389 |             except multiprocessing.TimeoutError:
390 |                 pass
391 |             else:
392 |                 counter += 1;
393 |                 buffer_list.append(data)
394 | 
395 |                 if counter % BATCH_SIZE == 0:
396 |                     buffer_list = dump_files(buffer_list)
397 | 
398 |         buffer_list = dump_files(buffer_list)
399 | 
400 | 
401 |     # Setup thread pool
402 |     pool = multiprocessing.Pool(WORKERS)
403 | 
404 |     # Queue thread
405 |     re = pool.apply_async(test, args=())
406 | 
407 |     # Individual file threads
408 |     for filename in file_list:
409 |         re = pool.apply_async(worker_process, args=(filename,), callback=log_result)
410 | 
411 |     pool.close()
412 |     pool.join()
413 | 
414 |     print("Success.\n\nDONE SCRIPT!")
415 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/4-ast_analysis/output_data/extended_symbol_counts.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "length": 83,
  3 |     "c": 7,
  4 |     "f": 15,
  5 |     "cookie": 11,
  6 |     "b": 37,
  7 |     "parse": 5,
  8 |     "d": 2,
  9 |     "r": 10,
 10 |     "replace": 13,
 11 |     "a": 2,
 12 |     "code": 1,
 13 |     "add": 48,
 14 |     "y": 5,
 15 |     "x": 9,
 16 |     "addEventListener": 28,
 17 |     "warn": 13,
 18 |     "toJSON": 5,
 19 |     "crypto": 1,
 20 |     "crypto.getRandomValues": 2,
 21 |     "toString": 11,
 22 |     "key": 2,
 23 |     "document.cookie": 2,
 24 |     "forEach": 8,
 25 |     "slice": 15,
 26 |     "nodeName": 10,
 27 |     "plugins": 7,
 28 |     "href": 7,
 29 |     "geolocation": 3,
 30 |     "value.toJSON": 2,
 31 |     "escape": 2,
 32 |     "parentNode": 2,
 33 |     "cookieEnabled": 2,
 34 |     "documentElement": 5,
 35 |     "body": 3,
 36 |     "mimeTypes": 3,
 37 |     "mimeTypes.length": 1,
 38 |     "colorDepth": 2,
 39 |     "className.match": 1,
 40 |     "title": 4,
 41 |     "target": 4,
 42 |     "userAgent": 2,
 43 |     "name": 5,
 44 |     "max": 3,
 45 |     "clientWidth": 1,
 46 |     "offsetWidth": 1,
 47 |     "scrollWidth": 1,
 48 |     "clientHeight": 1,
 49 |     "offsetHeight": 2,
 50 |     "scrollHeight": 2,
 51 |     "javaEnabled": 3,
 52 |     "height": 2,
 53 |     "className": 1,
 54 |     "filter": 4,
 55 |     "hostname": 1,
 56 |     "href.replace": 1,
 57 |     "type": 12,
 58 |     "document.links": 4,
 59 |     "domain": 2,
 60 |     "location.href": 2,
 61 |     "platform": 2,
 62 |     "doNotTrack": 2,
 63 |     "characterSet": 1,
 64 |     "charset": 1,
 65 |     "language": 1,
 66 |     "bufferSize": 1,
 67 |     "ay": 5,
 68 |     "w": 4,
 69 |     "value.length": 1,
 70 |     "value": 4,
 71 |     "document.getElementsByTagName": 2,
 72 |     "localStorage.setItem": 2,
 73 |     "localStorage.removeItem": 1,
 74 |     "plugins.length": 1,
 75 |     "width": 2,
 76 |     "window.location.href": 2,
 77 |     "window.top.document.referrer": 1,
 78 |     "window.parent": 2,
 79 |     "document.referrer": 1,
 80 |     "console.warn": 1,
 81 |     "localStorage.getItem": 2,
 82 |     "innerHTML": 2,
 83 |     "assign": 1,
 84 |     "compact": 1,
 85 |     "keys": 4,
 86 |     "extend": 1,
 87 |     "select": 1,
 88 |     "clone": 1,
 89 |     "id": 1,
 90 |     "s": 8,
 91 |     "window.event": 1,
 92 |     "which": 1,
 93 |     "button": 1,
 94 |     "srcElement": 1,
 95 |     "open": 1,
 96 |     "withCredentials": 1,
 97 |     "setRequestHeader": 1,
 98 |     "navigator.userAgent": 1,
 99 |     "location": 7,
100 |     "document.links.length": 1,
101 |     "geolocation.getCurrentPosition": 1,
102 |     "sessionStorage": 1,
103 |     "localStorage": 3,
104 |     "text": 2,
105 |     "window.location": 2,
106 |     "window.top.document": 1,
107 |     "tagName": 1,
108 |     "scrollLeft": 1,
109 |     "pageXOffset": 1,
110 |     "scrollTop": 1,
111 |     "pageYOffset": 1,
112 |     "timing": 3,
113 |     "setInterval": 2,
114 |     "getElementsByTagName": 2,
115 |     "checked": 2,
116 |     "forms": 1,
117 |     "z": 9,
118 |     "plugins.s.length": 1,
119 |     "sort": 1,
120 |     "window.top": 1,
121 |     "children": 5,
122 |     "window": 1,
123 |     "setTimeout": 2,
124 |     "onreadystatechange": 1,
125 |     "onload": 1,
126 |     "onerror": 1,
127 |     "src": 1,
128 |     "compatMode": 2,
129 |     "navigator.geolocation.getCurrentPosition": 1,
130 |     "text.replace": 2,
131 |     "plugins.s": 5,
132 |     "nodeType": 2,
133 |     "create": 1,
134 |     "bottom": 2,
135 |     "top": 7,
136 |     "performance": 1,
137 |     "navigator.geolocation": 1,
138 |     "top.location": 2,
139 |     "location.protocol": 2,
140 |     "transaction": 11,
141 |     "transaction.total": 1,
142 |     "transaction.city": 1,
143 |     "transaction.state": 1,
144 |     "transaction.country": 1,
145 |     "transaction.context": 1,
146 |     "items.length": 1,
147 |     "plugins.s.description": 1,
148 |     "loop": 6,
149 |     "top.replace": 1,
150 |     "send": 2,
151 |     "items": 3,
152 |     "suffixes": 1,
153 |     "plugins.s.name": 1,
154 |     "window.parent.document.referrer": 1,
155 |     "context": 1,
156 |     "enabledPlugin": 1,
157 |     "window.parent.document": 1,
158 |     "index": 2,
159 |     "removeEventListener": 2,
160 |     "coords": 1,
161 |     "source": 1,
162 |     "Q": 1,
163 |     "abort": 1,
164 |     "status": 3,
165 |     "clearTimeout": 2,
166 |     "readyState": 4,
167 |     "clearInterval": 1,
168 |     "latitude": 1,
169 |     "longitude": 1,
170 |     "accuracy": 1,
171 |     "altitude": 1,
172 |     "altitudeAccuracy": 1,
173 |     "heading": 1,
174 |     "speed": 1,
175 |     "timestamp": 1,
176 |     "document.body.children": 1,
177 |     "document.body": 1
178 | }


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/4-ast_analysis/requirements.txt:
--------------------------------------------------------------------------------
1 | esprima==4.0.1
2 | numpy==1.15.4
3 | pandas==0.23.4
4 | pkg-resources==0.0.0
5 | pyarrow==0.11.1
6 | python-dateutil==2.7.5
7 | pytz==2018.7
8 | six==1.12.0
9 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/4-ast_analysis/single_tree_explorer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import esprima
  4 | import json
  5 | import json2parquet as j2p
  6 | import sys
  7 | 
  8 | class SymbolNode:
  9 |     def __init__(self, depth, width, parent_depth, parent_width):
 10 |         self._depth = depth
 11 |         self._width = width
 12 |         self._parent_depth = parent_depth
 13 |         self._parent_width = parent_width
 14 | 
 15 |     def setDepthWidth(depth, width):
 16 |         self._depth = depth
 17 |         self._width = width
 18 | 
 19 | 
 20 | class CustomEncoder(json.JSONEncoder):
 21 |     def default(self, obj):
 22 | 
 23 |         # Symbol node class
 24 |         if isinstance(obj, SymbolNode):
 25 |             return {
 26 |                 "depth": obj._depth,
 27 |                 "width": obj._width,
 28 |                 "parent_depth": obj._parent_depth,
 29 |                 "parent_width": obj._parent_width,
 30 |             }
 31 | 
 32 |         return json.JSONEncoder.default(self, obj)
 33 | 
 34 | 
 35 | class Element:
 36 | 
 37 |     ## Define keys you want to skip over
 38 |     BLACKLISTEDKEYS = ["parent"]
 39 | 
 40 |     ## Constructor
 41 |     def __init__(self, esprima_ast):
 42 |         self._ast = esprima_ast  # Assign member var AST
 43 |         self._visitors = []  # Init empty visitor array
 44 | 
 45 |     ## Add a new visitor to execute (will be executed at each node)
 46 |     def accept(self, visitor):
 47 |         self._visitors.append(visitor)
 48 | 
 49 |     ## (private) Step through the node's queue of potential nodes to visit
 50 |     def _step(self, node, queue, depth, width):
 51 |         before = len(queue)
 52 | 
 53 |         for key in node.keys():  # Enumerate keys for possible children
 54 |             if key in self.BLACKLISTEDKEYS:
 55 |                 continue  # Ignore node if it is blacklisted
 56 | 
 57 |             child = getattr(node, key)  # Assign child = node.key
 58 | 
 59 |             # if the child exists && the child has an attribute 'type'
 60 |             if child and hasattr(child, "type") == True:
 61 |                 child.parent = node  # Assign this node as child's parent
 62 |                 child.parent_depth = depth
 63 |                 child.parent_width = width
 64 |                 queue.append(child)  # Append the child in this node's queue
 65 | 
 66 |             # if there is a list of children
 67 |             if isinstance(child, list):
 68 |                 for item in child:  # Iterate through them and do the same
 69 |                     #   as above
 70 |                     if hasattr(item, "type") == True:
 71 |                         item.parent = node
 72 |                         item.parent_depth = depth
 73 |                         item.parent_width = width
 74 |                         queue.append(item)
 75 | 
 76 |         return len(queue) - before  # Return whether any children were pushed
 77 | 
 78 |     ## Walk through this AST
 79 |     def walk(self, api_symbols):
 80 |         queue = [self._ast]  # Add the imported AST to the queue
 81 | 
 82 |         # Initialize these entries
 83 |         for node in queue:
 84 |             node.parent_depth = 0
 85 |             node.parent_width = 0
 86 | 
 87 |         # Depth and width counting
 88 |         depth = 0  # what level of the tree we are in
 89 |         width = 0  # how far from first node on this level we are
 90 |         this_depth_num_nodes = 1  # how many nodes in this level are left
 91 |         next_depth_num_nodes = 0  # how many nodes in the next level
 92 |         node_counter = 0  # how many total nodes have been visited
 93 |         this_depth_count = 0  # how many nodes are on this level (tot)
 94 | 
 95 |         # storage for the data
 96 |         symbol_counter = {key: 0 for key in api_symbols}
 97 |         extended_symbol_counter = {}
 98 |         node_dict = {key: [] for key in api_symbols}
 99 | 
100 |         while len(queue) > 0:  # While stuff in the queue
101 |             node = queue.pop(0)  # Pop stuff off of the FRONT (0)
102 |             this_depth_num_nodes -= 1  # Reduce how many left
103 |             node_counter += 1  # Increment counter
104 |             width = node_counter - this_depth_count - 1
105 | 
106 |             for v in self._visitors:  # Run visitor instances here
107 |                 result = v.visit(node, api_symbols)
108 |                 if result:
109 |                     if result not in extended_symbol_counter.keys():
110 |                         extended_symbol_counter[result] = 1
111 |                     else:
112 |                         extended_symbol_counter[result] += 1
113 | 
114 |                     # MemberExpression
115 |                     if "MemberExpression" == node.type:
116 |                         tmp = node.property.name
117 | 
118 |                     # CallExpression
119 |                     if "CallExpression" == node.type:
120 |                         tmp = node.callee.name
121 | 
122 |                     symbol_counter[tmp] += 1  # increment counter
123 |                     this_node = SymbolNode(
124 |                         depth, width, node.parent_depth, node.parent_width
125 |                     )
126 |                     node_dict[tmp].append(this_node)
127 |                     break
128 | 
129 |             # If node is an instance of "esprima node", step through the node
130 |             #   Returns how many children have been added to the queue
131 |             if isinstance(node, esprima.nodes.Node):
132 | 
133 |                 # Feed the nodes that will be labeled as children the current
134 |                 #   depth and width
135 |                 next_depth_num_nodes += self._step(node, queue, depth, width)
136 | 
137 |             # Once this tree depth has been walked, update with the existing
138 |             #   "next" set and reset the next set to 0. Increment depth by 1,
139 |             #   and keep a tally on how many nodes have been counted up until
140 |             #   this depth.
141 |             if this_depth_num_nodes == 0:
142 |                 this_depth_num_nodes = next_depth_num_nodes  # update current list
143 |                 next_depth_num_nodes = 0  # reset this list
144 |                 this_depth_count = node_counter  #
145 |                 depth += 1
146 | 
147 |         print("Done. Total stats: Tree depth: {}\tTotal nodes:{}".format(depth, node_counter))
148 | 
149 |         return symbol_counter, extended_symbol_counter, node_dict
150 | 
151 | 
152 | """
153 | Executes specified code given that an input node matches the property name of
154 |     this node.
155 | 
156 | Attributes:
157 |     _property_name: the name of the property required to execute the handler
158 |     _node_handler:  code to execute if _property_name matches
159 |     visit(node):    checks if input node's property matches this nodes; if yes,
160 |                         executes the code passed into _node_handler, passing the
161 |                         input node as an argument
162 | """
163 | class MatchPropertyVisitor:
164 | 
165 |     ## Constructor
166 |     def __init__(self, property_name):
167 |         self._property_name = property_name  # userAgent, getContext, etc
168 | 
169 |     ##################################################
170 | 
171 |     def _recursive_check_objects(self, node, api_symbols):
172 |         if node.object:
173 |             return self._recurrance_visit(node.object, api_symbols)
174 |         return False
175 | 
176 |     ## Visit the nodes, check if matches, and execute handler if it does
177 |     def _recurrance_visit(self, node, api_symbols):
178 | 
179 |         # No more objects to look through
180 |         if "Identifier" == node.type:
181 |             if node.name in api_symbols:
182 |                 return node.name
183 | 
184 |         # MemberExpression; maybe more objects
185 |         elif "MemberExpression" == node.type:
186 |             if node.property.name in api_symbols:
187 | 
188 |                 return_val = node.property.name
189 |                 tmp = self._recursive_check_objects(node, api_symbols)
190 | 
191 |                 if tmp:
192 |                     return_val = tmp + "." + return_val
193 | 
194 |                 return return_val
195 | 
196 |         # CallExpression; maybe more objects
197 |         elif "CallExpression" == node.type:
198 |             if node.callee.name in api_symbols:
199 | 
200 |                 return_val = node.callee.name
201 |                 tmp = self._recursive_check_objects(node.callee, api_symbols)
202 | 
203 |                 if tmp:
204 |                     return_val = tmp + "." + return_val
205 | 
206 |                 return return_val
207 |         return False
208 | 
209 |     ##################################################
210 | 
211 |     def _filter_parent_API(self, arg):
212 |         if len(arg.split(".")[0]) == 1:
213 |             arg = arg.split(".")
214 |             arg.pop(0)
215 |             arg = ".".join(arg)
216 |         return arg
217 | 
218 |     # FIRST VISIT: visit node, check if matches, and check for objects
219 |     def visit(self, node, api_symbols):
220 | 
221 |         # MemberExpression
222 |         if "MemberExpression" == node.type:
223 |             if node.property.name == self._property_name:
224 | 
225 |                 return_val = node.property.name
226 |                 tmp = self._recursive_check_objects(node, api_symbols)
227 | 
228 |                 if tmp:
229 |                     return_val = tmp + "." + return_val
230 |                     return_val = self._filter_parent_API(return_val)
231 | 
232 |                 return return_val
233 | 
234 |         # CallExpression
235 |         if "CallExpression" == node.type:
236 |             if node.callee.name == self._property_name:
237 | 
238 |                 return_val = node.callee.name
239 |                 tmp = self._recursive_check_objects(node.callee, api_symbols)
240 | 
241 |                 if tmp:
242 |                     return_val = tmp + "." + return_val
243 |                     return_val = self._filter_parent_API(return_val)
244 | 
245 |                 return return_val
246 |         return False
247 | 
248 | 
249 | # Extract JSON and JavaScript AST data from precompiled list
250 | def importData():
251 | 
252 |     if len(sys.argv) == 3:
253 |         with open(sys.argv[1], encoding="utf-8") as data_file:
254 |             api_list = json.loads(data_file.read())
255 |         with open(sys.argv[2]) as js_file:
256 |             ast = esprima.parseScript(js_file.read())
257 | 
258 |         return ast, api_list
259 | 
260 |     else:
261 |         print(
262 |             """Warning: invalid input type! This script does not use config.ini.
263 | Syntax:
264 | 
265 |         $ python3.6 this_script.py <path/to/api_list.json> \
266 | <path/to/javascript.js>
267 | """
268 |         )
269 |         exit()
270 |     return
271 | 
272 | 
273 | ################################################################################
274 | def uniquifyList(seq, idfun=None):
275 |     # order preserving
276 |     if idfun is None:
277 | 
278 |         def idfun(x):
279 |             return x
280 | 
281 |     seen = {}
282 |     result = []
283 |     for item in seq:
284 |         marker = idfun(item)
285 |         if marker in seen:
286 |             continue
287 |         seen[marker] = 1
288 |         result.append(item)
289 |     return result
290 | 
291 | 
292 | ################################################################################
293 | def main():
294 | 
295 |     # Try getting the AST using esprima, bail if non-JS syntax
296 |     try:
297 |         print("Trying to parse AST...")
298 |         ast, api_list = importData()
299 | 
300 |     except esprima.error_handler.Error as e:
301 |         print("Failure: non-javascript syntax detected. Terminating...")
302 |         sys.exit()
303 | 
304 |     print("Success. Walking through the AST...")
305 | 
306 |     # Extract all symbols from the generated "Symbols of Interest" list,
307 |     #   and flatten all api symbols into a single list (for efficiency)
308 |     api_symbols = [val for sublist in api_list.values() for val in sublist]
309 |     api_symbols = uniquifyList(api_symbols)
310 | 
311 |     # Create an element using that AST
312 |     el = Element(ast)
313 |     for entry in api_symbols:
314 |         visitor = MatchPropertyVisitor(entry)
315 |         el.accept(visitor)
316 | 
317 |     # Walk down the AST (breadth-first)
318 |     symbol_counter, extended_symbol_counter, node_dict = el.walk(api_symbols)
319 | 
320 |     # Dump outputs
321 |     with open("output_data/symbol_counts.json", "w") as out1:
322 |         json.dump(symbol_counter, out1, indent=4)
323 | 
324 |     with open("output_data/extended_symbol_counts.json", "w") as out2:
325 |         json.dump(extended_symbol_counter, out2, indent=4)
326 | 
327 |     #with open("output_data/symbol_node_info.json", "w") as out3:
328 |     #    json.dump(node_dict, out3, indent=4, cls=CustomEncoder)
329 | 
330 | 
331 | if __name__ == "__main__":
332 |     main()
333 | 


--------------------------------------------------------------------------------
/analyses/2018_12_ddobre_static_analysis/README.md:
--------------------------------------------------------------------------------
1 | A 4 stage pipeline for mass downloading javascript files, generating a list of potential symbols of interest, and parsing these symbols out of the downloaded scripts. Each stage of the pipeline has a more exhaustive readme - please check those for more details. 
2 | 


--------------------------------------------------------------------------------
/analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense/Audio Fingerprinting Heuristics.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "scrolled": false
  8 |    },
  9 |    "outputs": [
 10 |     {
 11 |      "data": {
 12 |       "text/html": [
 13 |        "<table style=\"border: 2px solid white;\">\n",
 14 |        "<tr>\n",
 15 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
 16 |        "<h3>Client</h3>\n",
 17 |        "<ul>\n",
 18 |        "  <li><b>Scheduler: </b>tcp://127.0.0.1:56640\n",
 19 |        "  <li><b>Dashboard: </b><a href='http://127.0.0.1:8787/status' target='_blank'>http://127.0.0.1:8787/status</a>\n",
 20 |        "</ul>\n",
 21 |        "</td>\n",
 22 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
 23 |        "<h3>Cluster</h3>\n",
 24 |        "<ul>\n",
 25 |        "  <li><b>Workers: </b>4</li>\n",
 26 |        "  <li><b>Cores: </b>4</li>\n",
 27 |        "  <li><b>Memory: </b>12.00 GB</li>\n",
 28 |        "</ul>\n",
 29 |        "</td>\n",
 30 |        "</tr>\n",
 31 |        "</table>"
 32 |       ],
 33 |       "text/plain": [
 34 |        "<Client: scheduler='tcp://127.0.0.1:56640' processes=4 cores=4>"
 35 |       ]
 36 |      },
 37 |      "execution_count": 1,
 38 |      "metadata": {},
 39 |      "output_type": "execute_result"
 40 |     }
 41 |    ],
 42 |    "source": [
 43 |     "import dask.dataframe as dd\n",
 44 |     "import json\n",
 45 |     "\n",
 46 |     "from dask.distributed import Client, progress\n",
 47 |     "\n",
 48 |     "DATA_DIR = 'YOUR DATA DIRECTORY HERE'\n",
 49 |     "DATA_DIR_FULL = DATA_DIR + \"PATH TO PARQUET FILES\"\n",
 50 |     "Client()"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "## Setup"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 6,
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "df = dd.read_parquet(DATA_DIR_FULL, columns=['script_url', 'symbol'])"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "## Build Candidate URLs for `OfflineAudioContext.createOscillator`"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 7,
 79 |    "metadata": {},
 80 |    "outputs": [
 81 |     {
 82 |      "name": "stdout",
 83 |      "output_type": "stream",
 84 |      "text": [
 85 |       "[########################################] | 100% Completed | 51.8s\r"
 86 |      ]
 87 |     }
 88 |    ],
 89 |    "source": [
 90 |     "create_oscillator_df = df[df.symbol == 'OfflineAudioContext.createOscillator']\n",
 91 |     "create_oscillator_urls = create_oscillator_df.script_url.unique().persist()\n",
 92 |     "progress(create_oscillator_urls, notebook=False)"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": 8,
 98 |    "metadata": {},
 99 |    "outputs": [
100 |     {
101 |      "data": {
102 |       "text/plain": [
103 |        "0      https://www.alaskaair.com/px/client/main.min.js\n",
104 |        "1    https://client.perimeterx.net/PXQ76Auu14/main....\n",
105 |        "2    https://client.perimeterx.net/PXM636Svr4/main....\n",
106 |        "3    http://client.perimeterx.net/PX0F3091f3/main.m...\n",
107 |        "4             https://media1.admicro.vn/core/fipmin.js\n",
108 |        "Name: script_url, dtype: object"
109 |       ]
110 |      },
111 |      "execution_count": 8,
112 |      "metadata": {},
113 |      "output_type": "execute_result"
114 |     }
115 |    ],
116 |    "source": [
117 |     "create_oscillator_urls = create_oscillator_urls.compute()\n",
118 |     "create_oscillator_urls[0:5]"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "metadata": {},
124 |    "source": [
125 |     "## Build Candidate URLs for `OfflineAudioContext.createDynamicsCompressor`"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 9,
131 |    "metadata": {},
132 |    "outputs": [
133 |     {
134 |      "name": "stdout",
135 |      "output_type": "stream",
136 |      "text": [
137 |       "[########################################] | 100% Completed | 47.3s\r"
138 |      ]
139 |     }
140 |    ],
141 |    "source": [
142 |     "create_dynamics_df = df[df.symbol == 'OfflineAudioContext.createDynamicsCompressor']\n",
143 |     "create_dynamics_urls = create_dynamics_df.script_url.unique().persist()\n",
144 |     "progress(create_dynamics_urls, notebook=False)"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": 10,
150 |    "metadata": {},
151 |    "outputs": [
152 |     {
153 |      "data": {
154 |       "text/plain": [
155 |        "0      https://www.alaskaair.com/px/client/main.min.js\n",
156 |        "1    https://client.perimeterx.net/PXQ76Auu14/main....\n",
157 |        "2    https://client.perimeterx.net/PXM636Svr4/main....\n",
158 |        "3    http://client.perimeterx.net/PX0F3091f3/main.m...\n",
159 |        "4             https://media1.admicro.vn/core/fipmin.js\n",
160 |        "Name: script_url, dtype: object"
161 |       ]
162 |      },
163 |      "execution_count": 10,
164 |      "metadata": {},
165 |      "output_type": "execute_result"
166 |     }
167 |    ],
168 |    "source": [
169 |     "create_dynamics_urls = create_dynamics_urls.compute()\n",
170 |     "create_dynamics_urls[0:5]"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "metadata": {},
176 |    "source": [
177 |     "## Build Candidate URLs for `OfflineAudioContext.destination`"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": 11,
183 |    "metadata": {},
184 |    "outputs": [
185 |     {
186 |      "name": "stdout",
187 |      "output_type": "stream",
188 |      "text": [
189 |       "[########################################] | 100% Completed | 39.6s\r"
190 |      ]
191 |     }
192 |    ],
193 |    "source": [
194 |     "destination_df = df[df.symbol == 'OfflineAudioContext.destination']\n",
195 |     "destination_urls = destination_df.script_url.unique().persist()\n",
196 |     "progress(destination_urls, notebook=False)"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": 12,
202 |    "metadata": {},
203 |    "outputs": [
204 |     {
205 |      "data": {
206 |       "text/plain": [
207 |        "0      https://www.alaskaair.com/px/client/main.min.js\n",
208 |        "1    https://client.perimeterx.net/PXQ76Auu14/main....\n",
209 |        "2    https://client.perimeterx.net/PXM636Svr4/main....\n",
210 |        "3    http://client.perimeterx.net/PX0F3091f3/main.m...\n",
211 |        "4             https://media1.admicro.vn/core/fipmin.js\n",
212 |        "Name: script_url, dtype: object"
213 |       ]
214 |      },
215 |      "execution_count": 12,
216 |      "metadata": {},
217 |      "output_type": "execute_result"
218 |     }
219 |    ],
220 |    "source": [
221 |     "destination_urls = destination_urls.compute()\n",
222 |     "destination_urls[0:5]"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "markdown",
227 |    "metadata": {},
228 |    "source": [
229 |     "## Build Candidate URLs for `OfflineAudioContext.startRendering`"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "code",
234 |    "execution_count": 13,
235 |    "metadata": {},
236 |    "outputs": [
237 |     {
238 |      "name": "stdout",
239 |      "output_type": "stream",
240 |      "text": [
241 |       "[########################################] | 100% Completed | 40.3s\r"
242 |      ]
243 |     }
244 |    ],
245 |    "source": [
246 |     "start_rendering_df = df[df.symbol == 'OfflineAudioContext.startRendering']\n",
247 |     "start_rendering_urls = start_rendering_df.script_url.unique().persist()\n",
248 |     "progress(start_rendering_urls, notebook=False)"
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": 14,
254 |    "metadata": {},
255 |    "outputs": [
256 |     {
257 |      "data": {
258 |       "text/plain": [
259 |        "0      https://www.alaskaair.com/px/client/main.min.js\n",
260 |        "1    https://client.perimeterx.net/PXQ76Auu14/main....\n",
261 |        "2    https://client.perimeterx.net/PXM636Svr4/main....\n",
262 |        "3    http://client.perimeterx.net/PX0F3091f3/main.m...\n",
263 |        "4             https://media1.admicro.vn/core/fipmin.js\n",
264 |        "Name: script_url, dtype: object"
265 |       ]
266 |      },
267 |      "execution_count": 14,
268 |      "metadata": {},
269 |      "output_type": "execute_result"
270 |     }
271 |    ],
272 |    "source": [
273 |     "start_rendering_urls = start_rendering_urls.compute()\n",
274 |     "start_rendering_urls[0:5]"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "markdown",
279 |    "metadata": {},
280 |    "source": [
281 |     "## Build Candidate URLs for `OfflineAudioContext.oncomplete`"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": 15,
287 |    "metadata": {},
288 |    "outputs": [
289 |     {
290 |      "name": "stdout",
291 |      "output_type": "stream",
292 |      "text": [
293 |       "[########################################] | 100% Completed | 44.8s\r"
294 |      ]
295 |     }
296 |    ],
297 |    "source": [
298 |     "on_complete_df = df[df.symbol == 'OfflineAudioContext.oncomplete']\n",
299 |     "on_complete_urls = on_complete_df.script_url.unique().persist()\n",
300 |     "progress(on_complete_urls, notebook=False)"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": 16,
306 |    "metadata": {},
307 |    "outputs": [
308 |     {
309 |      "data": {
310 |       "text/plain": [
311 |        "0      https://www.alaskaair.com/px/client/main.min.js\n",
312 |        "1    https://client.perimeterx.net/PXQ76Auu14/main....\n",
313 |        "2    https://client.perimeterx.net/PXM636Svr4/main....\n",
314 |        "3    http://client.perimeterx.net/PX0F3091f3/main.m...\n",
315 |        "4             https://media1.admicro.vn/core/fipmin.js\n",
316 |        "Name: script_url, dtype: object"
317 |       ]
318 |      },
319 |      "execution_count": 16,
320 |      "metadata": {},
321 |      "output_type": "execute_result"
322 |     }
323 |    ],
324 |    "source": [
325 |     "on_complete_urls = on_complete_urls.compute()\n",
326 |     "on_complete_urls[0:5]"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "markdown",
331 |    "metadata": {},
332 |    "source": [
333 |     "## Scripts must call all 5 functions: [\"OfflineAudioContext.createOscillator\", \"OfflineAudioContext.createDynamicsCompressor\", \"OfflineAudioContext.destination\", \"OfflineAudioContext.startRendering\", \"OfflineAudioContext.oncomplete\"]"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": 17,
339 |    "metadata": {},
340 |    "outputs": [
341 |     {
342 |      "name": "stdout",
343 |      "output_type": "stream",
344 |      "text": [
345 |       "# of script_urls using audio fingerprinting: 170\n"
346 |      ]
347 |     }
348 |    ],
349 |    "source": [
350 |     "audio_fp_urls = set(create_oscillator_urls) & \\\n",
351 |     "    set(create_dynamics_urls) & \\\n",
352 |     "    set(destination_urls) & \\\n",
353 |     "    set(start_rendering_urls) & \\\n",
354 |     "    set(on_complete_urls)\n",
355 |     "print('# of script_urls using audio fingerprinting:', len(audio_fp_urls))"
356 |    ]
357 |   },
358 |   {
359 |    "cell_type": "code",
360 |    "execution_count": 18,
361 |    "metadata": {},
362 |    "outputs": [
363 |     {
364 |      "name": "stdout",
365 |      "output_type": "stream",
366 |      "text": [
367 |       "# of script_urls that did not call all 5 symbols: 0\n"
368 |      ]
369 |     }
370 |    ],
371 |    "source": [
372 |     "all_candidate_urls = set(create_oscillator_urls) | \\\n",
373 |     "    set(create_dynamics_urls) | \\\n",
374 |     "    set(destination_urls) | \\\n",
375 |     "    set(start_rendering_urls) | \\\n",
376 |     "    set(on_complete_urls)\n",
377 |     "not_audio_fp_urls = all_candidate_urls - audio_fp_urls\n",
378 |     "print('# of script_urls that did not call all 5 symbols:', len(not_audio_fp_urls))"
379 |    ]
380 |   },
381 |   {
382 |    "cell_type": "markdown",
383 |    "metadata": {},
384 |    "source": [
385 |     "## Save URLs"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "code",
390 |    "execution_count": 19,
391 |    "metadata": {},
392 |    "outputs": [],
393 |    "source": [
394 |     "with open('audio_fingerprinting.json', 'w') as f:\n",
395 |     "    f.write(json.dumps(list(audio_fp_urls)))"
396 |    ]
397 |   },
398 |   {
399 |    "cell_type": "code",
400 |    "execution_count": 20,
401 |    "metadata": {},
402 |    "outputs": [],
403 |    "source": [
404 |     "with open('not_audio_fingerprinting.json', 'w') as f:\n",
405 |     "    f.write(json.dumps(list(not_audio_fp_urls)))"
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "markdown",
410 |    "metadata": {},
411 |    "source": [
412 |     "## Find Locations"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "code",
417 |    "execution_count": 2,
418 |    "metadata": {},
419 |    "outputs": [],
420 |    "source": [
421 |     "with open('audio_fingerprinting.json', 'r') as f:\n",
422 |     "    audio_fp_urls = json.load(f)"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": 4,
428 |    "metadata": {},
429 |    "outputs": [],
430 |    "source": [
431 |     "df = dd.read_parquet(DATA_DIR_FULL, columns=['script_url', 'location'])"
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "code",
436 |    "execution_count": 6,
437 |    "metadata": {},
438 |    "outputs": [
439 |     {
440 |      "name": "stdout",
441 |      "output_type": "stream",
442 |      "text": [
443 |       "[########################################] | 100% Completed | 44.7s\r"
444 |      ]
445 |     }
446 |    ],
447 |    "source": [
448 |     "df_locs = df[df.script_url.isin(audio_fp_urls)]\n",
449 |     "locs = df_locs.location.unique().persist()\n",
450 |     "progress(locs, notebook=False)"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "code",
455 |    "execution_count": 8,
456 |    "metadata": {},
457 |    "outputs": [
458 |     {
459 |      "name": "stdout",
460 |      "output_type": "stream",
461 |      "text": [
462 |       "# of locations that call audio fingerprinting scripts: 2006\n"
463 |      ]
464 |     }
465 |    ],
466 |    "source": [
467 |     "print('# of locations that call audio fingerprinting scripts:', len(locs))"
468 |    ]
469 |   },
470 |   {
471 |    "cell_type": "code",
472 |    "execution_count": null,
473 |    "metadata": {},
474 |    "outputs": [],
475 |    "source": []
476 |   }
477 |  ],
478 |  "metadata": {
479 |   "kernelspec": {
480 |    "display_name": "overscripted",
481 |    "language": "python",
482 |    "name": "overscripted"
483 |   },
484 |   "language_info": {
485 |    "codemirror_mode": {
486 |     "name": "ipython",
487 |     "version": 3
488 |    },
489 |    "file_extension": ".py",
490 |    "mimetype": "text/x-python",
491 |    "name": "python",
492 |    "nbconvert_exporter": "python",
493 |    "pygments_lexer": "ipython3",
494 |    "version": "3.5.6"
495 |   }
496 |  },
497 |  "nbformat": 4,
498 |  "nbformat_minor": 2
499 | }
500 | 


--------------------------------------------------------------------------------
/analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense/Canvas Fingerprinting Heuristics.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 2,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import dask.dataframe as dd\n",
 10 |     "import os\n",
 11 |     "import re\n",
 12 |     "import json\n",
 13 |     "\n",
 14 |     "from dask.distributed import Client, progress\n",
 15 |     "from pandas.api.types import CategoricalDtype\n",
 16 |     "\n",
 17 |     "DATA_DIR = 'YOUR DATA DIRECTORY HERE'\n",
 18 |     "DATA_DIR_FULL = DATA_DIR + \"PATH TO PARQUET FILES\""
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 3,
 24 |    "metadata": {
 25 |     "scrolled": false
 26 |    },
 27 |    "outputs": [
 28 |     {
 29 |      "data": {
 30 |       "text/html": [
 31 |        "<table style=\"border: 2px solid white;\">\n",
 32 |        "<tr>\n",
 33 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
 34 |        "<h3>Client</h3>\n",
 35 |        "<ul>\n",
 36 |        "  <li><b>Scheduler: </b>tcp://127.0.0.1:64926\n",
 37 |        "  <li><b>Dashboard: </b><a href='http://127.0.0.1:8787/status' target='_blank'>http://127.0.0.1:8787/status</a>\n",
 38 |        "</ul>\n",
 39 |        "</td>\n",
 40 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
 41 |        "<h3>Cluster</h3>\n",
 42 |        "<ul>\n",
 43 |        "  <li><b>Workers: </b>4</li>\n",
 44 |        "  <li><b>Cores: </b>4</li>\n",
 45 |        "  <li><b>Memory: </b>32.00 GB</li>\n",
 46 |        "</ul>\n",
 47 |        "</td>\n",
 48 |        "</tr>\n",
 49 |        "</table>"
 50 |       ],
 51 |       "text/plain": [
 52 |        "<Client: scheduler='tcp://127.0.0.1:64926' processes=4 cores=4>"
 53 |       ]
 54 |      },
 55 |      "execution_count": 3,
 56 |      "metadata": {},
 57 |      "output_type": "execute_result"
 58 |     }
 59 |    ],
 60 |    "source": [
 61 |     "Client()"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "# Build candidates"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": 3,
 74 |    "metadata": {},
 75 |    "outputs": [
 76 |     {
 77 |      "data": {
 78 |       "text/plain": [
 79 |        "0                  http://www.qvc.com/akam/10/2b30e194\n",
 80 |        "1                      http://www.qvc.com/_bm/async.js\n",
 81 |        "2                http://www.coupang.com/akam/10/4f2b47\n",
 82 |        "3     https://www.coches.net/ztkieflaaxcvaiwh121837.js\n",
 83 |        "4    https://a1.alicdn.com/creation/html/2016/06/20...\n",
 84 |        "Name: script_url, dtype: object"
 85 |       ]
 86 |      },
 87 |      "execution_count": 3,
 88 |      "metadata": {},
 89 |      "output_type": "execute_result"
 90 |     }
 91 |    ],
 92 |    "source": [
 93 |     "df_to_data_urls_df = dd.read_parquet(DATA_FILE, columns=['script_url', 'symbol'])\n",
 94 |     "df_to_data_urls_df = df_to_data_urls_df[df_to_data_urls_df.symbol == 'HTMLCanvasElement.toDataURL']\n",
 95 |     "to_data_urls = df_to_data_urls_df.script_url.unique().compute()\n",
 96 |     "to_data_urls[0:5]"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": 4,
102 |    "metadata": {},
103 |    "outputs": [
104 |     {
105 |      "data": {
106 |       "text/plain": [
107 |        "0    http://p6.drtst.com/templates/drtuber/js/drtub...\n",
108 |        "1      https://www.jigsawplanet.com/js/jp.js?v=b177a4b\n",
109 |        "2    http://p5.vptpsn.com/templates/frontend/viptub...\n",
110 |        "3    https://code.createjs.com/createjs-2015.11.26....\n",
111 |        "4           http://cdn.promodj.com/core/core.js?1ce4f0\n",
112 |        "Name: script_url, dtype: object"
113 |       ]
114 |      },
115 |      "execution_count": 4,
116 |      "metadata": {},
117 |      "output_type": "execute_result"
118 |     }
119 |    ],
120 |    "source": [
121 |     "def large_enough(row):\n",
122 |     "    width = float(row.argument_2)\n",
123 |     "    height = float(row.argument_3)\n",
124 |     "    return width >= 16 and height >= 16\n",
125 |     "\n",
126 |     "df_get_image_data_df = dd.read_parquet(DATA_FILE, columns=['script_url', 'symbol', 'argument_2', 'argument_3'])\n",
127 |     "df_get_image_data_df = df_get_image_data_df[df_get_image_data_df.symbol == 'CanvasRenderingContext2D.getImageData']\n",
128 |     "df_get_image_data_df = df_get_image_data_df[df_get_image_data_df.apply(large_enough, axis=1, meta=('bool'))]\n",
129 |     "get_image_data_urls = df_get_image_data_df.script_url.unique().compute()\n",
130 |     "get_image_data_urls[0:5]"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": 5,
136 |    "metadata": {},
137 |    "outputs": [
138 |     {
139 |      "name": "stdout",
140 |      "output_type": "stream",
141 |      "text": [
142 |       "n to_data_urls 26481\n",
143 |       "n get_image_data_urls 559\n",
144 |       "n candidate urls 27009\n"
145 |      ]
146 |     }
147 |    ],
148 |    "source": [
149 |     "print('n to_data_urls', len(to_data_urls))\n",
150 |     "print('n get_image_data_urls', len(get_image_data_urls))\n",
151 |     "candidate_urls = to_data_urls.append(get_image_data_urls).unique()\n",
152 |     "print('n candidate urls', len(candidate_urls))"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 6,
158 |    "metadata": {},
159 |    "outputs": [],
160 |    "source": [
161 |     "all_candidate_urls = candidate_urls.copy()"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "markdown",
166 |    "metadata": {},
167 |    "source": [
168 |     "# Start removing\n",
169 |     "\n",
170 |     "## 1. Remove manually filtered"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": 7,
176 |    "metadata": {},
177 |    "outputs": [],
178 |    "source": [
179 |     "false_positive_script_urls = {\n",
180 |     "    'http://www.fivola.com/',\n",
181 |     "    'http://cdn02.centraledachats.be/dist/js/holder.js',\n",
182 |     "    'http://ccmedia.fr/accueil.php',\n",
183 |     "    'http://rozup.ir/up/moisrex/themes/space_theme/script.js'\n",
184 |     "}"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": 8,
190 |    "metadata": {},
191 |    "outputs": [
192 |     {
193 |      "name": "stdout",
194 |      "output_type": "stream",
195 |      "text": [
196 |       "n candidate urls 27009\n"
197 |      ]
198 |     }
199 |    ],
200 |    "source": [
201 |     "candidate_urls = [url for url in candidate_urls if url not in false_positive_script_urls]\n",
202 |     "print('n candidate urls', len(candidate_urls))"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "code",
207 |    "execution_count": 9,
208 |    "metadata": {},
209 |    "outputs": [
210 |     {
211 |      "name": "stdout",
212 |      "output_type": "stream",
213 |      "text": [
214 |       "0\n"
215 |      ]
216 |     }
217 |    ],
218 |    "source": [
219 |     "print(len(set(all_candidate_urls) - set(candidate_urls)))\n",
220 |     "disgarded_urls = [url for url in all_candidate_urls if url not in candidate_urls]\n",
221 |     "with open('not_canvas_fingerprinting_1.json', 'w') as f:\n",
222 |     "    f.write(json.dumps(disgarded_urls)) "
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "markdown",
227 |    "metadata": {},
228 |    "source": [
229 |     "## 2. Remove save, restore, addEventListener"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "code",
234 |    "execution_count": 10,
235 |    "metadata": {
236 |     "scrolled": true
237 |    },
238 |    "outputs": [
239 |     {
240 |      "data": {
241 |       "text/plain": [
242 |        "array(['https://tpc.googlesyndication.com/sadbundle/$csp%3Der3%26dns%3Doff$/4134920871885725337/createjs-2015.11.26.min.js',\n",
243 |        "       'http://pics3.city-data.com/js/maps/CANVAS/boxMap.js',\n",
244 |        "       'https://code.createjs.com/createjs-2015.11.26.min.js',\n",
245 |        "       'http://media.ufc.tv/ufc_system_assets/ufc_201707101050/js/cufon-yui.js',\n",
246 |        "       'https://sale.yhd.com/act/J3oKuL4Izcsvpn.html'], dtype=object)"
247 |       ]
248 |      },
249 |      "execution_count": 10,
250 |      "metadata": {},
251 |      "output_type": "execute_result"
252 |     }
253 |    ],
254 |    "source": [
255 |     "df_valid_calls_df = dd.read_parquet(DATA_FILE, columns=['script_url', 'symbol'])\n",
256 |     "df_valid_calls_df = df_valid_calls_df[df_valid_calls_df.symbol.isin(\n",
257 |     "    ['CanvasRenderingContext2D.save', 'CanvasRenderingContext2D.restore', 'HTMLCanvasElement.addEventListener']\n",
258 |     ")]\n",
259 |     "valid_calls_urls = df_valid_calls_df.script_url.unique().values.compute()\n",
260 |     "valid_calls_urls[0:5]"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": 11,
266 |    "metadata": {},
267 |    "outputs": [
268 |     {
269 |      "name": "stdout",
270 |      "output_type": "stream",
271 |      "text": [
272 |       "n candidate urls 26877\n"
273 |      ]
274 |     }
275 |    ],
276 |    "source": [
277 |     "candidate_urls = [url for url in candidate_urls if url not in valid_calls_urls]\n",
278 |     "print('n candidate urls', len(candidate_urls))"
279 |    ]
280 |   },
281 |   {
282 |    "cell_type": "code",
283 |    "execution_count": 12,
284 |    "metadata": {},
285 |    "outputs": [
286 |     {
287 |      "data": {
288 |       "text/plain": [
289 |        "132"
290 |       ]
291 |      },
292 |      "execution_count": 12,
293 |      "metadata": {},
294 |      "output_type": "execute_result"
295 |     }
296 |    ],
297 |    "source": [
298 |     "len(set(all_candidate_urls) - set(candidate_urls))"
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "code",
303 |    "execution_count": 13,
304 |    "metadata": {},
305 |    "outputs": [
306 |     {
307 |      "name": "stdout",
308 |      "output_type": "stream",
309 |      "text": [
310 |       "132\n"
311 |      ]
312 |     }
313 |    ],
314 |    "source": [
315 |     "print(len(set(all_candidate_urls) - set(candidate_urls)))\n",
316 |     "disgarded_urls = [url for url in all_candidate_urls if url not in candidate_urls]\n",
317 |     "with open('not_canvas_fingerprinting_2.json', 'w') as f:\n",
318 |     "    f.write(json.dumps(disgarded_urls)) "
319 |    ]
320 |   },
321 |   {
322 |    "cell_type": "markdown",
323 |    "metadata": {},
324 |    "source": [
325 |     "## 3. Must have written 10 or more characters"
326 |    ]
327 |   },
328 |   {
329 |    "cell_type": "code",
330 |    "execution_count": 14,
331 |    "metadata": {},
332 |    "outputs": [],
333 |    "source": [
334 |     "## Code sourced from: github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py\n",
335 |     "\n",
336 |     "def text_length(arg_0):\n",
337 |     "    return len(arg_0.encode('ascii', 'ignore'))"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": 15,
343 |    "metadata": {
344 |     "scrolled": true
345 |    },
346 |    "outputs": [
347 |     {
348 |      "data": {
349 |       "text/html": [
350 |        "<div>\n",
351 |        "<style scoped>\n",
352 |        "    .dataframe tbody tr th:only-of-type {\n",
353 |        "        vertical-align: middle;\n",
354 |        "    }\n",
355 |        "\n",
356 |        "    .dataframe tbody tr th {\n",
357 |        "        vertical-align: top;\n",
358 |        "    }\n",
359 |        "\n",
360 |        "    .dataframe thead th {\n",
361 |        "        text-align: right;\n",
362 |        "    }\n",
363 |        "</style>\n",
364 |        "<table border=\"1\" class=\"dataframe\">\n",
365 |        "  <thead>\n",
366 |        "    <tr style=\"text-align: right;\">\n",
367 |        "      <th></th>\n",
368 |        "      <th>script_url</th>\n",
369 |        "      <th>symbol</th>\n",
370 |        "      <th>argument_0</th>\n",
371 |        "      <th>len_arg</th>\n",
372 |        "    </tr>\n",
373 |        "  </thead>\n",
374 |        "  <tbody>\n",
375 |        "    <tr>\n",
376 |        "      <th>944</th>\n",
377 |        "      <td>http://www.qvc.com/akam/10/2b30e194</td>\n",
378 |        "      <td>CanvasRenderingContext2D.fillText</td>\n",
379 |        "      <td>Soft Ruddy Foothold 2</td>\n",
380 |        "      <td>21</td>\n",
381 |        "    </tr>\n",
382 |        "    <tr>\n",
383 |        "      <th>951</th>\n",
384 |        "      <td>http://www.qvc.com/akam/10/2b30e194</td>\n",
385 |        "      <td>CanvasRenderingContext2D.fillText</td>\n",
386 |        "      <td>!H71JCaj)]# 1@#</td>\n",
387 |        "      <td>15</td>\n",
388 |        "    </tr>\n",
389 |        "    <tr>\n",
390 |        "      <th>1007</th>\n",
391 |        "      <td>http://www.qvc.com/_bm/async.js</td>\n",
392 |        "      <td>CanvasRenderingContext2D.fillText</td>\n",
393 |        "      <td>&lt;@nv45. F1n63r,Pr1n71n6!</td>\n",
394 |        "      <td>24</td>\n",
395 |        "    </tr>\n",
396 |        "    <tr>\n",
397 |        "      <th>2824</th>\n",
398 |        "      <td>http://www.coupang.com/akam/10/4f2b47</td>\n",
399 |        "      <td>CanvasRenderingContext2D.fillText</td>\n",
400 |        "      <td>Soft Ruddy Foothold 2</td>\n",
401 |        "      <td>21</td>\n",
402 |        "    </tr>\n",
403 |        "    <tr>\n",
404 |        "      <th>2831</th>\n",
405 |        "      <td>http://www.coupang.com/akam/10/4f2b47</td>\n",
406 |        "      <td>CanvasRenderingContext2D.fillText</td>\n",
407 |        "      <td>!H71JCaj)]# 1@#</td>\n",
408 |        "      <td>15</td>\n",
409 |        "    </tr>\n",
410 |        "  </tbody>\n",
411 |        "</table>\n",
412 |        "</div>"
413 |       ],
414 |       "text/plain": [
415 |        "                                 script_url  \\\n",
416 |        "944     http://www.qvc.com/akam/10/2b30e194   \n",
417 |        "951     http://www.qvc.com/akam/10/2b30e194   \n",
418 |        "1007        http://www.qvc.com/_bm/async.js   \n",
419 |        "2824  http://www.coupang.com/akam/10/4f2b47   \n",
420 |        "2831  http://www.coupang.com/akam/10/4f2b47   \n",
421 |        "\n",
422 |        "                                 symbol                argument_0  len_arg  \n",
423 |        "944   CanvasRenderingContext2D.fillText     Soft Ruddy Foothold 2       21  \n",
424 |        "951   CanvasRenderingContext2D.fillText           !H71JCaj)]# 1@#       15  \n",
425 |        "1007  CanvasRenderingContext2D.fillText  <@nv45. F1n63r,Pr1n71n6!       24  \n",
426 |        "2824  CanvasRenderingContext2D.fillText     Soft Ruddy Foothold 2       21  \n",
427 |        "2831  CanvasRenderingContext2D.fillText           !H71JCaj)]# 1@#       15  "
428 |       ]
429 |      },
430 |      "execution_count": 15,
431 |      "metadata": {},
432 |      "output_type": "execute_result"
433 |     }
434 |    ],
435 |    "source": [
436 |     "df_write = dd.read_parquet(DATA_FILE, columns=['script_url', 'symbol', 'argument_0'])\n",
437 |     "df_write = df_write[df_write.script_url.isin(candidate_urls)]\n",
438 |     "df_write = df_write[df_write.symbol.isin(['CanvasRenderingContext2D.fillText', 'CanvasRenderingContext2D.strokeText'])]\n",
439 |     "df_write['len_arg'] = df_write.argument_0.apply(text_length, meta=('int'))\n",
440 |     "df_write = df_write[df_write.len_arg >= 10]\n",
441 |     "df_write = df_write.compute()\n",
442 |     "df_write.head()"
443 |    ]
444 |   },
445 |   {
446 |    "cell_type": "code",
447 |    "execution_count": 16,
448 |    "metadata": {
449 |     "scrolled": true
450 |    },
451 |    "outputs": [
452 |     {
453 |      "name": "stdout",
454 |      "output_type": "stream",
455 |      "text": [
456 |       "n \"3 too long writes\" urls 8514\n"
457 |      ]
458 |     }
459 |    ],
460 |    "source": [
461 |     "too_many_write_urls = df_write.script_url.unique()\n",
462 |     "print('n \"3 too long writes\" urls', len(too_many_write_urls))"
463 |    ]
464 |   },
465 |   {
466 |    "cell_type": "markdown",
467 |    "metadata": {},
468 |    "source": [
469 |     "## Apply 3"
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "code",
474 |    "execution_count": 17,
475 |    "metadata": {},
476 |    "outputs": [],
477 |    "source": [
478 |     "text_filter = set(too_many_write_urls)\n",
479 |     "candidate_urls = list(text_filter)"
480 |    ]
481 |   },
482 |   {
483 |    "cell_type": "code",
484 |    "execution_count": 18,
485 |    "metadata": {
486 |     "scrolled": true
487 |    },
488 |    "outputs": [
489 |     {
490 |      "name": "stdout",
491 |      "output_type": "stream",
492 |      "text": [
493 |       "18495\n"
494 |      ]
495 |     }
496 |    ],
497 |    "source": [
498 |     "print(len(set(all_candidate_urls) - set(candidate_urls)))\n",
499 |     "disgarded_urls = [url for url in all_candidate_urls if url not in candidate_urls]\n",
500 |     "with open('not_canvas_fingerprinting_3.json', 'w') as f:\n",
501 |     "    f.write(json.dumps(disgarded_urls)) "
502 |    ]
503 |   },
504 |   {
505 |    "cell_type": "code",
506 |    "execution_count": 19,
507 |    "metadata": {},
508 |    "outputs": [],
509 |    "source": [
510 |     "with open('canvas_fingerprinting.json', 'w') as f:\n",
511 |     "    f.write(json.dumps(candidate_urls))"
512 |    ]
513 |   },
514 |   {
515 |    "cell_type": "code",
516 |    "execution_count": 20,
517 |    "metadata": {},
518 |    "outputs": [],
519 |    "source": [
520 |     "with open('not_canvas_fingerprinting.json', 'w') as f:\n",
521 |     "    f.write(json.dumps(disgarded_urls))"
522 |    ]
523 |   },
524 |   {
525 |    "cell_type": "markdown",
526 |    "metadata": {},
527 |    "source": [
528 |     "## Find Locations"
529 |    ]
530 |   },
531 |   {
532 |    "cell_type": "code",
533 |    "execution_count": 32,
534 |    "metadata": {},
535 |    "outputs": [
536 |     {
537 |      "name": "stdout",
538 |      "output_type": "stream",
539 |      "text": [
540 |       "8514 == 8514\n"
541 |      ]
542 |     }
543 |    ],
544 |    "source": [
545 |     "with open('canvas_fingerprinting.json', 'r') as f:\n",
546 |     "    canvas_fp_urls = json.load(f)\n",
547 |     "    \n",
548 |     "print(len(canvas_fp_urls), '== 8514')"
549 |    ]
550 |   },
551 |   {
552 |    "cell_type": "code",
553 |    "execution_count": 7,
554 |    "metadata": {},
555 |    "outputs": [],
556 |    "source": [
557 |     "df = dd.read_parquet(DATA_FILE, columns=['script_url', 'location'])"
558 |    ]
559 |   },
560 |   {
561 |    "cell_type": "code",
562 |    "execution_count": 34,
563 |    "metadata": {},
564 |    "outputs": [
565 |     {
566 |      "name": "stdout",
567 |      "output_type": "stream",
568 |      "text": [
569 |       "[########################################] | 100% Completed |  3min  0.8s\r"
570 |      ]
571 |     }
572 |    ],
573 |    "source": [
574 |     "df_locs = df[df.script_url.isin(canvas_fp_urls)]\n",
575 |     "locs = df_locs.location.unique().persist()\n",
576 |     "progress(locs, notebook=False)"
577 |    ]
578 |   },
579 |   {
580 |    "cell_type": "code",
581 |    "execution_count": 35,
582 |    "metadata": {},
583 |    "outputs": [
584 |     {
585 |      "name": "stdout",
586 |      "output_type": "stream",
587 |      "text": [
588 |       "# of locations that call canvas fingerprinting scripts: 38419\n"
589 |      ]
590 |     }
591 |    ],
592 |    "source": [
593 |     "print('# of locations that call canvas fingerprinting scripts:', len(locs))"
594 |    ]
595 |   },
596 |   {
597 |    "cell_type": "code",
598 |    "execution_count": null,
599 |    "metadata": {},
600 |    "outputs": [],
601 |    "source": []
602 |   }
603 |  ],
604 |  "metadata": {
605 |   "kernelspec": {
606 |    "display_name": "overscripted",
607 |    "language": "python",
608 |    "name": "overscripted"
609 |   },
610 |   "language_info": {
611 |    "codemirror_mode": {
612 |     "name": "ipython",
613 |     "version": 3
614 |    },
615 |    "file_extension": ".py",
616 |    "mimetype": "text/x-python",
617 |    "name": "python",
618 |    "nbconvert_exporter": "python",
619 |    "pygments_lexer": "ipython3",
620 |    "version": "3.5.6"
621 |   }
622 |  },
623 |  "nbformat": 4,
624 |  "nbformat_minor": 2
625 | }
626 | 


--------------------------------------------------------------------------------
/analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense/README.md:
--------------------------------------------------------------------------------
 1 | ## Overview
 2 | 
 3 | This is an implementation of the heuristics defined in _The Web's Sixth Sense: A Study of Scripts Accessing Smartphone Sensors_ 
 4 | by Anupam Das, Gunes Acar, Nikita Borisov and Amogh Pradeep. The heuristics are then run against the OverScripted dataset
 5 | to determine the prevalence of fingerprinting scripts and locations that call those scripts.
 6 | 
 7 | ## Original Code
 8 | 
 9 | The open-source implementation of _The Web's Sixth Sense: A Study of Scripts Accessing Smartphone Sensors_ can be found on 
10 | [GitHub](https://github.com/sensor-js/OpenWPM-mobile). All heuristics are implemented in 
11 | [extract_features.py](https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py).
12 | 
13 | ## Overall Stats
14 | 
15 | - Audio Fingerprinting: 170 scripts present on 2006 locations
16 | - Canvas Fingerprinting: 8,514 scripts present on 38,419 locations
17 | - CanvasFont Fingerprinting: 1,387 scripts present on 2,293 locations
18 | - WebRTC Fingerprinting: 1,313 scripts present on 15,360 locations
19 | 


--------------------------------------------------------------------------------
/analyses/README.md:
--------------------------------------------------------------------------------
 1 | # Analyses Folder
 2 | 
 3 | 
 4 | ## 1. Download Data
 5 | 
 6 | See main [README.md](https://github.com/mozilla/overscripted/blob/master/README.md) and unzip.
 7 | 
 8 | 
 9 | ## 2. Install  
10 | Install [Anaconda](https://www.anaconda.com/download) or [Miniconda](https://conda.io/miniconda.html).
11 | 
12 | > Optionally
13 | 
14 | Install [Spark](http://spark.apache.org/)
15 | 
16 | 
17 | ## 3. Setup and activate environment
18 | 
19 | ```
20 |  $ conda env create -f environment.yaml    
21 | ```
22 | ```
23 |  $ conda activate overscripted    
24 | ```    
25 | ## 4. Run Jupyter
26 | 
27 | ```
28 |  $ jupyter notebook   
29 | ```
30 | 
31 | 


--------------------------------------------------------------------------------
/analyses/environment.yaml:
--------------------------------------------------------------------------------
 1 | name: overscripted
 2 | channels:
 3 |     - defaults
 4 |     - conda-forge
 5 | dependencies:
 6 |     - dask=1.1.4
 7 |     - distributed=1.26.0
 8 |     - findspark=1.3.0
 9 |     - jupyter=1.0.0
10 |     - python=3.6
11 |     - pyarrow=0.12.1
12 |     - pandas=0.24.2
13 |     - tldextract=2.2.0
14 | 


--------------------------------------------------------------------------------
/data_prep/Sample Review.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stderr",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/dask/config.py:168: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n",
 13 |       "  data = yaml.load(f.read()) or {}\n",
 14 |       "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/distributed/config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n",
 15 |       "  defaults = yaml.load(f)\n"
 16 |      ]
 17 |     },
 18 |     {
 19 |      "data": {
 20 |       "text/html": [
 21 |        "<table style=\"border: 2px solid white;\">\n",
 22 |        "<tr>\n",
 23 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
 24 |        "<h3>Client</h3>\n",
 25 |        "<ul>\n",
 26 |        "  <li><b>Scheduler: </b>tcp://127.0.0.1:44039\n",
 27 |        "  <li><b>Dashboard: </b><a href='http://127.0.0.1:8787/status' target='_blank'>http://127.0.0.1:8787/status</a>\n",
 28 |        "</ul>\n",
 29 |        "</td>\n",
 30 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
 31 |        "<h3>Cluster</h3>\n",
 32 |        "<ul>\n",
 33 |        "  <li><b>Workers: </b>4</li>\n",
 34 |        "  <li><b>Cores: </b>12</li>\n",
 35 |        "  <li><b>Memory: </b>33.35 GB</li>\n",
 36 |        "</ul>\n",
 37 |        "</td>\n",
 38 |        "</tr>\n",
 39 |        "</table>"
 40 |       ],
 41 |       "text/plain": [
 42 |        "<Client: scheduler='tcp://127.0.0.1:44039' processes=4 cores=12>"
 43 |       ]
 44 |      },
 45 |      "execution_count": 1,
 46 |      "metadata": {},
 47 |      "output_type": "execute_result"
 48 |     }
 49 |    ],
 50 |    "source": [
 51 |     "import dask.dataframe as dd\n",
 52 |     "from dask.distributed import Client\n",
 53 |     "\n",
 54 |     "client = Client()\n",
 55 |     "client"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": 9,
 61 |    "metadata": {},
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "import os\n",
 65 |     "\n",
 66 |     "# Set to your local data directory with a string or an environment variable\n",
 67 |     "# DATA_DIR = '/path/to/your/data'\n",
 68 |     "DATA_DIR = os.environ.get('DATA_DIR')"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "This notebook briefly describes each of the sample datasets that have been uploaded.\n",
 76 |     "\n",
 77 |     "When using a sample dataset, take time to consider if it's representative  / suitable for the problem you're trying to solve and discuss your thoughts at the start of your analysis.\n",
 78 |     "\n",
 79 |     "Samples:\n",
 80 |     "\n",
 81 |     "1. value_1000_only\n",
 82 |     "1. sample_10percent\n",
 83 |     "1. sample_10percent_value_1000_only"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "**1. value_1000_only**\n",
 91 |     "\n",
 92 |     "The `value` field can get very large. This dataset contains all the rows of the dataset, but truncates the `value` field to only keep the first 1000 characters in a column called `value_1000`. The `value_len` column contains the length of the original `value` field.\n",
 93 |     "\n",
 94 |     "This shrinks the full dataset to 15GB on disk when uncompressed (from 70GB)."
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": 28,
100 |    "metadata": {},
101 |    "outputs": [
102 |     {
103 |      "data": {
104 |       "text/plain": [
105 |        "Index(['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4',\n",
106 |        "       'argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments',\n",
107 |        "       'arguments_n_keys', 'call_id', 'call_stack', 'file_name', 'func_name',\n",
108 |        "       'in_crawl_list', 'in_iframe', 'in_stripped_crawl_list', 'location',\n",
109 |        "       'locations_len', 'operation', 'script_url', 'symbol', 'time_stamp',\n",
110 |        "       'value_1000', 'value_len'],\n",
111 |        "      dtype='object')"
112 |       ]
113 |      },
114 |      "execution_count": 28,
115 |      "metadata": {},
116 |      "output_type": "execute_result"
117 |     }
118 |    ],
119 |    "source": [
120 |     "df = dd.read_parquet(DATA_DIR + 'clean_value_1000.parquet', engine='pyarrow')\n",
121 |     "df.columns"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": 7,
127 |    "metadata": {},
128 |    "outputs": [
129 |     {
130 |      "name": "stdout",
131 |      "output_type": "stream",
132 |      "text": [
133 |       "Number of rows: 113,790,686.\n"
134 |      ]
135 |     }
136 |    ],
137 |    "source": [
138 |     "print(f'Number of rows: {len(df):,}.')"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {},
144 |    "source": [
145 |     "**2. sample_10percent**\n",
146 |     "\n",
147 |     "This is a 10% sample of the complete dataset. It is 7.4GB when uncompressed on disk.\n",
148 |     "\n",
149 |     "The sampling procedure was to take 10% of the unique values in the `location` field, and then take all calls for those locations."
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "code",
154 |    "execution_count": 10,
155 |    "metadata": {},
156 |    "outputs": [
157 |     {
158 |      "data": {
159 |       "text/plain": [
160 |        "Index(['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4',\n",
161 |        "       'argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments',\n",
162 |        "       'arguments_n_keys', 'call_stack', 'crawl_id', 'file_name', 'func_name',\n",
163 |        "       'in_iframe', 'location', 'operation', 'script_col', 'script_line',\n",
164 |        "       'script_loc_eval', 'script_url', 'symbol', 'time_stamp', 'value',\n",
165 |        "       'value_1000', 'value_len'],\n",
166 |        "      dtype='object')"
167 |       ]
168 |      },
169 |      "execution_count": 10,
170 |      "metadata": {},
171 |      "output_type": "execute_result"
172 |     }
173 |    ],
174 |    "source": [
175 |     "df = dd.read_parquet(DATA_DIR + 'sample_10percent.parquet', engine='pyarrow')\n",
176 |     "df.columns"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": 20,
182 |    "metadata": {},
183 |    "outputs": [
184 |     {
185 |      "name": "stdout",
186 |      "output_type": "stream",
187 |      "text": [
188 |       "Number of rows: 11,292,867.\n"
189 |      ]
190 |     }
191 |    ],
192 |    "source": [
193 |     "print(f'Number of rows: {len(df):,}.')"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "markdown",
198 |    "metadata": {},
199 |    "source": [
200 |     "**3. sample_10percent_value_1000_only**\n",
201 |     "\n",
202 |     "This is a combination of the ideas from the previous two samples. It is the *same* 10% from \"2\" with the `value` column removed, leaving just the `value_1000` column.\n",
203 |     "\n",
204 |     "This dataset is just 1.3GB uncompressed on disk."
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": 22,
210 |    "metadata": {},
211 |    "outputs": [
212 |     {
213 |      "data": {
214 |       "text/plain": [
215 |        "Index(['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4',\n",
216 |        "       'argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments',\n",
217 |        "       'arguments_n_keys', 'call_stack', 'crawl_id', 'file_name', 'func_name',\n",
218 |        "       'in_iframe', 'location', 'operation', 'script_col', 'script_line',\n",
219 |        "       'script_loc_eval', 'script_url', 'symbol', 'time_stamp', 'value_1000',\n",
220 |        "       'value_len'],\n",
221 |        "      dtype='object')"
222 |       ]
223 |      },
224 |      "execution_count": 22,
225 |      "metadata": {},
226 |      "output_type": "execute_result"
227 |     }
228 |    ],
229 |    "source": [
230 |     "df = dd.read_parquet(DATA_DIR + 'sample_10percent_value_1000_only.parquet', engine='pyarrow')\n",
231 |     "df.columns"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "code",
236 |    "execution_count": 23,
237 |    "metadata": {},
238 |    "outputs": [
239 |     {
240 |      "name": "stdout",
241 |      "output_type": "stream",
242 |      "text": [
243 |       "Number of rows: 11,292,867.\n"
244 |      ]
245 |     }
246 |    ],
247 |    "source": [
248 |     "print(f'Number of rows: {len(df):,}.')"
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": null,
254 |    "metadata": {},
255 |    "outputs": [],
256 |    "source": []
257 |   }
258 |  ],
259 |  "metadata": {
260 |   "kernelspec": {
261 |    "display_name": "Python 3",
262 |    "language": "python",
263 |    "name": "python3"
264 |   },
265 |   "language_info": {
266 |    "codemirror_mode": {
267 |     "name": "ipython",
268 |     "version": 3
269 |    },
270 |    "file_extension": ".py",
271 |    "mimetype": "text/x-python",
272 |    "name": "python",
273 |    "nbconvert_exporter": "python",
274 |    "pygments_lexer": "ipython3",
275 |    "version": "3.6.7"
276 |   }
277 |  },
278 |  "nbformat": 4,
279 |  "nbformat_minor": 2
280 | }
281 | 


--------------------------------------------------------------------------------
/data_prep/raw_data_schema.template:
--------------------------------------------------------------------------------
 1 | {
 2 |     "$schema": "http://json-schema.org/draft-04/schema#",
 3 |     "title": "UCOSP Crawl - Call Schema",
 4 |     "description": "The schema for a row of the raw data in the crawl catalog. (The final dataset as additional derived columns)",
 5 |     "type": "object",
 6 |     "properties": {
 7 |         "arguments": {
 8 |             "description": "Any arguments passed to the javascript call. When present takes the form of an object with numeric string keys e.g. '0', '1', up to a max of '9'. Validator does not check for this yet as couldn't find a satisfactory regex",
 9 |             "type": "string",
10 |         },
11 |         "call_stack": {
12 |             "description": "69% of calls have no call_stack. Where there is a call_stack, it appears you can: split on '\n' and get the same values that are in func_name, script_url, script_col, script_line - func_name@script_url:script_line:script_col",
13 |             "type": "string",
14 |             "pattern": "^$|^(?!undefined).*$|"
15 |         },
16 |         "crawl_id": {
17 |             "description": "The ID for this crawl",
18 |             "const": 1
19 |         },
20 |         "func_name": {
21 |             "description": "Empty string or the name of the function that was executed. Note more liberal than current validation.",
22 |             "type": "string",
23 |         },
24 |         "in_iframe": {
25 |             "description": "Was JS being exectuted in an iframe.",
26 |             "type": "boolean"
27 |         },
28 |         "location": {
29 |             "description": "The location of the loaded page where from which javascript calls are being captured.",
30 |             "type": "string",
31 |             "format": "uri",
32 |             "pattern": "^(https?|http?):\/\/\S*",
33 |         },
34 |         "operation": {
35 |             "description": "The type of operation.",
36 |             "type": "string",
37 |             "enum": ["get", "set", "call", "set (failed)"]
38 |         },
39 |         "script_col": {
40 |             "description": "The column location in the script where the call is captured. We want this to be an integer, but we must test for numeric string.",
41 |             "type": "string",
42 |             "pattern": "^$|^[0-9]+$"
43 |         },
44 |         "script_line": {
45 |             "description": "The line location in the script where the call is captured. We want this to be an integer, but we must test for numeric string.",
46 |             "type": "string",
47 |             "pattern": "^[0-9]+$"
48 |         },
49 |         "script_loc_eval": {
50 |             "description": "Empty string or .... What is this?",
51 |             "type": "string",
52 |             "pattern": "^$|^(line [0-9]* > (eval|Function)[ ]?)*$"
53 |         },
54 |         "script_url": {
55 |             "description": "The location of the script url that is being executed. Liberally letting things through to see what's there.",
56 |             "type": "string",
57 |             "minLength": 1,
58 |         },
59 |         "symbol": {
60 |             "description": "The js Symbol. Has 282 possible values in this dataset (this is derived there are more possible symbols in JS)",
61 |             "type": "string",
62 |             "enum": {{ list_of_symbols }}
63 |         },
64 |         "time_stamp": {
65 |             "description": "Time at which call was captured. Valid timestamps 2017-12-16T10:12:58Z, 2017-12-16T10:12:58.000Z, 2017-12-16T10:12:58+0000 ",
66 |             "type": "string",
67 |             "format": "date-time",
68 |             "pattern": "^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.?\d{0,3}Z$|^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{4}$"
69 |         },
70 |         "value": {
71 |             "description": "The value that was passed to the javascript call. Can be a few or over 1million characters.",
72 |             "type": "string"
73 |         }
74 |         
75 |     },
76 |     "required": [
77 |         "call_stack",
78 |         "crawl_id",
79 |         "func_name",
80 |         "in_iframe",
81 |         "location",
82 |         "operation",
83 |         "script_col",
84 |         "script_line",
85 |         "script_loc_eval",
86 |         "script_url",
87 |         "symbol",
88 |         "time_stamp",
89 |         "value",
90 |     ]
91 | }


--------------------------------------------------------------------------------
/data_prep/symbol_counts.csv:
--------------------------------------------------------------------------------
  1 | window.document.cookie,35455680
  2 | window.navigator.userAgent,15534371
  3 | window.Storage.getItem,10553944
  4 | window.localStorage,8767285
  5 | window.Storage.setItem,4175556
  6 | window.sessionStorage,4033894
  7 | window.Storage.removeItem,2932713
  8 | window.name,2484730
  9 | CanvasRenderingContext2D.fillStyle,1957519
 10 | window.navigator.plugins[Shockwave Flash].description,1863285
 11 | window.screen.colorDepth,1449905
 12 | window.navigator.appName,1286084
 13 | window.navigator.language,1172256
 14 | window.navigator.platform,1140738
 15 | CanvasRenderingContext2D.save,1000762
 16 | CanvasRenderingContext2D.restore,997755
 17 | CanvasRenderingContext2D.fill,954340
 18 | CanvasRenderingContext2D.fillRect,936267
 19 | window.navigator.plugins[Shockwave Flash].name,895289
 20 | CanvasRenderingContext2D.font,814310
 21 | CanvasRenderingContext2D.lineWidth,718195
 22 | window.navigator.appVersion,707298
 23 | window.navigator.cookieEnabled,692524
 24 | HTMLCanvasElement.width,681003
 25 | CanvasRenderingContext2D.strokeStyle,650211
 26 | HTMLCanvasElement.height,644476
 27 | HTMLCanvasElement.getContext,596749
 28 | window.Storage.key,551691
 29 | CanvasRenderingContext2D.fillText,542896
 30 | window.Storage.length,541077
 31 | CanvasRenderingContext2D.stroke,537024
 32 | CanvasRenderingContext2D.measureText,522209
 33 | window.navigator.vendor,487833
 34 | window.navigator.doNotTrack,468365
 35 | CanvasRenderingContext2D.arc,413449
 36 | HTMLCanvasElement.style,294223
 37 | CanvasRenderingContext2D.textBaseline,293489
 38 | window.navigator.product,279653
 39 | CanvasRenderingContext2D.textAlign,246380
 40 | window.navigator.plugins[Shockwave Flash].filename,225751
 41 | window.navigator.mimeTypes[application/x-shockwave-flash].type,213769
 42 | window.navigator.languages,199435
 43 | window.navigator.plugins[Shockwave Flash].length,184995
 44 | CanvasRenderingContext2D.bezierCurveTo,176757
 45 | CanvasRenderingContext2D.shadowBlur,172808
 46 | CanvasRenderingContext2D.shadowOffsetY,161446
 47 | CanvasRenderingContext2D.shadowOffsetX,159263
 48 | CanvasRenderingContext2D.shadowColor,158579
 49 | window.screen.pixelDepth,156326
 50 | CanvasRenderingContext2D.rect,154007
 51 | HTMLCanvasElement.nodeType,153630
 52 | CanvasRenderingContext2D.lineJoin,151407
 53 | window.navigator.mimeTypes[application/futuresplash].type,150364
 54 | CanvasRenderingContext2D.lineCap,149511
 55 | window.navigator.plugins[Shockwave Flash].version,144656
 56 | CanvasRenderingContext2D.strokeRect,142838
 57 | HTMLCanvasElement.toDataURL,135041
 58 | CanvasRenderingContext2D.createRadialGradient,132444
 59 | CanvasRenderingContext2D.globalCompositeOperation,122162
 60 | window.navigator.onLine,116037
 61 | CanvasRenderingContext2D.scale,115227
 62 | window.Storage.hasOwnProperty,108138
 63 | CanvasRenderingContext2D.clip,106238
 64 | CanvasRenderingContext2D.miterLimit,102589
 65 | window.navigator.mimeTypes[application/x-shockwave-flash].suffixes,94030
 66 | window.navigator.mimeTypes[application/futuresplash].suffixes,94025
 67 | RTCPeerConnection.localDescription,88683
 68 | window.navigator.productSub,71139
 69 | window.navigator.mimeTypes[application/x-shockwave-flash].description,70284
 70 | window.navigator.mimeTypes[application/futuresplash].description,70278
 71 | HTMLCanvasElement.nodeName,67621
 72 | CanvasRenderingContext2D.rotate,63824
 73 | HTMLCanvasElement.parentNode,57192
 74 | window.navigator.oscpu,54799
 75 | window.navigator.appCodeName,51161
 76 | CanvasRenderingContext2D.createLinearGradient,46710
 77 | CanvasRenderingContext2D.putImageData,45469
 78 | window.navigator.geolocation,43022
 79 | CanvasRenderingContext2D.getImageData,41412
 80 | HTMLCanvasElement.ownerDocument,37831
 81 | HTMLCanvasElement.className,36778
 82 | RTCPeerConnection.onicecandidate,32522
 83 | HTMLCanvasElement.getAttribute,31800
 84 | window.navigator.vendorSub,26840
 85 | HTMLCanvasElement.addEventListener,23485
 86 | window.navigator.buildID,23419
 87 | HTMLCanvasElement.classList,22963
 88 | HTMLCanvasElement.setAttribute,20689
 89 | HTMLCanvasElement.clientHeight,20665
 90 | HTMLCanvasElement.clientWidth,20341
 91 | HTMLCanvasElement.getElementsByTagName,16224
 92 | HTMLCanvasElement.tagName,14475
 93 | RTCPeerConnection.iceGatheringState,13984
 94 | RTCPeerConnection.createDataChannel,13776
 95 | RTCPeerConnection.signalingState,13160
 96 | RTCPeerConnection.remoteDescription,13113
 97 | RTCPeerConnection.createOffer,13015
 98 | CanvasRenderingContext2D.setLineDash,12590
 99 | HTMLCanvasElement.onselectstart,12054
100 | RTCPeerConnection.setLocalDescription,11844
101 | CanvasRenderingContext2D.arcTo,11428
102 | CanvasRenderingContext2D.isPointInPath,11342
103 | CanvasRenderingContext2D.createImageData,11163
104 | HTMLCanvasElement.id,10941
105 | CanvasRenderingContext2D.imageSmoothingEnabled,9722
106 | HTMLCanvasElement.draggable,9558
107 | HTMLCanvasElement.constructor,9246
108 | CanvasRenderingContext2D.createPattern,8713
109 | CanvasRenderingContext2D.lineDashOffset,7726
110 | HTMLCanvasElement.offsetWidth,7346
111 | CanvasRenderingContext2D.mozImageSmoothingEnabled,6561
112 | RTCPeerConnection.idpLoginUrl,6556
113 | RTCPeerConnection.peerIdentity,6556
114 | RTCPeerConnection.onremovestream,6556
115 | HTMLCanvasElement.offsetHeight,6175
116 | CanvasRenderingContext2D.strokeText,5147
117 | HTMLCanvasElement.firstChild,4897
118 | HTMLCanvasElement.hasAttribute,4604
119 | HTMLCanvasElement.localName,4577
120 | HTMLCanvasElement.attributes,4507
121 | HTMLCanvasElement.nextSibling,3857
122 | AudioContext.destination,3758
123 | HTMLCanvasElement.firstElementChild,3586
124 | HTMLCanvasElement.nextElementSibling,3560
125 | window.Storage.clear,3348
126 | HTMLCanvasElement.dir,3171
127 | CanvasRenderingContext2D.mozCurrentTransform,3102
128 | OscillatorNode.frequency,3056
129 | AudioContext.createOscillator,2898
130 | OscillatorNode.start,2687
131 | CanvasRenderingContext2D.__lookupGetter__,2543
132 | HTMLCanvasElement.childNodes,2541
133 | CanvasRenderingContext2D.hasOwnProperty,2422
134 | HTMLCanvasElement.getBoundingClientRect,2276
135 | HTMLCanvasElement.offsetLeft,2096
136 | OscillatorNode.type,2011
137 | OscillatorNode.connect,2011
138 | CanvasRenderingContext2D.mozCurrentTransformInverse,1890
139 | HTMLCanvasElement.removeAttribute,1814
140 | HTMLCanvasElement.offsetTop,1812
141 | HTMLCanvasElement.children,1795
142 | HTMLCanvasElement.dispatchEvent,1698
143 | HTMLCanvasElement.mozOpaque,1687
144 | HTMLCanvasElement.onmousemove,1538
145 | AudioContext.createDynamicsCompressor,1535
146 | HTMLCanvasElement.offsetParent,1499
147 | OfflineAudioContext.startRendering,1381
148 | OfflineAudioContext.createDynamicsCompressor,1380
149 | OfflineAudioContext.oncomplete,1380
150 | OfflineAudioContext.createOscillator,1380
151 | OfflineAudioContext.destination,1380
152 | HTMLCanvasElement.remove,1257
153 | HTMLCanvasElement.compareDocumentPosition,1253
154 | AudioContext.state,1249
155 | AudioContext.listener,1230
156 | GainNode.connect,1204
157 | AudioContext.createGain,1197
158 | GainNode.gain,1112
159 | HTMLCanvasElement.__proto__,1028
160 | window.Storage.toString,1027
161 | AudioContext.createAnalyser,905
162 | HTMLCanvasElement.cloneNode,899
163 | AudioContext.sampleRate,882
164 | AudioContext.decodeAudioData,876
165 | AudioContext.createMediaElementSource,860
166 | HTMLCanvasElement.toBlob,837
167 | HTMLCanvasElement.removeEventListener,779
168 | AnalyserNode.fftSize,774
169 | AnalyserNode.maxDecibels,771
170 | AnalyserNode.smoothingTimeConstant,770
171 | AnalyserNode.frequencyBinCount,770
172 | AnalyserNode.minDecibels,769
173 | RTCPeerConnection.addIceCandidate,769
174 | AudioContext.onstatechange,745
175 | HTMLCanvasElement.textContent,628
176 | HTMLCanvasElement.onclick,466
177 | HTMLCanvasElement.innerHTML,437
178 | window.Storage.valueOf,423
179 | RTCPeerConnection.setRemoteDescription,379
180 | RTCPeerConnection.getStats,361
181 | AudioContext.currentTime,354
182 | OscillatorNode.stop,351
183 | RTCPeerConnection.removeEventListener,346
184 | RTCPeerConnection.addEventListener,346
185 | HTMLCanvasElement.__lookupGetter__,344
186 | AudioContext.createScriptProcessor,337
187 | HTMLCanvasElement.hasOwnProperty,312
188 | HTMLCanvasElement.onmousedown,310
189 | HTMLCanvasElement.toString,291
190 | ScriptProcessorNode.connect,288
191 | ScriptProcessorNode.onaudioprocess,287
192 | AnalyserNode.connect,285
193 | HTMLCanvasElement.blur,280
194 | HTMLCanvasElement.getAttributeNode,237
195 | HTMLCanvasElement.onmouseout,232
196 | HTMLCanvasElement.onmouseover,229
197 | HTMLCanvasElement.append,227
198 | HTMLCanvasElement.onmouseup,227
199 | CanvasRenderingContext2D.ellipse,166
200 | HTMLCanvasElement.setAttributeNode,152
201 | HTMLCanvasElement.oncontextmenu,152
202 | CanvasRenderingContext2D.getLineDash,146
203 | HTMLCanvasElement.previousSibling,139
204 | HTMLCanvasElement.parentElement,136
205 | HTMLCanvasElement.innerText,134
206 | HTMLCanvasElement.onkeydown,132
207 | HTMLCanvasElement.onkeyup,129
208 | HTMLCanvasElement.onkeypress,128
209 | HTMLCanvasElement.onblur,128
210 | HTMLCanvasElement.onfocus,128
211 | HTMLCanvasElement.onmouseleave,127
212 | HTMLCanvasElement.ondblclick,126
213 | HTMLCanvasElement.ondragenter,125
214 | HTMLCanvasElement.onresize,125
215 | HTMLCanvasElement.onpaste,125
216 | HTMLCanvasElement.onchange,125
217 | HTMLCanvasElement.oncut,125
218 | HTMLCanvasElement.ondragover,125
219 | HTMLCanvasElement.ondragleave,125
220 | HTMLCanvasElement.ondrop,125
221 | HTMLCanvasElement.onmouseenter,125
222 | HTMLCanvasElement.onload,125
223 | HTMLCanvasElement.contains,102
224 | HTMLCanvasElement.querySelectorAll,98
225 | GainNode.disconnect,77
226 | AudioContext.createBufferSource,70
227 | HTMLCanvasElement.hasChildNodes,67
228 | AudioContext.createBuffer,63
229 | AudioContext.createPanner,60
230 | HTMLCanvasElement.scrollLeft,60
231 | HTMLCanvasElement.scrollTop,60
232 | CanvasRenderingContext2D.__lookupSetter__,58
233 | CanvasRenderingContext2D.__defineSetter__,58
234 | HTMLCanvasElement.ondragstart,50
235 | HTMLCanvasElement.getClientRects,49
236 | HTMLCanvasElement.title,44
237 | HTMLCanvasElement.tabIndex,43
238 | RTCPeerConnection.close,43
239 | RTCPeerConnection.iceConnectionState,33
240 | AudioContext.close,32
241 | HTMLCanvasElement.hasAttributes,25
242 | HTMLCanvasElement.previousElementSibling,23
243 | OscillatorNode.disconnect,22
244 | HTMLCanvasElement.focus,22
245 | RTCPeerConnection.onsignalingstatechange,16
246 | RTCPeerConnection.oniceconnectionstatechange,16
247 | HTMLCanvasElement.valueOf,16
248 | HTMLCanvasElement.dataset,15
249 | HTMLCanvasElement.requestPointerLock,15
250 | HTMLCanvasElement.namespaceURI,13
251 | HTMLCanvasElement.webkitMatchesSelector,12
252 | HTMLCanvasElement.childElementCount,11
253 | HTMLCanvasElement.removeChild,8
254 | HTMLCanvasElement.insertBefore,8
255 | GainNode.numberOfOutputs,7
256 | HTMLCanvasElement.matches,6
257 | HTMLCanvasElement.outerHTML,6
258 | HTMLCanvasElement.appendChild,6
259 | AudioContext.resume,5
260 | AnalyserNode.getByteFrequencyData,5
261 | HTMLCanvasElement.clientTop,4
262 | HTMLCanvasElement.clientLeft,4
263 | HTMLCanvasElement.onwheel,4
264 | HTMLCanvasElement.DOCUMENT_NODE,4
265 | RTCPeerConnection.onaddstream,3
266 | AnalyserNode.channelInterpretation,3
267 | AnalyserNode.numberOfInputs,3
268 | AnalyserNode.channelCountMode,3
269 | AnalyserNode.numberOfOutputs,3
270 | AnalyserNode.channelCount,3
271 | HTMLCanvasElement.scrollWidth,3
272 | HTMLCanvasElement.scrollHeight,3
273 | CanvasRenderingContext2D.__proto__,3
274 | HTMLCanvasElement.getElementsByClassName,3
275 | CanvasRenderingContext2D.__defineGetter__,3
276 | HTMLCanvasElement.querySelector,2
277 | OfflineAudioContext.decodeAudioData,2
278 | RTCPeerConnection.createAnswer,2
279 | CanvasRenderingContext2D.filter,2
280 | AudioContext.createConvolver,1
281 | HTMLCanvasElement.lastChild,1
282 | CanvasRenderingContext2D.toString,1
283 | 


--------------------------------------------------------------------------------
/schema.md:
--------------------------------------------------------------------------------
  1 | * __call_stack:__
  2 |     * __Type:__ String
  3 |     * __Description:__ The call stack at the point when the function is called. The output is in the format: (function_name)(@)(javascript_source_file)(:)(line_number)(column_number)(new_line_character)
  4 |     * __Example:__ 
  5 |     ```
  6 |     jQuery.cookie@https://cdn.livechatinc.com/js/embedded.20171215135707.js:5:8393\nStore</s.get@https://cdn.livechatinc.com/js/embedded.20171215135707.js:8:3323\nStore</</s[p]@https://cdn.livechatinc.com/js/embedded.20171215135707.js:8:3746\nWindowsCommunicator.prototype.startCheckingForMainWindow/e<@https://cdn.livechatinc.com/js/embedded.20171215135707.js:10:11730
  7 |     ```
  8 | * __crawl_id:__
  9 |     * __Type:__ Integer
 10 |     * __Description:__ Crawl_id appears to be the value 1 for all json files. It is possible this field was not used when generating the data using the crawler.
 11 |     * __Example:__ 
 12 |     ```
 13 |     1
 14 |     ```
 15 | * __func_name:__
 16 |     * __Type:__ String
 17 |     * __Description:__ The name of the javascript function. Due to obfuscation the functions are often nonsensical and thus can be thought of as tokens. Anonymous functions will not have a name and the value will be an empty string.
 18 |     * __Examples:__ 
 19 |     ```
 20 |     ""
 21 |     a<4k
 22 |     getName
 23 |     ```
 24 | * __in_iframe:__
 25 |     * __Type:__ boolean
 26 |     * __Description:__ in_iframe is a boolean that indicates that the javascript code was run inside of an iframe. This is new functionality that was added ontop of the origional OpenWPM repository.
 27 | * __location:__
 28 |     * __Type:__ string
 29 |     * __Description:__ The url of the file that was being crawled to generate the json file. For iFrame resources, the location will be different that the parent url where the iFrame was encountered. For example, if Parent.html contains iFrame.html, iFrame.html is added inside an <iframe> tag. Inside iFrame.html a line of javascript such as: alert("window.location") is used to assert the location of content. When openWPM queries content that is inside iFrame.html which is found on Parent.html the location of the content is reported as: iFrame.html not Parent.html. Due to the paralellization of the crawl, the iFrame content can not be associated with the parent site on which it was encountered, only the __in_iframe__ filed can indicate whether the content was executed inside an iFrame or not. All objects in a json file that were accessed from the crawled page outside of an iFrame should have the same location value. The url can be for any type of file such as .html, .js or have no file extension.
 30 |     * __Examples:__ 
 31 |     ```
 32 |     https://www.dresslily.com/bottom-c-36.html
 33 |     http://www.vidalfrance.com/component/forme/?fid=2
 34 |     ```
 35 | * __operation:__
 36 |     * __Type:__ string
 37 |     * __Description:__ Corresponds to the "symbol" field. Operation is a call if the symbol is a method. Get/set operations get and set symbols that are properties with values. 
 38 |     * __Possible Values:__ get, call, set
 39 | * __script_col:__
 40 |     * __Type:__ string
 41 |     * __Description:__  The column in the `script_line` where the function call starts. Note: currently some string do not contain numbers, but instead they contain urls such as the example bellow.
 42 |     * __Examples:__ 
 43 |     ```
 44 |     57
 45 |     211
 46 |     //hdjs.hiido.com/hiido_internal.js?siteid=mhssj
 47 |     ```
 48 | * __script_line:__
 49 |     * __Type:__ string
 50 |     * __Description:__ The line in the file, indicated in the above `location` element, where the function call is located. Note: Currently some strings do not contain numbers, but instead they contain the protocol identifier for a url, such as in the example bellow.
 51 |     * __Examples:__ 
 52 |     ```
 53 |     12
 54 |     129
 55 |     http
 56 |     https
 57 |     ```
 58 | * __script_loc_eval:__
 59 |     * __Type:__ string
 60 |     * __Description:__ If a function call is generated using the `eval()` function, or is created using `new Function()`, then the "script_loc_eval" value will be set. For example `eval("console.log('my message')")` or `var log = new Function("message", "console.log(message)"); log("my message");` will both cause the "script_loc_evel" value be set when the function calls were collected. The format of "scipt_loc_eval" is: (line) (LINE_NUMBER) (>) (eval | Function) and can be repeated multiple times.  Additional information on how the eval line number is generated can be found at the bottom of the [MDN page](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Error/Stack) which discusses the `Error` objects `stack` property. The "script_loc_eval" element is generated from this stack property.
 61 |     * __Examples:__ 
 62 |     ```
 63 |     ""
 64 |     line 2 > eval
 65 |     line 70 > Function
 66 |     line 140 > eval line 232 > Function
 67 |     line 1 > Function line 1 > eval line 1 > eval
 68 |     ```
 69 | * __script_url:__
 70 |     * __Type:__ string
 71 |     * __Description:__ The url of the file where the javascript function call was run.  This may be the same value at "location", or it may be an external web url that was loaded into the website with the use of the `<script>` tag. 
 72 |     * __Examples:__ 
 73 |     ```
 74 |     http://www.google-analytics.com/analytics.js
 75 |     http://ajax.googleapis.com/ajax/libs/jquery/1.6/jquery.min.js
 76 |     http://pw.myersinfosys.com/javascripts/jquery-cookie.js?rwdv2
 77 |     https://g.alicdn.com/alilog/oneplus/blk.html#coid=52m7EjiWaj8CASPiP1nwaYXC&noid=&grd=n
 78 |     inline-cloudflare-rocketloader-executed-3.js
 79 |     /_/scs/shopping-verified-reviews-static/_/js/k=boq-shopping-verified-reviews.VerifiedReviewsBadgeUi.en_US.-JtwBcVsOWQ.O/m=_b,_tp/rt=j/d=1/excm=badgeview,_b,_tp/ed=1/rs=AC8lLkQbsBabKLQ4BgeJxo8BUz31aigxHA
 80 |     blob:http://nadgames.com/3334aa5f-24af-4c2f-9e52-fe196a0068b6
 81 |     ```
 82 | * __symbol:__
 83 |     * __Type:__ string
 84 |     * __Description:__ Either a Web API interface property (with a value) or method (which may take args as listed in "arguments" field). Symbol corresponds to "operation" field.
 85 |     * __Examples:__ 
 86 |     ``` 
 87 |     window.Storage.getItem 
 88 |     window.navigator.userAgent
 89 |     CanvasRenderingContext2D.textBaseline
 90 |     ```
 91 | * __time_stamp:__
 92 |     * __Type:__ string
 93 |     * __Description:__ The timestamp of when the javascript function information was collected. The timestamp is collected using Javascripts Date.now() function.  It is in the format YYYY-MM-DDTHH:mm:ss.sssZ. 
 94 |       * YYYY-MM-DD is the: year-month-day. 
 95 |       * "T" is a delimiter to seperate the two sections. 
 96 |       * HH:mm:ss.sss represents the: hours, minutes, seconds, and milliseconds. 
 97 |       * Z is optional and denotes the time zone. Z represents the time zone UTC+0.
 98 |     * __Examples:__ 
 99 |     ```
100 |     2017-12-16T00:17:37.973Z
101 |     2017-12-16T00:24:09.355Z
102 |     2017-12-16T08:10:24.749Z
103 |     ```
104 | * __value:__
105 |     * __Type:__ string
106 |     * __Description:__ The value that the function returned.
107 |     * __Examples:__
108 |     ```
109 |     ""
110 |     {}
111 |     Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
112 |     \_ga=GA1.2.1076416180.1513383458; \_gid=GA1.2.1940452730.1513383458
113 |     {"name": "example", "Browser": "Mozilla/5.0"}
114 |     ```
115 | * __value_1000:__
116 |     * __Type:__ string
117 |     * __Description:__ The value that the function returned, truncated to 1000 characters.
118 |     * __Examples:__
119 |     ```
120 |     ""
121 |     {}
122 |     Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
123 |     \_ga=GA1.2.1076416180.1513383458; \_gid=GA1.2.1940452730.1513383458
124 |     {"name": "example", "Browser": "Mozilla/5.0"}
125 |     ```
126 | * __value_len:__
127 |     * __Type:__ Integer
128 |     * __Description:__ The number of characters in the string representation of the value a function returned
129 |     * __Examples:__ 
130 |     ```
131 |     59508
132 |     ```
133 | * __arguments:__ 
134 | 	* __Type:__ object
135 | 	* __Description:__ Optional property which lists the arguments taken by the method in "symbol" field. 
136 | 	* __Examples:__ 
137 |     ```
138 |     {\"0\":\"liveAgentPc\"}
139 |     {\"0\":\"liveAgentPage_0\",\"1\":\"http://www.alamy.com/help/what-is-model-release-property-release.aspx\"}
140 |     ```
141 | * __file_name:__ 
142 |     * __Type:__ string
143 |     * __Description:__ Concatenation of the crawl_id and the name of a JSON file corresponding to a `location` in the raw collected data, format (crawl_id)_(JSON file name)
144 |     * __Examples:__ 
145 |     ```
146 |     1_f001bb59462bc80ee8ec9e6592b571d0a465cf3e05665953e71b9fe9.json
147 |     ```
148 | * __call_id:__ 
149 |     * __Type:__ string
150 |     * __Description:__ Concatenation of the file name and a row identifier to distinguish between different calls to the same file, format (file_name)__(identifier)
151 |     * __Examples:__ 
152 |     ```
153 |     1_f001bb59462bc80ee8ec9e6592b571d0a465cf3e05665953e71b9fe9.json__121
154 |     ```
155 | * __arguments_i:__ 
156 |     * __Type:__ string
157 |     * __Description:__ String representation of the argument passed to the function in position i, zero indexed.
158 |     * __Examples:__ 
159 |     ```
160 |     ''
161 |     {"domain":"backcountry.com"}
162 |     None
163 |     ```
164 | * __arguments_n_keys:__ 
165 |     * __Type:__ Integer
166 |     * __Description:__ The number of arguments in a function call
167 |     * __Examples:__ 
168 |     ```
169 |     0
170 |     5
171 |     ```
172 | 
173 | * __valid:__
174 |     * __Type:__ Boolean
175 |     * __Description:__ Whether the row returned a valid result during parsing
176 |     * __Examples:__ 
177 |     ```
178 |     True
179 |     ```
180 | * __errors:__
181 |     * __Type:__ string
182 |     * __Description:__ An error message if an error arised during row parsing
183 |     * __Examples:__
184 | 


--------------------------------------------------------------------------------