├── .gitignore ├── CODE_OF_CONDUCT.md ├── LICENSE ├── README.md ├── README_depricated.md ├── analyses ├── 2018_09_biskit1_mordax__canvas_fingerprinting.ipynb ├── 2018_12_LABBsoft_tracking_review │ ├── Ad Blocker Report.md │ ├── Evercookies Report.md │ ├── RelevantSymbolCounter.py │ ├── Tracking Method Sources.md │ ├── Tracking Methods.md │ ├── Tracking Report Template.md │ └── window.name Report.md ├── 2018_12_ddobre_static_analysis │ ├── 1-get_script_urls │ │ ├── README.md │ │ ├── config.ini │ │ ├── explore_url_lists.ipynb │ │ ├── generate_url_list_spark.py │ │ ├── requirements.txt │ │ ├── test_generate_url_list_spark.py │ │ └── test_urls.csv │ ├── 2-scrape_js │ │ ├── README.md │ │ ├── async_js_get.py │ │ ├── config.ini │ │ ├── downloads_analysis │ │ │ ├── README.md │ │ │ ├── compare_condensed_with_full.py │ │ │ ├── explore_downloads.ipynb │ │ │ ├── extract_hashes_from_full_dataset.py │ │ │ └── js_status.csv │ │ ├── requirements.txt │ │ └── single_js_get.py │ ├── 3-generate_symbols_of_interest │ │ ├── README.md │ │ ├── config.ini │ │ ├── master.txt │ │ ├── process_APIs.py │ │ └── symbol_dict.json │ ├── 4-ast_analysis │ │ ├── README.md │ │ ├── async_tree_explorer.py │ │ ├── config.ini │ │ ├── master_sym_list.json │ │ ├── new_async_tree_explorer.py │ │ ├── output_data │ │ │ ├── extended_symbol_counts.json │ │ │ └── symbol_counts.json │ │ ├── requirements.txt │ │ └── single_tree_explorer.py │ └── README.md ├── 2018_12_willoughr__fingerprinting_prevalence.txt ├── 2019_03_willougr_fingerprinting_implementation_sixth_sense │ ├── Audio Fingerprinting Heuristics.ipynb │ ├── Canvas Fingerprinting Heuristics.ipynb │ ├── Font Fingerprinting Heuristics.ipynb │ ├── README.md │ └── WebRTC Fingerprinting Heuristics.ipynb ├── README.md ├── environment.yaml ├── hello_mozfest.ipynb ├── hello_world.ipynb ├── hello_world.md ├── issue_34_setup_and_dask_tips.ipynb └── issue_36.ipynb ├── data_prep ├── Process All Data.ipynb ├── Process All Data.md ├── Sample Review.ipynb ├── raw_data_schema.template └── symbol_counts.csv └── schema.md /.gitignore: -------------------------------------------------------------------------------- 1 | # jupyter notebook checkpoints 2 | .ipynb_checkpoints 3 | 4 | # vim swap files 5 | *.sw? 6 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Community Participation Guidelines 2 | 3 | This repository is governed by Mozilla's code of conduct and etiquette guidelines. 4 | For more details, please read the 5 | [Mozilla Community Participation Guidelines](https://www.mozilla.org/about/governance/policies/participation/). 6 | 7 | ## How to Report 8 | For more information on how to report violations of the Community Participation Guidelines, please read our '[How to Report](https://www.mozilla.org/about/governance/policies/participation/reporting/)' page. 9 | 10 | 16 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Mozilla Public License Version 2.0 2 | ================================== 3 | 4 | 1. Definitions 5 | -------------- 6 | 7 | 1.1. "Contributor" 8 | means each individual or legal entity that creates, contributes to 9 | the creation of, or owns Covered Software. 10 | 11 | 1.2. "Contributor Version" 12 | means the combination of the Contributions of others (if any) used 13 | by a Contributor and that particular Contributor's Contribution. 14 | 15 | 1.3. "Contribution" 16 | means Covered Software of a particular Contributor. 17 | 18 | 1.4. "Covered Software" 19 | means Source Code Form to which the initial Contributor has attached 20 | the notice in Exhibit A, the Executable Form of such Source Code 21 | Form, and Modifications of such Source Code Form, in each case 22 | including portions thereof. 23 | 24 | 1.5. "Incompatible With Secondary Licenses" 25 | means 26 | 27 | (a) that the initial Contributor has attached the notice described 28 | in Exhibit B to the Covered Software; or 29 | 30 | (b) that the Covered Software was made available under the terms of 31 | version 1.1 or earlier of the License, but not also under the 32 | terms of a Secondary License. 33 | 34 | 1.6. "Executable Form" 35 | means any form of the work other than Source Code Form. 36 | 37 | 1.7. "Larger Work" 38 | means a work that combines Covered Software with other material, in 39 | a separate file or files, that is not Covered Software. 40 | 41 | 1.8. "License" 42 | means this document. 43 | 44 | 1.9. "Licensable" 45 | means having the right to grant, to the maximum extent possible, 46 | whether at the time of the initial grant or subsequently, any and 47 | all of the rights conveyed by this License. 48 | 49 | 1.10. "Modifications" 50 | means any of the following: 51 | 52 | (a) any file in Source Code Form that results from an addition to, 53 | deletion from, or modification of the contents of Covered 54 | Software; or 55 | 56 | (b) any new file in Source Code Form that contains any Covered 57 | Software. 58 | 59 | 1.11. "Patent Claims" of a Contributor 60 | means any patent claim(s), including without limitation, method, 61 | process, and apparatus claims, in any patent Licensable by such 62 | Contributor that would be infringed, but for the grant of the 63 | License, by the making, using, selling, offering for sale, having 64 | made, import, or transfer of either its Contributions or its 65 | Contributor Version. 66 | 67 | 1.12. "Secondary License" 68 | means either the GNU General Public License, Version 2.0, the GNU 69 | Lesser General Public License, Version 2.1, the GNU Affero General 70 | Public License, Version 3.0, or any later versions of those 71 | licenses. 72 | 73 | 1.13. "Source Code Form" 74 | means the form of the work preferred for making modifications. 75 | 76 | 1.14. "You" (or "Your") 77 | means an individual or a legal entity exercising rights under this 78 | License. For legal entities, "You" includes any entity that 79 | controls, is controlled by, or is under common control with You. For 80 | purposes of this definition, "control" means (a) the power, direct 81 | or indirect, to cause the direction or management of such entity, 82 | whether by contract or otherwise, or (b) ownership of more than 83 | fifty percent (50%) of the outstanding shares or beneficial 84 | ownership of such entity. 85 | 86 | 2. License Grants and Conditions 87 | -------------------------------- 88 | 89 | 2.1. Grants 90 | 91 | Each Contributor hereby grants You a world-wide, royalty-free, 92 | non-exclusive license: 93 | 94 | (a) under intellectual property rights (other than patent or trademark) 95 | Licensable by such Contributor to use, reproduce, make available, 96 | modify, display, perform, distribute, and otherwise exploit its 97 | Contributions, either on an unmodified basis, with Modifications, or 98 | as part of a Larger Work; and 99 | 100 | (b) under Patent Claims of such Contributor to make, use, sell, offer 101 | for sale, have made, import, and otherwise transfer either its 102 | Contributions or its Contributor Version. 103 | 104 | 2.2. Effective Date 105 | 106 | The licenses granted in Section 2.1 with respect to any Contribution 107 | become effective for each Contribution on the date the Contributor first 108 | distributes such Contribution. 109 | 110 | 2.3. Limitations on Grant Scope 111 | 112 | The licenses granted in this Section 2 are the only rights granted under 113 | this License. No additional rights or licenses will be implied from the 114 | distribution or licensing of Covered Software under this License. 115 | Notwithstanding Section 2.1(b) above, no patent license is granted by a 116 | Contributor: 117 | 118 | (a) for any code that a Contributor has removed from Covered Software; 119 | or 120 | 121 | (b) for infringements caused by: (i) Your and any other third party's 122 | modifications of Covered Software, or (ii) the combination of its 123 | Contributions with other software (except as part of its Contributor 124 | Version); or 125 | 126 | (c) under Patent Claims infringed by Covered Software in the absence of 127 | its Contributions. 128 | 129 | This License does not grant any rights in the trademarks, service marks, 130 | or logos of any Contributor (except as may be necessary to comply with 131 | the notice requirements in Section 3.4). 132 | 133 | 2.4. Subsequent Licenses 134 | 135 | No Contributor makes additional grants as a result of Your choice to 136 | distribute the Covered Software under a subsequent version of this 137 | License (see Section 10.2) or under the terms of a Secondary License (if 138 | permitted under the terms of Section 3.3). 139 | 140 | 2.5. Representation 141 | 142 | Each Contributor represents that the Contributor believes its 143 | Contributions are its original creation(s) or it has sufficient rights 144 | to grant the rights to its Contributions conveyed by this License. 145 | 146 | 2.6. Fair Use 147 | 148 | This License is not intended to limit any rights You have under 149 | applicable copyright doctrines of fair use, fair dealing, or other 150 | equivalents. 151 | 152 | 2.7. Conditions 153 | 154 | Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted 155 | in Section 2.1. 156 | 157 | 3. Responsibilities 158 | ------------------- 159 | 160 | 3.1. Distribution of Source Form 161 | 162 | All distribution of Covered Software in Source Code Form, including any 163 | Modifications that You create or to which You contribute, must be under 164 | the terms of this License. You must inform recipients that the Source 165 | Code Form of the Covered Software is governed by the terms of this 166 | License, and how they can obtain a copy of this License. You may not 167 | attempt to alter or restrict the recipients' rights in the Source Code 168 | Form. 169 | 170 | 3.2. Distribution of Executable Form 171 | 172 | If You distribute Covered Software in Executable Form then: 173 | 174 | (a) such Covered Software must also be made available in Source Code 175 | Form, as described in Section 3.1, and You must inform recipients of 176 | the Executable Form how they can obtain a copy of such Source Code 177 | Form by reasonable means in a timely manner, at a charge no more 178 | than the cost of distribution to the recipient; and 179 | 180 | (b) You may distribute such Executable Form under the terms of this 181 | License, or sublicense it under different terms, provided that the 182 | license for the Executable Form does not attempt to limit or alter 183 | the recipients' rights in the Source Code Form under this License. 184 | 185 | 3.3. Distribution of a Larger Work 186 | 187 | You may create and distribute a Larger Work under terms of Your choice, 188 | provided that You also comply with the requirements of this License for 189 | the Covered Software. If the Larger Work is a combination of Covered 190 | Software with a work governed by one or more Secondary Licenses, and the 191 | Covered Software is not Incompatible With Secondary Licenses, this 192 | License permits You to additionally distribute such Covered Software 193 | under the terms of such Secondary License(s), so that the recipient of 194 | the Larger Work may, at their option, further distribute the Covered 195 | Software under the terms of either this License or such Secondary 196 | License(s). 197 | 198 | 3.4. Notices 199 | 200 | You may not remove or alter the substance of any license notices 201 | (including copyright notices, patent notices, disclaimers of warranty, 202 | or limitations of liability) contained within the Source Code Form of 203 | the Covered Software, except that You may alter any license notices to 204 | the extent required to remedy known factual inaccuracies. 205 | 206 | 3.5. Application of Additional Terms 207 | 208 | You may choose to offer, and to charge a fee for, warranty, support, 209 | indemnity or liability obligations to one or more recipients of Covered 210 | Software. However, You may do so only on Your own behalf, and not on 211 | behalf of any Contributor. You must make it absolutely clear that any 212 | such warranty, support, indemnity, or liability obligation is offered by 213 | You alone, and You hereby agree to indemnify every Contributor for any 214 | liability incurred by such Contributor as a result of warranty, support, 215 | indemnity or liability terms You offer. You may include additional 216 | disclaimers of warranty and limitations of liability specific to any 217 | jurisdiction. 218 | 219 | 4. Inability to Comply Due to Statute or Regulation 220 | --------------------------------------------------- 221 | 222 | If it is impossible for You to comply with any of the terms of this 223 | License with respect to some or all of the Covered Software due to 224 | statute, judicial order, or regulation then You must: (a) comply with 225 | the terms of this License to the maximum extent possible; and (b) 226 | describe the limitations and the code they affect. Such description must 227 | be placed in a text file included with all distributions of the Covered 228 | Software under this License. Except to the extent prohibited by statute 229 | or regulation, such description must be sufficiently detailed for a 230 | recipient of ordinary skill to be able to understand it. 231 | 232 | 5. Termination 233 | -------------- 234 | 235 | 5.1. The rights granted under this License will terminate automatically 236 | if You fail to comply with any of its terms. However, if You become 237 | compliant, then the rights granted under this License from a particular 238 | Contributor are reinstated (a) provisionally, unless and until such 239 | Contributor explicitly and finally terminates Your grants, and (b) on an 240 | ongoing basis, if such Contributor fails to notify You of the 241 | non-compliance by some reasonable means prior to 60 days after You have 242 | come back into compliance. Moreover, Your grants from a particular 243 | Contributor are reinstated on an ongoing basis if such Contributor 244 | notifies You of the non-compliance by some reasonable means, this is the 245 | first time You have received notice of non-compliance with this License 246 | from such Contributor, and You become compliant prior to 30 days after 247 | Your receipt of the notice. 248 | 249 | 5.2. If You initiate litigation against any entity by asserting a patent 250 | infringement claim (excluding declaratory judgment actions, 251 | counter-claims, and cross-claims) alleging that a Contributor Version 252 | directly or indirectly infringes any patent, then the rights granted to 253 | You by any and all Contributors for the Covered Software under Section 254 | 2.1 of this License shall terminate. 255 | 256 | 5.3. In the event of termination under Sections 5.1 or 5.2 above, all 257 | end user license agreements (excluding distributors and resellers) which 258 | have been validly granted by You or Your distributors under this License 259 | prior to termination shall survive termination. 260 | 261 | ************************************************************************ 262 | * * 263 | * 6. Disclaimer of Warranty * 264 | * ------------------------- * 265 | * * 266 | * Covered Software is provided under this License on an "as is" * 267 | * basis, without warranty of any kind, either expressed, implied, or * 268 | * statutory, including, without limitation, warranties that the * 269 | * Covered Software is free of defects, merchantable, fit for a * 270 | * particular purpose or non-infringing. The entire risk as to the * 271 | * quality and performance of the Covered Software is with You. * 272 | * Should any Covered Software prove defective in any respect, You * 273 | * (not any Contributor) assume the cost of any necessary servicing, * 274 | * repair, or correction. This disclaimer of warranty constitutes an * 275 | * essential part of this License. No use of any Covered Software is * 276 | * authorized under this License except under this disclaimer. * 277 | * * 278 | ************************************************************************ 279 | 280 | ************************************************************************ 281 | * * 282 | * 7. Limitation of Liability * 283 | * -------------------------- * 284 | * * 285 | * Under no circumstances and under no legal theory, whether tort * 286 | * (including negligence), contract, or otherwise, shall any * 287 | * Contributor, or anyone who distributes Covered Software as * 288 | * permitted above, be liable to You for any direct, indirect, * 289 | * special, incidental, or consequential damages of any character * 290 | * including, without limitation, damages for lost profits, loss of * 291 | * goodwill, work stoppage, computer failure or malfunction, or any * 292 | * and all other commercial damages or losses, even if such party * 293 | * shall have been informed of the possibility of such damages. This * 294 | * limitation of liability shall not apply to liability for death or * 295 | * personal injury resulting from such party's negligence to the * 296 | * extent applicable law prohibits such limitation. Some * 297 | * jurisdictions do not allow the exclusion or limitation of * 298 | * incidental or consequential damages, so this exclusion and * 299 | * limitation may not apply to You. * 300 | * * 301 | ************************************************************************ 302 | 303 | 8. Litigation 304 | ------------- 305 | 306 | Any litigation relating to this License may be brought only in the 307 | courts of a jurisdiction where the defendant maintains its principal 308 | place of business and such litigation shall be governed by laws of that 309 | jurisdiction, without reference to its conflict-of-law provisions. 310 | Nothing in this Section shall prevent a party's ability to bring 311 | cross-claims or counter-claims. 312 | 313 | 9. Miscellaneous 314 | ---------------- 315 | 316 | This License represents the complete agreement concerning the subject 317 | matter hereof. If any provision of this License is held to be 318 | unenforceable, such provision shall be reformed only to the extent 319 | necessary to make it enforceable. Any law or regulation which provides 320 | that the language of a contract shall be construed against the drafter 321 | shall not be used to construe this License against a Contributor. 322 | 323 | 10. Versions of the License 324 | --------------------------- 325 | 326 | 10.1. New Versions 327 | 328 | Mozilla Foundation is the license steward. Except as provided in Section 329 | 10.3, no one other than the license steward has the right to modify or 330 | publish new versions of this License. Each version will be given a 331 | distinguishing version number. 332 | 333 | 10.2. Effect of New Versions 334 | 335 | You may distribute the Covered Software under the terms of the version 336 | of the License under which You originally received the Covered Software, 337 | or under the terms of any subsequent version published by the license 338 | steward. 339 | 340 | 10.3. Modified Versions 341 | 342 | If you create software not governed by this License, and you want to 343 | create a new license for such software, you may create and use a 344 | modified version of this License if you rename the license and remove 345 | any references to the name of the license steward (except to note that 346 | such modified license differs from this License). 347 | 348 | 10.4. Distributing Source Code Form that is Incompatible With Secondary 349 | Licenses 350 | 351 | If You choose to distribute Source Code Form that is Incompatible With 352 | Secondary Licenses under the terms of this version of the License, the 353 | notice described in Exhibit B of this License must be attached. 354 | 355 | Exhibit A - Source Code Form License Notice 356 | ------------------------------------------- 357 | 358 | This Source Code Form is subject to the terms of the Mozilla Public 359 | License, v. 2.0. If a copy of the MPL was not distributed with this 360 | file, You can obtain one at http://mozilla.org/MPL/2.0/. 361 | 362 | If it is not possible or desirable to put the notice in a particular 363 | file, then You may include the notice in a location (such as a LICENSE 364 | file in a relevant directory) where a recipient would be likely to look 365 | for such a notice. 366 | 367 | You may add additional accurate notices of copyright ownership. 368 | 369 | Exhibit B - "Incompatible With Secondary Licenses" Notice 370 | --------------------------------------------------------- 371 | 372 | This Source Code Form is "Incompatible With Secondary Licenses", as 373 | defined by the Mozilla Public License, v. 2.0. 374 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Overscripted Web: Data Analysis in the Open 2 | 3 | The Systems Research Group (SRG) at Mozilla have created and open sourced a data set of publicly available information that was collected by a November 2017 Web crawl. We want to empower the community to explore the unseen or otherwise not obvious series of JavaScript execution events that are triggered once a user visits a webpage, and all the first- and third-party events that are set in motion when people retrieve content. 4 | Some preliminary insights already uncovered from this data are illustrated in this [blog post](https://medium.com/firefox-context-graph/overscripted-digging-into-javascript-execution-at-scale-2ed508f21862). 5 | Ongoing analyses can be tracked [here](https://github.com/mozilla/overscripted/tree/master/analyses) 6 | 7 | The crawl data hosted here was collected using [OpenWPM](https://github.com/mozilla/OpenWPM), which is developed and maintained by the Mozilla Security Engineering team. 8 | 9 | ### Submitting an analysis: 10 | - Analyses should be performed in Python using the [jupyter scientific notebook](https://jupyter-notebook.readthedocs.io/en/stable/) format and executing in this [environment](https://github.com/mozilla/overscripted/blob/master/analyses/environment.yaml). 11 | - Analysis can be submitted by filing a [Pull Request](https://help.github.com/articles/using-pull-requests) against this repository with the analysis formatted as an *.ipynb file or folder in the /analyses/ folder. 12 | - Set-up instructions are provided here: https://github.com/mozilla/overscripted/blob/master/analyses/README.md 13 | - Notebooks must be well documented and run on the [environment](https://github.com/mozilla/overscripted/blob/master/analyses/environment.yaml) described. If additional installations are needed these should be documented. 14 | - Files and folders should have the format `yyyy_mm_username__short-title` - the analyses directory contains examples already if this is not clear. 15 | - PRs altering or updating an existing analysis will not be accepted unless they are tweaking formatting / small errors to facilitate that notebook running. If you wish to continue / build-on someone else's existing analysis start your own analysis folder / file, cite their work, and then proceed with your extension. 16 | 17 | ### Accessing the Data 18 | Each of the links below links to a bz2 zipped portion of the total dataset. 19 | 20 | A small sample of the data is available in `safe_dataset.sample.tar.bz2` to get a feel for the content without committing to the full download. 21 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.sample.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.sample.tar.bz2) 22 | 23 | Three samples that are large enough to meaningful analysis of the dataset are 24 | also available as the full dataset is very large. More details about the 25 | samples are available in [data_prep/Sample Review.ipynb](https://github.com/mozilla/overscripted/blob/master/data_prep/Sample%20Review.ipynb) 26 | - https://public-data.telemetry.mozilla.org/bigcrawl/sample_10percent_value_1000_only.parquet.tar.bz2 - 900MB download / 1.3GB on disk 27 | - https://public-data.telemetry.mozilla.org/bigcrawl/sample_10percent.parquet.tar.bz2 - 3.7GB download / 7.4GB on disk 28 | - https://public-data.telemetry.mozilla.org/bigcrawl/value_1000_only.parquet.tar.bz2 - 9.1GB download / 15GB on disk 29 | 30 | The full dataset. Unzipped the full parquet data will be approximately 70GB. Each (compressed) chunk dataset is around 9GB. `SHA256SUMS` contains the checksums for all datasets including the sample. 31 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.0.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.0.tar.bz2) 32 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.1.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.1.tar.bz2) 33 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.2.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.2.tar.bz2) 34 | - [https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.3.tar.bz2](https://public-data.telemetry.mozilla.org/bigcrawl/safe_dataset.3.tar.bz2) 35 | - [https://public-data.telemetry.mozilla.org/bigcrawl/SHA256SUMS](https://public-data.telemetry.mozilla.org/bigcrawl/SHA256SUMS) 36 | 37 | Refer [hello_world.ipynb](https://github.com/mozilla/overscripted/blob/master/analyses/hello_world.ipynb) to load and have a quick look at the data with pandas, dask and spark. 38 | 39 | ### New Contributor Tips 40 | 41 | - Make contributions with respect to any of your learnings, be it by reading related research papers or through your interaction with the community on gitter and submitting a Pull Request (PR) to the repository. You can submit the PR to the README on the main page or the analysis folder README. 42 | 43 | - This is not a one issue per person repo. All the questions are very open ended and different people may find very different and complementary things when looking at a question. 44 | 45 | - Use a reaction emoji to acknowledge a comment rather than writing a comment like "sure" - helps to keep things clean - but the contributor can still let folks know that they saw a comment. 46 | 47 | - You can ask for help and discuss your ideas on gitter. Click [here](https://gitter.im/overscripted-discuss/community) to join ! 48 | 49 | - When you open an issue and work on a Pull Request relating to the same, add "WIP" in the title of the PR. "WIP" is work in progress. When your PR is ready for review remove the WIP tag. You can also request feedback on specific things while it's still a WIP. 50 | 51 | - Please reference your issues on a PR so that they link and autoclose. Refer to [this](https://help.github.com/en/articles/closing-issues-using-keywords) 52 | 53 | - If your OS is Ubuntu and you have trouble installing spark with conda. Refer to this [link](https://datawookie.netlify.com/blog/2017/07/installing-spark-on-ubuntu/). 54 | 55 | - The dataset is very large. Even the subsets of the dataset are unlikely to fit into memory. Working with this dataset will typically require using Dask (http://dask.pydata.org/), Spark (http://spark.apache.org/) or similar tools to enable parallelized / out-of-core / distributed processing. 56 | 57 | ### Glossary 58 | - [Fingerprinting](https://en.wikipedia.org/wiki/Device_fingerprint) is the process of creating a unique identifier based off of some characteristics of your hardware, operating system and browser. 59 | - TLD means Top-level Domain. You can read up more about it [here.](https://en.wikipedia.org/wiki/Top-level_domain) 60 | - [User Agent](https://en.wikipedia.org/wiki/User_agent) (UA), is a string that helps identify which browser is being used, what version, and on which operating system. 61 | - [Web Crawler](https://en.wikipedia.org/wiki/Web_crawler)- It is a program or automated script which browses the World Wide Web in a methodical, automated manner. 62 | 63 | 64 | 65 | ### Resources 66 | 67 | - Please refer the [reading list](https://github.com/mozilla/overscripted/wiki/Reading-List-(WIP)) for additional references and information. 68 | 69 | - [This](https://github.com/brandon-rhodes/pycon-pandas-tutorial) is a great tutorial to learn Pandas. 70 | 71 | - [Tutorial](https://www.youtube.com/watch?v=HW29067qVWk) on Jupyter Notebook. 72 | 73 | - We have used dask in some of our Jupyter notebooks. Dask gives you a pandas-like API but lets you work on data that is too big to fit in memory. Dask can be used on a single machine or a cluster. Most analyses done for this project were done on a single machine. Please start by reviewing the [docs](https://dask.org/) to learn more about it. 74 | 75 | - [This](https://github.com/aSquare14/Git-Cheat-Sheet) will help you get started with GIT. For visual thinkers this [tutorial](https://www.youtube.com/playlist?list=PL6gx4Cwl9DGAKWClAD_iKpNC0bGHxGhcx) can be a good start. 76 | 77 | - Other Dask resources: [overview](https://www.youtube.com/watch?v=ods97a5Pzw0) video and [cheatsheet](https://dask.pydata.org/en/latest/_downloads/daskcheatsheet.pdf). 78 | 79 | - [Apache Spark](https://spark.apache.org/docs/latest/api/python/pyspark.html) is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads. We use [findspark](https://github.com/minrk/findspark) to set up spark. You can learn more about it [here](https://github.com/apache/spark ) 80 | 81 | -------------------------------------------------------------------------------- /analyses/2018_12_LABBsoft_tracking_review/Ad Blocker Report.md: -------------------------------------------------------------------------------- 1 | # Ad Blocker Usage Detection Report 2 | ### Summary 3 | Ad blocker usage can be detected by checking the state of one of your ad scripts. It is frequently used to restrict functionality for ad blocking users or as a bit of information in user fingerprinting. 4 | 5 | 6 | ### Detection 7 | #### In Literature 8 | In _Measuring and Disrupting Anti-Adblockers Using 9 | Differential Execution Analysis_ anti adblockers were detected on 30.5% of the Alexa top-10K websites. This is trending upwards significantly as in May 2016 only 6.7% were detected using anti-adblock software. 10 | 11 | #### In The Overscripted Dataset 12 | Anti-adblock behaviour could not be detected in the dataset. 13 | 14 | #### What else would we need to detect it? 15 | Adblock detection can be done by running two crawls, one with adblock one without, and comparing the behaviour differences. This would allow the overscripted set to detect behaviour differences as well as frequency. 16 | -------------------------------------------------------------------------------- /analyses/2018_12_LABBsoft_tracking_review/Evercookies Report.md: -------------------------------------------------------------------------------- 1 | # Evercookies 2 | ### Summary 3 | The practice of restoring cleared cookies by storing the cookie information in an alternate location and recreating the cookie if it has been removed. This preserves the lifespan of the cookies and allows sites to identify users that are clearing their cookies. 4 | 5 | Should other cookie behaviour such as duplicating cookies to all browsers on a system or passing cookies between domains count as evercookies? 6 | 7 | ### Detection 8 | #### In Literature 9 | The TOR Browser design draft identifies the following locations for identifier storage: 10 | - **Cookies** 11 | - **Cache** 12 | - **HTTP Authentication** 13 | - **DOM Storage** 14 | - IndexedDB Storage 15 | - **Flash Cookies** 16 | - SSL+TLS session resumption 17 | - Javascript SharedWorkers - threads with a shared scope between all threads from the same JS origin, could have access to objects from the same third party loaded at another origin 18 | - URL.createObjectURL 19 | - SPDY, HTTP/2 20 | - Cross-origin redirects 21 | - **window.name** 22 | - Auto form-fill 23 | - HSTS - Stores an effective bit of information per domain name 24 | - HPKP 25 | - BroadcastChannel API 26 | - OCSP Requests 27 | - Favicons 28 | - mediasource URIs and MediaStreams 29 | - Speculative and prefetched connections 30 | - Permissions API 31 | 32 | 33 | 34 | #### In The Overscripted Dataset 35 | We found 52,669,910 instances of local storage access, cookie data access or window.name access. 36 | 37 | #### What else would we need to detect it? 38 | The best way to detect evercookies would be multiple runs of the same crawl while clearing cache between attempts. This would allow us to compare cookies between attempts as well as log the state of all possible storage mechanisms after each run. 39 | 40 | 41 | #### Do we see it? 42 | Using our symbol_counts.csv, we can check for all instances of the "window.navigator" interface and counting all instances. 43 | ``` 44 | count = 0 45 | target = ["window.name", "setItem", "getItem", "document.cookie"] 46 | for line in f.split('\n'): 47 | for t in target: 48 | if t in line: 49 | count += int(line.split(",")[1]) 50 | print(line) 51 | print("Total:", count) 52 | ``` 53 | 54 | Output: 55 | ``` 56 | window.document.cookie,35455680 57 | window.Storage.getItem,10553944 58 | window.Storage.setItem,4175556 59 | window.name,2484730 60 | Total: 52669910 61 | ``` -------------------------------------------------------------------------------- /analyses/2018_12_LABBsoft_tracking_review/RelevantSymbolCounter.py: -------------------------------------------------------------------------------- 1 | f = """ 2 | window.document.cookie,35455680 3 | window.navigator.userAgent,15534371 4 | window.Storage.getItem,10553944 5 | window.localStorage,8767285 6 | window.Storage.setItem,4175556 7 | window.sessionStorage,4033894 8 | window.Storage.removeItem,2932713 9 | window.name,2484730 10 | CanvasRenderingContext2D.fillStyle,1957519 11 | window.navigator.plugins[Shockwave Flash].description,1863285 12 | window.screen.colorDepth,1449905 13 | window.navigator.appName,1286084 14 | window.navigator.language,1172256 15 | window.navigator.platform,1140738 16 | CanvasRenderingContext2D.save,1000762 17 | CanvasRenderingContext2D.restore,997755 18 | CanvasRenderingContext2D.fill,954340 19 | CanvasRenderingContext2D.fillRect,936267 20 | window.navigator.plugins[Shockwave Flash].name,895289 21 | CanvasRenderingContext2D.font,814310 22 | CanvasRenderingContext2D.lineWidth,718195 23 | window.navigator.appVersion,707298 24 | window.navigator.cookieEnabled,692524 25 | HTMLCanvasElement.width,681003 26 | CanvasRenderingContext2D.strokeStyle,650211 27 | HTMLCanvasElement.height,644476 28 | HTMLCanvasElement.getContext,596749 29 | window.Storage.key,551691 30 | CanvasRenderingContext2D.fillText,542896 31 | window.Storage.length,541077 32 | CanvasRenderingContext2D.stroke,537024 33 | CanvasRenderingContext2D.measureText,522209 34 | window.navigator.vendor,487833 35 | window.navigator.doNotTrack,468365 36 | CanvasRenderingContext2D.arc,413449 37 | HTMLCanvasElement.style,294223 38 | CanvasRenderingContext2D.textBaseline,293489 39 | window.navigator.product,279653 40 | CanvasRenderingContext2D.textAlign,246380 41 | window.navigator.plugins[Shockwave Flash].filename,225751 42 | window.navigator.mimeTypes[application/x-shockwave-flash].type,213769 43 | window.navigator.languages,199435 44 | window.navigator.plugins[Shockwave Flash].length,184995 45 | CanvasRenderingContext2D.bezierCurveTo,176757 46 | CanvasRenderingContext2D.shadowBlur,172808 47 | CanvasRenderingContext2D.shadowOffsetY,161446 48 | CanvasRenderingContext2D.shadowOffsetX,159263 49 | CanvasRenderingContext2D.shadowColor,158579 50 | window.screen.pixelDepth,156326 51 | CanvasRenderingContext2D.rect,154007 52 | HTMLCanvasElement.nodeType,153630 53 | CanvasRenderingContext2D.lineJoin,151407 54 | window.navigator.mimeTypes[application/futuresplash].type,150364 55 | CanvasRenderingContext2D.lineCap,149511 56 | window.navigator.plugins[Shockwave Flash].version,144656 57 | CanvasRenderingContext2D.strokeRect,142838 58 | HTMLCanvasElement.toDataURL,135041 59 | CanvasRenderingContext2D.createRadialGradient,132444 60 | CanvasRenderingContext2D.globalCompositeOperation,122162 61 | window.navigator.onLine,116037 62 | CanvasRenderingContext2D.scale,115227 63 | window.Storage.hasOwnProperty,108138 64 | CanvasRenderingContext2D.clip,106238 65 | CanvasRenderingContext2D.miterLimit,102589 66 | window.navigator.mimeTypes[application/x-shockwave-flash].suffixes,94030 67 | window.navigator.mimeTypes[application/futuresplash].suffixes,94025 68 | RTCPeerConnection.localDescription,88683 69 | window.navigator.productSub,71139 70 | window.navigator.mimeTypes[application/x-shockwave-flash].description,70284 71 | window.navigator.mimeTypes[application/futuresplash].description,70278 72 | HTMLCanvasElement.nodeName,67621 73 | CanvasRenderingContext2D.rotate,63824 74 | HTMLCanvasElement.parentNode,57192 75 | window.navigator.oscpu,54799 76 | window.navigator.appCodeName,51161 77 | CanvasRenderingContext2D.createLinearGradient,46710 78 | CanvasRenderingContext2D.putImageData,45469 79 | window.navigator.geolocation,43022 80 | CanvasRenderingContext2D.getImageData,41412 81 | HTMLCanvasElement.ownerDocument,37831 82 | HTMLCanvasElement.className,36778 83 | RTCPeerConnection.onicecandidate,32522 84 | HTMLCanvasElement.getAttribute,31800 85 | window.navigator.vendorSub,26840 86 | HTMLCanvasElement.addEventListener,23485 87 | window.navigator.buildID,23419 88 | HTMLCanvasElement.classList,22963 89 | HTMLCanvasElement.setAttribute,20689 90 | HTMLCanvasElement.clientHeight,20665 91 | HTMLCanvasElement.clientWidth,20341 92 | HTMLCanvasElement.getElementsByTagName,16224 93 | HTMLCanvasElement.tagName,14475 94 | RTCPeerConnection.iceGatheringState,13984 95 | RTCPeerConnection.createDataChannel,13776 96 | RTCPeerConnection.signalingState,13160 97 | RTCPeerConnection.remoteDescription,13113 98 | RTCPeerConnection.createOffer,13015 99 | CanvasRenderingContext2D.setLineDash,12590 100 | HTMLCanvasElement.onselectstart,12054 101 | RTCPeerConnection.setLocalDescription,11844 102 | CanvasRenderingContext2D.arcTo,11428 103 | CanvasRenderingContext2D.isPointInPath,11342 104 | CanvasRenderingContext2D.createImageData,11163 105 | HTMLCanvasElement.id,10941 106 | CanvasRenderingContext2D.imageSmoothingEnabled,9722 107 | HTMLCanvasElement.draggable,9558 108 | HTMLCanvasElement.constructor,9246 109 | CanvasRenderingContext2D.createPattern,8713 110 | CanvasRenderingContext2D.lineDashOffset,7726 111 | HTMLCanvasElement.offsetWidth,7346 112 | CanvasRenderingContext2D.mozImageSmoothingEnabled,6561 113 | RTCPeerConnection.idpLoginUrl,6556 114 | RTCPeerConnection.peerIdentity,6556 115 | RTCPeerConnection.onremovestream,6556 116 | HTMLCanvasElement.offsetHeight,6175 117 | CanvasRenderingContext2D.strokeText,5147 118 | HTMLCanvasElement.firstChild,4897 119 | HTMLCanvasElement.hasAttribute,4604 120 | HTMLCanvasElement.localName,4577 121 | HTMLCanvasElement.attributes,4507 122 | HTMLCanvasElement.nextSibling,3857 123 | AudioContext.destination,3758 124 | HTMLCanvasElement.firstElementChild,3586 125 | HTMLCanvasElement.nextElementSibling,3560 126 | window.Storage.clear,3348 127 | HTMLCanvasElement.dir,3171 128 | CanvasRenderingContext2D.mozCurrentTransform,3102 129 | OscillatorNode.frequency,3056 130 | AudioContext.createOscillator,2898 131 | OscillatorNode.start,2687 132 | CanvasRenderingContext2D.__lookupGetter__,2543 133 | HTMLCanvasElement.childNodes,2541 134 | CanvasRenderingContext2D.hasOwnProperty,2422 135 | HTMLCanvasElement.getBoundingClientRect,2276 136 | HTMLCanvasElement.offsetLeft,2096 137 | OscillatorNode.type,2011 138 | OscillatorNode.connect,2011 139 | CanvasRenderingContext2D.mozCurrentTransformInverse,1890 140 | HTMLCanvasElement.removeAttribute,1814 141 | HTMLCanvasElement.offsetTop,1812 142 | HTMLCanvasElement.children,1795 143 | HTMLCanvasElement.dispatchEvent,1698 144 | HTMLCanvasElement.mozOpaque,1687 145 | HTMLCanvasElement.onmousemove,1538 146 | AudioContext.createDynamicsCompressor,1535 147 | HTMLCanvasElement.offsetParent,1499 148 | OfflineAudioContext.startRendering,1381 149 | OfflineAudioContext.createDynamicsCompressor,1380 150 | OfflineAudioContext.oncomplete,1380 151 | OfflineAudioContext.createOscillator,1380 152 | OfflineAudioContext.destination,1380 153 | HTMLCanvasElement.remove,1257 154 | HTMLCanvasElement.compareDocumentPosition,1253 155 | AudioContext.state,1249 156 | AudioContext.listener,1230 157 | GainNode.connect,1204 158 | AudioContext.createGain,1197 159 | GainNode.gain,1112 160 | HTMLCanvasElement.__proto__,1028 161 | window.Storage.toString,1027 162 | AudioContext.createAnalyser,905 163 | HTMLCanvasElement.cloneNode,899 164 | AudioContext.sampleRate,882 165 | AudioContext.decodeAudioData,876 166 | AudioContext.createMediaElementSource,860 167 | HTMLCanvasElement.toBlob,837 168 | HTMLCanvasElement.removeEventListener,779 169 | AnalyserNode.fftSize,774 170 | AnalyserNode.maxDecibels,771 171 | AnalyserNode.smoothingTimeConstant,770 172 | AnalyserNode.frequencyBinCount,770 173 | AnalyserNode.minDecibels,769 174 | RTCPeerConnection.addIceCandidate,769 175 | AudioContext.onstatechange,745 176 | HTMLCanvasElement.textContent,628 177 | HTMLCanvasElement.onclick,466 178 | HTMLCanvasElement.innerHTML,437 179 | window.Storage.valueOf,423 180 | RTCPeerConnection.setRemoteDescription,379 181 | RTCPeerConnection.getStats,361 182 | AudioContext.currentTime,354 183 | OscillatorNode.stop,351 184 | RTCPeerConnection.removeEventListener,346 185 | RTCPeerConnection.addEventListener,346 186 | HTMLCanvasElement.__lookupGetter__,344 187 | AudioContext.createScriptProcessor,337 188 | HTMLCanvasElement.hasOwnProperty,312 189 | HTMLCanvasElement.onmousedown,310 190 | HTMLCanvasElement.toString,291 191 | ScriptProcessorNode.connect,288 192 | ScriptProcessorNode.onaudioprocess,287 193 | AnalyserNode.connect,285 194 | HTMLCanvasElement.blur,280 195 | HTMLCanvasElement.getAttributeNode,237 196 | HTMLCanvasElement.onmouseout,232 197 | HTMLCanvasElement.onmouseover,229 198 | HTMLCanvasElement.append,227 199 | HTMLCanvasElement.onmouseup,227 200 | CanvasRenderingContext2D.ellipse,166 201 | HTMLCanvasElement.setAttributeNode,152 202 | HTMLCanvasElement.oncontextmenu,152 203 | CanvasRenderingContext2D.getLineDash,146 204 | HTMLCanvasElement.previousSibling,139 205 | HTMLCanvasElement.parentElement,136 206 | HTMLCanvasElement.innerText,134 207 | HTMLCanvasElement.onkeydown,132 208 | HTMLCanvasElement.onkeyup,129 209 | HTMLCanvasElement.onkeypress,128 210 | HTMLCanvasElement.onblur,128 211 | HTMLCanvasElement.onfocus,128 212 | HTMLCanvasElement.onmouseleave,127 213 | HTMLCanvasElement.ondblclick,126 214 | HTMLCanvasElement.ondragenter,125 215 | HTMLCanvasElement.onresize,125 216 | HTMLCanvasElement.onpaste,125 217 | HTMLCanvasElement.onchange,125 218 | HTMLCanvasElement.oncut,125 219 | HTMLCanvasElement.ondragover,125 220 | HTMLCanvasElement.ondragleave,125 221 | HTMLCanvasElement.ondrop,125 222 | HTMLCanvasElement.onmouseenter,125 223 | HTMLCanvasElement.onload,125 224 | HTMLCanvasElement.contains,102 225 | HTMLCanvasElement.querySelectorAll,98 226 | GainNode.disconnect,77 227 | AudioContext.createBufferSource,70 228 | HTMLCanvasElement.hasChildNodes,67 229 | AudioContext.createBuffer,63 230 | AudioContext.createPanner,60 231 | HTMLCanvasElement.scrollLeft,60 232 | HTMLCanvasElement.scrollTop,60 233 | CanvasRenderingContext2D.__lookupSetter__,58 234 | CanvasRenderingContext2D.__defineSetter__,58 235 | HTMLCanvasElement.ondragstart,50 236 | HTMLCanvasElement.getClientRects,49 237 | HTMLCanvasElement.title,44 238 | HTMLCanvasElement.tabIndex,43 239 | RTCPeerConnection.close,43 240 | RTCPeerConnection.iceConnectionState,33 241 | AudioContext.close,32 242 | HTMLCanvasElement.hasAttributes,25 243 | HTMLCanvasElement.previousElementSibling,23 244 | OscillatorNode.disconnect,22 245 | HTMLCanvasElement.focus,22 246 | RTCPeerConnection.onsignalingstatechange,16 247 | RTCPeerConnection.oniceconnectionstatechange,16 248 | HTMLCanvasElement.valueOf,16 249 | HTMLCanvasElement.dataset,15 250 | HTMLCanvasElement.requestPointerLock,15 251 | HTMLCanvasElement.namespaceURI,13 252 | HTMLCanvasElement.webkitMatchesSelector,12 253 | HTMLCanvasElement.childElementCount,11 254 | HTMLCanvasElement.removeChild,8 255 | HTMLCanvasElement.insertBefore,8 256 | GainNode.numberOfOutputs,7 257 | HTMLCanvasElement.matches,6 258 | HTMLCanvasElement.outerHTML,6 259 | HTMLCanvasElement.appendChild,6 260 | AudioContext.resume,5 261 | AnalyserNode.getByteFrequencyData,5 262 | HTMLCanvasElement.clientTop,4 263 | HTMLCanvasElement.clientLeft,4 264 | HTMLCanvasElement.onwheel,4 265 | HTMLCanvasElement.DOCUMENT_NODE,4 266 | RTCPeerConnection.onaddstream,3 267 | AnalyserNode.channelInterpretation,3 268 | AnalyserNode.numberOfInputs,3 269 | AnalyserNode.channelCountMode,3 270 | AnalyserNode.numberOfOutputs,3 271 | AnalyserNode.channelCount,3 272 | HTMLCanvasElement.scrollWidth,3 273 | HTMLCanvasElement.scrollHeight,3 274 | CanvasRenderingContext2D.__proto__,3 275 | HTMLCanvasElement.getElementsByClassName,3 276 | CanvasRenderingContext2D.__defineGetter__,3 277 | HTMLCanvasElement.querySelector,2 278 | OfflineAudioContext.decodeAudioData,2 279 | RTCPeerConnection.createAnswer,2 280 | CanvasRenderingContext2D.filter,2 281 | AudioContext.createConvolver,1 282 | HTMLCanvasElement.lastChild,1 283 | CanvasRenderingContext2D.toString,1 284 | """ 285 | count = 0 286 | for line in f.split('\n'): 287 | if "window.navigator" in line: 288 | count += int(line.split(",")[1]) 289 | print(line[17:]) 290 | print("Total:", count) 291 | 292 | 293 | -------------------------------------------------------------------------------- /analyses/2018_12_LABBsoft_tracking_review/Tracking Method Sources.md: -------------------------------------------------------------------------------- 1 | # Tracking Method Sources 2 | 3 | ## Web Tracking: Mechanisms, Implications, and Defenses 4 | Note: This source covers almost all forms of fingerprinting detected or theorised 5 | #### Session Only 6 | * Session identifiers passed through links/requests 7 | * Explicit web-form authentication 8 | * window.name DOM property 9 | 10 | #### Storage-based 11 | * HTTP cookies 12 | * Flash cookies and Java JNLP PersistenceService 13 | * Flash LocalConnection object 14 | * Silverlight Isolated Storage 15 | * HTML5 Global, Local, and Session Storage 16 | * Web SQL Database and HTML5 IndexedDB 17 | * Internet Explorer userData storage 18 | 19 | #### Cache-based 20 | * Web cache 21 | * Embedding identifiers in cached documents 22 | * Loading performance tests 23 | * ETags and Last-Modified headers 24 | * DNS lookups 25 | * Operational caches 26 | * HTTP301 redirect cache 27 | * HTTP authentication cache 28 | * HTTP Strict Transport Security cache 29 | * TLS Session Resumption cache and TLS Session IDs 30 | 31 | #### Fingerprinting 32 | * Network and location fingerprinting 33 | * Device fingerprinting 34 | * OS instance fingerprinting 35 | * Browser version fingerprinting 36 | * Browser instance fingerprinting using canvas 37 | * Browser instance fingerprinting using web browsing history 38 | 39 | #### Other Tracking mechanisms 40 | * Headers attached to outgoing HTTP requests 41 | * Using telephone metadata 42 | * Timing attacks 43 | * Using unconscious collaboration of the user 44 | * Clickjacking 45 | * Evercookies 46 | 47 | 48 | ## Hiding in the Crowd: an Analysis of the Effectiveness of Browser Fingerprinting at Large Scale 49 | Note: Color depth, encoding, do not track, and plugins are valuable additions to the list 50 | * User-Agent 51 | * Header-accept 52 | * Content encoding 53 | * Content language 54 | * List of plugins 55 | * Cookies enabled 56 | * Use of local/session storage 57 | * Timezone 58 | * Screen resolution & color depth 59 | * Available fonts 60 | * List of HTTP headers 61 | * Platform 62 | * Do Not Track 63 | * Canvas 64 | * WebGL Vendor 65 | * WebGL Renderer 66 | * Use of an ad blocker 67 | 68 | ## The Web Never Forgets 69 | * Canvas Fingerprinting 70 | * Evercookies/Respawning 71 | * Cookie Syncing 72 | * Flash Cookies and using flash for respawning 73 | 74 | ## Fingerprinting Information in JavaScript Implementations 75 | * Performance Fingerprinting 76 | * Whitelist Fingerprinting 77 | 78 | ## Web Tracking – A Literature Review on the State of Research 79 | TODO: Parse this list of papers for additional fingerprinting methods 80 | Stateful tracking: 81 | * The Web Never Forgets: Persistant Tracking Mechanisms in the Wild - Acar et al. (2013, 2014) 82 | * Hybrid Information Flow Monitoring Against Web Tracking - Besson et al. (2014) 83 | * Web Tracking: Mechanisms, Implications, and Defenses - Bujlow et al. (2015, 2017) 84 | * Online Tracking: A 1-million-site Measurement and Analysis - Englehardt & Narayanan (2016) 85 | * Towards Seamless Tracking-Free Web: Improved Detection of Trackers via One-class Learning - Ikram et al. (2016) 86 | * Third-Party Web Tracking: Policy and Technology - Mayer & Mitchell (2012) 87 | * Web Tracking: Overview and applicability in digital investigations - Pugliese (2015) 88 | * The Web is Watching You: A Comprehensive Review of Web-Tracking Techniques and Countermeasures - Sanchez-Rola et al. (2016) 89 | 90 | ## Web tracking: Overview and applicability in digital investigation 91 | Note: In “The Web Never Forgets” there is reference to this paper that mentions behavioural fingerprinting. I could not access the paper. 92 | “Pugliese (2015) also mentions behavioral biometric features, namely those dynamics that occur when typing, moving and clicking the mouse, or touching a touch screen. Such behavioral biometric features can be used to improve stateless tracking” 93 | 94 | ## Evaluating the effectiveness of defences to web tracking 95 | https://www.royalholloway.ac.uk/media/5620/rhul-isg-2018-7-techreport-darrellnewman.pdf 96 | Note: Lots to dig into here, not enough time this week to get through all of it 97 | * Javascript Engines 98 | * DOM Objects 99 | * Installed add-ons/extensions 100 | * Fonts 101 | * Canvas API 102 | * AudioContext API 103 | * Battery Status API 104 | * Emojis 105 | * SSL/TLS Handshake Fingerprinting 106 | * WebRTC fingerprinting 107 | * Behavioural Biometrics 108 | 109 | ## Remote physical device fingerprinting 110 | Note: This paper is from 2005 111 | * Clock Skews 112 | 113 | ## Papers I couldn't access 114 | SHPF: Enhancing HTTP(S) Session Security with Browser Fingerprinting -------------------------------------------------------------------------------- /analyses/2018_12_LABBsoft_tracking_review/Tracking Methods.md: -------------------------------------------------------------------------------- 1 | Note: I'm not sure how I'd like to sort these and what level of granularity to apply. For example, evercookies is a much broader category than the window.name DOM property. 2 | 3 | # Tracking Methods 4 | * Session identifiers passed through links/requests 5 | * Explicit web-form authentication 6 | * window.name DOM property 7 | * HTTP cookies 8 | * Flash cookies and Java JNLP PersistenceService 9 | * Flash LocalConnection object 10 | * Silverlight Isolated Storage 11 | * HTML5 Global, Local, and Session Storage 12 | * Web SQL Database and HTML5 IndexedDB 13 | * Internet Explorer userData storage 14 | * Web cache 15 | * Embedding identifiers in cached documents 16 | * Loading performance tests 17 | * ETags and Last-Modified headers 18 | * DNS lookups 19 | * Operational caches 20 | * HTTP301 redirect cache 21 | * HTTP authentication cache 22 | * HTTP Strict Transport Security cache 23 | * TLS Session Resumption cache and TLS Session IDs 24 | * Network and location fingerprinting 25 | * Device fingerprinting 26 | * OS instance fingerprinting 27 | * Browser version fingerprinting 28 | * Browser instance fingerprinting using canvas 29 | * Browser instance fingerprinting using web browsing history 30 | * Headers attached to outgoing HTTP requests 31 | * Using telephone metadata 32 | * Timing attacks 33 | * Using unconscious collaboration of the user 34 | * Clickjacking 35 | * Evercookies 36 | * User-Agent 37 | * Header-accept 38 | * Content encoding 39 | * Content language 40 | * List of plugins 41 | * Cookies enabled 42 | * Use of local/session storage 43 | * Timezone 44 | * Screen resolution & color depth 45 | * Available fonts 46 | * List of HTTP headers 47 | * Platform 48 | * Do Not Track 49 | * Canvas 50 | * WebGL Vendor 51 | * WebGL Renderer 52 | * Use of an ad blocker 53 | * Canvas Fingerprinting 54 | * Evercookies/Respawning 55 | * Cookie Syncing 56 | * Flash Cookies and using flash for respawning 57 | * Performance Fingerprinting 58 | * Whitelist Fingerprinting 59 | * Javascript Engines 60 | * DOM Objects 61 | * Installed add-ons/extensions 62 | * Fonts 63 | * Canvas API 64 | * AudioContext API 65 | * Battery Status API 66 | * Emojis 67 | * SSL/TLS Handshake Fingerprinting 68 | * WebRTC fingerprinting 69 | * Behavioural Biometrics 70 | * Clock Skews 71 | 72 | ## Papers Still to Read 73 | SHPF: Enhancing HTTP(S) Session Security with Browser Fingerprinting 74 | Web Tracking: Overview and applicability in digital investigations - Pugliese (2015) 75 | The Web is Watching You: A Comprehensive Review of Web-Tracking Techniques and Countermeasures - Sanchez-Rola et al. (2016) -------------------------------------------------------------------------------- /analyses/2018_12_LABBsoft_tracking_review/Tracking Report Template.md: -------------------------------------------------------------------------------- 1 | # Browser Version Identification 2 | ### Summary 3 | Identifying web browsing software and version is a core component of a stateless fingerprint. In most cases the user will provide a genuine User-Agent request header that will contain the browser type, version, developer and the operating system. Some browsers are obfuscating this header to protect the user's privacy. Unfortunately, because of small differences between browsers there exist ways to detect a forged User-Agent string. 4 | 5 | ### Detection 6 | #### In Literature 7 | The User-Agent string is available through the Navigator interface, so any usage of window.navigator will be accessing User-Agent data. 8 | However, a site owner could manually parse the request headers to access that data without using the Navigator interface. 9 | Identifying forged User-Agents is browser version specific. In _How Unique Is Your Web Browser_ they found forged strings when they noticed there were iOS devices that had flash enabled. 10 | 11 | #### In The Overscripted Dataset 12 | The Navigator interface was detected 26,361,700 times in the overscripted dataset. 13 | I don't know how to identify if a Navigator interface was called for tracking purposes or for functionality (to display an error message on IE6 for example) 14 | 15 | 16 | #### What else would we need to detect it? 17 | The dataset allows us to identify when a website accesses the User-Agent through the Navigator interface. 18 | It may be possible to identify sites that test for forged User-Agents by crawling with a specific User-Agent string and checking for calls to unsupported features. 19 | 20 | #### Do we see it? 21 | Using our symbol_counts.csv, we can check for all instances of the "window.navigator" interface and counting all instances. 22 | ``` 23 | count = 0 24 | for line in f.split('\n'): 25 | if "window.navigator" in line: 26 | count += int(line.split(",")[1]) 27 | print(line[17:]) 28 | print("Total:", count) 29 | ``` 30 | 31 | Output: 32 | ``` 33 | userAgent,15534371 34 | plugins[Shockwave Flash].description,1863285 35 | appName,1286084 36 | language,1172256 37 | platform,1140738 38 | plugins[Shockwave Flash].name,895289 39 | appVersion,707298 40 | cookieEnabled,692524 41 | vendor,487833 42 | doNotTrack,468365 43 | product,279653 44 | plugins[Shockwave Flash].filename,225751 45 | mimeTypes[application/x-shockwave-flash].type,213769 46 | languages,199435 47 | plugins[Shockwave Flash].length,184995 48 | mimeTypes[application/futuresplash].type,150364 49 | plugins[Shockwave Flash].version,144656 50 | onLine,116037 51 | mimeTypes[application/x-shockwave-flash].suffixes,94030 52 | mimeTypes[application/futuresplash].suffixes,94025 53 | productSub,71139 54 | mimeTypes[application/x-shockwave-flash].description,70284 55 | mimeTypes[application/futuresplash].description,70278 56 | oscpu,54799 57 | appCodeName,51161 58 | geolocation,43022 59 | vendorSub,26840 60 | buildID,23419 61 | Total: 26361700 62 | ``` -------------------------------------------------------------------------------- /analyses/2018_12_LABBsoft_tracking_review/window.name Report.md: -------------------------------------------------------------------------------- 1 | # window.name DOM property 2 | ### Summary 3 | The window.name property stores up to 2MB of information that persists for the lifespan of a tab. This allows sites that are closed and revisited to reaccess whatever information they may have stored inside. 4 | 5 | ### Detection 6 | #### In Literature 7 | There were no examples of the detection of window.name trarcking found in the web tracking papers. 8 | 9 | #### In The Overscripted Dataset 10 | We found 2,484,730 instances of window.name access. 11 | 12 | #### What else would we need to detect it? 13 | We're equipped to track the calls to the window.name property. Detection could be improved if we logged the line of the window.name call. 14 | 15 | #### Do we see it? 16 | Using our symbol_counts.csv, we can check for all instances of the "window.navigator" interface and counting all instances. 17 | ``` 18 | count = 0 19 | target = ["window.name"] 20 | for line in f.split('\n'): 21 | for t in target: 22 | if t in line: 23 | count += int(line.split(",")[1]) 24 | print(line) 25 | ``` 26 | 27 | Output: 28 | ``` 29 | window.name,2484730 30 | ``` -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/1-get_script_urls/README.md: -------------------------------------------------------------------------------- 1 | # Extract script URLs from dataset 2 | _Note, this section uses pySpark_ 3 | 4 | ## 1) Setup Spark: 5 | 6 | 1) You must have Open JDK 8. Install via `$ sudo apt-get install openjdk-8-jdk` if you don't have this installed (check if `/usr/lib/jvm/java-8-openjdk-amd64/` exists). 7 | 8 | 2) Download the latest version of [Spark](https://spark.apache.org/downloads.html) (prebuilt for Apache Hadoop 2.7 and later), and unpack the tar to a directory of your choosing. 9 | 10 | 3) Set some environment variables: 11 | ``` 12 | $ export PYSPARK_PYTHON=python3 13 | $ export PATH=${PATH}:/path/to/spark--bin-hadoop2.7/bin 14 | $ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ 15 | ``` 16 | 17 | To run any of the pyspark scripts on their own, you can run 18 | ``` 19 | $ spark-submit sparkscript.py 20 | ``` 21 | 22 | **Note: you may need to increase spark driver memory.** 23 | In `/conf/spark-defaults.conf`, add the line `spark.driver.memory 15g`, or whatever is acceptable for your system. 24 | 25 | 26 | ## 2) Running the scripts: 27 | 28 | Replace the appropriate directories in `config.ini` for your system. You will need the [Full Mozilla Overscripted Dataset](https://github.com/mozilla/Overscripted-Data-Analysis-Challenge). 29 | 30 | Ensuring spark is set up as in pt. 1, run: 31 | ``` 32 | $ spark-submit generate_url_list_spark.py 33 | ``` 34 | 35 | To test individual user specified urls, you can place urls in `test_urls.csv` and run `$ spark-submit test_generate_url_list_spark.py`. This will output `parsed_test_urls.csv` using the exact same process as on the full dataset for easy debugging and sanity checks. 36 | 37 | A jupyter notebook with barebones loading and displaying of the data to make exploring the results easier. 38 | 39 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/1-get_script_urls/config.ini: -------------------------------------------------------------------------------- 1 | # Note: Make sure to not add comments on any lines being parsed in! 2 | [DEFAULT] 3 | 4 | # Location of the top of where data is located 5 | #datatop = 6 | datatop = /mnt/Data/UCOSP_DATA 7 | 8 | # Location of the parent directory of all parquet subfolders (iterates over) 9 | #parquet_dataset = 10 | parquet_dataset = full_data/* 11 | 12 | # Specify output directory (must create yourself!) 13 | # Make sure this directory exists! 14 | #output_dir 15 | output_dir = resources/full_url_list_parsed_v2 16 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/1-get_script_urls/generate_url_list_spark.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | from os import path 3 | import sys 4 | from slugify import slugify 5 | from pyspark.sql import SparkSession, functions, types 6 | 7 | # Safety for spark stuff 8 | spark = SparkSession.builder.appName('URL extractor').getOrCreate() 9 | assert sys.version_info >= (3, 4) # make sure we have Python 3.4+ 10 | assert spark.version >= '2.1' # make sure we have Spark 2.1+ 11 | 12 | # UDF to generate a text file from the script URL 13 | def shorten_name(url_name): 14 | # Strip out 'http', 'https', '/', and '.js' 15 | shortened_url = url_name.replace( 16 | 'https://', '' 17 | ).replace( 18 | 'http://', '' 19 | ).replace( 20 | '/', '_' 21 | ).replace( 22 | '.js', '' 23 | ) 24 | 25 | # Shorten url to 250 characters (max file system can support) 26 | shortened_url = slugify(shortened_url)[:250] 27 | 28 | # Specify the suffix for each downloaded file 29 | suffix = '.txt' 30 | 31 | # Final output 32 | file_name = shortened_url + suffix 33 | return file_name 34 | 35 | def main(): 36 | 37 | # Specify target directory 38 | config = configparser.ConfigParser() 39 | config.read('config.ini') 40 | 41 | datatop = config['DEFAULT']['datatop'] 42 | parquet_dataset = path.join(datatop,config['DEFAULT']['parquet_dataset']) 43 | output_dir = path.join(datatop,config['DEFAULT']['output_dir']) 44 | 45 | # Read in dataset, selecting the 'script_url' column and filter duplicates 46 | data = spark.read.parquet(parquet_dataset).select('script_url').distinct() 47 | 48 | # Split the string on reserved url characters to get canonical url 49 | data = data.withColumn( 50 | "parsed_url", 51 | functions.split("script_url", "[\?\#\,\;]")[0] 52 | ).distinct() 53 | 54 | # Only keep urls that are actually .js files 55 | data = data.filter( 56 | data["parsed_url"].rlike("\.js$") 57 | ).dropDuplicates(["parsed_url"]) 58 | 59 | # User Defined Function to convert script URL to a filename usable by ext4 60 | shorten_udf = functions.udf(shorten_name, returnType=types.StringType()) 61 | 62 | # Apply the UDF over the whole list to generate a new column 'filename' 63 | data = data.withColumn( 64 | 'filename', 65 | shorten_udf(data.parsed_url) 66 | ).sort('filename') 67 | 68 | # Save the data to parquet files 69 | data.write.parquet(output_dir) 70 | config = configparser.ConfigParser() 71 | config.read('config.ini') 72 | 73 | 74 | if __name__ == '__main__': 75 | main(); 76 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/1-get_script_urls/requirements.txt: -------------------------------------------------------------------------------- 1 | appdirs==1.4.3 2 | attrs==18.2.0 3 | Click==7.0 4 | pkg-resources==0.0.0 5 | python-slugify==1.2.6 6 | toml==0.10.0 7 | Unidecode==1.0.23 8 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/1-get_script_urls/test_generate_url_list_spark.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from slugify import slugify 3 | from pyspark.sql import SparkSession, functions, types 4 | 5 | # Safety for spark stuff 6 | spark = SparkSession.builder.appName('URL extractor').getOrCreate() 7 | assert sys.version_info >= (3, 4) # make sure we have Python 3.4+ 8 | assert spark.version >= '2.1' # make sure we have Spark 2.1+ 9 | 10 | # UDF to generate a text file from the script URL 11 | def shorten_name(url_name): 12 | # Strip out 'http', 'https', '/', and '.js' 13 | shortened_url = url_name.replace( 14 | 'https://', '' 15 | ).replace( 16 | 'http://', '' 17 | ).replace( 18 | '/', '_' 19 | ).replace( 20 | '.js', '' 21 | ) 22 | 23 | # Shorten url to 250 characters (max file system can support) 24 | shortened_url = slugify(shortened_url)[:250] 25 | 26 | # Specify the suffix for each downloaded file 27 | suffix = '.txt' 28 | 29 | # Final output 30 | file_name = shortened_url + suffix 31 | return file_name 32 | 33 | def main(): 34 | 35 | # Specify test file, a csv of urls to parse 36 | TEST_FILE = "test_urls.csv" 37 | OUTPUT_FILE = "parsed_test_urls.csv" 38 | 39 | # Read in dataset, selecting the 'script_url' column and filter duplicates 40 | data = spark.read.csv(TEST_FILE,header='true').distinct() 41 | 42 | # Split the string on reserved url characters to get canonical url 43 | data = data.withColumn( 44 | "parsed_url", 45 | functions.split("script_url", "[\?\#\,\;]")[0] 46 | ).distinct() 47 | 48 | # Only keep urls that are actually .js files 49 | data = data.filter( 50 | data["parsed_url"].rlike("\.js$") 51 | ).dropDuplicates(["parsed_url"]) 52 | 53 | # User Defined Function to convert script URL to a filename usable by ext4 54 | shorten_udf = functions.udf(shorten_name, returnType=types.StringType()) 55 | 56 | # Apply the UDF over the whole list to generate a new column 'filename' 57 | data = data.withColumn( 58 | 'filename', 59 | shorten_udf(data.parsed_url) 60 | )#.sort('filename') 61 | 62 | # Save the data to parquet files 63 | data.toPandas().to_csv(OUTPUT_FILE) 64 | 65 | if __name__ == '__main__': 66 | main(); 67 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/1-get_script_urls/test_urls.csv: -------------------------------------------------------------------------------- 1 | script_url 2 | https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com 3 | https://ajax.googleapis.com/ajax/libs/webfont/1.6.26/webfont.js 4 | http://cpro.baidustatic.com/cpro/ui/noexpire/js/4.0.1/adClosefeedbackUpgrade.min.js 5 | https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=fe1ad16a94c816&origin=http%3A%2F%2Farabi21.com 6 | https://static.dynamicyield.com/scripts/12290/dy-coll-min.js 7 | https://www.syracuse.edu/about/ 8 | https://www.googletagmanager.com/gtm.js?id=GTM-5FC97GL 9 | https://www.syracuse.edu/wp-includes/js/wp-emoji-release.min.js?ver=4.9.1 10 | https://www.google-analytics.com/analytics.js 11 | https://code.jquery.com/jquery-migrate-1.4.1.min.js 12 | https://www.syracuse.edu/wp-content/themes/g6-carbon/js/carbon-all.js?ver=6.3.6 13 | https://www.syracuse.edu/wp-includes/js/wp-embed.min.js?ver=4.9.1 14 | https://syr-piwik-prod.syr.edu/piwik.js 15 | http://ads.pubmatic.com/AdServer/js/showad.js#PIX&kdntuid=1&SPug=true&p=37855&predirect=http%3A%2F%2Fdelivery.swid.switchadhub.com%2Fadserver%2Fuser_sync.php%3FSWID%3Dc3a5b350611a2a8227e5f3e7fb4785fc%26sKey%3DPM3%26sVal%3D%26do%5Bsingle%5D%3D1&it=0&np=0 16 | https://z.moatads.com/keplerpaypaldcm168224283233/moatad.js#moatClientLevel1=11459972&moatClientLevel2=3346219&moatClientLevel3=210658248&moatClientLevel4=95946136&zMoatG=ct=US&st=&city=0&dma=0&zp=&bw=4&zMoatUSER=AMsySZYB24A5c8yM4Xpv9GSzr-9f 17 | https://securepubads.g.doubleclick.net/gpt/pubads_impl_170.js 18 | http://cdn.optimizely.com/js/549871026.js 19 | http://w.sharethis.com/button/sharethis.js#&offsetLeft=-283&offsetTop=-7&publisher=8a80909b-bbba-4773-ad9c-57ff5f2349d5&type=website&post_services=email%2Cfacebook%2Ctwitter%2Cgbuzz%2Cmyspace%2Cdigg%2Csms%2Cwindows_live%2Cdelicious%2Cstumbleupon%2Creddit%2Cgoogle_bmarks%2Clinkedin%2Cbebo%2Cybuzz%2Cblogger%2Cyahoo_bmarks%2Cmixx%2Ctechnorati%2Cfriendfeed%2Cpropeller%2Cwordpress%2Cnewsvine&button=false 20 | http://ajax.googleapis.com/ajax/libs/swfobject/2.2/swfobject.js 21 | http://www.google-analytics.com/analytics.js 22 | http://cdn.mplxtms.com/s/MasterTMS.min.js#2076 23 | http://b.scorecardresearch.com/beacon.js 24 | https://www.googletagmanager.com/gtm.js?id=GTM-WZQCWK 25 | http://dthq3mor50viz.cloudfront.net/zbajck9faU.js 26 | http://pagead2.googlesyndication.com/pagead/osd.js 27 | http://media.ufc.tv/ufc_system_assets/ufc_201707101050/js/jwplayer7/jwplayer.js 28 | http://media.ufc.tv/ufc_system_assets/ufc_201707101050/js/cufon-yui.js 29 | https://apis.google.com/js/plusone.js 30 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/2-scrape_js/README.md: -------------------------------------------------------------------------------- 1 | # Asynchronous Batch Downloading of JavaScript files 2 | 3 | `async_js_get.py` iterates over the parquet files generated in part 1 to asynchronously download all of the files. The script first scans your specified output directory to skip over the already downloaded files. No arguments are needed, just ensure that you satisfied all of the requirements specified in `requirements.txt`. Run with: 4 | ``` 5 | $ ./async_js_get.py > js_status.csv 6 | ``` 7 | This saves the status responses for some analysis if so desired later. Note that the error codes may have some characters invalidating the schema (namely extra commas). 8 | 9 | You may run into some system configuration issues; you may need to increase the ulimit (3000 worked for me): 10 | 11 | ``` 12 | $ ulimit -n 3000 13 | ``` 14 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/2-scrape_js/async_js_get.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # 3 | # Original code: Cristian Garcia, Sep 21 '18 4 | # https://medium.com/@cgarciae/making-an-infinite-number-of-requests-with- 5 | # python-aiohttp-pypeln-3a552b97dc95 6 | # 7 | # Script adapted by David Dobre Nov 20 '18: 8 | # Added parquet loading, iteration over dataframes, and content saving 9 | # 10 | # NOTE: You may need to increase the ulimit (3000 worked for me): 11 | # 12 | # $ ulimit -n 3000 13 | # 14 | ################################################################################ 15 | 16 | from aiohttp import ClientError, ClientSession, TCPConnector 17 | import asyncio 18 | import configparser 19 | import glob 20 | import os, os.path 21 | import pandas as pd 22 | import ssl 23 | import sys 24 | 25 | from pathlib import Path 26 | from pypeln import asyncio_task as aio 27 | 28 | ##### Specify directories and parameters 29 | config = configparser.ConfigParser() 30 | config.read('config.ini') 31 | 32 | # Top directory 33 | datatop = config['DEFAULT']['datatop'] 34 | 35 | # Input directory 36 | url_list = os.path.join(datatop, config['DEFAULT']['url_list']) 37 | 38 | # Output directory 39 | output_dir = os.path.join(datatop, config['DEFAULT']['output_dir']) 40 | 41 | # Max number of workers 42 | limit = config['DEFAULT'].getint('limit') 43 | 44 | print(datatop) 45 | print(url_list) 46 | print(output_dir) 47 | print(limit) 48 | 49 | # Not sure what this monkey business does 50 | ssl.match_hostname = lambda cert, hostname: True 51 | 52 | ##### Load in dataset ########################################################## 53 | parquet_dir = Path(url_list) 54 | input_data = pd.concat( 55 | pd.read_parquet(parquet_file) 56 | for parquet_file in parquet_dir.glob('*.parquet') 57 | ) 58 | 59 | # Check for existing files in output directory 60 | existing_files = [os.path.basename(x) for x in glob.glob(output_dir + "*.txt")] 61 | 62 | # Remove those from the "to request" list as they're already downloaded 63 | input_data = input_data[~input_data['filename'].isin(existing_files)] 64 | 65 | # Append filename to the output folder and get the raw string value 66 | input_data['filename'] = output_dir + input_data['filename'] 67 | input_data = input_data.values 68 | 69 | print("url,status,") 70 | 71 | ##### Async fetch ############################################################## 72 | async def fetch(data, session): 73 | 74 | url = data[1] 75 | filename = data[2] 76 | 77 | try: 78 | async with session.get(url, timeout=2) as response: 79 | output = await response.read() 80 | print("{},{},".format(url, response.status)) 81 | 82 | if (response.status == 200 and output): 83 | with open(filename, "wb") as source_file: 84 | source_file.write(output) 85 | return output 86 | 87 | return response.status 88 | 89 | # Catch exceptions 90 | except ClientError as e: 91 | print("{},{},".format(url, e)) 92 | return e 93 | 94 | except asyncio.TimeoutError as e: 95 | print("{},{},".format(url, e)) 96 | return erb 97 | 98 | except ssl.CertificateError as e: 99 | print("{},{},".format(url, e)) 100 | return e 101 | 102 | except ssl.SSLError as e: 103 | print("{},{},".format(url, e)) 104 | return e 105 | 106 | except ValueError as e: 107 | print("{},{},".format(url, e)) 108 | return e 109 | 110 | except TimeoutError as e: 111 | print("{},{},".format(url, e)) 112 | return e 113 | 114 | except concurrent.futures._base.TimeoutError as e: 115 | print("{},{},".format(url, e)) 116 | return e 117 | 118 | 119 | ##### Iterate over each list entry ############################################# 120 | aio.each( 121 | fetch, # worker function 122 | input_data, # input arguments 123 | workers = limit, # max number of workers 124 | on_start = lambda: ClientSession(connector=TCPConnector(limit=None)), 125 | on_done = lambda _status, session: session.close(), 126 | run = True, 127 | ) 128 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/2-scrape_js/config.ini: -------------------------------------------------------------------------------- 1 | # Note: Make sure to not add comments on any lines being parsed in! 2 | [DEFAULT] 3 | 4 | # Location of the top of where data is located 5 | #datatop = 6 | datatop = /mnt/Data/UCOSP_DATA 7 | 8 | # Specify the directory containing the output of Pt. 1 9 | url_list = resources/full_url_list_parsed 10 | 11 | # Specify directory to save all text files (make sure it exists!) 12 | output_dir = js_source_files 13 | 14 | # Worker limit (larger values fail for me, experiment) 15 | limit = 20 16 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/2-scrape_js/downloads_analysis/README.md: -------------------------------------------------------------------------------- 1 | # JavaScript Redundancy Analysis 2 | 3 | The purpose of these scripts are to analyze the difference between the uncleansed `script_url`s from the full dataset and their parsed counterparts. The parsed 'condensed dataset' essentially omits any queries that were included in the `script_url`, condensing the number of entries from over a million to less than 200000. 4 | 5 | `extract_hashes_from_full_dataset.py` generates a dictionary of hashes for every file (with `utf-8` encoding) in the specified directory. It is far too expensive to try to do this whole process in one step - filesystems are not fond of having over 800000 files in one directory. The pickled output is used in the next step. 6 | 7 | `compare_condensed_with_full.py` uses the data from the above script and iterates over the specified condensed dataset, generating a final pickled output of a pandas dataframe with a schema of `['parent_url', 'url', 'hash']`, allowing the comparison between hashes sharing the same base URL. 8 | 9 | `explore_downloads.ipynb` is a jupyter-notebook containing a brief analysis on the redundancy of the JS files when making requests with queries. 10 | 11 | Note: for a small section on the HTTP status response, you will require `js_status.csv`, which is the output of Pt.2's download script dumped into a text file. This stores the URLs and their HTTP response status. This is not necessary for the rest of the analysis so you can remove the associated code to this. 12 | 13 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/2-scrape_js/downloads_analysis/compare_condensed_with_full.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | import sys 3 | import hashlib 4 | import numpy as np 5 | import pandas as pd 6 | import glob 7 | from pathlib import Path 8 | import pickle 9 | from pypeln import asyncio_task as aio 10 | from os import path 11 | 12 | STORAGE_DIR = "/mnt/Data/UCOSP_DATA" 13 | 14 | FULL_URL_LIST = "full_data.pickle" 15 | 16 | CLEANED_FILES = path.join(STORAGE_DIR, ("js_source_files" + "/*")) 17 | 18 | OUTPUT_FILE = "final_processed.pickle" 19 | 20 | ##### Load in dataset 21 | print("Retrieving list from: '{}'".format(CLEANED_FILES)) 22 | input_data_cleaned = list(glob.glob(CLEANED_FILES)) 23 | 24 | with open(FULL_URL_LIST, "rb") as handle: 25 | input_data_full = pickle.load(handle) 26 | 27 | # Sanity check 28 | print( 29 | "\nThere are {} urls found in the cleaned dataset:\n\t'{}'".format( 30 | len(input_data_cleaned), CLEANED_FILES 31 | ) 32 | ) 33 | print( 34 | "\nThere are {} hashes found in the complete dataset:\n\t'{}'".format( 35 | len(input_data_full), FULL_URL_LIST 36 | ) 37 | ) 38 | 39 | # Generating new dataframe 40 | def get_hash_from_file(filename): 41 | sha1 = hashlib.sha1() 42 | with open(filename, "r") as f: 43 | data = f.read() 44 | sha1.update(data.encode("utf-8")) 45 | return sha1.hexdigest() 46 | 47 | output_success = [] 48 | output_fails = [] 49 | counter_success = 0 50 | counter_failed = 0 51 | 52 | # Iterate over all of the cleansed dataset 53 | for filename in input_data_cleaned: 54 | if counter_success % 5000 == 0: 55 | print("{}/{}".format(counter_success, len(input_data_cleaned))) 56 | 57 | # Get source url 58 | raw_filename = filename.split("/")[-1] 59 | 60 | # Try to get the file hash, only works with utf-8 61 | try: 62 | file_hash = get_hash_from_file(filename) 63 | 64 | except UnicodeDecodeError as e: 65 | print("Bad type:\n{}".format(e)) 66 | counter_failed += 1 67 | output_fails.append(raw_filename) 68 | continue 69 | 70 | # Create an entry for the parent and append it 71 | parent_dict = { 72 | "parent_filename": raw_filename, 73 | "filename": raw_filename, 74 | "hash": get_hash_from_file(filename), 75 | } 76 | 77 | output_success.append(parent_dict) 78 | 79 | # Now search for all entries in the complete crawl with the same base url 80 | search = raw_filename.split(".txt")[0] 81 | for key in input_data_full: 82 | 83 | if key.startswith(search) and key != raw_filename: 84 | child_dict = { 85 | "parent_filename": raw_filename, 86 | "filename": key, 87 | "hash": input_data_full[key], 88 | } 89 | output_success.append(child_dict) 90 | counter_success += 1 91 | 92 | # Pickle output 93 | df = pd.DataFrame(output_success) 94 | df.to_pickle(OUTPUT_FILE) 95 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/2-scrape_js/downloads_analysis/extract_hashes_from_full_dataset.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | import sys 3 | import hashlib 4 | import glob 5 | from pathlib import Path 6 | from os import path 7 | import pickle 8 | 9 | STORAGE_DIR = "/mnt/Data/UCOSP_DATA" 10 | 11 | ALL_FILES = path.join(STORAGE_DIR, ("1st_batch_js_source_files" + "/*")) 12 | 13 | OUTPUT_FILE = "full_data.pickle" 14 | 15 | OUTPUT_FILE_FAILS = "fails.pickle" 16 | 17 | # Generating new dataframe 18 | def get_hash_from_file(filename): 19 | sha1 = hashlib.sha1() 20 | with open(filename, "r") as f: 21 | data = f.read() 22 | sha1.update(data.encode("utf-8")) 23 | return sha1.hexdigest() 24 | 25 | # Get file list 26 | print("Retrieving list from: '{}'".format(ALL_FILES)) 27 | total_file_list = list(glob.glob(ALL_FILES)) 28 | total_num_files = len(total_file_list) 29 | print("Retrieved list of {} files.\n".format(total_num_files)) 30 | 31 | output_dict = {} 32 | output_fails = [] 33 | counter = 0 34 | fails = 0 35 | 36 | # Iterate over entire file list 37 | for filename in total_file_list: 38 | if counter % 20000 == 0: 39 | print(80 * "#" + "\n") 40 | print("{}/{}".format(counter, total_num_files)) 41 | print(80 * "#" + "\n") 42 | 43 | raw_filename = filename.split("/")[-1] 44 | try: 45 | file_hash = get_hash_from_file(filename) 46 | except UnicodeDecodeError as e: 47 | print("Bad type:\n{}".format(e)) 48 | fails += 1 49 | output_fails.append(raw_filename) 50 | continue 51 | 52 | output_dict.update({raw_filename: file_hash}) 53 | counter += 1 54 | 55 | print( 56 | "{}/{}\n\nDONE... Now pickling data to: '{}'".format( 57 | counter, total_num_files, OUTPUT_FILE 58 | ) 59 | ) 60 | 61 | with open(OUTPUT_FILE, "wb") as f: 62 | # Pickle the 'data' dictionary using the highest protocol available. 63 | pickle.dump(output_dict, f, pickle.HIGHEST_PROTOCOL) 64 | 65 | # Store fails for a later time 66 | with open(OUTPUT_FILE_FAILS, "wb") as f: 67 | pickle.dump(output_fails, f) 68 | 69 | print("DONE\nFinal stats:\nSuccess:\t{}\nFails:\t{}".format(counter, fails)) 70 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/2-scrape_js/requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp==3.4.4 2 | async-timeout==3.0.1 3 | attrs==18.2.0 4 | chardet==3.0.4 5 | idna==2.8 6 | idna-ssl==1.1.0 7 | multidict==4.5.2 8 | numpy==1.15.4 9 | pandas==0.23.4 10 | pkg-resources==0.0.0 11 | pypeln==0.1.6 12 | python-dateutil==2.7.5 13 | pytz==2018.7 14 | six==1.12.0 15 | yarl==1.3.0 16 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/2-scrape_js/single_js_get.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import requests 3 | import sys 4 | 5 | url = sys.argv[1] 6 | 7 | TIMEOUT = 2 8 | OUT_FILENAME = "downloaded.js" 9 | 10 | try: 11 | response = requests.get(url, timeout=TIMEOUT) 12 | 13 | if response.status_code == 200: 14 | print("File found - writing contents to {}".format(OUT_FILENAME)) 15 | 16 | with open(OUT_FILENAME, "w") as source_file: 17 | source_file.write(response.text) 18 | 19 | else: 20 | print("Error! Status code: {}".format(response.status_code)) 21 | 22 | except request.exceptions.RequestException as e: 23 | print("Exception") 24 | print(e) 25 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/3-generate_symbols_of_interest/README.md: -------------------------------------------------------------------------------- 1 | # API Symbol extraction from JSON database: 2 | 3 | Requires Mozilla's API list found on the [browser compatability data github](https://github.com/mdn/browser-compat-data/tree/master/api). For this reference, I've downloaded all of the json files into a directory called `api/`. 4 | 5 | To run, configure the provided `config.ini` file. Make sure to specify both the directory of the api json files (found above) as well a list of those to generate symbol lists for. Run with: 6 | ``` 7 | $ ./process_APIs.py 8 | ``` 9 | 10 | Program will run through the specified .json files, extracting methods/symbols, and then checking if these methods/symbols also exist as .json files in the json directory. If they do (and they haven't already been parsed), then program will also check through those files. 11 | 12 | Spits out a json file with a user specified name, with keys corresponding to the original interface, and values containing the lists of the corresponding symbols and methods. 13 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/3-generate_symbols_of_interest/config.ini: -------------------------------------------------------------------------------- 1 | # Note: Make sure to not add comments on any lines being parsed in! 2 | [DEFAULT] 3 | 4 | # Location to the file containing a list of all files you want to generate symbol lists for 5 | seed = master.txt 6 | 7 | # Location to the folder containing all API json files 8 | api_data = api/ 9 | 10 | # Specify output file containing the symbol list 11 | output = symbol_dict.json 12 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/3-generate_symbols_of_interest/process_APIs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import configparser 3 | from csv import DictWriter 4 | import json 5 | from os import listdir 6 | import sys 7 | 8 | # want to return a dict (lowercase : FileName.json) to recursively get nested 9 | # methods and properties. 10 | def getAllFileDict(api_data): 11 | master_API_dict = {} 12 | 13 | for entry in listdir(api_data): 14 | key = str(entry.split(".")[0]) 15 | master_API_dict[key] = entry 16 | 17 | return master_API_dict 18 | 19 | 20 | # import all data from specified filename 21 | def importData(filename): 22 | 23 | # read in JSON data 24 | with open(filename, encoding='utf-8') as data_file: 25 | data = json.loads(data_file.read()) 26 | 27 | return data['api'] 28 | 29 | 30 | # once imported data from file, extract the desired properties 31 | def extractProperties(json_data): 32 | 33 | # this only works if guaranteed that json file has one key under 'api' 34 | interface_name = list(json_data.keys())[0] 35 | property_list = list(json_data[interface_name].keys()) 36 | if "__compat" in property_list: 37 | property_list.remove("__compat") 38 | 39 | return str(interface_name), property_list 40 | 41 | 42 | def recursivelyGetProperties(current_interface, res_dict, master_API_dict): 43 | 44 | # import data from file 45 | data = importData(current_interface) 46 | 47 | # extract interface name and its associated property list 48 | interface_name, property_list = extractProperties(data) 49 | 50 | # add entries into the output 51 | res_dict[interface_name] = property_list 52 | 53 | # iterate over all properties, search if they exist in the master API list 54 | for entry in property_list: 55 | 56 | # (all keys within res_dict and master_API_dict are lowercase) 57 | entry = str.lower(entry) 58 | if (entry in master_API_dict): 59 | 60 | # if they exist, make sure they aren't already in the results dict 61 | if (entry not in res_dict.keys()): 62 | 63 | # take result, create new seed and recursively fill res_dict 64 | new_seed = sys.argv[2] + master_API_dict[entry] 65 | res_dict = recursivelyGetProperties(new_seed, res_dict, \ 66 | master_API_dict) 67 | return res_dict 68 | 69 | def main(): 70 | config = configparser.ConfigParser() 71 | config.read('config.ini') 72 | 73 | seed_apis = config['DEFAULT']['seed'] 74 | api_data = config['DEFAULT']['api_data'] 75 | output = config['DEFAULT']['output'] 76 | 77 | # init empty dict to store all results 78 | res_dict = {} 79 | 80 | # read in all available files to recurse over 81 | master_API_dict = getAllFileDict(api_data) 82 | 83 | # open seed file 84 | file_list = open(seed_apis).read().splitlines() 85 | 86 | # iterate over the list of specified files 87 | for entry in file_list: 88 | 89 | # generate a nice filename 90 | file_location = api_data + entry 91 | res_dict = recursivelyGetProperties(file_location, res_dict, \ 92 | master_API_dict) 93 | 94 | # dump all results to a file 95 | with open(output, 'w') as fp: 96 | json.dump(res_dict, fp, indent=4) 97 | 98 | if __name__ == '__main__': 99 | main(); 100 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/4-ast_analysis/README.md: -------------------------------------------------------------------------------- 1 | # Analyze Syntax Tree Counts 2 | 3 | This portion performs the actual counting of API symbol calls by generating an abstract syntax tree (AST) and walking through it (breadth-first) to extract symbol information. The AST generation is performed by [Esprima](https://github.com/Kronuz/esprima-python). 4 | 5 | There are two scripts. `async_tree_explorer.py` uses the parameters specified in `config.ini` to asynchronously generate symbol counts for all files within a directory. Depending on your machine, this may take a long time (10000 files took about 12 hours in one sample using 16 workers). There isn't a significant memory overhead, so depending on the size of your dataset, 16-32GB of RAM should be enough. The is a pandas dataframe with the schema: `['filename', 'symbol_1' , ... , 'symbol_n']`. This results in NaN values for scrips that do not have a symbol call that others do. The benefit of this is that one can analyze specific symbols. 6 | 7 | `single_tree_explorer.py` performs the same analysis as `async_tree_explorer.py` on a single file specified by the user at runtime. Runs with: 8 | ``` 9 | $ ./single_tree_explorer.py 10 | ``` 11 | This generates two outputs in `output_data`. `symbol_counts.json` is the output the count of bottom level calls, for example `window.location.href` would be counted as `href`. `extended_symbol_counts.json` would attempt to find the whole symbol `window.location.href`. There are limitations to this however. If the calls were broken up over several different functions, the AST has no way of recovering that information, so you might be left with `location.href`. Depending on the symbol, it might be possible to infer the parent APIs, but there are many cases of conflict which makes this not guaranteed. Furthermore, if the script is heavily obfuscated, then this approach will almost certainly fail. 12 | 13 | The cases this script looks for are `MemberExpressions`, `CallExpressions`, and `object`s within either of the expressions. 14 | 15 | Looking at some examples, the JavaScript line 16 | ```js 17 | if ((new RegExp("WebKit")).test(navigator.userAgent)) 18 | ``` 19 | parses out to: 20 | ```json 21 | "type": "IfStatement", 22 | "test": { 23 | "type": "CallExpression", 24 | "callee": { 25 | "type": "MemberExpression", 26 | "computed": false, 27 | "object": { 28 | "type": "NewExpression", 29 | "callee": { 30 | "type": "Identifier", 31 | "name": "RegExp" 32 | }, 33 | "arguments": [ 34 | { 35 | "type": "Literal", 36 | "value": "WebKit", 37 | "raw": "\"WebKit\"" 38 | } 39 | ] 40 | }, 41 | "property": { 42 | "type": "Identifier", 43 | "name": "test" 44 | } 45 | }, 46 | "arguments": [ 47 | { 48 | "type": "MemberExpression", 49 | "computed": false, 50 | "object": { 51 | "type": "Identifier", 52 | "name": "navigator" 53 | }, 54 | "property": { 55 | "type": "Identifier", 56 | "name": "userAgent" 57 | } 58 | } 59 | ] 60 | }, 61 | ... 62 | ``` 63 | The syntax tree walker will focus on nodes with the type `MemberExpression` or `CallExpression`. Focusing on the node which has the appropriate type: 64 | ```json 65 | { 66 | "type": "MemberExpression", 67 | "computed": false, 68 | "object": { 69 | "type": "Identifier", 70 | "name": "navigator" 71 | }, 72 | "property": { 73 | "type": "Identifier", 74 | "name": "userAgent" 75 | } 76 | } 77 | ``` 78 | The script first grabs the `property.name` parameter of this node, which is `userAgent`. The value is then compared to the list of symbols fed in, and if it does not exist in that list, this node is ignored. This would be the "outermost" symbol call, so this value is set to be the current output string (`output = userAgent`). The algorithm now recursively checks for objects in the node and repeats the first part of checking for types and properties. In this case, the object type is `Identifier`, so this this is the bottom of this branch and this value is prepended before the final value, i.e. `tmp = navigator.userAgent`. Note that the object's type might have been `MemberExpression` or `CallExpression`, in which case `property.name` is prepended to the output, and then another `object` check is done (until the object is of type `Identifier`). 79 | 80 | In the case of `CallExpression`, the same logic applies but instead of looking for `node.property.name`, we must look for `node.callee.name`. For example: 81 | ```json 82 | "type": "CallExpression", 83 | "callee": { 84 | "type": "Identifier", 85 | "name": "setInterval" 86 | }, 87 | ``` 88 | where `setInterval` is the symbol extracted. 89 | 90 | Note that this approach is vulnerable to prepending nodes which share names with valid symbols in the symbol list. This is largely the case for single character object names, such as `d.userAgent`, where `d` is a valid symbol in `DOMMatrixReadOnly`. This is mitigated by assuming that parent objects are unlikely to be of length 1 or 2, so any parents which have only one or two characters are discarded. The only vulnerable two character values would be `id`, `go`, `as`, `ch`, and `db` (using the provided master list). 91 | 92 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/4-ast_analysis/async_tree_explorer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import esprima 4 | import configparser 5 | import glob 6 | import json 7 | import multiprocessing 8 | import pandas as pd 9 | import sys 10 | 11 | from os import path 12 | 13 | config = configparser.ConfigParser() 14 | config.read("config.ini") 15 | 16 | # Top directory for all data and resource files 17 | DATATOP = config["DEFAULT"]["datatop"] 18 | 19 | # Stage 1 output: 20 | # directory for url:filename dictionary files (PARQUET!) 21 | URL_FILENAME_DICT = path.join(DATATOP, 'resources/full_url_list_parsed/') 22 | 23 | # Stage 2 output: 24 | # path to downloaded javascript files 25 | JS_SOURCE_FILES = path.join(DATATOP, config["DEFAULT"]["js_source_files"]) 26 | 27 | # Stage 3 output: 28 | # symbol list 29 | SYM_LIST = config["DEFAULT"]["sym_list"] 30 | 31 | # Output directory 32 | OUTPUT_DIR = path.join(DATATOP, "resources/symbol_counts/") 33 | 34 | OUTPUT_FILE = "full_run_trial2" 35 | OUTPUT_FAIL = "fails" 36 | 37 | 38 | # Number of workers: 39 | WORKERS = multiprocessing.cpu_count() 40 | 41 | # Number of files per queue batch 42 | BATCH_SIZE = 1000 43 | 44 | 45 | class SymbolNode: 46 | def __init__(self, depth, width, parent_depth, parent_width): 47 | self._depth = depth 48 | self._width = width 49 | self._parent_depth = parent_depth 50 | self._parent_width = parent_width 51 | 52 | def setDepthWidth(depth, width): 53 | self._depth = depth 54 | self._width = width 55 | 56 | 57 | class CustomEncoder(json.JSONEncoder): 58 | def default(self, obj): 59 | 60 | # Symbol node class 61 | if isinstance(obj, SymbolNode): 62 | return { 63 | "depth": obj._depth, 64 | "width": obj._width, 65 | "parent_depth": obj._parent_depth, 66 | "parent_width": obj._parent_width, 67 | } 68 | 69 | return json.JSONEncoder.default(self, obj) 70 | 71 | 72 | class Element: 73 | ## Define keys you want to skip over 74 | BLACKLISTEDKEYS = ["parent"] 75 | 76 | ## Constructor 77 | def __init__(self, esprima_ast): 78 | self._ast = esprima_ast # Assign member var AST 79 | self._visitors = [] # Init empty visitor array 80 | 81 | ## Add a new visitor to execute (will be executed at each node) 82 | def accept(self, visitor): 83 | self._visitors.append(visitor) 84 | 85 | ## (private) Step through the node's queue of potential nodes to visit 86 | def _step(self, node, queue, depth, width): 87 | before = len(queue) 88 | 89 | for key in node.keys(): # Enumerate keys for possible children 90 | if key in self.BLACKLISTEDKEYS: 91 | continue # Ignore node if it is blacklisted 92 | 93 | child = getattr(node, key) # Assign child = node.key 94 | 95 | # if the child exists && the child has an attribute 'type' 96 | if child and hasattr(child, "type") == True: 97 | child.parent = node # Assign this node as child's parent 98 | child.parent_depth = depth 99 | child.parent_width = width 100 | queue.append(child) # Append the child in this node's queue 101 | 102 | # if there is a list of children 103 | if isinstance(child, list): 104 | for item in child: # Iterate through them and do the same 105 | # as above 106 | if hasattr(item, "type") == True: 107 | item.parent = node 108 | item.parent_depth = depth 109 | item.parent_width = width 110 | queue.append(item) 111 | 112 | return len(queue) - before # Return whether any children were pushed 113 | 114 | ## Walk through this AST 115 | def walk(self, api_symbols, filename): 116 | queue = [self._ast] # Add the imported AST to the queue 117 | 118 | # Initialize these entries 119 | for node in queue: 120 | node.parent_depth = 0 121 | node.parent_width = 0 122 | 123 | # Depth and width counting 124 | depth = 0 # what level of the tree we are in 125 | width = 0 # how far from first node on this level we are 126 | this_depth_num_nodes = 1 # how many nodes in this level are left 127 | next_depth_num_nodes = 0 # how many nodes in the next level 128 | node_counter = 0 # how many total nodes have been visited 129 | this_depth_count = 0 # how many nodes are on this level (tot) 130 | 131 | # storage for the data 132 | extended_symbol_counter = {} 133 | symbol_counter = {key: 0 for key in api_symbols} 134 | node_dict = {key: [] for key in api_symbols} 135 | 136 | extended_symbol_counter["script_url_filename"] = filename 137 | 138 | while len(queue) > 0: # While stuff in the queue 139 | node = queue.pop(0) # Pop stuff off of the FRONT (0) 140 | this_depth_num_nodes -= 1 141 | node_counter += 1 142 | width = node_counter - this_depth_count - 1 143 | 144 | for v in self._visitors: # Run visitor instances here 145 | result = v.visit(node, api_symbols) 146 | if result: 147 | if result not in extended_symbol_counter.keys(): 148 | extended_symbol_counter[result] = 1 149 | else: 150 | extended_symbol_counter[result] += 1 151 | 152 | # MemberExpression 153 | if "MemberExpression" == node.type: 154 | tmp = node.property.name 155 | 156 | # CallExpression 157 | if "CallExpression" == node.type: 158 | tmp = node.callee.name 159 | 160 | symbol_counter[tmp] += 1 # increment counter 161 | this_node = SymbolNode( 162 | depth, width, node.parent_depth, node.parent_width 163 | ) 164 | node_dict[tmp].append(this_node) 165 | break 166 | 167 | # If node is an instance of "esprima node", step through the node 168 | # Returns how many children have been added to the queue 169 | if isinstance(node, esprima.nodes.Node): 170 | 171 | # Feed the nodes that will be labeled as children the current 172 | # depth and width 173 | next_depth_num_nodes += self._step(node, queue, depth, width) 174 | 175 | # Once this tree depth has been walked, update with the existing 176 | # "next" set and reset the next set to 0. Increment depth by 1, 177 | # and keep a tally on how many nodes have been counted up until 178 | # this depth. 179 | if this_depth_num_nodes == 0: 180 | this_depth_num_nodes = next_depth_num_nodes # update current list 181 | next_depth_num_nodes = 0 # reset this list 182 | this_depth_count = node_counter # 183 | depth += 1 184 | 185 | print( 186 | "Done '{}'. Total stats: Tree depth: {}\tTotal nodes:{}".format( 187 | filename, depth, node_counter 188 | ) 189 | ) 190 | return symbol_counter, extended_symbol_counter, node_dict 191 | 192 | 193 | """ 194 | Executes specified code given that an input node matches the property name of 195 | this node. 196 | 197 | Attributes: 198 | _property_name: the name of the property required to execute the handler 199 | _node_handler: code to execute if _property_name matches 200 | visit(node): checks if input node's property matches this nodes; if yes, 201 | executes the code passed into _node_handler, passing the 202 | input node as an argument 203 | """ 204 | class MatchPropertyVisitor: 205 | ## Constructor 206 | def __init__(self, property_name): 207 | self._property_name = property_name # userAgent, getContext, etc 208 | 209 | ################################################## 210 | def _recursive_check_objects(self, node, api_symbols): 211 | if node.object: 212 | return self._recurrance_visit(node.object, api_symbols) 213 | return False 214 | 215 | ## Visit the nodes, check if matches, and execute handler if it does 216 | def _recurrance_visit(self, node, api_symbols): 217 | 218 | # No more objects to look through 219 | if "Identifier" == node.type: 220 | if node.name in api_symbols: 221 | return node.name 222 | 223 | # MemberExpression; maybe more objects 224 | elif "MemberExpression" == node.type: 225 | if node.property.name in api_symbols: 226 | 227 | return_val = node.property.name 228 | tmp = self._recursive_check_objects(node, api_symbols) 229 | 230 | if tmp: 231 | return_val = tmp + "." + return_val 232 | 233 | return return_val 234 | 235 | # CallExpression; maybe more objects 236 | elif "CallExpression" == node.type: 237 | if node.callee.name in api_symbols: 238 | 239 | return_val = node.callee.name 240 | tmp = self._recursive_check_objects(node.callee, api_symbols) 241 | 242 | if tmp: 243 | return_val = tmp + "." + return_val 244 | 245 | return return_val 246 | return False 247 | 248 | ################################################## 249 | 250 | def _filter_parent_API(self, arg): 251 | if len(arg.split(".")[0]) == 1: 252 | arg = arg.split(".") 253 | arg.pop(0) 254 | arg = ".".join(arg) 255 | return arg 256 | 257 | # FIRST VISIT: visit node, check if matches, and check for objects 258 | def visit(self, node, api_symbols): 259 | 260 | # MemberExpression 261 | if "MemberExpression" == node.type: 262 | if node.property.name == self._property_name: 263 | 264 | return_val = node.property.name 265 | tmp = self._recursive_check_objects(node, api_symbols) 266 | 267 | if tmp: 268 | return_val = tmp + "." + return_val 269 | return_val = self._filter_parent_API(return_val) 270 | 271 | return return_val 272 | 273 | # CallExpression 274 | if "CallExpression" == node.type: 275 | if node.callee.name == self._property_name: 276 | 277 | return_val = node.callee.name 278 | tmp = self._recursive_check_objects(node.callee, api_symbols) 279 | 280 | if tmp: 281 | return_val = tmp + "." + return_val 282 | return_val = self._filter_parent_API(return_val) 283 | 284 | return return_val 285 | return False 286 | 287 | 288 | # Extract JSON and JavaScript AST data from precompiled list 289 | def importData(js_file): 290 | 291 | return ast, api_list 292 | 293 | 294 | def uniquifyList(seq, idfun=None): 295 | # order preserving 296 | if idfun is None: 297 | 298 | def idfun(x): 299 | return x 300 | 301 | seen = {} 302 | result = [] 303 | for item in seq: 304 | marker = idfun(item) 305 | if marker in seen: 306 | continue 307 | seen[marker] = 1 308 | result.append(item) 309 | return result 310 | 311 | 312 | ################################################################################ 313 | def worker_process(input_file): 314 | 315 | filename = input_file.split("/")[-1] 316 | 317 | # Try getting the AST using esprima, bail if non-JS syntax 318 | try: 319 | with open(input_file) as f: 320 | ast = esprima.parseScript(f.read()) 321 | 322 | except esprima.error_handler.Error as e: 323 | print("Failure: non-javascript syntax detected. Terminating...") 324 | return False, filename 325 | 326 | # Create an element using that AST 327 | el = Element(ast) 328 | for entry in api_symbols: 329 | visitor = MatchPropertyVisitor(entry) 330 | el.accept(visitor) 331 | 332 | # Walk down the AST (breadth-first) 333 | symbol_counter, extended_symbol_counter, node_dict = el.walk(api_symbols, filename) 334 | 335 | return True, extended_symbol_counter 336 | 337 | 338 | ################################################################################ 339 | if __name__ == "__main__": 340 | 341 | print("Initialized to use {} workers.".format(workers)) 342 | 343 | # Extract all symbols from the generated "Symbols of Interest" list, 344 | # and flatten all api symbols into a single list (for efficiency) 345 | with open(sym_list, encoding="utf-8") as f: 346 | api_list = json.loads(f.read()) 347 | 348 | print("Looking in '{}' for the API list...".format(sym_list)) 349 | api_symbols = [val for sublist in api_list.values() for val in sublist] 350 | api_symbols = uniquifyList(api_symbols) 351 | print("Success.") 352 | 353 | # Get file list from data directory 354 | print("Looking in '{}' for all .txt files...".format(js_source_files)) 355 | file_list = glob.glob(js_source_files + "/*") 356 | print("Success. Found {} files.".format(len(file_list))) 357 | print("Begin iterating over the files to get symbol info.") 358 | print("-" * 80) 359 | 360 | # Storage arrays 361 | symbol_counts = [] 362 | fails_list = [] 363 | 364 | # Callback 365 | def log_result(result): 366 | if result[0]: 367 | symbol_counts.append(result[1]) 368 | else: 369 | fails_list.append(result[1]) 370 | 371 | # Setup thread pool 372 | pool = multiprocessing.Pool(workers) 373 | for filename in file_list: 374 | re = pool.apply_async(worker_process, args=(filename,), callback=log_result) 375 | 376 | pool.close() 377 | pool.join() 378 | 379 | # Done counting symbols 380 | print("All files done. Saving...") 381 | df = pd.DataFrame(symbol_counts) 382 | 383 | # Saving 384 | df.to_parquet(output_dir + "/" + output_file + ".parquet.gzip", compression="gzip") 385 | 386 | with open(output_dir + "/" + output_fail + ".txt", "w") as f: 387 | for item in fails_list: 388 | f.write("%s\n" % item) 389 | 390 | print("Success.\n\nDONE SCRIPT!") 391 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/4-ast_analysis/config.ini: -------------------------------------------------------------------------------- 1 | # Note: Make sure to not add comments on any lines being parsed in! 2 | [DEFAULT] 3 | 4 | # Location of the top of where data is located 5 | #datatop = 6 | datatop = /mnt/Data/UCOSP_DATA 7 | 8 | # Stage 1 output: 9 | # directory for url:filename dictionary files (PARQUET!) 10 | url_filename_dict = resources/full_url_list_parsed 11 | 12 | # Stage 2 output: 13 | # Path to directory containing all of the downloaded scripts 14 | js_source_files = js_source_files 15 | 16 | # Output directory to save parquet output (make sure it exists!) 17 | output_dir = resources/symbol_counts_output 18 | 19 | # Output file names 20 | output_file = output_counts_part 21 | output_fail = fails 22 | 23 | # Specify the list of symbols you want to search for in the AST scan 24 | sym_list = master_sym_list.json 25 | 26 | # Number of files per queue batch 27 | batch_size = 1000 28 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/4-ast_analysis/new_async_tree_explorer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import configparser 4 | import esprima 5 | import glob 6 | import json 7 | import multiprocessing 8 | import pandas as pd 9 | import sys 10 | 11 | from os import path 12 | from tqdm import tqdm 13 | 14 | ################################################################################ 15 | 16 | config = configparser.ConfigParser() 17 | config.read("config.ini") 18 | 19 | # Top directory for all data and resource files 20 | DATATOP = config["DEFAULT"]["datatop"] 21 | 22 | # Stage 1 output: 23 | # directory for url:filename dictionary files (PARQUET!) 24 | URL_FILENAME_DICT = path.join(DATATOP, config["DEFAULT"]["url_filename_dict"]) 25 | 26 | # Stage 2 output: 27 | # path to downloaded javascript files 28 | JS_SOURCE_FILES = path.join(DATATOP, config["DEFAULT"]["js_source_files"]) 29 | 30 | # Stage 3 output: 31 | # specify the list of symbols you want to search for in the AST scan 32 | SYM_LIST = config["DEFAULT"]["sym_list"] 33 | 34 | # Output directory 35 | OUTPUT_DIR = path.join(DATATOP, config["DEFAULT"]["output_dir"]) 36 | 37 | OUTPUT_FILE = config["DEFAULT"]["output_file"] 38 | OUTPUT_FAIL = config["DEFAULT"]["output_fail"] 39 | 40 | # Number of workers: 41 | WORKERS = multiprocessing.cpu_count() 42 | 43 | # Number of files per queue batch 44 | BATCH_SIZE = config["DEFAULT"].getint("batch_size") 45 | 46 | ################################################################################ 47 | class SymbolNode: 48 | 49 | def __init__(self, depth, width, parent_depth, parent_width): 50 | self._depth = depth 51 | self._width = width 52 | self._parent_depth = parent_depth 53 | self._parent_width = parent_width 54 | 55 | def setDepthWidth(depth, width): 56 | self._depth = depth 57 | self._width = width 58 | 59 | class CustomEncoder(json.JSONEncoder): 60 | def default(self, obj): 61 | 62 | # Symbol node class 63 | if isinstance(obj, SymbolNode): 64 | return{ "depth" : obj._depth, 65 | "width" : obj._width, 66 | "parent_depth": obj._parent_depth, 67 | "parent_width": obj._parent_width} 68 | 69 | return json.JSONEncoder.default(self, obj) 70 | 71 | ################################################################################ 72 | class Element: 73 | 74 | ## Define keys you want to skip over 75 | BLACKLISTEDKEYS = ['parent'] 76 | 77 | ## Constructor 78 | def __init__(self, esprima_ast): 79 | self._ast = esprima_ast # Assign member var AST 80 | self._visitors = [] # Init empty visitor array 81 | 82 | 83 | ## Add a new visitor to execute (will be executed at each node) 84 | def accept(self, visitor): 85 | self._visitors.append(visitor) 86 | 87 | 88 | ## (private) Step through the node's queue of potential nodes to visit 89 | def _step(self, node, queue, depth, width): 90 | before = len(queue) 91 | 92 | for key in node.keys(): # Enumerate keys for possible children 93 | if key in self.BLACKLISTEDKEYS: 94 | continue # Ignore node if it is blacklisted 95 | 96 | child = getattr(node, key) # Assign child = node.key 97 | 98 | # if the child exists && the child has an attribute 'type' 99 | if child and hasattr(child, 'type') == True: 100 | child.parent = node # Assign this node as child's parent 101 | child.parent_depth = depth 102 | child.parent_width = width 103 | queue.append(child) # Append the child in this node's queue 104 | 105 | # if there is a list of children 106 | if isinstance(child, list): 107 | for item in child: # Iterate through them and do the same 108 | # as above 109 | if hasattr(item, 'type') == True: 110 | item.parent = node 111 | item.parent_depth = depth 112 | item.parent_width = width 113 | queue.append(item) 114 | 115 | return len(queue) - before # Return whether any children were pushed 116 | 117 | ## Walk through this AST 118 | def walk(self, api_symbols, filename): 119 | queue = [self._ast] # Add the imported AST to the queue 120 | 121 | # Initialize these entries 122 | for node in queue: 123 | node.parent_depth = 0 124 | node.parent_width = 0 125 | 126 | # Depth and width counting 127 | depth = 0 # what level of the tree we are in 128 | width = 0 # how far from first node on this level we are 129 | this_depth_num_nodes = 1 # how many nodes in this level are left 130 | next_depth_num_nodes = 0 # how many nodes in the next level 131 | node_counter = 0 # how many total nodes have been visited 132 | this_depth_count = 0 # how many nodes are on this level (tot) 133 | 134 | # storage for the data 135 | extended_symbol_counter = {} 136 | symbol_counter = {key: 0 for key in api_symbols} 137 | node_dict = {key: [] for key in api_symbols} 138 | 139 | extended_symbol_counter['script_url_filename'] = filename 140 | 141 | while len(queue) > 0: # While stuff in the queue 142 | node = queue.pop(0) # Pop stuff off of the FRONT (0) 143 | this_depth_num_nodes -= 1 144 | node_counter += 1 145 | width = node_counter - this_depth_count - 1 146 | 147 | for v in self._visitors: # Run visitor instances here 148 | result = v.visit(node, api_symbols) 149 | if result: 150 | if result not in extended_symbol_counter.keys(): 151 | extended_symbol_counter[result] = 1; 152 | else: 153 | extended_symbol_counter[result] += 1; 154 | 155 | #MemberExpression 156 | if 'MemberExpression' == node.type: 157 | tmp = node.property.name 158 | 159 | # CallExpression 160 | if 'CallExpression' == node.type: 161 | tmp = node.callee.name 162 | 163 | symbol_counter[tmp] += 1 # increment counter 164 | this_node = SymbolNode(depth, width, node.parent_depth, node.parent_width) 165 | node_dict[tmp].append(this_node) 166 | break 167 | 168 | 169 | # If node is an instance of "esprima node", step through the node 170 | # Returns how many children have been added to the queue 171 | if isinstance(node, esprima.nodes.Node): 172 | 173 | # Feed the nodes that will be labeled as children the current 174 | # depth and width 175 | next_depth_num_nodes += self._step(node, queue, depth, width) 176 | 177 | 178 | # Once this tree depth has been walked, update with the existing 179 | # "next" set and reset the next set to 0. Increment depth by 1, 180 | # and keep a tally on how many nodes have been counted up until 181 | # this depth. 182 | if this_depth_num_nodes == 0: 183 | this_depth_num_nodes = next_depth_num_nodes # update current list 184 | next_depth_num_nodes = 0 # reset this list 185 | this_depth_count = node_counter # 186 | depth += 1 187 | 188 | return symbol_counter, extended_symbol_counter, node_dict 189 | 190 | 191 | ################################################################################ 192 | """ 193 | Executes specified code given that an input node matches the property name of 194 | this node. 195 | 196 | Attributes: 197 | _property_name: the name of the property required to execute the handler 198 | _node_handler: code to execute if _property_name matches 199 | visit(node): checks if input node's property matches this nodes; if yes, 200 | executes the code passed into _node_handler, passing the 201 | input node as an argument 202 | """ 203 | class MatchPropertyVisitor: 204 | 205 | ## Constructor 206 | def __init__(self, property_name): 207 | self._property_name = property_name # userAgent, getContext, etc 208 | 209 | ################################################## 210 | def _recursive_check_objects(self, node, api_symbols): 211 | if node.object: 212 | return self._recurrance_visit(node.object, api_symbols) 213 | return False 214 | 215 | ## Visit the nodes, check if matches, and execute handler if it does 216 | def _recurrance_visit(self, node, api_symbols): 217 | 218 | # No more objects to look through 219 | if 'Identifier' == node.type: 220 | if node.name in api_symbols: 221 | return node.name 222 | 223 | # MemberExpression; maybe more objects 224 | elif 'MemberExpression' == node.type: 225 | if node.property.name in api_symbols: 226 | #self._memb_expr_handler(node) 227 | 228 | return_val = node.property.name 229 | tmp = self._recursive_check_objects(node, api_symbols) 230 | 231 | if tmp: 232 | return_val = tmp + '.' + return_val 233 | 234 | return return_val 235 | 236 | # CallExpression; maybe more objects 237 | elif 'CallExpression' == node.type: 238 | if node.callee.name in api_symbols: 239 | #self._call_expr_handler(node) 240 | 241 | return_val = node.callee.name 242 | tmp = self._recursive_check_objects(node.callee, api_symbols) 243 | 244 | if tmp: 245 | return_val = tmp + '.' + return_val 246 | 247 | return return_val 248 | return False 249 | ################################################## 250 | 251 | def _filter_parent_API(self, arg): 252 | if len(arg.split('.')[0]) == 1: 253 | arg = arg.split('.') 254 | arg.pop(0) 255 | arg = '.'.join(arg) 256 | return arg 257 | 258 | # FIRST VISIT: visit node, check if matches, and check for objects 259 | def visit(self, node, api_symbols): 260 | 261 | #MemberExpression 262 | if 'MemberExpression' == node.type: 263 | if node.property.name == self._property_name: 264 | 265 | return_val = node.property.name 266 | tmp = self._recursive_check_objects(node, api_symbols) 267 | 268 | if tmp: 269 | return_val = tmp + '.' + return_val 270 | return_val = self._filter_parent_API(return_val) 271 | 272 | return return_val 273 | 274 | # CallExpression 275 | if 'CallExpression' == node.type: 276 | if node.callee.name == self._property_name: 277 | 278 | return_val = node.callee.name 279 | tmp = self._recursive_check_objects(node.callee, api_symbols) 280 | 281 | if tmp: 282 | return_val = tmp + '.' + return_val 283 | return_val = self._filter_parent_API(return_val) 284 | 285 | return return_val 286 | return False 287 | 288 | 289 | ################################################################################ 290 | def uniquifyList(seq, idfun=None): 291 | # order preserving 292 | if idfun is None: 293 | def idfun(x): return x 294 | seen = {} 295 | result = [] 296 | for item in seq: 297 | marker = idfun(item) 298 | if marker in seen: continue 299 | seen[marker] = 1 300 | result.append(item) 301 | return result 302 | 303 | 304 | ################################################################################ 305 | def worker_process(input_file): 306 | 307 | filename = input_file.split('/')[-1] 308 | 309 | # Try getting the AST using esprima, bail if non-JS syntax 310 | try: 311 | with open(input_file) as f: 312 | ast = esprima.parseScript(f.read()) 313 | 314 | except esprima.error_handler.Error as e: 315 | return False, filename 316 | 317 | # Create an element using that AST 318 | el = Element(ast) 319 | for entry in api_symbols: 320 | visitor = MatchPropertyVisitor(entry) 321 | el.accept(visitor) 322 | 323 | # Walk down the AST (breadth-first) 324 | symbol_counter, extended_symbol_counter, node_dict = el.walk(api_symbols, filename) 325 | 326 | return True, extended_symbol_counter 327 | 328 | ################################################################################ 329 | if __name__ == '__main__': 330 | 331 | print("Initialized to use {} workers.".format(WORKERS)) 332 | 333 | # Extract all symbols from the generated "Symbols of Interest" list, 334 | # and flatten all api symbols into a single list (for efficiency) 335 | with open(SYM_LIST, encoding='utf-8') as f: 336 | api_list = json.loads(f.read()) 337 | 338 | print("Looking in \'{}\' for the API list...".format(SYM_LIST)) 339 | api_symbols = [val for sublist in api_list.values() for val in sublist]; 340 | api_symbols = uniquifyList(api_symbols) 341 | print("Success.") 342 | 343 | # Get file list from data directory 344 | print("Looking in \'{}\' for all .txt files...".format(JS_SOURCE_FILES)) 345 | 346 | file_list = glob.glob(JS_SOURCE_FILES + "/*") 347 | file_list_size = len(file_list) 348 | 349 | print("Success. Found {} files.".format(file_list_size)) 350 | print("Begin iterating over the files to get symbol info.") 351 | print("-" * 80) 352 | 353 | pbar = tqdm(total=file_list_size) 354 | 355 | # Storage queue 356 | symbol_counts = multiprocessing.Queue() 357 | 358 | # Fail list (don't need a queue here) 359 | fails_list = multiprocessing.Queue() 360 | 361 | # Callback (add to queue) 362 | def log_result(result): 363 | pbar.update(1) 364 | if result[0]: 365 | symbol_counts.put(result[1]) 366 | else: 367 | fails_list.put(result[1]) 368 | 369 | # Process queue thread 370 | def test(): 371 | counter = 0 372 | buffer_list = [] 373 | 374 | def dump_files(buffer_list): 375 | from math import ceil 376 | df = pd.DataFrame(buffer_list) 377 | filename = OUTPUT_DIR + OUTPUT_FILE + "_" + str(ceil(counter/BATCH_SIZE)) + '.parquet' 378 | df.to_parquet(filename) 379 | return [] 380 | 381 | while counter + fails_list.qsize() < file_list_size: 382 | try: 383 | # Attempt to get data from the queue. Note that 384 | # symbol_counts.get() will block this thread's execution 385 | # until data is available 386 | data = symbol_counts.get() 387 | except queue.Empty: 388 | pass 389 | except multiprocessing.TimeoutError: 390 | pass 391 | else: 392 | counter += 1; 393 | buffer_list.append(data) 394 | 395 | if counter % BATCH_SIZE == 0: 396 | buffer_list = dump_files(buffer_list) 397 | 398 | buffer_list = dump_files(buffer_list) 399 | 400 | 401 | # Setup thread pool 402 | pool = multiprocessing.Pool(WORKERS) 403 | 404 | # Queue thread 405 | re = pool.apply_async(test, args=()) 406 | 407 | # Individual file threads 408 | for filename in file_list: 409 | re = pool.apply_async(worker_process, args=(filename,), callback=log_result) 410 | 411 | pool.close() 412 | pool.join() 413 | 414 | print("Success.\n\nDONE SCRIPT!") 415 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/4-ast_analysis/output_data/extended_symbol_counts.json: -------------------------------------------------------------------------------- 1 | { 2 | "length": 83, 3 | "c": 7, 4 | "f": 15, 5 | "cookie": 11, 6 | "b": 37, 7 | "parse": 5, 8 | "d": 2, 9 | "r": 10, 10 | "replace": 13, 11 | "a": 2, 12 | "code": 1, 13 | "add": 48, 14 | "y": 5, 15 | "x": 9, 16 | "addEventListener": 28, 17 | "warn": 13, 18 | "toJSON": 5, 19 | "crypto": 1, 20 | "crypto.getRandomValues": 2, 21 | "toString": 11, 22 | "key": 2, 23 | "document.cookie": 2, 24 | "forEach": 8, 25 | "slice": 15, 26 | "nodeName": 10, 27 | "plugins": 7, 28 | "href": 7, 29 | "geolocation": 3, 30 | "value.toJSON": 2, 31 | "escape": 2, 32 | "parentNode": 2, 33 | "cookieEnabled": 2, 34 | "documentElement": 5, 35 | "body": 3, 36 | "mimeTypes": 3, 37 | "mimeTypes.length": 1, 38 | "colorDepth": 2, 39 | "className.match": 1, 40 | "title": 4, 41 | "target": 4, 42 | "userAgent": 2, 43 | "name": 5, 44 | "max": 3, 45 | "clientWidth": 1, 46 | "offsetWidth": 1, 47 | "scrollWidth": 1, 48 | "clientHeight": 1, 49 | "offsetHeight": 2, 50 | "scrollHeight": 2, 51 | "javaEnabled": 3, 52 | "height": 2, 53 | "className": 1, 54 | "filter": 4, 55 | "hostname": 1, 56 | "href.replace": 1, 57 | "type": 12, 58 | "document.links": 4, 59 | "domain": 2, 60 | "location.href": 2, 61 | "platform": 2, 62 | "doNotTrack": 2, 63 | "characterSet": 1, 64 | "charset": 1, 65 | "language": 1, 66 | "bufferSize": 1, 67 | "ay": 5, 68 | "w": 4, 69 | "value.length": 1, 70 | "value": 4, 71 | "document.getElementsByTagName": 2, 72 | "localStorage.setItem": 2, 73 | "localStorage.removeItem": 1, 74 | "plugins.length": 1, 75 | "width": 2, 76 | "window.location.href": 2, 77 | "window.top.document.referrer": 1, 78 | "window.parent": 2, 79 | "document.referrer": 1, 80 | "console.warn": 1, 81 | "localStorage.getItem": 2, 82 | "innerHTML": 2, 83 | "assign": 1, 84 | "compact": 1, 85 | "keys": 4, 86 | "extend": 1, 87 | "select": 1, 88 | "clone": 1, 89 | "id": 1, 90 | "s": 8, 91 | "window.event": 1, 92 | "which": 1, 93 | "button": 1, 94 | "srcElement": 1, 95 | "open": 1, 96 | "withCredentials": 1, 97 | "setRequestHeader": 1, 98 | "navigator.userAgent": 1, 99 | "location": 7, 100 | "document.links.length": 1, 101 | "geolocation.getCurrentPosition": 1, 102 | "sessionStorage": 1, 103 | "localStorage": 3, 104 | "text": 2, 105 | "window.location": 2, 106 | "window.top.document": 1, 107 | "tagName": 1, 108 | "scrollLeft": 1, 109 | "pageXOffset": 1, 110 | "scrollTop": 1, 111 | "pageYOffset": 1, 112 | "timing": 3, 113 | "setInterval": 2, 114 | "getElementsByTagName": 2, 115 | "checked": 2, 116 | "forms": 1, 117 | "z": 9, 118 | "plugins.s.length": 1, 119 | "sort": 1, 120 | "window.top": 1, 121 | "children": 5, 122 | "window": 1, 123 | "setTimeout": 2, 124 | "onreadystatechange": 1, 125 | "onload": 1, 126 | "onerror": 1, 127 | "src": 1, 128 | "compatMode": 2, 129 | "navigator.geolocation.getCurrentPosition": 1, 130 | "text.replace": 2, 131 | "plugins.s": 5, 132 | "nodeType": 2, 133 | "create": 1, 134 | "bottom": 2, 135 | "top": 7, 136 | "performance": 1, 137 | "navigator.geolocation": 1, 138 | "top.location": 2, 139 | "location.protocol": 2, 140 | "transaction": 11, 141 | "transaction.total": 1, 142 | "transaction.city": 1, 143 | "transaction.state": 1, 144 | "transaction.country": 1, 145 | "transaction.context": 1, 146 | "items.length": 1, 147 | "plugins.s.description": 1, 148 | "loop": 6, 149 | "top.replace": 1, 150 | "send": 2, 151 | "items": 3, 152 | "suffixes": 1, 153 | "plugins.s.name": 1, 154 | "window.parent.document.referrer": 1, 155 | "context": 1, 156 | "enabledPlugin": 1, 157 | "window.parent.document": 1, 158 | "index": 2, 159 | "removeEventListener": 2, 160 | "coords": 1, 161 | "source": 1, 162 | "Q": 1, 163 | "abort": 1, 164 | "status": 3, 165 | "clearTimeout": 2, 166 | "readyState": 4, 167 | "clearInterval": 1, 168 | "latitude": 1, 169 | "longitude": 1, 170 | "accuracy": 1, 171 | "altitude": 1, 172 | "altitudeAccuracy": 1, 173 | "heading": 1, 174 | "speed": 1, 175 | "timestamp": 1, 176 | "document.body.children": 1, 177 | "document.body": 1 178 | } -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/4-ast_analysis/requirements.txt: -------------------------------------------------------------------------------- 1 | esprima==4.0.1 2 | numpy==1.15.4 3 | pandas==0.23.4 4 | pkg-resources==0.0.0 5 | pyarrow==0.11.1 6 | python-dateutil==2.7.5 7 | pytz==2018.7 8 | six==1.12.0 9 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/4-ast_analysis/single_tree_explorer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import esprima 4 | import json 5 | import json2parquet as j2p 6 | import sys 7 | 8 | class SymbolNode: 9 | def __init__(self, depth, width, parent_depth, parent_width): 10 | self._depth = depth 11 | self._width = width 12 | self._parent_depth = parent_depth 13 | self._parent_width = parent_width 14 | 15 | def setDepthWidth(depth, width): 16 | self._depth = depth 17 | self._width = width 18 | 19 | 20 | class CustomEncoder(json.JSONEncoder): 21 | def default(self, obj): 22 | 23 | # Symbol node class 24 | if isinstance(obj, SymbolNode): 25 | return { 26 | "depth": obj._depth, 27 | "width": obj._width, 28 | "parent_depth": obj._parent_depth, 29 | "parent_width": obj._parent_width, 30 | } 31 | 32 | return json.JSONEncoder.default(self, obj) 33 | 34 | 35 | class Element: 36 | 37 | ## Define keys you want to skip over 38 | BLACKLISTEDKEYS = ["parent"] 39 | 40 | ## Constructor 41 | def __init__(self, esprima_ast): 42 | self._ast = esprima_ast # Assign member var AST 43 | self._visitors = [] # Init empty visitor array 44 | 45 | ## Add a new visitor to execute (will be executed at each node) 46 | def accept(self, visitor): 47 | self._visitors.append(visitor) 48 | 49 | ## (private) Step through the node's queue of potential nodes to visit 50 | def _step(self, node, queue, depth, width): 51 | before = len(queue) 52 | 53 | for key in node.keys(): # Enumerate keys for possible children 54 | if key in self.BLACKLISTEDKEYS: 55 | continue # Ignore node if it is blacklisted 56 | 57 | child = getattr(node, key) # Assign child = node.key 58 | 59 | # if the child exists && the child has an attribute 'type' 60 | if child and hasattr(child, "type") == True: 61 | child.parent = node # Assign this node as child's parent 62 | child.parent_depth = depth 63 | child.parent_width = width 64 | queue.append(child) # Append the child in this node's queue 65 | 66 | # if there is a list of children 67 | if isinstance(child, list): 68 | for item in child: # Iterate through them and do the same 69 | # as above 70 | if hasattr(item, "type") == True: 71 | item.parent = node 72 | item.parent_depth = depth 73 | item.parent_width = width 74 | queue.append(item) 75 | 76 | return len(queue) - before # Return whether any children were pushed 77 | 78 | ## Walk through this AST 79 | def walk(self, api_symbols): 80 | queue = [self._ast] # Add the imported AST to the queue 81 | 82 | # Initialize these entries 83 | for node in queue: 84 | node.parent_depth = 0 85 | node.parent_width = 0 86 | 87 | # Depth and width counting 88 | depth = 0 # what level of the tree we are in 89 | width = 0 # how far from first node on this level we are 90 | this_depth_num_nodes = 1 # how many nodes in this level are left 91 | next_depth_num_nodes = 0 # how many nodes in the next level 92 | node_counter = 0 # how many total nodes have been visited 93 | this_depth_count = 0 # how many nodes are on this level (tot) 94 | 95 | # storage for the data 96 | symbol_counter = {key: 0 for key in api_symbols} 97 | extended_symbol_counter = {} 98 | node_dict = {key: [] for key in api_symbols} 99 | 100 | while len(queue) > 0: # While stuff in the queue 101 | node = queue.pop(0) # Pop stuff off of the FRONT (0) 102 | this_depth_num_nodes -= 1 # Reduce how many left 103 | node_counter += 1 # Increment counter 104 | width = node_counter - this_depth_count - 1 105 | 106 | for v in self._visitors: # Run visitor instances here 107 | result = v.visit(node, api_symbols) 108 | if result: 109 | if result not in extended_symbol_counter.keys(): 110 | extended_symbol_counter[result] = 1 111 | else: 112 | extended_symbol_counter[result] += 1 113 | 114 | # MemberExpression 115 | if "MemberExpression" == node.type: 116 | tmp = node.property.name 117 | 118 | # CallExpression 119 | if "CallExpression" == node.type: 120 | tmp = node.callee.name 121 | 122 | symbol_counter[tmp] += 1 # increment counter 123 | this_node = SymbolNode( 124 | depth, width, node.parent_depth, node.parent_width 125 | ) 126 | node_dict[tmp].append(this_node) 127 | break 128 | 129 | # If node is an instance of "esprima node", step through the node 130 | # Returns how many children have been added to the queue 131 | if isinstance(node, esprima.nodes.Node): 132 | 133 | # Feed the nodes that will be labeled as children the current 134 | # depth and width 135 | next_depth_num_nodes += self._step(node, queue, depth, width) 136 | 137 | # Once this tree depth has been walked, update with the existing 138 | # "next" set and reset the next set to 0. Increment depth by 1, 139 | # and keep a tally on how many nodes have been counted up until 140 | # this depth. 141 | if this_depth_num_nodes == 0: 142 | this_depth_num_nodes = next_depth_num_nodes # update current list 143 | next_depth_num_nodes = 0 # reset this list 144 | this_depth_count = node_counter # 145 | depth += 1 146 | 147 | print("Done. Total stats: Tree depth: {}\tTotal nodes:{}".format(depth, node_counter)) 148 | 149 | return symbol_counter, extended_symbol_counter, node_dict 150 | 151 | 152 | """ 153 | Executes specified code given that an input node matches the property name of 154 | this node. 155 | 156 | Attributes: 157 | _property_name: the name of the property required to execute the handler 158 | _node_handler: code to execute if _property_name matches 159 | visit(node): checks if input node's property matches this nodes; if yes, 160 | executes the code passed into _node_handler, passing the 161 | input node as an argument 162 | """ 163 | class MatchPropertyVisitor: 164 | 165 | ## Constructor 166 | def __init__(self, property_name): 167 | self._property_name = property_name # userAgent, getContext, etc 168 | 169 | ################################################## 170 | 171 | def _recursive_check_objects(self, node, api_symbols): 172 | if node.object: 173 | return self._recurrance_visit(node.object, api_symbols) 174 | return False 175 | 176 | ## Visit the nodes, check if matches, and execute handler if it does 177 | def _recurrance_visit(self, node, api_symbols): 178 | 179 | # No more objects to look through 180 | if "Identifier" == node.type: 181 | if node.name in api_symbols: 182 | return node.name 183 | 184 | # MemberExpression; maybe more objects 185 | elif "MemberExpression" == node.type: 186 | if node.property.name in api_symbols: 187 | 188 | return_val = node.property.name 189 | tmp = self._recursive_check_objects(node, api_symbols) 190 | 191 | if tmp: 192 | return_val = tmp + "." + return_val 193 | 194 | return return_val 195 | 196 | # CallExpression; maybe more objects 197 | elif "CallExpression" == node.type: 198 | if node.callee.name in api_symbols: 199 | 200 | return_val = node.callee.name 201 | tmp = self._recursive_check_objects(node.callee, api_symbols) 202 | 203 | if tmp: 204 | return_val = tmp + "." + return_val 205 | 206 | return return_val 207 | return False 208 | 209 | ################################################## 210 | 211 | def _filter_parent_API(self, arg): 212 | if len(arg.split(".")[0]) == 1: 213 | arg = arg.split(".") 214 | arg.pop(0) 215 | arg = ".".join(arg) 216 | return arg 217 | 218 | # FIRST VISIT: visit node, check if matches, and check for objects 219 | def visit(self, node, api_symbols): 220 | 221 | # MemberExpression 222 | if "MemberExpression" == node.type: 223 | if node.property.name == self._property_name: 224 | 225 | return_val = node.property.name 226 | tmp = self._recursive_check_objects(node, api_symbols) 227 | 228 | if tmp: 229 | return_val = tmp + "." + return_val 230 | return_val = self._filter_parent_API(return_val) 231 | 232 | return return_val 233 | 234 | # CallExpression 235 | if "CallExpression" == node.type: 236 | if node.callee.name == self._property_name: 237 | 238 | return_val = node.callee.name 239 | tmp = self._recursive_check_objects(node.callee, api_symbols) 240 | 241 | if tmp: 242 | return_val = tmp + "." + return_val 243 | return_val = self._filter_parent_API(return_val) 244 | 245 | return return_val 246 | return False 247 | 248 | 249 | # Extract JSON and JavaScript AST data from precompiled list 250 | def importData(): 251 | 252 | if len(sys.argv) == 3: 253 | with open(sys.argv[1], encoding="utf-8") as data_file: 254 | api_list = json.loads(data_file.read()) 255 | with open(sys.argv[2]) as js_file: 256 | ast = esprima.parseScript(js_file.read()) 257 | 258 | return ast, api_list 259 | 260 | else: 261 | print( 262 | """Warning: invalid input type! This script does not use config.ini. 263 | Syntax: 264 | 265 | $ python3.6 this_script.py \ 266 | 267 | """ 268 | ) 269 | exit() 270 | return 271 | 272 | 273 | ################################################################################ 274 | def uniquifyList(seq, idfun=None): 275 | # order preserving 276 | if idfun is None: 277 | 278 | def idfun(x): 279 | return x 280 | 281 | seen = {} 282 | result = [] 283 | for item in seq: 284 | marker = idfun(item) 285 | if marker in seen: 286 | continue 287 | seen[marker] = 1 288 | result.append(item) 289 | return result 290 | 291 | 292 | ################################################################################ 293 | def main(): 294 | 295 | # Try getting the AST using esprima, bail if non-JS syntax 296 | try: 297 | print("Trying to parse AST...") 298 | ast, api_list = importData() 299 | 300 | except esprima.error_handler.Error as e: 301 | print("Failure: non-javascript syntax detected. Terminating...") 302 | sys.exit() 303 | 304 | print("Success. Walking through the AST...") 305 | 306 | # Extract all symbols from the generated "Symbols of Interest" list, 307 | # and flatten all api symbols into a single list (for efficiency) 308 | api_symbols = [val for sublist in api_list.values() for val in sublist] 309 | api_symbols = uniquifyList(api_symbols) 310 | 311 | # Create an element using that AST 312 | el = Element(ast) 313 | for entry in api_symbols: 314 | visitor = MatchPropertyVisitor(entry) 315 | el.accept(visitor) 316 | 317 | # Walk down the AST (breadth-first) 318 | symbol_counter, extended_symbol_counter, node_dict = el.walk(api_symbols) 319 | 320 | # Dump outputs 321 | with open("output_data/symbol_counts.json", "w") as out1: 322 | json.dump(symbol_counter, out1, indent=4) 323 | 324 | with open("output_data/extended_symbol_counts.json", "w") as out2: 325 | json.dump(extended_symbol_counter, out2, indent=4) 326 | 327 | #with open("output_data/symbol_node_info.json", "w") as out3: 328 | # json.dump(node_dict, out3, indent=4, cls=CustomEncoder) 329 | 330 | 331 | if __name__ == "__main__": 332 | main() 333 | -------------------------------------------------------------------------------- /analyses/2018_12_ddobre_static_analysis/README.md: -------------------------------------------------------------------------------- 1 | A 4 stage pipeline for mass downloading javascript files, generating a list of potential symbols of interest, and parsing these symbols out of the downloaded scripts. Each stage of the pipeline has a more exhaustive readme - please check those for more details. 2 | -------------------------------------------------------------------------------- /analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense/Audio Fingerprinting Heuristics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "scrolled": false 8 | }, 9 | "outputs": [ 10 | { 11 | "data": { 12 | "text/html": [ 13 | "\n", 14 | "\n", 15 | "\n", 22 | "\n", 30 | "\n", 31 | "
\n", 16 | "

Client

\n", 17 | "\n", 21 | "
\n", 23 | "

Cluster

\n", 24 | "
    \n", 25 | "
  • Workers: 4
  • \n", 26 | "
  • Cores: 4
  • \n", 27 | "
  • Memory: 12.00 GB
  • \n", 28 | "
\n", 29 | "
" 32 | ], 33 | "text/plain": [ 34 | "" 35 | ] 36 | }, 37 | "execution_count": 1, 38 | "metadata": {}, 39 | "output_type": "execute_result" 40 | } 41 | ], 42 | "source": [ 43 | "import dask.dataframe as dd\n", 44 | "import json\n", 45 | "\n", 46 | "from dask.distributed import Client, progress\n", 47 | "\n", 48 | "DATA_DIR = 'YOUR DATA DIRECTORY HERE'\n", 49 | "DATA_DIR_FULL = DATA_DIR + \"PATH TO PARQUET FILES\"\n", 50 | "Client()" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "## Setup" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 6, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "df = dd.read_parquet(DATA_DIR_FULL, columns=['script_url', 'symbol'])" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Build Candidate URLs for `OfflineAudioContext.createOscillator`" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 7, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "[########################################] | 100% Completed | 51.8s\r" 86 | ] 87 | } 88 | ], 89 | "source": [ 90 | "create_oscillator_df = df[df.symbol == 'OfflineAudioContext.createOscillator']\n", 91 | "create_oscillator_urls = create_oscillator_df.script_url.unique().persist()\n", 92 | "progress(create_oscillator_urls, notebook=False)" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 8, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "data": { 102 | "text/plain": [ 103 | "0 https://www.alaskaair.com/px/client/main.min.js\n", 104 | "1 https://client.perimeterx.net/PXQ76Auu14/main....\n", 105 | "2 https://client.perimeterx.net/PXM636Svr4/main....\n", 106 | "3 http://client.perimeterx.net/PX0F3091f3/main.m...\n", 107 | "4 https://media1.admicro.vn/core/fipmin.js\n", 108 | "Name: script_url, dtype: object" 109 | ] 110 | }, 111 | "execution_count": 8, 112 | "metadata": {}, 113 | "output_type": "execute_result" 114 | } 115 | ], 116 | "source": [ 117 | "create_oscillator_urls = create_oscillator_urls.compute()\n", 118 | "create_oscillator_urls[0:5]" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "## Build Candidate URLs for `OfflineAudioContext.createDynamicsCompressor`" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 9, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "[########################################] | 100% Completed | 47.3s\r" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "create_dynamics_df = df[df.symbol == 'OfflineAudioContext.createDynamicsCompressor']\n", 143 | "create_dynamics_urls = create_dynamics_df.script_url.unique().persist()\n", 144 | "progress(create_dynamics_urls, notebook=False)" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 10, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "data": { 154 | "text/plain": [ 155 | "0 https://www.alaskaair.com/px/client/main.min.js\n", 156 | "1 https://client.perimeterx.net/PXQ76Auu14/main....\n", 157 | "2 https://client.perimeterx.net/PXM636Svr4/main....\n", 158 | "3 http://client.perimeterx.net/PX0F3091f3/main.m...\n", 159 | "4 https://media1.admicro.vn/core/fipmin.js\n", 160 | "Name: script_url, dtype: object" 161 | ] 162 | }, 163 | "execution_count": 10, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "create_dynamics_urls = create_dynamics_urls.compute()\n", 170 | "create_dynamics_urls[0:5]" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "## Build Candidate URLs for `OfflineAudioContext.destination`" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 11, 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "name": "stdout", 187 | "output_type": "stream", 188 | "text": [ 189 | "[########################################] | 100% Completed | 39.6s\r" 190 | ] 191 | } 192 | ], 193 | "source": [ 194 | "destination_df = df[df.symbol == 'OfflineAudioContext.destination']\n", 195 | "destination_urls = destination_df.script_url.unique().persist()\n", 196 | "progress(destination_urls, notebook=False)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 12, 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "data": { 206 | "text/plain": [ 207 | "0 https://www.alaskaair.com/px/client/main.min.js\n", 208 | "1 https://client.perimeterx.net/PXQ76Auu14/main....\n", 209 | "2 https://client.perimeterx.net/PXM636Svr4/main....\n", 210 | "3 http://client.perimeterx.net/PX0F3091f3/main.m...\n", 211 | "4 https://media1.admicro.vn/core/fipmin.js\n", 212 | "Name: script_url, dtype: object" 213 | ] 214 | }, 215 | "execution_count": 12, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "destination_urls = destination_urls.compute()\n", 222 | "destination_urls[0:5]" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "## Build Candidate URLs for `OfflineAudioContext.startRendering`" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 13, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "name": "stdout", 239 | "output_type": "stream", 240 | "text": [ 241 | "[########################################] | 100% Completed | 40.3s\r" 242 | ] 243 | } 244 | ], 245 | "source": [ 246 | "start_rendering_df = df[df.symbol == 'OfflineAudioContext.startRendering']\n", 247 | "start_rendering_urls = start_rendering_df.script_url.unique().persist()\n", 248 | "progress(start_rendering_urls, notebook=False)" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 14, 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "data": { 258 | "text/plain": [ 259 | "0 https://www.alaskaair.com/px/client/main.min.js\n", 260 | "1 https://client.perimeterx.net/PXQ76Auu14/main....\n", 261 | "2 https://client.perimeterx.net/PXM636Svr4/main....\n", 262 | "3 http://client.perimeterx.net/PX0F3091f3/main.m...\n", 263 | "4 https://media1.admicro.vn/core/fipmin.js\n", 264 | "Name: script_url, dtype: object" 265 | ] 266 | }, 267 | "execution_count": 14, 268 | "metadata": {}, 269 | "output_type": "execute_result" 270 | } 271 | ], 272 | "source": [ 273 | "start_rendering_urls = start_rendering_urls.compute()\n", 274 | "start_rendering_urls[0:5]" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "## Build Candidate URLs for `OfflineAudioContext.oncomplete`" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 15, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "name": "stdout", 291 | "output_type": "stream", 292 | "text": [ 293 | "[########################################] | 100% Completed | 44.8s\r" 294 | ] 295 | } 296 | ], 297 | "source": [ 298 | "on_complete_df = df[df.symbol == 'OfflineAudioContext.oncomplete']\n", 299 | "on_complete_urls = on_complete_df.script_url.unique().persist()\n", 300 | "progress(on_complete_urls, notebook=False)" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 16, 306 | "metadata": {}, 307 | "outputs": [ 308 | { 309 | "data": { 310 | "text/plain": [ 311 | "0 https://www.alaskaair.com/px/client/main.min.js\n", 312 | "1 https://client.perimeterx.net/PXQ76Auu14/main....\n", 313 | "2 https://client.perimeterx.net/PXM636Svr4/main....\n", 314 | "3 http://client.perimeterx.net/PX0F3091f3/main.m...\n", 315 | "4 https://media1.admicro.vn/core/fipmin.js\n", 316 | "Name: script_url, dtype: object" 317 | ] 318 | }, 319 | "execution_count": 16, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "on_complete_urls = on_complete_urls.compute()\n", 326 | "on_complete_urls[0:5]" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "## Scripts must call all 5 functions: [\"OfflineAudioContext.createOscillator\", \"OfflineAudioContext.createDynamicsCompressor\", \"OfflineAudioContext.destination\", \"OfflineAudioContext.startRendering\", \"OfflineAudioContext.oncomplete\"]" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 17, 339 | "metadata": {}, 340 | "outputs": [ 341 | { 342 | "name": "stdout", 343 | "output_type": "stream", 344 | "text": [ 345 | "# of script_urls using audio fingerprinting: 170\n" 346 | ] 347 | } 348 | ], 349 | "source": [ 350 | "audio_fp_urls = set(create_oscillator_urls) & \\\n", 351 | " set(create_dynamics_urls) & \\\n", 352 | " set(destination_urls) & \\\n", 353 | " set(start_rendering_urls) & \\\n", 354 | " set(on_complete_urls)\n", 355 | "print('# of script_urls using audio fingerprinting:', len(audio_fp_urls))" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 18, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "name": "stdout", 365 | "output_type": "stream", 366 | "text": [ 367 | "# of script_urls that did not call all 5 symbols: 0\n" 368 | ] 369 | } 370 | ], 371 | "source": [ 372 | "all_candidate_urls = set(create_oscillator_urls) | \\\n", 373 | " set(create_dynamics_urls) | \\\n", 374 | " set(destination_urls) | \\\n", 375 | " set(start_rendering_urls) | \\\n", 376 | " set(on_complete_urls)\n", 377 | "not_audio_fp_urls = all_candidate_urls - audio_fp_urls\n", 378 | "print('# of script_urls that did not call all 5 symbols:', len(not_audio_fp_urls))" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "## Save URLs" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 19, 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "with open('audio_fingerprinting.json', 'w') as f:\n", 395 | " f.write(json.dumps(list(audio_fp_urls)))" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 20, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "with open('not_audio_fingerprinting.json', 'w') as f:\n", 405 | " f.write(json.dumps(list(not_audio_fp_urls)))" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "## Find Locations" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 2, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "with open('audio_fingerprinting.json', 'r') as f:\n", 422 | " audio_fp_urls = json.load(f)" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 4, 428 | "metadata": {}, 429 | "outputs": [], 430 | "source": [ 431 | "df = dd.read_parquet(DATA_DIR_FULL, columns=['script_url', 'location'])" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 6, 437 | "metadata": {}, 438 | "outputs": [ 439 | { 440 | "name": "stdout", 441 | "output_type": "stream", 442 | "text": [ 443 | "[########################################] | 100% Completed | 44.7s\r" 444 | ] 445 | } 446 | ], 447 | "source": [ 448 | "df_locs = df[df.script_url.isin(audio_fp_urls)]\n", 449 | "locs = df_locs.location.unique().persist()\n", 450 | "progress(locs, notebook=False)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 8, 456 | "metadata": {}, 457 | "outputs": [ 458 | { 459 | "name": "stdout", 460 | "output_type": "stream", 461 | "text": [ 462 | "# of locations that call audio fingerprinting scripts: 2006\n" 463 | ] 464 | } 465 | ], 466 | "source": [ 467 | "print('# of locations that call audio fingerprinting scripts:', len(locs))" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": {}, 474 | "outputs": [], 475 | "source": [] 476 | } 477 | ], 478 | "metadata": { 479 | "kernelspec": { 480 | "display_name": "overscripted", 481 | "language": "python", 482 | "name": "overscripted" 483 | }, 484 | "language_info": { 485 | "codemirror_mode": { 486 | "name": "ipython", 487 | "version": 3 488 | }, 489 | "file_extension": ".py", 490 | "mimetype": "text/x-python", 491 | "name": "python", 492 | "nbconvert_exporter": "python", 493 | "pygments_lexer": "ipython3", 494 | "version": "3.5.6" 495 | } 496 | }, 497 | "nbformat": 4, 498 | "nbformat_minor": 2 499 | } 500 | -------------------------------------------------------------------------------- /analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense/Canvas Fingerprinting Heuristics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import dask.dataframe as dd\n", 10 | "import os\n", 11 | "import re\n", 12 | "import json\n", 13 | "\n", 14 | "from dask.distributed import Client, progress\n", 15 | "from pandas.api.types import CategoricalDtype\n", 16 | "\n", 17 | "DATA_DIR = 'YOUR DATA DIRECTORY HERE'\n", 18 | "DATA_DIR_FULL = DATA_DIR + \"PATH TO PARQUET FILES\"" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 3, 24 | "metadata": { 25 | "scrolled": false 26 | }, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/html": [ 31 | "\n", 32 | "\n", 33 | "\n", 40 | "\n", 48 | "\n", 49 | "
\n", 34 | "

Client

\n", 35 | "\n", 39 | "
\n", 41 | "

Cluster

\n", 42 | "
    \n", 43 | "
  • Workers: 4
  • \n", 44 | "
  • Cores: 4
  • \n", 45 | "
  • Memory: 32.00 GB
  • \n", 46 | "
\n", 47 | "
" 50 | ], 51 | "text/plain": [ 52 | "" 53 | ] 54 | }, 55 | "execution_count": 3, 56 | "metadata": {}, 57 | "output_type": "execute_result" 58 | } 59 | ], 60 | "source": [ 61 | "Client()" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "# Build candidates" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [ 76 | { 77 | "data": { 78 | "text/plain": [ 79 | "0 http://www.qvc.com/akam/10/2b30e194\n", 80 | "1 http://www.qvc.com/_bm/async.js\n", 81 | "2 http://www.coupang.com/akam/10/4f2b47\n", 82 | "3 https://www.coches.net/ztkieflaaxcvaiwh121837.js\n", 83 | "4 https://a1.alicdn.com/creation/html/2016/06/20...\n", 84 | "Name: script_url, dtype: object" 85 | ] 86 | }, 87 | "execution_count": 3, 88 | "metadata": {}, 89 | "output_type": "execute_result" 90 | } 91 | ], 92 | "source": [ 93 | "df_to_data_urls_df = dd.read_parquet(DATA_FILE, columns=['script_url', 'symbol'])\n", 94 | "df_to_data_urls_df = df_to_data_urls_df[df_to_data_urls_df.symbol == 'HTMLCanvasElement.toDataURL']\n", 95 | "to_data_urls = df_to_data_urls_df.script_url.unique().compute()\n", 96 | "to_data_urls[0:5]" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 4, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "data": { 106 | "text/plain": [ 107 | "0 http://p6.drtst.com/templates/drtuber/js/drtub...\n", 108 | "1 https://www.jigsawplanet.com/js/jp.js?v=b177a4b\n", 109 | "2 http://p5.vptpsn.com/templates/frontend/viptub...\n", 110 | "3 https://code.createjs.com/createjs-2015.11.26....\n", 111 | "4 http://cdn.promodj.com/core/core.js?1ce4f0\n", 112 | "Name: script_url, dtype: object" 113 | ] 114 | }, 115 | "execution_count": 4, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "def large_enough(row):\n", 122 | " width = float(row.argument_2)\n", 123 | " height = float(row.argument_3)\n", 124 | " return width >= 16 and height >= 16\n", 125 | "\n", 126 | "df_get_image_data_df = dd.read_parquet(DATA_FILE, columns=['script_url', 'symbol', 'argument_2', 'argument_3'])\n", 127 | "df_get_image_data_df = df_get_image_data_df[df_get_image_data_df.symbol == 'CanvasRenderingContext2D.getImageData']\n", 128 | "df_get_image_data_df = df_get_image_data_df[df_get_image_data_df.apply(large_enough, axis=1, meta=('bool'))]\n", 129 | "get_image_data_urls = df_get_image_data_df.script_url.unique().compute()\n", 130 | "get_image_data_urls[0:5]" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 5, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "name": "stdout", 140 | "output_type": "stream", 141 | "text": [ 142 | "n to_data_urls 26481\n", 143 | "n get_image_data_urls 559\n", 144 | "n candidate urls 27009\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "print('n to_data_urls', len(to_data_urls))\n", 150 | "print('n get_image_data_urls', len(get_image_data_urls))\n", 151 | "candidate_urls = to_data_urls.append(get_image_data_urls).unique()\n", 152 | "print('n candidate urls', len(candidate_urls))" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 6, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "all_candidate_urls = candidate_urls.copy()" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "# Start removing\n", 169 | "\n", 170 | "## 1. Remove manually filtered" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 7, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "false_positive_script_urls = {\n", 180 | " 'http://www.fivola.com/',\n", 181 | " 'http://cdn02.centraledachats.be/dist/js/holder.js',\n", 182 | " 'http://ccmedia.fr/accueil.php',\n", 183 | " 'http://rozup.ir/up/moisrex/themes/space_theme/script.js'\n", 184 | "}" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 8, 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "n candidate urls 27009\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "candidate_urls = [url for url in candidate_urls if url not in false_positive_script_urls]\n", 202 | "print('n candidate urls', len(candidate_urls))" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 9, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "name": "stdout", 212 | "output_type": "stream", 213 | "text": [ 214 | "0\n" 215 | ] 216 | } 217 | ], 218 | "source": [ 219 | "print(len(set(all_candidate_urls) - set(candidate_urls)))\n", 220 | "disgarded_urls = [url for url in all_candidate_urls if url not in candidate_urls]\n", 221 | "with open('not_canvas_fingerprinting_1.json', 'w') as f:\n", 222 | " f.write(json.dumps(disgarded_urls)) " 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "## 2. Remove save, restore, addEventListener" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 10, 235 | "metadata": { 236 | "scrolled": true 237 | }, 238 | "outputs": [ 239 | { 240 | "data": { 241 | "text/plain": [ 242 | "array(['https://tpc.googlesyndication.com/sadbundle/$csp%3Der3%26dns%3Doff$/4134920871885725337/createjs-2015.11.26.min.js',\n", 243 | " 'http://pics3.city-data.com/js/maps/CANVAS/boxMap.js',\n", 244 | " 'https://code.createjs.com/createjs-2015.11.26.min.js',\n", 245 | " 'http://media.ufc.tv/ufc_system_assets/ufc_201707101050/js/cufon-yui.js',\n", 246 | " 'https://sale.yhd.com/act/J3oKuL4Izcsvpn.html'], dtype=object)" 247 | ] 248 | }, 249 | "execution_count": 10, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "df_valid_calls_df = dd.read_parquet(DATA_FILE, columns=['script_url', 'symbol'])\n", 256 | "df_valid_calls_df = df_valid_calls_df[df_valid_calls_df.symbol.isin(\n", 257 | " ['CanvasRenderingContext2D.save', 'CanvasRenderingContext2D.restore', 'HTMLCanvasElement.addEventListener']\n", 258 | ")]\n", 259 | "valid_calls_urls = df_valid_calls_df.script_url.unique().values.compute()\n", 260 | "valid_calls_urls[0:5]" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 11, 266 | "metadata": {}, 267 | "outputs": [ 268 | { 269 | "name": "stdout", 270 | "output_type": "stream", 271 | "text": [ 272 | "n candidate urls 26877\n" 273 | ] 274 | } 275 | ], 276 | "source": [ 277 | "candidate_urls = [url for url in candidate_urls if url not in valid_calls_urls]\n", 278 | "print('n candidate urls', len(candidate_urls))" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 12, 284 | "metadata": {}, 285 | "outputs": [ 286 | { 287 | "data": { 288 | "text/plain": [ 289 | "132" 290 | ] 291 | }, 292 | "execution_count": 12, 293 | "metadata": {}, 294 | "output_type": "execute_result" 295 | } 296 | ], 297 | "source": [ 298 | "len(set(all_candidate_urls) - set(candidate_urls))" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 13, 304 | "metadata": {}, 305 | "outputs": [ 306 | { 307 | "name": "stdout", 308 | "output_type": "stream", 309 | "text": [ 310 | "132\n" 311 | ] 312 | } 313 | ], 314 | "source": [ 315 | "print(len(set(all_candidate_urls) - set(candidate_urls)))\n", 316 | "disgarded_urls = [url for url in all_candidate_urls if url not in candidate_urls]\n", 317 | "with open('not_canvas_fingerprinting_2.json', 'w') as f:\n", 318 | " f.write(json.dumps(disgarded_urls)) " 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "## 3. Must have written 10 or more characters" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 14, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "## Code sourced from: github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py\n", 335 | "\n", 336 | "def text_length(arg_0):\n", 337 | " return len(arg_0.encode('ascii', 'ignore'))" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 15, 343 | "metadata": { 344 | "scrolled": true 345 | }, 346 | "outputs": [ 347 | { 348 | "data": { 349 | "text/html": [ 350 | "
\n", 351 | "\n", 364 | "\n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | "
script_urlsymbolargument_0len_arg
944http://www.qvc.com/akam/10/2b30e194CanvasRenderingContext2D.fillTextSoft Ruddy Foothold 221
951http://www.qvc.com/akam/10/2b30e194CanvasRenderingContext2D.fillText!H71JCaj)]# 1@#15
1007http://www.qvc.com/_bm/async.jsCanvasRenderingContext2D.fillText<@nv45. F1n63r,Pr1n71n6!24
2824http://www.coupang.com/akam/10/4f2b47CanvasRenderingContext2D.fillTextSoft Ruddy Foothold 221
2831http://www.coupang.com/akam/10/4f2b47CanvasRenderingContext2D.fillText!H71JCaj)]# 1@#15
\n", 412 | "
" 413 | ], 414 | "text/plain": [ 415 | " script_url \\\n", 416 | "944 http://www.qvc.com/akam/10/2b30e194 \n", 417 | "951 http://www.qvc.com/akam/10/2b30e194 \n", 418 | "1007 http://www.qvc.com/_bm/async.js \n", 419 | "2824 http://www.coupang.com/akam/10/4f2b47 \n", 420 | "2831 http://www.coupang.com/akam/10/4f2b47 \n", 421 | "\n", 422 | " symbol argument_0 len_arg \n", 423 | "944 CanvasRenderingContext2D.fillText Soft Ruddy Foothold 2 21 \n", 424 | "951 CanvasRenderingContext2D.fillText !H71JCaj)]# 1@# 15 \n", 425 | "1007 CanvasRenderingContext2D.fillText <@nv45. F1n63r,Pr1n71n6! 24 \n", 426 | "2824 CanvasRenderingContext2D.fillText Soft Ruddy Foothold 2 21 \n", 427 | "2831 CanvasRenderingContext2D.fillText !H71JCaj)]# 1@# 15 " 428 | ] 429 | }, 430 | "execution_count": 15, 431 | "metadata": {}, 432 | "output_type": "execute_result" 433 | } 434 | ], 435 | "source": [ 436 | "df_write = dd.read_parquet(DATA_FILE, columns=['script_url', 'symbol', 'argument_0'])\n", 437 | "df_write = df_write[df_write.script_url.isin(candidate_urls)]\n", 438 | "df_write = df_write[df_write.symbol.isin(['CanvasRenderingContext2D.fillText', 'CanvasRenderingContext2D.strokeText'])]\n", 439 | "df_write['len_arg'] = df_write.argument_0.apply(text_length, meta=('int'))\n", 440 | "df_write = df_write[df_write.len_arg >= 10]\n", 441 | "df_write = df_write.compute()\n", 442 | "df_write.head()" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": 16, 448 | "metadata": { 449 | "scrolled": true 450 | }, 451 | "outputs": [ 452 | { 453 | "name": "stdout", 454 | "output_type": "stream", 455 | "text": [ 456 | "n \"3 too long writes\" urls 8514\n" 457 | ] 458 | } 459 | ], 460 | "source": [ 461 | "too_many_write_urls = df_write.script_url.unique()\n", 462 | "print('n \"3 too long writes\" urls', len(too_many_write_urls))" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": {}, 468 | "source": [ 469 | "## Apply 3" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 17, 475 | "metadata": {}, 476 | "outputs": [], 477 | "source": [ 478 | "text_filter = set(too_many_write_urls)\n", 479 | "candidate_urls = list(text_filter)" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": 18, 485 | "metadata": { 486 | "scrolled": true 487 | }, 488 | "outputs": [ 489 | { 490 | "name": "stdout", 491 | "output_type": "stream", 492 | "text": [ 493 | "18495\n" 494 | ] 495 | } 496 | ], 497 | "source": [ 498 | "print(len(set(all_candidate_urls) - set(candidate_urls)))\n", 499 | "disgarded_urls = [url for url in all_candidate_urls if url not in candidate_urls]\n", 500 | "with open('not_canvas_fingerprinting_3.json', 'w') as f:\n", 501 | " f.write(json.dumps(disgarded_urls)) " 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 19, 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "with open('canvas_fingerprinting.json', 'w') as f:\n", 511 | " f.write(json.dumps(candidate_urls))" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": 20, 517 | "metadata": {}, 518 | "outputs": [], 519 | "source": [ 520 | "with open('not_canvas_fingerprinting.json', 'w') as f:\n", 521 | " f.write(json.dumps(disgarded_urls))" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "## Find Locations" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 32, 534 | "metadata": {}, 535 | "outputs": [ 536 | { 537 | "name": "stdout", 538 | "output_type": "stream", 539 | "text": [ 540 | "8514 == 8514\n" 541 | ] 542 | } 543 | ], 544 | "source": [ 545 | "with open('canvas_fingerprinting.json', 'r') as f:\n", 546 | " canvas_fp_urls = json.load(f)\n", 547 | " \n", 548 | "print(len(canvas_fp_urls), '== 8514')" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 7, 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "df = dd.read_parquet(DATA_FILE, columns=['script_url', 'location'])" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": 34, 563 | "metadata": {}, 564 | "outputs": [ 565 | { 566 | "name": "stdout", 567 | "output_type": "stream", 568 | "text": [ 569 | "[########################################] | 100% Completed | 3min 0.8s\r" 570 | ] 571 | } 572 | ], 573 | "source": [ 574 | "df_locs = df[df.script_url.isin(canvas_fp_urls)]\n", 575 | "locs = df_locs.location.unique().persist()\n", 576 | "progress(locs, notebook=False)" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": 35, 582 | "metadata": {}, 583 | "outputs": [ 584 | { 585 | "name": "stdout", 586 | "output_type": "stream", 587 | "text": [ 588 | "# of locations that call canvas fingerprinting scripts: 38419\n" 589 | ] 590 | } 591 | ], 592 | "source": [ 593 | "print('# of locations that call canvas fingerprinting scripts:', len(locs))" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": null, 599 | "metadata": {}, 600 | "outputs": [], 601 | "source": [] 602 | } 603 | ], 604 | "metadata": { 605 | "kernelspec": { 606 | "display_name": "overscripted", 607 | "language": "python", 608 | "name": "overscripted" 609 | }, 610 | "language_info": { 611 | "codemirror_mode": { 612 | "name": "ipython", 613 | "version": 3 614 | }, 615 | "file_extension": ".py", 616 | "mimetype": "text/x-python", 617 | "name": "python", 618 | "nbconvert_exporter": "python", 619 | "pygments_lexer": "ipython3", 620 | "version": "3.5.6" 621 | } 622 | }, 623 | "nbformat": 4, 624 | "nbformat_minor": 2 625 | } 626 | -------------------------------------------------------------------------------- /analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense/README.md: -------------------------------------------------------------------------------- 1 | ## Overview 2 | 3 | This is an implementation of the heuristics defined in _The Web's Sixth Sense: A Study of Scripts Accessing Smartphone Sensors_ 4 | by Anupam Das, Gunes Acar, Nikita Borisov and Amogh Pradeep. The heuristics are then run against the OverScripted dataset 5 | to determine the prevalence of fingerprinting scripts and locations that call those scripts. 6 | 7 | ## Original Code 8 | 9 | The open-source implementation of _The Web's Sixth Sense: A Study of Scripts Accessing Smartphone Sensors_ can be found on 10 | [GitHub](https://github.com/sensor-js/OpenWPM-mobile). All heuristics are implemented in 11 | [extract_features.py](https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py). 12 | 13 | ## Overall Stats 14 | 15 | - Audio Fingerprinting: 170 scripts present on 2006 locations 16 | - Canvas Fingerprinting: 8,514 scripts present on 38,419 locations 17 | - CanvasFont Fingerprinting: 1,387 scripts present on 2,293 locations 18 | - WebRTC Fingerprinting: 1,313 scripts present on 15,360 locations 19 | -------------------------------------------------------------------------------- /analyses/README.md: -------------------------------------------------------------------------------- 1 | # Analyses Folder 2 | 3 | 4 | ## 1. Download Data 5 | 6 | See main [README.md](https://github.com/mozilla/overscripted/blob/master/README.md) and unzip. 7 | 8 | 9 | ## 2. Install 10 | Install [Anaconda](https://www.anaconda.com/download) or [Miniconda](https://conda.io/miniconda.html). 11 | 12 | > Optionally 13 | 14 | Install [Spark](http://spark.apache.org/) 15 | 16 | 17 | ## 3. Setup and activate environment 18 | 19 | ``` 20 | $ conda env create -f environment.yaml 21 | ``` 22 | ``` 23 | $ conda activate overscripted 24 | ``` 25 | ## 4. Run Jupyter 26 | 27 | ``` 28 | $ jupyter notebook 29 | ``` 30 | 31 | -------------------------------------------------------------------------------- /analyses/environment.yaml: -------------------------------------------------------------------------------- 1 | name: overscripted 2 | channels: 3 | - defaults 4 | - conda-forge 5 | dependencies: 6 | - dask=1.1.4 7 | - distributed=1.26.0 8 | - findspark=1.3.0 9 | - jupyter=1.0.0 10 | - python=3.6 11 | - pyarrow=0.12.1 12 | - pandas=0.24.2 13 | - tldextract=2.2.0 14 | -------------------------------------------------------------------------------- /data_prep/Sample Review.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/dask/config.py:168: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n", 13 | " data = yaml.load(f.read()) or {}\n", 14 | "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/distributed/config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n", 15 | " defaults = yaml.load(f)\n" 16 | ] 17 | }, 18 | { 19 | "data": { 20 | "text/html": [ 21 | "\n", 22 | "\n", 23 | "\n", 30 | "\n", 38 | "\n", 39 | "
\n", 24 | "

Client

\n", 25 | "\n", 29 | "
\n", 31 | "

Cluster

\n", 32 | "
    \n", 33 | "
  • Workers: 4
  • \n", 34 | "
  • Cores: 12
  • \n", 35 | "
  • Memory: 33.35 GB
  • \n", 36 | "
\n", 37 | "
" 40 | ], 41 | "text/plain": [ 42 | "" 43 | ] 44 | }, 45 | "execution_count": 1, 46 | "metadata": {}, 47 | "output_type": "execute_result" 48 | } 49 | ], 50 | "source": [ 51 | "import dask.dataframe as dd\n", 52 | "from dask.distributed import Client\n", 53 | "\n", 54 | "client = Client()\n", 55 | "client" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 9, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "import os\n", 65 | "\n", 66 | "# Set to your local data directory with a string or an environment variable\n", 67 | "# DATA_DIR = '/path/to/your/data'\n", 68 | "DATA_DIR = os.environ.get('DATA_DIR')" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "This notebook briefly describes each of the sample datasets that have been uploaded.\n", 76 | "\n", 77 | "When using a sample dataset, take time to consider if it's representative / suitable for the problem you're trying to solve and discuss your thoughts at the start of your analysis.\n", 78 | "\n", 79 | "Samples:\n", 80 | "\n", 81 | "1. value_1000_only\n", 82 | "1. sample_10percent\n", 83 | "1. sample_10percent_value_1000_only" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "**1. value_1000_only**\n", 91 | "\n", 92 | "The `value` field can get very large. This dataset contains all the rows of the dataset, but truncates the `value` field to only keep the first 1000 characters in a column called `value_1000`. The `value_len` column contains the length of the original `value` field.\n", 93 | "\n", 94 | "This shrinks the full dataset to 15GB on disk when uncompressed (from 70GB)." 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 28, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "Index(['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4',\n", 106 | " 'argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments',\n", 107 | " 'arguments_n_keys', 'call_id', 'call_stack', 'file_name', 'func_name',\n", 108 | " 'in_crawl_list', 'in_iframe', 'in_stripped_crawl_list', 'location',\n", 109 | " 'locations_len', 'operation', 'script_url', 'symbol', 'time_stamp',\n", 110 | " 'value_1000', 'value_len'],\n", 111 | " dtype='object')" 112 | ] 113 | }, 114 | "execution_count": 28, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "df = dd.read_parquet(DATA_DIR + 'clean_value_1000.parquet', engine='pyarrow')\n", 121 | "df.columns" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 7, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "Number of rows: 113,790,686.\n" 134 | ] 135 | } 136 | ], 137 | "source": [ 138 | "print(f'Number of rows: {len(df):,}.')" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "**2. sample_10percent**\n", 146 | "\n", 147 | "This is a 10% sample of the complete dataset. It is 7.4GB when uncompressed on disk.\n", 148 | "\n", 149 | "The sampling procedure was to take 10% of the unique values in the `location` field, and then take all calls for those locations." 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 10, 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/plain": [ 160 | "Index(['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4',\n", 161 | " 'argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments',\n", 162 | " 'arguments_n_keys', 'call_stack', 'crawl_id', 'file_name', 'func_name',\n", 163 | " 'in_iframe', 'location', 'operation', 'script_col', 'script_line',\n", 164 | " 'script_loc_eval', 'script_url', 'symbol', 'time_stamp', 'value',\n", 165 | " 'value_1000', 'value_len'],\n", 166 | " dtype='object')" 167 | ] 168 | }, 169 | "execution_count": 10, 170 | "metadata": {}, 171 | "output_type": "execute_result" 172 | } 173 | ], 174 | "source": [ 175 | "df = dd.read_parquet(DATA_DIR + 'sample_10percent.parquet', engine='pyarrow')\n", 176 | "df.columns" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 20, 182 | "metadata": {}, 183 | "outputs": [ 184 | { 185 | "name": "stdout", 186 | "output_type": "stream", 187 | "text": [ 188 | "Number of rows: 11,292,867.\n" 189 | ] 190 | } 191 | ], 192 | "source": [ 193 | "print(f'Number of rows: {len(df):,}.')" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "**3. sample_10percent_value_1000_only**\n", 201 | "\n", 202 | "This is a combination of the ideas from the previous two samples. It is the *same* 10% from \"2\" with the `value` column removed, leaving just the `value_1000` column.\n", 203 | "\n", 204 | "This dataset is just 1.3GB uncompressed on disk." 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 22, 210 | "metadata": {}, 211 | "outputs": [ 212 | { 213 | "data": { 214 | "text/plain": [ 215 | "Index(['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4',\n", 216 | " 'argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments',\n", 217 | " 'arguments_n_keys', 'call_stack', 'crawl_id', 'file_name', 'func_name',\n", 218 | " 'in_iframe', 'location', 'operation', 'script_col', 'script_line',\n", 219 | " 'script_loc_eval', 'script_url', 'symbol', 'time_stamp', 'value_1000',\n", 220 | " 'value_len'],\n", 221 | " dtype='object')" 222 | ] 223 | }, 224 | "execution_count": 22, 225 | "metadata": {}, 226 | "output_type": "execute_result" 227 | } 228 | ], 229 | "source": [ 230 | "df = dd.read_parquet(DATA_DIR + 'sample_10percent_value_1000_only.parquet', engine='pyarrow')\n", 231 | "df.columns" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 23, 237 | "metadata": {}, 238 | "outputs": [ 239 | { 240 | "name": "stdout", 241 | "output_type": "stream", 242 | "text": [ 243 | "Number of rows: 11,292,867.\n" 244 | ] 245 | } 246 | ], 247 | "source": [ 248 | "print(f'Number of rows: {len(df):,}.')" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [] 257 | } 258 | ], 259 | "metadata": { 260 | "kernelspec": { 261 | "display_name": "Python 3", 262 | "language": "python", 263 | "name": "python3" 264 | }, 265 | "language_info": { 266 | "codemirror_mode": { 267 | "name": "ipython", 268 | "version": 3 269 | }, 270 | "file_extension": ".py", 271 | "mimetype": "text/x-python", 272 | "name": "python", 273 | "nbconvert_exporter": "python", 274 | "pygments_lexer": "ipython3", 275 | "version": "3.6.7" 276 | } 277 | }, 278 | "nbformat": 4, 279 | "nbformat_minor": 2 280 | } 281 | -------------------------------------------------------------------------------- /data_prep/raw_data_schema.template: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-04/schema#", 3 | "title": "UCOSP Crawl - Call Schema", 4 | "description": "The schema for a row of the raw data in the crawl catalog. (The final dataset as additional derived columns)", 5 | "type": "object", 6 | "properties": { 7 | "arguments": { 8 | "description": "Any arguments passed to the javascript call. When present takes the form of an object with numeric string keys e.g. '0', '1', up to a max of '9'. Validator does not check for this yet as couldn't find a satisfactory regex", 9 | "type": "string", 10 | }, 11 | "call_stack": { 12 | "description": "69% of calls have no call_stack. Where there is a call_stack, it appears you can: split on '\n' and get the same values that are in func_name, script_url, script_col, script_line - func_name@script_url:script_line:script_col", 13 | "type": "string", 14 | "pattern": "^$|^(?!undefined).*$|" 15 | }, 16 | "crawl_id": { 17 | "description": "The ID for this crawl", 18 | "const": 1 19 | }, 20 | "func_name": { 21 | "description": "Empty string or the name of the function that was executed. Note more liberal than current validation.", 22 | "type": "string", 23 | }, 24 | "in_iframe": { 25 | "description": "Was JS being exectuted in an iframe.", 26 | "type": "boolean" 27 | }, 28 | "location": { 29 | "description": "The location of the loaded page where from which javascript calls are being captured.", 30 | "type": "string", 31 | "format": "uri", 32 | "pattern": "^(https?|http?):\/\/\S*", 33 | }, 34 | "operation": { 35 | "description": "The type of operation.", 36 | "type": "string", 37 | "enum": ["get", "set", "call", "set (failed)"] 38 | }, 39 | "script_col": { 40 | "description": "The column location in the script where the call is captured. We want this to be an integer, but we must test for numeric string.", 41 | "type": "string", 42 | "pattern": "^$|^[0-9]+$" 43 | }, 44 | "script_line": { 45 | "description": "The line location in the script where the call is captured. We want this to be an integer, but we must test for numeric string.", 46 | "type": "string", 47 | "pattern": "^[0-9]+$" 48 | }, 49 | "script_loc_eval": { 50 | "description": "Empty string or .... What is this?", 51 | "type": "string", 52 | "pattern": "^$|^(line [0-9]* > (eval|Function)[ ]?)*$" 53 | }, 54 | "script_url": { 55 | "description": "The location of the script url that is being executed. Liberally letting things through to see what's there.", 56 | "type": "string", 57 | "minLength": 1, 58 | }, 59 | "symbol": { 60 | "description": "The js Symbol. Has 282 possible values in this dataset (this is derived there are more possible symbols in JS)", 61 | "type": "string", 62 | "enum": {{ list_of_symbols }} 63 | }, 64 | "time_stamp": { 65 | "description": "Time at which call was captured. Valid timestamps 2017-12-16T10:12:58Z, 2017-12-16T10:12:58.000Z, 2017-12-16T10:12:58+0000 ", 66 | "type": "string", 67 | "format": "date-time", 68 | "pattern": "^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.?\d{0,3}Z$|^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{4}$" 69 | }, 70 | "value": { 71 | "description": "The value that was passed to the javascript call. Can be a few or over 1million characters.", 72 | "type": "string" 73 | } 74 | 75 | }, 76 | "required": [ 77 | "call_stack", 78 | "crawl_id", 79 | "func_name", 80 | "in_iframe", 81 | "location", 82 | "operation", 83 | "script_col", 84 | "script_line", 85 | "script_loc_eval", 86 | "script_url", 87 | "symbol", 88 | "time_stamp", 89 | "value", 90 | ] 91 | } -------------------------------------------------------------------------------- /data_prep/symbol_counts.csv: -------------------------------------------------------------------------------- 1 | window.document.cookie,35455680 2 | window.navigator.userAgent,15534371 3 | window.Storage.getItem,10553944 4 | window.localStorage,8767285 5 | window.Storage.setItem,4175556 6 | window.sessionStorage,4033894 7 | window.Storage.removeItem,2932713 8 | window.name,2484730 9 | CanvasRenderingContext2D.fillStyle,1957519 10 | window.navigator.plugins[Shockwave Flash].description,1863285 11 | window.screen.colorDepth,1449905 12 | window.navigator.appName,1286084 13 | window.navigator.language,1172256 14 | window.navigator.platform,1140738 15 | CanvasRenderingContext2D.save,1000762 16 | CanvasRenderingContext2D.restore,997755 17 | CanvasRenderingContext2D.fill,954340 18 | CanvasRenderingContext2D.fillRect,936267 19 | window.navigator.plugins[Shockwave Flash].name,895289 20 | CanvasRenderingContext2D.font,814310 21 | CanvasRenderingContext2D.lineWidth,718195 22 | window.navigator.appVersion,707298 23 | window.navigator.cookieEnabled,692524 24 | HTMLCanvasElement.width,681003 25 | CanvasRenderingContext2D.strokeStyle,650211 26 | HTMLCanvasElement.height,644476 27 | HTMLCanvasElement.getContext,596749 28 | window.Storage.key,551691 29 | CanvasRenderingContext2D.fillText,542896 30 | window.Storage.length,541077 31 | CanvasRenderingContext2D.stroke,537024 32 | CanvasRenderingContext2D.measureText,522209 33 | window.navigator.vendor,487833 34 | window.navigator.doNotTrack,468365 35 | CanvasRenderingContext2D.arc,413449 36 | HTMLCanvasElement.style,294223 37 | CanvasRenderingContext2D.textBaseline,293489 38 | window.navigator.product,279653 39 | CanvasRenderingContext2D.textAlign,246380 40 | window.navigator.plugins[Shockwave Flash].filename,225751 41 | window.navigator.mimeTypes[application/x-shockwave-flash].type,213769 42 | window.navigator.languages,199435 43 | window.navigator.plugins[Shockwave Flash].length,184995 44 | CanvasRenderingContext2D.bezierCurveTo,176757 45 | CanvasRenderingContext2D.shadowBlur,172808 46 | CanvasRenderingContext2D.shadowOffsetY,161446 47 | CanvasRenderingContext2D.shadowOffsetX,159263 48 | CanvasRenderingContext2D.shadowColor,158579 49 | window.screen.pixelDepth,156326 50 | CanvasRenderingContext2D.rect,154007 51 | HTMLCanvasElement.nodeType,153630 52 | CanvasRenderingContext2D.lineJoin,151407 53 | window.navigator.mimeTypes[application/futuresplash].type,150364 54 | CanvasRenderingContext2D.lineCap,149511 55 | window.navigator.plugins[Shockwave Flash].version,144656 56 | CanvasRenderingContext2D.strokeRect,142838 57 | HTMLCanvasElement.toDataURL,135041 58 | CanvasRenderingContext2D.createRadialGradient,132444 59 | CanvasRenderingContext2D.globalCompositeOperation,122162 60 | window.navigator.onLine,116037 61 | CanvasRenderingContext2D.scale,115227 62 | window.Storage.hasOwnProperty,108138 63 | CanvasRenderingContext2D.clip,106238 64 | CanvasRenderingContext2D.miterLimit,102589 65 | window.navigator.mimeTypes[application/x-shockwave-flash].suffixes,94030 66 | window.navigator.mimeTypes[application/futuresplash].suffixes,94025 67 | RTCPeerConnection.localDescription,88683 68 | window.navigator.productSub,71139 69 | window.navigator.mimeTypes[application/x-shockwave-flash].description,70284 70 | window.navigator.mimeTypes[application/futuresplash].description,70278 71 | HTMLCanvasElement.nodeName,67621 72 | CanvasRenderingContext2D.rotate,63824 73 | HTMLCanvasElement.parentNode,57192 74 | window.navigator.oscpu,54799 75 | window.navigator.appCodeName,51161 76 | CanvasRenderingContext2D.createLinearGradient,46710 77 | CanvasRenderingContext2D.putImageData,45469 78 | window.navigator.geolocation,43022 79 | CanvasRenderingContext2D.getImageData,41412 80 | HTMLCanvasElement.ownerDocument,37831 81 | HTMLCanvasElement.className,36778 82 | RTCPeerConnection.onicecandidate,32522 83 | HTMLCanvasElement.getAttribute,31800 84 | window.navigator.vendorSub,26840 85 | HTMLCanvasElement.addEventListener,23485 86 | window.navigator.buildID,23419 87 | HTMLCanvasElement.classList,22963 88 | HTMLCanvasElement.setAttribute,20689 89 | HTMLCanvasElement.clientHeight,20665 90 | HTMLCanvasElement.clientWidth,20341 91 | HTMLCanvasElement.getElementsByTagName,16224 92 | HTMLCanvasElement.tagName,14475 93 | RTCPeerConnection.iceGatheringState,13984 94 | RTCPeerConnection.createDataChannel,13776 95 | RTCPeerConnection.signalingState,13160 96 | RTCPeerConnection.remoteDescription,13113 97 | RTCPeerConnection.createOffer,13015 98 | CanvasRenderingContext2D.setLineDash,12590 99 | HTMLCanvasElement.onselectstart,12054 100 | RTCPeerConnection.setLocalDescription,11844 101 | CanvasRenderingContext2D.arcTo,11428 102 | CanvasRenderingContext2D.isPointInPath,11342 103 | CanvasRenderingContext2D.createImageData,11163 104 | HTMLCanvasElement.id,10941 105 | CanvasRenderingContext2D.imageSmoothingEnabled,9722 106 | HTMLCanvasElement.draggable,9558 107 | HTMLCanvasElement.constructor,9246 108 | CanvasRenderingContext2D.createPattern,8713 109 | CanvasRenderingContext2D.lineDashOffset,7726 110 | HTMLCanvasElement.offsetWidth,7346 111 | CanvasRenderingContext2D.mozImageSmoothingEnabled,6561 112 | RTCPeerConnection.idpLoginUrl,6556 113 | RTCPeerConnection.peerIdentity,6556 114 | RTCPeerConnection.onremovestream,6556 115 | HTMLCanvasElement.offsetHeight,6175 116 | CanvasRenderingContext2D.strokeText,5147 117 | HTMLCanvasElement.firstChild,4897 118 | HTMLCanvasElement.hasAttribute,4604 119 | HTMLCanvasElement.localName,4577 120 | HTMLCanvasElement.attributes,4507 121 | HTMLCanvasElement.nextSibling,3857 122 | AudioContext.destination,3758 123 | HTMLCanvasElement.firstElementChild,3586 124 | HTMLCanvasElement.nextElementSibling,3560 125 | window.Storage.clear,3348 126 | HTMLCanvasElement.dir,3171 127 | CanvasRenderingContext2D.mozCurrentTransform,3102 128 | OscillatorNode.frequency,3056 129 | AudioContext.createOscillator,2898 130 | OscillatorNode.start,2687 131 | CanvasRenderingContext2D.__lookupGetter__,2543 132 | HTMLCanvasElement.childNodes,2541 133 | CanvasRenderingContext2D.hasOwnProperty,2422 134 | HTMLCanvasElement.getBoundingClientRect,2276 135 | HTMLCanvasElement.offsetLeft,2096 136 | OscillatorNode.type,2011 137 | OscillatorNode.connect,2011 138 | CanvasRenderingContext2D.mozCurrentTransformInverse,1890 139 | HTMLCanvasElement.removeAttribute,1814 140 | HTMLCanvasElement.offsetTop,1812 141 | HTMLCanvasElement.children,1795 142 | HTMLCanvasElement.dispatchEvent,1698 143 | HTMLCanvasElement.mozOpaque,1687 144 | HTMLCanvasElement.onmousemove,1538 145 | AudioContext.createDynamicsCompressor,1535 146 | HTMLCanvasElement.offsetParent,1499 147 | OfflineAudioContext.startRendering,1381 148 | OfflineAudioContext.createDynamicsCompressor,1380 149 | OfflineAudioContext.oncomplete,1380 150 | OfflineAudioContext.createOscillator,1380 151 | OfflineAudioContext.destination,1380 152 | HTMLCanvasElement.remove,1257 153 | HTMLCanvasElement.compareDocumentPosition,1253 154 | AudioContext.state,1249 155 | AudioContext.listener,1230 156 | GainNode.connect,1204 157 | AudioContext.createGain,1197 158 | GainNode.gain,1112 159 | HTMLCanvasElement.__proto__,1028 160 | window.Storage.toString,1027 161 | AudioContext.createAnalyser,905 162 | HTMLCanvasElement.cloneNode,899 163 | AudioContext.sampleRate,882 164 | AudioContext.decodeAudioData,876 165 | AudioContext.createMediaElementSource,860 166 | HTMLCanvasElement.toBlob,837 167 | HTMLCanvasElement.removeEventListener,779 168 | AnalyserNode.fftSize,774 169 | AnalyserNode.maxDecibels,771 170 | AnalyserNode.smoothingTimeConstant,770 171 | AnalyserNode.frequencyBinCount,770 172 | AnalyserNode.minDecibels,769 173 | RTCPeerConnection.addIceCandidate,769 174 | AudioContext.onstatechange,745 175 | HTMLCanvasElement.textContent,628 176 | HTMLCanvasElement.onclick,466 177 | HTMLCanvasElement.innerHTML,437 178 | window.Storage.valueOf,423 179 | RTCPeerConnection.setRemoteDescription,379 180 | RTCPeerConnection.getStats,361 181 | AudioContext.currentTime,354 182 | OscillatorNode.stop,351 183 | RTCPeerConnection.removeEventListener,346 184 | RTCPeerConnection.addEventListener,346 185 | HTMLCanvasElement.__lookupGetter__,344 186 | AudioContext.createScriptProcessor,337 187 | HTMLCanvasElement.hasOwnProperty,312 188 | HTMLCanvasElement.onmousedown,310 189 | HTMLCanvasElement.toString,291 190 | ScriptProcessorNode.connect,288 191 | ScriptProcessorNode.onaudioprocess,287 192 | AnalyserNode.connect,285 193 | HTMLCanvasElement.blur,280 194 | HTMLCanvasElement.getAttributeNode,237 195 | HTMLCanvasElement.onmouseout,232 196 | HTMLCanvasElement.onmouseover,229 197 | HTMLCanvasElement.append,227 198 | HTMLCanvasElement.onmouseup,227 199 | CanvasRenderingContext2D.ellipse,166 200 | HTMLCanvasElement.setAttributeNode,152 201 | HTMLCanvasElement.oncontextmenu,152 202 | CanvasRenderingContext2D.getLineDash,146 203 | HTMLCanvasElement.previousSibling,139 204 | HTMLCanvasElement.parentElement,136 205 | HTMLCanvasElement.innerText,134 206 | HTMLCanvasElement.onkeydown,132 207 | HTMLCanvasElement.onkeyup,129 208 | HTMLCanvasElement.onkeypress,128 209 | HTMLCanvasElement.onblur,128 210 | HTMLCanvasElement.onfocus,128 211 | HTMLCanvasElement.onmouseleave,127 212 | HTMLCanvasElement.ondblclick,126 213 | HTMLCanvasElement.ondragenter,125 214 | HTMLCanvasElement.onresize,125 215 | HTMLCanvasElement.onpaste,125 216 | HTMLCanvasElement.onchange,125 217 | HTMLCanvasElement.oncut,125 218 | HTMLCanvasElement.ondragover,125 219 | HTMLCanvasElement.ondragleave,125 220 | HTMLCanvasElement.ondrop,125 221 | HTMLCanvasElement.onmouseenter,125 222 | HTMLCanvasElement.onload,125 223 | HTMLCanvasElement.contains,102 224 | HTMLCanvasElement.querySelectorAll,98 225 | GainNode.disconnect,77 226 | AudioContext.createBufferSource,70 227 | HTMLCanvasElement.hasChildNodes,67 228 | AudioContext.createBuffer,63 229 | AudioContext.createPanner,60 230 | HTMLCanvasElement.scrollLeft,60 231 | HTMLCanvasElement.scrollTop,60 232 | CanvasRenderingContext2D.__lookupSetter__,58 233 | CanvasRenderingContext2D.__defineSetter__,58 234 | HTMLCanvasElement.ondragstart,50 235 | HTMLCanvasElement.getClientRects,49 236 | HTMLCanvasElement.title,44 237 | HTMLCanvasElement.tabIndex,43 238 | RTCPeerConnection.close,43 239 | RTCPeerConnection.iceConnectionState,33 240 | AudioContext.close,32 241 | HTMLCanvasElement.hasAttributes,25 242 | HTMLCanvasElement.previousElementSibling,23 243 | OscillatorNode.disconnect,22 244 | HTMLCanvasElement.focus,22 245 | RTCPeerConnection.onsignalingstatechange,16 246 | RTCPeerConnection.oniceconnectionstatechange,16 247 | HTMLCanvasElement.valueOf,16 248 | HTMLCanvasElement.dataset,15 249 | HTMLCanvasElement.requestPointerLock,15 250 | HTMLCanvasElement.namespaceURI,13 251 | HTMLCanvasElement.webkitMatchesSelector,12 252 | HTMLCanvasElement.childElementCount,11 253 | HTMLCanvasElement.removeChild,8 254 | HTMLCanvasElement.insertBefore,8 255 | GainNode.numberOfOutputs,7 256 | HTMLCanvasElement.matches,6 257 | HTMLCanvasElement.outerHTML,6 258 | HTMLCanvasElement.appendChild,6 259 | AudioContext.resume,5 260 | AnalyserNode.getByteFrequencyData,5 261 | HTMLCanvasElement.clientTop,4 262 | HTMLCanvasElement.clientLeft,4 263 | HTMLCanvasElement.onwheel,4 264 | HTMLCanvasElement.DOCUMENT_NODE,4 265 | RTCPeerConnection.onaddstream,3 266 | AnalyserNode.channelInterpretation,3 267 | AnalyserNode.numberOfInputs,3 268 | AnalyserNode.channelCountMode,3 269 | AnalyserNode.numberOfOutputs,3 270 | AnalyserNode.channelCount,3 271 | HTMLCanvasElement.scrollWidth,3 272 | HTMLCanvasElement.scrollHeight,3 273 | CanvasRenderingContext2D.__proto__,3 274 | HTMLCanvasElement.getElementsByClassName,3 275 | CanvasRenderingContext2D.__defineGetter__,3 276 | HTMLCanvasElement.querySelector,2 277 | OfflineAudioContext.decodeAudioData,2 278 | RTCPeerConnection.createAnswer,2 279 | CanvasRenderingContext2D.filter,2 280 | AudioContext.createConvolver,1 281 | HTMLCanvasElement.lastChild,1 282 | CanvasRenderingContext2D.toString,1 283 | -------------------------------------------------------------------------------- /schema.md: -------------------------------------------------------------------------------- 1 | * __call_stack:__ 2 | * __Type:__ String 3 | * __Description:__ The call stack at the point when the function is called. The output is in the format: (function_name)(@)(javascript_source_file)(:)(line_number)(column_number)(new_line_character) 4 | * __Example:__ 5 | ``` 6 | jQuery.cookie@https://cdn.livechatinc.com/js/embedded.20171215135707.js:5:8393\nStore tag. Inside iFrame.html a line of javascript such as: alert("window.location") is used to assert the location of content. When openWPM queries content that is inside iFrame.html which is found on Parent.html the location of the content is reported as: iFrame.html not Parent.html. Due to the paralellization of the crawl, the iFrame content can not be associated with the parent site on which it was encountered, only the __in_iframe__ filed can indicate whether the content was executed inside an iFrame or not. All objects in a json file that were accessed from the crawled page outside of an iFrame should have the same location value. The url can be for any type of file such as .html, .js or have no file extension. 30 | * __Examples:__ 31 | ``` 32 | https://www.dresslily.com/bottom-c-36.html 33 | http://www.vidalfrance.com/component/forme/?fid=2 34 | ``` 35 | * __operation:__ 36 | * __Type:__ string 37 | * __Description:__ Corresponds to the "symbol" field. Operation is a call if the symbol is a method. Get/set operations get and set symbols that are properties with values. 38 | * __Possible Values:__ get, call, set 39 | * __script_col:__ 40 | * __Type:__ string 41 | * __Description:__ The column in the `script_line` where the function call starts. Note: currently some string do not contain numbers, but instead they contain urls such as the example bellow. 42 | * __Examples:__ 43 | ``` 44 | 57 45 | 211 46 | //hdjs.hiido.com/hiido_internal.js?siteid=mhssj 47 | ``` 48 | * __script_line:__ 49 | * __Type:__ string 50 | * __Description:__ The line in the file, indicated in the above `location` element, where the function call is located. Note: Currently some strings do not contain numbers, but instead they contain the protocol identifier for a url, such as in the example bellow. 51 | * __Examples:__ 52 | ``` 53 | 12 54 | 129 55 | http 56 | https 57 | ``` 58 | * __script_loc_eval:__ 59 | * __Type:__ string 60 | * __Description:__ If a function call is generated using the `eval()` function, or is created using `new Function()`, then the "script_loc_eval" value will be set. For example `eval("console.log('my message')")` or `var log = new Function("message", "console.log(message)"); log("my message");` will both cause the "script_loc_evel" value be set when the function calls were collected. The format of "scipt_loc_eval" is: (line) (LINE_NUMBER) (>) (eval | Function) and can be repeated multiple times. Additional information on how the eval line number is generated can be found at the bottom of the [MDN page](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Error/Stack) which discusses the `Error` objects `stack` property. The "script_loc_eval" element is generated from this stack property. 61 | * __Examples:__ 62 | ``` 63 | "" 64 | line 2 > eval 65 | line 70 > Function 66 | line 140 > eval line 232 > Function 67 | line 1 > Function line 1 > eval line 1 > eval 68 | ``` 69 | * __script_url:__ 70 | * __Type:__ string 71 | * __Description:__ The url of the file where the javascript function call was run. This may be the same value at "location", or it may be an external web url that was loaded into the website with the use of the `