├── .gitignore ├── LICENSE ├── README.md ├── conf ├── nginx.conf ├── ssl.conf ├── uwsgi-upstart.conf └── uwsgi.ini ├── crawl.py ├── crawler ├── args.py ├── collector.py ├── crawler_manager.py ├── crawler_process.py └── utils.py ├── requirements.txt ├── results_schema.sql ├── setup.cfg ├── urls.txt ├── utils └── database.py ├── view.py └── viewer ├── app.py ├── static ├── 16.png ├── index.css └── index.js └── templates ├── _crawls_table.html ├── _fingerprinters.html ├── crawls.html ├── errors.html ├── helpers └── form.html ├── index.html └── results.html /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | *.crx 3 | *.sw? 4 | Session.vim 5 | results.sqlite3 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Mozilla Public License Version 2.0 2 | ================================== 3 | 4 | 1. Definitions 5 | -------------- 6 | 7 | 1.1. "Contributor" 8 | means each individual or legal entity that creates, contributes to 9 | the creation of, or owns Covered Software. 10 | 11 | 1.2. "Contributor Version" 12 | means the combination of the Contributions of others (if any) used 13 | by a Contributor and that particular Contributor's Contribution. 14 | 15 | 1.3. "Contribution" 16 | means Covered Software of a particular Contributor. 17 | 18 | 1.4. "Covered Software" 19 | means Source Code Form to which the initial Contributor has attached 20 | the notice in Exhibit A, the Executable Form of such Source Code 21 | Form, and Modifications of such Source Code Form, in each case 22 | including portions thereof. 23 | 24 | 1.5. "Incompatible With Secondary Licenses" 25 | means 26 | 27 | (a) that the initial Contributor has attached the notice described 28 | in Exhibit B to the Covered Software; or 29 | 30 | (b) that the Covered Software was made available under the terms of 31 | version 1.1 or earlier of the License, but not also under the 32 | terms of a Secondary License. 33 | 34 | 1.6. "Executable Form" 35 | means any form of the work other than Source Code Form. 36 | 37 | 1.7. "Larger Work" 38 | means a work that combines Covered Software with other material, in 39 | a separate file or files, that is not Covered Software. 40 | 41 | 1.8. "License" 42 | means this document. 43 | 44 | 1.9. "Licensable" 45 | means having the right to grant, to the maximum extent possible, 46 | whether at the time of the initial grant or subsequently, any and 47 | all of the rights conveyed by this License. 48 | 49 | 1.10. "Modifications" 50 | means any of the following: 51 | 52 | (a) any file in Source Code Form that results from an addition to, 53 | deletion from, or modification of the contents of Covered 54 | Software; or 55 | 56 | (b) any new file in Source Code Form that contains any Covered 57 | Software. 58 | 59 | 1.11. "Patent Claims" of a Contributor 60 | means any patent claim(s), including without limitation, method, 61 | process, and apparatus claims, in any patent Licensable by such 62 | Contributor that would be infringed, but for the grant of the 63 | License, by the making, using, selling, offering for sale, having 64 | made, import, or transfer of either its Contributions or its 65 | Contributor Version. 66 | 67 | 1.12. "Secondary License" 68 | means either the GNU General Public License, Version 2.0, the GNU 69 | Lesser General Public License, Version 2.1, the GNU Affero General 70 | Public License, Version 3.0, or any later versions of those 71 | licenses. 72 | 73 | 1.13. "Source Code Form" 74 | means the form of the work preferred for making modifications. 75 | 76 | 1.14. "You" (or "Your") 77 | means an individual or a legal entity exercising rights under this 78 | License. For legal entities, "You" includes any entity that 79 | controls, is controlled by, or is under common control with You. For 80 | purposes of this definition, "control" means (a) the power, direct 81 | or indirect, to cause the direction or management of such entity, 82 | whether by contract or otherwise, or (b) ownership of more than 83 | fifty percent (50%) of the outstanding shares or beneficial 84 | ownership of such entity. 85 | 86 | 2. License Grants and Conditions 87 | -------------------------------- 88 | 89 | 2.1. Grants 90 | 91 | Each Contributor hereby grants You a world-wide, royalty-free, 92 | non-exclusive license: 93 | 94 | (a) under intellectual property rights (other than patent or trademark) 95 | Licensable by such Contributor to use, reproduce, make available, 96 | modify, display, perform, distribute, and otherwise exploit its 97 | Contributions, either on an unmodified basis, with Modifications, or 98 | as part of a Larger Work; and 99 | 100 | (b) under Patent Claims of such Contributor to make, use, sell, offer 101 | for sale, have made, import, and otherwise transfer either its 102 | Contributions or its Contributor Version. 103 | 104 | 2.2. Effective Date 105 | 106 | The licenses granted in Section 2.1 with respect to any Contribution 107 | become effective for each Contribution on the date the Contributor first 108 | distributes such Contribution. 109 | 110 | 2.3. Limitations on Grant Scope 111 | 112 | The licenses granted in this Section 2 are the only rights granted under 113 | this License. No additional rights or licenses will be implied from the 114 | distribution or licensing of Covered Software under this License. 115 | Notwithstanding Section 2.1(b) above, no patent license is granted by a 116 | Contributor: 117 | 118 | (a) for any code that a Contributor has removed from Covered Software; 119 | or 120 | 121 | (b) for infringements caused by: (i) Your and any other third party's 122 | modifications of Covered Software, or (ii) the combination of its 123 | Contributions with other software (except as part of its Contributor 124 | Version); or 125 | 126 | (c) under Patent Claims infringed by Covered Software in the absence of 127 | its Contributions. 128 | 129 | This License does not grant any rights in the trademarks, service marks, 130 | or logos of any Contributor (except as may be necessary to comply with 131 | the notice requirements in Section 3.4). 132 | 133 | 2.4. Subsequent Licenses 134 | 135 | No Contributor makes additional grants as a result of Your choice to 136 | distribute the Covered Software under a subsequent version of this 137 | License (see Section 10.2) or under the terms of a Secondary License (if 138 | permitted under the terms of Section 3.3). 139 | 140 | 2.5. Representation 141 | 142 | Each Contributor represents that the Contributor believes its 143 | Contributions are its original creation(s) or it has sufficient rights 144 | to grant the rights to its Contributions conveyed by this License. 145 | 146 | 2.6. Fair Use 147 | 148 | This License is not intended to limit any rights You have under 149 | applicable copyright doctrines of fair use, fair dealing, or other 150 | equivalents. 151 | 152 | 2.7. Conditions 153 | 154 | Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted 155 | in Section 2.1. 156 | 157 | 3. Responsibilities 158 | ------------------- 159 | 160 | 3.1. Distribution of Source Form 161 | 162 | All distribution of Covered Software in Source Code Form, including any 163 | Modifications that You create or to which You contribute, must be under 164 | the terms of this License. You must inform recipients that the Source 165 | Code Form of the Covered Software is governed by the terms of this 166 | License, and how they can obtain a copy of this License. You may not 167 | attempt to alter or restrict the recipients' rights in the Source Code 168 | Form. 169 | 170 | 3.2. Distribution of Executable Form 171 | 172 | If You distribute Covered Software in Executable Form then: 173 | 174 | (a) such Covered Software must also be made available in Source Code 175 | Form, as described in Section 3.1, and You must inform recipients of 176 | the Executable Form how they can obtain a copy of such Source Code 177 | Form by reasonable means in a timely manner, at a charge no more 178 | than the cost of distribution to the recipient; and 179 | 180 | (b) You may distribute such Executable Form under the terms of this 181 | License, or sublicense it under different terms, provided that the 182 | license for the Executable Form does not attempt to limit or alter 183 | the recipients' rights in the Source Code Form under this License. 184 | 185 | 3.3. Distribution of a Larger Work 186 | 187 | You may create and distribute a Larger Work under terms of Your choice, 188 | provided that You also comply with the requirements of this License for 189 | the Covered Software. If the Larger Work is a combination of Covered 190 | Software with a work governed by one or more Secondary Licenses, and the 191 | Covered Software is not Incompatible With Secondary Licenses, this 192 | License permits You to additionally distribute such Covered Software 193 | under the terms of such Secondary License(s), so that the recipient of 194 | the Larger Work may, at their option, further distribute the Covered 195 | Software under the terms of either this License or such Secondary 196 | License(s). 197 | 198 | 3.4. Notices 199 | 200 | You may not remove or alter the substance of any license notices 201 | (including copyright notices, patent notices, disclaimers of warranty, 202 | or limitations of liability) contained within the Source Code Form of 203 | the Covered Software, except that You may alter any license notices to 204 | the extent required to remedy known factual inaccuracies. 205 | 206 | 3.5. Application of Additional Terms 207 | 208 | You may choose to offer, and to charge a fee for, warranty, support, 209 | indemnity or liability obligations to one or more recipients of Covered 210 | Software. However, You may do so only on Your own behalf, and not on 211 | behalf of any Contributor. You must make it absolutely clear that any 212 | such warranty, support, indemnity, or liability obligation is offered by 213 | You alone, and You hereby agree to indemnify every Contributor for any 214 | liability incurred by such Contributor as a result of warranty, support, 215 | indemnity or liability terms You offer. You may include additional 216 | disclaimers of warranty and limitations of liability specific to any 217 | jurisdiction. 218 | 219 | 4. Inability to Comply Due to Statute or Regulation 220 | --------------------------------------------------- 221 | 222 | If it is impossible for You to comply with any of the terms of this 223 | License with respect to some or all of the Covered Software due to 224 | statute, judicial order, or regulation then You must: (a) comply with 225 | the terms of this License to the maximum extent possible; and (b) 226 | describe the limitations and the code they affect. Such description must 227 | be placed in a text file included with all distributions of the Covered 228 | Software under this License. Except to the extent prohibited by statute 229 | or regulation, such description must be sufficiently detailed for a 230 | recipient of ordinary skill to be able to understand it. 231 | 232 | 5. Termination 233 | -------------- 234 | 235 | 5.1. The rights granted under this License will terminate automatically 236 | if You fail to comply with any of its terms. However, if You become 237 | compliant, then the rights granted under this License from a particular 238 | Contributor are reinstated (a) provisionally, unless and until such 239 | Contributor explicitly and finally terminates Your grants, and (b) on an 240 | ongoing basis, if such Contributor fails to notify You of the 241 | non-compliance by some reasonable means prior to 60 days after You have 242 | come back into compliance. Moreover, Your grants from a particular 243 | Contributor are reinstated on an ongoing basis if such Contributor 244 | notifies You of the non-compliance by some reasonable means, this is the 245 | first time You have received notice of non-compliance with this License 246 | from such Contributor, and You become compliant prior to 30 days after 247 | Your receipt of the notice. 248 | 249 | 5.2. If You initiate litigation against any entity by asserting a patent 250 | infringement claim (excluding declaratory judgment actions, 251 | counter-claims, and cross-claims) alleging that a Contributor Version 252 | directly or indirectly infringes any patent, then the rights granted to 253 | You by any and all Contributors for the Covered Software under Section 254 | 2.1 of this License shall terminate. 255 | 256 | 5.3. In the event of termination under Sections 5.1 or 5.2 above, all 257 | end user license agreements (excluding distributors and resellers) which 258 | have been validly granted by You or Your distributors under this License 259 | prior to termination shall survive termination. 260 | 261 | ************************************************************************ 262 | * * 263 | * 6. Disclaimer of Warranty * 264 | * ------------------------- * 265 | * * 266 | * Covered Software is provided under this License on an "as is" * 267 | * basis, without warranty of any kind, either expressed, implied, or * 268 | * statutory, including, without limitation, warranties that the * 269 | * Covered Software is free of defects, merchantable, fit for a * 270 | * particular purpose or non-infringing. The entire risk as to the * 271 | * quality and performance of the Covered Software is with You. * 272 | * Should any Covered Software prove defective in any respect, You * 273 | * (not any Contributor) assume the cost of any necessary servicing, * 274 | * repair, or correction. This disclaimer of warranty constitutes an * 275 | * essential part of this License. No use of any Covered Software is * 276 | * authorized under this License except under this disclaimer. * 277 | * * 278 | ************************************************************************ 279 | 280 | ************************************************************************ 281 | * * 282 | * 7. Limitation of Liability * 283 | * -------------------------- * 284 | * * 285 | * Under no circumstances and under no legal theory, whether tort * 286 | * (including negligence), contract, or otherwise, shall any * 287 | * Contributor, or anyone who distributes Covered Software as * 288 | * permitted above, be liable to You for any direct, indirect, * 289 | * special, incidental, or consequential damages of any character * 290 | * including, without limitation, damages for lost profits, loss of * 291 | * goodwill, work stoppage, computer failure or malfunction, or any * 292 | * and all other commercial damages or losses, even if such party * 293 | * shall have been informed of the possibility of such damages. This * 294 | * limitation of liability shall not apply to liability for death or * 295 | * personal injury resulting from such party's negligence to the * 296 | * extent applicable law prohibits such limitation. Some * 297 | * jurisdictions do not allow the exclusion or limitation of * 298 | * incidental or consequential damages, so this exclusion and * 299 | * limitation may not apply to You. * 300 | * * 301 | ************************************************************************ 302 | 303 | 8. Litigation 304 | ------------- 305 | 306 | Any litigation relating to this License may be brought only in the 307 | courts of a jurisdiction where the defendant maintains its principal 308 | place of business and such litigation shall be governed by laws of that 309 | jurisdiction, without reference to its conflict-of-law provisions. 310 | Nothing in this Section shall prevent a party's ability to bring 311 | cross-claims or counter-claims. 312 | 313 | 9. Miscellaneous 314 | ---------------- 315 | 316 | This License represents the complete agreement concerning the subject 317 | matter hereof. If any provision of this License is held to be 318 | unenforceable, such provision shall be reformed only to the extent 319 | necessary to make it enforceable. Any law or regulation which provides 320 | that the language of a contract shall be construed against the drafter 321 | shall not be used to construe this License against a Contributor. 322 | 323 | 10. Versions of the License 324 | --------------------------- 325 | 326 | 10.1. New Versions 327 | 328 | Mozilla Foundation is the license steward. Except as provided in Section 329 | 10.3, no one other than the license steward has the right to modify or 330 | publish new versions of this License. Each version will be given a 331 | distinguishing version number. 332 | 333 | 10.2. Effect of New Versions 334 | 335 | You may distribute the Covered Software under the terms of the version 336 | of the License under which You originally received the Covered Software, 337 | or under the terms of any subsequent version published by the license 338 | steward. 339 | 340 | 10.3. Modified Versions 341 | 342 | If you create software not governed by this License, and you want to 343 | create a new license for such software, you may create and use a 344 | modified version of this License if you rename the license and remove 345 | any references to the name of the license steward (except to note that 346 | such modified license differs from this License). 347 | 348 | 10.4. Distributing Source Code Form that is Incompatible With Secondary 349 | Licenses 350 | 351 | If You choose to distribute Source Code Form that is Incompatible With 352 | Secondary Licenses under the terms of this version of the License, the 353 | notice described in Exhibit B of this License must be attached. 354 | 355 | Exhibit A - Source Code Form License Notice 356 | ------------------------------------------- 357 | 358 | This Source Code Form is subject to the terms of the Mozilla Public 359 | License, v. 2.0. If a copy of the MPL was not distributed with this 360 | file, You can obtain one at http://mozilla.org/MPL/2.0/. 361 | 362 | If it is not possible or desirable to put the notice in a particular 363 | file, then You may include the notice in a location (such as a LICENSE 364 | file in a relevant directory) where a recipient would be likely to look 365 | for such a notice. 366 | 367 | You may add additional accurate notices of copyright ownership. 368 | 369 | Exhibit B - "Incompatible With Secondary Licenses" Notice 370 | --------------------------------------------------------- 371 | 372 | This Source Code Form is "Incompatible With Secondary Licenses", as 373 | defined by the Mozilla Public License, v. 2.0. 374 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Chameleon Crawler 2 | 3 | Browser automation for [Chameleon](https://github.com/ghostwords/chameleon). 4 | 5 | 6 | ## Setup 7 | 8 | - Install Chromium, chromedriver, python3 and xvfb. On Ubuntu: 9 | ``` 10 | sudo apt-get install chromium-browser chromium-chromedriver python3 xvfb 11 | ``` 12 | 13 | - Install the project's Python dependencies (documented in [requirements.txt](requirements.txt)). You might do this with `virtualenv` and `pip`, or maybe Docker. Note this is a Python 3 project. 14 | 15 | - Make sure `chromedriver` is in your $PATH. It's not on Ubuntu, so we have to fix that: 16 | ``` 17 | sudo ln -s /usr/lib/chromium-browser/chromedriver /usr/local/bin/chromedriver 18 | ``` 19 | 20 | - If using Ubuntu 14.04, [fix chromedriver's shared libraries error](http://stackoverflow.com/questions/25695299/chromedriver-on-ubuntu-14-04-error-while-loading-shared-libraries-libui-base): 21 | ``` 22 | echo "/usr/lib/chromium-browser/libs" | sudo tee --append /etc/ld.so.conf.d/chrome_lib.conf >/dev/null 23 | sudo ldconfig 24 | ``` 25 | 26 | - Finally, generate a Chameleon CRX package [by following development setup steps 1 and 4 in Chameleon's checkout](https://github.com/ghostwords/chameleon#development-setup). 27 | 28 | 29 | ## Usage 30 | 31 | Run `./crawl.py /path/to/chameleon.crx` to perform a crawl, or `./crawl.py -h` to see the optional arguments: 32 | 33 | ``` 34 | usage: crawl.py [-h] [--headless | --no-headless] [-n {1,2,3,4,5,6,7,8}] [-q] 35 | [-t SECONDS] [--urls URL_FILE_PATH] 36 | CHAMELEON_CRX_FILE_PATH 37 | 38 | positional arguments: 39 | CHAMELEON_CRX_FILE_PATH 40 | path to Chameleon CRX package 41 | 42 | optional arguments: 43 | -h, --help show this help message and exit 44 | --headless use a virtual display (default) 45 | --no-headless 46 | -n {1,2,3,4,5,6,7,8} how many browsers to use in parallel (default: 4) 47 | -q, --quiet turn off standard output 48 | -t SECONDS, --timeout SECONDS 49 | how many seconds to wait for pages to finish loading 50 | before timing out (default: 20) 51 | --urls URL_FILE_PATH path to URL list file (default: urls.txt) 52 | ``` 53 | 54 | Run `./view.py` and visit the displayed URL to review crawl results. 55 | 56 | 57 | ## Roadmap 58 | 59 | 1. Crawl Alexa Global Top 1,000,000 Sites: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip 60 | 2. Analyze results: 61 | - Discover fingerprinters 62 | - Confirm detection of known fingerprinters 63 | 3. Tweak the heuristic to minimize false negatives/positives. 64 | 4. Create minisite to chart (the growth of?) fingerprinting across the Web. 65 | 66 | 67 | ## Code license 68 | 69 | Mozilla Public License Version 2.0 70 | -------------------------------------------------------------------------------- /conf/nginx.conf: -------------------------------------------------------------------------------- 1 | # needs to end up in /etc/nginx/sites-available (and get linked to ../sites-enabled) 2 | 3 | server { 4 | listen [::]:80; 5 | listen 80; 6 | 7 | server_name panopticlicker.com www.panopticlicker.com; 8 | 9 | # serve ACME challenge files over HTTP 10 | location /.well-known/acme-challenge/ { 11 | alias /var/www/challenges/.well-known/acme-challenge/; 12 | try_files $uri @forward_https; 13 | } 14 | 15 | # redirect HTTP requests to HTTPS 16 | location @forward_https { 17 | return 301 https://www.panopticlicker.com$request_uri; 18 | } 19 | location / { 20 | return 301 https://www.panopticlicker.com$request_uri; 21 | } 22 | } 23 | 24 | server { 25 | listen [::]:443 ssl spdy; 26 | listen 443 ssl spdy; 27 | 28 | server_name panopticlicker.com; 29 | 30 | include /etc/nginx/ssl/panopticlicker.ssl.conf; 31 | 32 | return 301 https://www.panopticlicker.com$request_uri; 33 | } 34 | 35 | server { 36 | listen [::]:443 ssl spdy; 37 | listen 443 ssl spdy; 38 | 39 | server_name www.panopticlicker.com; 40 | 41 | include /etc/nginx/ssl/panopticlicker.ssl.conf; 42 | 43 | root /var/www/panopticlicker.com; 44 | 45 | location / { 46 | uwsgi_pass unix:/tmp/uwsgi.sock; 47 | include /etc/nginx/uwsgi_params; 48 | } 49 | 50 | location /static { 51 | alias /var/www/panopticlicker.com/viewer/static; 52 | } 53 | } 54 | -------------------------------------------------------------------------------- /conf/ssl.conf: -------------------------------------------------------------------------------- 1 | # /etc/nginx/ssl/panopticlicker.ssl.conf 2 | 3 | #ssl_certificate /etc/nginx/ssl/www.panopticlicker.com.crt; 4 | #ssl_certificate_key /etc/nginx/ssl/www.panopticlicker.com.key; 5 | ssl_certificate /etc/letsencrypt/live/panopticlicker.com/fullchain.pem; 6 | ssl_certificate_key /etc/letsencrypt/live/panopticlicker.com/key.pem; 7 | 8 | ssl_session_timeout 1d; 9 | ssl_session_cache shared:SSL:50m; 10 | 11 | ssl_protocols TLSv1 TLSv1.1 TLSv1.2; 12 | ssl_ciphers 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA'; 13 | ssl_prefer_server_ciphers on; 14 | -------------------------------------------------------------------------------- /conf/uwsgi-upstart.conf: -------------------------------------------------------------------------------- 1 | # needs to end up in /etc/init 2 | 3 | description "uWSGI Chameleon Crawler Viewer server" 4 | 5 | start on runlevel [2345] 6 | stop on runlevel [!2345] 7 | 8 | setuid www-data 9 | setgid www-data 10 | 11 | env PATH=/var/www/panopticlicker.com/env/bin 12 | chdir /var/www/panopticlicker.com 13 | exec uwsgi --ini conf/uwsgi.ini 14 | -------------------------------------------------------------------------------- /conf/uwsgi.ini: -------------------------------------------------------------------------------- 1 | [uwsgi] 2 | socket = /tmp/uwsgi.sock 3 | chmod-socket = 660 4 | uid = www-data 5 | gid = www-data 6 | home = env 7 | module = viewer.app:app 8 | master = true 9 | processes = 4 10 | die-on-term = true 11 | req-logger = file:/var/log/uwsgi/access.log 12 | logger = file:/var/log/uwsgi/error.log 13 | -------------------------------------------------------------------------------- /crawl.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # chameleon-crawler 4 | # 5 | # Copyright 2016 ghostwords. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | from datetime import datetime 12 | from random import shuffle 13 | from multiprocessing import Process, Queue 14 | from urllib.parse import urlparse 15 | 16 | from crawler.args import parse_args 17 | from crawler.collector import collect 18 | from crawler.crawler_manager import Crawler 19 | from crawler.utils import Logger 20 | from utils.database import DATABASE_URL, initialize_database 21 | 22 | import dataset 23 | import sys 24 | 25 | 26 | def run(): 27 | # get commandline args 28 | args = parse_args() 29 | 30 | initialize_database() 31 | 32 | # store start time & args, plus get an ID for this crawl 33 | with dataset.connect(DATABASE_URL) as db: 34 | crawl_id = db['crawl'].insert(dict( 35 | args=" ".join(sys.argv[1:]), 36 | start_time=datetime.now() 37 | )) 38 | 39 | url_queue = Queue() # (url, num_timeouts) tuples 40 | result_queue = Queue() 41 | 42 | # read in URLs and populate the job queue 43 | with args.urls: 44 | urls = list(args.urls) 45 | # randomize crawl order 46 | shuffle(urls) 47 | for url in urls: 48 | url = url.strip() 49 | if not urlparse(url).scheme: 50 | url = 'http://' + url 51 | url_queue.put((url, 0)) 52 | 53 | log = Logger().log if not args.quiet else lambda *args, **kwargs: None 54 | 55 | # launch browsers 56 | crawlers = [] 57 | for i in range(args.num_crawlers): 58 | crawler = Process( 59 | target=Crawler, 60 | args=(i + 1,), 61 | kwargs={ 62 | 'crx': args.crx, 63 | 'headless': args.headless, 64 | 'logger': log, 65 | 'timeout': args.timeout, 66 | 'url_queue': url_queue, 67 | 'result_queue': result_queue 68 | } 69 | ) 70 | crawler.start() 71 | crawlers.append(crawler) 72 | 73 | # start the collector process 74 | Process(target=collect, args=(crawl_id, result_queue, log)).start() 75 | 76 | # wait for all browsers to finish 77 | for crawler in crawlers: 78 | crawler.join() 79 | 80 | # tell collector we are done 81 | result_queue.put(None) 82 | 83 | # store completion time 84 | with dataset.connect(DATABASE_URL) as db: 85 | db['crawl'].update(dict(id=crawl_id, end_time=datetime.now()), 'id') 86 | 87 | log("Main process all done!") 88 | 89 | 90 | if __name__ == '__main__': 91 | run() 92 | -------------------------------------------------------------------------------- /crawler/args.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # chameleon-crawler 4 | # 5 | # Copyright 2016 ghostwords. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | from os import path 12 | 13 | import argparse 14 | 15 | 16 | def is_valid_file(f, parser): 17 | if path.isfile(f): 18 | return f 19 | raise argparse.ArgumentTypeError("%s does not exist!" % f) 20 | 21 | 22 | def parse_args(): 23 | parser = argparse.ArgumentParser() 24 | 25 | parser.add_argument( 26 | "crx", metavar='CHAMELEON_CRX_FILE_PATH', 27 | type=lambda x: is_valid_file(x, parser), 28 | help="path to Chameleon CRX package" 29 | ) 30 | 31 | group = parser.add_mutually_exclusive_group() 32 | group.add_argument( 33 | "--headless", action="store_true", default=True, 34 | help="use a virtual display (default)" 35 | ) 36 | group.add_argument("--no-headless", dest='headless', action="store_false") 37 | 38 | parser.add_argument( 39 | "-n", dest='num_crawlers', type=int, 40 | choices=range(1, 9), default=4, 41 | help="how many browsers to use in parallel " 42 | "(default: %(default)s)" 43 | ) 44 | 45 | parser.add_argument( 46 | "-q", "--quiet", action="store_true", default=False, 47 | help="turn off standard output" 48 | ) 49 | 50 | parser.add_argument( 51 | "-t", "--timeout", metavar='SECONDS', 52 | type=int, default=20, 53 | help="how many seconds to wait for pages to finish " 54 | "loading before timing out (default: %(default)s)" 55 | ) 56 | 57 | parser.add_argument( 58 | "--urls", metavar='URL_FILE_PATH', 59 | type=argparse.FileType('r'), default='urls.txt', 60 | help="path to URL list file (default: %(default)s)" 61 | ) 62 | 63 | return parser.parse_args() 64 | -------------------------------------------------------------------------------- /crawler/collector.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # chameleon-crawler 4 | # 5 | # Copyright 2016 ghostwords. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | from time import sleep 12 | 13 | from utils.database import DATABASE_URL 14 | 15 | import dataset 16 | 17 | 18 | def collect(crawl_id, result_queue, log): 19 | db = dataset.connect(DATABASE_URL) 20 | 21 | while True: 22 | if result_queue.empty(): 23 | sleep(0.01) 24 | continue 25 | 26 | result = result_queue.get() 27 | 28 | if result is None: 29 | break 30 | 31 | crawl_url, error, result = result 32 | 33 | if not result: 34 | with db: 35 | db['result'].insert(dict( 36 | crawl_id=crawl_id, 37 | crawl_url=crawl_url, 38 | error=error 39 | )) 40 | continue 41 | 42 | for page_url, page_data in result.items(): 43 | if not page_data['domains']: 44 | # nothing found 45 | with db: 46 | db['result'].insert(dict( 47 | crawl_id=crawl_id, 48 | crawl_url=crawl_url, 49 | page_url=page_url 50 | )) 51 | continue 52 | 53 | for script_domain, ddata in page_data['domains'].items(): 54 | for script_url, sdata in ddata['scripts'].items(): 55 | with db: 56 | canvas_id = None 57 | if 'dataURL' in sdata['canvas']: 58 | data_url = sdata['canvas']['dataURL'] 59 | if data_url: 60 | db.query("""INSERT OR IGNORE INTO canvas (data_url) 61 | VALUES (:data_url)""", data_url=data_url) 62 | canvas_id = db['canvas'].find_one( 63 | data_url=data_url)['id'] 64 | 65 | result_id = db['result'].insert(dict( 66 | crawl_id=crawl_id, 67 | crawl_url=crawl_url, 68 | page_url=page_url, 69 | script_url=script_url, 70 | script_domain=script_domain, 71 | canvas=sdata['canvas']['fingerprinting'], 72 | canvas_id=canvas_id, 73 | font_enum=sdata['fontEnumeration'], 74 | navigator_enum=sdata['navigatorEnumeration'] 75 | )) 76 | 77 | # property access counts get saved in `property_count` 78 | rows = [] 79 | for property, count in sdata['counts'].items(): 80 | rows.append(dict( 81 | result_id=result_id, 82 | property=property, 83 | count=count 84 | )) 85 | with db: 86 | db['property_count'].insert_many(rows) 87 | 88 | log("Collecting finished.") 89 | -------------------------------------------------------------------------------- /crawler/crawler_manager.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # chameleon-crawler 4 | # 5 | # Copyright 2016 ghostwords. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | from multiprocessing import Process, Queue 12 | 13 | from .crawler_process import CrawlerProcess 14 | 15 | import os 16 | import queue 17 | import signal 18 | 19 | 20 | class Crawler(object): 21 | def __init__(self, id, headless=True, timeout=20, **kwargs): 22 | self.id = id 23 | self.headless = headless 24 | 25 | self.crx = kwargs['crx'] 26 | self.glob_url_queue = kwargs['url_queue'] 27 | self.glob_result_queue = kwargs['result_queue'] 28 | self.log = kwargs['logger'] 29 | 30 | self.url_queue = Queue() 31 | self.result_queue = Queue() 32 | 33 | self.start_process() 34 | 35 | while not self.glob_url_queue.empty(): 36 | url, num_timeouts = self.glob_url_queue.get() 37 | 38 | self.url_queue.put(url) 39 | 40 | try: 41 | error, result = self.result_queue.get( 42 | True, 43 | timeout * ((num_timeouts + 1) ** 2) 44 | ) 45 | 46 | except queue.Empty: 47 | self.log("%s timed out fetching %s" % (self.process.name, url)) 48 | num_timeouts += 1 49 | 50 | self.stop_process() 51 | 52 | if num_timeouts > 2: 53 | self.log("Too many timeouts, giving up on %s" % url) 54 | self.glob_result_queue.put((url, 'TIMEOUT', None)) 55 | reinserted = False 56 | else: 57 | self.glob_url_queue.put((url, num_timeouts)) 58 | reinserted = True 59 | 60 | if reinserted or not self.glob_url_queue.empty(): 61 | self.start_process() 62 | 63 | else: 64 | self.glob_result_queue.put((url, error, result)) 65 | 66 | # tell the process we are done 67 | self.url_queue.put(None) 68 | 69 | def start_process(self): 70 | name = "Crawler %i" % self.id 71 | self.log("Starting %s" % name) 72 | 73 | self.process = Process( 74 | target=CrawlerProcess, 75 | name=name, 76 | kwargs={ 77 | 'crx': self.crx, 78 | 'headless': self.headless, 79 | 'logger': self.log, 80 | 'url_queue': self.url_queue, 81 | 'result_queue': self.result_queue 82 | } 83 | ) 84 | 85 | self.process.start() 86 | 87 | self.driver_pid, self.display_pid = self.result_queue.get() 88 | 89 | # TODO cross-platform termination (with psutil?) 90 | def stop_process(self): 91 | self.log("Stopping %s" % self.process.name) 92 | 93 | try: 94 | os.kill(self.process.pid, signal.SIGKILL) 95 | except ProcessLookupError: 96 | self.log("Crawler process not found.") 97 | 98 | try: 99 | os.kill(self.driver_pid, signal.SIGKILL) 100 | except ProcessLookupError: 101 | self.log("chromedriver process not found.") 102 | 103 | if self.headless: 104 | try: 105 | os.kill(self.display_pid, signal.SIGKILL) 106 | except ProcessLookupError: 107 | self.log("Xvfb process not found.") 108 | -------------------------------------------------------------------------------- /crawler/crawler_process.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # chameleon-crawler 4 | # 5 | # Copyright 2016 ghostwords. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | from contextlib import contextmanager 12 | from multiprocessing import current_process 13 | from pyvirtualdisplay import Display 14 | from random import random 15 | from selenium import webdriver 16 | from selenium.common.exceptions import ( 17 | TimeoutException, 18 | UnexpectedAlertPresentException 19 | ) 20 | from selenium.webdriver.support.ui import WebDriverWait 21 | from time import sleep 22 | 23 | 24 | class CrawlerProcess(object): 25 | def __init__(self, headless=True, **kwargs): 26 | self.headless = headless 27 | 28 | self.crx = kwargs['crx'] 29 | self.log = kwargs['logger'] 30 | self.url_queue = kwargs['url_queue'] 31 | self.result_queue = kwargs['result_queue'] 32 | 33 | self.name = current_process().name 34 | 35 | self.crawl() 36 | 37 | self.log("%s is all done!" % self.name) 38 | 39 | def crawl(self): 40 | with self.selenium(): 41 | # open the extension's background page in window 0 42 | # Chrome extension APIs just aren't there sometimes ... 43 | while True: 44 | self.get( 45 | "chrome-extension://%s/_generated_background_page.html" % 46 | self.extension_id) 47 | 48 | if self.js("return chrome.hasOwnProperty('tabs')"): 49 | break 50 | 51 | # get a URL from the job queue 52 | while True: 53 | if self.url_queue.empty(): 54 | sleep(0.01) 55 | continue 56 | 57 | url = self.url_queue.get() 58 | 59 | if url is None: 60 | break 61 | 62 | # open a new window 63 | self.js('window.open()') 64 | # switch to the new window 65 | self.driver.switch_to_window(self.driver.window_handles[-1]) 66 | 67 | # load the URL in the new window 68 | self.get(url) 69 | 70 | if not self.get_data(): 71 | # must be a browser error page: "Unable to connect to the 72 | # Internet", "This webpage is not available", ... 73 | self.log("%s got a browser error." % self.name) 74 | self.result_queue.put(('BROWSER ERROR', None)) 75 | continue 76 | 77 | # scroll to encourage dynamic scripts to load/execute 78 | # this takes between 3 and 13 seconds 79 | for _ in range(10): 80 | sleep(0.3 + random()) 81 | 82 | # handle modal dialogs (alerts) 83 | while True: 84 | try: 85 | self.js('window.scrollBy(0, %f)' % 86 | (random() * 500)) 87 | break 88 | except UnexpectedAlertPresentException: 89 | self.driver.switch_to_alert().dismiss() 90 | 91 | self.log("%s getting data ..." % self.name) 92 | self.result_queue.put((None, self.get_data())) 93 | 94 | # close the window opened above 95 | self.driver.close() 96 | # close any popups 97 | for window in self.driver.window_handles: 98 | if window != self.driver.window_handles[0]: 99 | self.driver.switch_to_window(window) 100 | self.driver.close() 101 | # switch back to the extension's background page 102 | self.driver.switch_to_window(self.driver.window_handles[0]) 103 | 104 | @contextmanager 105 | def selenium(self): 106 | self.startup() 107 | 108 | self.extension_id = self.get_extension_id() 109 | 110 | try: 111 | yield 112 | finally: 113 | self.shutdown() 114 | 115 | def startup(self): 116 | if self.headless: 117 | self.display = Display(visible=0, size=(1440, 900)) 118 | self.display.start() 119 | 120 | opts = webdriver.chrome.options.Options() 121 | opts.add_extension(self.crx) 122 | self.driver = webdriver.Chrome(chrome_options=opts) 123 | self.driver.implicitly_wait(5) 124 | 125 | # communicate chromedriver and Xvfb PIDs back to crawler.py 126 | display_pid = self.display.pid if self.headless else None 127 | # NOTE Firefox is different: self.driver.binary.process.pid 128 | driver_pid = self.driver.service.process.pid 129 | self.result_queue.put((driver_pid, display_pid)) 130 | 131 | def shutdown(self): 132 | self.driver.quit() 133 | if self.headless: 134 | self.display.stop() 135 | 136 | def get_data(self): 137 | cwh = self.driver.current_window_handle 138 | # switch to window 0 (our extension's background page) 139 | self.driver.switch_to_window(self.driver.window_handles[0]) 140 | 141 | self.js("""chrome.tabs.query({}, function (tabs) { 142 | window.result = tabs.reduce(function (memo, tab) { 143 | var data = tabData.get(tab.id); 144 | if (data) { 145 | memo[data.url] = data; 146 | } 147 | return memo; 148 | }, {}); 149 | });""") 150 | 151 | try: 152 | self.wait_for_script( 153 | "return typeof result == 'object' && !!result") 154 | except TimeoutException: 155 | self.log("%s failed to get data from the extension!" % self.name) 156 | raise 157 | 158 | data = self.js("return result") 159 | 160 | # switch back to original window 161 | self.driver.switch_to_window(cwh) 162 | 163 | return data 164 | 165 | def get(self, url): 166 | self.log("%s fetching %s ..." % (self.name, url)) 167 | self.driver.get(url) 168 | 169 | def get_extension_id(self): 170 | self.driver.get('chrome://extensions-frame/') 171 | self.driver.find_element_by_id('toggle-dev-on').click() 172 | return self.driver.find_element_by_class_name( 173 | 'extension-list-item-wrapper').get_attribute('id') 174 | 175 | def js(self, script): 176 | return self.driver.execute_script(script) 177 | 178 | def wait_for_script(self, script, timeout=5): 179 | return WebDriverWait(self.driver, timeout, poll_frequency=0.5).until( 180 | lambda drv: drv.execute_script(script), 181 | ("Timeout waiting for script to eval to True:\n%s" % script) 182 | ) 183 | -------------------------------------------------------------------------------- /crawler/utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # chameleon-crawler 4 | # 5 | # Copyright 2016 ghostwords. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | from multiprocessing import Lock 12 | 13 | 14 | class Logger(object): 15 | def __init__(self): 16 | self.lock = Lock() 17 | 18 | def log(self, *args, **kwargs): 19 | with self.lock: 20 | print(*args, **kwargs) 21 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | dataset==0.5.6 2 | flask==0.10.1 3 | flask-failsafe==0.2 4 | pyvirtualdisplay==0.1.5 5 | selenium==2.44.0 6 | -------------------------------------------------------------------------------- /results_schema.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE crawl ( 2 | id INTEGER PRIMARY KEY AUTOINCREMENT, 3 | args TEXT, 4 | start_time DATETIME NOT NULL, 5 | end_time DATETIME 6 | ); 7 | CREATE TABLE result ( 8 | id INTEGER PRIMARY KEY AUTOINCREMENT, 9 | crawl_id INTEGER NOT NULL, 10 | crawl_url TEXT, 11 | error TEXT, 12 | page_url TEXT, 13 | script_url TEXT, 14 | script_domain TEXT, 15 | canvas BOOLEAN, 16 | canvas_id INTEGER, 17 | font_enum BOOLEAN, 18 | navigator_enum BOOLEAN, 19 | FOREIGN KEY(crawl_id) REFERENCES crawl(id), 20 | FOREIGN KEY(canvas_id) REFERENCES canvas(id) 21 | ); 22 | CREATE TABLE property_count ( 23 | id INTEGER PRIMARY KEY AUTOINCREMENT, 24 | result_id INTEGER NOT NULL, 25 | property TEXT, 26 | count INTEGER, 27 | FOREIGN KEY(result_id) REFERENCES result(id) 28 | ); 29 | CREATE TABLE canvas ( 30 | id INTEGER PRIMARY KEY AUTOINCREMENT, 31 | data_url TEXT NOT NULL UNIQUE 32 | ); 33 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [flake8] 2 | ignore = E128,E261,E265 3 | -------------------------------------------------------------------------------- /urls.txt: -------------------------------------------------------------------------------- 1 | https://www.google.com/ 2 | https://www.yahoo.com/ 3 | -------------------------------------------------------------------------------- /utils/database.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # chameleon-crawler 4 | # 5 | # Copyright 2016 ghostwords. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | import dataset 12 | 13 | DATABASE_URL = 'sqlite:///results.sqlite3' 14 | SCHEMA_VERSION = 3 15 | 16 | MIGRATIONS = { 17 | 2: ["ALTER TABLE crawl ADD COLUMN args TEXT"], 18 | 3: [ 19 | ( 20 | "CREATE TABLE canvas (" 21 | "id INTEGER PRIMARY KEY AUTOINCREMENT," 22 | "data_url TEXT NOT NULL UNIQUE" 23 | ")" 24 | ), 25 | "ALTER TABLE result ADD COLUMN canvas_id INTEGER REFERENCES canvas(id)" 26 | ] 27 | } 28 | 29 | 30 | def migrate_database(current_version): 31 | # TODO in a transaction, BUT pysqlite doesn't do DDL transactions correctly 32 | # http://docs.sqlalchemy.org/en/latest/dialects/sqlite.html#transactional-ddl 33 | with dataset.connect(DATABASE_URL) as db: 34 | for i in range(current_version + 1, SCHEMA_VERSION + 1): 35 | for sql in MIGRATIONS[i]: 36 | db.query(sql) 37 | 38 | 39 | def initialize_database(): 40 | # in a transaction 41 | with dataset.connect(DATABASE_URL) as db: 42 | current_version = next(db.query("PRAGMA user_version"))['user_version'] 43 | 44 | if current_version == 0: 45 | # new database: create tables from schema 46 | if 'crawl' not in db.tables: 47 | with open('results_schema.sql') as f: 48 | for sql in f.read().split(';'): 49 | db.query(sql) 50 | 51 | # "version 1" database: existing db from before versioning 52 | else: 53 | migrate_database(1) 54 | 55 | # existing database: apply schema updates 56 | elif current_version != SCHEMA_VERSION: 57 | migrate_database(current_version) 58 | 59 | if current_version != SCHEMA_VERSION: 60 | db.query("PRAGMA user_version = %i" % SCHEMA_VERSION) 61 | -------------------------------------------------------------------------------- /view.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # chameleon-crawler 4 | # 5 | # Copyright 2016 ghostwords. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | from flask_failsafe import failsafe 12 | 13 | 14 | @failsafe 15 | def create_app(): 16 | # imports of our code are inside this function so that Flask-Failsafe can 17 | # catch errors that happen at import time 18 | from viewer.app import app 19 | 20 | return app 21 | 22 | 23 | if __name__ == '__main__': 24 | create_app().run(debug=True) 25 | -------------------------------------------------------------------------------- /viewer/app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # chameleon-crawler 4 | # 5 | # Copyright 2016 ghostwords. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | from datetime import datetime 12 | from flask import Flask, render_template, request 13 | from utils.database import DATABASE_URL, initialize_database 14 | 15 | import dataset 16 | 17 | 18 | app = Flask(__name__) 19 | 20 | app.config['DATABASE_URL'] = DATABASE_URL 21 | app.jinja_env.trim_blocks = True 22 | app.jinja_env.lstrip_blocks = True 23 | 24 | initialize_database() 25 | 26 | 27 | def get_canvases(result_ids): 28 | canvases = {} 29 | 30 | sql = """SELECT canvas.id, data_url 31 | FROM canvas 32 | JOIN result ON result.canvas_id = canvas.id 33 | WHERE result.id IN (%s)""" % ','.join( 34 | [str(int(id)) for id in result_ids]) 35 | 36 | with dataset.connect(app.config['DATABASE_URL']) as db: 37 | for row in db.query(sql): 38 | canvases[row['id']] = row['data_url'] 39 | 40 | return canvases 41 | 42 | 43 | def get_fingerprinters(crawl_ids=None, canvas=True, font_enum=True, 44 | navigator_enum=True, num_properties=4, webgl=False, webrtc=False): 45 | fp = {} 46 | result_ids = set() 47 | 48 | in_clause = "" 49 | if crawl_ids: 50 | in_clause = "AND crawl_id IN (%s)" % ','.join( 51 | [str(int(id)) for id in crawl_ids]) 52 | 53 | filters = [] 54 | if canvas: 55 | filters.append("canvas = 1") 56 | if font_enum: 57 | filters.append("font_enum = 1") 58 | if navigator_enum: 59 | filters.append("navigator_enum = 1") 60 | if num_properties: 61 | filters.append("num_properties >= %s" % str(int(num_properties))) 62 | 63 | having_clause = "" 64 | if filters: 65 | having_clause = "HAVING %s" % " OR ".join(filters) 66 | 67 | where = [] 68 | if webgl: 69 | where.append("""( 70 | pc.property = 'WebGLRenderingContext.prototype.getParameter' 71 | OR pc.property = 72 | 'WebGLRenderingContext.prototype.getSupportedExtensions' 73 | )""") 74 | if webrtc: 75 | where.append("""( 76 | pc.property = 'RTCPeerConnection' 77 | OR pc.property = 'webkitRTCPeerConnection' 78 | )""") 79 | 80 | union_clause = "" 81 | if where and filters: 82 | union_clause = """UNION 83 | SELECT 84 | result.*, 85 | COUNT(pc.result_id) num_properties 86 | FROM result 87 | JOIN property_count pc ON pc.result_id = result.id 88 | WHERE %s 89 | GROUP BY pc.result_id""" % " OR ".join(where) 90 | elif where and not filters: 91 | in_clause = "%s AND (%s)" % (in_clause, " OR ".join(where)) 92 | 93 | sql = """SELECT * FROM ( 94 | SELECT 95 | result.*, 96 | COUNT(pc.result_id) num_properties 97 | FROM result 98 | LEFT JOIN property_count pc ON pc.result_id = result.id 99 | WHERE 1 {in_clause} 100 | GROUP BY COALESCE(pc.result_id, result.id) 101 | {having_clause} 102 | {union_clause} 103 | ) GROUP BY 104 | crawl_url, 105 | page_url, 106 | script_url, 107 | canvas, 108 | canvas_id, 109 | font_enum, 110 | navigator_enum, 111 | num_properties""".format( 112 | in_clause=in_clause, 113 | having_clause=having_clause, 114 | union_clause=union_clause) 115 | 116 | with dataset.connect(app.config['DATABASE_URL']) as db: 117 | result = db.query(sql) 118 | 119 | """ { 120 | row['script_domain']: { 121 | row['script_url']: [ 122 | row, 123 | ... 124 | ], 125 | ... 126 | }, 127 | ... 128 | }""" 129 | for row in result: 130 | fp.setdefault(row['script_domain'], {}).setdefault( 131 | row['script_url'], []).append(row) 132 | 133 | result_ids.add(row['id']) 134 | 135 | return fp, result_ids 136 | 137 | 138 | def get_problem_pages(crawl_ids): 139 | if crawl_ids: 140 | in_clause = "AND crawl_id IN (%s)" % ','.join( 141 | [str(int(id)) for id in crawl_ids]) 142 | 143 | sql = """SELECT crawl_url, error, COUNT(*) count 144 | FROM result 145 | WHERE error IS NOT NULL %s 146 | GROUP BY crawl_url, error 147 | ORDER BY count DESC""" % (in_clause if crawl_ids else "") 148 | 149 | with dataset.connect(app.config['DATABASE_URL']) as db: 150 | result = db.query(sql) 151 | return list(result) 152 | 153 | 154 | def get_error_counts(): 155 | errors = {} 156 | 157 | with dataset.connect(app.config['DATABASE_URL']) as db: 158 | result = db.query( 159 | """SELECT crawl_id, error, COUNT(DISTINCT crawl_url) num_urls 160 | FROM result 161 | WHERE error IS NOT NULL 162 | GROUP BY crawl_id, error""" 163 | ) 164 | 165 | for row in result: 166 | errors.setdefault( 167 | row['crawl_id'], {})[row['error']] = row['num_urls'] 168 | 169 | return errors 170 | 171 | 172 | def get_crawls(): 173 | with dataset.connect(app.config['DATABASE_URL']) as db: 174 | return list(db.query( 175 | """SELECT 176 | crawl.id, 177 | crawl.args, 178 | COUNT(DISTINCT crawl_url) num_urls, 179 | (STRFTIME('%s', end_time) - STRFTIME('%s', start_time)) duration, 180 | STRFTIME('%s', start_time) start_time 181 | FROM crawl 182 | JOIN result ON result.crawl_id = crawl.id 183 | GROUP BY crawl.id 184 | ORDER BY crawl.id DESC""" 185 | )) 186 | 187 | 188 | @app.template_filter('number_format') 189 | def number_format(value): 190 | return '{:,}'.format(value) 191 | 192 | 193 | @app.route('/errors') 194 | def errors(): 195 | crawl_ids = [int(i) for i in request.args.getlist('crawl')] 196 | 197 | return render_template( 198 | 'errors.html', 199 | problem_pages=get_problem_pages(crawl_ids) 200 | ) 201 | 202 | 203 | @app.route('/results') 204 | def results(): 205 | args = {} 206 | 207 | crawl_ids = [int(i) for i in request.args.getlist('crawl')] 208 | 209 | if crawl_ids: 210 | args['crawl_ids'] = crawl_ids 211 | 212 | all_filters = { 213 | 'canvas', 214 | 'font_enum', 215 | 'navigator_enum', 216 | 'num_properties', 217 | 'webgl', 218 | 'webrtc' 219 | } 220 | filters = request.args.getlist('filter') 221 | 222 | if set.intersection(all_filters, filters): 223 | for filt in all_filters: 224 | args[filt] = filt in filters 225 | 226 | if 'num_properties' in filters: 227 | args['num_properties'] = request.args.get('num_properties') 228 | 229 | fingerprinters, result_ids = get_fingerprinters(**args) 230 | 231 | return render_template( 232 | 'results.html', 233 | fingerprinters=fingerprinters, 234 | canvases=get_canvases(result_ids) 235 | ) 236 | 237 | 238 | @app.route('/') 239 | def index(): 240 | crawls = get_crawls() 241 | 242 | for crawl in crawls: 243 | crawl['start_time'] = datetime.utcfromtimestamp( 244 | int(crawl['start_time'])).strftime("%b %d %Y %I:%M %p") 245 | 246 | return render_template( 247 | 'crawls.html', 248 | crawls=crawls, 249 | error_counts=get_error_counts() 250 | ) 251 | -------------------------------------------------------------------------------- /viewer/static/16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ghostwords/chameleon-crawler/56fc325ad11c57539d9fcad7feb9eaaf51411167/viewer/static/16.png -------------------------------------------------------------------------------- /viewer/static/index.css: -------------------------------------------------------------------------------- 1 | a, a:visited { 2 | color: #c00; 3 | text-decoration: none; 4 | } 5 | 6 | a:hover { 7 | text-decoration: underline; 8 | } 9 | 10 | body { 11 | color: #444; 12 | font: normal 15px 'Helvetica Neue', 'Helvetica', sans-serif; 13 | line-height: 1.3em; 14 | margin: 30px auto; 15 | width: 990px; 16 | } 17 | 18 | h1 { 19 | margin: 0 0 15px; 20 | } 21 | 22 | h3 { 23 | background-color: pink; 24 | display: inline-block; 25 | margin: 10px 0; 26 | padding: 5px; 27 | } 28 | 29 | hr { 30 | background-color: #ccc; 31 | border: 0; 32 | height: 1px; 33 | margin: 20px 0; 34 | } 35 | 36 | table { 37 | border: 1px solid #ddd; 38 | border-collapse: collapse; 39 | } 40 | 41 | tbody tr:nth-child(odd) { 42 | background-color: #eee; 43 | } 44 | 45 | th { 46 | border: 1px solid #ddd; 47 | font-size: 13px; 48 | padding: 5px 10px; 49 | white-space: nowrap; 50 | } 51 | 52 | td { 53 | border: 1px solid #ddd; 54 | padding: 5px 10px; 55 | white-space: nowrap; 56 | } 57 | 58 | input[type=submit] { 59 | padding: 5px 10px; 60 | } 61 | 62 | .canvas-image-data { 63 | border: 2px dotted pink; 64 | max-width: 75px; 65 | } 66 | 67 | .ellipsis { 68 | overflow: hidden; 69 | text-overflow: ellipsis; 70 | white-space: nowrap; 71 | } 72 | 73 | .center { 74 | text-align: center; 75 | } 76 | 77 | .right { 78 | text-align: right; 79 | } 80 | 81 | .script-url, .url { 82 | max-width: 280px; 83 | } 84 | 85 | .script-url { 86 | max-width: 900px; 87 | padding-bottom: 5px; 88 | } 89 | -------------------------------------------------------------------------------- /viewer/static/index.js: -------------------------------------------------------------------------------- 1 | /*! 2 | * chameleon-crawler 3 | * 4 | * Copyright 2016 ghostwords. 5 | * 6 | * This Source Code Form is subject to the terms of the Mozilla Public 7 | * License, v. 2.0. If a copy of the MPL was not distributed with this 8 | * file, You can obtain one at http://mozilla.org/MPL/2.0/. 9 | * 10 | */ 11 | 12 | function init_checkboxes() { 13 | var els = document.querySelectorAll('.select-all'); 14 | 15 | function find_parent(el) { 16 | var parent = el; 17 | while (['table', 'form'].indexOf(parent.tagName.toLowerCase()) == -1) { 18 | parent = parent.parentNode; 19 | } 20 | return parent; 21 | } 22 | 23 | for (var i = 0, count = els.length; i < count; i++) { 24 | var child_checks = find_parent(els[i]) 25 | .querySelectorAll('input[type=checkbox]'); 26 | 27 | (function (checks) { 28 | els[i].addEventListener('click', function () { 29 | for (var i = 0, count = checks.length; i < count; i++) { 30 | if (checks[i] != this) { 31 | checks[i].checked = this.checked; 32 | } 33 | } 34 | }); 35 | }(child_checks)); 36 | } 37 | } 38 | 39 | document.addEventListener('DOMContentLoaded', function () { 40 | init_checkboxes(); 41 | }); 42 | -------------------------------------------------------------------------------- /viewer/templates/_crawls_table.html: -------------------------------------------------------------------------------- 1 |
2 |

Crawls

3 |

4 | See what Chameleon found on its travels around the Web. 5 |

6 | 7 |   8 | 9 |
10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | {% for crawl in crawls %} 24 | 25 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 43 | 44 | 51 | 52 | 55 | 56 | 59 | 60 | 65 | 66 | {% endfor %} 67 |
crawl #start timenumber of URLstimeout rateerror rateduration (minutes)speed (URLs/min)crawl args
26 | 28 | {{ crawl.id }}{{ crawl.start_time }}{{ crawl.num_urls|number_format }} 37 | {% with num_timeouts = error_counts[crawl.id]['TIMEOUT']|default(0) if crawl.id in error_counts else 0 %} 38 | {% if num_timeouts > 0 %} 39 | {{ ((num_timeouts / crawl.num_urls * 100)|round(1)) }}% 40 | {% endif %} 41 | {% endwith %} 42 | 45 | {% with num_errors = error_counts[crawl.id]['BROWSER ERROR']|default(0) if crawl.id in error_counts else 0 %} 46 | {% if num_errors > 0 %} 47 | {{ ((num_errors / crawl.num_urls * 100)|round(1)) }}% 48 | {% endif %} 49 | {% endwith %} 50 | 53 | {{ (crawl.duration / 60)|round|int if crawl.duration is number else '-' }} 54 | 57 | {{ (crawl.num_urls / crawl.duration * 60)|round(1) if crawl.duration is number else '-' }} 58 | 61 |
62 | {{ crawl.args if crawl.args }} 63 |
64 |
68 |
69 | 70 |   71 | 72 |
73 | -------------------------------------------------------------------------------- /viewer/templates/_fingerprinters.html: -------------------------------------------------------------------------------- 1 | {% from 'helpers/form.html' import checkbox %} 2 |
3 | 4 |

Fingerprinters

5 | 6 | 8 | 54 | 55 |
56 | 57 | {% if not fingerprinters %} 58 |

No results.

59 | {% endif %} 60 | {% for script_domain, domain_data in fingerprinters|dictsort %} 61 |

{{ script_domain }}

62 |
63 | {% for script_url, script_data in domain_data|dictsort %} 64 |
65 | {{ script_url|replace(script_domain, ''|safe + script_domain + ''|safe, 1) }} 66 |
67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | {% for fp in script_data %} 77 | 78 | 81 | 86 | 96 | 97 | 98 | 99 | 100 | {% endfor %} 101 |
crawl URLpage URLcanvasfont enum.nav. enum.num. props.
79 |
{{ fp.crawl_url }}
80 |
82 |
83 | {{ fp.page_url if fp.page_url != fp.crawl_url else '-' }} 84 |
85 |
87 | {% if fp.canvas_id %} 88 | 89 | 91 | 92 | {% elif fp.canvas %} 93 | ● 94 | {% endif %} 95 | {{ "●" if fp.font_enum }}{{ "●" if fp.navigator_enum }}{{ fp.num_properties if fp.num_properties > 0 }}
102 |
103 | {% endfor %} 104 | {% endfor %} 105 | -------------------------------------------------------------------------------- /viewer/templates/crawls.html: -------------------------------------------------------------------------------- 1 | {% extends "index.html" %} 2 | 3 | {% block body %} 4 | 5 | {% if crawls %} 6 | {% include '_crawls_table.html' %} 7 | {% else %} 8 | No crawls found. Please run crawl.py first. 9 | {% endif %} 10 | 11 | {% endblock %} 12 | -------------------------------------------------------------------------------- /viewer/templates/errors.html: -------------------------------------------------------------------------------- 1 | {% extends "index.html" %} 2 | 3 | {% block body %} 4 | 5 | back to list 6 | 7 |
8 | 9 |
10 |

Pages that timed out

11 | 12 | 13 | 14 | 15 | 16 | {% for page in problem_pages %} 17 | {% if page.error == 'TIMEOUT' %} 18 | 19 | 20 | 21 | 22 | {% endif %} 23 | {% endfor %} 24 |
crawl URLcount
{{ page.crawl_url }}{{ page.count }}
25 |
26 | 27 |
28 |

Pages that errored out

29 | 30 | 31 | 32 | 33 | 34 | {% for page in problem_pages %} 35 | {% if page.error == 'BROWSER ERROR' %} 36 | 37 | 38 | 39 | 40 | {% endif %} 41 | {% endfor %} 42 |
crawl URLcount
{{ page.crawl_url }}{{ page.count }}
43 |
44 | 45 |
46 | 47 | {% endblock %} 48 | -------------------------------------------------------------------------------- /viewer/templates/helpers/form.html: -------------------------------------------------------------------------------- 1 | {% macro checkbox(name, value="1") -%} 2 | 4 | {%- endmacro %} 5 | -------------------------------------------------------------------------------- /viewer/templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 11 | 12 | 13 | Chameleon Crawler Viewer 14 | 15 | 16 | {% block head %} 17 | {% endblock %} 18 | 19 | 20 | {% block body %} 21 | {% endblock %} 22 | 23 | 24 | 25 | -------------------------------------------------------------------------------- /viewer/templates/results.html: -------------------------------------------------------------------------------- 1 | {% extends "index.html" %} 2 | 3 | {% block body %} 4 | 5 | back to list 6 | 7 | {% include '_fingerprinters.html' %} 8 | 9 | 35 | 36 | {% endblock %} 37 | --------------------------------------------------------------------------------