├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── SmarterEncryption └── Crawl.pm ├── config.yml.example ├── cpanfile ├── https_crawl.pl ├── netsniff_screenshot.js ├── sql ├── domain_exceptions.sql ├── full_urls.sql ├── https_crawl.sql ├── https_crawl_aggregate.sql ├── https_queue.sql ├── https_response_headers.sql ├── mixed_assets.sql ├── ssl_cert_info.sql └── upgradeable_domains_func.sql └── third-party.txt /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing guidelines 2 | 3 | * [Reporting bugs](#reporting-bugs) 4 | * [Development](#development) 5 | * [New features](#new-features) 6 | * [Bug fixes](#bug-fixes) 7 | * [Getting Started](#getting-started) 8 | * [Pre-Requisites](#pre-requisites) 9 | * [Setup](#setup) 10 | * [Running the crawler](#running-the-crawler) 11 | * [Checking the results](#checking-the-results) 12 | * [Data Model](#data-model) 13 | * [full_urls](#full_urls) 14 | * [https_queue](#https_queue) 15 | * [https_crawl](#https_crawl) 16 | * [mixed_assets](#mixed_assets) 17 | * [https_response_headers](#https_response_headers) 18 | * [ssl_cert_info](#ssl_cert_info) 19 | * [https_crawl_aggregate](#https_crawl_aggregate) 20 | * [https_upgrade_metrics](#https_upgrade_metrics) 21 | * [domain_exceptions](#domain_exceptions) 22 | * [upgradeable_domains](#upgradeable_domains) 23 | 24 | # Reporting bugs 25 | 26 | 1. First check to see if the bug has not already been [reported](https://github.com/duckduckgo/smarter-encryption/issues). 27 | 2. Create a bug report [issue](https://github.com/duckduckgo/smarter-encryption/issues/new?template=bug_report.md). 28 | 29 | # Development 30 | 31 | ## New features 32 | 33 | Right now all new feature development is handled internally. 34 | 35 | ## Bug fixes 36 | 37 | Most bug fixes are handled internally, but we will accept pull requests for bug fixes if you first: 38 | 1. Create an issue describing the bug. see [Reporting bugs](CONTRIBUTING.md#reporting-bugs) 39 | 2. Get approval from DDG staff before working on it. Since most bug fixes and feature development are handled internally, we want to make sure that your work doesn't conflict with any current projects. 40 | 41 | ## Getting Started 42 | 43 | ### Pre-Requisites 44 | - [PostgreSQL](https://www.postgresql.org/) database 45 | - [PhantomJS 2.1.1](https://phantomjs.org/download.html) 46 | - [Perl](https://www.perl.org/get.html) 47 | - [compare](https://imagemagick.org/script/compare.php) 48 | - [pkill](https://en.wikipedia.org/wiki/Pkill) 49 | - Should run on many varieties of Linux/*BSD 50 | 51 | ### Setup 52 | 53 | 1. Install required Perl modules via cpanfile: 54 | ```sh 55 | cpanm --installdeps . 56 | ``` 57 | 2. Connect to PostgreSQL with psql and create the tables needed by the crawler: 58 | ``` 59 | \i sql/full_urls.sql 60 | \i sql/https_crawl.sql 61 | \i sql/mixed_assets.sql 62 | etc. 63 | ``` 64 | 3. Create a copy of the crawler configuration file: 65 | ```sh 66 | cp config.yml.example config.yml 67 | ``` 68 | Edit the settings as necessary for your system. 69 | 70 | 4. If you have a source of URLs you would like to be crawled for a host they can be added to the [full_urls](#full_urls) table: 71 | ```sql 72 | insert into full_urls (host, url) values ('duckduckgo.com', 'https://duckduckgo.com/?q=privacy'), ... 73 | ``` 74 | The crawler will attempt to get URLs from the home page even if none are available in this table. 75 | 76 | ### Running the crawler 77 | 78 | 1. Add hosts to be crawled to the [https_queue](#https_queue) table: 79 | ```sql 80 | insert into https_queue (domain) values ('duckduckgo.com'); 81 | ``` 82 | 83 | 2. The crawler can be run as follows: 84 | ```sh 85 | perl -Mlib=/path/to/smarter-encryption https_crawl.pl -c /path/to/config.yml 86 | ``` 87 | 88 | ### Checking the results 89 | 90 | 1. The individual HTTP and HTTPs comparisons for each URL crawled are stored in [https_crawl](#https_crawl): 91 | ```sql 92 | select * from https_crawl where domain = 'duckduckgo.com' order by id desc limit 10; 93 | ``` 94 | The maximum URLs for the crawl session, i.e. `limit`, is determined by [URLS_PER_SITE](config.yml.example#L49). 95 | 96 | 2. Aggregate session data for each host is stored in [https_crawl_aggregate](#https_crawl_aggregate): 97 | ```sql 98 | select * from https_crawl_aggregate where domain = 'duckduckgo.com'; 99 | ``` 100 | There is also an associated view - [https_upgrade_metrics](#https_upgrade_metrics) - that calculates some additional metrics: 101 | ```sql 102 | select * from https_upgrade_metrics where domain = 'duckduckgo.com'; 103 | ``` 104 | 105 | 3. Additional information from the crawl can be found in: 106 | 107 | * [sss_cert_info](#ssl_cert_info) 108 | * [mixed_assets](#mixed_assets) 109 | * [https_response_headers](#https_response_headers) 110 | 111 | 4. Hosts can be selected based on various combinations of criteria directly from the above tables or by using the [upgradeable_domains](#upgradeable_domains) function. 112 | 113 | ### Data Model 114 | 115 | #### full_urls 116 | 117 | Complete URLs for hosts that will be used in addition to those the crawler extracts from the home page. 118 | 119 | | Column | Description | Type | Key | 120 | | --- | --- | --- | --- | 121 | | host | hostname | text |unique| 122 | | url | Complete URL with scheme | text |unique| 123 | | updated | When added to table | timestamp with time zone || 124 | 125 | #### https_queue 126 | 127 | Domains to be crawled in rank order. Multiple crawlers can access this concurrently. 128 | 129 | | Column | Description | Type | Key | 130 | | --- | --- | --- | --- | 131 | | rank | Processing order | integer | primary | 132 | |domain | Domain to be crawled | character varying(500) || 133 | |processing_host|Hostname of server processing domain|character varying(50)|| 134 | |worker_pid|Process ID of crawler handling domain|integer|| 135 | |reserved|When domain was selected for processing|timestamp with time zone|| 136 | |started|When processing of domain started|timestamp with time zone|| 137 | |finished|When processing of domain completed|timestamp with time zone|| 138 | 139 | #### https_crawl 140 | 141 | Log table of HTTP and HTTPs comparisons made by the crawler. 142 | 143 | | Column | Description | Type | Key | 144 | | --- | --- | --- | --- | 145 | | id | Comparison ID | bigint | unique | 146 | |domain|Domain evaluated|text|| 147 | |http_request_uri|Resulting URI of HTTP request|text|| 148 | |http_response|HTTP status code for HTTP request|integer|| 149 | |http_requests|Total requests made, including child subrequests, for HTTP request|integer|| 150 | |http_size|Size of HTTP response (bytes)|integer|| 151 | |https_request_uri|Resulting URI of HTTPs request|text|| 152 | |https_response|HTTP status code for HTTPs request|integer|| 153 | |https_requests|Total requests made, including child subrequests, for HTTPs request|integer|| 154 | |https_size|Size of HTTPs response (bytes)|integer|| 155 | |timestamp|When inserted|timestamp with time zone|| 156 | |screenshot_diff|Percentage difference between HTTP and HTTPs screenshots after page load|real|| 157 | |autoupgrade|Whether HTTP request was redirected to HTTPs|boolean|| 158 | |mixed|Whether HTTPs request had HTTP child requests|boolean|| 159 | 160 | #### mixed_assets 161 | 162 | HTTP child requests made for HTTPs. 163 | 164 | | Column | Description | Type | Key | 165 | | --- | --- | --- | --- | 166 | | https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign | 167 | | asset | URI of HTTP subrequest made during HTTPs request | text | unique | 168 | 169 | 170 | #### https_response_headers 171 | 172 | The response headers for HTTPs requests. 173 | 174 | | Column | Description | Type | Key | 175 | | --- | --- | --- | --- | 176 | | https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign | 177 | |response_headers|key/value of all HTTPs response headers|jsonb|| 178 | 179 | 180 | #### ssl_cert_info 181 | 182 | SSL certificate information for domains crawled. 183 | 184 | | Column | Description | Type | Key | 185 | | --- | --- | --- | --- | 186 | | domain | Domain evaluated | text | primary | 187 | |issuer|Issuer of SSL certificate|text|| 188 | |notbefore|Valid from timestamp|timestamp with time zone|| 189 | |notafter|Valid to timestamp|timestamp with time zone|| 190 | |host_valid|Whether the domain is covered by the SSL certificate|boolean|| 191 | |err|Connection err|text|| 192 | |updated|When last updated|timestamp with time zone|| 193 | 194 | #### https_crawl_aggregate 195 | 196 | Aggregate of [https_crawl](#https_crawl) that creates latest crawl sessions based on domain. Can also include domains that were redirected to and not directly crawled. 197 | 198 | | Column | Description | Type | Key | 199 | | --- | --- | --- | --- | 200 | | domain | Domain evaluated | text | primary | 201 | |https|Comparisons where only HTTPs was supported|integer|| 202 | |http_and_https|Comparisons where HTTP and HTTPs were supported|integer|| 203 | |http|Comparisons where only HTTP was supported|integer|| 204 | |https_errs|Number of non-2xx HTTPs responses|integer|| 205 | |unknown|Comparisons where neither HTTP nor HTTPs responses were valid or the status codes differed|integer|| 206 | |autoupgrade|Comparisons where HTTP was redirected to HTTPs|integer|| 207 | |mixed_requests|HTTPs request that made HTTP calls|integer|| 208 | |max_screenshot_diff|Maximum percentage difference between HTTP and HTTPs screenshots|real|| 209 | |redirects|Number of HTTPs requests redirected to different host|integer|| 210 | |requests|Number of comparison requests actually made during the crawl session|integer|| 211 | |session_request_limit|The number of comparisons wanted for the session|integer|| 212 | |is_redirect|Whether the domain was actually crawled or is a redirect from another host in the table that was crawled|boolean|| 213 | |max_https_crawl_id|https_crawl.id of last comparison made during crawl session|bigint|| 214 | |redirect_hosts|key/value pairs of hosts and the number of redirects to it|jsonb|| 215 | 216 | #### https_upgrade_metrics 217 | 218 | View of [https_crawl_aggregate](#https_crawl_aggregate) that calculates crawl session percentages for easier selection based on cutoffs. 219 | 220 | | Column | Description | Type | Key | 221 | | --- | --- | --- | --- | 222 | | domain | Domain evaluated | text | | 223 | | unknown_pct | Percentage of unknown|real|| 224 | | combined_pct | Percentage that supported HTTPs|real|| 225 | | https_err_rate | Percentage unknown|real|| 226 | | max_screenshot_diff | https_crawl_aggregate.max_screenshot_diff|real|| 227 | | mixed_ok | Whether HTTPs requests contained mixed content requests|boolean|| 228 | | autoupgrade_pct|Percentage of autoupgrade|real|| 229 | 230 | #### domain_exceptions 231 | 232 | For manually excluding domains that may otherwise pass specific upgrade criteria given to [upgradeable_domains](#upgradeable_domains). 233 | 234 | | Column | Description | Type | Key | 235 | | --- | --- | --- | --- | 236 | | domain | Domain to exclude | text | primary | 237 | | comment | Reason for exclusion | text || 238 | |updated|When added|timestamp with time zone|| 239 | 240 | #### upgradeable_domains 241 | 242 | Function to select domains based on a variety of criteria. 243 | 244 | | Parameter | Description | Type | Source | 245 | | --- | --- | --- | --- | 246 | |autoupgrade_min|Minimum autoupgrade percentage|real|[https_upgrade_metrics](#https_upgrade_metrics)| 247 | |combined_min|Minimum percentage of HTTPs responses|real|[https_upgrade_metrics](#https_upgrade_metrics)| 248 | |screenshot_diff_max|Maximum observed screenshot diff allowed|real|[https_upgrade_metrics](#https_upgrade_metrics)| 249 | |mixed_ok|Whether to allow domains that had mixed content|boolean|[https_upgrade_metrics](#https_upgrade_metrics)| 250 | |max_err_rate|Maximum https_err_rate|real|[https_upgrade_metrics](#https_upgrade_metrics)| 251 | |unknown_max|Maximum unknown comparisons|real|[https_upgrade_metrics](#https_upgrade_metrics)| 252 | |ssl_cert_buffer|SSL certificate must be valid until this timestamp|timestamp with time zone|[ssl_cert_info](#ssl_cert_info)| 253 | |exclude_issuers|Array of SSL cert issuers to exclude|text array|[ssl_cert_info](#ssl_cert_info)| 254 | 255 | In addtion to the above parameters, the function enforces several other conditions: 256 | 257 | 1. Domain must not be in [domain_exceptions](#domain_exceptions) 258 | 2. From values in [ssl_cert_info](#ssl_cert_info): 259 | 1. No err 260 | 2. The domain, or host, must be valid for the certificate. 261 | 3. Valid from/to and the issuer must not be null 262 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | This license does not apply to any DuckDuckGo logos or marks that may be contained 2 | in this repo. DuckDuckGo logos and marks are licensed separately under the CCBY-NC-ND 4.0 3 | license (https://creativecommons.org/licenses/by-nc-nd/4.0/), and official up-to-date 4 | versions can be downloaded from https://duckduckgo.com/press. 5 | 6 | Copyright 2010 Duck Duck Go, Inc. 7 | 8 | Licensed under the Apache License, Version 2.0 (the "License"); 9 | you may not use this file except in compliance with the License. 10 | You may obtain a copy of the License at 11 | 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | 14 | Unless required by applicable law or agreed to in writing, software 15 | distributed under the License is distributed on an "AS IS" BASIS, 16 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DuckDuckGo Smarter Encryption 2 | 3 | DuckDuckGo Smarter Encryption is a large list of web sites that we know support HTTPS. The list is automatically generated and updated by using the crawler in this repository. 4 | 5 | For more information about where the list is being used and how it compares to other solutions, see our blog post [Your Connection is Secure with DuckDuckGo Smarter Encryption](https://spreadprivacy.com/duckduckgo-smarter-encryption). 6 | 7 | This software is licensed under the terms of the Apache License, Version 2.0 (see [LICENSE](LICENSE)). Copyright (c) 2019 [Duck Duck Go, Inc.](https://duckduckgo.com) 8 | 9 | ## Contributing 10 | 11 | See [Contributing](CONTRIBUTING.md) for more information about [Reporting bugs](CONTRIBUTING.md#reporting-bugs) and [Getting Started](CONTRIBUTING.md#getting-started) with the crawler. 12 | 13 | ## Just want the list? 14 | 15 | The list we use (as a result of running this code) is [publicly available](https://staticcdn.duckduckgo.com/https/smarter_encryption_latest.tgz) under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/). 16 | 17 | If you'd like to license the list for commercial use, [please reach out](https://help.duckduckgo.com/duckduckgo-help-pages/company/contact-us/). 18 | 19 | ## Questions or help with other DuckDuckGo things? 20 | See [DuckDuckGo Help Pages](https://duck.co/help). 21 | -------------------------------------------------------------------------------- /SmarterEncryption/Crawl.pm: -------------------------------------------------------------------------------- 1 | package SmarterEncryption::Crawl; 2 | 3 | use Exporter::Shiny qw' 4 | aggregate_crawl_session 5 | check_ssl_cert 6 | dupe_link 7 | urls_by_path 8 | '; 9 | 10 | use IO::Socket::SSL; 11 | use IO::Socket::SSL::Utils 'CERT_asHash'; 12 | use Cpanel::JSON::XS 'encode_json'; 13 | use List::Util 'sum'; 14 | use URI; 15 | use List::AllUtils qw'each_arrayref'; 16 | use Domain::PublicSuffix; 17 | 18 | use strict; 19 | use warnings; 20 | no warnings 'uninitialized'; 21 | use feature 'state'; 22 | 23 | my $SSL_TIMEOUT = 5; 24 | my $DEBUG = 0; 25 | 26 | # Fields we want to convert to int if null 27 | my @CONVERT_TO_INT = qw' 28 | https 29 | http_s 30 | https_errs 31 | http 32 | unknown 33 | autoupgrade 34 | mixed_requests 35 | max_ss_diff 36 | redirects 37 | '; 38 | 39 | sub screenshot_threshold { 0.05 } 40 | # Number of URLs checked for each domain per run. 41 | sub urls_per_domain { 10 } 42 | 43 | sub check_ssl_cert { 44 | my $host = shift; 45 | 46 | my ($issuer, $not_before, $not_after, $host_valid, $err); 47 | 48 | if(my $iossl = IO::Socket::SSL->new( 49 | PeerHost => $host, 50 | PeerPort => 'https', 51 | SSL_hostname => $host, 52 | Timeout => $SSL_TIMEOUT, 53 | )){ 54 | $host_valid = $iossl->verify_hostname($host, 'http') || 0; 55 | my $c = $iossl->peer_certificate; 56 | my $cert = CERT_asHash($c); 57 | $issuer = $cert->{issuer}{organizationName}; 58 | $not_before = gmtime($cert->{not_before}) . ' UTC'; 59 | $not_after = gmtime($cert->{not_after}) . ' UTC'; 60 | } 61 | else{ 62 | my $sys_err = $!; 63 | $err = $SSL_ERROR; 64 | if($sys_err){ $err .= ": $sys_err"; } 65 | } 66 | 67 | return [$issuer, $not_before, $not_after, $host_valid, $err]; 68 | } 69 | 70 | sub aggregate_crawl_session { 71 | my ($domain, $session) = @_; 72 | 73 | state $dps = Domain::PublicSuffix->new; 74 | my $root_domain = $dps->get_root_domain($domain); 75 | 76 | my %domain_stats = (is_redirect => 0); 77 | my %redirects; 78 | for my $comparison (@$session){ 79 | my ($http_request_uri, 80 | $http_response, 81 | $https_request_uri, 82 | $https_response, 83 | $autoupgrade, 84 | $mixed, 85 | $screenshot_diff, 86 | $id 87 | ) = @$comparison{qw' 88 | http_request_uri 89 | http_response 90 | https_request_uri 91 | https_response 92 | autoupgrade 93 | mixed 94 | ss_diff 95 | id 96 | '}; 97 | 98 | 99 | my $http_valid = $http_request_uri =~ /^http:/i; 100 | my $https_valid = $https_request_uri =~ /^https:/i; 101 | 102 | my $redirect; 103 | if($https_valid){ 104 | if(my $host = eval { URI->new($https_request_uri)->host }){ 105 | if($host ne $domain){ 106 | my $host_root_domain = $dps->get_root_domain($host); 107 | if($root_domain eq $host_root_domain){ 108 | ++$domain_stats{redirects}{$host}; 109 | unless(exists $redirects{$host}){ 110 | $redirects{$host} = {is_redirect => 1}; 111 | } 112 | $redirect = $redirects{$host}; 113 | } 114 | } 115 | } 116 | } 117 | 118 | ++$domain_stats{requests}; 119 | $redirect && ++$redirect->{requests}; 120 | 121 | $domain_stats{max_id} = $id if $domain_stats{max_id} < $id; 122 | $redirect->{max_id} = $id if $redirect && ($redirect->{max_id} < $id); 123 | 124 | if($autoupgrade){ 125 | ++$domain_stats{autoupgrade}; 126 | $redirect && ++$redirect->{autoupgrade}; 127 | } 128 | 129 | if($mixed){ 130 | ++$domain_stats{mixed_requests}; 131 | $redirect && ++$redirect->{mixed_requests}; 132 | } 133 | 134 | if(defined($screenshot_diff)){ 135 | $domain_stats{max_ss_diff} = $screenshot_diff if $domain_stats{max_ss_diff} < $screenshot_diff; 136 | $redirect->{max_ss_diff} = $screenshot_diff if $redirect && ($redirect->{max_ss_diff} < $screenshot_diff) 137 | } 138 | 139 | my $http_s_same_response = $http_response == $https_response; 140 | my $http_response_good = $http_valid && ( ($http_response == 200) || $http_s_same_response ); 141 | my $https_response_good = $https_valid && ( ($https_response == 200) || $http_s_same_response); 142 | 143 | if($https_response_good){ 144 | if($http_response_good){ 145 | ++$domain_stats{http_s}; 146 | $redirect && ++$redirect->{http_s}; 147 | } 148 | else{ 149 | ++$domain_stats{https}; 150 | $redirect && ++$redirect->{https}; 151 | } 152 | 153 | if($https_response =~ /^[45]/){ 154 | ++$domain_stats{https_errs}; 155 | $redirect && ++$redirect->{https_errs}; 156 | } 157 | } 158 | elsif($http_response_good){ 159 | ++$domain_stats{http}; 160 | $redirect && ++$redirect->{http}; 161 | } 162 | else{ 163 | ++$domain_stats{unknown}; 164 | $redirect && ++$redirect->{unknown}; 165 | } 166 | } 167 | 168 | my %aggs; 169 | if(my $hosts = delete $domain_stats{redirects}){ 170 | $domain_stats{redirects} = sum values(%$hosts); 171 | $domain_stats{redirect_hosts} = encode_json($hosts); 172 | 173 | while(my ($host, $agg) = each %redirects){ 174 | null_to_int($agg); 175 | $aggs{$host} = $agg; 176 | } 177 | } 178 | 179 | null_to_int(\%domain_stats); 180 | $aggs{$domain} = \%domain_stats; 181 | 182 | return \%aggs; 183 | } 184 | 185 | sub null_to_int { 186 | my $h = shift; 187 | $h->{$_} += 0 for @CONVERT_TO_INT; 188 | } 189 | 190 | sub urls_by_path { 191 | my ($urls, $rr, $url_limit) = @_; 192 | 193 | my %links; 194 | for my $url (@$urls){ 195 | eval { 196 | my @segs = URI->new($url)->path_segments; 197 | push @{$links{$segs[1]}}, $url; 198 | }; 199 | } 200 | 201 | my @sorted_paths = sort {@{$links{$b}} <=> @{$links{$a}}} keys %links; 202 | 203 | my @urls_by_path; 204 | 205 | my $paths = each_arrayref @links{@sorted_paths}; 206 | CLICK_GROUP: while(my @urls = $paths->()){ 207 | for my $url (@urls){ 208 | next unless $url; 209 | last CLICK_GROUP unless @urls_by_path < $url_limit; 210 | next unless $rr->allowed($url); 211 | push @urls_by_path, $url; 212 | } 213 | } 214 | 215 | @$urls = @urls_by_path; 216 | } 217 | 218 | 219 | sub dupe_link { 220 | my ($url, $urls) = @_; 221 | 222 | $url =~ s{^https:}{http:}i; 223 | 224 | for (@$urls){ 225 | my $u = $_ =~ s{^https:}{http:}ir; 226 | return 1 if URI::eq($u, $url); 227 | } 228 | 229 | 0; 230 | } 231 | 232 | 1; 233 | -------------------------------------------------------------------------------- /config.yml.example: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | # Top-level temp directory will be created on start and removed 4 | # on exit. Each crawler will have its own subdirectory with 5 | # PID appended 6 | TMP_DIR: /tmp/smarter_encryption 7 | CRAWLER_TMP_PREFIX: crawler_ 8 | 9 | # User agent. Will use defaults if not specified 10 | #UA: 11 | VERBOSE: 1 12 | 13 | # Paths to system binaries. If in path already, just the program 14 | # name should suffice. 15 | COMPARE: /usr/local/bin/compare 16 | PKILL: /usr/bin/pkill 17 | 18 | # Database connection options. If not specified will connect as 19 | # the current user. 20 | #DB: 21 | #HOST: 22 | #PORT: 23 | #USER: 24 | #PASS: 25 | 26 | # Number of concurrent crawlers per cpu. 27 | CRAWLERS_PER_CPU: 3 28 | # or exact number 29 | # MAX_CONCURRENT_CRAWLERS: 10 30 | 31 | # Path to phantomjs. Should be v2.1.1 32 | PHANTOMJS: phantomjs 33 | 34 | # Path to modified netsniff.js 35 | NETSNIFF_SS: netsniff_screenshot.js 36 | 37 | # Timeout before killing phantomjs in seconds 38 | HEADLESS_ALARM: 30 39 | 40 | # Whether to continue running and polling the queue or exit when finished. 41 | # If specified and non-zero, it is the number of seconds to wait in 42 | # between polls. 43 | POLL: 60 44 | 45 | # Number of sites a crawler should process before exiting 46 | SITES_PER_CRAWLER: 10 47 | 48 | # Desired number of URLs to check for each site 49 | URLS_PER_SITE: 10 50 | 51 | # Max percentage of URLS_PER_SITE included from the current home page 52 | HOMEPAGE_LINK_PCT: 0.5 53 | 54 | # Number of times to re-request HTTPs URL on failure 55 | HTTPS_RETRIES: 1 56 | 57 | # If SCREENSHOT_RETRIES is not 0, the comparison between HTTP and HTTPs 58 | # pages will be re-run if the diff is above SCREENSHOT_THRESHOLD. It 59 | # will also introduce a delay before taking the screenshot to potentially 60 | # overcome slight network differences between the two. The delay will 61 | # remain in effect for links still to be processed for the site. 62 | SCREENSHOT_RETRIES: 1 63 | SCREENSHOT_THRESHOLD: 0.05 64 | PHANTOM_RENDER_DELAY: 1000 65 | -------------------------------------------------------------------------------- /cpanfile: -------------------------------------------------------------------------------- 1 | requires 'Cpanel::JSON::XS', 2.3310; 2 | requires 'DBI', '1.631'; 3 | requires 'Domain::PublicSuffix', '0.10'; 4 | requires 'Exporter::Shiny', '0.038'; 5 | requires 'Exporter::Tiny', 0.038; 6 | requires 'File::Copy::Recursive', 0.38; 7 | requires 'IO::Socket::SSL', 2.060; 8 | requires 'IO::Socket::SSL::Utils', 2.014; 9 | requires 'IPC::Run', 0.92; 10 | requires 'IPC::Run::Timer', 0.90; 11 | requires 'LWP', 6.05; 12 | requires 'List::AllUtils', 0.07; 13 | requires 'List::Util', 1.52; 14 | requires 'POE', 1.358; 15 | requires 'POE::XS::Loop::Poll', 1.000; 16 | requires 'URI', 1.71; 17 | requires 'URI::Escape', 3.31; 18 | requires 'WWW::Mechanize', 1.73; 19 | requires 'WWW::RobotRules', 6.02; 20 | requires 'YAML::XS', 0.41; 21 | -------------------------------------------------------------------------------- /https_crawl.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | 3 | use LWP::UserAgent; 4 | use WWW::Mechanize; 5 | use POE::Kernel { loop => 'POE::XS::Loop::Poll' }; 6 | use POE qw(Wheel::Run Filter::Reference); 7 | use DBI; 8 | use Sys::Hostname 'hostname'; 9 | use Cpanel::JSON::XS qw'decode_json encode_json'; 10 | use URI; 11 | use File::Copy::Recursive qw'pathmk pathrmdir'; 12 | use WWW::RobotRules; 13 | use IPC::Run; 14 | use YAML::XS 'LoadFile'; 15 | use List::AllUtils 'each_arrayref'; 16 | use SmarterEncryption::Crawl qw' 17 | aggregate_crawl_session 18 | check_ssl_cert 19 | dupe_link 20 | urls_by_path 21 | '; 22 | use Module::Load::Conditional 'can_load'; 23 | 24 | use feature 'state'; 25 | use strict; 26 | use warnings; 27 | no warnings 'uninitialized'; 28 | 29 | my $DDG_INTERNAL; 30 | if(can_load(modules => {'DDG::Util::HTTPS2' => undef})){ 31 | DDG::Util::HTTPS2->import(qw'add_stat backfill_urls'); 32 | $DDG_INTERNAL = 1; 33 | } 34 | 35 | my $HOST = hostname(); 36 | 37 | # Crawler Config 38 | my %CC; 39 | 40 | # Derived config values 41 | my ($MAX_CONCURRENT_CRAWLERS, $PHANTOM_TIMEOUT, $HOMEPAGE_LINKS_MAX); 42 | 43 | POE::Session->create( 44 | inline_states => { 45 | _start => \&_start, 46 | _stop => \&normal_cleanup, 47 | crawl => \&start_crawlers, 48 | crawler_done => \&crawler_done, 49 | crawler_debug => \&crawler_debug, 50 | sig_child => \&sig_child, 51 | shutdown => \&shutdown_now, 52 | prune_tmp_dirs => \&prune_tmp_dirs 53 | } 54 | ); 55 | 56 | POE::Kernel->run; 57 | exit; 58 | 59 | sub _start { 60 | my ($k, $h) = @_[KERNEL, HEAP]; 61 | 62 | parse_argv(); 63 | 64 | unless($MAX_CONCURRENT_CRAWLERS){ 65 | $MAX_CONCURRENT_CRAWLERS = `nproc` * $CC{CRAWLERS_PER_CPU}; 66 | } 67 | 68 | $PHANTOM_TIMEOUT = $CC{HEADLESS_ALARM} * 1000; # in ms 69 | $HOMEPAGE_LINKS_MAX = sprintf '%d', $CC{HOMEPAGE_LINK_PCT} * $CC{URLS_PER_SITE}; 70 | 71 | my $TMP_DIR = $CC{TMP_DIR}; 72 | unless(-d $TMP_DIR){ 73 | $CC{VERBOSE} && warn "Creating temp dir $TMP_DIR\n"; 74 | pathmk($TMP_DIR) or die "Failed to create tmp dir $TMP_DIR: $!"; 75 | } 76 | 77 | # clean up leftover junk for forced shutdown 78 | while(<$TMP_DIR/$CC{CRAWLER_TMP_PREFIX}*>){ 79 | chomp; 80 | pathrmdir($_) or warn "Failed to remove old crawler tmp dir $_: $!"; 81 | } 82 | 83 | $k->sig($_ => 'shutdown') for qw{TERM INT}; 84 | 85 | $k->yield('crawl'); 86 | } 87 | 88 | sub shutdown_now { 89 | $_[KERNEL]->sig_handled; 90 | 91 | # Kill crawlers 92 | $_->kill() for values %{$_[HEAP]->{crawlers}}; 93 | 94 | # Make unfinished tasks available in the queue 95 | my $db = prep_db('queue'); 96 | $db->{reset_unfinished_tasks}->execute; 97 | 98 | normal_cleanup(); 99 | 100 | exit 1; 101 | } 102 | 103 | sub normal_cleanup { 104 | # remove tmp dir 105 | pathrmdir($CC{TMP_DIR}) if -d $CC{TMP_DIR}; 106 | } 107 | 108 | sub start_crawlers{ 109 | my ($k, $h) = @_[KERNEL, HEAP]; 110 | 111 | my $db = prep_db('queue'); 112 | 113 | my $reserve_tasks = $db->{reserve_tasks}; 114 | while(keys %{$h->{crawlers}} < $MAX_CONCURRENT_CRAWLERS){ 115 | 116 | $reserve_tasks->execute(); 117 | if(my @ranks = sort map { $_->[0] } @{$reserve_tasks->fetchall_arrayref}){ 118 | 119 | my $c = POE::Wheel::Run->new( 120 | Program => \&crawl_sites, 121 | ProgramArgs => [\@ranks], 122 | CloseOnCall => 1, 123 | NoSetSid => 1, 124 | StderrEvent => 'crawler_debug', 125 | CloseEvent => 'crawler_done', 126 | StdinFilter => POE::Filter::Reference->new, 127 | StderrFilter => POE::Filter::Line->new 128 | ); 129 | $h->{crawlers}{$c->ID} = $c; 130 | $k->sig_child($c->PID, 'sig_child'); 131 | } 132 | else{ 133 | $CC{POLL} && $k->delay(crawl => $CC{POLL}); 134 | last; 135 | } 136 | } 137 | } 138 | 139 | sub crawl_sites{ 140 | my ($ranks) = @_; 141 | 142 | my $VERBOSE = $CC{VERBOSE}; 143 | my $db = prep_db('crawl'); 144 | 145 | my $crawler_tmp_dir = "$CC{TMP_DIR}/$CC{CRAWLER_TMP_PREFIX}$$"; 146 | my $rm_tmp = pathmk($crawler_tmp_dir); 147 | 148 | my @urls_by_domain; 149 | for(my $i = 0;$i < @$ranks;++$i){ 150 | my $rank = $ranks->[$i]; 151 | 152 | my $domain; 153 | eval { 154 | $db->{start_task}->execute($$, $rank); 155 | $domain = $db->{start_task}->fetchall_arrayref->[0][0]; 156 | } 157 | or do { 158 | warn "Failed to start task for rank $rank: $@"; 159 | next; 160 | }; 161 | 162 | eval { 163 | $domain = URI->new("https://$domain/")->host; 164 | 1; 165 | } 166 | or do { 167 | warn "Failed to filter domain $domain: $@"; 168 | next; 169 | }; 170 | 171 | $VERBOSE && warn "checking domain $domain\n"; 172 | my $urls = get_urls_for_domain($domain, $db); 173 | my @pairs; 174 | for my $url (@$urls){ 175 | push @pairs, [$domain, $url]; 176 | } 177 | push @urls_by_domain, \@pairs if @pairs; 178 | } 179 | 180 | my $ranks_str = '{' . join(',', @$ranks) . '}'; 181 | 182 | my $ea = each_arrayref @urls_by_domain; 183 | 184 | my (%ssl_cert_checked, %domain_render_delay, %sessions); 185 | while(my @urls = $ea->()){ 186 | for my $u (@urls){ 187 | next unless $u; 188 | my ($domain, $url) = @$u; 189 | next unless $url =~ /^http/i; 190 | 191 | # for the command-line 192 | $url =~ s/'/%27/g; 193 | 194 | my ($http_url) = $url =~ s/^https:/http:/ri; 195 | my ($https_url) = $url =~ s/^http:/https:/ri; 196 | 197 | my $http_ss = $crawler_tmp_dir . '/http.' . $domain . '.png'; 198 | 199 | unless($ssl_cert_checked{$domain}){ 200 | my $ssl = check_ssl_cert($domain); 201 | eval { 202 | $db->{insert_ssl}->execute($domain, @$ssl); 203 | ++$ssl_cert_checked{$domain}; 204 | } 205 | or do { 206 | warn "Failed to insert ssl info for $domain: $@"; 207 | }; 208 | } 209 | 210 | my %comparison; 211 | # We will compare a URL twice max: 212 | # 1. Compare HTTP vs. HTTPS 213 | # 2. Redo if the screenshot is a above the threshold to check for rendering problems 214 | SCREENSHOT_RETRY: for (0..$CC{SCREENSHOT_RETRIES}){ 215 | my $redo_comparison = 0; 216 | 217 | my %stats = (domain => $domain); 218 | check_site(\%stats, $http_url, $http_ss, $domain_render_delay{$domain}, $crawler_tmp_dir); 219 | # the idea behind screenshots is: 220 | # 1. Do for HTTP automatically so we don't have to make another request if it works 221 | # 2. Do for HTTPS if HTTP worked and wasn't autoupgraded 222 | # 3. If HTTPS worked and didn't downgrade, compare them 223 | my $https_ss; 224 | if( (-e $http_ss) && ($stats{http_request_uri} =~ /^http:/i) && ($stats{http_response} == 200)){ 225 | $https_ss = $crawler_tmp_dir . '/https.' . $domain . '.png'; 226 | } 227 | 228 | HTTPS_RETRY: for my $https_attempt (0..$CC{HTTPS_RETRIES}){ 229 | my $redo_https; 230 | check_site(\%stats, $https_url, $https_ss, $domain_render_delay{$domain}, $crawler_tmp_dir); 231 | if( ($stats{https_request_uri} =~ /^https:/i) && ($stats{https_response} == 200)){ 232 | if($https_ss && (-e $https_ss)){ 233 | my $out = `$CC{COMPARE} -metric mae $http_ss $https_ss /dev/null 2>&1`; 234 | 235 | if(my ($diff) = $out =~ /\(([\d\.e\-]+)\)/){ 236 | if($CC{SCREENSHOT_THRESHOLD} < $diff){ 237 | # Only need to redo on the first failure. After that, the delay 238 | # will have already been increased by a previous URL 239 | unless($domain_render_delay{$domain} == $CC{PHANTOM_RENDER_DELAY}){ 240 | $domain_render_delay{$domain} = $CC{PHANTOM_RENDER_DELAY}; 241 | $redo_comparison = 1; 242 | $VERBOSE && warn "redoing $http_url (diff: $diff)\n"; 243 | } 244 | } 245 | $stats{ss_diff} = $diff; 246 | } 247 | else{ 248 | warn "Failed to extract compare diff betweeen $http_ss and $https_ss from $out\n"; 249 | } 250 | unlink $_ for $http_ss, $https_ss; 251 | } 252 | 253 | if($DDG_INTERNAL && $https_attempt){ 254 | add_stat(qw'increment smarter_encryption.crawl.https_retries.success'); 255 | } 256 | } 257 | elsif($DDG_INTERNAL && $https_attempt){ 258 | add_stat(qw'increment smarter_encryption.crawl.https_retries.failure'); 259 | } 260 | elsif( ($stats{https_request_uri} !~ /^http:/) && ($stats{http_response} != $stats{https_response})){ 261 | $redo_https = 1; 262 | $VERBOSE && warn "Redoing HTTPS request for $domain: $https_url\n"; 263 | } 264 | 265 | last HTTPS_RETRY unless $redo_https; 266 | } 267 | 268 | # Most should exit here 269 | unless($redo_comparison){ 270 | %comparison = %stats; 271 | last; 272 | } 273 | } 274 | 275 | unless($db->{con}->ping){ 276 | $VERBOSE && warn "Reconnecting to DB before inserting comparison"; 277 | $db = prep_db('crawl'); 278 | } 279 | 280 | if(my $host = eval { URI->new($comparison{https_request_uri})->host}){ 281 | unless($ssl_cert_checked{$host}){ 282 | my $ssl = check_ssl_cert($host); 283 | eval { 284 | $db->{insert_ssl}->execute($host, @$ssl); 285 | ++$ssl_cert_checked{$host}; 286 | } 287 | or do { 288 | warn "Failed to insert ssl info for $host: $@"; 289 | }; 290 | } 291 | } 292 | 293 | if($comparison{http_request_uri} || $comparison{https_request_uri}){ 294 | my $log_id; 295 | eval { 296 | $db->{insert_domain}->execute(@comparison{qw' 297 | domain 298 | http_request_uri 299 | http_response 300 | http_requests 301 | http_size 302 | https_request_uri 303 | https_response 304 | https_requests 305 | https_size 306 | autoupgrade 307 | mixed 308 | ss_diff'} 309 | ); 310 | $log_id = $db->{insert_domain}->fetch()->[0]; 311 | } 312 | or do { 313 | $VERBOSE && warn "Failed to insert request for $domain: $@"; 314 | }; 315 | 316 | if($log_id){ 317 | if(my $hdrs = delete $comparison{https_response_headers}){ 318 | eval { 319 | $db->{insert_headers}->execute($log_id, $hdrs); 320 | } 321 | or do { 322 | $VERBOSE && warn "Failed to insert response headers for $domain ($log_id): $@"; 323 | }; 324 | } 325 | 326 | if(my $mixed_reqs = delete $comparison{mixed_children}){ 327 | for my $m (keys %$mixed_reqs){ 328 | eval{ 329 | $db->{insert_mixed}->execute($log_id, $m); 330 | 1; 331 | } 332 | or do { 333 | $VERBOSE && warn "Failed to insert mixed request for $domain: $@"; 334 | }; 335 | } 336 | } 337 | $comparison{id} = $log_id; 338 | push @{$sessions{$domain}}, \%comparison; 339 | } 340 | } 341 | } 342 | } 343 | 344 | unless($db->{con}->ping){ 345 | $VERBOSE && warn "Reconnecting to DB before updating aggregate data"; 346 | $db = prep_db('crawl'); 347 | } 348 | 349 | while(my ($domain, $session) = each %sessions){ 350 | my $aggregates = aggregate_crawl_session($domain, $session); 351 | while(my ($host, $agg) = each %$aggregates){ 352 | eval { 353 | $db->{upsert_aggregate}->execute( 354 | $host, @$agg{qw' 355 | https 356 | http_s 357 | https_errs 358 | http 359 | unknown 360 | autoupgrade 361 | mixed_requests 362 | max_ss_diff 363 | redirects 364 | max_id 365 | requests 366 | is_redirect 367 | redirect_hosts' 368 | } 369 | ); 370 | 1; 371 | } 372 | or do { 373 | warn "Failed to upsert aggregate for $host: $@"; 374 | }; 375 | } 376 | } 377 | 378 | eval { 379 | $db->{finish_tasks}->execute($ranks_str); 380 | 1; 381 | } 382 | or do { 383 | warn "Failed to finish tasks for ranks ($ranks_str): $@"; 384 | }; 385 | 386 | system "$CC{PKILL} -9 -f '$crawler_tmp_dir '"; 387 | pathrmdir($crawler_tmp_dir) if $rm_tmp; 388 | } 389 | 390 | sub prep_db { 391 | my $target = shift; 392 | 393 | my %db; 394 | 395 | my $con = get_con(); 396 | 397 | if($target eq 'queue'){ 398 | $db{reserve_tasks} = $con->prepare(" 399 | update https_queue 400 | set processing_host = '$HOST', 401 | reserved = now() 402 | where rank in ( 403 | select rank from https_queue 404 | where processing_host is null 405 | order by rank 406 | limit $CC{SITES_PER_CRAWLER} 407 | for update skip locked 408 | ) 409 | returning rank 410 | "); 411 | $db{reset_unfinished_tasks} = $con->prepare(" 412 | update https_queue 413 | set processing_host = null, 414 | worker_pid = null, 415 | reserved = null, 416 | started = null 417 | where 418 | processing_host = '$HOST' and 419 | finished is null 420 | "); 421 | $db{complete_unfinished_worker_tasks} = $con->prepare(" 422 | update https_queue 423 | set finished = now(), 424 | processing_host = '$HOST (incomplete)' 425 | where 426 | processing_host = '$HOST' and 427 | finished is null and 428 | worker_pid = ? 429 | "); 430 | } 431 | elsif($target eq 'crawl'){ 432 | $db{start_task} = $con->prepare('update https_queue set worker_pid = ?, started = now() where rank = ? returning domain'); 433 | $db{select_urls} = $con->prepare('select url from full_urls where host = ?'); 434 | $db{insert_domain} = $con->prepare(' 435 | insert into https_crawl 436 | (domain, http_request_uri, http_response, http_requests, http_size, https_request_uri, https_response, https_requests, https_size, autoupgrade, mixed, screenshot_diff) 437 | values (?,?,?,?,?,?,?,?,?,?,?,?) returning id 438 | '); 439 | $db{insert_mixed} = $con->prepare('insert into mixed_assets (https_crawl_id, asset) values (?,?)'); 440 | $db{insert_headers} = $con->prepare('insert into https_response_headers (https_crawl_id, response_headers) values (?,?)'); 441 | $db{finish_tasks} = $con->prepare('update https_queue set finished = now() where rank = ANY(?::integer[])'); 442 | $db{insert_ssl} = $con->prepare(' 443 | insert into ssl_cert_info (domain, issuer, notBefore, notAfter, host_valid, err) values (?,?,?,?,?,?) 444 | on conflict (domain) do update set 445 | issuer = EXCLUDED.issuer, 446 | notBefore = EXCLUDED.notBefore, 447 | notAfter = EXCLUDED.notAfter, 448 | host_valid = EXCLUDED.host_valid, 449 | err = EXCLUDED.err, 450 | updated = now() 451 | '); 452 | # Note where clause: 453 | # 1. Non-redirects update any, including changing a redirect to a non-redirect 454 | # 2. Redirects update other redirects 455 | $db{upsert_aggregate} = $con->prepare(" 456 | insert into https_crawl_aggregate ( 457 | domain, 458 | https, 459 | http_and_https, 460 | https_errs, http, 461 | unknown, 462 | autoupgrade, 463 | mixed_requests, 464 | max_screenshot_diff, 465 | redirects, 466 | max_https_crawl_id, 467 | requests, 468 | is_redirect, 469 | redirect_hosts, 470 | session_request_limit) 471 | values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,$CC{URLS_PER_SITE}) 472 | on conflict (domain) do update set ( 473 | https, 474 | http_and_https, 475 | https_errs, 476 | http, 477 | unknown, 478 | autoupgrade, 479 | mixed_requests, 480 | max_screenshot_diff, 481 | redirects, 482 | max_https_crawl_id, 483 | requests, 484 | is_redirect, 485 | redirect_hosts, 486 | session_request_limit 487 | ) = ( 488 | EXCLUDED.https, 489 | EXCLUDED.http_and_https, 490 | EXCLUDED.https_errs, 491 | EXCLUDED.http, 492 | EXCLUDED.unknown, 493 | EXCLUDED.autoupgrade, 494 | EXCLUDED.mixed_requests, 495 | EXCLUDED.max_screenshot_diff, 496 | EXCLUDED.redirects, 497 | EXCLUDED.max_https_crawl_id, 498 | EXCLUDED.requests, 499 | EXCLUDED.is_redirect, 500 | EXCLUDED.redirect_hosts, 501 | EXCLUDED.session_request_limit) 502 | where 503 | EXCLUDED.is_redirect = false or 504 | https_crawl_aggregate.is_redirect = true 505 | "); 506 | } 507 | 508 | $db{con} = $con; 509 | return \%db; 510 | } 511 | 512 | # Strategy behind url selection: 513 | # 1. Fill queue with homepage and click urls sort by top-level path 514 | # prevalence 515 | # 2. If necessary, get backfill_urls 516 | sub get_urls_for_domain { 517 | my ($domain, $db) = @_; 518 | 519 | state $rr = WWW::RobotRules->new($CC{UA}); 520 | state $mech = get_ua('mech'); 521 | state $VERBOSE = $CC{VERBOSE}; 522 | 523 | # Get latest robot rules for domain 524 | my $res = $mech->get("http://$domain/robots.txt"); 525 | if($res->is_success){ 526 | # the uri may be different than what we requested 527 | my @doms = ($domain); 528 | my $uri = $res->request->uri; 529 | if(my $host = eval { URI->new($uri)->host }){ 530 | push @doms, $host if $host ne $domain; 531 | } 532 | my $robots_txt = $res->decoded_content; 533 | 534 | # Add the rules for the: 535 | # 1. The domain and redirect host if different 536 | # 2. HTTP/HTTPS for each 537 | # yes, http and https could be different 538 | for my $d (@doms){ 539 | for my $p (qw(http https)){ 540 | $rr->parse("$p://$d/", $robots_txt); 541 | } 542 | } 543 | } 544 | 545 | my @urls; 546 | my $homepage = 'http://' . $domain . '/'; 547 | 548 | $res = $mech->get($homepage); 549 | 550 | if($res->is_success){ 551 | # the uri may be different than what we requested 552 | my $uri = $res->request->uri; 553 | if(my $host = eval { URI->new($uri)->host }){ 554 | # all links with the same host 555 | my @homepage_links; 556 | if(my $l = $mech->find_all_links(url_abs_regex => qr{//\Q$host\E/})){ 557 | @homepage_links = @$l; 558 | } 559 | 560 | for my $l (@homepage_links){ 561 | my $abs_url = $l->url_abs; 562 | $abs_url = "$abs_url"; 563 | next if dupe_link($abs_url, \@urls); 564 | push @urls, $abs_url; 565 | } 566 | } 567 | } 568 | else { 569 | $VERBOSE && warn "Failed to get homepage links for $domain: " . $res->status_line; 570 | } 571 | 572 | eval { 573 | my $select_urls = $db->{select_urls}; 574 | $select_urls->execute($domain); 575 | while(my $r = $select_urls->fetchrow_arrayref){ 576 | my $url = $r->[0]; 577 | next if dupe_link($url, \@urls); 578 | push @urls, $url; 579 | } 580 | 1; 581 | } 582 | or do { 583 | $VERBOSE && warn "Failed to get click urls for $domain: $@"; 584 | }; 585 | 586 | state $URLS_PER_SITE = $CC{URLS_PER_SITE}; 587 | 588 | urls_by_path(\@urls, $rr, $URLS_PER_SITE); 589 | 590 | if($DDG_INTERNAL && (@urls < $URLS_PER_SITE)){ 591 | backfill_urls($domain, \@urls, $rr, $db, $mech, $URLS_PER_SITE, $VERBOSE); 592 | } 593 | 594 | # Add home by default since it often behaves differently 595 | unless(dupe_link($homepage, \@urls)){ 596 | if(@urls < $URLS_PER_SITE){ 597 | push @urls, $homepage; 598 | } 599 | else{ 600 | splice(@urls, -1, 1, $homepage); 601 | } 602 | } 603 | 604 | return \@urls; 605 | } 606 | 607 | sub prune_tmp_dirs { 608 | my $h = $_[HEAP]; 609 | 610 | return unless exists $h->{crawler_tmp_dirs}; 611 | 612 | my ($TMP_DIR, $CRAWLER_TMP_PREFIX) = @CC{qw'TMP_DIR CRAWLER_TMP_PREFIX'}; 613 | for my $pid (keys %{$h->{crawler_tmp_dirs}}){ 614 | my $crawler_tmp_dir = "$TMP_DIR/$CRAWLER_TMP_PREFIX$pid"; 615 | if(-d $crawler_tmp_dir){ 616 | next unless pathrmdir($crawler_tmp_dir); 617 | } 618 | delete $h->{crawler_tmp_dirs}{$pid}; 619 | } 620 | } 621 | 622 | sub check_site { 623 | my ($stats, $site, $screenshot, $delay, $crawler_tmp_dir) = @_; 624 | 625 | if(my ($request_scheme) = $site =~ /^(https?):/i){ 626 | $request_scheme = lc $request_scheme; 627 | 628 | eval{ 629 | @ENV{qw(PHANTOM_RENDER_DELAY PHANTOM_UA PHANTOM_TIMEOUT)} = 630 | ($delay, "'$CC{UA}'", $PHANTOM_TIMEOUT); 631 | 632 | my $out; 633 | my @cmd = ( 634 | $CC{PHANTOMJS}, 635 | "--local-storage-path=$crawler_tmp_dir", "--offline-storage-path=$crawler_tmp_dir", 636 | $CC{NETSNIFF_SS}, $site); 637 | push @cmd, $screenshot if $screenshot; 638 | 639 | IPC::Run::run \@cmd, \undef, \$out, 640 | IPC::Run::timeout($CC{HEADLESS_ALARM}, exception => "$site timed out after $CC{HEADLESS_ALARM} seconds"); 641 | die "PHANTOMJS $out" if $out =~ /^FAIL/; 642 | 643 | # Can have error messages at the end so have to extract the json 644 | my ($j) = $out =~ /^(\{\s+"log".+\})/ms; 645 | my $m = decode_json($j)->{log}; 646 | 647 | my ($main_request_scheme, $check_mixed); 648 | for my $e (@{$m->{entries}}){ 649 | my $response_status = $e->{response}{status}; 650 | # netsniff records the redirects to https for some sites 651 | next if $response_status =~ /^3/; 652 | my $url = $e->{request}{url}; 653 | next unless my ($scheme) = $url =~ /^(https?):/i; 654 | $scheme = lc $scheme; 655 | 656 | if($check_mixed && ($scheme eq 'http')){ 657 | # Absolute links. Even if the same host as parent, browsers will mark 658 | # this as mixed and the extension can't upgrade them 659 | $stats->{mixed_children}{$url} = 1; 660 | } 661 | 662 | unless($main_request_scheme){ 663 | $stats->{"${request_scheme}_request_uri"} = $url; 664 | $stats->{"${request_scheme}_response"} = $response_status; 665 | if($request_scheme eq 'http'){ 666 | $stats->{autoupgrade} = $scheme eq 'https' ? 1 : 0; 667 | } 668 | elsif($scheme eq 'https'){ 669 | $check_mixed = lc URI->new($url)->host; 670 | my $hdrs = delete $e->{response}{headers}; 671 | my %response_headers; 672 | # We don't want to store an array of one-key hashes. 673 | for my $h (@$hdrs){ 674 | my ($name, $value) = @$h{qw(name value)}; 675 | if(exists $response_headers{$name}){ 676 | # https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2 677 | $response_headers{$name} .= ",$value"; 678 | } 679 | else{ 680 | $response_headers{$name} = $value; 681 | } 682 | } 683 | $stats->{https_response_headers} = encode_json(\%response_headers); 684 | } 685 | $main_request_scheme = $scheme; 686 | } 687 | 688 | $stats->{"${request_scheme}_size"} += $e->{response}{bodySize}; 689 | ++$stats->{"${request_scheme}_requests"}; 690 | 691 | } 692 | 693 | if($check_mixed){ 694 | $stats->{mixed} = exists $stats->{mixed_children} ? 1 : 0; 695 | } 696 | 1; 697 | } 698 | or do { 699 | warn "check_site error: $@ ($site)"; 700 | system "$CC{PKILL} -9 -f '$crawler_tmp_dir '" if $crawler_tmp_dir =~ /\S/; 701 | }; 702 | } 703 | } 704 | 705 | sub crawler_done{ 706 | my ($k, $h, $id) = @_[KERNEL, HEAP, ARG0]; 707 | 708 | state $VERBOSE = $CC{VERBOSE}; 709 | $VERBOSE && warn "deleting crawler $id\n"; 710 | my $c = delete $h->{crawlers}{$id}; 711 | 712 | # see if any of its domains were left unfinished 713 | my $pid = $c->PID; 714 | eval { 715 | my $db = prep_db('queue'); 716 | my $unfinished = $db->{complete_unfinished_worker_tasks}->execute($pid); 717 | if($unfinished > 0){ 718 | $VERBOSE && warn "Marked $unfinished tasks incomplete for crawler with pid $pid\n"; 719 | } 720 | 1; 721 | } 722 | or do { 723 | warn "Failed to verify worker tasks: $@"; 724 | }; 725 | 726 | # Check and clean up tmp dirs for hung crawlers 727 | $h->{crawler_tmp_dirs}{$pid} = 1; 728 | $k->yield('prune_tmp_dirs'); 729 | 730 | $k->yield('crawl'); 731 | } 732 | 733 | sub crawler_debug{ 734 | my $msg = $_[ARG0]; 735 | 736 | $CC{VERBOSE} && warn 'crawler debug: ' . $msg. "\n"; 737 | } 738 | 739 | sub sig_child { 740 | warn 'Got signal from pid ' . $_[ARG1] . ', exit status: ' . $_[ARG2] if $_[ARG2]; 741 | $_[KERNEL]->sig_handled; 742 | } 743 | 744 | sub get_ua { 745 | my $type = shift; 746 | 747 | my $ua = $type eq 'mech' ? 748 | WWW::Mechanize->new( 749 | onerror => undef, # We'll check these ourselves so we don't have to catch die in eval 750 | quiet => 1 751 | ) 752 | : 753 | LWP::UserAgent->new(); 754 | 755 | $ua->agent($CC{UA}); 756 | $ua->timeout(10); 757 | return $ua; 758 | } 759 | 760 | sub get_con { 761 | 762 | $ENV{PGDATABASE} = $CC{DB} if exists $CC{DB}; 763 | $ENV{PGHOST} = $CC{HOST} if exists $CC{HOST}; 764 | $ENV{PGPORT} = $CC{PORT} if exists $CC{PORT}; 765 | $ENV{PGUSER} = $CC{USER} if exists $CC{USER}; 766 | $ENV{PGPASSWORD} = $CC{PASS} if exists $CC{PASS}; 767 | 768 | return DBI->connect('dbi:Pg:', '', '', { 769 | RaiseError => 1, 770 | PrintError => 0, 771 | AutoCommit => 1, 772 | }); 773 | } 774 | 775 | sub parse_argv { 776 | my $usage = <