├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── SmarterEncryption
    └── Crawl.pm
├── config.yml.example
├── cpanfile
├── https_crawl.pl
├── netsniff_screenshot.js
├── sql
    ├── domain_exceptions.sql
    ├── full_urls.sql
    ├── https_crawl.sql
    ├── https_crawl_aggregate.sql
    ├── https_queue.sql
    ├── https_response_headers.sql
    ├── mixed_assets.sql
    ├── ssl_cert_info.sql
    └── upgradeable_domains_func.sql
└── third-party.txt


/CONTRIBUTING.md:
--------------------------------------------------------------------------------
  1 | # Contributing guidelines
  2 | 
  3 | * [Reporting bugs](#reporting-bugs)
  4 | * [Development](#development)
  5 |   * [New features](#new-features)
  6 |   * [Bug fixes](#bug-fixes)
  7 | * [Getting Started](#getting-started)
  8 |   * [Pre-Requisites](#pre-requisites)
  9 |   * [Setup](#setup)
 10 |   * [Running the crawler](#running-the-crawler)
 11 |   * [Checking the results](#checking-the-results)
 12 | * [Data Model](#data-model)
 13 |   * [full_urls](#full_urls)
 14 |   * [https_queue](#https_queue)
 15 |   * [https_crawl](#https_crawl)
 16 |   * [mixed_assets](#mixed_assets)
 17 |   * [https_response_headers](#https_response_headers)
 18 |   * [ssl_cert_info](#ssl_cert_info)
 19 |   * [https_crawl_aggregate](#https_crawl_aggregate)
 20 |   * [https_upgrade_metrics](#https_upgrade_metrics)
 21 |   * [domain_exceptions](#domain_exceptions)
 22 |   * [upgradeable_domains](#upgradeable_domains)
 23 | 
 24 | # Reporting bugs
 25 | 
 26 | 1. First check to see if the bug has not already been [reported](https://github.com/duckduckgo/smarter-encryption/issues).
 27 | 2. Create a bug report [issue](https://github.com/duckduckgo/smarter-encryption/issues/new?template=bug_report.md).
 28 | 
 29 | # Development
 30 | 
 31 | ## New features
 32 | 
 33 | Right now all new feature development is handled internally.
 34 | 
 35 | ## Bug fixes
 36 | 
 37 | Most bug fixes are handled internally, but we will accept pull requests for bug fixes if you first:
 38 | 1. Create an issue describing the bug. see [Reporting bugs](CONTRIBUTING.md#reporting-bugs)
 39 | 2. Get approval from DDG staff before working on it. Since most bug fixes and feature development are handled internally, we want to make sure that your work doesn't conflict with any current projects.
 40 | 
 41 | ## Getting Started
 42 | 
 43 | ### Pre-Requisites
 44 | - [PostgreSQL](https://www.postgresql.org/) database
 45 | - [PhantomJS 2.1.1](https://phantomjs.org/download.html)
 46 | - [Perl](https://www.perl.org/get.html)
 47 | - [compare](https://imagemagick.org/script/compare.php)
 48 | - [pkill](https://en.wikipedia.org/wiki/Pkill)
 49 | - Should run on many varieties of Linux/*BSD
 50 | 
 51 | ### Setup
 52 | 
 53 | 1. Install required Perl modules via cpanfile:
 54 | ```sh
 55 | cpanm --installdeps .
 56 | ```
 57 | 2. Connect to PostgreSQL with psql and create the tables needed by the crawler:
 58 | ```
 59 | \i sql/full_urls.sql
 60 | \i sql/https_crawl.sql
 61 | \i sql/mixed_assets.sql
 62 | etc.
 63 | ```
 64 | 3. Create a copy of the crawler configuration file:
 65 | ```sh
 66 | cp config.yml.example config.yml
 67 | ```
 68 | Edit the settings as necessary for your system.
 69 | 
 70 | 4. If you have a source of URLs you would like to be crawled for a host they can be added to the [full_urls](#full_urls) table:
 71 | ```sql
 72 | insert into full_urls (host, url) values ('duckduckgo.com', 'https://duckduckgo.com/?q=privacy'), ...
 73 | ```
 74 | The crawler will attempt to get URLs from the home page even if none are available in this table.
 75 | 
 76 | ### Running the crawler
 77 | 
 78 | 1. Add hosts to be crawled to the [https_queue](#https_queue) table:
 79 | ```sql
 80 | insert into https_queue (domain) values ('duckduckgo.com');
 81 | ```
 82 | 
 83 | 2. The crawler can be run as follows:
 84 | ```sh
 85 | perl -Mlib=/path/to/smarter-encryption https_crawl.pl -c /path/to/config.yml
 86 | ```
 87 | 
 88 | ### Checking the results
 89 | 
 90 | 1. The individual HTTP and HTTPs comparisons for each URL crawled are stored in [https_crawl](#https_crawl):
 91 | ```sql
 92 | select * from https_crawl where domain = 'duckduckgo.com' order by id desc limit 10;
 93 | ```
 94 | The maximum URLs for the crawl session, i.e. `limit`, is determined by [URLS_PER_SITE](config.yml.example#L49).
 95 | 
 96 | 2. Aggregate session data for each host is stored in [https_crawl_aggregate](#https_crawl_aggregate):
 97 | ```sql
 98 | select * from https_crawl_aggregate where domain = 'duckduckgo.com';
 99 | ```
100 | There is also an associated view - [https_upgrade_metrics](#https_upgrade_metrics) - that calculates some additional metrics:
101 | ```sql
102 | select * from https_upgrade_metrics where domain = 'duckduckgo.com';
103 | ```
104 | 
105 | 3. Additional information from the crawl can be found in:
106 | 
107 |   * [sss_cert_info](#ssl_cert_info)
108 |   * [mixed_assets](#mixed_assets)
109 |   * [https_response_headers](#https_response_headers)
110 | 
111 | 4. Hosts can be selected based on various combinations of criteria directly from the above tables or by using the [upgradeable_domains](#upgradeable_domains) function.  
112 | 
113 | ### Data Model
114 | 
115 | #### full_urls
116 | 
117 | Complete URLs for hosts that will be used in addition to those the crawler extracts from the home page.
118 | 
119 | | Column | Description | Type | Key |
120 | | --- | --- | --- | --- |
121 | | host | hostname | text |unique|
122 | | url | Complete URL with scheme | text |unique|
123 | | updated | When added to table | timestamp with time zone ||
124 | 
125 | #### https_queue
126 | 
127 | Domains to be crawled in rank order.  Multiple crawlers can access this concurrently.
128 | 
129 | | Column | Description | Type | Key |
130 | | --- | --- | --- | --- |
131 | | rank | Processing order | integer | primary |
132 | |domain | Domain to be crawled | character varying(500) ||
133 | |processing_host|Hostname of server processing domain|character varying(50)||
134 | |worker_pid|Process ID of crawler handling domain|integer||
135 | |reserved|When domain was selected for processing|timestamp with time zone||
136 | |started|When processing of domain started|timestamp with time zone||
137 | |finished|When processing of domain completed|timestamp with time zone||
138 | 
139 | #### https_crawl
140 | 
141 | Log table of HTTP and HTTPs comparisons made by the crawler.
142 | 
143 | | Column | Description | Type | Key |
144 | | --- | --- | --- | --- |
145 | | id | Comparison ID | bigint | unique |
146 | |domain|Domain evaluated|text||
147 | |http_request_uri|Resulting URI of HTTP request|text||
148 | |http_response|HTTP status code for HTTP request|integer||
149 | |http_requests|Total requests made, including child subrequests, for HTTP request|integer||
150 | |http_size|Size of HTTP response (bytes)|integer||
151 | |https_request_uri|Resulting URI of HTTPs request|text||
152 | |https_response|HTTP status code for HTTPs request|integer||
153 | |https_requests|Total requests made, including child subrequests, for HTTPs request|integer||
154 | |https_size|Size of HTTPs response (bytes)|integer||
155 | |timestamp|When inserted|timestamp with time zone||
156 | |screenshot_diff|Percentage difference between HTTP and HTTPs screenshots after page load|real||
157 | |autoupgrade|Whether HTTP request was redirected to HTTPs|boolean||
158 | |mixed|Whether HTTPs request had HTTP child requests|boolean||
159 | 
160 | #### mixed_assets
161 | 
162 | HTTP child requests made for HTTPs.
163 | 
164 | | Column         | Description                                          | Type   | Key            |
165 | | ---            | ---                                                  | ---    | ---            |
166 | | https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign |
167 | | asset          | URI of HTTP subrequest made during HTTPs request     | text   | unique         |
168 | 
169 | 
170 | #### https_response_headers
171 | 
172 | The response headers for HTTPs requests.
173 | 
174 | | Column         | Description                                          | Type   | Key            |
175 | | ---            | ---                                                  | ---    | ---            |
176 | | https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign |
177 | |response_headers|key/value of all HTTPs response headers|jsonb||
178 | 
179 | 
180 | #### ssl_cert_info
181 | 
182 | SSL certificate information for domains crawled.
183 | 
184 | | Column         | Description                                          | Type   | Key            |
185 | | ---            | ---                                                  | ---    | ---            |
186 | | domain | Domain evaluated | text | primary |
187 | |issuer|Issuer of SSL certificate|text||
188 | |notbefore|Valid from timestamp|timestamp with time zone||
189 | |notafter|Valid to timestamp|timestamp with time zone||
190 | |host_valid|Whether the domain is covered by the SSL certificate|boolean||
191 | |err|Connection err|text||
192 | |updated|When last updated|timestamp with time zone||
193 | 
194 | #### https_crawl_aggregate
195 | 
196 | Aggregate of [https_crawl](#https_crawl) that creates latest crawl sessions based on domain.  Can also include domains that were redirected to and not directly crawled.
197 | 
198 | | Column         | Description                                          | Type   | Key            |
199 | | ---            | ---                                                  | ---    | ---            |
200 | | domain | Domain evaluated | text | primary |
201 | |https|Comparisons where only HTTPs was supported|integer||
202 | |http_and_https|Comparisons where HTTP and HTTPs were supported|integer||
203 | |http|Comparisons where only HTTP was supported|integer||
204 | |https_errs|Number of non-2xx HTTPs responses|integer||
205 | |unknown|Comparisons where neither HTTP nor HTTPs responses were valid or the status codes differed|integer||
206 | |autoupgrade|Comparisons where HTTP was redirected to HTTPs|integer||
207 | |mixed_requests|HTTPs request that made HTTP calls|integer||
208 | |max_screenshot_diff|Maximum percentage difference between HTTP and HTTPs screenshots|real||
209 | |redirects|Number of HTTPs requests redirected to different host|integer||
210 | |requests|Number of comparison requests actually made during the crawl session|integer||
211 | |session_request_limit|The number of comparisons wanted for the session|integer||
212 | |is_redirect|Whether the domain was actually crawled or is a redirect from another host in the table that was crawled|boolean||
213 | |max_https_crawl_id|https_crawl.id of last comparison made during crawl session|bigint||
214 | |redirect_hosts|key/value pairs of hosts and the number of redirects to it|jsonb||
215 | 
216 | #### https_upgrade_metrics
217 | 
218 | View of [https_crawl_aggregate](#https_crawl_aggregate) that calculates crawl session percentages for easier selection based on cutoffs.
219 | 
220 | | Column         | Description                                          | Type   | Key            |
221 | | ---            | ---                                                  | ---    | ---            |
222 | | domain | Domain evaluated | text | |
223 | | unknown_pct | Percentage of unknown|real||
224 | | combined_pct | Percentage that supported HTTPs|real||
225 | | https_err_rate | Percentage unknown|real||
226 | | max_screenshot_diff | https_crawl_aggregate.max_screenshot_diff|real||
227 | | mixed_ok | Whether HTTPs requests contained mixed content requests|boolean||
228 | | autoupgrade_pct|Percentage of autoupgrade|real||
229 | 
230 | #### domain_exceptions
231 | 
232 | For manually excluding domains that may otherwise pass specific upgrade criteria given to [upgradeable_domains](#upgradeable_domains).
233 | 
234 | | Column | Description       | Type | Key     |
235 | | ---    | ---               | ---  | ---     |
236 | | domain | Domain to exclude | text | primary |
237 | | comment | Reason for exclusion | text ||
238 | |updated|When added|timestamp with time zone||
239 | 
240 | #### upgradeable_domains
241 | 
242 | Function to select domains based on a variety of criteria.
243 | 
244 | | Parameter | Description       | Type | Source     |
245 | | ---    | ---               | ---  | ---     |
246 | |autoupgrade_min|Minimum autoupgrade percentage|real|[https_upgrade_metrics](#https_upgrade_metrics)|
247 | |combined_min|Minimum percentage of HTTPs responses|real|[https_upgrade_metrics](#https_upgrade_metrics)|
248 | |screenshot_diff_max|Maximum observed screenshot diff allowed|real|[https_upgrade_metrics](#https_upgrade_metrics)|
249 | |mixed_ok|Whether to allow domains that had mixed content|boolean|[https_upgrade_metrics](#https_upgrade_metrics)|
250 | |max_err_rate|Maximum https_err_rate|real|[https_upgrade_metrics](#https_upgrade_metrics)|
251 | |unknown_max|Maximum unknown comparisons|real|[https_upgrade_metrics](#https_upgrade_metrics)|
252 | |ssl_cert_buffer|SSL certificate must be valid until this timestamp|timestamp with time zone|[ssl_cert_info](#ssl_cert_info)|
253 | |exclude_issuers|Array of SSL cert issuers to exclude|text array|[ssl_cert_info](#ssl_cert_info)|
254 | 
255 | In addtion to the above parameters, the function enforces several other conditions:
256 | 
257 | 1. Domain must not be in [domain_exceptions](#domain_exceptions)
258 | 2. From values in [ssl_cert_info](#ssl_cert_info):
259 |    1. No err
260 |    2. The domain, or host, must be valid for the certificate.
261 |    3. Valid from/to and the issuer must not be null
262 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | This license does not apply to any DuckDuckGo logos or marks that may be contained
 2 | in this repo. DuckDuckGo logos and marks are licensed separately under the CCBY-NC-ND 4.0
 3 | license (https://creativecommons.org/licenses/by-nc-nd/4.0/), and official up-to-date
 4 | versions can be downloaded from https://duckduckgo.com/press.
 5 | 
 6 | Copyright 2010 Duck Duck Go, Inc.
 7 | 
 8 | Licensed under the Apache License, Version 2.0 (the "License");
 9 | you may not use this file except in compliance with the License.
10 | You may obtain a copy of the License at
11 | 
12 |    http://www.apache.org/licenses/LICENSE-2.0
13 | 
14 | Unless required by applicable law or agreed to in writing, software
15 | distributed under the License is distributed on an "AS IS" BASIS,
16 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
17 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DuckDuckGo Smarter Encryption 
 2 | 
 3 | DuckDuckGo Smarter Encryption is a large list of web sites that we know support HTTPS.  The list is automatically generated and updated by using the crawler in this repository.
 4 | 
 5 | For more information about where the list is being used and how it compares to other solutions, see our blog post [Your Connection is Secure with DuckDuckGo Smarter Encryption](https://spreadprivacy.com/duckduckgo-smarter-encryption).
 6 | 
 7 | This software is licensed under the terms of the Apache License, Version 2.0 (see [LICENSE](LICENSE)). Copyright (c) 2019 [Duck Duck Go, Inc.](https://duckduckgo.com)
 8 | 
 9 | ## Contributing
10 | 
11 | See [Contributing](CONTRIBUTING.md) for more information about [Reporting bugs](CONTRIBUTING.md#reporting-bugs) and [Getting Started](CONTRIBUTING.md#getting-started) with the crawler.
12 | 
13 | ## Just want the list?
14 | 
15 | The list we use (as a result of running this code) is [publicly available](https://staticcdn.duckduckgo.com/https/smarter_encryption_latest.tgz) under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/).
16 | 
17 | If you'd like to license the list for commercial use, [please reach out](https://help.duckduckgo.com/duckduckgo-help-pages/company/contact-us/).
18 | 
19 | ## Questions or help with other DuckDuckGo things?
20 | See [DuckDuckGo Help Pages](https://duck.co/help).
21 | 


--------------------------------------------------------------------------------
/SmarterEncryption/Crawl.pm:
--------------------------------------------------------------------------------
  1 | package SmarterEncryption::Crawl;
  2 | 
  3 | use Exporter::Shiny qw'
  4 |     aggregate_crawl_session
  5 |     check_ssl_cert
  6 |     dupe_link
  7 |     urls_by_path
  8 | ';
  9 | 
 10 | use IO::Socket::SSL;
 11 | use IO::Socket::SSL::Utils 'CERT_asHash';
 12 | use Cpanel::JSON::XS 'encode_json';
 13 | use List::Util 'sum';
 14 | use URI;
 15 | use List::AllUtils qw'each_arrayref';
 16 | use Domain::PublicSuffix;
 17 | 
 18 | use strict;
 19 | use warnings;
 20 | no warnings 'uninitialized';
 21 | use feature 'state';
 22 | 
 23 | my $SSL_TIMEOUT = 5;
 24 | my $DEBUG = 0;
 25 | 
 26 | # Fields we want to convert to int if null
 27 | my @CONVERT_TO_INT = qw'
 28 |     https
 29 |     http_s
 30 |     https_errs
 31 |     http
 32 |     unknown
 33 |     autoupgrade
 34 |     mixed_requests
 35 |     max_ss_diff
 36 |     redirects
 37 | ';
 38 | 
 39 | sub screenshot_threshold { 0.05 }
 40 | # Number of URLs checked for each domain per run.
 41 | sub urls_per_domain { 10 }
 42 | 
 43 | sub check_ssl_cert {
 44 |     my $host = shift;
 45 | 
 46 |     my ($issuer, $not_before, $not_after, $host_valid, $err);
 47 | 
 48 |     if(my $iossl = IO::Socket::SSL->new(
 49 |         PeerHost => $host,
 50 |         PeerPort => 'https',
 51 |         SSL_hostname => $host,
 52 |         Timeout => $SSL_TIMEOUT,
 53 |     )){
 54 |         $host_valid = $iossl->verify_hostname($host, 'http') || 0;
 55 |         my $c = $iossl->peer_certificate;
 56 |         my $cert = CERT_asHash($c);
 57 |         $issuer = $cert->{issuer}{organizationName};
 58 |         $not_before = gmtime($cert->{not_before}) . ' UTC';
 59 |         $not_after = gmtime($cert->{not_after}) . ' UTC';
 60 |     }
 61 |     else{
 62 |         my $sys_err = $!;
 63 |         $err = $SSL_ERROR;
 64 |         if($sys_err){ $err .= ": $sys_err"; }
 65 |     }
 66 | 
 67 |     return [$issuer, $not_before, $not_after, $host_valid, $err];
 68 | }
 69 | 
 70 | sub aggregate_crawl_session {
 71 |     my ($domain, $session) = @_;
 72 | 
 73 |     state $dps = Domain::PublicSuffix->new;
 74 |     my $root_domain = $dps->get_root_domain($domain);
 75 | 
 76 |     my %domain_stats = (is_redirect => 0);
 77 |     my %redirects;
 78 |     for my $comparison (@$session){
 79 |         my ($http_request_uri,
 80 |             $http_response,
 81 |             $https_request_uri,
 82 |             $https_response,
 83 |             $autoupgrade,
 84 |             $mixed,
 85 |             $screenshot_diff,
 86 |             $id
 87 |         ) = @$comparison{qw'
 88 |             http_request_uri
 89 |             http_response
 90 |             https_request_uri
 91 |             https_response
 92 |             autoupgrade
 93 |             mixed
 94 |             ss_diff
 95 |             id
 96 |         '};
 97 | 
 98 | 
 99 |         my $http_valid = $http_request_uri =~ /^http:/i;
100 |         my $https_valid = $https_request_uri =~ /^https:/i;
101 | 
102 |         my $redirect;
103 |         if($https_valid){
104 |             if(my $host = eval { URI->new($https_request_uri)->host }){
105 |                 if($host ne $domain){
106 |                     my $host_root_domain = $dps->get_root_domain($host);
107 |                     if($root_domain eq $host_root_domain){
108 |                         ++$domain_stats{redirects}{$host};
109 |                         unless(exists $redirects{$host}){
110 |                             $redirects{$host} = {is_redirect => 1};
111 |                         }
112 |                         $redirect = $redirects{$host};
113 |                     }
114 |                 }
115 |             }
116 |         }
117 | 
118 |         ++$domain_stats{requests};
119 |         $redirect && ++$redirect->{requests};
120 | 
121 |         $domain_stats{max_id} = $id if $domain_stats{max_id} < $id;
122 |         $redirect->{max_id} = $id if $redirect && ($redirect->{max_id} < $id);
123 | 
124 |         if($autoupgrade){
125 |             ++$domain_stats{autoupgrade};
126 |             $redirect && ++$redirect->{autoupgrade};
127 |         }
128 | 
129 |         if($mixed){
130 |             ++$domain_stats{mixed_requests};
131 |             $redirect && ++$redirect->{mixed_requests};
132 |         }
133 | 
134 |         if(defined($screenshot_diff)){
135 |             $domain_stats{max_ss_diff} = $screenshot_diff if $domain_stats{max_ss_diff} < $screenshot_diff;
136 |             $redirect->{max_ss_diff} = $screenshot_diff if $redirect && ($redirect->{max_ss_diff} < $screenshot_diff)
137 |         }
138 | 
139 |         my $http_s_same_response = $http_response == $https_response;
140 |         my $http_response_good = $http_valid && ( ($http_response == 200) || $http_s_same_response );
141 |         my $https_response_good = $https_valid && ( ($https_response == 200) || $http_s_same_response);
142 | 
143 |         if($https_response_good){
144 |             if($http_response_good){
145 |                 ++$domain_stats{http_s};
146 |                 $redirect && ++$redirect->{http_s};
147 |             }
148 |             else{
149 |                 ++$domain_stats{https};
150 |                 $redirect && ++$redirect->{https};
151 |             }
152 | 
153 |             if($https_response =~ /^[45]/){
154 |                 ++$domain_stats{https_errs};
155 |                 $redirect && ++$redirect->{https_errs};
156 |             }
157 |         }
158 |         elsif($http_response_good){
159 |             ++$domain_stats{http};
160 |             $redirect && ++$redirect->{http};
161 |         }
162 |         else{
163 |             ++$domain_stats{unknown};
164 |             $redirect && ++$redirect->{unknown};
165 |         }
166 |     }
167 | 
168 |     my %aggs;
169 |     if(my $hosts = delete $domain_stats{redirects}){
170 |         $domain_stats{redirects} = sum values(%$hosts);
171 |         $domain_stats{redirect_hosts} = encode_json($hosts);
172 | 
173 |         while(my ($host, $agg) = each %redirects){
174 |             null_to_int($agg);
175 |             $aggs{$host} = $agg;
176 |         }
177 |     }
178 | 
179 |     null_to_int(\%domain_stats);
180 |     $aggs{$domain} = \%domain_stats;
181 | 
182 |     return \%aggs;
183 | }
184 | 
185 | sub null_to_int {
186 |     my $h = shift;
187 |     $h->{$_} += 0 for @CONVERT_TO_INT;
188 | }
189 | 
190 | sub urls_by_path {
191 |     my ($urls, $rr, $url_limit) = @_;
192 | 
193 |     my %links;
194 |     for my $url (@$urls){
195 |         eval {
196 |             my @segs = URI->new($url)->path_segments;
197 |             push @{$links{$segs[1]}}, $url;
198 |         };
199 |     }
200 | 
201 |     my @sorted_paths = sort {@{$links{$b}} <=> @{$links{$a}}} keys %links;
202 | 
203 |     my @urls_by_path;
204 | 
205 |     my $paths = each_arrayref @links{@sorted_paths};
206 |     CLICK_GROUP: while(my @urls = $paths->()){
207 |         for my $url (@urls){
208 |             next unless $url;
209 |             last CLICK_GROUP unless @urls_by_path < $url_limit;
210 |             next unless $rr->allowed($url);
211 |             push @urls_by_path, $url;
212 |         }
213 |     }
214 | 
215 |     @$urls = @urls_by_path;
216 | }
217 | 
218 | 
219 | sub dupe_link {
220 |     my ($url, $urls) = @_;
221 | 
222 |     $url =~ s{^https:}{http:}i;
223 | 
224 |     for (@$urls){
225 |         my $u = $_ =~ s{^https:}{http:}ir;
226 |         return 1 if URI::eq($u, $url);
227 |     }
228 | 
229 |     0;
230 | }
231 | 
232 | 1;
233 | 


--------------------------------------------------------------------------------
/config.yml.example:
--------------------------------------------------------------------------------
 1 | ---
 2 | 
 3 | # Top-level temp directory will be created on start and removed
 4 | # on exit.  Each crawler will have its own subdirectory with
 5 | # PID appended
 6 | TMP_DIR: /tmp/smarter_encryption
 7 | CRAWLER_TMP_PREFIX: crawler_
 8 | 
 9 | # User agent. Will use defaults if not specified
10 | #UA: 
11 | VERBOSE: 1
12 | 
13 | # Paths to system binaries.  If in path already, just the program
14 | # name should suffice.
15 | COMPARE: /usr/local/bin/compare 
16 | PKILL: /usr/bin/pkill
17 | 
18 | # Database connection options.  If not specified will connect as
19 | # the current user.
20 | #DB:
21 | #HOST:
22 | #PORT:
23 | #USER:
24 | #PASS:
25 | 
26 | # Number of concurrent crawlers per cpu.
27 | CRAWLERS_PER_CPU: 3
28 | # or exact number
29 | # MAX_CONCURRENT_CRAWLERS: 10
30 | 
31 | # Path to phantomjs.  Should be v2.1.1
32 | PHANTOMJS: phantomjs
33 | 
34 | # Path to modified netsniff.js
35 | NETSNIFF_SS: netsniff_screenshot.js
36 | 
37 | # Timeout before killing phantomjs in seconds
38 | HEADLESS_ALARM: 30
39 | 
40 | # Whether to continue running and polling the queue or exit when finished.
41 | # If specified and non-zero, it is the number of seconds to wait in
42 | # between polls.
43 | POLL: 60
44 | 
45 | # Number of sites a crawler should process before exiting
46 | SITES_PER_CRAWLER: 10
47 | 
48 | # Desired number of URLs to check for each site 
49 | URLS_PER_SITE: 10
50 | 
51 | # Max percentage of URLS_PER_SITE included from the current home page
52 | HOMEPAGE_LINK_PCT: 0.5
53 | 
54 | # Number of times to re-request HTTPs URL on failure
55 | HTTPS_RETRIES: 1
56 | 
57 | # If SCREENSHOT_RETRIES is not 0, the comparison between HTTP and HTTPs
58 | # pages will be re-run if the diff is above SCREENSHOT_THRESHOLD.  It
59 | # will also introduce a delay before taking the screenshot to potentially
60 | # overcome slight network differences between the two. The delay will
61 | # remain in effect for links still to be processed for the site.
62 | SCREENSHOT_RETRIES: 1
63 | SCREENSHOT_THRESHOLD: 0.05
64 | PHANTOM_RENDER_DELAY: 1000
65 | 


--------------------------------------------------------------------------------
/cpanfile:
--------------------------------------------------------------------------------
 1 | requires 'Cpanel::JSON::XS', 2.3310;
 2 | requires 'DBI', '1.631';
 3 | requires 'Domain::PublicSuffix', '0.10';
 4 | requires 'Exporter::Shiny', '0.038';
 5 | requires 'Exporter::Tiny', 0.038;
 6 | requires 'File::Copy::Recursive', 0.38;
 7 | requires 'IO::Socket::SSL', 2.060;
 8 | requires 'IO::Socket::SSL::Utils', 2.014;
 9 | requires 'IPC::Run', 0.92;
10 | requires 'IPC::Run::Timer', 0.90;
11 | requires 'LWP', 6.05;
12 | requires 'List::AllUtils', 0.07;
13 | requires 'List::Util', 1.52;
14 | requires 'POE', 1.358;
15 | requires 'POE::XS::Loop::Poll', 1.000;
16 | requires 'URI', 1.71;
17 | requires 'URI::Escape', 3.31;
18 | requires 'WWW::Mechanize', 1.73;
19 | requires 'WWW::RobotRules', 6.02;
20 | requires 'YAML::XS', 0.41;
21 | 


--------------------------------------------------------------------------------
/https_crawl.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env perl
  2 | 
  3 | use LWP::UserAgent;
  4 | use WWW::Mechanize;
  5 | use POE::Kernel { loop => 'POE::XS::Loop::Poll' };
  6 | use POE qw(Wheel::Run Filter::Reference);
  7 | use DBI;
  8 | use Sys::Hostname 'hostname';
  9 | use Cpanel::JSON::XS qw'decode_json encode_json';
 10 | use URI;
 11 | use File::Copy::Recursive qw'pathmk pathrmdir';
 12 | use WWW::RobotRules;
 13 | use IPC::Run;
 14 | use YAML::XS 'LoadFile';
 15 | use List::AllUtils 'each_arrayref';
 16 | use SmarterEncryption::Crawl qw'
 17 |     aggregate_crawl_session
 18 |     check_ssl_cert
 19 |     dupe_link
 20 |     urls_by_path
 21 | ';
 22 | use Module::Load::Conditional 'can_load';
 23 | 
 24 | use feature 'state';
 25 | use strict;
 26 | use warnings;
 27 | no warnings 'uninitialized';
 28 | 
 29 | my $DDG_INTERNAL;
 30 | if(can_load(modules => {'DDG::Util::HTTPS2' => undef})){
 31 |     DDG::Util::HTTPS2->import(qw'add_stat backfill_urls');
 32 |     $DDG_INTERNAL = 1;
 33 | }
 34 | 
 35 | my $HOST = hostname();
 36 | 
 37 | # Crawler Config
 38 | my %CC;
 39 | 
 40 | # Derived config values
 41 | my ($MAX_CONCURRENT_CRAWLERS, $PHANTOM_TIMEOUT, $HOMEPAGE_LINKS_MAX); 
 42 | 
 43 | POE::Session->create(
 44 |     inline_states => {
 45 |         _start         => \&_start,
 46 |         _stop          => \&normal_cleanup,
 47 |         crawl          => \&start_crawlers,
 48 |         crawler_done   => \&crawler_done,
 49 |         crawler_debug  => \&crawler_debug,
 50 |         sig_child      => \&sig_child,
 51 |         shutdown       => \&shutdown_now,
 52 |         prune_tmp_dirs => \&prune_tmp_dirs
 53 |     }
 54 | );
 55 | 
 56 | POE::Kernel->run;
 57 | exit;
 58 | 
 59 | sub _start {
 60 |     my ($k, $h) = @_[KERNEL, HEAP];
 61 | 
 62 |     parse_argv();
 63 | 
 64 |     unless($MAX_CONCURRENT_CRAWLERS){
 65 |         $MAX_CONCURRENT_CRAWLERS = `nproc` * $CC{CRAWLERS_PER_CPU};
 66 |     }
 67 | 
 68 |     $PHANTOM_TIMEOUT = $CC{HEADLESS_ALARM} * 1000; # in ms
 69 |     $HOMEPAGE_LINKS_MAX = sprintf '%d', $CC{HOMEPAGE_LINK_PCT} * $CC{URLS_PER_SITE};
 70 | 
 71 |     my $TMP_DIR = $CC{TMP_DIR};
 72 |     unless(-d $TMP_DIR){
 73 |         $CC{VERBOSE} && warn "Creating temp dir $TMP_DIR\n";
 74 |         pathmk($TMP_DIR) or die "Failed to create tmp dir $TMP_DIR: $!";
 75 |     }
 76 | 
 77 |     # clean up leftover junk for forced shutdown
 78 |     while(<$TMP_DIR/$CC{CRAWLER_TMP_PREFIX}*>){
 79 |         chomp;
 80 |         pathrmdir($_) or warn "Failed to remove old crawler tmp dir $_: $!";
 81 |     }
 82 | 
 83 |     $k->sig($_ => 'shutdown') for qw{TERM INT};
 84 | 
 85 |     $k->yield('crawl');
 86 | }
 87 | 
 88 | sub shutdown_now {
 89 |     $_[KERNEL]->sig_handled;
 90 | 
 91 |     # Kill crawlers
 92 |     $_->kill() for values %{$_[HEAP]->{crawlers}};
 93 | 
 94 |     # Make unfinished tasks available in the queue
 95 |     my $db = prep_db('queue');
 96 |     $db->{reset_unfinished_tasks}->execute;
 97 | 
 98 |     normal_cleanup();
 99 | 
100 |     exit 1;
101 | }
102 | 
103 | sub normal_cleanup {
104 |     # remove tmp dir
105 |     pathrmdir($CC{TMP_DIR}) if -d $CC{TMP_DIR};
106 | }
107 | 
108 | sub start_crawlers{
109 |     my ($k, $h) = @_[KERNEL, HEAP];
110 | 
111 |     my $db = prep_db('queue');
112 | 
113 |     my $reserve_tasks = $db->{reserve_tasks};
114 |     while(keys %{$h->{crawlers}} < $MAX_CONCURRENT_CRAWLERS){
115 | 
116 |         $reserve_tasks->execute();
117 |         if(my @ranks = sort map { $_->[0] } @{$reserve_tasks->fetchall_arrayref}){
118 | 
119 |             my $c = POE::Wheel::Run->new(
120 |                 Program      => \&crawl_sites,
121 |                 ProgramArgs  => [\@ranks],
122 |                 CloseOnCall  => 1,
123 |                 NoSetSid     => 1,
124 |                 StderrEvent  => 'crawler_debug',
125 |                 CloseEvent   => 'crawler_done',
126 |                 StdinFilter  => POE::Filter::Reference->new,
127 |                 StderrFilter => POE::Filter::Line->new
128 |             );
129 |             $h->{crawlers}{$c->ID} = $c;
130 |             $k->sig_child($c->PID, 'sig_child');
131 |         }
132 |         else{
133 |             $CC{POLL} && $k->delay(crawl => $CC{POLL});
134 |             last;
135 |         }
136 |     }
137 | }
138 | 
139 | sub crawl_sites{
140 |     my ($ranks) = @_;
141 | 
142 |     my $VERBOSE = $CC{VERBOSE};
143 |     my $db = prep_db('crawl');
144 | 
145 |     my $crawler_tmp_dir = "$CC{TMP_DIR}/$CC{CRAWLER_TMP_PREFIX}$$";
146 |     my $rm_tmp = pathmk($crawler_tmp_dir);
147 | 
148 |     my @urls_by_domain;
149 |     for(my $i = 0;$i < @$ranks;++$i){
150 |         my $rank = $ranks->[$i];
151 | 
152 |         my $domain;
153 |         eval {
154 |             $db->{start_task}->execute($$, $rank);
155 |             $domain = $db->{start_task}->fetchall_arrayref->[0][0];
156 |         }
157 |         or do {
158 |             warn "Failed to start task for rank $rank: $@";
159 |             next;
160 |         };
161 | 
162 |         eval {
163 |             $domain = URI->new("https://$domain/")->host;
164 |             1;
165 |         }
166 |         or do {
167 |             warn "Failed to filter domain $domain: $@";
168 |             next;
169 |         };
170 | 
171 |         $VERBOSE && warn "checking domain $domain\n";
172 |         my $urls = get_urls_for_domain($domain, $db);
173 |         my @pairs;
174 |         for my $url (@$urls){
175 |             push @pairs, [$domain, $url];
176 |         }
177 |         push @urls_by_domain, \@pairs if @pairs;
178 |     }
179 | 
180 |     my $ranks_str = '{' . join(',', @$ranks) . '}';
181 | 
182 |     my $ea = each_arrayref @urls_by_domain;
183 | 
184 |     my (%ssl_cert_checked, %domain_render_delay, %sessions);
185 |     while(my @urls = $ea->()){
186 |         for my $u (@urls){
187 |             next unless $u;
188 |             my ($domain, $url) = @$u;
189 |             next unless $url =~ /^http/i;
190 | 
191 |             # for the command-line
192 |             $url =~ s/'/%27/g;
193 | 
194 |             my ($http_url) = $url =~ s/^https:/http:/ri;
195 |             my ($https_url) = $url =~ s/^http:/https:/ri;
196 | 
197 |             my $http_ss = $crawler_tmp_dir . '/http.' . $domain . '.png';
198 | 
199 |             unless($ssl_cert_checked{$domain}){
200 |                 my $ssl = check_ssl_cert($domain);
201 |                 eval {
202 |                     $db->{insert_ssl}->execute($domain, @$ssl);
203 |                     ++$ssl_cert_checked{$domain};
204 |                 }
205 |                 or do {
206 |                     warn "Failed to insert ssl info for $domain: $@";
207 |                 };
208 |             }
209 | 
210 |             my %comparison;
211 |             # We will compare a URL twice max:
212 |             # 1. Compare HTTP vs. HTTPS
213 |             # 2. Redo if the screenshot is a above the threshold to check for rendering problems
214 |             SCREENSHOT_RETRY: for (0..$CC{SCREENSHOT_RETRIES}){
215 |                 my $redo_comparison = 0;
216 | 
217 |                 my %stats = (domain => $domain);
218 |                 check_site(\%stats, $http_url, $http_ss, $domain_render_delay{$domain}, $crawler_tmp_dir);
219 |                 # the idea behind screenshots is:
220 |                 # 1. Do for HTTP automatically so we don't have to make another request if it works
221 |                 # 2. Do for HTTPS if HTTP worked and wasn't autoupgraded
222 |                 # 3. If HTTPS worked and didn't downgrade, compare them
223 |                 my $https_ss;
224 |                 if( (-e $http_ss) && ($stats{http_request_uri} =~ /^http:/i) && ($stats{http_response} == 200)){
225 |                     $https_ss = $crawler_tmp_dir . '/https.' . $domain . '.png';
226 |                 }
227 | 
228 |                 HTTPS_RETRY: for my $https_attempt (0..$CC{HTTPS_RETRIES}){
229 |                     my $redo_https;
230 |                     check_site(\%stats, $https_url, $https_ss, $domain_render_delay{$domain}, $crawler_tmp_dir);
231 |                     if( ($stats{https_request_uri} =~ /^https:/i) && ($stats{https_response} == 200)){
232 |                         if($https_ss && (-e $https_ss)){
233 |                             my $out = `$CC{COMPARE} -metric mae $http_ss $https_ss /dev/null 2>&1`;
234 | 
235 |                             if(my ($diff) = $out =~ /\(([\d\.e\-]+)\)/){
236 |                                 if($CC{SCREENSHOT_THRESHOLD} < $diff){
237 |                                     # Only need to redo on the first failure. After that, the delay
238 |                                     # will have already been increased by a previous URL
239 |                                     unless($domain_render_delay{$domain} == $CC{PHANTOM_RENDER_DELAY}){
240 |                                         $domain_render_delay{$domain} = $CC{PHANTOM_RENDER_DELAY};
241 |                                         $redo_comparison = 1;
242 |                                         $VERBOSE && warn "redoing $http_url (diff: $diff)\n";
243 |                                     }
244 |                                 }
245 |                                 $stats{ss_diff} = $diff;
246 |                             }
247 |                             else{
248 |                                 warn "Failed to extract compare diff betweeen $http_ss and $https_ss from $out\n";
249 |                             }
250 |                             unlink $_ for $http_ss, $https_ss;
251 |                         }
252 | 
253 |                         if($DDG_INTERNAL && $https_attempt){
254 |                             add_stat(qw'increment smarter_encryption.crawl.https_retries.success');
255 |                         }
256 |                     }
257 |                     elsif($DDG_INTERNAL && $https_attempt){
258 |                         add_stat(qw'increment smarter_encryption.crawl.https_retries.failure');
259 |                     }
260 |                     elsif( ($stats{https_request_uri} !~ /^http:/) && ($stats{http_response} != $stats{https_response})){
261 |                         $redo_https = 1;
262 |                         $VERBOSE && warn "Redoing HTTPS request for $domain: $https_url\n";
263 |                     }
264 | 
265 |                     last HTTPS_RETRY unless $redo_https;
266 |                 }
267 | 
268 |                 # Most should exit here
269 |                 unless($redo_comparison){
270 |                     %comparison = %stats;
271 |                     last;
272 |                 }
273 |             }
274 | 
275 |             unless($db->{con}->ping){
276 |                 $VERBOSE && warn "Reconnecting to DB before inserting comparison";
277 |                 $db = prep_db('crawl');
278 |             }
279 | 
280 |             if(my $host = eval { URI->new($comparison{https_request_uri})->host}){
281 |                 unless($ssl_cert_checked{$host}){
282 |                     my $ssl = check_ssl_cert($host);
283 |                     eval {
284 |                         $db->{insert_ssl}->execute($host, @$ssl);
285 |                         ++$ssl_cert_checked{$host};
286 |                     }
287 |                     or do {
288 |                         warn "Failed to insert ssl info for $host: $@";
289 |                     };
290 |                 }
291 |             }
292 | 
293 |             if($comparison{http_request_uri} || $comparison{https_request_uri}){
294 |                 my $log_id;
295 |                 eval {
296 |                     $db->{insert_domain}->execute(@comparison{qw'
297 |                         domain
298 |                         http_request_uri
299 |                         http_response
300 |                         http_requests
301 |                         http_size
302 |                         https_request_uri
303 |                         https_response
304 |                         https_requests
305 |                         https_size
306 |                         autoupgrade
307 |                         mixed
308 |                         ss_diff'}
309 |                     );
310 |                     $log_id = $db->{insert_domain}->fetch()->[0];
311 |                 }
312 |                 or do {
313 |                    $VERBOSE && warn "Failed to insert request for $domain: $@";
314 |                 };
315 | 
316 |                 if($log_id){
317 |                     if(my $hdrs = delete $comparison{https_response_headers}){
318 |                         eval {
319 |                             $db->{insert_headers}->execute($log_id, $hdrs);
320 |                         }
321 |                         or do {
322 |                             $VERBOSE && warn "Failed to insert response headers for $domain ($log_id): $@";
323 |                         };
324 |                     }
325 | 
326 |                     if(my $mixed_reqs = delete $comparison{mixed_children}){
327 |                         for my $m (keys %$mixed_reqs){
328 |                             eval{
329 |                                 $db->{insert_mixed}->execute($log_id, $m);
330 |                                 1;
331 |                             }
332 |                             or do {
333 |                                 $VERBOSE && warn "Failed to insert mixed request for $domain: $@";
334 |                             };
335 |                         }
336 |                     }
337 |                     $comparison{id} = $log_id;
338 |                     push @{$sessions{$domain}}, \%comparison;
339 |                 }
340 |             }
341 |         }
342 |     }
343 | 
344 |     unless($db->{con}->ping){
345 |         $VERBOSE && warn "Reconnecting to DB before updating aggregate data";
346 |         $db = prep_db('crawl');
347 |     }
348 | 
349 |     while(my ($domain, $session) = each %sessions){
350 |         my $aggregates = aggregate_crawl_session($domain, $session);
351 |         while(my ($host, $agg) = each %$aggregates){
352 |             eval {
353 |                 $db->{upsert_aggregate}->execute(
354 |                     $host, @$agg{qw'
355 |                         https
356 |                         http_s
357 |                         https_errs
358 |                         http
359 |                         unknown
360 |                         autoupgrade
361 |                         mixed_requests
362 |                         max_ss_diff
363 |                         redirects
364 |                         max_id
365 |                         requests
366 |                         is_redirect
367 |                         redirect_hosts'
368 |                     }
369 |                 );
370 |                 1;
371 |             }
372 |             or do {
373 |                 warn "Failed to upsert aggregate for $host: $@";
374 |             };
375 |         }
376 |     }
377 | 
378 |     eval {
379 |         $db->{finish_tasks}->execute($ranks_str);
380 |         1;
381 |     }
382 |     or do {
383 |         warn "Failed to finish tasks for ranks ($ranks_str): $@";
384 |     };
385 | 
386 |     system "$CC{PKILL} -9 -f '$crawler_tmp_dir '";
387 |     pathrmdir($crawler_tmp_dir) if $rm_tmp;
388 | }
389 | 
390 | sub prep_db {
391 |     my $target = shift;
392 | 
393 |     my %db;
394 | 
395 |     my $con = get_con();
396 | 
397 |     if($target eq 'queue'){
398 |         $db{reserve_tasks} = $con->prepare("
399 |             update https_queue
400 |                 set processing_host = '$HOST',
401 |                     reserved = now()
402 |             where rank in (
403 |                 select rank from https_queue
404 |                     where processing_host is null
405 |                     order by rank
406 |                     limit $CC{SITES_PER_CRAWLER}
407 |                     for update skip locked
408 |             )
409 |             returning rank
410 |         ");
411 |         $db{reset_unfinished_tasks} = $con->prepare("
412 |             update https_queue
413 |                 set processing_host = null,
414 |                 worker_pid = null,
415 |                 reserved = null,
416 |                 started = null
417 |             where
418 |                 processing_host = '$HOST' and
419 |                 finished is null
420 |         ");
421 |         $db{complete_unfinished_worker_tasks} = $con->prepare("
422 |             update https_queue
423 |                 set finished = now(),
424 |                 processing_host = '$HOST (incomplete)'
425 |             where
426 |                 processing_host = '$HOST' and
427 |                 finished is null and
428 |                 worker_pid = ?
429 |         ");
430 |     }
431 |     elsif($target eq 'crawl'){
432 |         $db{start_task} = $con->prepare('update https_queue set worker_pid = ?, started = now() where rank = ? returning domain');
433 |         $db{select_urls} = $con->prepare('select url from full_urls where host = ?');
434 |         $db{insert_domain} = $con->prepare('
435 |             insert into https_crawl
436 |               (domain, http_request_uri, http_response, http_requests, http_size, https_request_uri, https_response, https_requests, https_size, autoupgrade, mixed, screenshot_diff)
437 |               values (?,?,?,?,?,?,?,?,?,?,?,?) returning id
438 |         ');
439 |         $db{insert_mixed} = $con->prepare('insert into mixed_assets (https_crawl_id, asset) values (?,?)');
440 |         $db{insert_headers} = $con->prepare('insert into https_response_headers (https_crawl_id, response_headers) values (?,?)');
441 |         $db{finish_tasks} = $con->prepare('update https_queue set finished = now() where rank = ANY(?::integer[])');
442 |         $db{insert_ssl} = $con->prepare('
443 |             insert into ssl_cert_info (domain, issuer, notBefore, notAfter, host_valid, err) values (?,?,?,?,?,?)
444 |             on conflict (domain) do update set
445 |             issuer = EXCLUDED.issuer,
446 |             notBefore = EXCLUDED.notBefore,
447 |             notAfter = EXCLUDED.notAfter,
448 |             host_valid = EXCLUDED.host_valid,
449 |             err = EXCLUDED.err,
450 |             updated = now()
451 |         ');
452 |         # Note where clause:
453 |         # 1. Non-redirects update any, including changing a redirect to a non-redirect
454 |         # 2. Redirects update other redirects
455 |         $db{upsert_aggregate} = $con->prepare("
456 |             insert into https_crawl_aggregate (
457 |                 domain,
458 |                 https,
459 |                 http_and_https,
460 |                 https_errs, http,
461 |                 unknown,
462 |                 autoupgrade,
463 |                 mixed_requests,
464 |                 max_screenshot_diff,
465 |                 redirects,
466 |                 max_https_crawl_id,
467 |                 requests,
468 |                 is_redirect,
469 |                 redirect_hosts,
470 |                 session_request_limit)
471 |                 values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,$CC{URLS_PER_SITE})
472 |             on conflict (domain) do update set (
473 |                 https,
474 |                 http_and_https,
475 |                 https_errs,
476 |                 http,
477 |                 unknown,
478 |                 autoupgrade,
479 |                 mixed_requests,
480 |                 max_screenshot_diff,
481 |                 redirects,
482 |                 max_https_crawl_id,
483 |                 requests,
484 |                 is_redirect,
485 |                 redirect_hosts,
486 |                 session_request_limit
487 |             ) = (
488 |                 EXCLUDED.https,
489 |                 EXCLUDED.http_and_https,
490 |                 EXCLUDED.https_errs,
491 |                 EXCLUDED.http,
492 |                 EXCLUDED.unknown,
493 |                 EXCLUDED.autoupgrade,
494 |                 EXCLUDED.mixed_requests,
495 |                 EXCLUDED.max_screenshot_diff,
496 |                 EXCLUDED.redirects,
497 |                 EXCLUDED.max_https_crawl_id,
498 |                 EXCLUDED.requests,
499 |                 EXCLUDED.is_redirect,
500 |                 EXCLUDED.redirect_hosts,
501 |                 EXCLUDED.session_request_limit)
502 |             where
503 |                 EXCLUDED.is_redirect = false or
504 |                 https_crawl_aggregate.is_redirect = true
505 |         ");
506 |     }
507 | 
508 |     $db{con} = $con;
509 |     return \%db;
510 | }
511 | 
512 | # Strategy behind url selection:
513 | # 1. Fill queue with homepage and click urls sort by top-level path
514 | #    prevalence
515 | # 2. If necessary, get backfill_urls
516 | sub get_urls_for_domain {
517 |     my ($domain, $db) = @_;
518 | 
519 |     state $rr = WWW::RobotRules->new($CC{UA});
520 |     state $mech = get_ua('mech');
521 |     state $VERBOSE = $CC{VERBOSE};
522 | 
523 |     # Get latest robot rules for domain
524 |     my $res = $mech->get("http://$domain/robots.txt");
525 |     if($res->is_success){
526 |         # the uri may be different than what we requested
527 |         my @doms = ($domain);
528 |         my $uri = $res->request->uri;
529 |         if(my $host = eval { URI->new($uri)->host }){
530 |             push @doms, $host if $host ne $domain;
531 |         }
532 |         my $robots_txt = $res->decoded_content;
533 | 
534 |         # Add the rules for the:
535 |         # 1. The domain and redirect host if different
536 |         # 2. HTTP/HTTPS for each
537 |         # yes, http and https could be different
538 |         for my $d (@doms){
539 |             for my $p (qw(http https)){
540 |                 $rr->parse("$p://$d/", $robots_txt);
541 |             }
542 |         }
543 |     }
544 | 
545 |     my @urls;
546 |     my $homepage = 'http://' . $domain . '/';
547 | 
548 |     $res = $mech->get($homepage);
549 | 
550 |     if($res->is_success){
551 |         # the uri may be different than what we requested
552 |         my $uri = $res->request->uri;
553 |         if(my $host = eval { URI->new($uri)->host }){
554 |             # all links with the same host
555 |             my @homepage_links;
556 |             if(my $l = $mech->find_all_links(url_abs_regex => qr{//\Q$host\E/})){
557 |                 @homepage_links = @$l;
558 |             }
559 | 
560 |             for my $l (@homepage_links){
561 |                 my $abs_url = $l->url_abs;
562 |                 $abs_url = "$abs_url";
563 |                 next if dupe_link($abs_url, \@urls);
564 |                 push @urls, $abs_url;
565 |             }
566 |         }
567 |     }
568 |     else {
569 |         $VERBOSE && warn "Failed to get homepage links for $domain: " . $res->status_line;
570 |     }
571 | 
572 |     eval {
573 |         my $select_urls = $db->{select_urls};
574 |         $select_urls->execute($domain);
575 |         while(my $r = $select_urls->fetchrow_arrayref){
576 |             my $url = $r->[0];
577 |             next if dupe_link($url, \@urls);
578 |             push @urls, $url;
579 |         }
580 |         1;
581 |     }
582 |     or do {
583 |         $VERBOSE && warn "Failed to get click urls for $domain: $@";
584 |     };
585 | 
586 |     state $URLS_PER_SITE = $CC{URLS_PER_SITE};
587 | 
588 |     urls_by_path(\@urls, $rr, $URLS_PER_SITE);
589 | 
590 |     if($DDG_INTERNAL && (@urls < $URLS_PER_SITE)){
591 |         backfill_urls($domain, \@urls, $rr, $db, $mech, $URLS_PER_SITE, $VERBOSE);
592 |     }
593 | 
594 |     # Add home by default since it often behaves differently
595 |     unless(dupe_link($homepage, \@urls)){
596 |         if(@urls < $URLS_PER_SITE){
597 |             push @urls, $homepage;
598 |         }
599 |         else{
600 |             splice(@urls, -1, 1, $homepage);
601 |         }
602 |     }
603 | 
604 |     return \@urls;
605 | }
606 | 
607 | sub prune_tmp_dirs {
608 |     my $h = $_[HEAP];
609 | 
610 |     return unless exists $h->{crawler_tmp_dirs};
611 | 
612 |     my ($TMP_DIR, $CRAWLER_TMP_PREFIX) = @CC{qw'TMP_DIR CRAWLER_TMP_PREFIX'};
613 |     for my $pid (keys %{$h->{crawler_tmp_dirs}}){
614 |         my $crawler_tmp_dir = "$TMP_DIR/$CRAWLER_TMP_PREFIX$pid";
615 |         if(-d $crawler_tmp_dir){
616 |             next unless pathrmdir($crawler_tmp_dir);
617 |         }
618 |         delete $h->{crawler_tmp_dirs}{$pid};
619 |     }
620 | }
621 | 
622 | sub check_site {
623 |     my ($stats, $site, $screenshot, $delay, $crawler_tmp_dir) = @_;
624 | 
625 |     if(my ($request_scheme) = $site =~ /^(https?):/i){
626 |         $request_scheme = lc $request_scheme;
627 | 
628 |         eval{
629 |             @ENV{qw(PHANTOM_RENDER_DELAY PHANTOM_UA PHANTOM_TIMEOUT)} =
630 |                 ($delay, "'$CC{UA}'", $PHANTOM_TIMEOUT);
631 | 
632 |             my $out;
633 |             my @cmd = (
634 |                 $CC{PHANTOMJS},
635 |                 "--local-storage-path=$crawler_tmp_dir", "--offline-storage-path=$crawler_tmp_dir",
636 |                 $CC{NETSNIFF_SS}, $site);
637 |             push @cmd, $screenshot if $screenshot;
638 | 
639 |             IPC::Run::run \@cmd,  \undef, \$out,
640 |                 IPC::Run::timeout($CC{HEADLESS_ALARM}, exception => "$site timed out after $CC{HEADLESS_ALARM} seconds");
641 |             die "PHANTOMJS $out" if $out =~ /^FAIL/;
642 | 
643 |             # Can have error messages at the end so have to extract the json
644 |             my ($j) = $out =~ /^(\{\s+"log".+\})/ms;
645 |             my $m = decode_json($j)->{log};
646 | 
647 |             my ($main_request_scheme, $check_mixed);
648 |             for my $e (@{$m->{entries}}){
649 |                 my $response_status = $e->{response}{status};
650 |                 # netsniff records the redirects to https for some sites
651 |                 next if $response_status =~ /^3/;
652 |                 my $url = $e->{request}{url};
653 |                 next unless my ($scheme) = $url =~ /^(https?):/i;
654 |                 $scheme = lc $scheme;
655 | 
656 |                 if($check_mixed && ($scheme eq 'http')){
657 |                     # Absolute links.  Even if the same host as parent, browsers will mark
658 |                     # this as mixed and the extension can't upgrade them
659 |                     $stats->{mixed_children}{$url} = 1;
660 |                 }
661 | 
662 |                 unless($main_request_scheme){
663 |                     $stats->{"${request_scheme}_request_uri"} = $url;
664 |                     $stats->{"${request_scheme}_response"} = $response_status;
665 |                     if($request_scheme eq 'http'){
666 |                         $stats->{autoupgrade} = $scheme eq 'https' ? 1 : 0;
667 |                     }
668 |                     elsif($scheme eq 'https'){
669 |                         $check_mixed = lc URI->new($url)->host;
670 |                         my $hdrs = delete $e->{response}{headers};
671 |                         my %response_headers;
672 |                         # We don't want to store an array of one-key hashes.
673 |                         for my $h (@$hdrs){
674 |                             my ($name, $value) = @$h{qw(name value)};
675 |                             if(exists $response_headers{$name}){
676 |                                 # https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2
677 |                                 $response_headers{$name} .= ",$value";
678 |                             }
679 |                             else{
680 |                                 $response_headers{$name} = $value;
681 |                             }
682 |                         }
683 |                         $stats->{https_response_headers} = encode_json(\%response_headers);
684 |                     }
685 |                     $main_request_scheme = $scheme;
686 |                 }
687 | 
688 |                 $stats->{"${request_scheme}_size"} += $e->{response}{bodySize};
689 |                 ++$stats->{"${request_scheme}_requests"};
690 | 
691 |             }
692 | 
693 |             if($check_mixed){
694 |                 $stats->{mixed} = exists $stats->{mixed_children} ? 1 : 0;
695 |             }
696 |             1;
697 |         }
698 |         or do {
699 |             warn "check_site error: $@ ($site)";
700 |             system "$CC{PKILL} -9 -f '$crawler_tmp_dir '" if $crawler_tmp_dir =~ /\S/;
701 |         };
702 |     }
703 | }
704 | 
705 | sub crawler_done{
706 |     my ($k, $h, $id) = @_[KERNEL, HEAP, ARG0];
707 | 
708 |     state $VERBOSE = $CC{VERBOSE};
709 |     $VERBOSE && warn "deleting crawler $id\n";
710 |     my $c = delete $h->{crawlers}{$id};
711 | 
712 |     # see if any of its domains were left unfinished
713 |     my $pid = $c->PID;
714 |     eval {
715 |         my $db = prep_db('queue');
716 |         my $unfinished = $db->{complete_unfinished_worker_tasks}->execute($pid);
717 |         if($unfinished > 0){
718 |             $VERBOSE && warn "Marked $unfinished tasks incomplete for crawler with pid $pid\n";
719 |         }
720 |         1;
721 |     }
722 |     or do {
723 |         warn "Failed to verify worker tasks: $@";
724 |     };
725 | 
726 |     # Check and clean up tmp dirs for hung crawlers
727 |     $h->{crawler_tmp_dirs}{$pid} = 1;
728 |     $k->yield('prune_tmp_dirs');
729 | 
730 |     $k->yield('crawl');
731 | }
732 | 
733 | sub crawler_debug{
734 |     my $msg = $_[ARG0];
735 | 
736 |     $CC{VERBOSE} && warn 'crawler debug: ' . $msg. "\n";
737 | }
738 | 
739 | sub sig_child {
740 |     warn 'Got signal from pid ' . $_[ARG1] . ', exit status: ' . $_[ARG2] if $_[ARG2];
741 |     $_[KERNEL]->sig_handled;
742 | }
743 | 
744 | sub get_ua {
745 |     my $type = shift;
746 | 
747 |     my $ua = $type eq 'mech' ?
748 |         WWW::Mechanize->new(
749 |             onerror => undef, # We'll check these ourselves so we don't have to catch die in eval
750 |             quiet => 1
751 |         ) 
752 |         :
753 |         LWP::UserAgent->new();
754 | 
755 |     $ua->agent($CC{UA});
756 |     $ua->timeout(10);
757 |     return $ua;
758 | }
759 | 
760 | sub get_con {
761 | 
762 |     $ENV{PGDATABASE} = $CC{DB}   if exists $CC{DB};
763 |     $ENV{PGHOST}     = $CC{HOST} if exists $CC{HOST};
764 |     $ENV{PGPORT}     = $CC{PORT} if exists $CC{PORT};
765 |     $ENV{PGUSER}     = $CC{USER} if exists $CC{USER};
766 |     $ENV{PGPASSWORD} = $CC{PASS} if exists $CC{PASS};
767 | 
768 |     return DBI->connect('dbi:Pg:', '', '', {
769 |         RaiseError => 1,
770 |         PrintError => 0,
771 |         AutoCommit => 1,
772 |     });
773 | }
774 | 
775 | sub parse_argv {
776 |     my $usage = <<ENDOFUSAGE;
777 | 
778 |      *********************************************************************
779 |        USAGE: https_crawl.pl -c /path/to/config.yml [-h]
780 | 
781 |        -c: Path to YAML config file
782 |        -h: Print this help
783 | 
784 |     ***********************************************************************
785 | 
786 | ENDOFUSAGE
787 | 
788 |     my $config_file_specified;
789 |     for(my $i = 0;$i < @ARGV;$i++) {
790 |         if($ARGV[$i] =~ /^-c$/i ){
791 |             %CC = %{LoadFile($ARGV[++$i])};
792 |             $config_file_specified = 1;
793 |         }
794 |         elsif($ARGV[$i] =~ /^-h$/i ){ die "$usage\n" }
795 |     }
796 | 
797 |     die "Config file required\n\n$usage\n" unless $config_file_specified;
798 | }
799 | 


--------------------------------------------------------------------------------
/netsniff_screenshot.js:
--------------------------------------------------------------------------------
  1 | // Copyright 2010 Ariya Hidayat
  2 | // 
  3 | // Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
  4 | //
  5 | // 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  6 | //
  7 | // 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  8 | //
  9 | // 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
 10 | //
 11 | // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 12 | 
 13 | "use strict";
 14 | if (!Date.prototype.toISOString) {
 15 |     Date.prototype.toISOString = function () {
 16 |         function pad(n) { return n < 10 ? '0' + n : n; }
 17 |         function ms(n) { return n < 10 ? '00'+ n : n < 100 ? '0' + n : n }
 18 |         return this.getFullYear() + '-' +
 19 |             pad(this.getMonth() + 1) + '-' +
 20 |             pad(this.getDate()) + 'T' +
 21 |             pad(this.getHours()) + ':' +
 22 |             pad(this.getMinutes()) + ':' +
 23 |             pad(this.getSeconds()) + '.' +
 24 |             ms(this.getMilliseconds()) + 'Z';
 25 |     }
 26 | }
 27 | 
 28 | function createHAR(address, title, startTime, resources)
 29 | {
 30 |     var entries = [];
 31 | 
 32 |     resources.forEach(function (resource) {
 33 |         var request = resource.request,
 34 |             startReply = resource.startReply,
 35 |             endReply = resource.endReply;
 36 | 
 37 |         if (!request || !startReply || !endReply) {
 38 |             return;
 39 |         }
 40 | 
 41 |         // Exclude Data URI from HAR file because
 42 |         // they aren't included in specification
 43 |         if (request.url.match(/(^data:image\/.*)/i)) {
 44 |             return;
 45 |     }
 46 | 
 47 |         entries.push({
 48 |             startedDateTime: request.time.toISOString(),
 49 |             time: endReply.time - request.time,
 50 |             request: {
 51 |                 method: request.method,
 52 |                 url: request.url,
 53 |                 httpVersion: "HTTP/1.1",
 54 |                 cookies: [],
 55 |                 headers: request.headers,
 56 |                 queryString: [],
 57 |                 headersSize: -1,
 58 |                 bodySize: -1
 59 |             },
 60 |             response: {
 61 |                 status: endReply.status,
 62 |                 statusText: endReply.statusText,
 63 |                 httpVersion: "HTTP/1.1",
 64 |                 cookies: [],
 65 |                 headers: endReply.headers,
 66 |                 redirectURL: "",
 67 |                 headersSize: -1,
 68 |                 bodySize: startReply.bodySize,
 69 |                 content: {
 70 |                     size: startReply.bodySize,
 71 |                     mimeType: endReply.contentType
 72 |                 }
 73 |             },
 74 |             cache: {},
 75 |             timings: {
 76 |                 blocked: 0,
 77 |                 dns: -1,
 78 |                 connect: -1,
 79 |                 send: 0,
 80 |                 wait: startReply.time - request.time,
 81 |                 receive: endReply.time - startReply.time,
 82 |                 ssl: -1
 83 |             },
 84 |             pageref: address
 85 |         });
 86 |     });
 87 | 
 88 |     return {
 89 |         log: {
 90 |             version: '1.2',
 91 |             creator: {
 92 |                 name: "PhantomJS",
 93 |                 version: phantom.version.major + '.' + phantom.version.minor +
 94 |                     '.' + phantom.version.patch
 95 |             },
 96 |             pages: [{
 97 |                 startedDateTime: startTime.toISOString(),
 98 |                 id: address,
 99 |                 title: title,
100 |                 pageTimings: {
101 |                     onLoad: page.endTime - page.startTime
102 |                 }
103 |             }],
104 |             entries: entries
105 |         }
106 |     };
107 | }
108 | 
109 | var page = require('webpage').create(),
110 |     system = require('system');
111 | if(system.env['PHANTOM_UA'] !== 'undefined'){
112 |     page.settings.userAgent = system.env['PHANTOM_UA'];
113 | }
114 | if(system.env['PHANTOM_TIMEOUT'] !== 'undefined'){
115 |     page.settings.resourceTimeout = system.env['PHANTOM_TIMEOUT'];
116 | }
117 | var renderDelay = 0;
118 | if(system.env['PHANTOM_RENDER_DELAY'] !== 'undefined'){
119 |     renderDelay = system.env['PHANTOM_RENDER_DELAY'];
120 | }
121 | 
122 | page.viewportSize = { width: 1024, height: 768 };
123 | page.clipRect = { top: 0, left: 0, width: 1024, height: 768 };
124 | 
125 | if (system.args.length === 1) {
126 |     console.log('Usage: netsniff.js <some URL> <optional: screenshot file name');
127 |     phantom.exit(1);
128 | } else {
129 | 
130 |     page.address = system.args[1];
131 |     page.resources = [];
132 |     var screenshot_file = system.args[2];
133 | 
134 |     page.onLoadStarted = function () {
135 |         page.startTime = new Date();
136 |     };
137 | 
138 |     page.onResourceRequested = function (req) {
139 |         page.resources[req.id] = {
140 |             request: req,
141 |             startReply: null,
142 |             endReply: null
143 |         };
144 |     };
145 | 
146 |     page.onResourceReceived = function (res) {
147 |         if (res.stage === 'start') {
148 |             page.resources[res.id].startReply = res;
149 |         }
150 |         if (res.stage === 'end') {
151 |             page.resources[res.id].endReply = res;
152 |         }
153 |     };
154 | 
155 |     page.onResourceError = function(resourceError) {
156 |         page.reason = resourceError.errorString;
157 |         page.reason_url = resourceError.url;
158 |     };
159 | 
160 |     page.open(page.address, function (status) {
161 |         var har;
162 |         if (status !== 'success') {
163 |             console.log('FAIL to load the address ' + page.reason_url + ': ' + page.reason);
164 |             phantom.exit(1);
165 |         } else {
166 |             window.setTimeout(function () {
167 |                 page.endTime = new Date();
168 |                 page.title = page.evaluate(function () {
169 |                 return document.title;
170 |                 });
171 |                 har = createHAR(page.address, page.title, page.startTime, page.resources);
172 |                 console.log(JSON.stringify(har, undefined, 4));
173 |                 if (typeof screenshot_file !== 'undefined') {
174 |                     page.render(screenshot_file);
175 |                 }
176 |                 phantom.exit();
177 |             }, renderDelay);
178 |         }
179 |     });
180 | }
181 | 


--------------------------------------------------------------------------------
/sql/domain_exceptions.sql:
--------------------------------------------------------------------------------
 1 | --
 2 | -- PostgreSQL database dump
 3 | --
 4 | 
 5 | -- Dumped from database version 9.5.9
 6 | -- Dumped by pg_dump version 9.5.9
 7 | 
 8 | SET statement_timeout = 0;
 9 | SET lock_timeout = 0;
10 | SET client_encoding = 'UTF8';
11 | SET standard_conforming_strings = on;
12 | SET check_function_bodies = false;
13 | SET client_min_messages = warning;
14 | SET row_security = off;
15 | 
16 | SET search_path = public, pg_catalog;
17 | 
18 | SET default_tablespace = '';
19 | 
20 | SET default_with_oids = false;
21 | 
22 | --
23 | -- Name: domain_exceptions; Type: TABLE; Schema: public; Owner: -
24 | --
25 | 
26 | CREATE TABLE domain_exceptions (
27 |     domain text NOT NULL,
28 |     comment text,
29 |     updated timestamp with time zone NOT NULL default now()
30 | );
31 | 
32 | 
33 | --
34 | -- Name: domain_exceptions_pkey; Type: CONSTRAINT; Schema: public; Owner: -
35 | --
36 | 
37 | ALTER TABLE ONLY domain_exceptions
38 |     ADD CONSTRAINT domain_exceptions_pkey PRIMARY KEY (domain);
39 | 
40 | 
41 | --
42 | -- PostgreSQL database dump complete
43 | --
44 | 
45 | 


--------------------------------------------------------------------------------
/sql/full_urls.sql:
--------------------------------------------------------------------------------
 1 | --
 2 | -- PostgreSQL database dump
 3 | --
 4 | 
 5 | -- Dumped from database version 9.5.9
 6 | -- Dumped by pg_dump version 9.5.9
 7 | 
 8 | SET statement_timeout = 0;
 9 | SET lock_timeout = 0;
10 | SET client_encoding = 'UTF8';
11 | SET standard_conforming_strings = on;
12 | SET check_function_bodies = false;
13 | SET client_min_messages = warning;
14 | SET row_security = off;
15 | 
16 | SET search_path = public, pg_catalog;
17 | 
18 | SET default_tablespace = '';
19 | 
20 | SET default_with_oids = false;
21 | 
22 | --
23 | -- Name: full_urls; Type: TABLE; Schema: public; Owner: -
24 | --
25 | 
26 | CREATE TABLE full_urls (
27 |     host text NOT NULL,
28 |     url text NOT NULL,
29 |     updated timestamp with time zone DEFAULT now() NOT NULL
30 | );
31 | 
32 | 
33 | --
34 | -- Name: full_urls_host_idx; Type: INDEX; Schema: public; Owner: -
35 | --
36 | 
37 | CREATE INDEX full_urls_host_idx ON full_urls USING btree (host);
38 | 
39 | 
40 | --
41 | -- Name: full_urls_unique_substrmd5_idx; Type: INDEX; Schema: public; Owner: -
42 | --
43 | 
44 | CREATE UNIQUE INDEX full_urls_unique_substrmd5_idx ON full_urls USING btree (host, "left"(md5(url), 8));
45 | 
46 | 
47 | --
48 | -- PostgreSQL database dump complete
49 | --
50 | 
51 | 


--------------------------------------------------------------------------------
/sql/https_crawl.sql:
--------------------------------------------------------------------------------
 1 | --
 2 | -- PostgreSQL database dump
 3 | --
 4 | 
 5 | -- Dumped from database version 9.5.9
 6 | -- Dumped by pg_dump version 9.5.9
 7 | 
 8 | SET statement_timeout = 0;
 9 | SET lock_timeout = 0;
10 | SET client_encoding = 'UTF8';
11 | SET standard_conforming_strings = on;
12 | SET check_function_bodies = false;
13 | SET client_min_messages = warning;
14 | SET row_security = off;
15 | 
16 | SET search_path = public, pg_catalog;
17 | 
18 | SET default_tablespace = '';
19 | 
20 | SET default_with_oids = false;
21 | 
22 | --
23 | -- Name: https_crawl; Type: TABLE; Schema: public; Owner: -
24 | --
25 | 
26 | CREATE TABLE https_crawl (
27 |     domain text NOT NULL,
28 |     http_request_uri text,
29 |     http_response integer,
30 |     http_requests integer,
31 |     http_size integer,
32 |     https_request_uri text,
33 |     https_response integer,
34 |     https_requests integer,
35 |     https_size integer,
36 |     "timestamp" timestamp with time zone DEFAULT now(),
37 |     screenshot_diff real,
38 |     id bigint,
39 |     autoupgrade boolean,
40 |     mixed boolean
41 | );
42 | 
43 | 
44 | --
45 | -- Name: https_crawl_id_seq; Type: SEQUENCE; Schema: public; Owner: -
46 | --
47 | 
48 | CREATE SEQUENCE https_crawl_id_seq
49 |     START WITH 1
50 |     INCREMENT BY 1
51 |     NO MINVALUE
52 |     NO MAXVALUE
53 |     CACHE 1;
54 | 
55 | 
56 | --
57 | -- Name: https_crawl_id_seq; Type: SEQUENCE OWNED BY; Schema: public; Owner: -
58 | --
59 | 
60 | ALTER SEQUENCE https_crawl_id_seq OWNED BY https_crawl.id;
61 | 
62 | 
63 | --
64 | -- Name: id; Type: DEFAULT; Schema: public; Owner: -
65 | --
66 | 
67 | ALTER TABLE ONLY https_crawl ALTER COLUMN id SET DEFAULT nextval('https_crawl_id_seq'::regclass);
68 | 
69 | 
70 | --
71 | -- Name: https_crawl_id_key; Type: CONSTRAINT; Schema: public; Owner: -
72 | --
73 | 
74 | ALTER TABLE ONLY https_crawl
75 |     ADD CONSTRAINT https_crawl_id_key UNIQUE (id);
76 | 
77 | 
78 | --
79 | -- Name: https_crawl_domain_idx; Type: INDEX; Schema: public; Owner: -
80 | --
81 | 
82 | CREATE INDEX https_crawl_domain_idx ON https_crawl USING btree (domain);
83 | 
84 | 
85 | --
86 | -- PostgreSQL database dump complete
87 | --
88 | 
89 | 


--------------------------------------------------------------------------------
/sql/https_crawl_aggregate.sql:
--------------------------------------------------------------------------------
 1 | --
 2 | -- PostgreSQL database dump
 3 | --
 4 | 
 5 | -- Dumped from database version 9.5.9
 6 | -- Dumped by pg_dump version 9.5.9
 7 | 
 8 | SET statement_timeout = 0;
 9 | SET lock_timeout = 0;
10 | SET client_encoding = 'UTF8';
11 | SET standard_conforming_strings = on;
12 | SET check_function_bodies = false;
13 | SET client_min_messages = warning;
14 | SET row_security = off;
15 | 
16 | SET search_path = public, pg_catalog;
17 | 
18 | SET default_tablespace = '';
19 | 
20 | SET default_with_oids = false;
21 | 
22 | --
23 | -- Name: https_crawl_aggregate; Type: TABLE; Schema: public; Owner: -
24 | --
25 | 
26 | CREATE TABLE https_crawl_aggregate (
27 |     domain text NOT NULL,
28 |     https integer DEFAULT 0 NOT NULL,
29 |     http_and_https integer DEFAULT 0 NOT NULL,
30 |     https_errs integer DEFAULT 0 NOT NULL,
31 |     http integer DEFAULT 0 NOT NULL,
32 |     unknown integer DEFAULT 0 NOT NULL,
33 |     autoupgrade integer DEFAULT 0 NOT NULL,
34 |     mixed_requests integer DEFAULT 0 NOT NULL,
35 |     max_screenshot_diff real DEFAULT 0 NOT NULL,
36 |     redirects integer DEFAULT 0 NOT NULL,
37 |     requests integer NOT NULL,
38 |     session_request_limit integer NOT NULL,
39 |     is_redirect boolean DEFAULT false NOT NULL,
40 |     max_https_crawl_id bigint NOT NULL,
41 |     redirect_hosts jsonb
42 | );
43 | 
44 | 
45 | --
46 | -- Name: https_upgrade_metrics; Type: MATERIALIZED VIEW; Schema: public; Owner: -
47 | --
48 | 
49 | CREATE VIEW https_upgrade_metrics AS
50 |  SELECT https_crawl_aggregate.domain,
51 |     ((https_crawl_aggregate.unknown)::real / (https_crawl_aggregate.requests)::real) AS unknown_pct,
52 |     ((((https_crawl_aggregate.https + https_crawl_aggregate.http_and_https)))::double precision / (https_crawl_aggregate.requests)::real) AS combined_pct,
53 |     coalesce(https_crawl_aggregate.https_errs::real/nullif( (https_crawl_aggregate.https + https_crawl_aggregate.http_and_https), 0), 0)::real as https_err_rate,
54 |     https_crawl_aggregate.max_screenshot_diff,
55 |     ((https_crawl_aggregate.mixed_requests = 0) OR (https_crawl_aggregate.autoupgrade = https_crawl_aggregate.requests)) AS mixed_ok,
56 |     ((https_crawl_aggregate.autoupgrade)::double precision / (https_crawl_aggregate.requests)::real) AS autoupgrade_pct
57 |    FROM https_crawl_aggregate;
58 | 
59 | 
60 | --
61 | -- Name: https_crawl_aggregate_pkey; Type: CONSTRAINT; Schema: public; Owner: -
62 | --
63 | 
64 | ALTER TABLE ONLY https_crawl_aggregate
65 |     ADD CONSTRAINT https_crawl_aggregate_pkey PRIMARY KEY (domain);
66 | 
67 | 
68 | --
69 | -- PostgreSQL database dump complete
70 | --
71 | 
72 | 


--------------------------------------------------------------------------------
/sql/https_queue.sql:
--------------------------------------------------------------------------------
 1 | --
 2 | -- PostgreSQL database dump
 3 | --
 4 | 
 5 | -- Dumped from database version 9.5.9
 6 | -- Dumped by pg_dump version 9.5.9
 7 | 
 8 | SET statement_timeout = 0;
 9 | SET lock_timeout = 0;
10 | SET client_encoding = 'UTF8';
11 | SET standard_conforming_strings = on;
12 | SET check_function_bodies = false;
13 | SET client_min_messages = warning;
14 | SET row_security = off;
15 | 
16 | SET search_path = public, pg_catalog;
17 | 
18 | SET default_tablespace = '';
19 | 
20 | SET default_with_oids = false;
21 | 
22 | --
23 | -- Name: https_queue; Type: TABLE; Schema: public; Owner: -
24 | --
25 | 
26 | CREATE TABLE https_queue (
27 |     rank integer NOT NULL,
28 |     domain character varying(500) NOT NULL,
29 |     processing_host character varying(50),
30 |     worker_pid integer,
31 |     reserved timestamp with time zone,
32 |     started timestamp with time zone,
33 |     finished timestamp with time zone,
34 |     CONSTRAINT domain_is_lowercase CHECK (((domain)::text = lower((domain)::text)))
35 | );
36 | 
37 | 
38 | --
39 | -- Name: https_queue_rank_seq; Type: SEQUENCE; Schema: public; Owner: -
40 | --
41 | 
42 | CREATE SEQUENCE https_queue_rank_seq
43 |     START WITH 1
44 |     INCREMENT BY 1
45 |     NO MINVALUE
46 |     NO MAXVALUE
47 |     CACHE 1
48 |     CYCLE;
49 | 
50 | 
51 | --
52 | -- Name: https_queue_rank_seq; Type: SEQUENCE OWNED BY; Schema: public; Owner: -
53 | --
54 | 
55 | ALTER SEQUENCE https_queue_rank_seq OWNED BY https_queue.rank;
56 | 
57 | 
58 | --
59 | -- Name: rank; Type: DEFAULT; Schema: public; Owner: -
60 | --
61 | 
62 | ALTER TABLE ONLY https_queue ALTER COLUMN rank SET DEFAULT nextval('https_queue_rank_seq'::regclass);
63 | 
64 | 
65 | --
66 | -- Name: https_queue_pkey; Type: CONSTRAINT; Schema: public; Owner: -
67 | --
68 | 
69 | ALTER TABLE ONLY https_queue
70 |     ADD CONSTRAINT https_queue_pkey PRIMARY KEY (rank);
71 | 
72 | 
73 | --
74 | -- Name: https_queue_domain_finished_idx; Type: INDEX; Schema: public; Owner: -
75 | --
76 | 
77 | CREATE UNIQUE INDEX https_queue_domain_finished_idx ON https_queue USING btree (domain, finished);
78 | 
79 | 
80 | --
81 | -- Name: https_queue_processing_host_idx; Type: INDEX; Schema: public; Owner: -
82 | --
83 | 
84 | CREATE INDEX https_queue_processing_host_idx ON https_queue USING btree (processing_host);
85 | 
86 | 
87 | --
88 | -- PostgreSQL database dump complete
89 | --
90 | 
91 | 


--------------------------------------------------------------------------------
/sql/https_response_headers.sql:
--------------------------------------------------------------------------------
 1 | --
 2 | -- PostgreSQL database dump
 3 | --
 4 | 
 5 | -- Dumped from database version 9.5.9
 6 | -- Dumped by pg_dump version 9.5.9
 7 | 
 8 | SET statement_timeout = 0;
 9 | SET lock_timeout = 0;
10 | SET client_encoding = 'UTF8';
11 | SET standard_conforming_strings = on;
12 | SET check_function_bodies = false;
13 | SET client_min_messages = warning;
14 | SET row_security = off;
15 | 
16 | SET search_path = public, pg_catalog;
17 | 
18 | SET default_tablespace = '';
19 | 
20 | SET default_with_oids = false;
21 | 
22 | --
23 | -- Name: https_response_headers; Type: TABLE; Schema: public; Owner: -
24 | --
25 | 
26 | CREATE TABLE https_response_headers (
27 |     https_crawl_id bigint NOT NULL,
28 |     response_headers jsonb NOT NULL
29 | );
30 | 
31 | 
32 | --
33 | -- Name: https_response_headers_https_crawl_id_idx; Type: INDEX; Schema: public; Owner: -
34 | --
35 | 
36 | CREATE UNIQUE INDEX https_response_headers_https_crawl_id_idx ON https_response_headers USING btree (https_crawl_id);
37 | 
38 | 
39 | --
40 | -- Name: https_response_headers_response_headers_idx; Type: INDEX; Schema: public; Owner: -
41 | --
42 | 
43 | CREATE INDEX https_response_headers_response_headers_idx ON https_response_headers USING gin (response_headers);
44 | 
45 | 
46 | --
47 | -- Name: https_response_headers_https_crawl_id_fkey; Type: FK CONSTRAINT; Schema: public; Owner: -
48 | --
49 | 
50 | ALTER TABLE ONLY https_response_headers
51 |     ADD CONSTRAINT https_response_headers_https_crawl_id_fkey FOREIGN KEY (https_crawl_id) REFERENCES https_crawl(id) ON DELETE CASCADE;
52 | 
53 | 
54 | --
55 | -- PostgreSQL database dump complete
56 | --
57 | 
58 | 


--------------------------------------------------------------------------------
/sql/mixed_assets.sql:
--------------------------------------------------------------------------------
 1 | --
 2 | -- PostgreSQL database dump
 3 | --
 4 | 
 5 | -- Dumped from database version 9.5.9
 6 | -- Dumped by pg_dump version 9.5.9
 7 | 
 8 | SET statement_timeout = 0;
 9 | SET lock_timeout = 0;
10 | SET client_encoding = 'UTF8';
11 | SET standard_conforming_strings = on;
12 | SET check_function_bodies = false;
13 | SET client_min_messages = warning;
14 | SET row_security = off;
15 | 
16 | SET search_path = public, pg_catalog;
17 | 
18 | SET default_tablespace = '';
19 | 
20 | SET default_with_oids = false;
21 | 
22 | --
23 | -- Name: mixed_assets; Type: TABLE; Schema: public; Owner: -
24 | --
25 | 
26 | CREATE TABLE mixed_assets (
27 |     asset text NOT NULL,
28 |     https_crawl_id bigint NOT NULL
29 | );
30 | 
31 | 
32 | --
33 | -- Name: mixed_assets_unique_substrmd5_idx; Type: INDEX; Schema: public; Owner: -
34 | --
35 | 
36 | CREATE UNIQUE INDEX mixed_assets_unique_substrmd5_idx ON mixed_assets USING btree (https_crawl_id, "left"(md5(asset), 8));
37 | 
38 | 
39 | --
40 | -- Name: mixed_assets_https_crawl_id_fkey; Type: FK CONSTRAINT; Schema: public; Owner: -
41 | --
42 | 
43 | ALTER TABLE ONLY mixed_assets
44 |     ADD CONSTRAINT mixed_assets_https_crawl_id_fkey FOREIGN KEY (https_crawl_id) REFERENCES https_crawl(id) ON DELETE CASCADE;
45 | 
46 | 
47 | --
48 | -- PostgreSQL database dump complete
49 | --
50 | 
51 | 


--------------------------------------------------------------------------------
/sql/ssl_cert_info.sql:
--------------------------------------------------------------------------------
 1 | --
 2 | -- PostgreSQL database dump
 3 | --
 4 | 
 5 | -- Dumped from database version 9.5.9
 6 | -- Dumped by pg_dump version 9.5.9
 7 | 
 8 | SET statement_timeout = 0;
 9 | SET lock_timeout = 0;
10 | SET client_encoding = 'UTF8';
11 | SET standard_conforming_strings = on;
12 | SET check_function_bodies = false;
13 | SET client_min_messages = warning;
14 | SET row_security = off;
15 | 
16 | SET search_path = public, pg_catalog;
17 | 
18 | SET default_tablespace = '';
19 | 
20 | SET default_with_oids = false;
21 | 
22 | --
23 | -- Name: ssl_cert_info; Type: TABLE; Schema: public; Owner: -
24 | --
25 | 
26 | CREATE TABLE ssl_cert_info (
27 |     domain text NOT NULL,
28 |     issuer text,
29 |     notbefore timestamp with time zone,
30 |     notafter timestamp with time zone,
31 |     host_valid boolean,
32 |     err text,
33 |     updated timestamp with time zone DEFAULT now() NOT NULL
34 | );
35 | 
36 | 
37 | --
38 | -- Name: ssl_cert_info_pkey; Type: CONSTRAINT; Schema: public; Owner: -
39 | --
40 | 
41 | ALTER TABLE ONLY ssl_cert_info
42 |     ADD CONSTRAINT ssl_cert_info_pkey PRIMARY KEY (domain);
43 | 
44 | 
45 | --
46 | -- Name: ssl_cert_info_host_valid_idx; Type: INDEX; Schema: public; Owner: -
47 | --
48 | 
49 | CREATE INDEX ssl_cert_info_host_valid_idx ON ssl_cert_info USING btree (host_valid);
50 | 
51 | 
52 | --
53 | -- Name: ssl_cert_info_issuer_idx; Type: INDEX; Schema: public; Owner: -
54 | --
55 | 
56 | CREATE INDEX ssl_cert_info_issuer_idx ON ssl_cert_info USING btree (issuer);
57 | 
58 | 
59 | --
60 | -- Name: ssl_cert_info_notafter_idx; Type: INDEX; Schema: public; Owner: -
61 | --
62 | 
63 | CREATE INDEX ssl_cert_info_notafter_idx ON ssl_cert_info USING btree (notafter);
64 | 
65 | 
66 | --
67 | -- PostgreSQL database dump complete
68 | --
69 | 
70 | 


--------------------------------------------------------------------------------
/sql/upgradeable_domains_func.sql:
--------------------------------------------------------------------------------
 1 | --
 2 | -- PostgreSQL database dump
 3 | --
 4 | 
 5 | -- Dumped from database version 9.5.9
 6 | -- Dumped by pg_dump version 9.5.9
 7 | 
 8 | SET statement_timeout = 0;
 9 | SET lock_timeout = 0;
10 | SET client_encoding = 'UTF8';
11 | SET standard_conforming_strings = on;
12 | SET check_function_bodies = false;
13 | SET client_min_messages = warning;
14 | SET row_security = off;
15 | 
16 | SET search_path = public, pg_catalog;
17 | 
18 | --
19 | -- Name: upgradeable_domains(real, real, real, real, real); Type: FUNCTION; Schema: public; Owner: - 
20 | --
21 | 
22 | CREATE OR REPLACE FUNCTION upgradeable_domains(
23 |     unknown_max real,
24 |     combined_min real,
25 |     screenshot_diff_max real,
26 |     mixed_ok boolean DEFAULT TRUE,
27 |     autoupgrade_min real DEFAULT 0,
28 |     ssl_cert_buffer timestamp with time zone DEFAULT now(),
29 |     exclude_issuers text[] default '{}',
30 |     max_err_rate real DEFAULT 1)
31 |     RETURNS TABLE(domain character varying) AS
32 | $$
33 |     select domain from https_upgrade_metrics m
34 |         where
35 |         (unknown_pct <= unknown_max) and
36 |         (combined_min <= combined_pct) and
37 |         (max_screenshot_diff <= screenshot_diff_max) and
38 |         (upgradeable_domains.mixed_ok = m.mixed_ok) and
39 |         (autoupgrade_min <= autoupgrade_pct) and
40 |         (https_err_rate <= max_err_rate)
41 |     except
42 |     (
43 |         select domain from domain_exceptions
44 |         union
45 |         select domain from ssl_cert_info
46 |             where
47 |             err is not null or
48 |             host_valid = false or
49 |             notafter < ssl_cert_buffer or
50 |             notbefore is null or
51 |             notafter is null or
52 |             issuer is null or
53 |             issuer ~* ANY(exclude_issuers)
54 |     )
55 | $$ LANGUAGE sql RETURNS NULL ON NULL INPUT;
56 | 
57 | 
58 | --
59 | -- PostgreSQL database dump complete
60 | --
61 | 
62 | 


--------------------------------------------------------------------------------
/third-party.txt:
--------------------------------------------------------------------------------
1 | Smarter Encryption includes the following third party software:
2 | 
3 | Software Name: PhantomJS
4 | Version: 2.1.1
5 | License: BSD-3-Clause
6 | Modified by DDG: Yes
7 | Location: netsniff_screenshot.js
8 | Obtained from: https://github.com/ariya/phantomjs
9 | 


--------------------------------------------------------------------------------