├── Documentation ├── License └── Changelog ├── README.md └── Source └── 404.py /Documentation/License: -------------------------------------------------------------------------------- 1 | 2 | LICENSE 3 | 4 | Permission is hereby granted, free of charge, to anyone 5 | obtaining a copy of this document and accompanying files, 6 | to do whatever they want with them without any restriction, 7 | including, but not limited to, copying, modification and redistribution. 8 | 9 | NO WARRANTY OF ANY KIND IS PROVIDED. 10 | 11 | -------------------------------------------------------------------------------- /Documentation/Changelog: -------------------------------------------------------------------------------- 1 | 2 | CHANGELOG 3 | 4 | * 2016/02/03: 5 | 6 | - Revised. Working on Python 3.5.0, beautifulsoup4 4.4.1, 7 | requests 2.9.1. 8 | 9 | * 2015/07/27: 10 | 11 | - 404 now uses 'html.parser' explicitly. 12 | 13 | * 2015/05/12: 14 | 15 | - Allow ignoring internal links. 16 | 17 | * 2015/05/10: 18 | 19 | - Avoid parsing the entire HTML and look only for link tags. 20 | This should make 404 faster and use less memory. 21 | 22 | - Show the number of errors in the final stats. 23 | 24 | * 2015/05/09: 25 | 26 | - Make a single, lazy get request instead of head/get. 27 | This is a major performance improvement. 28 | 29 | - Also look for when crawling. 30 | 31 | - Added links and time statistics and an option to suppress them (--quiet). 32 | 33 | - Added an option to disable redirects (--no-redirects). 34 | 35 | - Bugfix: add the root url to the link cache too. 36 | 37 | - Bugfix: check that the number of threads is positive. 38 | 39 | * 2015/05/08: 40 | 41 | - Implemented 'ignore', 'check' and 'follow' for links 42 | allowing recursive link crawling. 43 | 44 | - Print all http status codes > 400 instead of just 404. 45 | 46 | - Follow redirects. 47 | May add an option later to turn them off. 48 | 49 | - Check the headers content type when crawling 50 | to avoid doing get requests when possible. 51 | 52 | - Ignore fragments. 53 | 54 | * 2015/05/06: 55 | 56 | - First version. 57 | 58 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ## About 3 | 4 | This shouldn't have happened. 5 | 6 | The thing is... I was testing a new programming language by writing 7 | a simple web crawler as an exercise. Being frustrated by multiple concurrency 8 | bugs in the stdlib I thought: "Okay, enough. I can probably write this in 9 | Python in an evening". 10 | 11 | Famous last words. 12 | 13 | A week later, the program snowballed from a toy example and it currently 14 | has the following features: 15 | 16 | * Supports SSL, redirections and custom timeouts, thanks 17 | to the excellent [requests][] library. 18 | 19 | * Lenient HTML parsing, so dubious markup should be fine, using 20 | the also excellent [beautifulsoup4][] library. 21 | 22 | * Validates both usual `` hyperlinks and `` 23 | image links. 24 | 25 | * Can check, ignore or recursively follow both internal (same domain) 26 | and external links. 27 | 28 | * Tries to be efficient: multithreaded, ignores [fragments][], does not build 29 | a parse tree for non-link markup. 30 | 31 | * Fits in 404 lines. :) 32 | 33 | [beautifulsoup4]: http://www.crummy.com/software/BeautifulSoup/ 34 | [fragments]: http://en.wikipedia.org/wiki/Fragment_identifier 35 | [requests]: http://docs.python-requests.org/en/latest/ 36 | 37 | Here is an example, checking my entire blog: 38 | 39 | ```bash 40 | $ 404.py http://beluki.github.io --threads 20 --internal follow 41 | 404: http://cdimage.debian.org/debian-cd/7.8.0/i386/iso-cd/ 42 | Checked 144 total links in 6.54 seconds. 43 | 46 internal, 98 external. 44 | 0 network/parsing errors, 1 link errors. 45 | ``` 46 | 47 | (please, be polite and don't spawn many concurrent connections to the 48 | same server, this is just a demonstration) 49 | 50 | ## Installation 51 | 52 | First, make sure you are using Python 3.3+ and have the [beautifulsoup4][] 53 | and [requests][] libraries installed. Both are available in pip. 54 | 55 | Other than that, 404 is a single Python script that you can put in your PATH. 56 | 57 | ## Command-line options 58 | 59 | 404 has some options that can be used to change the behavior: 60 | 61 | * `--external [check, ignore, follow]` toggles behavior for external (different 62 | domain) links. The default is to check them. Be careful! 'follow' may try 63 | to recursively crawl the entire internet and should only be used on an 64 | intranet. 65 | 66 | * ` --internal [check, ignore, follow]` like above, but for internal links. 67 | The default is also 'check'. 68 | 69 | * `--newline [dos, mac, unix, system]` changes the newline format. 70 | I tend to use Unix newlines everywhere, even on Windows. The default is 71 | `system`, which uses the current platform newline format. 72 | 73 | * `--no-redirects` avoids following redirections. Links with redirections 74 | will be considered ok, according to their 3xx status code. 75 | 76 | * `--print-all` prints all the status codes/links, regardles of whether 77 | it indicates an error. This is useful to grep specific non-error codes 78 | such as 204 (no content). 79 | 80 | * ` --quiet` avoids printing the statistics to stderr at the end. 81 | Useful for scripts. 82 | 83 | * `--threads n` uses n concurrent threads to process requests. 84 | The default is to use a single thread. 85 | 86 | * `--timeout n` waits n seconds for request responses. 10 seconds by 87 | default. Use `--timeout 0` to wait forever for the response. 88 | 89 | Some examples: 90 | 91 | ```bash 92 | # check all the reachable internal links, ignoring external links 93 | # (e.g. check that all the links a static blog generator creates are ok) 94 | 404.py url --internal follow --external ignore 95 | 96 | # check all the external links in a single page: 97 | 404.py url --internal ignore --external check 98 | 99 | # wait forever for an url to be available: 100 | 404.py url --internal ignore --external ignore --timeout 0 101 | 102 | # get all the links in a site and dump them to a txt (without status code) 103 | # (errors and statistics on stderr) 104 | 404.py url --internal follow --print-all | awk '{ print $2 }' > links.txt 105 | ``` 106 | 107 | ## Portability 108 | 109 | Status codes/links are written to stdout, using UTF-8 and the newline 110 | format specified by `--newline`. 111 | 112 | Network or HTML parsing errors and statistics and written to stderr using 113 | the current platform newline format. 114 | 115 | The exit status is 0 on success and 1 on errors. After an error, 116 | 404 skips the current url and proceeds with the next one instead of aborting. 117 | It can be interrupted with control + c. 118 | 119 | Note that a link returning a 404 status code (or any 4xx or 5xx status) is 120 | NOT an error. Only being unable to get a status code at all due to network 121 | problems or invalid input is considered an error. 122 | 123 | 404 is tested on Windows 7 and 8 and on Debian (both x86 and x86-64) 124 | using Python 3.4+, beautifulsoup4 4.3.2+ and requests 2.6.2+. Older versions 125 | are not supported. 126 | 127 | ## Status 128 | 129 | This program is finished! 130 | 131 | 404 is feature-complete and has no known bugs. Unless issues are reported 132 | I plan no further development on it other than maintenance. 133 | 134 | ## License 135 | 136 | Like all my hobby projects, this is Free Software. See the [Documentation][] 137 | folder for more information. No warranty though. 138 | 139 | [Documentation]: Documentation 140 | 141 | -------------------------------------------------------------------------------- /Source/404.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | 404. 6 | A simple multithreaded dead link crawler. 7 | """ 8 | 9 | 10 | import os 11 | import queue 12 | import sys 13 | import time 14 | import urllib 15 | 16 | from contextlib import closing 17 | from queue import Queue 18 | from threading import Thread 19 | 20 | from argparse import ArgumentParser, RawDescriptionHelpFormatter 21 | 22 | 23 | # Information and error messages: 24 | 25 | def outln(line): 26 | """ Write 'line' to stdout, using the platform encoding and newline format. """ 27 | print(line, flush = True) 28 | 29 | 30 | def errln(line): 31 | """ Write 'line' to stderr, using the platform encoding and newline format. """ 32 | print('404.py: error:', line, file = sys.stderr, flush = True) 33 | 34 | 35 | # Non-builtin imports: 36 | 37 | try: 38 | import requests 39 | 40 | from bs4 import BeautifulSoup, SoupStrainer 41 | from requests import Timeout 42 | 43 | except ImportError: 44 | errln('404 requires the following modules:') 45 | errln('beautifulsoup4 4.3.2+ - ') 46 | errln('requests 2.7.0+ - ') 47 | sys.exit(1) 48 | 49 | 50 | # Threads and a thread pool: 51 | 52 | class Worker(Thread): 53 | """ 54 | Thread that pops tasks from a '.todo' Queue, executes them, and puts 55 | the completed tasks in a '.done' Queue. 56 | 57 | A task is any object that has a run() method. 58 | Tasks themselves are responsible to hold their own results. 59 | """ 60 | 61 | def __init__(self, todo, done): 62 | super().__init__() 63 | self.todo = todo 64 | self.done = done 65 | self.daemon = True 66 | self.start() 67 | 68 | def run(self): 69 | while True: 70 | task = self.todo.get() 71 | task.run() 72 | self.done.put(task) 73 | self.todo.task_done() 74 | 75 | 76 | class ThreadPool(object): 77 | """ 78 | Mantains a list of 'todo' and 'done' tasks and a number of threads 79 | consuming the tasks. Child threads are expected to put the tasks 80 | in the 'done' queue when those are completed. 81 | """ 82 | 83 | def __init__(self, threads): 84 | self.threads = threads 85 | 86 | self.todo = Queue() 87 | self.done = Queue() 88 | 89 | self.pending_tasks = 0 90 | 91 | def add_task(self, task): 92 | """ 93 | Add a new task to complete. 94 | Can be called after start(). 95 | """ 96 | self.pending_tasks += 1 97 | self.todo.put(task) 98 | 99 | def start(self): 100 | """ Start computing tasks. """ 101 | for x in range(self.threads): 102 | Worker(self.todo, self.done) 103 | 104 | def wait_for_task(self): 105 | """ Wait for one task to complete. """ 106 | while True: 107 | try: 108 | return self.done.get(block = False) 109 | 110 | # give tasks processor time: 111 | except queue.Empty: 112 | time.sleep(0.1) 113 | 114 | def poll_completed_tasks(self): 115 | """ Yield the computed tasks as soon as they are finished. """ 116 | while self.pending_tasks > 0: 117 | yield self.wait_for_task() 118 | self.pending_tasks -= 1 119 | 120 | # at this point, all the tasks are completed: 121 | self.todo.join() 122 | 123 | 124 | # Tasks: 125 | 126 | # A BeautifulSoup strainer that only cares about links/images: 127 | link_strainer = SoupStrainer(lambda name, attrs: name == 'a' or name == 'img') 128 | 129 | 130 | class LinkTask(object): 131 | """ 132 | A task that checks one link and optionally parses 133 | the HTML to get links in the body. 134 | """ 135 | def __init__(self, link, parse_links, timeout, allow_redirects): 136 | self.link = link 137 | self.parse_links = parse_links 138 | self.timeout = timeout 139 | self.allow_redirects = allow_redirects 140 | 141 | # will contain the links found in the url body when HTML and parse_links = True: 142 | self.links = [] 143 | 144 | # will hold the status code and the response headers after executing run(): 145 | self.status = None 146 | 147 | # since we run in a thread with its own context 148 | # exception information is captured here: 149 | self.exception = None 150 | 151 | def run(self): 152 | try: 153 | with closing(requests.get(self.link, 154 | timeout = self.timeout, 155 | allow_redirects = self.allow_redirects, 156 | stream = True)) as response: 157 | 158 | self.status = response.status_code 159 | 160 | # when not looking for links, we have all the information needed: 161 | if not self.parse_links: 162 | return 163 | 164 | # when the status is a client/server error, don't look for links either: 165 | if 400 <= self.status < 600: 166 | return 167 | 168 | # when not html/xml, no links: 169 | content_type = response.headers.get('content-type', '').strip() 170 | if not content_type.startswith(('text/html', 'application/xhtml+xml')): 171 | return 172 | 173 | # parse: 174 | soup = BeautifulSoup(response.content, 'html.parser', parse_only = link_strainer, from_encoding = response.encoding) 175 | 176 | # 177 | for tag in soup.find_all('a', href = True): 178 | absolute_link = urllib.parse.urljoin(self.link, tag['href']) 179 | self.links.append(absolute_link) 180 | 181 | # 182 | for tag in soup.find_all('img', src = True): 183 | absolute_link = urllib.parse.urljoin(self.link, tag['src']) 184 | self.links.append(absolute_link) 185 | 186 | except: 187 | self.exception = sys.exc_info() 188 | 189 | 190 | # IO: 191 | 192 | # For portability, all output is done in bytes 193 | # to avoid Python default encoding and automatic newline conversion: 194 | 195 | def utf8_bytes(string): 196 | """ Convert 'string' to bytes using UTF-8. """ 197 | return bytes(string, 'UTF-8') 198 | 199 | 200 | BYTES_NEWLINES = { 201 | 'dos' : b'\r\n', 202 | 'mac' : b'\r', 203 | 'unix' : b'\n', 204 | 'system' : utf8_bytes(os.linesep), 205 | } 206 | 207 | 208 | def binary_stdout_writeline(line, newline): 209 | """ 210 | Write 'line' (as bytes) to stdout without buffering 211 | using the specified 'newline' format (as bytes). 212 | """ 213 | sys.stdout.buffer.write(line) 214 | sys.stdout.buffer.write(newline) 215 | sys.stdout.flush() 216 | 217 | 218 | # Parser: 219 | 220 | def make_parser(): 221 | parser = ArgumentParser( 222 | description = __doc__, 223 | formatter_class = RawDescriptionHelpFormatter, 224 | epilog = 'example: 404.py http://beluki.github.io --internal follow --threads 5', 225 | usage = '404.py url [option [options ...]]', 226 | ) 227 | 228 | # positional: 229 | parser.add_argument('url', 230 | help = 'url to crawl looking for links') 231 | 232 | # optional: 233 | parser.add_argument('--external', 234 | help = 'whether to check, ignore or follow external links (default: check)', 235 | choices = ['check', 'ignore', 'follow'], 236 | default = 'check') 237 | 238 | parser.add_argument('--internal', 239 | help = 'whether to check, ignore or follow internal links (default: check)', 240 | choices = ['check', 'ignore', 'follow'], 241 | default = 'check') 242 | 243 | parser.add_argument('--newline', 244 | help = 'use a specific newline mode (default: system)', 245 | choices = ['dos', 'mac', 'unix', 'system'], 246 | default = 'system') 247 | 248 | parser.add_argument('--no-redirects', 249 | help = 'do not follow redirects, just return the status code', 250 | action = 'store_true') 251 | 252 | parser.add_argument('--print-all', 253 | help = 'print all status codes and urls instead of only errors', 254 | action = 'store_true') 255 | 256 | parser.add_argument('--quiet', 257 | help = 'do not print statistics to stderr after crawling', 258 | action = 'store_true') 259 | 260 | parser.add_argument('--threads', 261 | help = 'number of threads (default: 1)', 262 | default = 1, 263 | type = int) 264 | 265 | parser.add_argument('--timeout', 266 | help = 'seconds to wait for request responses (default: 10)', 267 | default = 10, 268 | type = int) 269 | 270 | return parser 271 | 272 | 273 | # Main program: 274 | 275 | def run(url, allow_redirects, internal, external, newline, print_all, quiet, threads, timeout): 276 | """ 277 | Setup a threadpool and start checking links. 278 | """ 279 | status = 0 280 | 281 | # create the pool and a task to start at the root: 282 | pool = ThreadPool(threads) 283 | pool.add_task(LinkTask(url, True, timeout, allow_redirects)) 284 | pool.start() 285 | 286 | # link cache to avoid following repeating links: 287 | link_cache = set([url]) 288 | 289 | # url domain: 290 | netloc = urllib.parse.urlparse(url).netloc 291 | 292 | # stats: 293 | st_total_links = 1 294 | st_total_internal = 1 295 | st_total_external = 0 296 | st_error_task = 0 297 | st_error_link = 0 298 | st_start_time = time.clock() 299 | 300 | # start checking links: 301 | for task in pool.poll_completed_tasks(): 302 | 303 | # error in request: 304 | if task.exception: 305 | status = 1 306 | exc_type, exc_obj, exc_trace = task.exception 307 | 308 | # provide a concise error message for timeouts (common): 309 | if isinstance(exc_obj, Timeout): 310 | errln('{} - timeout.'.format(task.link)) 311 | else: 312 | errln('{} - {}.'.format(task.link, exc_obj)) 313 | 314 | st_error_task += 1 315 | 316 | else: 317 | client_or_server_error = (400 <= task.status < 600) 318 | 319 | if client_or_server_error or print_all: 320 | output = utf8_bytes('{}: {}'.format(task.status, task.link)) 321 | binary_stdout_writeline(output, newline) 322 | 323 | if client_or_server_error: 324 | st_error_link += 1 325 | 326 | for link in task.links: 327 | 328 | # ignore client-side fragment: 329 | link, _ = urllib.parse.urldefrag(link) 330 | 331 | if link not in link_cache: 332 | link_cache.add(link) 333 | parsed = urllib.parse.urlparse(link) 334 | 335 | # accept http/s protocols: 336 | if not parsed.scheme in ('http', 'https'): 337 | continue 338 | 339 | # internal or external link? 340 | if parsed.netloc == netloc: 341 | if internal == 'ignore': 342 | continue 343 | 344 | st_total_internal += 1 345 | get_links = (internal == 'follow') 346 | 347 | else: 348 | if external == 'ignore': 349 | continue 350 | 351 | st_total_external += 1 352 | get_links = (external == 'follow') 353 | 354 | link_task = LinkTask(link, get_links, timeout, allow_redirects) 355 | pool.add_task(link_task) 356 | st_total_links += 1 357 | 358 | if not quiet: 359 | st_end_time = time.clock() - st_start_time 360 | 361 | print('Checked {} total links in {:.3} seconds.'.format(st_total_links, st_end_time), file = sys.stderr) 362 | print('{} internal, {} external.'.format(st_total_internal, st_total_external), file = sys.stderr) 363 | print('{} network/parsing errors, {} link errors.'.format(st_error_task, st_error_link), file = sys.stderr) 364 | 365 | sys.exit(status) 366 | 367 | 368 | # Entry point: 369 | 370 | def main(): 371 | parser = make_parser() 372 | options = parser.parse_args() 373 | 374 | url = options.url 375 | external = options.external 376 | internal = options.internal 377 | newline = BYTES_NEWLINES[options.newline] 378 | no_redirects = options.no_redirects 379 | print_all = options.print_all 380 | quiet = options.quiet 381 | threads = options.threads 382 | 383 | # validate threads number: 384 | if threads < 1: 385 | errln('the number of threads must be positive.') 386 | sys.exit(1) 387 | 388 | # 0 means no timeout: 389 | if options.timeout > 0: 390 | timeout = options.timeout 391 | else: 392 | timeout = None 393 | 394 | allow_redirects = not(no_redirects) 395 | run(url, allow_redirects, internal, external, newline, print_all, quiet, threads, timeout) 396 | 397 | 398 | if __name__ == '__main__': 399 | try: 400 | main() 401 | except KeyboardInterrupt: 402 | pass 403 | 404 | --------------------------------------------------------------------------------