├── Documentation
├── License
└── Changelog
├── README.md
└── Source
└── 404.py
/Documentation/License:
--------------------------------------------------------------------------------
1 |
2 | LICENSE
3 |
4 | Permission is hereby granted, free of charge, to anyone
5 | obtaining a copy of this document and accompanying files,
6 | to do whatever they want with them without any restriction,
7 | including, but not limited to, copying, modification and redistribution.
8 |
9 | NO WARRANTY OF ANY KIND IS PROVIDED.
10 |
11 |
--------------------------------------------------------------------------------
/Documentation/Changelog:
--------------------------------------------------------------------------------
1 |
2 | CHANGELOG
3 |
4 | * 2016/02/03:
5 |
6 | - Revised. Working on Python 3.5.0, beautifulsoup4 4.4.1,
7 | requests 2.9.1.
8 |
9 | * 2015/07/27:
10 |
11 | - 404 now uses 'html.parser' explicitly.
12 |
13 | * 2015/05/12:
14 |
15 | - Allow ignoring internal links.
16 |
17 | * 2015/05/10:
18 |
19 | - Avoid parsing the entire HTML and look only for link tags.
20 | This should make 404 faster and use less memory.
21 |
22 | - Show the number of errors in the final stats.
23 |
24 | * 2015/05/09:
25 |
26 | - Make a single, lazy get request instead of head/get.
27 | This is a major performance improvement.
28 |
29 | - Also look for
when crawling.
30 |
31 | - Added links and time statistics and an option to suppress them (--quiet).
32 |
33 | - Added an option to disable redirects (--no-redirects).
34 |
35 | - Bugfix: add the root url to the link cache too.
36 |
37 | - Bugfix: check that the number of threads is positive.
38 |
39 | * 2015/05/08:
40 |
41 | - Implemented 'ignore', 'check' and 'follow' for links
42 | allowing recursive link crawling.
43 |
44 | - Print all http status codes > 400 instead of just 404.
45 |
46 | - Follow redirects.
47 | May add an option later to turn them off.
48 |
49 | - Check the headers content type when crawling
50 | to avoid doing get requests when possible.
51 |
52 | - Ignore fragments.
53 |
54 | * 2015/05/06:
55 |
56 | - First version.
57 |
58 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | ## About
3 |
4 | This shouldn't have happened.
5 |
6 | The thing is... I was testing a new programming language by writing
7 | a simple web crawler as an exercise. Being frustrated by multiple concurrency
8 | bugs in the stdlib I thought: "Okay, enough. I can probably write this in
9 | Python in an evening".
10 |
11 | Famous last words.
12 |
13 | A week later, the program snowballed from a toy example and it currently
14 | has the following features:
15 |
16 | * Supports SSL, redirections and custom timeouts, thanks
17 | to the excellent [requests][] library.
18 |
19 | * Lenient HTML parsing, so dubious markup should be fine, using
20 | the also excellent [beautifulsoup4][] library.
21 |
22 | * Validates both usual `` hyperlinks and `
`
23 | image links.
24 |
25 | * Can check, ignore or recursively follow both internal (same domain)
26 | and external links.
27 |
28 | * Tries to be efficient: multithreaded, ignores [fragments][], does not build
29 | a parse tree for non-link markup.
30 |
31 | * Fits in 404 lines. :)
32 |
33 | [beautifulsoup4]: http://www.crummy.com/software/BeautifulSoup/
34 | [fragments]: http://en.wikipedia.org/wiki/Fragment_identifier
35 | [requests]: http://docs.python-requests.org/en/latest/
36 |
37 | Here is an example, checking my entire blog:
38 |
39 | ```bash
40 | $ 404.py http://beluki.github.io --threads 20 --internal follow
41 | 404: http://cdimage.debian.org/debian-cd/7.8.0/i386/iso-cd/
42 | Checked 144 total links in 6.54 seconds.
43 | 46 internal, 98 external.
44 | 0 network/parsing errors, 1 link errors.
45 | ```
46 |
47 | (please, be polite and don't spawn many concurrent connections to the
48 | same server, this is just a demonstration)
49 |
50 | ## Installation
51 |
52 | First, make sure you are using Python 3.3+ and have the [beautifulsoup4][]
53 | and [requests][] libraries installed. Both are available in pip.
54 |
55 | Other than that, 404 is a single Python script that you can put in your PATH.
56 |
57 | ## Command-line options
58 |
59 | 404 has some options that can be used to change the behavior:
60 |
61 | * `--external [check, ignore, follow]` toggles behavior for external (different
62 | domain) links. The default is to check them. Be careful! 'follow' may try
63 | to recursively crawl the entire internet and should only be used on an
64 | intranet.
65 |
66 | * ` --internal [check, ignore, follow]` like above, but for internal links.
67 | The default is also 'check'.
68 |
69 | * `--newline [dos, mac, unix, system]` changes the newline format.
70 | I tend to use Unix newlines everywhere, even on Windows. The default is
71 | `system`, which uses the current platform newline format.
72 |
73 | * `--no-redirects` avoids following redirections. Links with redirections
74 | will be considered ok, according to their 3xx status code.
75 |
76 | * `--print-all` prints all the status codes/links, regardles of whether
77 | it indicates an error. This is useful to grep specific non-error codes
78 | such as 204 (no content).
79 |
80 | * ` --quiet` avoids printing the statistics to stderr at the end.
81 | Useful for scripts.
82 |
83 | * `--threads n` uses n concurrent threads to process requests.
84 | The default is to use a single thread.
85 |
86 | * `--timeout n` waits n seconds for request responses. 10 seconds by
87 | default. Use `--timeout 0` to wait forever for the response.
88 |
89 | Some examples:
90 |
91 | ```bash
92 | # check all the reachable internal links, ignoring external links
93 | # (e.g. check that all the links a static blog generator creates are ok)
94 | 404.py url --internal follow --external ignore
95 |
96 | # check all the external links in a single page:
97 | 404.py url --internal ignore --external check
98 |
99 | # wait forever for an url to be available:
100 | 404.py url --internal ignore --external ignore --timeout 0
101 |
102 | # get all the links in a site and dump them to a txt (without status code)
103 | # (errors and statistics on stderr)
104 | 404.py url --internal follow --print-all | awk '{ print $2 }' > links.txt
105 | ```
106 |
107 | ## Portability
108 |
109 | Status codes/links are written to stdout, using UTF-8 and the newline
110 | format specified by `--newline`.
111 |
112 | Network or HTML parsing errors and statistics and written to stderr using
113 | the current platform newline format.
114 |
115 | The exit status is 0 on success and 1 on errors. After an error,
116 | 404 skips the current url and proceeds with the next one instead of aborting.
117 | It can be interrupted with control + c.
118 |
119 | Note that a link returning a 404 status code (or any 4xx or 5xx status) is
120 | NOT an error. Only being unable to get a status code at all due to network
121 | problems or invalid input is considered an error.
122 |
123 | 404 is tested on Windows 7 and 8 and on Debian (both x86 and x86-64)
124 | using Python 3.4+, beautifulsoup4 4.3.2+ and requests 2.6.2+. Older versions
125 | are not supported.
126 |
127 | ## Status
128 |
129 | This program is finished!
130 |
131 | 404 is feature-complete and has no known bugs. Unless issues are reported
132 | I plan no further development on it other than maintenance.
133 |
134 | ## License
135 |
136 | Like all my hobby projects, this is Free Software. See the [Documentation][]
137 | folder for more information. No warranty though.
138 |
139 | [Documentation]: Documentation
140 |
141 |
--------------------------------------------------------------------------------
/Source/404.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 |
4 | """
5 | 404.
6 | A simple multithreaded dead link crawler.
7 | """
8 |
9 |
10 | import os
11 | import queue
12 | import sys
13 | import time
14 | import urllib
15 |
16 | from contextlib import closing
17 | from queue import Queue
18 | from threading import Thread
19 |
20 | from argparse import ArgumentParser, RawDescriptionHelpFormatter
21 |
22 |
23 | # Information and error messages:
24 |
25 | def outln(line):
26 | """ Write 'line' to stdout, using the platform encoding and newline format. """
27 | print(line, flush = True)
28 |
29 |
30 | def errln(line):
31 | """ Write 'line' to stderr, using the platform encoding and newline format. """
32 | print('404.py: error:', line, file = sys.stderr, flush = True)
33 |
34 |
35 | # Non-builtin imports:
36 |
37 | try:
38 | import requests
39 |
40 | from bs4 import BeautifulSoup, SoupStrainer
41 | from requests import Timeout
42 |
43 | except ImportError:
44 | errln('404 requires the following modules:')
45 | errln('beautifulsoup4 4.3.2+ - ')
46 | errln('requests 2.7.0+ - ')
47 | sys.exit(1)
48 |
49 |
50 | # Threads and a thread pool:
51 |
52 | class Worker(Thread):
53 | """
54 | Thread that pops tasks from a '.todo' Queue, executes them, and puts
55 | the completed tasks in a '.done' Queue.
56 |
57 | A task is any object that has a run() method.
58 | Tasks themselves are responsible to hold their own results.
59 | """
60 |
61 | def __init__(self, todo, done):
62 | super().__init__()
63 | self.todo = todo
64 | self.done = done
65 | self.daemon = True
66 | self.start()
67 |
68 | def run(self):
69 | while True:
70 | task = self.todo.get()
71 | task.run()
72 | self.done.put(task)
73 | self.todo.task_done()
74 |
75 |
76 | class ThreadPool(object):
77 | """
78 | Mantains a list of 'todo' and 'done' tasks and a number of threads
79 | consuming the tasks. Child threads are expected to put the tasks
80 | in the 'done' queue when those are completed.
81 | """
82 |
83 | def __init__(self, threads):
84 | self.threads = threads
85 |
86 | self.todo = Queue()
87 | self.done = Queue()
88 |
89 | self.pending_tasks = 0
90 |
91 | def add_task(self, task):
92 | """
93 | Add a new task to complete.
94 | Can be called after start().
95 | """
96 | self.pending_tasks += 1
97 | self.todo.put(task)
98 |
99 | def start(self):
100 | """ Start computing tasks. """
101 | for x in range(self.threads):
102 | Worker(self.todo, self.done)
103 |
104 | def wait_for_task(self):
105 | """ Wait for one task to complete. """
106 | while True:
107 | try:
108 | return self.done.get(block = False)
109 |
110 | # give tasks processor time:
111 | except queue.Empty:
112 | time.sleep(0.1)
113 |
114 | def poll_completed_tasks(self):
115 | """ Yield the computed tasks as soon as they are finished. """
116 | while self.pending_tasks > 0:
117 | yield self.wait_for_task()
118 | self.pending_tasks -= 1
119 |
120 | # at this point, all the tasks are completed:
121 | self.todo.join()
122 |
123 |
124 | # Tasks:
125 |
126 | # A BeautifulSoup strainer that only cares about links/images:
127 | link_strainer = SoupStrainer(lambda name, attrs: name == 'a' or name == 'img')
128 |
129 |
130 | class LinkTask(object):
131 | """
132 | A task that checks one link and optionally parses
133 | the HTML to get links in the body.
134 | """
135 | def __init__(self, link, parse_links, timeout, allow_redirects):
136 | self.link = link
137 | self.parse_links = parse_links
138 | self.timeout = timeout
139 | self.allow_redirects = allow_redirects
140 |
141 | # will contain the links found in the url body when HTML and parse_links = True:
142 | self.links = []
143 |
144 | # will hold the status code and the response headers after executing run():
145 | self.status = None
146 |
147 | # since we run in a thread with its own context
148 | # exception information is captured here:
149 | self.exception = None
150 |
151 | def run(self):
152 | try:
153 | with closing(requests.get(self.link,
154 | timeout = self.timeout,
155 | allow_redirects = self.allow_redirects,
156 | stream = True)) as response:
157 |
158 | self.status = response.status_code
159 |
160 | # when not looking for links, we have all the information needed:
161 | if not self.parse_links:
162 | return
163 |
164 | # when the status is a client/server error, don't look for links either:
165 | if 400 <= self.status < 600:
166 | return
167 |
168 | # when not html/xml, no links:
169 | content_type = response.headers.get('content-type', '').strip()
170 | if not content_type.startswith(('text/html', 'application/xhtml+xml')):
171 | return
172 |
173 | # parse:
174 | soup = BeautifulSoup(response.content, 'html.parser', parse_only = link_strainer, from_encoding = response.encoding)
175 |
176 | #
177 | for tag in soup.find_all('a', href = True):
178 | absolute_link = urllib.parse.urljoin(self.link, tag['href'])
179 | self.links.append(absolute_link)
180 |
181 | #
182 | for tag in soup.find_all('img', src = True):
183 | absolute_link = urllib.parse.urljoin(self.link, tag['src'])
184 | self.links.append(absolute_link)
185 |
186 | except:
187 | self.exception = sys.exc_info()
188 |
189 |
190 | # IO:
191 |
192 | # For portability, all output is done in bytes
193 | # to avoid Python default encoding and automatic newline conversion:
194 |
195 | def utf8_bytes(string):
196 | """ Convert 'string' to bytes using UTF-8. """
197 | return bytes(string, 'UTF-8')
198 |
199 |
200 | BYTES_NEWLINES = {
201 | 'dos' : b'\r\n',
202 | 'mac' : b'\r',
203 | 'unix' : b'\n',
204 | 'system' : utf8_bytes(os.linesep),
205 | }
206 |
207 |
208 | def binary_stdout_writeline(line, newline):
209 | """
210 | Write 'line' (as bytes) to stdout without buffering
211 | using the specified 'newline' format (as bytes).
212 | """
213 | sys.stdout.buffer.write(line)
214 | sys.stdout.buffer.write(newline)
215 | sys.stdout.flush()
216 |
217 |
218 | # Parser:
219 |
220 | def make_parser():
221 | parser = ArgumentParser(
222 | description = __doc__,
223 | formatter_class = RawDescriptionHelpFormatter,
224 | epilog = 'example: 404.py http://beluki.github.io --internal follow --threads 5',
225 | usage = '404.py url [option [options ...]]',
226 | )
227 |
228 | # positional:
229 | parser.add_argument('url',
230 | help = 'url to crawl looking for links')
231 |
232 | # optional:
233 | parser.add_argument('--external',
234 | help = 'whether to check, ignore or follow external links (default: check)',
235 | choices = ['check', 'ignore', 'follow'],
236 | default = 'check')
237 |
238 | parser.add_argument('--internal',
239 | help = 'whether to check, ignore or follow internal links (default: check)',
240 | choices = ['check', 'ignore', 'follow'],
241 | default = 'check')
242 |
243 | parser.add_argument('--newline',
244 | help = 'use a specific newline mode (default: system)',
245 | choices = ['dos', 'mac', 'unix', 'system'],
246 | default = 'system')
247 |
248 | parser.add_argument('--no-redirects',
249 | help = 'do not follow redirects, just return the status code',
250 | action = 'store_true')
251 |
252 | parser.add_argument('--print-all',
253 | help = 'print all status codes and urls instead of only errors',
254 | action = 'store_true')
255 |
256 | parser.add_argument('--quiet',
257 | help = 'do not print statistics to stderr after crawling',
258 | action = 'store_true')
259 |
260 | parser.add_argument('--threads',
261 | help = 'number of threads (default: 1)',
262 | default = 1,
263 | type = int)
264 |
265 | parser.add_argument('--timeout',
266 | help = 'seconds to wait for request responses (default: 10)',
267 | default = 10,
268 | type = int)
269 |
270 | return parser
271 |
272 |
273 | # Main program:
274 |
275 | def run(url, allow_redirects, internal, external, newline, print_all, quiet, threads, timeout):
276 | """
277 | Setup a threadpool and start checking links.
278 | """
279 | status = 0
280 |
281 | # create the pool and a task to start at the root:
282 | pool = ThreadPool(threads)
283 | pool.add_task(LinkTask(url, True, timeout, allow_redirects))
284 | pool.start()
285 |
286 | # link cache to avoid following repeating links:
287 | link_cache = set([url])
288 |
289 | # url domain:
290 | netloc = urllib.parse.urlparse(url).netloc
291 |
292 | # stats:
293 | st_total_links = 1
294 | st_total_internal = 1
295 | st_total_external = 0
296 | st_error_task = 0
297 | st_error_link = 0
298 | st_start_time = time.clock()
299 |
300 | # start checking links:
301 | for task in pool.poll_completed_tasks():
302 |
303 | # error in request:
304 | if task.exception:
305 | status = 1
306 | exc_type, exc_obj, exc_trace = task.exception
307 |
308 | # provide a concise error message for timeouts (common):
309 | if isinstance(exc_obj, Timeout):
310 | errln('{} - timeout.'.format(task.link))
311 | else:
312 | errln('{} - {}.'.format(task.link, exc_obj))
313 |
314 | st_error_task += 1
315 |
316 | else:
317 | client_or_server_error = (400 <= task.status < 600)
318 |
319 | if client_or_server_error or print_all:
320 | output = utf8_bytes('{}: {}'.format(task.status, task.link))
321 | binary_stdout_writeline(output, newline)
322 |
323 | if client_or_server_error:
324 | st_error_link += 1
325 |
326 | for link in task.links:
327 |
328 | # ignore client-side fragment:
329 | link, _ = urllib.parse.urldefrag(link)
330 |
331 | if link not in link_cache:
332 | link_cache.add(link)
333 | parsed = urllib.parse.urlparse(link)
334 |
335 | # accept http/s protocols:
336 | if not parsed.scheme in ('http', 'https'):
337 | continue
338 |
339 | # internal or external link?
340 | if parsed.netloc == netloc:
341 | if internal == 'ignore':
342 | continue
343 |
344 | st_total_internal += 1
345 | get_links = (internal == 'follow')
346 |
347 | else:
348 | if external == 'ignore':
349 | continue
350 |
351 | st_total_external += 1
352 | get_links = (external == 'follow')
353 |
354 | link_task = LinkTask(link, get_links, timeout, allow_redirects)
355 | pool.add_task(link_task)
356 | st_total_links += 1
357 |
358 | if not quiet:
359 | st_end_time = time.clock() - st_start_time
360 |
361 | print('Checked {} total links in {:.3} seconds.'.format(st_total_links, st_end_time), file = sys.stderr)
362 | print('{} internal, {} external.'.format(st_total_internal, st_total_external), file = sys.stderr)
363 | print('{} network/parsing errors, {} link errors.'.format(st_error_task, st_error_link), file = sys.stderr)
364 |
365 | sys.exit(status)
366 |
367 |
368 | # Entry point:
369 |
370 | def main():
371 | parser = make_parser()
372 | options = parser.parse_args()
373 |
374 | url = options.url
375 | external = options.external
376 | internal = options.internal
377 | newline = BYTES_NEWLINES[options.newline]
378 | no_redirects = options.no_redirects
379 | print_all = options.print_all
380 | quiet = options.quiet
381 | threads = options.threads
382 |
383 | # validate threads number:
384 | if threads < 1:
385 | errln('the number of threads must be positive.')
386 | sys.exit(1)
387 |
388 | # 0 means no timeout:
389 | if options.timeout > 0:
390 | timeout = options.timeout
391 | else:
392 | timeout = None
393 |
394 | allow_redirects = not(no_redirects)
395 | run(url, allow_redirects, internal, external, newline, print_all, quiet, threads, timeout)
396 |
397 |
398 | if __name__ == '__main__':
399 | try:
400 | main()
401 | except KeyboardInterrupt:
402 | pass
403 |
404 |
--------------------------------------------------------------------------------