├── .gitignore ├── LICENSE ├── README.md ├── gcrawl ├── __init__.py ├── page.py └── url.py ├── setup.py └── test ├── testAllowed.py └── testSanitize.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | dist/* 3 | build/* 4 | gcrawl.egg-info/* 5 | 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2011 SEOmoz 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining 4 | a copy of this software and associated documentation files (the 5 | "Software"), to deal in the Software without restriction, including 6 | without limitation the rights to use, copy, modify, merge, publish, 7 | distribute, sublicense, and/or sell copies of the Software, and to 8 | permit persons to whom the Software is furnished to do so, subject to 9 | the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be 12 | included in all copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 15 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 16 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 17 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 18 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 19 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 20 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | g-crawl-py 2 | ========== 3 | Qless-based crawl jobs, and gevent-based crawling. Hot damn! 4 | 5 | ![Status: Production](https://img.shields.io/badge/status-production-green.svg?style=flat) 6 | ![Team: Big Data](https://img.shields.io/badge/team-big_data-green.svg?style=flat) 7 | ![Scope: External](https://img.shields.io/badge/scope-external-green.svg?style=flat) 8 | ![Open Source: Yes](https://img.shields.io/badge/open_source-MIT-green.svg?style=flat) 9 | ![Critical: Yes](https://img.shields.io/badge/critical-yes-red.svg?style=flat) 10 | 11 | Installation 12 | ============ 13 | Installation in the typical way: 14 | 15 | sudo python setup.py install 16 | 17 | This installs, aside from the library itself, `reppy`, `urllib3`, `requests` and 18 | `gevent`. In addition to these, you'll have to install qless-py manually until it 19 | is released on `pip`: 20 | 21 | git clone git://github.com/seomoz/qless-py.git 22 | cd qless-py 23 | sudo python setup.py install 24 | 25 | Until you need to run `qless` jobs, you won't need it, but when that day inevitably 26 | comes, you'll need Redis 2.6. Unfortunately, Redis 2.6 has been delayed, and is 27 | still only available as an unstable release. __NOTE__: You only need Redis 2.6 28 | installed on the system where you'll be hosting the queue. This may or may not be 29 | the same system where you'll be running your crawlers: 30 | 31 | git clone git://github.com/antirez/redis.git 32 | cd redis 33 | git checkout 2.6 34 | make && make test && sudo make install 35 | 36 | Development 37 | =========== 38 | There's mostly one class that you have to worry about subclassing: `gcrawl.Crawl`. 39 | It provides a few methods that you're encouraged to override in subclasses: 40 | 41 | def before(self): 42 | # This method gets executed before the crawl loop, in case you need it 43 | 44 | def after(self): 45 | # Before's counterpart, exucted right before returning from `run()` 46 | 47 | def delay(self, page): 48 | # Return how many seconds to wait before sending the next request in this crawl 49 | 50 | def pop(self): 51 | # Return the next url to fetch 52 | 53 | def extend(self, urls, page): 54 | # These urls were discovered in response. Pick, choose, add to your list 55 | # Response is the `requests.Response` object returned, urls is a list of 56 | # string urls 57 | 58 | def got(self, page): 59 | # We've fetched a page. You should decide what requests to add 60 | 61 | def count(self, page): 62 | # Should this page count against the max_pages count? 63 | 64 | A `Crawl` object also has a method `run()`, which performs the crawl in a non-blocking 65 | way in a greenlet, and returns a list of results from the crawl (determined by the 66 | `dump()` method in the class). The `run()` method first fetches the seed url, finds 67 | links, and then provides them to you to decide if you want to follow them. 68 | 69 | Many of these methods accept a `page` argument. That's a `Page` object, which has 70 | some helpful lazily-loaded attributes: 71 | 72 | - `page.content` -- raw HTML message, decoded and decompressed 73 | - `page.html` -- an etree of the HTML 74 | - `page.xml` -- an etree of the XML 75 | - `page.redireciton` -- redirection location (through a 301, 302, 303, 307 or Refresh header) 76 | - `page.links` -- returns a two-element dictionary, which each value a list of links. This 77 | takes into account not only the 'nofollow' attribute of the links themselves, but also 78 | any meta robots on the page. 79 | 80 | { 81 | 'follow': [...], 82 | 'nofollow': [...] 83 | } 84 | 85 | Usage 86 | ===== 87 | Hot damn! You can run it striaght from the interpreter (go ahead -- try it out): 88 | 89 | >>> from gcrawl import Crawl 90 | >>> c = Crawl('http://www.seomoz.org') 91 | >>> c.run() 92 | 93 | This is probably a good way to debug for development. When it comes time to run the 94 | thing in production, you'll want to have `qless` (which amounts to having Redis 2.6 95 | installed) on a server somewhere, and then invoke `qless-py-worker` (included with 96 | qless-py), which will: 97 | 98 | 1. Fork itself and make use of multiple cores on your machine, and manages the child 99 | processes. If child processes exit, it spawns replacements. 100 | 2. Each process spawns a pool of greenlets to run crawls in a non-blocking way 101 | 102 | The details of the invocation can be discussed further. -------------------------------------------------------------------------------- /gcrawl/__init__.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import urllib3 4 | import requests 5 | import urlparse 6 | from .page import Page 7 | 8 | # Gevent! 9 | import gevent 10 | from gevent import sleep 11 | from gevent import monkey; monkey.patch_all() 12 | 13 | import logging 14 | logging.basicConfig() 15 | logger = logging.getLogger('gcrawl') 16 | logger.setLevel(logging.INFO) 17 | 18 | 19 | class TimeoutException(Exception): 20 | '''A timeout happened''' 21 | pass 22 | 23 | 24 | class Crawl(object): 25 | '''A crawl of a set of urls''' 26 | # The headers that we'd like to use when making requests 27 | headers = {} 28 | 29 | @staticmethod 30 | def crawl(job): 31 | # @Matt, this is where you'd describe your qless job. For 32 | # example, you'd probably so something like this. This would 33 | # allow you to spawn up a bunch of these, and really saturate 34 | # all the CPU without worrying about network IO stuff 35 | c = Crawl(job['seed'], job['allow_subdomains'], job['max_pages']) 36 | results = c.run() 37 | with file('%s-dump' % job.jid, 'w+') as f: 38 | import cPickle as pickle 39 | pickle.dump(results, f) 40 | job.complete() 41 | 42 | def __init__(self, seed, allow_subdomains=False, max_pages=10, allow_redirects=False, timeout=10): 43 | self.requests = [seed] 44 | self.results = [] 45 | self.crawled = 0 46 | self.timeout = 10 47 | self.allow_subdomains = allow_subdomains 48 | self.max_pages = max_pages 49 | self.allow_redirects = allow_redirects 50 | 51 | def before(self): 52 | '''This is executed before we run the main crawl loop''' 53 | pass 54 | 55 | def run(self): 56 | '''Run the crawl!''' 57 | self.before() 58 | while self.requests and self.crawled < self.max_pages: 59 | url = self.pop() 60 | try: 61 | logger.info('Requesting %s' % url) 62 | try: 63 | page = None 64 | with gevent.timeout.Timeout(self.timeout, False): 65 | page = Page(requests.get(url, headers=self.headers, 66 | allow_redirects=self.allow_redirects)) 67 | if page is None: 68 | logger.warn('Timed out fetching %s' % url) 69 | raise TimeoutException('Url %s timed out' % url) 70 | except Exception as exc: 71 | res = self.exception(url, exc) 72 | if res: 73 | self.results.append(res) 74 | continue 75 | 76 | # Should we append these results? 77 | res = self.got(page) 78 | if res: 79 | self.results.append(res) 80 | 81 | if self.count(page): 82 | self.crawled += 1 83 | 84 | delay = None 85 | with gevent.timeout.Timeout(self.timeout, False): 86 | delay = self.delay(page) 87 | sleep(delay) 88 | if delay is None: 89 | logger.warn('Timed out getting dealy for %s' % url) 90 | except Exception as exc: 91 | logger.exception('Failed to request %s' % url) 92 | 93 | self.after() 94 | return self.results 95 | 96 | def after(self): 97 | '''This is executed after we run the main crawl loop, befor returning''' 98 | pass 99 | 100 | def delay(self, page): 101 | '''How long to wait before sending the next request''' 102 | hostname = urlparse.urlparse(page.url).hostname 103 | if (hostname == 'localhost') or (hostname == '127.0.0.1'): 104 | # No delay if the request was to localhost 105 | return 0 106 | return 2 107 | 108 | def pop(self): 109 | '''Get the next url we should fetch''' 110 | return self.requests.pop(0) 111 | 112 | def extend(self, urls, page): 113 | '''Add these urls to the list of requests we have to make''' 114 | self.requests.extend(urls) 115 | 116 | def exception(self, url, exc): 117 | '''We encountered an exception when parsing this page''' 118 | pass 119 | 120 | def got(self, page): 121 | '''We fetched a page. Here is where you should decide what 122 | to do. Most likely, you'll add all the followable links to 123 | the requests you'll make. If you want something appended to 124 | the array returned by `run`, return that value here''' 125 | logger.info('Fetched %s' % page.url) 126 | if page.status in (301, 302, 303, 307): 127 | logger.info('Following redirect %s => %s' % (page.url, page.redirection)) 128 | self.extend([page.redirection], page) 129 | else: 130 | self.extend(page.links['follow'], page) 131 | return None 132 | 133 | def count(self, page): 134 | '''Return true here if this request should count toward the 135 | max number of pages.''' 136 | return page.status not in (301, 302, 303, 307) 137 | -------------------------------------------------------------------------------- /gcrawl/page.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | from .url import Url 4 | from lxml import etree 5 | 6 | # The XPath functions available in lxml.etree do not include a lower() 7 | # function, and so we have to provide it ourselves. Ugly, yes. 8 | import string 9 | def lower(dummy, l): 10 | return [string.lower(s) for s in l] 11 | 12 | ns = etree.FunctionNamespace(None) 13 | ns['lower'] = lower 14 | 15 | class Page(object): 16 | # Disallowed schemes 17 | banned_schemes = ('mailto', 'javascript', 'tel') 18 | 19 | # A few xpaths we use 20 | # Meta robots. This applies to /all/ robots, though... 21 | metaRobotsXpath = etree.XPath('//meta[lower(@name)="robots"]/@content') 22 | # The xpath for finding the base tag, if one is supplied 23 | baseXpath = etree.XPath('//base[1]/@href') 24 | # Links we count should: 25 | # - have rel not containing nofollow 26 | # - have a valid href 27 | # - not start with any of our blacklisted schemes (like 'javascript', 'mailto', etc.) 28 | banned = ''.join('[not(starts-with(normalize-space(@href),"%s:"))]' % sc for sc in banned_schemes) 29 | 30 | followableLinksXpath = etree.XPath('//a[not(contains(lower(@rel),"nofollow"))]' + banned + '/@href') 31 | unfollowableLinksXpath = etree.XPath('//a[contains(lower(@rel),"nofollow")]' + banned + '/@href') 32 | allLinksXpath = etree.XPath('//a' + banned + '/@href') 33 | 34 | def __init__(self, response): 35 | self.url = response.url 36 | self.status = response.status_code 37 | self.headers = response.headers 38 | self.response = response 39 | self._content = '' 40 | self._tree = None 41 | 42 | def __getstate__(self): 43 | # Make sure we get the content -- this will fire that 44 | self.content 45 | result = self.__dict__.copy() 46 | del result['response'] 47 | return result 48 | 49 | def __getattr__(self, key): 50 | if key == 'content': 51 | self.content = self.response.content 52 | return self.content 53 | elif key == 'html': 54 | self.html = etree.fromstring(self.content, etree.HTMLParser(recover=True)) 55 | return self.html 56 | elif key == 'xml': 57 | self.xml = etree.fromstring(self.content, etree.XMLParser(recover=True)) 58 | return self.xml 59 | elif key == 'redirection': 60 | # This looks at both the Refresh header and the Location header 61 | self.redirection = self.headers.get('location') 62 | if self.redirection: 63 | self.redirection = Url.sanitize(self.redirection, self.url) 64 | return self.redirection 65 | 66 | rate, sep, self.redirection = self.headers.get('refresh', '').partition('=') 67 | if self.redirection: 68 | self.redirection = Url.sanitize(self.redirection, self.url) 69 | return self.redirection 70 | elif key == 'links': 71 | # Returns the links in the document that are followable and not: 72 | # 73 | # { 74 | # 'follow' : [...], 75 | # 'nofollow': [...] 76 | # } 77 | robots = ';'.join(self.metaRobotsXpath(self.html)) 78 | base = ''.join(self.baseXpath(self.html)) or self.url 79 | if 'nofollow' in robots: 80 | self.links = { 81 | 'follow' : [], 82 | 'nofollow': [Url.sanitize(link, base) for link in self.allLinksXpath(self.html)] 83 | } 84 | else: 85 | self.links = { 86 | 'follow' : [Url.sanitize(link, base) for link in self.followableLinksXpath(self.html)], 87 | 'nofollow': [Url.sanitize(link, base) for link in self.unfollowableLinksXpath(self.html)] 88 | } 89 | return self.links 90 | 91 | def text(self): 92 | '''Return all the text in the document, excluding tags''' 93 | raise NotImplementedError() 94 | -------------------------------------------------------------------------------- /gcrawl/url.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import re 4 | import reppy 5 | import urllib 6 | import urlparse 7 | 8 | class Url(object): 9 | # Link types. Links can be considered any of these types, relative to 10 | # one another: 11 | # 12 | # internal => On the same domain 13 | # external => On a different domain 14 | # subdomain => Same top-level domain, but one is a subdomain of the other 15 | internal = 1 16 | external = 2 17 | subdomain = 3 18 | 19 | @staticmethod 20 | def sanitize(url, base=None, param_blacklist=None): 21 | '''Given a url, and optionally a base and a parameter blacklist, 22 | sanitize the provided url. This includes removing blacklisted 23 | parameters, relative paths, and more. 24 | 25 | For more information on how and what we parse / sanitize: 26 | http://tools.ietf.org/html/rfc1808.html 27 | The more up-to-date RFC is this one: 28 | http://www.ietf.org/rfc/rfc3986.txt''' 29 | 30 | # Parse the url once it has been evaluated relative to a base 31 | p = urlparse.urlparse(urlparse.urljoin(base or '', url)) 32 | 33 | # And remove all the black-listed query parameters 34 | query = '&'.join(q for q in p.query.split('&') if q.partition('=')[0].lower() not in (param_blacklist or ())) 35 | query = re.sub(r'^&|&$', '', re.sub(r'&{2,}', '&', query)) 36 | # And remove all the black-listed param parameters 37 | params = ';'.join(q for q in p.params.split(';') if q.partition('=')[0].lower() not in (param_blacklist or ())) 38 | params = re.sub(r'^;|;$', '', re.sub(r';{2,}', ';', params)) 39 | 40 | # Remove double forward-slashes from the path 41 | path = re.sub(r'\/{2,}', '/', p.path) 42 | # With that done, go through and remove all the relative references 43 | unsplit = [] 44 | for part in path.split('/'): 45 | # If we encounter the parent directory, and there's 46 | # a segment to pop off, then we should pop it off. 47 | if part == '..' and (not unsplit or unsplit.pop() != None): 48 | pass 49 | elif part != '.': 50 | unsplit.append(part) 51 | 52 | # With all these pieces, assemble! 53 | path = '/'.join(unsplit) 54 | if p.netloc: 55 | path = path or '/' 56 | 57 | if isinstance(path, unicode): 58 | path = urllib.quote(urllib.unquote(path.encode('utf-8'))) 59 | else: 60 | path = urllib.quote(urllib.unquote(path)) 61 | 62 | return urlparse.urlunparse((p.scheme, re.sub(r'^\.+|\.+$', '', p.netloc.lower()), path, params, query, p.fragment)).replace(' ', '%20') 63 | 64 | @staticmethod 65 | def allowed(url, useragent, headers=None, meta_robots=None): 66 | '''Given a url, a useragent and optionally headers and a dictionary of 67 | meta robots key-value pairs, determine whether that url is allowed or not. 68 | 69 | The headers must be a dictionary mapping of string key to list value: 70 | 71 | { 72 | 'Content-Type': ['...'], 73 | 'X-Powered-By': ['...', '...'] 74 | } 75 | 76 | The meta robots must be provided as a mapping of string key to directives: 77 | 78 | { 79 | 'robots': 'index, follow', 80 | 'foobot': 'index, nofollow' 81 | } 82 | ''' 83 | # First, check robots.txt 84 | r = reppy.findRobot(url) 85 | allowed = (r == None) or r.allowed(url, useragent) 86 | 87 | # Next, check the X-Robots-Tag 88 | # There can be multiple instances of X-Robots-Tag in the headers, so we've 89 | # joined them together with semicolons. Now, make a dictionary of each of 90 | # the directives we found. They can be specific, in which case it's in the 91 | # format: 92 | # botname : directive 93 | # In the absence of a botname, it applies to all bots. Strictly speaking, 94 | # there is one directive that itself has a value, but as we're not interested 95 | # in it (it's the unavailable_after directive), it's ok to ignore it. 96 | if headers: 97 | for bot in headers.get('x-robots-tag', []): 98 | botname, sep, directive = bot.partition(':') 99 | if directive and botname == useragent: 100 | # This is when it applies just to us 101 | allowed = allowed and ('noindex' not in directive) and ('none' not in directive) 102 | else: 103 | # This is when it applies to all bots 104 | allowed = allowed and ('noindex' not in botname) and ('none' not in botname) 105 | 106 | # Now check for specific and general meta tags 107 | # In this implementation, specific meta tags override general meta robots 108 | if meta_robots: 109 | s = meta_robots.get(useragent, '') + meta_robots.get('robots', '') 110 | allowed = allowed and ('noindex' not in s) and ('none' not in s) 111 | 112 | return allowed 113 | 114 | @staticmethod 115 | def relationship(frm, to): 116 | '''Determine the relationship of the link `to` found on url `frm`. 117 | For example, the relationship between these urls is `internal`: 118 | http://foo.com/bar 119 | http://foo.com/howdy 120 | 121 | And `external`: 122 | http://foo.com/bar 123 | http://bar.com/foo 124 | 125 | And `subdomain`: 126 | http://foo.com/ 127 | http://bar.foo.com/ 128 | ''' 129 | return 0 -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # 3 | # Copyright (c) 2011 SEOmoz 4 | # 5 | # Permission is hereby granted, free of charge, to any person obtaining 6 | # a copy of this software and associated documentation files (the 7 | # "Software"), to deal in the Software without restriction, including 8 | # without limitation the rights to use, copy, modify, merge, publish, 9 | # distribute, sublicense, and/or sell copies of the Software, and to 10 | # permit persons to whom the Software is furnished to do so, subject to 11 | # the following conditions: 12 | # 13 | # The above copyright notice and this permission notice shall be 14 | # included in all copies or substantial portions of the Software. 15 | # 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 20 | # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 21 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 22 | # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | try: 25 | from setuptools import setup 26 | extra = { 27 | 'install_requires' : ['reppy', 'urllib3', 'requests', 'gevent'] 28 | } 29 | except ImportError: 30 | from distutils.core import setup 31 | extra = { 32 | 'dependencies' : ['reppy', 'urllib3', 'requests', 'gevent'] 33 | } 34 | 35 | setup(name = 'gcrawl', 36 | version = '0.1.0', 37 | description = 'Fetch urls. Faster.', 38 | url = 'http://github.com/seomoz/g-crawl-py', 39 | author = 'Dan Lecocq', 40 | author_email = 'dan@seomoz.org', 41 | license = 'MIT', 42 | packages = ['gcrawl'], 43 | classifiers = [ 44 | 'License :: OSI Approved :: MIT License', 45 | 'Programming Language :: Python', 46 | 'Intended Audience :: Developers', 47 | 'Operating System :: OS Independent', 48 | 'Topic :: Internet :: WWW/HTTP' 49 | ], 50 | **extra 51 | ) -------------------------------------------------------------------------------- /test/testAllowed.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import unittest 5 | from gcrawl.url import Url 6 | 7 | class TestParamSanitization(unittest.TestCase): 8 | # In these tests, we're going to assume that reppy's unit tests 9 | # are working properly 10 | def test_x_robots_header(self): 11 | examples = [ 12 | (['noindex'] , False), 13 | (['none'] , False), 14 | (['noindex,none'], False), 15 | (['index'] , True ), 16 | (['foobot:index'], True ), 17 | (['foobot:none' ], False), 18 | (['barbar:index'], True ), 19 | (['barbot:none' ], True ) 20 | ] 21 | for line in examples: 22 | e, result = line 23 | d = { 24 | 'x-robots-tag': e 25 | } 26 | self.assertEqual(Url.allowed('http://www.seomoz.org/', 'foobot', headers=d), result) 27 | 28 | def test_meta_robots(self): 29 | pass 30 | 31 | unittest.main() 32 | -------------------------------------------------------------------------------- /test/testSanitize.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import unittest 5 | from gcrawl.url import Url 6 | 7 | # An example of banned parameters, used for a few tests 8 | banned = [ 9 | 'foobar', 10 | 'widget', 11 | 'wassup' 12 | ] 13 | 14 | class TestParamSanitization(unittest.TestCase): 15 | ########################################################################### 16 | # Params 17 | # 18 | # In this context, params are urls with ';'-style parameters. Yes, they 19 | # really exist, and customers get piseed off when we don't honor them. 20 | ########################################################################### 21 | def test_pruning_with_other_params(self): 22 | '''Make sure we can strip out a single blacklisted param''' 23 | for b in banned: 24 | bad = 'http://testing.com/page;%s=foo;ok=foo' % b 25 | good = 'http://testing.com/page;ok=foo' 26 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 27 | 28 | def test_case_insensitivity_params(self): 29 | '''Make sure we can do it upper-cased''' 30 | for b in banned: 31 | bad = 'http://testing.com/page;%s=foo;ok=foo' % b.upper() 32 | good = 'http://testing.com/page;ok=foo' 33 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 34 | 35 | def test_all_together_params(self): 36 | '''And make sure we can remove all of the blacklisted query params''' 37 | params = ';'.join('%s=foo' % b for b in banned) 38 | bad = 'http://testing.com/page;%s' % params 39 | good = 'http://testing.com/page' 40 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 41 | 42 | def test_preserve_order_params(self): 43 | '''Make sure we keep it all in order''' 44 | for b in banned: 45 | bad = 'http://testing.com/page;hi=low;hello=goodbye;%s=foo;howdy=doodeedoo;whats=up' % b 46 | good = 'http://testing.com/page;hi=low;hello=goodbye;howdy=doodeedoo;whats=up' 47 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 48 | 49 | def test_pruning_alone_params(self): 50 | '''Make sure we don't include that ";"''' 51 | for b in banned: 52 | bad = 'http://testing.com/page;%s=foo' % b 53 | good = 'http://testing.com/page' 54 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 55 | 56 | def test_param_values_ok_params(self): 57 | '''Make sure we can include them as param values''' 58 | for b in banned: 59 | ok = 'http://testing.com/page;foo=%s;ok=foo' % b 60 | self.assertEqual(Url.sanitize(ok), ok) 61 | 62 | def test_prefix_param_ok_params(self): 63 | '''Make sure we can give each blacklisted param a prefix''' 64 | for b in banned: 65 | ok = 'http://testing.com/page;howdy_%s=foo;ok=foo' % b 66 | self.assertEqual(Url.sanitize(ok), ok) 67 | 68 | 69 | class TestQuerySanitization(unittest.TestCase): 70 | ########################################################################### 71 | # Query 72 | # 73 | # In this context, query strings are after a ? and are joined with ampersands 74 | # For more, 75 | # http://tools.ietf.org/html/rfc1808.html 76 | ########################################################################### 77 | def test_pruning_with_other_args(self): 78 | '''Make sure we can strip out a single blacklisted query''' 79 | for b in banned: 80 | bad = 'http://testing.com/page?%s=foo&ok=foo' % b 81 | good = 'http://testing.com/page?ok=foo' 82 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 83 | 84 | def test_case_insensitivity(self): 85 | '''Make sure we can do it upper-cased''' 86 | for b in banned: 87 | bad = 'http://testing.com/page?%s=foo&ok=foo' % b.upper() 88 | good = 'http://testing.com/page?ok=foo' 89 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 90 | 91 | def test_all_together(self): 92 | '''And make sure we can remove all of the blacklisted query params''' 93 | params = '&'.join('%s=foo' % b for b in banned) 94 | bad = 'http://testing.com/page?%s' % params 95 | good = 'http://testing.com/page' 96 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 97 | 98 | def test_preserve_order(self): 99 | '''Make sure we keep it all in order''' 100 | for b in banned: 101 | bad = 'http://testing.com/page?hi=low&hello=goodbye&%s=foo&howdy=doodeedoo&whats=up' % b 102 | good = 'http://testing.com/page?hi=low&hello=goodbye&howdy=doodeedoo&whats=up' 103 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 104 | 105 | def test_pruning_alone(self): 106 | '''Make sure we don't include that "?"''' 107 | for b in banned: 108 | bad = 'http://testing.com/page?%s=foo' % b 109 | good = 'http://testing.com/page' 110 | self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good) 111 | 112 | def test_param_values_ok(self): 113 | '''Make sure we can include them as query values''' 114 | for b in banned: 115 | ok = 'http://testing.com/page?foo=%s&ok=foo' % b 116 | self.assertEqual(Url.sanitize(ok), ok) 117 | 118 | def test_prefix_param_ok(self): 119 | '''Make sure we can give each blacklisted param a prefix''' 120 | for b in banned: 121 | ok = 'http://testing.com/page?howdy_%s=foo&ok=foo' % b 122 | self.assertEqual(Url.sanitize(ok), ok) 123 | 124 | def test_multiple_ampersands(self): 125 | paths = [ 126 | ('howdy?&&' , 'howdy'), 127 | ('howdy?&&&foo=bar&&&' , 'howdy?foo=bar'), 128 | ('howdy;;;;foo=bar;' , 'howdy;foo=bar'), 129 | # These come from the prototype lsapi: https://github.com/seomoz/lsapi-prototype/blob/master/tests/test_convert_url.py 130 | # In query parameters, we should escape these characters 131 | #('?foo=\xe4\xb8\xad' , '?foo=%E4%B8%AD'), 132 | # But in a path, we should not 133 | #('\xe4\xb8\xad/bar.html', '\xe4\xb8\xadbar.html') 134 | ] 135 | 136 | base = 'http://testing.com/' 137 | for bad, clean in paths: 138 | self.assertEqual(Url.sanitize(base + bad), base + clean) 139 | 140 | class TestRelative(unittest.TestCase): 141 | ########################################################################### 142 | # Relative url support tests 143 | ########################################################################### 144 | def test_case_insensitivity(self): 145 | paths = [ 146 | ('www.TESTING.coM' , 'www.testing.com/'), 147 | ('WWW.testing.com' , 'www.testing.com/'), 148 | ('WWW.testing.COM/FOOBAR', 'www.testing.com/FOOBAR') 149 | ] 150 | 151 | for bad, clean in paths: 152 | self.assertEqual(Url.sanitize('http://' + bad), 'http://' + clean) 153 | 154 | def test_double_forward_slash(self): 155 | paths = [ 156 | ('howdy' , 'howdy'), 157 | ('hello//how//are' , 'hello/how/are'), 158 | ('hello/../how/are', 'how/are'), 159 | ('hello//..//how/' , 'how/'), 160 | ('a/b/../../c' , 'c'), 161 | ('../../../c' , 'c'), 162 | ('./hello' , 'hello'), 163 | ('./././hello' , 'hello'), 164 | ('a/b/c/' , 'a/b/c/') 165 | ] 166 | 167 | base = 'http://testing.com/' 168 | 169 | for bad, clean in paths: 170 | self.assertEqual(Url.sanitize(base + bad), base + clean) 171 | 172 | # This is the example from the wild that spawned this whole change 173 | bad = 'http://www.vagueetvent.com/../fonctions_pack/ajouter_pack_action.php?id_produit=26301' 174 | clean = 'http://www.vagueetvent.com/fonctions_pack/ajouter_pack_action.php?id_produit=26301' 175 | self.assertEqual(Url.sanitize(bad), clean) 176 | 177 | def test_insert_trailing_slash(self): 178 | # When dealing with a path-less url, we should insert a trailing slash. 179 | paths = [ 180 | ('foo.com?page=home', 'foo.com/?page=home'), 181 | ('foo.com' , 'foo.com/') 182 | ] 183 | 184 | for bad, clean in paths: 185 | self.assertEqual(Url.sanitize('http://' + bad), 'http://' + clean) 186 | 187 | 188 | 189 | 190 | class TestTheRest(unittest.TestCase): 191 | def test_join(self): 192 | # We should be able to join urls 193 | self.assertEqual(Url.sanitize('/foo', 'http://cnn.com'), 'http://cnn.com/foo') 194 | 195 | def test_escaping(self): 196 | paths = [ 197 | ('hello%20and%20how%20are%20you', 'hello%20and%20how%20are%20you'), 198 | ('danny\'s pub' , 'danny%27s%20pub'), 199 | ('danny%27s pub?foo=bar&yo' , 'danny%27s%20pub?foo=bar&yo') 200 | ] 201 | 202 | base = 'http://testing.com/' 203 | for bad, clean in paths: 204 | self.assertEqual(Url.sanitize(base + bad), base + clean) 205 | 206 | def test_wild(self): 207 | # These are some examples from the wild that have been seeming to fail 208 | # It apparently comes from the fact that the input is a unicode string, 209 | # and has disallowed character 210 | pairs = [ 211 | (u'http://www.jointingmortar.co.uk/rompox®-easy.html', 212 | 'http://www.jointingmortar.co.uk/rompox%C2%AE-easy.html'), 213 | (u'http://www.dinvard.se//index.php/result/type/owner/Stift Fonden för mindre arbetarbos/', 214 | 'http://www.dinvard.se/index.php/result/type/owner/Stift%20Fonden%20f%C3%B6r%20mindre%20arbetarbos/'), 215 | (u'http://www.ewaterways.com/cruises/all/alaska//ship/safari quest/itinerary/mexico\'s sea of cortés - aquarium of the world (8 days)/itinerary/', 216 | 'http://www.ewaterways.com/cruises/all/alaska/ship/safari%20quest/itinerary/mexico%27s%20sea%20of%20cort%C3%A9s%20-%20aquarium%20of%20the%20world%20%288%20days%29/itinerary/'), 217 | (u'http://www.mydeals.gr/prosfores/p/Υπόλοιπα%20Νησιά/', 218 | 'http://www.mydeals.gr/prosfores/p/%CE%A5%CF%80%CF%8C%CE%BB%CE%BF%CE%B9%CF%80%CE%B1%20%CE%9D%CE%B7%CF%83%CE%B9%CE%AC/') 219 | ] 220 | 221 | for bad, good in pairs: 222 | self.assertEqual(Url.sanitize(bad), good) 223 | 224 | unittest.main() 225 | --------------------------------------------------------------------------------