├── .gitignore
├── LICENSE
├── README.md
├── gcrawl
    ├── __init__.py
    ├── page.py
    └── url.py
├── setup.py
└── test
    ├── testAllowed.py
    └── testSanitize.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | dist/*
3 | build/*
4 | gcrawl.egg-info/*
5 | 
6 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2011 SEOmoz
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining
 4 | a copy of this software and associated documentation files (the
 5 | "Software"), to deal in the Software without restriction, including
 6 | without limitation the rights to use, copy, modify, merge, publish,
 7 | distribute, sublicense, and/or sell copies of the Software, and to
 8 | permit persons to whom the Software is furnished to do so, subject to
 9 | the following conditions:
10 | 
11 | The above copyright notice and this permission notice shall be
12 | included in all copies or substantial portions of the Software.
13 | 
14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | g-crawl-py
  2 | ==========
  3 | Qless-based crawl jobs, and gevent-based crawling. Hot damn!
  4 | 
  5 | ![Status: Production](https://img.shields.io/badge/status-production-green.svg?style=flat)
  6 | ![Team: Big Data](https://img.shields.io/badge/team-big_data-green.svg?style=flat)
  7 | ![Scope: External](https://img.shields.io/badge/scope-external-green.svg?style=flat)
  8 | ![Open Source: Yes](https://img.shields.io/badge/open_source-MIT-green.svg?style=flat)
  9 | ![Critical: Yes](https://img.shields.io/badge/critical-yes-red.svg?style=flat)
 10 | 
 11 | Installation
 12 | ============
 13 | Installation in the typical way:
 14 | 
 15 | 	sudo python setup.py install
 16 | 
 17 | This installs, aside from the library itself, `reppy`, `urllib3`, `requests` and
 18 | `gevent`. In addition to these, you'll have to install qless-py manually until it
 19 | is released on `pip`:
 20 | 
 21 | 	git clone git://github.com/seomoz/qless-py.git
 22 | 	cd qless-py
 23 | 	sudo python setup.py install
 24 | 
 25 | Until you need to run `qless` jobs, you won't need it, but when that day inevitably
 26 | comes, you'll need Redis 2.6. Unfortunately, Redis 2.6 has been delayed, and is 
 27 | still only available as an unstable release. __NOTE__: You only need Redis 2.6 
 28 | installed on the system where you'll be hosting the queue. This may or may not be
 29 | the same system where you'll be running your crawlers:
 30 | 
 31 | 	git clone git://github.com/antirez/redis.git
 32 | 	cd redis
 33 | 	git checkout 2.6
 34 | 	make && make test && sudo make install
 35 | 
 36 | Development
 37 | ===========
 38 | There's mostly one class that you have to worry about subclassing: `gcrawl.Crawl`.
 39 | It provides a few methods that you're encouraged to override in subclasses:
 40 | 
 41 | 	def before(self):
 42 | 		# This method gets executed before the crawl loop, in case you need it
 43 | 	
 44 | 	def after(self):
 45 | 		# Before's counterpart, exucted right before returning from `run()`
 46 | 	
 47 | 	def delay(self, page):
 48 | 		# Return how many seconds to wait before sending the next request in this crawl
 49 |     
 50 |     def pop(self):
 51 | 		# Return the next url to fetch
 52 |     
 53 |     def extend(self, urls, page):
 54 | 		# These urls were discovered in response. Pick, choose, add to your list
 55 | 		# Response is the `requests.Response` object returned, urls is a list of
 56 | 		# string urls
 57 | 
 58 |     def got(self, page):
 59 | 		# We've fetched a page. You should decide what requests to add
 60 |     
 61 |     def count(self, page):
 62 | 		# Should this page count against the max_pages count?
 63 | 
 64 | A `Crawl` object also has a method `run()`, which performs the crawl in a non-blocking
 65 | way in a greenlet, and returns a list of results from the crawl (determined by the
 66 | `dump()` method in the class). The `run()` method first fetches the seed url, finds
 67 | links, and then provides them to you to decide if you want to follow them.
 68 | 
 69 | Many of these methods accept a `page` argument. That's a `Page` object, which has
 70 | some helpful lazily-loaded attributes:
 71 | 
 72 | - `page.content` -- raw HTML message, decoded and decompressed
 73 | - `page.html` -- an etree of the HTML
 74 | - `page.xml` -- an etree of the XML
 75 | - `page.redireciton` -- redirection location (through a 301, 302, 303, 307 or Refresh header)
 76 | - `page.links` -- returns a two-element dictionary, which each value a list of links. This
 77 | 	takes into account not only the 'nofollow' attribute of the links themselves, but also
 78 | 	any meta robots on the page.
 79 | 	
 80 | 	{
 81 | 		'follow': [...],
 82 | 		'nofollow': [...]
 83 | 	}
 84 | 
 85 | Usage
 86 | =====
 87 | Hot damn! You can run it striaght from the interpreter (go ahead -- try it out):
 88 | 
 89 | 	>>> from gcrawl import Crawl
 90 | 	>>> c = Crawl('http://www.seomoz.org')
 91 | 	>>> c.run()
 92 | 
 93 | This is probably a good way to debug for development. When it comes time to run the
 94 | thing in production, you'll want to have `qless` (which amounts to having Redis 2.6
 95 | installed) on a server somewhere, and then invoke `qless-py-worker` (included with 
 96 | qless-py), which will:
 97 | 
 98 | 1. Fork itself and make use of multiple cores on your machine, and manages the child
 99 | 	processes. If child processes exit, it spawns replacements.
100 | 2. Each process spawns a pool of greenlets to run crawls in a non-blocking way
101 | 
102 | The details of the invocation can be discussed further.


--------------------------------------------------------------------------------
/gcrawl/__init__.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | 
  3 | import urllib3
  4 | import requests
  5 | import urlparse
  6 | from .page import Page
  7 | 
  8 | # Gevent!
  9 | import gevent
 10 | from gevent import sleep
 11 | from gevent import monkey; monkey.patch_all()
 12 | 
 13 | import logging
 14 | logging.basicConfig()
 15 | logger = logging.getLogger('gcrawl')
 16 | logger.setLevel(logging.INFO)
 17 | 
 18 | 
 19 | class TimeoutException(Exception):
 20 |     '''A timeout happened'''
 21 |     pass
 22 | 
 23 | 
 24 | class Crawl(object):
 25 |     '''A crawl of a set of urls'''
 26 |     # The headers that we'd like to use when making requests
 27 |     headers = {}
 28 | 
 29 |     @staticmethod
 30 |     def crawl(job):
 31 |         # @Matt, this is where you'd describe your qless job. For
 32 |         # example, you'd probably so something like this. This would
 33 |         # allow you to spawn up a bunch of these, and really saturate
 34 |         # all the CPU without worrying about network IO stuff
 35 |         c = Crawl(job['seed'], job['allow_subdomains'], job['max_pages'])
 36 |         results = c.run()
 37 |         with file('%s-dump' % job.jid, 'w+') as f:
 38 |             import cPickle as pickle
 39 |             pickle.dump(results, f)
 40 |         job.complete()
 41 | 
 42 |     def __init__(self, seed, allow_subdomains=False, max_pages=10, allow_redirects=False, timeout=10):
 43 |         self.requests         = [seed]
 44 |         self.results          = []
 45 |         self.crawled          = 0
 46 |         self.timeout          = 10
 47 |         self.allow_subdomains = allow_subdomains
 48 |         self.max_pages        = max_pages
 49 |         self.allow_redirects  = allow_redirects
 50 | 
 51 |     def before(self):
 52 |         '''This is executed before we run the main crawl loop'''
 53 |         pass
 54 | 
 55 |     def run(self):
 56 |         '''Run the crawl!'''
 57 |         self.before()
 58 |         while self.requests and self.crawled < self.max_pages:
 59 |             url = self.pop()
 60 |             try:
 61 |                 logger.info('Requesting %s' % url)
 62 |                 try:
 63 |                     page = None
 64 |                     with gevent.timeout.Timeout(self.timeout, False):
 65 |                         page = Page(requests.get(url, headers=self.headers,
 66 |                             allow_redirects=self.allow_redirects))
 67 |                     if page is None:
 68 |                         logger.warn('Timed out fetching %s' % url)
 69 |                         raise TimeoutException('Url %s timed out' % url)
 70 |                 except Exception as exc:
 71 |                     res = self.exception(url, exc)
 72 |                     if res:
 73 |                         self.results.append(res)
 74 |                     continue
 75 | 
 76 |                 # Should we append these results?
 77 |                 res = self.got(page)
 78 |                 if res:
 79 |                     self.results.append(res)
 80 | 
 81 |                 if self.count(page):
 82 |                     self.crawled += 1
 83 | 
 84 |                 delay = None
 85 |                 with gevent.timeout.Timeout(self.timeout, False):
 86 |                     delay = self.delay(page)
 87 |                     sleep(delay)
 88 |                 if delay is None:
 89 |                     logger.warn('Timed out getting dealy for %s' % url)
 90 |             except Exception as exc:
 91 |                 logger.exception('Failed to request %s' % url)
 92 | 
 93 |         self.after()
 94 |         return self.results
 95 | 
 96 |     def after(self):
 97 |         '''This is executed after we run the main crawl loop, befor returning'''
 98 |         pass
 99 | 
100 |     def delay(self, page):
101 |         '''How long to wait before sending the next request'''
102 |         hostname = urlparse.urlparse(page.url).hostname
103 |         if (hostname == 'localhost') or (hostname == '127.0.0.1'):
104 |             # No delay if the request was to localhost
105 |             return 0
106 |         return 2
107 | 
108 |     def pop(self):
109 |         '''Get the next url we should fetch'''
110 |         return self.requests.pop(0)
111 | 
112 |     def extend(self, urls, page):
113 |         '''Add these urls to the list of requests we have to make'''
114 |         self.requests.extend(urls)
115 |     
116 |     def exception(self, url, exc):
117 |         '''We encountered an exception when parsing this page'''
118 |         pass
119 |     
120 |     def got(self, page):
121 |         '''We fetched a page. Here is where you should decide what
122 |         to do. Most likely, you'll add all the followable links to
123 |         the requests you'll make. If you want something appended to
124 |         the array returned by `run`, return that value here'''
125 |         logger.info('Fetched %s' % page.url)
126 |         if page.status in (301, 302, 303, 307):
127 |             logger.info('Following redirect %s => %s' % (page.url, page.redirection))
128 |             self.extend([page.redirection], page)
129 |         else:
130 |             self.extend(page.links['follow'], page)
131 |         return None
132 |     
133 |     def count(self, page):
134 |         '''Return true here if this request should count toward the 
135 |         max number of pages.'''
136 |         return page.status not in (301, 302, 303, 307)
137 |     


--------------------------------------------------------------------------------
/gcrawl/page.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | 
 3 | from .url import Url
 4 | from lxml import etree
 5 | 
 6 | # The XPath functions available in lxml.etree do not include a lower()
 7 | # function, and so we have to provide it ourselves. Ugly, yes.
 8 | import string
 9 | def lower(dummy, l):
10 |     return [string.lower(s) for s in l]
11 | 
12 | ns = etree.FunctionNamespace(None)
13 | ns['lower'] = lower
14 | 
15 | class Page(object):
16 |     # Disallowed schemes
17 |     banned_schemes = ('mailto', 'javascript', 'tel')
18 |     
19 |     # A few xpaths we use
20 |     # Meta robots. This applies to /all/ robots, though...
21 |     metaRobotsXpath = etree.XPath('//meta[lower(@name)="robots"]/@content')
22 |     # The xpath for finding the base tag, if one is supplied
23 |     baseXpath = etree.XPath('//base[1]/@href')
24 |     # Links we count should:
25 |     #   - have rel not containing nofollow
26 |     #   - have a valid href
27 |     #   - not start with any of our blacklisted schemes (like 'javascript', 'mailto', etc.)
28 |     banned = ''.join('[not(starts-with(normalize-space(@href),"%s:"))]' % sc for sc in banned_schemes)
29 |     
30 |     followableLinksXpath   = etree.XPath('//a[not(contains(lower(@rel),"nofollow"))]' + banned + '/@href')
31 |     unfollowableLinksXpath = etree.XPath('//a[contains(lower(@rel),"nofollow")]' + banned + '/@href')
32 |     allLinksXpath          = etree.XPath('//a' + banned + '/@href')
33 |     
34 |     def __init__(self, response):
35 |         self.url      = response.url
36 |         self.status   = response.status_code
37 |         self.headers  = response.headers
38 |         self.response = response
39 |         self._content = ''
40 |         self._tree    = None
41 |     
42 |     def __getstate__(self):
43 |         # Make sure we get the content -- this will fire that
44 |         self.content
45 |         result = self.__dict__.copy()
46 |         del result['response']
47 |         return result
48 |     
49 |     def __getattr__(self, key):
50 |         if key == 'content':
51 |             self.content = self.response.content
52 |             return self.content
53 |         elif key == 'html':
54 |             self.html = etree.fromstring(self.content, etree.HTMLParser(recover=True))
55 |             return self.html
56 |         elif key == 'xml':
57 |             self.xml = etree.fromstring(self.content, etree.XMLParser(recover=True))
58 |             return self.xml
59 |         elif key == 'redirection':
60 |             # This looks at both the Refresh header and the Location header
61 |             self.redirection = self.headers.get('location')
62 |             if self.redirection:
63 |                 self.redirection = Url.sanitize(self.redirection, self.url)
64 |                 return self.redirection
65 |             
66 |             rate, sep, self.redirection = self.headers.get('refresh', '').partition('=')
67 |             if self.redirection:
68 |                 self.redirection = Url.sanitize(self.redirection, self.url)
69 |             return self.redirection
70 |         elif key == 'links':
71 |             # Returns the links in the document that are followable and not:
72 |             #         
73 |             #     {
74 |             #         'follow'  : [...],
75 |             #         'nofollow': [...]
76 |             #     }
77 |             robots = ';'.join(self.metaRobotsXpath(self.html))
78 |             base   = ''.join(self.baseXpath(self.html)) or self.url
79 |             if 'nofollow' in robots:
80 |                 self.links = {
81 |                     'follow'  : [],
82 |                     'nofollow': [Url.sanitize(link, base) for link in self.allLinksXpath(self.html)]
83 |                 }
84 |             else:
85 |                 self.links = {
86 |                     'follow'  : [Url.sanitize(link, base) for link in self.followableLinksXpath(self.html)],
87 |                     'nofollow': [Url.sanitize(link, base) for link in self.unfollowableLinksXpath(self.html)]
88 |                 }
89 |             return self.links
90 |     
91 |     def text(self):
92 |         '''Return all the text in the document, excluding tags'''
93 |         raise NotImplementedError()
94 |     


--------------------------------------------------------------------------------
/gcrawl/url.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | 
  3 | import re
  4 | import reppy
  5 | import urllib
  6 | import urlparse
  7 | 
  8 | class Url(object):
  9 |     # Link types. Links can be considered any of these types, relative to 
 10 |     # one another:
 11 |     #
 12 |     #   internal  => On the same domain
 13 |     #   external  => On a different domain
 14 |     #   subdomain => Same top-level domain, but one is a subdomain of the other
 15 |     internal  = 1
 16 |     external  = 2
 17 |     subdomain = 3
 18 |     
 19 |     @staticmethod
 20 |     def sanitize(url, base=None, param_blacklist=None):
 21 |         '''Given a url, and optionally a base and a parameter blacklist,
 22 |         sanitize the provided url. This includes removing blacklisted 
 23 |         parameters, relative paths, and more.
 24 |         
 25 |         For more information on how and what we parse / sanitize:
 26 |             http://tools.ietf.org/html/rfc1808.html
 27 |         The more up-to-date RFC is this one:
 28 |             http://www.ietf.org/rfc/rfc3986.txt'''
 29 |         
 30 |         # Parse the url once it has been evaluated relative to a base
 31 |         p = urlparse.urlparse(urlparse.urljoin(base or '', url))
 32 |         
 33 |         # And remove all the black-listed query parameters
 34 |         query = '&'.join(q for q in p.query.split('&') if q.partition('=')[0].lower() not in (param_blacklist or ()))
 35 |         query = re.sub(r'^&|&$', '', re.sub(r'&{2,}', '&', query))
 36 |         # And remove all the black-listed param parameters
 37 |         params = ';'.join(q for q in p.params.split(';') if q.partition('=')[0].lower() not in (param_blacklist or ()))
 38 |         params = re.sub(r'^;|;$', '', re.sub(r';{2,}', ';', params))
 39 |         
 40 |         # Remove double forward-slashes from the path
 41 |         path  = re.sub(r'\/{2,}', '/', p.path)
 42 |         # With that done, go through and remove all the relative references
 43 |         unsplit = []
 44 |         for part in path.split('/'):
 45 |             # If we encounter the parent directory, and there's
 46 |             # a segment to pop off, then we should pop it off.
 47 |             if part == '..' and (not unsplit or unsplit.pop() != None):
 48 |                 pass
 49 |             elif part != '.':
 50 |                 unsplit.append(part)
 51 |         
 52 |         # With all these pieces, assemble!
 53 |         path = '/'.join(unsplit)
 54 |         if p.netloc:
 55 |             path = path or '/'
 56 |         
 57 |         if isinstance(path, unicode):
 58 |             path = urllib.quote(urllib.unquote(path.encode('utf-8')))
 59 |         else:
 60 |             path = urllib.quote(urllib.unquote(path))
 61 |         
 62 |         return urlparse.urlunparse((p.scheme, re.sub(r'^\.+|\.+$', '', p.netloc.lower()), path, params, query, p.fragment)).replace(' ', '%20')
 63 |     
 64 |     @staticmethod
 65 |     def allowed(url, useragent, headers=None, meta_robots=None):
 66 |         '''Given a url, a useragent and optionally headers and a dictionary of
 67 |         meta robots key-value pairs, determine whether that url is allowed or not.
 68 |         
 69 |         The headers must be a dictionary mapping of string key to list value:
 70 |         
 71 |             {
 72 |                 'Content-Type': ['...'],
 73 |                 'X-Powered-By': ['...', '...']
 74 |             }
 75 |         
 76 |         The meta robots must be provided as a mapping of string key to directives:
 77 |         
 78 |             {
 79 |                 'robots': 'index, follow',
 80 |                 'foobot': 'index, nofollow'
 81 |             }
 82 |         '''
 83 |         # First, check robots.txt
 84 |         r = reppy.findRobot(url)
 85 |         allowed = (r == None) or r.allowed(url, useragent)
 86 |         
 87 |         # Next, check the X-Robots-Tag
 88 |         # There can be multiple instances of X-Robots-Tag in the headers, so we've
 89 |         # joined them together with semicolons. Now, make a dictionary of each of
 90 |         # the directives we found. They can be specific, in which case it's in the
 91 |         # format:
 92 |         #   botname : directive
 93 |         # In the absence of a botname, it applies to all bots. Strictly speaking,
 94 |         # there is one directive that itself has a value, but as we're not interested
 95 |         # in it (it's the unavailable_after directive), it's ok to ignore it.
 96 |         if headers:
 97 |             for bot in headers.get('x-robots-tag', []):
 98 |                 botname, sep, directive = bot.partition(':')
 99 |                 if directive and botname == useragent:
100 |                     # This is when it applies just to us
101 |                     allowed = allowed and ('noindex' not in directive) and ('none' not in directive)
102 |                 else:
103 |                     # This is when it applies to all bots
104 |                     allowed = allowed and ('noindex' not in botname) and ('none' not in botname)
105 |         
106 |         # Now check for specific and general meta tags
107 |         # In this implementation, specific meta tags override general meta robots
108 |         if meta_robots:
109 |             s = meta_robots.get(useragent, '') + meta_robots.get('robots', '')
110 |             allowed = allowed and ('noindex' not in s) and ('none' not in s)
111 |         
112 |         return allowed
113 |     
114 |     @staticmethod
115 |     def relationship(frm, to):
116 |         '''Determine the relationship of the link `to` found on url `frm`.
117 |         For example, the relationship between these urls is `internal`:
118 |             http://foo.com/bar
119 |             http://foo.com/howdy
120 |         
121 |         And `external`:
122 |             http://foo.com/bar
123 |             http://bar.com/foo
124 |         
125 |         And `subdomain`:
126 |             http://foo.com/
127 |             http://bar.foo.com/
128 |         '''
129 |         return 0


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | #
 3 | # Copyright (c) 2011 SEOmoz
 4 | # 
 5 | # Permission is hereby granted, free of charge, to any person obtaining
 6 | # a copy of this software and associated documentation files (the
 7 | # "Software"), to deal in the Software without restriction, including
 8 | # without limitation the rights to use, copy, modify, merge, publish,
 9 | # distribute, sublicense, and/or sell copies of the Software, and to
10 | # permit persons to whom the Software is furnished to do so, subject to
11 | # the following conditions:
12 | # 
13 | # The above copyright notice and this permission notice shall be
14 | # included in all copies or substantial portions of the Software.
15 | # 
16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19 | # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20 | # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22 | # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
23 | 
24 | try:
25 | 	from setuptools import setup
26 | 	extra = {
27 | 		'install_requires' : ['reppy', 'urllib3', 'requests', 'gevent']
28 | 	}
29 | except ImportError:
30 | 	from distutils.core import setup
31 | 	extra = {
32 | 		'dependencies' : ['reppy', 'urllib3', 'requests', 'gevent']
33 | 	}
34 | 
35 | setup(name       = 'gcrawl',
36 | 	version      = '0.1.0',
37 | 	description  = 'Fetch urls. Faster.',
38 | 	url          = 'http://github.com/seomoz/g-crawl-py',
39 | 	author       = 'Dan Lecocq',
40 | 	author_email = 'dan@seomoz.org',
41 | 	license      = 'MIT',
42 | 	packages     = ['gcrawl'],
43 | 	classifiers  = [
44 | 		'License :: OSI Approved :: MIT License',
45 | 		'Programming Language :: Python',
46 | 		'Intended Audience :: Developers',
47 | 		'Operating System :: OS Independent',
48 | 		'Topic :: Internet :: WWW/HTTP'
49 | 	],
50 | 	**extra
51 | )


--------------------------------------------------------------------------------
/test/testAllowed.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import unittest
 5 | from gcrawl.url import Url
 6 | 
 7 | class TestParamSanitization(unittest.TestCase):
 8 |     # In these tests, we're going to assume that reppy's unit tests
 9 |     # are working properly
10 |     def test_x_robots_header(self):
11 |         examples = [
12 |             (['noindex']     , False),
13 |             (['none']        , False),
14 |             (['noindex,none'], False),
15 |             (['index']       , True ),
16 |             (['foobot:index'], True ),
17 |             (['foobot:none' ], False),
18 |             (['barbar:index'], True ),
19 |             (['barbot:none' ], True )
20 |         ]
21 |         for line in examples:
22 |             e, result = line
23 |             d = {
24 |                 'x-robots-tag': e
25 |             }
26 |             self.assertEqual(Url.allowed('http://www.seomoz.org/', 'foobot', headers=d), result)
27 |     
28 |     def test_meta_robots(self):
29 |         pass
30 |     
31 | unittest.main()
32 | 


--------------------------------------------------------------------------------
/test/testSanitize.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | import unittest
  5 | from gcrawl.url import Url
  6 | 
  7 | # An example of banned parameters, used for a few tests
  8 | banned = [
  9 |     'foobar',
 10 |     'widget',
 11 |     'wassup'
 12 | ]
 13 | 
 14 | class TestParamSanitization(unittest.TestCase):
 15 |     ###########################################################################
 16 |     # Params
 17 |     #
 18 |     # In this context, params are urls with ';'-style parameters. Yes, they
 19 |     # really exist, and customers get piseed off when we don't honor them.
 20 |     ###########################################################################
 21 |     def test_pruning_with_other_params(self):
 22 |         '''Make sure we can strip out a single blacklisted param'''
 23 |         for b in banned:
 24 |             bad  = 'http://testing.com/page;%s=foo;ok=foo' % b
 25 |             good = 'http://testing.com/page;ok=foo'
 26 |             self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
 27 | 
 28 |     def test_case_insensitivity_params(self):
 29 |         '''Make sure we can do it upper-cased'''
 30 |         for b in banned:
 31 |             bad  = 'http://testing.com/page;%s=foo;ok=foo' % b.upper()
 32 |             good = 'http://testing.com/page;ok=foo'
 33 |             self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
 34 | 
 35 |     def test_all_together_params(self):
 36 |         '''And make sure we can remove all of the blacklisted query params'''
 37 |         params = ';'.join('%s=foo' % b for b in banned)
 38 |         bad    = 'http://testing.com/page;%s' % params
 39 |         good   = 'http://testing.com/page'
 40 |         self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
 41 | 
 42 |     def test_preserve_order_params(self):
 43 |         '''Make sure we keep it all in order'''
 44 |         for b in banned:
 45 |             bad  = 'http://testing.com/page;hi=low;hello=goodbye;%s=foo;howdy=doodeedoo;whats=up' % b
 46 |             good = 'http://testing.com/page;hi=low;hello=goodbye;howdy=doodeedoo;whats=up'
 47 |             self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
 48 | 
 49 |     def test_pruning_alone_params(self):
 50 |         '''Make sure we don't include that ";"'''
 51 |         for b in banned:
 52 |             bad  = 'http://testing.com/page;%s=foo' % b
 53 |             good = 'http://testing.com/page'
 54 |             self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
 55 |     
 56 |     def test_param_values_ok_params(self):
 57 |         '''Make sure we can include them as param values'''
 58 |         for b in banned:
 59 |             ok   = 'http://testing.com/page;foo=%s;ok=foo' % b
 60 |             self.assertEqual(Url.sanitize(ok), ok)
 61 |     
 62 |     def test_prefix_param_ok_params(self):
 63 |         '''Make sure we can give each blacklisted param a prefix'''
 64 |         for b in banned:
 65 |             ok   = 'http://testing.com/page;howdy_%s=foo;ok=foo' % b
 66 |             self.assertEqual(Url.sanitize(ok), ok)
 67 | 
 68 | 
 69 | class TestQuerySanitization(unittest.TestCase):
 70 |     ###########################################################################
 71 |     # Query
 72 |     #
 73 |     # In this context, query strings are after a ? and are joined with ampersands
 74 |     # For more,
 75 |     #   http://tools.ietf.org/html/rfc1808.html
 76 |     ###########################################################################
 77 |     def test_pruning_with_other_args(self):
 78 |         '''Make sure we can strip out a single blacklisted query'''
 79 |         for b in banned:
 80 |             bad  = 'http://testing.com/page?%s=foo&ok=foo' % b
 81 |             good = 'http://testing.com/page?ok=foo'
 82 |             self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
 83 | 
 84 |     def test_case_insensitivity(self):
 85 |         '''Make sure we can do it upper-cased'''
 86 |         for b in banned:
 87 |             bad  = 'http://testing.com/page?%s=foo&ok=foo' % b.upper()
 88 |             good = 'http://testing.com/page?ok=foo'
 89 |             self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
 90 | 
 91 |     def test_all_together(self):
 92 |         '''And make sure we can remove all of the blacklisted query params'''
 93 |         params = '&'.join('%s=foo' % b for b in banned)
 94 |         bad    = 'http://testing.com/page?%s' % params
 95 |         good   = 'http://testing.com/page'
 96 |         self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
 97 | 
 98 |     def test_preserve_order(self):
 99 |         '''Make sure we keep it all in order'''
100 |         for b in banned:
101 |             bad  = 'http://testing.com/page?hi=low&hello=goodbye&%s=foo&howdy=doodeedoo&whats=up' % b
102 |             good = 'http://testing.com/page?hi=low&hello=goodbye&howdy=doodeedoo&whats=up'
103 |             self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
104 | 
105 |     def test_pruning_alone(self):
106 |         '''Make sure we don't include that "?"'''
107 |         for b in banned:
108 |             bad  = 'http://testing.com/page?%s=foo' % b
109 |             good = 'http://testing.com/page'
110 |             self.assertEqual(Url.sanitize(bad, param_blacklist=banned), good)
111 |     
112 |     def test_param_values_ok(self):
113 |         '''Make sure we can include them as query values'''
114 |         for b in banned:
115 |             ok   = 'http://testing.com/page?foo=%s&ok=foo' % b
116 |             self.assertEqual(Url.sanitize(ok), ok)
117 |     
118 |     def test_prefix_param_ok(self):
119 |         '''Make sure we can give each blacklisted param a prefix'''
120 |         for b in banned:
121 |             ok   = 'http://testing.com/page?howdy_%s=foo&ok=foo' % b
122 |             self.assertEqual(Url.sanitize(ok), ok)
123 |     
124 |     def test_multiple_ampersands(self):
125 |         paths = [
126 |             ('howdy?&&'              , 'howdy'),
127 |             ('howdy?&&&foo=bar&&&'   , 'howdy?foo=bar'),
128 |             ('howdy;;;;foo=bar;'     , 'howdy;foo=bar'),
129 |             # These come from the prototype lsapi: https://github.com/seomoz/lsapi-prototype/blob/master/tests/test_convert_url.py
130 |             # In query parameters, we should escape these characters
131 |             #('?foo=\xe4\xb8\xad'    , '?foo=%E4%B8%AD'),
132 |             # But in a path, we should not
133 |             #('\xe4\xb8\xad/bar.html', '\xe4\xb8\xadbar.html')
134 |         ]
135 |         
136 |         base = 'http://testing.com/'
137 |         for bad, clean in paths:
138 |             self.assertEqual(Url.sanitize(base + bad), base + clean)
139 | 
140 | class TestRelative(unittest.TestCase):
141 |     ###########################################################################
142 |     # Relative url support tests
143 |     ###########################################################################    
144 |     def test_case_insensitivity(self):
145 |         paths = [
146 |             ('www.TESTING.coM'       , 'www.testing.com/'),
147 |             ('WWW.testing.com'       , 'www.testing.com/'),
148 |             ('WWW.testing.COM/FOOBAR', 'www.testing.com/FOOBAR')
149 |         ]
150 |         
151 |         for bad, clean in paths:
152 |             self.assertEqual(Url.sanitize('http://' + bad), 'http://' + clean)
153 |     
154 |     def test_double_forward_slash(self):
155 |         paths = [
156 |             ('howdy'           , 'howdy'),
157 |             ('hello//how//are' , 'hello/how/are'),
158 |             ('hello/../how/are', 'how/are'),
159 |             ('hello//..//how/' , 'how/'),
160 |             ('a/b/../../c'     , 'c'),
161 |             ('../../../c'      , 'c'),
162 |             ('./hello'         , 'hello'),
163 |             ('./././hello'     , 'hello'),
164 |             ('a/b/c/'          , 'a/b/c/')
165 |         ]
166 |         
167 |         base = 'http://testing.com/'
168 |         
169 |         for bad, clean in paths:
170 |             self.assertEqual(Url.sanitize(base + bad), base + clean)
171 |         
172 |         # This is the example from the wild that spawned this whole change
173 |         bad   = 'http://www.vagueetvent.com/../fonctions_pack/ajouter_pack_action.php?id_produit=26301'
174 |         clean = 'http://www.vagueetvent.com/fonctions_pack/ajouter_pack_action.php?id_produit=26301'
175 |         self.assertEqual(Url.sanitize(bad), clean)
176 |     
177 |     def test_insert_trailing_slash(self):
178 |         # When dealing with a path-less url, we should insert a trailing slash.
179 |         paths = [
180 |             ('foo.com?page=home', 'foo.com/?page=home'),
181 |             ('foo.com'          , 'foo.com/')
182 |         ]
183 |         
184 |         for bad, clean in paths:
185 |             self.assertEqual(Url.sanitize('http://' + bad), 'http://' + clean)
186 | 
187 | 
188 | 
189 | 
190 | class TestTheRest(unittest.TestCase):
191 |     def test_join(self):
192 |         # We should be able to join urls
193 |         self.assertEqual(Url.sanitize('/foo', 'http://cnn.com'), 'http://cnn.com/foo')
194 |     
195 |     def test_escaping(self):
196 |         paths = [
197 |             ('hello%20and%20how%20are%20you', 'hello%20and%20how%20are%20you'),
198 |             ('danny\'s pub'                 , 'danny%27s%20pub'),
199 |             ('danny%27s pub?foo=bar&yo'     , 'danny%27s%20pub?foo=bar&yo')
200 |         ]
201 |         
202 |         base = 'http://testing.com/'
203 |         for bad, clean in paths:
204 |             self.assertEqual(Url.sanitize(base + bad), base + clean)
205 |     
206 |     def test_wild(self):
207 |         # These are some examples from the wild that have been seeming to fail
208 |         # It apparently comes from the fact that the input is a unicode string,
209 |         # and has disallowed character
210 |         pairs = [
211 |             (u'http://www.jointingmortar.co.uk/rompox®-easy.html',
212 |                 'http://www.jointingmortar.co.uk/rompox%C2%AE-easy.html'),
213 |             (u'http://www.dinvard.se//index.php/result/type/owner/Stift Fonden för mindre arbetarbos/',
214 |                 'http://www.dinvard.se/index.php/result/type/owner/Stift%20Fonden%20f%C3%B6r%20mindre%20arbetarbos/'),
215 |             (u'http://www.ewaterways.com/cruises/all/alaska//ship/safari quest/itinerary/mexico\'s sea of cortés - aquarium of the world (8 days)/itinerary/',
216 |                 'http://www.ewaterways.com/cruises/all/alaska/ship/safari%20quest/itinerary/mexico%27s%20sea%20of%20cort%C3%A9s%20-%20aquarium%20of%20the%20world%20%288%20days%29/itinerary/'),
217 |             (u'http://www.mydeals.gr/prosfores/p/Υπόλοιπα%20Νησιά/',
218 |                 'http://www.mydeals.gr/prosfores/p/%CE%A5%CF%80%CF%8C%CE%BB%CE%BF%CE%B9%CF%80%CE%B1%20%CE%9D%CE%B7%CF%83%CE%B9%CE%AC/')
219 |         ]
220 |         
221 |         for bad, good in pairs:
222 |             self.assertEqual(Url.sanitize(bad), good)
223 |     
224 | unittest.main()
225 | 


--------------------------------------------------------------------------------