├── .gitignore
├── CHANGES.txt
├── LICENSE.txt
├── MANIFEST.in
├── README.md
├── docs
├── exporting_tc.png
└── saving_tc.png
├── requirements.txt
├── seleniumcrawler
├── __init__.py
├── config.py
├── filesystem.py
├── handle.py
├── makeparsed.py
├── sites
│ ├── __init__.py
│ ├── forbes
│ │ ├── __init__.py
│ │ ├── forbes.py
│ │ ├── forbes.tc
│ │ └── forbes_raw.py
│ ├── hackaday
│ │ ├── __init__.py
│ │ ├── hackaday.py
│ │ ├── hackaday.tc
│ │ └── hackaday_raw.py
│ ├── hnews
│ │ ├── __init__.py
│ │ ├── hnews.py
│ │ ├── hnews.tc
│ │ └── hnews_raw.py
│ └── reddit
│ │ ├── __init__.py
│ │ ├── reddit.py
│ │ ├── reddit.tc
│ │ └── reddit_raw.py
└── tests
│ ├── __init__.py
│ └── test_all.py
└── setup.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | *.pyc
3 | chromedriver.log
4 | selenium_crawler.egg-info
5 |
--------------------------------------------------------------------------------
/CHANGES.txt:
--------------------------------------------------------------------------------
1 | v0.1.0, April 29 2013 -- Initial release.
2 |
--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2013 Cory Walker
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in
13 | all copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21 | THE SOFTWARE.
22 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include *.txt
2 | recursive-include docs *.txt
3 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | selenium-crawler 0.1.0
2 | ======================
3 |
4 | Have you ever needed to crawl a list of urls that may or may not directly
5 | contain the content you so desperately crave? The web is full of links that do
6 | not behave for a number of reasons, and here is a list of just some of them:
7 |
8 | 1. The page is actually a landing page that links to the page you want:
9 | * Hacker News
10 | * Hack-a-day
11 | * Reddit
12 | 2. The page content is only made available after closing an ad:
13 | * Forbes
14 | 3. The content is behind some sort of login or paywall:
15 | * Boston Globe
16 | 4. One must click through some sort of pagination to find the content:
17 | * Web forums
18 |
19 | You might be asking, why use Selenium when you can use a combination of
20 | PhantomJS and BeautifulSoup to extract the needed data? This is a great way to
21 | accomplish some of the listed tasks above, but it has a number of limitations:
22 |
23 | * Business teams would sometimes rather work with a visual tool rather than
24 | writing lines of code.
25 | * Selenium has most of the code that would be needed already built in.
26 |
27 | **Depending on Selenium DOES NOT mean that your crawling servers will need to
28 | also run a GUI. Selenium can run in a headless environment. See below for more
29 | information.**
30 |
31 | Quickstart
32 | ==========
33 |
34 | ```bash
35 | pip install -e git+https://github.com/cmwslw/selenium-crawler.git#egg=selenium-crawler
36 | ```
37 |
38 | ```python
39 | from seleniumcrawler import handle_url
40 | print handle_url('https://news.ycombinator.com/item?id=5626377')
41 | ```
42 |
43 | This will open up a browser window, 'click' on the main link, and load the
44 | article. It will print the following:
45 |
46 | ```python
47 | {
48 | 'url': 'http://googleblog.blogspot.com/2013/04/google-now-on-your-iphone-and-ipad-with.html',
49 | 'source': {{HTMLSOURCE}},
50 | 'handler': 'hnews'
51 | }
52 | ```
53 |
54 | Where `{{HTMLSOURCE}}` is the actual HTML of the article.
55 |
56 | Creating test cases
57 | ===================
58 |
59 | Create test cases as you normally would in Selenium, but be sure to take the
60 | following things into account:
61 |
62 | ### Never make a test case article-specific
63 | By default, when one records Selenium test cases, the very first instruction
64 | will be for the browser to load up a specific URL. This url will be the current
65 | article that you are viewing. Since you will be designing scripts that should
66 | work for ANY given article on the site. Make sure you always remove this line. I
67 | might make selenium-crawler take care of this step in the future. That is TBD.
68 |
69 | ### Make your locators as robust as possible
70 | On the one hand, make the locators specific enough to be confident that the
71 | will only match exactly what needs to be matched. On the other hand, make sure
72 | that your locator will continue to match even in the case of a website theme
73 | change or even redesign. It is impossible to account for every possible change
74 | in site structure, but you get the idea. By default, Selenium tends to create
75 | very specific locators while recording. Make sure to fix these up a bit.
76 |
77 | ### Save test cases in a standard way
78 | While it really doesn't matter where saved test case files even go, for
79 | reference and debugging purposes it is useful to save them in a standard
80 | location. For the time being, I have established the form
81 | `sites/{name}/{name}.tc`. Hopefully this will prove to be a decent convention.
82 |
83 | Exporting test cases
84 | ====================
85 |
86 | Selenium-crawler takes in exported selenium Python scripts as inputs. Follow
87 | these instructions to obtain the exported scripts. Be sure to use the WebDriver
88 | backend.
89 |
90 | 
91 |
92 | Next we need to save the script in a place where selenium-crawler can find it.
93 | Find the sites directory. Next, create a directory for the site. Choose a
94 | useful, concise name. Save the actual exported script in the format of
95 | `{name}_raw.py`:
96 |
97 | 
98 |
99 | Parsing selenium cases
100 | ======================
101 |
102 | The test cases that Selenium exports are not even valid Python code. Here is an
103 | example:
104 |
105 | ```python
106 | self.accept_next_alert = true
107 | ```
108 |
109 | Once you fix the syntax errors, they are useful for writing test suites, but not
110 | for writing crawlers. Running `python makeparsed.py` takes all these Selenium
111 | test cases and converts them to Python code usable for crawling. Here is an
112 | example that clicks through an annoying Javascript ad on Forbes pages:
113 |
114 | ```python
115 | from selenium import webdriver
116 |
117 | def handle_link(link):
118 | driver = webdriver.Firefox()
119 | driver.implicitly_wait(30)
120 | driver.get(link)
121 |
122 | driver.find_element_by_xpath("//div[@class='header']/div[@class='continue']/a").click()
123 |
124 | results = {
125 | 'url': driver.current_url,
126 | 'source': driver.page_source
127 | }
128 | driver.quit()
129 |
130 | return results
131 | ```
132 |
133 | So go ahead. Run `python makeparsed.py`. You should see output similar to the
134 | following:
135 |
136 | ```
137 | Parsed ./sites/forbes/forbes_raw.py.
138 | Parsed ./sites/hnews/hnews_raw.py.
139 | Parsed ./sites/reddit/reddit_raw.py.
140 | ```
141 |
142 | Don't worry if the paths are different for your installation. Keep in mind that
143 | `makeparsed.py` only has to be run when site scripts have either been changed
144 | or added.
145 |
146 | Headless configuration
147 | ======================
148 |
149 | Running headless means that no actual GUI will be running on a monitor during
150 | use. Put simply, it means that no browser window will pop up when handling a
151 | URL. One way to run headless is through the use of xvfb, a tool used to set up
152 | virtual framebuffers. Run this before using selenium-crawler:
153 |
154 | ```bash
155 | sh -e /etc/init.d/xvfb start
156 | export DISPLAY=:99.0
157 | ```
158 |
159 | This is the method that CI systems like Travis-CI and CircleCI recommend. There
160 | are other methods of running Selenium in a headless environment. Do a quick
161 | Google search for more information.
162 |
163 | Contributing
164 | ============
165 |
166 | Contributing is easy. If you write any new site handling scripts, just be sure
167 | to follow the guide above and write a quick test for it in `test_all.py`. Just
168 | send in a pull request and I'll get a `CONTRIBUTORS.txt` file going.
169 |
170 |
--------------------------------------------------------------------------------
/docs/exporting_tc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corywalker/selenium-crawler/e49df4c1f40a330af19d90b7c59d49ac1acee86c/docs/exporting_tc.png
--------------------------------------------------------------------------------
/docs/saving_tc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corywalker/selenium-crawler/e49df4c1f40a330af19d90b7c59d49ac1acee86c/docs/saving_tc.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | selenium==2.32.0
2 |
--------------------------------------------------------------------------------
/seleniumcrawler/__init__.py:
--------------------------------------------------------------------------------
1 | from seleniumcrawler.handle import handle_url
2 |
--------------------------------------------------------------------------------
/seleniumcrawler/config.py:
--------------------------------------------------------------------------------
1 | sites_dict = {
2 | 'forbes': '(.*)forbes.com(.*)',
3 | 'hnews': '(.*)news.ycombinator.com(.*)',
4 | 'reddit': '(.*)reddit.com(.*)',
5 | 'hackaday': '(.*)hackaday.com(.*)',
6 | }
7 |
--------------------------------------------------------------------------------
/seleniumcrawler/filesystem.py:
--------------------------------------------------------------------------------
1 | # Global modules
2 | import os
3 | # Local modules
4 | from seleniumcrawler.config import sites_dict
5 |
6 | THIS_DIR = os.path.dirname(os.path.realpath(__file__))
7 | SITES_DIR = os.path.join(THIS_DIR, 'sites')
8 | def locate_sites():
9 | location_list = []
10 | for site, regex in sites_dict.items():
11 | this_site_dir = os.path.join(SITES_DIR, site)
12 | # This is only the EXPECTED script name.
13 | # All scripts should follow this convention.
14 | script_name = site + '_raw.py'
15 | script_path = os.path.join(this_site_dir, script_name)
16 | config_path = os.path.join(this_site_dir, 'config.py')
17 | location_dict = {
18 | 'name': site,
19 | 'script_path': script_path,
20 | 'config_path': config_path,
21 | 'site_dir': this_site_dir
22 | }
23 | location_list.append(location_dict)
24 |
25 | return location_list
26 |
--------------------------------------------------------------------------------
/seleniumcrawler/handle.py:
--------------------------------------------------------------------------------
1 | # Global modules
2 | import re
3 | # Local modules
4 | from seleniumcrawler.config import sites_dict
5 |
6 | class HandlerError(Exception):
7 | def __init__(self, value):
8 | self.value = value
9 | def __str__(self):
10 | return repr(self.value)
11 |
12 | def import_from(module, name):
13 | module = __import__(module, fromlist=[name])
14 | return getattr(module, name)
15 |
16 | def handle_url(url):
17 | for site, regex in sites_dict.items():
18 | if re.match(regex, url):
19 | handler = import_from('seleniumcrawler.sites.%s.%s' % (site, site), 'handle_link')
20 | result = handler(url)
21 | result['handler'] = site
22 | return result
23 | raise HandlerError('Handler for URL not defined')
24 |
--------------------------------------------------------------------------------
/seleniumcrawler/makeparsed.py:
--------------------------------------------------------------------------------
1 | # Global modules
2 | import os
3 | # Third party modules
4 | import pystache
5 | # Local modules
6 | from seleniumcrawler.filesystem import locate_sites
7 |
8 | template = '''from selenium import webdriver
9 | from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
10 |
11 | def get_profile():
12 | # get the Firefox profile object
13 | firefoxProfile = FirefoxProfile()
14 | # Disable CSS
15 | firefoxProfile.set_preference('permissions.default.stylesheet', 2)
16 | # Disable images
17 | firefoxProfile.set_preference('permissions.default.image', 2)
18 | # Disable Flash
19 | firefoxProfile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so',
20 | 'false')
21 |
22 | return firefoxProfile
23 |
24 | def handle_link(link):
25 | profile = get_profile()
26 |
27 | # Set the modified profile while creating the browser object
28 | driver = webdriver.Firefox(profile)
29 | driver.implicitly_wait(30)
30 | driver.get(link)
31 |
32 | {{{code}}}
33 | results = {
34 | 'url': driver.current_url,
35 | 'source': driver.page_source
36 | }
37 | driver.quit()
38 |
39 | return results
40 | '''
41 |
42 | def parse_raw_script(name, directory, path):
43 | f = open(path)
44 |
45 | codelines = []
46 | at_main_code = False
47 | for line in f:
48 | if at_main_code:
49 | codelines.append(line[4:])
50 | if line.startswith(' def test_'):
51 | at_main_code = True
52 | elif line.startswith(' def') and at_main_code:
53 | codelines = codelines[1:-2]
54 | at_main_code = False
55 | break
56 |
57 | fout = open(os.path.join(directory, name + '.py'), 'w')
58 | code = ''.join(codelines)
59 | data = {'code': code}
60 | rendered = pystache.render(template, data)
61 | fout.write(rendered)
62 |
63 | for site in locate_sites():
64 | parse_raw_script(site['name'], site['site_dir'], site['script_path'])
65 | print "Parsed %s." % site['script_path']
66 |
--------------------------------------------------------------------------------
/seleniumcrawler/sites/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corywalker/selenium-crawler/e49df4c1f40a330af19d90b7c59d49ac1acee86c/seleniumcrawler/sites/__init__.py
--------------------------------------------------------------------------------
/seleniumcrawler/sites/forbes/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corywalker/selenium-crawler/e49df4c1f40a330af19d90b7c59d49ac1acee86c/seleniumcrawler/sites/forbes/__init__.py
--------------------------------------------------------------------------------
/seleniumcrawler/sites/forbes/forbes.py:
--------------------------------------------------------------------------------
1 | from selenium import webdriver
2 | from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
3 |
4 | def get_profile():
5 | # get the Firefox profile object
6 | firefoxProfile = FirefoxProfile()
7 | # Disable CSS
8 | firefoxProfile.set_preference('permissions.default.stylesheet', 2)
9 | # Disable images
10 | firefoxProfile.set_preference('permissions.default.image', 2)
11 | # Disable Flash
12 | firefoxProfile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so',
13 | 'false')
14 |
15 | return firefoxProfile
16 |
17 | def handle_link(link):
18 | profile = get_profile()
19 |
20 | # Set the modified profile while creating the browser object
21 | driver = webdriver.Firefox(profile)
22 | driver.implicitly_wait(30)
23 | driver.get(link)
24 |
25 | driver.find_element_by_xpath("//div[@class='header']/div[@class='continue']/a").click()
26 |
27 | results = {
28 | 'url': driver.current_url,
29 | 'source': driver.page_source
30 | }
31 | driver.quit()
32 |
33 | return results
34 |
--------------------------------------------------------------------------------
/seleniumcrawler/sites/forbes/forbes.tc:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |