├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.md
├── debian
    ├── changelog
    ├── compat
    ├── control
    ├── copyright
    ├── examples
    ├── lintian-overrides
    ├── rules
    └── source
    │   ├── format
    │   └── options
├── docs
    ├── .gitignore
    ├── Makefile
    ├── apidoc.rst
    ├── conf.py
    ├── index.rst
    ├── installation.rst
    └── usage.rst
├── dryscrape
    ├── __init__.py
    ├── driver
    │   ├── __init__.py
    │   └── webkit.py
    ├── mixins.py
    ├── session.py
    └── xvfb.py
├── examples
    └── google.py
├── requirements.txt
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.png
2 | *.pyc
3 | *~
4 | MANIFEST
5 | /dist
6 | /build
7 | 
8 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) Niklas Baumstark
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in
11 | all copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19 | THE SOFTWARE.
20 | 
21 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include LICENSE README.md
2 | recursive-include examples *.py
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | **NOTE: This package is not actively maintained. It uses QtWebkit, which is end-of-life and probably doesn't get security fixes backported. Consider using a similar package like [Spynner](https://github.com/makinacorpus/spynner) instead.**
 3 | 
 4 | 
 5 | # Overview
 6 | 
 7 | **Author:** Niklas Baumstark
 8 | 
 9 | 
10 | 
11 | dryscrape is a lightweight web scraping library for Python. It uses a 
12 | headless Webkit instance to evaluate Javascript on the visited pages. This 
13 | enables painless scraping of plain web pages as well as Javascript-heavy 
14 | “Web 2.0” applications like
15 | Facebook.
16 | 
17 | It is built on the shoulders of
18 | [capybara-webkit](https://github.com/thoughtbot/capybara-webkit)'s 
19 | [webkit-server](https://github.com/niklasb/webkit-server). A big thanks goes 
20 | to thoughtbot, inc. for building this excellent piece of software!
21 | 
22 | # Changelog
23 | 
24 | * 1.0: Added Python 3 support, small performance fixes, header names are now
25 |   properly normalized. Also added the function `dryscrape.start_xvfb()` to
26 |   easily start Xvfb.
27 | * 0.9.1: Changed semantics of the `headers` function in
28 |   a backwards-incompatible way: It now returns a list of (key, value)
29 |   pairs instead of a dictionary.
30 | 
31 | # Supported Platforms
32 | 
33 | The library has been confirmed to work on the following platforms:
34 | 
35 | * Mac OS X 10.9 Mavericks and 10.10 Yosemite
36 | * Ubuntu Linux
37 | * Arch Linux
38 | 
39 | Other unixoid systems should work just fine.
40 | 
41 | Windows is not officially supported, although dryscrape should work
42 | with [cygwin](https://www.cygwin.com/).
43 | 
44 | ### A word about Qt 5.6
45 | 
46 | The 5.6 version of Qt removes the Qt WebKit module in favor of the new module Qt WebEngine. So far webkit-server has not been ported to WebEngine (and likely won't be in the near future), so Qt <= 5.5 is a requirement.
47 | 
48 | # Installation, Usage, API Docs
49 | 
50 | Documentation can be found at 
51 | [dryscrape's ReadTheDocs page](http://dryscrape.readthedocs.io/).
52 | 
53 | Quick installation instruction for Ubuntu:
54 | 
55 |     # apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb
56 |     # pip install dryscrape
57 | 
58 | # Contact, Bugs, Contributions
59 | 
60 | If you have any problems with this software, don't hesitate to open an
61 | issue on [Github](https://github.com/niklasb/dryscrape) or open a pull
62 | request or write a mail to **niklas baumstark at Gmail**.
63 | 


--------------------------------------------------------------------------------
/debian/changelog:
--------------------------------------------------------------------------------
1 | dryscrape (1.0-1) unstable; urgency=low
2 | 
3 |   * Initial import
4 | 
5 |  -- Niklas Baumstark <niklas.baumstark@gmail.com>  Wed, 23 Sep 2015 13:00:10 +0000
6 | 


--------------------------------------------------------------------------------
/debian/compat:
--------------------------------------------------------------------------------
1 | 7
2 | 


--------------------------------------------------------------------------------
/debian/control:
--------------------------------------------------------------------------------
 1 | Source: dryscrape
 2 | Maintainer: Niklas Baumstark <niklas.baumstark@gmail.com>
 3 | Section: python
 4 | Priority: optional
 5 | Build-Depends: python-all (>= 2.6.6-3), debhelper (>= 7)
 6 | Standards-Version: 3.9.1
 7 | 
 8 | Package: python-dryscrape
 9 | Architecture: all
10 | XB-Python-Version: ${python:Versions}
11 | Depends: ${misc:Depends}, ${python:Depends}, python-webkit-server, python-lxml, python-xvfbwrapper
12 | Provides: ${python:Provides}
13 | Description: a lightweight Javascript-aware, headless web scraping librar
14 |  dryscrape is a lightweight web scraping library for Python.
15 |  It uses a headless Webkit instance to evaluate Javascript on the visited pages.
16 |  This enables painless scraping of plain web pages as well 
17 |  as Javascript-heavy "Web 2.0" applications like Facebook.
18 | 


--------------------------------------------------------------------------------
/debian/copyright:
--------------------------------------------------------------------------------
1 | Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
2 | Upstream-Name: dryscrape
3 | Source: https://github.com/niklasb/dryscrape
4 | 
5 | Files: *
6 | Copyright: Copyright (c) 2012 Niklas Baumstark
7 | License: MIT
8 |  For details see http://opensource.org/licenses/MIT.
9 | 


--------------------------------------------------------------------------------
/debian/examples:
--------------------------------------------------------------------------------
1 | examples/*
2 | 


--------------------------------------------------------------------------------
/debian/lintian-overrides:
--------------------------------------------------------------------------------
1 | python-dryscrape binary: description-synopsis-starts-with-article
2 | python-dryscrape binary: new-package-should-close-itp-bug
3 | 
4 | 


--------------------------------------------------------------------------------
/debian/rules:
--------------------------------------------------------------------------------
1 | #!/usr/bin/make -f
2 | 
3 | %:
4 | 	dh $@ --with python2 --buildsystem=python_distutils
5 | 
6 | 
7 | 


--------------------------------------------------------------------------------
/debian/source/format:
--------------------------------------------------------------------------------
1 | 3.0 (quilt)
2 | 


--------------------------------------------------------------------------------
/debian/source/options:
--------------------------------------------------------------------------------
1 | extend-diff-ignore="\.egg-info"


--------------------------------------------------------------------------------
/docs/.gitignore:
--------------------------------------------------------------------------------
1 | /_*
2 | 


--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
  1 | # Makefile for Sphinx documentation
  2 | #
  3 | 
  4 | # You can set these variables from the command line.
  5 | SPHINXOPTS    =
  6 | SPHINXBUILD   = sphinx-build
  7 | PAPER         =
  8 | BUILDDIR      = _build
  9 | 
 10 | # Internal variables.
 11 | PAPEROPT_a4     = -D latex_paper_size=a4
 12 | PAPEROPT_letter = -D latex_paper_size=letter
 13 | ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 14 | 
 15 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest
 16 | 
 17 | help:
 18 | 	@echo "Please use \`make <target>' where <target> is one of"
 19 | 	@echo "  html       to make standalone HTML files"
 20 | 	@echo "  dirhtml    to make HTML files named index.html in directories"
 21 | 	@echo "  singlehtml to make a single large HTML file"
 22 | 	@echo "  pickle     to make pickle files"
 23 | 	@echo "  json       to make JSON files"
 24 | 	@echo "  htmlhelp   to make HTML files and a HTML help project"
 25 | 	@echo "  qthelp     to make HTML files and a qthelp project"
 26 | 	@echo "  devhelp    to make HTML files and a Devhelp project"
 27 | 	@echo "  epub       to make an epub"
 28 | 	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
 29 | 	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
 30 | 	@echo "  text       to make text files"
 31 | 	@echo "  man        to make manual pages"
 32 | 	@echo "  changes    to make an overview of all changed/added/deprecated items"
 33 | 	@echo "  linkcheck  to check all external links for integrity"
 34 | 	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
 35 | 
 36 | clean:
 37 | 	-rm -rf $(BUILDDIR)/*
 38 | 
 39 | html:
 40 | 	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
 41 | 	@echo
 42 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
 43 | 
 44 | dirhtml:
 45 | 	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
 46 | 	@echo
 47 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
 48 | 
 49 | singlehtml:
 50 | 	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
 51 | 	@echo
 52 | 	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
 53 | 
 54 | pickle:
 55 | 	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
 56 | 	@echo
 57 | 	@echo "Build finished; now you can process the pickle files."
 58 | 
 59 | json:
 60 | 	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
 61 | 	@echo
 62 | 	@echo "Build finished; now you can process the JSON files."
 63 | 
 64 | htmlhelp:
 65 | 	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
 66 | 	@echo
 67 | 	@echo "Build finished; now you can run HTML Help Workshop with the" \
 68 | 	      ".hhp project file in $(BUILDDIR)/htmlhelp."
 69 | 
 70 | qthelp:
 71 | 	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
 72 | 	@echo
 73 | 	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
 74 | 	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
 75 | 	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/dryscrape.qhcp"
 76 | 	@echo "To view the help file:"
 77 | 	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/dryscrape.qhc"
 78 | 
 79 | devhelp:
 80 | 	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
 81 | 	@echo
 82 | 	@echo "Build finished."
 83 | 	@echo "To view the help file:"
 84 | 	@echo "# mkdir -p $$HOME/.local/share/devhelp/dryscrape"
 85 | 	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/dryscrape"
 86 | 	@echo "# devhelp"
 87 | 
 88 | epub:
 89 | 	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
 90 | 	@echo
 91 | 	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
 92 | 
 93 | latex:
 94 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
 95 | 	@echo
 96 | 	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
 97 | 	@echo "Run \`make' in that directory to run these through (pdf)latex" \
 98 | 	      "(use \`make latexpdf' here to do that automatically)."
 99 | 
100 | latexpdf:
101 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
102 | 	@echo "Running LaTeX files through pdflatex..."
103 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf
104 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
105 | 
106 | text:
107 | 	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
108 | 	@echo
109 | 	@echo "Build finished. The text files are in $(BUILDDIR)/text."
110 | 
111 | man:
112 | 	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
113 | 	@echo
114 | 	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
115 | 
116 | changes:
117 | 	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
118 | 	@echo
119 | 	@echo "The overview file is in $(BUILDDIR)/changes."
120 | 
121 | linkcheck:
122 | 	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
123 | 	@echo
124 | 	@echo "Link check complete; look for any errors in the above output " \
125 | 	      "or in $(BUILDDIR)/linkcheck/output.txt."
126 | 
127 | doctest:
128 | 	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
129 | 	@echo "Testing of doctests in the sources finished, look at the " \
130 | 	      "results in $(BUILDDIR)/doctest/output.txt."
131 | 


--------------------------------------------------------------------------------
/docs/apidoc.rst:
--------------------------------------------------------------------------------
 1 | API Documentation
 2 | =================
 3 | 
 4 | This documentation also contains the API docs for the ``webkit_server``
 5 | module, for convenience (and because I am too lazy to set up dedicated docs
 6 | for it).
 7 | 
 8 | Overview
 9 | ----------
10 | 
11 | .. inheritance-diagram:: dryscrape.session
12 |                          dryscrape.mixins
13 |                          dryscrape.driver.webkit
14 |                          webkit_server
15 | 
16 | Module :mod:`dryscrape.session`
17 | -------------------------------
18 | 
19 | .. automodule:: dryscrape.session
20 |    :members:
21 | 
22 | Module :mod:`dryscrape.mixins`
23 | -------------------------------
24 | 
25 | .. automodule:: dryscrape.mixins
26 |    :members:
27 | 
28 | Module :mod:`dryscrape.driver.webkit`
29 | -------------------------------------
30 | 
31 | .. automodule:: dryscrape.driver.webkit
32 |    :members:
33 | 
34 | Module :mod:`webkit_server`
35 | -------------------------------
36 | 
37 | .. automodule:: webkit_server
38 |    :members:
39 | 


--------------------------------------------------------------------------------
/docs/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # dryscrape documentation build configuration file, created by
  4 | # sphinx-quickstart on Thu Jan 12 15:55:25 2012.
  5 | #
  6 | # This file is execfile()d with the current directory set to its containing dir.
  7 | #
  8 | # Note that not all possible configuration values are present in this
  9 | # autogenerated file.
 10 | #
 11 | # All configuration values have a default; values that are commented out
 12 | # serve to show the default.
 13 | 
 14 | import sys, os
 15 | 
 16 | class Mock(object):
 17 |     def __init__(self, *args, **kwargs):
 18 |         pass
 19 | 
 20 |     def __call__(self, *args, **kwargs):
 21 |         return Mock()
 22 | 
 23 |     @classmethod
 24 |     def __getattr__(self, name):
 25 |         if name in ('__file__', '__path__'):
 26 |             return '/dev/null'
 27 |         elif name[0].upper() == name[0]:
 28 |             return type(name, (), {})
 29 |         else:
 30 |             return Mock()
 31 | 
 32 | # mock some modules...
 33 | MOCK_MODULES = []
 34 | for mod_name in MOCK_MODULES:
 35 |     sys.modules[mod_name] = Mock()
 36 | 
 37 | # If extensions (or modules to document with autodoc) are in another directory,
 38 | # add these directories to sys.path here. If the directory is relative to the
 39 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 40 | sys.path.insert(0, os.path.abspath('..'))
 41 | 
 42 | # -- General configuration -----------------------------------------------------
 43 | 
 44 | # If your documentation needs a minimal Sphinx version, state it here.
 45 | #needs_sphinx = '1.0'
 46 | 
 47 | # Add any Sphinx extension module names here, as strings. They can be extensions
 48 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
 49 | extensions = ['sphinx.ext.autodoc',
 50 |               'sphinx.ext.viewcode',
 51 |               'sphinx.ext.graphviz',
 52 |               'sphinx.ext.inheritance_diagram']
 53 | 
 54 | # autodoc config
 55 | autodoc_default_flags = ['show-inheritance']
 56 | 
 57 | # Add any paths that contain templates here, relative to this directory.
 58 | templates_path = ['_templates']
 59 | 
 60 | # The suffix of source filenames.
 61 | source_suffix = '.rst'
 62 | 
 63 | # The encoding of source files.
 64 | #source_encoding = 'utf-8-sig'
 65 | 
 66 | # The master toctree document.
 67 | master_doc = 'index'
 68 | 
 69 | # General information about the project.
 70 | project = u'dryscrape'
 71 | copyright = u'2012, Niklas Baumstark'
 72 | 
 73 | # The version info for the project you're documenting, acts as replacement for
 74 | # |version| and |release|, also used in various other places throughout the
 75 | # built documents.
 76 | #
 77 | # The short X.Y version.
 78 | version = '1.0'
 79 | # The full version, including alpha/beta/rc tags.
 80 | release = '1.0.1'
 81 | 
 82 | # The language for content autogenerated by Sphinx. Refer to documentation
 83 | # for a list of supported languages.
 84 | #language = None
 85 | 
 86 | # There are two options for replacing |today|: either, you set today to some
 87 | # non-false value, then it is used:
 88 | #today = ''
 89 | # Else, today_fmt is used as the format for a strftime call.
 90 | #today_fmt = '%B %d, %Y'
 91 | 
 92 | # List of patterns, relative to source directory, that match files and
 93 | # directories to ignore when looking for source files.
 94 | exclude_patterns = ['_build']
 95 | 
 96 | # The reST default role (used for this markup: `text`) to use for all documents.
 97 | #default_role = None
 98 | 
 99 | # If true, '()' will be appended to :func: etc. cross-reference text.
100 | #add_function_parentheses = True
101 | 
102 | # If true, the current module name will be prepended to all description
103 | # unit titles (such as .. function::).
104 | #add_module_names = True
105 | 
106 | # If true, sectionauthor and moduleauthor directives will be shown in the
107 | # output. They are ignored by default.
108 | #show_authors = False
109 | 
110 | # The name of the Pygments (syntax highlighting) style to use.
111 | pygments_style = 'sphinx'
112 | 
113 | # A list of ignored prefixes for module index sorting.
114 | #modindex_common_prefix = []
115 | 
116 | 
117 | # -- Options for HTML output ---------------------------------------------------
118 | 
119 | # The theme to use for HTML and HTML Help pages.  See the documentation for
120 | # a list of builtin themes.
121 | html_theme = 'default'
122 | 
123 | # Theme options are theme-specific and customize the look and feel of a theme
124 | # further.  For a list of options available for each theme, see the
125 | # documentation.
126 | #html_theme_options = {}
127 | 
128 | # Add any paths that contain custom themes here, relative to this directory.
129 | #html_theme_path = []
130 | 
131 | # The name for this set of Sphinx documents.  If None, it defaults to
132 | # "<project> v<release> documentation".
133 | #html_title = None
134 | 
135 | # A shorter title for the navigation bar.  Default is the same as html_title.
136 | #html_short_title = None
137 | 
138 | # The name of an image file (relative to this directory) to place at the top
139 | # of the sidebar.
140 | #html_logo = None
141 | 
142 | # The name of an image file (within the static path) to use as favicon of the
143 | # docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
144 | # pixels large.
145 | #html_favicon = None
146 | 
147 | # Add any paths that contain custom static files (such as style sheets) here,
148 | # relative to this directory. They are copied after the builtin static files,
149 | # so a file named "default.css" will overwrite the builtin "default.css".
150 | html_static_path = ['_static']
151 | 
152 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
153 | # using the given strftime format.
154 | #html_last_updated_fmt = '%b %d, %Y'
155 | 
156 | # If true, SmartyPants will be used to convert quotes and dashes to
157 | # typographically correct entities.
158 | #html_use_smartypants = True
159 | 
160 | # Custom sidebar templates, maps document names to template names.
161 | #html_sidebars = {}
162 | 
163 | # Additional templates that should be rendered to pages, maps page names to
164 | # template names.
165 | #html_additional_pages = {}
166 | 
167 | # If false, no module index is generated.
168 | #html_domain_indices = True
169 | 
170 | # If false, no index is generated.
171 | #html_use_index = True
172 | 
173 | # If true, the index is split into individual pages for each letter.
174 | #html_split_index = False
175 | 
176 | # If true, links to the reST sources are added to the pages.
177 | #html_show_sourcelink = True
178 | 
179 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
180 | #html_show_sphinx = True
181 | 
182 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
183 | #html_show_copyright = True
184 | 
185 | # If true, an OpenSearch description file will be output, and all pages will
186 | # contain a <link> tag referring to it.  The value of this option must be the
187 | # base URL from which the finished HTML is served.
188 | #html_use_opensearch = ''
189 | 
190 | # This is the file name suffix for HTML files (e.g. ".xhtml").
191 | #html_file_suffix = None
192 | 
193 | # Output file base name for HTML help builder.
194 | htmlhelp_basename = 'dryscrapedoc'
195 | 
196 | 
197 | # -- Options for LaTeX output --------------------------------------------------
198 | 
199 | # The paper size ('letter' or 'a4').
200 | #latex_paper_size = 'letter'
201 | 
202 | # The font size ('10pt', '11pt' or '12pt').
203 | #latex_font_size = '10pt'
204 | 
205 | # Grouping the document tree into LaTeX files. List of tuples
206 | # (source start file, target name, title, author, documentclass [howto/manual]).
207 | latex_documents = [
208 |   ('index', 'dryscrape.tex', u'dryscrape Documentation',
209 |    u'Niklas Baumstark', 'manual'),
210 | ]
211 | 
212 | # The name of an image file (relative to this directory) to place at the top of
213 | # the title page.
214 | #latex_logo = None
215 | 
216 | # For "manual" documents, if this is true, then toplevel headings are parts,
217 | # not chapters.
218 | #latex_use_parts = False
219 | 
220 | # If true, show page references after internal links.
221 | #latex_show_pagerefs = False
222 | 
223 | # If true, show URL addresses after external links.
224 | #latex_show_urls = False
225 | 
226 | # Additional stuff for the LaTeX preamble.
227 | #latex_preamble = ''
228 | 
229 | # Documents to append as an appendix to all manuals.
230 | #latex_appendices = []
231 | 
232 | # If false, no module index is generated.
233 | #latex_domain_indices = True
234 | 
235 | 
236 | # -- Options for manual page output --------------------------------------------
237 | 
238 | # One entry per manual page. List of tuples
239 | # (source start file, name, description, authors, manual section).
240 | man_pages = [
241 |     ('index', 'dryscrape', u'dryscrape Documentation',
242 |      [u'Niklas Baumstark'], 1)
243 | ]
244 | 


--------------------------------------------------------------------------------
/docs/index.rst:
--------------------------------------------------------------------------------
 1 | Welcome to dryscrape's documentation!
 2 | ====================================
 3 | 
 4 | dryscrape_ is a lightweight web scraping library for Python. It uses a
 5 | headless Webkit instance to evaluate Javascript on the visited pages. This
 6 | enables painless scraping of plain web pages as well as Javascript-heavy
 7 | “Web 2.0” applications like
 8 | Facebook.
 9 | 
10 | It is built on the shoulders of capybara-webkit_'s webkit-server_.
11 | A big thanks goes to thoughtbot, inc. for building this excellent
12 | piece of software!
13 | 
14 | .. _dryscrape: https://github.com/niklasb/dryscrape
15 | .. _capybara-webkit: https://github.com/thoughtbot/capybara-webkit
16 | .. _webkit-server: https://github.com/niklasb/webkit-server
17 | 
18 | Contents
19 | ----------
20 | 
21 | .. toctree::
22 |    :maxdepth: 2
23 | 
24 |    installation
25 |    usage
26 |    apidoc
27 | 
28 | Indices and tables
29 | ==================
30 | 
31 | * :ref:`genindex`
32 | * :ref:`modindex`
33 | * :ref:`search`
34 | 
35 | 


--------------------------------------------------------------------------------
/docs/installation.rst:
--------------------------------------------------------------------------------
 1 | .. highlight:: none
 2 | 
 3 | Installation
 4 | ============
 5 | 
 6 | Prerequisites
 7 | -------------
 8 | 
 9 | Before installing dryscrape_, you need to install some software it depends on:
10 | 
11 | * Qt_, QtWebKit_
12 | * lxml_
13 | * pip_
14 | * xvfb_ (necessary only if no other X server is available)
15 | 
16 | On Ubuntu you can do that with one command (the ``#`` indicates that you need
17 | root privileges for this):
18 | 
19 | ::
20 | 
21 |   # apt-get install qt5-default libqt5webkit5-dev build-essential \
22 |                     python-lxml python-pip xvfb
23 | 
24 | Please note that Qt4 is also supported.
25 | 
26 | On Mac OS X, you can use Homebrew_ to install Qt and
27 | easy_install_ to install pip_:
28 | 
29 | ::
30 | 
31 |   # brew install qt
32 |   # easy_install pip
33 | 
34 | On other operating systems, you can use pip_ to install lxml (though you might
35 | have to install libxml and the Python headers first).
36 | 
37 | Recommended: Installing dryscrape from PyPI
38 | -------------------------------
39 | 
40 | This is as simple as a quick
41 | 
42 | ::
43 | 
44 |   # pip install dryscrape
45 | 
46 | Note that dryscrape supports Python 2.7 and 3 as of version 1.0.
47 | 
48 | Installing dryscrape from Git
49 | -------------------------------
50 | 
51 | First, get a copy of dryscrape_ using Git:
52 | 
53 | ::
54 | 
55 |   $ git clone https://github.com/niklasb/dryscrape.git dryscrape
56 |   $ cd dryscrape
57 | 
58 | To install dryscrape, you first need to install webkit-server_. You can use
59 | pip_ to do this for you (while still in the dryscrape directory).
60 | 
61 | ::
62 | 
63 |   # pip install -r requirements.txt
64 | 
65 | If you want, you can of course also install the dependencies manually.
66 | 
67 | Afterwards, you can use the ``setup.py`` script included to install dryscrape:
68 | 
69 | ::
70 | 
71 |   # python setup.py install
72 | 
73 | .. _Qt: http://www.qt.io
74 | .. _QtWebKit: http://doc.qt.io/qt-5/qtwebkit-index.html
75 | .. _lxml: http://lxml.de/
76 | .. _webkit-server: https://github.com/niklasb/webkit-server/
77 | .. _pip: http://pypi.python.org/pypi/pip
78 | .. _dryscrape: https://github.com/niklasb/dryscrape/
79 | .. _Homebrew: http://brew.sh/
80 | .. _easy_install: https://pypi.python.org/pypi/setuptools
81 | 


--------------------------------------------------------------------------------
/docs/usage.rst:
--------------------------------------------------------------------------------
 1 | Usage
 2 | ======
 3 | 
 4 | First demonstration
 5 | ------------------------
 6 | 
 7 | A code sample tells more than thousand words:
 8 | 
 9 | .. literalinclude:: /../examples/google.py
10 | 
11 | In this sample, we use dryscrape to do a simple web search on Google.
12 | Note that we set up a Webkit driver instance here and pass it to a dryscrape
13 | :py:class:`~dryscrape.session.Session` in the constructor. The session instance
14 | then passes every method call it cannot resolve -- such as
15 | :py:meth:`~webkit_server.CommandsMixin.visit`, in this case -- to the
16 | underlying driver.
17 | 


--------------------------------------------------------------------------------
/dryscrape/__init__.py:
--------------------------------------------------------------------------------
1 | from .session import *
2 | from .xvfb import *
3 | import dryscrape.driver
4 | 


--------------------------------------------------------------------------------
/dryscrape/driver/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/niklasb/dryscrape/4d3dabdec02f321a37325ff8dbb43d049d451931/dryscrape/driver/__init__.py


--------------------------------------------------------------------------------
/dryscrape/driver/webkit.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Headless Webkit driver for dryscrape. Wraps the ``webkit_server`` module.
 3 | """
 4 | 
 5 | import dryscrape.mixins
 6 | import webkit_server
 7 | 
 8 | class Node(webkit_server.Node,
 9 |            dryscrape.mixins.SelectionMixin,
10 |            dryscrape.mixins.AttributeMixin):
11 |   """ Node implementation wrapping a ``webkit_server`` node. """
12 | 
13 | 
14 | class NodeFactory(webkit_server.NodeFactory):
15 |   """ overrides the NodeFactory provided by ``webkit_server``. """
16 |   def create(self, node_id):
17 |     return Node(self.client, node_id)
18 | 
19 | 
20 | class Driver(webkit_server.Client,
21 |              dryscrape.mixins.WaitMixin,
22 |              dryscrape.mixins.HtmlParsingMixin):
23 |   """ Driver implementation wrapping a ``webkit_server`` driver.
24 | 
25 |   Keyword arguments are passed through to the underlying ``webkit_server.Client``
26 |   constructor. By default, `node_factory_class` is set to use the dryscrape
27 |   node implementation. """
28 |   def __init__(self, **kw):
29 |     kw.setdefault('node_factory_class', NodeFactory)
30 |     super(Driver, self).__init__(**kw)
31 | 


--------------------------------------------------------------------------------
/dryscrape/mixins.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Mixins for use in dryscrape drivers.
  3 | """
  4 | 
  5 | import time
  6 | import lxml.html
  7 | 
  8 | class SelectionMixin(object):
  9 |   """ Mixin that adds different methods of node selection to an object that
 10 |   provides an ``xpath`` method returning a collection of matches. """
 11 | 
 12 |   def css(self, css):
 13 |     """ Returns all nodes matching the given CSSv3 expression. """
 14 |     return self.css(css)
 15 | 
 16 |   def at_css(self, css):
 17 |     """ Returns the first node matching the given CSSv3
 18 |     expression or ``None``. """
 19 |     return self._first_or_none(self.css(css))
 20 | 
 21 |   def at_xpath(self, xpath):
 22 |     """ Returns the first node matching the given XPath 2.0 expression or ``None``.
 23 |     """
 24 |     return self._first_or_none(self.xpath(xpath))
 25 | 
 26 |   def parent(self):
 27 |     """ Returns the parent node. """
 28 |     return self.at_xpath('..')
 29 | 
 30 |   def children(self):
 31 |     """ Returns the child nodes. """
 32 |     return self.xpath('*')
 33 | 
 34 |   def form(self):
 35 |     """ Returns the form wherein this node is contained or ``None``. """
 36 |     return self.at_xpath("ancestor::form")
 37 | 
 38 |   def _first_or_none(self, list):
 39 |     return list[0] if list else None
 40 | 
 41 | 
 42 | class AttributeMixin(object):
 43 |   """ Mixin that adds ``[]`` access syntax sugar to an object that supports a
 44 |   ``set_attr`` and ``get_attr`` method. """
 45 | 
 46 |   def __getitem__(self, attr):
 47 |     """ Syntax sugar for accessing this node's attributes """
 48 |     return self.get_attr(attr)
 49 | 
 50 |   def __setitem__(self, attr, value):
 51 |     """ Syntax sugar for setting this node's attributes """
 52 |     self.set_attr(attr, value)
 53 | 
 54 | 
 55 | class HtmlParsingMixin(object):
 56 |   """ Mixin that adds a ``document`` method to an object that supports a ``body``
 57 |   method returning valid HTML. """
 58 | 
 59 |   def document(self):
 60 |     """ Parses the HTML returned by ``body`` and returns it as an lxml.html
 61 |     document. If the driver supports live DOM manipulation (like webkit_server
 62 |     does), changes performed on the returned document will not take effect. """
 63 |     return lxml.html.document_fromstring(self.body())
 64 | 
 65 | 
 66 | # default timeout values
 67 | DEFAULT_WAIT_INTERVAL = 0.5
 68 | DEFAULT_WAIT_TIMEOUT = 10
 69 | DEFAULT_AT_TIMEOUT = 1
 70 | 
 71 | class WaitTimeoutError(Exception):
 72 |   """ Raised when a wait times out """
 73 | 
 74 | class WaitMixin(SelectionMixin):
 75 |   """ Mixin that allows waiting for conditions or elements. """
 76 | 
 77 |   def wait_for(self,
 78 |                condition,
 79 |                interval = DEFAULT_WAIT_INTERVAL,
 80 |                timeout  = DEFAULT_WAIT_TIMEOUT):
 81 |     """ Wait until a condition holds by checking it in regular intervals.
 82 |     Raises ``WaitTimeoutError`` on timeout. """
 83 | 
 84 |     start = time.time()
 85 | 
 86 |     # at least execute the check once!
 87 |     while True:
 88 |       res = condition()
 89 |       if res:
 90 |         return res
 91 | 
 92 |       # timeout?
 93 |       if time.time() - start > timeout:
 94 |         break
 95 | 
 96 |       # wait a bit
 97 |       time.sleep(interval)
 98 | 
 99 |     # timeout occured!
100 |     raise WaitTimeoutError("wait_for timed out")
101 | 
102 |   def wait_for_safe(self, *args, **kw):
103 |     """ Wait until a condition holds and return
104 |     ``None`` on timeout. """
105 |     try:
106 |       return self.wait_for(*args, **kw)
107 |     except WaitTimeoutError:
108 |       return None
109 | 
110 |   def wait_while(self, condition, *args, **kw):
111 |     """ Wait while a condition holds. """
112 |     return self.wait_for(lambda: not condition(), *args, **kw)
113 | 
114 |   def at_css(self, css, timeout = DEFAULT_AT_TIMEOUT, **kw):
115 |     """ Returns the first node matching the given CSSv3 expression or ``None``
116 |     if a timeout occurs. """
117 |     return self.wait_for_safe(lambda: super(WaitMixin, self).at_css(css),
118 |                               timeout = timeout,
119 |                               **kw)
120 | 
121 |   def at_xpath(self, xpath, timeout = DEFAULT_AT_TIMEOUT, **kw):
122 |     """ Returns the first node matching the given XPath 2.0 expression or ``None``
123 |     if a timeout occurs. """
124 |     return self.wait_for_safe(lambda: super(WaitMixin, self).at_xpath(xpath),
125 |                               timeout = timeout,
126 |                               **kw)
127 | 


--------------------------------------------------------------------------------
/dryscrape/session.py:
--------------------------------------------------------------------------------
 1 | from dryscrape.driver.webkit import Driver as DefaultDriver
 2 | 
 3 | from itertools import chain
 4 | try:
 5 |   import urlparse
 6 | except ImportError:
 7 |   import urllib
 8 |   urlparse = urllib.parse
 9 | 
10 | class Session(object):
11 |   """ A web scraping session based on a driver instance. Implements the proxy
12 |   pattern to pass unresolved method calls to the underlying driver.
13 | 
14 |   If no `driver` is specified, the instance will create an instance of
15 |   ``dryscrape.session.DefaultDriver`` to get a driver instance (defaults to
16 |   ``dryscrape.driver.webkit.Driver``).
17 | 
18 |   If `base_url` is present, relative URLs are completed with this URL base.
19 |   If not, the `get_base_url` method is called on itself to get the base URL. """
20 | 
21 |   def __init__(self,
22 |                driver = None,
23 |                base_url = None):
24 |     self.driver = driver or DefaultDriver()
25 |     self.base_url = base_url
26 | 
27 |   # implement proxy pattern
28 |   def __getattr__(self, attr):
29 |     """ Pass unresolved method calls to underlying driver. """
30 |     return getattr(self.driver, attr)
31 | 
32 |   def __dir__(self):
33 |     """Allow for `dir` to detect proxied methods from `Driver`."""
34 |     dir_chain = chain(dir(type(self)), dir(self.driver))
35 |     return list(set(dir_chain))
36 | 
37 |   def visit(self, url):
38 |     """ Passes through the URL to the driver after completing it using the
39 |     instance's URL base. """
40 |     return self.driver.visit(self.complete_url(url))
41 | 
42 |   def complete_url(self, url):
43 |     """ Completes a given URL with this instance's URL base. """
44 |     if self.base_url:
45 |       return urlparse.urljoin(self.base_url, url)
46 |     else:
47 |       return url
48 | 
49 |   def interact(self, **local):
50 |     """ Drops the user into an interactive Python session with the ``sess`` variable
51 |     set to the current session instance. If keyword arguments are supplied, these
52 |     names will also be available within the session. """
53 |     import code
54 |     code.interact(local=dict(sess=self, **local))
55 | 


--------------------------------------------------------------------------------
/dryscrape/xvfb.py:
--------------------------------------------------------------------------------
 1 | import atexit
 2 | import os
 3 | 
 4 | _xvfb = None
 5 | 
 6 | 
 7 | def start_xvfb():
 8 |   from xvfbwrapper import Xvfb
 9 |   global _xvfb
10 |   _xvfb = Xvfb()
11 |   _xvfb.start()
12 |   atexit.register(_xvfb.stop)
13 | 
14 | 
15 | def stop_xvfb():
16 |   global _xvfb
17 |   _xvfb.stop()
18 | 


--------------------------------------------------------------------------------
/examples/google.py:
--------------------------------------------------------------------------------
 1 | import dryscrape
 2 | import sys
 3 | 
 4 | if 'linux' in sys.platform:
 5 |     # start xvfb in case no X is running. Make sure xvfb 
 6 |     # is installed, otherwise this won't work!
 7 |     dryscrape.start_xvfb()
 8 | 
 9 | search_term = 'dryscrape'
10 | 
11 | # set up a web scraping session
12 | sess = dryscrape.Session(base_url = 'http://google.com')
13 | 
14 | # we don't need images
15 | sess.set_attribute('auto_load_images', False)
16 | 
17 | # visit homepage and search for a term
18 | sess.visit('/')
19 | q = sess.at_xpath('//*[@name="q"]')
20 | q.set(search_term)
21 | q.form().submit()
22 | 
23 | # extract all links
24 | for link in sess.xpath('//a[@href]'):
25 |   print(link['href'])
26 | 
27 | # save a screenshot of the web page
28 | sess.render('google.png')
29 | print("Screenshot written to 'google.png'")
30 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | lxml
2 | git+git://github.com/niklasb/webkit-server.git
3 | xvfbwrapper
4 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from distutils.core import setup, Command
 2 | 
 3 | setup(name='dryscrape',
 4 |       version='1.0.1',
 5 |       description='a lightweight Javascript-aware, headless web scraping library for Python',
 6 |       author='Niklas Baumstark',
 7 |       author_email='niklas.baumstark@gmail.com',
 8 |       license='MIT',
 9 |       url='https://github.com/niklasb/dryscrape',
10 |       packages=['dryscrape', 'dryscrape.driver'],
11 |       install_requires=['webkit_server>=1.0', 'lxml', 'xvfbwrapper'],
12 |       )
13 | 


--------------------------------------------------------------------------------