├── .gitignore
├── .travis.yml
├── CHANGES
├── LICENSE
├── README.md
├── ankle
    ├── __init__.py
    ├── find.py
    ├── match.py
    └── utils.py
├── setup.py
├── tests.py
└── tox.ini


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.egg-info
3 | .coverage
4 | .tox
5 | dist
6 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | python:
 3 |   - "2.7"
 4 |   - "3.4"
 5 |   - "3.5"
 6 |   - "pypy"
 7 | before_install:
 8 |   - pip install --upgrade setuptools
 9 | install:
10 |   - python setup.py -q install
11 | script: nosetests
12 | 


--------------------------------------------------------------------------------
/CHANGES:
--------------------------------------------------------------------------------
 1 | ankle Changelog
 2 | ===============
 3 | 
 4 | Version 0.4.1
 5 | -------------
 6 | - Lock html5lib version for now, until it's stable. Latest releases broke things.
 7 | 
 8 | Version 0.4.0
 9 | -------------
10 | 
11 | - lxml is no longer a dependency
12 | - Now works with more modern html5lib
13 | 
14 | Version 0.3.0
15 | -------------
16 | 
17 | - Order of elements in the skeleton is now significant
18 | - Text between elements is now significant
19 | - Old `match` API is gone
20 | 
21 | Version 0.2.0
22 | -------------
23 | 
24 | - Add support for checking node text in skeletons
25 | - Replace `match` with `find`, `find_all` and `find_iter` interfaces
26 | 
27 | Version 0.1.0
28 | -------------
29 | 
30 | - Initial release
31 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2016 Aleksei Voronov
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ankle
  2 | =====
  3 | 
  4 | [![Build Status](https://travis-ci.org/despawnerer/ankle.svg?branch=master)](https://travis-ci.org/despawnerer/ankle)
  5 | [![PyPI version](https://badge.fury.io/py/ankle.svg)](https://badge.fury.io/py/ankle)
  6 | 
  7 | `ankle` is a tool that finds elements inside HTML documents by comparing them with an HTML skeleton. It is useful in testing to check markup returned by a server or possibly web scraping.
  8 | 
  9 | Works on Python 2.7 and Python 3.4+.
 10 | 
 11 | 
 12 | Installation
 13 | ------------
 14 | 
 15 | 	$ pip install ankle
 16 | 
 17 | 
 18 | Usage
 19 | -----
 20 | 
 21 | ```python
 22 | ankle.find_all(skeleton, document)
 23 | ```
 24 | 
 25 | Return all elements from document that match given skeleton.
 26 | 
 27 | Skeleton elements are compared with the document's by tag name,
 28 | attributes and text inside or between them.
 29 | 
 30 | Children of elements in the skeleton are looked for in the descendants of
 31 | matching elements in the document.
 32 | 
 33 | Order of elements in the skeleton is signficant.
 34 | 
 35 | Skeleton must contain one root element.
 36 | 
 37 | `document` and `skeleton` may be either HTML strings or parsed etrees.
 38 | 
 39 | 
 40 | ```python
 41 | ankle.find(skeleton, document)
 42 | ```
 43 | 
 44 | Return the first element that matches given skeleton in the document.
 45 | 
 46 | 
 47 | ```python
 48 | ankle.find_iter(skeleton, document)
 49 | ```
 50 | 
 51 | Return an iterator that yields elements from the document that
 52 | match given skeleton.
 53 | 
 54 | See `find_all` for details.
 55 | 
 56 | 
 57 | Caveats
 58 | -------
 59 | 
 60 | - Class attribute is checked by strict equality, but it may be desirable to ignore classes that aren't present in the skeleton
 61 | - Text inside and between elements is checked strictly, so, for example, if it's broken up by a \<span\> in the document, but presented without it in the skeleton, it won't be found
 62 | - There are no nice assertion failure messages when using in py.test so it's difficult to see what failed to match and why
 63 | 
 64 | 
 65 | Example
 66 | -------
 67 | 
 68 | ```python
 69 | import ankle
 70 | 
 71 | document = """
 72 | <html>
 73 | 	<body>
 74 | 		<h1>My document</h1>
 75 | 		<p>
 76 | 			Some text
 77 | 		</p>
 78 | 		<form id="subscription-form">
 79 | 			<h2>Subscribe for more information!</h2>
 80 | 			<div class="control-row">
 81 | 				<label for="email">Email</label>
 82 | 				<input name="email" placeholder="Email"/>
 83 | 			</div>
 84 | 			<div class="control-row">
 85 | 				<label for="tos">
 86 | 					<input name="tos" type="checkbox" class="checkbox-input">
 87 | 					<span class="checkbox-text">I agree to TOS</span>
 88 | 				</label>
 89 | 			</div>
 90 | 			<div class="submit-row">
 91 | 				<button type="submit">Subscribe</button>
 92 | 			</div>
 93 | 		</form>
 94 | 	</body>
 95 | </html>
 96 | """
 97 | 
 98 | skeleton = """
 99 | <form>
100 | 	Subscribe for more information!
101 | 	<label for="email">Email</label>
102 | 	<input name="email">
103 | 	<label for="tos">
104 | 		<input name="tos" type="checkbox">
105 | 		I agree to TOS
106 | 	</label>
107 | 	<button type="submit"></button>
108 | </form>
109 | """
110 | 
111 | ankle.find(skeleton, document)  # will return the <form> element from the document
112 | ```
113 | 


--------------------------------------------------------------------------------
/ankle/__init__.py:
--------------------------------------------------------------------------------
1 | from .find import *  # noqa
2 | 


--------------------------------------------------------------------------------
/ankle/find.py:
--------------------------------------------------------------------------------
 1 | import html5lib
 2 | 
 3 | from .utils import is_string
 4 | from .match import node_matches_bone
 5 | 
 6 | 
 7 | __all__ = ['find', 'find_all', 'find_iter']
 8 | 
 9 | 
10 | def find(skeleton, document):
11 |     """
12 |     Return the first element that matches given skeleton in the document.
13 |     """
14 |     return next(find_iter(skeleton, document), None)
15 | 
16 | 
17 | def find_all(skeleton, document):
18 |     """
19 |     Return all elements from document that match given skeleton.
20 | 
21 |     Skeleton elements are compared with the document's by tag name,
22 |     attributes and text inside or between them.
23 | 
24 |     Children of elements in the skeleton are looked for in the descendants of
25 |     matching elements in the document.
26 | 
27 |     Order of elements in the skeleton is signficant.
28 | 
29 |     Skeleton must contain one root element.
30 | 
31 |     `document` and `skeleton` may be either HTML strings or parsed etrees.
32 |     """
33 |     return list(find_iter(skeleton, document))
34 | 
35 | 
36 | def find_iter(skeleton, document):
37 |     """
38 |     Return an iterator that yields elements from the document that
39 |     match given skeleton.
40 | 
41 |     See `find_all` for details.
42 |     """
43 |     if is_string(document):
44 |         document = html5lib.parse(document)
45 |     if is_string(skeleton):
46 |         fragment = html5lib.parseFragment(skeleton)
47 |         if len(fragment) != 1:
48 |             raise ValueError("Skeleton must have exactly one root element.")
49 |         skeleton = fragment[0]
50 | 
51 |     for element in document.iter():
52 |         if node_matches_bone(element, skeleton):
53 |             yield element
54 | 


--------------------------------------------------------------------------------
/ankle/match.py:
--------------------------------------------------------------------------------
 1 | from .utils import is_string, maybe_strip
 2 | 
 3 | 
 4 | def node_matches_bone(node, bone):
 5 |     if is_string(bone) or is_string(node):
 6 |         return node == bone
 7 |     else:
 8 |         return (
 9 |             node.tag == bone.tag and
10 |             all(bone.attrib[x] == node.attrib.get(x) for x in bone.attrib) and
11 |             has_all_matching_elements(node, iter_child_nodes(bone))
12 |         )
13 | 
14 | 
15 | def has_all_matching_elements(element, bones):
16 |     # this is sort of convoluted in comparison with recursive version, but
17 |     # there are advantages:
18 |     # - it's easy to break out once we've found all the matching elements
19 |     # - there's no possibility of recursion errors (documents may be very large)
20 |     bones_iter = iter(bones)
21 |     nodes_iters = [iter_child_nodes(element)]
22 | 
23 |     try:
24 |         bone = next(bones_iter)
25 |     except StopIteration:
26 |         return True
27 | 
28 |     while nodes_iters:
29 |         try:
30 |             node = next(nodes_iters[-1])
31 |         except StopIteration:
32 |             nodes_iters.pop()
33 |             continue
34 | 
35 |         if node_matches_bone(node, bone):
36 |             try:
37 |                 bone = next(bones_iter)
38 |             except StopIteration:
39 |                 return True
40 |         elif not is_string(node):
41 |             nodes_iters.append(iter_child_nodes(node))
42 |     else:
43 |         return False
44 | 
45 | 
46 | def iter_child_nodes(element):
47 |     text = maybe_strip(element.text)
48 |     if text:
49 |         yield text
50 | 
51 |     for child in element:
52 |         yield child
53 |         tail = maybe_strip(child.tail)
54 |         if tail:
55 |             yield tail
56 | 


--------------------------------------------------------------------------------
/ankle/utils.py:
--------------------------------------------------------------------------------
 1 | import six
 2 | 
 3 | 
 4 | def maybe_strip(text):
 5 |     return text and text.strip()
 6 | 
 7 | 
 8 | def is_string(value):
 9 |     return isinstance(value, six.string_types)
10 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | setup(
 4 |     name='ankle',
 5 |     version='0.4.1',
 6 |     description='Find elements in HTML by matching them with a skeleton',
 7 |     url='https://github.com/despawnerer/ankle',
 8 |     author='Aleksei Voronov',
 9 |     author_email='despawn@gmail.com',
10 |     license='MIT',
11 |     classifiers=[
12 |         'Development Status :: 4 - Beta',
13 |         'Intended Audience :: Developers',
14 |         'Operating System :: OS Independent',
15 |         'License :: OSI Approved :: MIT License',
16 |         'Topic :: Software Development :: Libraries',
17 |         'Topic :: Software Development :: Libraries :: Python Modules',
18 |         'Programming Language :: Python :: 2',
19 |         'Programming Language :: Python :: 2.7',
20 |         'Programming Language :: Python :: 3',
21 |         'Programming Language :: Python :: 3.3',
22 |         'Programming Language :: Python :: 3.4',
23 |         'Programming Language :: Python :: 3.5',
24 |     ],
25 |     packages=['ankle'],
26 |     install_requires=[
27 |         'html5lib==0.9999999',
28 |         'six>=1.0'
29 |     ]
30 | )
31 | 


--------------------------------------------------------------------------------
/tests.py:
--------------------------------------------------------------------------------
  1 | import ankle
  2 | import unittest
  3 | 
  4 | 
  5 | class SimpleTestCase(unittest.TestCase):
  6 |     def test_works(self):
  7 |         document = '''
  8 |             <form id="test1" class="form"></form>
  9 |             <form id="test2" class="form"></form>
 10 |         '''
 11 |         skeleton = '<form class="form"></form>'
 12 |         ankle.find_all(skeleton, document)
 13 | 
 14 |     def test_disallows_skeleton_with_multiple_elements(self):
 15 |         document = '<html></html>'
 16 |         skeleton = '<p></p><p></p>'
 17 |         with self.assertRaises(ValueError):
 18 |             ankle.find_all(skeleton, document)
 19 | 
 20 | 
 21 | class FindTestCase(unittest.TestCase):
 22 |     def test_returns_first_found_element_when_found(self):
 23 |         document = '''
 24 |             <form id="test1" class="form"></form>
 25 |             <form id="test2" class="form"></form>
 26 |         '''
 27 |         skeleton = '<form class="form"></form>'
 28 |         element = ankle.find(skeleton, document)
 29 |         self.assertEqual(element.attrib['id'], 'test1')
 30 | 
 31 |     def test_returns_none_when_nothing_found(self):
 32 |         document = '<form id="test"></form>'
 33 |         skeleton = '<div class="other"></div>'
 34 |         self.assertIsNone(ankle.find(skeleton, document))
 35 | 
 36 | 
 37 | class MatchingTestCase(unittest.TestCase):
 38 |     def test_match_by_tag_name(self):
 39 |         document = '''
 40 |             <form id="test"></form>
 41 |         '''
 42 |         skeleton = '<form></form>'
 43 |         matches = ankle.find_all(skeleton, document)
 44 |         self.assertEqual(len(matches), 1)
 45 |         self.assertEqual(matches[0].attrib['id'], 'test')
 46 | 
 47 |     def test_match_by_attribute(self):
 48 |         document = '''
 49 |             <form id="test1"></form>
 50 |             <form id="test2"></form>
 51 |         '''
 52 |         skeleton = '<form id="test1"></form>'
 53 |         matches = ankle.find_all(skeleton, document)
 54 |         self.assertEqual(len(matches), 1)
 55 |         self.assertEqual(matches[0].attrib['id'], 'test1')
 56 | 
 57 |     def test_match_by_child(self):
 58 |         document = '''
 59 |             <form id="test1"><input name="match"></form>
 60 |             <form id="test2"><input name="no-match"></form>
 61 |         '''
 62 |         skeleton = '<form><input name="match"></form>'
 63 |         matches = ankle.find_all(skeleton, document)
 64 |         self.assertEqual(len(matches), 1)
 65 |         self.assertEqual(matches[0].attrib['id'], 'test1')
 66 | 
 67 |     def test_match_by_descendant(self):
 68 |         document = '''
 69 |             <form id="test1">
 70 |                 <div><span><input name="match"></span></div>
 71 |                 <button>Submit</button>
 72 |             </form>
 73 |             <form id="test2">
 74 |                 <input name="whatever">
 75 |                 <div><button>Go</button></div>
 76 |             </form>
 77 |         '''
 78 |         skeleton = '<form><input name="match"></form>'
 79 |         matches = ankle.find_all(skeleton, document)
 80 |         self.assertEqual(len(matches), 1)
 81 |         self.assertEqual(matches[0].attrib['id'], 'test1')
 82 | 
 83 |     def test_match_by_multiple_children(self):
 84 |         document = '''
 85 |             <form id="login">
 86 |                 <input name="name">
 87 |                 <input name="password">
 88 |             </form>
 89 |             <form id="some-other-form">
 90 |                 <input name="no-match">
 91 |                 <input name="different-input">
 92 |             </form>
 93 |         '''
 94 |         skeleton = '<form><input name="name"><input name="password"></form>'
 95 |         matches = ankle.find_all(skeleton, document)
 96 |         self.assertEqual(len(matches), 1)
 97 |         self.assertEqual(matches[0].attrib['id'], 'login')
 98 | 
 99 |     def test_multiple_matches(self):
100 |         document = '''
101 |             <form id="test1"></form>
102 |             <form id="test2"></form>
103 |         '''
104 |         skeleton = '<form></form>'
105 |         matches = ankle.find_all(skeleton, document)
106 |         self.assertEqual(len(matches), 2)
107 |         self.assertEqual(matches[0].attrib['id'], 'test1')
108 |         self.assertEqual(matches[1].attrib['id'], 'test2')
109 | 
110 |     def test_attribute_order_doesnt_matter(self):
111 |         document = '<form method="POST" action="." id="test1"></form>'
112 |         skeleton = '<form id="test1" action="." method="POST"></form>'
113 |         matches = ankle.find_all(skeleton, document)
114 |         self.assertEqual(len(matches), 1)
115 |         self.assertEqual(matches[0].attrib['id'], 'test1')
116 | 
117 |     def test_match_deep_descendants(self):
118 |         document = '''
119 |             <form id="test1">
120 |                 <div class="red">
121 |                     <input name="wonderful">
122 |                 </div>
123 |             </form>
124 |             <form id="test2">
125 |                 <div class="red">
126 |                     <input name="different">
127 |                 </div>
128 |             </form>
129 |         '''
130 |         skeleton = (
131 |             '<form><div class="red"><input name="wonderful"></form></div>'
132 |         )
133 |         matches = ankle.find_all(skeleton, document)
134 |         self.assertEqual(len(matches), 1)
135 |         self.assertEqual(matches[0].attrib['id'], 'test1')
136 | 
137 |     def test_match_text(self):
138 |         document = '''
139 |             <form id="test1">
140 |                 <label for="name">
141 |                     Correct label
142 |                 </label>
143 |                 <input name="name">
144 |             </form>
145 |             <form id="test2">
146 |                 <label for="name">
147 |                     Wrong label
148 |                 </label>
149 |                 <input name="name">
150 |             </form>
151 |         '''
152 |         skeleton = '''
153 |             <form>
154 |                 <label for="name">Correct label</label>
155 |                 <input name="name">
156 |             </form>
157 |         '''
158 |         matches = ankle.find_all(skeleton, document)
159 |         self.assertEqual(len(matches), 1)
160 |         self.assertEqual(matches[0].attrib['id'], 'test1')
161 | 
162 |     def test_match_by_order(self):
163 |         document = '''
164 |             <form id="test1">
165 |                 <label for="name">Label</label>
166 |                 <input name="name">
167 |             </form>
168 |             <form id="test2">
169 |                 <input name="name">
170 |                 <label for="name">Label</label>
171 |             </form>
172 |         '''
173 |         skeleton = '''
174 |             <form>
175 |                 <label for="name">Label</label>
176 |                 <input name="name">
177 |             </form>
178 |         '''
179 |         matches = ankle.find_all(skeleton, document)
180 |         self.assertEqual(len(matches), 1)
181 |         self.assertEqual(matches[0].attrib['id'], 'test1')
182 | 
183 |     def test_matches_skeleton_with_just_text(self):
184 |         document = '''
185 |             <h1 id="test1">Correct title</h1>
186 |             <h2 id="test2">Different title</h1>
187 |         '''
188 |         skeleton = '''
189 |             <h1>Correct title</h1>
190 |         '''
191 |         matches = ankle.find_all(skeleton, document)
192 |         self.assertEqual(len(matches), 1)
193 |         self.assertEqual(matches[0].attrib['id'], 'test1')
194 | 
195 |     def test_match_text_between_elements(self):
196 |         document = '''
197 |             <form id="test1">
198 |                 <label for="name">Label</label>
199 |                 Correct text
200 |                 <input name="name">
201 |             </form>
202 |             <form id="test2">
203 |                 <label for="name">Label</label>
204 |                 Incorrect text
205 |                 <input name="name">
206 |             </form>
207 |         '''
208 |         skeleton = '''
209 |             <form>
210 |                 <label for="name">Label</label>
211 |                 Correct text
212 |                 <input name="name">
213 |             </form>
214 |         '''
215 |         matches = ankle.find_all(skeleton, document)
216 |         self.assertEqual(len(matches), 1)
217 |         self.assertEqual(matches[0].attrib['id'], 'test1')
218 | 
219 |     def test_match_text_in_the_beginning_of_element(self):
220 |         document = '''
221 |             <form id="test1">
222 |                 Correct text
223 |                 <label for="name">Label</label>
224 |                 <input name="name">
225 |             </form>
226 |             <form id="test2">
227 |                 Incorrect text
228 |                 <label for="name">Label</label>
229 |                 <input name="name">
230 |             </form>
231 |         '''
232 |         skeleton = '''
233 |             <form>
234 |                 Correct text
235 |                 <label for="name">Label</label>
236 |                 <input name="name">
237 |             </form>
238 |         '''
239 |         matches = ankle.find_all(skeleton, document)
240 |         self.assertEqual(len(matches), 1)
241 |         self.assertEqual(matches[0].attrib['id'], 'test1')
242 | 


--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
1 | [tox]
2 | envlist = py27, py34, py35, pypy
3 | 
4 | [testenv]
5 | commands = nosetests {posargs}
6 | deps =
7 |     nose
8 | 


--------------------------------------------------------------------------------