├── .gitignore ├── .travis.yml ├── CHANGES ├── LICENSE ├── README.md ├── ankle ├── __init__.py ├── find.py ├── match.py └── utils.py ├── setup.py ├── tests.py └── tox.ini /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.egg-info 3 | .coverage 4 | .tox 5 | dist 6 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.7" 4 | - "3.4" 5 | - "3.5" 6 | - "pypy" 7 | before_install: 8 | - pip install --upgrade setuptools 9 | install: 10 | - python setup.py -q install 11 | script: nosetests 12 | -------------------------------------------------------------------------------- /CHANGES: -------------------------------------------------------------------------------- 1 | ankle Changelog 2 | =============== 3 | 4 | Version 0.4.1 5 | ------------- 6 | - Lock html5lib version for now, until it's stable. Latest releases broke things. 7 | 8 | Version 0.4.0 9 | ------------- 10 | 11 | - lxml is no longer a dependency 12 | - Now works with more modern html5lib 13 | 14 | Version 0.3.0 15 | ------------- 16 | 17 | - Order of elements in the skeleton is now significant 18 | - Text between elements is now significant 19 | - Old `match` API is gone 20 | 21 | Version 0.2.0 22 | ------------- 23 | 24 | - Add support for checking node text in skeletons 25 | - Replace `match` with `find`, `find_all` and `find_iter` interfaces 26 | 27 | Version 0.1.0 28 | ------------- 29 | 30 | - Initial release 31 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016 Aleksei Voronov 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ankle 2 | ===== 3 | 4 | [![Build Status](https://travis-ci.org/despawnerer/ankle.svg?branch=master)](https://travis-ci.org/despawnerer/ankle) 5 | [![PyPI version](https://badge.fury.io/py/ankle.svg)](https://badge.fury.io/py/ankle) 6 | 7 | `ankle` is a tool that finds elements inside HTML documents by comparing them with an HTML skeleton. It is useful in testing to check markup returned by a server or possibly web scraping. 8 | 9 | Works on Python 2.7 and Python 3.4+. 10 | 11 | 12 | Installation 13 | ------------ 14 | 15 | $ pip install ankle 16 | 17 | 18 | Usage 19 | ----- 20 | 21 | ```python 22 | ankle.find_all(skeleton, document) 23 | ``` 24 | 25 | Return all elements from document that match given skeleton. 26 | 27 | Skeleton elements are compared with the document's by tag name, 28 | attributes and text inside or between them. 29 | 30 | Children of elements in the skeleton are looked for in the descendants of 31 | matching elements in the document. 32 | 33 | Order of elements in the skeleton is signficant. 34 | 35 | Skeleton must contain one root element. 36 | 37 | `document` and `skeleton` may be either HTML strings or parsed etrees. 38 | 39 | 40 | ```python 41 | ankle.find(skeleton, document) 42 | ``` 43 | 44 | Return the first element that matches given skeleton in the document. 45 | 46 | 47 | ```python 48 | ankle.find_iter(skeleton, document) 49 | ``` 50 | 51 | Return an iterator that yields elements from the document that 52 | match given skeleton. 53 | 54 | See `find_all` for details. 55 | 56 | 57 | Caveats 58 | ------- 59 | 60 | - Class attribute is checked by strict equality, but it may be desirable to ignore classes that aren't present in the skeleton 61 | - Text inside and between elements is checked strictly, so, for example, if it's broken up by a \ in the document, but presented without it in the skeleton, it won't be found 62 | - There are no nice assertion failure messages when using in py.test so it's difficult to see what failed to match and why 63 | 64 | 65 | Example 66 | ------- 67 | 68 | ```python 69 | import ankle 70 | 71 | document = """ 72 | 73 | 74 |

My document

75 |

76 | Some text 77 |

78 |
79 |

Subscribe for more information!

80 |
81 | 82 | 83 |
84 |
85 | 89 |
90 |
91 | 92 |
93 |
94 | 95 | 96 | """ 97 | 98 | skeleton = """ 99 |
100 | Subscribe for more information! 101 | 102 | 103 | 107 | 108 |
109 | """ 110 | 111 | ankle.find(skeleton, document) # will return the
element from the document 112 | ``` 113 | -------------------------------------------------------------------------------- /ankle/__init__.py: -------------------------------------------------------------------------------- 1 | from .find import * # noqa 2 | -------------------------------------------------------------------------------- /ankle/find.py: -------------------------------------------------------------------------------- 1 | import html5lib 2 | 3 | from .utils import is_string 4 | from .match import node_matches_bone 5 | 6 | 7 | __all__ = ['find', 'find_all', 'find_iter'] 8 | 9 | 10 | def find(skeleton, document): 11 | """ 12 | Return the first element that matches given skeleton in the document. 13 | """ 14 | return next(find_iter(skeleton, document), None) 15 | 16 | 17 | def find_all(skeleton, document): 18 | """ 19 | Return all elements from document that match given skeleton. 20 | 21 | Skeleton elements are compared with the document's by tag name, 22 | attributes and text inside or between them. 23 | 24 | Children of elements in the skeleton are looked for in the descendants of 25 | matching elements in the document. 26 | 27 | Order of elements in the skeleton is signficant. 28 | 29 | Skeleton must contain one root element. 30 | 31 | `document` and `skeleton` may be either HTML strings or parsed etrees. 32 | """ 33 | return list(find_iter(skeleton, document)) 34 | 35 | 36 | def find_iter(skeleton, document): 37 | """ 38 | Return an iterator that yields elements from the document that 39 | match given skeleton. 40 | 41 | See `find_all` for details. 42 | """ 43 | if is_string(document): 44 | document = html5lib.parse(document) 45 | if is_string(skeleton): 46 | fragment = html5lib.parseFragment(skeleton) 47 | if len(fragment) != 1: 48 | raise ValueError("Skeleton must have exactly one root element.") 49 | skeleton = fragment[0] 50 | 51 | for element in document.iter(): 52 | if node_matches_bone(element, skeleton): 53 | yield element 54 | -------------------------------------------------------------------------------- /ankle/match.py: -------------------------------------------------------------------------------- 1 | from .utils import is_string, maybe_strip 2 | 3 | 4 | def node_matches_bone(node, bone): 5 | if is_string(bone) or is_string(node): 6 | return node == bone 7 | else: 8 | return ( 9 | node.tag == bone.tag and 10 | all(bone.attrib[x] == node.attrib.get(x) for x in bone.attrib) and 11 | has_all_matching_elements(node, iter_child_nodes(bone)) 12 | ) 13 | 14 | 15 | def has_all_matching_elements(element, bones): 16 | # this is sort of convoluted in comparison with recursive version, but 17 | # there are advantages: 18 | # - it's easy to break out once we've found all the matching elements 19 | # - there's no possibility of recursion errors (documents may be very large) 20 | bones_iter = iter(bones) 21 | nodes_iters = [iter_child_nodes(element)] 22 | 23 | try: 24 | bone = next(bones_iter) 25 | except StopIteration: 26 | return True 27 | 28 | while nodes_iters: 29 | try: 30 | node = next(nodes_iters[-1]) 31 | except StopIteration: 32 | nodes_iters.pop() 33 | continue 34 | 35 | if node_matches_bone(node, bone): 36 | try: 37 | bone = next(bones_iter) 38 | except StopIteration: 39 | return True 40 | elif not is_string(node): 41 | nodes_iters.append(iter_child_nodes(node)) 42 | else: 43 | return False 44 | 45 | 46 | def iter_child_nodes(element): 47 | text = maybe_strip(element.text) 48 | if text: 49 | yield text 50 | 51 | for child in element: 52 | yield child 53 | tail = maybe_strip(child.tail) 54 | if tail: 55 | yield tail 56 | -------------------------------------------------------------------------------- /ankle/utils.py: -------------------------------------------------------------------------------- 1 | import six 2 | 3 | 4 | def maybe_strip(text): 5 | return text and text.strip() 6 | 7 | 8 | def is_string(value): 9 | return isinstance(value, six.string_types) 10 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup( 4 | name='ankle', 5 | version='0.4.1', 6 | description='Find elements in HTML by matching them with a skeleton', 7 | url='https://github.com/despawnerer/ankle', 8 | author='Aleksei Voronov', 9 | author_email='despawn@gmail.com', 10 | license='MIT', 11 | classifiers=[ 12 | 'Development Status :: 4 - Beta', 13 | 'Intended Audience :: Developers', 14 | 'Operating System :: OS Independent', 15 | 'License :: OSI Approved :: MIT License', 16 | 'Topic :: Software Development :: Libraries', 17 | 'Topic :: Software Development :: Libraries :: Python Modules', 18 | 'Programming Language :: Python :: 2', 19 | 'Programming Language :: Python :: 2.7', 20 | 'Programming Language :: Python :: 3', 21 | 'Programming Language :: Python :: 3.3', 22 | 'Programming Language :: Python :: 3.4', 23 | 'Programming Language :: Python :: 3.5', 24 | ], 25 | packages=['ankle'], 26 | install_requires=[ 27 | 'html5lib==0.9999999', 28 | 'six>=1.0' 29 | ] 30 | ) 31 | -------------------------------------------------------------------------------- /tests.py: -------------------------------------------------------------------------------- 1 | import ankle 2 | import unittest 3 | 4 | 5 | class SimpleTestCase(unittest.TestCase): 6 | def test_works(self): 7 | document = ''' 8 |
9 |
10 | ''' 11 | skeleton = '
' 12 | ankle.find_all(skeleton, document) 13 | 14 | def test_disallows_skeleton_with_multiple_elements(self): 15 | document = '' 16 | skeleton = '

' 17 | with self.assertRaises(ValueError): 18 | ankle.find_all(skeleton, document) 19 | 20 | 21 | class FindTestCase(unittest.TestCase): 22 | def test_returns_first_found_element_when_found(self): 23 | document = ''' 24 |
25 |
26 | ''' 27 | skeleton = '
' 28 | element = ankle.find(skeleton, document) 29 | self.assertEqual(element.attrib['id'], 'test1') 30 | 31 | def test_returns_none_when_nothing_found(self): 32 | document = '
' 33 | skeleton = '
' 34 | self.assertIsNone(ankle.find(skeleton, document)) 35 | 36 | 37 | class MatchingTestCase(unittest.TestCase): 38 | def test_match_by_tag_name(self): 39 | document = ''' 40 |
41 | ''' 42 | skeleton = '
' 43 | matches = ankle.find_all(skeleton, document) 44 | self.assertEqual(len(matches), 1) 45 | self.assertEqual(matches[0].attrib['id'], 'test') 46 | 47 | def test_match_by_attribute(self): 48 | document = ''' 49 |
50 |
51 | ''' 52 | skeleton = '
' 53 | matches = ankle.find_all(skeleton, document) 54 | self.assertEqual(len(matches), 1) 55 | self.assertEqual(matches[0].attrib['id'], 'test1') 56 | 57 | def test_match_by_child(self): 58 | document = ''' 59 |
60 |
61 | ''' 62 | skeleton = '
' 63 | matches = ankle.find_all(skeleton, document) 64 | self.assertEqual(len(matches), 1) 65 | self.assertEqual(matches[0].attrib['id'], 'test1') 66 | 67 | def test_match_by_descendant(self): 68 | document = ''' 69 |
70 |
71 | 72 |
73 |
74 | 75 |
76 |
77 | ''' 78 | skeleton = '
' 79 | matches = ankle.find_all(skeleton, document) 80 | self.assertEqual(len(matches), 1) 81 | self.assertEqual(matches[0].attrib['id'], 'test1') 82 | 83 | def test_match_by_multiple_children(self): 84 | document = ''' 85 |
86 | 87 | 88 |
89 |
90 | 91 | 92 |
93 | ''' 94 | skeleton = '
' 95 | matches = ankle.find_all(skeleton, document) 96 | self.assertEqual(len(matches), 1) 97 | self.assertEqual(matches[0].attrib['id'], 'login') 98 | 99 | def test_multiple_matches(self): 100 | document = ''' 101 |
102 |
103 | ''' 104 | skeleton = '
' 105 | matches = ankle.find_all(skeleton, document) 106 | self.assertEqual(len(matches), 2) 107 | self.assertEqual(matches[0].attrib['id'], 'test1') 108 | self.assertEqual(matches[1].attrib['id'], 'test2') 109 | 110 | def test_attribute_order_doesnt_matter(self): 111 | document = '
' 112 | skeleton = '
' 113 | matches = ankle.find_all(skeleton, document) 114 | self.assertEqual(len(matches), 1) 115 | self.assertEqual(matches[0].attrib['id'], 'test1') 116 | 117 | def test_match_deep_descendants(self): 118 | document = ''' 119 |
120 |
121 | 122 |
123 |
124 |
125 |
126 | 127 |
128 |
129 | ''' 130 | skeleton = ( 131 | '
' 132 | ) 133 | matches = ankle.find_all(skeleton, document) 134 | self.assertEqual(len(matches), 1) 135 | self.assertEqual(matches[0].attrib['id'], 'test1') 136 | 137 | def test_match_text(self): 138 | document = ''' 139 |
140 | 143 | 144 |
145 |
146 | 149 | 150 |
151 | ''' 152 | skeleton = ''' 153 |
154 | 155 | 156 |
157 | ''' 158 | matches = ankle.find_all(skeleton, document) 159 | self.assertEqual(len(matches), 1) 160 | self.assertEqual(matches[0].attrib['id'], 'test1') 161 | 162 | def test_match_by_order(self): 163 | document = ''' 164 |
165 | 166 | 167 |
168 |
169 | 170 | 171 |
172 | ''' 173 | skeleton = ''' 174 |
175 | 176 | 177 |
178 | ''' 179 | matches = ankle.find_all(skeleton, document) 180 | self.assertEqual(len(matches), 1) 181 | self.assertEqual(matches[0].attrib['id'], 'test1') 182 | 183 | def test_matches_skeleton_with_just_text(self): 184 | document = ''' 185 |

Correct title

186 |

Different title

187 | ''' 188 | skeleton = ''' 189 |

Correct title

190 | ''' 191 | matches = ankle.find_all(skeleton, document) 192 | self.assertEqual(len(matches), 1) 193 | self.assertEqual(matches[0].attrib['id'], 'test1') 194 | 195 | def test_match_text_between_elements(self): 196 | document = ''' 197 |
198 | 199 | Correct text 200 | 201 |
202 |
203 | 204 | Incorrect text 205 | 206 |
207 | ''' 208 | skeleton = ''' 209 |
210 | 211 | Correct text 212 | 213 |
214 | ''' 215 | matches = ankle.find_all(skeleton, document) 216 | self.assertEqual(len(matches), 1) 217 | self.assertEqual(matches[0].attrib['id'], 'test1') 218 | 219 | def test_match_text_in_the_beginning_of_element(self): 220 | document = ''' 221 |
222 | Correct text 223 | 224 | 225 |
226 |
227 | Incorrect text 228 | 229 | 230 |
231 | ''' 232 | skeleton = ''' 233 |
234 | Correct text 235 | 236 | 237 |
238 | ''' 239 | matches = ankle.find_all(skeleton, document) 240 | self.assertEqual(len(matches), 1) 241 | self.assertEqual(matches[0].attrib['id'], 'test1') 242 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py27, py34, py35, pypy 3 | 4 | [testenv] 5 | commands = nosetests {posargs} 6 | deps = 7 | nose 8 | --------------------------------------------------------------------------------