├── .coveragerc ├── .gitignore ├── .travis.yml ├── LICENSE ├── MANIFEST.in ├── README.rst ├── docs ├── advanced.rst ├── api.rst ├── builder.rst ├── conf.py ├── index.rst ├── nodes.rst ├── parser.rst └── writer.rst ├── requirements-dev.txt ├── setup.py ├── tests ├── __init__.py ├── data │ ├── example_doc.small.xml │ ├── example_doc.unicode.xml │ ├── monty_python_films.ns.xml │ └── monty_python_films.xml ├── test_builder.py ├── test_nodes.py ├── test_parser.py └── test_writer.py ├── tox.ini └── xml4h ├── __init__.py ├── builder.py ├── exceptions.py ├── impls ├── __init__.py ├── interface.py ├── lxml_etree.py ├── xml_dom_minidom.py └── xml_etree_elementtree.py ├── nodes.py └── writer.py /.coveragerc: -------------------------------------------------------------------------------- 1 | [report] 2 | show_missing = 1 3 | exclude_lines = 4 | pragma: no cover 5 | raise NotImplementedError 6 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Build artifacts 2 | dist 3 | build 4 | xml4h.egg-info 5 | 6 | # Sphinx documentation 7 | docs/_* 8 | docs/.* 9 | 10 | # Nosetests coverage report 11 | .coverage 12 | 13 | # Tox virtualenvs 14 | .tox/ 15 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.7" 4 | - "3.5" 5 | - "3.6" 6 | - "3.7" 7 | - "3.8" 8 | install: pip install tox-travis 9 | script: tox 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2013 James Murty. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.rst LICENSE requirements-dev.txt 2 | recursive-include tests *.py *.xml 3 | recursive-include docs *.rst 4 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | =============================== 2 | xml4h: XML for Humans in Python 3 | =============================== 4 | 5 | *xml4h* is an MIT licensed library for Python to make it easier to work with XML. 6 | 7 | This library exists because Python is awesome, XML is everywhere, and combining 8 | the two should be a pleasure but often is not. With *xml4h*, it can be easy. 9 | 10 | As of version 1.0 *xml4h* supports Python versions 2.7 and 3.5+. 11 | 12 | 13 | Features 14 | -------- 15 | 16 | *xml4h* is a simplification layer over existing Python XML processing libraries 17 | such as *lxml*, *ElementTree* and the *minidom*. It provides: 18 | 19 | - a rich pythonic API to traverse and manipulate the XML DOM. 20 | - a document builder to simply and safely construct complex documents with 21 | minimal code. 22 | - a writer that serialises XML documents with the structure and format that you 23 | expect, unlike the machine- but not human-friendly output you tend to get 24 | from other libraries. 25 | 26 | The *xml4h* abstraction layer also offers some other benefits, beyond a nice 27 | API and tool set: 28 | 29 | - A common interface to different underlying XML libraries, so code written 30 | against *xml4h* need not be rewritten if you switch implementations. 31 | - You can easily move between *xml4h* and the underlying implementation: parse 32 | your document using the fastest implementation, manipulate the DOM with 33 | human-friendly code using *xml4h*, then get back to the underlying 34 | implementation if you need to. 35 | 36 | 37 | Installation 38 | ------------ 39 | 40 | Install *xml4h* with pip:: 41 | 42 | $ pip install xml4h 43 | 44 | Or install the tarball manually with:: 45 | 46 | $ python setup.py install 47 | 48 | 49 | Links 50 | ----- 51 | 52 | - GitHub for source code and issues: https://github.com/jmurty/xml4h 53 | - ReadTheDocs for documentation: https://xml4h.readthedocs.org 54 | - Install from the Python Package Index: https://pypi.python.org/pypi/xml4h 55 | 56 | 57 | Introduction 58 | ------------ 59 | 60 | With *xml4h* you can easily parse XML files and access their data. 61 | 62 | Let's start with an example XML document:: 63 | 64 | $ cat tests/data/monty_python_films.xml 65 | 66 | 67 | And Now for Something Completely Different 68 | 69 | A collection of sketches from the first and second TV series of 70 | Monty Python's Flying Circus purposely re-enacted and shot for film. 71 | 72 | 73 | 74 | Monty Python and the Holy Grail 75 | 76 | King Arthur and his knights embark on a low-budget search for 77 | the Holy Grail, encountering humorous obstacles along the way. 78 | Some of these turned into standalone sketches. 79 | 80 | 81 | 82 | Monty Python's Life of Brian 83 | 84 | Brian is born on the first Christmas, in the stable next to 85 | Jesus'. He spends his life being mistaken for a messiah. 86 | 87 | 88 | <... more Film elements here ...> 89 | 90 | 91 | With *xml4h* you can parse the XML file and use "magical" element and attribute 92 | lookups to read data:: 93 | 94 | >>> import xml4h 95 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml') 96 | 97 | >>> for film in doc.MontyPythonFilms.Film[:3]: 98 | ... print(film['year'] + ' : ' + film.Title.text) 99 | 1971 : And Now for Something Completely Different 100 | 1974 : Monty Python and the Holy Grail 101 | 1979 : Monty Python's Life of Brian 102 | 103 | You can also use more explicit (non-magical) methods to traverse the DOM:: 104 | 105 | >>> for film in doc.child('MontyPythonFilms').children('Film')[:3]: 106 | ... print(film.attributes['year'] + ' : ' + film.children.first.text) 107 | 1971 : And Now for Something Completely Different 108 | 1974 : Monty Python and the Holy Grail 109 | 1979 : Monty Python's Life of Brian 110 | 111 | The *xml4h* builder makes programmatic document creation simple, with a 112 | method-chaining feature that allows for expressive but sparse code that mirrors 113 | the document itself. Here is the code to build part of the above XML document:: 114 | 115 | >>> b = (xml4h.build('MontyPythonFilms') 116 | ... .attributes({'source': 'http://en.wikipedia.org/wiki/Monty_Python'}) 117 | ... .element('Film') 118 | ... .attributes({'year': 1971}) 119 | ... .element('Title') 120 | ... .text('And Now for Something Completely Different') 121 | ... .up() 122 | ... .elem('Description').t( 123 | ... "A collection of sketches from the first and second TV" 124 | ... " series of Monty Python's Flying Circus purposely" 125 | ... " re-enacted and shot for film." 126 | ... ).up() 127 | ... .up() 128 | ... ) 129 | 130 | >>> # A builder object can be re-used, and has short method aliases 131 | >>> b = (b.e('Film') 132 | ... .attrs(year=1974) 133 | ... .e('Title').t('Monty Python and the Holy Grail').up() 134 | ... .e('Description').t( 135 | ... "King Arthur and his knights embark on a low-budget search" 136 | ... " for the Holy Grail, encountering humorous obstacles along" 137 | ... " the way. Some of these turned into standalone sketches." 138 | ... ).up() 139 | ... .up() 140 | ... ) 141 | 142 | Pretty-print your XML document with *xml4h*'s writer implementation with 143 | methods to write content to a stream or get the content as text with flexible 144 | formatting options:: 145 | 146 | >>> print(b.xml_doc(indent=4, newline=True)) # doctest: +ELLIPSIS 147 | 148 | 149 | 150 | And Now for Something Completely Different 151 | A collection of sketches from ... 152 | 153 | 154 | Monty Python and the Holy Grail 155 | King Arthur and his knights embark ... 156 | 157 | 158 | 159 | 160 | 161 | Why use *xml4h*? 162 | ---------------- 163 | 164 | Python has three popular libraries for working with XML, none of which are 165 | particularly easy to use: 166 | 167 | - `xml.dom.minidom `_ 168 | is a light-weight, moderately-featured implementation of the W3C DOM 169 | that is included in the standard library. Unfortunately the W3C DOM API is 170 | verbose, clumsy, and not very pythonic, and the *minidom* does not support 171 | XPath expressions. 172 | - `xml.etree.ElementTree `_ 173 | is a fast hierarchical data container that is included in the standard 174 | library and can be used to represent XML, mostly. The API is fairly pythonic 175 | and supports some basic XPath features, but it lacks some DOM traversal 176 | niceties you might expect (e.g. to get an element's parent) and when using it 177 | you often feel like your working with something subtly different from XML, 178 | because you are. 179 | - `lxml `_ is a fast, full-featured XML library with an API 180 | based on ElementTree but extended. It is your best choice for doing serious 181 | work with XML in Python but it is not included in the standard library, it 182 | can be difficult to install, and it gives you the same it's-XML-but-not-quite 183 | feeling as its ElementTree forebear. 184 | 185 | Given these three options it can be difficult to choose which library to use, 186 | especially if you're new to XML processing in Python and haven't already 187 | used (struggled with) any of them. 188 | 189 | In the past your best bet would have been to go with *lxml* for the most 190 | flexibility, even though it might be overkill, because at least then you 191 | wouldn't have to rewrite your code if you later find you need XPath support or 192 | powerful DOM traversal methods. 193 | 194 | This is where *xml4h* comes in. It provides an abstraction layer over 195 | the existing XML libraries, taking advantage of their power while offering an 196 | improved API and tool set. 197 | 198 | 199 | Development Status: beta 200 | ------------------------ 201 | 202 | Currently *xml4h* includes adapter implementations for three of the main XML 203 | processing Python libraries. 204 | 205 | If you have *lxml* available (highly recommended) it will use that, otherwise 206 | it will fall back to use the *(c)ElementTree* then the *minidom* libraries. 207 | 208 | 209 | 210 | History 211 | ------- 212 | 213 | 1.0 214 | ... 215 | 216 | - Add support for Python 3 (3.5+) 217 | - Dropped support for Python versions before 2.7. 218 | - Fix node namespace prefix values for lxml adapter. 219 | - Improve builder's ``up()`` method to accept and distinguish between a count 220 | of parents to step up, or the name of a target ancestor node. 221 | - Add ``xml()`` and ``xml_doc()`` methods to document builder to more easily 222 | get string content from it, without resorting to the write methods. 223 | - The ``write()`` and ``write_doc()`` methods no longer send output to 224 | ``sys.stdout`` by default. The user must explicitly provide a target writer 225 | object, and hopefully be more mindful of the need to set up encoding correctly 226 | when providing a text stream object. 227 | - Handling of redundant Element namespace prefixes is now more consistent: we 228 | always strip the prefix when the element has an `xmlns` attribute defining 229 | the same namespace URI. 230 | 231 | 0.2.0 232 | ..... 233 | 234 | - Add adapter for the *(c)ElementTree* library versions included as standard 235 | with Python 2.7+. 236 | - Improved "magical" node traversal to work with lowercase tag names without 237 | always needing a trailing underscore. See also improved docs. 238 | - Fixes for: potential errors ASCII-encoding nodes as strings; default XPath 239 | namespace from document node; lookup precedence of xmlns attributes. 240 | 241 | 242 | 0.1.0 243 | ..... 244 | 245 | - Initial alpha release with support for *lxml* and *minidom* libraries. 246 | -------------------------------------------------------------------------------- /docs/advanced.rst: -------------------------------------------------------------------------------- 1 | ======== 2 | Advanced 3 | ======== 4 | 5 | 6 | .. _xml4h-namespaces: 7 | 8 | Namespaces 9 | ========== 10 | 11 | *xml4h* supports using XML namespaces in a number of ways, and tries to make 12 | this sometimes complex and fiddly aspect of XML a little easier to deal with. 13 | 14 | Namespace URIs 15 | -------------- 16 | 17 | XML document nodes can be associated with a *namespace URI* which uniquely 18 | identifies the namespace. At bottom a URI is really just a name to identifiy 19 | the namespace, which may or may not point at an actual resource. 20 | 21 | Namespace URIs are the core piece of the namespacing puzzle, everything else is 22 | extras. 23 | 24 | Namespace URI values are assigned to a node in one of three ways: 25 | 26 | - an ``xmlns`` attribute on an element assigns a *namespace URI* to that 27 | element, and may also define a shorthand *prefix* for the namespace:: 28 | 29 | 30 | 31 | .. note:: 32 | Technically the ``xmlns`` attribute must itself also be in the special XML 33 | namespacing namespace http://www.w3.org/2000/xmlns/. You needn't care 34 | about this. 35 | 36 | - a tag or attribute name includes a *prefix* alias portion that specifies the 37 | namespace the item belongs to:: 38 | 39 | 40 | 41 | A prefix alias can be defined using an "xmlns" attribute as described above, 42 | or by using the Builder :meth:`~xml4h.Builder.ns_prefix` or Node 43 | :meth:`~xml4h.nodes.Node.set_ns_prefix` methods. 44 | 45 | - in an apparent effort to reduce confusion around namespace URIs and prefixes, 46 | some XML libraries avoid prefix aliases altogether and instead require you to 47 | specify the full *namespace URI* as a prefix to tag and attribute names 48 | using a special syntax with braces:: 49 | 50 | >>> tagname = '{urn:example-uri}YetAnotherWayToNamespace' 51 | 52 | .. note:: 53 | In the author's opinion, using a non-standard way to define namespaces 54 | does not reduce confusion. *xml4h* supports this approach technically but 55 | not philosphically. 56 | 57 | *xml4h* allows you to assign namespace URIs to document nodes when using the 58 | Builder:: 59 | 60 | >>> # Assign a default namespace with ns_uri 61 | >>> import xml4h 62 | >>> b = xml4h.build('Doc', ns_uri='ns-uri') 63 | >>> root = b.root 64 | 65 | >>> # Descendent without a namespace inherit their ancestor's default one 66 | >>> elem1 = b.elem('Elem1').dom_element 67 | >>> elem1.namespace_uri 68 | 'ns-uri' 69 | 70 | >>> # Define a prefix alias to assign a new or existing namespace URI 71 | >>> elem2 = b.ns_prefix('my-ns', 'second-ns-uri') \ 72 | ... .elem('my-ns:Elem2').dom_element 73 | >>> print(root.xml()) 74 | 75 | 76 | 77 | 78 | 79 | >>> # Or use the explicit URI prefix approach, if you must 80 | >>> elem3 = b.elem('{third-ns-uri}Elem3').dom_element 81 | >>> elem3.namespace_uri 82 | 'third-ns-uri' 83 | 84 | And when adding nodes with the API:: 85 | 86 | >>> # Define the ns_uri argument when creating a new element 87 | >>> elem4 = root.add_element('Elem4', ns_uri='fourth-ns-uri') 88 | 89 | >>> # Attributes can be namespaced too 90 | >>> elem4.set_attributes({'my-ns:attr1': 'value'}) 91 | 92 | >>> print(elem4.xml()) 93 | 94 | 95 | 96 | Filtering by Namespace 97 | ---------------------- 98 | 99 | *xml4h* allows you to find and filter nodes based on their namespace. 100 | 101 | The :meth:`~xml4h.nodes.Node.find` method takes a ``ns_uri`` keyword argument to 102 | return only elements in that namespace:: 103 | 104 | >>> # By default, find ignores namespaces... 105 | >>> [n.local_name for n in root.find()] 106 | ['Elem1', 'Elem2', 'Elem3', 'Elem4'] 107 | >>> # ...but will filter by namespace URI if you wish 108 | >>> [n.local_name for n in root.find(ns_uri='fourth-ns-uri')] 109 | ['Elem4'] 110 | 111 | Similarly, a node's children listing can be filtered:: 112 | 113 | >>> len(root.children) 114 | 4 115 | >>> root.children(ns_uri='ns-uri') 116 | [] 117 | 118 | XPath queries can also filter by namespace, but the 119 | :meth:`~xml4h.nodes.Node.xpath` method needs to be given a dictionary mapping 120 | of prefix aliases to URIs:: 121 | 122 | >>> root.xpath('//ns4:*', namespaces={'ns4': 'fourth-ns-uri'}) 123 | [] 124 | 125 | .. note:: 126 | Normally, because XPath queries rely on namespace prefix aliases, they 127 | cannot find namespaced nodes in the default namespace which has an "empty" 128 | prefix name. *xml4h* works around this limitation by providing the special 129 | empty/default prefix alias '_'. 130 | 131 | 132 | Element Names: Local and Prefix Components 133 | ------------------------------------------ 134 | 135 | When you use a namespace prefix alias to define the namespace an element or 136 | attribute belongs to, the name of that node will be made up of two components: 137 | 138 | - *prefix* - the namespace alias. 139 | - *local* - the real name of the node, without the namespace alias. 140 | 141 | *xml4h* makes the full (qualified) name, and the two components, available at 142 | node attributes:: 143 | 144 | >>> # Elem2's namespace was defined earlier using a prefix alias 145 | >>> elem2 146 | 147 | 148 | # The full node name... 149 | >>> elem2.name 150 | 'my-ns:Elem2' 151 | >>> # ...comprises a prefix... 152 | >>> elem2.prefix 153 | 'my-ns' 154 | >>> # ...and a local name component 155 | >>> elem2.local_name 156 | 'Elem2' 157 | 158 | >>> # Here is an element without a prefix alias 159 | >>> elem1.name 160 | 'Elem1' 161 | >>> elem1.prefix == None 162 | True 163 | >>> elem1.local_name 164 | 'Elem1' 165 | 166 | 167 | .. _xml-lib-architecture: 168 | 169 | *xml4h* Architecture 170 | ==================== 171 | 172 | To best understand the *xml4h* library and to use it appropriately in demanding 173 | situations, you should appreciate what the library is not. 174 | 175 | *xml4h* is not a full-fledged XML library in its own right, far from it. 176 | Instead of implementing low-level document parsing and manipulation tools, it 177 | operates as an abstraction layer on top of the pre-existing XML processing 178 | libraries you already know. 179 | 180 | This means the improved API and tool suite provided by *xml4h* work by 181 | mediating operations you perform, asking the underlying XML library to do the 182 | work, and packaging up the results of this work as wrapped *xml4h* objects. 183 | 184 | This approach has a number of implications, good and bad. 185 | 186 | On the good side: 187 | 188 | - you can start using and benefiting from *xml4h* in an existing projects that 189 | already use a supported XML library without any impact, it can fit right in. 190 | - *xml4h* can take advantage of the existing powerful and fast XML libraries to 191 | do its work. 192 | - by providing an abstraction layer over multiple libraries, *xml4h* can make 193 | it (relatively) easy to switch the underlying library without you needing to 194 | rewrite your own XML handling code. 195 | - by building on the shoulders of giants, *xml4h* itself can remain relatively 196 | lightweight and focussed on simplicity and usability. 197 | - the author of *xml4h* does not have to write XML-handling code in C... 198 | 199 | On the bad side: 200 | 201 | - if the underlying XML libraries available in the Python environment do not 202 | support a feature (like XPath querying) then that feature will not be 203 | available in *xml4h*. 204 | - *xml4h* cannot provide radical new XML processing features, since the bulk of 205 | its work must be done by the underlying library. 206 | - the abstraction layer *xml4h* uses to do its work requires more resources 207 | than it would to use the underlying library directly, so if you absolutely 208 | need maximal speed or minimal memory use the library might prove too 209 | expensive. 210 | - *xml4h* sometimes needs to jump through some hoops to maintain the shared 211 | abstraction interface over multiple libraries, which means extra work is 212 | done in Python instead of by the underlying library code in C. 213 | 214 | The author believes the benefits of using *xml4h* outweighs the drawbacks in 215 | the majority of real-world situations, or he wouldn't have created the library 216 | in the first place, but ultimately it is up to you to decide where you should 217 | or should not use it. 218 | 219 | 220 | .. _xml-lib-adapters: 221 | 222 | Library Adapters 223 | ---------------- 224 | 225 | To provide an abstraction layer over multiple underlying XML libraries, *xml4h* 226 | uses an "adapter" mechanism to mediate operations on documents. There is an 227 | adapter implementation for each library *xml4h* can work with, each of which 228 | extends the :class:`~xml4h.impls.interface.XmlImplAdapter` class. This base 229 | class includes some standard behaviour, and defines the interface for adapter 230 | implementations (to the extent you can define such interfaces in Python). 231 | 232 | The current version of *xml4h* includes adapter implementations for the three 233 | main XML processing libraries for Python: 234 | 235 | - :class:`~xml4h.impls.lxml_etree.LXMLAdapter` works with the excellent 236 | `lxml `_ library which is very full-featured and fast, but 237 | which is not included in the standard library. 238 | - :class:`~xml4h.impls.xml_etree_elementtree.cElementTreeAdapter` and 239 | :class:`~xml4h.impls.xml_etree_elementtree.ElementTreeAdapter` work with the 240 | *ElementTree* libraries included with the standard library of Python versions 241 | 2.7 and later. *ElementTree* is fast and includes support for some basic 242 | XPath expressions. If the C-based version of ElementTree is available, the 243 | former adapter is made available and should be used for best performance. 244 | - :class:`~xml4h.impls.xml_dom_minidom.XmlDomImplAdapter` works with the 245 | `minidom `_ W3C-style 246 | XML library included with the standard library. This library is always 247 | available but is slower and has fewer features than alternative libraries 248 | (e.g. no support for XPath) 249 | 250 | The adapter layer allows the rest of the *xml4h* library code to remain almost 251 | entirely oblivious to the underlying XML library that happens to be available 252 | at the time. The *xml4h* Builder, Node objects, writer etc. call adapter 253 | methods to perform document operations, and the adapter is responsible for 254 | doing the necessary work with the underlying library. 255 | 256 | 257 | .. _best-adapter: 258 | 259 | "Best" Adapter 260 | -------------- 261 | 262 | While *xml4h* can work with multiple underlying XML libraries, some of these 263 | libraries are better (faster, more fully-featured) than others so it would be 264 | smart to use the best of the libraries available. 265 | 266 | *xml4h* does exactly that: unless you explicitly choose an adapter (see below) 267 | *xml4h* will find the supported libraries in the Python environment and choose 268 | the "best" adapter for you in the circumstances. 269 | 270 | Here is the list of libraries *xml4h* will choose from, best to least-best: 271 | 272 | - *lxml* 273 | - *(c)ElementTree* 274 | - *ElementTree* 275 | - *minidom* 276 | 277 | The :attr:`xml4h.best_adapter` attribute stores the adapter class that *xml4h* 278 | considers to be the best. 279 | 280 | .. note: 281 | You cannot always rely on *xml4h* to choose the right underlying XML library 282 | for your needs. For cases where you need to use a specific library, such as 283 | when you have a pre-parsed document object, see `wrap-unwrap-nodes`_. 284 | 285 | 286 | Choose Your Own Adapter 287 | ----------------------- 288 | 289 | By default, *xml4h* will choose an adapter and underlying XML library 290 | implementation that it considers the best available. However, in some cases you 291 | may need to have full control over which underlying implementation *xml4h* 292 | uses, perhaps because you will use features of the underlying XML 293 | implementation later on, or because you need the performance characteristics 294 | only available in a particular library. 295 | 296 | For these situations it is possible to tell *xml4h* which adapter 297 | implementation, and therefore which underlying XML library, it should use. 298 | 299 | To use a specific adapter implementation when parsing a document, or when 300 | creating a new document using the builder, simply provide the optional 301 | ``adapter`` keyword argument to the relevant method: 302 | 303 | - Parsing:: 304 | 305 | >>> # Explicitly use the minidom adapter to parse a document 306 | >>> minidom_doc = xml4h.parse('tests/data/monty_python_films.xml', 307 | ... adapter=xml4h.XmlDomImplAdapter) 308 | >>> minidom_doc.root.impl_node #doctest:+ELLIPSIS 309 | >> # Explicitly use the lxml adapter to build a document 314 | >>> lxml_b = xml4h.build('MyDoc', adapter=xml4h.LXMLAdapter) 315 | >>> lxml_b.root.impl_node #doctest:+ELLIPSIS 316 | >> # Use xml4h with a cElementTree document object 321 | >>> import xml.etree.ElementTree as ET 322 | >>> et_doc = ET.parse('tests/data/monty_python_films.xml') 323 | >>> et_doc #doctest:+ELLIPSIS 324 | >> doc = xml4h.cElementTreeAdapter.wrap_document(et_doc) 326 | >>> doc.root 327 | 328 | 329 | 330 | Check Feature Support 331 | ..................... 332 | 333 | Because not all underlying XML libraries support all the features exposed by 334 | *xml4h*, the library includes a simple mechanism to check whether a given 335 | feature is available in the current Python environment or with the current 336 | adapter. 337 | 338 | To check for feature support call the :meth:`~xml4h.nodes.Node.has_feature` 339 | method on a document node, or 340 | :meth:`~xml4h.impl.interface.XmlImplAdapter.has_feature` on an adapter class. 341 | 342 | List of features that are not available in all adapters: 343 | 344 | - ``xpath`` - Can perform XPath queries using the 345 | :meth:`~xml4h.nodes.Node.xpath` method. 346 | - More to come later, probably... 347 | 348 | For example, here is how you would test for XPath support in the *minidom* 349 | adapter, which doesn't include it:: 350 | 351 | >>> minidom_doc.root.has_feature('xpath') 352 | False 353 | 354 | If you forget to check for a feature and use it anyway, you will get 355 | a :class:`~xml4h.exceptions.FeatureUnavailableException`:: 356 | 357 | >>> try: 358 | ... minidom_doc.root.xpath('//*') 359 | ... except Exception as e: 360 | ... e #doctest:+ELLIPSIS 361 | FeatureUnavailableException('xpath'... 362 | 363 | 364 | Adapter & Implementation Quirks 365 | ------------------------------- 366 | 367 | Although *xml4h* aims to provide a seamless abstraction over underlying XML 368 | library implementations this isn't always possible, or is only possible by 369 | performing lots of extra work that affects performance. This section describes 370 | some implementation-specific quirks or differences you may encounter. 371 | 372 | .. note: 373 | This set of quirks is almost certainly incomplete, please report issues you 374 | find so they can either be fixed (in the best case) or captured here as 375 | known trouble-spots. 376 | 377 | LXMLAdapter - *lxml* 378 | .................... 379 | 380 | - *lxml* does not have full support for CDATA nodes, which devolve into plain 381 | text node values when written (by *xml4h* or by *lxml*'s writer). 382 | - Namespaces defined by adding ``xmlns`` element attributes are not properly 383 | represented in the underlying implementation due to the *lxml* library's 384 | immutable ``nsmap`` namespace map. Such namespaces are written correcly 385 | by the *xml4h* writer, but to avoid quirks it is best to specify namespace 386 | when creating nodes by setting the ``ns_uri`` keyword attribute. 387 | - When *xml4h* writes *lxml*-based documents with namespaces, some node tag 388 | names may have unnecessary namespace prefix aliases. 389 | 390 | (c)ElementTreeAdapter - *ElementTree* 391 | ..................................... 392 | 393 | - Only the versions of (c)ElementTree included with Python version 2.7 and 394 | later are supported. 395 | - *ElementTree* supports only a very limited subset of XPath for querying, so 396 | although the ``has_feature('xpath')`` check returns ``True`` don't expect to 397 | get the full power of XPath when you use this adapter. 398 | - *ElementTree* does not have full support for CDATA nodes, which devolve into 399 | plain text node values when written (by *xml4h* or by *ElementTree*'s writer). 400 | - Because *ElementTree* doesn't retain information about a node's parent, 401 | *xml4h* needs to build and maintain its own records of which nodes are 402 | parents of which children. This extra overhead might harm performance or 403 | memory usage. 404 | - *ElementTree* doesn't normally remember explicit namespace definition 405 | directives when parsing a document. *xml4h* works around this when it is 406 | asked to parse XML data, but if you parse data outside of *xml4h* then use 407 | the library on the resultant document the namespace definitions will get 408 | messed up. 409 | 410 | XmlImplAdapter - *minidom* 411 | .......................... 412 | 413 | - No support for performing XPath queries. 414 | - Slower than alternative C-based implementations. 415 | -------------------------------------------------------------------------------- /docs/api.rst: -------------------------------------------------------------------------------- 1 | === 2 | API 3 | === 4 | 5 | 6 | Main Interface 7 | -------------- 8 | 9 | .. automodule:: xml4h 10 | :members: parse, build, best_adapter 11 | 12 | 13 | Builder 14 | ------- 15 | 16 | .. automodule:: xml4h.builder 17 | :members: 18 | 19 | 20 | Writer 21 | ------ 22 | 23 | .. automodule:: xml4h.writer 24 | :members: 25 | 26 | 27 | .. _api-nodes: 28 | 29 | DOM Nodes API 30 | ------------- 31 | 32 | .. automodule:: xml4h.nodes 33 | :members: 34 | :special-members: 35 | :private-members: 36 | 37 | 38 | XML Libarary Adapters 39 | --------------------- 40 | 41 | .. automodule:: xml4h.impls.interface 42 | :members: 43 | 44 | .. automodule:: xml4h.impls.lxml_etree 45 | :members: 46 | 47 | .. automodule:: xml4h.impls.xml_etree_elementtree 48 | :members: 49 | 50 | .. automodule:: xml4h.impls.xml_dom_minidom 51 | :members: 52 | 53 | 54 | Custom Exceptions 55 | ----------------- 56 | 57 | .. automodule:: xml4h.exceptions 58 | :members: 59 | -------------------------------------------------------------------------------- /docs/builder.rst: -------------------------------------------------------------------------------- 1 | .. _builder: 2 | 3 | ======= 4 | Builder 5 | ======= 6 | 7 | *xml4h* includes a document builder tool that makes it easy to create valid, 8 | well-formed XML documents using relatively sparse python code. It makes it so 9 | easy to create XML that you will no longer be tempted to cobble together 10 | documents with error-prone methods like manual string concatenation or a 11 | templating library. 12 | 13 | Internally, the builder uses the DOM-building features of an underlying XML 14 | library which means it is (almost) impossible to construct an invalid document. 15 | 16 | Here is some example code to build a document about Monty Python films:: 17 | 18 | >>> import xml4h 19 | >>> xmlb = (xml4h.build('MontyPythonFilms') 20 | ... .attributes({'source': 'http://en.wikipedia.org/wiki/Monty_Python'}) 21 | ... .element('Film') 22 | ... .attributes({'year': 1971}) 23 | ... .element('Title') 24 | ... .text('And Now for Something Completely Different') 25 | ... .up() 26 | ... .elem('Description').t( 27 | ... "A collection of sketches from the first and second TV" 28 | ... " series of Monty Python's Flying Circus purposely" 29 | ... " re-enacted and shot for film.") 30 | ... .up() 31 | ... .up() 32 | ... .elem('Film') 33 | ... .attrs(year=1974) 34 | ... .e('Title') 35 | ... .t('Monty Python and the Holy Grail') 36 | ... .up() 37 | ... .e('Description').t( 38 | ... "King Arthur and his knights embark on a low-budget search" 39 | ... " for the Holy Grail, encountering humorous obstacles along" 40 | ... " the way. Some of these turned into standalone sketches." 41 | ... ).up() 42 | ... ) 43 | 44 | The code above produces the following XML document (abbreviated):: 45 | 46 | >>> print(xmlb.xml_doc(indent=True)) # doctest:+ELLIPSIS 47 | 48 | 49 | 50 | And Now for Something Completely Different 51 | A collection of sketches from the first and second... 52 | 53 | 54 | Monty Python and the Holy Grail 55 | King Arthur and his knights embark on a low-budget... 56 | 57 | 58 | 59 | 60 | 61 | Getting Started 62 | --------------- 63 | 64 | You typically create a new XML document builder by calling the 65 | :func:`xml4h.build` function with the name of the root element:: 66 | 67 | >>> root_b = xml4h.build('RootElement') 68 | 69 | The function returns a :class:`~xml4h.builder.Builder` object that represents 70 | the *RootElement* and allows you to manipulate this element's attributes 71 | or to add child elements. 72 | 73 | Once you have the first builder instance, every action you perform to add 74 | content to the XML document will return another instance of the Builder class:: 75 | 76 | >>> # Add attributes to the root element's Builder 77 | >>> root_b = root_b.attributes({'a': 1, 'b': 2}, c=3) 78 | 79 | >>> root_b #doctest:+ELLIPSIS 80 | >> root_b.dom_element 86 | 87 | 88 | >>> root_b.dom_element.attributes 89 | 90 | 91 | When you add a new child element, the result is a builder instance representing 92 | that child element, *not the original element*:: 93 | 94 | >>> child1_b = root_b.element('ChildElement1') 95 | >>> child2_b = root_b.element('ChildElement2') 96 | 97 | >>> # The element method returns a Builder wrapping the new child element 98 | >>> child2_b.dom_element 99 | 100 | >>> child2_b.dom_element.parent 101 | 102 | 103 | This feature of the builder can be a little confusing, but it allows for the 104 | very convenient method-chaining feature that gives the builder its power. 105 | 106 | 107 | .. _builder-method-chaining: 108 | 109 | Method Chaining 110 | --------------- 111 | 112 | Because every builder method that adds content to the XML document returns 113 | a builder instance representing the nearest (or newest) element, you can 114 | chain together many method calls to construct your document without any 115 | need for intermediate variables. 116 | 117 | For example, the example code in the previous section used the variables 118 | ``root_b``, ``child1_b`` and ``child2_b`` to represent builder instances but 119 | this is not necessary. Here is how you can use method-chaining to build the 120 | same document with less code:: 121 | 122 | >>> b = (xml4h 123 | ... .build('RootElement').attributes({'a': 1, 'b': 2}, c=3) 124 | ... .element('ChildElement1').up() # NOTE the up() method 125 | ... .element('ChildElement2') 126 | ... ) 127 | 128 | >>> print(b.xml_doc(indent=4)) 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | Notice how you can use chained method calls to write code with a structure 137 | that mirrors that of the XML document you want to produce? This makes it 138 | much easier to spot errors in your code than it would be if you were to 139 | concatenate strings. 140 | 141 | .. note:: 142 | 143 | It is a good idea to wrap the :func:`~xml4h.build` function call and all 144 | following chained methods in parentheses, so you don't need to put 145 | backslash (\\) characters at the end of every line. 146 | 147 | The code above introduces a very important builder method: 148 | :meth:`~xml4h.builder.Builder.up`. This method returns a builder instance 149 | representing the current element's parent, or indeed any ancestor. 150 | 151 | Without the ``up()`` method, every time you created a child element with the 152 | builder you would end up deeper in the document structure with no way to return 153 | to prior elements to add sibling nodes or hierarchies. 154 | 155 | To help reduce the number of ``up()`` method calls you need to include in 156 | your code, this method can also jump up multiple levels or to a named ancestor 157 | element:: 158 | 159 | >>> # A builder that references a deeply-nested element: 160 | >>> deep_b = (xml4h.build('Root') 161 | ... .element('Deep') 162 | ... .element('AndDeeper') 163 | ... .element('AndDeeperStill') 164 | ... .element('UntilWeGetThere') 165 | ... ) 166 | >>> deep_b.dom_element 167 | 168 | 169 | >>> # Jump up 4 levels, back to the root element 170 | >>> deep_b.up(4).dom_element 171 | 172 | 173 | >>> # Jump up to a named ancestor element 174 | >>> deep_b.up('Root').dom_element 175 | 176 | 177 | .. note:: 178 | To avoid making subtle errors in your document's structure, we recommend you 179 | use :meth:`~xml4h.builder.Builder.up` calls to return up one level for every 180 | :meth:`~xml4h.builder.Builder.element` method (or alias) you call. 181 | 182 | 183 | Shorthand Methods 184 | ----------------- 185 | 186 | To make your XML-producing code even less verbose and quicker to type, the 187 | builder has shorthand "alias" methods corresponding to the full names. 188 | 189 | For example, instead of calling ``element()`` to create a new 190 | child element, you can instead use the equivalent ``elem()`` or ``e()`` 191 | methods. Similarly, instead of typing ``attributes()`` you can use ``attrs()`` 192 | or ``a()``. 193 | 194 | Here are the methods and method aliases for adding content to an XML document: 195 | 196 | =================== ========================== ================ 197 | XML Node Created Builder method Aliases 198 | =================== ========================== ================ 199 | Element ``element`` ``elem``, ``e`` 200 | Attribute ``attributes`` ``attrs``, ``a`` 201 | Text ``text`` ``t`` 202 | CDATA ``cdata`` ``data``, ``d`` 203 | Comment ``comment`` ``c`` 204 | Process Instruction ``processing_instruction`` ``inst``, ``i`` 205 | =================== ========================== ================ 206 | 207 | These shorthand method aliases are convenient and lead to even less cruft 208 | around the actual XML content you are interested in. But on the other hand 209 | they are much less explicit than the longer versions, so use them judiciously. 210 | 211 | 212 | Access the DOM 213 | -------------- 214 | 215 | The XML builder is merely a layer of convenience methods that sits on the 216 | :mod:`xml4h.nodes` DOM API. This means you can quickly access the underlying 217 | nodes from a builder if you need to inspect them or manipulate them in a 218 | way the builder doesn't allow: 219 | 220 | - The :attr:`~xml4h.builder.Builder.dom_element` attribute returns a builder's 221 | underlying :class:`~xml4h.nodes.Element` 222 | - The :attr:`~xml4h.builder.Builder.root` attribute returns the document's 223 | root element. 224 | - The :attr:`~xml4h.builder.Builder.document` attribute returns a builder's 225 | underlying :class:`~xml4h.nodes.Document`. 226 | 227 | See the :ref:`api-nodes` documentation to find out how to work with DOM 228 | element nodes once you get them. 229 | 230 | 231 | Building on an Existing DOM 232 | --------------------------- 233 | 234 | When you are building an XML document from scratch you will generally use 235 | the :func:`~xml4h.build` function described in `Getting Started`_. However, 236 | what if you want to add content to a parsed XML document DOM you have already? 237 | 238 | To wrap an :class:`~xml4h.nodes.Element` DOM node with a builder you simply 239 | provide the element node to the same ``builder()`` method used previously and 240 | it will do the right thing. 241 | 242 | Here is an example of parsing an existing XML document, locating an element 243 | of interest, constructing a builder from that element, and adding some new 244 | content. Luckily, the code is simpler than that description... 245 | 246 | :: 247 | 248 | >>> # Parse an XML document 249 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml') 250 | 251 | >>> # Find an Element node of interest 252 | >>> lob_film_elem = doc.MontyPythonFilms.Film[2] 253 | >>> lob_film_elem.Title.text 254 | "Monty Python's Life of Brian" 255 | 256 | >>> # Construct a builder from the element 257 | >>> lob_builder = xml4h.build(lob_film_elem) 258 | 259 | >>> # Add content 260 | >>> b = (lob_builder.attrs(stars=5) 261 | ... .elem('Review').t('One of my favourite films!').up()) 262 | 263 | >>> # See the results 264 | >>> print(lob_builder.xml()) # doctest:+ELLIPSIS 265 | 266 | Monty Python's Life of Brian 267 | Brian is born on the first Christmas, in the stable... 268 | One of my favourite films! 269 | 270 | 271 | 272 | Hydra-Builder 273 | ------------- 274 | 275 | Because each builder class instance is independent, an advanced technique for 276 | constructing complex documents is to use multiple builders anchored at 277 | different places in the DOM. In some situations, the ability to add content 278 | to different places in the same document can be very handy. 279 | 280 | Here is a trivial example of this technique:: 281 | 282 | >>> # Create two Elements in a doc to store even or odd numbers 283 | >>> odd_b = xml4h.build('EvenAndOdd').elem('Odd') 284 | >>> even_b = odd_b.up().elem('Even') 285 | 286 | >>> # Populate the numbers from a loop 287 | >>> for i in range(1, 11): # doctest:+ELLIPSIS 288 | ... if i % 2 == 0: 289 | ... even_b.elem('Number').text(i) 290 | ... else: 291 | ... odd_b.elem('Number').text(i) 292 | <... 293 | 294 | >>> # Check the final document 295 | >>> print(odd_b.xml_doc(indent=True)) 296 | 297 | 298 | 299 | 1 300 | 3 301 | 5 302 | 7 303 | 9 304 | 305 | 306 | 2 307 | 4 308 | 6 309 | 8 310 | 10 311 | 312 | 313 | 314 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # xml4h documentation build configuration file, created by 4 | # sphinx-quickstart on Thu Aug 30 22:29:54 2012. 5 | # 6 | # This file is execfile()d with the current directory set to its containing dir. 7 | # 8 | # Note that not all possible configuration values are present in this 9 | # autogenerated file. 10 | # 11 | # All configuration values have a default; values that are commented out 12 | # serve to show the default. 13 | 14 | import sys, os 15 | from xml4h import __version__ 16 | 17 | # If extensions (or modules to document with autodoc) are in another directory, 18 | # add these directories to sys.path here. If the directory is relative to the 19 | # documentation root, use os.path.abspath to make it absolute, like shown here. 20 | #sys.path.insert(0, os.path.abspath('.')) 21 | 22 | # -- General configuration ----------------------------------------------------- 23 | 24 | # If your documentation needs a minimal Sphinx version, state it here. 25 | #needs_sphinx = '1.0' 26 | 27 | # Add any Sphinx extension module names here, as strings. They can be extensions 28 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. 29 | extensions = ['sphinx.ext.autodoc', 'sphinx.ext.viewcode'] 30 | 31 | # Add any paths that contain templates here, relative to this directory. 32 | templates_path = ['_templates'] 33 | 34 | # The suffix of source filenames. 35 | source_suffix = '.rst' 36 | 37 | # The encoding of source files. 38 | #source_encoding = 'utf-8-sig' 39 | 40 | # The master toctree document. 41 | master_doc = 'index' 42 | 43 | # General information about the project. 44 | project = 'xml4h' 45 | copyright = '2020, James Murty' 46 | 47 | # The version info for the project you're documenting, acts as replacement for 48 | # |version| and |release|, also used in various other places throughout the 49 | # built documents. 50 | # 51 | # The short X.Y version. 52 | version = __version__ 53 | # The full version, including alpha/beta/rc tags. 54 | release = version 55 | 56 | # The language for content autogenerated by Sphinx. Refer to documentation 57 | # for a list of supported languages. 58 | #language = None 59 | 60 | # There are two options for replacing |today|: either, you set today to some 61 | # non-false value, then it is used: 62 | #today = '' 63 | # Else, today_fmt is used as the format for a strftime call. 64 | #today_fmt = '%B %d, %Y' 65 | 66 | # List of patterns, relative to source directory, that match files and 67 | # directories to ignore when looking for source files. 68 | exclude_patterns = ['_build'] 69 | 70 | # The reST default role (used for this markup: `text`) to use for all documents. 71 | #default_role = None 72 | 73 | # If true, '()' will be appended to :func: etc. cross-reference text. 74 | #add_function_parentheses = True 75 | 76 | # If true, the current module name will be prepended to all description 77 | # unit titles (such as .. function::). 78 | #add_module_names = True 79 | 80 | # If true, sectionauthor and moduleauthor directives will be shown in the 81 | # output. They are ignored by default. 82 | #show_authors = False 83 | 84 | # The name of the Pygments (syntax highlighting) style to use. 85 | pygments_style = 'sphinx' 86 | 87 | # A list of ignored prefixes for module index sorting. 88 | #modindex_common_prefix = [] 89 | 90 | 91 | # -- Options for HTML output --------------------------------------------------- 92 | 93 | # The theme to use for HTML and HTML Help pages. See the documentation for 94 | # a list of builtin themes. 95 | html_theme = 'default' 96 | 97 | # Theme options are theme-specific and customize the look and feel of a theme 98 | # further. For a list of options available for each theme, see the 99 | # documentation. 100 | #html_theme_options = {} 101 | 102 | # Add any paths that contain custom themes here, relative to this directory. 103 | #html_theme_path = [] 104 | 105 | # The name for this set of Sphinx documents. If None, it defaults to 106 | # " v documentation". 107 | #html_title = None 108 | 109 | # A shorter title for the navigation bar. Default is the same as html_title. 110 | #html_short_title = None 111 | 112 | # The name of an image file (relative to this directory) to place at the top 113 | # of the sidebar. 114 | #html_logo = None 115 | 116 | # The name of an image file (within the static path) to use as favicon of the 117 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 118 | # pixels large. 119 | #html_favicon = None 120 | 121 | # Add any paths that contain custom static files (such as style sheets) here, 122 | # relative to this directory. They are copied after the builtin static files, 123 | # so a file named "default.css" will overwrite the builtin "default.css". 124 | html_static_path = ['_static'] 125 | 126 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 127 | # using the given strftime format. 128 | #html_last_updated_fmt = '%b %d, %Y' 129 | 130 | # If true, SmartyPants will be used to convert quotes and dashes to 131 | # typographically correct entities. 132 | #html_use_smartypants = True 133 | 134 | # Custom sidebar templates, maps document names to template names. 135 | #html_sidebars = {} 136 | 137 | # Additional templates that should be rendered to pages, maps page names to 138 | # template names. 139 | #html_additional_pages = {} 140 | 141 | # If false, no module index is generated. 142 | #html_domain_indices = True 143 | 144 | # If false, no index is generated. 145 | #html_use_index = True 146 | 147 | # If true, the index is split into individual pages for each letter. 148 | #html_split_index = False 149 | 150 | # If true, links to the reST sources are added to the pages. 151 | #html_show_sourcelink = True 152 | 153 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 154 | #html_show_sphinx = True 155 | 156 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 157 | #html_show_copyright = True 158 | 159 | # If true, an OpenSearch description file will be output, and all pages will 160 | # contain a tag referring to it. The value of this option must be the 161 | # base URL from which the finished HTML is served. 162 | #html_use_opensearch = '' 163 | 164 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 165 | #html_file_suffix = None 166 | 167 | # Output file base name for HTML help builder. 168 | htmlhelp_basename = 'xml4hdoc' 169 | 170 | 171 | # -- Options for LaTeX output -------------------------------------------------- 172 | 173 | latex_elements = { 174 | # The paper size ('letterpaper' or 'a4paper'). 175 | #'papersize': 'letterpaper', 176 | 177 | # The font size ('10pt', '11pt' or '12pt'). 178 | #'pointsize': '10pt', 179 | 180 | # Additional stuff for the LaTeX preamble. 181 | #'preamble': '', 182 | } 183 | 184 | # Grouping the document tree into LaTeX files. List of tuples 185 | # (source start file, target name, title, author, documentclass [howto/manual]). 186 | latex_documents = [ 187 | ('index', 'xml4h.tex', 'xml4h Documentation', 188 | 'James Murty', 'manual'), 189 | ] 190 | 191 | # The name of an image file (relative to this directory) to place at the top of 192 | # the title page. 193 | #latex_logo = None 194 | 195 | # For "manual" documents, if this is true, then toplevel headings are parts, 196 | # not chapters. 197 | #latex_use_parts = False 198 | 199 | # If true, show page references after internal links. 200 | #latex_show_pagerefs = False 201 | 202 | # If true, show URL addresses after external links. 203 | #latex_show_urls = False 204 | 205 | # Documents to append as an appendix to all manuals. 206 | #latex_appendices = [] 207 | 208 | # If false, no module index is generated. 209 | #latex_domain_indices = True 210 | 211 | 212 | # -- Options for manual page output -------------------------------------------- 213 | 214 | # One entry per manual page. List of tuples 215 | # (source start file, name, description, authors, manual section). 216 | man_pages = [ 217 | ('index', 'xml4h', 'xml4h Documentation', 218 | ['James Murty'], 1) 219 | ] 220 | 221 | # If true, show URL addresses after external links. 222 | #man_show_urls = False 223 | 224 | 225 | # -- Options for Texinfo output ------------------------------------------------ 226 | 227 | # Grouping the document tree into Texinfo files. List of tuples 228 | # (source start file, target name, title, author, 229 | # dir menu entry, description, category) 230 | texinfo_documents = [ 231 | ('index', 'xml4h', 'xml4h Documentation', 232 | 'James Murty', 'xml4h', 'One line description of project.', 233 | 'Miscellaneous'), 234 | ] 235 | 236 | # Documents to append as an appendix to all manuals. 237 | #texinfo_appendices = [] 238 | 239 | # If false, no module index is generated. 240 | #texinfo_domain_indices = True 241 | 242 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 243 | #texinfo_show_urls = 'footnote' 244 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | .. xml4h documentation master file, created by 2 | sphinx-quickstart on Thu Aug 30 22:29:54 2012. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | .. include:: ../README.rst 7 | 8 | 9 | ========== 10 | User Guide 11 | ========== 12 | 13 | .. toctree:: 14 | :maxdepth: 3 15 | 16 | parser 17 | builder 18 | writer 19 | nodes 20 | advanced 21 | api 22 | 23 | 24 | ================== 25 | Indices and tables 26 | ================== 27 | 28 | * :ref:`genindex` 29 | * :ref:`modindex` 30 | * :ref:`search` 31 | 32 | -------------------------------------------------------------------------------- /docs/nodes.rst: -------------------------------------------------------------------------------- 1 | ========= 2 | DOM Nodes 3 | ========= 4 | 5 | *xml4h* provides node objects and convenience methods that make it easier to 6 | work with an in-memory XML document object model (DOM). 7 | 8 | This section of the document covers the main features of *xml4h* nodes. 9 | For the full API-level documentation see :ref:`api-nodes`. 10 | 11 | .. _node-traversal: 12 | 13 | Traversing Nodes 14 | ---------------- 15 | 16 | *xml4h* aims to provide a simple and intuitive API for traversing and 17 | manipulating the XML DOM. To that end it includes a number of convenience 18 | methods for performing common tasks: 19 | 20 | - Get the :class:`~xml4h.nodes.Document` or root :class:`~xml4h.nodes.Element` 21 | from any node via the ``document`` and ``root`` attributes respectively. 22 | - You can get the ``name`` attribute of nodes that have a name, or look up 23 | the different name components with ``prefix`` to get the namespace prefix 24 | (if any) and ``local_name`` to get the name portion without the prefix. 25 | - Nodes that have a value expose it via the ``value`` attribute. 26 | - A node's ``parent`` attribute returns its parent, while the ``ancestors`` 27 | attribute returns a list containing its parent, grand-parent, 28 | great-grand-parent etc. 29 | - A node's ``children`` attribute returns the child nodes that belong to it, 30 | while the ``siblings`` attribute returns all other nodes that belong to its 31 | parent. You can also get the ``siblings_before`` or ``siblings_after`` the 32 | current node. 33 | - Look up a node's namespace URI with ``namespace_uri`` or the alias 34 | ``ns_uri``. 35 | - Check what type of :class:`~xml4h.nodes.Node` you have with Boolean 36 | attributes like ``is_element``, ``is_text``, ``is_entity`` etc. 37 | 38 | 39 | .. _magical-node-traversal: 40 | 41 | "Magical" Node Traversal 42 | ------------------------ 43 | 44 | To make it easy to traverse XML documents with a known structure *xml4h* 45 | performs some minor magic when you look up attributes or keys on Document 46 | and Element nodes. If you like, you can take advantage of magical traversal 47 | to avoid peppering your code with ``find`` and ``xpath`` searches, or with 48 | ``child`` and ``children`` node attribute lookups. 49 | 50 | The principle is simple: 51 | 52 | - Child elements are available as Python attributes of the parent element 53 | class. 54 | - XML element attributes are available as a Python dict in the owning element. 55 | 56 | Here is an example of retrieving information from our Monty Python films 57 | document using element names as Python attributes (``MontyPythonFilms``, 58 | ``Film``, ``Title``) and XML attribute names as Python keys (``year``):: 59 | 60 | >>> # Parse an example XML document about Monty Python films 61 | >>> import xml4h 62 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml') 63 | 64 | >>> for film in doc.MontyPythonFilms.Film: 65 | ... print(film['year'] + ' : ' + film.Title.text) # doctest:+ELLIPSIS 66 | 1971 : And Now for Something Completely Different 67 | 1974 : Monty Python and the Holy Grail 68 | ... 69 | 70 | Python class attribute lookups of child elements work very well when your XML 71 | document contains only camel-case tag names ``LikeThisOne`` or ``LikeThat``. 72 | However, if your document contains lower-case tag names there is a chance the 73 | element names will clash with existing Python attribute or method names in the 74 | *xml4h* classes. 75 | 76 | To work around this potential issue you can add an underscore (``_``) 77 | character at the end of a magical attribute lookup to avoid the naming clash; 78 | *xml4h* will remove that character before looking for a child element. For 79 | example, to look up a child of the element ``elem1`` which is named ``child``, 80 | the code ``elem1.child_`` will return the child element whereas ``elem1.child`` 81 | would access the :meth:`~xml4h.nodes.Node.child` Node method instead. 82 | 83 | .. note:: 84 | Not all XML child element tag names are accessible using magical traversal. 85 | Names with leading underscore characters will not work, and nor will names 86 | containing hyphens because they are not valid Python attribute names. If you 87 | have to deal with XML names like this use the full API methods like 88 | :meth:`~xml4h.nodes.Node.child` and :meth:`~xml4h.nodes.Node.children` 89 | instead. 90 | 91 | All the gory details about how magical traversal works are documented at 92 | :class:`~xml4h.nodes.NodeAttrAndChildElementLookupsMixin`. Depending on how 93 | you feel about magical behaviour this feature might feel like a great 94 | convenience, or black magic that makes you wary. The right attitude probably 95 | lies somewhere in the middle... 96 | 97 | .. warning:: 98 | The behaviour of namespaced XML elements and attributes is inconsistent. 99 | You can do magical traversal of elements regardless of what namespace the 100 | elements are in, but to look up XML attributes with a namespace prefix 101 | you must include that prefix in the name e.g. ``prefix:attribute-name``. 102 | 103 | 104 | Searching with Find and XPath 105 | ----------------------------- 106 | 107 | There are two ways to search for elements within an *xml4h* document: ``find`` 108 | and ``xpath``. 109 | 110 | The find methods provided by the library are easy to use but can only perform 111 | relatively simple searches that return :class:`~xml4h.nodes.Element` results, 112 | whereas you need to be familiar with XPath query syntax to search effectively 113 | with the ``xpath`` method but you can perform more complex searches and get 114 | results other than just elements. 115 | 116 | Find Methods 117 | ............ 118 | 119 | *xml4h* provides three different find methods: 120 | 121 | - :meth:`~xml4h.nodes.Node.find` searches descendants of the current node for 122 | elements matching the given constraints. You can search by element name, 123 | by namespace URI, or with no constraints at all:: 124 | 125 | >>> # Find ALL elements in the document 126 | >>> elems = doc.find() 127 | >>> [e.name for e in elems] # doctest:+ELLIPSIS 128 | ['MontyPythonFilms', 'Film', 'Title', 'Description', 'Film', 'Title', 'Description',... 129 | 130 | >>> # Find the seven elements in the XML document 131 | >>> film_elems = doc.find('Film') 132 | >>> [e.Title.text for e in film_elems] # doctest:+ELLIPSIS 133 | ['And Now for Something Completely Different', 'Monty Python and the Holy Grail',... 134 | 135 | Note that the :meth:`~xml4h.nodes.Node.find` method only finds descendants 136 | of the node you run it on:: 137 | 138 | >>> # Find elements in a single <Film> element; there's only one 139 | >>> film_elem = doc.find('Film', first_only=True) 140 | >>> film_elem.find('Title') 141 | [<xml4h.nodes.Element: "Title">] 142 | 143 | - :meth:`~xml4h.nodes.Node.find_first` searches descendants of the current 144 | node but only returns the first result element, not a list. If there are no 145 | matching element results this method returns *None*:: 146 | 147 | >>> # Find the first <Film> element in the document 148 | >>> doc.find_first('Film') 149 | <xml4h.nodes.Element: "Film"> 150 | 151 | >>> # Search for an element that does not exist 152 | >>> print(doc.find_first('OopsWrongName')) 153 | None 154 | 155 | If you were paying attention you may have noticed in the example above that 156 | you can make the :meth:`~xml4h.nodes.Node.find` method do exactly same thing 157 | as :meth:`~xml4h.nodes.Node.find_first` by passing the keyword argument 158 | ``first_only=True``. 159 | 160 | - :meth:`~xml4h.nodes.Node.find_doc` is a convenience method that searches the 161 | entire document no matter which node you run it on:: 162 | 163 | >>> # Normal find only searches descendants of the current node 164 | >>> len(film_elem.find('Title')) 165 | 1 166 | 167 | >>> # find_doc searches the entire document 168 | >>> len(film_elem.find_doc('Title')) 169 | 7 170 | 171 | This method is exactly like calling ``xml4h_node.document.find()``, which is 172 | actually what happens behind the scenes. 173 | 174 | XPath Querying 175 | .............. 176 | 177 | *xml4h* provides a single XPath search method which is available on 178 | :class:`~xml4h.nodes.Document` and :class:`~xml4h.nodes.Element` nodes: 179 | 180 | :meth:`~xml4h.nodes.XPathMixin.xpath` takes an XPath query string and returns 181 | the result which may be a list of elements, a list of attributes, a list of 182 | values, or a single value. The result depends entirely on the kind of query you 183 | perform. 184 | 185 | .. note:: 186 | XPath querying is currently only available if you use the *lxml* or 187 | *ElementTree* implementation libraries. You can check whether the XPath 188 | feature is available with :meth:`~xml4h.nodes.Node.has_feature`. 189 | 190 | .. note:: 191 | Although *ElementTree* supports XPath queries, this support is 192 | `very limited <http://effbot.org/zone/element-xpath.htm>`_ and most of the 193 | example XPath queries below **will not work**. If you want to use XPath, you 194 | should install *lxml* for better support. 195 | 196 | XPath queries are powerful and complex so we cannot describe them in detail 197 | here, but we can at least present some useful examples. Here are queries that 198 | perform the same work as the find queries we saw above:: 199 | 200 | >>> # Query for ALL elements in the document 201 | >>> elems = doc.xpath('//*') # doctest:+ELLIPSIS 202 | >>> [e.name for e in elems] # doctest:+ELLIPSIS 203 | ['MontyPythonFilms', 'Film', 'Title', 'Description', 'Film', 'Title', 'Description',... 204 | 205 | >>> # Query for the seven <Film> elements in the XML document 206 | >>> film_elems = doc.xpath('//Film') 207 | >>> [e.Title.text for e in film_elems] # doctest:+ELLIPSIS 208 | ['And Now for Something Completely Different', 'Monty Python and the Holy Grail',... 209 | 210 | >>> # Query for the first <Film> element in the document (returns list) 211 | >>> doc.xpath('//Film[1]') 212 | [<xml4h.nodes.Element: "Film">] 213 | 214 | >>> # Query for <Title> elements in a single <Film> element; there's only one 215 | >>> film_elem = doc.xpath('Film[1]')[0] 216 | >>> film_elem.xpath('Title') 217 | [<xml4h.nodes.Element: "Title">] 218 | 219 | You can also do things with XPath queries that you simply cannot with the 220 | *find* method, such as find all the attributes of a certain name or apply 221 | rich constraints to the query:: 222 | 223 | >>> # Query for all year attributes 224 | >>> doc.xpath('//@year') 225 | ['1971', '1974', '1979', '1982', '1983', '2009', '2012'] 226 | 227 | >>> # Query for the title of the film released in 1982 228 | >>> doc.xpath('//Film[@year="1982"]/Title/text()') 229 | ['Monty Python Live at the Hollywood Bowl'] 230 | 231 | 232 | Namespaces and XPath 233 | .................... 234 | 235 | Finally, let's discuss how you can run XPath queries on documents with 236 | namespaces, because unfortunately this is not a simple subject. 237 | 238 | First, you need to understand that if you are working with a namespaced 239 | document your XPath queries must refer to those namespaces or they will not 240 | find anything:: 241 | 242 | >>> # Parse a namespaced version of the Monty Python Films doc 243 | >>> ns_doc = xml4h.parse('tests/data/monty_python_films.ns.xml') 244 | >>> print(ns_doc.xml()) #doctest:+ELLIPSIS 245 | <?xml version="1.0" encoding="utf-8"?> 246 | <MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python" xmlns="uri:monty-python" xmlns:work="uri:artistic-work"> 247 | <work:Film year="1971"> 248 | <Title>And Now for Something Completely Different 249 | ... 250 | 251 | >>> # XPath queries without prefixes won't find namespaced elements 252 | >>> ns_doc.xpath('//Film') 253 | [] 254 | 255 | To refer to namespaced nodes in your query the namespace must have a prefix 256 | alias assigned to it. You can specify prefixes when you call the *xpath* method 257 | by providing a ``namespaces`` keyword argument with a dictionary of 258 | alias-to-URI mappings:: 259 | 260 | >>> # Specify explicit prefix alias mappings 261 | >>> films = ns_doc.xpath('//x:Film', namespaces={'x': 'uri:artistic-work'}) 262 | >>> len(films) 263 | 7 264 | 265 | Or, preferably, if your document node already has prefix mappings you can use 266 | them directly:: 267 | 268 | >>> # Our root node already has a 'work' prefix defined... 269 | >>> ns_doc.root['xmlns:work'] 270 | 'uri:artistic-work' 271 | 272 | >>> # ...so we can use this prefix directly 273 | >>> films = ns_doc.root.xpath('//work:Film') 274 | >>> len(films) 275 | 7 276 | 277 | Another gotcha is when a document has a default namespace. The default 278 | namespace applies to every descendent node without its own namespace, but XPath 279 | doesn't have a good way of dealing with this since there is no such thing as 280 | a "default namespace" prefix alias. 281 | 282 | *xml4h* helps out by providing just such an alias: the underscore (``_``):: 283 | 284 | >>> # Our document root has a default namespace 285 | >>> ns_doc.root.ns_uri 286 | 'uri:monty-python' 287 | 288 | >>> # You need a prefix alias that refers to the default namespace 289 | >>> ns_doc.xpath('//Title') 290 | [] 291 | 292 | >>> # You could specify it explicitly... 293 | >>> titles = ns_doc.xpath('//x:Title', 294 | ... namespaces={'x': ns_doc.root.ns_uri}) 295 | >>> len(titles) 296 | 7 297 | 298 | >>> # ...or use xml4h's special default namespace prefix: _ 299 | >>> titles = ns_doc.xpath('//_:Title') 300 | >>> len(titles) 301 | 7 302 | 303 | 304 | Filtering Node Lists 305 | -------------------- 306 | 307 | Many *xml4h* node attributes return a list of nodes as a 308 | :class:`~xml4h.nodes.NodeList` object which confers some special filtering 309 | powers. You get this special node list object from attributes like 310 | ``children``, ``ancestors``, and ``siblings``, and from the ``find`` search 311 | method if it has element results. 312 | 313 | Here are some examples of how you can easily filter a 314 | :class:`~xml4h.nodes.NodeList` to get just the 315 | nodes you need: 316 | 317 | - Get the first child node using the ``filter`` method:: 318 | 319 | >>> # Filter to get just the first child 320 | >>> doc.root.children.filter(first_only=True) 321 | 322 | 323 | >>> # The document has 7 element children of the root 324 | >>> len(doc.root.children) 325 | 7 326 | 327 | - Get the first child node by treating ``children`` as a callable:: 328 | 329 | >>> doc.root.children(first_only=True) 330 | 331 | 332 | When you treat the node list as a callable it calls the ``filter`` method 333 | behind the scenes, but since doing it the callable way is quicker and 334 | clearer in code we will use that approach from now on. 335 | 336 | - Get the first child node with the ``child`` filtering method, which accepts 337 | the same constraints as the ``filter`` method:: 338 | 339 | >>> doc.root.child() 340 | 341 | 342 | >>> # Apply filtering with child 343 | >>> print(doc.root.child('WrongName')) 344 | None 345 | 346 | - Get the first of a set of children with the ``first`` attribute:: 347 | 348 | >>> doc.root.children.first 349 | 350 | 351 | 352 | - Filter the node list by name:: 353 | 354 | >>> for n in doc.root.children('Film'): 355 | ... print(n.Title.text) 356 | And Now for Something Completely Different 357 | Monty Python and the Holy Grail 358 | Monty Python's Life of Brian 359 | Monty Python Live at the Hollywood Bowl 360 | Monty Python's The Meaning of Life 361 | Monty Python: Almost the Truth (The Lawyer's Cut) 362 | A Liar's Autobiography: Volume IV 363 | 364 | >>> len(doc.root.children('WrongName')) 365 | 0 366 | 367 | .. note:: 368 | Passing a node name as the first argument will match the *local* name of 369 | a node. You can match the full node name, which might include a prefix 370 | for example, with a call like: ``.children(name='SomeName')``. 371 | 372 | - Filter with a custom function:: 373 | 374 | >>> # Filter to films released in the year 1979 375 | >>> for n in doc.root.children('Film', 376 | ... filter_fn=lambda node: node.attributes['year'] == '1979'): 377 | ... print(n.Title.text) 378 | Monty Python's Life of Brian 379 | 380 | 381 | Manipulating Nodes and Elements 382 | ------------------------------- 383 | 384 | *xml4h* provides simple methods to manipulate the structure and content of an 385 | XML DOM. The methods available depend on the kind of node you are interacting 386 | with, and by far the majority are for working with 387 | :class:`~xml4h.nodes.Element` nodes. 388 | 389 | 390 | Delete a Node 391 | ............. 392 | 393 | Any node can be removes from its owner document with 394 | :meth:`~xml4h.nodes.Node.delete`:: 395 | 396 | >>> # Before deleting a Film element there are 7 films 397 | >>> len(doc.MontyPythonFilms.Film) 398 | 7 399 | 400 | >>> doc.MontyPythonFilms.children('Film')[-1].delete() 401 | >>> len(doc.MontyPythonFilms.Film) 402 | 6 403 | 404 | .. note:: 405 | By default deleting a node also destroys it, but it can optionally be left 406 | intact after removal from the document by including the ``destroy=False`` 407 | option. 408 | 409 | Name and Value Attributes 410 | ......................... 411 | 412 | Many nodes have low-level name and value properties that can be read from and 413 | written to. Nodes with names and values include Text, CDATA, Comment, 414 | ProcessingInstruction, Attribute, and Element nodes. 415 | 416 | Here is an example of accessing the low-level name and value properties of a 417 | Text node:: 418 | 419 | >>> text_node = doc.MontyPythonFilms.child('Film').child('Title').child() 420 | >>> text_node.is_text 421 | True 422 | 423 | >>> text_node.name 424 | '#text' 425 | >>> text_node.value 426 | 'And Now for Something Completely Different' 427 | 428 | And here is the same for an Attribute node:: 429 | 430 | >>> # Access the name/value properties of an Attribute node 431 | >>> year_attr = doc.MontyPythonFilms.child('Film').attribute_node('year') 432 | >>> year_attr.is_attribute 433 | True 434 | 435 | >>> year_attr.name 436 | 'year' 437 | >>> year_attr.value 438 | '1971' 439 | 440 | The name attribute of a node is not necessarily a plain string, in the case of 441 | nodes within a defined namespaced the ``name`` attribute may comprise two 442 | components: a ``prefix`` that represents the namespace, and a ``local_name`` 443 | which is the plain name of the node ignoring the namespace. For more 444 | information on namespaces see :ref:`xml4h-namespaces`. 445 | 446 | Import a Node and its Descendants 447 | ................................. 448 | 449 | In addition to manipulating nodes in a single XML document directly, you can 450 | also import a node (and all its descendant) from another document using a node 451 | clone or transplant operation. 452 | 453 | There are two ways to import a node and its descendants: 454 | 455 | - Use the :meth:`~xml4h.nodes.Node.clone_node` Node method or 456 | :meth:`~xml4h.builder.Builder.clone` Builder method to copy a node into your 457 | document without removing it from its original document. 458 | - Use the :meth:`~xml4h.nodes.Node.transplant_node` Node method or 459 | :meth:`~xml4h.builder.Builder.transplant` Builder method to transplant a node 460 | into your document and remove it from its original document. 461 | 462 | Here is an example of transplanting a node into a document (which also happens 463 | to undo the damage we did to our example DOM in the ``delete()`` example 464 | above):: 465 | 466 | >>> # Build a new document containing a Film element 467 | >>> film_builder = (xml4h.build('DeletedFilm') 468 | ... .element('Film').attrs(year='1971') 469 | ... .element('Title') 470 | ... .text('And Now for Something Completely Different').up() 471 | ... .element('Description').text( 472 | ... "A collection of sketches from the first and second TV" 473 | ... " series of Monty Python's Flying Circus purposely" 474 | ... " re-enacted and shot for film.") 475 | ... ) 476 | 477 | >>> # Transplant the Film element from the new document 478 | >>> node_to_transplant = film_builder.root.child('Film') 479 | >>> doc.MontyPythonFilms.transplant_node(node_to_transplant) 480 | >>> len(doc.MontyPythonFilms.Film) 481 | 7 482 | 483 | When you transplant a node from another document it is removed from that 484 | document:: 485 | 486 | >>> # After transplanting the Film node it is no longer in the original doc 487 | >>> len(film_builder.root.find('Film')) 488 | 0 489 | 490 | If you need to leave the original document unchanged when importing a node use 491 | the clone methods instead. 492 | 493 | Working with Elements 494 | ..................... 495 | 496 | Element nodes have the most methods to access and manipulate their content, 497 | which is fitting since this is the most useful type of node and you will deal 498 | with elements regularly. 499 | 500 | The leaf elements in XML documents often have one or more 501 | :class:`~xml4h.nodes.Text` node children that contain the element's data 502 | content. While you could iterate over such text nodes as child nodes, *xml4h* 503 | provides the more convenient text accessors you would expect:: 504 | 505 | >>> title_elem = doc.MontyPythonFilms.Film[0].Title 506 | >>> orig_title = title_elem.text 507 | >>> orig_title 508 | 'And Now for Something Completely Different' 509 | 510 | >>> title_elem.text = 'A new, and wrong, title' 511 | >>> title_elem.text 512 | 'A new, and wrong, title' 513 | 514 | >>> # Let's put it back the way it was... 515 | >>> title_elem.text = orig_title 516 | 517 | Elements also have attributes that can be manipulated in a number of ways. 518 | 519 | Look up an element's attributes with: 520 | 521 | - the :meth:`~xml4h.nodes.Element.attributes` attribute (or aliases ``attrib`` 522 | and ``attrs``) that return an ordered dictionary of attribute names and 523 | values:: 524 | 525 | >>> film_elem = doc.MontyPythonFilms.Film[0] 526 | >>> film_elem.attributes 527 | 528 | 529 | - or by obtaining an element's attributes as :class:`~xml4h.nodes.Attribute` 530 | nodes, though that is only likely to be useful in unusual circumstances:: 531 | 532 | >>> film_elem.attribute_nodes 533 | [] 534 | 535 | >>> # Get a specific attribute node by name or namespace URI 536 | >>> film_elem.attribute_node('year') 537 | 538 | 539 | - and there's also the "magical" keyword lookup technique discussed in 540 | :ref:`magical-node-traversal` for quickly grabbing attribute values. 541 | 542 | Set attribute values with: 543 | 544 | - the :meth:`~xml4h.nodes.Element.set_attributes` method, which allows you to 545 | add attributes without replacing existing ones. This method also supports 546 | defining XML attributes as a dictionary, list of name/value pairs, or 547 | keyword arguments:: 548 | 549 | >>> # Set/add attributes as a dictionary 550 | >>> film_elem.set_attributes({'a1': 'v1'}) 551 | 552 | >>> # Set/add attributes as a list of name/value pairs 553 | >>> film_elem.set_attributes([('a2', 'v2')]) 554 | 555 | >>> # Set/add attributes as keyword arguments 556 | >>> film_elem.set_attributes(a3='v3', a4=4) 557 | 558 | >>> film_elem.attributes 559 | 560 | 561 | - the setter version of the :attr:`~xml4h.nodes.Element.attributes` attribute, 562 | which replaces any existing attributes with the new set:: 563 | 564 | >>> film_elem.attributes = {'year': '1971', 'note': 'funny'} 565 | >>> film_elem.attributes 566 | 567 | 568 | Delete attributes from an element by: 569 | 570 | - using Python's delete-in-dict technique:: 571 | 572 | >>> del(film_elem.attributes['note']) 573 | >>> film_elem.attributes 574 | 575 | 576 | - or by calling the ``delete()`` method on an :class:`~xml4h.nodes.Attribute` 577 | node. 578 | 579 | Finally, the :class:`~xml4h.nodes.Element` class provides a number of methods 580 | for programmatically adding child nodes, for cases where you would rather work 581 | directly with nodes instead of using a :ref:`builder`. 582 | 583 | The most complex of these methods is :meth:`~xml4h.nodes.Element.add_element` 584 | which allows you to add a named child element, and to optionally to set the new 585 | element's namespace, text content, and attributes all at the same time. Let's 586 | try an example:: 587 | 588 | >>> # Add a Film element with an attribute 589 | >>> new_film_elem = doc.MontyPythonFilms.add_element( 590 | ... 'Film', attributes={'year': 'never'}) 591 | 592 | >>> # Add a Description element with text content 593 | >>> desc_elem = new_film_elem.add_element( 594 | ... 'Description', text='Just testing...') 595 | 596 | >>> # Add a Title element with text *before* the description element 597 | >>> title_elem = desc_elem.add_element( 598 | ... 'Title', text='The Film that Never Was', before_this_element=True) 599 | 600 | >>> print(doc.MontyPythonFilms.Film[-1].xml()) 601 | 602 | The Film that Never Was 603 | Just testing... 604 | 605 | 606 | There are similar methods for handling simpler cases like adding text nodes, 607 | comments etc. Here is an example of adding text nodes:: 608 | 609 | >>> # Add a text node 610 | >>> title_elem = doc.MontyPythonFilms.Film[-1].Title 611 | >>> title_elem.add_text(', and Never Will Be') 612 | 613 | >>> title_elem.text 614 | 'The Film that Never Was, and Never Will Be' 615 | 616 | Refer to the :class:`~xml4h.nodes.Element` documentation for more information 617 | about the other methods for adding nodes. 618 | 619 | 620 | .. _wrap-unwrap-nodes: 621 | 622 | Wrapping and Unwrapping *xml4h* Nodes 623 | ------------------------------------- 624 | 625 | You can easily convert to or from *xml4h*'s wrapped version of an 626 | implementation node. For example, if you prefer the *lxml* library's 627 | `ElementMaker `_ document builder 628 | approach to the :ref:`xml4h Builder `, you can create a document 629 | in *lxml*... 630 | 631 | :: 632 | 633 | >>> from lxml.builder import ElementMaker 634 | >>> E = ElementMaker() 635 | >>> lxml_doc = E.DocRoot( 636 | ... E.Item( 637 | ... E.Name('Item 1'), 638 | ... E.Value('Value 1') 639 | ... ), 640 | ... E.Item( 641 | ... E.Name('Item 2'), 642 | ... E.Value('Value 2') 643 | ... ) 644 | ... ) 645 | >>> lxml_doc # doctest:+ELLIPSIS 646 | >> # Convert lxml Document to xml4h version 652 | >>> xml4h_doc = xml4h.LXMLAdapter.wrap_document(lxml_doc) 653 | >>> xml4h_doc.children 654 | [, ] 655 | 656 | >>> # Get an element within the lxml document 657 | >>> lxml_elem = list(lxml_doc)[0] 658 | >>> lxml_elem # doctest:+ELLIPSIS 659 | >> # Convert lxml Element to xml4h version 662 | >>> xml4h_elem = xml4h.LXMLAdapter.wrap_node(lxml_elem, lxml_doc) 663 | >>> xml4h_elem # doctest:+ELLIPSIS 664 | 665 | 666 | You can reach the underlying XML implementation document or node at any time 667 | from an *xml4h* node:: 668 | 669 | >>> # Get an xml4h node's underlying implementation node 670 | >>> xml4h_elem.impl_node # doctest:+ELLIPSIS 671 | >> xml4h_elem.impl_node == lxml_elem 673 | True 674 | 675 | >>> # Get the underlying implementatation document from any node 676 | >>> xml4h_elem.impl_document # doctest:+ELLIPSIS 677 | >> xml4h_elem.impl_document == lxml_doc 679 | True 680 | 681 | -------------------------------------------------------------------------------- /docs/parser.rst: -------------------------------------------------------------------------------- 1 | ====== 2 | Parser 3 | ====== 4 | 5 | The *xml4h* parser is a simple wrapper around the parser provided by an 6 | underlying :ref:`XML library implementation `. 7 | 8 | .. _parser-parse: 9 | 10 | Parse function 11 | -------------- 12 | 13 | To parse XML documents with *xml4h* you feed the :func:`xml4h.parse` function 14 | an XML text document in one of three forms: 15 | 16 | - A file-like object:: 17 | 18 | >>> import xml4h 19 | 20 | >>> xml_file = open('tests/data/monty_python_films.xml', 'rb') 21 | >>> doc = xml4h.parse(xml_file) 22 | 23 | >>> doc.MontyPythonFilms 24 | 25 | 26 | - A file path string:: 27 | 28 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml') 29 | 30 | >>> doc.root['source'] 31 | 'http://en.wikipedia.org/wiki/Monty_Python' 32 | 33 | - A string containing literal XML content:: 34 | 35 | >>> xml_file = open('tests/data/monty_python_films.xml', 'rb') 36 | >>> xml_text = xml_file.read() 37 | >>> doc = xml4h.parse(xml_text) 38 | 39 | >>> len(doc.find('Film')) 40 | 7 41 | 42 | .. note:: The :func:`~xml4h.parse` method distinguishes between a file path 43 | string and an XML text string by looking for a ``<`` character 44 | in the value. 45 | 46 | 47 | Stripping of Whitespace Nodes 48 | ----------------------------- 49 | 50 | By default the *parse* method ignores whitespace nodes in the XML document 51 | -- or more accurately, it does extra work to remove these nodes after the 52 | document has been parsed by the underlying XML library. 53 | 54 | Whitespace nodes are rarely interesting, since they are usually the result of 55 | XML content that has been serialized with extra whitespace to make it more 56 | readable to humans. 57 | 58 | However if you need to keep these nodes, or if you want to avoid the extra 59 | processing overhead when parsing large documents, you can disable this 60 | feature by passing in the ``ignore_whitespace_text_nodes=False`` flag:: 61 | 62 | >>> # Strip whitespace nodes from document 63 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml') 64 | 65 | >>> # No excess text nodes (XML doc lists 7 films) 66 | >>> len(doc.MontyPythonFilms.children) 67 | 7 68 | >>> doc.MontyPythonFilms.children[0] 69 | 70 | 71 | 72 | >>> # Don't strip whitespace nodes 73 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml', 74 | ... ignore_whitespace_text_nodes=False) 75 | 76 | >>> # An extra text node is present 77 | >>> len(doc.MontyPythonFilms.children) 78 | 8 79 | >>> doc.MontyPythonFilms.children[0] 80 | 81 | -------------------------------------------------------------------------------- /docs/writer.rst: -------------------------------------------------------------------------------- 1 | ====== 2 | Writer 3 | ====== 4 | 5 | The *xml4h* writer produces serialized XML text documents formatted more 6 | traditionally – and in our opinion more correctly – than the other Python XML 7 | libraries. 8 | 9 | .. _writer-write-methods: 10 | 11 | Write methods 12 | ------------- 13 | 14 | To write out an XML document with *xml4h* you will generally use the 15 | :meth:`~xml4h.nodes.Node.write` or :meth:`~xml4h.nodes.Node.write_doc` methods 16 | available on any *xml4h* node. 17 | 18 | The writer methods require a file or any IO stream object as the first 19 | argument, and will automatically handle text or binary IO streams. 20 | 21 | The :meth:`~xml4h.nodes.Node.write` method outputs the current node and any 22 | descendants:: 23 | 24 | >>> import xml4h 25 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml') 26 | >>> first_film_elem = doc.find('Film')[0] 27 | 28 | >>> # Write XML node to stdout 29 | >>> import sys 30 | >>> first_film_elem.write(sys.stdout, indent=True) # doctest:+ELLIPSIS 31 | 32 | And Now for Something Completely Different 33 | A collection of sketches from the first and second... 34 | 35 | 36 | The :meth:`~xml4h.nodes.Node.write_doc` method outputs the entire document no 37 | matter which node you call it on:: 38 | 39 | >>> first_film_elem.write_doc(sys.stdout, indent=True) # doctest:+ELLIPSIS 40 | 41 | 42 | 43 | And Now for Something Completely Different 44 | A collection of sketches from the first and second... 45 | 46 | ... 47 | 48 | To send output to a file:: 49 | 50 | >>> # Write to a file 51 | >>> with open('/tmp/example.xml', 'wb') as f: 52 | ... first_film_elem.write_doc(f) 53 | 54 | .. _writer-xml-methods: 55 | 56 | Get XML as a string 57 | ------------------- 58 | 59 | Because you will often want to generate a string of XML content directly, 60 | *xml4h* includes the convenience methods :meth:`~xml4h.nodes.Node.xml` 61 | and :meth:`~xml4h.nodes.Node.xml_doc` to do this easily. 62 | 63 | The :meth:`~xml4h.nodes.Node.xml` method works like the *write* method and 64 | will return a string of XML content including the current node and its 65 | descendants:: 66 | 67 | >>> print(first_film_elem.xml()) # doctest:+ELLIPSIS 68 | 69 | And Now for Something Completely... 70 | 71 | The :meth:`~xml4h.nodes.Node.xml_doc` method works like the *write_doc* 72 | method and returns a string for the whole document:: 73 | 74 | >>> print(first_film_elem.xml_doc()) # doctest:+ELLIPSIS 75 | <?xml version="1.0" encoding="utf-8"?> 76 | <MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python"> 77 | <Film year="1971"> 78 | <Title>And Now for Something Completely Different 79 | A collection of sketches from the first and second... 80 | 81 | ... 82 | 83 | .. note:: 84 | *xml4h* assumes that when you directly generate an XML string with these 85 | methods it is intended for human consumption, so it applies pretty-print 86 | formatting by default. 87 | 88 | 89 | .. _writer-formatting: 90 | 91 | Format Output 92 | ------------- 93 | 94 | The *write* and *xml* methods accept a range of formatting options to control 95 | how XML content is serialized. These are useful if you expect a human to read 96 | the resulting data. 97 | 98 | For the full range of formatting options see the code documentation for 99 | :meth:`~xml4h.nodes.Node.write` and :meth:`~xml4h.nodes.Node.xml` et al. 100 | but here are some pointers to get you started: 101 | 102 | - Set ``indent=True`` to write a pretty-printed XML document with four space 103 | characters for indentation and ``\n`` for newlines. 104 | - To use a tab character for indenting and ``\r\n`` for indents: 105 | ``indent='\t', newline='\r\n'``. 106 | - *xml4h* writes *utf-8*-encoded documents by default, to write with a 107 | different encoding: ``encoding='iso-8859-1'``. 108 | - To avoid outputting the XML declaration when writing a document: 109 | ``omit_declaration=True``. 110 | 111 | 112 | Write using the underlying implementation 113 | ----------------------------------------- 114 | 115 | Because *xml4h* sits on top of an underlying 116 | :ref:`XML library implementation ` you can use that 117 | library's serialization methods if you prefer, and if you don't mind having 118 | some implementation-specific code. 119 | 120 | For example, if you are using *lxml* as the underlying library you can use 121 | its serialisation methods by accessing the implementation node:: 122 | 123 | >>> # Get the implementation root node, in this case an lxml node 124 | >>> lxml_root_node = first_film_elem.root.impl_node 125 | >>> type(lxml_root_node) # doctest:+ELLIPSIS 126 | <... 'lxml.etree._Element'> 127 | 128 | >>> # Use lxml features as normal; xml4h is no longer in the picture 129 | >>> from lxml import etree 130 | >>> xml_bytes = etree.tostring( 131 | ... lxml_root_node, encoding='utf-8', xml_declaration=True, pretty_print=True) 132 | >>> print(xml_bytes.decode('utf-8')) # doctest:+ELLIPSIS 133 | 134 | And Now for Something Completely Different 135 | A collection of sketches from the first and second... 136 | 137 | Monty Python and the Holy Grail 138 | King Arthur and his knights embark on a low-budget... 139 | 140 | ... 141 | 142 | .. note:: 143 | The output from *lxml* is a little quirky, at least on the author's machine. 144 | Note for example the single-quote characters in the XML declaration, and 145 | the missing newline and indent before the first ```` element. But 146 | don't worry, that's why you have *xml4h* ;) 147 | -------------------------------------------------------------------------------- /requirements-dev.txt: -------------------------------------------------------------------------------- 1 | # Nose for running tests 2 | six 3 | nose 4 | coverage 5 | tox 6 | sphinx 7 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import xml4h 5 | 6 | try: 7 | from setuptools import setup 8 | except ImportError: 9 | from distutils.core import setup 10 | 11 | setup( 12 | name=xml4h.__title__, 13 | version=xml4h.__version__, 14 | description='XML for Humans in Python', 15 | long_description=open('README.rst').read(), 16 | long_description_content_type='text/x-rst', 17 | author='James Murty', 18 | author_email='james@murty.co', 19 | url='https://github.com/jmurty/xml4h', 20 | packages=[ 21 | 'xml4h', 22 | 'xml4h.impls', 23 | ], 24 | package_dir={'xml4h': 'xml4h'}, 25 | package_data={'': ['README.rst', 'LICENSE']}, 26 | include_package_data=True, 27 | install_requires=[ 28 | 'six', 29 | ], 30 | license='MIT License', 31 | # http://pypi.python.org/pypi?%3Aaction=list_classifiers 32 | classifiers=[ 33 | 'Development Status :: 4 - Beta', 34 | 'Intended Audience :: Developers', 35 | 'Topic :: Text Processing :: Markup :: XML', 36 | 'Natural Language :: English', 37 | 'License :: OSI Approved :: MIT License', 38 | 'Programming Language :: Python', 39 | 'Programming Language :: Python :: 2.7', 40 | 'Programming Language :: Python :: 3.5', 41 | 'Programming Language :: Python :: 3.6', 42 | 'Programming Language :: Python :: 3.7', 43 | 'Programming Language :: Python :: 3.8', 44 | ], 45 | test_suite='tests', 46 | ) 47 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmurty/xml4h/83bc0a91afe5d6e17d6c99ec43dc0aec9593cc06/tests/__init__.py -------------------------------------------------------------------------------- /tests/data/example_doc.small.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | -------------------------------------------------------------------------------- /tests/data/example_doc.unicode.xml: -------------------------------------------------------------------------------- 1 | 2 | <جذر xmlns="urn:default" xmlns:důl="urn:custom"> 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | -------------------------------------------------------------------------------- /tests/data/monty_python_films.ns.xml: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | And Now for Something Completely Different 6 | A collection of sketches from the first and second TV series of Monty Python's Flying Circus purposely re-enacted and shot for film. 7 | 8 | 9 | Monty Python and the Holy Grail 10 | King Arthur and his knights embark on a low-budget search for the Holy Grail, encountering humorous obstacles along the way. Some of these turned into standalone sketches. 11 | 12 | 13 | Monty Python's Life of Brian 14 | Brian is born on the first Christmas, in the stable next to Jesus'. He spends his life being mistaken for a messiah. 15 | 16 | 17 | Monty Python Live at the Hollywood Bowl 18 | A videotape recording directed by Ian MacNaughton of a live performance of sketches. Originally intended for a TV/video special. Transferred to 35mm and given a limited theatrical release in the US. 19 | 20 | 21 | Monty Python's The Meaning of Life 22 | An examination of the meaning of life in a series of sketches from conception to death and beyond. 23 | 24 | 25 | Monty Python: Almost the Truth (The Lawyer's Cut) 26 | This film features interviews with all the surviving Python members, along with archive representation for the late Graham Chapman. 27 | 28 | 29 | A Liar's Autobiography: Volume IV 30 | This is an animated film which is based on the memoir of the late Monty Python member, Graham Chapman. 31 | 32 | 33 | -------------------------------------------------------------------------------- /tests/data/monty_python_films.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | And Now for Something Completely Different 5 | A collection of sketches from the first and second TV series of Monty Python's Flying Circus purposely re-enacted and shot for film. 6 | 7 | 8 | Monty Python and the Holy Grail 9 | King Arthur and his knights embark on a low-budget search for the Holy Grail, encountering humorous obstacles along the way. Some of these turned into standalone sketches. 10 | 11 | 12 | Monty Python's Life of Brian 13 | Brian is born on the first Christmas, in the stable next to Jesus'. He spends his life being mistaken for a messiah. 14 | 15 | 16 | Monty Python Live at the Hollywood Bowl 17 | A videotape recording directed by Ian MacNaughton of a live performance of sketches. Originally intended for a TV/video special. Transferred to 35mm and given a limited theatrical release in the US. 18 | 19 | 20 | Monty Python's The Meaning of Life 21 | An examination of the meaning of life in a series of sketches from conception to death and beyond. 22 | 23 | 24 | Monty Python: Almost the Truth (The Lawyer's Cut) 25 | This film features interviews with all the surviving Python members, along with archive representation for the late Graham Chapman. 26 | 27 | 28 | A Liar's Autobiography: Volume IV 29 | This is an animated film which is based on the memoir of the late Monty Python member, Graham Chapman. 30 | 31 | 32 | -------------------------------------------------------------------------------- /tests/test_parser.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import six 3 | import unittest 4 | import os 5 | import re 6 | 7 | import xml4h 8 | 9 | 10 | class TestParserBasics(unittest.TestCase): 11 | 12 | @property 13 | def small_xml_file_path(self): 14 | return os.path.join( 15 | os.path.dirname(__file__), 'data/example_doc.small.xml') 16 | 17 | def test_parse_with_default_parser(self): 18 | # Explicit use of default/best adapter 19 | dom = xml4h.parse(self.small_xml_file_path, adapter=xml4h.best_adapter) 20 | self.assertEqual(8, len(dom.find())) 21 | # Implicit use of default/best adapter 22 | dom = xml4h.parse(self.small_xml_file_path) 23 | self.assertEqual(8, len(dom.find())) 24 | self.assertEqual(xml4h.best_adapter, dom.adapter_class) 25 | 26 | 27 | class BaseParserTest(object): 28 | """ 29 | Tests to exercise parsing across all xml4h implementations. 30 | """ 31 | 32 | @property 33 | def small_xml_file_path(self): 34 | return os.path.join( 35 | os.path.dirname(__file__), 'data/example_doc.small.xml') 36 | 37 | @property 38 | def unicode_xml_file_path(self): 39 | return os.path.join( 40 | os.path.dirname(__file__), 'data/example_doc.unicode.xml') 41 | 42 | def parse(self, xml_str): 43 | return xml4h.parse(xml_str, adapter=self.adapter) 44 | 45 | def test_auto_detect_filename_or_xml_data(self): 46 | # String with a '<' is parsed as literal XML data 47 | dom = self.parse('\n\n\tcontent') 48 | self.assertEqual(2, len(dom.find())) 49 | # String without a '<' is treated as a file path -- invalid path 50 | self.assertRaises(IOError, self.parse, 'not/a/real/file/path') 51 | # String without a '<' is treated as a file path -- valid path 52 | self.parse(self.small_xml_file_path) 53 | 54 | def test_parse_file(self): 55 | wrapped_doc = self.parse(self.small_xml_file_path) 56 | self.assertIsInstance(wrapped_doc, xml4h.nodes.Document) 57 | self.assertEqual(8, len(wrapped_doc.find())) 58 | # Check element namespaces 59 | self.assertEqual( 60 | ['DocRoot', 'NSDefaultImplicit', 'NSDefaultExplicit', 61 | 'Attrs1', 'Attrs2'], 62 | [n.name for n in wrapped_doc.find(ns_uri='urn:default')]) 63 | self.assertEqual( 64 | ['urn:custom', 'urn:custom', 'urn:custom'], 65 | [n.namespace_uri for n in wrapped_doc.find(ns_uri='urn:custom')]) 66 | # We test local name, not full name, here as different XML libraries 67 | # retain (or not) different literal element prefixes differently. 68 | self.assertEqual( 69 | ['NSCustomExplicit', 70 | 'NSCustomWithPrefixImplicit', 71 | 'NSCustomWithPrefixExplicit'], 72 | [n.local_name for n in wrapped_doc.find(ns_uri='urn:custom')]) 73 | # Check namespace attributes 74 | self.assertEqual( 75 | [xml4h.nodes.Node.XMLNS_URI, xml4h.nodes.Node.XMLNS_URI], 76 | [n.namespace_uri for n in wrapped_doc.root.attribute_nodes]) 77 | attrs1_elem = wrapped_doc.find_first('Attrs1') 78 | self.assertNotEqual(None, attrs1_elem) 79 | self.assertEqual([None], 80 | [n.namespace_uri for n in attrs1_elem.attribute_nodes]) 81 | attrs2_elem = wrapped_doc.find_first('Attrs2') 82 | self.assertEqual(['urn:custom'], 83 | [n.namespace_uri for n in attrs2_elem.attribute_nodes]) 84 | 85 | def test_roundtrip(self): 86 | orig_xml = open(self.small_xml_file_path).read() 87 | # We discard semantically unnecessary namespace prefixes on 88 | # element names. 89 | orig_xml = re.sub( 90 | '', 91 | '', orig_xml) 92 | if self.adapter == xml4h.LXMLAdapter: 93 | # lxml parser does not make it possible to retain semantically 94 | # unnecessary 'xmlns' namespace definitions in all elements. 95 | # It's not worth failing the roundtrip test just for this 96 | orig_xml = re.sub( 97 | '', 98 | '', orig_xml) 99 | doc = self.parse(self.small_xml_file_path) 100 | roundtrip_xml = doc.xml_doc() 101 | self.assertEqual(six.text_type(orig_xml), roundtrip_xml) 102 | 103 | def test_unicode(self): 104 | # NOTE lxml doesn't support unicode namespace URIs? 105 | doc = self.parse(self.unicode_xml_file_path) 106 | self.assertEqual(u'جذر', doc.root.name) 107 | self.assertEqual(u'urn:default', doc.root.attributes['xmlns']) 108 | self.assertEqual(u'urn:custom', doc.root.attributes[u'xmlns:důl']) 109 | self.assertEqual(5, len(doc.find(ns_uri=u'urn:default'))) 110 | self.assertEqual(3, len(doc.find(ns_uri=u'urn:custom'))) 111 | self.assertEqual(u'1', doc.find_first(u'yếutố1').attributes[u'תכונה']) 112 | self.assertEqual(u'tvö', 113 | doc.find_first(u'yếutố2').attributes[u'důl:עודתכונה']) 114 | 115 | 116 | class TestXmlDomParser(unittest.TestCase, BaseParserTest): 117 | 118 | @property 119 | def adapter(self): 120 | return xml4h.XmlDomImplAdapter 121 | 122 | 123 | class TestLXMLEtreeParser(unittest.TestCase, BaseParserTest): 124 | 125 | @property 126 | def adapter(self): 127 | if not xml4h.LXMLAdapter.is_available(): 128 | self.skipTest("lxml library is not installed") 129 | return xml4h.LXMLAdapter 130 | 131 | 132 | class TestElementTreeEtreeParser(unittest.TestCase, BaseParserTest): 133 | 134 | @property 135 | def adapter(self): 136 | if not xml4h.ElementTreeAdapter.is_available(): 137 | self.skipTest( 138 | "ElementTree library is not installed or is outdated") 139 | return xml4h.ElementTreeAdapter 140 | 141 | 142 | class TestcElementTreeEtreeParser(unittest.TestCase, BaseParserTest): 143 | 144 | @property 145 | def adapter(self): 146 | if not xml4h.cElementTreeAdapter.is_available(): 147 | self.skipTest( 148 | "cElementTree library is not installed or is outdated") 149 | return xml4h.cElementTreeAdapter 150 | -------------------------------------------------------------------------------- /tests/test_writer.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import unittest 3 | import functools 4 | import six 5 | 6 | import xml4h 7 | 8 | 9 | class BaseWriterTest(object): 10 | 11 | @property 12 | def my_builder(self): 13 | return functools.partial(xml4h.build, adapter=self.adapter) 14 | 15 | def setUp(self): 16 | # Create test document 17 | self.builder = ( 18 | self.my_builder('DocRoot') 19 | .element('Elem1').text(u'默认جذ').up() 20 | .element('Elem2')) 21 | # Handy IO writer 22 | self.iobytes = six.BytesIO() 23 | 24 | def test_write_defaults(self): 25 | """ Default write output is utf-8 with no pretty-printing """ 26 | xml = ( 27 | u'' 28 | u'' 29 | u'默认جذ' 30 | u'' 31 | u'' 32 | ) 33 | io_string = six.StringIO() 34 | self.builder.write_doc(io_string) 35 | if six.PY2: 36 | self.assertEqual(xml.encode('utf-8'), io_string.getvalue()) 37 | else: 38 | self.assertEqual(xml, io_string.getvalue()) 39 | 40 | def test_write_current_node_and_descendents(self): 41 | self.builder.dom_element.write(self.iobytes) 42 | self.assertEqual(b'', self.iobytes.getvalue()) 43 | 44 | def test_write_utf8_by_default(self): 45 | # Default write output is utf-8, with no pretty-printing 46 | xml = ( 47 | u'' 48 | u'' 49 | u'默认جذ' 50 | u'' 51 | u'' 52 | ) 53 | self.builder.dom_element.write_doc(self.iobytes) 54 | self.assertEqual(xml.encode('utf-8'), self.iobytes.getvalue()) 55 | 56 | def test_write_utf16(self): 57 | xml = ( 58 | u'' 59 | u'' 60 | u'默认جذ' 61 | u'' 62 | u'' 63 | ) 64 | self.builder.dom_element.write_doc(self.iobytes, encoding='utf-16') 65 | self.assertEqual(xml.encode('utf-16'), self.iobytes.getvalue()) 66 | 67 | def test_write_latin1_with_illegal_characters(self): 68 | self.assertRaises(UnicodeEncodeError, 69 | self.builder.dom_element.write_doc, 70 | self.iobytes, encoding='latin1', indent=2) 71 | 72 | def test_write_latin1(self): 73 | # Create latin1-friendly test document 74 | self.builder = ( 75 | self.my_builder('DocRoot') 76 | .element('Elem1').text(u'Tést çæsè').up() 77 | .element('Elem2')) 78 | self.builder.dom_element.write_doc(self.iobytes, encoding='latin1') 79 | self.assertEqual( 80 | u'' 81 | u'' 82 | u'Tést çæsè' 83 | u'' 84 | u''.encode('latin1'), 85 | self.iobytes.getvalue()) 86 | 87 | def test_with_no_encoding(self): 88 | """No encoding writes python unicode""" 89 | xml = ( 90 | u'' 91 | u'' 92 | u'默认جذ' 93 | u'' 94 | u'' 95 | ) 96 | io_string = six.StringIO() 97 | self.builder.dom_element.write_doc(io_string, encoding=None) 98 | # NOTE Exact test, no encoding of comparison XML doc string 99 | self.assertEqual(xml, io_string.getvalue()) 100 | 101 | def test_omit_declaration(self): 102 | self.builder.dom_element.write_doc(self.iobytes, 103 | omit_declaration=True) 104 | self.assertEqual( 105 | u'' 106 | u'默认جذ' 107 | u'' 108 | u''.encode('utf-8'), 109 | self.iobytes.getvalue()) 110 | 111 | def test_default_indent_and_newline(self): 112 | """Default indent of 4 spaces with newlines when indent=True""" 113 | self.builder.dom_element.write_doc(self.iobytes, indent=True) 114 | self.assertEqual( 115 | u'\n' 116 | u'\n' 117 | u' 默认جذ\n' 118 | u' \n' 119 | u'\n'.encode('utf-8'), 120 | self.iobytes.getvalue()) 121 | 122 | def test_custom_indent_and_newline(self): 123 | self.builder.dom_element.write_doc(self.iobytes, 124 | indent=8, newline='\t') 125 | self.assertEqual( 126 | u'\t' 127 | u'\t' 128 | u' 默认جذ\t' 129 | u' \t' 130 | u'\t'.encode('utf-8'), 131 | self.iobytes.getvalue()) 132 | 133 | 134 | class TestXmlDomBuilder(BaseWriterTest, unittest.TestCase): 135 | """ 136 | Tests building with the standard library xml.dom module, or with any 137 | library that augments/clobbers this module. 138 | """ 139 | 140 | @property 141 | def adapter(self): 142 | return xml4h.XmlDomImplAdapter 143 | 144 | 145 | class TestLXMLEtreeBuilder(BaseWriterTest, unittest.TestCase): 146 | """ 147 | Tests building with the lxml (lxml.etree) library. 148 | """ 149 | 150 | @property 151 | def adapter(self): 152 | if not xml4h.LXMLAdapter.is_available(): 153 | self.skipTest("lxml library is not installed") 154 | return xml4h.LXMLAdapter 155 | 156 | 157 | class TestElementTreeBuilder(BaseWriterTest, unittest.TestCase): 158 | """ 159 | Tests building with the xml.etree.ElementTree library. 160 | """ 161 | 162 | @property 163 | def adapter(self): 164 | if not xml4h.ElementTreeAdapter.is_available(): 165 | self.skipTest( 166 | "ElementTree library is not installed or is outdated") 167 | return xml4h.ElementTreeAdapter 168 | 169 | 170 | class TestElementTreeBuilder(BaseWriterTest, unittest.TestCase): 171 | """ 172 | Tests building with the xml.etree.ElementTree library. 173 | """ 174 | 175 | @property 176 | def adapter(self): 177 | if not xml4h.ElementTreeAdapter.is_available(): 178 | self.skipTest( 179 | "cElementTree library is not installed or is outdated") 180 | return xml4h.ElementTreeAdapter 181 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist=py27,py35,py36,py37,py38,without-lxml 3 | 4 | [testenv] 5 | deps= 6 | six 7 | nose 8 | coverage 9 | lxml 10 | commands= 11 | python -m nose --with-coverage --cover-package=xml4h --with-doctest --include=docs --doctest-extension=.rst 12 | 13 | ; Run reduced tests to ensure xml4h works when lxml isn't installed 14 | [testenv:without-lxml] 15 | deps= 16 | six 17 | nose 18 | coverage 19 | commands= 20 | python -m nose 21 | -------------------------------------------------------------------------------- /xml4h/__init__.py: -------------------------------------------------------------------------------- 1 | import six 2 | 3 | import xml4h 4 | 5 | # Make commonly-used classes and functions available in xml4h module 6 | from xml4h.impls.xml_dom_minidom import XmlDomImplAdapter 7 | from xml4h.impls.xml_etree_elementtree import ( 8 | ElementTreeAdapter, cElementTreeAdapter) 9 | from xml4h.impls.lxml_etree import LXMLAdapter 10 | from xml4h.builder import Builder 11 | from xml4h.writer import write_node 12 | 13 | 14 | __title__ = 'xml4h' 15 | __version__ = '1.0' 16 | 17 | 18 | # List of xml4h adapter classes, in order of preference 19 | _ADAPTER_CLASSES = [ 20 | LXMLAdapter, 21 | cElementTreeAdapter, 22 | ElementTreeAdapter, 23 | XmlDomImplAdapter] 24 | 25 | _ADAPTERS_AVAILABLE = [] 26 | _ADAPTERS_UNAVAILABLE = [] 27 | 28 | for impl_class in _ADAPTER_CLASSES: 29 | if impl_class.is_available(): 30 | _ADAPTERS_AVAILABLE.append(impl_class) 31 | else: 32 | _ADAPTERS_UNAVAILABLE.append(impl_class) 33 | 34 | 35 | best_adapter = _ADAPTERS_AVAILABLE[0] 36 | """ 37 | The :ref:`best adapter available ` in the Python environment. 38 | This adapter is the default when parsing or creating XML documents, 39 | unless overridden by passing a specific adapter class. 40 | """ 41 | 42 | 43 | def parse( 44 | to_parse, ignore_whitespace_text_nodes=True, adapter=None 45 | ): 46 | """ 47 | Parse an XML document into an *xml4h*-wrapped DOM representation 48 | using an underlying XML library implementation. 49 | 50 | :param to_parse: an XML document file, document bytes, or the 51 | path to an XML file. If a bytes value is given that contains 52 | a ``<`` character it is treated as literal XML data, otherwise 53 | a bytes value is treated as a file path. 54 | :type to_parse: a file-like object or string 55 | :param bool ignore_whitespace_text_nodes: if ``True`` pure whitespace 56 | nodes are stripped from the parsed document, since these are 57 | usually noise introduced by XML docs serialized to be human-friendly. 58 | :param adapter: the *xml4h* implementation adapter class used to parse 59 | the document and to interact with the resulting nodes. 60 | If None, :attr:`best_adapter` will be used. 61 | :type adapter: adapter class or None 62 | 63 | :return: an :class:`xml4h.nodes.Document` node representing the 64 | parsed document. 65 | 66 | Delegates to an adapter's :meth:`~xml4h.impls.interface.parse_string` or 67 | :meth:`~xml4h.impls.interface.parse_file` implementation. 68 | """ 69 | if adapter is None: 70 | adapter = best_adapter 71 | if isinstance(to_parse, six.binary_type) and b'<' in to_parse: 72 | return adapter.parse_bytes(to_parse, ignore_whitespace_text_nodes) 73 | elif isinstance(to_parse, six.string_types) and '<' in to_parse: 74 | return adapter.parse_string(to_parse, ignore_whitespace_text_nodes) 75 | else: 76 | return adapter.parse_file(to_parse, ignore_whitespace_text_nodes) 77 | 78 | 79 | def build(tagname_or_element, ns_uri=None, adapter=None): 80 | """ 81 | Return a :class:`~xml4h.builder.Builder` that represents an element in 82 | a new or existing XML DOM and provides "chainable" methods focussed 83 | specifically on adding XML content. 84 | 85 | :param tagname_or_element: a string name for the root node of a 86 | new XML document, or an :class:`~xml4h.nodes.Element` node in an 87 | existing document. 88 | :type tagname_or_element: string or :class:`~xml4h.nodes.Element` node 89 | :param ns_uri: a namespace URI to apply to the new root node. This 90 | argument has no effect this method is acting on an element. 91 | :type ns_uri: string or None 92 | :param adapter: the *xml4h* implementation adapter class used to 93 | interact with the document DOM nodes. 94 | If None, :attr:`best_adapter` will be used. 95 | :type adapter: adapter class or None 96 | 97 | :return: a :class:`~xml4h.builder.Builder` instance that represents an 98 | :class:`~xml4h.nodes.Element` node in an XML DOM. 99 | """ 100 | if adapter is None: 101 | adapter = best_adapter 102 | if isinstance(tagname_or_element, six.string_types): 103 | doc = adapter.create_document( 104 | tagname_or_element, ns_uri=ns_uri) 105 | element = doc.root 106 | elif isinstance(tagname_or_element, xml4h.nodes.Element): 107 | element = tagname_or_element 108 | else: 109 | raise xml4h.exceptions.IncorrectArgumentTypeException( 110 | tagname_or_element, [str, xml4h.nodes.Element]) 111 | return Builder(element) 112 | -------------------------------------------------------------------------------- /xml4h/builder.py: -------------------------------------------------------------------------------- 1 | """ 2 | Builder is a utility class that makes it easy to create valid, well-formed 3 | XML documents using relatively sparse python code. The builder class works 4 | by wrapping an :class:`xml4h.nodes.Element` node to provide "chainable" 5 | methods focussed specifically on adding XML content. 6 | 7 | Each method that adds content returns a Builder instance representing the 8 | current or the newly-added element. Behind the scenes, the builder uses the 9 | :mod:`xml4h.nodes` node traversal and manipulation methods to add content 10 | directly to the underlying DOM. 11 | 12 | You will not generally create Builder instances directly, but will instead 13 | call the :meth:`xml4h.builder` method with the name for a new root element 14 | or with an existing :class:`xml4h.nodes.Element` node. 15 | """ 16 | import xml4h 17 | 18 | 19 | class Builder(object): 20 | """ 21 | Builder class that wraps an :class:`xml4h.nodes.Element` node with methods 22 | for adding XML content to an underlying DOM. 23 | """ 24 | 25 | def __init__(self, element): 26 | """ 27 | Create a Builder representing an xml4h Element node. 28 | 29 | :param element: Element node to represent 30 | :type element: :class:`xml4h.nodes.Element` 31 | """ 32 | if not isinstance(element, xml4h.nodes.Element): 33 | raise ValueError( 34 | "Builder can only be created with an %s.%s instance, not %s" 35 | % (xml4h.nodes.Element.__module__, 36 | xml4h.nodes.Element.__name__, 37 | element)) 38 | self._element = element 39 | 40 | @property 41 | def dom_element(self): 42 | """ 43 | :return: the :class:`xml4h.nodes.Element` node represented by this 44 | Builder. 45 | """ 46 | return self._element 47 | 48 | @property 49 | def document(self): 50 | """ 51 | :return: the :class:`xml4h.nodes.Document` node that contains the 52 | element represented by this Builder. 53 | """ 54 | return self._element.document 55 | 56 | @property 57 | def root(self): 58 | """ 59 | :return: the :class:`xml4h.nodes.Element` root node ancestor of the 60 | element represented by this Builder 61 | """ 62 | return self._element.root 63 | 64 | def find(self, **kwargs): 65 | """ 66 | Find descendants of the element represented by this builder that 67 | match the given constraints. 68 | 69 | :return: a list of :class:`xml4h.nodes.Element` nodes 70 | 71 | Delegates to :meth:`xml4h.nodes.Node.find` 72 | """ 73 | return self._element.find(**kwargs) 74 | 75 | def find_doc(self, **kwargs): 76 | """ 77 | Find nodes in this element's owning :class:`xml4h.nodes.Document` 78 | that match the given constraints. 79 | 80 | :return: a list of :class:`xml4h.nodes.Element` nodes 81 | 82 | Delegates to :meth:`xml4h.nodes.Node.find_doc`. 83 | """ 84 | return self._element.find_doc(**kwargs) 85 | 86 | def write(self, *args, **kwargs): 87 | """ 88 | Write XML bytes for the element represented by this builder. 89 | 90 | Delegates to :meth:`xml4h.nodes.Node.write`. 91 | """ 92 | self.dom_element.write(*args, **kwargs) 93 | 94 | def write_doc(self, *args, **kwargs): 95 | """ 96 | Write XML bytes for the Document containing the element 97 | represented by this builder. 98 | 99 | Delegates to :meth:`xml4h.nodes.Node.write_doc`. 100 | """ 101 | self.dom_element.write_doc(*args, **kwargs) 102 | 103 | def xml(self, **kwargs): 104 | """ 105 | :return: XML string for the element represented by this builder. 106 | 107 | Delegates to :meth:`xml4h.nodes.Node.xml`. 108 | """ 109 | return self.dom_element.xml(**kwargs) 110 | 111 | def xml_doc(self, **kwargs): 112 | """ 113 | :return: XML string for the Document containing the element represented 114 | by this builder. 115 | 116 | Delegates to :meth:`xml4h.nodes.Node.xml_doc`. 117 | """ 118 | return self.dom_element.xml_doc(**kwargs) 119 | 120 | def up(self, count_or_element_name=1): 121 | """ 122 | :return: a builder representing an ancestor of the current element, 123 | by default the parent element. 124 | 125 | :param count_or_element_name: 126 | when an integer, return the n'th ancestor element up to the 127 | document's root element. 128 | when a string, return the nearest ancestor element with that name, 129 | or the document's root element if there are no matching ancestors. 130 | Defaults to integer value 1 which means the immediate parent. 131 | :type count_or_element_name: integer or string 132 | """ 133 | elem = self._element 134 | to_count = to_name = None 135 | if isinstance(count_or_element_name, int): 136 | to_count = count_or_element_name 137 | else: 138 | to_name = count_or_element_name 139 | up_count = 0 140 | while True: 141 | # Don't go up beyond the document root 142 | if elem.is_root or elem.parent is None: 143 | break 144 | # Go up to element's parent 145 | elem = elem.parent 146 | # If we have a name to match and it matches, stop 147 | if to_name: 148 | if elem.name == to_name: 149 | break 150 | continue 151 | # If we have a count to reach and have reached it, stop 152 | up_count += 1 153 | if up_count >= to_count: 154 | break 155 | return Builder(elem) 156 | 157 | def transplant(self, node): 158 | """ 159 | Transplant a node from another document to become a child of 160 | the :class:`xml4h.nodes.Element` node represented by this Builder. 161 | 162 | :return: a new Builder that represents the current element \ 163 | (not the transplanted node). 164 | 165 | Delegates to :meth:`xml4h.nodes.Node.transplant_node`. 166 | """ 167 | self._element.transplant_node(node) 168 | return self 169 | 170 | def clone(self, node): 171 | """ 172 | Clone a node from another document to become a child of 173 | the :class:`xml4h.nodes.Element` node represented by this Builder. 174 | 175 | :return: a new Builder that represents the current element \ 176 | (not the cloned node). 177 | 178 | Delegates to :meth:`xml4h.nodes.Node.clone_node`. 179 | """ 180 | self._element.clone_node(node) 181 | return self 182 | 183 | def element(self, *args, **kwargs): 184 | """ 185 | Add a child element to the :class:`xml4h.nodes.Element` node 186 | represented by this Builder. 187 | 188 | :return: a new Builder that represents the child element. 189 | 190 | Delegates to :meth:`xml4h.nodes.Element.add_element`. 191 | """ 192 | child_element = self._element.add_element(*args, **kwargs) 193 | return Builder(child_element) 194 | 195 | elem = element # Alias 196 | """Alias of :meth:`element`""" 197 | 198 | e = element # Alias 199 | """Alias of :meth:`element`""" 200 | 201 | def attributes(self, *args, **kwargs): 202 | """ 203 | Add one or more attributes to the :class:`xml4h.nodes.Element` node 204 | represented by this Builder. 205 | 206 | :return: the current Builder. 207 | 208 | Delegates to :meth:`xml4h.nodes.Element.set_attributes`. 209 | """ 210 | self._element.set_attributes(*args, **kwargs) 211 | return self 212 | 213 | attrs = attributes # Alias 214 | """Alias of :meth:`attributes`""" 215 | 216 | a = attributes # Alias 217 | """Alias of :meth:`attributes`""" 218 | 219 | def text(self, text): 220 | """ 221 | Add a text node to the :class:`xml4h.nodes.Element` node 222 | represented by this Builder. 223 | 224 | :return: the current Builder. 225 | 226 | Delegates to :meth:`xml4h.nodes.Element.add_text`. 227 | """ 228 | self._element.add_text(text) 229 | return self 230 | 231 | t = text # Alias 232 | """Alias of :meth:`text`""" 233 | 234 | def comment(self, text): 235 | """ 236 | Add a coment node to the :class:`xml4h.nodes.Element` node 237 | represented by this Builder. 238 | 239 | :return: the current Builder. 240 | 241 | Delegates to :meth:`xml4h.nodes.Element.add_comment`. 242 | """ 243 | self._element.add_comment(text) 244 | return self 245 | 246 | c = comment # Alias 247 | """Alias of :meth:`comment`""" 248 | 249 | def processing_instruction(self, target, data): 250 | """ 251 | Add a processing instruction node to the :class:`xml4h.nodes.Element` 252 | node represented by this Builder. 253 | 254 | :return: the current Builder. 255 | 256 | Delegates to :meth:`xml4h.nodes.Element.add_instruction`. 257 | """ 258 | self._element.add_instruction(target, data) 259 | return self 260 | 261 | instruction = processing_instruction # Alias 262 | """Alias of :meth:`processing_instruction`""" 263 | 264 | i = instruction # Alias 265 | """Alias of :meth:`processing_instruction`""" 266 | 267 | def cdata(self, text): 268 | """ 269 | Add a CDATA node to the :class:`xml4h.nodes.Element` node 270 | represented by this Builder. 271 | 272 | :return: the current Builder. 273 | 274 | Delegates to :meth:`xml4h.nodes.Element.add_cdata`. 275 | """ 276 | self._element.add_cdata(text) 277 | return self 278 | 279 | data = cdata # Alias 280 | """Alias of :meth:`cdata`""" 281 | 282 | d = cdata # Alias 283 | """Alias of :meth:`cdata`""" 284 | 285 | def ns_prefix(self, prefix, ns_uri): 286 | """ 287 | Set the namespace prefix of the :class:`xml4h.nodes.Element` node 288 | represented by this Builder. 289 | 290 | :return: the current Builder. 291 | 292 | Delegates to :meth:`xml4h.nodes.Element.set_ns_prefix`. 293 | """ 294 | self._element.set_ns_prefix(prefix, ns_uri) 295 | return self 296 | -------------------------------------------------------------------------------- /xml4h/exceptions.py: -------------------------------------------------------------------------------- 1 | """ 2 | Custom *xml4h* exceptions. 3 | """ 4 | 5 | 6 | class Xml4hException(Exception): 7 | """ 8 | Base exception class for all non-standard exceptions raised by *xml4h*. 9 | """ 10 | pass 11 | 12 | 13 | class Xml4hImplementationBug(Xml4hException): 14 | """ 15 | *xml4h* implementation has a bug, probably. 16 | """ 17 | pass 18 | 19 | 20 | class FeatureUnavailableException(Xml4hException): 21 | """ 22 | User has attempted to use a feature that is available in some *xml4h* 23 | implementations/adapters, but is not available in the current one. 24 | """ 25 | pass 26 | 27 | 28 | class IncorrectArgumentTypeException(ValueError, Xml4hException): 29 | """ 30 | Richer flavour of a ValueError that describes exactly what argument 31 | types are expected. 32 | """ 33 | 34 | def __init__(self, arg, expected_types): 35 | msg = ('Argument %s is not one of the expected types: %s' 36 | % (arg, expected_types)) 37 | super(IncorrectArgumentTypeException, self).__init__(msg) 38 | 39 | 40 | class UnknownNamespaceException(ValueError, Xml4hException): 41 | """ 42 | User has attempted to refer to an unknown or undeclared namespace by 43 | prefix or URI. 44 | """ 45 | pass 46 | -------------------------------------------------------------------------------- /xml4h/impls/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmurty/xml4h/83bc0a91afe5d6e17d6c99ec43dc0aec9593cc06/xml4h/impls/__init__.py -------------------------------------------------------------------------------- /xml4h/impls/interface.py: -------------------------------------------------------------------------------- 1 | import abc 2 | import six 3 | 4 | from xml4h import nodes, exceptions 5 | 6 | 7 | @six.add_metaclass(abc.ABCMeta) 8 | class XmlImplAdapter(object): 9 | """ 10 | Base class that defines how *xml4h* interacts with an underlying XML 11 | library that the adaptor "wraps" to provide additional (or at least 12 | different) functionality. 13 | 14 | This class should be treated as an abstract class. It provides some 15 | common implementation code used by all *xml4h* adapter implementations, 16 | but mostly it sketches out the methods the real implementaiton subclasses 17 | must provide. 18 | """ 19 | 20 | # List of extra features supported (or not) by an adapter implementation 21 | SUPPORTED_FEATURES = { 22 | 'xpath': False, 23 | } 24 | 25 | @classmethod 26 | def has_feature(cls, feature_name): 27 | """ 28 | :return: *True* if a named feature is supported by this adapter. 29 | """ 30 | return cls.SUPPORTED_FEATURES.get(feature_name.lower(), False) 31 | 32 | @classmethod 33 | def ignore_whitespace_text_nodes(cls, wrapped_node): 34 | """ 35 | Find and delete any text nodes containing nothing but whitespace in 36 | in the given node and its descendents. 37 | 38 | This is useful for cleaning up excess low-value text nodes in a 39 | document DOM after parsing a pretty-printed XML document. 40 | """ 41 | for child in wrapped_node.children: 42 | if child.is_text and child.value.strip() == '': 43 | child.delete() 44 | else: 45 | cls.ignore_whitespace_text_nodes(child) 46 | 47 | @classmethod 48 | def create_document(cls, root_tagname, ns_uri=None, **kwargs): 49 | # Use implementation's method to create base document and root element 50 | impl_doc = cls.new_impl_document(root_tagname, ns_uri, **kwargs) 51 | adapter = cls(impl_doc) 52 | wrapped_doc = nodes.Document(impl_doc, adapter) 53 | # Automatically add namespace URI to root Element as attribute 54 | if ns_uri is not None: 55 | adapter.set_node_attribute_value(wrapped_doc.root.impl_node, 56 | 'xmlns', ns_uri, ns_uri=nodes.Node.XMLNS_URI) 57 | return wrapped_doc 58 | 59 | @classmethod 60 | def wrap_document(cls, document_node): 61 | adapter = cls(document_node) 62 | return nodes.Document(document_node, adapter) 63 | 64 | @classmethod 65 | def wrap_node(cls, node, document, adapter=None): 66 | if node is None: 67 | return None 68 | if adapter is None: 69 | adapter = cls(document) 70 | impl_class = adapter.map_node_to_class(node) 71 | return impl_class(node, adapter) 72 | 73 | @classmethod 74 | @abc.abstractmethod 75 | def is_available(cls): 76 | """ 77 | :return: *True* if this adapter's underlying XML library is available \ 78 | in the Python environment. 79 | """ 80 | raise NotImplementedError("Implementation missing for %s" % cls) 81 | 82 | @classmethod 83 | @abc.abstractmethod 84 | def parse_string(cls, xml_str, ignore_whitespace_text_nodes=True): 85 | raise NotImplementedError("Implementation missing for %s" % cls) 86 | 87 | @classmethod 88 | @abc.abstractmethod 89 | def parse_bytes(cls, xml_bytes, ignore_whitespace_text_nodes=True): 90 | raise NotImplementedError("Implementation missing for %s" % cls) 91 | 92 | @classmethod 93 | @abc.abstractmethod 94 | def parse_file(cls, xml_file, ignore_whitespace_text_nodes=True): 95 | raise NotImplementedError("Implementation missing for %s" % cls) 96 | 97 | def __init__(self, document): 98 | if not isinstance(document, object): 99 | raise exceptions.IncorrectArgumentTypeException( 100 | document, [object]) 101 | self._impl_document = document 102 | self._auto_ns_prefix_count = 0 103 | self.clear_caches() 104 | 105 | def clear_caches(cls): 106 | """ 107 | Clear any in-adapter cached data, for cases where cached data could 108 | become outdated e.g. by making DOM changes directly outside of *xml4h*. 109 | 110 | This is a no-op if the implementing adapter has no cached data. 111 | """ 112 | pass 113 | 114 | @property 115 | def impl_document(self): 116 | return self._impl_document 117 | 118 | @property 119 | def impl_root_element(self): 120 | return self.get_impl_root(self.impl_document) 121 | 122 | def get_ns_uri_for_prefix(self, node, prefix): 123 | if prefix == 'xmlns': 124 | return nodes.Node.XMLNS_URI 125 | elif prefix is None: 126 | attr_name = 'xmlns' 127 | else: 128 | attr_name = 'xmlns:%s' % prefix 129 | uri = self.lookup_ns_uri_by_attr_name(node, attr_name) 130 | if uri is None: 131 | if attr_name == 'xmlns': 132 | # Default namespace URI 133 | return nodes.Node.XMLNS_URI 134 | raise exceptions.UnknownNamespaceException( 135 | "Unknown namespace URI for attribute name '%s'" % attr_name) 136 | return uri 137 | 138 | def get_ns_prefix_for_uri(self, node, uri, auto_generate_prefix=False): 139 | if uri == nodes.Node.XMLNS_URI: 140 | return 'xmlns' 141 | prefix = self.lookup_ns_prefix_for_uri(node, uri) 142 | if not prefix and auto_generate_prefix: 143 | prefix = 'autoprefix%d' % self._auto_ns_prefix_count 144 | self._auto_ns_prefix_count += 1 145 | return prefix 146 | 147 | def get_ns_info_from_node_name(self, name, impl_node): 148 | """ 149 | Return a three-element tuple with the prefix, local name, and namespace 150 | URI for the given element/attribute name (in the context of the given 151 | node's hierarchy). If the name has no associated prefix or namespace 152 | information, None is return for those tuple members. 153 | """ 154 | if '}' in name: 155 | ns_uri, name = name.split('}') 156 | ns_uri = ns_uri[1:] 157 | prefix = self.get_ns_prefix_for_uri(impl_node, ns_uri) 158 | elif ':' in name: 159 | prefix, name = name.split(':') 160 | ns_uri = self.get_ns_uri_for_prefix(impl_node, prefix) 161 | if ns_uri is None: 162 | raise exceptions.UnknownNamespaceException( 163 | "Prefix '%s' does not have a defined namespace URI" 164 | % prefix) 165 | else: 166 | prefix, ns_uri = None, None 167 | return prefix, name, ns_uri 168 | 169 | # Utility implementation methods 170 | 171 | @classmethod 172 | @abc.abstractmethod 173 | def new_impl_document(cls, root_tagname, ns_uri=None, **kwargs): 174 | raise NotImplementedError("Implementation missing for %s" % cls) 175 | 176 | @abc.abstractmethod 177 | def map_node_to_class(self, node): 178 | raise NotImplementedError("Implementation missing for %s" % self) 179 | 180 | @abc.abstractmethod 181 | def get_impl_root(self, node): 182 | raise NotImplementedError("Implementation missing for %s" % self) 183 | 184 | # Document implementation methods 185 | 186 | @abc.abstractmethod 187 | def new_impl_element(self, tagname, ns_uri=None, parent=None): 188 | raise NotImplementedError("Implementation missing for %s" % self) 189 | 190 | @abc.abstractmethod 191 | def new_impl_text(self, text): 192 | raise NotImplementedError("Implementation missing for %s" % self) 193 | 194 | @abc.abstractmethod 195 | def new_impl_comment(self, text): 196 | raise NotImplementedError("Implementation missing for %s" % self) 197 | 198 | @abc.abstractmethod 199 | def new_impl_instruction(self, target, data): 200 | raise NotImplementedError("Implementation missing for %s" % self) 201 | 202 | @abc.abstractmethod 203 | def new_impl_cdata(self, text): 204 | raise NotImplementedError("Implementation missing for %s" % self) 205 | 206 | @abc.abstractmethod 207 | def find_node_elements(self, node, name='*', ns_uri='*'): 208 | """ 209 | :return: element node descendents of the given node that match the \ 210 | search constraints. 211 | 212 | :param node: a node object from the underlying XML library. 213 | :param string name: only elements with a matching name will be 214 | returned. If the value is ``*`` all names will match. 215 | :param string ns_uri: only elements with a matching namespace URI 216 | will be returned. If the value is ``*`` all namespaces will match. 217 | """ 218 | raise NotImplementedError("Implementation missing for %s" % self) 219 | 220 | def xpath_on_node(self, node, xpath, **kwargs): 221 | if not self.has_feature('xpath'): 222 | raise exceptions.FeatureUnavailableException('xpath') 223 | 224 | # Node implementation methods 225 | 226 | @abc.abstractmethod 227 | def get_node_namespace_uri(self, node): 228 | raise NotImplementedError("Implementation missing for %s" % self) 229 | 230 | @abc.abstractmethod 231 | def set_node_namespace_uri(self, node, ns_uri): 232 | raise NotImplementedError("Implementation missing for %s" % self) 233 | 234 | @abc.abstractmethod 235 | def get_node_parent(self, node): 236 | raise NotImplementedError("Implementation missing for %s" % self) 237 | 238 | @abc.abstractmethod 239 | def get_node_children(self, node): 240 | raise NotImplementedError("Implementation missing for %s" % self) 241 | 242 | @abc.abstractmethod 243 | def get_node_name(self, node): 244 | raise NotImplementedError("Implementation missing for %s" % self) 245 | 246 | @abc.abstractmethod 247 | def get_node_local_name(self, node): 248 | raise NotImplementedError("Implementation missing for %s" % self) 249 | 250 | @abc.abstractmethod 251 | def get_node_name_prefix(self, node): 252 | raise NotImplementedError("Implementation missing for %s" % self) 253 | 254 | @abc.abstractmethod 255 | def get_node_value(self, node): 256 | raise NotImplementedError("Implementation missing for %s" % self) 257 | 258 | @abc.abstractmethod 259 | def set_node_value(self, node, value): 260 | raise NotImplementedError("Implementation missing for %s" % self) 261 | 262 | @abc.abstractmethod 263 | def get_node_text(self, node): 264 | raise NotImplementedError("Implementation missing for %s" % self) 265 | 266 | @abc.abstractmethod 267 | def set_node_text(self, node, text): 268 | raise NotImplementedError("Implementation missing for %s" % self) 269 | 270 | @abc.abstractmethod 271 | def get_node_attributes(self, element, ns_uri=None): 272 | raise NotImplementedError("Implementation missing for %s" % self) 273 | 274 | @abc.abstractmethod 275 | def has_node_attribute(self, element, name, ns_uri=None): 276 | raise NotImplementedError("Implementation missing for %s" % self) 277 | 278 | @abc.abstractmethod 279 | def get_node_attribute_node(self, element, name, ns_uri=None): 280 | raise NotImplementedError("Implementation missing for %s" % self) 281 | 282 | @abc.abstractmethod 283 | def get_node_attribute_value(self, element, name, ns_uri=None): 284 | raise NotImplementedError("Implementation missing for %s" % self) 285 | 286 | @abc.abstractmethod 287 | def set_node_attribute_value(self, element, name, value, ns_uri=None): 288 | raise NotImplementedError("Implementation missing for %s" % self) 289 | 290 | @abc.abstractmethod 291 | def remove_node_attribute(self, element, name, ns_uri=None): 292 | raise NotImplementedError("Implementation missing for %s" % self) 293 | 294 | @abc.abstractmethod 295 | def add_node_child(self, parent, child, before_sibling=None): 296 | raise NotImplementedError("Implementation missing for %s" % self) 297 | 298 | @abc.abstractmethod 299 | def import_node(self, parent, node, original_parent=None, clone=False): 300 | raise NotImplementedError("Implementation missing for %s" % self) 301 | 302 | @abc.abstractmethod 303 | def clone_node(self, node, deep=True): 304 | raise NotImplementedError("Implementation missing for %s" % self) 305 | 306 | @abc.abstractmethod 307 | def remove_node_child(self, parent, child, destroy_node=True): 308 | raise NotImplementedError("Implementation missing for %s" % self) 309 | 310 | @abc.abstractmethod 311 | def lookup_ns_uri_by_attr_name(self, node, name): 312 | raise NotImplementedError("Implementation missing for %s" % self) 313 | 314 | @abc.abstractmethod 315 | def lookup_ns_prefix_for_uri(self, node, uri): 316 | raise NotImplementedError("Implementation missing for %s" % self) 317 | -------------------------------------------------------------------------------- /xml4h/impls/lxml_etree.py: -------------------------------------------------------------------------------- 1 | import re 2 | import copy 3 | 4 | from xml4h.impls.interface import XmlImplAdapter 5 | from xml4h import nodes, exceptions 6 | 7 | try: 8 | from lxml import etree 9 | except ImportError: 10 | pass 11 | 12 | 13 | class LXMLAdapter(XmlImplAdapter): 14 | """ 15 | Adapter to the `lxml `_ XML library implementation. 16 | """ 17 | 18 | SUPPORTED_FEATURES = { 19 | 'xpath': True, 20 | } 21 | 22 | @classmethod 23 | def is_available(cls): 24 | try: 25 | etree.Element 26 | return True 27 | except: 28 | return False 29 | 30 | @classmethod 31 | def parse_string(cls, xml_str, ignore_whitespace_text_nodes=True): 32 | impl_root_elem = etree.fromstring(xml_str) 33 | wrapped_doc = LXMLAdapter.wrap_document(impl_root_elem.getroottree()) 34 | if ignore_whitespace_text_nodes: 35 | cls.ignore_whitespace_text_nodes(wrapped_doc) 36 | return wrapped_doc 37 | 38 | @classmethod 39 | def parse_bytes(cls, xml_bytes, ignore_whitespace_text_nodes=True): 40 | return LXMLAdapter.parse_string(xml_bytes, ignore_whitespace_text_nodes) 41 | 42 | @classmethod 43 | def parse_file(cls, xml_file, ignore_whitespace_text_nodes=True): 44 | impl_doc = etree.parse(xml_file) 45 | wrapped_doc = LXMLAdapter.wrap_document(impl_doc) 46 | if ignore_whitespace_text_nodes: 47 | cls.ignore_whitespace_text_nodes(wrapped_doc) 48 | return wrapped_doc 49 | 50 | @classmethod 51 | def new_impl_document(cls, root_tagname, ns_uri=None, **kwargs): 52 | root_nsmap = {} 53 | if ns_uri is not None: 54 | root_nsmap[None] = ns_uri 55 | else: 56 | ns_uri = nodes.Node.XMLNS_URI 57 | root_nsmap[None] = ns_uri 58 | root_elem = etree.Element('{%s}%s' % (ns_uri, root_tagname), 59 | nsmap=root_nsmap) 60 | doc = etree.ElementTree(root_elem) 61 | return doc 62 | 63 | def map_node_to_class(self, node): 64 | if isinstance(node, etree._ProcessingInstruction): 65 | return nodes.ProcessingInstruction 66 | elif isinstance(node, etree._Comment): 67 | return nodes.Comment 68 | elif isinstance(node, etree._ElementTree): 69 | return nodes.Document 70 | elif isinstance(node, etree._Element): 71 | return nodes.Element 72 | elif isinstance(node, LXMLAttribute): 73 | return nodes.Attribute 74 | elif isinstance(node, LXMLText): 75 | if node.is_cdata: 76 | return nodes.CDATA 77 | else: 78 | return nodes.Text 79 | raise exceptions.Xml4hImplementationBug( 80 | 'Unrecognized type for implementation node: %s' % node) 81 | 82 | def get_impl_root(self, node): 83 | return self._impl_document.getroot() 84 | 85 | # Document implementation methods 86 | 87 | def new_impl_element(self, tagname, ns_uri=None, parent=None): 88 | if ns_uri is not None: 89 | if ':' in tagname: 90 | tagname = tagname.split(':')[1] 91 | my_nsmap = {None: ns_uri} 92 | # Add any xmlns attribute prefix mappings from parent's document 93 | # TODO This doesn't seem to help 94 | curr_node = parent 95 | while curr_node.__class__ == etree._Element: 96 | for n, v in list(curr_node.attrib.items()): 97 | if '{%s}' % nodes.Node.XMLNS_URI in n: 98 | _, prefix = n.split('}') 99 | my_nsmap[prefix] = v 100 | curr_node = self.get_node_parent(curr_node) 101 | return etree.Element('{%s}%s' % (ns_uri, tagname), nsmap=my_nsmap) 102 | else: 103 | return etree.Element(tagname) 104 | 105 | def new_impl_text(self, text): 106 | return LXMLText(text) 107 | 108 | def new_impl_comment(self, text): 109 | return etree.Comment(text) 110 | 111 | def new_impl_instruction(self, target, data): 112 | return etree.ProcessingInstruction(target, data) 113 | 114 | def new_impl_cdata(self, text): 115 | return LXMLText(text, is_cdata=True) 116 | 117 | def find_node_elements(self, node, name='*', ns_uri='*'): 118 | # TODO Any proper way to find namespaced elements by name? 119 | name_match_nodes = node.getiterator() 120 | # Filter nodes by name and ns_uri if necessary 121 | results = [] 122 | for n in name_match_nodes: 123 | # Ignore the current node 124 | if n == node: 125 | continue 126 | # Ignore non-Elements 127 | if not n.__class__ == etree._Element: 128 | continue 129 | if ns_uri != '*' and self.get_node_namespace_uri(n) != ns_uri: 130 | continue 131 | if name != '*' and self.get_node_local_name(n) != name: 132 | continue 133 | results.append(n) 134 | return results 135 | find_node_elements.__doc__ = XmlImplAdapter.find_node_elements.__doc__ 136 | 137 | def xpath_on_node(self, node, xpath, **kwargs): 138 | """ 139 | Return result of performing the given XPath query on the given node. 140 | 141 | All known namespace prefix-to-URI mappings in the document are 142 | automatically included in the XPath invocation. 143 | 144 | If an empty/default namespace (i.e. None) is defined, this is 145 | converted to the prefix name '_' so it can be used despite empty 146 | namespace prefixes being unsupported by XPath. 147 | """ 148 | if isinstance(node, etree._ElementTree): 149 | # Document node lxml.etree._ElementTree has no nsmap, lookup root 150 | root = self.get_impl_root(node) 151 | namespaces_dict = root.nsmap.copy() 152 | else: 153 | namespaces_dict = node.nsmap.copy() 154 | if 'namespaces' in kwargs: 155 | namespaces_dict.update(kwargs['namespaces']) 156 | # Empty namespace prefix is not supported, convert to '_' prefix 157 | if None in namespaces_dict: 158 | default_ns_uri = namespaces_dict.pop(None) 159 | namespaces_dict['_'] = default_ns_uri 160 | # Include XMLNS namespace if it's not already defined 161 | if not 'xmlns' in namespaces_dict: 162 | namespaces_dict['xmlns'] = nodes.Node.XMLNS_URI 163 | return node.xpath(xpath, namespaces=namespaces_dict) 164 | 165 | # Node implementation methods 166 | 167 | def get_node_namespace_uri(self, node): 168 | if '}' in node.tag: 169 | return node.tag.split('}')[0][1:] 170 | elif isinstance(node, LXMLAttribute): 171 | return node.namespace_uri 172 | elif isinstance(node, etree._ElementTree): 173 | return None 174 | elif isinstance(node, etree._Element): 175 | qname, ns_uri = self._unpack_name(node.tag, node)[:2] 176 | return ns_uri 177 | else: 178 | return None 179 | 180 | def set_node_namespace_uri(self, node, ns_uri): 181 | node.nsmap[None] = ns_uri 182 | 183 | def get_node_parent(self, node): 184 | if isinstance(node, etree._ElementTree): 185 | return None 186 | else: 187 | parent = node.getparent() 188 | # Return ElementTree as root element's parent 189 | if parent is None: 190 | return self.impl_document 191 | return parent 192 | 193 | def get_node_children(self, node): 194 | if isinstance(node, etree._ElementTree): 195 | children = [node.getroot()] 196 | else: 197 | if not hasattr(node, 'getchildren'): 198 | return [] 199 | children = node.getchildren() 200 | # Hack to treat text attribute as child text nodes 201 | if node.text is not None: 202 | children.insert(0, LXMLText(node.text, parent=node)) 203 | return children 204 | 205 | def get_node_name(self, node): 206 | if isinstance(node, etree._Comment): 207 | return '#comment' 208 | elif isinstance(node, etree._ProcessingInstruction): 209 | return node.target 210 | prefix = self.get_node_name_prefix(node) 211 | local_name = self.get_node_local_name(node) 212 | if prefix is not None: 213 | return '%s:%s' % (prefix, local_name) 214 | else: 215 | return local_name 216 | 217 | def get_node_local_name(self, node): 218 | return re.sub('{.*}', '', node.tag) 219 | 220 | def get_node_name_prefix(self, node): 221 | # Believe non-Element nodes that have a prefix set (e.g. LXMLAttribute) 222 | if node.prefix and not isinstance(node, etree._Element): 223 | return node.prefix 224 | # Derive prefix by unpacking node name 225 | qname, ns_uri, prefix, local_name = self._unpack_name(node.tag, node) 226 | if prefix: 227 | # Don't add unnecessary excess namespace prefixes for elements 228 | # with a local default namespace declaration 229 | xmlns_val = self.get_node_attribute_value(node, 'xmlns') 230 | if xmlns_val == ns_uri: 231 | return None 232 | # Don't add unnecessary excess namespace prefixes for default ns 233 | if prefix == 'xmlns': 234 | return None 235 | else: 236 | return prefix 237 | else: 238 | return None 239 | 240 | def get_node_value(self, node): 241 | if isinstance(node, (etree._ProcessingInstruction, etree._Comment)): 242 | return node.text 243 | elif hasattr(node, 'value'): 244 | return node.value 245 | else: 246 | return node.text 247 | 248 | def set_node_value(self, node, value): 249 | if hasattr(node, 'value'): 250 | node.value = value 251 | else: 252 | self.set_node_text(node, value) 253 | 254 | def get_node_text(self, node): 255 | return node.text 256 | 257 | def set_node_text(self, node, text): 258 | node.text = text 259 | 260 | def get_node_attributes(self, element, ns_uri=None): 261 | # TODO: Filter by ns_uri 262 | attribs_by_qname = {} 263 | for n, v in list(element.attrib.items()): 264 | qname, ns_uri, prefix, local_name = self._unpack_name(n, element) 265 | attribs_by_qname[qname] = LXMLAttribute( 266 | qname, ns_uri, prefix, local_name, v, element) 267 | # Include namespace declarations, which we also treat as attributes 268 | if element.nsmap: 269 | for n, v in list(element.nsmap.items()): 270 | # Only add namespace as attribute if not defined in ancestors 271 | # and not the global xmlns namespace 272 | if (self._is_ns_in_ancestor(element, n, v) 273 | or v == nodes.Node.XMLNS_URI): 274 | continue 275 | if n is None: 276 | ns_attr_name = 'xmlns' 277 | else: 278 | ns_attr_name = 'xmlns:%s' % n 279 | qname, ns_uri, prefix, local_name = self._unpack_name( 280 | ns_attr_name, element) 281 | attribs_by_qname[qname] = LXMLAttribute( 282 | qname, ns_uri, prefix, local_name, v, element) 283 | return list(attribs_by_qname.values()) 284 | 285 | def has_node_attribute(self, element, name, ns_uri=None): 286 | return name in [a.qname for a 287 | in self.get_node_attributes(element, ns_uri)] 288 | 289 | def get_node_attribute_node(self, element, name, ns_uri=None): 290 | for attr in self.get_node_attributes(element, ns_uri): 291 | if attr.qname == name: 292 | return attr 293 | return None 294 | 295 | def get_node_attribute_value(self, element, name, ns_uri=None): 296 | if ns_uri is not None: 297 | prefix = self.lookup_ns_prefix_for_uri(element, ns_uri) 298 | name = '%s:%s' % (prefix, name) 299 | for attr in self.get_node_attributes(element, ns_uri): 300 | if attr.qname == name: 301 | return attr.value 302 | return None 303 | 304 | def set_node_attribute_value(self, element, name, value, ns_uri=None): 305 | prefix = None 306 | if ':' in name: 307 | prefix, name = name.split(':') 308 | if ns_uri is None and prefix is not None: 309 | ns_uri = self.lookup_ns_uri_by_attr_name(element, prefix) 310 | if ns_uri is not None: 311 | name = '{%s}%s' % (ns_uri, name) 312 | if name.startswith('{%s}' % nodes.Node.XMLNS_URI): 313 | if element.nsmap.get(name) != value: 314 | # Ideally we would apply namespace (xmlns) attributes to the 315 | # element's `nsmap` only, but the lxml/etree nsmap attribute 316 | # is immutable and there's no non-hacky way around this. 317 | # TODO Is there a better way? 318 | pass 319 | if name.split('}')[1] == 'xmlns': 320 | # Hack to remove namespace URI from 'xmlns' attributes so 321 | # the name is just a simple string 322 | name = 'xmlns' 323 | element.attrib[name] = value 324 | else: 325 | element.attrib[name] = value 326 | 327 | def remove_node_attribute(self, element, name, ns_uri=None): 328 | if ns_uri is not None: 329 | name = '{%s}%s' % (ns_uri, name) 330 | elif ':' in name: 331 | prefix, name = name.split(':') 332 | if prefix == 'xmlns': 333 | name = '{%s}%s' % (nodes.Node.XMLNS_URI, name) 334 | else: 335 | name = '{%s}%s' % (element.nsmap[prefix], name) 336 | if name in element.attrib: 337 | del(element.attrib[name]) 338 | 339 | def add_node_child(self, parent, child, before_sibling=None): 340 | if isinstance(child, LXMLText): 341 | # Add text values directly to parent's 'text' attribute 342 | if parent.text is not None: 343 | parent.text = parent.text + child.text 344 | else: 345 | parent.text = child.text 346 | return None 347 | else: 348 | if before_sibling is not None: 349 | offset = 0 350 | for c in parent.getchildren(): 351 | if c == before_sibling: 352 | break 353 | offset += 1 354 | parent.insert(offset, child) 355 | else: 356 | parent.append(child) 357 | return child 358 | 359 | def import_node(self, parent, node, original_parent=None, clone=False): 360 | original_node = node 361 | if clone: 362 | node = self.clone_node(node) 363 | self.add_node_child(parent, node) 364 | # Hack to remove text node content from original parent by manually 365 | # deleting matching text content 366 | if not clone and isinstance(original_node, LXMLText): 367 | original_parent = self.get_node_parent(original_node) 368 | if original_parent.text == original_node.text: 369 | # Must set to None if there would be no remaining text, 370 | # otherwise parent element won't realise it's empty 371 | original_parent.text = None 372 | else: 373 | original_parent.text = \ 374 | original_parent.text.replace(original_node.text, '', 1) 375 | 376 | def clone_node(self, node, deep=True): 377 | if deep: 378 | return copy.deepcopy(node) 379 | else: 380 | return copy.copy(node) 381 | 382 | def remove_node_child(self, parent, child, destroy_node=True): 383 | if isinstance(child, LXMLText): 384 | parent.text = None 385 | return 386 | parent.remove(child) 387 | if destroy_node: 388 | child.clear() 389 | return None 390 | else: 391 | return child 392 | 393 | def lookup_ns_uri_by_attr_name(self, node, name): 394 | ns_name = None 395 | if name == 'xmlns': 396 | ns_name = None 397 | elif name.startswith('xmlns:'): 398 | _, ns_name = name.split(':') 399 | if ns_name in node.nsmap: 400 | return node.nsmap[ns_name] 401 | # If namespace is not in `nsmap` it may be in an XML DOM attribute 402 | # TODO Generalize this block 403 | curr_node = node 404 | while (curr_node is not None 405 | and curr_node.__class__ != etree._ElementTree): 406 | uri = self.get_node_attribute_value(curr_node, name) 407 | if uri is not None: 408 | return uri 409 | curr_node = self.get_node_parent(curr_node) 410 | return None 411 | 412 | def lookup_ns_prefix_for_uri(self, node, uri): 413 | if uri == nodes.Node.XMLNS_URI: 414 | return 'xmlns' 415 | result = None 416 | if hasattr(node, 'nsmap') and uri in list(node.nsmap.values()): 417 | for n, v in list(node.nsmap.items()): 418 | if v == uri: 419 | result = n 420 | break 421 | # TODO This is a slow hack necessary due to lxml's immutable nsmap 422 | if result is None or re.match('ns\d', result): 423 | # We either have no namespace prefix in the nsmap, in which case we 424 | # will try looking for a matching xmlns attribute, or we have 425 | # a namespace prefix that was probably assigned automatically by 426 | # lxml and we'd rather use a human-assigned prefix if available. 427 | curr_node = node # self.get_node_parent(node) 428 | while curr_node.__class__ == etree._Element: 429 | for n, v in list(curr_node.attrib.items()): 430 | if v == uri and ('{%s}' % nodes.Node.XMLNS_URI) in n: 431 | result = n.split('}')[1] 432 | return result 433 | curr_node = self.get_node_parent(curr_node) 434 | return result 435 | 436 | def _unpack_name(self, name, node): 437 | qname = prefix = local_name = ns_uri = None 438 | if name == 'xmlns': 439 | # Namespace URI of 'xmlns' is a constant 440 | ns_uri = nodes.Node.XMLNS_URI 441 | elif '}' in name: 442 | # Namespace URI is contained in {}, find URI's defined prefix 443 | ns_uri, local_name = name.split('}') 444 | ns_uri = ns_uri[1:] 445 | prefix = self.lookup_ns_prefix_for_uri(node, ns_uri) 446 | elif ':' in name: 447 | # Namespace prefix is before ':', find prefix's defined URI 448 | prefix, local_name = name.split(':') 449 | if prefix == 'xmlns': 450 | # All 'xmlns' attributes are in XMLNS URI by definition 451 | ns_uri = nodes.Node.XMLNS_URI 452 | else: 453 | ns_uri = self.lookup_ns_uri_by_attr_name(node, prefix) 454 | # Catch case where a prefix other than 'xmlns' points at XMLNS URI 455 | if name != 'xmlns' and ns_uri == nodes.Node.XMLNS_URI: 456 | prefix = 'xmlns' 457 | # Construct fully-qualified name from prefix + local names 458 | if prefix is not None: 459 | qname = '%s:%s' % (prefix, local_name) 460 | else: 461 | qname = local_name = name 462 | return (qname, ns_uri, prefix, local_name) 463 | 464 | def _is_ns_in_ancestor(self, node, name, value): 465 | """ 466 | Return True if the given namespace name/value is defined in an 467 | ancestor of the given node, meaning that the given node need not 468 | have its own attributes to apply that namespacing. 469 | """ 470 | curr_node = self.get_node_parent(node) 471 | while curr_node.__class__ == etree._Element: 472 | if (hasattr(curr_node, 'nsmap') 473 | and curr_node.nsmap.get(name) == value): 474 | return True 475 | for n, v in list(curr_node.attrib.items()): 476 | if v == value and '{%s}' % nodes.Node.XMLNS_URI in n: 477 | return True 478 | curr_node = self.get_node_parent(curr_node) 479 | return False 480 | 481 | 482 | class LXMLText(object): 483 | 484 | def __init__(self, text, parent=None, is_cdata=False): 485 | self._text = text 486 | self._parent = parent 487 | self._is_cdata = is_cdata 488 | 489 | @property 490 | def is_cdata(self): 491 | return self._is_cdata 492 | 493 | @property 494 | def value(self): 495 | return self._text 496 | 497 | text = value # Alias 498 | 499 | def getparent(self): 500 | return self._parent 501 | 502 | @property 503 | def prefix(self): 504 | return None 505 | 506 | @property 507 | def tag(self): 508 | if self.is_cdata: 509 | return "#cdata-section" 510 | else: 511 | return "#text" 512 | 513 | 514 | class LXMLAttribute(object): 515 | 516 | def __init__(self, qname, ns_uri, prefix, local_name, value, element): 517 | self._qname, self._ns_uri, self._prefix, self._local_name = ( 518 | qname, ns_uri, prefix, local_name) 519 | self._value, self._element = (value, element) 520 | 521 | def getroottree(self): 522 | return self._element.getroottree() 523 | 524 | @property 525 | def qname(self): 526 | return self._qname 527 | 528 | @property 529 | def namespace_uri(self): 530 | return self._ns_uri 531 | 532 | @property 533 | def prefix(self): 534 | return self._prefix 535 | 536 | @property 537 | def local_name(self): 538 | return self._local_name 539 | 540 | @property 541 | def value(self): 542 | return self._value 543 | 544 | name = tag = local_name # Alias 545 | -------------------------------------------------------------------------------- /xml4h/impls/xml_dom_minidom.py: -------------------------------------------------------------------------------- 1 | from six import StringIO, BytesIO 2 | 3 | from xml4h.impls.interface import XmlImplAdapter 4 | from xml4h import nodes, exceptions 5 | 6 | import xml.dom 7 | import xml.dom.minidom 8 | 9 | 10 | class XmlDomImplAdapter(XmlImplAdapter): 11 | """ 12 | Adapter to the 13 | `minidom `_ XML 14 | library implementation. 15 | """ 16 | 17 | @classmethod 18 | def is_available(cls): 19 | try: 20 | xml.dom.Node 21 | return True 22 | except: 23 | return False 24 | 25 | @classmethod 26 | def parse_string(cls, xml_str, ignore_whitespace_text_nodes=True): 27 | return cls.parse_file(StringIO(xml_str), ignore_whitespace_text_nodes) 28 | 29 | @classmethod 30 | def parse_bytes(cls, xml_bytes, ignore_whitespace_text_nodes=True): 31 | return cls.parse_file(BytesIO(xml_bytes), ignore_whitespace_text_nodes) 32 | 33 | @classmethod 34 | def parse_file(cls, xml_file, ignore_whitespace_text_nodes=True): 35 | impl_doc = xml.dom.minidom.parse(xml_file) 36 | wrapped_doc = XmlDomImplAdapter.wrap_document(impl_doc) 37 | if ignore_whitespace_text_nodes: 38 | cls.ignore_whitespace_text_nodes(wrapped_doc) 39 | return wrapped_doc 40 | 41 | @classmethod 42 | def new_impl_document(cls, root_tagname, ns_uri=None, 43 | doctype=None, impl_features=None): 44 | # Create DOM implementation factory 45 | if impl_features is None: 46 | impl_features = [] 47 | factory = xml.dom.getDOMImplementation('minidom', impl_features) 48 | # Create Document from factory 49 | doc = factory.createDocument(ns_uri, root_tagname, doctype) 50 | return doc 51 | 52 | def map_node_to_class(self, impl_node): 53 | try: 54 | return { 55 | xml.dom.Node.ELEMENT_NODE: nodes.Element, 56 | xml.dom.Node.ATTRIBUTE_NODE: nodes.Attribute, 57 | xml.dom.Node.TEXT_NODE: nodes.Text, 58 | xml.dom.Node.CDATA_SECTION_NODE: nodes.CDATA, 59 | # EntityReference not supported by minidom 60 | #xml.dom.Node.ENTITY_REFERENCE: nodes.EntityReference, 61 | xml.dom.Node.ENTITY_NODE: nodes.Entity, 62 | xml.dom.Node.PROCESSING_INSTRUCTION_NODE: 63 | nodes.ProcessingInstruction, 64 | xml.dom.Node.COMMENT_NODE: nodes.Comment, 65 | xml.dom.Node.DOCUMENT_NODE: nodes.Document, 66 | xml.dom.Node.DOCUMENT_TYPE_NODE: nodes.DocumentType, 67 | xml.dom.Node.DOCUMENT_FRAGMENT_NODE: nodes.DocumentFragment, 68 | xml.dom.Node.NOTATION_NODE: nodes.Notation, 69 | }[impl_node.nodeType] 70 | except KeyError: 71 | raise exceptions.Xml4hImplementationBug( 72 | 'Unrecognized type for implementation node: %s' % impl_node) 73 | 74 | def get_impl_root(self, node): 75 | return node.documentElement 76 | 77 | def new_impl_element(self, tagname, ns_uri=None, parent=None): 78 | return self.impl_document.createElementNS(ns_uri, tagname) 79 | 80 | def new_impl_text(self, text): 81 | return self.impl_document.createTextNode(text) 82 | 83 | def new_impl_comment(self, text): 84 | return self.impl_document.createComment(text) 85 | 86 | def new_impl_instruction(self, target, data): 87 | return self.impl_document.createProcessingInstruction(target, data) 88 | 89 | def new_impl_cdata(self, text): 90 | return self.impl_document.createCDATASection(text) 91 | 92 | def find_node_elements(self, node, name='*', ns_uri='*'): 93 | return node.getElementsByTagNameNS(ns_uri, name) 94 | 95 | def get_node_namespace_uri(self, node): 96 | return node.namespaceURI 97 | 98 | def set_node_namespace_uri(self, node, ns_uri): 99 | node.namespaceURI = ns_uri 100 | 101 | def get_node_parent(self, element): 102 | return element.parentNode 103 | 104 | def get_node_children(self, element): 105 | return element.childNodes 106 | 107 | def get_node_name(self, node): 108 | if node.nodeType not in ( 109 | xml.dom.Node.ELEMENT_NODE, xml.dom.Node.ATTRIBUTE_NODE 110 | ): 111 | return node.nodeName 112 | # Special handling of node names for Element and Attribute nodes where 113 | # we want to exclude the namespace prefix in some cases 114 | prefix = self.get_node_name_prefix(node) 115 | local_name = self.get_node_local_name(node) 116 | if prefix is not None: 117 | return '%s:%s' % (prefix, local_name) 118 | else: 119 | return local_name 120 | 121 | def get_node_local_name(self, node): 122 | return node.localName 123 | 124 | def get_node_name_prefix(self, node): 125 | prefix = node.prefix 126 | # Don't add unnecessary excess namespace prefixes for elements 127 | # with a local default namespace declaration 128 | if prefix and node.nodeType == xml.dom.Node.ELEMENT_NODE: 129 | xmlns_val = self.get_node_attribute_value(node, 'xmlns') 130 | if xmlns_val == self.get_node_namespace_uri(node): 131 | return None 132 | return prefix 133 | 134 | def get_node_value(self, node): 135 | return node.nodeValue 136 | 137 | def set_node_value(self, node, value): 138 | node.nodeValue = value 139 | 140 | def get_node_text(self, node): 141 | """ 142 | Return contatenated value of all text node children of this element 143 | """ 144 | text_children = [n.nodeValue for n in self.get_node_children(node) 145 | if n.nodeType == xml.dom.Node.TEXT_NODE] 146 | if text_children: 147 | return ''.join(text_children) 148 | else: 149 | return None 150 | 151 | def set_node_text(self, node, text): 152 | """ 153 | Set text value as sole Text child node of element; any existing 154 | Text nodes are removed 155 | """ 156 | # Remove any existing Text node children 157 | for child in self.get_node_children(node): 158 | if child.nodeType == xml.dom.Node.TEXT_NODE: 159 | self.remove_node_child(node, child, True) 160 | if text is not None: 161 | text_node = self.new_impl_text(text) 162 | self.add_node_child(node, text_node) 163 | 164 | def get_node_attributes(self, element, ns_uri=None): 165 | attr_nodes = [] 166 | if not element.attributes: 167 | return attr_nodes 168 | for attr_name in list(element.attributes.keys()): 169 | if self.has_node_attribute(element, attr_name, ns_uri): 170 | attr_nodes.append( 171 | self.get_node_attribute_node(element, attr_name, ns_uri)) 172 | return attr_nodes 173 | 174 | def has_node_attribute(self, element, name, ns_uri=None): 175 | if ns_uri is not None: 176 | return element.hasAttributeNS(ns_uri, name) 177 | else: 178 | return element.hasAttribute(name) 179 | 180 | def get_node_attribute_node(self, element, name, ns_uri=None): 181 | if ns_uri is not None: 182 | return element.getAttributeNodeNS(ns_uri, name) 183 | else: 184 | return element.getAttributeNode(name) 185 | 186 | def get_node_attribute_value(self, element, name, ns_uri=None): 187 | if isinstance(element, xml.dom.minidom.Document): 188 | return None 189 | if ns_uri is not None: 190 | result = element.getAttributeNS(ns_uri, name) 191 | else: 192 | result = element.getAttribute(name) 193 | # Minidom returns empty string for non-existent nodes, correct this 194 | if result == '' and not name in list(element.attributes.keys()): 195 | return None 196 | return result 197 | 198 | def set_node_attribute_value(self, element, name, value, ns_uri=None): 199 | element.setAttributeNS(ns_uri, name, value) 200 | 201 | def remove_node_attribute(self, element, name, ns_uri=None): 202 | if ns_uri is not None: 203 | element.removeAttributeNS(ns_uri, name) 204 | else: 205 | element.removeAttribute(name) 206 | 207 | def add_node_child(self, parent, child, before_sibling=None): 208 | if before_sibling is not None: 209 | parent.insertBefore(child, before_sibling) 210 | else: 211 | parent.appendChild(child) 212 | 213 | def import_node(self, parent, node, original_parent=None, clone=False): 214 | if clone: 215 | node = self.clone_node(node) 216 | self.add_node_child(parent, node) 217 | 218 | def clone_node(self, node, deep=True): 219 | return node.cloneNode(deep) 220 | 221 | def remove_node_child(self, parent, child, destroy_node=True): 222 | parent.removeChild(child) 223 | if destroy_node: 224 | child.unlink() 225 | return None 226 | else: 227 | return child 228 | 229 | def lookup_ns_uri_by_attr_name(self, node, name): 230 | curr_node = node 231 | while curr_node is not None: 232 | value = self.get_node_attribute_value(curr_node, name) 233 | if value is not None: 234 | return value 235 | curr_node = self.get_node_parent(curr_node) 236 | return None 237 | 238 | def lookup_ns_prefix_for_uri(self, node, uri): 239 | curr_node = node 240 | while curr_node: 241 | attrs = self.get_node_attributes(curr_node) 242 | for attr in attrs: 243 | if attr.value == uri: 244 | if ':' in attr.name: 245 | return attr.name.split(':')[1] 246 | else: 247 | return attr.name 248 | curr_node = self.get_node_parent(curr_node) 249 | return None 250 | -------------------------------------------------------------------------------- /xml4h/impls/xml_etree_elementtree.py: -------------------------------------------------------------------------------- 1 | import re 2 | import copy 3 | 4 | import six 5 | 6 | from xml4h.impls.interface import XmlImplAdapter 7 | from xml4h import nodes, exceptions 8 | 9 | # Import the pure-Python ElementTree implementation, if possible 10 | try: 11 | import xml.etree.ElementTree as PythonET 12 | # Re-import non-C ElementTree with a definitive name, for cases where we 13 | # must explicilty use non-C-based elements of ElementTree. 14 | import xml.etree.ElementTree as BaseET 15 | except ImportError: 16 | pass 17 | 18 | # Import the C-based ElementTree implementation, if possible 19 | try: 20 | import xml.etree.cElementTree as cET 21 | except ImportError: 22 | pass 23 | 24 | 25 | class ElementTreeAdapter(XmlImplAdapter): 26 | """ 27 | Adapter to the 28 | `ElementTree `_ 29 | XML library. 30 | 31 | This code *must* work with either the base ElementTree pure python 32 | implementation or the C-based cElementTree implementation, since it is 33 | reused in the `cElementTree` class defined below. 34 | """ 35 | 36 | ET = PythonET # Use the pure-Python implementation 37 | 38 | SUPPORTED_FEATURES = { 39 | 'xpath': True, 40 | } 41 | 42 | @classmethod 43 | def is_available(cls): 44 | # Is vital piece of ElementTree module available at all? 45 | try: 46 | cls.ET.Element 47 | except: 48 | return False 49 | # We only support ElementTree version 1.3+ 50 | from distutils.version import StrictVersion 51 | return StrictVersion(BaseET.VERSION) >= StrictVersion('1.3') 52 | 53 | @classmethod 54 | def parse_string(cls, xml_str, ignore_whitespace_text_nodes=True): 55 | return cls.parse_file( 56 | six.StringIO(xml_str), 57 | ignore_whitespace_text_nodes=ignore_whitespace_text_nodes) 58 | 59 | @classmethod 60 | def parse_bytes(cls, xml_bytes, ignore_whitespace_text_nodes=True): 61 | return cls.parse_file( 62 | six.BytesIO(xml_bytes), 63 | ignore_whitespace_text_nodes=ignore_whitespace_text_nodes) 64 | 65 | @classmethod 66 | def parse_file(cls, xml_file_path, ignore_whitespace_text_nodes=True): 67 | # To retain explicit xmlns namespace definition attributes, we need to 68 | # manually add these elements to the parsed DOM as we go using 69 | # iterative parsing per: 70 | # effbot.org/zone/element-namespaces.htm#preserving-existing-namespace-attributes 71 | events = ('start', 'start-ns') 72 | impl_root = None 73 | ns_list = [] 74 | for event, node in cls.ET.iterparse(xml_file_path, events): 75 | if event == 'start-ns': 76 | # Track namespaces as nodes declared 77 | ns_list.append(node) 78 | elif event == 'start': 79 | # Recognise and retain root node 80 | if impl_root is None: 81 | impl_root = node 82 | # Add xmlns attributes for each namespace declared 83 | for ns_prefix, ns_uri in ns_list: 84 | if ns_prefix: 85 | attr_name = 'xmlns:%s' % ns_prefix 86 | else: 87 | attr_name = 'xmlns' 88 | node.set(attr_name, ns_uri) 89 | # Reset namespace list now the corresponding attributes exist 90 | ns_list = [] 91 | 92 | impl_doc = cls.ET.ElementTree(impl_root) 93 | wrapped_doc = cls.wrap_document(impl_doc) 94 | if ignore_whitespace_text_nodes: 95 | cls.ignore_whitespace_text_nodes(wrapped_doc) 96 | return wrapped_doc 97 | 98 | @classmethod 99 | def new_impl_document(cls, root_tagname, ns_uri=None, **kwargs): 100 | root_nsmap = {} 101 | if ns_uri is not None: 102 | root_nsmap[None] = ns_uri 103 | else: 104 | ns_uri = nodes.Node.XMLNS_URI 105 | root_nsmap[None] = ns_uri 106 | root_elem = cls.ET.Element('{%s}%s' % (ns_uri, root_tagname)) 107 | doc = cls.ET.ElementTree(root_elem) 108 | return doc 109 | 110 | # This method is called by interface super-class's __init__ 111 | def clear_caches(self): 112 | self.CACHED_ANCESTRY_DICT = {} 113 | 114 | def _lookup_node_parent(self, node): 115 | """ 116 | Return the parent of the given node, based on an internal dictionary 117 | mapping of child nodes to the child's parent required since 118 | ElementTree doesn't make info about node ancestry/parentage available. 119 | """ 120 | # Basic caching of our internal ancestry dict to help performance 121 | if not node in self.CACHED_ANCESTRY_DICT: 122 | # Given node isn't in cached ancestry dictionary, rebuild this now 123 | ancestry_dict = dict( 124 | (c, p) for p in self._impl_document.getiterator() for c in p) 125 | self.CACHED_ANCESTRY_DICT = ancestry_dict 126 | return self.CACHED_ANCESTRY_DICT[node] 127 | 128 | def _is_node_an_element(self, node): 129 | """ 130 | Return True if the given node is an ElementTree Element, a fact that 131 | can be tricky to determine if the cElementTree implementation is 132 | used. 133 | """ 134 | # Try the simplest approach first, works for plain old ElementTree 135 | if isinstance(node, BaseET.Element): 136 | return True 137 | # For cElementTree we need to be more cunning (or find a better way) 138 | if hasattr(node, 'makeelement') \ 139 | and isinstance(node.tag, six.string_types): 140 | return True 141 | 142 | def map_node_to_class(self, node): 143 | if isinstance(node, BaseET.ElementTree): 144 | return nodes.Document 145 | elif node.tag == BaseET.ProcessingInstruction: 146 | return nodes.ProcessingInstruction 147 | elif node.tag == BaseET.Comment: 148 | return nodes.Comment 149 | elif isinstance(node, ETAttribute): 150 | return nodes.Attribute 151 | elif isinstance(node, ElementTreeText): 152 | if node.is_cdata: 153 | return nodes.CDATA 154 | else: 155 | return nodes.Text 156 | elif self._is_node_an_element(node): 157 | return nodes.Element 158 | raise exceptions.Xml4hImplementationBug( 159 | 'Unrecognized type for implementation node: %s' % node) 160 | 161 | def get_impl_root(self, node): 162 | return self._impl_document.getroot() 163 | 164 | # Document implementation methods 165 | 166 | def new_impl_element(self, tagname, ns_uri=None, parent=None): 167 | if ns_uri is not None: 168 | if ':' in tagname: 169 | tagname = tagname.split(':')[1] 170 | element = self.ET.Element('{%s}%s' % (ns_uri, tagname)) 171 | return element 172 | else: 173 | return self.ET.Element(tagname) 174 | 175 | def new_impl_text(self, text): 176 | return ElementTreeText(text) 177 | 178 | def new_impl_comment(self, text): 179 | return self.ET.Comment(text) 180 | 181 | def new_impl_instruction(self, target, data): 182 | return self.ET.ProcessingInstruction(target, data) 183 | 184 | def new_impl_cdata(self, text): 185 | return ElementTreeText(text, is_cdata=True) 186 | 187 | def find_node_elements(self, node, name='*', ns_uri='*'): 188 | # TODO Any proper way to find namespaced elements by name? 189 | name_match_nodes = node.getiterator() 190 | # Filter nodes by name and ns_uri if necessary 191 | results = [] 192 | for n in name_match_nodes: 193 | # Ignore the current node 194 | if n == node: 195 | continue 196 | # Ignore non-Elements 197 | if not isinstance(n.tag, six.string_types): 198 | continue 199 | if ns_uri != '*' and self.get_node_namespace_uri(n) != ns_uri: 200 | continue 201 | if name != '*' and self.get_node_local_name(n) != name: 202 | continue 203 | results.append(n) 204 | return results 205 | find_node_elements.__doc__ = XmlImplAdapter.find_node_elements.__doc__ 206 | 207 | def xpath_on_node(self, node, xpath, **kwargs): 208 | """ 209 | Return result of performing the given XPath query on the given node. 210 | 211 | All known namespace prefix-to-URI mappings in the document are 212 | automatically included in the XPath invocation. 213 | 214 | If an empty/default namespace (i.e. None) is defined, this is 215 | converted to the prefix name '_' so it can be used despite empty 216 | namespace prefixes being unsupported by XPath. 217 | """ 218 | namespaces_dict = {} 219 | if 'namespaces' in kwargs: 220 | namespaces_dict.update(kwargs['namespaces']) 221 | # Empty namespace prefix is not supported, convert to '_' prefix 222 | if None in namespaces_dict: 223 | default_ns_uri = namespaces_dict.pop(None) 224 | namespaces_dict['_'] = default_ns_uri 225 | # If no default namespace URI defined, use root's namespace (if any) 226 | if not '_' in namespaces_dict: 227 | root = self.get_impl_root(node) 228 | qname, ns_uri, prefix, local_name = self._unpack_name( 229 | root.tag, root) 230 | if ns_uri: 231 | namespaces_dict['_'] = ns_uri 232 | # Include XMLNS namespace if it's not already defined 233 | if not 'xmlns' in namespaces_dict: 234 | namespaces_dict['xmlns'] = nodes.Node.XMLNS_URI 235 | return node.findall(xpath, namespaces_dict) 236 | 237 | # Node implementation methods 238 | 239 | def get_node_namespace_uri(self, node): 240 | if '}' in node.tag: 241 | return node.tag.split('}')[0][1:] 242 | elif isinstance(node, ETAttribute): 243 | return node.namespace_uri 244 | elif self._is_node_an_element(node): 245 | qname, ns_uri = self._unpack_name(node.tag, node)[:2] 246 | return ns_uri 247 | else: 248 | return None 249 | 250 | def set_node_namespace_uri(self, node, ns_uri): 251 | qname, orig_ns_uri, prefix, local_name = self._unpack_name( 252 | node.tag, node) 253 | node.tag = '{%s}%s' % (ns_uri, local_name) 254 | 255 | def get_node_parent(self, node): 256 | parent = None 257 | # Root document has no parent 258 | if isinstance(node, BaseET.ElementTree): 259 | pass 260 | elif hasattr(node, 'getparent'): 261 | parent = node.getparent() 262 | # Return ElementTree as root element's parent 263 | elif node == self.get_impl_root(node): 264 | parent = self._impl_document 265 | else: 266 | parent = self._lookup_node_parent(node) 267 | return parent 268 | 269 | def get_node_children(self, node): 270 | if isinstance(node, BaseET.ElementTree): 271 | children = [node.getroot()] 272 | else: 273 | if not hasattr(node, 'getchildren'): 274 | return [] 275 | children = list(node.getchildren()) 276 | # Hack to treat text attribute as child text nodes 277 | if node.text is not None: 278 | children.insert(0, ElementTreeText(node.text, parent=node)) 279 | return children 280 | 281 | def get_node_name(self, node): 282 | if node.tag == BaseET.Comment: 283 | return '#comment' 284 | elif node.tag == BaseET.ProcessingInstruction: 285 | name, target = node.text.split(' ') 286 | return name 287 | prefix = self.get_node_name_prefix(node) 288 | if prefix is not None: 289 | return '%s:%s' % (prefix, self.get_node_local_name(node)) 290 | else: 291 | return self.get_node_local_name(node) 292 | 293 | def get_node_local_name(self, node): 294 | return re.sub('{.*}', '', node.tag) 295 | 296 | def get_node_name_prefix(self, node): 297 | # Ignore non-elements 298 | if not isinstance(node.tag, six.string_types): 299 | return None 300 | # Believe nodes that have their own prefix (likely only ETAttribute) 301 | prefix = getattr(node, 'prefix', None) 302 | if prefix: 303 | return prefix 304 | # Derive prefix by unpacking node name 305 | qname, ns_uri, prefix, local_name = self._unpack_name(node.tag, node) 306 | if prefix: 307 | # Don't add unnecessary excess namespace prefixes for elements 308 | # with a local default namespace declaration 309 | if node.attrib.get('xmlns') == ns_uri: 310 | return None 311 | # Don't add unnecessary excess namespace prefixes for default ns 312 | elif prefix == 'xmlns': 313 | return None 314 | else: 315 | return prefix 316 | else: 317 | return None 318 | 319 | def get_node_value(self, node): 320 | if node.tag == BaseET.ProcessingInstruction: 321 | name, target = node.text.split(' ') 322 | return target 323 | elif node.tag == BaseET.Comment: 324 | return node.text 325 | elif hasattr(node, 'value'): 326 | return node.value 327 | else: 328 | return node.text 329 | 330 | def set_node_value(self, node, value): 331 | if hasattr(node, 'value'): 332 | node.value = value 333 | else: 334 | self.set_node_text(node, value) 335 | 336 | def get_node_text(self, node): 337 | return node.text 338 | 339 | def set_node_text(self, node, text): 340 | node.text = text 341 | 342 | def get_node_attributes(self, element, ns_uri=None): 343 | # TODO: Filter by ns_uri 344 | attribs_by_qname = {} 345 | for n, v in list(element.attrib.items()): 346 | qname, ns_uri, prefix, local_name = self._unpack_name(n, element) 347 | attribs_by_qname[qname] = ETAttribute( 348 | qname, ns_uri, prefix, local_name, v, element) 349 | return list(attribs_by_qname.values()) 350 | 351 | def has_node_attribute(self, element, name, ns_uri=None): 352 | return name in [a.qname for a 353 | in self.get_node_attributes(element, ns_uri)] 354 | 355 | def get_node_attribute_node(self, element, name, ns_uri=None): 356 | for attr in self.get_node_attributes(element, ns_uri): 357 | if attr.qname == name: 358 | return attr 359 | return None 360 | 361 | def get_node_attribute_value(self, element, name, ns_uri=None): 362 | if ns_uri is not None: 363 | prefix = self.lookup_ns_prefix_for_uri(element, ns_uri) 364 | name = '%s:%s' % (prefix, name) 365 | for attr in self.get_node_attributes(element, ns_uri): 366 | if attr.qname == name: 367 | return attr.value 368 | return None 369 | 370 | def set_node_attribute_value(self, element, name, value, ns_uri=None): 371 | prefix = None 372 | if ':' in name: 373 | prefix, name = name.split(':') 374 | if ns_uri is None and prefix is not None: 375 | ns_uri = self.lookup_ns_uri_by_attr_name(element, prefix) 376 | if ns_uri is not None: 377 | name = '{%s}%s' % (ns_uri, name) 378 | if name.startswith('{%s}' % nodes.Node.XMLNS_URI): 379 | if name.split('}')[1] == 'xmlns': 380 | # Hack to remove namespace URI from 'xmlns' attributes so 381 | # the name is just a simple string 382 | name = 'xmlns' 383 | element.attrib[name] = value 384 | else: 385 | element.attrib[name] = value 386 | 387 | def remove_node_attribute(self, element, name, ns_uri=None): 388 | if ns_uri is not None: 389 | name = '{%s}%s' % (ns_uri, name) 390 | elif ':' in name: 391 | prefix, local_name = name.split(':') 392 | if prefix != 'xmlns': 393 | ns_attr_name = 'xmlns:%s' % prefix 394 | ns_uri = self.lookup_ns_uri_by_attr_name(element, ns_attr_name) 395 | name = '{%s}%s' % (ns_uri, local_name) 396 | if name in element.attrib: 397 | del(element.attrib[name]) 398 | 399 | def add_node_child(self, parent, child, before_sibling=None): 400 | if isinstance(child, ElementTreeText): 401 | # Add text values directly to parent's 'text' attribute 402 | if parent.text is not None: 403 | parent.text = parent.text + child.text 404 | else: 405 | parent.text = child.text 406 | self.CACHED_ANCESTRY_DICT[child] = parent 407 | return None 408 | else: 409 | if before_sibling is not None: 410 | offset = 0 411 | for c in parent.getchildren(): 412 | if c == before_sibling: 413 | break 414 | offset += 1 415 | parent.insert(offset, child) 416 | else: 417 | parent.append(child) 418 | self.CACHED_ANCESTRY_DICT[child] = parent 419 | return child 420 | 421 | def import_node(self, parent, node, original_parent=None, clone=False): 422 | original_node = node 423 | # We always clone for (c)ElementTree adapter so we can remove original 424 | # if necessary 425 | node = self.clone_node(node) 426 | self.add_node_child(parent, node) 427 | # Hack to remove text node content from original parent by manually 428 | # deleting matching text content 429 | if not clone: 430 | if isinstance(original_node, ElementTreeText): 431 | original_parent = self.get_node_parent(original_node) 432 | if original_parent.text == original_node.text: 433 | # Must set to None if there would be no remaining text, 434 | # otherwise parent element won't realise it's empty 435 | original_parent.text = None 436 | else: 437 | original_parent.text = \ 438 | original_parent.text.replace(original_node.text, '', 1) 439 | else: 440 | original_parent.remove(original_node) 441 | 442 | def clone_node(self, node, deep=True): 443 | if deep: 444 | return copy.deepcopy(node) 445 | else: 446 | return copy.copy(node) 447 | 448 | def remove_node_child(self, parent, child, destroy_node=True): 449 | if isinstance(child, ElementTreeText): 450 | child._parent.text = None 451 | return 452 | parent.remove(child) 453 | if destroy_node: 454 | child.clear() 455 | return None 456 | else: 457 | return child 458 | 459 | def lookup_ns_uri_by_attr_name(self, node, name): 460 | curr_node = node 461 | while (curr_node is not None 462 | and not isinstance(curr_node, BaseET.ElementTree)): 463 | uri = self.get_node_attribute_value(curr_node, name) 464 | if uri is not None: 465 | return uri 466 | curr_node = self.get_node_parent(curr_node) 467 | return None 468 | 469 | def lookup_ns_prefix_for_uri(self, node, uri): 470 | if uri == nodes.Node.XMLNS_URI: 471 | return 'xmlns' 472 | result = None 473 | # Lookup namespace URI in ET's awful global namespace/prefix registry 474 | if hasattr(BaseET, '_namespace_map') and uri in BaseET._namespace_map: 475 | result = BaseET._namespace_map[uri] 476 | if result == '': 477 | result = None 478 | if result is None or re.match('ns\d', result): 479 | # We either have no namespace prefix in the global mapping, in 480 | # which case we will try looking for a matching xmlns attribute, 481 | # or we have a namespace prefix that was probably assigned 482 | # automatically by ElementTree and we'd rather use a 483 | # human-assigned prefix if available. 484 | curr_node = node 485 | while self._is_node_an_element(curr_node): 486 | for n, v in list(curr_node.attrib.items()): 487 | if v == uri: 488 | if n.startswith('xmlns:'): 489 | result = n.split(':')[1] 490 | return result 491 | elif n.startswith('{%s}' % nodes.Node.XMLNS_URI): 492 | result = n.split('}')[1] 493 | return result 494 | curr_node = self.get_node_parent(curr_node) 495 | return result 496 | 497 | def _unpack_name(self, name, node): 498 | qname = prefix = local_name = ns_uri = None 499 | if name == 'xmlns': 500 | # Namespace URI of 'xmlns' is a constant 501 | ns_uri = nodes.Node.XMLNS_URI 502 | elif '}' in name: 503 | # Namespace URI is contained in {}, find URI's defined prefix 504 | ns_uri, local_name = name.split('}') 505 | ns_uri = ns_uri[1:] 506 | prefix = self.lookup_ns_prefix_for_uri(node, ns_uri) 507 | elif ':' in name: 508 | # Namespace prefix is before ':', find prefix's defined URI 509 | prefix, local_name = name.split(':') 510 | if prefix == 'xmlns': 511 | # All 'xmlns' attributes are in XMLNS URI by definition 512 | ns_uri = nodes.Node.XMLNS_URI 513 | else: 514 | ns_uri = self.lookup_ns_uri_by_attr_name(node, prefix) 515 | # Catch case where a prefix other than 'xmlns' points at XMLNS URI 516 | if name != 'xmlns' and ns_uri == nodes.Node.XMLNS_URI: 517 | prefix = 'xmlns' 518 | # Construct fully-qualified name from prefix + local names 519 | if prefix is not None: 520 | qname = '%s:%s' % (prefix, local_name) 521 | else: 522 | qname = local_name = name 523 | return (qname, ns_uri, prefix, local_name) 524 | 525 | 526 | class ElementTreeText(object): 527 | 528 | def __init__(self, text, parent=None, is_cdata=False): 529 | self._text = text 530 | self._parent = parent 531 | self._is_cdata = is_cdata 532 | 533 | @property 534 | def is_cdata(self): 535 | return self._is_cdata 536 | 537 | @property 538 | def value(self): 539 | return self._text 540 | 541 | text = value # Alias 542 | 543 | def getparent(self): 544 | return self._parent 545 | 546 | @property 547 | def prefix(self): 548 | return None 549 | 550 | @property 551 | def tag(self): 552 | if self.is_cdata: 553 | return "#cdata-section" 554 | else: 555 | return "#text" 556 | 557 | 558 | class ETAttribute(object): 559 | 560 | def __init__(self, qname, ns_uri, prefix, local_name, value, element): 561 | self._qname, self._ns_uri, self._prefix, self._local_name = ( 562 | qname, ns_uri, prefix, local_name) 563 | self._value, self._element = (value, element) 564 | 565 | def getroottree(self): 566 | return self._element.getroottree() 567 | 568 | @property 569 | def qname(self): 570 | return self._qname 571 | 572 | @property 573 | def namespace_uri(self): 574 | return self._ns_uri 575 | 576 | @property 577 | def prefix(self): 578 | return self._prefix 579 | 580 | @property 581 | def local_name(self): 582 | return self._local_name 583 | 584 | @property 585 | def value(self): 586 | return self._value 587 | 588 | name = tag = local_name # Alias 589 | 590 | 591 | class cElementTreeAdapter(ElementTreeAdapter): 592 | """ 593 | Adapter to the C-based implementation of the 594 | `ElementTree `_ 595 | XML library. 596 | """ 597 | 598 | ET = cET # Use the C-based implementation 599 | 600 | @classmethod 601 | def is_available(cls): 602 | if not super(cElementTreeAdapter, cls).is_available(): 603 | return False 604 | # We only support cElementTree version 1.0.6+ 605 | from distutils.version import StrictVersion 606 | return StrictVersion(cls.ET.VERSION) >= StrictVersion('1.0.6') 607 | -------------------------------------------------------------------------------- /xml4h/writer.py: -------------------------------------------------------------------------------- 1 | """ 2 | Writer to serialize XML DOM documents or sections to text. 3 | """ 4 | # This implementation is adapted (heavily) from the standard library method 5 | # xml.dom.minidom.writexml 6 | import six 7 | 8 | import codecs 9 | 10 | from xml4h import exceptions 11 | 12 | 13 | def write_node(node, writer, encoding='utf-8', indent=0, newline='', 14 | omit_declaration=False, node_depth=0, quote_char='"'): 15 | """ 16 | Serialize an *xml4h* DOM node and its descendants to text, writing 17 | the output to the given *writer*. 18 | 19 | :param node: the DOM node whose content and descendants will 20 | be serialized. 21 | :type node: an :class:`xml4h.nodes.Node` or subclass 22 | :param writer: a file or stream to which XML text is written. 23 | :type writer: a file, stream, etc 24 | :param string encoding: the character encoding for serialized text. 25 | :param indent: indentation prefix to apply to descendent nodes for 26 | pretty-printing. The value can take many forms: 27 | 28 | - *int*: the number of spaces to indent. 0 means no indent. 29 | - *string*: a literal prefix for indented nodes, such as ``\\t``. 30 | - *bool*: no indent if *False*, four spaces indent if *True*. 31 | - *None*: no indent. 32 | :type indent: string, int, bool, or None 33 | :param newline: the string value used to separate lines of output. 34 | The value can take a number of forms: 35 | 36 | - *string*: the literal newline value, such as ``\\n`` or ``\\r``. 37 | An empty string means no newline. 38 | - *bool*: no newline if *False*, ``\\n`` newline if *True*. 39 | - *None*: no newline. 40 | :type newline: string, bool, or None 41 | :param boolean omit_declaration: if *True* the XML declaration header 42 | is omitted, otherwise it is included. Note that the declaration is 43 | only output when serializing an :class:`xml4h.nodes.Document` node. 44 | :param int node_depth: the indentation level to start at, such as 2 to 45 | indent output as if the given *node* has two ancestors. 46 | This parameter will only be useful if you need to output XML text 47 | fragments that can be assembled into a document. This parameter 48 | has no effect unless indentation is applied. 49 | :param string quote_char: the character that delimits quoted content. 50 | You should never need to mess with this. 51 | """ 52 | def _sanitize_write_value(value): 53 | """Return XML-encoded value.""" 54 | if not value: 55 | return value 56 | return (value 57 | .replace("&", "&") 58 | .replace("<", "<") 59 | .replace("\"", """) 60 | .replace(">", ">") 61 | ) 62 | 63 | def _write_node_impl(node, node_depth): 64 | """ 65 | Internal write implementation that does the real work while keeping 66 | track of node depth. 67 | """ 68 | # Output document declaration if we're outputting the whole doc 69 | if node.is_document: 70 | if not omit_declaration: 71 | writer.write( 72 | '%s' % newline) 77 | for child in node.children: 78 | _write_node_impl(child, 79 | node_depth) # node_depth not incremented 80 | writer.write(newline) 81 | elif node.is_document_type: 82 | writer.write("") 93 | elif node.is_text: 94 | writer.write( 95 | _sanitize_write_value(node.value) 96 | ) 97 | elif node.is_cdata: 98 | if ']]>' in node.value: 99 | raise ValueError("']]>' is not allowed in CDATA node value") 100 | writer.write( 101 | "" % node.value 102 | ) 103 | #elif node.is_entity_reference: # TODO 104 | elif node.is_entity: 105 | writer.write(newline + indent * node_depth) 106 | writer.write("" 111 | % (node.name, quote_char, node.value, quote_char) 112 | ) 113 | elif node.is_processing_instruction: 114 | writer.write(newline + indent * node_depth) 115 | writer.write("" % (node.target, node.data)) 116 | elif node.is_comment: 117 | if '--' in node.value: 118 | raise ValueError("'--' is not allowed in COMMENT node value") 119 | writer.write("" % node.value) 120 | elif node.is_notation: 121 | writer.write(newline + indent * node_depth) 122 | writer.write("" 125 | % (quote_char, node.external_id, quote_char)) 126 | elif node.is_system_identifier: 127 | writer.write(" system %s%s%s %s%s%s>" 128 | % (quote_char, node.external_id, quote_char, 129 | quote_char, node.uri, quote_char)) 130 | elif node.is_attribute: 131 | writer.write( 132 | " %s=%s" % (node.name, quote_char) 133 | ) 134 | writer.write( 135 | _sanitize_write_value(node.value) 136 | ) 137 | writer.write(quote_char) 138 | elif node.is_element: 139 | # Only need a preceding newline if we're in a sub-element 140 | if node_depth > 0: 141 | writer.write(newline) 142 | writer.write(indent * node_depth) 143 | writer.write("<" + node.name) 144 | 145 | for attr in node.attribute_nodes: 146 | _write_node_impl(attr, node_depth) 147 | if node.children: 148 | found_indented_child = False 149 | writer.write(">") 150 | for child in node.children: 151 | _write_node_impl(child, node_depth + 1) 152 | if not (child.is_text 153 | or child.is_comment 154 | or child.is_cdata): 155 | found_indented_child = True 156 | if found_indented_child: 157 | writer.write(newline + indent * node_depth) 158 | writer.write('' % node.name) 159 | else: 160 | writer.write('/>') 161 | else: 162 | raise exceptions.Xml4hImplementationBug( 163 | 'Cannot write node with class: %s' % node.__class__) 164 | 165 | # Sanitize whitespace parameters 166 | if indent is True: 167 | indent = ' ' * 4 168 | elif indent is False: 169 | indent = '' 170 | elif isinstance(indent, int): 171 | indent = ' ' * indent 172 | # If indent but no newline set, always apply a newline (it makes sense) 173 | if indent and not newline: 174 | newline = True 175 | 176 | if newline is None or newline is False: 177 | newline = '' 178 | elif newline is True: 179 | newline = '\n' 180 | 181 | # If we have a target encoding and are writing to a binary IO stream, wrap 182 | # the writer with an encoding writer to produce the correct bytes. 183 | # We detect binary IO streams by: 184 | # - Python 3: the *absence* of the `encoding` attribute that is present on 185 | # `io.TextIOBase`-derived objects 186 | # - Python 2: the *absence* of the `encode` attribute that is present on 187 | # `StringIO` objects 188 | if ( 189 | encoding 190 | and not hasattr(writer, 'encoding') 191 | and not hasattr(writer, 'encode') 192 | ): 193 | writer = codecs.getwriter(encoding)(writer) 194 | 195 | # Do the business... 196 | _write_node_impl(node, node_depth) 197 | --------------------------------------------------------------------------------