├── .coveragerc
├── .gitignore
├── .travis.yml
├── LICENSE
├── MANIFEST.in
├── README.rst
├── docs
├── advanced.rst
├── api.rst
├── builder.rst
├── conf.py
├── index.rst
├── nodes.rst
├── parser.rst
└── writer.rst
├── requirements-dev.txt
├── setup.py
├── tests
├── __init__.py
├── data
│ ├── example_doc.small.xml
│ ├── example_doc.unicode.xml
│ ├── monty_python_films.ns.xml
│ └── monty_python_films.xml
├── test_builder.py
├── test_nodes.py
├── test_parser.py
└── test_writer.py
├── tox.ini
└── xml4h
├── __init__.py
├── builder.py
├── exceptions.py
├── impls
├── __init__.py
├── interface.py
├── lxml_etree.py
├── xml_dom_minidom.py
└── xml_etree_elementtree.py
├── nodes.py
└── writer.py
/.coveragerc:
--------------------------------------------------------------------------------
1 | [report]
2 | show_missing = 1
3 | exclude_lines =
4 | pragma: no cover
5 | raise NotImplementedError
6 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Build artifacts
2 | dist
3 | build
4 | xml4h.egg-info
5 |
6 | # Sphinx documentation
7 | docs/_*
8 | docs/.*
9 |
10 | # Nosetests coverage report
11 | .coverage
12 |
13 | # Tox virtualenvs
14 | .tox/
15 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 | python:
3 | - "2.7"
4 | - "3.5"
5 | - "3.6"
6 | - "3.7"
7 | - "3.8"
8 | install: pip install tox-travis
9 | script: tox
10 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2013 James Murty.
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
6 | this software and associated documentation files (the "Software"), to deal in
7 | the Software without restriction, including without limitation the rights to
8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include README.rst LICENSE requirements-dev.txt
2 | recursive-include tests *.py *.xml
3 | recursive-include docs *.rst
4 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 | ===============================
2 | xml4h: XML for Humans in Python
3 | ===============================
4 |
5 | *xml4h* is an MIT licensed library for Python to make it easier to work with XML.
6 |
7 | This library exists because Python is awesome, XML is everywhere, and combining
8 | the two should be a pleasure but often is not. With *xml4h*, it can be easy.
9 |
10 | As of version 1.0 *xml4h* supports Python versions 2.7 and 3.5+.
11 |
12 |
13 | Features
14 | --------
15 |
16 | *xml4h* is a simplification layer over existing Python XML processing libraries
17 | such as *lxml*, *ElementTree* and the *minidom*. It provides:
18 |
19 | - a rich pythonic API to traverse and manipulate the XML DOM.
20 | - a document builder to simply and safely construct complex documents with
21 | minimal code.
22 | - a writer that serialises XML documents with the structure and format that you
23 | expect, unlike the machine- but not human-friendly output you tend to get
24 | from other libraries.
25 |
26 | The *xml4h* abstraction layer also offers some other benefits, beyond a nice
27 | API and tool set:
28 |
29 | - A common interface to different underlying XML libraries, so code written
30 | against *xml4h* need not be rewritten if you switch implementations.
31 | - You can easily move between *xml4h* and the underlying implementation: parse
32 | your document using the fastest implementation, manipulate the DOM with
33 | human-friendly code using *xml4h*, then get back to the underlying
34 | implementation if you need to.
35 |
36 |
37 | Installation
38 | ------------
39 |
40 | Install *xml4h* with pip::
41 |
42 | $ pip install xml4h
43 |
44 | Or install the tarball manually with::
45 |
46 | $ python setup.py install
47 |
48 |
49 | Links
50 | -----
51 |
52 | - GitHub for source code and issues: https://github.com/jmurty/xml4h
53 | - ReadTheDocs for documentation: https://xml4h.readthedocs.org
54 | - Install from the Python Package Index: https://pypi.python.org/pypi/xml4h
55 |
56 |
57 | Introduction
58 | ------------
59 |
60 | With *xml4h* you can easily parse XML files and access their data.
61 |
62 | Let's start with an example XML document::
63 |
64 | $ cat tests/data/monty_python_films.xml
65 |
66 |
67 | And Now for Something Completely Different
68 |
69 | A collection of sketches from the first and second TV series of
70 | Monty Python's Flying Circus purposely re-enacted and shot for film.
71 |
72 |
73 |
74 | Monty Python and the Holy Grail
75 |
76 | King Arthur and his knights embark on a low-budget search for
77 | the Holy Grail, encountering humorous obstacles along the way.
78 | Some of these turned into standalone sketches.
79 |
80 |
81 |
82 | Monty Python's Life of Brian
83 |
84 | Brian is born on the first Christmas, in the stable next to
85 | Jesus'. He spends his life being mistaken for a messiah.
86 |
87 |
88 | <... more Film elements here ...>
89 |
90 |
91 | With *xml4h* you can parse the XML file and use "magical" element and attribute
92 | lookups to read data::
93 |
94 | >>> import xml4h
95 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml')
96 |
97 | >>> for film in doc.MontyPythonFilms.Film[:3]:
98 | ... print(film['year'] + ' : ' + film.Title.text)
99 | 1971 : And Now for Something Completely Different
100 | 1974 : Monty Python and the Holy Grail
101 | 1979 : Monty Python's Life of Brian
102 |
103 | You can also use more explicit (non-magical) methods to traverse the DOM::
104 |
105 | >>> for film in doc.child('MontyPythonFilms').children('Film')[:3]:
106 | ... print(film.attributes['year'] + ' : ' + film.children.first.text)
107 | 1971 : And Now for Something Completely Different
108 | 1974 : Monty Python and the Holy Grail
109 | 1979 : Monty Python's Life of Brian
110 |
111 | The *xml4h* builder makes programmatic document creation simple, with a
112 | method-chaining feature that allows for expressive but sparse code that mirrors
113 | the document itself. Here is the code to build part of the above XML document::
114 |
115 | >>> b = (xml4h.build('MontyPythonFilms')
116 | ... .attributes({'source': 'http://en.wikipedia.org/wiki/Monty_Python'})
117 | ... .element('Film')
118 | ... .attributes({'year': 1971})
119 | ... .element('Title')
120 | ... .text('And Now for Something Completely Different')
121 | ... .up()
122 | ... .elem('Description').t(
123 | ... "A collection of sketches from the first and second TV"
124 | ... " series of Monty Python's Flying Circus purposely"
125 | ... " re-enacted and shot for film."
126 | ... ).up()
127 | ... .up()
128 | ... )
129 |
130 | >>> # A builder object can be re-used, and has short method aliases
131 | >>> b = (b.e('Film')
132 | ... .attrs(year=1974)
133 | ... .e('Title').t('Monty Python and the Holy Grail').up()
134 | ... .e('Description').t(
135 | ... "King Arthur and his knights embark on a low-budget search"
136 | ... " for the Holy Grail, encountering humorous obstacles along"
137 | ... " the way. Some of these turned into standalone sketches."
138 | ... ).up()
139 | ... .up()
140 | ... )
141 |
142 | Pretty-print your XML document with *xml4h*'s writer implementation with
143 | methods to write content to a stream or get the content as text with flexible
144 | formatting options::
145 |
146 | >>> print(b.xml_doc(indent=4, newline=True)) # doctest: +ELLIPSIS
147 |
148 |
149 |
150 | And Now for Something Completely Different
151 | A collection of sketches from ...
152 |
153 |
154 | Monty Python and the Holy Grail
155 | King Arthur and his knights embark ...
156 |
157 |
158 |
159 |
160 |
161 | Why use *xml4h*?
162 | ----------------
163 |
164 | Python has three popular libraries for working with XML, none of which are
165 | particularly easy to use:
166 |
167 | - `xml.dom.minidom `_
168 | is a light-weight, moderately-featured implementation of the W3C DOM
169 | that is included in the standard library. Unfortunately the W3C DOM API is
170 | verbose, clumsy, and not very pythonic, and the *minidom* does not support
171 | XPath expressions.
172 | - `xml.etree.ElementTree `_
173 | is a fast hierarchical data container that is included in the standard
174 | library and can be used to represent XML, mostly. The API is fairly pythonic
175 | and supports some basic XPath features, but it lacks some DOM traversal
176 | niceties you might expect (e.g. to get an element's parent) and when using it
177 | you often feel like your working with something subtly different from XML,
178 | because you are.
179 | - `lxml `_ is a fast, full-featured XML library with an API
180 | based on ElementTree but extended. It is your best choice for doing serious
181 | work with XML in Python but it is not included in the standard library, it
182 | can be difficult to install, and it gives you the same it's-XML-but-not-quite
183 | feeling as its ElementTree forebear.
184 |
185 | Given these three options it can be difficult to choose which library to use,
186 | especially if you're new to XML processing in Python and haven't already
187 | used (struggled with) any of them.
188 |
189 | In the past your best bet would have been to go with *lxml* for the most
190 | flexibility, even though it might be overkill, because at least then you
191 | wouldn't have to rewrite your code if you later find you need XPath support or
192 | powerful DOM traversal methods.
193 |
194 | This is where *xml4h* comes in. It provides an abstraction layer over
195 | the existing XML libraries, taking advantage of their power while offering an
196 | improved API and tool set.
197 |
198 |
199 | Development Status: beta
200 | ------------------------
201 |
202 | Currently *xml4h* includes adapter implementations for three of the main XML
203 | processing Python libraries.
204 |
205 | If you have *lxml* available (highly recommended) it will use that, otherwise
206 | it will fall back to use the *(c)ElementTree* then the *minidom* libraries.
207 |
208 |
209 |
210 | History
211 | -------
212 |
213 | 1.0
214 | ...
215 |
216 | - Add support for Python 3 (3.5+)
217 | - Dropped support for Python versions before 2.7.
218 | - Fix node namespace prefix values for lxml adapter.
219 | - Improve builder's ``up()`` method to accept and distinguish between a count
220 | of parents to step up, or the name of a target ancestor node.
221 | - Add ``xml()`` and ``xml_doc()`` methods to document builder to more easily
222 | get string content from it, without resorting to the write methods.
223 | - The ``write()`` and ``write_doc()`` methods no longer send output to
224 | ``sys.stdout`` by default. The user must explicitly provide a target writer
225 | object, and hopefully be more mindful of the need to set up encoding correctly
226 | when providing a text stream object.
227 | - Handling of redundant Element namespace prefixes is now more consistent: we
228 | always strip the prefix when the element has an `xmlns` attribute defining
229 | the same namespace URI.
230 |
231 | 0.2.0
232 | .....
233 |
234 | - Add adapter for the *(c)ElementTree* library versions included as standard
235 | with Python 2.7+.
236 | - Improved "magical" node traversal to work with lowercase tag names without
237 | always needing a trailing underscore. See also improved docs.
238 | - Fixes for: potential errors ASCII-encoding nodes as strings; default XPath
239 | namespace from document node; lookup precedence of xmlns attributes.
240 |
241 |
242 | 0.1.0
243 | .....
244 |
245 | - Initial alpha release with support for *lxml* and *minidom* libraries.
246 |
--------------------------------------------------------------------------------
/docs/advanced.rst:
--------------------------------------------------------------------------------
1 | ========
2 | Advanced
3 | ========
4 |
5 |
6 | .. _xml4h-namespaces:
7 |
8 | Namespaces
9 | ==========
10 |
11 | *xml4h* supports using XML namespaces in a number of ways, and tries to make
12 | this sometimes complex and fiddly aspect of XML a little easier to deal with.
13 |
14 | Namespace URIs
15 | --------------
16 |
17 | XML document nodes can be associated with a *namespace URI* which uniquely
18 | identifies the namespace. At bottom a URI is really just a name to identifiy
19 | the namespace, which may or may not point at an actual resource.
20 |
21 | Namespace URIs are the core piece of the namespacing puzzle, everything else is
22 | extras.
23 |
24 | Namespace URI values are assigned to a node in one of three ways:
25 |
26 | - an ``xmlns`` attribute on an element assigns a *namespace URI* to that
27 | element, and may also define a shorthand *prefix* for the namespace::
28 |
29 |
30 |
31 | .. note::
32 | Technically the ``xmlns`` attribute must itself also be in the special XML
33 | namespacing namespace http://www.w3.org/2000/xmlns/. You needn't care
34 | about this.
35 |
36 | - a tag or attribute name includes a *prefix* alias portion that specifies the
37 | namespace the item belongs to::
38 |
39 |
40 |
41 | A prefix alias can be defined using an "xmlns" attribute as described above,
42 | or by using the Builder :meth:`~xml4h.Builder.ns_prefix` or Node
43 | :meth:`~xml4h.nodes.Node.set_ns_prefix` methods.
44 |
45 | - in an apparent effort to reduce confusion around namespace URIs and prefixes,
46 | some XML libraries avoid prefix aliases altogether and instead require you to
47 | specify the full *namespace URI* as a prefix to tag and attribute names
48 | using a special syntax with braces::
49 |
50 | >>> tagname = '{urn:example-uri}YetAnotherWayToNamespace'
51 |
52 | .. note::
53 | In the author's opinion, using a non-standard way to define namespaces
54 | does not reduce confusion. *xml4h* supports this approach technically but
55 | not philosphically.
56 |
57 | *xml4h* allows you to assign namespace URIs to document nodes when using the
58 | Builder::
59 |
60 | >>> # Assign a default namespace with ns_uri
61 | >>> import xml4h
62 | >>> b = xml4h.build('Doc', ns_uri='ns-uri')
63 | >>> root = b.root
64 |
65 | >>> # Descendent without a namespace inherit their ancestor's default one
66 | >>> elem1 = b.elem('Elem1').dom_element
67 | >>> elem1.namespace_uri
68 | 'ns-uri'
69 |
70 | >>> # Define a prefix alias to assign a new or existing namespace URI
71 | >>> elem2 = b.ns_prefix('my-ns', 'second-ns-uri') \
72 | ... .elem('my-ns:Elem2').dom_element
73 | >>> print(root.xml())
74 |
75 |
76 |
77 |
78 |
79 | >>> # Or use the explicit URI prefix approach, if you must
80 | >>> elem3 = b.elem('{third-ns-uri}Elem3').dom_element
81 | >>> elem3.namespace_uri
82 | 'third-ns-uri'
83 |
84 | And when adding nodes with the API::
85 |
86 | >>> # Define the ns_uri argument when creating a new element
87 | >>> elem4 = root.add_element('Elem4', ns_uri='fourth-ns-uri')
88 |
89 | >>> # Attributes can be namespaced too
90 | >>> elem4.set_attributes({'my-ns:attr1': 'value'})
91 |
92 | >>> print(elem4.xml())
93 |
94 |
95 |
96 | Filtering by Namespace
97 | ----------------------
98 |
99 | *xml4h* allows you to find and filter nodes based on their namespace.
100 |
101 | The :meth:`~xml4h.nodes.Node.find` method takes a ``ns_uri`` keyword argument to
102 | return only elements in that namespace::
103 |
104 | >>> # By default, find ignores namespaces...
105 | >>> [n.local_name for n in root.find()]
106 | ['Elem1', 'Elem2', 'Elem3', 'Elem4']
107 | >>> # ...but will filter by namespace URI if you wish
108 | >>> [n.local_name for n in root.find(ns_uri='fourth-ns-uri')]
109 | ['Elem4']
110 |
111 | Similarly, a node's children listing can be filtered::
112 |
113 | >>> len(root.children)
114 | 4
115 | >>> root.children(ns_uri='ns-uri')
116 | []
117 |
118 | XPath queries can also filter by namespace, but the
119 | :meth:`~xml4h.nodes.Node.xpath` method needs to be given a dictionary mapping
120 | of prefix aliases to URIs::
121 |
122 | >>> root.xpath('//ns4:*', namespaces={'ns4': 'fourth-ns-uri'})
123 | []
124 |
125 | .. note::
126 | Normally, because XPath queries rely on namespace prefix aliases, they
127 | cannot find namespaced nodes in the default namespace which has an "empty"
128 | prefix name. *xml4h* works around this limitation by providing the special
129 | empty/default prefix alias '_'.
130 |
131 |
132 | Element Names: Local and Prefix Components
133 | ------------------------------------------
134 |
135 | When you use a namespace prefix alias to define the namespace an element or
136 | attribute belongs to, the name of that node will be made up of two components:
137 |
138 | - *prefix* - the namespace alias.
139 | - *local* - the real name of the node, without the namespace alias.
140 |
141 | *xml4h* makes the full (qualified) name, and the two components, available at
142 | node attributes::
143 |
144 | >>> # Elem2's namespace was defined earlier using a prefix alias
145 | >>> elem2
146 |
147 |
148 | # The full node name...
149 | >>> elem2.name
150 | 'my-ns:Elem2'
151 | >>> # ...comprises a prefix...
152 | >>> elem2.prefix
153 | 'my-ns'
154 | >>> # ...and a local name component
155 | >>> elem2.local_name
156 | 'Elem2'
157 |
158 | >>> # Here is an element without a prefix alias
159 | >>> elem1.name
160 | 'Elem1'
161 | >>> elem1.prefix == None
162 | True
163 | >>> elem1.local_name
164 | 'Elem1'
165 |
166 |
167 | .. _xml-lib-architecture:
168 |
169 | *xml4h* Architecture
170 | ====================
171 |
172 | To best understand the *xml4h* library and to use it appropriately in demanding
173 | situations, you should appreciate what the library is not.
174 |
175 | *xml4h* is not a full-fledged XML library in its own right, far from it.
176 | Instead of implementing low-level document parsing and manipulation tools, it
177 | operates as an abstraction layer on top of the pre-existing XML processing
178 | libraries you already know.
179 |
180 | This means the improved API and tool suite provided by *xml4h* work by
181 | mediating operations you perform, asking the underlying XML library to do the
182 | work, and packaging up the results of this work as wrapped *xml4h* objects.
183 |
184 | This approach has a number of implications, good and bad.
185 |
186 | On the good side:
187 |
188 | - you can start using and benefiting from *xml4h* in an existing projects that
189 | already use a supported XML library without any impact, it can fit right in.
190 | - *xml4h* can take advantage of the existing powerful and fast XML libraries to
191 | do its work.
192 | - by providing an abstraction layer over multiple libraries, *xml4h* can make
193 | it (relatively) easy to switch the underlying library without you needing to
194 | rewrite your own XML handling code.
195 | - by building on the shoulders of giants, *xml4h* itself can remain relatively
196 | lightweight and focussed on simplicity and usability.
197 | - the author of *xml4h* does not have to write XML-handling code in C...
198 |
199 | On the bad side:
200 |
201 | - if the underlying XML libraries available in the Python environment do not
202 | support a feature (like XPath querying) then that feature will not be
203 | available in *xml4h*.
204 | - *xml4h* cannot provide radical new XML processing features, since the bulk of
205 | its work must be done by the underlying library.
206 | - the abstraction layer *xml4h* uses to do its work requires more resources
207 | than it would to use the underlying library directly, so if you absolutely
208 | need maximal speed or minimal memory use the library might prove too
209 | expensive.
210 | - *xml4h* sometimes needs to jump through some hoops to maintain the shared
211 | abstraction interface over multiple libraries, which means extra work is
212 | done in Python instead of by the underlying library code in C.
213 |
214 | The author believes the benefits of using *xml4h* outweighs the drawbacks in
215 | the majority of real-world situations, or he wouldn't have created the library
216 | in the first place, but ultimately it is up to you to decide where you should
217 | or should not use it.
218 |
219 |
220 | .. _xml-lib-adapters:
221 |
222 | Library Adapters
223 | ----------------
224 |
225 | To provide an abstraction layer over multiple underlying XML libraries, *xml4h*
226 | uses an "adapter" mechanism to mediate operations on documents. There is an
227 | adapter implementation for each library *xml4h* can work with, each of which
228 | extends the :class:`~xml4h.impls.interface.XmlImplAdapter` class. This base
229 | class includes some standard behaviour, and defines the interface for adapter
230 | implementations (to the extent you can define such interfaces in Python).
231 |
232 | The current version of *xml4h* includes adapter implementations for the three
233 | main XML processing libraries for Python:
234 |
235 | - :class:`~xml4h.impls.lxml_etree.LXMLAdapter` works with the excellent
236 | `lxml `_ library which is very full-featured and fast, but
237 | which is not included in the standard library.
238 | - :class:`~xml4h.impls.xml_etree_elementtree.cElementTreeAdapter` and
239 | :class:`~xml4h.impls.xml_etree_elementtree.ElementTreeAdapter` work with the
240 | *ElementTree* libraries included with the standard library of Python versions
241 | 2.7 and later. *ElementTree* is fast and includes support for some basic
242 | XPath expressions. If the C-based version of ElementTree is available, the
243 | former adapter is made available and should be used for best performance.
244 | - :class:`~xml4h.impls.xml_dom_minidom.XmlDomImplAdapter` works with the
245 | `minidom `_ W3C-style
246 | XML library included with the standard library. This library is always
247 | available but is slower and has fewer features than alternative libraries
248 | (e.g. no support for XPath)
249 |
250 | The adapter layer allows the rest of the *xml4h* library code to remain almost
251 | entirely oblivious to the underlying XML library that happens to be available
252 | at the time. The *xml4h* Builder, Node objects, writer etc. call adapter
253 | methods to perform document operations, and the adapter is responsible for
254 | doing the necessary work with the underlying library.
255 |
256 |
257 | .. _best-adapter:
258 |
259 | "Best" Adapter
260 | --------------
261 |
262 | While *xml4h* can work with multiple underlying XML libraries, some of these
263 | libraries are better (faster, more fully-featured) than others so it would be
264 | smart to use the best of the libraries available.
265 |
266 | *xml4h* does exactly that: unless you explicitly choose an adapter (see below)
267 | *xml4h* will find the supported libraries in the Python environment and choose
268 | the "best" adapter for you in the circumstances.
269 |
270 | Here is the list of libraries *xml4h* will choose from, best to least-best:
271 |
272 | - *lxml*
273 | - *(c)ElementTree*
274 | - *ElementTree*
275 | - *minidom*
276 |
277 | The :attr:`xml4h.best_adapter` attribute stores the adapter class that *xml4h*
278 | considers to be the best.
279 |
280 | .. note:
281 | You cannot always rely on *xml4h* to choose the right underlying XML library
282 | for your needs. For cases where you need to use a specific library, such as
283 | when you have a pre-parsed document object, see `wrap-unwrap-nodes`_.
284 |
285 |
286 | Choose Your Own Adapter
287 | -----------------------
288 |
289 | By default, *xml4h* will choose an adapter and underlying XML library
290 | implementation that it considers the best available. However, in some cases you
291 | may need to have full control over which underlying implementation *xml4h*
292 | uses, perhaps because you will use features of the underlying XML
293 | implementation later on, or because you need the performance characteristics
294 | only available in a particular library.
295 |
296 | For these situations it is possible to tell *xml4h* which adapter
297 | implementation, and therefore which underlying XML library, it should use.
298 |
299 | To use a specific adapter implementation when parsing a document, or when
300 | creating a new document using the builder, simply provide the optional
301 | ``adapter`` keyword argument to the relevant method:
302 |
303 | - Parsing::
304 |
305 | >>> # Explicitly use the minidom adapter to parse a document
306 | >>> minidom_doc = xml4h.parse('tests/data/monty_python_films.xml',
307 | ... adapter=xml4h.XmlDomImplAdapter)
308 | >>> minidom_doc.root.impl_node #doctest:+ELLIPSIS
309 | >> # Explicitly use the lxml adapter to build a document
314 | >>> lxml_b = xml4h.build('MyDoc', adapter=xml4h.LXMLAdapter)
315 | >>> lxml_b.root.impl_node #doctest:+ELLIPSIS
316 | >> # Use xml4h with a cElementTree document object
321 | >>> import xml.etree.ElementTree as ET
322 | >>> et_doc = ET.parse('tests/data/monty_python_films.xml')
323 | >>> et_doc #doctest:+ELLIPSIS
324 | >> doc = xml4h.cElementTreeAdapter.wrap_document(et_doc)
326 | >>> doc.root
327 |
328 |
329 |
330 | Check Feature Support
331 | .....................
332 |
333 | Because not all underlying XML libraries support all the features exposed by
334 | *xml4h*, the library includes a simple mechanism to check whether a given
335 | feature is available in the current Python environment or with the current
336 | adapter.
337 |
338 | To check for feature support call the :meth:`~xml4h.nodes.Node.has_feature`
339 | method on a document node, or
340 | :meth:`~xml4h.impl.interface.XmlImplAdapter.has_feature` on an adapter class.
341 |
342 | List of features that are not available in all adapters:
343 |
344 | - ``xpath`` - Can perform XPath queries using the
345 | :meth:`~xml4h.nodes.Node.xpath` method.
346 | - More to come later, probably...
347 |
348 | For example, here is how you would test for XPath support in the *minidom*
349 | adapter, which doesn't include it::
350 |
351 | >>> minidom_doc.root.has_feature('xpath')
352 | False
353 |
354 | If you forget to check for a feature and use it anyway, you will get
355 | a :class:`~xml4h.exceptions.FeatureUnavailableException`::
356 |
357 | >>> try:
358 | ... minidom_doc.root.xpath('//*')
359 | ... except Exception as e:
360 | ... e #doctest:+ELLIPSIS
361 | FeatureUnavailableException('xpath'...
362 |
363 |
364 | Adapter & Implementation Quirks
365 | -------------------------------
366 |
367 | Although *xml4h* aims to provide a seamless abstraction over underlying XML
368 | library implementations this isn't always possible, or is only possible by
369 | performing lots of extra work that affects performance. This section describes
370 | some implementation-specific quirks or differences you may encounter.
371 |
372 | .. note:
373 | This set of quirks is almost certainly incomplete, please report issues you
374 | find so they can either be fixed (in the best case) or captured here as
375 | known trouble-spots.
376 |
377 | LXMLAdapter - *lxml*
378 | ....................
379 |
380 | - *lxml* does not have full support for CDATA nodes, which devolve into plain
381 | text node values when written (by *xml4h* or by *lxml*'s writer).
382 | - Namespaces defined by adding ``xmlns`` element attributes are not properly
383 | represented in the underlying implementation due to the *lxml* library's
384 | immutable ``nsmap`` namespace map. Such namespaces are written correcly
385 | by the *xml4h* writer, but to avoid quirks it is best to specify namespace
386 | when creating nodes by setting the ``ns_uri`` keyword attribute.
387 | - When *xml4h* writes *lxml*-based documents with namespaces, some node tag
388 | names may have unnecessary namespace prefix aliases.
389 |
390 | (c)ElementTreeAdapter - *ElementTree*
391 | .....................................
392 |
393 | - Only the versions of (c)ElementTree included with Python version 2.7 and
394 | later are supported.
395 | - *ElementTree* supports only a very limited subset of XPath for querying, so
396 | although the ``has_feature('xpath')`` check returns ``True`` don't expect to
397 | get the full power of XPath when you use this adapter.
398 | - *ElementTree* does not have full support for CDATA nodes, which devolve into
399 | plain text node values when written (by *xml4h* or by *ElementTree*'s writer).
400 | - Because *ElementTree* doesn't retain information about a node's parent,
401 | *xml4h* needs to build and maintain its own records of which nodes are
402 | parents of which children. This extra overhead might harm performance or
403 | memory usage.
404 | - *ElementTree* doesn't normally remember explicit namespace definition
405 | directives when parsing a document. *xml4h* works around this when it is
406 | asked to parse XML data, but if you parse data outside of *xml4h* then use
407 | the library on the resultant document the namespace definitions will get
408 | messed up.
409 |
410 | XmlImplAdapter - *minidom*
411 | ..........................
412 |
413 | - No support for performing XPath queries.
414 | - Slower than alternative C-based implementations.
415 |
--------------------------------------------------------------------------------
/docs/api.rst:
--------------------------------------------------------------------------------
1 | ===
2 | API
3 | ===
4 |
5 |
6 | Main Interface
7 | --------------
8 |
9 | .. automodule:: xml4h
10 | :members: parse, build, best_adapter
11 |
12 |
13 | Builder
14 | -------
15 |
16 | .. automodule:: xml4h.builder
17 | :members:
18 |
19 |
20 | Writer
21 | ------
22 |
23 | .. automodule:: xml4h.writer
24 | :members:
25 |
26 |
27 | .. _api-nodes:
28 |
29 | DOM Nodes API
30 | -------------
31 |
32 | .. automodule:: xml4h.nodes
33 | :members:
34 | :special-members:
35 | :private-members:
36 |
37 |
38 | XML Libarary Adapters
39 | ---------------------
40 |
41 | .. automodule:: xml4h.impls.interface
42 | :members:
43 |
44 | .. automodule:: xml4h.impls.lxml_etree
45 | :members:
46 |
47 | .. automodule:: xml4h.impls.xml_etree_elementtree
48 | :members:
49 |
50 | .. automodule:: xml4h.impls.xml_dom_minidom
51 | :members:
52 |
53 |
54 | Custom Exceptions
55 | -----------------
56 |
57 | .. automodule:: xml4h.exceptions
58 | :members:
59 |
--------------------------------------------------------------------------------
/docs/builder.rst:
--------------------------------------------------------------------------------
1 | .. _builder:
2 |
3 | =======
4 | Builder
5 | =======
6 |
7 | *xml4h* includes a document builder tool that makes it easy to create valid,
8 | well-formed XML documents using relatively sparse python code. It makes it so
9 | easy to create XML that you will no longer be tempted to cobble together
10 | documents with error-prone methods like manual string concatenation or a
11 | templating library.
12 |
13 | Internally, the builder uses the DOM-building features of an underlying XML
14 | library which means it is (almost) impossible to construct an invalid document.
15 |
16 | Here is some example code to build a document about Monty Python films::
17 |
18 | >>> import xml4h
19 | >>> xmlb = (xml4h.build('MontyPythonFilms')
20 | ... .attributes({'source': 'http://en.wikipedia.org/wiki/Monty_Python'})
21 | ... .element('Film')
22 | ... .attributes({'year': 1971})
23 | ... .element('Title')
24 | ... .text('And Now for Something Completely Different')
25 | ... .up()
26 | ... .elem('Description').t(
27 | ... "A collection of sketches from the first and second TV"
28 | ... " series of Monty Python's Flying Circus purposely"
29 | ... " re-enacted and shot for film.")
30 | ... .up()
31 | ... .up()
32 | ... .elem('Film')
33 | ... .attrs(year=1974)
34 | ... .e('Title')
35 | ... .t('Monty Python and the Holy Grail')
36 | ... .up()
37 | ... .e('Description').t(
38 | ... "King Arthur and his knights embark on a low-budget search"
39 | ... " for the Holy Grail, encountering humorous obstacles along"
40 | ... " the way. Some of these turned into standalone sketches."
41 | ... ).up()
42 | ... )
43 |
44 | The code above produces the following XML document (abbreviated)::
45 |
46 | >>> print(xmlb.xml_doc(indent=True)) # doctest:+ELLIPSIS
47 |
48 |
49 |
50 | And Now for Something Completely Different
51 | A collection of sketches from the first and second...
52 |
53 |
54 | Monty Python and the Holy Grail
55 | King Arthur and his knights embark on a low-budget...
56 |
57 |
58 |
59 |
60 |
61 | Getting Started
62 | ---------------
63 |
64 | You typically create a new XML document builder by calling the
65 | :func:`xml4h.build` function with the name of the root element::
66 |
67 | >>> root_b = xml4h.build('RootElement')
68 |
69 | The function returns a :class:`~xml4h.builder.Builder` object that represents
70 | the *RootElement* and allows you to manipulate this element's attributes
71 | or to add child elements.
72 |
73 | Once you have the first builder instance, every action you perform to add
74 | content to the XML document will return another instance of the Builder class::
75 |
76 | >>> # Add attributes to the root element's Builder
77 | >>> root_b = root_b.attributes({'a': 1, 'b': 2}, c=3)
78 |
79 | >>> root_b #doctest:+ELLIPSIS
80 | >> root_b.dom_element
86 |
87 |
88 | >>> root_b.dom_element.attributes
89 |
90 |
91 | When you add a new child element, the result is a builder instance representing
92 | that child element, *not the original element*::
93 |
94 | >>> child1_b = root_b.element('ChildElement1')
95 | >>> child2_b = root_b.element('ChildElement2')
96 |
97 | >>> # The element method returns a Builder wrapping the new child element
98 | >>> child2_b.dom_element
99 |
100 | >>> child2_b.dom_element.parent
101 |
102 |
103 | This feature of the builder can be a little confusing, but it allows for the
104 | very convenient method-chaining feature that gives the builder its power.
105 |
106 |
107 | .. _builder-method-chaining:
108 |
109 | Method Chaining
110 | ---------------
111 |
112 | Because every builder method that adds content to the XML document returns
113 | a builder instance representing the nearest (or newest) element, you can
114 | chain together many method calls to construct your document without any
115 | need for intermediate variables.
116 |
117 | For example, the example code in the previous section used the variables
118 | ``root_b``, ``child1_b`` and ``child2_b`` to represent builder instances but
119 | this is not necessary. Here is how you can use method-chaining to build the
120 | same document with less code::
121 |
122 | >>> b = (xml4h
123 | ... .build('RootElement').attributes({'a': 1, 'b': 2}, c=3)
124 | ... .element('ChildElement1').up() # NOTE the up() method
125 | ... .element('ChildElement2')
126 | ... )
127 |
128 | >>> print(b.xml_doc(indent=4))
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 | Notice how you can use chained method calls to write code with a structure
137 | that mirrors that of the XML document you want to produce? This makes it
138 | much easier to spot errors in your code than it would be if you were to
139 | concatenate strings.
140 |
141 | .. note::
142 |
143 | It is a good idea to wrap the :func:`~xml4h.build` function call and all
144 | following chained methods in parentheses, so you don't need to put
145 | backslash (\\) characters at the end of every line.
146 |
147 | The code above introduces a very important builder method:
148 | :meth:`~xml4h.builder.Builder.up`. This method returns a builder instance
149 | representing the current element's parent, or indeed any ancestor.
150 |
151 | Without the ``up()`` method, every time you created a child element with the
152 | builder you would end up deeper in the document structure with no way to return
153 | to prior elements to add sibling nodes or hierarchies.
154 |
155 | To help reduce the number of ``up()`` method calls you need to include in
156 | your code, this method can also jump up multiple levels or to a named ancestor
157 | element::
158 |
159 | >>> # A builder that references a deeply-nested element:
160 | >>> deep_b = (xml4h.build('Root')
161 | ... .element('Deep')
162 | ... .element('AndDeeper')
163 | ... .element('AndDeeperStill')
164 | ... .element('UntilWeGetThere')
165 | ... )
166 | >>> deep_b.dom_element
167 |
168 |
169 | >>> # Jump up 4 levels, back to the root element
170 | >>> deep_b.up(4).dom_element
171 |
172 |
173 | >>> # Jump up to a named ancestor element
174 | >>> deep_b.up('Root').dom_element
175 |
176 |
177 | .. note::
178 | To avoid making subtle errors in your document's structure, we recommend you
179 | use :meth:`~xml4h.builder.Builder.up` calls to return up one level for every
180 | :meth:`~xml4h.builder.Builder.element` method (or alias) you call.
181 |
182 |
183 | Shorthand Methods
184 | -----------------
185 |
186 | To make your XML-producing code even less verbose and quicker to type, the
187 | builder has shorthand "alias" methods corresponding to the full names.
188 |
189 | For example, instead of calling ``element()`` to create a new
190 | child element, you can instead use the equivalent ``elem()`` or ``e()``
191 | methods. Similarly, instead of typing ``attributes()`` you can use ``attrs()``
192 | or ``a()``.
193 |
194 | Here are the methods and method aliases for adding content to an XML document:
195 |
196 | =================== ========================== ================
197 | XML Node Created Builder method Aliases
198 | =================== ========================== ================
199 | Element ``element`` ``elem``, ``e``
200 | Attribute ``attributes`` ``attrs``, ``a``
201 | Text ``text`` ``t``
202 | CDATA ``cdata`` ``data``, ``d``
203 | Comment ``comment`` ``c``
204 | Process Instruction ``processing_instruction`` ``inst``, ``i``
205 | =================== ========================== ================
206 |
207 | These shorthand method aliases are convenient and lead to even less cruft
208 | around the actual XML content you are interested in. But on the other hand
209 | they are much less explicit than the longer versions, so use them judiciously.
210 |
211 |
212 | Access the DOM
213 | --------------
214 |
215 | The XML builder is merely a layer of convenience methods that sits on the
216 | :mod:`xml4h.nodes` DOM API. This means you can quickly access the underlying
217 | nodes from a builder if you need to inspect them or manipulate them in a
218 | way the builder doesn't allow:
219 |
220 | - The :attr:`~xml4h.builder.Builder.dom_element` attribute returns a builder's
221 | underlying :class:`~xml4h.nodes.Element`
222 | - The :attr:`~xml4h.builder.Builder.root` attribute returns the document's
223 | root element.
224 | - The :attr:`~xml4h.builder.Builder.document` attribute returns a builder's
225 | underlying :class:`~xml4h.nodes.Document`.
226 |
227 | See the :ref:`api-nodes` documentation to find out how to work with DOM
228 | element nodes once you get them.
229 |
230 |
231 | Building on an Existing DOM
232 | ---------------------------
233 |
234 | When you are building an XML document from scratch you will generally use
235 | the :func:`~xml4h.build` function described in `Getting Started`_. However,
236 | what if you want to add content to a parsed XML document DOM you have already?
237 |
238 | To wrap an :class:`~xml4h.nodes.Element` DOM node with a builder you simply
239 | provide the element node to the same ``builder()`` method used previously and
240 | it will do the right thing.
241 |
242 | Here is an example of parsing an existing XML document, locating an element
243 | of interest, constructing a builder from that element, and adding some new
244 | content. Luckily, the code is simpler than that description...
245 |
246 | ::
247 |
248 | >>> # Parse an XML document
249 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml')
250 |
251 | >>> # Find an Element node of interest
252 | >>> lob_film_elem = doc.MontyPythonFilms.Film[2]
253 | >>> lob_film_elem.Title.text
254 | "Monty Python's Life of Brian"
255 |
256 | >>> # Construct a builder from the element
257 | >>> lob_builder = xml4h.build(lob_film_elem)
258 |
259 | >>> # Add content
260 | >>> b = (lob_builder.attrs(stars=5)
261 | ... .elem('Review').t('One of my favourite films!').up())
262 |
263 | >>> # See the results
264 | >>> print(lob_builder.xml()) # doctest:+ELLIPSIS
265 |
266 | Monty Python's Life of Brian
267 | Brian is born on the first Christmas, in the stable...
268 | One of my favourite films!
269 |
270 |
271 |
272 | Hydra-Builder
273 | -------------
274 |
275 | Because each builder class instance is independent, an advanced technique for
276 | constructing complex documents is to use multiple builders anchored at
277 | different places in the DOM. In some situations, the ability to add content
278 | to different places in the same document can be very handy.
279 |
280 | Here is a trivial example of this technique::
281 |
282 | >>> # Create two Elements in a doc to store even or odd numbers
283 | >>> odd_b = xml4h.build('EvenAndOdd').elem('Odd')
284 | >>> even_b = odd_b.up().elem('Even')
285 |
286 | >>> # Populate the numbers from a loop
287 | >>> for i in range(1, 11): # doctest:+ELLIPSIS
288 | ... if i % 2 == 0:
289 | ... even_b.elem('Number').text(i)
290 | ... else:
291 | ... odd_b.elem('Number').text(i)
292 | <...
293 |
294 | >>> # Check the final document
295 | >>> print(odd_b.xml_doc(indent=True))
296 |
297 |
298 |
299 | 1
300 | 3
301 | 5
302 | 7
303 | 9
304 |
305 |
306 | 2
307 | 4
308 | 6
309 | 8
310 | 10
311 |
312 |
313 |
314 |
--------------------------------------------------------------------------------
/docs/conf.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #
3 | # xml4h documentation build configuration file, created by
4 | # sphinx-quickstart on Thu Aug 30 22:29:54 2012.
5 | #
6 | # This file is execfile()d with the current directory set to its containing dir.
7 | #
8 | # Note that not all possible configuration values are present in this
9 | # autogenerated file.
10 | #
11 | # All configuration values have a default; values that are commented out
12 | # serve to show the default.
13 |
14 | import sys, os
15 | from xml4h import __version__
16 |
17 | # If extensions (or modules to document with autodoc) are in another directory,
18 | # add these directories to sys.path here. If the directory is relative to the
19 | # documentation root, use os.path.abspath to make it absolute, like shown here.
20 | #sys.path.insert(0, os.path.abspath('.'))
21 |
22 | # -- General configuration -----------------------------------------------------
23 |
24 | # If your documentation needs a minimal Sphinx version, state it here.
25 | #needs_sphinx = '1.0'
26 |
27 | # Add any Sphinx extension module names here, as strings. They can be extensions
28 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
29 | extensions = ['sphinx.ext.autodoc', 'sphinx.ext.viewcode']
30 |
31 | # Add any paths that contain templates here, relative to this directory.
32 | templates_path = ['_templates']
33 |
34 | # The suffix of source filenames.
35 | source_suffix = '.rst'
36 |
37 | # The encoding of source files.
38 | #source_encoding = 'utf-8-sig'
39 |
40 | # The master toctree document.
41 | master_doc = 'index'
42 |
43 | # General information about the project.
44 | project = 'xml4h'
45 | copyright = '2020, James Murty'
46 |
47 | # The version info for the project you're documenting, acts as replacement for
48 | # |version| and |release|, also used in various other places throughout the
49 | # built documents.
50 | #
51 | # The short X.Y version.
52 | version = __version__
53 | # The full version, including alpha/beta/rc tags.
54 | release = version
55 |
56 | # The language for content autogenerated by Sphinx. Refer to documentation
57 | # for a list of supported languages.
58 | #language = None
59 |
60 | # There are two options for replacing |today|: either, you set today to some
61 | # non-false value, then it is used:
62 | #today = ''
63 | # Else, today_fmt is used as the format for a strftime call.
64 | #today_fmt = '%B %d, %Y'
65 |
66 | # List of patterns, relative to source directory, that match files and
67 | # directories to ignore when looking for source files.
68 | exclude_patterns = ['_build']
69 |
70 | # The reST default role (used for this markup: `text`) to use for all documents.
71 | #default_role = None
72 |
73 | # If true, '()' will be appended to :func: etc. cross-reference text.
74 | #add_function_parentheses = True
75 |
76 | # If true, the current module name will be prepended to all description
77 | # unit titles (such as .. function::).
78 | #add_module_names = True
79 |
80 | # If true, sectionauthor and moduleauthor directives will be shown in the
81 | # output. They are ignored by default.
82 | #show_authors = False
83 |
84 | # The name of the Pygments (syntax highlighting) style to use.
85 | pygments_style = 'sphinx'
86 |
87 | # A list of ignored prefixes for module index sorting.
88 | #modindex_common_prefix = []
89 |
90 |
91 | # -- Options for HTML output ---------------------------------------------------
92 |
93 | # The theme to use for HTML and HTML Help pages. See the documentation for
94 | # a list of builtin themes.
95 | html_theme = 'default'
96 |
97 | # Theme options are theme-specific and customize the look and feel of a theme
98 | # further. For a list of options available for each theme, see the
99 | # documentation.
100 | #html_theme_options = {}
101 |
102 | # Add any paths that contain custom themes here, relative to this directory.
103 | #html_theme_path = []
104 |
105 | # The name for this set of Sphinx documents. If None, it defaults to
106 | # " v documentation".
107 | #html_title = None
108 |
109 | # A shorter title for the navigation bar. Default is the same as html_title.
110 | #html_short_title = None
111 |
112 | # The name of an image file (relative to this directory) to place at the top
113 | # of the sidebar.
114 | #html_logo = None
115 |
116 | # The name of an image file (within the static path) to use as favicon of the
117 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
118 | # pixels large.
119 | #html_favicon = None
120 |
121 | # Add any paths that contain custom static files (such as style sheets) here,
122 | # relative to this directory. They are copied after the builtin static files,
123 | # so a file named "default.css" will overwrite the builtin "default.css".
124 | html_static_path = ['_static']
125 |
126 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
127 | # using the given strftime format.
128 | #html_last_updated_fmt = '%b %d, %Y'
129 |
130 | # If true, SmartyPants will be used to convert quotes and dashes to
131 | # typographically correct entities.
132 | #html_use_smartypants = True
133 |
134 | # Custom sidebar templates, maps document names to template names.
135 | #html_sidebars = {}
136 |
137 | # Additional templates that should be rendered to pages, maps page names to
138 | # template names.
139 | #html_additional_pages = {}
140 |
141 | # If false, no module index is generated.
142 | #html_domain_indices = True
143 |
144 | # If false, no index is generated.
145 | #html_use_index = True
146 |
147 | # If true, the index is split into individual pages for each letter.
148 | #html_split_index = False
149 |
150 | # If true, links to the reST sources are added to the pages.
151 | #html_show_sourcelink = True
152 |
153 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
154 | #html_show_sphinx = True
155 |
156 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
157 | #html_show_copyright = True
158 |
159 | # If true, an OpenSearch description file will be output, and all pages will
160 | # contain a tag referring to it. The value of this option must be the
161 | # base URL from which the finished HTML is served.
162 | #html_use_opensearch = ''
163 |
164 | # This is the file name suffix for HTML files (e.g. ".xhtml").
165 | #html_file_suffix = None
166 |
167 | # Output file base name for HTML help builder.
168 | htmlhelp_basename = 'xml4hdoc'
169 |
170 |
171 | # -- Options for LaTeX output --------------------------------------------------
172 |
173 | latex_elements = {
174 | # The paper size ('letterpaper' or 'a4paper').
175 | #'papersize': 'letterpaper',
176 |
177 | # The font size ('10pt', '11pt' or '12pt').
178 | #'pointsize': '10pt',
179 |
180 | # Additional stuff for the LaTeX preamble.
181 | #'preamble': '',
182 | }
183 |
184 | # Grouping the document tree into LaTeX files. List of tuples
185 | # (source start file, target name, title, author, documentclass [howto/manual]).
186 | latex_documents = [
187 | ('index', 'xml4h.tex', 'xml4h Documentation',
188 | 'James Murty', 'manual'),
189 | ]
190 |
191 | # The name of an image file (relative to this directory) to place at the top of
192 | # the title page.
193 | #latex_logo = None
194 |
195 | # For "manual" documents, if this is true, then toplevel headings are parts,
196 | # not chapters.
197 | #latex_use_parts = False
198 |
199 | # If true, show page references after internal links.
200 | #latex_show_pagerefs = False
201 |
202 | # If true, show URL addresses after external links.
203 | #latex_show_urls = False
204 |
205 | # Documents to append as an appendix to all manuals.
206 | #latex_appendices = []
207 |
208 | # If false, no module index is generated.
209 | #latex_domain_indices = True
210 |
211 |
212 | # -- Options for manual page output --------------------------------------------
213 |
214 | # One entry per manual page. List of tuples
215 | # (source start file, name, description, authors, manual section).
216 | man_pages = [
217 | ('index', 'xml4h', 'xml4h Documentation',
218 | ['James Murty'], 1)
219 | ]
220 |
221 | # If true, show URL addresses after external links.
222 | #man_show_urls = False
223 |
224 |
225 | # -- Options for Texinfo output ------------------------------------------------
226 |
227 | # Grouping the document tree into Texinfo files. List of tuples
228 | # (source start file, target name, title, author,
229 | # dir menu entry, description, category)
230 | texinfo_documents = [
231 | ('index', 'xml4h', 'xml4h Documentation',
232 | 'James Murty', 'xml4h', 'One line description of project.',
233 | 'Miscellaneous'),
234 | ]
235 |
236 | # Documents to append as an appendix to all manuals.
237 | #texinfo_appendices = []
238 |
239 | # If false, no module index is generated.
240 | #texinfo_domain_indices = True
241 |
242 | # How to display URL addresses: 'footnote', 'no', or 'inline'.
243 | #texinfo_show_urls = 'footnote'
244 |
--------------------------------------------------------------------------------
/docs/index.rst:
--------------------------------------------------------------------------------
1 | .. xml4h documentation master file, created by
2 | sphinx-quickstart on Thu Aug 30 22:29:54 2012.
3 | You can adapt this file completely to your liking, but it should at least
4 | contain the root `toctree` directive.
5 |
6 | .. include:: ../README.rst
7 |
8 |
9 | ==========
10 | User Guide
11 | ==========
12 |
13 | .. toctree::
14 | :maxdepth: 3
15 |
16 | parser
17 | builder
18 | writer
19 | nodes
20 | advanced
21 | api
22 |
23 |
24 | ==================
25 | Indices and tables
26 | ==================
27 |
28 | * :ref:`genindex`
29 | * :ref:`modindex`
30 | * :ref:`search`
31 |
32 |
--------------------------------------------------------------------------------
/docs/nodes.rst:
--------------------------------------------------------------------------------
1 | =========
2 | DOM Nodes
3 | =========
4 |
5 | *xml4h* provides node objects and convenience methods that make it easier to
6 | work with an in-memory XML document object model (DOM).
7 |
8 | This section of the document covers the main features of *xml4h* nodes.
9 | For the full API-level documentation see :ref:`api-nodes`.
10 |
11 | .. _node-traversal:
12 |
13 | Traversing Nodes
14 | ----------------
15 |
16 | *xml4h* aims to provide a simple and intuitive API for traversing and
17 | manipulating the XML DOM. To that end it includes a number of convenience
18 | methods for performing common tasks:
19 |
20 | - Get the :class:`~xml4h.nodes.Document` or root :class:`~xml4h.nodes.Element`
21 | from any node via the ``document`` and ``root`` attributes respectively.
22 | - You can get the ``name`` attribute of nodes that have a name, or look up
23 | the different name components with ``prefix`` to get the namespace prefix
24 | (if any) and ``local_name`` to get the name portion without the prefix.
25 | - Nodes that have a value expose it via the ``value`` attribute.
26 | - A node's ``parent`` attribute returns its parent, while the ``ancestors``
27 | attribute returns a list containing its parent, grand-parent,
28 | great-grand-parent etc.
29 | - A node's ``children`` attribute returns the child nodes that belong to it,
30 | while the ``siblings`` attribute returns all other nodes that belong to its
31 | parent. You can also get the ``siblings_before`` or ``siblings_after`` the
32 | current node.
33 | - Look up a node's namespace URI with ``namespace_uri`` or the alias
34 | ``ns_uri``.
35 | - Check what type of :class:`~xml4h.nodes.Node` you have with Boolean
36 | attributes like ``is_element``, ``is_text``, ``is_entity`` etc.
37 |
38 |
39 | .. _magical-node-traversal:
40 |
41 | "Magical" Node Traversal
42 | ------------------------
43 |
44 | To make it easy to traverse XML documents with a known structure *xml4h*
45 | performs some minor magic when you look up attributes or keys on Document
46 | and Element nodes. If you like, you can take advantage of magical traversal
47 | to avoid peppering your code with ``find`` and ``xpath`` searches, or with
48 | ``child`` and ``children`` node attribute lookups.
49 |
50 | The principle is simple:
51 |
52 | - Child elements are available as Python attributes of the parent element
53 | class.
54 | - XML element attributes are available as a Python dict in the owning element.
55 |
56 | Here is an example of retrieving information from our Monty Python films
57 | document using element names as Python attributes (``MontyPythonFilms``,
58 | ``Film``, ``Title``) and XML attribute names as Python keys (``year``)::
59 |
60 | >>> # Parse an example XML document about Monty Python films
61 | >>> import xml4h
62 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml')
63 |
64 | >>> for film in doc.MontyPythonFilms.Film:
65 | ... print(film['year'] + ' : ' + film.Title.text) # doctest:+ELLIPSIS
66 | 1971 : And Now for Something Completely Different
67 | 1974 : Monty Python and the Holy Grail
68 | ...
69 |
70 | Python class attribute lookups of child elements work very well when your XML
71 | document contains only camel-case tag names ``LikeThisOne`` or ``LikeThat``.
72 | However, if your document contains lower-case tag names there is a chance the
73 | element names will clash with existing Python attribute or method names in the
74 | *xml4h* classes.
75 |
76 | To work around this potential issue you can add an underscore (``_``)
77 | character at the end of a magical attribute lookup to avoid the naming clash;
78 | *xml4h* will remove that character before looking for a child element. For
79 | example, to look up a child of the element ``elem1`` which is named ``child``,
80 | the code ``elem1.child_`` will return the child element whereas ``elem1.child``
81 | would access the :meth:`~xml4h.nodes.Node.child` Node method instead.
82 |
83 | .. note::
84 | Not all XML child element tag names are accessible using magical traversal.
85 | Names with leading underscore characters will not work, and nor will names
86 | containing hyphens because they are not valid Python attribute names. If you
87 | have to deal with XML names like this use the full API methods like
88 | :meth:`~xml4h.nodes.Node.child` and :meth:`~xml4h.nodes.Node.children`
89 | instead.
90 |
91 | All the gory details about how magical traversal works are documented at
92 | :class:`~xml4h.nodes.NodeAttrAndChildElementLookupsMixin`. Depending on how
93 | you feel about magical behaviour this feature might feel like a great
94 | convenience, or black magic that makes you wary. The right attitude probably
95 | lies somewhere in the middle...
96 |
97 | .. warning::
98 | The behaviour of namespaced XML elements and attributes is inconsistent.
99 | You can do magical traversal of elements regardless of what namespace the
100 | elements are in, but to look up XML attributes with a namespace prefix
101 | you must include that prefix in the name e.g. ``prefix:attribute-name``.
102 |
103 |
104 | Searching with Find and XPath
105 | -----------------------------
106 |
107 | There are two ways to search for elements within an *xml4h* document: ``find``
108 | and ``xpath``.
109 |
110 | The find methods provided by the library are easy to use but can only perform
111 | relatively simple searches that return :class:`~xml4h.nodes.Element` results,
112 | whereas you need to be familiar with XPath query syntax to search effectively
113 | with the ``xpath`` method but you can perform more complex searches and get
114 | results other than just elements.
115 |
116 | Find Methods
117 | ............
118 |
119 | *xml4h* provides three different find methods:
120 |
121 | - :meth:`~xml4h.nodes.Node.find` searches descendants of the current node for
122 | elements matching the given constraints. You can search by element name,
123 | by namespace URI, or with no constraints at all::
124 |
125 | >>> # Find ALL elements in the document
126 | >>> elems = doc.find()
127 | >>> [e.name for e in elems] # doctest:+ELLIPSIS
128 | ['MontyPythonFilms', 'Film', 'Title', 'Description', 'Film', 'Title', 'Description',...
129 |
130 | >>> # Find the seven elements in the XML document
131 | >>> film_elems = doc.find('Film')
132 | >>> [e.Title.text for e in film_elems] # doctest:+ELLIPSIS
133 | ['And Now for Something Completely Different', 'Monty Python and the Holy Grail',...
134 |
135 | Note that the :meth:`~xml4h.nodes.Node.find` method only finds descendants
136 | of the node you run it on::
137 |
138 | >>> # Find elements in a single element; there's only one
139 | >>> film_elem = doc.find('Film', first_only=True)
140 | >>> film_elem.find('Title')
141 | []
142 |
143 | - :meth:`~xml4h.nodes.Node.find_first` searches descendants of the current
144 | node but only returns the first result element, not a list. If there are no
145 | matching element results this method returns *None*::
146 |
147 | >>> # Find the first element in the document
148 | >>> doc.find_first('Film')
149 |
150 |
151 | >>> # Search for an element that does not exist
152 | >>> print(doc.find_first('OopsWrongName'))
153 | None
154 |
155 | If you were paying attention you may have noticed in the example above that
156 | you can make the :meth:`~xml4h.nodes.Node.find` method do exactly same thing
157 | as :meth:`~xml4h.nodes.Node.find_first` by passing the keyword argument
158 | ``first_only=True``.
159 |
160 | - :meth:`~xml4h.nodes.Node.find_doc` is a convenience method that searches the
161 | entire document no matter which node you run it on::
162 |
163 | >>> # Normal find only searches descendants of the current node
164 | >>> len(film_elem.find('Title'))
165 | 1
166 |
167 | >>> # find_doc searches the entire document
168 | >>> len(film_elem.find_doc('Title'))
169 | 7
170 |
171 | This method is exactly like calling ``xml4h_node.document.find()``, which is
172 | actually what happens behind the scenes.
173 |
174 | XPath Querying
175 | ..............
176 |
177 | *xml4h* provides a single XPath search method which is available on
178 | :class:`~xml4h.nodes.Document` and :class:`~xml4h.nodes.Element` nodes:
179 |
180 | :meth:`~xml4h.nodes.XPathMixin.xpath` takes an XPath query string and returns
181 | the result which may be a list of elements, a list of attributes, a list of
182 | values, or a single value. The result depends entirely on the kind of query you
183 | perform.
184 |
185 | .. note::
186 | XPath querying is currently only available if you use the *lxml* or
187 | *ElementTree* implementation libraries. You can check whether the XPath
188 | feature is available with :meth:`~xml4h.nodes.Node.has_feature`.
189 |
190 | .. note::
191 | Although *ElementTree* supports XPath queries, this support is
192 | `very limited `_ and most of the
193 | example XPath queries below **will not work**. If you want to use XPath, you
194 | should install *lxml* for better support.
195 |
196 | XPath queries are powerful and complex so we cannot describe them in detail
197 | here, but we can at least present some useful examples. Here are queries that
198 | perform the same work as the find queries we saw above::
199 |
200 | >>> # Query for ALL elements in the document
201 | >>> elems = doc.xpath('//*') # doctest:+ELLIPSIS
202 | >>> [e.name for e in elems] # doctest:+ELLIPSIS
203 | ['MontyPythonFilms', 'Film', 'Title', 'Description', 'Film', 'Title', 'Description',...
204 |
205 | >>> # Query for the seven elements in the XML document
206 | >>> film_elems = doc.xpath('//Film')
207 | >>> [e.Title.text for e in film_elems] # doctest:+ELLIPSIS
208 | ['And Now for Something Completely Different', 'Monty Python and the Holy Grail',...
209 |
210 | >>> # Query for the first element in the document (returns list)
211 | >>> doc.xpath('//Film[1]')
212 | []
213 |
214 | >>> # Query for elements in a single element; there's only one
215 | >>> film_elem = doc.xpath('Film[1]')[0]
216 | >>> film_elem.xpath('Title')
217 | []
218 |
219 | You can also do things with XPath queries that you simply cannot with the
220 | *find* method, such as find all the attributes of a certain name or apply
221 | rich constraints to the query::
222 |
223 | >>> # Query for all year attributes
224 | >>> doc.xpath('//@year')
225 | ['1971', '1974', '1979', '1982', '1983', '2009', '2012']
226 |
227 | >>> # Query for the title of the film released in 1982
228 | >>> doc.xpath('//Film[@year="1982"]/Title/text()')
229 | ['Monty Python Live at the Hollywood Bowl']
230 |
231 |
232 | Namespaces and XPath
233 | ....................
234 |
235 | Finally, let's discuss how you can run XPath queries on documents with
236 | namespaces, because unfortunately this is not a simple subject.
237 |
238 | First, you need to understand that if you are working with a namespaced
239 | document your XPath queries must refer to those namespaces or they will not
240 | find anything::
241 |
242 | >>> # Parse a namespaced version of the Monty Python Films doc
243 | >>> ns_doc = xml4h.parse('tests/data/monty_python_films.ns.xml')
244 | >>> print(ns_doc.xml()) #doctest:+ELLIPSIS
245 |
246 |
247 |
248 | And Now for Something Completely Different
249 | ...
250 |
251 | >>> # XPath queries without prefixes won't find namespaced elements
252 | >>> ns_doc.xpath('//Film')
253 | []
254 |
255 | To refer to namespaced nodes in your query the namespace must have a prefix
256 | alias assigned to it. You can specify prefixes when you call the *xpath* method
257 | by providing a ``namespaces`` keyword argument with a dictionary of
258 | alias-to-URI mappings::
259 |
260 | >>> # Specify explicit prefix alias mappings
261 | >>> films = ns_doc.xpath('//x:Film', namespaces={'x': 'uri:artistic-work'})
262 | >>> len(films)
263 | 7
264 |
265 | Or, preferably, if your document node already has prefix mappings you can use
266 | them directly::
267 |
268 | >>> # Our root node already has a 'work' prefix defined...
269 | >>> ns_doc.root['xmlns:work']
270 | 'uri:artistic-work'
271 |
272 | >>> # ...so we can use this prefix directly
273 | >>> films = ns_doc.root.xpath('//work:Film')
274 | >>> len(films)
275 | 7
276 |
277 | Another gotcha is when a document has a default namespace. The default
278 | namespace applies to every descendent node without its own namespace, but XPath
279 | doesn't have a good way of dealing with this since there is no such thing as
280 | a "default namespace" prefix alias.
281 |
282 | *xml4h* helps out by providing just such an alias: the underscore (``_``)::
283 |
284 | >>> # Our document root has a default namespace
285 | >>> ns_doc.root.ns_uri
286 | 'uri:monty-python'
287 |
288 | >>> # You need a prefix alias that refers to the default namespace
289 | >>> ns_doc.xpath('//Title')
290 | []
291 |
292 | >>> # You could specify it explicitly...
293 | >>> titles = ns_doc.xpath('//x:Title',
294 | ... namespaces={'x': ns_doc.root.ns_uri})
295 | >>> len(titles)
296 | 7
297 |
298 | >>> # ...or use xml4h's special default namespace prefix: _
299 | >>> titles = ns_doc.xpath('//_:Title')
300 | >>> len(titles)
301 | 7
302 |
303 |
304 | Filtering Node Lists
305 | --------------------
306 |
307 | Many *xml4h* node attributes return a list of nodes as a
308 | :class:`~xml4h.nodes.NodeList` object which confers some special filtering
309 | powers. You get this special node list object from attributes like
310 | ``children``, ``ancestors``, and ``siblings``, and from the ``find`` search
311 | method if it has element results.
312 |
313 | Here are some examples of how you can easily filter a
314 | :class:`~xml4h.nodes.NodeList` to get just the
315 | nodes you need:
316 |
317 | - Get the first child node using the ``filter`` method::
318 |
319 | >>> # Filter to get just the first child
320 | >>> doc.root.children.filter(first_only=True)
321 |
322 |
323 | >>> # The document has 7 element children of the root
324 | >>> len(doc.root.children)
325 | 7
326 |
327 | - Get the first child node by treating ``children`` as a callable::
328 |
329 | >>> doc.root.children(first_only=True)
330 |
331 |
332 | When you treat the node list as a callable it calls the ``filter`` method
333 | behind the scenes, but since doing it the callable way is quicker and
334 | clearer in code we will use that approach from now on.
335 |
336 | - Get the first child node with the ``child`` filtering method, which accepts
337 | the same constraints as the ``filter`` method::
338 |
339 | >>> doc.root.child()
340 |
341 |
342 | >>> # Apply filtering with child
343 | >>> print(doc.root.child('WrongName'))
344 | None
345 |
346 | - Get the first of a set of children with the ``first`` attribute::
347 |
348 | >>> doc.root.children.first
349 |
350 |
351 |
352 | - Filter the node list by name::
353 |
354 | >>> for n in doc.root.children('Film'):
355 | ... print(n.Title.text)
356 | And Now for Something Completely Different
357 | Monty Python and the Holy Grail
358 | Monty Python's Life of Brian
359 | Monty Python Live at the Hollywood Bowl
360 | Monty Python's The Meaning of Life
361 | Monty Python: Almost the Truth (The Lawyer's Cut)
362 | A Liar's Autobiography: Volume IV
363 |
364 | >>> len(doc.root.children('WrongName'))
365 | 0
366 |
367 | .. note::
368 | Passing a node name as the first argument will match the *local* name of
369 | a node. You can match the full node name, which might include a prefix
370 | for example, with a call like: ``.children(name='SomeName')``.
371 |
372 | - Filter with a custom function::
373 |
374 | >>> # Filter to films released in the year 1979
375 | >>> for n in doc.root.children('Film',
376 | ... filter_fn=lambda node: node.attributes['year'] == '1979'):
377 | ... print(n.Title.text)
378 | Monty Python's Life of Brian
379 |
380 |
381 | Manipulating Nodes and Elements
382 | -------------------------------
383 |
384 | *xml4h* provides simple methods to manipulate the structure and content of an
385 | XML DOM. The methods available depend on the kind of node you are interacting
386 | with, and by far the majority are for working with
387 | :class:`~xml4h.nodes.Element` nodes.
388 |
389 |
390 | Delete a Node
391 | .............
392 |
393 | Any node can be removes from its owner document with
394 | :meth:`~xml4h.nodes.Node.delete`::
395 |
396 | >>> # Before deleting a Film element there are 7 films
397 | >>> len(doc.MontyPythonFilms.Film)
398 | 7
399 |
400 | >>> doc.MontyPythonFilms.children('Film')[-1].delete()
401 | >>> len(doc.MontyPythonFilms.Film)
402 | 6
403 |
404 | .. note::
405 | By default deleting a node also destroys it, but it can optionally be left
406 | intact after removal from the document by including the ``destroy=False``
407 | option.
408 |
409 | Name and Value Attributes
410 | .........................
411 |
412 | Many nodes have low-level name and value properties that can be read from and
413 | written to. Nodes with names and values include Text, CDATA, Comment,
414 | ProcessingInstruction, Attribute, and Element nodes.
415 |
416 | Here is an example of accessing the low-level name and value properties of a
417 | Text node::
418 |
419 | >>> text_node = doc.MontyPythonFilms.child('Film').child('Title').child()
420 | >>> text_node.is_text
421 | True
422 |
423 | >>> text_node.name
424 | '#text'
425 | >>> text_node.value
426 | 'And Now for Something Completely Different'
427 |
428 | And here is the same for an Attribute node::
429 |
430 | >>> # Access the name/value properties of an Attribute node
431 | >>> year_attr = doc.MontyPythonFilms.child('Film').attribute_node('year')
432 | >>> year_attr.is_attribute
433 | True
434 |
435 | >>> year_attr.name
436 | 'year'
437 | >>> year_attr.value
438 | '1971'
439 |
440 | The name attribute of a node is not necessarily a plain string, in the case of
441 | nodes within a defined namespaced the ``name`` attribute may comprise two
442 | components: a ``prefix`` that represents the namespace, and a ``local_name``
443 | which is the plain name of the node ignoring the namespace. For more
444 | information on namespaces see :ref:`xml4h-namespaces`.
445 |
446 | Import a Node and its Descendants
447 | .................................
448 |
449 | In addition to manipulating nodes in a single XML document directly, you can
450 | also import a node (and all its descendant) from another document using a node
451 | clone or transplant operation.
452 |
453 | There are two ways to import a node and its descendants:
454 |
455 | - Use the :meth:`~xml4h.nodes.Node.clone_node` Node method or
456 | :meth:`~xml4h.builder.Builder.clone` Builder method to copy a node into your
457 | document without removing it from its original document.
458 | - Use the :meth:`~xml4h.nodes.Node.transplant_node` Node method or
459 | :meth:`~xml4h.builder.Builder.transplant` Builder method to transplant a node
460 | into your document and remove it from its original document.
461 |
462 | Here is an example of transplanting a node into a document (which also happens
463 | to undo the damage we did to our example DOM in the ``delete()`` example
464 | above)::
465 |
466 | >>> # Build a new document containing a Film element
467 | >>> film_builder = (xml4h.build('DeletedFilm')
468 | ... .element('Film').attrs(year='1971')
469 | ... .element('Title')
470 | ... .text('And Now for Something Completely Different').up()
471 | ... .element('Description').text(
472 | ... "A collection of sketches from the first and second TV"
473 | ... " series of Monty Python's Flying Circus purposely"
474 | ... " re-enacted and shot for film.")
475 | ... )
476 |
477 | >>> # Transplant the Film element from the new document
478 | >>> node_to_transplant = film_builder.root.child('Film')
479 | >>> doc.MontyPythonFilms.transplant_node(node_to_transplant)
480 | >>> len(doc.MontyPythonFilms.Film)
481 | 7
482 |
483 | When you transplant a node from another document it is removed from that
484 | document::
485 |
486 | >>> # After transplanting the Film node it is no longer in the original doc
487 | >>> len(film_builder.root.find('Film'))
488 | 0
489 |
490 | If you need to leave the original document unchanged when importing a node use
491 | the clone methods instead.
492 |
493 | Working with Elements
494 | .....................
495 |
496 | Element nodes have the most methods to access and manipulate their content,
497 | which is fitting since this is the most useful type of node and you will deal
498 | with elements regularly.
499 |
500 | The leaf elements in XML documents often have one or more
501 | :class:`~xml4h.nodes.Text` node children that contain the element's data
502 | content. While you could iterate over such text nodes as child nodes, *xml4h*
503 | provides the more convenient text accessors you would expect::
504 |
505 | >>> title_elem = doc.MontyPythonFilms.Film[0].Title
506 | >>> orig_title = title_elem.text
507 | >>> orig_title
508 | 'And Now for Something Completely Different'
509 |
510 | >>> title_elem.text = 'A new, and wrong, title'
511 | >>> title_elem.text
512 | 'A new, and wrong, title'
513 |
514 | >>> # Let's put it back the way it was...
515 | >>> title_elem.text = orig_title
516 |
517 | Elements also have attributes that can be manipulated in a number of ways.
518 |
519 | Look up an element's attributes with:
520 |
521 | - the :meth:`~xml4h.nodes.Element.attributes` attribute (or aliases ``attrib``
522 | and ``attrs``) that return an ordered dictionary of attribute names and
523 | values::
524 |
525 | >>> film_elem = doc.MontyPythonFilms.Film[0]
526 | >>> film_elem.attributes
527 |
528 |
529 | - or by obtaining an element's attributes as :class:`~xml4h.nodes.Attribute`
530 | nodes, though that is only likely to be useful in unusual circumstances::
531 |
532 | >>> film_elem.attribute_nodes
533 | []
534 |
535 | >>> # Get a specific attribute node by name or namespace URI
536 | >>> film_elem.attribute_node('year')
537 |
538 |
539 | - and there's also the "magical" keyword lookup technique discussed in
540 | :ref:`magical-node-traversal` for quickly grabbing attribute values.
541 |
542 | Set attribute values with:
543 |
544 | - the :meth:`~xml4h.nodes.Element.set_attributes` method, which allows you to
545 | add attributes without replacing existing ones. This method also supports
546 | defining XML attributes as a dictionary, list of name/value pairs, or
547 | keyword arguments::
548 |
549 | >>> # Set/add attributes as a dictionary
550 | >>> film_elem.set_attributes({'a1': 'v1'})
551 |
552 | >>> # Set/add attributes as a list of name/value pairs
553 | >>> film_elem.set_attributes([('a2', 'v2')])
554 |
555 | >>> # Set/add attributes as keyword arguments
556 | >>> film_elem.set_attributes(a3='v3', a4=4)
557 |
558 | >>> film_elem.attributes
559 |
560 |
561 | - the setter version of the :attr:`~xml4h.nodes.Element.attributes` attribute,
562 | which replaces any existing attributes with the new set::
563 |
564 | >>> film_elem.attributes = {'year': '1971', 'note': 'funny'}
565 | >>> film_elem.attributes
566 |
567 |
568 | Delete attributes from an element by:
569 |
570 | - using Python's delete-in-dict technique::
571 |
572 | >>> del(film_elem.attributes['note'])
573 | >>> film_elem.attributes
574 |
575 |
576 | - or by calling the ``delete()`` method on an :class:`~xml4h.nodes.Attribute`
577 | node.
578 |
579 | Finally, the :class:`~xml4h.nodes.Element` class provides a number of methods
580 | for programmatically adding child nodes, for cases where you would rather work
581 | directly with nodes instead of using a :ref:`builder`.
582 |
583 | The most complex of these methods is :meth:`~xml4h.nodes.Element.add_element`
584 | which allows you to add a named child element, and to optionally to set the new
585 | element's namespace, text content, and attributes all at the same time. Let's
586 | try an example::
587 |
588 | >>> # Add a Film element with an attribute
589 | >>> new_film_elem = doc.MontyPythonFilms.add_element(
590 | ... 'Film', attributes={'year': 'never'})
591 |
592 | >>> # Add a Description element with text content
593 | >>> desc_elem = new_film_elem.add_element(
594 | ... 'Description', text='Just testing...')
595 |
596 | >>> # Add a Title element with text *before* the description element
597 | >>> title_elem = desc_elem.add_element(
598 | ... 'Title', text='The Film that Never Was', before_this_element=True)
599 |
600 | >>> print(doc.MontyPythonFilms.Film[-1].xml())
601 |
602 | The Film that Never Was
603 | Just testing...
604 |
605 |
606 | There are similar methods for handling simpler cases like adding text nodes,
607 | comments etc. Here is an example of adding text nodes::
608 |
609 | >>> # Add a text node
610 | >>> title_elem = doc.MontyPythonFilms.Film[-1].Title
611 | >>> title_elem.add_text(', and Never Will Be')
612 |
613 | >>> title_elem.text
614 | 'The Film that Never Was, and Never Will Be'
615 |
616 | Refer to the :class:`~xml4h.nodes.Element` documentation for more information
617 | about the other methods for adding nodes.
618 |
619 |
620 | .. _wrap-unwrap-nodes:
621 |
622 | Wrapping and Unwrapping *xml4h* Nodes
623 | -------------------------------------
624 |
625 | You can easily convert to or from *xml4h*'s wrapped version of an
626 | implementation node. For example, if you prefer the *lxml* library's
627 | `ElementMaker `_ document builder
628 | approach to the :ref:`xml4h Builder `, you can create a document
629 | in *lxml*...
630 |
631 | ::
632 |
633 | >>> from lxml.builder import ElementMaker
634 | >>> E = ElementMaker()
635 | >>> lxml_doc = E.DocRoot(
636 | ... E.Item(
637 | ... E.Name('Item 1'),
638 | ... E.Value('Value 1')
639 | ... ),
640 | ... E.Item(
641 | ... E.Name('Item 2'),
642 | ... E.Value('Value 2')
643 | ... )
644 | ... )
645 | >>> lxml_doc # doctest:+ELLIPSIS
646 | >> # Convert lxml Document to xml4h version
652 | >>> xml4h_doc = xml4h.LXMLAdapter.wrap_document(lxml_doc)
653 | >>> xml4h_doc.children
654 | [, ]
655 |
656 | >>> # Get an element within the lxml document
657 | >>> lxml_elem = list(lxml_doc)[0]
658 | >>> lxml_elem # doctest:+ELLIPSIS
659 | >> # Convert lxml Element to xml4h version
662 | >>> xml4h_elem = xml4h.LXMLAdapter.wrap_node(lxml_elem, lxml_doc)
663 | >>> xml4h_elem # doctest:+ELLIPSIS
664 |
665 |
666 | You can reach the underlying XML implementation document or node at any time
667 | from an *xml4h* node::
668 |
669 | >>> # Get an xml4h node's underlying implementation node
670 | >>> xml4h_elem.impl_node # doctest:+ELLIPSIS
671 | >> xml4h_elem.impl_node == lxml_elem
673 | True
674 |
675 | >>> # Get the underlying implementatation document from any node
676 | >>> xml4h_elem.impl_document # doctest:+ELLIPSIS
677 | >> xml4h_elem.impl_document == lxml_doc
679 | True
680 |
681 |
--------------------------------------------------------------------------------
/docs/parser.rst:
--------------------------------------------------------------------------------
1 | ======
2 | Parser
3 | ======
4 |
5 | The *xml4h* parser is a simple wrapper around the parser provided by an
6 | underlying :ref:`XML library implementation `.
7 |
8 | .. _parser-parse:
9 |
10 | Parse function
11 | --------------
12 |
13 | To parse XML documents with *xml4h* you feed the :func:`xml4h.parse` function
14 | an XML text document in one of three forms:
15 |
16 | - A file-like object::
17 |
18 | >>> import xml4h
19 |
20 | >>> xml_file = open('tests/data/monty_python_films.xml', 'rb')
21 | >>> doc = xml4h.parse(xml_file)
22 |
23 | >>> doc.MontyPythonFilms
24 |
25 |
26 | - A file path string::
27 |
28 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml')
29 |
30 | >>> doc.root['source']
31 | 'http://en.wikipedia.org/wiki/Monty_Python'
32 |
33 | - A string containing literal XML content::
34 |
35 | >>> xml_file = open('tests/data/monty_python_films.xml', 'rb')
36 | >>> xml_text = xml_file.read()
37 | >>> doc = xml4h.parse(xml_text)
38 |
39 | >>> len(doc.find('Film'))
40 | 7
41 |
42 | .. note:: The :func:`~xml4h.parse` method distinguishes between a file path
43 | string and an XML text string by looking for a ``<`` character
44 | in the value.
45 |
46 |
47 | Stripping of Whitespace Nodes
48 | -----------------------------
49 |
50 | By default the *parse* method ignores whitespace nodes in the XML document
51 | -- or more accurately, it does extra work to remove these nodes after the
52 | document has been parsed by the underlying XML library.
53 |
54 | Whitespace nodes are rarely interesting, since they are usually the result of
55 | XML content that has been serialized with extra whitespace to make it more
56 | readable to humans.
57 |
58 | However if you need to keep these nodes, or if you want to avoid the extra
59 | processing overhead when parsing large documents, you can disable this
60 | feature by passing in the ``ignore_whitespace_text_nodes=False`` flag::
61 |
62 | >>> # Strip whitespace nodes from document
63 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml')
64 |
65 | >>> # No excess text nodes (XML doc lists 7 films)
66 | >>> len(doc.MontyPythonFilms.children)
67 | 7
68 | >>> doc.MontyPythonFilms.children[0]
69 |
70 |
71 |
72 | >>> # Don't strip whitespace nodes
73 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml',
74 | ... ignore_whitespace_text_nodes=False)
75 |
76 | >>> # An extra text node is present
77 | >>> len(doc.MontyPythonFilms.children)
78 | 8
79 | >>> doc.MontyPythonFilms.children[0]
80 |
81 |
--------------------------------------------------------------------------------
/docs/writer.rst:
--------------------------------------------------------------------------------
1 | ======
2 | Writer
3 | ======
4 |
5 | The *xml4h* writer produces serialized XML text documents formatted more
6 | traditionally – and in our opinion more correctly – than the other Python XML
7 | libraries.
8 |
9 | .. _writer-write-methods:
10 |
11 | Write methods
12 | -------------
13 |
14 | To write out an XML document with *xml4h* you will generally use the
15 | :meth:`~xml4h.nodes.Node.write` or :meth:`~xml4h.nodes.Node.write_doc` methods
16 | available on any *xml4h* node.
17 |
18 | The writer methods require a file or any IO stream object as the first
19 | argument, and will automatically handle text or binary IO streams.
20 |
21 | The :meth:`~xml4h.nodes.Node.write` method outputs the current node and any
22 | descendants::
23 |
24 | >>> import xml4h
25 | >>> doc = xml4h.parse('tests/data/monty_python_films.xml')
26 | >>> first_film_elem = doc.find('Film')[0]
27 |
28 | >>> # Write XML node to stdout
29 | >>> import sys
30 | >>> first_film_elem.write(sys.stdout, indent=True) # doctest:+ELLIPSIS
31 |
32 | And Now for Something Completely Different
33 | A collection of sketches from the first and second...
34 |
35 |
36 | The :meth:`~xml4h.nodes.Node.write_doc` method outputs the entire document no
37 | matter which node you call it on::
38 |
39 | >>> first_film_elem.write_doc(sys.stdout, indent=True) # doctest:+ELLIPSIS
40 |
41 |
42 |
43 | And Now for Something Completely Different
44 | A collection of sketches from the first and second...
45 |
46 | ...
47 |
48 | To send output to a file::
49 |
50 | >>> # Write to a file
51 | >>> with open('/tmp/example.xml', 'wb') as f:
52 | ... first_film_elem.write_doc(f)
53 |
54 | .. _writer-xml-methods:
55 |
56 | Get XML as a string
57 | -------------------
58 |
59 | Because you will often want to generate a string of XML content directly,
60 | *xml4h* includes the convenience methods :meth:`~xml4h.nodes.Node.xml`
61 | and :meth:`~xml4h.nodes.Node.xml_doc` to do this easily.
62 |
63 | The :meth:`~xml4h.nodes.Node.xml` method works like the *write* method and
64 | will return a string of XML content including the current node and its
65 | descendants::
66 |
67 | >>> print(first_film_elem.xml()) # doctest:+ELLIPSIS
68 |
69 | And Now for Something Completely...
70 |
71 | The :meth:`~xml4h.nodes.Node.xml_doc` method works like the *write_doc*
72 | method and returns a string for the whole document::
73 |
74 | >>> print(first_film_elem.xml_doc()) # doctest:+ELLIPSIS
75 |
76 |
77 |
78 | And Now for Something Completely Different
79 | A collection of sketches from the first and second...
80 |
81 | ...
82 |
83 | .. note::
84 | *xml4h* assumes that when you directly generate an XML string with these
85 | methods it is intended for human consumption, so it applies pretty-print
86 | formatting by default.
87 |
88 |
89 | .. _writer-formatting:
90 |
91 | Format Output
92 | -------------
93 |
94 | The *write* and *xml* methods accept a range of formatting options to control
95 | how XML content is serialized. These are useful if you expect a human to read
96 | the resulting data.
97 |
98 | For the full range of formatting options see the code documentation for
99 | :meth:`~xml4h.nodes.Node.write` and :meth:`~xml4h.nodes.Node.xml` et al.
100 | but here are some pointers to get you started:
101 |
102 | - Set ``indent=True`` to write a pretty-printed XML document with four space
103 | characters for indentation and ``\n`` for newlines.
104 | - To use a tab character for indenting and ``\r\n`` for indents:
105 | ``indent='\t', newline='\r\n'``.
106 | - *xml4h* writes *utf-8*-encoded documents by default, to write with a
107 | different encoding: ``encoding='iso-8859-1'``.
108 | - To avoid outputting the XML declaration when writing a document:
109 | ``omit_declaration=True``.
110 |
111 |
112 | Write using the underlying implementation
113 | -----------------------------------------
114 |
115 | Because *xml4h* sits on top of an underlying
116 | :ref:`XML library implementation ` you can use that
117 | library's serialization methods if you prefer, and if you don't mind having
118 | some implementation-specific code.
119 |
120 | For example, if you are using *lxml* as the underlying library you can use
121 | its serialisation methods by accessing the implementation node::
122 |
123 | >>> # Get the implementation root node, in this case an lxml node
124 | >>> lxml_root_node = first_film_elem.root.impl_node
125 | >>> type(lxml_root_node) # doctest:+ELLIPSIS
126 | <... 'lxml.etree._Element'>
127 |
128 | >>> # Use lxml features as normal; xml4h is no longer in the picture
129 | >>> from lxml import etree
130 | >>> xml_bytes = etree.tostring(
131 | ... lxml_root_node, encoding='utf-8', xml_declaration=True, pretty_print=True)
132 | >>> print(xml_bytes.decode('utf-8')) # doctest:+ELLIPSIS
133 |
134 | And Now for Something Completely Different
135 | A collection of sketches from the first and second...
136 |
137 | Monty Python and the Holy Grail
138 | King Arthur and his knights embark on a low-budget...
139 |
140 | ...
141 |
142 | .. note::
143 | The output from *lxml* is a little quirky, at least on the author's machine.
144 | Note for example the single-quote characters in the XML declaration, and
145 | the missing newline and indent before the first ```` element. But
146 | don't worry, that's why you have *xml4h* ;)
147 |
--------------------------------------------------------------------------------
/requirements-dev.txt:
--------------------------------------------------------------------------------
1 | # Nose for running tests
2 | six
3 | nose
4 | coverage
5 | tox
6 | sphinx
7 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 |
4 | import xml4h
5 |
6 | try:
7 | from setuptools import setup
8 | except ImportError:
9 | from distutils.core import setup
10 |
11 | setup(
12 | name=xml4h.__title__,
13 | version=xml4h.__version__,
14 | description='XML for Humans in Python',
15 | long_description=open('README.rst').read(),
16 | long_description_content_type='text/x-rst',
17 | author='James Murty',
18 | author_email='james@murty.co',
19 | url='https://github.com/jmurty/xml4h',
20 | packages=[
21 | 'xml4h',
22 | 'xml4h.impls',
23 | ],
24 | package_dir={'xml4h': 'xml4h'},
25 | package_data={'': ['README.rst', 'LICENSE']},
26 | include_package_data=True,
27 | install_requires=[
28 | 'six',
29 | ],
30 | license='MIT License',
31 | # http://pypi.python.org/pypi?%3Aaction=list_classifiers
32 | classifiers=[
33 | 'Development Status :: 4 - Beta',
34 | 'Intended Audience :: Developers',
35 | 'Topic :: Text Processing :: Markup :: XML',
36 | 'Natural Language :: English',
37 | 'License :: OSI Approved :: MIT License',
38 | 'Programming Language :: Python',
39 | 'Programming Language :: Python :: 2.7',
40 | 'Programming Language :: Python :: 3.5',
41 | 'Programming Language :: Python :: 3.6',
42 | 'Programming Language :: Python :: 3.7',
43 | 'Programming Language :: Python :: 3.8',
44 | ],
45 | test_suite='tests',
46 | )
47 |
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jmurty/xml4h/83bc0a91afe5d6e17d6c99ec43dc0aec9593cc06/tests/__init__.py
--------------------------------------------------------------------------------
/tests/data/example_doc.small.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
--------------------------------------------------------------------------------
/tests/data/example_doc.unicode.xml:
--------------------------------------------------------------------------------
1 |
2 | <جذر xmlns="urn:default" xmlns:důl="urn:custom">
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 | جذر>
11 |
--------------------------------------------------------------------------------
/tests/data/monty_python_films.ns.xml:
--------------------------------------------------------------------------------
1 |
2 |
4 |
5 | And Now for Something Completely Different
6 | A collection of sketches from the first and second TV series of Monty Python's Flying Circus purposely re-enacted and shot for film.
7 |
8 |
9 | Monty Python and the Holy Grail
10 | King Arthur and his knights embark on a low-budget search for the Holy Grail, encountering humorous obstacles along the way. Some of these turned into standalone sketches.
11 |
12 |
13 | Monty Python's Life of Brian
14 | Brian is born on the first Christmas, in the stable next to Jesus'. He spends his life being mistaken for a messiah.
15 |
16 |
17 | Monty Python Live at the Hollywood Bowl
18 | A videotape recording directed by Ian MacNaughton of a live performance of sketches. Originally intended for a TV/video special. Transferred to 35mm and given a limited theatrical release in the US.
19 |
20 |
21 | Monty Python's The Meaning of Life
22 | An examination of the meaning of life in a series of sketches from conception to death and beyond.
23 |
24 |
25 | Monty Python: Almost the Truth (The Lawyer's Cut)
26 | This film features interviews with all the surviving Python members, along with archive representation for the late Graham Chapman.
27 |
28 |
29 | A Liar's Autobiography: Volume IV
30 | This is an animated film which is based on the memoir of the late Monty Python member, Graham Chapman.
31 |
32 |
33 |
--------------------------------------------------------------------------------
/tests/data/monty_python_films.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | And Now for Something Completely Different
5 | A collection of sketches from the first and second TV series of Monty Python's Flying Circus purposely re-enacted and shot for film.
6 |
7 |
8 | Monty Python and the Holy Grail
9 | King Arthur and his knights embark on a low-budget search for the Holy Grail, encountering humorous obstacles along the way. Some of these turned into standalone sketches.
10 |
11 |
12 | Monty Python's Life of Brian
13 | Brian is born on the first Christmas, in the stable next to Jesus'. He spends his life being mistaken for a messiah.
14 |
15 |
16 | Monty Python Live at the Hollywood Bowl
17 | A videotape recording directed by Ian MacNaughton of a live performance of sketches. Originally intended for a TV/video special. Transferred to 35mm and given a limited theatrical release in the US.
18 |
19 |
20 | Monty Python's The Meaning of Life
21 | An examination of the meaning of life in a series of sketches from conception to death and beyond.
22 |
23 |
24 | Monty Python: Almost the Truth (The Lawyer's Cut)
25 | This film features interviews with all the surviving Python members, along with archive representation for the late Graham Chapman.
26 |
27 |
28 | A Liar's Autobiography: Volume IV
29 | This is an animated film which is based on the memoir of the late Monty Python member, Graham Chapman.
30 |
31 |
32 |
--------------------------------------------------------------------------------
/tests/test_parser.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import six
3 | import unittest
4 | import os
5 | import re
6 |
7 | import xml4h
8 |
9 |
10 | class TestParserBasics(unittest.TestCase):
11 |
12 | @property
13 | def small_xml_file_path(self):
14 | return os.path.join(
15 | os.path.dirname(__file__), 'data/example_doc.small.xml')
16 |
17 | def test_parse_with_default_parser(self):
18 | # Explicit use of default/best adapter
19 | dom = xml4h.parse(self.small_xml_file_path, adapter=xml4h.best_adapter)
20 | self.assertEqual(8, len(dom.find()))
21 | # Implicit use of default/best adapter
22 | dom = xml4h.parse(self.small_xml_file_path)
23 | self.assertEqual(8, len(dom.find()))
24 | self.assertEqual(xml4h.best_adapter, dom.adapter_class)
25 |
26 |
27 | class BaseParserTest(object):
28 | """
29 | Tests to exercise parsing across all xml4h implementations.
30 | """
31 |
32 | @property
33 | def small_xml_file_path(self):
34 | return os.path.join(
35 | os.path.dirname(__file__), 'data/example_doc.small.xml')
36 |
37 | @property
38 | def unicode_xml_file_path(self):
39 | return os.path.join(
40 | os.path.dirname(__file__), 'data/example_doc.unicode.xml')
41 |
42 | def parse(self, xml_str):
43 | return xml4h.parse(xml_str, adapter=self.adapter)
44 |
45 | def test_auto_detect_filename_or_xml_data(self):
46 | # String with a '<' is parsed as literal XML data
47 | dom = self.parse('\n\n\tcontent')
48 | self.assertEqual(2, len(dom.find()))
49 | # String without a '<' is treated as a file path -- invalid path
50 | self.assertRaises(IOError, self.parse, 'not/a/real/file/path')
51 | # String without a '<' is treated as a file path -- valid path
52 | self.parse(self.small_xml_file_path)
53 |
54 | def test_parse_file(self):
55 | wrapped_doc = self.parse(self.small_xml_file_path)
56 | self.assertIsInstance(wrapped_doc, xml4h.nodes.Document)
57 | self.assertEqual(8, len(wrapped_doc.find()))
58 | # Check element namespaces
59 | self.assertEqual(
60 | ['DocRoot', 'NSDefaultImplicit', 'NSDefaultExplicit',
61 | 'Attrs1', 'Attrs2'],
62 | [n.name for n in wrapped_doc.find(ns_uri='urn:default')])
63 | self.assertEqual(
64 | ['urn:custom', 'urn:custom', 'urn:custom'],
65 | [n.namespace_uri for n in wrapped_doc.find(ns_uri='urn:custom')])
66 | # We test local name, not full name, here as different XML libraries
67 | # retain (or not) different literal element prefixes differently.
68 | self.assertEqual(
69 | ['NSCustomExplicit',
70 | 'NSCustomWithPrefixImplicit',
71 | 'NSCustomWithPrefixExplicit'],
72 | [n.local_name for n in wrapped_doc.find(ns_uri='urn:custom')])
73 | # Check namespace attributes
74 | self.assertEqual(
75 | [xml4h.nodes.Node.XMLNS_URI, xml4h.nodes.Node.XMLNS_URI],
76 | [n.namespace_uri for n in wrapped_doc.root.attribute_nodes])
77 | attrs1_elem = wrapped_doc.find_first('Attrs1')
78 | self.assertNotEqual(None, attrs1_elem)
79 | self.assertEqual([None],
80 | [n.namespace_uri for n in attrs1_elem.attribute_nodes])
81 | attrs2_elem = wrapped_doc.find_first('Attrs2')
82 | self.assertEqual(['urn:custom'],
83 | [n.namespace_uri for n in attrs2_elem.attribute_nodes])
84 |
85 | def test_roundtrip(self):
86 | orig_xml = open(self.small_xml_file_path).read()
87 | # We discard semantically unnecessary namespace prefixes on
88 | # element names.
89 | orig_xml = re.sub(
90 | '',
91 | '', orig_xml)
92 | if self.adapter == xml4h.LXMLAdapter:
93 | # lxml parser does not make it possible to retain semantically
94 | # unnecessary 'xmlns' namespace definitions in all elements.
95 | # It's not worth failing the roundtrip test just for this
96 | orig_xml = re.sub(
97 | '',
98 | '', orig_xml)
99 | doc = self.parse(self.small_xml_file_path)
100 | roundtrip_xml = doc.xml_doc()
101 | self.assertEqual(six.text_type(orig_xml), roundtrip_xml)
102 |
103 | def test_unicode(self):
104 | # NOTE lxml doesn't support unicode namespace URIs?
105 | doc = self.parse(self.unicode_xml_file_path)
106 | self.assertEqual(u'جذر', doc.root.name)
107 | self.assertEqual(u'urn:default', doc.root.attributes['xmlns'])
108 | self.assertEqual(u'urn:custom', doc.root.attributes[u'xmlns:důl'])
109 | self.assertEqual(5, len(doc.find(ns_uri=u'urn:default')))
110 | self.assertEqual(3, len(doc.find(ns_uri=u'urn:custom')))
111 | self.assertEqual(u'1', doc.find_first(u'yếutố1').attributes[u'תכונה'])
112 | self.assertEqual(u'tvö',
113 | doc.find_first(u'yếutố2').attributes[u'důl:עודתכונה'])
114 |
115 |
116 | class TestXmlDomParser(unittest.TestCase, BaseParserTest):
117 |
118 | @property
119 | def adapter(self):
120 | return xml4h.XmlDomImplAdapter
121 |
122 |
123 | class TestLXMLEtreeParser(unittest.TestCase, BaseParserTest):
124 |
125 | @property
126 | def adapter(self):
127 | if not xml4h.LXMLAdapter.is_available():
128 | self.skipTest("lxml library is not installed")
129 | return xml4h.LXMLAdapter
130 |
131 |
132 | class TestElementTreeEtreeParser(unittest.TestCase, BaseParserTest):
133 |
134 | @property
135 | def adapter(self):
136 | if not xml4h.ElementTreeAdapter.is_available():
137 | self.skipTest(
138 | "ElementTree library is not installed or is outdated")
139 | return xml4h.ElementTreeAdapter
140 |
141 |
142 | class TestcElementTreeEtreeParser(unittest.TestCase, BaseParserTest):
143 |
144 | @property
145 | def adapter(self):
146 | if not xml4h.cElementTreeAdapter.is_available():
147 | self.skipTest(
148 | "cElementTree library is not installed or is outdated")
149 | return xml4h.cElementTreeAdapter
150 |
--------------------------------------------------------------------------------
/tests/test_writer.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import unittest
3 | import functools
4 | import six
5 |
6 | import xml4h
7 |
8 |
9 | class BaseWriterTest(object):
10 |
11 | @property
12 | def my_builder(self):
13 | return functools.partial(xml4h.build, adapter=self.adapter)
14 |
15 | def setUp(self):
16 | # Create test document
17 | self.builder = (
18 | self.my_builder('DocRoot')
19 | .element('Elem1').text(u'默认جذ').up()
20 | .element('Elem2'))
21 | # Handy IO writer
22 | self.iobytes = six.BytesIO()
23 |
24 | def test_write_defaults(self):
25 | """ Default write output is utf-8 with no pretty-printing """
26 | xml = (
27 | u''
28 | u''
29 | u'默认جذ'
30 | u''
31 | u''
32 | )
33 | io_string = six.StringIO()
34 | self.builder.write_doc(io_string)
35 | if six.PY2:
36 | self.assertEqual(xml.encode('utf-8'), io_string.getvalue())
37 | else:
38 | self.assertEqual(xml, io_string.getvalue())
39 |
40 | def test_write_current_node_and_descendents(self):
41 | self.builder.dom_element.write(self.iobytes)
42 | self.assertEqual(b'', self.iobytes.getvalue())
43 |
44 | def test_write_utf8_by_default(self):
45 | # Default write output is utf-8, with no pretty-printing
46 | xml = (
47 | u''
48 | u''
49 | u'默认جذ'
50 | u''
51 | u''
52 | )
53 | self.builder.dom_element.write_doc(self.iobytes)
54 | self.assertEqual(xml.encode('utf-8'), self.iobytes.getvalue())
55 |
56 | def test_write_utf16(self):
57 | xml = (
58 | u''
59 | u''
60 | u'默认جذ'
61 | u''
62 | u''
63 | )
64 | self.builder.dom_element.write_doc(self.iobytes, encoding='utf-16')
65 | self.assertEqual(xml.encode('utf-16'), self.iobytes.getvalue())
66 |
67 | def test_write_latin1_with_illegal_characters(self):
68 | self.assertRaises(UnicodeEncodeError,
69 | self.builder.dom_element.write_doc,
70 | self.iobytes, encoding='latin1', indent=2)
71 |
72 | def test_write_latin1(self):
73 | # Create latin1-friendly test document
74 | self.builder = (
75 | self.my_builder('DocRoot')
76 | .element('Elem1').text(u'Tést çæsè').up()
77 | .element('Elem2'))
78 | self.builder.dom_element.write_doc(self.iobytes, encoding='latin1')
79 | self.assertEqual(
80 | u''
81 | u''
82 | u'Tést çæsè'
83 | u''
84 | u''.encode('latin1'),
85 | self.iobytes.getvalue())
86 |
87 | def test_with_no_encoding(self):
88 | """No encoding writes python unicode"""
89 | xml = (
90 | u''
91 | u''
92 | u'默认جذ'
93 | u''
94 | u''
95 | )
96 | io_string = six.StringIO()
97 | self.builder.dom_element.write_doc(io_string, encoding=None)
98 | # NOTE Exact test, no encoding of comparison XML doc string
99 | self.assertEqual(xml, io_string.getvalue())
100 |
101 | def test_omit_declaration(self):
102 | self.builder.dom_element.write_doc(self.iobytes,
103 | omit_declaration=True)
104 | self.assertEqual(
105 | u''
106 | u'默认جذ'
107 | u''
108 | u''.encode('utf-8'),
109 | self.iobytes.getvalue())
110 |
111 | def test_default_indent_and_newline(self):
112 | """Default indent of 4 spaces with newlines when indent=True"""
113 | self.builder.dom_element.write_doc(self.iobytes, indent=True)
114 | self.assertEqual(
115 | u'\n'
116 | u'\n'
117 | u' 默认جذ\n'
118 | u' \n'
119 | u'\n'.encode('utf-8'),
120 | self.iobytes.getvalue())
121 |
122 | def test_custom_indent_and_newline(self):
123 | self.builder.dom_element.write_doc(self.iobytes,
124 | indent=8, newline='\t')
125 | self.assertEqual(
126 | u'\t'
127 | u'\t'
128 | u' 默认جذ\t'
129 | u' \t'
130 | u'\t'.encode('utf-8'),
131 | self.iobytes.getvalue())
132 |
133 |
134 | class TestXmlDomBuilder(BaseWriterTest, unittest.TestCase):
135 | """
136 | Tests building with the standard library xml.dom module, or with any
137 | library that augments/clobbers this module.
138 | """
139 |
140 | @property
141 | def adapter(self):
142 | return xml4h.XmlDomImplAdapter
143 |
144 |
145 | class TestLXMLEtreeBuilder(BaseWriterTest, unittest.TestCase):
146 | """
147 | Tests building with the lxml (lxml.etree) library.
148 | """
149 |
150 | @property
151 | def adapter(self):
152 | if not xml4h.LXMLAdapter.is_available():
153 | self.skipTest("lxml library is not installed")
154 | return xml4h.LXMLAdapter
155 |
156 |
157 | class TestElementTreeBuilder(BaseWriterTest, unittest.TestCase):
158 | """
159 | Tests building with the xml.etree.ElementTree library.
160 | """
161 |
162 | @property
163 | def adapter(self):
164 | if not xml4h.ElementTreeAdapter.is_available():
165 | self.skipTest(
166 | "ElementTree library is not installed or is outdated")
167 | return xml4h.ElementTreeAdapter
168 |
169 |
170 | class TestElementTreeBuilder(BaseWriterTest, unittest.TestCase):
171 | """
172 | Tests building with the xml.etree.ElementTree library.
173 | """
174 |
175 | @property
176 | def adapter(self):
177 | if not xml4h.ElementTreeAdapter.is_available():
178 | self.skipTest(
179 | "cElementTree library is not installed or is outdated")
180 | return xml4h.ElementTreeAdapter
181 |
--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
1 | [tox]
2 | envlist=py27,py35,py36,py37,py38,without-lxml
3 |
4 | [testenv]
5 | deps=
6 | six
7 | nose
8 | coverage
9 | lxml
10 | commands=
11 | python -m nose --with-coverage --cover-package=xml4h --with-doctest --include=docs --doctest-extension=.rst
12 |
13 | ; Run reduced tests to ensure xml4h works when lxml isn't installed
14 | [testenv:without-lxml]
15 | deps=
16 | six
17 | nose
18 | coverage
19 | commands=
20 | python -m nose
21 |
--------------------------------------------------------------------------------
/xml4h/__init__.py:
--------------------------------------------------------------------------------
1 | import six
2 |
3 | import xml4h
4 |
5 | # Make commonly-used classes and functions available in xml4h module
6 | from xml4h.impls.xml_dom_minidom import XmlDomImplAdapter
7 | from xml4h.impls.xml_etree_elementtree import (
8 | ElementTreeAdapter, cElementTreeAdapter)
9 | from xml4h.impls.lxml_etree import LXMLAdapter
10 | from xml4h.builder import Builder
11 | from xml4h.writer import write_node
12 |
13 |
14 | __title__ = 'xml4h'
15 | __version__ = '1.0'
16 |
17 |
18 | # List of xml4h adapter classes, in order of preference
19 | _ADAPTER_CLASSES = [
20 | LXMLAdapter,
21 | cElementTreeAdapter,
22 | ElementTreeAdapter,
23 | XmlDomImplAdapter]
24 |
25 | _ADAPTERS_AVAILABLE = []
26 | _ADAPTERS_UNAVAILABLE = []
27 |
28 | for impl_class in _ADAPTER_CLASSES:
29 | if impl_class.is_available():
30 | _ADAPTERS_AVAILABLE.append(impl_class)
31 | else:
32 | _ADAPTERS_UNAVAILABLE.append(impl_class)
33 |
34 |
35 | best_adapter = _ADAPTERS_AVAILABLE[0]
36 | """
37 | The :ref:`best adapter available ` in the Python environment.
38 | This adapter is the default when parsing or creating XML documents,
39 | unless overridden by passing a specific adapter class.
40 | """
41 |
42 |
43 | def parse(
44 | to_parse, ignore_whitespace_text_nodes=True, adapter=None
45 | ):
46 | """
47 | Parse an XML document into an *xml4h*-wrapped DOM representation
48 | using an underlying XML library implementation.
49 |
50 | :param to_parse: an XML document file, document bytes, or the
51 | path to an XML file. If a bytes value is given that contains
52 | a ``<`` character it is treated as literal XML data, otherwise
53 | a bytes value is treated as a file path.
54 | :type to_parse: a file-like object or string
55 | :param bool ignore_whitespace_text_nodes: if ``True`` pure whitespace
56 | nodes are stripped from the parsed document, since these are
57 | usually noise introduced by XML docs serialized to be human-friendly.
58 | :param adapter: the *xml4h* implementation adapter class used to parse
59 | the document and to interact with the resulting nodes.
60 | If None, :attr:`best_adapter` will be used.
61 | :type adapter: adapter class or None
62 |
63 | :return: an :class:`xml4h.nodes.Document` node representing the
64 | parsed document.
65 |
66 | Delegates to an adapter's :meth:`~xml4h.impls.interface.parse_string` or
67 | :meth:`~xml4h.impls.interface.parse_file` implementation.
68 | """
69 | if adapter is None:
70 | adapter = best_adapter
71 | if isinstance(to_parse, six.binary_type) and b'<' in to_parse:
72 | return adapter.parse_bytes(to_parse, ignore_whitespace_text_nodes)
73 | elif isinstance(to_parse, six.string_types) and '<' in to_parse:
74 | return adapter.parse_string(to_parse, ignore_whitespace_text_nodes)
75 | else:
76 | return adapter.parse_file(to_parse, ignore_whitespace_text_nodes)
77 |
78 |
79 | def build(tagname_or_element, ns_uri=None, adapter=None):
80 | """
81 | Return a :class:`~xml4h.builder.Builder` that represents an element in
82 | a new or existing XML DOM and provides "chainable" methods focussed
83 | specifically on adding XML content.
84 |
85 | :param tagname_or_element: a string name for the root node of a
86 | new XML document, or an :class:`~xml4h.nodes.Element` node in an
87 | existing document.
88 | :type tagname_or_element: string or :class:`~xml4h.nodes.Element` node
89 | :param ns_uri: a namespace URI to apply to the new root node. This
90 | argument has no effect this method is acting on an element.
91 | :type ns_uri: string or None
92 | :param adapter: the *xml4h* implementation adapter class used to
93 | interact with the document DOM nodes.
94 | If None, :attr:`best_adapter` will be used.
95 | :type adapter: adapter class or None
96 |
97 | :return: a :class:`~xml4h.builder.Builder` instance that represents an
98 | :class:`~xml4h.nodes.Element` node in an XML DOM.
99 | """
100 | if adapter is None:
101 | adapter = best_adapter
102 | if isinstance(tagname_or_element, six.string_types):
103 | doc = adapter.create_document(
104 | tagname_or_element, ns_uri=ns_uri)
105 | element = doc.root
106 | elif isinstance(tagname_or_element, xml4h.nodes.Element):
107 | element = tagname_or_element
108 | else:
109 | raise xml4h.exceptions.IncorrectArgumentTypeException(
110 | tagname_or_element, [str, xml4h.nodes.Element])
111 | return Builder(element)
112 |
--------------------------------------------------------------------------------
/xml4h/builder.py:
--------------------------------------------------------------------------------
1 | """
2 | Builder is a utility class that makes it easy to create valid, well-formed
3 | XML documents using relatively sparse python code. The builder class works
4 | by wrapping an :class:`xml4h.nodes.Element` node to provide "chainable"
5 | methods focussed specifically on adding XML content.
6 |
7 | Each method that adds content returns a Builder instance representing the
8 | current or the newly-added element. Behind the scenes, the builder uses the
9 | :mod:`xml4h.nodes` node traversal and manipulation methods to add content
10 | directly to the underlying DOM.
11 |
12 | You will not generally create Builder instances directly, but will instead
13 | call the :meth:`xml4h.builder` method with the name for a new root element
14 | or with an existing :class:`xml4h.nodes.Element` node.
15 | """
16 | import xml4h
17 |
18 |
19 | class Builder(object):
20 | """
21 | Builder class that wraps an :class:`xml4h.nodes.Element` node with methods
22 | for adding XML content to an underlying DOM.
23 | """
24 |
25 | def __init__(self, element):
26 | """
27 | Create a Builder representing an xml4h Element node.
28 |
29 | :param element: Element node to represent
30 | :type element: :class:`xml4h.nodes.Element`
31 | """
32 | if not isinstance(element, xml4h.nodes.Element):
33 | raise ValueError(
34 | "Builder can only be created with an %s.%s instance, not %s"
35 | % (xml4h.nodes.Element.__module__,
36 | xml4h.nodes.Element.__name__,
37 | element))
38 | self._element = element
39 |
40 | @property
41 | def dom_element(self):
42 | """
43 | :return: the :class:`xml4h.nodes.Element` node represented by this
44 | Builder.
45 | """
46 | return self._element
47 |
48 | @property
49 | def document(self):
50 | """
51 | :return: the :class:`xml4h.nodes.Document` node that contains the
52 | element represented by this Builder.
53 | """
54 | return self._element.document
55 |
56 | @property
57 | def root(self):
58 | """
59 | :return: the :class:`xml4h.nodes.Element` root node ancestor of the
60 | element represented by this Builder
61 | """
62 | return self._element.root
63 |
64 | def find(self, **kwargs):
65 | """
66 | Find descendants of the element represented by this builder that
67 | match the given constraints.
68 |
69 | :return: a list of :class:`xml4h.nodes.Element` nodes
70 |
71 | Delegates to :meth:`xml4h.nodes.Node.find`
72 | """
73 | return self._element.find(**kwargs)
74 |
75 | def find_doc(self, **kwargs):
76 | """
77 | Find nodes in this element's owning :class:`xml4h.nodes.Document`
78 | that match the given constraints.
79 |
80 | :return: a list of :class:`xml4h.nodes.Element` nodes
81 |
82 | Delegates to :meth:`xml4h.nodes.Node.find_doc`.
83 | """
84 | return self._element.find_doc(**kwargs)
85 |
86 | def write(self, *args, **kwargs):
87 | """
88 | Write XML bytes for the element represented by this builder.
89 |
90 | Delegates to :meth:`xml4h.nodes.Node.write`.
91 | """
92 | self.dom_element.write(*args, **kwargs)
93 |
94 | def write_doc(self, *args, **kwargs):
95 | """
96 | Write XML bytes for the Document containing the element
97 | represented by this builder.
98 |
99 | Delegates to :meth:`xml4h.nodes.Node.write_doc`.
100 | """
101 | self.dom_element.write_doc(*args, **kwargs)
102 |
103 | def xml(self, **kwargs):
104 | """
105 | :return: XML string for the element represented by this builder.
106 |
107 | Delegates to :meth:`xml4h.nodes.Node.xml`.
108 | """
109 | return self.dom_element.xml(**kwargs)
110 |
111 | def xml_doc(self, **kwargs):
112 | """
113 | :return: XML string for the Document containing the element represented
114 | by this builder.
115 |
116 | Delegates to :meth:`xml4h.nodes.Node.xml_doc`.
117 | """
118 | return self.dom_element.xml_doc(**kwargs)
119 |
120 | def up(self, count_or_element_name=1):
121 | """
122 | :return: a builder representing an ancestor of the current element,
123 | by default the parent element.
124 |
125 | :param count_or_element_name:
126 | when an integer, return the n'th ancestor element up to the
127 | document's root element.
128 | when a string, return the nearest ancestor element with that name,
129 | or the document's root element if there are no matching ancestors.
130 | Defaults to integer value 1 which means the immediate parent.
131 | :type count_or_element_name: integer or string
132 | """
133 | elem = self._element
134 | to_count = to_name = None
135 | if isinstance(count_or_element_name, int):
136 | to_count = count_or_element_name
137 | else:
138 | to_name = count_or_element_name
139 | up_count = 0
140 | while True:
141 | # Don't go up beyond the document root
142 | if elem.is_root or elem.parent is None:
143 | break
144 | # Go up to element's parent
145 | elem = elem.parent
146 | # If we have a name to match and it matches, stop
147 | if to_name:
148 | if elem.name == to_name:
149 | break
150 | continue
151 | # If we have a count to reach and have reached it, stop
152 | up_count += 1
153 | if up_count >= to_count:
154 | break
155 | return Builder(elem)
156 |
157 | def transplant(self, node):
158 | """
159 | Transplant a node from another document to become a child of
160 | the :class:`xml4h.nodes.Element` node represented by this Builder.
161 |
162 | :return: a new Builder that represents the current element \
163 | (not the transplanted node).
164 |
165 | Delegates to :meth:`xml4h.nodes.Node.transplant_node`.
166 | """
167 | self._element.transplant_node(node)
168 | return self
169 |
170 | def clone(self, node):
171 | """
172 | Clone a node from another document to become a child of
173 | the :class:`xml4h.nodes.Element` node represented by this Builder.
174 |
175 | :return: a new Builder that represents the current element \
176 | (not the cloned node).
177 |
178 | Delegates to :meth:`xml4h.nodes.Node.clone_node`.
179 | """
180 | self._element.clone_node(node)
181 | return self
182 |
183 | def element(self, *args, **kwargs):
184 | """
185 | Add a child element to the :class:`xml4h.nodes.Element` node
186 | represented by this Builder.
187 |
188 | :return: a new Builder that represents the child element.
189 |
190 | Delegates to :meth:`xml4h.nodes.Element.add_element`.
191 | """
192 | child_element = self._element.add_element(*args, **kwargs)
193 | return Builder(child_element)
194 |
195 | elem = element # Alias
196 | """Alias of :meth:`element`"""
197 |
198 | e = element # Alias
199 | """Alias of :meth:`element`"""
200 |
201 | def attributes(self, *args, **kwargs):
202 | """
203 | Add one or more attributes to the :class:`xml4h.nodes.Element` node
204 | represented by this Builder.
205 |
206 | :return: the current Builder.
207 |
208 | Delegates to :meth:`xml4h.nodes.Element.set_attributes`.
209 | """
210 | self._element.set_attributes(*args, **kwargs)
211 | return self
212 |
213 | attrs = attributes # Alias
214 | """Alias of :meth:`attributes`"""
215 |
216 | a = attributes # Alias
217 | """Alias of :meth:`attributes`"""
218 |
219 | def text(self, text):
220 | """
221 | Add a text node to the :class:`xml4h.nodes.Element` node
222 | represented by this Builder.
223 |
224 | :return: the current Builder.
225 |
226 | Delegates to :meth:`xml4h.nodes.Element.add_text`.
227 | """
228 | self._element.add_text(text)
229 | return self
230 |
231 | t = text # Alias
232 | """Alias of :meth:`text`"""
233 |
234 | def comment(self, text):
235 | """
236 | Add a coment node to the :class:`xml4h.nodes.Element` node
237 | represented by this Builder.
238 |
239 | :return: the current Builder.
240 |
241 | Delegates to :meth:`xml4h.nodes.Element.add_comment`.
242 | """
243 | self._element.add_comment(text)
244 | return self
245 |
246 | c = comment # Alias
247 | """Alias of :meth:`comment`"""
248 |
249 | def processing_instruction(self, target, data):
250 | """
251 | Add a processing instruction node to the :class:`xml4h.nodes.Element`
252 | node represented by this Builder.
253 |
254 | :return: the current Builder.
255 |
256 | Delegates to :meth:`xml4h.nodes.Element.add_instruction`.
257 | """
258 | self._element.add_instruction(target, data)
259 | return self
260 |
261 | instruction = processing_instruction # Alias
262 | """Alias of :meth:`processing_instruction`"""
263 |
264 | i = instruction # Alias
265 | """Alias of :meth:`processing_instruction`"""
266 |
267 | def cdata(self, text):
268 | """
269 | Add a CDATA node to the :class:`xml4h.nodes.Element` node
270 | represented by this Builder.
271 |
272 | :return: the current Builder.
273 |
274 | Delegates to :meth:`xml4h.nodes.Element.add_cdata`.
275 | """
276 | self._element.add_cdata(text)
277 | return self
278 |
279 | data = cdata # Alias
280 | """Alias of :meth:`cdata`"""
281 |
282 | d = cdata # Alias
283 | """Alias of :meth:`cdata`"""
284 |
285 | def ns_prefix(self, prefix, ns_uri):
286 | """
287 | Set the namespace prefix of the :class:`xml4h.nodes.Element` node
288 | represented by this Builder.
289 |
290 | :return: the current Builder.
291 |
292 | Delegates to :meth:`xml4h.nodes.Element.set_ns_prefix`.
293 | """
294 | self._element.set_ns_prefix(prefix, ns_uri)
295 | return self
296 |
--------------------------------------------------------------------------------
/xml4h/exceptions.py:
--------------------------------------------------------------------------------
1 | """
2 | Custom *xml4h* exceptions.
3 | """
4 |
5 |
6 | class Xml4hException(Exception):
7 | """
8 | Base exception class for all non-standard exceptions raised by *xml4h*.
9 | """
10 | pass
11 |
12 |
13 | class Xml4hImplementationBug(Xml4hException):
14 | """
15 | *xml4h* implementation has a bug, probably.
16 | """
17 | pass
18 |
19 |
20 | class FeatureUnavailableException(Xml4hException):
21 | """
22 | User has attempted to use a feature that is available in some *xml4h*
23 | implementations/adapters, but is not available in the current one.
24 | """
25 | pass
26 |
27 |
28 | class IncorrectArgumentTypeException(ValueError, Xml4hException):
29 | """
30 | Richer flavour of a ValueError that describes exactly what argument
31 | types are expected.
32 | """
33 |
34 | def __init__(self, arg, expected_types):
35 | msg = ('Argument %s is not one of the expected types: %s'
36 | % (arg, expected_types))
37 | super(IncorrectArgumentTypeException, self).__init__(msg)
38 |
39 |
40 | class UnknownNamespaceException(ValueError, Xml4hException):
41 | """
42 | User has attempted to refer to an unknown or undeclared namespace by
43 | prefix or URI.
44 | """
45 | pass
46 |
--------------------------------------------------------------------------------
/xml4h/impls/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jmurty/xml4h/83bc0a91afe5d6e17d6c99ec43dc0aec9593cc06/xml4h/impls/__init__.py
--------------------------------------------------------------------------------
/xml4h/impls/interface.py:
--------------------------------------------------------------------------------
1 | import abc
2 | import six
3 |
4 | from xml4h import nodes, exceptions
5 |
6 |
7 | @six.add_metaclass(abc.ABCMeta)
8 | class XmlImplAdapter(object):
9 | """
10 | Base class that defines how *xml4h* interacts with an underlying XML
11 | library that the adaptor "wraps" to provide additional (or at least
12 | different) functionality.
13 |
14 | This class should be treated as an abstract class. It provides some
15 | common implementation code used by all *xml4h* adapter implementations,
16 | but mostly it sketches out the methods the real implementaiton subclasses
17 | must provide.
18 | """
19 |
20 | # List of extra features supported (or not) by an adapter implementation
21 | SUPPORTED_FEATURES = {
22 | 'xpath': False,
23 | }
24 |
25 | @classmethod
26 | def has_feature(cls, feature_name):
27 | """
28 | :return: *True* if a named feature is supported by this adapter.
29 | """
30 | return cls.SUPPORTED_FEATURES.get(feature_name.lower(), False)
31 |
32 | @classmethod
33 | def ignore_whitespace_text_nodes(cls, wrapped_node):
34 | """
35 | Find and delete any text nodes containing nothing but whitespace in
36 | in the given node and its descendents.
37 |
38 | This is useful for cleaning up excess low-value text nodes in a
39 | document DOM after parsing a pretty-printed XML document.
40 | """
41 | for child in wrapped_node.children:
42 | if child.is_text and child.value.strip() == '':
43 | child.delete()
44 | else:
45 | cls.ignore_whitespace_text_nodes(child)
46 |
47 | @classmethod
48 | def create_document(cls, root_tagname, ns_uri=None, **kwargs):
49 | # Use implementation's method to create base document and root element
50 | impl_doc = cls.new_impl_document(root_tagname, ns_uri, **kwargs)
51 | adapter = cls(impl_doc)
52 | wrapped_doc = nodes.Document(impl_doc, adapter)
53 | # Automatically add namespace URI to root Element as attribute
54 | if ns_uri is not None:
55 | adapter.set_node_attribute_value(wrapped_doc.root.impl_node,
56 | 'xmlns', ns_uri, ns_uri=nodes.Node.XMLNS_URI)
57 | return wrapped_doc
58 |
59 | @classmethod
60 | def wrap_document(cls, document_node):
61 | adapter = cls(document_node)
62 | return nodes.Document(document_node, adapter)
63 |
64 | @classmethod
65 | def wrap_node(cls, node, document, adapter=None):
66 | if node is None:
67 | return None
68 | if adapter is None:
69 | adapter = cls(document)
70 | impl_class = adapter.map_node_to_class(node)
71 | return impl_class(node, adapter)
72 |
73 | @classmethod
74 | @abc.abstractmethod
75 | def is_available(cls):
76 | """
77 | :return: *True* if this adapter's underlying XML library is available \
78 | in the Python environment.
79 | """
80 | raise NotImplementedError("Implementation missing for %s" % cls)
81 |
82 | @classmethod
83 | @abc.abstractmethod
84 | def parse_string(cls, xml_str, ignore_whitespace_text_nodes=True):
85 | raise NotImplementedError("Implementation missing for %s" % cls)
86 |
87 | @classmethod
88 | @abc.abstractmethod
89 | def parse_bytes(cls, xml_bytes, ignore_whitespace_text_nodes=True):
90 | raise NotImplementedError("Implementation missing for %s" % cls)
91 |
92 | @classmethod
93 | @abc.abstractmethod
94 | def parse_file(cls, xml_file, ignore_whitespace_text_nodes=True):
95 | raise NotImplementedError("Implementation missing for %s" % cls)
96 |
97 | def __init__(self, document):
98 | if not isinstance(document, object):
99 | raise exceptions.IncorrectArgumentTypeException(
100 | document, [object])
101 | self._impl_document = document
102 | self._auto_ns_prefix_count = 0
103 | self.clear_caches()
104 |
105 | def clear_caches(cls):
106 | """
107 | Clear any in-adapter cached data, for cases where cached data could
108 | become outdated e.g. by making DOM changes directly outside of *xml4h*.
109 |
110 | This is a no-op if the implementing adapter has no cached data.
111 | """
112 | pass
113 |
114 | @property
115 | def impl_document(self):
116 | return self._impl_document
117 |
118 | @property
119 | def impl_root_element(self):
120 | return self.get_impl_root(self.impl_document)
121 |
122 | def get_ns_uri_for_prefix(self, node, prefix):
123 | if prefix == 'xmlns':
124 | return nodes.Node.XMLNS_URI
125 | elif prefix is None:
126 | attr_name = 'xmlns'
127 | else:
128 | attr_name = 'xmlns:%s' % prefix
129 | uri = self.lookup_ns_uri_by_attr_name(node, attr_name)
130 | if uri is None:
131 | if attr_name == 'xmlns':
132 | # Default namespace URI
133 | return nodes.Node.XMLNS_URI
134 | raise exceptions.UnknownNamespaceException(
135 | "Unknown namespace URI for attribute name '%s'" % attr_name)
136 | return uri
137 |
138 | def get_ns_prefix_for_uri(self, node, uri, auto_generate_prefix=False):
139 | if uri == nodes.Node.XMLNS_URI:
140 | return 'xmlns'
141 | prefix = self.lookup_ns_prefix_for_uri(node, uri)
142 | if not prefix and auto_generate_prefix:
143 | prefix = 'autoprefix%d' % self._auto_ns_prefix_count
144 | self._auto_ns_prefix_count += 1
145 | return prefix
146 |
147 | def get_ns_info_from_node_name(self, name, impl_node):
148 | """
149 | Return a three-element tuple with the prefix, local name, and namespace
150 | URI for the given element/attribute name (in the context of the given
151 | node's hierarchy). If the name has no associated prefix or namespace
152 | information, None is return for those tuple members.
153 | """
154 | if '}' in name:
155 | ns_uri, name = name.split('}')
156 | ns_uri = ns_uri[1:]
157 | prefix = self.get_ns_prefix_for_uri(impl_node, ns_uri)
158 | elif ':' in name:
159 | prefix, name = name.split(':')
160 | ns_uri = self.get_ns_uri_for_prefix(impl_node, prefix)
161 | if ns_uri is None:
162 | raise exceptions.UnknownNamespaceException(
163 | "Prefix '%s' does not have a defined namespace URI"
164 | % prefix)
165 | else:
166 | prefix, ns_uri = None, None
167 | return prefix, name, ns_uri
168 |
169 | # Utility implementation methods
170 |
171 | @classmethod
172 | @abc.abstractmethod
173 | def new_impl_document(cls, root_tagname, ns_uri=None, **kwargs):
174 | raise NotImplementedError("Implementation missing for %s" % cls)
175 |
176 | @abc.abstractmethod
177 | def map_node_to_class(self, node):
178 | raise NotImplementedError("Implementation missing for %s" % self)
179 |
180 | @abc.abstractmethod
181 | def get_impl_root(self, node):
182 | raise NotImplementedError("Implementation missing for %s" % self)
183 |
184 | # Document implementation methods
185 |
186 | @abc.abstractmethod
187 | def new_impl_element(self, tagname, ns_uri=None, parent=None):
188 | raise NotImplementedError("Implementation missing for %s" % self)
189 |
190 | @abc.abstractmethod
191 | def new_impl_text(self, text):
192 | raise NotImplementedError("Implementation missing for %s" % self)
193 |
194 | @abc.abstractmethod
195 | def new_impl_comment(self, text):
196 | raise NotImplementedError("Implementation missing for %s" % self)
197 |
198 | @abc.abstractmethod
199 | def new_impl_instruction(self, target, data):
200 | raise NotImplementedError("Implementation missing for %s" % self)
201 |
202 | @abc.abstractmethod
203 | def new_impl_cdata(self, text):
204 | raise NotImplementedError("Implementation missing for %s" % self)
205 |
206 | @abc.abstractmethod
207 | def find_node_elements(self, node, name='*', ns_uri='*'):
208 | """
209 | :return: element node descendents of the given node that match the \
210 | search constraints.
211 |
212 | :param node: a node object from the underlying XML library.
213 | :param string name: only elements with a matching name will be
214 | returned. If the value is ``*`` all names will match.
215 | :param string ns_uri: only elements with a matching namespace URI
216 | will be returned. If the value is ``*`` all namespaces will match.
217 | """
218 | raise NotImplementedError("Implementation missing for %s" % self)
219 |
220 | def xpath_on_node(self, node, xpath, **kwargs):
221 | if not self.has_feature('xpath'):
222 | raise exceptions.FeatureUnavailableException('xpath')
223 |
224 | # Node implementation methods
225 |
226 | @abc.abstractmethod
227 | def get_node_namespace_uri(self, node):
228 | raise NotImplementedError("Implementation missing for %s" % self)
229 |
230 | @abc.abstractmethod
231 | def set_node_namespace_uri(self, node, ns_uri):
232 | raise NotImplementedError("Implementation missing for %s" % self)
233 |
234 | @abc.abstractmethod
235 | def get_node_parent(self, node):
236 | raise NotImplementedError("Implementation missing for %s" % self)
237 |
238 | @abc.abstractmethod
239 | def get_node_children(self, node):
240 | raise NotImplementedError("Implementation missing for %s" % self)
241 |
242 | @abc.abstractmethod
243 | def get_node_name(self, node):
244 | raise NotImplementedError("Implementation missing for %s" % self)
245 |
246 | @abc.abstractmethod
247 | def get_node_local_name(self, node):
248 | raise NotImplementedError("Implementation missing for %s" % self)
249 |
250 | @abc.abstractmethod
251 | def get_node_name_prefix(self, node):
252 | raise NotImplementedError("Implementation missing for %s" % self)
253 |
254 | @abc.abstractmethod
255 | def get_node_value(self, node):
256 | raise NotImplementedError("Implementation missing for %s" % self)
257 |
258 | @abc.abstractmethod
259 | def set_node_value(self, node, value):
260 | raise NotImplementedError("Implementation missing for %s" % self)
261 |
262 | @abc.abstractmethod
263 | def get_node_text(self, node):
264 | raise NotImplementedError("Implementation missing for %s" % self)
265 |
266 | @abc.abstractmethod
267 | def set_node_text(self, node, text):
268 | raise NotImplementedError("Implementation missing for %s" % self)
269 |
270 | @abc.abstractmethod
271 | def get_node_attributes(self, element, ns_uri=None):
272 | raise NotImplementedError("Implementation missing for %s" % self)
273 |
274 | @abc.abstractmethod
275 | def has_node_attribute(self, element, name, ns_uri=None):
276 | raise NotImplementedError("Implementation missing for %s" % self)
277 |
278 | @abc.abstractmethod
279 | def get_node_attribute_node(self, element, name, ns_uri=None):
280 | raise NotImplementedError("Implementation missing for %s" % self)
281 |
282 | @abc.abstractmethod
283 | def get_node_attribute_value(self, element, name, ns_uri=None):
284 | raise NotImplementedError("Implementation missing for %s" % self)
285 |
286 | @abc.abstractmethod
287 | def set_node_attribute_value(self, element, name, value, ns_uri=None):
288 | raise NotImplementedError("Implementation missing for %s" % self)
289 |
290 | @abc.abstractmethod
291 | def remove_node_attribute(self, element, name, ns_uri=None):
292 | raise NotImplementedError("Implementation missing for %s" % self)
293 |
294 | @abc.abstractmethod
295 | def add_node_child(self, parent, child, before_sibling=None):
296 | raise NotImplementedError("Implementation missing for %s" % self)
297 |
298 | @abc.abstractmethod
299 | def import_node(self, parent, node, original_parent=None, clone=False):
300 | raise NotImplementedError("Implementation missing for %s" % self)
301 |
302 | @abc.abstractmethod
303 | def clone_node(self, node, deep=True):
304 | raise NotImplementedError("Implementation missing for %s" % self)
305 |
306 | @abc.abstractmethod
307 | def remove_node_child(self, parent, child, destroy_node=True):
308 | raise NotImplementedError("Implementation missing for %s" % self)
309 |
310 | @abc.abstractmethod
311 | def lookup_ns_uri_by_attr_name(self, node, name):
312 | raise NotImplementedError("Implementation missing for %s" % self)
313 |
314 | @abc.abstractmethod
315 | def lookup_ns_prefix_for_uri(self, node, uri):
316 | raise NotImplementedError("Implementation missing for %s" % self)
317 |
--------------------------------------------------------------------------------
/xml4h/impls/lxml_etree.py:
--------------------------------------------------------------------------------
1 | import re
2 | import copy
3 |
4 | from xml4h.impls.interface import XmlImplAdapter
5 | from xml4h import nodes, exceptions
6 |
7 | try:
8 | from lxml import etree
9 | except ImportError:
10 | pass
11 |
12 |
13 | class LXMLAdapter(XmlImplAdapter):
14 | """
15 | Adapter to the `lxml `_ XML library implementation.
16 | """
17 |
18 | SUPPORTED_FEATURES = {
19 | 'xpath': True,
20 | }
21 |
22 | @classmethod
23 | def is_available(cls):
24 | try:
25 | etree.Element
26 | return True
27 | except:
28 | return False
29 |
30 | @classmethod
31 | def parse_string(cls, xml_str, ignore_whitespace_text_nodes=True):
32 | impl_root_elem = etree.fromstring(xml_str)
33 | wrapped_doc = LXMLAdapter.wrap_document(impl_root_elem.getroottree())
34 | if ignore_whitespace_text_nodes:
35 | cls.ignore_whitespace_text_nodes(wrapped_doc)
36 | return wrapped_doc
37 |
38 | @classmethod
39 | def parse_bytes(cls, xml_bytes, ignore_whitespace_text_nodes=True):
40 | return LXMLAdapter.parse_string(xml_bytes, ignore_whitespace_text_nodes)
41 |
42 | @classmethod
43 | def parse_file(cls, xml_file, ignore_whitespace_text_nodes=True):
44 | impl_doc = etree.parse(xml_file)
45 | wrapped_doc = LXMLAdapter.wrap_document(impl_doc)
46 | if ignore_whitespace_text_nodes:
47 | cls.ignore_whitespace_text_nodes(wrapped_doc)
48 | return wrapped_doc
49 |
50 | @classmethod
51 | def new_impl_document(cls, root_tagname, ns_uri=None, **kwargs):
52 | root_nsmap = {}
53 | if ns_uri is not None:
54 | root_nsmap[None] = ns_uri
55 | else:
56 | ns_uri = nodes.Node.XMLNS_URI
57 | root_nsmap[None] = ns_uri
58 | root_elem = etree.Element('{%s}%s' % (ns_uri, root_tagname),
59 | nsmap=root_nsmap)
60 | doc = etree.ElementTree(root_elem)
61 | return doc
62 |
63 | def map_node_to_class(self, node):
64 | if isinstance(node, etree._ProcessingInstruction):
65 | return nodes.ProcessingInstruction
66 | elif isinstance(node, etree._Comment):
67 | return nodes.Comment
68 | elif isinstance(node, etree._ElementTree):
69 | return nodes.Document
70 | elif isinstance(node, etree._Element):
71 | return nodes.Element
72 | elif isinstance(node, LXMLAttribute):
73 | return nodes.Attribute
74 | elif isinstance(node, LXMLText):
75 | if node.is_cdata:
76 | return nodes.CDATA
77 | else:
78 | return nodes.Text
79 | raise exceptions.Xml4hImplementationBug(
80 | 'Unrecognized type for implementation node: %s' % node)
81 |
82 | def get_impl_root(self, node):
83 | return self._impl_document.getroot()
84 |
85 | # Document implementation methods
86 |
87 | def new_impl_element(self, tagname, ns_uri=None, parent=None):
88 | if ns_uri is not None:
89 | if ':' in tagname:
90 | tagname = tagname.split(':')[1]
91 | my_nsmap = {None: ns_uri}
92 | # Add any xmlns attribute prefix mappings from parent's document
93 | # TODO This doesn't seem to help
94 | curr_node = parent
95 | while curr_node.__class__ == etree._Element:
96 | for n, v in list(curr_node.attrib.items()):
97 | if '{%s}' % nodes.Node.XMLNS_URI in n:
98 | _, prefix = n.split('}')
99 | my_nsmap[prefix] = v
100 | curr_node = self.get_node_parent(curr_node)
101 | return etree.Element('{%s}%s' % (ns_uri, tagname), nsmap=my_nsmap)
102 | else:
103 | return etree.Element(tagname)
104 |
105 | def new_impl_text(self, text):
106 | return LXMLText(text)
107 |
108 | def new_impl_comment(self, text):
109 | return etree.Comment(text)
110 |
111 | def new_impl_instruction(self, target, data):
112 | return etree.ProcessingInstruction(target, data)
113 |
114 | def new_impl_cdata(self, text):
115 | return LXMLText(text, is_cdata=True)
116 |
117 | def find_node_elements(self, node, name='*', ns_uri='*'):
118 | # TODO Any proper way to find namespaced elements by name?
119 | name_match_nodes = node.getiterator()
120 | # Filter nodes by name and ns_uri if necessary
121 | results = []
122 | for n in name_match_nodes:
123 | # Ignore the current node
124 | if n == node:
125 | continue
126 | # Ignore non-Elements
127 | if not n.__class__ == etree._Element:
128 | continue
129 | if ns_uri != '*' and self.get_node_namespace_uri(n) != ns_uri:
130 | continue
131 | if name != '*' and self.get_node_local_name(n) != name:
132 | continue
133 | results.append(n)
134 | return results
135 | find_node_elements.__doc__ = XmlImplAdapter.find_node_elements.__doc__
136 |
137 | def xpath_on_node(self, node, xpath, **kwargs):
138 | """
139 | Return result of performing the given XPath query on the given node.
140 |
141 | All known namespace prefix-to-URI mappings in the document are
142 | automatically included in the XPath invocation.
143 |
144 | If an empty/default namespace (i.e. None) is defined, this is
145 | converted to the prefix name '_' so it can be used despite empty
146 | namespace prefixes being unsupported by XPath.
147 | """
148 | if isinstance(node, etree._ElementTree):
149 | # Document node lxml.etree._ElementTree has no nsmap, lookup root
150 | root = self.get_impl_root(node)
151 | namespaces_dict = root.nsmap.copy()
152 | else:
153 | namespaces_dict = node.nsmap.copy()
154 | if 'namespaces' in kwargs:
155 | namespaces_dict.update(kwargs['namespaces'])
156 | # Empty namespace prefix is not supported, convert to '_' prefix
157 | if None in namespaces_dict:
158 | default_ns_uri = namespaces_dict.pop(None)
159 | namespaces_dict['_'] = default_ns_uri
160 | # Include XMLNS namespace if it's not already defined
161 | if not 'xmlns' in namespaces_dict:
162 | namespaces_dict['xmlns'] = nodes.Node.XMLNS_URI
163 | return node.xpath(xpath, namespaces=namespaces_dict)
164 |
165 | # Node implementation methods
166 |
167 | def get_node_namespace_uri(self, node):
168 | if '}' in node.tag:
169 | return node.tag.split('}')[0][1:]
170 | elif isinstance(node, LXMLAttribute):
171 | return node.namespace_uri
172 | elif isinstance(node, etree._ElementTree):
173 | return None
174 | elif isinstance(node, etree._Element):
175 | qname, ns_uri = self._unpack_name(node.tag, node)[:2]
176 | return ns_uri
177 | else:
178 | return None
179 |
180 | def set_node_namespace_uri(self, node, ns_uri):
181 | node.nsmap[None] = ns_uri
182 |
183 | def get_node_parent(self, node):
184 | if isinstance(node, etree._ElementTree):
185 | return None
186 | else:
187 | parent = node.getparent()
188 | # Return ElementTree as root element's parent
189 | if parent is None:
190 | return self.impl_document
191 | return parent
192 |
193 | def get_node_children(self, node):
194 | if isinstance(node, etree._ElementTree):
195 | children = [node.getroot()]
196 | else:
197 | if not hasattr(node, 'getchildren'):
198 | return []
199 | children = node.getchildren()
200 | # Hack to treat text attribute as child text nodes
201 | if node.text is not None:
202 | children.insert(0, LXMLText(node.text, parent=node))
203 | return children
204 |
205 | def get_node_name(self, node):
206 | if isinstance(node, etree._Comment):
207 | return '#comment'
208 | elif isinstance(node, etree._ProcessingInstruction):
209 | return node.target
210 | prefix = self.get_node_name_prefix(node)
211 | local_name = self.get_node_local_name(node)
212 | if prefix is not None:
213 | return '%s:%s' % (prefix, local_name)
214 | else:
215 | return local_name
216 |
217 | def get_node_local_name(self, node):
218 | return re.sub('{.*}', '', node.tag)
219 |
220 | def get_node_name_prefix(self, node):
221 | # Believe non-Element nodes that have a prefix set (e.g. LXMLAttribute)
222 | if node.prefix and not isinstance(node, etree._Element):
223 | return node.prefix
224 | # Derive prefix by unpacking node name
225 | qname, ns_uri, prefix, local_name = self._unpack_name(node.tag, node)
226 | if prefix:
227 | # Don't add unnecessary excess namespace prefixes for elements
228 | # with a local default namespace declaration
229 | xmlns_val = self.get_node_attribute_value(node, 'xmlns')
230 | if xmlns_val == ns_uri:
231 | return None
232 | # Don't add unnecessary excess namespace prefixes for default ns
233 | if prefix == 'xmlns':
234 | return None
235 | else:
236 | return prefix
237 | else:
238 | return None
239 |
240 | def get_node_value(self, node):
241 | if isinstance(node, (etree._ProcessingInstruction, etree._Comment)):
242 | return node.text
243 | elif hasattr(node, 'value'):
244 | return node.value
245 | else:
246 | return node.text
247 |
248 | def set_node_value(self, node, value):
249 | if hasattr(node, 'value'):
250 | node.value = value
251 | else:
252 | self.set_node_text(node, value)
253 |
254 | def get_node_text(self, node):
255 | return node.text
256 |
257 | def set_node_text(self, node, text):
258 | node.text = text
259 |
260 | def get_node_attributes(self, element, ns_uri=None):
261 | # TODO: Filter by ns_uri
262 | attribs_by_qname = {}
263 | for n, v in list(element.attrib.items()):
264 | qname, ns_uri, prefix, local_name = self._unpack_name(n, element)
265 | attribs_by_qname[qname] = LXMLAttribute(
266 | qname, ns_uri, prefix, local_name, v, element)
267 | # Include namespace declarations, which we also treat as attributes
268 | if element.nsmap:
269 | for n, v in list(element.nsmap.items()):
270 | # Only add namespace as attribute if not defined in ancestors
271 | # and not the global xmlns namespace
272 | if (self._is_ns_in_ancestor(element, n, v)
273 | or v == nodes.Node.XMLNS_URI):
274 | continue
275 | if n is None:
276 | ns_attr_name = 'xmlns'
277 | else:
278 | ns_attr_name = 'xmlns:%s' % n
279 | qname, ns_uri, prefix, local_name = self._unpack_name(
280 | ns_attr_name, element)
281 | attribs_by_qname[qname] = LXMLAttribute(
282 | qname, ns_uri, prefix, local_name, v, element)
283 | return list(attribs_by_qname.values())
284 |
285 | def has_node_attribute(self, element, name, ns_uri=None):
286 | return name in [a.qname for a
287 | in self.get_node_attributes(element, ns_uri)]
288 |
289 | def get_node_attribute_node(self, element, name, ns_uri=None):
290 | for attr in self.get_node_attributes(element, ns_uri):
291 | if attr.qname == name:
292 | return attr
293 | return None
294 |
295 | def get_node_attribute_value(self, element, name, ns_uri=None):
296 | if ns_uri is not None:
297 | prefix = self.lookup_ns_prefix_for_uri(element, ns_uri)
298 | name = '%s:%s' % (prefix, name)
299 | for attr in self.get_node_attributes(element, ns_uri):
300 | if attr.qname == name:
301 | return attr.value
302 | return None
303 |
304 | def set_node_attribute_value(self, element, name, value, ns_uri=None):
305 | prefix = None
306 | if ':' in name:
307 | prefix, name = name.split(':')
308 | if ns_uri is None and prefix is not None:
309 | ns_uri = self.lookup_ns_uri_by_attr_name(element, prefix)
310 | if ns_uri is not None:
311 | name = '{%s}%s' % (ns_uri, name)
312 | if name.startswith('{%s}' % nodes.Node.XMLNS_URI):
313 | if element.nsmap.get(name) != value:
314 | # Ideally we would apply namespace (xmlns) attributes to the
315 | # element's `nsmap` only, but the lxml/etree nsmap attribute
316 | # is immutable and there's no non-hacky way around this.
317 | # TODO Is there a better way?
318 | pass
319 | if name.split('}')[1] == 'xmlns':
320 | # Hack to remove namespace URI from 'xmlns' attributes so
321 | # the name is just a simple string
322 | name = 'xmlns'
323 | element.attrib[name] = value
324 | else:
325 | element.attrib[name] = value
326 |
327 | def remove_node_attribute(self, element, name, ns_uri=None):
328 | if ns_uri is not None:
329 | name = '{%s}%s' % (ns_uri, name)
330 | elif ':' in name:
331 | prefix, name = name.split(':')
332 | if prefix == 'xmlns':
333 | name = '{%s}%s' % (nodes.Node.XMLNS_URI, name)
334 | else:
335 | name = '{%s}%s' % (element.nsmap[prefix], name)
336 | if name in element.attrib:
337 | del(element.attrib[name])
338 |
339 | def add_node_child(self, parent, child, before_sibling=None):
340 | if isinstance(child, LXMLText):
341 | # Add text values directly to parent's 'text' attribute
342 | if parent.text is not None:
343 | parent.text = parent.text + child.text
344 | else:
345 | parent.text = child.text
346 | return None
347 | else:
348 | if before_sibling is not None:
349 | offset = 0
350 | for c in parent.getchildren():
351 | if c == before_sibling:
352 | break
353 | offset += 1
354 | parent.insert(offset, child)
355 | else:
356 | parent.append(child)
357 | return child
358 |
359 | def import_node(self, parent, node, original_parent=None, clone=False):
360 | original_node = node
361 | if clone:
362 | node = self.clone_node(node)
363 | self.add_node_child(parent, node)
364 | # Hack to remove text node content from original parent by manually
365 | # deleting matching text content
366 | if not clone and isinstance(original_node, LXMLText):
367 | original_parent = self.get_node_parent(original_node)
368 | if original_parent.text == original_node.text:
369 | # Must set to None if there would be no remaining text,
370 | # otherwise parent element won't realise it's empty
371 | original_parent.text = None
372 | else:
373 | original_parent.text = \
374 | original_parent.text.replace(original_node.text, '', 1)
375 |
376 | def clone_node(self, node, deep=True):
377 | if deep:
378 | return copy.deepcopy(node)
379 | else:
380 | return copy.copy(node)
381 |
382 | def remove_node_child(self, parent, child, destroy_node=True):
383 | if isinstance(child, LXMLText):
384 | parent.text = None
385 | return
386 | parent.remove(child)
387 | if destroy_node:
388 | child.clear()
389 | return None
390 | else:
391 | return child
392 |
393 | def lookup_ns_uri_by_attr_name(self, node, name):
394 | ns_name = None
395 | if name == 'xmlns':
396 | ns_name = None
397 | elif name.startswith('xmlns:'):
398 | _, ns_name = name.split(':')
399 | if ns_name in node.nsmap:
400 | return node.nsmap[ns_name]
401 | # If namespace is not in `nsmap` it may be in an XML DOM attribute
402 | # TODO Generalize this block
403 | curr_node = node
404 | while (curr_node is not None
405 | and curr_node.__class__ != etree._ElementTree):
406 | uri = self.get_node_attribute_value(curr_node, name)
407 | if uri is not None:
408 | return uri
409 | curr_node = self.get_node_parent(curr_node)
410 | return None
411 |
412 | def lookup_ns_prefix_for_uri(self, node, uri):
413 | if uri == nodes.Node.XMLNS_URI:
414 | return 'xmlns'
415 | result = None
416 | if hasattr(node, 'nsmap') and uri in list(node.nsmap.values()):
417 | for n, v in list(node.nsmap.items()):
418 | if v == uri:
419 | result = n
420 | break
421 | # TODO This is a slow hack necessary due to lxml's immutable nsmap
422 | if result is None or re.match('ns\d', result):
423 | # We either have no namespace prefix in the nsmap, in which case we
424 | # will try looking for a matching xmlns attribute, or we have
425 | # a namespace prefix that was probably assigned automatically by
426 | # lxml and we'd rather use a human-assigned prefix if available.
427 | curr_node = node # self.get_node_parent(node)
428 | while curr_node.__class__ == etree._Element:
429 | for n, v in list(curr_node.attrib.items()):
430 | if v == uri and ('{%s}' % nodes.Node.XMLNS_URI) in n:
431 | result = n.split('}')[1]
432 | return result
433 | curr_node = self.get_node_parent(curr_node)
434 | return result
435 |
436 | def _unpack_name(self, name, node):
437 | qname = prefix = local_name = ns_uri = None
438 | if name == 'xmlns':
439 | # Namespace URI of 'xmlns' is a constant
440 | ns_uri = nodes.Node.XMLNS_URI
441 | elif '}' in name:
442 | # Namespace URI is contained in {}, find URI's defined prefix
443 | ns_uri, local_name = name.split('}')
444 | ns_uri = ns_uri[1:]
445 | prefix = self.lookup_ns_prefix_for_uri(node, ns_uri)
446 | elif ':' in name:
447 | # Namespace prefix is before ':', find prefix's defined URI
448 | prefix, local_name = name.split(':')
449 | if prefix == 'xmlns':
450 | # All 'xmlns' attributes are in XMLNS URI by definition
451 | ns_uri = nodes.Node.XMLNS_URI
452 | else:
453 | ns_uri = self.lookup_ns_uri_by_attr_name(node, prefix)
454 | # Catch case where a prefix other than 'xmlns' points at XMLNS URI
455 | if name != 'xmlns' and ns_uri == nodes.Node.XMLNS_URI:
456 | prefix = 'xmlns'
457 | # Construct fully-qualified name from prefix + local names
458 | if prefix is not None:
459 | qname = '%s:%s' % (prefix, local_name)
460 | else:
461 | qname = local_name = name
462 | return (qname, ns_uri, prefix, local_name)
463 |
464 | def _is_ns_in_ancestor(self, node, name, value):
465 | """
466 | Return True if the given namespace name/value is defined in an
467 | ancestor of the given node, meaning that the given node need not
468 | have its own attributes to apply that namespacing.
469 | """
470 | curr_node = self.get_node_parent(node)
471 | while curr_node.__class__ == etree._Element:
472 | if (hasattr(curr_node, 'nsmap')
473 | and curr_node.nsmap.get(name) == value):
474 | return True
475 | for n, v in list(curr_node.attrib.items()):
476 | if v == value and '{%s}' % nodes.Node.XMLNS_URI in n:
477 | return True
478 | curr_node = self.get_node_parent(curr_node)
479 | return False
480 |
481 |
482 | class LXMLText(object):
483 |
484 | def __init__(self, text, parent=None, is_cdata=False):
485 | self._text = text
486 | self._parent = parent
487 | self._is_cdata = is_cdata
488 |
489 | @property
490 | def is_cdata(self):
491 | return self._is_cdata
492 |
493 | @property
494 | def value(self):
495 | return self._text
496 |
497 | text = value # Alias
498 |
499 | def getparent(self):
500 | return self._parent
501 |
502 | @property
503 | def prefix(self):
504 | return None
505 |
506 | @property
507 | def tag(self):
508 | if self.is_cdata:
509 | return "#cdata-section"
510 | else:
511 | return "#text"
512 |
513 |
514 | class LXMLAttribute(object):
515 |
516 | def __init__(self, qname, ns_uri, prefix, local_name, value, element):
517 | self._qname, self._ns_uri, self._prefix, self._local_name = (
518 | qname, ns_uri, prefix, local_name)
519 | self._value, self._element = (value, element)
520 |
521 | def getroottree(self):
522 | return self._element.getroottree()
523 |
524 | @property
525 | def qname(self):
526 | return self._qname
527 |
528 | @property
529 | def namespace_uri(self):
530 | return self._ns_uri
531 |
532 | @property
533 | def prefix(self):
534 | return self._prefix
535 |
536 | @property
537 | def local_name(self):
538 | return self._local_name
539 |
540 | @property
541 | def value(self):
542 | return self._value
543 |
544 | name = tag = local_name # Alias
545 |
--------------------------------------------------------------------------------
/xml4h/impls/xml_dom_minidom.py:
--------------------------------------------------------------------------------
1 | from six import StringIO, BytesIO
2 |
3 | from xml4h.impls.interface import XmlImplAdapter
4 | from xml4h import nodes, exceptions
5 |
6 | import xml.dom
7 | import xml.dom.minidom
8 |
9 |
10 | class XmlDomImplAdapter(XmlImplAdapter):
11 | """
12 | Adapter to the
13 | `minidom `_ XML
14 | library implementation.
15 | """
16 |
17 | @classmethod
18 | def is_available(cls):
19 | try:
20 | xml.dom.Node
21 | return True
22 | except:
23 | return False
24 |
25 | @classmethod
26 | def parse_string(cls, xml_str, ignore_whitespace_text_nodes=True):
27 | return cls.parse_file(StringIO(xml_str), ignore_whitespace_text_nodes)
28 |
29 | @classmethod
30 | def parse_bytes(cls, xml_bytes, ignore_whitespace_text_nodes=True):
31 | return cls.parse_file(BytesIO(xml_bytes), ignore_whitespace_text_nodes)
32 |
33 | @classmethod
34 | def parse_file(cls, xml_file, ignore_whitespace_text_nodes=True):
35 | impl_doc = xml.dom.minidom.parse(xml_file)
36 | wrapped_doc = XmlDomImplAdapter.wrap_document(impl_doc)
37 | if ignore_whitespace_text_nodes:
38 | cls.ignore_whitespace_text_nodes(wrapped_doc)
39 | return wrapped_doc
40 |
41 | @classmethod
42 | def new_impl_document(cls, root_tagname, ns_uri=None,
43 | doctype=None, impl_features=None):
44 | # Create DOM implementation factory
45 | if impl_features is None:
46 | impl_features = []
47 | factory = xml.dom.getDOMImplementation('minidom', impl_features)
48 | # Create Document from factory
49 | doc = factory.createDocument(ns_uri, root_tagname, doctype)
50 | return doc
51 |
52 | def map_node_to_class(self, impl_node):
53 | try:
54 | return {
55 | xml.dom.Node.ELEMENT_NODE: nodes.Element,
56 | xml.dom.Node.ATTRIBUTE_NODE: nodes.Attribute,
57 | xml.dom.Node.TEXT_NODE: nodes.Text,
58 | xml.dom.Node.CDATA_SECTION_NODE: nodes.CDATA,
59 | # EntityReference not supported by minidom
60 | #xml.dom.Node.ENTITY_REFERENCE: nodes.EntityReference,
61 | xml.dom.Node.ENTITY_NODE: nodes.Entity,
62 | xml.dom.Node.PROCESSING_INSTRUCTION_NODE:
63 | nodes.ProcessingInstruction,
64 | xml.dom.Node.COMMENT_NODE: nodes.Comment,
65 | xml.dom.Node.DOCUMENT_NODE: nodes.Document,
66 | xml.dom.Node.DOCUMENT_TYPE_NODE: nodes.DocumentType,
67 | xml.dom.Node.DOCUMENT_FRAGMENT_NODE: nodes.DocumentFragment,
68 | xml.dom.Node.NOTATION_NODE: nodes.Notation,
69 | }[impl_node.nodeType]
70 | except KeyError:
71 | raise exceptions.Xml4hImplementationBug(
72 | 'Unrecognized type for implementation node: %s' % impl_node)
73 |
74 | def get_impl_root(self, node):
75 | return node.documentElement
76 |
77 | def new_impl_element(self, tagname, ns_uri=None, parent=None):
78 | return self.impl_document.createElementNS(ns_uri, tagname)
79 |
80 | def new_impl_text(self, text):
81 | return self.impl_document.createTextNode(text)
82 |
83 | def new_impl_comment(self, text):
84 | return self.impl_document.createComment(text)
85 |
86 | def new_impl_instruction(self, target, data):
87 | return self.impl_document.createProcessingInstruction(target, data)
88 |
89 | def new_impl_cdata(self, text):
90 | return self.impl_document.createCDATASection(text)
91 |
92 | def find_node_elements(self, node, name='*', ns_uri='*'):
93 | return node.getElementsByTagNameNS(ns_uri, name)
94 |
95 | def get_node_namespace_uri(self, node):
96 | return node.namespaceURI
97 |
98 | def set_node_namespace_uri(self, node, ns_uri):
99 | node.namespaceURI = ns_uri
100 |
101 | def get_node_parent(self, element):
102 | return element.parentNode
103 |
104 | def get_node_children(self, element):
105 | return element.childNodes
106 |
107 | def get_node_name(self, node):
108 | if node.nodeType not in (
109 | xml.dom.Node.ELEMENT_NODE, xml.dom.Node.ATTRIBUTE_NODE
110 | ):
111 | return node.nodeName
112 | # Special handling of node names for Element and Attribute nodes where
113 | # we want to exclude the namespace prefix in some cases
114 | prefix = self.get_node_name_prefix(node)
115 | local_name = self.get_node_local_name(node)
116 | if prefix is not None:
117 | return '%s:%s' % (prefix, local_name)
118 | else:
119 | return local_name
120 |
121 | def get_node_local_name(self, node):
122 | return node.localName
123 |
124 | def get_node_name_prefix(self, node):
125 | prefix = node.prefix
126 | # Don't add unnecessary excess namespace prefixes for elements
127 | # with a local default namespace declaration
128 | if prefix and node.nodeType == xml.dom.Node.ELEMENT_NODE:
129 | xmlns_val = self.get_node_attribute_value(node, 'xmlns')
130 | if xmlns_val == self.get_node_namespace_uri(node):
131 | return None
132 | return prefix
133 |
134 | def get_node_value(self, node):
135 | return node.nodeValue
136 |
137 | def set_node_value(self, node, value):
138 | node.nodeValue = value
139 |
140 | def get_node_text(self, node):
141 | """
142 | Return contatenated value of all text node children of this element
143 | """
144 | text_children = [n.nodeValue for n in self.get_node_children(node)
145 | if n.nodeType == xml.dom.Node.TEXT_NODE]
146 | if text_children:
147 | return ''.join(text_children)
148 | else:
149 | return None
150 |
151 | def set_node_text(self, node, text):
152 | """
153 | Set text value as sole Text child node of element; any existing
154 | Text nodes are removed
155 | """
156 | # Remove any existing Text node children
157 | for child in self.get_node_children(node):
158 | if child.nodeType == xml.dom.Node.TEXT_NODE:
159 | self.remove_node_child(node, child, True)
160 | if text is not None:
161 | text_node = self.new_impl_text(text)
162 | self.add_node_child(node, text_node)
163 |
164 | def get_node_attributes(self, element, ns_uri=None):
165 | attr_nodes = []
166 | if not element.attributes:
167 | return attr_nodes
168 | for attr_name in list(element.attributes.keys()):
169 | if self.has_node_attribute(element, attr_name, ns_uri):
170 | attr_nodes.append(
171 | self.get_node_attribute_node(element, attr_name, ns_uri))
172 | return attr_nodes
173 |
174 | def has_node_attribute(self, element, name, ns_uri=None):
175 | if ns_uri is not None:
176 | return element.hasAttributeNS(ns_uri, name)
177 | else:
178 | return element.hasAttribute(name)
179 |
180 | def get_node_attribute_node(self, element, name, ns_uri=None):
181 | if ns_uri is not None:
182 | return element.getAttributeNodeNS(ns_uri, name)
183 | else:
184 | return element.getAttributeNode(name)
185 |
186 | def get_node_attribute_value(self, element, name, ns_uri=None):
187 | if isinstance(element, xml.dom.minidom.Document):
188 | return None
189 | if ns_uri is not None:
190 | result = element.getAttributeNS(ns_uri, name)
191 | else:
192 | result = element.getAttribute(name)
193 | # Minidom returns empty string for non-existent nodes, correct this
194 | if result == '' and not name in list(element.attributes.keys()):
195 | return None
196 | return result
197 |
198 | def set_node_attribute_value(self, element, name, value, ns_uri=None):
199 | element.setAttributeNS(ns_uri, name, value)
200 |
201 | def remove_node_attribute(self, element, name, ns_uri=None):
202 | if ns_uri is not None:
203 | element.removeAttributeNS(ns_uri, name)
204 | else:
205 | element.removeAttribute(name)
206 |
207 | def add_node_child(self, parent, child, before_sibling=None):
208 | if before_sibling is not None:
209 | parent.insertBefore(child, before_sibling)
210 | else:
211 | parent.appendChild(child)
212 |
213 | def import_node(self, parent, node, original_parent=None, clone=False):
214 | if clone:
215 | node = self.clone_node(node)
216 | self.add_node_child(parent, node)
217 |
218 | def clone_node(self, node, deep=True):
219 | return node.cloneNode(deep)
220 |
221 | def remove_node_child(self, parent, child, destroy_node=True):
222 | parent.removeChild(child)
223 | if destroy_node:
224 | child.unlink()
225 | return None
226 | else:
227 | return child
228 |
229 | def lookup_ns_uri_by_attr_name(self, node, name):
230 | curr_node = node
231 | while curr_node is not None:
232 | value = self.get_node_attribute_value(curr_node, name)
233 | if value is not None:
234 | return value
235 | curr_node = self.get_node_parent(curr_node)
236 | return None
237 |
238 | def lookup_ns_prefix_for_uri(self, node, uri):
239 | curr_node = node
240 | while curr_node:
241 | attrs = self.get_node_attributes(curr_node)
242 | for attr in attrs:
243 | if attr.value == uri:
244 | if ':' in attr.name:
245 | return attr.name.split(':')[1]
246 | else:
247 | return attr.name
248 | curr_node = self.get_node_parent(curr_node)
249 | return None
250 |
--------------------------------------------------------------------------------
/xml4h/impls/xml_etree_elementtree.py:
--------------------------------------------------------------------------------
1 | import re
2 | import copy
3 |
4 | import six
5 |
6 | from xml4h.impls.interface import XmlImplAdapter
7 | from xml4h import nodes, exceptions
8 |
9 | # Import the pure-Python ElementTree implementation, if possible
10 | try:
11 | import xml.etree.ElementTree as PythonET
12 | # Re-import non-C ElementTree with a definitive name, for cases where we
13 | # must explicilty use non-C-based elements of ElementTree.
14 | import xml.etree.ElementTree as BaseET
15 | except ImportError:
16 | pass
17 |
18 | # Import the C-based ElementTree implementation, if possible
19 | try:
20 | import xml.etree.cElementTree as cET
21 | except ImportError:
22 | pass
23 |
24 |
25 | class ElementTreeAdapter(XmlImplAdapter):
26 | """
27 | Adapter to the
28 | `ElementTree `_
29 | XML library.
30 |
31 | This code *must* work with either the base ElementTree pure python
32 | implementation or the C-based cElementTree implementation, since it is
33 | reused in the `cElementTree` class defined below.
34 | """
35 |
36 | ET = PythonET # Use the pure-Python implementation
37 |
38 | SUPPORTED_FEATURES = {
39 | 'xpath': True,
40 | }
41 |
42 | @classmethod
43 | def is_available(cls):
44 | # Is vital piece of ElementTree module available at all?
45 | try:
46 | cls.ET.Element
47 | except:
48 | return False
49 | # We only support ElementTree version 1.3+
50 | from distutils.version import StrictVersion
51 | return StrictVersion(BaseET.VERSION) >= StrictVersion('1.3')
52 |
53 | @classmethod
54 | def parse_string(cls, xml_str, ignore_whitespace_text_nodes=True):
55 | return cls.parse_file(
56 | six.StringIO(xml_str),
57 | ignore_whitespace_text_nodes=ignore_whitespace_text_nodes)
58 |
59 | @classmethod
60 | def parse_bytes(cls, xml_bytes, ignore_whitespace_text_nodes=True):
61 | return cls.parse_file(
62 | six.BytesIO(xml_bytes),
63 | ignore_whitespace_text_nodes=ignore_whitespace_text_nodes)
64 |
65 | @classmethod
66 | def parse_file(cls, xml_file_path, ignore_whitespace_text_nodes=True):
67 | # To retain explicit xmlns namespace definition attributes, we need to
68 | # manually add these elements to the parsed DOM as we go using
69 | # iterative parsing per:
70 | # effbot.org/zone/element-namespaces.htm#preserving-existing-namespace-attributes
71 | events = ('start', 'start-ns')
72 | impl_root = None
73 | ns_list = []
74 | for event, node in cls.ET.iterparse(xml_file_path, events):
75 | if event == 'start-ns':
76 | # Track namespaces as nodes declared
77 | ns_list.append(node)
78 | elif event == 'start':
79 | # Recognise and retain root node
80 | if impl_root is None:
81 | impl_root = node
82 | # Add xmlns attributes for each namespace declared
83 | for ns_prefix, ns_uri in ns_list:
84 | if ns_prefix:
85 | attr_name = 'xmlns:%s' % ns_prefix
86 | else:
87 | attr_name = 'xmlns'
88 | node.set(attr_name, ns_uri)
89 | # Reset namespace list now the corresponding attributes exist
90 | ns_list = []
91 |
92 | impl_doc = cls.ET.ElementTree(impl_root)
93 | wrapped_doc = cls.wrap_document(impl_doc)
94 | if ignore_whitespace_text_nodes:
95 | cls.ignore_whitespace_text_nodes(wrapped_doc)
96 | return wrapped_doc
97 |
98 | @classmethod
99 | def new_impl_document(cls, root_tagname, ns_uri=None, **kwargs):
100 | root_nsmap = {}
101 | if ns_uri is not None:
102 | root_nsmap[None] = ns_uri
103 | else:
104 | ns_uri = nodes.Node.XMLNS_URI
105 | root_nsmap[None] = ns_uri
106 | root_elem = cls.ET.Element('{%s}%s' % (ns_uri, root_tagname))
107 | doc = cls.ET.ElementTree(root_elem)
108 | return doc
109 |
110 | # This method is called by interface super-class's __init__
111 | def clear_caches(self):
112 | self.CACHED_ANCESTRY_DICT = {}
113 |
114 | def _lookup_node_parent(self, node):
115 | """
116 | Return the parent of the given node, based on an internal dictionary
117 | mapping of child nodes to the child's parent required since
118 | ElementTree doesn't make info about node ancestry/parentage available.
119 | """
120 | # Basic caching of our internal ancestry dict to help performance
121 | if not node in self.CACHED_ANCESTRY_DICT:
122 | # Given node isn't in cached ancestry dictionary, rebuild this now
123 | ancestry_dict = dict(
124 | (c, p) for p in self._impl_document.getiterator() for c in p)
125 | self.CACHED_ANCESTRY_DICT = ancestry_dict
126 | return self.CACHED_ANCESTRY_DICT[node]
127 |
128 | def _is_node_an_element(self, node):
129 | """
130 | Return True if the given node is an ElementTree Element, a fact that
131 | can be tricky to determine if the cElementTree implementation is
132 | used.
133 | """
134 | # Try the simplest approach first, works for plain old ElementTree
135 | if isinstance(node, BaseET.Element):
136 | return True
137 | # For cElementTree we need to be more cunning (or find a better way)
138 | if hasattr(node, 'makeelement') \
139 | and isinstance(node.tag, six.string_types):
140 | return True
141 |
142 | def map_node_to_class(self, node):
143 | if isinstance(node, BaseET.ElementTree):
144 | return nodes.Document
145 | elif node.tag == BaseET.ProcessingInstruction:
146 | return nodes.ProcessingInstruction
147 | elif node.tag == BaseET.Comment:
148 | return nodes.Comment
149 | elif isinstance(node, ETAttribute):
150 | return nodes.Attribute
151 | elif isinstance(node, ElementTreeText):
152 | if node.is_cdata:
153 | return nodes.CDATA
154 | else:
155 | return nodes.Text
156 | elif self._is_node_an_element(node):
157 | return nodes.Element
158 | raise exceptions.Xml4hImplementationBug(
159 | 'Unrecognized type for implementation node: %s' % node)
160 |
161 | def get_impl_root(self, node):
162 | return self._impl_document.getroot()
163 |
164 | # Document implementation methods
165 |
166 | def new_impl_element(self, tagname, ns_uri=None, parent=None):
167 | if ns_uri is not None:
168 | if ':' in tagname:
169 | tagname = tagname.split(':')[1]
170 | element = self.ET.Element('{%s}%s' % (ns_uri, tagname))
171 | return element
172 | else:
173 | return self.ET.Element(tagname)
174 |
175 | def new_impl_text(self, text):
176 | return ElementTreeText(text)
177 |
178 | def new_impl_comment(self, text):
179 | return self.ET.Comment(text)
180 |
181 | def new_impl_instruction(self, target, data):
182 | return self.ET.ProcessingInstruction(target, data)
183 |
184 | def new_impl_cdata(self, text):
185 | return ElementTreeText(text, is_cdata=True)
186 |
187 | def find_node_elements(self, node, name='*', ns_uri='*'):
188 | # TODO Any proper way to find namespaced elements by name?
189 | name_match_nodes = node.getiterator()
190 | # Filter nodes by name and ns_uri if necessary
191 | results = []
192 | for n in name_match_nodes:
193 | # Ignore the current node
194 | if n == node:
195 | continue
196 | # Ignore non-Elements
197 | if not isinstance(n.tag, six.string_types):
198 | continue
199 | if ns_uri != '*' and self.get_node_namespace_uri(n) != ns_uri:
200 | continue
201 | if name != '*' and self.get_node_local_name(n) != name:
202 | continue
203 | results.append(n)
204 | return results
205 | find_node_elements.__doc__ = XmlImplAdapter.find_node_elements.__doc__
206 |
207 | def xpath_on_node(self, node, xpath, **kwargs):
208 | """
209 | Return result of performing the given XPath query on the given node.
210 |
211 | All known namespace prefix-to-URI mappings in the document are
212 | automatically included in the XPath invocation.
213 |
214 | If an empty/default namespace (i.e. None) is defined, this is
215 | converted to the prefix name '_' so it can be used despite empty
216 | namespace prefixes being unsupported by XPath.
217 | """
218 | namespaces_dict = {}
219 | if 'namespaces' in kwargs:
220 | namespaces_dict.update(kwargs['namespaces'])
221 | # Empty namespace prefix is not supported, convert to '_' prefix
222 | if None in namespaces_dict:
223 | default_ns_uri = namespaces_dict.pop(None)
224 | namespaces_dict['_'] = default_ns_uri
225 | # If no default namespace URI defined, use root's namespace (if any)
226 | if not '_' in namespaces_dict:
227 | root = self.get_impl_root(node)
228 | qname, ns_uri, prefix, local_name = self._unpack_name(
229 | root.tag, root)
230 | if ns_uri:
231 | namespaces_dict['_'] = ns_uri
232 | # Include XMLNS namespace if it's not already defined
233 | if not 'xmlns' in namespaces_dict:
234 | namespaces_dict['xmlns'] = nodes.Node.XMLNS_URI
235 | return node.findall(xpath, namespaces_dict)
236 |
237 | # Node implementation methods
238 |
239 | def get_node_namespace_uri(self, node):
240 | if '}' in node.tag:
241 | return node.tag.split('}')[0][1:]
242 | elif isinstance(node, ETAttribute):
243 | return node.namespace_uri
244 | elif self._is_node_an_element(node):
245 | qname, ns_uri = self._unpack_name(node.tag, node)[:2]
246 | return ns_uri
247 | else:
248 | return None
249 |
250 | def set_node_namespace_uri(self, node, ns_uri):
251 | qname, orig_ns_uri, prefix, local_name = self._unpack_name(
252 | node.tag, node)
253 | node.tag = '{%s}%s' % (ns_uri, local_name)
254 |
255 | def get_node_parent(self, node):
256 | parent = None
257 | # Root document has no parent
258 | if isinstance(node, BaseET.ElementTree):
259 | pass
260 | elif hasattr(node, 'getparent'):
261 | parent = node.getparent()
262 | # Return ElementTree as root element's parent
263 | elif node == self.get_impl_root(node):
264 | parent = self._impl_document
265 | else:
266 | parent = self._lookup_node_parent(node)
267 | return parent
268 |
269 | def get_node_children(self, node):
270 | if isinstance(node, BaseET.ElementTree):
271 | children = [node.getroot()]
272 | else:
273 | if not hasattr(node, 'getchildren'):
274 | return []
275 | children = list(node.getchildren())
276 | # Hack to treat text attribute as child text nodes
277 | if node.text is not None:
278 | children.insert(0, ElementTreeText(node.text, parent=node))
279 | return children
280 |
281 | def get_node_name(self, node):
282 | if node.tag == BaseET.Comment:
283 | return '#comment'
284 | elif node.tag == BaseET.ProcessingInstruction:
285 | name, target = node.text.split(' ')
286 | return name
287 | prefix = self.get_node_name_prefix(node)
288 | if prefix is not None:
289 | return '%s:%s' % (prefix, self.get_node_local_name(node))
290 | else:
291 | return self.get_node_local_name(node)
292 |
293 | def get_node_local_name(self, node):
294 | return re.sub('{.*}', '', node.tag)
295 |
296 | def get_node_name_prefix(self, node):
297 | # Ignore non-elements
298 | if not isinstance(node.tag, six.string_types):
299 | return None
300 | # Believe nodes that have their own prefix (likely only ETAttribute)
301 | prefix = getattr(node, 'prefix', None)
302 | if prefix:
303 | return prefix
304 | # Derive prefix by unpacking node name
305 | qname, ns_uri, prefix, local_name = self._unpack_name(node.tag, node)
306 | if prefix:
307 | # Don't add unnecessary excess namespace prefixes for elements
308 | # with a local default namespace declaration
309 | if node.attrib.get('xmlns') == ns_uri:
310 | return None
311 | # Don't add unnecessary excess namespace prefixes for default ns
312 | elif prefix == 'xmlns':
313 | return None
314 | else:
315 | return prefix
316 | else:
317 | return None
318 |
319 | def get_node_value(self, node):
320 | if node.tag == BaseET.ProcessingInstruction:
321 | name, target = node.text.split(' ')
322 | return target
323 | elif node.tag == BaseET.Comment:
324 | return node.text
325 | elif hasattr(node, 'value'):
326 | return node.value
327 | else:
328 | return node.text
329 |
330 | def set_node_value(self, node, value):
331 | if hasattr(node, 'value'):
332 | node.value = value
333 | else:
334 | self.set_node_text(node, value)
335 |
336 | def get_node_text(self, node):
337 | return node.text
338 |
339 | def set_node_text(self, node, text):
340 | node.text = text
341 |
342 | def get_node_attributes(self, element, ns_uri=None):
343 | # TODO: Filter by ns_uri
344 | attribs_by_qname = {}
345 | for n, v in list(element.attrib.items()):
346 | qname, ns_uri, prefix, local_name = self._unpack_name(n, element)
347 | attribs_by_qname[qname] = ETAttribute(
348 | qname, ns_uri, prefix, local_name, v, element)
349 | return list(attribs_by_qname.values())
350 |
351 | def has_node_attribute(self, element, name, ns_uri=None):
352 | return name in [a.qname for a
353 | in self.get_node_attributes(element, ns_uri)]
354 |
355 | def get_node_attribute_node(self, element, name, ns_uri=None):
356 | for attr in self.get_node_attributes(element, ns_uri):
357 | if attr.qname == name:
358 | return attr
359 | return None
360 |
361 | def get_node_attribute_value(self, element, name, ns_uri=None):
362 | if ns_uri is not None:
363 | prefix = self.lookup_ns_prefix_for_uri(element, ns_uri)
364 | name = '%s:%s' % (prefix, name)
365 | for attr in self.get_node_attributes(element, ns_uri):
366 | if attr.qname == name:
367 | return attr.value
368 | return None
369 |
370 | def set_node_attribute_value(self, element, name, value, ns_uri=None):
371 | prefix = None
372 | if ':' in name:
373 | prefix, name = name.split(':')
374 | if ns_uri is None and prefix is not None:
375 | ns_uri = self.lookup_ns_uri_by_attr_name(element, prefix)
376 | if ns_uri is not None:
377 | name = '{%s}%s' % (ns_uri, name)
378 | if name.startswith('{%s}' % nodes.Node.XMLNS_URI):
379 | if name.split('}')[1] == 'xmlns':
380 | # Hack to remove namespace URI from 'xmlns' attributes so
381 | # the name is just a simple string
382 | name = 'xmlns'
383 | element.attrib[name] = value
384 | else:
385 | element.attrib[name] = value
386 |
387 | def remove_node_attribute(self, element, name, ns_uri=None):
388 | if ns_uri is not None:
389 | name = '{%s}%s' % (ns_uri, name)
390 | elif ':' in name:
391 | prefix, local_name = name.split(':')
392 | if prefix != 'xmlns':
393 | ns_attr_name = 'xmlns:%s' % prefix
394 | ns_uri = self.lookup_ns_uri_by_attr_name(element, ns_attr_name)
395 | name = '{%s}%s' % (ns_uri, local_name)
396 | if name in element.attrib:
397 | del(element.attrib[name])
398 |
399 | def add_node_child(self, parent, child, before_sibling=None):
400 | if isinstance(child, ElementTreeText):
401 | # Add text values directly to parent's 'text' attribute
402 | if parent.text is not None:
403 | parent.text = parent.text + child.text
404 | else:
405 | parent.text = child.text
406 | self.CACHED_ANCESTRY_DICT[child] = parent
407 | return None
408 | else:
409 | if before_sibling is not None:
410 | offset = 0
411 | for c in parent.getchildren():
412 | if c == before_sibling:
413 | break
414 | offset += 1
415 | parent.insert(offset, child)
416 | else:
417 | parent.append(child)
418 | self.CACHED_ANCESTRY_DICT[child] = parent
419 | return child
420 |
421 | def import_node(self, parent, node, original_parent=None, clone=False):
422 | original_node = node
423 | # We always clone for (c)ElementTree adapter so we can remove original
424 | # if necessary
425 | node = self.clone_node(node)
426 | self.add_node_child(parent, node)
427 | # Hack to remove text node content from original parent by manually
428 | # deleting matching text content
429 | if not clone:
430 | if isinstance(original_node, ElementTreeText):
431 | original_parent = self.get_node_parent(original_node)
432 | if original_parent.text == original_node.text:
433 | # Must set to None if there would be no remaining text,
434 | # otherwise parent element won't realise it's empty
435 | original_parent.text = None
436 | else:
437 | original_parent.text = \
438 | original_parent.text.replace(original_node.text, '', 1)
439 | else:
440 | original_parent.remove(original_node)
441 |
442 | def clone_node(self, node, deep=True):
443 | if deep:
444 | return copy.deepcopy(node)
445 | else:
446 | return copy.copy(node)
447 |
448 | def remove_node_child(self, parent, child, destroy_node=True):
449 | if isinstance(child, ElementTreeText):
450 | child._parent.text = None
451 | return
452 | parent.remove(child)
453 | if destroy_node:
454 | child.clear()
455 | return None
456 | else:
457 | return child
458 |
459 | def lookup_ns_uri_by_attr_name(self, node, name):
460 | curr_node = node
461 | while (curr_node is not None
462 | and not isinstance(curr_node, BaseET.ElementTree)):
463 | uri = self.get_node_attribute_value(curr_node, name)
464 | if uri is not None:
465 | return uri
466 | curr_node = self.get_node_parent(curr_node)
467 | return None
468 |
469 | def lookup_ns_prefix_for_uri(self, node, uri):
470 | if uri == nodes.Node.XMLNS_URI:
471 | return 'xmlns'
472 | result = None
473 | # Lookup namespace URI in ET's awful global namespace/prefix registry
474 | if hasattr(BaseET, '_namespace_map') and uri in BaseET._namespace_map:
475 | result = BaseET._namespace_map[uri]
476 | if result == '':
477 | result = None
478 | if result is None or re.match('ns\d', result):
479 | # We either have no namespace prefix in the global mapping, in
480 | # which case we will try looking for a matching xmlns attribute,
481 | # or we have a namespace prefix that was probably assigned
482 | # automatically by ElementTree and we'd rather use a
483 | # human-assigned prefix if available.
484 | curr_node = node
485 | while self._is_node_an_element(curr_node):
486 | for n, v in list(curr_node.attrib.items()):
487 | if v == uri:
488 | if n.startswith('xmlns:'):
489 | result = n.split(':')[1]
490 | return result
491 | elif n.startswith('{%s}' % nodes.Node.XMLNS_URI):
492 | result = n.split('}')[1]
493 | return result
494 | curr_node = self.get_node_parent(curr_node)
495 | return result
496 |
497 | def _unpack_name(self, name, node):
498 | qname = prefix = local_name = ns_uri = None
499 | if name == 'xmlns':
500 | # Namespace URI of 'xmlns' is a constant
501 | ns_uri = nodes.Node.XMLNS_URI
502 | elif '}' in name:
503 | # Namespace URI is contained in {}, find URI's defined prefix
504 | ns_uri, local_name = name.split('}')
505 | ns_uri = ns_uri[1:]
506 | prefix = self.lookup_ns_prefix_for_uri(node, ns_uri)
507 | elif ':' in name:
508 | # Namespace prefix is before ':', find prefix's defined URI
509 | prefix, local_name = name.split(':')
510 | if prefix == 'xmlns':
511 | # All 'xmlns' attributes are in XMLNS URI by definition
512 | ns_uri = nodes.Node.XMLNS_URI
513 | else:
514 | ns_uri = self.lookup_ns_uri_by_attr_name(node, prefix)
515 | # Catch case where a prefix other than 'xmlns' points at XMLNS URI
516 | if name != 'xmlns' and ns_uri == nodes.Node.XMLNS_URI:
517 | prefix = 'xmlns'
518 | # Construct fully-qualified name from prefix + local names
519 | if prefix is not None:
520 | qname = '%s:%s' % (prefix, local_name)
521 | else:
522 | qname = local_name = name
523 | return (qname, ns_uri, prefix, local_name)
524 |
525 |
526 | class ElementTreeText(object):
527 |
528 | def __init__(self, text, parent=None, is_cdata=False):
529 | self._text = text
530 | self._parent = parent
531 | self._is_cdata = is_cdata
532 |
533 | @property
534 | def is_cdata(self):
535 | return self._is_cdata
536 |
537 | @property
538 | def value(self):
539 | return self._text
540 |
541 | text = value # Alias
542 |
543 | def getparent(self):
544 | return self._parent
545 |
546 | @property
547 | def prefix(self):
548 | return None
549 |
550 | @property
551 | def tag(self):
552 | if self.is_cdata:
553 | return "#cdata-section"
554 | else:
555 | return "#text"
556 |
557 |
558 | class ETAttribute(object):
559 |
560 | def __init__(self, qname, ns_uri, prefix, local_name, value, element):
561 | self._qname, self._ns_uri, self._prefix, self._local_name = (
562 | qname, ns_uri, prefix, local_name)
563 | self._value, self._element = (value, element)
564 |
565 | def getroottree(self):
566 | return self._element.getroottree()
567 |
568 | @property
569 | def qname(self):
570 | return self._qname
571 |
572 | @property
573 | def namespace_uri(self):
574 | return self._ns_uri
575 |
576 | @property
577 | def prefix(self):
578 | return self._prefix
579 |
580 | @property
581 | def local_name(self):
582 | return self._local_name
583 |
584 | @property
585 | def value(self):
586 | return self._value
587 |
588 | name = tag = local_name # Alias
589 |
590 |
591 | class cElementTreeAdapter(ElementTreeAdapter):
592 | """
593 | Adapter to the C-based implementation of the
594 | `ElementTree `_
595 | XML library.
596 | """
597 |
598 | ET = cET # Use the C-based implementation
599 |
600 | @classmethod
601 | def is_available(cls):
602 | if not super(cElementTreeAdapter, cls).is_available():
603 | return False
604 | # We only support cElementTree version 1.0.6+
605 | from distutils.version import StrictVersion
606 | return StrictVersion(cls.ET.VERSION) >= StrictVersion('1.0.6')
607 |
--------------------------------------------------------------------------------
/xml4h/writer.py:
--------------------------------------------------------------------------------
1 | """
2 | Writer to serialize XML DOM documents or sections to text.
3 | """
4 | # This implementation is adapted (heavily) from the standard library method
5 | # xml.dom.minidom.writexml
6 | import six
7 |
8 | import codecs
9 |
10 | from xml4h import exceptions
11 |
12 |
13 | def write_node(node, writer, encoding='utf-8', indent=0, newline='',
14 | omit_declaration=False, node_depth=0, quote_char='"'):
15 | """
16 | Serialize an *xml4h* DOM node and its descendants to text, writing
17 | the output to the given *writer*.
18 |
19 | :param node: the DOM node whose content and descendants will
20 | be serialized.
21 | :type node: an :class:`xml4h.nodes.Node` or subclass
22 | :param writer: a file or stream to which XML text is written.
23 | :type writer: a file, stream, etc
24 | :param string encoding: the character encoding for serialized text.
25 | :param indent: indentation prefix to apply to descendent nodes for
26 | pretty-printing. The value can take many forms:
27 |
28 | - *int*: the number of spaces to indent. 0 means no indent.
29 | - *string*: a literal prefix for indented nodes, such as ``\\t``.
30 | - *bool*: no indent if *False*, four spaces indent if *True*.
31 | - *None*: no indent.
32 | :type indent: string, int, bool, or None
33 | :param newline: the string value used to separate lines of output.
34 | The value can take a number of forms:
35 |
36 | - *string*: the literal newline value, such as ``\\n`` or ``\\r``.
37 | An empty string means no newline.
38 | - *bool*: no newline if *False*, ``\\n`` newline if *True*.
39 | - *None*: no newline.
40 | :type newline: string, bool, or None
41 | :param boolean omit_declaration: if *True* the XML declaration header
42 | is omitted, otherwise it is included. Note that the declaration is
43 | only output when serializing an :class:`xml4h.nodes.Document` node.
44 | :param int node_depth: the indentation level to start at, such as 2 to
45 | indent output as if the given *node* has two ancestors.
46 | This parameter will only be useful if you need to output XML text
47 | fragments that can be assembled into a document. This parameter
48 | has no effect unless indentation is applied.
49 | :param string quote_char: the character that delimits quoted content.
50 | You should never need to mess with this.
51 | """
52 | def _sanitize_write_value(value):
53 | """Return XML-encoded value."""
54 | if not value:
55 | return value
56 | return (value
57 | .replace("&", "&")
58 | .replace("<", "<")
59 | .replace("\"", """)
60 | .replace(">", ">")
61 | )
62 |
63 | def _write_node_impl(node, node_depth):
64 | """
65 | Internal write implementation that does the real work while keeping
66 | track of node depth.
67 | """
68 | # Output document declaration if we're outputting the whole doc
69 | if node.is_document:
70 | if not omit_declaration:
71 | writer.write(
72 | '%s' % newline)
77 | for child in node.children:
78 | _write_node_impl(child,
79 | node_depth) # node_depth not incremented
80 | writer.write(newline)
81 | elif node.is_document_type:
82 | writer.write("")
93 | elif node.is_text:
94 | writer.write(
95 | _sanitize_write_value(node.value)
96 | )
97 | elif node.is_cdata:
98 | if ']]>' in node.value:
99 | raise ValueError("']]>' is not allowed in CDATA node value")
100 | writer.write(
101 | "" % node.value
102 | )
103 | #elif node.is_entity_reference: # TODO
104 | elif node.is_entity:
105 | writer.write(newline + indent * node_depth)
106 | writer.write(""
111 | % (node.name, quote_char, node.value, quote_char)
112 | )
113 | elif node.is_processing_instruction:
114 | writer.write(newline + indent * node_depth)
115 | writer.write("%s %s?>" % (node.target, node.data))
116 | elif node.is_comment:
117 | if '--' in node.value:
118 | raise ValueError("'--' is not allowed in COMMENT node value")
119 | writer.write("" % node.value)
120 | elif node.is_notation:
121 | writer.write(newline + indent * node_depth)
122 | writer.write(""
125 | % (quote_char, node.external_id, quote_char))
126 | elif node.is_system_identifier:
127 | writer.write(" system %s%s%s %s%s%s>"
128 | % (quote_char, node.external_id, quote_char,
129 | quote_char, node.uri, quote_char))
130 | elif node.is_attribute:
131 | writer.write(
132 | " %s=%s" % (node.name, quote_char)
133 | )
134 | writer.write(
135 | _sanitize_write_value(node.value)
136 | )
137 | writer.write(quote_char)
138 | elif node.is_element:
139 | # Only need a preceding newline if we're in a sub-element
140 | if node_depth > 0:
141 | writer.write(newline)
142 | writer.write(indent * node_depth)
143 | writer.write("<" + node.name)
144 |
145 | for attr in node.attribute_nodes:
146 | _write_node_impl(attr, node_depth)
147 | if node.children:
148 | found_indented_child = False
149 | writer.write(">")
150 | for child in node.children:
151 | _write_node_impl(child, node_depth + 1)
152 | if not (child.is_text
153 | or child.is_comment
154 | or child.is_cdata):
155 | found_indented_child = True
156 | if found_indented_child:
157 | writer.write(newline + indent * node_depth)
158 | writer.write('%s>' % node.name)
159 | else:
160 | writer.write('/>')
161 | else:
162 | raise exceptions.Xml4hImplementationBug(
163 | 'Cannot write node with class: %s' % node.__class__)
164 |
165 | # Sanitize whitespace parameters
166 | if indent is True:
167 | indent = ' ' * 4
168 | elif indent is False:
169 | indent = ''
170 | elif isinstance(indent, int):
171 | indent = ' ' * indent
172 | # If indent but no newline set, always apply a newline (it makes sense)
173 | if indent and not newline:
174 | newline = True
175 |
176 | if newline is None or newline is False:
177 | newline = ''
178 | elif newline is True:
179 | newline = '\n'
180 |
181 | # If we have a target encoding and are writing to a binary IO stream, wrap
182 | # the writer with an encoding writer to produce the correct bytes.
183 | # We detect binary IO streams by:
184 | # - Python 3: the *absence* of the `encoding` attribute that is present on
185 | # `io.TextIOBase`-derived objects
186 | # - Python 2: the *absence* of the `encode` attribute that is present on
187 | # `StringIO` objects
188 | if (
189 | encoding
190 | and not hasattr(writer, 'encoding')
191 | and not hasattr(writer, 'encode')
192 | ):
193 | writer = codecs.getwriter(encoding)(writer)
194 |
195 | # Do the business...
196 | _write_node_impl(node, node_depth)
197 |
--------------------------------------------------------------------------------