'
48 | xs = self.hxs_cls(text)
49 | self.assertEqual(xs.select('//input[@name="a"]/@name="a"').extract(), ['1'])
50 | self.assertEqual(xs.select('//input[@name="a"]/@name="n"').extract(), ['0'])
51 |
52 | def test_extractor_xml_html(self):
53 | '''Test that XML and HTML XPathExtractor's behave differently.'''
54 | # some text which is parsed differently by XML and HTML flavors
55 | text = '
'
140 |
141 | >>> Url(css='a', count=1).parse(content)
142 | '/test'
143 |
144 | >>> Url(css='a', count=1).parse(content, url='http://github.com/Mimino666')
145 | 'http://github.com/test' # absolute url address. Told ya!
146 |
147 | >>> Prefix(css='#main', children=[
148 | ... Url(css='a', count=1)
149 | ... ]).parse(content, url='http://github.com/Mimino666') # you can pass url also to ancestor's parse(). It will propagate down.
150 | 'http://github.com/test'
151 |
152 |
153 | --------
154 | DateTime
155 | --------
156 |
157 | **Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), ``format`` (**required**), `count`_ (optional, default ``"*"``), `attr`_ (optional, default ``"_text"``), `callback`_ (optional) `namespaces`_ (optional)
158 |
159 | Returns the ``datetime.datetime`` object constructed out of the extracted data: ``datetime.strptime(extracted_data, format)``.
160 |
161 | ``format`` syntax is described in the `Python documentation `_.
162 |
163 | If ``callback`` is specified, it is called *after* the datetime objects are constructed.
164 |
165 | Example:
166 |
167 | .. code-block:: python
168 |
169 | >>> from xextract import DateTime
170 | >>> DateTime(css='span', count=1, format='%d.%m.%Y %H:%M').parse('24.12.2015 5:30')
171 | datetime.datetime(2015, 12, 24, 50, 30)
172 |
173 |
174 | ----
175 | Date
176 | ----
177 |
178 | **Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), ``format`` (**required**), `count`_ (optional, default ``"*"``), `attr`_ (optional, default ``"_text"``), `callback`_ (optional) `namespaces`_ (optional)
179 |
180 | Returns the ``datetime.date`` object constructed out of the extracted data: ``datetime.strptime(extracted_data, format).date()``.
181 |
182 | ``format`` syntax is described in the `Python documentation `_.
183 |
184 | If ``callback`` is specified, it is called *after* the datetime objects are constructed.
185 |
186 | Example:
187 |
188 | .. code-block:: python
189 |
190 | >>> from xextract import Date
191 | >>> Date(css='span', count=1, format='%d.%m.%Y').parse('24.12.2015')
192 | datetime.date(2015, 12, 24)
193 |
194 |
195 | -------
196 | Element
197 | -------
198 |
199 | **Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), `count`_ (optional, default ``"*"``), `callback`_ (optional), `namespaces`_ (optional)
200 |
201 | Returns lxml instance (``lxml.etree._Element``) of the matched element(s).
202 | If you use xpath expression and match the text content of the element (e.g. ``text()`` or ``@attr``), unicode is returned.
203 |
204 | If ``callback`` is specified, it is called with ``lxml.etree._Element`` instance.
205 |
206 | Example:
207 |
208 | .. code-block:: python
209 |
210 | >>> from xextract import Element
211 | >>> Element(css='span', count=1).parse('Hello')
212 |
213 |
214 | >>> Element(css='span', count=1, callback=lambda el: el.text).parse('Hello')
215 | 'Hello'
216 |
217 | # same as above
218 | >>> Element(xpath='//span/text()', count=1).parse('Hello')
219 | 'Hello'
220 |
221 |
222 | -----
223 | Group
224 | -----
225 |
226 | **Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), `children`_ (**required**), `count`_ (optional, default ``"*"``), `callback`_ (optional), `namespaces`_ (optional)
227 |
228 | For each element matched by css/xpath selector returns the dictionary containing the data extracted by the parsers listed in ``children`` parameter.
229 | All parsers listed in ``children`` parameter **must** have ``name`` specified - this is then used as the key in dictionary.
230 |
231 | Typical use case for this parser is when you want to parse structured data, e.g. list of user profiles, where each profile contains fields like name, address, etc. Use ``Group`` parser to group the fields of each user profile together.
232 |
233 | If ``callback`` is specified, it is called with the dictionary of parsed children values.
234 |
235 | Example:
236 |
237 | .. code-block:: python
238 |
239 | >>> from xextract import Group
240 | >>> content = '
michal
peter
'
241 |
242 | >>> Group(css='li', count=2, children=[
243 | ... String(name='id', xpath='self::*', count=1, attr='id'),
244 | ... String(name='name', xpath='self::*', count=1)
245 | ... ]).parse(content)
246 | [{'name': 'michal', 'id': 'id1'},
247 | {'name': 'peter', 'id': 'id2'}]
248 |
249 |
250 | ------
251 | Prefix
252 | ------
253 |
254 | **Parameters**: `css / xpath`_ (optional, default ``"self::*"``), `children`_ (**required**), `namespaces`_ (optional)
255 |
256 | This parser doesn't actually parse any data on its own. Instead you can use it, when many of your parsers share the same css/xpath selector prefix.
257 |
258 | ``Prefix`` parser always returns a single dictionary containing the data extracted by the parsers listed in ``children`` parameter.
259 | All parsers listed in ``children`` parameter **must** have ``name`` specified - this is then used as the key in dictionary.
260 |
261 | Example:
262 |
263 | .. code-block:: python
264 |
265 | # instead of...
266 | >>> String(css='#main .name').parse(...)
267 | >>> String(css='#main .date').parse(...)
268 |
269 | # ...you can use
270 | >>> from xextract import Prefix
271 | >>> Prefix(css='#main', children=[
272 | ... String(name="name", css='.name'),
273 | ... String(name="date", css='.date')
274 | ... ]).parse(...)
275 |
276 |
277 | =================
278 | Parser parameters
279 | =================
280 |
281 | ----
282 | name
283 | ----
284 |
285 | **Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_
286 |
287 | **Default value**: ``None``
288 |
289 | If specified, then the extracted data will be returned in a dictionary, with the ``name`` as the key and the data as the value.
290 |
291 | All parsers listed in ``children`` parameter of ``Group`` or ``Prefix`` parser **must** have ``name`` specified.
292 | If multiple children parsers have the same ``name``, the behavior is undefined.
293 |
294 | Example:
295 |
296 | .. code-block:: python
297 |
298 | # when `name` is not specified, raw value is returned
299 | >>> String(css='span', count=1).parse('Hello!')
300 | 'Hello!'
301 |
302 | # when `name` is specified, dictionary is returned with `name` as the key
303 | >>> String(name='message', css='span', count=1).parse('Hello!')
304 | {'message': 'Hello!'}
305 |
306 |
307 | -----------
308 | css / xpath
309 | -----------
310 |
311 | **Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_, `Prefix`_
312 |
313 | **Default value (xpath)**: ``"self::*"``
314 |
315 | Use either ``css`` or ``xpath`` parameter (but not both) to select the elements from which to extract the data.
316 |
317 | Under the hood css selectors are translated into equivalent xpath selectors.
318 |
319 | For the children of ``Prefix`` or ``Group`` parsers, the elements are selected relative to the elements matched by the parent parser.
320 |
321 | Example:
322 |
323 | .. code-block:: python
324 |
325 | Prefix(xpath='//*[@id="profile"]', children=[
326 | # equivalent to: //*[@id="profile"]/descendant-or-self::*[@class="name"]
327 | String(name='name', css='.name', count=1),
328 |
329 | # equivalent to: //*[@id="profile"]/*[@class="title"]
330 | String(name='title', xpath='*[@class="title"]', count=1),
331 |
332 | # equivalent to: //*[@class="subtitle"]
333 | String(name='subtitle', xpath='//*[@class="subtitle"]', count=1)
334 | ])
335 |
336 |
337 | -----
338 | count
339 | -----
340 |
341 | **Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_
342 |
343 | **Default value**: ``"*"``
344 |
345 | ``count`` specifies the expected number of elements to be matched with css/xpath selector. It serves two purposes:
346 |
347 | 1. Number of matched elements is checked against the ``count`` parameter. If the number of elements doesn't match the expected countity, ``xextract.parsers.ParsingError`` exception is raised. This way you will be notified, when the website has changed its structure.
348 | 2. It tells the parser whether to return a single extracted value or a list of values. See the table below.
349 |
350 | Syntax for ``count`` mimics the regular expressions.
351 | You can either pass the value as a string, single integer or tuple of two integers.
352 |
353 | Depending on the value of ``count``, the parser returns either a single extracted value or a list of values.
354 |
355 | +-------------------+-----------------------------------------------+-----------------------------+
356 | | Value of ``count``| Meaning | Extracted data |
357 | +===================+===============================================+=============================+
358 | | ``"*"`` (default) | Zero or more elements. | List of values |
359 | +-------------------+-----------------------------------------------+-----------------------------+
360 | | ``"+"`` | One or more elements. | List of values |
361 | +-------------------+-----------------------------------------------+-----------------------------+
362 | | ``"?"`` | Zero or one element. | Single value or ``None`` |
363 | +-------------------+-----------------------------------------------+-----------------------------+
364 | | ``num`` | Exactly ``num`` elements. | ``num`` == 0: ``None`` |
365 | | | | |
366 | | | You can pass either string or integer. | ``num`` == 1: Single value |
367 | | | | |
368 | | | | ``num`` > 1: List of values |
369 | +-------------------+-----------------------------------------------+-----------------------------+
370 | | ``(num1, num2)`` | Number of elements has to be between | List of values |
371 | | | ``num1`` and ``num2``, inclusive. | |
372 | | | | |
373 | | | You can pass either a string or 2-tuple. | |
374 | +-------------------+-----------------------------------------------+-----------------------------+
375 |
376 | Example:
377 |
378 | .. code-block:: python
379 |
380 | >>> String(css='.full-name', count=1).parse(content) # return single value
381 | 'John Rambo'
382 |
383 | >>> String(css='.full-name', count='1').parse(content) # same as above
384 | 'John Rambo'
385 |
386 | >>> String(css='.full-name', count=(1,2)).parse(content) # return list of values
387 | ['John Rambo']
388 |
389 | >>> String(css='.full-name', count='1,2').parse(content) # same as above
390 | ['John Rambo']
391 |
392 | >>> String(css='.middle-name', count='?').parse(content) # return single value or None
393 | None
394 |
395 | >>> String(css='.job-titles', count='+').parse(content) # return list of values
396 | ['President', 'US Senator', 'State Senator', 'Senior Lecturer in Law']
397 |
398 | >>> String(css='.friends', count='*').parse(content) # return possibly empty list of values
399 | []
400 |
401 | >>> String(css='.friends', count='+').parse(content) # raise exception, when no elements are matched
402 | xextract.parsers.ParsingError: Parser String matched 0 elements ("+" expected).
403 |
404 |
405 | ----
406 | attr
407 | ----
408 |
409 | **Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_
410 |
411 | **Default value**: ``"href"`` for ``Url`` parser. ``"_text"`` otherwise.
412 |
413 | Use ``attr`` parameter to specify what data to extract from the matched element.
414 |
415 | +-------------------+-----------------------------------------------------+
416 | | Value of ``attr`` | Meaning |
417 | +===================+=====================================================+
418 | | ``"_text"`` | Extract the text content of the matched element. |
419 | +-------------------+-----------------------------------------------------+
420 | | ``"_all_text"`` | Extract and concatenate the text content of |
421 | | | the matched element and all its descendants. |
422 | +-------------------+-----------------------------------------------------+
423 | | ``"_name"`` | Extract tag name of the matched element. |
424 | +-------------------+-----------------------------------------------------+
425 | | ``att_name`` | Extract the value out of ``att_name`` attribute of |
426 | | | the matched element. |
427 | | | |
428 | | | If such attribute doesn't exist, empty string is |
429 | | | returned. |
430 | +-------------------+-----------------------------------------------------+
431 |
432 | Example:
433 |
434 | .. code-block:: python
435 |
436 | >>> from xextract import String, Url
437 | >>> content = 'Barack Obama III.Link'
438 |
439 | >>> String(css='.name', count=1).parse(content) # default attr is "_text"
440 | 'Barack III.'
441 |
442 | >>> String(css='.name', count=1, attr='_text').parse(content) # same as above
443 | 'Barack III.'
444 |
445 | >>> String(css='.name', count=1, attr='_all_text').parse(content) # all text
446 | 'Barack Obama III.'
447 |
448 | >>> String(css='.name', count=1, attr='_name').parse(content) # tag name
449 | 'span'
450 |
451 | >>> Url(css='a', count='1').parse(content) # Url extracts href by default
452 | '/test'
453 |
454 | >>> String(css='a', count='1', attr='id').parse(content) # non-existent attributes return empty string
455 | ''
456 |
457 |
458 | --------
459 | callback
460 | --------
461 |
462 | **Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_
463 |
464 | Provides an easy way to post-process extracted values.
465 | It should be a function that takes a single argument, the extracted value, and returns the postprocessed value.
466 |
467 | Example:
468 |
469 | .. code-block:: python
470 |
471 | >>> String(css='span', callback=int).parse('12')
472 | [1, 2]
473 |
474 | >>> Element(css='span', count=1, callback=lambda el: el.text).parse('Hello')
475 | 'Hello'
476 |
477 | --------
478 | children
479 | --------
480 |
481 | **Parsers**: `Group`_, `Prefix`_
482 |
483 | Specifies the children parsers for the ``Group`` and ``Prefix`` parsers.
484 | All parsers listed in ``children`` parameter **must** have ``name`` specified
485 |
486 | Css/xpath selectors in the children parsers are relative to the selectors specified in the parent parser.
487 |
488 | Example:
489 |
490 | .. code-block:: python
491 |
492 | Prefix(xpath='//*[@id="profile"]', children=[
493 | # equivalent to: //*[@id="profile"]/descendant-or-self::*[@class="name"]
494 | String(name='name', css='.name', count=1),
495 |
496 | # equivalent to: //*[@id="profile"]/*[@class="title"]
497 | String(name='title', xpath='*[@class="title"]', count=1),
498 |
499 | # equivalent to: //*[@class="subtitle"]
500 | String(name='subtitle', xpath='//*[@class="subtitle"]', count=1)
501 | ])
502 |
503 | ----------
504 | namespaces
505 | ----------
506 |
507 | **Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_, `Prefix`_
508 |
509 | When parsing XML documents containing namespace prefixes, pass the dictionary mapping namespace prefixes to namespace URIs.
510 | Use then full name for elements in xpath selector in the form ``"prefix:element"``
511 |
512 | As for the moment, you **cannot use default namespace** for parsing (see `lxml docs `_ for more information). Just use an arbitrary prefix.
513 |
514 | Example:
515 |
516 | .. code-block:: python
517 |
518 | >>> content = '''
519 | ...
520 | ... The Shawshank Redemption
521 | ... 1994
522 | ... '''
523 | >>> nsmap = {'imdb': 'http://imdb.com/ns/'} # use arbitrary prefix for default namespace
524 |
525 | >>> Prefix(xpath='//imdb:movie', namespaces=nsmap, children=[ # pass namespaces to the outermost parser
526 | ... String(name='title', xpath='imdb:title', count=1),
527 | ... String(name='year', xpath='imdb:year', count=1)
528 | ... ]).parse(content)
529 | {'title': 'The Shawshank Redemption', 'year': '1994'}
530 |
531 |
532 | ====================
533 | HTML vs. XML parsing
534 | ====================
535 |
536 | To extract data from HTML or XML document, simply call ``parse()`` method of the parser:
537 |
538 | .. code-block:: python
539 |
540 | >>> from xextract import *
541 | >>> parser = Prefix(..., children=[...])
542 | >>> extracted_data = parser.parse(content)
543 |
544 |
545 | ``content`` can be either string or unicode, containing the content of the document.
546 |
547 | Under the hood **xextact** uses either ``lxml.etree.XMLParser`` or ``lxml.etree.HTMLParser`` to parse the document.
548 | To select the parser, **xextract** looks for ``">> parser.parse_html(content) # force lxml.etree.HTMLParser
555 | >>> parser.parse_xml(content) # force lxml.etree.XMLParser
556 |
--------------------------------------------------------------------------------