├── .gitignore
├── README
├── readability
├── BeautifulSoup.py
├── __init__.py
├── page_parser.py
├── readability.py
└── url_helpers.py
└── setup.py
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 |
--------------------------------------------------------------------------------
/README:
--------------------------------------------------------------------------------
1 | This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0
2 |
3 | This is a python port of a ruby port of arc90's readability project
4 |
5 | http://lab.arc90.com/experiments/readability/
6 |
7 | Given a html document, it pulls out the main body text and cleans it up.
8 |
9 | Ruby port by starrhorne and iterationlabs
10 | Python port by gfxmonk
11 |
12 | This port uses BeautifulSoup for the HTML parsing. That means it can be
13 | a little slow, but will work on Google App Engine (unlike libxml-based
14 | libraries)
15 |
16 |
17 | **note**: I don't currently have any plans for using or improving this
18 | library, and it's far from perfect (slow, and almost certainly buggy).
19 | So if you do something cool with it or have a better tool that does
20 | the same job, please let me know and I can link to it from here.
21 |
22 | If you're looking for alternatives / forks, here's the list so far:
23 | - http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
24 | - https://github.com/buriy/python-readability
25 |
--------------------------------------------------------------------------------
/readability/BeautifulSoup.py:
--------------------------------------------------------------------------------
1 | """Beautiful Soup
2 | Elixir and Tonic
3 | "The Screen-Scraper's Friend"
4 | http://www.crummy.com/software/BeautifulSoup/
5 |
6 | Beautiful Soup parses a (possibly invalid) XML or HTML document into a
7 | tree representation. It provides methods and Pythonic idioms that make
8 | it easy to navigate, search, and modify the tree.
9 |
10 | A well-formed XML/HTML document yields a well-formed data
11 | structure. An ill-formed XML/HTML document yields a correspondingly
12 | ill-formed data structure. If your document is only locally
13 | well-formed, you can use this library to find and process the
14 | well-formed part of it.
15 |
16 | Beautiful Soup works with Python 2.2 and up. It has no external
17 | dependencies, but you'll have more success at converting data to UTF-8
18 | if you also install these three packages:
19 |
20 | * chardet, for auto-detecting character encodings
21 | http://chardet.feedparser.org/
22 | * cjkcodecs and iconv_codec, which add more encodings to the ones supported
23 | by stock Python.
24 | http://cjkpython.i18n.org/
25 |
26 | Beautiful Soup defines classes for two main parsing strategies:
27 |
28 | * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
29 | language that kind of looks like XML.
30 |
31 | * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
32 | or invalid. This class has web browser-like heuristics for
33 | obtaining a sensible parse tree in the face of common HTML errors.
34 |
35 | Beautiful Soup also defines a class (UnicodeDammit) for autodetecting
36 | the encoding of an HTML or XML document, and converting it to
37 | Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
38 |
39 | For more than you ever wanted to know about Beautiful Soup, see the
40 | documentation:
41 | http://www.crummy.com/software/BeautifulSoup/documentation.html
42 |
43 | Here, have some legalese:
44 |
45 | Copyright (c) 2004-2009, Leonard Richardson
46 |
47 | All rights reserved.
48 |
49 | Redistribution and use in source and binary forms, with or without
50 | modification, are permitted provided that the following conditions are
51 | met:
52 |
53 | * Redistributions of source code must retain the above copyright
54 | notice, this list of conditions and the following disclaimer.
55 |
56 | * Redistributions in binary form must reproduce the above
57 | copyright notice, this list of conditions and the following
58 | disclaimer in the documentation and/or other materials provided
59 | with the distribution.
60 |
61 | * Neither the name of the the Beautiful Soup Consortium and All
62 | Night Kosher Bakery nor the names of its contributors may be
63 | used to endorse or promote products derived from this software
64 | without specific prior written permission.
65 |
66 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
67 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
68 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
69 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
70 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
71 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
72 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
73 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
74 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
75 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
76 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
77 |
78 | """
79 | from __future__ import generators
80 |
81 | __author__ = "Leonard Richardson (leonardr@segfault.org)"
82 | __version__ = "3.1.0.1"
83 | __copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
84 | __license__ = "New-style BSD"
85 |
86 | import codecs
87 | import markupbase
88 | import types
89 | import re
90 | from HTMLParser import HTMLParser, HTMLParseError
91 | try:
92 | from htmlentitydefs import name2codepoint
93 | except ImportError:
94 | name2codepoint = {}
95 | try:
96 | set
97 | except NameError:
98 | from sets import Set as set
99 |
100 | #These hacks make Beautiful Soup able to parse XML with namespaces
101 | markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
102 |
103 | DEFAULT_OUTPUT_ENCODING = "utf-8"
104 |
105 | # First, the classes that represent markup elements.
106 |
107 | def sob(unicode, encoding):
108 | """Returns either the given Unicode string or its encoding."""
109 | if encoding is None:
110 | return unicode
111 | else:
112 | return unicode.encode(encoding)
113 |
114 | class PageElement:
115 | """Contains the navigational information for some part of the page
116 | (either a tag or a piece of text)"""
117 |
118 | def setup(self, parent=None, previous=None):
119 | """Sets up the initial relations between this element and
120 | other elements."""
121 | self.parent = parent
122 | self.previous = previous
123 | self.next = None
124 | self.previousSibling = None
125 | self.nextSibling = None
126 | if self.parent and self.parent.contents:
127 | self.previousSibling = self.parent.contents[-1]
128 | self.previousSibling.nextSibling = self
129 |
130 | def replaceWith(self, replaceWith):
131 | oldParent = self.parent
132 | myIndex = self.parent.contents.index(self)
133 | if hasattr(replaceWith, 'parent') and replaceWith.parent == self.parent:
134 | # We're replacing this element with one of its siblings.
135 | index = self.parent.contents.index(replaceWith)
136 | if index and index < myIndex:
137 | # Furthermore, it comes before this element. That
138 | # means that when we extract it, the index of this
139 | # element will change.
140 | myIndex = myIndex - 1
141 | self.extract()
142 | oldParent.insert(myIndex, replaceWith)
143 |
144 | def extract(self):
145 | """Destructively rips this element out of the tree."""
146 | if self.parent:
147 | try:
148 | self.parent.contents.remove(self)
149 | except ValueError:
150 | pass
151 |
152 | #Find the two elements that would be next to each other if
153 | #this element (and any children) hadn't been parsed. Connect
154 | #the two.
155 | lastChild = self._lastRecursiveChild()
156 | nextElement = lastChild.next
157 |
158 | if self.previous:
159 | self.previous.next = nextElement
160 | if nextElement:
161 | nextElement.previous = self.previous
162 | self.previous = None
163 | lastChild.next = None
164 |
165 | self.parent = None
166 | if self.previousSibling:
167 | self.previousSibling.nextSibling = self.nextSibling
168 | if self.nextSibling:
169 | self.nextSibling.previousSibling = self.previousSibling
170 | self.previousSibling = self.nextSibling = None
171 | return self
172 |
173 | def _lastRecursiveChild(self):
174 | "Finds the last element beneath this object to be parsed."
175 | lastChild = self
176 | while hasattr(lastChild, 'contents') and lastChild.contents:
177 | lastChild = lastChild.contents[-1]
178 | return lastChild
179 |
180 | def insert(self, position, newChild):
181 | if (isinstance(newChild, basestring)
182 | or isinstance(newChild, unicode)) \
183 | and not isinstance(newChild, NavigableString):
184 | newChild = NavigableString(newChild)
185 |
186 | position = min(position, len(self.contents))
187 | if hasattr(newChild, 'parent') and newChild.parent != None:
188 | # We're 'inserting' an element that's already one
189 | # of this object's children.
190 | if newChild.parent == self:
191 | index = self.find(newChild)
192 | if index and index < position:
193 | # Furthermore we're moving it further down the
194 | # list of this object's children. That means that
195 | # when we extract this element, our target index
196 | # will jump down one.
197 | position = position - 1
198 | newChild.extract()
199 |
200 | newChild.parent = self
201 | previousChild = None
202 | if position == 0:
203 | newChild.previousSibling = None
204 | newChild.previous = self
205 | else:
206 | previousChild = self.contents[position-1]
207 | newChild.previousSibling = previousChild
208 | newChild.previousSibling.nextSibling = newChild
209 | newChild.previous = previousChild._lastRecursiveChild()
210 | if newChild.previous:
211 | newChild.previous.next = newChild
212 |
213 | newChildsLastElement = newChild._lastRecursiveChild()
214 |
215 | if position >= len(self.contents):
216 | newChild.nextSibling = None
217 |
218 | parent = self
219 | parentsNextSibling = None
220 | while not parentsNextSibling:
221 | parentsNextSibling = parent.nextSibling
222 | parent = parent.parent
223 | if not parent: # This is the last element in the document.
224 | break
225 | if parentsNextSibling:
226 | newChildsLastElement.next = parentsNextSibling
227 | else:
228 | newChildsLastElement.next = None
229 | else:
230 | nextChild = self.contents[position]
231 | newChild.nextSibling = nextChild
232 | if newChild.nextSibling:
233 | newChild.nextSibling.previousSibling = newChild
234 | newChildsLastElement.next = nextChild
235 |
236 | if newChildsLastElement.next:
237 | newChildsLastElement.next.previous = newChildsLastElement
238 | self.contents.insert(position, newChild)
239 |
240 | def append(self, tag):
241 | """Appends the given tag to the contents of this tag."""
242 | self.insert(len(self.contents), tag)
243 |
244 | def findNext(self, name=None, attrs={}, text=None, **kwargs):
245 | """Returns the first item that matches the given criteria and
246 | appears after this Tag in the document."""
247 | return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
248 |
249 | def findAllNext(self, name=None, attrs={}, text=None, limit=None,
250 | **kwargs):
251 | """Returns all items that match the given criteria and appear
252 | after this Tag in the document."""
253 | return self._findAll(name, attrs, text, limit, self.nextGenerator,
254 | **kwargs)
255 |
256 | def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
257 | """Returns the closest sibling to this Tag that matches the
258 | given criteria and appears after this Tag in the document."""
259 | return self._findOne(self.findNextSiblings, name, attrs, text,
260 | **kwargs)
261 |
262 | def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
263 | **kwargs):
264 | """Returns the siblings of this Tag that match the given
265 | criteria and appear after this Tag in the document."""
266 | return self._findAll(name, attrs, text, limit,
267 | self.nextSiblingGenerator, **kwargs)
268 | fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
269 |
270 | def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
271 | """Returns the first item that matches the given criteria and
272 | appears before this Tag in the document."""
273 | return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
274 |
275 | def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
276 | **kwargs):
277 | """Returns all items that match the given criteria and appear
278 | before this Tag in the document."""
279 | return self._findAll(name, attrs, text, limit, self.previousGenerator,
280 | **kwargs)
281 | fetchPrevious = findAllPrevious # Compatibility with pre-3.x
282 |
283 | def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
284 | """Returns the closest sibling to this Tag that matches the
285 | given criteria and appears before this Tag in the document."""
286 | return self._findOne(self.findPreviousSiblings, name, attrs, text,
287 | **kwargs)
288 |
289 | def findPreviousSiblings(self, name=None, attrs={}, text=None,
290 | limit=None, **kwargs):
291 | """Returns the siblings of this Tag that match the given
292 | criteria and appear before this Tag in the document."""
293 | return self._findAll(name, attrs, text, limit,
294 | self.previousSiblingGenerator, **kwargs)
295 | fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
296 |
297 | def findParent(self, name=None, attrs={}, **kwargs):
298 | """Returns the closest parent of this Tag that matches the given
299 | criteria."""
300 | # NOTE: We can't use _findOne because findParents takes a different
301 | # set of arguments.
302 | r = None
303 | l = self.findParents(name, attrs, 1)
304 | if l:
305 | r = l[0]
306 | return r
307 |
308 | def findParents(self, name=None, attrs={}, limit=None, **kwargs):
309 | """Returns the parents of this Tag that match the given
310 | criteria."""
311 |
312 | return self._findAll(name, attrs, None, limit, self.parentGenerator,
313 | **kwargs)
314 | fetchParents = findParents # Compatibility with pre-3.x
315 |
316 | #These methods do the real heavy lifting.
317 |
318 | def _findOne(self, method, name, attrs, text, **kwargs):
319 | r = None
320 | l = method(name, attrs, text, 1, **kwargs)
321 | if l:
322 | r = l[0]
323 | return r
324 |
325 | def _findAll(self, name, attrs, text, limit, generator, **kwargs):
326 | "Iterates over a generator looking for things that match."
327 |
328 | if isinstance(name, SoupStrainer):
329 | strainer = name
330 | else:
331 | # Build a SoupStrainer
332 | strainer = SoupStrainer(name, attrs, text, **kwargs)
333 | results = ResultSet(strainer)
334 | g = generator()
335 | while True:
336 | try:
337 | i = g.next()
338 | except StopIteration:
339 | break
340 | if i:
341 | found = strainer.search(i)
342 | if found:
343 | results.append(found)
344 | if limit and len(results) >= limit:
345 | break
346 | return results
347 |
348 | #These Generators can be used to navigate starting from both
349 | #NavigableStrings and Tags.
350 | def nextGenerator(self):
351 | i = self
352 | while i:
353 | i = i.next
354 | yield i
355 |
356 | def nextSiblingGenerator(self):
357 | i = self
358 | while i:
359 | i = i.nextSibling
360 | yield i
361 |
362 | def previousGenerator(self):
363 | i = self
364 | while i:
365 | i = i.previous
366 | yield i
367 |
368 | def previousSiblingGenerator(self):
369 | i = self
370 | while i:
371 | i = i.previousSibling
372 | yield i
373 |
374 | def parentGenerator(self):
375 | i = self
376 | while i:
377 | i = i.parent
378 | yield i
379 |
380 | # Utility methods
381 | def substituteEncoding(self, str, encoding=None):
382 | encoding = encoding or "utf-8"
383 | return str.replace("%SOUP-ENCODING%", encoding)
384 |
385 | def toEncoding(self, s, encoding=None):
386 | """Encodes an object to a string in some encoding, or to Unicode.
387 | ."""
388 | if isinstance(s, unicode):
389 | if encoding:
390 | s = s.encode(encoding)
391 | elif isinstance(s, str):
392 | if encoding:
393 | s = s.encode(encoding)
394 | else:
395 | s = unicode(s)
396 | else:
397 | if encoding:
398 | s = self.toEncoding(str(s), encoding)
399 | else:
400 | s = unicode(s)
401 | return s
402 |
403 | class NavigableString(unicode, PageElement):
404 |
405 | def __new__(cls, value):
406 | """Create a new NavigableString.
407 |
408 | When unpickling a NavigableString, this method is called with
409 | the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
410 | passed in to the superclass's __new__ or the superclass won't know
411 | how to handle non-ASCII characters.
412 | """
413 | if isinstance(value, unicode):
414 | return unicode.__new__(cls, value)
415 | return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
416 |
417 | def __getnewargs__(self):
418 | return (unicode(self),)
419 |
420 | def __getattr__(self, attr):
421 | """text.string gives you text. This is for backwards
422 | compatibility for Navigable*String, but for CData* it lets you
423 | get the string without the CData wrapper."""
424 | if attr == 'string':
425 | return self
426 | else:
427 | raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
428 |
429 | def encode(self, encoding=DEFAULT_OUTPUT_ENCODING):
430 | return self.decode().encode(encoding)
431 |
432 | def decodeGivenEventualEncoding(self, eventualEncoding):
433 | return self
434 |
435 | class CData(NavigableString):
436 |
437 | def decodeGivenEventualEncoding(self, eventualEncoding):
438 | return u''
439 |
440 | class ProcessingInstruction(NavigableString):
441 |
442 | def decodeGivenEventualEncoding(self, eventualEncoding):
443 | output = self
444 | if u'%SOUP-ENCODING%' in output:
445 | output = self.substituteEncoding(output, eventualEncoding)
446 | return u'' + output + u'?>'
447 |
448 | class Comment(NavigableString):
449 | def decodeGivenEventualEncoding(self, eventualEncoding):
450 | return u''
451 |
452 | class Declaration(NavigableString):
453 | def decodeGivenEventualEncoding(self, eventualEncoding):
454 | return u''
455 |
456 | class Tag(PageElement):
457 |
458 | """Represents a found HTML tag with its attributes and contents."""
459 |
460 | def _invert(h):
461 | "Cheap function to invert a hash."
462 | i = {}
463 | for k,v in h.items():
464 | i[v] = k
465 | return i
466 |
467 | XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
468 | "quot" : '"',
469 | "amp" : "&",
470 | "lt" : "<",
471 | "gt" : ">" }
472 |
473 | XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
474 |
475 | def _convertEntities(self, match):
476 | """Used in a call to re.sub to replace HTML, XML, and numeric
477 | entities with the appropriate Unicode characters. If HTML
478 | entities are being converted, any unrecognized entities are
479 | escaped."""
480 | x = match.group(1)
481 | if self.convertHTMLEntities and x in name2codepoint:
482 | return unichr(name2codepoint[x])
483 | elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
484 | if self.convertXMLEntities:
485 | return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
486 | else:
487 | return u'&%s;' % x
488 | elif len(x) > 0 and x[0] == '#':
489 | # Handle numeric entities
490 | if len(x) > 1 and x[1] == 'x':
491 | return unichr(int(x[2:], 16))
492 | else:
493 | return unichr(int(x[1:]))
494 |
495 | elif self.escapeUnrecognizedEntities:
496 | return u'&%s;' % x
497 | else:
498 | return u'&%s;' % x
499 |
500 | def __init__(self, parser, name, attrs=None, parent=None,
501 | previous=None):
502 | "Basic constructor."
503 |
504 | # We don't actually store the parser object: that lets extracted
505 | # chunks be garbage-collected
506 | self.parserClass = parser.__class__
507 | self.isSelfClosing = parser.isSelfClosingTag(name)
508 | self.name = name
509 | if attrs == None:
510 | attrs = []
511 | self.attrs = attrs
512 | self.contents = []
513 | self.setup(parent, previous)
514 | self.hidden = False
515 | self.containsSubstitutions = False
516 | self.convertHTMLEntities = parser.convertHTMLEntities
517 | self.convertXMLEntities = parser.convertXMLEntities
518 | self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
519 |
520 | def convert(kval):
521 | "Converts HTML, XML and numeric entities in the attribute value."
522 | k, val = kval
523 | if val is None:
524 | return kval
525 | return (k, re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
526 | self._convertEntities, val))
527 | self.attrs = map(convert, self.attrs)
528 |
529 | def get(self, key, default=None):
530 | """Returns the value of the 'key' attribute for the tag, or
531 | the value given for 'default' if it doesn't have that
532 | attribute."""
533 | return self._getAttrMap().get(key, default)
534 |
535 | def has_key(self, key):
536 | return self._getAttrMap().has_key(key)
537 |
538 | def __getitem__(self, key):
539 | """tag[key] returns the value of the 'key' attribute for the tag,
540 | and throws an exception if it's not there."""
541 | return self._getAttrMap()[key]
542 |
543 | def __iter__(self):
544 | "Iterating over a tag iterates over its contents."
545 | return iter(self.contents)
546 |
547 | def __len__(self):
548 | "The length of a tag is the length of its list of contents."
549 | return len(self.contents)
550 |
551 | def __contains__(self, x):
552 | return x in self.contents
553 |
554 | def __nonzero__(self):
555 | "A tag is non-None even if it has no contents."
556 | return True
557 |
558 | def __setitem__(self, key, value):
559 | """Setting tag[key] sets the value of the 'key' attribute for the
560 | tag."""
561 | self._getAttrMap()
562 | self.attrMap[key] = value
563 | found = False
564 | for i in range(0, len(self.attrs)):
565 | if self.attrs[i][0] == key:
566 | self.attrs[i] = (key, value)
567 | found = True
568 | if not found:
569 | self.attrs.append((key, value))
570 | self._getAttrMap()[key] = value
571 |
572 | def __delitem__(self, key):
573 | "Deleting tag[key] deletes all 'key' attributes for the tag."
574 | for item in self.attrs:
575 | if item[0] == key:
576 | self.attrs.remove(item)
577 | #We don't break because bad HTML can define the same
578 | #attribute multiple times.
579 | self._getAttrMap()
580 | if self.attrMap.has_key(key):
581 | del self.attrMap[key]
582 |
583 | def __call__(self, *args, **kwargs):
584 | """Calling a tag like a function is the same as calling its
585 | findAll() method. Eg. tag('a') returns a list of all the A tags
586 | found within this tag."""
587 | return apply(self.findAll, args, kwargs)
588 |
589 | def __getattr__(self, tag):
590 | #print "Getattr %s.%s" % (self.__class__, tag)
591 | if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
592 | return self.find(tag[:-3])
593 | elif tag.find('__') != 0:
594 | return self.find(tag)
595 | raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag)
596 |
597 | def __eq__(self, other):
598 | """Returns true iff this tag has the same name, the same attributes,
599 | and the same contents (recursively) as the given tag.
600 |
601 | NOTE: right now this will return false if two tags have the
602 | same attributes in a different order. Should this be fixed?"""
603 | if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
604 | return False
605 | for i in range(0, len(self.contents)):
606 | if self.contents[i] != other.contents[i]:
607 | return False
608 | return True
609 |
610 | def __ne__(self, other):
611 | """Returns true iff this tag is not identical to the other tag,
612 | as defined in __eq__."""
613 | return not self == other
614 |
615 | def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
616 | """Renders this tag as a string."""
617 | return self.decode(eventualEncoding=encoding)
618 |
619 | BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
620 | + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
621 | + ")")
622 |
623 | def _sub_entity(self, x):
624 | """Used with a regular expression to substitute the
625 | appropriate XML entity for an XML special character."""
626 | return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
627 |
628 | def __unicode__(self):
629 | return self.decode()
630 |
631 | def __str__(self):
632 | return self.encode()
633 |
634 | def encode(self, encoding=DEFAULT_OUTPUT_ENCODING,
635 | prettyPrint=False, indentLevel=0):
636 | return self.decode(prettyPrint, indentLevel, encoding).encode(encoding)
637 |
638 | def decode(self, prettyPrint=False, indentLevel=0,
639 | eventualEncoding=DEFAULT_OUTPUT_ENCODING):
640 | """Returns a string or Unicode representation of this tag and
641 | its contents. To get Unicode, pass None for encoding."""
642 |
643 | attrs = []
644 | if self.attrs:
645 | for key, val in self.attrs:
646 | fmt = '%s="%s"'
647 | if isString(val):
648 | if (self.containsSubstitutions
649 | and eventualEncoding is not None
650 | and '%SOUP-ENCODING%' in val):
651 | val = self.substituteEncoding(val, eventualEncoding)
652 |
653 | # The attribute value either:
654 | #
655 | # * Contains no embedded double quotes or single quotes.
656 | # No problem: we enclose it in double quotes.
657 | # * Contains embedded single quotes. No problem:
658 | # double quotes work here too.
659 | # * Contains embedded double quotes. No problem:
660 | # we enclose it in single quotes.
661 | # * Embeds both single _and_ double quotes. This
662 | # can't happen naturally, but it can happen if
663 | # you modify an attribute value after parsing
664 | # the document. Now we have a bit of a
665 | # problem. We solve it by enclosing the
666 | # attribute in single quotes, and escaping any
667 | # embedded single quotes to XML entities.
668 | if '"' in val:
669 | fmt = "%s='%s'"
670 | if "'" in val:
671 | # TODO: replace with apos when
672 | # appropriate.
673 | val = val.replace("'", "&squot;")
674 |
675 | # Now we're okay w/r/t quotes. But the attribute
676 | # value might also contain angle brackets, or
677 | # ampersands that aren't part of entities. We need
678 | # to escape those to XML entities too.
679 | val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
680 | if val is None:
681 | # Handle boolean attributes.
682 | decoded = key
683 | else:
684 | decoded = fmt % (key, val)
685 | attrs.append(decoded)
686 | close = ''
687 | closeTag = ''
688 | if self.isSelfClosing:
689 | close = ' /'
690 | else:
691 | closeTag = '%s>' % self.name
692 |
693 | indentTag, indentContents = 0, 0
694 | if prettyPrint:
695 | indentTag = indentLevel
696 | space = (' ' * (indentTag-1))
697 | indentContents = indentTag + 1
698 | contents = self.decodeContents(prettyPrint, indentContents,
699 | eventualEncoding)
700 | if self.hidden:
701 | s = contents
702 | else:
703 | s = []
704 | attributeString = ''
705 | if attrs:
706 | attributeString = ' ' + ' '.join(attrs)
707 | if prettyPrint:
708 | s.append(space)
709 | s.append('<%s%s%s>' % (self.name, attributeString, close))
710 | if prettyPrint:
711 | s.append("\n")
712 | s.append(contents)
713 | if prettyPrint and contents and contents[-1] != "\n":
714 | s.append("\n")
715 | if prettyPrint and closeTag:
716 | s.append(space)
717 | s.append(closeTag)
718 | if prettyPrint and closeTag and self.nextSibling:
719 | s.append("\n")
720 | s = ''.join(s)
721 | return s
722 |
723 | def decompose(self):
724 | """Recursively destroys the contents of this tree."""
725 | contents = [i for i in self.contents]
726 | for i in contents:
727 | if isinstance(i, Tag):
728 | i.decompose()
729 | else:
730 | i.extract()
731 | self.extract()
732 |
733 | def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
734 | return self.encode(encoding, True)
735 |
736 | def encodeContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
737 | prettyPrint=False, indentLevel=0):
738 | return self.decodeContents(prettyPrint, indentLevel).encode(encoding)
739 |
740 | def decodeContents(self, prettyPrint=False, indentLevel=0,
741 | eventualEncoding=DEFAULT_OUTPUT_ENCODING):
742 | """Renders the contents of this tag as a string in the given
743 | encoding. If encoding is None, returns a Unicode string.."""
744 | s=[]
745 | for c in self:
746 | text = None
747 | if isinstance(c, NavigableString):
748 | text = c.decodeGivenEventualEncoding(eventualEncoding)
749 | elif isinstance(c, Tag):
750 | s.append(c.decode(prettyPrint, indentLevel, eventualEncoding))
751 | if text and prettyPrint:
752 | text = text.strip()
753 | if text:
754 | if prettyPrint:
755 | s.append(" " * (indentLevel-1))
756 | s.append(text)
757 | if prettyPrint:
758 | s.append("\n")
759 | return ''.join(s)
760 |
761 | #Soup methods
762 |
763 | def find(self, name=None, attrs={}, recursive=True, text=None,
764 | **kwargs):
765 | """Return only the first child of this Tag matching the given
766 | criteria."""
767 | r = None
768 | l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
769 | if l:
770 | r = l[0]
771 | return r
772 | findChild = find
773 |
774 | def findAll(self, name=None, attrs={}, recursive=True, text=None,
775 | limit=None, **kwargs):
776 | """Extracts a list of Tag objects that match the given
777 | criteria. You can specify the name of the Tag and any
778 | attributes you want the Tag to have.
779 |
780 | The value of a key-value pair in the 'attrs' map can be a
781 | string, a list of strings, a regular expression object, or a
782 | callable that takes a string and returns whether or not the
783 | string matches for some custom definition of 'matches'. The
784 | same is true of the tag name."""
785 | generator = self.recursiveChildGenerator
786 | if not recursive:
787 | generator = self.childGenerator
788 | return self._findAll(name, attrs, text, limit, generator, **kwargs)
789 | findChildren = findAll
790 |
791 | # Pre-3.x compatibility methods. Will go away in 4.0.
792 | first = find
793 | fetch = findAll
794 |
795 | def fetchText(self, text=None, recursive=True, limit=None):
796 | return self.findAll(text=text, recursive=recursive, limit=limit)
797 |
798 | def firstText(self, text=None, recursive=True):
799 | return self.find(text=text, recursive=recursive)
800 |
801 | # 3.x compatibility methods. Will go away in 4.0.
802 | def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
803 | prettyPrint=False, indentLevel=0):
804 | if encoding is None:
805 | return self.decodeContents(prettyPrint, indentLevel, encoding)
806 | else:
807 | return self.encodeContents(encoding, prettyPrint, indentLevel)
808 |
809 |
810 | #Private methods
811 |
812 | def _getAttrMap(self):
813 | """Initializes a map representation of this tag's attributes,
814 | if not already initialized."""
815 | if not getattr(self, 'attrMap'):
816 | self.attrMap = {}
817 | for (key, value) in self.attrs:
818 | self.attrMap[key] = value
819 | return self.attrMap
820 |
821 | #Generator methods
822 | def recursiveChildGenerator(self):
823 | if not len(self.contents):
824 | raise StopIteration
825 | stopNode = self._lastRecursiveChild().next
826 | current = self.contents[0]
827 | while current is not stopNode:
828 | yield current
829 | current = current.next
830 |
831 | def childGenerator(self):
832 | if not len(self.contents):
833 | raise StopIteration
834 | current = self.contents[0]
835 | while current:
836 | yield current
837 | current = current.nextSibling
838 | raise StopIteration
839 |
840 | # Next, a couple classes to represent queries and their results.
841 | class SoupStrainer:
842 | """Encapsulates a number of ways of matching a markup element (tag or
843 | text)."""
844 |
845 | def __init__(self, name=None, attrs={}, text=None, **kwargs):
846 | self.name = name
847 | if isString(attrs):
848 | kwargs['class'] = attrs
849 | attrs = None
850 | if kwargs:
851 | if attrs:
852 | attrs = attrs.copy()
853 | attrs.update(kwargs)
854 | else:
855 | attrs = kwargs
856 | self.attrs = attrs
857 | self.text = text
858 |
859 | def __str__(self):
860 | if self.text:
861 | return self.text
862 | else:
863 | return "%s|%s" % (self.name, self.attrs)
864 |
865 | def searchTag(self, markupName=None, markupAttrs={}):
866 | found = None
867 | markup = None
868 | if isinstance(markupName, Tag):
869 | markup = markupName
870 | markupAttrs = markup
871 | callFunctionWithTagData = callable(self.name) \
872 | and not isinstance(markupName, Tag)
873 |
874 | if (not self.name) \
875 | or callFunctionWithTagData \
876 | or (markup and self._matches(markup, self.name)) \
877 | or (not markup and self._matches(markupName, self.name)):
878 | if callFunctionWithTagData:
879 | match = self.name(markupName, markupAttrs)
880 | else:
881 | match = True
882 | markupAttrMap = None
883 | for attr, matchAgainst in self.attrs.items():
884 | if not markupAttrMap:
885 | if hasattr(markupAttrs, 'get'):
886 | markupAttrMap = markupAttrs
887 | else:
888 | markupAttrMap = {}
889 | for k,v in markupAttrs:
890 | markupAttrMap[k] = v
891 | attrValue = markupAttrMap.get(attr)
892 | if not self._matches(attrValue, matchAgainst):
893 | match = False
894 | break
895 | if match:
896 | if markup:
897 | found = markup
898 | else:
899 | found = markupName
900 | return found
901 |
902 | def search(self, markup):
903 | #print 'looking for %s in %s' % (self, markup)
904 | found = None
905 | # If given a list of items, scan it for a text element that
906 | # matches.
907 | if isList(markup) and not isinstance(markup, Tag):
908 | for element in markup:
909 | if isinstance(element, NavigableString) \
910 | and self.search(element):
911 | found = element
912 | break
913 | # If it's a Tag, make sure its name or attributes match.
914 | # Don't bother with Tags if we're searching for text.
915 | elif isinstance(markup, Tag):
916 | if not self.text:
917 | found = self.searchTag(markup)
918 | # If it's text, make sure the text matches.
919 | elif isinstance(markup, NavigableString) or \
920 | isString(markup):
921 | if self._matches(markup, self.text):
922 | found = markup
923 | else:
924 | raise Exception, "I don't know how to match against a %s" \
925 | % markup.__class__
926 | return found
927 |
928 | def _matches(self, markup, matchAgainst):
929 | #print "Matching %s against %s" % (markup, matchAgainst)
930 | result = False
931 | if matchAgainst == True and type(matchAgainst) == types.BooleanType:
932 | result = markup != None
933 | elif callable(matchAgainst):
934 | result = matchAgainst(markup)
935 | else:
936 | #Custom match methods take the tag as an argument, but all
937 | #other ways of matching match the tag name as a string.
938 | if isinstance(markup, Tag):
939 | markup = markup.name
940 | if markup is not None and not isString(markup):
941 | markup = unicode(markup)
942 | #Now we know that chunk is either a string, or None.
943 | if hasattr(matchAgainst, 'match'):
944 | # It's a regexp object.
945 | result = markup and matchAgainst.search(markup)
946 | elif (isList(matchAgainst)
947 | and (markup is not None or not isString(matchAgainst))):
948 | result = markup in matchAgainst
949 | elif hasattr(matchAgainst, 'items'):
950 | result = markup.has_key(matchAgainst)
951 | elif matchAgainst and isString(markup):
952 | if isinstance(markup, unicode):
953 | matchAgainst = unicode(matchAgainst)
954 | else:
955 | matchAgainst = str(matchAgainst)
956 |
957 | if not result:
958 | result = matchAgainst == markup
959 | return result
960 |
961 | class ResultSet(list):
962 | """A ResultSet is just a list that keeps track of the SoupStrainer
963 | that created it."""
964 | def __init__(self, source):
965 | list.__init__([])
966 | self.source = source
967 |
968 | # Now, some helper functions.
969 |
970 | def isList(l):
971 | """Convenience method that works with all 2.x versions of Python
972 | to determine whether or not something is listlike."""
973 | return ((hasattr(l, '__iter__') and not isString(l))
974 | or (type(l) in (types.ListType, types.TupleType)))
975 |
976 | def isString(s):
977 | """Convenience method that works with all 2.x versions of Python
978 | to determine whether or not something is stringlike."""
979 | try:
980 | return isinstance(s, unicode) or isinstance(s, basestring)
981 | except NameError:
982 | return isinstance(s, str)
983 |
984 | def buildTagMap(default, *args):
985 | """Turns a list of maps, lists, or scalars into a single map.
986 | Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
987 | NESTING_RESET_TAGS maps out of lists and partial maps."""
988 | built = {}
989 | for portion in args:
990 | if hasattr(portion, 'items'):
991 | #It's a map. Merge it.
992 | for k,v in portion.items():
993 | built[k] = v
994 | elif isList(portion) and not isString(portion):
995 | #It's a list. Map each item to the default.
996 | for k in portion:
997 | built[k] = default
998 | else:
999 | #It's a scalar. Map it to the default.
1000 | built[portion] = default
1001 | return built
1002 |
1003 | # Now, the parser classes.
1004 |
1005 | class HTMLParserBuilder(HTMLParser):
1006 |
1007 | def __init__(self, soup):
1008 | HTMLParser.__init__(self)
1009 | self.soup = soup
1010 |
1011 | # We inherit feed() and reset().
1012 |
1013 | def handle_starttag(self, name, attrs):
1014 | if name == 'meta':
1015 | self.soup.extractCharsetFromMeta(attrs)
1016 | else:
1017 | self.soup.unknown_starttag(name, attrs)
1018 |
1019 | def handle_endtag(self, name):
1020 | self.soup.unknown_endtag(name)
1021 |
1022 | def handle_data(self, content):
1023 | self.soup.handle_data(content)
1024 |
1025 | def _toStringSubclass(self, text, subclass):
1026 | """Adds a certain piece of text to the tree as a NavigableString
1027 | subclass."""
1028 | self.soup.endData()
1029 | self.handle_data(text)
1030 | self.soup.endData(subclass)
1031 |
1032 | def handle_pi(self, text):
1033 | """Handle a processing instruction as a ProcessingInstruction
1034 | object, possibly one with a %SOUP-ENCODING% slot into which an
1035 | encoding will be plugged later."""
1036 | if text[:3] == "xml":
1037 | text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
1038 | self._toStringSubclass(text, ProcessingInstruction)
1039 |
1040 | def handle_comment(self, text):
1041 | "Handle comments as Comment objects."
1042 | self._toStringSubclass(text, Comment)
1043 |
1044 | def handle_charref(self, ref):
1045 | "Handle character references as data."
1046 | if self.soup.convertEntities:
1047 | data = unichr(int(ref))
1048 | else:
1049 | data = '%s;' % ref
1050 | self.handle_data(data)
1051 |
1052 | def handle_entityref(self, ref):
1053 | """Handle entity references as data, possibly converting known
1054 | HTML and/or XML entity references to the corresponding Unicode
1055 | characters."""
1056 | data = None
1057 | if self.soup.convertHTMLEntities:
1058 | try:
1059 | data = unichr(name2codepoint[ref])
1060 | except KeyError:
1061 | pass
1062 |
1063 | if not data and self.soup.convertXMLEntities:
1064 | data = self.soup.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
1065 |
1066 | if not data and self.soup.convertHTMLEntities and \
1067 | not self.soup.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
1068 | # TODO: We've got a problem here. We're told this is
1069 | # an entity reference, but it's not an XML entity
1070 | # reference or an HTML entity reference. Nonetheless,
1071 | # the logical thing to do is to pass it through as an
1072 | # unrecognized entity reference.
1073 | #
1074 | # Except: when the input is "&carol;" this function
1075 | # will be called with input "carol". When the input is
1076 | # "AT&T", this function will be called with input
1077 | # "T". We have no way of knowing whether a semicolon
1078 | # was present originally, so we don't know whether
1079 | # this is an unknown entity or just a misplaced
1080 | # ampersand.
1081 | #
1082 | # The more common case is a misplaced ampersand, so I
1083 | # escape the ampersand and omit the trailing semicolon.
1084 | data = "&%s" % ref
1085 | if not data:
1086 | # This case is different from the one above, because we
1087 | # haven't already gone through a supposedly comprehensive
1088 | # mapping of entities to Unicode characters. We might not
1089 | # have gone through any mapping at all. So the chances are
1090 | # very high that this is a real entity, and not a
1091 | # misplaced ampersand.
1092 | data = "&%s;" % ref
1093 | self.handle_data(data)
1094 |
1095 | def handle_decl(self, data):
1096 | "Handle DOCTYPEs and the like as Declaration objects."
1097 | self._toStringSubclass(data, Declaration)
1098 |
1099 | def parse_declaration(self, i):
1100 | """Treat a bogus SGML declaration as raw data. Treat a CDATA
1101 | declaration as a CData object."""
1102 | j = None
1103 | if self.rawdata[i:i+9] == '', i)
1105 | if k == -1:
1106 | k = len(self.rawdata)
1107 | data = self.rawdata[i+9:k]
1108 | j = k+3
1109 | self._toStringSubclass(data, CData)
1110 | else:
1111 | try:
1112 | j = HTMLParser.parse_declaration(self, i)
1113 | except HTMLParseError:
1114 | toHandle = self.rawdata[i:]
1115 | self.handle_data(toHandle)
1116 | j = i + len(toHandle)
1117 | return j
1118 |
1119 |
1120 | class BeautifulStoneSoup(Tag):
1121 |
1122 | """This class contains the basic parser and search code. It defines
1123 | a parser that knows nothing about tag behavior except for the
1124 | following:
1125 |
1126 | You can't close a tag without closing all the tags it encloses.
1127 | That is, "
(No space between name of closing tag and tag close)
1187 | (Extraneous whitespace in declaration)
1188 |
1189 | You can pass in a custom list of (RE object, replace method)
1190 | tuples to get Beautiful Soup to scrub your input the way you
1191 | want."""
1192 |
1193 | self.parseOnlyThese = parseOnlyThese
1194 | self.fromEncoding = fromEncoding
1195 | self.smartQuotesTo = smartQuotesTo
1196 | self.convertEntities = convertEntities
1197 | # Set the rules for how we'll deal with the entities we
1198 | # encounter
1199 | if self.convertEntities:
1200 | # It doesn't make sense to convert encoded characters to
1201 | # entities even while you're converting entities to Unicode.
1202 | # Just convert it all to Unicode.
1203 | self.smartQuotesTo = None
1204 | if convertEntities == self.HTML_ENTITIES:
1205 | self.convertXMLEntities = False
1206 | self.convertHTMLEntities = True
1207 | self.escapeUnrecognizedEntities = True
1208 | elif convertEntities == self.XHTML_ENTITIES:
1209 | self.convertXMLEntities = True
1210 | self.convertHTMLEntities = True
1211 | self.escapeUnrecognizedEntities = False
1212 | elif convertEntities == self.XML_ENTITIES:
1213 | self.convertXMLEntities = True
1214 | self.convertHTMLEntities = False
1215 | self.escapeUnrecognizedEntities = False
1216 | else:
1217 | self.convertXMLEntities = False
1218 | self.convertHTMLEntities = False
1219 | self.escapeUnrecognizedEntities = False
1220 |
1221 | self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
1222 | self.builder = builder(self)
1223 | self.reset()
1224 |
1225 | if hasattr(markup, 'read'): # It's a file-type object.
1226 | markup = markup.read()
1227 | self.markup = markup
1228 | self.markupMassage = markupMassage
1229 | try:
1230 | self._feed(isHTML=isHTML)
1231 | except StopParsing:
1232 | pass
1233 | self.markup = None # The markup can now be GCed.
1234 | self.builder = None # So can the builder.
1235 |
1236 | def _feed(self, inDocumentEncoding=None, isHTML=False):
1237 | # Convert the document to Unicode.
1238 | markup = self.markup
1239 | if isinstance(markup, unicode):
1240 | if not hasattr(self, 'originalEncoding'):
1241 | self.originalEncoding = None
1242 | else:
1243 | dammit = UnicodeDammit\
1244 | (markup, [self.fromEncoding, inDocumentEncoding],
1245 | smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
1246 | markup = dammit.unicode
1247 | self.originalEncoding = dammit.originalEncoding
1248 | self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
1249 | if markup:
1250 | if self.markupMassage:
1251 | if not isList(self.markupMassage):
1252 | self.markupMassage = self.MARKUP_MASSAGE
1253 | for fix, m in self.markupMassage:
1254 | markup = fix.sub(m, markup)
1255 | # TODO: We get rid of markupMassage so that the
1256 | # soup object can be deepcopied later on. Some
1257 | # Python installations can't copy regexes. If anyone
1258 | # was relying on the existence of markupMassage, this
1259 | # might cause problems.
1260 | del(self.markupMassage)
1261 | self.builder.reset()
1262 |
1263 | self.builder.feed(markup)
1264 | # Close out any unfinished strings and close all the open tags.
1265 | self.endData()
1266 | while self.currentTag.name != self.ROOT_TAG_NAME:
1267 | self.popTag()
1268 |
1269 | def isSelfClosingTag(self, name):
1270 | """Returns true iff the given string is the name of a
1271 | self-closing tag according to this parser."""
1272 | return self.SELF_CLOSING_TAGS.has_key(name) \
1273 | or self.instanceSelfClosingTags.has_key(name)
1274 |
1275 | def reset(self):
1276 | Tag.__init__(self, self, self.ROOT_TAG_NAME)
1277 | self.hidden = 1
1278 | self.builder.reset()
1279 | self.currentData = []
1280 | self.currentTag = None
1281 | self.tagStack = []
1282 | self.quoteStack = []
1283 | self.pushTag(self)
1284 |
1285 | def popTag(self):
1286 | tag = self.tagStack.pop()
1287 | # Tags with just one string-owning child get the child as a
1288 | # 'string' property, so that soup.tag.string is shorthand for
1289 | # soup.tag.contents[0]
1290 | if len(self.currentTag.contents) == 1 and \
1291 | isinstance(self.currentTag.contents[0], NavigableString):
1292 | self.currentTag.string = self.currentTag.contents[0]
1293 |
1294 | #print "Pop", tag.name
1295 | if self.tagStack:
1296 | self.currentTag = self.tagStack[-1]
1297 | return self.currentTag
1298 |
1299 | def pushTag(self, tag):
1300 | #print "Push", tag.name
1301 | if self.currentTag:
1302 | self.currentTag.contents.append(tag)
1303 | self.tagStack.append(tag)
1304 | self.currentTag = self.tagStack[-1]
1305 |
1306 | def endData(self, containerClass=NavigableString):
1307 | if self.currentData:
1308 | currentData = u''.join(self.currentData)
1309 | if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
1310 | not set([tag.name for tag in self.tagStack]).intersection(
1311 | self.PRESERVE_WHITESPACE_TAGS)):
1312 | if '\n' in currentData:
1313 | currentData = '\n'
1314 | else:
1315 | currentData = ' '
1316 | self.currentData = []
1317 | if self.parseOnlyThese and len(self.tagStack) <= 1 and \
1318 | (not self.parseOnlyThese.text or \
1319 | not self.parseOnlyThese.search(currentData)):
1320 | return
1321 | o = containerClass(currentData)
1322 | o.setup(self.currentTag, self.previous)
1323 | if self.previous:
1324 | self.previous.next = o
1325 | self.previous = o
1326 | self.currentTag.contents.append(o)
1327 |
1328 |
1329 | def _popToTag(self, name, inclusivePop=True):
1330 | """Pops the tag stack up to and including the most recent
1331 | instance of the given tag. If inclusivePop is false, pops the tag
1332 | stack up to but *not* including the most recent instqance of
1333 | the given tag."""
1334 | #print "Popping to %s" % name
1335 | if name == self.ROOT_TAG_NAME:
1336 | return
1337 |
1338 | numPops = 0
1339 | mostRecentTag = None
1340 | for i in range(len(self.tagStack)-1, 0, -1):
1341 | if name == self.tagStack[i].name:
1342 | numPops = len(self.tagStack)-i
1343 | break
1344 | if not inclusivePop:
1345 | numPops = numPops - 1
1346 |
1347 | for i in range(0, numPops):
1348 | mostRecentTag = self.popTag()
1349 | return mostRecentTag
1350 |
1351 | def _smartPop(self, name):
1352 |
1353 | """We need to pop up to the previous tag of this type, unless
1354 | one of this tag's nesting reset triggers comes between this
1355 | tag and the previous tag of this type, OR unless this tag is a
1356 | generic nesting trigger and another generic nesting trigger
1357 | comes between this tag and the previous tag of this type.
1358 |
1359 | Examples:
1360 |
FooBar *
* should pop to 'p', not 'b'. 1361 |
Foo
* | * should pop to 'tr', not the first 'td'
1367 | """
1368 |
1369 | nestingResetTriggers = self.NESTABLE_TAGS.get(name)
1370 | isNestable = nestingResetTriggers != None
1371 | isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
1372 | popTo = None
1373 | inclusive = True
1374 | for i in range(len(self.tagStack)-1, 0, -1):
1375 | p = self.tagStack[i]
1376 | if (not p or p.name == name) and not isNestable:
1377 | #Non-nestable tags get popped to the top or to their
1378 | #last occurance.
1379 | popTo = name
1380 | break
1381 | if (nestingResetTriggers != None
1382 | and p.name in nestingResetTriggers) \
1383 | or (nestingResetTriggers == None and isResetNesting
1384 | and self.RESET_NESTING_TAGS.has_key(p.name)):
1385 |
1386 | #If we encounter one of the nesting reset triggers
1387 | #peculiar to this tag, or we encounter another tag
1388 | #that causes nesting to reset, pop up to but not
1389 | #including that tag.
1390 | popTo = p.name
1391 | inclusive = False
1392 | break
1393 | p = p.parent
1394 | if popTo:
1395 | self._popToTag(popTo, inclusive)
1396 |
1397 | def unknown_starttag(self, name, attrs, selfClosing=0):
1398 | #print "Start tag %s: %s" % (name, attrs)
1399 | if self.quoteStack:
1400 | #This is not a real tag.
1401 | #print "<%s> is not real!" % name
1402 | attrs = ''.join(map(lambda(x, y): ' %s="%s"' % (x, y), attrs))
1403 | self.handle_data('<%s%s>' % (name, attrs))
1404 | return
1405 | self.endData()
1406 |
1407 | if not self.isSelfClosingTag(name) and not selfClosing:
1408 | self._smartPop(name)
1409 |
1410 | if self.parseOnlyThese and len(self.tagStack) <= 1 \
1411 | and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
1412 | return
1413 |
1414 | tag = Tag(self, name, attrs, self.currentTag, self.previous)
1415 | if self.previous:
1416 | self.previous.next = tag
1417 | self.previous = tag
1418 | self.pushTag(tag)
1419 | if selfClosing or self.isSelfClosingTag(name):
1420 | self.popTag()
1421 | if name in self.QUOTE_TAGS:
1422 | #print "Beginning quote (%s)" % name
1423 | self.quoteStack.append(name)
1424 | self.literal = 1
1425 | return tag
1426 |
1427 | def unknown_endtag(self, name):
1428 | #print "End tag %s" % name
1429 | if self.quoteStack and self.quoteStack[-1] != name:
1430 | #This is not a real end tag.
1431 | #print "%s> is not real!" % name
1432 | self.handle_data('%s>' % name)
1433 | return
1434 | self.endData()
1435 | self._popToTag(name)
1436 | if self.quoteStack and self.quoteStack[-1] == name:
1437 | self.quoteStack.pop()
1438 | self.literal = (len(self.quoteStack) > 0)
1439 |
1440 | def handle_data(self, data):
1441 | self.currentData.append(data)
1442 |
1443 | def extractCharsetFromMeta(self, attrs):
1444 | self.unknown_starttag('meta', attrs)
1445 |
1446 |
1447 | class BeautifulSoup(BeautifulStoneSoup):
1448 |
1449 | """This parser knows the following facts about HTML:
1450 |
1451 | * Some tags have no closing tag and should be interpreted as being
1452 | closed as soon as they are encountered.
1453 |
1454 | * The text inside some tags (ie. 'script') may contain tags which
1455 | are not really part of the document and which should be parsed
1456 | as text, not tags. If you want to parse the text as tags, you can
1457 | always fetch it and parse it explicitly.
1458 |
1459 | * Tag nesting rules:
1460 |
1461 | Most tags can't be nested at all. For instance, the occurance of
1462 | a tag should implicitly close the previous tag. 1463 | 1464 | Para1 Para2 1465 | should be transformed into: 1466 | Para1 Para2 1467 | 1468 | Some tags can be nested arbitrarily. For instance, the occurance 1469 | of a tag should _not_ implicitly close the previous 1470 |tag. 1471 | 1472 | Alice said:Bob said:Blah 1473 | should NOT be transformed into: 1474 | Alice said:Bob said:Blah 1475 | 1476 | Some tags can be nested, but the nesting is reset by the 1477 | interposition of other tags. For instance, a |