├── README.rst ├── build_dict.py ├── data └── dict.pkl ├── doc ├── api-objects.txt ├── build_dict-module.html ├── build_dict-pysrc.html ├── class-tree.html ├── class_hierarchy_for_faststemme.gif ├── class_hierarchy_for_htmlreader.gif ├── class_hierarchy_for_multitag.gif ├── class_hierarchy_for_naiverater.gif ├── class_hierarchy_for_rater.gif ├── class_hierarchy_for_reader.gif ├── class_hierarchy_for_simpleread.gif ├── class_hierarchy_for_stemmer.gif ├── class_hierarchy_for_tag.gif ├── class_hierarchy_for_unicoderea.gif ├── crarr.png ├── epydoc.css ├── epydoc.js ├── extras-module.html ├── extras-pysrc.html ├── extras.FastStemmer-class.html ├── extras.HTMLReader-class.html ├── extras.NaiveRater-class.html ├── extras.SimpleReader-class.html ├── extras.UnicodeReader-class.html ├── frames.html ├── help.html ├── identifier-index.html ├── index.html ├── module-tree.html ├── redirect.html ├── tagger-module.html ├── tagger-pysrc.html ├── tagger.MultiTag-class.html ├── tagger.Rater-class.html ├── tagger.Reader-class.html ├── tagger.Stemmer-class.html ├── tagger.Tag-class.html ├── tagger.Tagger-class.html ├── toc-build_dict-module.html ├── toc-everything.html ├── toc-extras-module.html ├── toc-tagger-module.html └── toc.html ├── extras.py ├── tagger.py ├── test_ui.py └── tests ├── bbc1.txt ├── bbc2.txt ├── bbc3.txt ├── guardian1.txt ├── guardian2.txt ├── post1.txt ├── wikipedia1.txt ├── wikipedia2.txt └── wikipedia3.txt /README.rst: -------------------------------------------------------------------------------- 1 | ====== 2 | tagger 3 | ====== 4 | 5 | Module for extracting tags from text documents. 6 | 7 | Copyright (C) 2011 by Alessandro Presta 8 | 9 | Configuration 10 | ============= 11 | 12 | Dependencies: 13 | python2.7, stemming, nltk (optional), lxml (optional), tkinter (optional) 14 | 15 | You can install the stemming package with:: 16 | 17 | $ easy_install stemming 18 | 19 | Usage 20 | ===== 21 | 22 | Tagging a text document from Python:: 23 | 24 | import tagger 25 | weights = pickle.load(open('data/dict.pkl', 'rb')) # or your own dictionary 26 | myreader = tagger.Reader() # or your own reader class 27 | mystemmer = tagger.Stemmer() # or your own stemmer class 28 | myrater = tagger.Rater(weights) # or your own... (you got the idea) 29 | mytagger = Tagger(myreader, mystemmer, myrater) 30 | best_3_tags = mytagger(text_string, 3) 31 | 32 | Running the module as a script:: 33 | 34 | $ ./tagger.py 35 | 36 | Example:: 37 | 38 | $ ./tagger.py tests/* 39 | Loading dictionary... 40 | Tags for tests/bbc1.txt : 41 | ['bin laden', 'obama', 'pakistan', 'killed', 'raid'] 42 | Tags for tests/bbc2.txt : 43 | ['jo yeates', 'bristol', 'vincent tabak', 'murder', 'strangled'] 44 | Tags for tests/bbc3.txt : 45 | ['snp', 'party', 'election', 'scottish', 'labour'] 46 | Tags for tests/guardian1.txt : 47 | ['bin laden', 'al-qaida', 'killed', 'pakistan', 'al-fawwaz'] 48 | Tags for tests/guardian2.txt : 49 | ['clegg', 'tory', 'lib dem', 'party', 'coalition'] 50 | Tags for tests/post1.txt : 51 | ['sony', 'stolen', 'playstation network', 'hacker attack', 'lawsuit'] 52 | Tags for tests/wikipedia1.txt : 53 | ['universe', 'anthropic principle', 'observed', 'cosmological', 'theory'] 54 | Tags for tests/wikipedia2.txt : 55 | ['beetroot', 'beet', 'betaine', 'blood pressure', 'dietary nitrate'] 56 | Tags for tests/wikipedia3.txt : 57 | ['the lounge lizards', 'jazz', 'john lurie', 'musical', 'albums'] 58 | 59 | A brief explanation 60 | =================== 61 | 62 | Extracting tags from a text document involves at least three steps: splitting the document into words, grouping together variants of the same word, and ranking them according to their relevance. 63 | These three tasks are carried out respectively by the **Reader**, **Stemmer** and **Rater** classes, and their work is put together by the **Tagger** class. 64 | 65 | A **Reader** object may accept as input a document in some format, perform some normalisation of the text (such as turning everything into lower case), analyse the structure of the phrases and punctuation, and return a list of words respecting the order in the text, perhaps with some additional information such as which ones look like proper nouns, or are at the end of a phrase. 66 | A very straightforward way of doing this would be to just match all the words with a regular expression, and this is indeed what the **SimpleReader** class does. 67 | 68 | The **Stemmer** tries to recognise the root of a word, in order to identify slightly different forms. This is already a quite complicated task, and it's clearly language-specific. 69 | The *stem* module in the NLTK package provides algorithms for many languages 70 | and integrates nicely with the tagger:: 71 | 72 | import nltk 73 | # an English stemmer using Lancaster's algorithm 74 | mystemmer = Stemmer(nltk.stem.LancasterStemmer) 75 | # an Italian stemmer 76 | class MyItalianStemmer(Stemmer): 77 | def __init__(self): 78 | Stemmer.__init__(self, nltk.stem.ItalianStemmer) 79 | def preprocess(self, string): 80 | # do something with the string before passing it to nltk's stemmer 81 | 82 | The **Rater** takes the list of words contained in the document, together with any additional information gathered at the previous stages, and returns a list of tags (i.e. words or small units of text) ordered by some idea of "relevance". 83 | 84 | It turns out that just working on the information contained in the document itself is not enough, because it says nothing about the frequency of a term in the language. For this reason, an early "off-line" phase of the algorithm consists in analysing a *corpus* (i.e. a sample of documents written in the same language) to build a dictionary of known words. This is taken care by the **build_dict()** function. 85 | It is advised to build your own dictionaries, and the **build_dict_from_nltk()** function in the *extras* module enables you to use the corpora included in NLTK:: 86 | 87 | build_dict_from_nltk(output_file, nltk.corpus.brown, 88 | nltk.corpus.stopwords.words('english'), measure='ICF') 89 | 90 | So far, we may define the relevance of a word as the product of two distinct functions: one that depends on the document itself, and one that depends on the corpus. 91 | A standard measure in information retrieval is TF-IDF (*term frequency-inverse 92 | document frequency*): the frequency of the word in the document multiplied by 93 | the (logarithm of) the inverse of its frequency in the corpus (i.e. the cardinality of the corpus divided by the number of documents where the word is found). 94 | If we treat the whole corpus as a single document, and count the total occurrences of the term instead, we obtain ICF (*inverse collection frequency*). 95 | Both of these are implemented in the *build_dict* module, and any other reasonable measure should be fine, provided that it is normalised in the interval [0,1]. The dictionary is passed to the **Rater** object as the *weights* argument in its constructor. 96 | We might also want to define the first term of the product in a different way, and this is done by overriding the **rate_tags()** method (which by default calculates TF for each word and multiplies it by its weight):: 97 | 98 | class MyRater(Rater): 99 | def rate_tags(self, tags): 100 | # set each tag's rating as you wish 101 | 102 | If we were not too picky about the results, these few bits would already make an acceptable tagger. 103 | However, it's a matter of fact that tags formed only by single words are quite limited: while "obama" and "barack obama" are both reasonable tags (and it is quite easy to treat cases like this in order to regard them as equal), having "laden" and "bin" as two separate tags is definitely not acceptable and misleading. 104 | Compare the results on the same document using the **NaiveRater** class (defined in the module *extras*) instead of the standard one. 105 | 106 | The *multitag_size* parameter in the **Rater**'s constructor defines the maximum number of words that can constitute a tag. Multitags are generated in the **create_multitags()** method; if additional information about the position of a word in the phrase is available (i.e. the **terminal** member of the class **Tag**), this can be done in a more accurate way. 107 | The rating of a **MultiTag** is computed from the ratings of its unit tags. 108 | By default, the **combined_rating()** method uses the geometric mean, with a special treatment of proper nouns if that information is available too (in the **proper** member). 109 | This method can be overridden too, so there is room for experimentation. 110 | 111 | With a few "common sense" heuristics the results are greatly improved. 112 | The final stage of the default rating algorithm involves discarding redundant tags (i.e. tags that contain or are contained in other, less relevant tags). 113 | 114 | It should be stressed that the default implementation doesn't make any assumption on the type of document that is being tagged (except for it being written in English) and on the kinds of tags that should be given priority (which sometimes can be a matter of taste or depend on the particular task we are using the tags for). 115 | With some additional assumptions and an accurate treatment of corner cases, the tagger can be tailored to suit the user's needs. 116 | 117 | This is proof-of-concept software and extensive experimentation is encouraged. The design of the base classes should allow for this, and the few examples in the *extras* module are a good starting point for customising the algorithm. 118 | -------------------------------------------------------------------------------- /build_dict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright (C) 2011 by Alessandro Presta 4 | 5 | # Permission is hereby granted, free of charge, to any person obtaining a copy 6 | # of this software and associated documentation files (the "Software"), to deal 7 | # in the Software without restriction, including without limitation the rights 8 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | # copies of the Software, and to permit persons to whom the Software is 10 | # furnished to do so, subject to the following conditions: 11 | 12 | # The above copyright notice and this permission notice shall be included in 13 | # all copies or substantial portions of the Software. 14 | 15 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | # THE SOFTWARE 22 | 23 | 24 | ''' 25 | Usage: build_dict.py -o -s 26 | ''' 27 | 28 | from tagger import Stemmer 29 | from extras import SimpleReader 30 | 31 | 32 | def build_dict(corpus, stopwords=None, measure='IDF'): 33 | ''' 34 | @param corpus: a list of documents, represented as lists of (stemmed) 35 | words 36 | @param stopwords: the list of (stemmed) words that should have zero weight 37 | @param measure: the measure used to compute the weights ('IDF' 38 | i.e. 'inverse document frequency' or 'ICF' i.e. 39 | 'inverse collection frequency'; defaults to 'IDF') 40 | 41 | @returns: a dictionary of weights in the interval [0,1] 42 | ''' 43 | 44 | import collections 45 | import math 46 | 47 | dictionary = {} 48 | 49 | if measure == 'ICF': 50 | words = [w for doc in corpus for w in doc] 51 | 52 | term_count = collections.Counter(words) 53 | total_count = float(len(words)) 54 | scale = math.log(total_count) 55 | 56 | for w, cnt in term_count.iteritems(): 57 | dictionary[w] = math.log(total_count / (cnt + 1)) / scale 58 | 59 | elif measure == 'IDF': 60 | corpus_size = float(len(corpus)) 61 | scale = math.log(corpus_size) 62 | 63 | term_count = collections.defaultdict(int) 64 | 65 | for doc in corpus: 66 | words = set(doc) 67 | for w in words: 68 | term_count[w] += 1 69 | 70 | for w, cnt in term_count.iteritems(): 71 | dictionary[w] = math.log(corpus_size / (cnt + 1)) / scale 72 | 73 | if stopwords: 74 | for w in stopwords: 75 | dictionary[w] = 0.0 76 | 77 | return dictionary 78 | 79 | 80 | def build_dict_from_files(output_file, corpus_files, stopwords_file=None, 81 | reader=SimpleReader(), stemmer=Stemmer(), 82 | measure='IDF', verbose=False): 83 | ''' 84 | @param output_file: the binary stream where the dictionary should be 85 | saved 86 | @param corpus_files: a list of streams with words to process 87 | @param stopwords_file: a stream containing a list of stopwords 88 | @param reader: the L{Reader} object to be used 89 | @param stemmer: the L{Stemmer} object to be used 90 | @param measure: the measure used to compute the weights ('IDF' 91 | i.e. 'inverse document frequency' or 'ICF' i.e. 92 | 'inverse collection frequency'; defaults to 'IDF') 93 | @param verbose: whether information on the progress should be 94 | printed on screen 95 | ''' 96 | 97 | import pickle 98 | 99 | if verbose: print 'Processing corpus...' 100 | corpus = [] 101 | for doc in corpus_files: 102 | corpus.append(reader(doc.read())) 103 | corpus = [[w.stem for w in map(stemmer, doc)] for doc in corpus] 104 | 105 | stopwords = None 106 | if stopwords_file: 107 | if verbose: print 'Processing stopwords...' 108 | stopwords = reader(stopwords_file.read()) 109 | stopwords = [w.stem for w in map(stemmer, stopwords)] 110 | 111 | if verbose: print 'Building dictionary... ' 112 | dictionary = build_dict(corpus, stopwords, measure) 113 | pickle.dump(dictionary, output_file, -1) 114 | 115 | 116 | if __name__ == '__main__': 117 | 118 | import getopt 119 | import sys 120 | 121 | try: 122 | options = getopt.getopt(sys.argv[1:], 'o:s:') 123 | output_file = options[0][0][1] 124 | stopwords_file = options[0][1][1] 125 | corpus = options[1] 126 | except: 127 | print __doc__ 128 | exit(1) 129 | 130 | corpus = [open(doc, 'r') for doc in corpus] 131 | stopwords_file = open(stopwords_file, 'r') 132 | output_file = open(output_file, 'wb') 133 | 134 | build_dict_from_files(output_file, corpus, stopwords_file, verbose=True) 135 | 136 | output_file.close() 137 | stopwords_file.close() 138 | for doc in corpus: 139 | doc.close() 140 | 141 | 142 | 143 | -------------------------------------------------------------------------------- /data/dict.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/data/dict.pkl -------------------------------------------------------------------------------- /doc/api-objects.txt: -------------------------------------------------------------------------------- 1 | build_dict build_dict-module.html 2 | build_dict.build_dict_from_files build_dict-module.html#build_dict_from_files 3 | build_dict.build_dict build_dict-module.html#build_dict 4 | build_dict.__package__ build_dict-module.html#__package__ 5 | extras extras-module.html 6 | extras.__package__ extras-module.html#__package__ 7 | extras.build_dict_from_nltk extras-module.html#build_dict_from_nltk 8 | tagger tagger-module.html 9 | tagger.__package__ tagger-module.html#__package__ 10 | extras.FastStemmer extras.FastStemmer-class.html 11 | tagger.Stemmer.preprocess tagger.Stemmer-class.html#preprocess 12 | tagger.Stemmer.__call__ tagger.Stemmer-class.html#__call__ 13 | extras.FastStemmer.__init__ extras.FastStemmer-class.html#__init__ 14 | tagger.Stemmer.match_contractions tagger.Stemmer-class.html#match_contractions 15 | extras.HTMLReader extras.HTMLReader-class.html 16 | tagger.Reader.match_phrases tagger.Reader-class.html#match_phrases 17 | tagger.Reader.match_words tagger.Reader-class.html#match_words 18 | tagger.Reader.match_paragraphs tagger.Reader-class.html#match_paragraphs 19 | tagger.Reader.preprocess tagger.Reader-class.html#preprocess 20 | extras.HTMLReader.__call__ extras.HTMLReader-class.html#__call__ 21 | tagger.Reader.match_apostrophes tagger.Reader-class.html#match_apostrophes 22 | extras.NaiveRater extras.NaiveRater-class.html 23 | extras.NaiveRater.__call__ extras.NaiveRater-class.html#__call__ 24 | tagger.Rater.create_multitags tagger.Rater-class.html#create_multitags 25 | tagger.Rater.__init__ tagger.Rater-class.html#__init__ 26 | tagger.Rater.rate_tags tagger.Rater-class.html#rate_tags 27 | extras.SimpleReader extras.SimpleReader-class.html 28 | tagger.Reader.match_phrases tagger.Reader-class.html#match_phrases 29 | tagger.Reader.match_words tagger.Reader-class.html#match_words 30 | tagger.Reader.match_paragraphs tagger.Reader-class.html#match_paragraphs 31 | tagger.Reader.preprocess tagger.Reader-class.html#preprocess 32 | extras.SimpleReader.__call__ extras.SimpleReader-class.html#__call__ 33 | tagger.Reader.match_apostrophes tagger.Reader-class.html#match_apostrophes 34 | extras.UnicodeReader extras.UnicodeReader-class.html 35 | tagger.Reader.match_phrases tagger.Reader-class.html#match_phrases 36 | tagger.Reader.match_words tagger.Reader-class.html#match_words 37 | tagger.Reader.match_paragraphs tagger.Reader-class.html#match_paragraphs 38 | tagger.Reader.preprocess tagger.Reader-class.html#preprocess 39 | extras.UnicodeReader.__call__ extras.UnicodeReader-class.html#__call__ 40 | tagger.Reader.match_apostrophes tagger.Reader-class.html#match_apostrophes 41 | tagger.MultiTag tagger.MultiTag-class.html 42 | tagger.MultiTag.combined_rating tagger.MultiTag-class.html#combined_rating 43 | tagger.Tag.__repr__ tagger.Tag-class.html#__repr__ 44 | tagger.Tag.__hash__ tagger.Tag-class.html#__hash__ 45 | tagger.Tag.__lt__ tagger.Tag-class.html#__lt__ 46 | tagger.Tag.__eq__ tagger.Tag-class.html#__eq__ 47 | tagger.MultiTag.__init__ tagger.MultiTag-class.html#__init__ 48 | tagger.Rater tagger.Rater-class.html 49 | tagger.Rater.__call__ tagger.Rater-class.html#__call__ 50 | tagger.Rater.create_multitags tagger.Rater-class.html#create_multitags 51 | tagger.Rater.__init__ tagger.Rater-class.html#__init__ 52 | tagger.Rater.rate_tags tagger.Rater-class.html#rate_tags 53 | tagger.Reader tagger.Reader-class.html 54 | tagger.Reader.match_phrases tagger.Reader-class.html#match_phrases 55 | tagger.Reader.match_words tagger.Reader-class.html#match_words 56 | tagger.Reader.match_paragraphs tagger.Reader-class.html#match_paragraphs 57 | tagger.Reader.__call__ tagger.Reader-class.html#__call__ 58 | tagger.Reader.preprocess tagger.Reader-class.html#preprocess 59 | tagger.Reader.match_apostrophes tagger.Reader-class.html#match_apostrophes 60 | tagger.Stemmer tagger.Stemmer-class.html 61 | tagger.Stemmer.preprocess tagger.Stemmer-class.html#preprocess 62 | tagger.Stemmer.__call__ tagger.Stemmer-class.html#__call__ 63 | tagger.Stemmer.__init__ tagger.Stemmer-class.html#__init__ 64 | tagger.Stemmer.match_contractions tagger.Stemmer-class.html#match_contractions 65 | tagger.Tag tagger.Tag-class.html 66 | tagger.Tag.__hash__ tagger.Tag-class.html#__hash__ 67 | tagger.Tag.__lt__ tagger.Tag-class.html#__lt__ 68 | tagger.Tag.__eq__ tagger.Tag-class.html#__eq__ 69 | tagger.Tag.__repr__ tagger.Tag-class.html#__repr__ 70 | tagger.Tag.__init__ tagger.Tag-class.html#__init__ 71 | tagger.Tagger tagger.Tagger-class.html 72 | tagger.Tagger.__call__ tagger.Tagger-class.html#__call__ 73 | tagger.Tagger.__init__ tagger.Tagger-class.html#__init__ 74 | -------------------------------------------------------------------------------- /doc/class-tree.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | Class Hierarchy 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 41 | 52 | 53 |
  42 | 43 | 44 | 46 | 50 |
[hide private]
[frames] | no frames]
51 |
54 |
55 | [ Module Hierarchy 56 | | Class Hierarchy ] 57 |

58 |

Class Hierarchy

59 |
    60 |
  • tagger.Rater: 61 | Class for estimating the relevance of tags 62 |
      63 |
    • extras.NaiveRater: 64 | Rater subclass that jusk ranks single-word tags by their frequency 65 | and weight 66 |
    • 67 |
    68 |
  • 69 |
  • tagger.Reader: 70 | Class for parsing a string of text to obtain tags 71 |
      72 |
    • extras.SimpleReader: 73 | Reader subclass that doesn't perform any advanced analysis of the 74 | text 75 |
    • 76 |
    • extras.UnicodeReader: 77 | Reader subclass that converts Unicode strings to a close ASCII 78 | representation 79 |
        80 |
      • extras.HTMLReader: 81 | Reader subclass that can parse HTML code from the input 82 |
      • 83 |
      84 |
    • 85 |
    86 |
  • 87 |
  • tagger.Stemmer: 88 | Class for extracting the stem of a word 89 |
      90 |
    • extras.FastStemmer: 91 | Stemmer subclass that uses a much faster, but less correct 92 | algorithm 93 |
    • 94 |
    95 |
  • 96 |
  • tagger.Tag: 97 | General class for tags (small units of text) 98 |
      99 |
    • tagger.MultiTag: 100 | Class for aggregates of tags (usually next to each other in the 101 | document) 102 |
    • 103 |
    104 |
  • 105 |
  • tagger.Tagger: 106 | Master class for tagging text documents 107 |
  • 108 |
109 | 110 | 112 | 113 | 114 | 115 | 117 | 118 | 119 | 121 | 122 | 123 | 125 | 126 | 127 | 132 | 133 | 134 | 135 | 136 | 139 | 143 | 144 |
145 | 146 | 155 | 156 | 157 | -------------------------------------------------------------------------------- /doc/class_hierarchy_for_faststemme.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_faststemme.gif -------------------------------------------------------------------------------- /doc/class_hierarchy_for_htmlreader.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_htmlreader.gif -------------------------------------------------------------------------------- /doc/class_hierarchy_for_multitag.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_multitag.gif -------------------------------------------------------------------------------- /doc/class_hierarchy_for_naiverater.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_naiverater.gif -------------------------------------------------------------------------------- /doc/class_hierarchy_for_rater.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_rater.gif -------------------------------------------------------------------------------- /doc/class_hierarchy_for_reader.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_reader.gif -------------------------------------------------------------------------------- /doc/class_hierarchy_for_simpleread.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_simpleread.gif -------------------------------------------------------------------------------- /doc/class_hierarchy_for_stemmer.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_stemmer.gif -------------------------------------------------------------------------------- /doc/class_hierarchy_for_tag.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_tag.gif -------------------------------------------------------------------------------- /doc/class_hierarchy_for_unicoderea.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/class_hierarchy_for_unicoderea.gif -------------------------------------------------------------------------------- /doc/crarr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kevinmcmahon/tagger/69432971b6af1f2649f2af5b623f156ffeb0d1c3/doc/crarr.png -------------------------------------------------------------------------------- /doc/epydoc.js: -------------------------------------------------------------------------------- 1 | function toggle_private() { 2 | // Search for any private/public links on this page. Store 3 | // their old text in "cmd," so we will know what action to 4 | // take; and change their text to the opposite action. 5 | var cmd = "?"; 6 | var elts = document.getElementsByTagName("a"); 7 | for(var i=0; i...
"; 127 | elt.innerHTML = s; 128 | } 129 | } 130 | 131 | function toggle(id) { 132 | elt = document.getElementById(id+"-toggle"); 133 | if (elt.innerHTML == "-") 134 | collapse(id); 135 | else 136 | expand(id); 137 | return false; 138 | } 139 | 140 | function highlight(id) { 141 | var elt = document.getElementById(id+"-def"); 142 | if (elt) elt.className = "py-highlight-hdr"; 143 | var elt = document.getElementById(id+"-expanded"); 144 | if (elt) elt.className = "py-highlight"; 145 | var elt = document.getElementById(id+"-collapsed"); 146 | if (elt) elt.className = "py-highlight"; 147 | } 148 | 149 | function num_lines(s) { 150 | var n = 1; 151 | var pos = s.indexOf("\n"); 152 | while ( pos > 0) { 153 | n += 1; 154 | pos = s.indexOf("\n", pos+1); 155 | } 156 | return n; 157 | } 158 | 159 | // Collapse all blocks that mave more than `min_lines` lines. 160 | function collapse_all(min_lines) { 161 | var elts = document.getElementsByTagName("div"); 162 | for (var i=0; i 0) 166 | if (elt.id.substring(split, elt.id.length) == "-expanded") 167 | if (num_lines(elt.innerHTML) > min_lines) 168 | collapse(elt.id.substring(0, split)); 169 | } 170 | } 171 | 172 | function expandto(href) { 173 | var start = href.indexOf("#")+1; 174 | if (start != 0 && start != href.length) { 175 | if (href.substring(start, href.length) != "-") { 176 | collapse_all(4); 177 | pos = href.indexOf(".", start); 178 | while (pos != -1) { 179 | var id = href.substring(start, pos); 180 | expand(id); 181 | pos = href.indexOf(".", pos+1); 182 | } 183 | var id = href.substring(start, href.length); 184 | expand(id); 185 | highlight(id); 186 | } 187 | } 188 | } 189 | 190 | function kill_doclink(id) { 191 | var parent = document.getElementById(id); 192 | parent.removeChild(parent.childNodes.item(0)); 193 | } 194 | function auto_kill_doclink(ev) { 195 | if (!ev) var ev = window.event; 196 | if (!this.contains(ev.toElement)) { 197 | var parent = document.getElementById(this.parentID); 198 | parent.removeChild(parent.childNodes.item(0)); 199 | } 200 | } 201 | 202 | function doclink(id, name, targets_id) { 203 | var elt = document.getElementById(id); 204 | 205 | // If we already opened the box, then destroy it. 206 | // (This case should never occur, but leave it in just in case.) 207 | if (elt.childNodes.length > 1) { 208 | elt.removeChild(elt.childNodes.item(0)); 209 | } 210 | else { 211 | // The outer box: relative + inline positioning. 212 | var box1 = document.createElement("div"); 213 | box1.style.position = "relative"; 214 | box1.style.display = "inline"; 215 | box1.style.top = 0; 216 | box1.style.left = 0; 217 | 218 | // A shadow for fun 219 | var shadow = document.createElement("div"); 220 | shadow.style.position = "absolute"; 221 | shadow.style.left = "-1.3em"; 222 | shadow.style.top = "-1.3em"; 223 | shadow.style.background = "#404040"; 224 | 225 | // The inner box: absolute positioning. 226 | var box2 = document.createElement("div"); 227 | box2.style.position = "relative"; 228 | box2.style.border = "1px solid #a0a0a0"; 229 | box2.style.left = "-.2em"; 230 | box2.style.top = "-.2em"; 231 | box2.style.background = "white"; 232 | box2.style.padding = ".3em .4em .3em .4em"; 233 | box2.style.fontStyle = "normal"; 234 | box2.onmouseout=auto_kill_doclink; 235 | box2.parentID = id; 236 | 237 | // Get the targets 238 | var targets_elt = document.getElementById(targets_id); 239 | var targets = targets_elt.getAttribute("targets"); 240 | var links = ""; 241 | target_list = targets.split(","); 242 | for (var i=0; i" + 246 | target[0] + ""; 247 | } 248 | 249 | // Put it all together. 250 | elt.insertBefore(box1, elt.childNodes.item(0)); 251 | //box1.appendChild(box2); 252 | box1.appendChild(shadow); 253 | shadow.appendChild(box2); 254 | box2.innerHTML = 255 | "Which "+name+" do you want to see documentation for?" + 256 | ""; 261 | } 262 | return false; 263 | } 264 | 265 | function get_anchor() { 266 | var href = location.href; 267 | var start = href.indexOf("#")+1; 268 | if ((start != 0) && (start != href.length)) 269 | return href.substring(start, href.length); 270 | } 271 | function redirect_url(dottedName) { 272 | // Scan through each element of the "pages" list, and check 273 | // if "name" matches with any of them. 274 | for (var i=0; i-m" or "-c"; 277 | // extract the portion & compare it to dottedName. 278 | var pagename = pages[i].substring(0, pages[i].length-2); 279 | if (pagename == dottedName.substring(0,pagename.length)) { 280 | 281 | // We've found a page that matches `dottedName`; 282 | // construct its URL, using leftover `dottedName` 283 | // content to form an anchor. 284 | var pagetype = pages[i].charAt(pages[i].length-1); 285 | var url = pagename + ((pagetype=="m")?"-module.html": 286 | "-class.html"); 287 | if (dottedName.length > pagename.length) 288 | url += "#" + dottedName.substring(pagename.length+1, 289 | dottedName.length); 290 | return url; 291 | } 292 | } 293 | } 294 | -------------------------------------------------------------------------------- /doc/extras-module.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | extras 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 45 | 56 | 57 |
41 | 42 | Module extras 43 | 44 | 46 | 47 | 48 | 50 | 54 |
[hide private]
[frames] | no frames]
55 |
58 | 59 |

Module extras

source code

60 | 61 | 62 | 64 | 65 | 76 | 77 | 78 | 85 | 86 | 87 | 93 | 94 | 95 | 102 | 103 | 104 | 111 | 112 | 113 | 120 | 121 |
66 | 67 | 68 | 69 | 73 | 74 |
Classes[hide private]
75 |
79 |   80 | 81 | UnicodeReader
82 | Reader subclass that converts Unicode strings to a close ASCII 83 | representation 84 |
88 |   89 | 90 | HTMLReader
91 | Reader subclass that can parse HTML code from the input 92 |
96 |   97 | 98 | SimpleReader
99 | Reader subclass that doesn't perform any advanced analysis of the 100 | text 101 |
105 |   106 | 107 | FastStemmer
108 | Stemmer subclass that uses a much faster, but less correct 109 | algorithm 110 |
114 |   115 | 116 | NaiveRater
117 | Rater subclass that jusk ranks single-word tags by their frequency 118 | and weight 119 |
122 | 123 | 124 | 126 | 127 | 138 | 139 | 140 | 159 | 160 |
128 | 129 | 130 | 131 | 135 | 136 |
Functions[hide private]
137 |
141 |   142 | 143 | 144 | 145 | 151 | 155 | 156 |
build_dict_from_nltk(output_file, 146 | corpus=None, 147 | stopwords=None, 148 | stemmer=Stemmer(), 149 | measure='IDF', 150 | verbose=False) 152 | source code 153 | 154 |
157 | 158 |
161 | 162 | 163 | 165 | 166 | 177 | 178 | 179 | 184 | 185 |
167 | 168 | 169 | 170 | 174 | 175 |
Variables[hide private]
176 |
180 |   181 | 182 | __package__ = None 183 |
186 | 187 | 188 | 190 | 191 | 202 | 203 |
192 | 193 | 194 | 195 | 199 | 200 |
Function Details[hide private]
201 |
204 | 205 |
206 | 208 |
209 | 210 | 221 |
211 |

build_dict_from_nltk(output_file, 212 | corpus=None, 213 | stopwords=None, 214 | stemmer=Stemmer(), 215 | measure='IDF', 216 | verbose=False) 217 |

218 |
source code  220 |
222 | 223 | 224 |
225 |
Parameters:
226 |
    227 |
  • output_file - the binary stream where the dictionary should be saved
  • 228 |
  • corpus - the NLTK corpus to use (defaults to nltk.corpus.reuters)
  • 229 |
  • stopwords - a list of (not stemmed) stopwords (defaults to 230 | nltk.corpus.reuters.words('stopwords'))
  • 231 |
  • stemmer - the Stemmer 232 | object to be used
  • 233 |
  • measure - the measure used to compute the weights ('IDF' i.e. 'inverse 234 | document frequency' or 'ICF' i.e. 'inverse collection frequency'; 235 | defaults to 'IDF')
  • 236 |
  • verbose - whether information on the progress should be printed on screen
  • 237 |
238 |
239 |
240 |
241 |
242 | 243 | 245 | 246 | 247 | 248 | 250 | 251 | 252 | 254 | 255 | 256 | 258 | 259 | 260 | 265 | 266 | 267 | 268 | 269 | 272 | 276 | 277 |
278 | 279 | 288 | 289 | 290 | -------------------------------------------------------------------------------- /doc/extras.FastStemmer-class.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | extras.FastStemmer 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 46 | 57 | 58 |
41 | 42 | Module extras :: 43 | Class FastStemmer 44 | 45 | 47 | 48 | 49 | 51 | 55 |
[hide private]
[frames] | no frames]
56 |
59 | 60 |

Class FastStemmer

source code

61 |
62 |
63 | 64 | 65 | 66 | 67 |
68 |
69 |
70 |

Stemmer subclass that uses a much faster, but less correct 71 | algorithm

72 | 73 | 74 | 75 | 77 | 78 | 89 | 90 | 91 | 108 | 109 | 110 | 116 | 117 |
79 | 80 | 81 | 82 | 86 | 87 |
Instance Methods[hide private]
88 |
92 |   93 | 94 | 95 | 96 | 100 | 104 | 105 |
__init__(self)
97 | Returns: 98 | a new Stemmer 99 | object
101 | source code 102 | 103 |
106 | 107 |
111 |

Inherited from tagger.Stemmer: 112 | __call__, 113 | preprocess 114 |

115 |
118 | 119 | 120 | 122 | 123 | 134 | 135 | 136 | 141 | 142 |
124 | 125 | 126 | 127 | 131 | 132 |
Class Variables[hide private]
133 |
137 |

Inherited from tagger.Stemmer: 138 | match_contractions 139 |

140 |
143 | 144 | 145 | 147 | 148 | 159 | 160 |
149 | 150 | 151 | 152 | 156 | 157 |
Method Details[hide private]
158 |
161 | 162 |
163 | 165 |
166 | 167 | 174 |
168 |

__init__(self) 169 |
(Constructor) 170 |

171 |
source code  173 |
175 | 176 | 177 |
178 |
Parameters:
179 |
    180 |
  • stemmer - an object or module with a 'stem' method (defaults to 181 | stemming.porter2)
  • 182 |
183 |
Returns:
184 |
a new Stemmer object
186 |
Overrides: 187 | tagger.Stemmer.__init__ 188 |
(inherited documentation)
189 | 190 |
191 |
192 |
193 |
194 | 195 | 197 | 198 | 199 | 200 | 202 | 203 | 204 | 206 | 207 | 208 | 210 | 211 | 212 | 217 | 218 | 219 | 220 | 221 | 224 | 228 | 229 |
230 | 231 | 240 | 241 | 242 | -------------------------------------------------------------------------------- /doc/extras.HTMLReader-class.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | extras.HTMLReader 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 46 | 57 | 58 |
41 | 42 | Module extras :: 43 | Class HTMLReader 44 | 45 | 47 | 48 | 49 | 51 | 55 |
[hide private]
[frames] | no frames]
56 |
59 | 60 |

Class HTMLReader

source code

61 |
62 |
63 | 64 | 65 | 66 | 67 | 68 |
69 |
70 |
71 |

Reader subclass that can parse HTML code from the input

72 | 73 | 74 | 75 | 77 | 78 | 89 | 90 | 91 | 108 | 109 | 110 | 115 | 116 |
79 | 80 | 81 | 82 | 86 | 87 |
Instance Methods[hide private]
88 |
92 |   93 | 94 | 95 | 96 | 100 | 104 | 105 |
__call__(self, 97 | html)
98 | Returns: 99 | a list of tags respecting the order in the text
101 | source code 102 | 103 |
106 | 107 |
111 |

Inherited from tagger.Reader: 112 | preprocess 113 |

114 |
117 | 118 | 119 | 121 | 122 | 133 | 134 | 135 | 143 | 144 |
123 | 124 | 125 | 126 | 130 | 131 |
Class Variables[hide private]
132 |
136 |

Inherited from tagger.Reader: 137 | match_apostrophes, 138 | match_paragraphs, 139 | match_phrases, 140 | match_words 141 |

142 |
145 | 146 | 147 | 149 | 150 | 161 | 162 |
151 | 152 | 153 | 154 | 158 | 159 |
Method Details[hide private]
160 |
163 | 164 |
165 | 167 |
168 | 169 | 177 |
170 |

__call__(self, 171 | html) 172 |
(Call operator) 173 |

174 |
source code  176 |
178 | 179 | 180 |
181 |
Parameters:
182 |
    183 |
  • text - the string of text to be tagged
  • 184 |
185 |
Returns:
186 |
a list of tags respecting the order in the text
187 |
Overrides: 188 | tagger.Reader.__call__ 189 |
(inherited documentation)
190 | 191 |
192 |
193 |
194 |
195 | 196 | 198 | 199 | 200 | 201 | 203 | 204 | 205 | 207 | 208 | 209 | 211 | 212 | 213 | 218 | 219 | 220 | 221 | 222 | 225 | 229 | 230 |
231 | 232 | 241 | 242 | 243 | -------------------------------------------------------------------------------- /doc/extras.NaiveRater-class.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | extras.NaiveRater 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 46 | 57 | 58 |
41 | 42 | Module extras :: 43 | Class NaiveRater 44 | 45 | 47 | 48 | 49 | 51 | 55 |
[hide private]
[frames] | no frames]
56 |
59 | 60 |

Class NaiveRater

source code

61 |
62 |
63 | 64 | 65 | 66 | 67 |
68 |
69 |
70 |

Rater subclass that jusk ranks single-word tags by their frequency and 71 | weight

72 | 73 | 74 | 75 | 77 | 78 | 89 | 90 | 91 | 108 | 109 | 110 | 117 | 118 |
79 | 80 | 81 | 82 | 86 | 87 |
Instance Methods[hide private]
88 |
92 |   93 | 94 | 95 | 96 | 100 | 104 | 105 |
__call__(self, 97 | tags)
98 | Returns: 99 | a list of unique (multi)tags sorted by relevance
101 | source code 102 | 103 |
106 | 107 |
111 |

Inherited from tagger.Rater: 112 | __init__, 113 | create_multitags, 114 | rate_tags 115 |

116 |
119 | 120 | 121 | 123 | 124 | 135 | 136 |
125 | 126 | 127 | 128 | 132 | 133 |
Method Details[hide private]
134 |
137 | 138 |
139 | 141 |
142 | 143 | 151 |
144 |

__call__(self, 145 | tags) 146 |
(Call operator) 147 |

148 |
source code  150 |
152 | 153 | 154 |
155 |
Parameters:
156 |
    157 |
  • tags - a list of (preferably stemmed) tags
  • 158 |
159 |
Returns:
160 |
a list of unique (multi)tags sorted by relevance
161 |
Overrides: 162 | tagger.Rater.__call__ 163 |
(inherited documentation)
164 | 165 |
166 |
167 |
168 |
169 | 170 | 172 | 173 | 174 | 175 | 177 | 178 | 179 | 181 | 182 | 183 | 185 | 186 | 187 | 192 | 193 | 194 | 195 | 196 | 199 | 203 | 204 |
205 | 206 | 215 | 216 | 217 | -------------------------------------------------------------------------------- /doc/extras.SimpleReader-class.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | extras.SimpleReader 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 46 | 57 | 58 |
41 | 42 | Module extras :: 43 | Class SimpleReader 44 | 45 | 47 | 48 | 49 | 51 | 55 |
[hide private]
[frames] | no frames]
56 |
59 | 60 |

Class SimpleReader

source code

61 |
62 |
63 | 64 | 65 | 66 | 67 |
68 |
69 |
70 |

Reader subclass that doesn't perform any advanced analysis of the 71 | text

72 | 73 | 74 | 75 | 77 | 78 | 89 | 90 | 91 | 108 | 109 | 110 | 115 | 116 |
79 | 80 | 81 | 82 | 86 | 87 |
Instance Methods[hide private]
88 |
92 |   93 | 94 | 95 | 96 | 100 | 104 | 105 |
__call__(self, 97 | text)
98 | Returns: 99 | a list of tags respecting the order in the text
101 | source code 102 | 103 |
106 | 107 |
111 |

Inherited from tagger.Reader: 112 | preprocess 113 |

114 |
117 | 118 | 119 | 121 | 122 | 133 | 134 | 135 | 143 | 144 |
123 | 124 | 125 | 126 | 130 | 131 |
Class Variables[hide private]
132 |
136 |

Inherited from tagger.Reader: 137 | match_apostrophes, 138 | match_paragraphs, 139 | match_phrases, 140 | match_words 141 |

142 |
145 | 146 | 147 | 149 | 150 | 161 | 162 |
151 | 152 | 153 | 154 | 158 | 159 |
Method Details[hide private]
160 |
163 | 164 |
165 | 167 |
168 | 169 | 177 |
170 |

__call__(self, 171 | text) 172 |
(Call operator) 173 |

174 |
source code  176 |
178 | 179 | 180 |
181 |
Parameters:
182 |
    183 |
  • text - the string of text to be tagged
  • 184 |
185 |
Returns:
186 |
a list of tags respecting the order in the text
187 |
Overrides: 188 | tagger.Reader.__call__ 189 |
(inherited documentation)
190 | 191 |
192 |
193 |
194 |
195 | 196 | 198 | 199 | 200 | 201 | 203 | 204 | 205 | 207 | 208 | 209 | 211 | 212 | 213 | 218 | 219 | 220 | 221 | 222 | 225 | 229 | 230 |
231 | 232 | 241 | 242 | 243 | -------------------------------------------------------------------------------- /doc/extras.UnicodeReader-class.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | extras.UnicodeReader 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 46 | 57 | 58 |
41 | 42 | Module extras :: 43 | Class UnicodeReader 44 | 45 | 47 | 48 | 49 | 51 | 55 |
[hide private]
[frames] | no frames]
56 |
59 | 60 |

Class UnicodeReader

source code

61 |
62 |
63 | 64 | 65 | 66 | 67 | 68 |
69 |
70 |
71 |

Reader subclass that converts Unicode strings to a close ASCII 72 | representation

73 | 74 | 75 | 76 | 78 | 79 | 90 | 91 | 92 | 109 | 110 | 111 | 116 | 117 |
80 | 81 | 82 | 83 | 87 | 88 |
Instance Methods[hide private]
89 |
93 |   94 | 95 | 96 | 97 | 101 | 105 | 106 |
__call__(self, 98 | text)
99 | Returns: 100 | a list of tags respecting the order in the text
102 | source code 103 | 104 |
107 | 108 |
112 |

Inherited from tagger.Reader: 113 | preprocess 114 |

115 |
118 | 119 | 120 | 122 | 123 | 134 | 135 | 136 | 144 | 145 |
124 | 125 | 126 | 127 | 131 | 132 |
Class Variables[hide private]
133 |
137 |

Inherited from tagger.Reader: 138 | match_apostrophes, 139 | match_paragraphs, 140 | match_phrases, 141 | match_words 142 |

143 |
146 | 147 | 148 | 150 | 151 | 162 | 163 |
152 | 153 | 154 | 155 | 159 | 160 |
Method Details[hide private]
161 |
164 | 165 |
166 | 168 |
169 | 170 | 178 |
171 |

__call__(self, 172 | text) 173 |
(Call operator) 174 |

175 |
source code  177 |
179 | 180 | 181 |
182 |
Parameters:
183 |
    184 |
  • text - the string of text to be tagged
  • 185 |
186 |
Returns:
187 |
a list of tags respecting the order in the text
188 |
Overrides: 189 | tagger.Reader.__call__ 190 |
(inherited documentation)
191 | 192 |
193 |
194 |
195 |
196 | 197 | 199 | 200 | 201 | 202 | 204 | 205 | 206 | 208 | 209 | 210 | 212 | 213 | 214 | 219 | 220 | 221 | 222 | 223 | 226 | 230 | 231 |
232 | 233 | 242 | 243 | 244 | -------------------------------------------------------------------------------- /doc/frames.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | tagger 7 | 8 | 9 | 10 | 12 | 14 | 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /doc/help.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | Help 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 41 | 52 | 53 |
  42 | 43 | 44 | 46 | 50 |
[hide private]
[frames] | no frames]
51 |
54 | 55 |

API Documentation

56 | 57 |

This document contains the API (Application Programming Interface) 58 | documentation for tagger. Documentation for the Python 59 | objects defined by the project is divided into separate pages for each 60 | package, module, and class. The API documentation also includes two 61 | pages containing information about the project as a whole: a trees 62 | page, and an index page.

63 | 64 |

Object Documentation

65 | 66 |

Each Package Documentation page contains:

67 |
    68 |
  • A description of the package.
  • 69 |
  • A list of the modules and sub-packages contained by the 70 | package.
  • 71 |
  • A summary of the classes defined by the package.
  • 72 |
  • A summary of the functions defined by the package.
  • 73 |
  • A summary of the variables defined by the package.
  • 74 |
  • A detailed description of each function defined by the 75 | package.
  • 76 |
  • A detailed description of each variable defined by the 77 | package.
  • 78 |
79 | 80 |

Each Module Documentation page contains:

81 |
    82 |
  • A description of the module.
  • 83 |
  • A summary of the classes defined by the module.
  • 84 |
  • A summary of the functions defined by the module.
  • 85 |
  • A summary of the variables defined by the module.
  • 86 |
  • A detailed description of each function defined by the 87 | module.
  • 88 |
  • A detailed description of each variable defined by the 89 | module.
  • 90 |
91 | 92 |

Each Class Documentation page contains:

93 |
    94 |
  • A class inheritance diagram.
  • 95 |
  • A list of known subclasses.
  • 96 |
  • A description of the class.
  • 97 |
  • A summary of the methods defined by the class.
  • 98 |
  • A summary of the instance variables defined by the class.
  • 99 |
  • A summary of the class (static) variables defined by the 100 | class.
  • 101 |
  • A detailed description of each method defined by the 102 | class.
  • 103 |
  • A detailed description of each instance variable defined by the 104 | class.
  • 105 |
  • A detailed description of each class (static) variable defined 106 | by the class.
  • 107 |
108 | 109 |

Project Documentation

110 | 111 |

The Trees page contains the module and class hierarchies:

112 |
    113 |
  • The module hierarchy lists every package and module, with 114 | modules grouped into packages. At the top level, and within each 115 | package, modules and sub-packages are listed alphabetically.
  • 116 |
  • The class hierarchy lists every class, grouped by base 117 | class. If a class has more than one base class, then it will be 118 | listed under each base class. At the top level, and under each base 119 | class, classes are listed alphabetically.
  • 120 |
121 | 122 |

The Index page contains indices of terms and 123 | identifiers:

124 |
    125 |
  • The term index lists every term indexed by any object's 126 | documentation. For each term, the index provides links to each 127 | place where the term is indexed.
  • 128 |
  • The identifier index lists the (short) name of every package, 129 | module, class, method, function, variable, and parameter. For each 130 | identifier, the index provides a short description, and a link to 131 | its documentation.
  • 132 |
133 | 134 |

The Table of Contents

135 | 136 |

The table of contents occupies the two frames on the left side of 137 | the window. The upper-left frame displays the project 138 | contents, and the lower-left frame displays the module 139 | contents:

140 | 141 | 142 | 143 | 145 | 148 | 149 | 150 | 153 | 154 |
144 | Project
Contents
...
146 | API
Documentation
Frame


147 |
151 | Module
Contents
 
...
  152 |

155 | 156 |

The project contents frame contains a list of all packages 157 | and modules that are defined by the project. Clicking on an entry 158 | will display its contents in the module contents frame. Clicking on a 159 | special entry, labeled "Everything," will display the contents of 160 | the entire project.

161 | 162 |

The module contents frame contains a list of every 163 | submodule, class, type, exception, function, and variable defined by a 164 | module or package. Clicking on an entry will display its 165 | documentation in the API documentation frame. Clicking on the name of 166 | the module, at the top of the frame, will display the documentation 167 | for the module itself.

168 | 169 |

The "frames" and "no frames" buttons below the top 170 | navigation bar can be used to control whether the table of contents is 171 | displayed or not.

172 | 173 |

The Navigation Bar

174 | 175 |

A navigation bar is located at the top and bottom of every page. 176 | It indicates what type of page you are currently viewing, and allows 177 | you to go to related pages. The following table describes the labels 178 | on the navigation bar. Note that not some labels (such as 179 | [Parent]) are not displayed on all pages.

180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 194 | 195 | 196 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 |
LabelHighlighted when...Links to...
[Parent](never highlighted) the parent of the current package
[Package]viewing a packagethe package containing the current object 193 |
[Module]viewing a modulethe module containing the current object 197 |
[Class]viewing a class the class containing the current object
[Trees]viewing the trees page the trees page
[Index]viewing the index page the index page
[Help]viewing the help page the help page
211 | 212 |

The "show private" and "hide private" buttons below 213 | the top navigation bar can be used to control whether documentation 214 | for private objects is displayed. Private objects are usually defined 215 | as objects whose (short) names begin with a single underscore, but do 216 | not end with an underscore. For example, "_x", 217 | "__pprint", and "epydoc.epytext._tokenize" 218 | are private objects; but "re.sub", 219 | "__init__", and "type_" are not. However, 220 | if a module defines the "__all__" variable, then its 221 | contents are used to decide which objects are private.

222 | 223 |

A timestamp below the bottom navigation bar indicates when each 224 | page was last updated.

225 | 226 | 228 | 229 | 230 | 231 | 233 | 234 | 235 | 237 | 238 | 239 | 241 | 242 | 243 | 248 | 249 | 250 | 251 | 252 | 255 | 259 | 260 |
261 | 262 | 271 | 272 | 273 | -------------------------------------------------------------------------------- /doc/index.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | tagger 7 | 8 | 9 | 10 | 12 | 14 | 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /doc/module-tree.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | Module Hierarchy 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 41 | 52 | 53 |
  42 | 43 | 44 | 46 | 50 |
[hide private]
[frames] | no frames]
51 |
54 |
55 | [ Module Hierarchy 56 | | Class Hierarchy ] 57 |

58 |

Module Hierarchy

59 |
    60 |
  • build_dict: Usage: build_dict.py -o <output file> -s <stopwords 61 | file> <list of files>
  • 62 |
  • extras
  • 63 |
  • tagger: ====== tagger ======
  • 64 |
65 | 66 | 68 | 69 | 70 | 71 | 73 | 74 | 75 | 77 | 78 | 79 | 81 | 82 | 83 | 88 | 89 | 90 | 91 | 92 | 95 | 99 | 100 |
101 | 102 | 111 | 112 | 113 | -------------------------------------------------------------------------------- /doc/redirect.html: -------------------------------------------------------------------------------- 1 | Epydoc Redirect Page 2 | 3 | 4 | 5 | 6 | 7 | 8 | 18 | 19 |

Epydoc Auto-redirect page

20 | 21 |

When javascript is enabled, this page will redirect URLs of 22 | the form redirect.html#dotted.name to the 23 | documentation for the object with the given fully-qualified 24 | dotted name.

25 |

 

26 | 27 | 36 | 37 | 38 | 39 | -------------------------------------------------------------------------------- /doc/tagger-module.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | tagger 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 45 | 56 | 57 |
41 | 42 | Module tagger 43 | 44 | 46 | 47 | 48 | 50 | 54 |
[hide private]
[frames] | no frames]
55 |
58 | 59 |

Module tagger

source code

60 |

====== tagger ======

61 |

Module for extracting tags from text documents.

62 |

Copyright (C) 2011 by Alessandro Presta

63 |

Configuration

64 |

Dependencies: python2.7, stemming, nltk (optional), lxml (optional), 65 | tkinter (optional)

66 |

You can install the stemming package with:

67 |
 68 |    $ easy_install stemming
 69 | 
70 |

Usage

71 |

Tagging a text document from Python:

72 |
 73 |    import tagger
 74 |    weights = pickle.load(open('data/dict.pkl', 'rb')) # or your own dictionary
 75 |    myreader = tagger.Reader() # or your own reader class
 76 |    mystemmer = tagger.Stemmer() # or your own stemmer class
 77 |    myrater = tagger.Rater(weights) # or your own... (you got the idea)
 78 |    mytagger = Tagger(myreader, mystemmer, myrater)
 79 |    best_3_tags = mytagger(text_string, 3)
 80 | 
81 |

Running the module as a script:

82 |
 83 |    $ ./tagger.py <text document(s) to tag>
 84 | 
85 |

Example:

86 |
 87 |    $ ./tagger.py tests/*
 88 |    Loading dictionary... 
 89 |    Tags for  tests/bbc1.txt :
 90 |    ['bin laden', 'obama', 'pakistan', 'killed', 'raid']
 91 |    Tags for  tests/bbc2.txt :
 92 |    ['jo yeates', 'bristol', 'vincent tabak', 'murder', 'strangled']
 93 |    Tags for  tests/bbc3.txt :
 94 |    ['snp', 'party', 'election', 'scottish', 'labour']
 95 |    Tags for  tests/guardian1.txt :
 96 |    ['bin laden', 'al-qaida', 'killed', 'pakistan', 'al-fawwaz']
 97 |    Tags for  tests/guardian2.txt :
 98 |    ['clegg', 'tory', 'lib dem', 'party', 'coalition']
 99 |    Tags for  tests/post1.txt :
100 |    ['sony', 'stolen', 'playstation network', 'hacker attack', 'lawsuit']
101 |    Tags for  tests/wikipedia1.txt :
102 |    ['universe', 'anthropic principle', 'observed', 'cosmological', 'theory']
103 |    Tags for  tests/wikipedia2.txt :
104 |    ['beetroot', 'beet', 'betaine', 'blood pressure', 'dietary nitrate']
105 |    Tags for  tests/wikipedia3.txt :
106 |    ['the lounge lizards', 'jazz', 'john lurie', 'musical', 'albums']
107 | 
108 | 109 | 110 | 111 | 113 | 114 | 125 | 126 | 127 | 133 | 134 | 135 | 142 | 143 | 144 | 150 | 151 | 152 | 158 | 159 | 160 | 166 | 167 | 168 | 174 | 175 |
115 | 116 | 117 | 118 | 122 | 123 |
Classes[hide private]
124 |
128 |   129 | 130 | Tag
131 | General class for tags (small units of text) 132 |
136 |   137 | 138 | MultiTag
139 | Class for aggregates of tags (usually next to each other in the 140 | document) 141 |
145 |   146 | 147 | Reader
148 | Class for parsing a string of text to obtain tags 149 |
153 |   154 | 155 | Stemmer
156 | Class for extracting the stem of a word 157 |
161 |   162 | 163 | Rater
164 | Class for estimating the relevance of tags 165 |
169 |   170 | 171 | Tagger
172 | Master class for tagging text documents 173 |
176 | 177 | 178 | 180 | 181 | 192 | 193 | 194 | 199 | 200 |
182 | 183 | 184 | 185 | 189 | 190 |
Variables[hide private]
191 |
195 |   196 | 197 | __package__ = None 198 |
201 | 202 | 204 | 205 | 206 | 207 | 209 | 210 | 211 | 213 | 214 | 215 | 217 | 218 | 219 | 224 | 225 | 226 | 227 | 228 | 231 | 235 | 236 |
237 | 238 | 247 | 248 | 249 | -------------------------------------------------------------------------------- /doc/tagger.MultiTag-class.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | tagger.MultiTag 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 46 | 57 | 58 |
41 | 42 | Module tagger :: 43 | Class MultiTag 44 | 45 | 47 | 48 | 49 | 51 | 55 |
[hide private]
[frames] | no frames]
56 |
59 | 60 |

Class MultiTag

source code

61 |
62 |
63 | 64 | 65 | 66 | 67 |
68 |
69 |
70 |

Class for aggregates of tags (usually next to each other in the 71 | document)

72 | 73 | 74 | 75 | 77 | 78 | 89 | 90 | 91 | 110 | 111 | 112 | 128 | 129 | 130 | 138 | 139 |
79 | 80 | 81 | 82 | 86 | 87 |
Instance Methods[hide private]
88 |
92 |   93 | 94 | 95 | 96 | 102 | 106 | 107 |
__init__(self, 97 | tail, 98 | head=None)
99 | Returns: 100 | a new MultiTag 101 | object
103 | source code 104 | 105 |
108 | 109 |
113 |   114 | 115 | 116 | 117 | 120 | 124 | 125 |
combined_rating(self)
118 | Method that computes the multitag's rating from the ratings of unit 119 | subtags
121 | source code 122 | 123 |
126 | 127 |
131 |

Inherited from Tag: 132 | __eq__, 133 | __hash__, 134 | __lt__, 135 | __repr__ 136 |

137 |
140 | 141 | 142 | 144 | 145 | 156 | 157 |
146 | 147 | 148 | 149 | 153 | 154 |
Method Details[hide private]
155 |
158 | 159 |
160 | 162 |
163 | 164 | 173 |
165 |

__init__(self, 166 | tail, 167 | head=None) 168 |
(Constructor) 169 |

170 |
source code  172 |
174 | 175 | 176 |
177 |
Parameters:
178 |
    179 |
  • tail - the Tag object 180 | to add to the first part (head)
  • 181 |
  • head - the (eventually absent) MultiTag to be extended
  • 183 |
184 |
Returns:
185 |
a new MultiTag object
187 |
Overrides: 188 | Tag.__init__ 189 |
190 |
191 |
192 |
193 | 194 |
195 | 197 |
198 | 199 | 205 |
200 |

combined_rating(self) 201 |

202 |
source code  204 |
206 | 207 |

Method that computes the multitag's rating from the ratings of unit 208 | subtags

209 |

(the default implementation uses the geometric mean - with a special 210 | treatment for proper nouns - but this method can be overridden)

211 |
212 |
Returns:
213 |
the rating of the multitag
214 |
215 |
216 |
217 |
218 | 219 | 221 | 222 | 223 | 224 | 226 | 227 | 228 | 230 | 231 | 232 | 234 | 235 | 236 | 241 | 242 | 243 | 244 | 245 | 248 | 252 | 253 |
254 | 255 | 264 | 265 | 266 | -------------------------------------------------------------------------------- /doc/tagger.Tag-class.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | tagger.Tag 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 46 | 57 | 58 |
41 | 42 | Module tagger :: 43 | Class Tag 44 | 45 | 47 | 48 | 49 | 51 | 55 |
[hide private]
[frames] | no frames]
56 |
59 | 60 |

Class Tag

source code

61 |
62 |
63 | 64 | 65 | 66 | 67 |
68 |
69 |
70 |

General class for tags (small units of text)

71 | 72 | 73 | 74 | 76 | 77 | 88 | 89 | 90 | 111 | 112 | 113 | 128 | 129 | 130 | 144 | 145 | 146 | 161 | 162 | 163 | 177 | 178 |
78 | 79 | 80 | 81 | 85 | 86 |
Instance Methods[hide private]
87 |
91 |   92 | 93 | 94 | 95 | 103 | 107 | 108 |
__init__(self, 96 | string, 97 | stem=None, 98 | rating=1.0, 99 | proper=False, 100 | terminal=False)
101 | Returns: 102 | a new Tag object
104 | source code 105 | 106 |
109 | 110 |
114 |   115 | 116 | 117 | 118 | 120 | 124 | 125 |
__eq__(self, 119 | other) 121 | source code 122 | 123 |
126 | 127 |
131 |   132 | 133 | 134 | 135 | 136 | 140 | 141 |
__repr__(self) 137 | source code 138 | 139 |
142 | 143 |
147 |   148 | 149 | 150 | 151 | 153 | 157 | 158 |
__lt__(self, 152 | other) 154 | source code 155 | 156 |
159 | 160 |
164 |   165 | 166 | 167 | 168 | 169 | 173 | 174 |
__hash__(self) 170 | source code 171 | 172 |
175 | 176 |
179 | 180 | 181 | 183 | 184 | 195 | 196 |
185 | 186 | 187 | 188 | 192 | 193 |
Method Details[hide private]
194 |
197 | 198 |
199 | 201 |
202 | 203 | 215 |
204 |

__init__(self, 205 | string, 206 | stem=None, 207 | rating=1.0, 208 | proper=False, 209 | terminal=False) 210 |
(Constructor) 211 |

212 |
source code  214 |
216 | 217 | 218 |
219 |
Parameters:
220 |
    221 |
  • string - the actual representation of the tag
  • 222 |
  • stem - the internal (usually stemmed) representation; tags with the same 223 | stem are regarded as equal
  • 224 |
  • rating - a measure of the tag's relevance in the interval [0,1]
  • 225 |
  • proper - whether the tag is a proper noun
  • 226 |
  • terminal - set to True if the tag is at the end of a phrase (or anyway it 227 | cannot be logically merged to the following one)
  • 228 |
229 |
Returns:
230 |
a new Tag object
231 |
232 |
233 |
234 |
235 | 236 | 238 | 239 | 240 | 241 | 243 | 244 | 245 | 247 | 248 | 249 | 251 | 252 | 253 | 258 | 259 | 260 | 261 | 262 | 265 | 269 | 270 |
271 | 272 | 281 | 282 | 283 | -------------------------------------------------------------------------------- /doc/tagger.Tagger-class.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | tagger.Tagger 7 | 8 | 9 | 10 | 11 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 23 | 25 | 26 | 27 | 29 | 30 | 31 | 36 | 37 | 38 | 39 | 40 | 46 | 57 | 58 |
41 | 42 | Module tagger :: 43 | Class Tagger 44 | 45 | 47 | 48 | 49 | 51 | 55 |
[hide private]
[frames] | no frames]
56 |
59 | 60 |

Class Tagger

source code

61 |

Master class for tagging text documents

62 |

(this is a simple interface that should allow convenient 63 | experimentation by using different classes as building blocks)

64 | 65 | 66 | 67 | 69 | 70 | 81 | 82 | 83 | 103 | 104 | 105 | 121 | 122 |
71 | 72 | 73 | 74 | 78 | 79 |
Instance Methods[hide private]
80 |
84 |   85 | 86 | 87 | 88 | 95 | 99 | 100 |
__init__(self, 89 | reader, 90 | stemmer, 91 | rater)
92 | Returns: 93 | a new Tagger 94 | object
96 | source code 97 | 98 |
101 | 102 |
106 |   107 | 108 | 109 | 110 | 113 | 117 | 118 |
__call__(self, 111 | text, 112 | tags_number=5) 114 | source code 115 | 116 |
119 | 120 |
123 | 124 | 125 | 127 | 128 | 139 | 140 |
129 | 130 | 131 | 132 | 136 | 137 |
Method Details[hide private]
138 |
141 | 142 |
143 | 145 |
146 | 147 | 157 |
148 |

__init__(self, 149 | reader, 150 | stemmer, 151 | rater) 152 |
(Constructor) 153 |

154 |
source code  156 |
158 | 159 | 160 |
161 |
Parameters:
162 |
    163 |
  • reader - a Reader 164 | object
  • 165 |
  • stemmer - a Stemmer 166 | object
  • 167 |
  • rater - a Rater object
  • 168 |
169 |
Returns:
170 |
a new Tagger 171 | object
172 |
173 |
174 |
175 | 176 |
177 | 179 |
180 | 181 | 190 |
182 |

__call__(self, 183 | text, 184 | tags_number=5) 185 |
(Call operator) 186 |

187 |
source code  189 |
191 | 192 | 193 |
194 |
Parameters:
195 |
    196 |
  • text - the string of text to be tagged
  • 197 |
  • tags_number - number of best tags to be returned 198 |

    Returns: a list of (hopefully) relevant tags

  • 199 |
200 |
201 |
202 |
203 |
204 | 205 | 207 | 208 | 209 | 210 | 212 | 213 | 214 | 216 | 217 | 218 | 220 | 221 | 222 | 227 | 228 | 229 | 230 | 231 | 234 | 238 | 239 |
240 | 241 | 250 | 251 | 252 | -------------------------------------------------------------------------------- /doc/toc-build_dict-module.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | build_dict 7 | 8 | 9 | 10 | 11 | 13 |

Module build_dict

14 |
15 |

Functions

16 | build_dict
build_dict_from_files

Variables

19 | __package__

21 | [hide private] 23 | 24 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /doc/toc-everything.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | Everything 7 | 8 | 9 | 10 | 11 | 13 |

Everything

14 |
15 |

All Classes

16 | extras.FastStemmer
extras.HTMLReader
extras.NaiveRater
extras.SimpleReader
extras.UnicodeReader
tagger.MultiTag
tagger.Rater
tagger.Reader
tagger.Stemmer
tagger.Tag
tagger.Tagger

All Functions

28 | build_dict.build_dict
build_dict.build_dict_from_files
extras.build_dict_from_nltk

All Variables

32 | build_dict.__package__
extras.__package__
tagger.__package__

36 | [hide private] 38 | 39 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /doc/toc-extras-module.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | extras 7 | 8 | 9 | 10 | 11 | 13 |

Module extras

14 |
15 |

Classes

16 | FastStemmer
HTMLReader
NaiveRater
SimpleReader
UnicodeReader

Functions

22 | build_dict_from_nltk

Variables

24 | __package__

26 | [hide private] 28 | 29 | 38 | 39 | 40 | -------------------------------------------------------------------------------- /doc/toc-tagger-module.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | tagger 7 | 8 | 9 | 10 | 11 | 13 |

Module tagger

14 |
15 |

Classes

16 | MultiTag
Rater
Reader
Stemmer
Tag
Tagger

Variables

23 | __package__

25 | [hide private] 27 | 28 | 37 | 38 | 39 | -------------------------------------------------------------------------------- /doc/toc.html: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | Table of Contents 7 | 8 | 9 | 10 | 11 | 13 |

Table of Contents

14 |
15 | Everything 16 |
17 |

Modules

18 | build_dict
extras
tagger

22 | [hide private] 24 | 25 | 34 | 35 | 36 | -------------------------------------------------------------------------------- /extras.py: -------------------------------------------------------------------------------- 1 | # Copyright (C) 2011 by Alessandro Presta 2 | 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy 4 | # of this software and associated documentation files (the "Software"), to deal 5 | # in the Software without restriction, including without limitation the rights 6 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | # copies of the Software, and to permit persons to whom the Software is 8 | # furnished to do so, subject to the following conditions: 9 | 10 | # The above copyright notice and this permission notice shall be included in 11 | # all copies or substantial portions of the Software. 12 | 13 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | # THE SOFTWARE 20 | 21 | 22 | from tagger import * 23 | 24 | 25 | class UnicodeReader(Reader): 26 | ''' 27 | Reader subclass that converts Unicode strings to a close ASCII 28 | representation 29 | ''' 30 | 31 | def __call__(self, text): 32 | import unicodedata 33 | 34 | text = unicode(text) 35 | text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore') 36 | return Reader.__call__(self, text) 37 | 38 | 39 | class HTMLReader(UnicodeReader): 40 | ''' 41 | Reader subclass that can parse HTML code from the input 42 | ''' 43 | 44 | def __call__(self, html): 45 | import lxml.html 46 | 47 | text = lxml.html.fromstring(html).text_content().encode('utf-8') 48 | return UnicodeReader.__call__(self, text) 49 | 50 | 51 | class SimpleReader(Reader): 52 | ''' 53 | Reader subclass that doesn't perform any advanced analysis of the text 54 | ''' 55 | 56 | def __call__(self, text): 57 | text = text.lower() 58 | text = self.preprocess(text) 59 | words = self.match_words.findall(text) 60 | tags = [Tag(w) for w in words] 61 | return tags 62 | 63 | 64 | class FastStemmer(Stemmer): 65 | ''' 66 | Stemmer subclass that uses a much faster, but less correct algorithm 67 | ''' 68 | 69 | def __init__(self): 70 | from stemming import porter 71 | 72 | Stemmer.__init__(self, porter) 73 | 74 | 75 | class NaiveRater(Rater): 76 | ''' 77 | Rater subclass that jusk ranks single-word tags by their frequency and 78 | weight 79 | ''' 80 | 81 | def __call__(self, tags): 82 | self.rate_tags(tags) 83 | # we still get rid of one-character tags 84 | unique_tags = set(t for t in tags if len(t.string) > 1) 85 | return sorted(unique_tags) 86 | 87 | 88 | def build_dict_from_nltk(output_file, corpus=None, stopwords=None, 89 | stemmer=Stemmer(), measure='IDF', verbose=False): 90 | ''' 91 | @param output_file: the binary stream where the dictionary should be saved 92 | @param corpus: the NLTK corpus to use (defaults to nltk.corpus.reuters) 93 | @param stopwords: a list of (not stemmed) stopwords (defaults to 94 | nltk.corpus.reuters.words('stopwords')) 95 | @param stemmer: the L{Stemmer} object to be used 96 | @param measure: the measure used to compute the weights ('IDF' 97 | i.e. 'inverse document frequency' or 'ICF' i.e. 98 | 'inverse collection frequency'; defaults to 'IDF') 99 | @param verbose: whether information on the progress should be printed 100 | on screen 101 | ''' 102 | 103 | from build_dict import build_dict 104 | import nltk 105 | import pickle 106 | 107 | if not (corpus and stopwords): 108 | nltk.download('reuters') 109 | 110 | corpus = corpus or nltk.corpus.reuters 111 | stopwords = stopwords or nltk.corpus.reuters.words('stopwords') 112 | 113 | corpus_list = [] 114 | 115 | if verbose: print 'Processing corpus...' 116 | for file in corpus.fileids(): 117 | doc = [stemmer(Tag(w.lower())).stem for w in corpus.words(file) 118 | if w[0].isalpha()] 119 | corpus_list.append(doc) 120 | 121 | if verbose: print 'Processing stopwords...' 122 | stopwords = [stemmer(Tag(w.lower())).stem for w in stopwords] 123 | 124 | if verbose: print 'Building dictionary... ' 125 | dictionary = build_dict(corpus_list, stopwords, measure) 126 | pickle.dump(dictionary, output_file, -1) 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | -------------------------------------------------------------------------------- /test_ui.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from Tkinter import * 4 | import tkMessageBox 5 | import ScrolledText 6 | 7 | from tagger import * 8 | from extras import UnicodeReader 9 | 10 | import pickle 11 | 12 | with open('data/dict.pkl', 'rb') as f: 13 | weights = pickle.load(f) 14 | tagger = Tagger(UnicodeReader(), Stemmer(), Rater(weights)) 15 | 16 | top = Tk() 17 | top.title('tagger') 18 | 19 | st = ScrolledText.ScrolledText(top) 20 | st.pack() 21 | 22 | def tag(): 23 | tags = tagger(st.get(1.0, END)) 24 | output = ', '.join(t.string for t in tags) 25 | tkMessageBox.showinfo('Tags:', output) 26 | st.delete(1.0, END) 27 | 28 | b = Button(top, text ='TAG!', command=tag) 29 | 30 | b.pack() 31 | top.mainloop() 32 | -------------------------------------------------------------------------------- /tests/bbc1.txt: -------------------------------------------------------------------------------- 1 | Obama lays wreath at Ground Zero 2 | US President Barack Obama is visiting Ground Zero, the site of the 9/11 attacks in New York, four days after US forces killed al-Qaeda head Osama Bin Laden in Pakistan. 3 | Bin Laden was believed to be the mastermind of the 9/11 attacks in 2001. 4 | Mr Obama laid a wreath in memory of the nearly 3,000 victims and spoke to relatives at the site. 5 | He earlier told New York firefighters: "When we say we will never forget, we mean what we say." 6 | The visit comes a day after the US president said graphic photographs of Bin Laden's dead body would not be made public. 7 | The al-Qaeda leader was killed by US special forces in northern Pakistan on Monday. His body was then buried at sea from a US aircraft carrier. 8 | The Pakistani military on Thursday admitted "shortcomings" for failing to locate Bin Laden and has said it will launch an investigation. 9 | But it also warned it would review co-operation with the US if there were any more unilateral raids such as the one that killed Bin Laden. 10 | Moment of reflection 11 | Mr Obama's first stop in New York was a fire station in midtown Manhattan. He told firefighters: "We are going to make sure that the perpetrators of that horrible act will see justice." 12 | Mr Obama met families of the victims at the site of Ground Zero, where he laid a wreath in red, white and blue. 13 | He made no public comments at the scene. 14 | The BBC's Barbara Plett, in New York, says the Obama administration is very sensitive to accusations that the president is politicising his visit. 15 | Thousands of people gathered at Ground Zero on Sunday night, waving flags and climbing street signs, as the news emerged that Bin Laden had been killed. 16 | On Monday, Mr Obama said he had made it his top national security priority to find Bin Laden. 17 | Ground Zero is now a building site, with construction scheduled for completion in 2013. 18 | As well as several office towers, the area will also house the National September 11 Memorial & Museum, which comprises a museum, waterfalls and a park. 19 | Mr Obama has decided not to publish photos of Bin Laden's body, saying the images could pose a national security risk. 20 | "It is important for us to make sure that very graphic photos of somebody who was shot in the head are not floating around as an incitement to additional violence, as a propaganda tool. That's not who we are," Mr Obama said. 21 | Our correspondent says there are mixed feelings in New York about the decision not to publish the pictures. While some want proof that it was Bin Laden who was killed, for others, the photos would reopen painful memories. 22 | Meanwhile, White House officials have again changed their account of the raid on Bin Laden's compound. 23 | They told respected media outlets, including the New York Times, that only one individual - a courier for Bin Laden - fired at US special forces. He was killed at the start of the raid. 24 | The Al Qaeda leader was elsewhere in compound and unarmed when killed. 25 | As recently as Tuesday, the President's spokesman had spoken of a "highly volatile" fire-fight, lasting throughout the forty minute operation. And the White House had given the impression that an armed Osama Bin Laden had been killed in a shoot-out. 26 | Critics have raised concerns about the legality of the operation, after the US revised its account to acknowledge Bin Laden was unarmed when shot dead. 27 | But the US has said Bin Laden was a lawful military target, whose killing was "an act of national self-defence". 28 | Anger in Pakistan 29 | In Pakistan, there are continuing recriminations over the failure to arrest or locate Bin Laden. 30 | A senior Pakistani military official said one of Bin Laden's wives told investigators she had been living in the same room for five years, along with her husband. 31 | On Thursday, the head of Pakistan's diplomatic service, Salman Bashir, again dismissed allegations his country's secret services had links to al-Qaeda, and said the investigation into the presence of Bin Laden in Abbottabad would reveal what failures there were. 32 | Pakistan's army has long been seen as the most effective institution in an unstable country. However, Pakistani public opinion has been critical of the perceived violation of national sovereignty by the US raid. 33 | Referring to the raid, Chief of Army Staff Gen Ashfaq Kayani was quoted by the Reuters news agency as saying: "Any similar action violating the sovereignty of Pakistan will warrant a review on the level of military/intelligence cooperation with the United States." 34 | The BBC's Syed Shoaib Hasan, in Rawalpindi, said Pakistan's forces again denied all prior knowledge of the raid and Bin Laden's whereabouts. 35 | However, he says there were some contradictions: the military claimed Pakistan contributed information leading to the capture of Bin Laden by providing details of phone calls made from Pakistan to Saudi Arabia. 36 | The calls - from a Saudi man - concerned financial transactions. The last one was traced to Bin Laden's compound, the military said. 37 | However, our correspondent in Rawalpindi says the military did not explain why, after this call, the compound was not raided or investigated. 38 | -------------------------------------------------------------------------------- /tests/bbc2.txt: -------------------------------------------------------------------------------- 1 | Jo Yeates: Vincent Tabak admits manslaughter 2 | A man charged with the murder of Jo Yeates, whose body was found near Bristol on Christmas Day, has admitted manslaughter but denied murder. 3 | Vincent Tabak, 33, is accused of killing the landscape architect who disappeared on 17 December after going for drinks with colleagues. 4 | The prosecution has refused to accept Tabak's manslaughter plea and a murder trial will go ahead on 4 October. 5 | Tabak, a Dutch national, lived next to Jo Yeates in Clifton, Bristol. 6 | He appeared via video link from HMP Long Lartin in Worcestershire. 7 | Vincent Tabak spoke only to confirm his name and to say he was content for proceedings to continue in English without an interpreter, before entering his pleas. 8 | He was remanded in custody to face trial at Bristol Crown Court. 9 | Strangled 10 | Miss Yeates's parents David and Theresa were in court for the hearing. 11 | Their daughter, who grew up in Ampfield, Hampshire, spent the evening of 17 December in the Bristol Ram with colleagues before visiting Tesco Express to buy a pizza on her way home to the Clifton area of Bristol. 12 | She was reported missing by her boyfriend Greg Reardon on 19 December. 13 | Jo Yeates's snow-covered body was found in Longwood Lane, Failand, by dog walkers on Christmas Day, eight days after she was reported missing. 14 | A post-mortem test revealed she had been strangled. 15 | Her funeral took place at St Mark's Church in Hampshire, where she was christened, on 11 February. 16 | Her body was carried into the church in a wicker coffin in front of her parents, who said her death had been "traumatic". 17 | Miss Yeates's coffin was adorned with daffodils, small sunflowers and assorted other flowers. 18 | -------------------------------------------------------------------------------- /tests/bbc3.txt: -------------------------------------------------------------------------------- 1 | Scottish election: SNP profile 2 | 3 | By Andrew Black 4 | Political reporter, BBC Scotland 5 | Over the years, the world has seen the rise and fall of many single-issue groups and minor parties, yet only a handful go on to achieve their goals. 6 | The Scottish National Party is one of those. 7 | The story of the SNP is one of success and failure, highs and lows, rogues and visionaries - but, most of all, it's the story of a party which started life on the fringes and moved in to claim political success. 8 | Despite the party's turbulent history, it is now set to realise its vision for an independence referendum, after first emerging as the government of Scotland in 2007. 9 | The SNP's spring conference in March last year was a bittersweet one. 10 | The party was on a high, but it was also mourning the death, at 86, of one of its former leaders, Billy Wolfe. 11 | This was the man who finally transformed the SNP into a serious party, guiding it to its greatest Westminster electoral success in 1974. 12 | The case for Scottish home rule goes right back to its unification with England in 1707. 13 | The view that the Scots who put their names to the Act of Union had been bribed, famously spurred Robert Burns to write: "We are bought and sold for English gold. Such a parcel of rogues in a nation." 14 | Many years later, the realisation that a pro-independence, election-fighting party was the way to go eventually led to the creation in 1934 of the Scottish National Party, through the amalgamation of the Scottish Party and the National Party of Scotland. 15 | Election-fighting 16 | But for years the SNP struggled to make an impact, party due to the on-going debate between those who wanted to concentrate on independence - the fundamentalists - and those who wanted to achieve it through policies such as devolution - the gradualists. 17 | The young Nationalist party's other problem was that, put simply, it just was not any good at fighting elections, because of its lack of funding, organisation and policies beyond independence. 18 | In its first test, the 1935 General Election, the SNP contested eight seats and won none. 19 | It was not until a decade later and the Motherwell and Wishaw by-election when the party finally got a break. 20 | When the contest was announced following the death of sitting Labour MP James Walker, the Nationalists sent in one of their up-and-comers, Robert McIntyre, to fight the seat. 21 | After standing largely on a platform of Labour failures in post-war reconstruction, the SNP took the seat with 50% of the vote, but lost it months later in the election. 22 | Even though this brief victory provided much excitement over what the party could achieve, it failed to make progress in subsequent elections and disquiet set in. 23 | But it was this disquiet which forced the party to reorganise - a move which would help the SNP to one of its most famous achievements. 24 | The Hamilton by-election should have been a breeze for Labour, but, as the party's vote collapsed, the SNP's Winnie Ewing romped home on 46%, famously declaring: "Stop the world, Scotland wants to get on." 25 | The 1970s was the decade of boom and bust for the SNP. They failed to hang on in Hamilton, but 1970 brought the SNP its first UK election gain, in the Western Isles. 26 | That same decade also saw the beginnings of the party's "It's Scotland's Oil" strategy, which sought to demonstrate Scotland was seeing little direct benefit of the tax wealth brought by North Sea oil. 27 | More success followed in 1973, when Margo MacDonald, "the blonde bombshell" won the Glasgow Govan by-election and, the following year, an under-fire Tory government called an election, which it lost. 28 | The SNP gained six seats and retained the Western Isles, but lost Govan - however, there were to be further gains. 29 | With Labour in power as a minority government, the party had little choice but to call a second election in 1974 - but not before committing to support for a Scottish Assembly. 30 | Even so, the SNP gained a further four seats, hitting its all time Westminster high of 11. 31 | Despite the success, tensions began to develop between those in the SNP who were elected and those who were not. 32 | 'Tartan Tories' 33 | Then came 1979 - the year which provided two killer blows to the SNP. 34 | Not only did Scots voters fail to support the establishment of a Scottish Assembly in a referendum, but Margaret Thatcher's Tories swept to power - meaning the constitutional issue was not only off the table, but had been completely blown out of the water. 35 | The SNP had also come under a period of heavy fire from rival parties, portrayed by Labour as the "Tartan Tories" and "Separatists" by the Conservatives. 36 | With a post-election SNP slashed back to two MPs, the party needed a serious jump-start - but that jump-start dragged the party into a period which could have finished it off for good. 37 | The start of the 80s was a torrid time for the SNP. Many in the party felt bitter that it had come so far but was now practically back at square one in terms of its performance and the independence argument. 38 | The deep divisions gave rise to two notorious splinter groups. 39 | One was the ultra-nationalist group Siol Nan Gaidheal - branded "proto fascists" by the then SNP leader Gordon Wilson - whose members used to march around in Highland dress. 40 | The other was the Interim Committee for Political Discussion - more commonly known as the '79 Group. 41 | Formed to sharpen the party's message and appeal to dissident Labour voters, the group also embarked on a campaign of civil disobedience, spearheaded by the former Labour MP Jim Sillars, who had founded the Scottish Labour Party before defecting to the SNP in 1980. 42 | The campaign took a radical turn when Sillars, with several other group members, broke into Edinburgh's old Royal High School building. 43 | Then, a leak of '79 Group minutes to the media raised the prospect of links with the Provisional Sinn Fein. 44 | Despite claims the leaked version was inaccurate, Mr Wilson had had enough. 45 | His view that the party had to unite or die led to a ban on organised groups, but when the '79 Group refused to go quietly, seven of its members were briefly expelled from the party. 46 | They included Scotland's future justice secretary, Kenny MacAskill, and one Alex Salmond. 47 | The 1987 election saw another bad SNP performance. The party emerged with only three seats - but with the collapse in the Conservative vote, the constitutional issue was back. 48 | The SNP needed new blood at the top, and it came in the form of Alex Salmond. 49 | Despite previous form with the '79 Group, Mr Salmond had risen through the SNP ranks, becoming MP for Banff and Buchan and deputy leader of the party. 50 | Independence case 51 | Mr Salmond did not have his work to seek on becoming leader in 1990. 52 | As well as having to deal with on-going issues over the party's independence policy - future minister Alex Neil had declared Scotland would be "free by '93" - there was an election to fight. 53 | In 1992, the SNP increased its vote, but the party was only able to retain the three seats it already had, and lost Govan, which Mr Sillars re-took for the party in a 1988 by-election. 54 | Mr Salmond moved to modernise the SNP, repositioning the party as more socially democratic and pro-European and pushing the economic case for independence. 55 | Labour's commitment to a Scottish Parliament, delivered in 1999, was both a blessing and a curse for the Nationalists. 56 | Although devolution presented a great opportunity for the SNP, many questioned how relevant a pro-independence party would be - George Robertson famously quipped devolution would "kill nationalism stone dead". 57 | The SNP won 35 seats in the first election and also had two MEPs and six MPs. 58 | But the best it could manage in 1999 was becoming the main opposition to the Labour-Lib Dem coalition government. 59 | Mr Salmond's decision to quit as leader and an MSP came as a surprise. 60 | Despite much speculation over his reasons for returning to Westminster, ultimately, after a decade in the job, he decided it was time to step aside. 61 | His successor in 2000 was John Swinney, but, despite being among the party's brightest talent, his four-year tenure was plagued by dissenters from within. 62 | The party dropped a seat in 2001, and, despite a slick 2003 election campaign, the SNP once again ended up as the opposition. 63 | Later that year, a little-known SNP activist called Bill Wilson challenged Mr Swinney for the leadership, accusing him of ducking responsibility for a "plummeting" SNP vote. 64 | Mr Swinney won a decisive victory but was left weakened, and, at Holyrood, SNP MSPs Bruce McFee and Adam Ingram declared they would not support Mr Swinney in a leadership ballot. 65 | Another, Campbell Martin, was flung out of the party after bosses found his criticism of the Swinney leadership damaged its interests in the run-up to the SNP's poor European election showing in 2004, where it failed to overtake Labour. 66 | Mr Swinney quit as leader, accepting responsibility for failing to sell the party's message - but warned SNP members over the damage caused by "the loose and dangerous talk of the few". 67 | Close result 68 | When the leaderless party turned to Mr Salmond, he drew on a quote from US civil war leader General Sherman to declare: "If nominated I'll decline. If drafted I'll defer. And if elected I'll resign." 69 | Then, in a move almost as surprising as his decision to quit, Mr Salmond launched a successful leadership campaign on a joint ticket with Nicola Sturgeon, winning a decisive victory. 70 | Nobody thought the 2007 Scottish election result could be so close. 71 | In the end, the SNP won the election by one seat, while Mr Salmond returned to Holyrood as MSP for Gordon. 72 | With the SNP's pro-independence stance ruling out a coalition, the party forged ahead as a minority government. 73 | The SNP government had promised to seek consensus on an issue-by-issue basis, but when the opposition parties thought the government was being disingenuous, they converged to reject the Scottish budget in 2009. 74 | It was passed on the second attempt, but served as a reminder to the SNP the delicate position it was in. 75 | Other key manifesto commitments also ran into trouble - plans to replace council tax with local income tax were dropped due to lack of support, while ambitious plans to cut class sizes in the early primary school years ran into problems. 76 | Eventually, the bill on an independence referendum was dropped. 77 | Such is life in minority government. 78 | Although the SNP's focus had become the Scottish government, it was keen not to lose sight of its status beyond the Holyrood bubble and, in 2009, won the largest share of the Scottish vote in the European election for the first time. 79 | Continuing its knack for winning safe Labour seats in by-elections, the SNP delivered a crushing blow to Labour, winning Glasgow East by overturning a majority of 13,507 to win by just a few hundred votes. 80 | But the party failed to repeat this success a few months later in the Glenrothes by-election and, later, in Glasgow North East. 81 | In a story that bore echoes of the past for the SNP, the 2010 UK election saw Labour regain Glasgow East, while the Nationalists concluded that, with a resurgent Tory party on course for victory, Scots voters came out in their droves to back Labour. 82 | The 2011 Holyrood election was Labour's to lose. In the event, that is exactly what happened. 83 | Despite polls predicting a Labour lead over the SNP of up to 15 points, the Nationalists threw themselves into the campaign. 84 | They say their positive campaign, versus Labour's negativity, was what won it for them. 85 | The SNP's 2007 win was rightly described as a historic one - but, four years later, the has re-written the history books again. 86 | Its jaw-dropping victory means it will be Scotland's first majority government - and the independence referendum will happen. 87 | The SNP has truly come a long way since the fringes of 1934. 88 | -------------------------------------------------------------------------------- /tests/guardian1.txt: -------------------------------------------------------------------------------- 1 | Osama bin Laden death: Al-Qaida vows to carry out revenge attacks on US 2 | White House says US is being 'extremely vigilant' after al-Qaida declares Bin Laden's death a curse on the US 3 | 4 | Al-Qaida conceded in an 11-paragraph statement that Osama Bin Laden had been killed. Photograph: Rahimullah Yousafzai/AP 5 | Al-Qaida has vowed to carry out revenge attacks on the US and its allies over the killing of Osama bin Laden, warning that celebrations in the west would be replaced by sorrow and blood. 6 | 7 | The statement on a jihadist website was the first by al-Qaida since Bin Laden's death, which it said would become "a curse that hunts the Americans and their collaborators, and chases them outside and inside their country". 8 | 9 | The 11-paragraph statement, dated Tuesday, confirmed that Bin Laden was dead, disappointing conspiracy theorists who refuse to believe he has been killed. 10 | 11 | The White House spokesman, Jay Carney, said of the al-Qaida statement: "We are aware of it. What it does, obviously, is acknowledge the obvious, which is that Osama bin Laden was killed on Sunday night by US forces. We're being extremely vigilant. We're quite aware of the potential for activity and are highly vigilant on that matter for that reason. US security, both at home and at embassies and bases overseas, has been on high alert since Sunday." 12 | 13 | The Department of Homeland Security has warned US train operators to be especially careful after officials said that among computers, hard disks and other material taken from the Abbottabad compound they found a vague plan to attack the US rail network on this year's 10th anniversary of 9/11. One proposal was to demolish part of a rail track so that a train would fall into a river or valley, according to US officials. 14 | 15 | Carney said: "One of the things we saw, I think, was the notice that DHS put out with regard to the information collected about the consideration at least of a terrorist plot against American railways back in February of 2010. 16 | 17 | "The fact that the world's most wanted terrorist might have been considering further terror plots against the United States is not a surprise, but it reminds us, of course, that we need to remain ever vigilant." 18 | 19 | In its statement, al-Qaida said: "We stress that the blood of the holy warrior sheikh Osama bin Laden, God bless him, is precious to us and to all Muslims and will not go in vain. We will remain, God willing, a curse chasing the Americans and their agents, following them outside and inside their countries. 20 | 21 | "Soon, God willing, their happiness will turn to sadness, their blood will be mingled with their tears." 22 | 23 | It said Bin Laden's death would not deflect al-Qaida from its war against the US and its allies, which include the Pakistani government. It called on Pakistan to rise up against the "traitors". 24 | 25 | The discovery of Bin Laden's hide–away so close to the capital, Islamabad, has strained relations between the US and Pakistan. Carl Levin, chairman of the Senate armed services committee and a Democrat, ordered an investigation into whether the Pakistani government and intelligence services knew of his whereabouts. "We need these questions about whether or not the top level of the Pakistan government knew or was told by the ISI, their intelligence service, about anything about this suspicious activity for years in a very, very centralised place," Levin said. 26 | 27 | The senator, who is usually guarded in his public statements, hinted that he believed some senior figures in Pakistani intelligence knew where Bin Laden was hiding – comments that will further inflame the Pakistani government. 28 | 29 | "I think at high levels – high levels being the intelligence service – they knew it," Levin said. "I can't prove it. I just think it's counterintuitive not to." 30 | 31 | He raised doubts about continuing the billions of dollars in aid to Pakistan, which requires congressional approval. 32 | 33 | The Obama administration so far has been reluctant to criticise the Pakistani government and has opted instead to stress the positive aspects of the ties. The strategy seems to be to try to use Pakistan's embarrassment to prise out other al-Qaida or Taliban figures who may be living in Pakistan, such as the Taliban leader, Mullah Omar, and Bin Laden's deputy, Ayman al-Zawahiri. 34 | 35 | United Nations human rights investigators have called on Washington to disclose whether there had been any plan to capture Bin Laden. While they acknowledged the difficulties involved in such terrorist-related missions, they raised questions about the legality of the killing. 36 | 37 | The UN's special rapporteur on extrajudicial, summary or arbitrary executions, Christof Heyns, and the special rapporteur on the promotion and protection of human rights and fundamental freedoms while countering terrorism, Martin Scheinin, said the US "should disclose the supporting facts to allow an assessment in terms of international human rights law standards", adding: "For instance, it will be particularly important to know if the planning of the mission allowed an effort to capture Bin Laden." There has been relatively little debate in the US so far about the legality of the raid. 38 | 39 | Meanwhile the New York Times reported that the US may have targeted one of the men named as a possible successor to Bin Laden. Quoting American officials, the paper said a missile strike from an American military drone in a remote region of Yemen on Thursday was aimed at killing Anwar al-Awlaki, the radical American-born cleric believed to be hiding in the country. 40 | 41 | Separately it was reported that a Saudi man accused of conspiring with Bin Laden in the bombings of two US embassies expects to be extradited in the next few months to face charges after more than 12 years in British custody, according to documents which have emerged from a US court. 42 | 43 | Prosecutors in New York have charged Khalid al-Fawwaz with helping al-Qaida to orchestrate the 1998 car bombings of the US embassies in Kenya and Tanzania, which killed 224 people. 44 | 45 | A letter from a lawyer seeking to be appointed as al-Fawwaz's US defence counsel, said: "He [al-Fawwaz] anticipates extradition from the United Kingdom to the United States within the next few months to face these charges." 46 | -------------------------------------------------------------------------------- /tests/guardian2.txt: -------------------------------------------------------------------------------- 1 | After election battle, bruised Clegg seeks distance from Tory partners 2 | There is no serious move against the Lib Dem leader so far – but he is under pressure to make his party stand out again 3 | 4 | The local election rout and AV defeat leaves Liberal Democrat leader Nick Clegg facing the extraordinary challenge of turning around his party's fortunes by the next election. Photograph: Toby Melville/Reuters 5 | Nick Clegg has moved to reassure shattered Liberal Democrats that he could engineer a political recovery by the time of the next election, and that the current mood of anger at his "betrayal" would dissipate over two to three years. 6 | 7 | As the economy recovers, voters will slowly and grudgingly recognise the party's difficult role in saving the country from crisis, he feels. 8 | 9 | Senior party figures were predicting the Lib Dems would suffer at least two more years in the doldrums. Despite the loss of more than 500 English council seats, 13 Scottish parliament seats and one in the Welsh assembly, there is no sign of any half-serious move against Clegg's leadership. The judgment across the party remains that Clegg had no choice but to form a coalition with the Tories last May. 10 | 11 | But the pressure he has been under from within his party for months to highlight the its distinctiveness is now increasing. It will be a key test of Clegg's still developing leadership how he balances these competing demands. 12 | 13 | In the first sign that Clegg understands he needs to do more to distance his party from its coalition partners, he argued in a change of tone that one role for his party would be to protect the country from a return to the unfairness of Thatcherism. He is implying that the ideology of an unalloyed majority Conservative administration would be well to the right. 14 | 15 | The Lib Dem federal committee will meet shortly to set out the specific ways in which it expects the party to do more to differentiate itself from the Tories in line with a lengthy motion passed at the party's spring conference in Sheffield. 16 | 17 | Senior party figures almost universally predicted the Lib Dems were now entering "a transactional business relationship" with their coalition partners – a phrase first urged on the deputy prime minister by Vince Cable, the business secretary, last autumn. 18 | 19 | Many Lib Dem activists have been demanding distinctiveness and a willingness by their leadership to spell out what they have stopped, as well as what they have achieved, by being in coalition. Clegg will explain more about his approach to coalition politics in a speech next week. 20 | 21 | Cable warned at his party conference in the autumn that "to hold our own we need to maintain our party's identity and our authentic voice". 22 | 23 | That point was echoed strongly by Evan Harris, the former MP and vice-chairman of the party's federal policy committee and the authentic voice of the social liberal left in the party. 24 | 25 | He said no one was turning on Clegg inside the party, but his approach had to be less "collegiate" towards the prime minister. "He has got to deliver a strategy change, which is to do more than point out what we have achieved but also point out what bits of the programme come from a Conservative philosophy that we do not share," Harris said. 26 | 27 | Disgruntled senior Lib Dems, knowing the beating was coming, were privately spitting at what they regard as an ill-judged attempt by some in Clegg's circle to project the coalition as some kind of new ideological fusion of JS Mill and Friedrich Hayek. 28 | 29 | One Clegg aide said: "We are not going to behave like an opposition in the government, but we will have greater latitude to talk about when we disagree, as we already have over multiculturalism. That set the template. We are not going suddenly to Defcon 2 [a reference to the US armed forces expression for a defence-ready condition], or have poisonous rubbish briefed into the papers. 30 | 31 | ""We cannot play silly buggers or spring surprises on the Tories. It has to be managed and agreed. Nor can you have a German system where ministers from different sides of the coalition go on television to set out their differences. We are not ready for that culturally." 32 | 33 | But it was being argued that too many Lib Dem ministers in departments of state have been happy, in the words of the former leader Sir Menzies Campbell, to give the impression that "they get on like a house on fire with their Tory secretaries of state". 34 | 35 | One Lib Dem official admitted the "power dynamics" of the coalition had changed after Thursday's vote. "I guess people that voted Tory last time sort of knew what they were getting. They were hardly surprised by what Cameron did, and did not feel the need to punish him," the official said. 36 | 37 | "If you had to say on the basis of these results which party is going to form an overall majority at the next election, it is the Conservatives." 38 | 39 | For the moment, Clegg will just have to go through the grind of government as it undertakes plans to shake up health, the police, welfare and the House of Lords and introduces a more radical form of recall for MPs guilty of wrongdoing than previously envisaged. 40 | 41 | There is a deeper malaise for his party, however. The battle wounds of the alternative vote referendum campaign are going to leave permanent scars. For some, the anger is directed at Cameron for implicitly endorsing attacks by the NOtoAV campaign that trashed Clegg's leadership. 42 | 43 | Neil O'Brien, director of right-leaning thinktank Policy Exchange, said: "The real threat is that the coalition will be crippled inside. When trust and goodwill flow, the coalition can make real progress. Each partner is prepared to swallow decisions that are unpalatable, knowing that, in the not too distant future, the favour will be returned. That could get harder now." 44 | 45 | There is a deeper disappointment for some. One wing of the Tory party and the Liberal Democrats had been harbouring hopes that Cameron could actually have embraced the alternative vote, as Michael Gove, his education secretary, intended to do. In 1911, the Tory party had split on Lords reform between the hedgers and ditchers. There was optimistic talk in some circles that Cameron would prove to be a hedger. 46 | 47 | If AV went through, it would have been possible for the Tories and Lib Dems to come to second preference arrangements at the next election. Cameron would then have achieved a realignment of politics. He chose to retain the existing divisions. 48 | -------------------------------------------------------------------------------- /tests/post1.txt: -------------------------------------------------------------------------------- 1 | Sony aims to fully restore PlayStation Network, down by hacker attack, by end of May 2 | 3 | By Associated Press, Updated: Tuesday, May 10, 11:56 AM 4 | 5 | TOKYO — Sony said Tuesday it aims to fully restore its PlayStation Network, shut down after a massive security breach affecting over 100 million online accounts, by the end of May. 6 | 7 | Sony also confirmed that personal data from 24.6 million user accounts was stolen in the hacker attack last month. Personal data, including credit card numbers, might have been stolen from another 77 million PlayStation accounts, said Sony Computer Entertainment Inc. spokesman Satoshi Fukuoka. 8 | 9 | He said Sony has not received any reports of illegal uses of stolen information, and the company is continuing its probe into the hacker attack. He declined to give details on the investigation. 10 | 11 | Sony shut down the PlayStation network, a system that links gamers worldwide in live play, on April 20 after discovering the security breach. The network also allows users to upgrade and download games and other content. 12 | 13 | Sony was under heavy criticism over its handling of the network intrusion. The company did not notify consumers of the breach until April 26 even though it began investigating unusual activity on the network since April 19. 14 | 15 | Last month, U.S. lawyers filed a lawsuit against Sony on behalf of lead plaintiff Kristopher Johns for negligent protection of personal data and failure to inform players in a timely fashion that their credit card information may have been stolen. The lawsuit seeks class-action status. 16 | 17 | Fukuoka declined to comment on the lawsuit. 18 | -------------------------------------------------------------------------------- /tests/wikipedia2.txt: -------------------------------------------------------------------------------- 1 | Beetroot 2 | From Wikipedia, the free encyclopedia 3 | 4 | Beetroots at a grocery store 5 | 6 | The beetroot, also known as the table beet, garden beet, red beet or informally simply as beet, is one of the many cultivated varieties of beets (Beta vulgaris) and arguably the most commonly encountered variety in North America and Britain. 7 | 8 | Uses 9 | 10 | As a root vegetable 11 | 12 | The usually deep-red roots of beetroot are eaten boiled either as a cooked vegetable, or cold as a salad after cooking and adding oil and vinegar, or raw and shredded, either alone or combined with any salad vegetable. A large proportion of the commercial production is processed into boiled and sterilised beets or into pickles. In Eastern Europe beet soup, such as borscht, is a popular dish. Yellow-coloured beetroots are grown on a very small scale for home consumption.[1] 13 | 14 | As a leaf vegetable 15 | 16 | The green leafy portion of the beet is also edible. It is most commonly served boiled or steamed, in which case it has a taste and texture similar to spinach. 17 | 18 | Health benefits 19 | 20 | Beetroots are a rich source of potent antioxidants and nutrients, including magnesium, sodium, potassium and vitamin C, and betaine, which is important for cardiovascular health. It functions by acting with other nutrients to reduce the concentration of homocysteine, a homologue of the naturally occurring amino acid cysteine, which can be harmful to blood vessels and thus contribute to the development of heart disease, stroke, and peripheral vascular disease. Betaine functions in conjunction with S-adenosylmethionine, folic acid, and vitamins B6 and B12 to carry out this function.[2] 21 | 22 | Additionally, several preliminary studies on both rats and humans have shown betaine may protect against liver disease, particularly the build up of fatty deposits in the liver caused by alcohol abuse, protein deficiency, or diabetes, among other causes. The nutrient also helps individuals with hypochlorhydria, a condition causing abnormally low levels of stomach acid, by increasing stomach acidity.[2] 23 | 24 | Beetroot juice has been shown to lower blood pressure and thus help prevent cardiovascular problems. Research published in the American Heart Association journal Hypertension showed drinking 500 ml of beetroot juice led to a reduction in blood pressure within one hour. The reduction was more pronounced after three to four hours, and was measurable up to 24 hours after drinking the juice. The effect is attributed to the high nitrate content of the beetroot. The study correlated high nitrate concentrations in the blood following ingestion of the beetroot juice and the drop in blood pressure. Dietary nitrate, such as that found in the beetroot, is thought to be a source for the biological messenger nitric oxide, which is used by the endothelium to signal smooth muscle, triggering it to relax. This induces vasodilation and increased blood flow.[3] 25 | 26 | Other studies have found the positive effects beetroot juice can have on human exercise and performances. In studies conducted by the Exeter University, scientists found cyclists who drank a half-litre of beetroot juice several hours before setting off were able to ride up to 20 per cent longer than those who drank a placebo blackcurrant juice. [4] 27 | 28 | As a dye 29 | 30 | Betanin, obtained from the roots, is used industrially as red food colourants, e.g. to improve the color and flavor of tomato paste, sauces, desserts, jams and jellies, ice cream, sweets and breakfast cereals.[1] Within older bulbs of beetroot, the colour is a deep crimson and the flesh is much softer. Beetroot dye may also be used in ink. 31 | 32 | Betanin is not broken down in the body, and in higher concentration can temporarily cause urine (termed beeturia) and stool to assume a reddish color. This effect can cause distress and concern due to the visual similarity to bloody stools or urine, but is completely harmless and will subside once the food is out of the system. 33 | 34 | As a traditional remedy 35 | 36 | It is a rich source of the element boron. Field Marshal Montgomery is reputed to have exhorted his troops to 'take favours in the beetroot fields', a euphemism for visiting prostitutes.[5] From the Middle Ages, beetroot was used as a treatment for a variety of conditions, especially illnesses relating to digestion and the blood. Bartolomeo Platina recommended taking beetroot with garlic to nullify the effects of 'garlic-breath'.[6] 37 | 38 | References 39 | 40 | ^ a b Grubben, G.J.H. & Denton, O.A. (2004) Plant Resources of Tropical Africa 2. Vegetables. PROTA Foundation, Wageningen; Backhuys, Leiden; CTA, Wageningen. 41 | ^ a b A.D.A.M., Inc., ed. (2002), Betaine, University of Maryland Medical Center 42 | ^ Webb, Andrew J.; Nakul Patel; Stavros Loukogeorgakis; Mike Okorie; Zainab Aboud; Shivani Misra; Rahim Rashid; Philip Miall; John Deanfield; Nigel Benjamin; Raymond MacAllister; Adrian J. Hobbs; Amrita Ahluwalia; Patel, N; Loukogeorgakis, S; Okorie, M; Aboud, Z; Misra, S; Rashid, R; Miall, P et al. (2008), "Acute Blood Pressure Lowering, Vasoprotective, and Antiplatelet Properties of Dietary Nitrate via Bioconversion to Nitrite", Hypertension 51 (3): 784–790, doi:10.1161/HYPERTENSIONAHA.107.103523, PMC 2839282, PMID 18250365 43 | ^ "Beet your personal best". Sydney Morning Hearld. 4 October 2010. Retrieved 5 October 2010. 44 | ^ Stephen Nottingham (2004) (E-book), Beetroot 45 | ^ Platina De Honesta Voluptate et Valetudine, 3.14 46 | -------------------------------------------------------------------------------- /tests/wikipedia3.txt: -------------------------------------------------------------------------------- 1 | The Lounge Lizards 2 | From Wikipedia, the free encyclopedia 3 | 4 | The Lounge Lizards are a jazz group formed in 1978 by saxophone player John Lurie. 5 | Initially a tongue in cheek "fake jazz" combo, drawing on punk rock and no wave as much as jazz, The Lounge Lizards have since become respected for their creative and distinctive sound. 6 | Contents [hide] 7 | 1 History 8 | 2 Past members 9 | 3 Discography 10 | 3.1 Studio albums 11 | 3.2 Live albums 12 | [edit]History 13 | 14 | The Lounge Lizards were founded on June 4th, 1979 with John Lurie, his brother Evan (piano and organ), Arto Lindsay (guitar), Steve Piccolo (bass guitar), and Anton Fier (drums). They were initially a punk or fake jazz group but soon evolved into something quite special. Taking music from all corners of the globe and synthesizing it into something truly organic and unique. The New York Times' Robert Palmer wrote (October 7,1986): "The Lounge Lizards are not faking anything. They have staked their claim to a musical territory that lies somewhere west of Charles Mingus and east of Bernard Hermann and made it their own." 15 | In 1984 the new group consists of Erik Sanko, Curtis Fowlkes, Marc Ribot, Evan Lurie, Roy Nathanson, Dougie Bowne, E.J Rodriguez. This edition of The Lounge Lizards recorded three albums in two years, and demonstrated John Lurie's increasingly sophisticated and multi-layered compositions that often stray rather far from conventional jazz: He was able to integrate elements of various world musics (he often favors tango-inspired passages in his songs), which retain a distinctive flavor, but avoid gimmickry. One critic notes traces of "Erik Satie and Kurt Weill" [1]. 16 | John Lurie formed a new version of The Lounge Lizards in the early 1990s; prominent members included Steven Bernstein (trumpet), Michael Blake (saxophone), Oren Bloedow (bass guitar), David Tronzo (guitar), Calvin Weston (drums) and Billy Martin (percussion). 17 | Recent years have found The Lounge Lizards less active; John Lurie has been increasingly occupied with painting, while Evan Lurie has worked on The Backyardigans, a children's show that highlights multiple musical genres. 18 | --------------------------------------------------------------------------------