├── .gitignore
├── .travis.yml
├── CONTRIBUTING.rst
├── HISTORY.rst
├── LICENSE
├── MANIFEST.in
├── README.rst
├── dev-requirements.txt
├── run_tests.py
├── setup.cfg
├── setup.py
├── tests
├── __init__.py
└── test_taggers.py
├── textblob_aptagger
├── __init__.py
├── _perceptron.py
├── compat.py
├── taggers.py
└── trontagger-0.1.0.pickle
└── tox.ini
/.gitignore:
--------------------------------------------------------------------------------
1 | ########## Generated by gig 0.1.0 ###########
2 |
3 | ### Python ###
4 | *.py[cod]
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Packages
10 | *.egg
11 | *.egg-info
12 | dist
13 | build
14 | eggs
15 | parts
16 | bin
17 | var
18 | sdist
19 | develop-eggs
20 | .installed.cfg
21 | lib
22 | lib64
23 | __pycache__
24 |
25 | # Installer logs
26 | pip-log.txt
27 |
28 | # Unit test / coverage reports
29 | .coverage
30 | .tox
31 | nosetests.xml
32 |
33 | # Translations
34 | *.mo
35 |
36 | # Mr Developer
37 | .mr.developer.cfg
38 | .project
39 | .pydevproject
40 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | # Config file for automatic testing at travis-ci.org
2 |
3 | language: python
4 |
5 | python:
6 | - "3.3"
7 | - "2.7"
8 | - "2.6"
9 | - "pypy"
10 |
11 | before_install:
12 | - "wget https://s3.amazonaws.com/textblob/nltk_data.tar.gz"
13 | - "tar -xzvf nltk_data.tar.gz -C ~"
14 |
15 | install:
16 | - pip install -U .
17 | - curl https://raw.github.com/sloria/TextBlob/master/download_corpora_lite.py | python
18 |
19 | script: python run_tests.py
20 |
--------------------------------------------------------------------------------
/CONTRIBUTING.rst:
--------------------------------------------------------------------------------
1 | Contributing guidelines
2 | =======================
3 |
4 | In General
5 | ----------
6 |
7 | - `PEP 8`_, when sensible.
8 | - Test ruthlessly. Write docs for new features.
9 | - Even more important than Test-Driven Development--*Human-Driven Development*.
10 |
11 | .. _`PEP 8`: http://www.python.org/dev/peps/pep-0008/
12 |
--------------------------------------------------------------------------------
/HISTORY.rst:
--------------------------------------------------------------------------------
1 | Changelog
2 | ---------
3 |
4 | 0.3.0 (unreleased)
5 | ++++++++++++++++++
6 |
7 | * Compatibility with Textblob>=0.9.0.
8 |
9 | 0.2.0 (10/21/2013)
10 | ++++++++++++++++++
11 |
12 | * Compatibility with Textblob>=0.8.0.
13 |
14 | 0.1.0 (09/25/2013)
15 | ++++++++++++++++++
16 |
17 | * First stable release.
18 | * Ports the ``PerceptronTagger`` from TextBlob 0.6.3.
19 |
20 |
21 | 0.0.1 (09/22/2013)
22 | ++++++++++++++++++
23 |
24 | * Experimental release.
25 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright 2013 Matthew Honnibal
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy
4 | of this software and associated documentation files (the "Software"), to deal
5 | in the Software without restriction, including without limitation the rights
6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7 | copies of the Software, and to permit persons to whom the Software is
8 | furnished to do so, subject to the following conditions:
9 |
10 | The above copyright notice and this permission notice shall be included in
11 | all copies or substantial portions of the Software.
12 |
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19 | THE SOFTWARE.
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include *.rst LICENSE *.txt *.ini setup.cfg
2 | include textblob_aptagger/*.pickle
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 | =================
2 | textblob-aptagger
3 | =================
4 |
5 | **As of TextBlob 0.11.0, TextBlob uses NLTK's averaged perceptron tagger by default. This package is no longer necessary.**
6 |
7 | .. image:: https://badge.fury.io/py/textblob-aptagger.png
8 | :target: http://badge.fury.io/py/textblob-aptagger
9 | :alt: Latest version
10 |
11 | .. image:: https://travis-ci.org/sloria/textblob-aptagger.png?branch=master
12 | :target: https://travis-ci.org/sloria/textblob-aptagger
13 | :alt: Travis-CI
14 |
15 | A fast and accurate part-of-speech tagger based on the Averaged Perceptron. For use with `TextBlob`_.
16 |
17 | Implementation by Matthew Honnibal, a.k.a. `syllog1sm `_. Read more about it `here `_.
18 |
19 | Install
20 | -------
21 |
22 | If you have `pip `_ installed (you should), run ::
23 |
24 | $ pip install -U textblob-aptagger
25 |
26 | Usage
27 | -----
28 | .. code-block:: python
29 |
30 | >>> from textblob import TextBlob
31 | >>> from textblob_aptagger import PerceptronTagger
32 | >>> blob = TextBlob("Simple is better than complex.", pos_tagger=PerceptronTagger())
33 | >>> blob.tags
34 | [('Simple', u'NN'), ('is', u'VBZ'), ('better', u'JJR'), ('than', u'IN'), ('complex', u'JJ')]
35 |
36 | Requirements
37 | ------------
38 |
39 | - Python >= 2.6 or >= 3.3
40 |
41 | License
42 | -------
43 |
44 | MIT licensed. See the bundled `LICENSE `_ file for more details.
45 |
46 | .. _TextBlob: https://textblob.readthedocs.org/
47 |
--------------------------------------------------------------------------------
/dev-requirements.txt:
--------------------------------------------------------------------------------
1 | nose>=1.3.0
2 | tox>=1.5.0
3 | sphinx
4 | wheel
5 |
--------------------------------------------------------------------------------
/run_tests.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | '''
4 | The main test runner script, adapted from TextBlob.
5 |
6 | Usage: ::
7 |
8 | python run_tests.py
9 |
10 | Skip slow tests: ::
11 |
12 | python run_tests.py fast
13 | '''
14 | from __future__ import unicode_literals
15 | import nose
16 | import sys
17 | from textblob_aptagger.compat import PY2, PY26
18 |
19 |
20 | def main():
21 | args = get_argv()
22 | success = nose.run(argv=args)
23 | sys.exit(0) if success else sys.exit(1)
24 |
25 |
26 | def get_argv():
27 | args = [sys.argv[0], ]
28 | attr_conditions = [] # Use nose's attribselect plugin to filter tests
29 | if "force-all" in sys.argv:
30 | # Don't exclude any tests
31 | return args
32 | if PY26:
33 | # Exclude tests that don't work on python2.6
34 | attr_conditions.append("not py27_only")
35 | if not PY2:
36 | # Exclude tests that only work on python2
37 | attr_conditions.append("not py2_only")
38 | if "fast" in sys.argv:
39 | attr_conditions.append("not slow")
40 |
41 | attr_expression = " and ".join(attr_conditions)
42 | if attr_expression:
43 | args.extend(["-A", attr_expression])
44 | return args
45 |
46 | if __name__ == '__main__':
47 | main()
48 |
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [wheel]
2 | universal = 1
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import re
3 | import sys
4 | import subprocess
5 | from setuptools import setup
6 |
7 | packages = ['textblob_aptagger']
8 | requires = ["textblob>=0.9.0"]
9 |
10 | PUBLISH_CMD = "python setup.py register sdist bdist_wheel upload"
11 | TEST_PUBLISH_CMD = 'python setup.py register -r test sdist bdist_wheel upload -r test'
12 | TEST_CMD = 'python run_tests.py'
13 |
14 |
15 | def find_version(fname):
16 | '''Attempts to find the version number in the file names fname.
17 | Raises RuntimeError if not found.
18 | '''
19 | version = ''
20 | with open(fname, 'r') as fp:
21 | reg = re.compile(r'__version__ = [\'"]([^\'"]*)[\'"]')
22 | for line in fp:
23 | m = reg.match(line)
24 | if m:
25 | version = m.group(1)
26 | break
27 | if not version:
28 | raise RuntimeError('Cannot find version information')
29 | return version
30 |
31 | __version__ = find_version("textblob_aptagger/__init__.py")
32 |
33 | if 'publish' in sys.argv:
34 | try:
35 | __import__('wheel')
36 | except ImportError:
37 | print("wheel required. Run `pip install wheel`.")
38 | sys.exit(1)
39 | status = subprocess.call(PUBLISH_CMD, shell=True)
40 | sys.exit(status)
41 |
42 | if 'publish_test' in sys.argv:
43 | try:
44 | __import__('wheel')
45 | except ImportError:
46 | print("wheel required. Run `pip install wheel`.")
47 | sys.exit(1)
48 | status = subprocess.call(TEST_PUBLISH_CMD, shell=True)
49 | sys.exit()
50 |
51 | if 'run_tests' in sys.argv:
52 | try:
53 | __import__('nose')
54 | except ImportError:
55 | print('nose required. Run `pip install nose`.')
56 | sys.exit(1)
57 |
58 | status = subprocess.call(TEST_CMD, shell=True)
59 | sys.exit(status)
60 |
61 | def read(fname):
62 | with open(fname) as fp:
63 | content = fp.read()
64 | return content
65 |
66 | setup(
67 | name='textblob-aptagger',
68 | version=__version__,
69 | description='A fast and accurate part-of-speech tagger for TextBlob.',
70 | long_description=(read("README.rst") + '\n\n' +
71 | read("HISTORY.rst")),
72 | author='Steven Loria',
73 | author_email='sloria1@gmail.com',
74 | url='https://github.com/sloria/textblob-aptagger',
75 | packages=packages,
76 | package_dir={'textblob_aptagger': 'textblob_aptagger'},
77 | include_package_data=True,
78 | package_data={
79 | "textblob_aptagger": ["*.pickle"]
80 | },
81 | install_requires=requires,
82 | license=read("LICENSE"),
83 | zip_safe=False,
84 | keywords='textblob_aptagger',
85 | classifiers=[
86 | 'Development Status :: 2 - Pre-Alpha',
87 | 'Intended Audience :: Developers',
88 | 'License :: OSI Approved :: MIT License',
89 | 'Natural Language :: English',
90 | "Programming Language :: Python :: 2",
91 | 'Programming Language :: Python :: 2.6',
92 | 'Programming Language :: Python :: 2.7',
93 | 'Programming Language :: Python :: 3',
94 | 'Programming Language :: Python :: 3.3',
95 | ],
96 | test_suite='tests',
97 | tests_require=['nose'],
98 | )
99 |
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
--------------------------------------------------------------------------------
/tests/test_taggers.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | from __future__ import unicode_literals
3 | import unittest
4 | from nose.tools import * # PEP8 asserts
5 | from nose.plugins.attrib import attr
6 |
7 | from textblob.base import BaseTagger
8 | from textblob.blob import TextBlob
9 | from textblob.exceptions import MissingCorpusError
10 | from textblob_aptagger import PerceptronTagger
11 |
12 | class TestPerceptronTagger(unittest.TestCase):
13 |
14 | def setUp(self):
15 | self.text = ("Simple is better than complex. "
16 | "Complex is better than complicated.")
17 | self.tagger = PerceptronTagger(load=False)
18 |
19 | def test_init(self):
20 | tagger = PerceptronTagger(load=False)
21 | assert_true(isinstance(tagger, BaseTagger))
22 |
23 | def test_train(self):
24 | sentences = _read_tagged(_wsj_train)
25 | nr_iter = 5
26 | self.tagger.train(sentences, nr_iter=nr_iter)
27 | nr_words = sum(len(words) for words, tags in sentences)
28 | # Check that the model has 'ticked over' once per instance
29 | assert_equal(nr_words * nr_iter, self.tagger.model.i)
30 | # Check that the tagger has a class for every seen tag
31 | tag_set = set()
32 | for _, tags in sentences:
33 | tag_set.update(tags)
34 | assert_equal(len(tag_set), len(self.tagger.model.classes))
35 | for tag in tag_set:
36 | assert_true(tag in self.tagger.model.classes)
37 |
38 | @attr("slow")
39 | def test_tag(self):
40 | trained_tagger = PerceptronTagger()
41 | tokens = trained_tagger.tag(self.text)
42 | assert_equal([w for w, t in tokens],
43 | ['Simple', 'is', 'better', 'than', 'complex', '.', 'Complex', 'is',
44 | 'better', 'than', 'complicated', '.'])
45 |
46 | @attr("slow")
47 | def test_tag_textblob(self):
48 | trained_tagger = PerceptronTagger()
49 | blob = TextBlob(self.text, pos_tagger=trained_tagger)
50 | # Punctuation is excluded
51 | assert_equal([w for w, t in blob.tags],
52 | ['Simple', 'is', 'better', 'than', 'complex', 'Complex', 'is',
53 | 'better', 'than', 'complicated'])
54 |
55 | def test_loading_missing_file_raises_missing_corpus_exception(self):
56 | tagger = PerceptronTagger(load=False)
57 | assert_raises(MissingCorpusError, tagger.load, 'missing.pickle')
58 |
59 |
60 | def _read_tagged(text, sep='|'):
61 | sentences = []
62 | for sent in text.split('\n'):
63 | tokens = []
64 | tags = []
65 | for token in sent.split():
66 | word, pos = token.split(sep)
67 | tokens.append(word)
68 | tags.append(pos)
69 | sentences.append((tokens, tags))
70 | return sentences
71 |
72 | _wsj_train = ("Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS old|JJ ,|, will|MD "
73 | "join|VB the|DT board|NN as|IN a|DT nonexecutive|JJ director|NN "
74 | "Nov.|NNP 29|CD .|.\nMr.|NNP Vinken|NNP is|VBZ chairman|NN of|IN "
75 | "Elsevier|NNP N.V.|NNP ,|, the|DT Dutch|NNP publishing|VBG "
76 | "group|NN .|. Rudolph|NNP Agnew|NNP ,|, 55|CD years|NNS old|JJ "
77 | "and|CC former|JJ chairman|NN of|IN Consolidated|NNP Gold|NNP "
78 | "Fields|NNP PLC|NNP ,|, was|VBD named|VBN a|DT nonexecutive|JJ "
79 | "director|NN of|IN this|DT British|JJ industrial|JJ conglomerate|NN "
80 | ".|.\nA|DT form|NN of|IN asbestos|NN once|RB used|VBN to|TO make|VB "
81 | "Kent|NNP cigarette|NN filters|NNS has|VBZ caused|VBN a|DT high|JJ "
82 | "percentage|NN of|IN cancer|NN deaths|NNS among|IN a|DT group|NN "
83 | "of|IN workers|NNS exposed|VBN to|TO it|PRP more|RBR than|IN "
84 | "30|CD years|NNS ago|IN ,|, researchers|NNS reported|VBD .|.")
85 |
86 |
87 | if __name__ == '__main__':
88 | unittest.main()
89 |
--------------------------------------------------------------------------------
/textblob_aptagger/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''textblob-aptagger
3 |
4 | A TextBlob extension that adds the `PerceptronTagger`, a fast and accurate
5 | part-of-speech tagger based on the Averaged Perceptron algorithm.
6 | '''
7 | from __future__ import absolute_import
8 | from textblob_aptagger.taggers import PerceptronTagger
9 |
10 | __version__ = '0.3.0-dev'
11 | __license__ = "MIT"
12 |
--------------------------------------------------------------------------------
/textblob_aptagger/_perceptron.py:
--------------------------------------------------------------------------------
1 | """
2 | Averaged perceptron classifier. Implementation geared for simplicity rather than
3 | efficiency.
4 | """
5 | from collections import defaultdict
6 | import pickle
7 | import random
8 |
9 |
10 | class AveragedPerceptron(object):
11 |
12 | '''An averaged perceptron, as implemented by Matthew Honnibal.
13 |
14 | See more implementation details here:
15 | http://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
16 | '''
17 |
18 | def __init__(self):
19 | # Each feature gets its own weight vector, so weights is a dict-of-dicts
20 | self.weights = {}
21 | self.classes = set()
22 | # The accumulated values, for the averaging. These will be keyed by
23 | # feature/clas tuples
24 | self._totals = defaultdict(int)
25 | # The last time the feature was changed, for the averaging. Also
26 | # keyed by feature/clas tuples
27 | # (tstamps is short for timestamps)
28 | self._tstamps = defaultdict(int)
29 | # Number of instances seen
30 | self.i = 0
31 |
32 | def predict(self, features):
33 | '''Dot-product the features and current weights and return the best label.'''
34 | scores = defaultdict(float)
35 | for feat, value in features.items():
36 | if feat not in self.weights or value == 0:
37 | continue
38 | weights = self.weights[feat]
39 | for label, weight in weights.items():
40 | scores[label] += value * weight
41 | # Do a secondary alphabetic sort, for stability
42 | return max(self.classes, key=lambda label: (scores[label], label))
43 |
44 | def update(self, truth, guess, features):
45 | '''Update the feature weights.'''
46 | def upd_feat(c, f, w, v):
47 | param = (f, c)
48 | self._totals[param] += (self.i - self._tstamps[param]) * w
49 | self._tstamps[param] = self.i
50 | self.weights[f][c] = w + v
51 |
52 | self.i += 1
53 | if truth == guess:
54 | return None
55 | for f in features:
56 | weights = self.weights.setdefault(f, {})
57 | upd_feat(truth, f, weights.get(truth, 0.0), 1.0)
58 | upd_feat(guess, f, weights.get(guess, 0.0), -1.0)
59 | return None
60 |
61 | def average_weights(self):
62 | '''Average weights from all iterations.'''
63 | for feat, weights in self.weights.items():
64 | new_feat_weights = {}
65 | for clas, weight in weights.items():
66 | param = (feat, clas)
67 | total = self._totals[param]
68 | total += (self.i - self._tstamps[param]) * weight
69 | averaged = round(total / float(self.i), 3)
70 | if averaged:
71 | new_feat_weights[clas] = averaged
72 | self.weights[feat] = new_feat_weights
73 | return None
74 |
75 | def save(self, path):
76 | '''Save the pickled model weights.'''
77 | return pickle.dump(dict(self.weights), open(path, 'w'))
78 |
79 | def load(self, path):
80 | '''Load the pickled model weights.'''
81 | self.weights = pickle.load(open(path))
82 | return None
83 |
84 |
85 | def train(nr_iter, examples):
86 | '''Return an averaged perceptron model trained on ``examples`` for
87 | ``nr_iter`` iterations.
88 | '''
89 | model = AveragedPerceptron()
90 | for i in range(nr_iter):
91 | random.shuffle(examples)
92 | for features, class_ in examples:
93 | scores = model.predict(features)
94 | guess, score = max(scores.items(), key=lambda i: i[1])
95 | if guess != class_:
96 | model.update(class_, guess, features)
97 | model.average_weights()
98 | return model
99 |
--------------------------------------------------------------------------------
/textblob_aptagger/compat.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import sys
3 |
4 | PY2 = int(sys.version[0]) == 2
5 | PY26 = PY2 and int(sys.version_info[1]) < 7
6 |
7 | if PY2:
8 | text_type = unicode
9 | binary_type = str
10 | string_types = (str, unicode)
11 | unicode = unicode
12 | basestring = basestring
13 | else:
14 | text_type = str
15 | binary_type = bytes
16 | string_types = (str,)
17 | unicode = str
18 | basestring = (str, bytes)
--------------------------------------------------------------------------------
/textblob_aptagger/taggers.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | from __future__ import absolute_import
3 | import os
4 | import random
5 | from collections import defaultdict
6 | import pickle
7 | import logging
8 |
9 | from textblob.base import BaseTagger
10 | from textblob.tokenizers import WordTokenizer, SentenceTokenizer
11 | from textblob.exceptions import MissingCorpusError
12 | from textblob_aptagger._perceptron import AveragedPerceptron
13 |
14 | PICKLE = "trontagger-0.1.0.pickle"
15 |
16 |
17 | class PerceptronTagger(BaseTagger):
18 |
19 | '''Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal.
20 |
21 | See more implementation details here:
22 | http://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
23 |
24 | :param load: Load the pickled model upon instantiation.
25 | '''
26 |
27 | START = ['-START-', '-START2-']
28 | END = ['-END-', '-END2-']
29 | AP_MODEL_LOC = os.path.join(os.path.dirname(__file__), PICKLE)
30 |
31 | def __init__(self, load=True):
32 | self.model = AveragedPerceptron()
33 | self.tagdict = {}
34 | self.classes = set()
35 | if load:
36 | self.load(self.AP_MODEL_LOC)
37 |
38 | def tag(self, corpus, tokenize=True):
39 | '''Tags a string `corpus`.'''
40 | # Assume untokenized corpus has \n between sentences and ' ' between words
41 | s_split = SentenceTokenizer().tokenize if tokenize else lambda t: t.split('\n')
42 | w_split = WordTokenizer().tokenize if tokenize else lambda s: s.split()
43 | def split_sents(corpus):
44 | for s in s_split(corpus):
45 | yield w_split(s)
46 |
47 | prev, prev2 = self.START
48 | tokens = []
49 | for words in split_sents(corpus):
50 | context = self.START + [self._normalize(w) for w in words] + self.END
51 | for i, word in enumerate(words):
52 | tag = self.tagdict.get(word)
53 | if not tag:
54 | features = self._get_features(i, word, context, prev, prev2)
55 | tag = self.model.predict(features)
56 | tokens.append((word, tag))
57 | prev2 = prev
58 | prev = tag
59 | return tokens
60 |
61 | def train(self, sentences, save_loc=None, nr_iter=5):
62 | '''Train a model from sentences, and save it at ``save_loc``. ``nr_iter``
63 | controls the number of Perceptron training iterations.
64 |
65 | :param sentences: A list of (words, tags) tuples.
66 | :param save_loc: If not ``None``, saves a pickled model in this location.
67 | :param nr_iter: Number of training iterations.
68 | '''
69 | self._make_tagdict(sentences)
70 | self.model.classes = self.classes
71 | for iter_ in range(nr_iter):
72 | c = 0
73 | n = 0
74 | for words, tags in sentences:
75 | prev, prev2 = self.START
76 | context = self.START + [self._normalize(w) for w in words] \
77 | + self.END
78 | for i, word in enumerate(words):
79 | guess = self.tagdict.get(word)
80 | if not guess:
81 | feats = self._get_features(i, word, context, prev, prev2)
82 | guess = self.model.predict(feats)
83 | self.model.update(tags[i], guess, feats)
84 | prev2 = prev
85 | prev = guess
86 | c += guess == tags[i]
87 | n += 1
88 | random.shuffle(sentences)
89 | logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n)))
90 | self.model.average_weights()
91 | # Pickle as a binary file
92 | if save_loc is not None:
93 | pickle.dump((self.model.weights, self.tagdict, self.classes),
94 | open(save_loc, 'wb'), -1)
95 | return None
96 |
97 | def load(self, loc):
98 | '''Load a pickled model.'''
99 | try:
100 | w_td_c = pickle.load(open(loc, 'rb'))
101 | except IOError:
102 | msg = ("Missing trontagger.pickle file.")
103 | raise MissingCorpusError(msg)
104 | self.model.weights, self.tagdict, self.classes = w_td_c
105 | self.model.classes = self.classes
106 | return None
107 |
108 | def _normalize(self, word):
109 | '''Normalization used in pre-processing.
110 |
111 | - All words are lower cased
112 | - Digits in the range 1800-2100 are represented as !YEAR;
113 | - Other digits are represented as !DIGITS
114 |
115 | :rtype: str
116 | '''
117 | if '-' in word and word[0] != '-':
118 | return '!HYPHEN'
119 | elif word.isdigit() and len(word) == 4:
120 | return '!YEAR'
121 | elif word[0].isdigit():
122 | return '!DIGITS'
123 | else:
124 | return word.lower()
125 |
126 | def _get_features(self, i, word, context, prev, prev2):
127 | '''Map tokens into a feature representation, implemented as a
128 | {hashable: float} dict. If the features change, a new model must be
129 | trained.
130 | '''
131 | def add(name, *args):
132 | features[' '.join((name,) + tuple(args))] += 1
133 |
134 | i += len(self.START)
135 | features = defaultdict(int)
136 | # It's useful to have a constant feature, which acts sort of like a prior
137 | add('bias')
138 | add('i suffix', word[-3:])
139 | add('i pref1', word[0])
140 | add('i-1 tag', prev)
141 | add('i-2 tag', prev2)
142 | add('i tag+i-2 tag', prev, prev2)
143 | add('i word', context[i])
144 | add('i-1 tag+i word', prev, context[i])
145 | add('i-1 word', context[i-1])
146 | add('i-1 suffix', context[i-1][-3:])
147 | add('i-2 word', context[i-2])
148 | add('i+1 word', context[i+1])
149 | add('i+1 suffix', context[i+1][-3:])
150 | add('i+2 word', context[i+2])
151 | return features
152 |
153 | def _make_tagdict(self, sentences):
154 | '''Make a tag dictionary for single-tag words.'''
155 | counts = defaultdict(lambda: defaultdict(int))
156 | for words, tags in sentences:
157 | for word, tag in zip(words, tags):
158 | counts[word][tag] += 1
159 | self.classes.add(tag)
160 | freq_thresh = 20
161 | ambiguity_thresh = 0.97
162 | for word, tag_freqs in counts.items():
163 | tag, mode = max(tag_freqs.items(), key=lambda item: item[1])
164 | n = sum(tag_freqs.values())
165 | # Don't add rare words to the tag dictionary
166 | # Only add quite unambiguous words
167 | if n >= freq_thresh and (float(mode) / n) >= ambiguity_thresh:
168 | self.tagdict[word] = tag
169 |
170 |
171 | def _pc(n, d):
172 | return (float(n) / d) * 100
173 |
--------------------------------------------------------------------------------
/textblob_aptagger/trontagger-0.1.0.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sloria/textblob-aptagger/fb98bbd16a83650cab4819c4b89f0973e60fb3fe/textblob_aptagger/trontagger-0.1.0.pickle
--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
1 | [tox]
2 | envlist =py26,py27,py33
3 | [testenv]
4 | deps=nose
5 | commands=
6 | python run_tests.py
7 |
--------------------------------------------------------------------------------