├── .gitignore
├── README.md
├── default_model_test.py
├── detectormorse
    ├── __init__.py
    ├── __main__.py
    ├── detector.py
    ├── models
    │   └── DM-wsj.json.gz
    └── ptbtokenizer.py
├── requirements.txt
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__/
2 | *.py[co]
3 | *.log
4 | *.sh
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | Detector Morse
 2 | ==============
 3 | 
 4 | Detector Morse is a program for sentence boundary detection (henceforth, SBD), also known as sentence segmentation. Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank:
 5 | 
 6 |     Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain
 7 |     steady at about 1,200 cars in 1990.
 8 | 
 9 | This sentence contains 4 periods, but only the last denotes a sentence boundary. The first one in `U.S.` is unambiguously part of an acronym, not a sentence boundary; the same is true of expressions like `$12.53`. But the periods at the end of `Inc.` and `U.S.` could easily denote a sentence boundary. Humans use the local context to determine that neither period denote sentence boundaries (e.g. the selectional properties of the verb _expect_ are not met if there is a sentence bounary immediately after `U.S.`). Detector Morse uses artisinal, handcrafted contextual features and low-impact, leave-no-trace machine learning methods to automatically detect sentence boundaries.
10 | 
11 | SBD is one of the earliest pieces of many natural language processing pipelines. Since errors at this step are likely to propagate, SBD is an important---albeit overlooked---problem in natural language processing.
12 | 
13 | Detector Morse has been tested on CPython 3.4 and PyPy3 (2.3.1, corresponding to Python 3.2); the latter is much faster. Detector Morse depends on the Python module `nlup` (which in turn relies on `jsonpickle`) to (de)serialize models. For the versions used, see `requirements.txt`.
14 | 
15 | Installation
16 | ============
17 | 
18 | ```
19 | pip install detectormorse
20 | ```
21 | 
22 | Usage
23 | =====
24 | 
25 | ```
26 | Detector Morse, by Kyle Gorman
27 |      
28 | usage: python -m detectormorse [-h] [-v | -V] (-t TRAIN | -r [READ])
29 |                                (-s SEGMENT | -w WRITE | -e EVALUATE)
30 |                                [-E EPOCHS] [-C] [--preserve-whitespace]
31 | 
32 | Detector Morse
33 | 
34 | optional arguments:
35 |   -h, --help            show this help message and exit
36 |   -v, --verbose         enable verbose output
37 |   -V, --really-verbose  enable even more verbose output
38 |   -t TRAIN, --train TRAIN
39 |                         training data
40 |   -r [READ], --read [READ]
41 |                         read in a serialized model from a path or read the
42 |                         default model if no path is specified
43 |   -s SEGMENT, --segment SEGMENT
44 |                         segment sentences
45 |   -w WRITE, --write WRITE
46 |                         write out serialized model
47 |   -e EVALUATE, --evaluate EVALUATE
48 |                         evaluate on segmented data
49 |   -E EPOCHS, --epochs EPOCHS
50 |                         # of epochs (default: 20)
51 |   -C, --nocase          disable case features
52 |   --preserve-whitespace
53 |                         preserve whitespace when segmenting
54 | ```
55 | 
56 | Files used for training (`-t`/`--train`) and evaluation (`-e`/`--evaluate`) should contain one sentence per line; newline characters are ignored otherwise.
57 | 
58 | When segmenting a file (`-s`/`--segment`), DetectorMorse simply inserts a newline after predicted sentence boundaries that aren't already marked by one. All other newline characters are passed through, unmolested.
59 | 
60 | The included `DM-wsj.json.gz` is a segmenter model trained on the Wall St. Journal portion of the Penn Treebank. This model can be loaded by using `detector.default_model()` or by specifying `-r` with no path at the command line.
61 | 
62 | Method
63 | ======
64 | 
65 | See [this blog post](http://www.wellformedness.com/blog/simpler-sentence-boundary-detection/).
66 | 
67 | Caveats
68 | =======
69 | 
70 | DetectorMorse processes text by reading the entire file into memory. This means it will not work with files that won't fit into the available RAM. The easiest way to get around this is to import the `Detector` instance in your own Python script.
71 | 
72 | Exciting extras!
73 | ================
74 | 
75 | I've included a Perl script `untokenize.pl` which attempts to invert the Penn Treebank tokenization process. Tokenization is an inherently "lossy" procedure, so there is no guarantee that the output is exactly how it appeared in the WSJ. But, the rules appear to be correct and produce sane text, and I have used it for all experiments. **Update (2015-02-10): I've removed this script; I just use the Stanford tokenizer for this purpose, now.**
76 | 


--------------------------------------------------------------------------------
/default_model_test.py:
--------------------------------------------------------------------------------
 1 | import pytest
 2 | 
 3 | from detectormorse import detector
 4 | 
 5 | SENT_ROLLS = """Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain
 6 | steady at about 1,200 cars in 1990."""
 7 | SENT_TORTURE = "Dr. F. Jones M.D. doesn't have a Ph.D. and never went to N. Korea."
 8 | SENT_BOTH_SPACE = " ".join((SENT_ROLLS, SENT_TORTURE))
 9 | SENT_BOTH_NEWLINE = "\n".join((SENT_ROLLS, SENT_TORTURE))
10 | SENT_BOTH_EXTRA_WHITESPACE = ' ' + SENT_ROLLS + '  ' + SENT_TORTURE + ' '
11 | 
12 | 
13 | @pytest.fixture(scope='module')
14 | def default_model():
15 |     return detector.default_model()
16 | 
17 | 
18 | def test_single_sentence(default_model):
19 |     """A single sentence is segmented as a single sentence."""
20 |     sents = list(default_model.segments(SENT_ROLLS))
21 |     assert len(sents) == 1
22 |     # Note that this confirms that newline is passed through without issue
23 |     assert sents[0] == SENT_ROLLS
24 | 
25 | def test_two_sentences_space(default_model):
26 |     """Two sentences joined by space are segmented as two sentences."""
27 |     sents = list(default_model.segments(SENT_BOTH_SPACE))
28 |     assert len(sents) == 2
29 |     assert sents == [SENT_ROLLS, SENT_TORTURE]
30 | 
31 | 
32 | def test_two_sentences_newline(default_model):
33 |     """Two sentences joined by newline are segmented as two sentences."""
34 |     sents = list(default_model.segments(SENT_BOTH_NEWLINE))
35 |     assert len(sents) == 2
36 |     assert sents == [SENT_ROLLS, SENT_TORTURE]
37 | 
38 | 
39 | def test_strip(default_model):
40 |     """Strip non-initial whitespace by default."""
41 |     sents = list(default_model.segments(SENT_BOTH_EXTRA_WHITESPACE))
42 |     assert len(sents) == 2
43 |     assert sents[0] == ' ' + SENT_ROLLS
44 |     assert sents[1] == SENT_TORTURE
45 | 
46 | 
47 | def test_strip_false(default_model):
48 |     """Preserve all whitespace when strip=False."""
49 |     sents = list(default_model.segments(SENT_BOTH_EXTRA_WHITESPACE, strip=False))
50 |     assert len(sents) == 2
51 |     assert sents[0] == ' ' + SENT_ROLLS + '  '
52 |     assert sents[1] == SENT_TORTURE + ' '
53 | 


--------------------------------------------------------------------------------
/detectormorse/__init__.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2014 Kyle Gorman <gormanky@ohsu.edu>
 2 | #
 3 | # Permission is hereby granted, free of charge, to any person obtaining a
 4 | # copy of this software and associated documentation files (the
 5 | # "Software"), to deal in the Software without restriction, including
 6 | # without limitation the rights to use, copy, modify, merge, publish,
 7 | # distribute, sublicense, and/or sell copies of the Software, and to
 8 | # permit persons to whom the Software is furnished to do so, subject to
 9 | # the following conditions:
10 | #
11 | # The above copyright notice and this permission notice shall be included
12 | # in all copies or substantial portions of the Software.
13 | #
14 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
15 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
18 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
19 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
20 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 
22 | 
23 | from .detector import Detector, slurp
24 | from .ptbtokenizer import word_tokenize
25 | 


--------------------------------------------------------------------------------
/detectormorse/__main__.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2014 Kyle Gorman <gormanky@ohsu.edu>
 2 | #
 3 | # Permission is hereby granted, free of charge, to any person obtaining a
 4 | # copy of this software and associated documentation files (the
 5 | # "Software"), to deal in the Software without restriction, including
 6 | # without limitation the rights to use, copy, modify, merge, publish,
 7 | # distribute, sublicense, and/or sell copies of the Software, and to
 8 | # permit persons to whom the Software is furnished to do so, subject to
 9 | # the following conditions:
10 | #
11 | # The above copyright notice and this permission notice shall be included
12 | # in all copies or substantial portions of the Software.
13 | #
14 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
15 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
18 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
19 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
20 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 
22 | 
23 | import logging
24 | 
25 | from argparse import ArgumentParser
26 | 
27 | 
28 | from .detector import Detector, slurp, EPOCHS, default_model
29 | 
30 | # Sentinel for reading in the default model
31 | _READ_DEFAULT = object()
32 | 
33 | 
34 | LOGGING_FMT = "%(message)s"
35 | 
36 | 
37 | argparser = ArgumentParser(prog="python -m detectormorse",
38 |                            description="Detector Morse")
39 | vrb_group = argparser.add_mutually_exclusive_group()
40 | vrb_group.add_argument("-v", "--verbose", action="store_true",
41 |                        help="enable verbose output")
42 | vrb_group.add_argument("-V", "--really-verbose", action="store_true",
43 |               help="enable even more verbose output")
44 | inp_group = argparser.add_mutually_exclusive_group(required=True)
45 | inp_group.add_argument("-t", "--train", help="training data")
46 | inp_group.add_argument("-r", "--read", nargs='?', const=_READ_DEFAULT,
47 |               help="read in a serialized model from a path or read "
48 |                        "the default model if no path is specified")
49 | out_group = argparser.add_mutually_exclusive_group(required=True)
50 | out_group.add_argument("-s", "--segment", help="segment sentences")
51 | out_group.add_argument("-w", "--write",
52 |               help="write out serialized model")
53 | out_group.add_argument("-e", "--evaluate",
54 |               help="evaluate on segmented data")
55 | argparser.add_argument("-E", "--epochs", type=int, default=EPOCHS,
56 |               help="# of epochs (default: {})".format(EPOCHS))
57 | argparser.add_argument("-C", "--nocase", action="store_true",
58 |               help="disable case features")
59 | argparser.add_argument("--preserve-whitespace", action="store_true",
60 |                        help="preserve whitespace when segmenting")
61 | args = argparser.parse_args()
62 | # verbosity block
63 | if args.really_verbose:
64 |     logging.basicConfig(format=LOGGING_FMT, level="DEBUG")
65 | elif args.verbose:
66 |     logging.basicConfig(format=LOGGING_FMT, level="INFO")
67 | else:
68 |     logging.basicConfig(format=LOGGING_FMT)
69 | # input block
70 | detector = None
71 | if args.train:
72 |     logging.info("Training model on '{}'.".format(args.train))
73 |     detector = Detector(slurp(args.train), epochs=args.epochs,
74 |                         nocase=args.nocase)
75 | elif args.read:
76 |     if args.read is _READ_DEFAULT:
77 |         # If sentinel specified, load default model
78 |         logging.info("Reading default model.")
79 |         detector = default_model()
80 |     else:
81 |         # Otherwise load normally
82 |         logging.info("Reading pretrained model '{}'.".format(args.read))
83 |         detector = Detector.load(args.read)
84 | # output block
85 | if args.segment:
86 |     logging.info("Segmenting '{}'.".format(args.segment))
87 |     print("\n".join(detector.segments(slurp(args.segment),
88 |                                       strip=not args.preserve_whitespace)))
89 | if args.write:
90 |     logging.info("Writing model to '{}'.".format(args.write))
91 |     detector.dump(args.write)
92 | elif args.evaluate:
93 |     logging.info("Evaluating model on '{}'.".format(args.evaluate))
94 |     cx = detector.evaluate(slurp(args.evaluate))
95 |     if args.verbose or args.really_verbose:
96 |         cx.pprint()
97 |     print(cx.summary)
98 | 


--------------------------------------------------------------------------------
/detectormorse/detector.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2014 Kyle Gorman <gormanky@ohsu.edu>
  2 | #
  3 | # Permission is hereby granted, free of charge, to any person obtaining a
  4 | # copy of this software and associated documentation files (the
  5 | # "Software"), to deal in the Software without restriction, including
  6 | # without limitation the rights to use, copy, modify, merge, publish,
  7 | # distribute, sublicense, and/or sell copies of the Software, and to
  8 | # permit persons to whom the Software is furnished to do so, subject to
  9 | # the following conditions:
 10 | #
 11 | # The above copyright notice and this permission notice shall be included
 12 | # in all copies or substantial portions of the Software.
 13 | #
 14 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
 15 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 16 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
 17 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
 18 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
 19 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 20 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 21 | 
 22 | import logging
 23 | from collections import namedtuple
 24 | from re import finditer, match, search
 25 | 
 26 | import pkg_resources
 27 | from nlup import case_feature, isnumberlike, listify, \
 28 |     BinaryAveragedPerceptron, BinaryConfusion, JSONable
 29 | 
 30 | from .ptbtokenizer import word_tokenize
 31 | 
 32 | # FIXME(kbg) can surely avoid full-blown tokenization
 33 | 
 34 | 
 35 | # defaults
 36 | 
 37 | NOCASE = False  # disable case-based features?
 38 | EPOCHS = 20     # number of epochs (iterations for classifier training)
 39 | BUFSIZE = 32    # for reading in left and right contexts...see below
 40 | CLIP = 8        # clip numerical count feature values
 41 | DEFAULT_MODEL = 'DM-wsj.json.gz'
 42 | 
 43 | # character classes
 44 | 
 45 | VOWELS = frozenset("AEIOUY")
 46 | 
 47 | # token classes
 48 | 
 49 | QUOTE_TOKEN = "*QUOTE*"
 50 | NUMBER_TOKEN = "*NUMBER*"
 51 | 
 52 | # regexes
 53 | 
 54 | PUNCT = r"((\.+)|([!?]))"
 55 | TARGET = PUNCT + r"(['`\")}\]]*)(\s+)"
 56 | 
 57 | LTOKEN = r"(\S+)\s*$"
 58 | RTOKEN = r"^\s*(\S+)"
 59 | NEWLINE = r"^\s*[\r\n]+\s*$"
 60 | 
 61 | QUOTE = r"^['`\"]+$"
 62 | 
 63 | # other
 64 | 
 65 | Observation = namedtuple("Observation", ["L", "P", "R", "B", "end"])
 66 | 
 67 | 
 68 | def slurp(filename, encoding='utf-8'):
 69 |     """
 70 |     Given a `filename` string, slurp the whole file into a string
 71 |     """
 72 |     with open(filename, encoding=encoding) as source:
 73 |         return source.read()
 74 | 
 75 | 
 76 | def load_from_resource(name):
 77 |     """
 78 |     Return a Detector loaded from resource with the specified name.
 79 | 
 80 |     The model name must match a filename existing under /models
 81 |     in this package.
 82 |     """
 83 |     # Note that you do not want os.path.join here as all resource paths
 84 |     # use forward slash
 85 |     filename = pkg_resources.resource_filename(__name__, 'models/' + name)
 86 |     return Detector.load(filename)
 87 | 
 88 | 
 89 | def default_model():
 90 |     """
 91 |     Return a Detector loaded from the default model.
 92 | 
 93 |     Currently, the default model is trained on WSJ.
 94 |     """
 95 |     return load_from_resource(DEFAULT_MODEL)
 96 | 
 97 | 
 98 | class Detector(JSONable):
 99 | 
100 |     def __init__(self, text=None, nocase=NOCASE, epochs=EPOCHS,
101 |                  classifier=BinaryAveragedPerceptron, **kwargs):
102 |         self.classifier = classifier(**kwargs)
103 |         self.nocase = nocase
104 |         if text:
105 |             self.fit(text, epochs)
106 | 
107 |     def __repr__(self):
108 |         return "{}(classifier={!r})".format(self.__class__.__name__,
109 |                                             self.classifier)
110 | 
111 |     # identify candidate regions
112 | 
113 |     @staticmethod
114 |     def candidates(text):
115 |         """
116 |         Given a `text` string, get candidates and context for feature
117 |         extraction and classification
118 |         """
119 |         for Pmatch in finditer(TARGET, text):
120 |             # the punctuation mark itself
121 |             P = Pmatch.group(1)
122 |             # is it a boundary?
123 |             B = bool(match(NEWLINE, Pmatch.group(5)))
124 |             # L & R
125 |             start = Pmatch.start()
126 |             end = Pmatch.end()
127 |             Lmatch = search(LTOKEN, text[max(0, start - BUFSIZE):start])
128 |             if not Lmatch:  # this happens when a line begins with '.'
129 |                 continue
130 |             L = word_tokenize(" " + Lmatch.group(1))[-1]
131 |             Rmatch = search(RTOKEN, text[end:end + BUFSIZE])
132 |             if not Rmatch:  # this happens at the end of the file, usually
133 |                 continue
134 |             R = word_tokenize(Rmatch.group(1) + " ")[0]
135 |             # complete observation
136 |             yield Observation(L, P, R, B, end)
137 | 
138 |     # extract features
139 | 
140 |     @listify
141 |     def extract_one(self, L, P, R):
142 |         """
143 |         Given left context `L`, punctuation mark `P`, and right context
144 |         R`, extract features. Probability distributions for any
145 |         quantile-based features will not be modified.
146 |         """
147 |         yield "*bias*"
148 |         # L feature(s)
149 |         if match(QUOTE, L):
150 |             L = QUOTE_TOKEN
151 |         elif isnumberlike(L):
152 |             L = NUMBER_TOKEN
153 |         else:
154 |             yield "len(L)={}".format(min(len(L), CLIP))
155 |             if "." in L:
156 |                 yield "L:*period*"
157 |             if not self.nocase:
158 |                 cf = case_feature(L)
159 |                 if cf:
160 |                     yield "L:{}'".format(cf)
161 |             L = L.upper()
162 |             if not any(char in VOWELS for char in L):
163 |                 yield "L:*no-vowel*"
164 |         L_feat = "L='{}'".format(L)
165 |         yield L_feat
166 |         # P feature(s)
167 |         yield "P='{}'".format(P)
168 |         # R feature(s)
169 |         if match(QUOTE, R):
170 |             R = QUOTE_TOKEN
171 |         elif isnumberlike(R):
172 |             R = NUMBER_TOKEN
173 |         else:
174 |             if not self.nocase:
175 |                 cf = case_feature(R)
176 |                 if cf:
177 |                     yield "R:{}'".format(cf)
178 |             R = R.upper()
179 |         R_feat = "R='{}'".format(R)
180 |         yield R_feat
181 |         # the combined L,R feature
182 |         yield "{},{}".format(L_feat, R_feat)
183 | 
184 |     # actual detector operations
185 | 
186 |     def fit(self, text, epochs=EPOCHS):
187 |         """
188 |         Given a string `text`, use it to train the segmentation classifier
189 |         for `epochs` iterations.
190 |         """
191 |         logging.debug("Extracting features and classifications.")
192 |         Phi = []
193 |         Y = []
194 |         for (L, P, R, gold, _) in Detector.candidates(text):
195 |             Phi.append(self.extract_one(L, P, R))
196 |             Y.append(gold)
197 |         self.classifier.fit(Y, Phi, epochs)
198 |         logging.debug("Fitting complete.")
199 | 
200 |     def predict(self, L, P, R):
201 |         """
202 |         Given an left context `L`, punctuation mark `P`, and right context
203 |         `R`, return True iff this observation is hypothesized to be a
204 |         sentence boundary.
205 |         """
206 |         phi = self.extract_one(L, P, R)
207 |         return self.classifier.predict(phi)
208 | 
209 |     def segments(self, text, strip=True):
210 |         """
211 |         Given a string of `text`, return a generator yielding each
212 |         hypothesized sentence string
213 |         """
214 |         start = 0
215 |         for (L, P, R, B, end) in Detector.candidates(text):
216 |             if self.predict(L, P, R):
217 |                 sent = text[start:end]
218 |                 if strip:
219 |                     sent = sent.rstrip()
220 |                 yield sent
221 |                 start = end
222 |             # otherwise, there's probably not a sentence boundary here
223 |         sent = text[start:]
224 |         if strip:
225 |             sent = sent.rstrip()
226 |         yield sent
227 | 
228 |     def evaluate(self, text):
229 |         """
230 |         Given a string of `text`, compute confusion matrix for the
231 |         classification task.
232 |         """
233 |         cx = BinaryConfusion()
234 |         for (L, P, R, gold, _) in Detector.candidates(text):
235 |             guess = self.predict(L, P, R)
236 |             cx.update(gold, guess)
237 |             if not gold and guess:
238 |                 logging.debug("False pos.: L='{}', R='{}'.".format(L, R))
239 |             elif gold and not guess:
240 |                 logging.debug("False neg.: L='{}', R='{}'.".format(L, R))
241 |         return cx
242 | 


--------------------------------------------------------------------------------
/detectormorse/models/DM-wsj.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cslu-nlp/DetectorMorse/d381926ff39f69f1f6f2088342d3543138e3cb83/detectormorse/models/DM-wsj.json.gz


--------------------------------------------------------------------------------
/detectormorse/ptbtokenizer.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2014 Kyle Gorman <gormanky@ohsu.edu>
 2 | #
 3 | # Permission is hereby granted, free of charge, to any person obtaining a
 4 | # copy of this software and associated documentation files (the
 5 | # "Software"), to deal in the Software without restriction, including
 6 | # without limitation the rights to use, copy, modify, merge, publish,
 7 | # distribute, sublicense, and/or sell copies of the Software, and to
 8 | # permit persons to whom the Software is furnished to do so, subject to
 9 | # the following conditions:
10 | #
11 | # The above copyright notice and this permission notice shall be included
12 | # in all copies or substantial portions of the Software.
13 | #
14 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
15 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
18 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
19 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
20 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 
22 | 
23 | """
24 | Penn Treebank tokenizer, adapted from `nltk.tokenize.treebank.py`, which
25 | in turn is adapted from an infamous sed script by Robert McIntyre. Even
26 | ignoring the reduced import overhead, this is about half again faster than
27 | the NLTK version; don't ask me why.
28 | 
29 | >>> s = '''Good muffins cost $3.88\\nin New York.  Please buy me\\ntwo of them.\\nThanks.'''
30 | >>> word_tokenize(s)
31 | ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']
32 | >>> s = "They'll save and invest more."
33 | >>> word_tokenize(s)
34 | ['They', "'ll", 'save', 'and', 'invest', 'more', '.']
35 | """
36 | 
37 | from re import sub
38 | 
39 | 
40 | RULES1 = [  # starting quotes
41 |     (r'^\"', r'``'),
42 |     (r'(``)', r' \1 '),
43 |     (r'([ (\[{<])"', r'\1 `` '),
44 |     # punctuation
45 |     (r'([:,])([^\d])', r' \1 \2'),
46 |     (r'\.\.\.', r' ... '),
47 |     (r'[;@#$%&]', r' \g<0> '),
48 |     (r'([^\.])(\.)([\]\)}>"\']*)\s*$', r'\1 \2\3 '),
49 |     (r'[?!]', r' \g<0> '),
50 |     (r"([^'])' ", r"\1 ' "),
51 |     # parens, brackets, etc.
52 |     (r'[\]\[\(\)\{\}\<\>]', r' \g<0> '),
53 |     (r'--', r' -- ')]
54 | 
55 | # ending quotes
56 | RULES2 = [(r'"', " '' "),
57 |           (r'(\S)(\'\')', r'\1 \2 ')]
58 | 
59 | # all replaced with r"\1 \2 "
60 | CONTRACTIONS = [r"(?i)([^' ])('S|'M|'D|') ",
61 |                 r"(?i)([^' ])('LL|'RE|'VE|N'T) ",
62 |                 r"(?i)\b(CAN)(NOT)\b",
63 |                 r"(?i)\b(D)('YE)\b",
64 |                 r"(?i)\b(GIM)(ME)\b",
65 |                 r"(?i)\b(GON)(NA)\b",
66 |                 r"(?i)\b(GOT)(TA)\b",
67 |                 r"(?i)\b(LEM)(ME)\b",
68 |                 r"(?i)\b(MOR)('N)\b",
69 |                 r"(?i)\b(WAN)(NA) ",
70 |                 r"(?i) ('T)(IS)\b",
71 |                 r"(?i) ('T)(WAS)\b"]
72 | 
73 | 
74 | def word_tokenize(text):
75 |     """
76 |     Split string `text` into word tokens using the Penn Treebank rules
77 |     """
78 |     for (regexp, replacement) in RULES1:
79 |         text = sub(regexp, replacement, text)
80 |     # add extra space to make things easier
81 |     text = " " + text + " "
82 |     for (regexp, replacement) in RULES2:
83 |         text = sub(regexp, replacement, text)
84 |     for regexp in CONTRACTIONS:
85 |         text = sub(regexp, r"\1 \2 ", text)
86 |     # split and return
87 |     return text.split()
88 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | nlup>=0.7
2 | setuptools
3 | pytest
4 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from io import open
 4 | from os import path
 5 | from setuptools import setup
 6 | 
 7 | description = ("Core libraries for natural language processing",)
 8 | 
 9 | this_directory = path.abspath(path.dirname(__file__))
10 | with open(path.join(this_directory, "README.md"), encoding="utf8") as f:
11 |     long_description = f.read()
12 | 
13 | setup(name="DetectorMorse",
14 |       version="0.4.1",
15 |       description="DetectorMorse, a sentence splitter",
16 |       long_description=long_description,
17 |       long_description_content_type="text/markdown",
18 |       author="Kyle Gorman",
19 |       author_email="kylebgorman@gmail.com",
20 |       packages=["detectormorse"],
21 |       package_data={
22 |           "detectormorse": ["models/*"],
23 |       },
24 |       install_requires=[
25 |           "nlup>=0.7",
26 |           "setuptools",  # For pkg_resources.
27 |       ],
28 |       test_suite="default_model_test",
29 | )
30 | 


--------------------------------------------------------------------------------