├── .gitignore ├── README.md ├── default_model_test.py ├── detectormorse ├── __init__.py ├── __main__.py ├── detector.py ├── models │ └── DM-wsj.json.gz └── ptbtokenizer.py ├── requirements.txt └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | *.py[co] 3 | *.log 4 | *.sh 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Detector Morse 2 | ============== 3 | 4 | Detector Morse is a program for sentence boundary detection (henceforth, SBD), also known as sentence segmentation. Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank: 5 | 6 | Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain 7 | steady at about 1,200 cars in 1990. 8 | 9 | This sentence contains 4 periods, but only the last denotes a sentence boundary. The first one in `U.S.` is unambiguously part of an acronym, not a sentence boundary; the same is true of expressions like `$12.53`. But the periods at the end of `Inc.` and `U.S.` could easily denote a sentence boundary. Humans use the local context to determine that neither period denote sentence boundaries (e.g. the selectional properties of the verb _expect_ are not met if there is a sentence bounary immediately after `U.S.`). Detector Morse uses artisinal, handcrafted contextual features and low-impact, leave-no-trace machine learning methods to automatically detect sentence boundaries. 10 | 11 | SBD is one of the earliest pieces of many natural language processing pipelines. Since errors at this step are likely to propagate, SBD is an important---albeit overlooked---problem in natural language processing. 12 | 13 | Detector Morse has been tested on CPython 3.4 and PyPy3 (2.3.1, corresponding to Python 3.2); the latter is much faster. Detector Morse depends on the Python module `nlup` (which in turn relies on `jsonpickle`) to (de)serialize models. For the versions used, see `requirements.txt`. 14 | 15 | Installation 16 | ============ 17 | 18 | ``` 19 | pip install detectormorse 20 | ``` 21 | 22 | Usage 23 | ===== 24 | 25 | ``` 26 | Detector Morse, by Kyle Gorman 27 | 28 | usage: python -m detectormorse [-h] [-v | -V] (-t TRAIN | -r [READ]) 29 | (-s SEGMENT | -w WRITE | -e EVALUATE) 30 | [-E EPOCHS] [-C] [--preserve-whitespace] 31 | 32 | Detector Morse 33 | 34 | optional arguments: 35 | -h, --help show this help message and exit 36 | -v, --verbose enable verbose output 37 | -V, --really-verbose enable even more verbose output 38 | -t TRAIN, --train TRAIN 39 | training data 40 | -r [READ], --read [READ] 41 | read in a serialized model from a path or read the 42 | default model if no path is specified 43 | -s SEGMENT, --segment SEGMENT 44 | segment sentences 45 | -w WRITE, --write WRITE 46 | write out serialized model 47 | -e EVALUATE, --evaluate EVALUATE 48 | evaluate on segmented data 49 | -E EPOCHS, --epochs EPOCHS 50 | # of epochs (default: 20) 51 | -C, --nocase disable case features 52 | --preserve-whitespace 53 | preserve whitespace when segmenting 54 | ``` 55 | 56 | Files used for training (`-t`/`--train`) and evaluation (`-e`/`--evaluate`) should contain one sentence per line; newline characters are ignored otherwise. 57 | 58 | When segmenting a file (`-s`/`--segment`), DetectorMorse simply inserts a newline after predicted sentence boundaries that aren't already marked by one. All other newline characters are passed through, unmolested. 59 | 60 | The included `DM-wsj.json.gz` is a segmenter model trained on the Wall St. Journal portion of the Penn Treebank. This model can be loaded by using `detector.default_model()` or by specifying `-r` with no path at the command line. 61 | 62 | Method 63 | ====== 64 | 65 | See [this blog post](http://www.wellformedness.com/blog/simpler-sentence-boundary-detection/). 66 | 67 | Caveats 68 | ======= 69 | 70 | DetectorMorse processes text by reading the entire file into memory. This means it will not work with files that won't fit into the available RAM. The easiest way to get around this is to import the `Detector` instance in your own Python script. 71 | 72 | Exciting extras! 73 | ================ 74 | 75 | I've included a Perl script `untokenize.pl` which attempts to invert the Penn Treebank tokenization process. Tokenization is an inherently "lossy" procedure, so there is no guarantee that the output is exactly how it appeared in the WSJ. But, the rules appear to be correct and produce sane text, and I have used it for all experiments. **Update (2015-02-10): I've removed this script; I just use the Stanford tokenizer for this purpose, now.** 76 | -------------------------------------------------------------------------------- /default_model_test.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | from detectormorse import detector 4 | 5 | SENT_ROLLS = """Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain 6 | steady at about 1,200 cars in 1990.""" 7 | SENT_TORTURE = "Dr. F. Jones M.D. doesn't have a Ph.D. and never went to N. Korea." 8 | SENT_BOTH_SPACE = " ".join((SENT_ROLLS, SENT_TORTURE)) 9 | SENT_BOTH_NEWLINE = "\n".join((SENT_ROLLS, SENT_TORTURE)) 10 | SENT_BOTH_EXTRA_WHITESPACE = ' ' + SENT_ROLLS + ' ' + SENT_TORTURE + ' ' 11 | 12 | 13 | @pytest.fixture(scope='module') 14 | def default_model(): 15 | return detector.default_model() 16 | 17 | 18 | def test_single_sentence(default_model): 19 | """A single sentence is segmented as a single sentence.""" 20 | sents = list(default_model.segments(SENT_ROLLS)) 21 | assert len(sents) == 1 22 | # Note that this confirms that newline is passed through without issue 23 | assert sents[0] == SENT_ROLLS 24 | 25 | def test_two_sentences_space(default_model): 26 | """Two sentences joined by space are segmented as two sentences.""" 27 | sents = list(default_model.segments(SENT_BOTH_SPACE)) 28 | assert len(sents) == 2 29 | assert sents == [SENT_ROLLS, SENT_TORTURE] 30 | 31 | 32 | def test_two_sentences_newline(default_model): 33 | """Two sentences joined by newline are segmented as two sentences.""" 34 | sents = list(default_model.segments(SENT_BOTH_NEWLINE)) 35 | assert len(sents) == 2 36 | assert sents == [SENT_ROLLS, SENT_TORTURE] 37 | 38 | 39 | def test_strip(default_model): 40 | """Strip non-initial whitespace by default.""" 41 | sents = list(default_model.segments(SENT_BOTH_EXTRA_WHITESPACE)) 42 | assert len(sents) == 2 43 | assert sents[0] == ' ' + SENT_ROLLS 44 | assert sents[1] == SENT_TORTURE 45 | 46 | 47 | def test_strip_false(default_model): 48 | """Preserve all whitespace when strip=False.""" 49 | sents = list(default_model.segments(SENT_BOTH_EXTRA_WHITESPACE, strip=False)) 50 | assert len(sents) == 2 51 | assert sents[0] == ' ' + SENT_ROLLS + ' ' 52 | assert sents[1] == SENT_TORTURE + ' ' 53 | -------------------------------------------------------------------------------- /detectormorse/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2014 Kyle Gorman 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a 4 | # copy of this software and associated documentation files (the 5 | # "Software"), to deal in the Software without restriction, including 6 | # without limitation the rights to use, copy, modify, merge, publish, 7 | # distribute, sublicense, and/or sell copies of the Software, and to 8 | # permit persons to whom the Software is furnished to do so, subject to 9 | # the following conditions: 10 | # 11 | # The above copyright notice and this permission notice shall be included 12 | # in all copies or substantial portions of the Software. 13 | # 14 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 15 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 16 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 17 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 18 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 19 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 20 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | 22 | 23 | from .detector import Detector, slurp 24 | from .ptbtokenizer import word_tokenize 25 | -------------------------------------------------------------------------------- /detectormorse/__main__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2014 Kyle Gorman 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a 4 | # copy of this software and associated documentation files (the 5 | # "Software"), to deal in the Software without restriction, including 6 | # without limitation the rights to use, copy, modify, merge, publish, 7 | # distribute, sublicense, and/or sell copies of the Software, and to 8 | # permit persons to whom the Software is furnished to do so, subject to 9 | # the following conditions: 10 | # 11 | # The above copyright notice and this permission notice shall be included 12 | # in all copies or substantial portions of the Software. 13 | # 14 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 15 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 16 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 17 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 18 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 19 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 20 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | 22 | 23 | import logging 24 | 25 | from argparse import ArgumentParser 26 | 27 | 28 | from .detector import Detector, slurp, EPOCHS, default_model 29 | 30 | # Sentinel for reading in the default model 31 | _READ_DEFAULT = object() 32 | 33 | 34 | LOGGING_FMT = "%(message)s" 35 | 36 | 37 | argparser = ArgumentParser(prog="python -m detectormorse", 38 | description="Detector Morse") 39 | vrb_group = argparser.add_mutually_exclusive_group() 40 | vrb_group.add_argument("-v", "--verbose", action="store_true", 41 | help="enable verbose output") 42 | vrb_group.add_argument("-V", "--really-verbose", action="store_true", 43 | help="enable even more verbose output") 44 | inp_group = argparser.add_mutually_exclusive_group(required=True) 45 | inp_group.add_argument("-t", "--train", help="training data") 46 | inp_group.add_argument("-r", "--read", nargs='?', const=_READ_DEFAULT, 47 | help="read in a serialized model from a path or read " 48 | "the default model if no path is specified") 49 | out_group = argparser.add_mutually_exclusive_group(required=True) 50 | out_group.add_argument("-s", "--segment", help="segment sentences") 51 | out_group.add_argument("-w", "--write", 52 | help="write out serialized model") 53 | out_group.add_argument("-e", "--evaluate", 54 | help="evaluate on segmented data") 55 | argparser.add_argument("-E", "--epochs", type=int, default=EPOCHS, 56 | help="# of epochs (default: {})".format(EPOCHS)) 57 | argparser.add_argument("-C", "--nocase", action="store_true", 58 | help="disable case features") 59 | argparser.add_argument("--preserve-whitespace", action="store_true", 60 | help="preserve whitespace when segmenting") 61 | args = argparser.parse_args() 62 | # verbosity block 63 | if args.really_verbose: 64 | logging.basicConfig(format=LOGGING_FMT, level="DEBUG") 65 | elif args.verbose: 66 | logging.basicConfig(format=LOGGING_FMT, level="INFO") 67 | else: 68 | logging.basicConfig(format=LOGGING_FMT) 69 | # input block 70 | detector = None 71 | if args.train: 72 | logging.info("Training model on '{}'.".format(args.train)) 73 | detector = Detector(slurp(args.train), epochs=args.epochs, 74 | nocase=args.nocase) 75 | elif args.read: 76 | if args.read is _READ_DEFAULT: 77 | # If sentinel specified, load default model 78 | logging.info("Reading default model.") 79 | detector = default_model() 80 | else: 81 | # Otherwise load normally 82 | logging.info("Reading pretrained model '{}'.".format(args.read)) 83 | detector = Detector.load(args.read) 84 | # output block 85 | if args.segment: 86 | logging.info("Segmenting '{}'.".format(args.segment)) 87 | print("\n".join(detector.segments(slurp(args.segment), 88 | strip=not args.preserve_whitespace))) 89 | if args.write: 90 | logging.info("Writing model to '{}'.".format(args.write)) 91 | detector.dump(args.write) 92 | elif args.evaluate: 93 | logging.info("Evaluating model on '{}'.".format(args.evaluate)) 94 | cx = detector.evaluate(slurp(args.evaluate)) 95 | if args.verbose or args.really_verbose: 96 | cx.pprint() 97 | print(cx.summary) 98 | -------------------------------------------------------------------------------- /detectormorse/detector.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2014 Kyle Gorman 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a 4 | # copy of this software and associated documentation files (the 5 | # "Software"), to deal in the Software without restriction, including 6 | # without limitation the rights to use, copy, modify, merge, publish, 7 | # distribute, sublicense, and/or sell copies of the Software, and to 8 | # permit persons to whom the Software is furnished to do so, subject to 9 | # the following conditions: 10 | # 11 | # The above copyright notice and this permission notice shall be included 12 | # in all copies or substantial portions of the Software. 13 | # 14 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 15 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 16 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 17 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 18 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 19 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 20 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | 22 | import logging 23 | from collections import namedtuple 24 | from re import finditer, match, search 25 | 26 | import pkg_resources 27 | from nlup import case_feature, isnumberlike, listify, \ 28 | BinaryAveragedPerceptron, BinaryConfusion, JSONable 29 | 30 | from .ptbtokenizer import word_tokenize 31 | 32 | # FIXME(kbg) can surely avoid full-blown tokenization 33 | 34 | 35 | # defaults 36 | 37 | NOCASE = False # disable case-based features? 38 | EPOCHS = 20 # number of epochs (iterations for classifier training) 39 | BUFSIZE = 32 # for reading in left and right contexts...see below 40 | CLIP = 8 # clip numerical count feature values 41 | DEFAULT_MODEL = 'DM-wsj.json.gz' 42 | 43 | # character classes 44 | 45 | VOWELS = frozenset("AEIOUY") 46 | 47 | # token classes 48 | 49 | QUOTE_TOKEN = "*QUOTE*" 50 | NUMBER_TOKEN = "*NUMBER*" 51 | 52 | # regexes 53 | 54 | PUNCT = r"((\.+)|([!?]))" 55 | TARGET = PUNCT + r"(['`\")}\]]*)(\s+)" 56 | 57 | LTOKEN = r"(\S+)\s*$" 58 | RTOKEN = r"^\s*(\S+)" 59 | NEWLINE = r"^\s*[\r\n]+\s*$" 60 | 61 | QUOTE = r"^['`\"]+$" 62 | 63 | # other 64 | 65 | Observation = namedtuple("Observation", ["L", "P", "R", "B", "end"]) 66 | 67 | 68 | def slurp(filename, encoding='utf-8'): 69 | """ 70 | Given a `filename` string, slurp the whole file into a string 71 | """ 72 | with open(filename, encoding=encoding) as source: 73 | return source.read() 74 | 75 | 76 | def load_from_resource(name): 77 | """ 78 | Return a Detector loaded from resource with the specified name. 79 | 80 | The model name must match a filename existing under /models 81 | in this package. 82 | """ 83 | # Note that you do not want os.path.join here as all resource paths 84 | # use forward slash 85 | filename = pkg_resources.resource_filename(__name__, 'models/' + name) 86 | return Detector.load(filename) 87 | 88 | 89 | def default_model(): 90 | """ 91 | Return a Detector loaded from the default model. 92 | 93 | Currently, the default model is trained on WSJ. 94 | """ 95 | return load_from_resource(DEFAULT_MODEL) 96 | 97 | 98 | class Detector(JSONable): 99 | 100 | def __init__(self, text=None, nocase=NOCASE, epochs=EPOCHS, 101 | classifier=BinaryAveragedPerceptron, **kwargs): 102 | self.classifier = classifier(**kwargs) 103 | self.nocase = nocase 104 | if text: 105 | self.fit(text, epochs) 106 | 107 | def __repr__(self): 108 | return "{}(classifier={!r})".format(self.__class__.__name__, 109 | self.classifier) 110 | 111 | # identify candidate regions 112 | 113 | @staticmethod 114 | def candidates(text): 115 | """ 116 | Given a `text` string, get candidates and context for feature 117 | extraction and classification 118 | """ 119 | for Pmatch in finditer(TARGET, text): 120 | # the punctuation mark itself 121 | P = Pmatch.group(1) 122 | # is it a boundary? 123 | B = bool(match(NEWLINE, Pmatch.group(5))) 124 | # L & R 125 | start = Pmatch.start() 126 | end = Pmatch.end() 127 | Lmatch = search(LTOKEN, text[max(0, start - BUFSIZE):start]) 128 | if not Lmatch: # this happens when a line begins with '.' 129 | continue 130 | L = word_tokenize(" " + Lmatch.group(1))[-1] 131 | Rmatch = search(RTOKEN, text[end:end + BUFSIZE]) 132 | if not Rmatch: # this happens at the end of the file, usually 133 | continue 134 | R = word_tokenize(Rmatch.group(1) + " ")[0] 135 | # complete observation 136 | yield Observation(L, P, R, B, end) 137 | 138 | # extract features 139 | 140 | @listify 141 | def extract_one(self, L, P, R): 142 | """ 143 | Given left context `L`, punctuation mark `P`, and right context 144 | R`, extract features. Probability distributions for any 145 | quantile-based features will not be modified. 146 | """ 147 | yield "*bias*" 148 | # L feature(s) 149 | if match(QUOTE, L): 150 | L = QUOTE_TOKEN 151 | elif isnumberlike(L): 152 | L = NUMBER_TOKEN 153 | else: 154 | yield "len(L)={}".format(min(len(L), CLIP)) 155 | if "." in L: 156 | yield "L:*period*" 157 | if not self.nocase: 158 | cf = case_feature(L) 159 | if cf: 160 | yield "L:{}'".format(cf) 161 | L = L.upper() 162 | if not any(char in VOWELS for char in L): 163 | yield "L:*no-vowel*" 164 | L_feat = "L='{}'".format(L) 165 | yield L_feat 166 | # P feature(s) 167 | yield "P='{}'".format(P) 168 | # R feature(s) 169 | if match(QUOTE, R): 170 | R = QUOTE_TOKEN 171 | elif isnumberlike(R): 172 | R = NUMBER_TOKEN 173 | else: 174 | if not self.nocase: 175 | cf = case_feature(R) 176 | if cf: 177 | yield "R:{}'".format(cf) 178 | R = R.upper() 179 | R_feat = "R='{}'".format(R) 180 | yield R_feat 181 | # the combined L,R feature 182 | yield "{},{}".format(L_feat, R_feat) 183 | 184 | # actual detector operations 185 | 186 | def fit(self, text, epochs=EPOCHS): 187 | """ 188 | Given a string `text`, use it to train the segmentation classifier 189 | for `epochs` iterations. 190 | """ 191 | logging.debug("Extracting features and classifications.") 192 | Phi = [] 193 | Y = [] 194 | for (L, P, R, gold, _) in Detector.candidates(text): 195 | Phi.append(self.extract_one(L, P, R)) 196 | Y.append(gold) 197 | self.classifier.fit(Y, Phi, epochs) 198 | logging.debug("Fitting complete.") 199 | 200 | def predict(self, L, P, R): 201 | """ 202 | Given an left context `L`, punctuation mark `P`, and right context 203 | `R`, return True iff this observation is hypothesized to be a 204 | sentence boundary. 205 | """ 206 | phi = self.extract_one(L, P, R) 207 | return self.classifier.predict(phi) 208 | 209 | def segments(self, text, strip=True): 210 | """ 211 | Given a string of `text`, return a generator yielding each 212 | hypothesized sentence string 213 | """ 214 | start = 0 215 | for (L, P, R, B, end) in Detector.candidates(text): 216 | if self.predict(L, P, R): 217 | sent = text[start:end] 218 | if strip: 219 | sent = sent.rstrip() 220 | yield sent 221 | start = end 222 | # otherwise, there's probably not a sentence boundary here 223 | sent = text[start:] 224 | if strip: 225 | sent = sent.rstrip() 226 | yield sent 227 | 228 | def evaluate(self, text): 229 | """ 230 | Given a string of `text`, compute confusion matrix for the 231 | classification task. 232 | """ 233 | cx = BinaryConfusion() 234 | for (L, P, R, gold, _) in Detector.candidates(text): 235 | guess = self.predict(L, P, R) 236 | cx.update(gold, guess) 237 | if not gold and guess: 238 | logging.debug("False pos.: L='{}', R='{}'.".format(L, R)) 239 | elif gold and not guess: 240 | logging.debug("False neg.: L='{}', R='{}'.".format(L, R)) 241 | return cx 242 | -------------------------------------------------------------------------------- /detectormorse/models/DM-wsj.json.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cslu-nlp/DetectorMorse/d381926ff39f69f1f6f2088342d3543138e3cb83/detectormorse/models/DM-wsj.json.gz -------------------------------------------------------------------------------- /detectormorse/ptbtokenizer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2014 Kyle Gorman 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a 4 | # copy of this software and associated documentation files (the 5 | # "Software"), to deal in the Software without restriction, including 6 | # without limitation the rights to use, copy, modify, merge, publish, 7 | # distribute, sublicense, and/or sell copies of the Software, and to 8 | # permit persons to whom the Software is furnished to do so, subject to 9 | # the following conditions: 10 | # 11 | # The above copyright notice and this permission notice shall be included 12 | # in all copies or substantial portions of the Software. 13 | # 14 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 15 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 16 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 17 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 18 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 19 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 20 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | 22 | 23 | """ 24 | Penn Treebank tokenizer, adapted from `nltk.tokenize.treebank.py`, which 25 | in turn is adapted from an infamous sed script by Robert McIntyre. Even 26 | ignoring the reduced import overhead, this is about half again faster than 27 | the NLTK version; don't ask me why. 28 | 29 | >>> s = '''Good muffins cost $3.88\\nin New York. Please buy me\\ntwo of them.\\nThanks.''' 30 | >>> word_tokenize(s) 31 | ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.'] 32 | >>> s = "They'll save and invest more." 33 | >>> word_tokenize(s) 34 | ['They', "'ll", 'save', 'and', 'invest', 'more', '.'] 35 | """ 36 | 37 | from re import sub 38 | 39 | 40 | RULES1 = [ # starting quotes 41 | (r'^\"', r'``'), 42 | (r'(``)', r' \1 '), 43 | (r'([ (\[{<])"', r'\1 `` '), 44 | # punctuation 45 | (r'([:,])([^\d])', r' \1 \2'), 46 | (r'\.\.\.', r' ... '), 47 | (r'[;@#$%&]', r' \g<0> '), 48 | (r'([^\.])(\.)([\]\)}>"\']*)\s*$', r'\1 \2\3 '), 49 | (r'[?!]', r' \g<0> '), 50 | (r"([^'])' ", r"\1 ' "), 51 | # parens, brackets, etc. 52 | (r'[\]\[\(\)\{\}\<\>]', r' \g<0> '), 53 | (r'--', r' -- ')] 54 | 55 | # ending quotes 56 | RULES2 = [(r'"', " '' "), 57 | (r'(\S)(\'\')', r'\1 \2 ')] 58 | 59 | # all replaced with r"\1 \2 " 60 | CONTRACTIONS = [r"(?i)([^' ])('S|'M|'D|') ", 61 | r"(?i)([^' ])('LL|'RE|'VE|N'T) ", 62 | r"(?i)\b(CAN)(NOT)\b", 63 | r"(?i)\b(D)('YE)\b", 64 | r"(?i)\b(GIM)(ME)\b", 65 | r"(?i)\b(GON)(NA)\b", 66 | r"(?i)\b(GOT)(TA)\b", 67 | r"(?i)\b(LEM)(ME)\b", 68 | r"(?i)\b(MOR)('N)\b", 69 | r"(?i)\b(WAN)(NA) ", 70 | r"(?i) ('T)(IS)\b", 71 | r"(?i) ('T)(WAS)\b"] 72 | 73 | 74 | def word_tokenize(text): 75 | """ 76 | Split string `text` into word tokens using the Penn Treebank rules 77 | """ 78 | for (regexp, replacement) in RULES1: 79 | text = sub(regexp, replacement, text) 80 | # add extra space to make things easier 81 | text = " " + text + " " 82 | for (regexp, replacement) in RULES2: 83 | text = sub(regexp, replacement, text) 84 | for regexp in CONTRACTIONS: 85 | text = sub(regexp, r"\1 \2 ", text) 86 | # split and return 87 | return text.split() 88 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | nlup>=0.7 2 | setuptools 3 | pytest 4 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from io import open 4 | from os import path 5 | from setuptools import setup 6 | 7 | description = ("Core libraries for natural language processing",) 8 | 9 | this_directory = path.abspath(path.dirname(__file__)) 10 | with open(path.join(this_directory, "README.md"), encoding="utf8") as f: 11 | long_description = f.read() 12 | 13 | setup(name="DetectorMorse", 14 | version="0.4.1", 15 | description="DetectorMorse, a sentence splitter", 16 | long_description=long_description, 17 | long_description_content_type="text/markdown", 18 | author="Kyle Gorman", 19 | author_email="kylebgorman@gmail.com", 20 | packages=["detectormorse"], 21 | package_data={ 22 | "detectormorse": ["models/*"], 23 | }, 24 | install_requires=[ 25 | "nlup>=0.7", 26 | "setuptools", # For pkg_resources. 27 | ], 28 | test_suite="default_model_test", 29 | ) 30 | --------------------------------------------------------------------------------