, 1 April 1989
333 | Ty Coon, President of Vice
334 |
335 | This General Public License does not permit incorporating your program into
336 | proprietary programs. If your program is a subroutine library, you may
337 | consider it more useful to permit linking proprietary applications with the
338 | library. If this is what you want to do, use the GNU Lesser General
339 | Public License instead of this License.
340 |
--------------------------------------------------------------------------------
/stanford_corenlp_python/README.md:
--------------------------------------------------------------------------------
1 | # Python interface to Stanford Core NLP tools v3.4.1
2 |
3 | This is a Python wrapper for Stanford University's NLP group's Java-based [CoreNLP tools](http://nlp.stanford.edu/software/corenlp.shtml). It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.
4 |
5 |
6 | * Python interface to Stanford CoreNLP tools: tagging, phrase-structure parsing, dependency parsing, [named-entity recognition](http://en.wikipedia.org/wiki/Named-entity_recognition), and [coreference resolution](http://en.wikipedia.org/wiki/Coreference).
7 | * Runs an JSON-RPC server that wraps the Java server and outputs JSON.
8 | * Outputs parse trees which can be used by [nltk](http://nltk.googlecode.com/svn/trunk/doc/howto/tree.html).
9 |
10 |
11 | It depends on [pexpect](http://www.noah.org/wiki/pexpect) and includes and uses code from [jsonrpc](http://www.simple-is-better.org/rpc/) and [python-progressbar](http://code.google.com/p/python-progressbar/).
12 |
13 | It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on **Core NLP tools version 3.4.1** released 2014-08-27.
14 |
15 | ## Download and Usage
16 |
17 | To use this program you must [download](http://nlp.stanford.edu/software/corenlp.shtml#Download) and unpack the compressed file containing Stanford's CoreNLP package. By default, `corenlp.py` looks for the Stanford Core NLP folder as a subdirectory of where the script is being run. In other words:
18 |
19 | sudo pip install pexpect unidecode
20 | git clone git://github.com/dasmith/stanford-corenlp-python.git
21 | cd stanford-corenlp-python
22 | wget http://nlp.stanford.edu/software/stanford-corenlp-full-2014-08-27.zip
23 | unzip stanford-corenlp-full-2014-08-27.zip
24 |
25 | Then launch the server:
26 |
27 | python corenlp.py
28 |
29 | Optionally, you can specify a host or port:
30 |
31 | python corenlp.py -H 0.0.0.0 -p 3456
32 |
33 | That will run a public JSON-RPC server on port 3456.
34 |
35 | Assuming you are running on port 8080, the code in `client.py` shows an example parse:
36 |
37 | import jsonrpc
38 | from simplejson import loads
39 | server = jsonrpc.ServerProxy(jsonrpc.JsonRpc20(),
40 | jsonrpc.TransportTcpIp(addr=("127.0.0.1", 8080)))
41 |
42 | result = loads(server.parse("Hello world. It is so beautiful"))
43 | print "Result", result
44 |
45 | That returns a dictionary containing the keys `sentences` and `coref`. The key `sentences` contains a list of dictionaries for each sentence, which contain `parsetree`, `text`, `tuples` containing the dependencies, and `words`, containing information about parts of speech, recognized named-entities, etc:
46 |
47 | {u'sentences': [{u'parsetree': u'(ROOT (S (VP (NP (INTJ (UH Hello)) (NP (NN world)))) (. !)))',
48 | u'text': u'Hello world!',
49 | u'tuples': [[u'dep', u'world', u'Hello'],
50 | [u'root', u'ROOT', u'world']],
51 | u'words': [[u'Hello',
52 | {u'CharacterOffsetBegin': u'0',
53 | u'CharacterOffsetEnd': u'5',
54 | u'Lemma': u'hello',
55 | u'NamedEntityTag': u'O',
56 | u'PartOfSpeech': u'UH'}],
57 | [u'world',
58 | {u'CharacterOffsetBegin': u'6',
59 | u'CharacterOffsetEnd': u'11',
60 | u'Lemma': u'world',
61 | u'NamedEntityTag': u'O',
62 | u'PartOfSpeech': u'NN'}],
63 | [u'!',
64 | {u'CharacterOffsetBegin': u'11',
65 | u'CharacterOffsetEnd': u'12',
66 | u'Lemma': u'!',
67 | u'NamedEntityTag': u'O',
68 | u'PartOfSpeech': u'.'}]]},
69 | {u'parsetree': u'(ROOT (S (NP (PRP It)) (VP (VBZ is) (ADJP (RB so) (JJ beautiful))) (. .)))',
70 | u'text': u'It is so beautiful.',
71 | u'tuples': [[u'nsubj', u'beautiful', u'It'],
72 | [u'cop', u'beautiful', u'is'],
73 | [u'advmod', u'beautiful', u'so'],
74 | [u'root', u'ROOT', u'beautiful']],
75 | u'words': [[u'It',
76 | {u'CharacterOffsetBegin': u'14',
77 | u'CharacterOffsetEnd': u'16',
78 | u'Lemma': u'it',
79 | u'NamedEntityTag': u'O',
80 | u'PartOfSpeech': u'PRP'}],
81 | [u'is',
82 | {u'CharacterOffsetBegin': u'17',
83 | u'CharacterOffsetEnd': u'19',
84 | u'Lemma': u'be',
85 | u'NamedEntityTag': u'O',
86 | u'PartOfSpeech': u'VBZ'}],
87 | [u'so',
88 | {u'CharacterOffsetBegin': u'20',
89 | u'CharacterOffsetEnd': u'22',
90 | u'Lemma': u'so',
91 | u'NamedEntityTag': u'O',
92 | u'PartOfSpeech': u'RB'}],
93 | [u'beautiful',
94 | {u'CharacterOffsetBegin': u'23',
95 | u'CharacterOffsetEnd': u'32',
96 | u'Lemma': u'beautiful',
97 | u'NamedEntityTag': u'O',
98 | u'PartOfSpeech': u'JJ'}],
99 | [u'.',
100 | {u'CharacterOffsetBegin': u'32',
101 | u'CharacterOffsetEnd': u'33',
102 | u'Lemma': u'.',
103 | u'NamedEntityTag': u'O',
104 | u'PartOfSpeech': u'.'}]]}],
105 | u'coref': [[[[u'It', 1, 0, 0, 1], [u'Hello world', 0, 1, 0, 2]]]]}
106 |
107 | To use it in a regular script (useful for debugging), load the module instead:
108 |
109 | from corenlp import *
110 | corenlp = StanfordCoreNLP() # wait a few minutes...
111 | corenlp.parse("Parse this sentence.")
112 |
113 | The server, `StanfordCoreNLP()`, takes an optional argument `corenlp_path` which specifies the path to the jar files. The default value is `StanfordCoreNLP(corenlp_path="./stanford-corenlp-full-2014-08-27/")`.
114 |
115 | ## Coreference Resolution
116 |
117 | The library supports [coreference resolution](http://en.wikipedia.org/wiki/Coreference), which means pronouns can be "dereferenced." If an entry in the `coref` list is, `[u'Hello world', 0, 1, 0, 2]`, the numbers mean:
118 |
119 | * 0 = The reference appears in the 0th sentence (e.g. "Hello world")
120 | * 1 = The 2nd token, "world", is the [headword](http://en.wikipedia.org/wiki/Head_%28linguistics%29) of that sentence
121 | * 0 = 'Hello world' begins at the 0th token in the sentence
122 | * 2 = 'Hello world' ends before the 2nd token in the sentence.
123 |
124 |
135 |
136 |
137 | ## Questions
138 |
139 | **Stanford CoreNLP tools require a large amount of free memory**. Java 5+ uses about 50% more RAM on 64-bit machines than 32-bit machines. 32-bit machine users can lower the memory requirements by changing `-Xmx3g` to `-Xmx2g` or even less.
140 | If pexpect timesout while loading models, check to make sure you have enough memory and can run the server alone without your kernel killing the java process:
141 |
142 | java -cp stanford-corenlp-2014-08-27.jar:stanford-corenlp-3.4.1-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props default.properties
143 |
144 | You can reach me, Dustin Smith, by sending a message on GitHub or through email (contact information is available [on my webpage](http://web.media.mit.edu/~dustin)).
145 |
146 |
147 | # License & Contributors
148 |
149 | This is free and open source software and has benefited from the contribution and feedback of others. Like Stanford's CoreNLP tools, it is covered under the [GNU General Public License v2 +](http://www.gnu.org/licenses/gpl-2.0.html), which in short means that modifications to this program must maintain the same free and open source distribution policy.
150 |
151 | I gratefully welcome bug fixes and new features. If you have forked this repository, please submit a [pull request](https://help.github.com/articles/using-pull-requests/) so others can benefit from your contributions. This project has already benefited from contributions from these members of the open source community:
152 |
153 | * [Emilio Monti](https://github.com/emilmont)
154 | * [Justin Cheng](https://github.com/jcccf)
155 | * Abhaya Agarwal
156 |
157 | *Thank you!*
158 |
159 | ## Related Projects
160 |
161 | Maintainers of the Core NLP library at Stanford keep an [updated list of wrappers and extensions](http://nlp.stanford.edu/software/corenlp.shtml#Extensions). See Brendan O'Connor's [stanford_corenlp_pywrapper](https://github.com/brendano/stanford_corenlp_pywrapper) for a different approach more suited to batch processing.
162 |
--------------------------------------------------------------------------------
/stanford_corenlp_python/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/stanford_corenlp_python/__init__.py
--------------------------------------------------------------------------------
/stanford_corenlp_python/client.py:
--------------------------------------------------------------------------------
1 | import json
2 | from jsonrpc import ServerProxy, JsonRpc20, TransportTcpIp
3 | from pprint import pprint
4 |
5 | class StanfordNLP:
6 | def __init__(self):
7 | self.server = ServerProxy(JsonRpc20(),
8 | TransportTcpIp(addr=("127.0.0.1", 8080)))
9 |
10 | def parse(self, text):
11 | return json.loads(self.server.parse(text))
12 |
13 | nlp = StanfordNLP()
14 | result = nlp.parse("Hello world! It is so beautiful.")
15 | pprint(result)
16 |
17 | from nltk.tree import Tree
18 | tree = Tree.parse(result['sentences'][0]['parsetree'])
19 | pprint(tree)
20 |
--------------------------------------------------------------------------------
/stanford_corenlp_python/corenlp.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #
3 | # corenlp - Python interface to Stanford Core NLP tools
4 | # Copyright (c) 2014 Dustin Smith
5 | # https://github.com/dasmith/stanford-corenlp-python
6 | #
7 | # This program is free software; you can redistribute it and/or
8 | # modify it under the terms of the GNU General Public License
9 | # as published by the Free Software Foundation; either version 2
10 | # of the License, or (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU General Public License
18 | # along with this program; if not, write to the Free Software
19 | # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
20 |
21 | import json
22 | import optparse
23 | import os, re, sys, time, traceback
24 | import jsonrpc, pexpect
25 | from progressbar import ProgressBar, Fraction
26 | import logging
27 |
28 |
29 | VERBOSE = True
30 |
31 | STATE_START, STATE_TEXT, STATE_WORDS, STATE_TREE, STATE_DEPENDENCY, STATE_COREFERENCE = 0, 1, 2, 3, 4, 5
32 | WORD_PATTERN = re.compile('\[([^\]]+)\]')
33 | CR_PATTERN = re.compile(r"\((\d*),(\d)*,\[(\d*),(\d*)\]\) -> \((\d*),(\d)*,\[(\d*),(\d*)\]\), that is: \"(.*)\" -> \"(.*)\"")
34 |
35 | # initialize logger
36 | logging.basicConfig(level=logging.INFO)
37 | logger = logging.getLogger(__name__)
38 |
39 |
40 | def remove_id(word):
41 | """Removes the numeric suffix from the parsed recognized words: e.g. 'word-2' > 'word' """
42 | return word.count("-") == 0 and word or word[0:word.rindex("-")]
43 |
44 |
45 | def parse_bracketed(s):
46 | '''Parse word features [abc=... def = ...]
47 | Also manages to parse out features that have XML within them
48 | '''
49 | word = None
50 | attrs = {}
51 | temp = {}
52 | # Substitute XML tags, to replace them later
53 | for i, tag in enumerate(re.findall(r"(<[^<>]+>.*<\/[^<>]+>)", s)):
54 | temp["^^^%d^^^" % i] = tag
55 | s = s.replace(tag, "^^^%d^^^" % i)
56 | # Load key-value pairs, substituting as necessary
57 | for attr, val in re.findall(r"([^=\s]*)=([^=\s]*)", s):
58 | if val in temp:
59 | val = temp[val]
60 | if attr == 'Text':
61 | word = val
62 | else:
63 | attrs[attr] = val
64 | return (word, attrs)
65 |
66 |
67 | def parse_parser_results(text):
68 | """ This is the nasty bit of code to interact with the command-line
69 | interface of the CoreNLP tools. Takes a string of the parser results
70 | and then returns a Python list of dictionaries, one for each parsed
71 | sentence.
72 | """
73 | results = {"sentences": []}
74 | state = STATE_START
75 | for line in text.encode('utf-8').split("\n"):
76 | line = line.strip()
77 |
78 | if line.startswith("Sentence #"):
79 | sentence = {'words':[], 'parsetree':[], 'dependencies':[]}
80 | results["sentences"].append(sentence)
81 | state = STATE_TEXT
82 |
83 | elif state == STATE_TEXT:
84 | sentence['text'] = line
85 | state = STATE_WORDS
86 |
87 | elif state == STATE_WORDS:
88 | if not line.startswith("[Text="):
89 | raise Exception('Parse error. Could not find "[Text=" in: %s' % line)
90 | for s in WORD_PATTERN.findall(line):
91 | sentence['words'].append(parse_bracketed(s))
92 | state = STATE_TREE
93 |
94 | elif state == STATE_TREE:
95 | if len(line) == 0:
96 | state = STATE_DEPENDENCY
97 | sentence['parsetree'] = " ".join(sentence['parsetree'])
98 | else:
99 | sentence['parsetree'].append(line)
100 |
101 | elif state == STATE_DEPENDENCY:
102 | if len(line) == 0:
103 | state = STATE_COREFERENCE
104 | else:
105 | split_entry = re.split("\(|, ", line[:-1])
106 | if len(split_entry) == 3:
107 | rel, left, right = map(lambda x: remove_id(x), split_entry)
108 | sentence['dependencies'].append(tuple([rel,left,right]))
109 |
110 | elif state == STATE_COREFERENCE:
111 | if "Coreference set" in line:
112 | if 'coref' not in results:
113 | results['coref'] = []
114 | coref_set = []
115 | results['coref'].append(coref_set)
116 | else:
117 | for src_i, src_pos, src_l, src_r, sink_i, sink_pos, sink_l, sink_r, src_word, sink_word in CR_PATTERN.findall(line):
118 | src_i, src_pos, src_l, src_r = int(src_i)-1, int(src_pos)-1, int(src_l)-1, int(src_r)-1
119 | sink_i, sink_pos, sink_l, sink_r = int(sink_i)-1, int(sink_pos)-1, int(sink_l)-1, int(sink_r)-1
120 | coref_set.append(((src_word, src_i, src_pos, src_l, src_r), (sink_word, sink_i, sink_pos, sink_l, sink_r)))
121 |
122 | return results
123 |
124 |
125 | class StanfordCoreNLP(object):
126 | """
127 | Command-line interaction with Stanford's CoreNLP java utilities.
128 | Can be run as a JSON-RPC server or imported as a module.
129 | """
130 | def __init__(self, corenlp_path=None):
131 | """
132 | Checks the location of the jar files.
133 | Spawns the server as a process.
134 | """
135 | jars = ["stanford-corenlp-3.6.0.jar",
136 | "stanford-corenlp-3.6.0-models.jar",
137 | "joda-time.jar",
138 | "xom.jar",
139 | "jollyday.jar",
140 | "slf4j-api.jar"]
141 |
142 | # if CoreNLP libraries are in a different directory,
143 | # change the corenlp_path variable to point to them
144 | if not corenlp_path:
145 | #corenlp_path = "./stanford-corenlp-full-2014-08-27/"
146 | corenlp_path = "stanford-corenlp-full/"
147 |
148 | java_path = "java"
149 | classname = "edu.stanford.nlp.pipeline.StanfordCoreNLP"
150 | # include the properties file, so you can change defaults
151 | # but any changes in output format will break parse_parser_results()
152 | props = "-props default.properties"
153 |
154 | # add and check classpaths
155 | jars = [corenlp_path + jar for jar in jars]
156 | for jar in jars:
157 | if not os.path.exists(jar):
158 | logger.error("Error! Cannot locate %s" % jar)
159 | sys.exit(1)
160 |
161 | # spawn the server
162 | start_corenlp = "%s -Xmx1800m -cp %s %s %s" % (java_path, ':'.join(jars), classname, props)
163 | if VERBOSE:
164 | logger.debug(start_corenlp)
165 | self.corenlp = pexpect.spawn(start_corenlp)
166 |
167 | # show progress bar while loading the models
168 | widgets = ['Loading Models: ', Fraction()]
169 | pbar = ProgressBar(widgets=widgets, maxval=5, force_update=True).start()
170 | self.corenlp.expect("done.", timeout=20) # Load pos tagger model (~5sec)
171 | pbar.update(1)
172 | self.corenlp.expect("done.", timeout=200) # Load NER-all classifier (~33sec)
173 | pbar.update(2)
174 | self.corenlp.expect("done.", timeout=600) # Load NER-muc classifier (~60sec)
175 | pbar.update(3)
176 | self.corenlp.expect("done.", timeout=600) # Load CoNLL classifier (~50sec)
177 | pbar.update(4)
178 | self.corenlp.expect("done.", timeout=200) # Loading PCFG (~3sec)
179 | pbar.update(5)
180 | self.corenlp.expect("Entering interactive shell.")
181 | pbar.finish()
182 |
183 | def _parse(self, text):
184 | """
185 | This is the core interaction with the parser.
186 |
187 | It returns a Python data-structure, while the parse()
188 | function returns a JSON object
189 | """
190 | # clean up anything leftover
191 | while True:
192 | try:
193 | self.corenlp.read_nonblocking (4000, 0.3)
194 | except pexpect.TIMEOUT:
195 | break
196 |
197 | self.corenlp.sendline(text)
198 |
199 | # How much time should we give the parser to parse it?
200 | # the idea here is that you increase the timeout as a
201 | # function of the text's length.
202 | # anything longer than 5 seconds requires that you also
203 | # increase timeout=5 in jsonrpc.py
204 | max_expected_time = min(40, 3 + len(text) / 20.0)
205 | end_time = time.time() + max_expected_time
206 |
207 | incoming = ""
208 | while True:
209 | # Time left, read more data
210 | try:
211 | incoming += self.corenlp.read_nonblocking(2000, 1)
212 | if "\nNLP>" in incoming:
213 | break
214 | time.sleep(0.0001)
215 | except pexpect.TIMEOUT:
216 | if end_time - time.time() < 0:
217 | logger.error("Error: Timeout with input '%s'" % (incoming))
218 | return {'error': "timed out after %f seconds" % max_expected_time}
219 | else:
220 | continue
221 | except pexpect.EOF:
222 | break
223 |
224 | if VERBOSE:
225 | logger.debug("%s\n%s" % ('='*40, incoming))
226 | try:
227 | results = parse_parser_results(incoming)
228 | except Exception, e:
229 | if VERBOSE:
230 | logger.debug(traceback.format_exc())
231 | raise e
232 |
233 | return results
234 |
235 | def parse(self, text):
236 | """
237 | This function takes a text string, sends it to the Stanford parser,
238 | reads in the result, parses the results and returns a list
239 | with one dictionary entry for each parsed sentence, in JSON format.
240 | """
241 | response = self._parse(text)
242 | logger.debug("Response: '%s'" % (response))
243 | return json.dumps(response)
244 |
245 |
246 | if __name__ == '__main__':
247 | """
248 | The code below starts an JSONRPC server
249 | """
250 | parser = optparse.OptionParser(usage="%prog [OPTIONS]")
251 | parser.add_option('-p', '--port', default='8080',
252 | help='Port to serve on (default: 8080)')
253 | parser.add_option('-H', '--host', default='127.0.0.1',
254 | help='Host to serve on (default: 127.0.0.1. Use 0.0.0.0 to make public)')
255 | options, args = parser.parse_args()
256 | server = jsonrpc.Server(jsonrpc.JsonRpc20(),
257 | jsonrpc.TransportTcpIp(addr=(options.host, int(options.port))))
258 |
259 | nlp = StanfordCoreNLP()
260 | server.register_function(nlp.parse)
261 |
262 | logger.info('Serving on http://%s:%s' % (options.host, options.port))
263 | server.serve()
264 |
--------------------------------------------------------------------------------
/stanford_corenlp_python/default.properties:
--------------------------------------------------------------------------------
1 | annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref
2 |
3 | # A true-casing annotator is also available (see below)
4 | #annotators = tokenize, ssplit, pos, lemma, truecase
5 |
6 | # A simple regex NER annotator is also available
7 | # annotators = tokenize, ssplit, regexner
8 |
9 | #Use these as EOS punctuation and discard them from the actual sentence content
10 | #These are HTML tags that get expanded internally to correct syntax, e.g., from "p" to "", "
" etc.
11 | #Will have no effect if the "cleanxml" annotator is used
12 | #ssplit.htmlBoundariesToDiscard = p,text
13 |
14 | #
15 | # None of these paths are necessary anymore: we load all models from the JAR file
16 | #
17 |
18 | #pos.model = /u/nlp/data/pos-tagger/wsj3t0-18-left3words/left3words-distsim-wsj-0-18.tagger
19 | ## slightly better model but much slower:
20 | ##pos.model = /u/nlp/data/pos-tagger/wsj3t0-18-bidirectional/bidirectional-distsim-wsj-0-18.tagger
21 |
22 | #ner.model.3class = /u/nlp/data/ner/goodClassifiers/all.3class.distsim.crf.ser.gz
23 | #ner.model.7class = /u/nlp/data/ner/goodClassifiers/muc.distsim.crf.ser.gz
24 | #ner.model.MISCclass = /u/nlp/data/ner/goodClassifiers/conll.distsim.crf.ser.gz
25 |
26 | #regexner.mapping = /u/nlp/data/TAC-KBP2010/sentence_extraction/type_map_clean
27 | #regexner.ignorecase = false
28 |
29 | #nfl.gazetteer = /scr/nlp/data/machine-reading/Machine_Reading_P1_Reading_Task_V2.0/data/SportsDomain/NFLScoring_UseCase/NFLgazetteer.txt
30 | #nfl.relation.model = /scr/nlp/data/ldc/LDC2009E112/Machine_Reading_P1_NFL_Scoring_Training_Data_V1.2/models/nfl_relation_model.ser
31 | #nfl.entity.model = /scr/nlp/data/ldc/LDC2009E112/Machine_Reading_P1_NFL_Scoring_Training_Data_V1.2/models/nfl_entity_model.ser
32 | #printable.relation.beam = 20
33 |
34 | #parser.model = /u/nlp/data/lexparser/englishPCFG.ser.gz
35 |
36 | #srl.verb.args=/u/kristina/srl/verbs.core_args
37 | #srl.model.cls=/u/nlp/data/srl/trainedModels/englishPCFG/cls/train.ann
38 | #srl.model.id=/u/nlp/data/srl/trainedModels/englishPCFG/id/train.ann
39 |
40 | #coref.model=/u/nlp/rte/resources/anno/coref/corefClassifierAll.March2009.ser.gz
41 | #coref.name.dir=/u/nlp/data/coref/
42 | #wordnet.dir=/u/nlp/data/wordnet/wordnet-3.0-prolog
43 |
44 | #dcoref.demonym = /scr/heeyoung/demonyms.txt
45 | #dcoref.animate = /scr/nlp/data/DekangLin-Animacy-Gender/Animacy/animate.unigrams.txt
46 | #dcoref.inanimate = /scr/nlp/data/DekangLin-Animacy-Gender/Animacy/inanimate.unigrams.txt
47 | #dcoref.male = /scr/nlp/data/Bergsma-Gender/male.unigrams.txt
48 | #dcoref.neutral = /scr/nlp/data/Bergsma-Gender/neutral.unigrams.txt
49 | #dcoref.female = /scr/nlp/data/Bergsma-Gender/female.unigrams.txt
50 | #dcoref.plural = /scr/nlp/data/Bergsma-Gender/plural.unigrams.txt
51 | #dcoref.singular = /scr/nlp/data/Bergsma-Gender/singular.unigrams.txt
52 |
53 |
54 | # This is the regular expression that describes which xml tags to keep
55 | # the text from. In order to on off the xml removal, add cleanxml
56 | # to the list of annotators above after "tokenize".
57 | #clean.xmltags = .*
58 | # A set of tags which will force the end of a sentence. HTML example:
59 | # you would not want to end on , but you would want to end on .
60 | # Once again, a regular expression.
61 | # (Blank means there are no sentence enders.)
62 | #clean.sentenceendingtags =
63 | # Whether or not to allow malformed xml
64 | # StanfordCoreNLP.properties
65 | #wordnet.dir=models/wordnet-3.0-prolog
66 |
--------------------------------------------------------------------------------
/stanford_corenlp_python/progressbar.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: iso-8859-1 -*-
3 | #
4 | # progressbar - Text progressbar library for python.
5 | # Copyright (c) 2005 Nilton Volpato
6 | #
7 | # This library is free software; you can redistribute it and/or
8 | # modify it under the terms of the GNU Lesser General Public
9 | # License as published by the Free Software Foundation; either
10 | # version 2.1 of the License, or (at your option) any later version.
11 | #
12 | # This library is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
15 | # Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public
18 | # License along with this library; if not, write to the Free Software
19 | # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
20 |
21 |
22 | """Text progressbar library for python.
23 |
24 | This library provides a text mode progressbar. This is typically used
25 | to display the progress of a long running operation, providing a
26 | visual clue that processing is underway.
27 |
28 | The ProgressBar class manages the progress, and the format of the line
29 | is given by a number of widgets. A widget is an object that may
30 | display diferently depending on the state of the progress. There are
31 | three types of widget:
32 | - a string, which always shows itself;
33 | - a ProgressBarWidget, which may return a diferent value every time
34 | it's update method is called; and
35 | - a ProgressBarWidgetHFill, which is like ProgressBarWidget, except it
36 | expands to fill the remaining width of the line.
37 |
38 | The progressbar module is very easy to use, yet very powerful. And
39 | automatically supports features like auto-resizing when available.
40 | """
41 |
42 | __author__ = "Nilton Volpato"
43 | __author_email__ = "first-name dot last-name @ gmail.com"
44 | __date__ = "2006-05-07"
45 | __version__ = "2.2"
46 |
47 | # Changelog
48 | #
49 | # 2006-05-07: v2.2 fixed bug in windows
50 | # 2005-12-04: v2.1 autodetect terminal width, added start method
51 | # 2005-12-04: v2.0 everything is now a widget (wow!)
52 | # 2005-12-03: v1.0 rewrite using widgets
53 | # 2005-06-02: v0.5 rewrite
54 | # 2004-??-??: v0.1 first version
55 |
56 | import sys
57 | import time
58 | from array import array
59 | try:
60 | from fcntl import ioctl
61 | import termios
62 | except ImportError:
63 | pass
64 | import signal
65 |
66 |
67 | class ProgressBarWidget(object):
68 | """This is an element of ProgressBar formatting.
69 |
70 | The ProgressBar object will call it's update value when an update
71 | is needed. It's size may change between call, but the results will
72 | not be good if the size changes drastically and repeatedly.
73 | """
74 | def update(self, pbar):
75 | """Returns the string representing the widget.
76 |
77 | The parameter pbar is a reference to the calling ProgressBar,
78 | where one can access attributes of the class for knowing how
79 | the update must be made.
80 |
81 | At least this function must be overriden."""
82 | pass
83 |
84 |
85 | class ProgressBarWidgetHFill(object):
86 | """This is a variable width element of ProgressBar formatting.
87 |
88 | The ProgressBar object will call it's update value, informing the
89 | width this object must the made. This is like TeX \\hfill, it will
90 | expand to fill the line. You can use more than one in the same
91 | line, and they will all have the same width, and together will
92 | fill the line.
93 | """
94 | def update(self, pbar, width):
95 | """Returns the string representing the widget.
96 |
97 | The parameter pbar is a reference to the calling ProgressBar,
98 | where one can access attributes of the class for knowing how
99 | the update must be made. The parameter width is the total
100 | horizontal width the widget must have.
101 |
102 | At least this function must be overriden."""
103 | pass
104 |
105 |
106 | class ETA(ProgressBarWidget):
107 | "Widget for the Estimated Time of Arrival"
108 | def format_time(self, seconds):
109 | return time.strftime('%H:%M:%S', time.gmtime(seconds))
110 |
111 | def update(self, pbar):
112 | if pbar.currval == 0:
113 | return 'ETA: --:--:--'
114 | elif pbar.finished:
115 | return 'Time: %s' % self.format_time(pbar.seconds_elapsed)
116 | else:
117 | elapsed = pbar.seconds_elapsed
118 | eta = elapsed * pbar.maxval / pbar.currval - elapsed
119 | return 'ETA: %s' % self.format_time(eta)
120 |
121 |
122 | class FileTransferSpeed(ProgressBarWidget):
123 | "Widget for showing the transfer speed (useful for file transfers)."
124 | def __init__(self):
125 | self.fmt = '%6.2f %s'
126 | self.units = ['B', 'K', 'M', 'G', 'T', 'P']
127 |
128 | def update(self, pbar):
129 | if pbar.seconds_elapsed < 2e-6: # == 0:
130 | bps = 0.0
131 | else:
132 | bps = float(pbar.currval) / pbar.seconds_elapsed
133 | spd = bps
134 | for u in self.units:
135 | if spd < 1000:
136 | break
137 | spd /= 1000
138 | return self.fmt % (spd, u + '/s')
139 |
140 |
141 | class RotatingMarker(ProgressBarWidget):
142 | "A rotating marker for filling the bar of progress."
143 | def __init__(self, markers='|/-\\'):
144 | self.markers = markers
145 | self.curmark = -1
146 |
147 | def update(self, pbar):
148 | if pbar.finished:
149 | return self.markers[0]
150 | self.curmark = (self.curmark + 1) % len(self.markers)
151 | return self.markers[self.curmark]
152 |
153 |
154 | class Percentage(ProgressBarWidget):
155 | "Just the percentage done."
156 | def update(self, pbar):
157 | return '%3d%%' % pbar.percentage()
158 |
159 |
160 | class Fraction(ProgressBarWidget):
161 | "Just the fraction done."
162 | def update(self, pbar):
163 | return "%d/%d" % (pbar.currval, pbar.maxval)
164 |
165 |
166 | class Bar(ProgressBarWidgetHFill):
167 | "The bar of progress. It will strech to fill the line."
168 | def __init__(self, marker='#', left='|', right='|'):
169 | self.marker = marker
170 | self.left = left
171 | self.right = right
172 |
173 | def _format_marker(self, pbar):
174 | if isinstance(self.marker, (str, unicode)):
175 | return self.marker
176 | else:
177 | return self.marker.update(pbar)
178 |
179 | def update(self, pbar, width):
180 | percent = pbar.percentage()
181 | cwidth = width - len(self.left) - len(self.right)
182 | marked_width = int(percent * cwidth / 100)
183 | m = self._format_marker(pbar)
184 | bar = (self.left + (m * marked_width).ljust(cwidth) + self.right)
185 | return bar
186 |
187 |
188 | class ReverseBar(Bar):
189 | "The reverse bar of progress, or bar of regress. :)"
190 | def update(self, pbar, width):
191 | percent = pbar.percentage()
192 | cwidth = width - len(self.left) - len(self.right)
193 | marked_width = int(percent * cwidth / 100)
194 | m = self._format_marker(pbar)
195 | bar = (self.left + (m * marked_width).rjust(cwidth) + self.right)
196 | return bar
197 |
198 | default_widgets = [Percentage(), ' ', Bar()]
199 |
200 |
201 | class ProgressBar(object):
202 | """This is the ProgressBar class, it updates and prints the bar.
203 |
204 | The term_width parameter may be an integer. Or None, in which case
205 | it will try to guess it, if it fails it will default to 80 columns.
206 |
207 | The simple use is like this:
208 | >>> pbar = ProgressBar().start()
209 | >>> for i in xrange(100):
210 | ... # do something
211 | ... pbar.update(i+1)
212 | ...
213 | >>> pbar.finish()
214 |
215 | But anything you want to do is possible (well, almost anything).
216 | You can supply different widgets of any type in any order. And you
217 | can even write your own widgets! There are many widgets already
218 | shipped and you should experiment with them.
219 |
220 | When implementing a widget update method you may access any
221 | attribute or function of the ProgressBar object calling the
222 | widget's update method. The most important attributes you would
223 | like to access are:
224 | - currval: current value of the progress, 0 <= currval <= maxval
225 | - maxval: maximum (and final) value of the progress
226 | - finished: True if the bar is have finished (reached 100%), False o/w
227 | - start_time: first time update() method of ProgressBar was called
228 | - seconds_elapsed: seconds elapsed since start_time
229 | - percentage(): percentage of the progress (this is a method)
230 | """
231 | def __init__(self, maxval=100, widgets=default_widgets, term_width=None,
232 | fd=sys.stderr, force_update=False):
233 | assert maxval > 0
234 | self.maxval = maxval
235 | self.widgets = widgets
236 | self.fd = fd
237 | self.signal_set = False
238 | if term_width is None:
239 | try:
240 | self.handle_resize(None, None)
241 | signal.signal(signal.SIGWINCH, self.handle_resize)
242 | self.signal_set = True
243 | except:
244 | self.term_width = 79
245 | else:
246 | self.term_width = term_width
247 |
248 | self.currval = 0
249 | self.finished = False
250 | self.prev_percentage = -1
251 | self.start_time = None
252 | self.seconds_elapsed = 0
253 | self.force_update = force_update
254 |
255 | def handle_resize(self, signum, frame):
256 | h, w = array('h', ioctl(self.fd, termios.TIOCGWINSZ, '\0' * 8))[:2]
257 | self.term_width = w
258 |
259 | def percentage(self):
260 | "Returns the percentage of the progress."
261 | return self.currval * 100.0 / self.maxval
262 |
263 | def _format_widgets(self):
264 | r = []
265 | hfill_inds = []
266 | num_hfill = 0
267 | currwidth = 0
268 | for i, w in enumerate(self.widgets):
269 | if isinstance(w, ProgressBarWidgetHFill):
270 | r.append(w)
271 | hfill_inds.append(i)
272 | num_hfill += 1
273 | elif isinstance(w, (str, unicode)):
274 | r.append(w)
275 | currwidth += len(w)
276 | else:
277 | weval = w.update(self)
278 | currwidth += len(weval)
279 | r.append(weval)
280 | for iw in hfill_inds:
281 | r[iw] = r[iw].update(self,
282 | (self.term_width - currwidth) / num_hfill)
283 | return r
284 |
285 | def _format_line(self):
286 | return ''.join(self._format_widgets()).ljust(self.term_width)
287 |
288 | def _need_update(self):
289 | if self.force_update:
290 | return True
291 | return int(self.percentage()) != int(self.prev_percentage)
292 |
293 | def reset(self):
294 | if not self.finished and self.start_time:
295 | self.finish()
296 | self.finished = False
297 | self.currval = 0
298 | self.start_time = None
299 | self.seconds_elapsed = None
300 | self.prev_percentage = None
301 | return self
302 |
303 | def update(self, value):
304 | "Updates the progress bar to a new value."
305 | assert 0 <= value <= self.maxval
306 | self.currval = value
307 | if not self._need_update() or self.finished:
308 | return
309 | if not self.start_time:
310 | self.start_time = time.time()
311 | self.seconds_elapsed = time.time() - self.start_time
312 | self.prev_percentage = self.percentage()
313 | if value != self.maxval:
314 | self.fd.write(self._format_line() + '\r')
315 | else:
316 | self.finished = True
317 | self.fd.write(self._format_line() + '\n')
318 |
319 | def start(self):
320 | """Start measuring time, and prints the bar at 0%.
321 |
322 | It returns self so you can use it like this:
323 | >>> pbar = ProgressBar().start()
324 | >>> for i in xrange(100):
325 | ... # do something
326 | ... pbar.update(i+1)
327 | ...
328 | >>> pbar.finish()
329 | """
330 | self.update(0)
331 | return self
332 |
333 | def finish(self):
334 | """Used to tell the progress is finished."""
335 | self.update(self.maxval)
336 | if self.signal_set:
337 | signal.signal(signal.SIGWINCH, signal.SIG_DFL)
338 |
339 |
340 | def example1():
341 | widgets = ['Test: ', Percentage(), ' ', Bar(marker=RotatingMarker()),
342 | ' ', ETA(), ' ', FileTransferSpeed()]
343 | pbar = ProgressBar(widgets=widgets, maxval=10000000).start()
344 | for i in range(1000000):
345 | # do something
346 | pbar.update(10 * i + 1)
347 | pbar.finish()
348 | return pbar
349 |
350 |
351 | def example2():
352 | class CrazyFileTransferSpeed(FileTransferSpeed):
353 | "It's bigger between 45 and 80 percent"
354 | def update(self, pbar):
355 | if 45 < pbar.percentage() < 80:
356 | return 'Bigger Now ' + FileTransferSpeed.update(self, pbar)
357 | else:
358 | return FileTransferSpeed.update(self, pbar)
359 |
360 | widgets = [CrazyFileTransferSpeed(), ' <<<',
361 | Bar(), '>>> ', Percentage(), ' ', ETA()]
362 | pbar = ProgressBar(widgets=widgets, maxval=10000000)
363 | # maybe do something
364 | pbar.start()
365 | for i in range(2000000):
366 | # do something
367 | pbar.update(5 * i + 1)
368 | pbar.finish()
369 | return pbar
370 |
371 |
372 | def example3():
373 | widgets = [Bar('>'), ' ', ETA(), ' ', ReverseBar('<')]
374 | pbar = ProgressBar(widgets=widgets, maxval=10000000).start()
375 | for i in range(1000000):
376 | # do something
377 | pbar.update(10 * i + 1)
378 | pbar.finish()
379 | return pbar
380 |
381 |
382 | def example4():
383 | widgets = ['Test: ', Percentage(), ' ',
384 | Bar(marker='0', left='[', right=']'),
385 | ' ', ETA(), ' ', FileTransferSpeed()]
386 | pbar = ProgressBar(widgets=widgets, maxval=500)
387 | pbar.start()
388 | for i in range(100, 500 + 1, 50):
389 | time.sleep(0.2)
390 | pbar.update(i)
391 | pbar.finish()
392 | return pbar
393 |
394 |
395 | def example5():
396 | widgets = ['Test: ', Fraction(), ' ', Bar(marker=RotatingMarker()),
397 | ' ', ETA(), ' ', FileTransferSpeed()]
398 | pbar = ProgressBar(widgets=widgets, maxval=10, force_update=True).start()
399 | for i in range(1, 11):
400 | # do something
401 | time.sleep(0.5)
402 | pbar.update(i)
403 | pbar.finish()
404 | return pbar
405 |
406 |
407 | def main():
408 | example1()
409 | print
410 | example2()
411 | print
412 | example3()
413 | print
414 | example4()
415 | print
416 | example5()
417 | print
418 |
419 | if __name__ == '__main__':
420 | main()
421 |
--------------------------------------------------------------------------------