├── .gitignore
├── README.md
├── pom.xml
└── src
├── main
├── java
│ └── com
│ │ └── graphaware
│ │ └── nlp
│ │ └── processor
│ │ └── opennlp
│ │ ├── OpenNLPAnnotation.java
│ │ ├── OpenNLPPipeline.java
│ │ ├── OpenNLPTextProcessor.java
│ │ ├── PipelineBuilder.java
│ │ └── model
│ │ ├── NERModelTool.java
│ │ ├── OpenNLPGenericModelTool.java
│ │ └── SentimentModelTool.java
└── resources
│ └── com
│ └── graphaware
│ └── nlp
│ └── processor
│ └── opennlp
│ ├── en-chunker.bin
│ ├── en-lemmatizer.dict
│ ├── en-ner-date.bin
│ ├── en-ner-location.bin
│ ├── en-ner-money.bin
│ ├── en-ner-organization.bin
│ ├── en-ner-percentage.bin
│ ├── en-ner-percentage_money.test
│ ├── en-ner-person.bin
│ ├── en-ner-person.test
│ ├── en-ner-person_organization_location_date.test
│ ├── en-ner-time.bin
│ ├── en-pos-maxent.bin
│ ├── en-sent.bin
│ ├── en-sentiment-tweets_toy.bin
│ ├── en-token.bin
│ └── sentiment_tweets.train
└── test
├── java
└── com
│ └── graphaware
│ └── nlp
│ └── processor
│ └── opennlp
│ ├── OpenNLPIntegrationTest.java
│ ├── OpenNLPPipelineTest.java
│ ├── TestOpenNLP.java
│ ├── TextProcessorTest.java
│ ├── conceptnet5
│ └── ConceptNet5ImporterTest.java
│ ├── model
│ └── CustomSentimentModelIntegrationTest.java
│ └── procedure
│ └── ProcedureTest.java
└── resources
└── import
└── sentiment_tweets.train
/.gitignore:
--------------------------------------------------------------------------------
1 | target/
2 | *iml
3 | dependency-reduced-pom.xml
4 | **/.DS_Store
5 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | GraphAware Neo4j NLP - OpenNLP - RETIRED
2 | ==========================
3 |
4 | ## GraphAware Neo4j NLP - OpenNLP Has Been Retired
5 | As of May 2021, this [repository has been retired](https://graphaware.com/framework/2021/05/06/from-graphaware-framework-to-graphaware-hume.html).
6 |
7 | ---
8 |
9 | GraphAware NLP Using OpenNLP
10 | ==========================================
11 |
12 | Getting the Software
13 | ---------------------
14 |
15 | ### Server Mode
16 | When using Neo4j in the standalone server mode, you will need the GraphAware Neo4j Framework and GraphAware NLP .jar files (both of which you can download here) dropped into the plugins directory of your Neo4j installation. Finally, the following needs to be appended to the `neo4j.conf` file in the `config/` directory:
17 |
18 | ```
19 | dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
20 | com.graphaware.runtime.enabled=true
21 |
22 | com.graphaware.module.NLP.2=com.graphaware.nlp.module.NLPBootstrapper
23 | ```
24 |
25 | ### For Developers
26 | This package is an extention of the GraphAware NLP, which therefore needs to be packaged and installed beforehand. No other dependencies required.
27 |
28 | ```
29 | cd neo4j-nlp
30 | mvn clean install
31 | cp target/graphaware-nlp-1.0-SNAPSHOT.jar /plugins
32 |
33 | cd ../neo4j-nlp-opennlp
34 | mvn clean package
35 | cp target/nlp-opennlp-1.0.0-SNAPSHOT.jar /plugins
36 | ```
37 |
38 |
39 | Introduction and How-To
40 | -------------------------
41 |
42 | The Apache OpenNLP library provides basic features for processing natural language text: sentence segmentation, tokenization, lemmatization, part-of-speach tagging, named entities identification, chunking, parsing and sentiment analysis. OpenNLP is implemented by extending the general GraphAware NLP package with extra parameters:
43 |
44 | ### Tag Extraction / Annotations
45 | ```
46 | #Annotate the news
47 | MATCH (n:News)
48 | CALL ga.nlp.annotate({text:n.text, id: n.uuid, textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", pipeline: "tokenizer"}) YIELD result
49 | MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)
50 | RETURN n, result
51 | ```
52 |
53 | Available parameters are:
54 | * the same ones as described in parent class GraphAware NLP
55 | * `sentimentProbabilityThr` (optional, default *0.7*): if assigned sentiment label has confidence smaller than this threshold, set sentiment to *Neutral*
56 | * `customProject` (optional): add user trained/provided models associated with specified project, see paragraph *Customizing pipeline models*
57 |
58 | Available pipelines:
59 | * `tokenizer` - tokenization, lemmatization, stop-words removal, part-of-speach tagging (POS), named entity recognition (NER)
60 | * `sentiment` - tokenization, sentiment analysis
61 | * `tokenizerAndSentiment` - tokenization, lemmatization, stop-words removal, POS tagging, NER, sentiment analysis
62 | * `phrase` (not supported yet) - tokenization, stop-words removal, relations, sentiment analysis
63 |
64 | ### Sentiment Analysis
65 | The current implementation of a sentiment analysis is just a toy - it relies on a file with 100 labeled twitter samples which are used to build a model when Neo4j starts (general recommendation for number of training samples is 10k and more). The current model supports only three options - Positive, Neutral, Negative - which are chosen based on the highest probability (the algorithm returns an array of probabilities for each category). If the highest probability is less then 70% (default value which can be customized by using parameter *sentimentProbabilityThr*), the category is not regarded trustworthy and is set to Neutral instead.
66 |
67 | The sentiment analysis can be run either as part of the annotation (see paragraph above) or as an independent procedure (see command below) which takes in AnnotatedText nodes, analyzes all attached sentences and adds to them a label corresponding to its sentiment.
68 |
69 | ```
70 | MATCH (a:AnnotatedText {id: {id}})
71 | CALL ga.nlp.sentiment({node:a, textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor"}) YIELD result
72 | MATCH (result)-[:CONTAINS_SENTENCE]->(s:Sentence)
73 | RETURN labels(s) as labels
74 | ```
75 |
76 |
77 |
78 |
79 | ## BETA
80 | ### Customizing pipeline models
81 | To add new customized model (currenlty NER and Sentiment), one can do it via Cypher:
82 | ```
83 | CALL ga.nlp.train({textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", modelIdentifier: "component-en", alg: "sentiment", inputFile: "" [, lang: "en", trainingParameters: {......}]})
84 | ```
85 | * `alg` (case insensitive) specifies which algorithm is about to be trained; currently available algs: `NER`, `sentiment`
86 | * `modelIdentifier` is an arbitrary string that provides a unique identifier of the model that you want to train (will be used for e.g. saving it into .bin file)
87 | * `inputFile` is path to the training data file
88 | * `lang` (default is "en") specifies the language
89 | * `textProcessor` - desired text processor
90 | * **training parameters** (defined in `com.graphaware.nlp.util.GenericModelParameters`) are optional and are not universal (some might be specific to only certain Text Processor):
91 | * *iter* - number of iterations
92 | * *cutoff* - useful for reducing the size of n-gram models, it's a threashold for n-gram occurrences/frequences in the training dataset
93 | * *threads* - provides support for multi-threading
94 | * *entityType* - name type to use for NER training, by default all entities (classes such as "Person", "Date", ...) present in provided training file are used
95 | * *nFolds* - parameter for cross-validation procedure (default is 10), see paragraph *Validation*
96 | * *trainerAlg* - specific for OpenNLP
97 | * *trainerType* - specific for OpenNLP
98 |
99 | The trained model is saved to a binary file in Neo4j's `import/` directory with name format: `-.bin`, so no need to train the same model again when you restart Neo4j. Cross-validation method is used to evaluate the model, see paragraph *Validation*.
100 | * `NER` - default models (Person, Location, Organization, Date, Time, Money, Percentage) plus all registered customized models are used when invoking `ga.nlp.annotate()` (see example below)
101 | * `Sentiment` - sentiment analysis is run only once (user-trained one has a priority over the default one)
102 |
103 | **Training/testing example:**
104 | Trainig:
105 | ```
106 | CALL ga.nlp.processor.train({textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", modelIdentifier: "test", alg: "sentiment", inputFile: "/Users/doctor-who/Documents/workspace/datasets/sentiment_tweets.train", trainingParameters: {iter: 10}})
107 | ```
108 |
109 | Testing the new model:
110 | ```
111 | CALL ga.nlp.processor.test({textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", modelIdentifier: "test", alg: "sentiment", inputFile: "/Users/doctor-who/Documents/workspace/datasets/sentiment_tweets.test})
112 | ```
113 |
114 | **Usage of new models:**
115 | To use custom models, one needs to assign them to a pipeline, for example:
116 |
117 | ```
118 | CALL ga.nlp.processor.addPipeline({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', name: 'customPipeline', processingSteps: {tokenize: true, ner: true, dependency: false, customSentiment: }})
119 | ```
120 | * `customSentiment` - string value wich is the identifier that you chose for your custom model
121 | * `customNER` - string value which is the identifier that you chose for your custom model; if you want to use more models, separate them by ",", for example: `customNER: "component-en,chemical-en,testing-model"`
122 |
123 | ```
124 | # Example of a text to analyze
125 | CREATE (l:Lesson {lesson: "Power system distribution at Kennedy Space Center (KSC) consists primarily of high-voltage, underground cables. These cables include approximately 5000 splices.ľ Splice failures result in arc flash events that are extremely hazardous to personnel in the vicinity of the arc flash. Some construction and maintenance tasks cannot be performed effectively in the required personal protective equipment (PPE), and de-energizing the cables is not feasible due to cost, lost productivity, and safety risk to others implementing the required outages. To verify alternate and effective mitigations, arc flash testing was conducted in a controlled environment. The arc flash effects were greater than expected. Testing also demonstrated the addition of neutral grounding resistors (NGRs) would result in substantial reductions to arc flash effects. As a result, NGRs are being installed on KSC primary substation transformers. The presence of the NGRs, enable usage of less cumbersome PPE. Laboratory testing revealed higher than anticipated safety risks from a potential arc-flash event in a manhole environment when conducted at KSCęs unreduced fault current levels.ľ The safety risks included bright flash, excessive sound, and smoke.ľľ Due to these findings and absence of other mitigations installed at the time, manhole entries require full arc-flash PPE.ľ Furthermore, manhole entries were temporarily restricted to short duration inspections until further mitigations could be implemented.ľ With installation of neutral grounding resistors (NGRs) on substation transformers, the flash, sound and flame energy was reduced.ľ The hazard reduction was so substantial that the required PPE would be less cumbersome and enable effective performance of maintenance tasks in the energized configuration."})
126 |
127 | WITH l
128 |
129 | # Annotate it and use newly trained NER model(s)
130 | CALL ga.nlp.annotate({text:l.lesson, id: l.uuid, pipeline: "customPipeline"}) YIELD result
131 | MERGE (l)-[:HAS_ANNOTATED_TEXT]->(result)
132 | RETURN l, result;
133 | ```
134 |
135 | **Format of training datasets:**
136 | * `NER`
137 | * one sentence per line
138 | * one empty line between two different texts (paragraphs)
139 | * there must be a space before and after each `` and `` statement
140 | * training data must not contain HTML symbols (such as `H2O`); **TO DO:** check whether text on which NER model is deployed needs to be manually deprived of HTML symbols or whether they are ignored automatically
141 | * Example: categories "person", "organization", "location"):
142 | ```
143 | Theresa May has said she will form a government with the support of the Democratic Unionists that can provide "certainty" for the future.
144 | Speaking after visiting Buckingham Palace , she said only her party had the "legitimacy" to govern after winning the most seats and votes.
145 | In a short statement outside Downing Street , which followed a 25-minute audience with The Queen , Mrs May said she intended to form a government which could "provide certainty and lead Britain forward at this critical time for our country".
146 |
147 | The Cabinet Office revealed on Wednesday that Japan's GDP grew by 0.3% during the first quarter of 2017 .
148 | Although the reading missed a forecast of 0.6% growth, Japan's economy continued to expand in five consecutive quarters, the country's highest streak in three years.
149 | ```
150 | * `sentiment` - two columns separated by a white space (tab): the first column is a category as integer (0=VeryNegative, 1=Negative, 2=Neutral, 3=Positive, 4=VeryPositive), the second column is a sentence; example:
151 | ```
152 | 3 Watching a nice movie
153 | 1 The painting is ugly, will return it tomorrow...
154 | 3 One of the best soccer games, worth seeing it
155 | 3 Very tasty, not only for vegetarians
156 | 1 Damn..the train is late again...
157 | ```
158 |
159 | **Validation/testing:**
160 |
161 | Evaluation of the new model is performed automatically when invoking procedure `ga.nlp.train()`. The evaluation of the new model is performed using OpenNLP cross-validation method: validation runs *n*-fold times on the same training file, but each time selecting different set of trainig and testing data with the sample size ratio of *train:test = (n-1):1*. Validation measures (Precision, Recall, F-Measure) are pooled together and returned to the user as a result.
162 |
163 | The following procedure can be invoked to test already existing models:
164 | ```
165 | CALL ga.nlp.test({[project: "myXYProject",] alg: "NER", model: "location", file: "" [, lang: "en"]})
166 | ```
167 | Parameters
168 | * `project` (optional) allows to specify which of the existing models we want to test (otherwise it uses the default)
169 | * `alg` (case insensitive) specifies which algorithm is about to be trained; currently available algs: `NER`, `sentiment`
170 | * `model` is an arbitrary string that provides, in combination with `alg` (and with `project` if it's specified), a unique identifier of the model that you want to
171 | * `file` is a path to the test file
172 | * `lang` specified the language
173 |
174 |
--------------------------------------------------------------------------------
/pom.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 | 4.0.0
4 |
5 | com.graphaware.neo4j
6 | nlp-opennlp
7 | 3.3.2.52.7-SNAPSHOT
8 |
9 |
10 |
11 | GNU General Public License, version 3
12 | http://www.gnu.org/licenses/gpl-3.0.txt
13 | repo
14 |
15 |
16 |
17 | GraphAware OpenNLP Integration
18 | OpenNLP integration into GraphAware NLP
19 | https://graphaware.com
20 |
21 |
22 | scm:git:git@github.com:graphaware/neo4j-nlp-opennlp.git
23 | scm:git:git@github.com:graphaware/neo4j-nlp-opennlp.git
24 | git@github.com:graphaware/neo4j-nlp-opennlp.git
25 | HEAD
26 |
27 |
28 |
29 |
30 | alenegro
31 | Alessandro Negro
32 | alessandro@graphaware.com
33 |
34 |
35 | ikwattro
36 | Christophe Willemsen
37 | christophe@graphaware.com
38 |
39 |
40 | vlasta-kus
41 | Vlastimil Kus
42 | vlasta@graphaware.com
43 |
44 |
45 | graphaware
46 | GraphAware
47 | nlp@graphaware.com
48 |
49 |
50 |
51 | 2015
52 |
53 |
54 | GitHub
55 | https://github.com/graphaware/neo4j-nlp-opennlp/issues
56 |
57 |
58 |
59 | Graph Aware Limited
60 | https://graphaware.com
61 |
62 |
63 |
64 | UTF-8
65 | 1.9.0
66 | 3.4.7.52
67 | ${graphaware.version}.18
68 | 3.4.7
69 | 1.8
70 | 1.8
71 | 3.4.9.52.16-SNAPSHOT
72 |
73 |
74 |
75 |
76 | com.graphaware.neo4j
77 | nlp
78 | ${nlp.version}
79 | provided
80 |
81 |
82 |
83 | org.apache.opennlp
84 | opennlp-tools
85 | ${open.nlp.version}
86 |
87 |
88 |
89 | org.slf4j
90 | slf4j-api
91 | 1.7.21
92 |
93 |
94 |
95 | junit
96 | junit
97 | 4.12
98 |
99 |
100 |
101 | org.slf4j
102 | slf4j-simple
103 | 1.7.21
104 |
105 |
106 |
107 | junit
108 | junit
109 | 4.12
110 | test
111 |
112 |
113 |
114 | com.graphaware.neo4j
115 | runtime
116 | ${graphaware.version}
117 | test
118 |
119 |
120 |
121 | com.graphaware.neo4j
122 | server
123 | ${graphaware.version}
124 | test
125 |
126 |
127 |
128 | com.graphaware.neo4j
129 | tests
130 | ${graphaware.version}
131 | test
132 |
133 |
134 |
135 | com.graphaware.neo4j
136 | resttest
137 | test
138 | ${resttest.version}
139 |
140 |
141 |
142 | com.graphaware.neo4j
143 | nlp
144 | ${nlp.version}
145 | test-jar
146 | test
147 |
148 |
149 |
150 | com.sun.jersey
151 | jersey-server
152 | 1.19.1
153 | test
154 |
155 |
156 |
157 |
158 |
159 |
160 |
161 | ossrh
162 | https://oss.sonatype.org/content/repositories/snapshots
163 |
164 |
165 | ossrh
166 | https://oss.sonatype.org/service/local/staging/deploy/maven2/
167 |
168 |
169 |
170 |
171 |
172 | release
173 |
174 |
175 | performRelease
176 | true
177 |
178 |
179 |
180 |
181 |
182 | org.apache.maven.plugins
183 | maven-gpg-plugin
184 | 1.5
185 |
186 |
187 | sign-artifacts
188 | verify
189 |
190 | sign
191 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 | maven-compiler-plugin
204 | 3.5.1
205 |
206 | 1.8
207 | 1.8
208 |
209 |
210 |
211 | maven-shade-plugin
212 | 2.4.3
213 |
214 |
215 | package
216 |
217 | shade
218 |
219 |
222 |
223 |
224 |
225 |
226 | org.apache.maven.plugins
227 | maven-surefire-plugin
228 | 2.20
229 |
230 | -Xmx8g
231 |
232 |
233 |
234 |
235 |
236 |
237 |
--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/OpenNLPAnnotation.java:
--------------------------------------------------------------------------------
1 | /*
2 | * To change this license header, choose License Headers in Project Properties.
3 | * To change this template file, choose Tools | Templates
4 | * and open the template in the editor.
5 | */
6 | package com.graphaware.nlp.processor.opennlp;
7 |
8 | import com.graphaware.nlp.util.OptionalNLPParameters;
9 | import java.util.ArrayList;
10 | import java.util.Arrays;
11 | import java.util.Collection;
12 | import java.util.HashMap;
13 | import java.util.HashSet;
14 | import java.util.List;
15 | import java.util.Map;
16 | import java.util.Set;
17 | import java.util.stream.Collectors;
18 | import opennlp.tools.util.Span;
19 |
20 | public class OpenNLPAnnotation {
21 |
22 | private static final double DEFAULT_SENTIMENT_PROBTHR = 0.7;
23 |
24 | private final String text;
25 | private List sentences;
26 | public static final String DEFAULT_LEMMA_OPEN_NLP = "O";
27 | public Map otherParams;
28 |
29 | public OpenNLPAnnotation(String text, Map otherParams) {
30 | this.text = text;
31 | this.otherParams = otherParams;
32 | }
33 |
34 | public OpenNLPAnnotation(String text) {
35 | this(text, null);
36 | }
37 |
38 | public String getText() {
39 | return text;
40 | }
41 |
42 | public void setSentences(Span[] sentencesArray) {
43 | sentences = new ArrayList<>();
44 | for (Span sentence : sentencesArray) {
45 | sentences.add(new Sentence(sentence, getText()));
46 | }
47 | }
48 |
49 | public List getSentences() {
50 | return sentences;
51 | }
52 |
53 | public double getSentimentProb() {
54 | if (otherParams != null && otherParams.containsKey(OptionalNLPParameters.SENTIMENT_PROB_THR)) {
55 | return Double.parseDouble(otherParams.get(OptionalNLPParameters.SENTIMENT_PROB_THR));
56 | }
57 | return DEFAULT_SENTIMENT_PROBTHR;
58 | }
59 |
60 | public Token getToken(String token, String lemma) {
61 | return new Token(token, lemma);
62 | }
63 |
64 | class Sentence {
65 |
66 | private final Span sentence;
67 | private final String sentenceText;
68 | private String sentenceSentiment;
69 | private List nounphrases;
70 | private String[] words;
71 | private Span[] wordSpans;
72 | private String[] posTags;
73 | private String[] lemmas;
74 | private final Map tokens;
75 | private Span[] chunks;
76 | private String[] chunkStrings;
77 | private String[] chunkSentiments;
78 | private final String defaultStringValue = "-"; // @Deprecated
79 |
80 | public Sentence(Span sentence, String text) {
81 | this.sentence = sentence;
82 | this.sentenceText = String.valueOf(sentence.getCoveredText(text));
83 | this.tokens = new HashMap<>();
84 | }
85 |
86 | public void addPhraseIndex(int phraseINdex) {
87 | if (this.nounphrases == null) {
88 | this.nounphrases = new ArrayList<>();
89 | }
90 | this.nounphrases.add(phraseINdex);
91 | }
92 |
93 | public Span getSentenceSpan() {
94 | return this.sentence;
95 | }
96 |
97 | public String getSentence() {
98 | return this.sentenceText;
99 | }
100 |
101 | public String getSentiment() {
102 | return this.sentenceSentiment;
103 | }
104 |
105 | public void setSentiment(String sent) {
106 | this.sentenceSentiment = sent;
107 | }
108 |
109 | public String[] getWords() {
110 | return words;
111 | }
112 |
113 | public void setWords(String[] words) {
114 | this.words = words;
115 | }
116 |
117 | public Span[] getWordSpans() {
118 | return this.wordSpans;
119 | }
120 |
121 | public void setWordSpans(Span[] spans) {
122 | this.wordSpans = spans;
123 | }
124 |
125 | public void setWordsAndSpans(Span[] spans) {
126 | if (spans == null) {
127 | this.wordSpans = null;
128 | this.words = null;
129 | return;
130 | }
131 | this.wordSpans = spans;
132 | this.words = Arrays.asList(spans).stream()
133 | .map(span -> String.valueOf(span.getCoveredText(sentenceText)))
134 | .collect(Collectors.toList()).toArray(new String[wordSpans.length]);
135 | }
136 |
137 | public int getWordStart(int idx) {
138 | if (this.wordSpans.length > idx) {
139 | return this.wordSpans[idx].getStart();
140 | }
141 | return -1;
142 | }
143 |
144 | public int getWordEnd(int idx) {
145 | if (this.wordSpans.length > idx) {
146 | return this.wordSpans[idx].getEnd();
147 | }
148 | return -1;
149 | }
150 |
151 | public String[] getPosTags() {
152 | return this.posTags;
153 | }
154 |
155 | public void setPosTags(String[] posTags) {
156 | this.posTags = posTags;
157 | }
158 |
159 | public Span[] getChunks() {
160 | return this.chunks;
161 | }
162 |
163 | public void setChunks(Span[] chunks) {
164 | this.chunks = chunks;
165 | }
166 |
167 | public String[] getChunkStrings() {
168 | return this.chunkStrings;
169 | }
170 |
171 | public void setChunkStrings(String[] chunkStrings) {
172 | this.chunkStrings = chunkStrings;
173 | }
174 |
175 | public String[] getChunkSentiments() {
176 | return this.chunkSentiments;
177 | }
178 |
179 | public void setChunkSentiments(String[] sents) {
180 | if (sents == null) {
181 | return;
182 | }
183 | if (sents.length != this.chunks.length) {
184 | return;
185 | }
186 | this.chunkSentiments = sents;
187 | }
188 |
189 | // @Deprecated
190 | // public void setDefaultChunks() {
191 | // this.chunks = new Span[this.words.length];
192 | // Arrays.fill(this.chunks, new Span(0, 0));
193 | // this.chunkStrings = new String[this.words.length];
194 | // Arrays.fill(this.chunkStrings, defaultStringValue);
195 | // this.nounphrases = new ArrayList<>();
196 | // }
197 |
198 | public List getPhrasesIndex() {
199 | //if (nounphrases==null)
200 | //return new ArrayList();
201 | return nounphrases;
202 | }
203 |
204 | public Collection getTokens() {
205 | return this.tokens.values();
206 | }
207 |
208 | public String[] getLemmas() {
209 | return this.lemmas;
210 | }
211 |
212 | public void setLemmas(String[] lemmas) {
213 | if (this.words == null || lemmas == null) {
214 | return;
215 | }
216 | if (this.words.length != lemmas.length) // ... something is wrong
217 | {
218 | return;
219 | }
220 | this.lemmas = lemmas;
221 | }
222 |
223 | protected Token getToken(String value, String lemma) {
224 | Token token;
225 | if (tokens.containsKey(value)) {
226 | token = tokens.get(value);
227 | } else {
228 | token = new Token(value, lemma);
229 | tokens.put(value, token);
230 | }
231 | return token;
232 | }
233 | }
234 |
235 | class Token {
236 |
237 | private final String token;
238 | private final Set tokenPOS;
239 | private final String tokenLemmas;
240 | private final Set tokenNEs;
241 | private final List tokenSpans;
242 |
243 | public Token(String token, String lemma) {
244 | this.token = token;
245 | this.tokenLemmas = lemma;
246 | this.tokenNEs = new HashSet<>();
247 | this.tokenPOS = new HashSet<>();
248 | this.tokenSpans = new ArrayList<>();
249 | }
250 |
251 | public List getTokenSpans() {
252 | return tokenSpans;
253 | }
254 |
255 | public String getToken() {
256 | return token;
257 | }
258 |
259 | public void addTokenSpans(Span tokenSpans) {
260 | this.tokenSpans.add(tokenSpans);
261 | }
262 |
263 | public Collection getTokenPOS() {
264 | return tokenPOS;
265 | }
266 |
267 | public void addTokenPOS(Collection tokenPOSes) {
268 | this.tokenPOS.addAll(tokenPOSes);
269 | }
270 |
271 | public void addTokenPOS(String tokenPOS) {
272 | this.tokenPOS.add(tokenPOS);
273 | }
274 |
275 | public String getTokenLemmas() {
276 | return tokenLemmas;
277 | }
278 |
279 | public Collection getTokenNEs() {
280 | return tokenNEs;
281 | }
282 |
283 | public void addTokenNE(String ne) {
284 | this.tokenNEs.add(ne);
285 | }
286 |
287 | }
288 | }
289 |
--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/OpenNLPPipeline.java:
--------------------------------------------------------------------------------
1 | /*
2 | * To change this license header, choose License Headers in Project Properties.
3 | * To change this template file, choose Tools | Templates
4 | * and open the template in the editor.
5 | */
6 | package com.graphaware.nlp.processor.opennlp;
7 |
8 | import com.graphaware.nlp.processor.opennlp.model.NERModelTool;
9 | import com.graphaware.nlp.processor.opennlp.model.SentimentModelTool;
10 | import com.graphaware.nlp.processor.AbstractTextProcessor;
11 | import static com.graphaware.nlp.processor.opennlp.OpenNLPAnnotation.DEFAULT_LEMMA_OPEN_NLP;
12 | import java.io.File;
13 | import java.io.FileInputStream;
14 | import java.io.FileOutputStream;
15 | import java.io.BufferedOutputStream;
16 | import java.io.FileNotFoundException;
17 | import java.io.IOException;
18 | import java.io.InputStream;
19 | import java.lang.reflect.Constructor;
20 | import java.lang.reflect.InvocationTargetException;
21 | import java.net.URI;
22 | import java.net.URISyntaxException;
23 | import java.util.Properties;
24 | import java.util.HashMap;
25 | import java.util.Arrays;
26 | import java.util.ArrayList;
27 | import java.util.HashSet;
28 | import java.util.List;
29 | import java.util.Map;
30 | import java.util.Set;
31 | import java.util.concurrent.atomic.AtomicInteger;
32 | import java.util.stream.Collectors;
33 | import opennlp.tools.chunker.ChunkerME;
34 | import opennlp.tools.chunker.ChunkerModel;
35 | import opennlp.tools.postag.POSModel;
36 | import opennlp.tools.postag.POSTaggerME;
37 | import opennlp.tools.sentdetect.SentenceDetectorME;
38 | import opennlp.tools.sentdetect.SentenceModel;
39 | import opennlp.tools.tokenize.TokenizerME;
40 | import opennlp.tools.tokenize.TokenizerModel;
41 | import opennlp.tools.namefind.TokenNameFinderModel;
42 | import opennlp.tools.namefind.NameFinderME;
43 | import opennlp.tools.lemmatizer.DictionaryLemmatizer; // needs OpenNLP >=1.7
44 | //import opennlp.tools.lemmatizer.SimpleLemmatizer; // for OpenNLP < 1.7
45 | import opennlp.tools.doccat.DoccatModel;
46 | import opennlp.tools.doccat.DocumentCategorizerME;
47 | import opennlp.tools.util.Span;
48 | import opennlp.tools.util.model.BaseModel;
49 | import org.slf4j.Logger;
50 | import org.slf4j.LoggerFactory;
51 |
52 | public class OpenNLPPipeline {
53 |
54 | protected static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class);
55 |
56 | public static final String DEFAULT_BACKGROUND_SYMBOL = "O";
57 |
58 | protected static final String IMPORT_DIRECTORY = "import/";
59 |
60 | protected static final String PROPERTY_PATH_CHUNKER_MODEL = "chuncker";
61 | protected static final String PROPERTY_PATH_POS_TAGGER_MODEL = "pos";
62 | protected static final String PROPERTY_PATH_SENTENCE_MODEL = "sentence";
63 | protected static final String PROPERTY_PATH_TOKENIZER_MODEL = "tokenizer";
64 | protected static final String PROPERTY_PATH_LEMMATIZER_MODEL = "lemmatizer";
65 | protected static final String PROPERTY_PATH_SENTIMENT_MODEL = "sentiment";
66 |
67 | protected static final String PROPERTY_DEFAULT_CHUNKER_MODEL = "en-chunker.bin";
68 | protected static final String PROPERTY_DEFAULT_POS_TAGGER_MODEL = "en-pos-maxent.bin";
69 | protected static final String PROPERTY_DEFAULT_SENTENCE_MODEL = "en-sent.bin";
70 | protected static final String PROPERTY_DEFAULT_TOKENIZER_MODEL = "en-token.bin";
71 | protected static final String PROPERTY_DEFAULT_LEMMATIZER_MODEL = "en-lemmatizer.dict";
72 | protected static final String PROPERTY_DEFAULT_SENTIMENT_MODEL = "en-sentiment-tweets_toy.bin";
73 |
74 | protected static final String DEFAULT_PROJECT_VALUE = "default";
75 |
76 | protected final List annotators;
77 | protected final List stopWords;
78 |
79 | protected TokenizerME wordBreaker;
80 | protected POSTaggerME posme;
81 | protected ChunkerME chunkerME;
82 | protected SentenceDetectorME sentenceDetector;
83 | protected DictionaryLemmatizer lemmaDetector; // needs OpenNLP >=1.7
84 |
85 | protected Map customNeModels = new HashMap<>();
86 | protected Map customSentimentModels = new HashMap<>();
87 |
88 | protected Map nameDetectors = new HashMap<>();
89 | //protected Map sentimentDetectors = new HashMap<>();
90 | protected DocumentCategorizerME sentimentDetector;
91 |
92 | protected static Map BASIC_NE_MODEL;
93 |
94 | {
95 | BASIC_NE_MODEL = new HashMap<>();
96 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-person", "en-ner-person.bin");
97 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-date", "en-ner-date.bin");
98 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-location", "en-ner-location.bin");
99 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-time", "en-ner-time.bin");
100 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-organization", "en-ner-organization.bin");
101 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-money", "en-ner-money.bin");
102 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-percentage", "en-ner-percentage.bin");
103 | }
104 |
105 | public OpenNLPPipeline(Properties properties) {
106 | findModelFiles(IMPORT_DIRECTORY);
107 | this.annotators = Arrays.asList(properties.getProperty("annotators", "").split(",")).stream().map(str -> str.trim()).collect(Collectors.toList());
108 | this.stopWords = Arrays.asList(properties.getProperty("stopword", "").split(",")).stream().map(str -> str.trim().toLowerCase()).collect(Collectors.toList());
109 | init(properties);
110 | }
111 |
112 | private void init(Properties properties) {
113 | try {
114 | setSenteceSplitter(properties);
115 | setTokenizer(properties);
116 | setPosTagger(properties);
117 | setChuncker(properties);
118 | loadNamedEntitiesFinders(properties);
119 | setLemmatizer(properties);
120 | setCategorizer(properties);
121 |
122 | } catch (IOException e) {
123 | LOG.error("Could not initialize OpenNLP models: " + e.getMessage());
124 | throw new RuntimeException("Could not initialize OpenNLP models", e);
125 | }
126 | }
127 |
128 | private void setChuncker(Properties properties) throws FileNotFoundException {
129 | InputStream is = getInputStream(properties, PROPERTY_PATH_CHUNKER_MODEL, PROPERTY_DEFAULT_CHUNKER_MODEL);
130 | ChunkerModel chunkerModel = loadModel(ChunkerModel.class, is);
131 | closeInputStream(is, PROPERTY_PATH_CHUNKER_MODEL);
132 | chunkerME = new ChunkerME(chunkerModel);
133 | }
134 |
135 | private void setPosTagger(Properties properties) throws FileNotFoundException {
136 | InputStream is = getInputStream(properties, PROPERTY_PATH_POS_TAGGER_MODEL, PROPERTY_DEFAULT_POS_TAGGER_MODEL);
137 | POSModel pm = loadModel(POSModel.class, is);
138 | closeInputStream(is, PROPERTY_PATH_POS_TAGGER_MODEL);
139 | posme = new POSTaggerME(pm);
140 | }
141 |
142 | private void setTokenizer(Properties properties) throws FileNotFoundException {
143 | InputStream is = getInputStream(properties, PROPERTY_PATH_TOKENIZER_MODEL, PROPERTY_DEFAULT_TOKENIZER_MODEL);
144 | TokenizerModel tm = loadModel(TokenizerModel.class, is);
145 | closeInputStream(is, PROPERTY_PATH_TOKENIZER_MODEL);
146 | wordBreaker = new TokenizerME(tm);
147 | }
148 |
149 | private void setSenteceSplitter(Properties properties) throws FileNotFoundException {
150 | InputStream is = getInputStream(properties, PROPERTY_PATH_SENTENCE_MODEL, PROPERTY_DEFAULT_SENTENCE_MODEL);
151 | SentenceModel sentenceModel = loadModel(SentenceModel.class, is);
152 | closeInputStream(is, PROPERTY_PATH_SENTENCE_MODEL);
153 | sentenceDetector = new SentenceDetectorME(sentenceModel);
154 | }
155 |
156 | private void loadNamedEntitiesFinders(Properties properties) throws FileNotFoundException {
157 | // Default NE models
158 | BASIC_NE_MODEL.entrySet().stream().forEach((item) -> {
159 | InputStream is = getInputStream(properties, item.getKey(), item.getValue());
160 | if (!(is == null)) {
161 | TokenNameFinderModel nameModel = loadModel(TokenNameFinderModel.class, is);
162 | closeInputStream(is, item.getKey());
163 | nameDetectors.put(item.getKey(), new NameFinderME(nameModel));
164 | }
165 | });
166 |
167 | // Custom NE models (in the `import/` dir of the Neo4j installation)
168 | if (properties.containsKey("customNEs")) {
169 | List requiredModels = Arrays.asList(properties.getProperty("customNEs").split(",")).stream().map(str -> str.trim()).collect(Collectors.toList());
170 | for (String key: requiredModels) {
171 | if (!customNeModels.containsKey(key)) {
172 | LOG.error("Custom NE model " + key + " not found!");
173 | throw new RuntimeException("Custom NE model " + key + " not found!");
174 | }
175 | LOG.info("Extracting custom NER model: " + key);
176 | InputStream is = new FileInputStream(new File(customNeModels.get(key)));
177 | TokenNameFinderModel nameModel = loadModel(TokenNameFinderModel.class, is);
178 | closeInputStream(is, key);
179 | nameDetectors.put(key, new NameFinderME(nameModel));
180 | LOG.info("Custom NER model " + key + " loaded for this pipeline.");
181 | }
182 | }
183 | }
184 |
185 | private void setLemmatizer(Properties properties) throws FileNotFoundException, IOException {
186 | InputStream is = getInputStream(properties, PROPERTY_PATH_LEMMATIZER_MODEL, PROPERTY_DEFAULT_LEMMATIZER_MODEL);
187 | lemmaDetector = new DictionaryLemmatizer(is);
188 | closeInputStream(is, PROPERTY_PATH_LEMMATIZER_MODEL);
189 | }
190 |
191 | private void setCategorizer(Properties properties) throws FileNotFoundException {
192 | // Default sentiment model
193 | if (!properties.containsKey("customSentiment")) {
194 | InputStream is = getInputStream(properties, PROPERTY_PATH_SENTIMENT_MODEL, PROPERTY_DEFAULT_SENTIMENT_MODEL);
195 | if (is != null) {
196 | DoccatModel doccatModel = loadModel(DoccatModel.class, is);
197 | closeInputStream(is, PROPERTY_PATH_SENTIMENT_MODEL);
198 | //sentimentDetectors.put(DEFAULT_PROJECT_VALUE, new DocumentCategorizerME(doccatModel));
199 | sentimentDetector = new DocumentCategorizerME(doccatModel);
200 | } else {
201 | LOG.warn("No default sentiment detector available (input stream is null).");
202 | //sentimentDetectors.put(DEFAULT_PROJECT_VALUE, null);
203 | sentenceDetector = null;
204 | }
205 | }
206 | // Custom sentiment model (currently only one is possible)
207 | else {
208 | String customModel = properties.getProperty("customSentiment");
209 | LOG.info("Extracting custom sentiment model: " + customModel);
210 | if (!customSentimentModels.containsKey(customModel)) {
211 | LOG.error("Custom sentiment model " + customModel + " not found!");
212 | throw new RuntimeException("Custom sentiment model " + customModel + " not found!");
213 | }
214 | try {
215 | InputStream is = new FileInputStream(new File(customSentimentModels.get(customModel)));
216 | if (is == null) {
217 | LOG.error("Custom sentiment model: input stream is null");
218 | return;
219 | }
220 | DoccatModel doccatModel = loadModel(DoccatModel.class, is);
221 | closeInputStream(is, customSentimentModels.get(customModel));
222 | //sentimentDetectors.put(customModel, new DocumentCategorizerME(doccatModel));
223 | sentimentDetector = new DocumentCategorizerME(doccatModel);
224 | LOG.info("Custom sentiment model " + customModel + " loaded for this pipeline.");
225 | } catch (IOException ex) {
226 | LOG.error("Error while opening file " + customSentimentModels.get(customModel), ex);
227 | }
228 | }
229 | }
230 |
231 | public void annotate(OpenNLPAnnotation document) {
232 | String text = document.getText();
233 | try {
234 | Span sentences[] = sentenceDetector.sentPosDetect(text);
235 | document.setSentences(sentences);
236 | document.getSentences().stream()
237 | .forEach((OpenNLPAnnotation.Sentence sentence) -> {
238 | if (annotators.contains("tokenize") && wordBreaker != null) {
239 | Span[] wordSpans = wordBreaker.tokenizePos(sentence.getSentence());
240 | if (wordSpans != null && wordSpans.length > 0) {
241 | sentence.setWordsAndSpans(wordSpans);
242 |
243 | if (annotators.contains("pos") && posme != null) {
244 | String[] posTags = posme.tag(sentence.getWords());
245 | sentence.setPosTags(posTags);
246 | if (annotators.contains("lemma")) {
247 | String[] finLemmas = lemmaDetector.lemmatize(sentence.getWords(), posTags);
248 | sentence.setLemmas(finLemmas);
249 | }
250 |
251 | //FIXME: this is wrong
252 | // if (annotators.contains("relation")) {
253 | // Span[] chunks = chunkerME.chunkAsSpans(sentence.getWords(), posTags);
254 | // sentence.setChunks(chunks);
255 | // LOG.info("Found " + chunks.length + " phrases.");
256 | // String[] chunkStrings = Span.spansToStrings(chunks, sentence.getWords());
257 | // sentence.setChunkStrings(chunkStrings);
258 | // List chunkSentiments = new ArrayList<>();
259 | // for (int i = 0; i < chunks.length; i++) {
260 | // sentence.addPhraseIndex(i);
261 | // }
262 | // if (!chunkSentiments.isEmpty()) {
263 | // sentence.setChunkSentiments(chunkSentiments.toArray(new String[chunkSentiments.size()]));
264 | // }
265 | // }
266 | }
267 |
268 | Map> nerOccurrences = new HashMap<>();
269 | if (annotators.contains("ner") && sentence.getWords() != null) {
270 |
271 | // Named Entities identification; needs to be performed after lemmas and POS (see implementation of Sentence.addNamedEntities())
272 | BASIC_NE_MODEL.keySet().stream().forEach((modelKey) -> {
273 | if (!nameDetectors.containsKey(modelKey)) {
274 | LOG.warn("NER model with key " + modelKey + " not available.");
275 | } else {
276 | List ners = Arrays.asList(nameDetectors.get(modelKey).find(sentence.getWords()));
277 | addNer(ners, nerOccurrences);
278 | }
279 | });
280 |
281 | if (!customNeModels.isEmpty()) {
282 | for (String key : customNeModels.keySet()) {
283 | if (!nameDetectors.containsKey(key)) {
284 | LOG.warn("Custom NER model with key " + key + " not available.");
285 | continue;
286 | }
287 | if (key.split("-").length == 0) {
288 | continue;
289 | }
290 | LOG.info("Running custom NER: " + key);
291 | List ners = Arrays.asList(nameDetectors.get(key).find(sentence.getWords()));
292 | addNer(ners, nerOccurrences);
293 | }
294 | }
295 | }
296 | processTokens(sentence, nerOccurrences);
297 | }
298 | }
299 | if (sentence.getWords() != null && sentence.getWords().length > 0) {
300 | if (annotators.contains("sentiment") && sentimentDetector != null) {
301 | double[] outcomes = sentimentDetector.categorize(sentence.getWords());
302 | String category = sentimentDetector.getBestCategory(outcomes);
303 | if (Arrays.stream(outcomes).max().getAsDouble() < document.getSentimentProb()) {
304 | category = "2";
305 | }
306 | sentence.setSentiment(category);
307 | LOG.info("Sentiment results: sentence = " + sentence.getSentence() + "; category = " + category + "; outcomes = " + Arrays.toString(outcomes));
308 | }
309 | }
310 | });
311 |
312 | // if (annotators.contains("ner")) {
313 | // for (String key : BASIC_NE_MODEL.keySet()) {
314 | // if (nameDetectors.containsKey(key)) {
315 | // nameDetectors.get(key).clearAdaptiveData();
316 | // }
317 | // }
318 | // if (customProject != null) {
319 | // for (String key : customNeModels.keySet()) {
320 | // if (nameDetectors.containsKey(key)) {
321 | // nameDetectors.get(key).clearAdaptiveData();
322 | // }
323 | // }
324 | // }
325 | // }
326 | } catch (Exception ex) {
327 | LOG.error("Error processing sentence for text: " + text, ex);
328 | throw new RuntimeException("Error processing sentence for text: " + text, ex);
329 | }
330 | }
331 |
332 | protected void addNer(List ners, Map> nerOccurrences) {
333 | if (ners != null && !ners.isEmpty()) {
334 | ners.stream().forEach((ner) -> {
335 | List currentNer = nerOccurrences.get(ner.getStart());
336 | if (currentNer == null) {
337 | currentNer = new ArrayList<>();
338 | nerOccurrences.put(ner.getStart(), currentNer);
339 | }
340 | currentNer.add(ner);
341 | });
342 | }
343 | }
344 |
345 | public String train(String alg, String modelId, String fileTrain, String lang, Map params) {
346 | String fileOut = createModelFileName(lang, alg, modelId);
347 | String newKey = /*lang.toLowerCase() + "-" +*/ modelId.toLowerCase();
348 | String result = "";
349 |
350 | if (alg.toLowerCase().equals("ner")) {
351 | NERModelTool nerModel = new NERModelTool(fileTrain, modelId, lang, params);
352 | nerModel.train();
353 | result = nerModel.validate();
354 | nerModel.saveModel(fileOut);
355 | // incorporate this model to the OpenNLPPipeline
356 | if (nerModel.getModel() != null) {
357 | customNeModels.put(newKey, fileOut);
358 | /*if (!nameDetectors.containsKey(newKey)) {
359 | nameDetectors.put(newKey, new NameFinderME((TokenNameFinderModel) nerModel.getModel()));
360 | }*/
361 | }
362 | }
363 | else if (alg.toLowerCase().equals("sentiment")) {
364 | SentimentModelTool sentModel = new SentimentModelTool(fileTrain, modelId, lang, params);
365 | sentModel.train();
366 | result = sentModel.validate();
367 | String[] dirPathSplit = fileTrain.split(File.separator);
368 | String fileOutToUse;
369 | if (dirPathSplit.length > 2) {
370 | StringBuilder sb = new StringBuilder("");
371 | for (int i = 0; i < dirPathSplit.length -2; ++i) {
372 | sb.append(dirPathSplit[i]).append(File.separator);
373 | }
374 | fileOutToUse = sb.toString() + fileOut;
375 | } else {
376 | fileOutToUse = fileOut;
377 | }
378 | System.out.println("Saving model to " + fileOutToUse);
379 | sentModel.saveModel(fileOutToUse);
380 | // incorporate this model to the OpenNLPPipeline
381 | if (sentModel.getModel() != null) {
382 | customSentimentModels.put(newKey, fileOutToUse);
383 | //sentimentDetectors.put(newKey, new DocumentCategorizerME((DoccatModel) sentModel.getModel()));
384 | }
385 | } else {
386 | throw new UnsupportedOperationException("Undefined training procedure for algorithm " + alg);
387 | }
388 |
389 | return result;
390 | }
391 |
392 | public String test(String alg, String modelId, String file, String lang) {
393 | String modelID = /*lang.toLowerCase() + "-" +*/ modelId.toLowerCase();
394 | String result = "failure";
395 |
396 | if (alg.toLowerCase().equals("ner")) {
397 | if (customNeModels.containsKey(modelID)) {
398 | LOG.info("Testing NER model: " + modelID);
399 |
400 | TokenNameFinderModel nameModel;
401 | try {
402 | // Load model
403 | InputStream is = new FileInputStream(new File(customNeModels.get(modelID)));
404 | nameModel = loadModel(TokenNameFinderModel.class, is);
405 | closeInputStream(is, modelID);
406 | } catch (Exception e) {
407 | throw new RuntimeException("Loading custom sentiment model " + modelID + " failed: ", e);
408 | }
409 |
410 | NERModelTool nerModel = new NERModelTool();
411 | result = nerModel.test(file, new NameFinderME(nameModel));
412 | } else
413 | LOG.error("Required NER model doesn't exist: " + modelID);
414 | }
415 | else if (alg.toLowerCase().equals("sentiment")) {
416 | if (customSentimentModels.containsKey(modelID)) {
417 | LOG.info("Testing sentiment model: " + modelID);
418 |
419 | DoccatModel doccatModel;
420 | try {
421 | // Load model
422 | InputStream is = new FileInputStream(new File(customSentimentModels.get(modelID)));
423 | doccatModel = loadModel(DoccatModel.class, is);
424 | closeInputStream(is, customSentimentModels.get(modelID));
425 | } catch (Exception e) {
426 | throw new RuntimeException("Loading custom sentiment model " + modelID + " failed: ", e);
427 | }
428 |
429 | SentimentModelTool sentModel = new SentimentModelTool();
430 | result = sentModel.test(file, new DocumentCategorizerME(doccatModel));
431 | } else
432 | LOG.error("Required sentiment model doesn't exist: " + modelID);
433 | } else {
434 | throw new UnsupportedOperationException("Undefined training procedure for algorithm " + alg);
435 | }
436 | return result;
437 | }
438 |
439 | private void processTokens(OpenNLPAnnotation.Sentence sentence, Map> nerOccurrences) {
440 | if (sentence.getWords() == null) {
441 | return;
442 | }
443 | String[] words = sentence.getWords();
444 | String[] lemmas = sentence.getLemmas();
445 | String[] posTags = sentence.getPosTags();
446 | Span[] wordSpans = sentence.getWordSpans();
447 |
448 | for (int i = 0; i < words.length; i++) {
449 | if (nerOccurrences != null && nerOccurrences.containsKey(i)) {
450 | List ners = nerOccurrences.get(i);
451 | final int startSpan = wordSpans[i].getStart();
452 | AtomicInteger index = new AtomicInteger(i);
453 | ners.forEach(ne -> {
454 | String value = "";
455 | String lemma = "";
456 | String type = ne.getType().toUpperCase();
457 | Set posSet = new HashSet<>();
458 | int endSpan = startSpan;
459 | for (int j = ne.getStart(); j < ne.getEnd(); j++) {
460 | value += " " + words[j].trim();
461 | lemma += " " + (lemmas[j].equals(DEFAULT_LEMMA_OPEN_NLP) ? words[j].toLowerCase().trim() : lemmas[j].trim());
462 | posSet.add(posTags[j]);
463 | endSpan = wordSpans[j].getEnd();
464 | if (index.get() < j) {
465 | index.set(j);
466 | }
467 | }
468 |
469 | value = value.trim();
470 | lemma = lemma.trim();
471 | //check stopwords
472 | if (isNotStopWord(lemma)) {
473 | OpenNLPAnnotation.Token token = sentence.getToken(value, lemma);
474 | token.addTokenNE(type);
475 | token.addTokenPOS(posSet);
476 | token.addTokenSpans(new Span(startSpan, endSpan));
477 | }
478 | });
479 | i = index.get();
480 | } else {
481 | String value = words[i].trim();
482 | String lemma = lemmas[i].equals(DEFAULT_LEMMA_OPEN_NLP) ? words[i].toLowerCase() : lemmas[i].trim();
483 | String ne = DEFAULT_BACKGROUND_SYMBOL;
484 | String pos = posTags[i];
485 | Set posSet = new HashSet<>();
486 | if (isNotStopWord(lemma)) {
487 | OpenNLPAnnotation.Token token = sentence.getToken(value, lemma);
488 | token.addTokenNE(ne);
489 | posSet.add(pos);
490 | token.addTokenPOS(posSet);
491 | token.addTokenSpans(wordSpans[i]);
492 | }
493 | }
494 | }
495 | }
496 |
497 | private boolean isNotStopWord(String value) {
498 | return !annotators.contains("stopword") || !stopWords.contains(value.toLowerCase());
499 | }
500 |
501 | private void findModelFiles(String path) {
502 | if (path == null || path.length() == 0) {
503 | LOG.error("Scanning for model files: wrong path specified.");
504 | return;
505 | }
506 |
507 | File folder = new File(path);
508 | File[] listOfFiles = folder.listFiles();
509 | if (listOfFiles == null) {
510 | return;
511 | }
512 |
513 | String p = path;
514 | if (p.charAt(p.length() - 1) != "/".charAt(0)) {
515 | path += "/";
516 | }
517 |
518 | for (int i = 0; i < listOfFiles.length; i++) {
519 | if (!listOfFiles[i].isFile()) {
520 | continue;
521 | }
522 | String name = listOfFiles[i].getName();
523 | String[] sp = name.split("-");
524 | if (sp.length < 2) {
525 | continue;
526 | }
527 | if (!name.substring(name.length() - 4).equals(".bin")) {
528 | continue;
529 | }
530 | LOG.info("Custom models: Found file " + name);
531 |
532 | String alg = sp[0].toLowerCase();
533 |
534 | String modelId = sp[1];
535 | // this is useful in case user-defined model ID contained symbol "-"
536 | for (int j = 2; j < sp.length; j++)
537 | modelId += "-" + sp[j];
538 | modelId = modelId.substring(0, modelId.length() - 4).toLowerCase(); // remove ".bin"
539 | //modelId = lang + "-" + modelId;
540 |
541 | LOG.info("Registering model name for algorithm " + alg + " under the key " + modelId);
542 | if (alg.equals("ner")) {
543 | customNeModels.put(modelId, path + name);
544 | } else if (alg.equals("sentiment")) {
545 | customSentimentModels.put(modelId, path + name);
546 | }
547 | }
548 | }
549 |
550 | private T loadModel(Class clazz, InputStream in) {
551 | try {
552 | Constructor modelConstructor = clazz.getConstructor(InputStream.class);
553 | T model = modelConstructor.newInstance(in);
554 | return model;
555 | } catch (NoSuchMethodException | SecurityException | InstantiationException | IllegalAccessException | IllegalArgumentException | InvocationTargetException ex) {
556 | LOG.error("Error while initializing model of class: " + clazz, ex);
557 | throw new RuntimeException("Error while initializing model of class: " + clazz, ex);
558 | }
559 | }
560 |
561 | private void saveModel(BaseModel model, String file) {
562 | if (model == null) {
563 | LOG.error("Can't save training results to a " + file + ": model is null");
564 | return;
565 | }
566 | BufferedOutputStream modelOut = null;
567 | try {
568 | modelOut = new BufferedOutputStream(new FileOutputStream(file));
569 | model.serialize(modelOut);
570 | modelOut.close();
571 | } catch (IOException ex) {
572 | LOG.error("Error saving model to file " + file, ex);
573 | throw new RuntimeException("Error saving model to file " + file, ex);
574 | }
575 | return;
576 | }
577 |
578 | private InputStream getInputStream(Properties properties, String property, String defaultValue) {
579 | String path = defaultValue;
580 | if (properties != null) {
581 | path = properties.getProperty(property, defaultValue);
582 | }
583 | InputStream is;
584 | try {
585 | if (path.startsWith("file://")) {
586 | is = new FileInputStream(new File(new URI(path)));
587 | } else if (path.startsWith("/")) {
588 | is = new FileInputStream(new File(path));
589 | } else {
590 | is = this.getClass().getResourceAsStream(path);
591 | }
592 | } catch (FileNotFoundException | URISyntaxException ex) {
593 | LOG.error("Error while loading model from path: " + path, ex);
594 | throw new RuntimeException("Error while loading model from path: " + path, ex);
595 | }
596 | return is;
597 | }
598 |
599 | private void closeInputStream(InputStream is, String name) {
600 | try {
601 | if (is != null) {
602 | is.close();
603 | }
604 | } catch (IOException ex) {
605 | LOG.warn("Attept to close stream for " + name + " model failed.");
606 | }
607 | return;
608 | }
609 |
610 | private String createModelFileName(String lang, String alg, String model) {
611 | String delim = "-";
612 | //String name = "import/" + lang.toLowerCase() + delim + alg.toLowerCase();
613 | String name = "import/" + alg.toLowerCase();
614 | if (model != null) {
615 | if (model.length() > 0) {
616 | name += delim + model.toLowerCase();
617 | }
618 | }
619 | name += ".bin";
620 | return name;
621 | }
622 |
623 |
624 | /*class ImprovisedInputStreamFactory implements InputStreamFactory {
625 | private File inputSourceFile;
626 | private String inputSourceStr;
627 |
628 | ImprovisedInputStreamFactory(Properties properties, String property, String defaultValue) {
629 | this.inputSourceFile = null;
630 | this.inputSourceStr = defaultValue;
631 | if (properties!=null) this.inputSourceStr = properties.getProperty(property, defaultValue);
632 |
633 | try {
634 | if (this.inputSourceStr.startsWith("file://"))
635 | this.inputSourceFile = new File(new URI(this.inputSourceStr));
636 | else if (this.inputSourceStr.startsWith("/"))
637 | this.inputSourceFile = new File(this.inputSourceStr);
638 | } catch (Exception ex) {
639 | LOG.error("Error while loading model from " + this.inputSourceStr);
640 | throw new RuntimeException("Error while loading model from " + this.inputSourceStr);
641 | }
642 | }
643 |
644 | @Override
645 | public InputStream createInputStream() throws IOException {
646 | LOG.debug("Creating input stream from " + this.inputSourceFile.getPath());
647 | //return getClass().getClassLoader().getResourceAsStream(this.inputSourceFile.getPath());
648 | return new FileInputStream(this.inputSourceFile.getPath());
649 | }
650 | }*/
651 |
652 | public Properties getProperties() {
653 | return new Properties();//to be implemented
654 | }
655 | }
656 |
--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/OpenNLPTextProcessor.java:
--------------------------------------------------------------------------------
1 | /*
2 | * Copyright (c) 2013-2016 GraphAware
3 | *
4 | * This file is part of the GraphAware Framework.
5 | *
6 | * GraphAware Framework is free software: you can redistribute it and/or modify it under the terms of
7 | * the GNU General Public License as published by the Free Software Foundation, either
8 | * version 3 of the License, or (at your option) any later version.
9 | *
10 | * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
11 | * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12 | * See the GNU General Public License for more details. You should have received a copy of
13 | * the GNU General Public License along with this program. If not, see
14 | * .
15 | */
16 | package com.graphaware.nlp.processor.opennlp;
17 |
18 | import com.graphaware.nlp.annotation.NLPTextProcessor;
19 | import com.graphaware.nlp.domain.*;
20 | import com.graphaware.nlp.dsl.request.PipelineSpecification;
21 | import com.graphaware.nlp.processor.AbstractTextProcessor;
22 |
23 | import java.util.*;
24 | import java.util.concurrent.atomic.AtomicInteger;
25 | import java.util.stream.Collectors;
26 |
27 | import com.graphaware.nlp.util.Timer;
28 | import opennlp.tools.util.Span;
29 | import org.jetbrains.annotations.NotNull;
30 | import org.slf4j.Logger;
31 | import org.slf4j.LoggerFactory;
32 |
33 | @NLPTextProcessor(name = "OpenNLPTextProcessor")
34 | public class OpenNLPTextProcessor extends AbstractTextProcessor {
35 |
36 | private static final Logger LOG = LoggerFactory.getLogger(OpenNLPTextProcessor.class);
37 |
38 | private static final String CORE_PIPELINE_NAME = "OpenNLP.CORE";
39 | public static final String TOKENIZER = "tokenizer";
40 | public static final String SENTIMENT = "sentiment";
41 |
42 | private final Map pipelines = new HashMap<>();
43 |
44 |
45 | @Override
46 | public void init() {
47 | }
48 |
49 | @Override
50 | public String getAlias() {
51 | return "opennlp";
52 | }
53 |
54 | @Override
55 | public String override() {
56 | return null;
57 | }
58 |
59 | public OpenNLPPipeline getPipeline(String name) {
60 | if (name == null || name.isEmpty()) {
61 | name = TOKENIZER;
62 | LOG.debug("Using default pipeline: " + name);
63 | }
64 | OpenNLPPipeline pipeline = getOpenNLPPipeline(name);
65 | return pipeline;
66 | }
67 |
68 | private void checkPipelineExistOrCreate(PipelineSpecification pipelineSpecification) {
69 | if (!pipelines.containsKey(pipelineSpecification.getName())) {
70 | createPipeline(pipelineSpecification);
71 | }
72 | }
73 |
74 | /* private void createFullPipeline() {
75 | OpenNLPPipeline pipeline = new PipelineBuilder()
76 | .tokenize()
77 | .extractNEs()
78 | .defaultStopWordAnnotator()
79 | .extractRelations()
80 | .extractSentiment()
81 | .threadNumber(6)
82 | .build();
83 | pipelines.put(CORE_PIPELINE_NAME, pipeline);
84 | }
85 |
86 | private void createTokenizerPipeline() {
87 | OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME);
88 | pipelines.put(TOKENIZER, pipeline);
89 | }
90 |
91 | private void createSentimentPipeline() {
92 | OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME);
93 | pipelines.put(SENTIMENT, pipeline);
94 | }
95 |
96 | private void createTokenizerAndSentimentPipeline() {
97 | OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME);
98 | pipelines.put(TOKENIZER_AND_SENTIMENT, pipeline);
99 | }
100 |
101 | private void createPhrasePipeline() {
102 | OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME);
103 | pipelines.put(PHRASE, pipeline);
104 | }*/
105 |
106 | @Override
107 | public AnnotatedText annotateText(String text, String lang, PipelineSpecification pipelineSpecification) {
108 | Timer timer = Timer.start();
109 | checkPipelineExistOrCreate(pipelineSpecification);
110 | timer.lap("pipeline check");
111 | OpenNLPPipeline pipeline = pipelines.get(pipelineSpecification.getName());
112 | OpenNLPAnnotation document = new OpenNLPAnnotation(text, Collections.EMPTY_MAP);
113 | pipeline.annotate(document);
114 |
115 | AnnotatedText result = new AnnotatedText();
116 | List sentences = document.getSentences();
117 | final AtomicInteger sentenceSequence = new AtomicInteger(0);
118 | sentences.stream().forEach((sentence) -> {
119 | int sentenceNumber = sentenceSequence.getAndIncrement();
120 | final Sentence newSentence = new Sentence(sentence.getSentence(), sentenceNumber);
121 | extractTokens(lang, sentence, newSentence);
122 | if (pipelineSpecification.hasProcessingStep(STEP_SENTIMENT)) {
123 | extractSentiment(sentence, newSentence);
124 | }
125 | if (pipelineSpecification.hasProcessingStep(STEP_PHRASE)) {
126 | extractPhrases(sentence, newSentence);
127 | }
128 | result.addSentence(newSentence);
129 | });
130 |
131 | return result;
132 | }
133 |
134 | protected Map getPipelineProperties(OpenNLPPipeline pipeline) {
135 | Map options = new HashMap<>();
136 | for (Object o : pipeline.getProperties().keySet()) {
137 | if (o instanceof String) {
138 | options.put(o.toString(), pipeline.getProperties().getProperty(o.toString()));
139 | }
140 | }
141 |
142 | return options;
143 | }
144 |
145 | protected Map buildSpecifications(List actives) {
146 | List all = Arrays.asList("tokenize", "ner", "cleanxml", "truecase", "dependency", "relations", "checkLemmaIsStopWord", "coref", "sentiment", "phrase", "customSentiment", "customNER");
147 | Map specs = new HashMap<>();
148 | all.forEach(s -> {
149 | specs.put(s, actives.contains(s));
150 | });
151 |
152 | return specs;
153 | }
154 |
155 |
156 | /* @Override
157 | public AnnotatedText annotateText(String text, String name, String lang, Map otherParams) {
158 | if (name.length() == 0) {
159 | name = TOKENIZER;
160 | LOG.info("Using default pipeline: " + name);
161 | }
162 | OpenNLPPipeline pipeline = pipelines.get(name);
163 | if (pipeline == null) {
164 | throw new RuntimeException("Pipeline: " + name + " doesn't exist");
165 | }
166 | OpenNLPAnnotation document = new OpenNLPAnnotation(text, otherParams);
167 | pipeline.annotate(document);
168 | // LOG.info("Annotation for id " + id + " finished.");
169 |
170 | AnnotatedText result = new AnnotatedText();
171 | List sentences = document.getSentences();
172 | final AtomicInteger sentenceSequence = new AtomicInteger(0);
173 | sentences.stream().forEach((sentence) -> {
174 | int sentenceNumber = sentenceSequence.getAndIncrement();
175 | // String sentenceId = id + "_" + sentenceNumber;
176 | final Sentence newSentence = new Sentence(sentence.getSentence(), sentenceNumber);
177 | extractTokens(lang, sentence, newSentence);
178 | extractSentiment(sentence, newSentence);
179 | extractPhrases(sentence, newSentence);
180 | result.addSentence(newSentence);
181 | });
182 | //extractRelationship(result, sentences, document);
183 | return result;
184 | }
185 | */
186 | private void extractPhrases(OpenNLPAnnotation.Sentence sentence, Sentence newSentence) {
187 | if (sentence.getPhrasesIndex() == null) {
188 | LOG.warn("extractPhrases(): phrases index empty, aborting extraction");
189 | return;
190 | }
191 | sentence.getPhrasesIndex().forEach(index -> {
192 | Span chunk = sentence.getChunks()[index];
193 | String chunkString = sentence.getChunkStrings()[index];
194 | newSentence.addPhraseOccurrence(chunk.getStart(), chunk.getEnd(), new Phrase(chunkString, chunk.getType()));
195 | });
196 | }
197 |
198 | private void extractSentiment(OpenNLPAnnotation.Sentence sentence, Sentence newSentence) {
199 | int score = -1;
200 | if (sentence.getSentiment() != null) { // && !sentence.getSentiment().equals("-")) {
201 | try {
202 | score = Integer.valueOf(sentence.getSentiment());
203 | } catch (NumberFormatException ex) {
204 | LOG.error("NumberFormatException: error extracting sentiment " + sentence.getSentiment() + " as a number.", ex);
205 | }
206 | }
207 | newSentence.setSentiment(score);
208 | }
209 |
210 | private void extractTokens(String lang, OpenNLPAnnotation.Sentence sentence, final Sentence newSentence) {
211 | Collection tokens = sentence.getTokens();
212 | tokens.stream().filter((token) -> token != null /*&& checkLemmaIsValid(token.getToken())*/).forEach((token) -> {
213 | Tag newTag = getTag(token, lang);
214 | if (newTag != null) {
215 | Tag tagInSentence = newSentence.addTag(newTag);
216 | token.getTokenSpans().stream().forEach((span) -> {
217 | newSentence.addTagOccurrence(span.getStart(), span.getEnd(), token.getToken(), tagInSentence);
218 | });
219 | }
220 | });
221 | }
222 |
223 | // private void extractRelationship(AnnotatedText annotatedText, List sentences, Annotation document) {
224 | // Map corefChains = document.get(CorefCoreAnnotations.CorefChainAnnotation.class);
225 | // if (corefChains != null) {
226 | // for (CorefChain chain : corefChains.values()) {
227 | // CorefChain.CorefMention representative = chain.getRepresentativeMention();
228 | // int representativeSenteceNumber = representative.sentNum - 1;
229 | // List representativeTokens = sentences.get(representativeSenteceNumber).get(CoreAnnotations.TokensAnnotation.class);
230 | // int beginPosition = representativeTokens.get(representative.startIndex - 1).beginPosition();
231 | // int endPosition = representativeTokens.get(representative.endIndex - 2).endPosition();
232 | // Phrase representativePhraseOccurrence = annotatedText.getSentences().get(representativeSenteceNumber).getPhraseOccurrence(beginPosition, endPosition);
233 | // if (representativePhraseOccurrence == null) {
234 | // LOG.warn("Representative Phrase not found: " + representative.mentionSpan);
235 | // }
236 | // for (CorefChain.CorefMention mention : chain.getMentionsInTextualOrder()) {
237 | // if (mention == representative) {
238 | // continue;
239 | // }
240 | // int mentionSentenceNumber = mention.sentNum - 1;
241 | //
242 | // List mentionTokens = sentences.get(mentionSentenceNumber).get(CoreAnnotations.TokensAnnotation.class);
243 | // int beginPositionMention = mentionTokens.get(mention.startIndex - 1).beginPosition();
244 | // int endPositionMention = mentionTokens.get(mention.endIndex - 2).endPosition();
245 | // Phrase mentionPhraseOccurrence = annotatedText.getSentences().get(mentionSentenceNumber).getPhraseOccurrence(beginPositionMention, endPositionMention);
246 | // if (mentionPhraseOccurrence == null) {
247 | // LOG.warn("Mention Phrase not found: " + mention.mentionSpan);
248 | // }
249 | // if (representativePhraseOccurrence != null
250 | // && mentionPhraseOccurrence != null) {
251 | // mentionPhraseOccurrence.setReference(representativePhraseOccurrence);
252 | // }
253 | // }
254 | // }
255 | // }
256 | // }
257 | @Override
258 | public Tag annotateSentence(String text, String lang, PipelineSpecification pipelineSpecification) {
259 | // Annotation document = new Annotation(text);
260 | // pipelines.get(SENTIMENT).annotate(document);
261 | // List sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
262 | // Optional sentence = sentences.stream().findFirst();
263 | // if (sentence.isPresent()) {
264 | // Optional oTag = sentence.get().get(CoreAnnotations.TokensAnnotation.class).stream()
265 | // .map((token) -> getTag(token))
266 | // .filter((tag) -> (tag != null) && checkPunctuation(tag.getLemma()))
267 | // .findFirst();
268 | // if (oTag.isPresent()) {
269 | // return oTag.get();
270 | // }
271 | // }
272 | return null;
273 | }
274 |
275 | @Override
276 | public Tag annotateTag(String text, String lang, PipelineSpecification pipelineSpecification) {
277 | OpenNLPAnnotation document = new OpenNLPAnnotation(text);
278 | final OpenNLPPipeline openNLPPipeline = getOpenNLPPipeline(pipelineSpecification.getName());
279 | openNLPPipeline.annotate(document);
280 | List sentences = document.getSentences();
281 | if (sentences != null && !sentences.isEmpty()) {
282 | if (sentences.size() > 1) {
283 | throw new RuntimeException("More than one sentence");
284 | }
285 | Collection tokens = sentences.get(0).getTokens();
286 | if (tokens != null && tokens.size() == 1) {
287 | OpenNLPAnnotation.Token token = tokens.iterator().next();
288 | Tag newTag = getTag(token, lang);
289 | return newTag;
290 | } else if (tokens != null && tokens.size() > 1) {
291 | OpenNLPAnnotation.Token token = document.getToken(text, text);
292 | Tag newTag = getTag(token, lang);
293 | return newTag;
294 | }
295 | }
296 | return null;
297 | }
298 |
299 | @NotNull
300 | private OpenNLPPipeline getOpenNLPPipeline(String name) {
301 | final OpenNLPPipeline openNLPPipeline = pipelines.get(name);
302 | if (openNLPPipeline == null) {
303 | throw new RuntimeException("Pipeline " + name + " doesn't exist");
304 | }
305 | return openNLPPipeline;
306 | }
307 |
308 | private Tag getTag(OpenNLPAnnotation.Token token, String lang) {
309 | List pos = new ArrayList<>();
310 | List ne = new ArrayList<>();
311 | String lemma = token.getTokenLemmas();
312 | pos.addAll(token.getTokenPOS());
313 | ne.addAll(token.getTokenNEs());
314 |
315 | // apply lemma validity check (to all words in case of NamedEntities)
316 | lemma = Arrays.asList(lemma.split(" ")).stream().filter(str -> checkLemmaIsValid(str)).collect(Collectors.joining(" "));
317 | if (lemma == null || lemma.length() == 0)
318 | return null;
319 |
320 | Tag tag = new Tag(lemma, lang);
321 | tag.setPos(pos);
322 | tag.setNe(ne);
323 | LOG.info("POS: " + pos + " ne: " + ne + " lemma: " + lemma);
324 | return tag;
325 | }
326 |
327 | private List annotateTagsAux(String text, String lang, OpenNLPPipeline pipeline) {
328 | List result = new ArrayList<>();
329 | OpenNLPAnnotation document = new OpenNLPAnnotation(text);
330 | pipeline.annotate(document);
331 | List sentences = document.getSentences();
332 | if (sentences != null && !sentences.isEmpty()) {
333 | if (sentences.size() > 1) {
334 | throw new RuntimeException("More than one sentence");
335 | }
336 | Collection tokens = sentences.get(0).getTokens();
337 | if (tokens != null && tokens.size() > 0) {
338 | tokens.stream().forEach((token) -> {
339 | Tag newTag = getTag(token, lang);
340 | if (newTag != null)
341 | result.add(newTag);
342 | });
343 | return result;
344 | }
345 | }
346 | return null;
347 | }
348 |
349 | @Override
350 | public List annotateTags(String text, String lang, PipelineSpecification pipelineSpecification) {
351 | return annotateTagsAux(text, lang, getOpenNLPPipeline(pipelineSpecification.getName()));
352 | }
353 |
354 | public List annotateTags(String text, String lang) {
355 | return annotateTagsAux(text, lang, getOpenNLPPipeline(TOKENIZER));
356 | }
357 |
358 | @Override
359 | public AnnotatedText sentiment(AnnotatedText annotated) {
360 | OpenNLPPipeline pipeline = getOpenNLPPipeline(SENTIMENT);
361 | annotated.getSentences().stream().forEach(item -> { // don't use parallelStream(), it crashes with the current content of the body
362 | OpenNLPAnnotation document = new OpenNLPAnnotation(item.getSentence());
363 | pipeline.annotate(document);
364 |
365 | List sentences = document.getSentences();
366 | Optional sentence = sentences.stream().findFirst();
367 | if (sentence != null && sentence.isPresent()) {
368 | extractSentiment(sentence.get(), item);
369 | }
370 | });
371 |
372 | return annotated;
373 | }
374 |
375 | @Override
376 | public String train(String alg, String modelId, String file, String lang, Map params) {
377 | // training could be done directly here, but it's better to have everything related to model implementation in one class, therefore ...
378 | OpenNLPPipeline pipeline = getOpenNLPPipeline(TOKENIZER);
379 | return pipeline.train(alg, modelId, file, lang, params);
380 | }
381 |
382 | @Override
383 | public String test(String alg, String modelId, String file, String lang) {
384 | OpenNLPPipeline pipeline = getOpenNLPPipeline(TOKENIZER);
385 | return pipeline.test(alg, modelId, file, lang);
386 |
387 | }
388 |
389 | class TokenHolder {
390 |
391 | private String ne;
392 | private StringBuilder sb;
393 | private int beginPosition;
394 | private int endPosition;
395 |
396 | public TokenHolder() {
397 | reset();
398 | }
399 |
400 | public String getNe() {
401 | return ne;
402 | }
403 |
404 | public String getToken() {
405 | if (sb == null) {
406 | return " - ";
407 | }
408 | return sb.toString();
409 | }
410 |
411 | public int getBeginPosition() {
412 | return beginPosition;
413 | }
414 |
415 | public int getEndPosition() {
416 | return endPosition;
417 | }
418 |
419 | public void setNe(String ne) {
420 | this.ne = ne;
421 | }
422 |
423 | public void updateToken(String tknStr) {
424 | this.sb.append(tknStr);
425 | }
426 |
427 | public void setBeginPosition(int beginPosition) {
428 | if (this.beginPosition < 0) {
429 | this.beginPosition = beginPosition;
430 | }
431 | }
432 |
433 | public void setEndPosition(int endPosition) {
434 | this.endPosition = endPosition;
435 | }
436 |
437 | public final void reset() {
438 | sb = new StringBuilder();
439 | beginPosition = -1;
440 | endPosition = -1;
441 | }
442 | }
443 |
444 | class PhraseHolder implements Comparable {
445 |
446 | private StringBuilder sb;
447 | private int beginPosition;
448 | private int endPosition;
449 |
450 | public PhraseHolder() {
451 | reset();
452 | }
453 |
454 | public String getPhrase() {
455 | if (sb == null) {
456 | return " - ";
457 | }
458 | return sb.toString();
459 | }
460 |
461 | public int getBeginPosition() {
462 | return beginPosition;
463 | }
464 |
465 | public int getEndPosition() {
466 | return endPosition;
467 | }
468 |
469 | public void updatePhrase(String tknStr) {
470 | this.sb.append(tknStr);
471 | }
472 |
473 | public void setBeginPosition(int beginPosition) {
474 | if (this.beginPosition < 0) {
475 | this.beginPosition = beginPosition;
476 | }
477 | }
478 |
479 | public void setEndPosition(int endPosition) {
480 | this.endPosition = endPosition;
481 | }
482 |
483 | public final void reset() {
484 | sb = new StringBuilder();
485 | beginPosition = -1;
486 | endPosition = -1;
487 | }
488 |
489 | @Override
490 | public boolean equals(Object o) {
491 | if (!(o instanceof PhraseHolder)) {
492 | return false;
493 | }
494 | PhraseHolder otherObject = (PhraseHolder) o;
495 | if (this.sb != null
496 | && otherObject.sb != null
497 | && this.sb.toString().equals(otherObject.sb.toString())
498 | && this.beginPosition == otherObject.beginPosition
499 | && this.endPosition == otherObject.endPosition) {
500 | return true;
501 | }
502 | return false;
503 | }
504 |
505 | @Override
506 | public int compareTo(PhraseHolder o) {
507 | if (o == null) {
508 | return 1;
509 | }
510 | if (this.equals(o)) {
511 | return 0;
512 | } else if (this.beginPosition > o.beginPosition) {
513 | return 1;
514 | } else if (this.beginPosition == o.beginPosition) {
515 | if (this.endPosition > o.endPosition) {
516 | return 1;
517 | }
518 | }
519 | return -1;
520 | }
521 | }
522 |
523 | @Override
524 | public List getPipelines() {
525 | return new ArrayList<>(pipelines.keySet());
526 | }
527 |
528 | @Override
529 | public boolean checkPipeline(String name) {
530 | return pipelines.containsKey(name);
531 | }
532 |
533 | @Override
534 | public void createPipeline(PipelineSpecification pipelineSpecification) {
535 | //TODO add validation
536 | String name = pipelineSpecification.getName();
537 | PipelineBuilder pipelineBuilder = new PipelineBuilder();
538 | List specActive = new ArrayList<>();
539 | List stopwordsList;
540 |
541 | if (pipelineSpecification.hasProcessingStep("tokenize", true)) {
542 | pipelineBuilder.tokenize();
543 | specActive.add("tokenize");
544 | }
545 |
546 | if (pipelineSpecification.hasProcessingStep("ner", true)) {
547 | pipelineBuilder.extractNEs();
548 | specActive.add("ner");
549 | }
550 |
551 | String stopWords = pipelineSpecification.getStopWords() != null ? pipelineSpecification.getStopWords() : "default";
552 | boolean checkLemma = pipelineSpecification.hasProcessingStep("checkLemmaIsStopWord");
553 | if (checkLemma) {
554 | specActive.add("checkLemmaIsStopWord");
555 | }
556 |
557 | if (stopWords.equalsIgnoreCase("default")) {
558 | pipelineBuilder.defaultStopWordAnnotator();
559 | stopwordsList = PipelineBuilder.getDefaultStopwords();
560 | } else {
561 | pipelineBuilder.customStopWordAnnotator(stopWords);
562 | stopwordsList = PipelineBuilder.getCustomStopwordsList(stopWords);
563 | }
564 |
565 | if (pipelineSpecification.hasProcessingStep("sentiment")) {
566 | pipelineBuilder.extractSentiment();
567 | specActive.add("sentiment");
568 | }
569 | if (pipelineSpecification.hasProcessingStep("coref")) {
570 | pipelineBuilder.extractCoref();
571 | specActive.add("coref");
572 | }
573 | if (pipelineSpecification.hasProcessingStep("relations")) {
574 | pipelineBuilder.extractRelations();
575 | specActive.add("relations");
576 | }
577 | if (pipelineSpecification.hasProcessingStep("customNER")) {
578 | if (!specActive.contains("ner")) {
579 | pipelineBuilder.extractNEs();
580 | specActive.add("ner");
581 | }
582 | specActive.add("customNER");
583 | pipelineBuilder.extractCustomNEs(pipelineSpecification.getProcessingStepAsString("customNER"));
584 | }
585 | if (pipelineSpecification.hasProcessingStep("customSentiment")) {
586 | if (!specActive.contains("sentiment")) {
587 | pipelineBuilder.extractSentiment();
588 | specActive.add("sentiment");
589 | }
590 | specActive.add("customSentiment");
591 | pipelineBuilder.extractCustomSentiment(pipelineSpecification.getProcessingStepAsString("customSentiment"));
592 | }
593 | Long threadNumber = pipelineSpecification.getThreadNumber() != 0 ? pipelineSpecification.getThreadNumber() : 4L;
594 | pipelineBuilder.threadNumber(threadNumber.intValue());
595 |
596 | OpenNLPPipeline pipeline = pipelineBuilder.build();
597 | pipelines.put(name, pipeline);
598 | }
599 |
600 |
601 | @Override
602 | public void removePipeline(String name) {
603 | if (!pipelines.containsKey(name)) {
604 | throw new RuntimeException("No pipeline found with name: " + name);
605 | }
606 | pipelines.remove(name);
607 | }
608 | }
609 |
--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/PipelineBuilder.java:
--------------------------------------------------------------------------------
1 | /*
2 | * To change this license header, choose License Headers in Project Properties.
3 | * To change this template file, choose Tools | Templates
4 | * and open the template in the editor.
5 | */
6 | package com.graphaware.nlp.processor.opennlp;
7 |
8 | import java.util.ArrayList;
9 | import java.util.Arrays;
10 | import java.util.List;
11 | import java.util.Properties;
12 |
13 | class PipelineBuilder {
14 |
15 | private static final String CUSTOM_STOP_WORD_LIST = "start,starts,period,periods,a,an,and,are,as,at,be,but,by,for,if,in,into,is,it,no,not,of,o,on,or,such,that,the,their,then,there,these,they,this,to,was,will,with";
16 |
17 | private final Properties properties = new Properties();
18 | private final StringBuilder annotators = new StringBuilder(); //basics annotators
19 | private int threadsNumber = 4;
20 |
21 | private void checkForExistingAnnotators() {
22 | if (annotators.toString().length() > 0) {
23 | annotators.append(", ");
24 | }
25 | }
26 |
27 | public PipelineBuilder tokenize() {
28 | checkForExistingAnnotators();
29 | annotators.append("tokenize, pos, lemma");
30 | return this;
31 | }
32 |
33 | public PipelineBuilder extractNEs() {
34 | checkForExistingAnnotators();
35 | annotators.append("ner");
36 | return this;
37 | }
38 |
39 | public PipelineBuilder extractSentiment() {
40 | checkForExistingAnnotators();
41 | annotators.append("sentiment");
42 | return this;
43 | }
44 |
45 | public PipelineBuilder extractRelations() {
46 | checkForExistingAnnotators();
47 | annotators.append("relation");
48 | return this;
49 | }
50 |
51 | public PipelineBuilder extractCoref() {
52 | return this;
53 | }
54 |
55 | public PipelineBuilder extractCustomNEs(String ners) {
56 | properties.setProperty("customNEs", ners);
57 | return this;
58 | }
59 |
60 | public PipelineBuilder extractCustomSentiment(String sent) {
61 | properties.setProperty("customSentiment", sent);
62 | return this;
63 | }
64 |
65 | public PipelineBuilder defaultStopWordAnnotator() {
66 | checkForExistingAnnotators();
67 | annotators.append("stopword");
68 | properties.setProperty("stopword", CUSTOM_STOP_WORD_LIST);
69 | return this;
70 | }
71 |
72 | public PipelineBuilder customStopWordAnnotator(String customStopWordList) {
73 | checkForExistingAnnotators();
74 | String stopWordList;
75 | if (annotators.indexOf("stopword") >= 0) {
76 | String alreadyexistingStopWordList = properties.getProperty("stopword");
77 | stopWordList = alreadyexistingStopWordList + "," + customStopWordList;
78 | } else {
79 | annotators.append("stopword");
80 | stopWordList = CUSTOM_STOP_WORD_LIST + "," + customStopWordList;
81 | }
82 | properties.setProperty("stopword", stopWordList);
83 | return this;
84 | }
85 |
86 | public PipelineBuilder stopWordAnnotator(Properties properties) {
87 | return this;
88 | }
89 |
90 | public PipelineBuilder threadNumber(int threads) {
91 | this.threadsNumber = threads;
92 | return this;
93 | }
94 |
95 | public OpenNLPPipeline build() {
96 | properties.setProperty("annotators", annotators.toString());
97 | properties.setProperty("threads", String.valueOf(threadsNumber));
98 | OpenNLPPipeline pipeline = new OpenNLPPipeline(properties);
99 | return pipeline;
100 | }
101 |
102 | public static List getDefaultStopwords() {
103 | List stopwords = new ArrayList<>();
104 | Arrays.stream(CUSTOM_STOP_WORD_LIST.split(",")).forEach(s -> {
105 | stopwords.add(s.trim());
106 | });
107 |
108 | return stopwords;
109 | }
110 |
111 | public static List getCustomStopwordsList(String customStopWordList) {
112 | String stopWordList;
113 | if (customStopWordList.startsWith("+")) {
114 | stopWordList = CUSTOM_STOP_WORD_LIST + "," + customStopWordList.replace("+,", "").replace("+", "");
115 | } else {
116 | stopWordList = customStopWordList;
117 | }
118 |
119 | List list = new ArrayList<>();
120 | Arrays.stream(stopWordList.split(",")).forEach(s -> {
121 | list.add(s.trim());
122 | });
123 |
124 | return list;
125 | }
126 | }
127 |
--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/model/NERModelTool.java:
--------------------------------------------------------------------------------
1 | /*
2 | *
3 | *
4 | */
5 | package com.graphaware.nlp.processor.opennlp.model;
6 |
7 | import com.graphaware.nlp.processor.opennlp.OpenNLPPipeline;
8 | import java.io.IOException;
9 | import java.util.Map;
10 |
11 | import opennlp.tools.namefind.TokenNameFinderFactory;
12 | import opennlp.tools.namefind.TokenNameFinderCrossValidator;
13 | import opennlp.tools.namefind.TokenNameFinderEvaluator;
14 | import opennlp.tools.namefind.NameSample;
15 | import opennlp.tools.namefind.NameFinderME;
16 | import opennlp.tools.namefind.NameSampleDataStream;
17 | import opennlp.tools.namefind.TokenNameFinderModel;
18 |
19 | import opennlp.tools.util.ObjectStream;
20 |
21 | import com.graphaware.nlp.util.GenericModelParameters;
22 |
23 | import org.slf4j.Logger;
24 | import org.slf4j.LoggerFactory;
25 |
26 | public class NERModelTool extends OpenNLPGenericModelTool {
27 |
28 | private String entityType;
29 | private static final String MODEL_NAME = "NER";
30 |
31 | private static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class);
32 |
33 | public NERModelTool(String fileIn, String modelDescr, String lang, Map params) {
34 | super(fileIn, modelDescr, lang, params);
35 | this.entityType = null; // train only specific named entity; null = train all entities present in the training set
36 | if (params != null) {
37 | if (params.containsKey(GenericModelParameters.TRAIN_ENTITYTYPE)) {
38 | this.entityType = (String) params.get(GenericModelParameters.TRAIN_ENTITYTYPE);
39 | }
40 | }
41 | }
42 |
43 | public NERModelTool(String fileIn, String modelDescr, String lang) {
44 | this(fileIn, modelDescr, lang, null);
45 | }
46 |
47 | public NERModelTool() {
48 | super();
49 | }
50 |
51 | public void train() {
52 | try (ObjectStream lineStream = openFile(fileIn); NameSampleDataStream sampleStream = new NameSampleDataStream(lineStream)) {
53 | LOG.info("Training of " + MODEL_NAME + " started ...");
54 | this.model = NameFinderME.train(lang, entityType, sampleStream, trainParams, new TokenNameFinderFactory());
55 | } catch (IOException ex) {
56 | LOG.error("Error while opening training file: " + fileIn, ex);
57 | throw new RuntimeException("Error while training " + MODEL_NAME + " model " + this.modelDescr, ex);
58 | } catch (Exception ex) {
59 | LOG.error("Error while training " + MODEL_NAME + " model " + modelDescr);
60 | throw new RuntimeException("Error while training " + MODEL_NAME + " model " + this.modelDescr, ex);
61 | }
62 | }
63 |
64 | public String validate() {
65 | String result = "";
66 | if (this.fileValidate == null) {
67 | //List> listeners = new LinkedList>();
68 | try (ObjectStream lineStream = openFile(fileIn); NameSampleDataStream sampleStream = new NameSampleDataStream(lineStream)) {
69 | LOG.info("Validation of " + MODEL_NAME + " started ...");
70 | // Using CrossValidator
71 | TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator(lang, entityType, trainParams, null);
72 | // the second argument of 'evaluate()' gives number of folds (n), i.e. number of times the training-testing will be run (with data splitting train:test = (n-1):1)
73 | evaluator.evaluate(sampleStream, nFolds);
74 | result = "F = " + decFormat.format(evaluator.getFMeasure().getFMeasure())
75 | + " (Precision = " + decFormat.format(evaluator.getFMeasure().getPrecisionScore())
76 | + ", Recall = " + decFormat.format(evaluator.getFMeasure().getRecallScore()) + ")";
77 | LOG.info("Validation: " + result);
78 | } catch (IOException ex) {
79 | LOG.error("Error while opening training file: " + fileIn, ex);
80 | throw new RuntimeException("IOError while evaluating " + MODEL_NAME + " model " + modelDescr, ex);
81 | } catch (Exception ex) {
82 | LOG.error("Error while evaluating " + MODEL_NAME + " model.", ex);
83 | throw new RuntimeException("Error while evaluating " + MODEL_NAME + " model " + modelDescr, ex);
84 | }
85 | } else {
86 | result = test(this.fileValidate, new NameFinderME((TokenNameFinderModel) model));
87 | }
88 |
89 | return result;
90 | }
91 |
92 | public String test(String file, NameFinderME modelME) {
93 | String result = "";
94 | try (ObjectStream lineStreamValidate = openFile(file); NameSampleDataStream sampleStreamValidate = new NameSampleDataStream(lineStreamValidate)) {
95 | LOG.info("Testing of " + MODEL_NAME + " started ...");
96 | //TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME((TokenNameFinderModel) model));
97 | TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(modelME);
98 | evaluator.evaluate(sampleStreamValidate);
99 | result = "F = " + decFormat.format(evaluator.getFMeasure().getFMeasure())
100 | + " (Precision = " + decFormat.format(evaluator.getFMeasure().getPrecisionScore())
101 | + ", Recall = " + decFormat.format(evaluator.getFMeasure().getRecallScore()) + ")";
102 | LOG.info("Testing result: " + result);
103 | } catch (IOException ex) {
104 | LOG.error("Error while opening test file: " + file, ex);
105 | throw new RuntimeException("Error while testing " + MODEL_NAME + " model " + modelDescr, ex);
106 | } catch (Exception ex) {
107 | LOG.error("Error while testing " + this.MODEL_NAME + " model.", ex);
108 | }
109 | return result;
110 | }
111 | }
112 |
--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/model/OpenNLPGenericModelTool.java:
--------------------------------------------------------------------------------
1 | /*
2 | *
3 | *
4 | */
5 | package com.graphaware.nlp.processor.opennlp.model;
6 |
7 | import com.graphaware.nlp.processor.opennlp.OpenNLPPipeline;
8 | import java.io.File;
9 | import java.io.FileInputStream;
10 | import java.io.FileOutputStream;
11 | import java.io.InputStream;
12 | import java.io.BufferedOutputStream;
13 | import java.io.IOException;
14 | import java.net.URI;
15 | import java.util.Properties;
16 | import java.util.Map;
17 | import java.text.DecimalFormat;
18 |
19 | import opennlp.tools.util.PlainTextByLineStream;
20 | import opennlp.tools.util.InputStreamFactory;
21 | import opennlp.tools.util.TrainingParameters;
22 | import opennlp.tools.util.ObjectStream;
23 | import opennlp.tools.util.model.BaseModel;
24 |
25 | import com.graphaware.nlp.util.GenericModelParameters;
26 |
27 | import org.slf4j.Logger;
28 | import org.slf4j.LoggerFactory;
29 |
30 | public class OpenNLPGenericModelTool {
31 |
32 | protected BaseModel model;
33 | protected TrainingParameters trainParams;
34 | protected final String modelDescr;
35 | protected final String lang;
36 | protected final DecimalFormat decFormat;
37 | protected int nFolds;
38 |
39 | protected final String fileIn;
40 | protected String fileValidate;
41 |
42 | private static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class);
43 |
44 | public OpenNLPGenericModelTool(String file, String modelDescr, String lang) {
45 | this.fileValidate = null;
46 | this.fileIn = file;
47 | this.nFolds = 10;
48 | this.modelDescr = modelDescr;
49 | this.lang = lang;
50 | this.decFormat = new DecimalFormat("#0.00"); // for formating validation results with precision 2 decimals
51 |
52 | this.setDefParams();
53 | }
54 |
55 | public OpenNLPGenericModelTool(String file, String modelDescr, String lang, Map params) {
56 | this(file, modelDescr, lang);
57 | this.setTrainingParameters(params);
58 | }
59 |
60 | /*
61 | * This constructor needed for invoking test() method only (model is provided as an argument of train() )
62 | */
63 | public OpenNLPGenericModelTool() {
64 | this(null, null, null);
65 | this.model = null;
66 | }
67 |
68 | // override this method in your child-class if you want different defaults
69 | protected void setDefParams() {
70 | this.trainParams = TrainingParameters.defaultParams();
71 | }
72 |
73 | protected ObjectStream openFile(String fileName) {
74 | if (fileName == null || fileName.isEmpty()) {
75 | LOG.error("File name is null or empty.");
76 | return null;
77 | }
78 | ObjectStream lStream = null;
79 | try {
80 | ImprovisedInputStreamFactory dataIn = new ImprovisedInputStreamFactory(null, "", fileName);
81 | lStream = new PlainTextByLineStream(dataIn, "UTF-8");
82 | } catch (IOException ex) {
83 | LOG.error("Failure while opening file " + fileName, ex);
84 | throw new RuntimeException("Failure while opening file " + fileName, ex);
85 | }
86 |
87 | if (lStream == null)
88 | throw new RuntimeException("Failure while opening file " + fileName + ": input stream is null.");
89 | return lStream;
90 | }
91 |
92 | private void setTrainingParameters(Map params) {
93 | this.setDefParams();
94 | if (params == null || params.isEmpty()) {
95 | LOG.error("Map of parameters is null or empty. Using default values.");
96 | return;
97 | }
98 |
99 | // now add/override-by user-defined parameters
100 | if (params.containsKey(GenericModelParameters.TRAIN_ALG)) {
101 | String val = objectToString(params, GenericModelParameters.TRAIN_ALG);
102 | this.trainParams.put(TrainingParameters.ALGORITHM_PARAM, val); // default: MAXENT
103 | LOG.info("Training parameter " + TrainingParameters.ALGORITHM_PARAM + " set to " + val);
104 | }
105 | if (params.containsKey(GenericModelParameters.TRAIN_TYPE)) {
106 | String val = objectToString(params, GenericModelParameters.TRAIN_TYPE);
107 | this.trainParams.put(TrainingParameters.TRAINER_TYPE_PARAM, val);
108 | LOG.info("Training parameter " + TrainingParameters.TRAINER_TYPE_PARAM + " set to " + val);
109 | }
110 | if (params.containsKey(GenericModelParameters.TRAIN_CUTOFF)) {
111 | String val = objectToString(params, GenericModelParameters.TRAIN_CUTOFF);
112 | this.trainParams.put(TrainingParameters.CUTOFF_PARAM, val);
113 | LOG.info("Training parameter " + TrainingParameters.CUTOFF_PARAM + " set to " + val);
114 | }
115 | if (params.containsKey(GenericModelParameters.TRAIN_ITER)) {
116 | String val = objectToString(params, GenericModelParameters.TRAIN_ITER);
117 | this.trainParams.put(TrainingParameters.ITERATIONS_PARAM, val);
118 | LOG.info("Training parameter " + TrainingParameters.ITERATIONS_PARAM + " set to " + val);
119 | }
120 | if (params.containsKey(GenericModelParameters.TRAIN_THREADS)) {
121 | String val = objectToString(params, GenericModelParameters.TRAIN_THREADS);
122 | this.trainParams.put(TrainingParameters.THREADS_PARAM, val);
123 | LOG.info("Training parameter " + TrainingParameters.THREADS_PARAM + " set to " + val);
124 | }
125 | if (params.containsKey(GenericModelParameters.VALIDATE_FOLDS)) {
126 | this.nFolds = objectToInt(params, GenericModelParameters.VALIDATE_FOLDS);
127 | LOG.info("n-folds for crossvalidation set to %d.", this.nFolds);
128 | }
129 | if (params.containsKey(GenericModelParameters.VALIDATE_FILE)) {
130 | this.fileValidate = objectToString(params, GenericModelParameters.VALIDATE_FILE);
131 | LOG.info("Using valudation file " + fileValidate);
132 | }
133 | }
134 |
135 | private String objectToString(Map params, String key) {
136 | String result = null;
137 | if (params.get(key) instanceof String)
138 | result = (String) params.get(key);
139 | else if (params.get(key) instanceof Long)
140 | result = ((Long) params.get(key)).toString();
141 | else if (params.get(key) instanceof Integer)
142 | result = ((Integer) params.get(key)).toString();
143 | else
144 | throw new RuntimeException("Wrong format of parameter " + key);
145 | return result;
146 | }
147 |
148 | private int objectToInt(Map params, String key) {
149 | int result;
150 | if (params.get(key) instanceof String)
151 | result = Integer.parseInt((String) params.get(key));
152 | else if (params.get(key) instanceof Long)
153 | result = ((Long) params.get(key)).intValue();
154 | else if (params.get(key) instanceof Integer)
155 | result = ((Integer) params.get(key)).intValue();
156 | else
157 | throw new RuntimeException("Wrong format of parameter " + key);
158 | return result;
159 | }
160 |
161 | protected void closeInputFiles() {
162 | // try {
163 | // if (this.lineStream != null) {
164 | // this.lineStream.close();
165 | // }
166 | // } catch (IOException ex) {
167 | // LOG.warn("Attept to close input line-stream from source file " + this.fileIn + " failed.");
168 | // }
169 | //
170 | // try {
171 | // if (this.lineStreamValidate != null) {
172 | // this.lineStreamValidate.close();
173 | // }
174 | // } catch (IOException ex) {
175 | // LOG.warn("Attept to close input line-stream from source file " + this.fileValidate + " failed.");
176 | // }
177 | }
178 |
179 | public void saveModel(String file) {
180 | if (this.model == null) {
181 | LOG.error("Can't save training results to a " + file + ": model is null");
182 | return;
183 | }
184 | try {
185 | LOG.info("Saving model to file: " + file);
186 | BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream(file));
187 | this.model.serialize(modelOut);
188 | modelOut.close();
189 | } catch (IOException ex) {
190 | LOG.error("Error saving model to file " + file, ex);
191 | throw new RuntimeException("Error saving model to file " + file, ex);
192 | }
193 |
194 | //this.closeInputFile();
195 | }
196 |
197 | public BaseModel getModel() {
198 | return this.model;
199 | }
200 |
201 | class ImprovisedInputStreamFactory implements InputStreamFactory {
202 |
203 | private File inputSourceFile;
204 | private String inputSourceStr;
205 |
206 | ImprovisedInputStreamFactory(Properties properties, String property, String defaultValue) {
207 | this.inputSourceFile = null;
208 | this.inputSourceStr = defaultValue;
209 | if (properties != null) {
210 | this.inputSourceStr = properties.getProperty(property, defaultValue);
211 | }
212 | try {
213 | if (this.inputSourceStr.startsWith("file://")) {
214 | this.inputSourceFile = new File(new URI(this.inputSourceStr.replace("file://", "")));
215 | } else if (this.inputSourceStr.startsWith("/")) {
216 | this.inputSourceFile = new File(this.inputSourceStr);
217 | }
218 | } catch (Exception ex) {
219 | LOG.error("Error while loading model from " + this.inputSourceStr);
220 | throw new RuntimeException("Error while loading model from " + this.inputSourceStr);
221 | }
222 | }
223 |
224 | @Override
225 | public InputStream createInputStream() throws IOException {
226 | LOG.debug("Creating input stream from " + this.inputSourceFile.getPath());
227 | //return getClass().getClassLoader().getResourceAsStream(this.inputSourceFile.getPath());
228 | return new FileInputStream(this.inputSourceFile.getPath());
229 | }
230 |
231 | /*public void closeInputStream() {
232 | try {
233 | if (this.is!=null)
234 | this.is.close();
235 | } catch (IOException ex) {
236 | LOG.warn("Attept to close input stream failed.");
237 | }
238 | }*/
239 | }
240 |
241 | }
242 |
--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/model/SentimentModelTool.java:
--------------------------------------------------------------------------------
1 | /*
2 | *
3 | *
4 | */
5 | package com.graphaware.nlp.processor.opennlp.model;
6 |
7 | import com.graphaware.nlp.processor.opennlp.OpenNLPPipeline;
8 | import java.io.File;
9 | import java.io.FileOutputStream;
10 | import java.io.InputStream;
11 | import java.io.BufferedOutputStream;
12 | import java.io.IOException;
13 | import java.util.Arrays;
14 | import java.util.Properties;
15 | import java.util.Map;
16 | import java.util.HashMap;
17 | import java.util.Collections;
18 | import java.util.Iterator;
19 | import java.net.URI;
20 |
21 | import opennlp.tools.namefind.TokenNameFinderFactory;
22 | import opennlp.tools.namefind.TokenNameFinderCrossValidator;
23 | import opennlp.tools.namefind.TokenNameFinderEvaluator;
24 | import opennlp.tools.doccat.DoccatFactory;
25 | import opennlp.tools.doccat.DocumentSample;
26 | import opennlp.tools.doccat.DocumentSampleStream;
27 | import opennlp.tools.doccat.DocumentCategorizerME;
28 | import opennlp.tools.doccat.DoccatCrossValidator;
29 | import opennlp.tools.doccat.DocumentCategorizerEvaluator;
30 | import opennlp.tools.doccat.DoccatModel;
31 |
32 | import opennlp.tools.namefind.NameSample;
33 |
34 | import opennlp.tools.util.PlainTextByLineStream;
35 | import opennlp.tools.util.InputStreamFactory;
36 | import opennlp.tools.util.TrainingParameters;
37 | import opennlp.tools.util.ObjectStream;
38 | import opennlp.tools.util.eval.CrossValidationPartitioner;
39 | import opennlp.tools.util.eval.FMeasure;
40 | import opennlp.tools.util.FilterObjectStream;
41 | //import opennlp.tools.util.eval.EvaluationMonitor;
42 |
43 | import com.graphaware.nlp.util.GenericModelParameters;
44 |
45 | import org.slf4j.Logger;
46 | import org.slf4j.LoggerFactory;
47 |
48 | /**
49 | *
50 | * @author vla
51 | */
52 | public class SentimentModelTool extends OpenNLPGenericModelTool {
53 |
54 | private static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class);
55 |
56 | private static final String MODEL_NAME = "sentiment";
57 | private static final String DEFAULT_ITER = "30";
58 | private static final String DEFAULT_CUTOFF = "2";
59 |
60 | public SentimentModelTool(String fileIn, String modelDescr, String lang, Map params) {
61 | super(fileIn, modelDescr, lang, params);
62 | }
63 |
64 | public SentimentModelTool(String fileIn, String modelDescr, String lang) {
65 | this(fileIn, modelDescr, lang, null);
66 | }
67 |
68 | public SentimentModelTool() {
69 | super();
70 | }
71 |
72 | // here you can specify default parameters specific to this class
73 | @Override
74 | protected void setDefParams() {
75 | this.trainParams = TrainingParameters.defaultParams();
76 | this.trainParams.put(TrainingParameters.ITERATIONS_PARAM, DEFAULT_ITER);
77 | this.trainParams.put(TrainingParameters.CUTOFF_PARAM, DEFAULT_CUTOFF);
78 | }
79 |
80 | public void train() {
81 | try (ObjectStream lineStream = openFile(fileIn); ObjectStream sampleStream = new DocumentSampleStream(lineStream)) {
82 | LOG.info("Training of " + MODEL_NAME + " started ...");
83 | this.model = DocumentCategorizerME.train("en", sampleStream, trainParams, new DoccatFactory());
84 | } catch (IOException e) {
85 | LOG.error("IOError while training a custom " + MODEL_NAME + " model " + modelDescr, e);
86 | throw new RuntimeException("IOError while training a custom " + MODEL_NAME + " model " + this.modelDescr, e);
87 | }
88 | }
89 |
90 | public String validate() {
91 | String result = "";
92 | if (this.fileValidate == null) {
93 | try (ObjectStream lineStream = openFile(fileIn); ObjectStream sampleStream = new DocumentSampleStream(lineStream)) {
94 | LOG.info("Validation of " + MODEL_NAME + " started ...");
95 | DoccatCrossValidator evaluator = new DoccatCrossValidator(this.lang, this.trainParams, new DoccatFactory());
96 | // the second argument of 'evaluate()' gives number of folds (n): number of times the training-testing will be run (with data splitting train:test = (n-1):1)
97 | evaluator.evaluate(sampleStream, this.nFolds);
98 | result = "Accuracy = " + this.decFormat.format(evaluator.getDocumentAccuracy());
99 | LOG.info("Validation: " + result);
100 | } catch (IOException e) {
101 | LOG.error("Error while opening training file: " + fileIn, e);
102 | throw new RuntimeException("IOError while evaluating a " + MODEL_NAME + " model " + this.modelDescr, e);
103 | } catch (Exception ex) {
104 | LOG.error("Error while evaluating " + MODEL_NAME + " model.", ex);
105 | }
106 | } else {
107 | // Using a separate .test file provided by user
108 | result = test(this.fileValidate, new DocumentCategorizerME((DoccatModel) this.model));
109 | }
110 |
111 | return result;
112 | }
113 |
114 | public String test(String file, DocumentCategorizerME modelME) {
115 | String result = "";
116 | try (ObjectStream lineStream = openFile(file); ObjectStream sampleStreamValidate = new DocumentSampleStream(lineStream)) {
117 | LOG.info("Testing of " + MODEL_NAME + " started ...");
118 | //DocumentCategorizerEvaluator evaluator = new DocumentCategorizerEvaluator(new DocumentCategorizerME((DoccatModel) this.model));
119 | DocumentCategorizerEvaluator evaluator = new DocumentCategorizerEvaluator(modelME);
120 | evaluator.evaluate(sampleStreamValidate);
121 | result = "Accuracy = " + this.decFormat.format(evaluator.getAccuracy());
122 | LOG.info("Validation: " + result);
123 | } catch (IOException e) {
124 | LOG.error("Error while opening a test file: " + file, e);
125 | throw new RuntimeException("IOError while testing a " + MODEL_NAME + " model " + this.modelDescr, e);
126 | } catch (Exception ex) {
127 | LOG.error("Error while testing " + MODEL_NAME + " model.", ex);
128 | }
129 | return result;
130 | }
131 | }
132 |
--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-chunker.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-chunker.bin
--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-date.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-date.bin
--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-location.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-location.bin
--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-money.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-money.bin
--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-organization.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-organization.bin
--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-percentage.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-percentage.bin
--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-percentage_money.test:
--------------------------------------------------------------------------------
1 | Mr Juncker said the EU was spending on average €27,000 ( £24,000 ; $30,000 ) per soldier on equipment and research, whereas the US was spending €108,000 .
2 | "Together, we spend half as much as the United States, but even then we only achieve 15% of their efficiency," he said.
3 |
4 | Under the plan, €500m will be made available annually after 2020 for joint military research, and another €1bn annually for joint investment and purchases of military equipment, such as drones and helicopters.
5 |
6 | In afternoon trade, sterling was down 1.7% against the dollar at $1.227 .
7 | Against the euro, the pound was down 1.4% at 1.1393.6.
8 | This makes imported goods' prices higher and squeezes consumers' ability to spend.
9 |
10 | Housebuilders, including Taylor Wimpey and Persimmon, saw falls of up to 5% , while retail companies' shares also fell.
11 | Next and Marks and Spencer fell more than 3% .
12 |
13 | Ultimately they may have deprived the state of nearly €32bn ( £28bn ; $36bn ).
14 | As the German broadcaster ARD wryly noted, that would have paid for repairs to a lot of schools and bridges.
15 |
16 | The pound has dropped by more than 2% against the dollar, sterling's biggest one-day fall since the Brexit referendum vote last June.
17 |
18 | Japan's benchmark Nikkei 225 stock index closed 0.5% higher and South Korea's Kospi cended the day up 0.8% .
19 |
20 | Ocwen Financial rose 2.7% .
21 | The loan servicer has been under pressure from the Consumer Financial Protection Bureau, which would be seriously weakened under Thursday's measure.
22 |
23 | The deals range from buying UK chip firm ARM Holdings for £24bn ( $32bn ), investing $1bn in satellite startup OneWeb, to setting up a venture fund with Saudi Arabia.
24 |
25 | The ECB now expects growth across the eurozone to be 1.9% in 2017 compared with its March forecast of 1.8% .
26 | It also increased its growth projection for 2018 to 1.8% from 1.7% , and for 2019 to 1.7% from 1.6% .
27 |
28 | At that rate, the 19 countries that use the euro would see growth at 2.3% this year, nearly double the rate of the US, which is on course to grow 1.2% .
29 |
30 | Although all EU countries are required to observe the 3% limit, only the 19 countries that use the euro as a currency can be fined.
31 |
32 | According to the Office for National Statistics (ONS), manufacturing production grew 0.2% from the month before in April, rebounding from the 0.6% decline recorded in the previous month but falling short of expectations for a 0.8% increase.
33 |
34 | At 1:59pm BST, the Brent front month futures contract for August delivery was down another 0.75% or 36 cents at $47.70 per barrel, with the global proxy benchmark having breached the psychological $50 -level on Monday.
35 |
36 | Concurrently, the West Texas Intermediate (WTI) was down 0.94% or 43 cents to $45.29 per barrel, after US Energy Information Administration said the country's stockpiles rose by 3.3m barrels last week, against market estimates for a 3.5m-barrel drop.
37 |
38 | A breakdown below $45 should open a path lower towards $44 .
39 | Furthermore, with the oversupply woes still a dominant theme in the oil markets and Opec's efforts to stabilise the markets disrupted by US Shale production, WTI Crude may receive further punishment.
40 |
41 | A diamond ring that was initially bought at a car boot sales for £10 has been auctioned off for £656,750 in London.
42 |
43 | The owner was unaware the "exceptionally-sized" stone was instead a 26-carat diamond, which she wore for almost two decades and which fetched almost double the £350,000 it was expected to be sold for.
44 |
45 | A Cartier diamond brooch owned by the late Margaret Thatcher, which estimated to fetch £35,000 was sold for £81,250 .
46 |
47 | French beauty group L'Oreal said it has entered into "exclusive discussions" to sell the natural products cosmetics business for an enterprise value of €1bn ( $1.1bn ), after buying it eleven years ago.
48 |
49 | British multinational utility firm Centrica has announced that intends to sell 60% of its stake in a joint venture oil and gas exploration and production enterprise to a consortium of firms.
50 | The deal is expected to cost £240m .
51 |
52 | Centrica's cash flows also jumped by nearly 130% to £2bn in 2016.
53 | Its operations are mainly confined to Europe and North America.
54 |
55 | MIE Holdings is a Hong Kong Stock Exchange-listed oil and gas firm.
56 | It managed to bring its net loss down by 13% to 1.3 billion renminbi during the 2016 financial year.
57 |
58 | The US Central Intelligence Agency (CIA) has estimated that after Liechtenstein it is Qatar who has the world's second largest GDP per capita with a value of $129,700 .
59 | Moreover, the Sovereign Wealth Fund Institute has ranked the Qatar Investment Authority as the world's 9th largest sovereign wealth fund, with a total asset value of $335bn .
60 |
61 | Data released by Halifax on 7 June showed UK house prices rose 3.3% year-on-year in May, following a 3.8% increase in April.
62 |
63 | The national sales gauge improved slightly to -8% in May from -9% in the previous month.
64 |
65 | At 4:51pm BST, the Brent front month futures contract for August delivery was down 3.49% or $1.75 at $48.37 per barrel, with the global proxy benchmark having breached the psychological