├── .gitignore ├── README.md ├── pom.xml └── src ├── main ├── java │ └── com │ │ └── graphaware │ │ └── nlp │ │ └── processor │ │ └── opennlp │ │ ├── OpenNLPAnnotation.java │ │ ├── OpenNLPPipeline.java │ │ ├── OpenNLPTextProcessor.java │ │ ├── PipelineBuilder.java │ │ └── model │ │ ├── NERModelTool.java │ │ ├── OpenNLPGenericModelTool.java │ │ └── SentimentModelTool.java └── resources │ └── com │ └── graphaware │ └── nlp │ └── processor │ └── opennlp │ ├── en-chunker.bin │ ├── en-lemmatizer.dict │ ├── en-ner-date.bin │ ├── en-ner-location.bin │ ├── en-ner-money.bin │ ├── en-ner-organization.bin │ ├── en-ner-percentage.bin │ ├── en-ner-percentage_money.test │ ├── en-ner-person.bin │ ├── en-ner-person.test │ ├── en-ner-person_organization_location_date.test │ ├── en-ner-time.bin │ ├── en-pos-maxent.bin │ ├── en-sent.bin │ ├── en-sentiment-tweets_toy.bin │ ├── en-token.bin │ └── sentiment_tweets.train └── test ├── java └── com │ └── graphaware │ └── nlp │ └── processor │ └── opennlp │ ├── OpenNLPIntegrationTest.java │ ├── OpenNLPPipelineTest.java │ ├── TestOpenNLP.java │ ├── TextProcessorTest.java │ ├── conceptnet5 │ └── ConceptNet5ImporterTest.java │ ├── model │ └── CustomSentimentModelIntegrationTest.java │ └── procedure │ └── ProcedureTest.java └── resources └── import └── sentiment_tweets.train /.gitignore: -------------------------------------------------------------------------------- 1 | target/ 2 | *iml 3 | dependency-reduced-pom.xml 4 | **/.DS_Store 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | GraphAware Neo4j NLP - OpenNLP - RETIRED 2 | ========================== 3 | 4 | ## GraphAware Neo4j NLP - OpenNLP Has Been Retired 5 | As of May 2021, this [repository has been retired](https://graphaware.com/framework/2021/05/06/from-graphaware-framework-to-graphaware-hume.html). 6 | 7 | --- 8 | 9 | GraphAware NLP Using OpenNLP 10 | ========================================== 11 | 12 | Getting the Software 13 | --------------------- 14 | 15 | ### Server Mode 16 | When using Neo4j in the standalone server mode, you will need the GraphAware Neo4j Framework and GraphAware NLP .jar files (both of which you can download here) dropped into the plugins directory of your Neo4j installation. Finally, the following needs to be appended to the `neo4j.conf` file in the `config/` directory: 17 | 18 | ``` 19 | dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware 20 | com.graphaware.runtime.enabled=true 21 | 22 | com.graphaware.module.NLP.2=com.graphaware.nlp.module.NLPBootstrapper 23 | ``` 24 | 25 | ### For Developers 26 | This package is an extention of the GraphAware NLP, which therefore needs to be packaged and installed beforehand. No other dependencies required. 27 | 28 | ``` 29 | cd neo4j-nlp 30 | mvn clean install 31 | cp target/graphaware-nlp-1.0-SNAPSHOT.jar /plugins 32 | 33 | cd ../neo4j-nlp-opennlp 34 | mvn clean package 35 | cp target/nlp-opennlp-1.0.0-SNAPSHOT.jar /plugins 36 | ``` 37 | 38 | 39 | Introduction and How-To 40 | ------------------------- 41 | 42 | The Apache OpenNLP library provides basic features for processing natural language text: sentence segmentation, tokenization, lemmatization, part-of-speach tagging, named entities identification, chunking, parsing and sentiment analysis. OpenNLP is implemented by extending the general GraphAware NLP package with extra parameters: 43 | 44 | ### Tag Extraction / Annotations 45 | ``` 46 | #Annotate the news 47 | MATCH (n:News) 48 | CALL ga.nlp.annotate({text:n.text, id: n.uuid, textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", pipeline: "tokenizer"}) YIELD result 49 | MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result) 50 | RETURN n, result 51 | ``` 52 | 53 | Available parameters are: 54 | * the same ones as described in parent class GraphAware NLP 55 | * `sentimentProbabilityThr` (optional, default *0.7*): if assigned sentiment label has confidence smaller than this threshold, set sentiment to *Neutral* 56 | * `customProject` (optional): add user trained/provided models associated with specified project, see paragraph *Customizing pipeline models* 57 | 58 | Available pipelines: 59 | * `tokenizer` - tokenization, lemmatization, stop-words removal, part-of-speach tagging (POS), named entity recognition (NER) 60 | * `sentiment` - tokenization, sentiment analysis 61 | * `tokenizerAndSentiment` - tokenization, lemmatization, stop-words removal, POS tagging, NER, sentiment analysis 62 | * `phrase` (not supported yet) - tokenization, stop-words removal, relations, sentiment analysis 63 | 64 | ### Sentiment Analysis 65 | The current implementation of a sentiment analysis is just a toy - it relies on a file with 100 labeled twitter samples which are used to build a model when Neo4j starts (general recommendation for number of training samples is 10k and more). The current model supports only three options - Positive, Neutral, Negative - which are chosen based on the highest probability (the algorithm returns an array of probabilities for each category). If the highest probability is less then 70% (default value which can be customized by using parameter *sentimentProbabilityThr*), the category is not regarded trustworthy and is set to Neutral instead. 66 | 67 | The sentiment analysis can be run either as part of the annotation (see paragraph above) or as an independent procedure (see command below) which takes in AnnotatedText nodes, analyzes all attached sentences and adds to them a label corresponding to its sentiment. 68 | 69 | ``` 70 | MATCH (a:AnnotatedText {id: {id}}) 71 | CALL ga.nlp.sentiment({node:a, textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor"}) YIELD result 72 | MATCH (result)-[:CONTAINS_SENTENCE]->(s:Sentence) 73 | RETURN labels(s) as labels 74 | ``` 75 | 76 | 77 | 78 | 79 | ## BETA 80 | ### Customizing pipeline models 81 | To add new customized model (currenlty NER and Sentiment), one can do it via Cypher: 82 | ``` 83 | CALL ga.nlp.train({textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", modelIdentifier: "component-en", alg: "sentiment", inputFile: "" [, lang: "en", trainingParameters: {......}]}) 84 | ``` 85 | * `alg` (case insensitive) specifies which algorithm is about to be trained; currently available algs: `NER`, `sentiment` 86 | * `modelIdentifier` is an arbitrary string that provides a unique identifier of the model that you want to train (will be used for e.g. saving it into .bin file) 87 | * `inputFile` is path to the training data file 88 | * `lang` (default is "en") specifies the language 89 | * `textProcessor` - desired text processor 90 | * **training parameters** (defined in `com.graphaware.nlp.util.GenericModelParameters`) are optional and are not universal (some might be specific to only certain Text Processor): 91 | * *iter* - number of iterations 92 | * *cutoff* - useful for reducing the size of n-gram models, it's a threashold for n-gram occurrences/frequences in the training dataset 93 | * *threads* - provides support for multi-threading 94 | * *entityType* - name type to use for NER training, by default all entities (classes such as "Person", "Date", ...) present in provided training file are used 95 | * *nFolds* - parameter for cross-validation procedure (default is 10), see paragraph *Validation* 96 | * *trainerAlg* - specific for OpenNLP 97 | * *trainerType* - specific for OpenNLP 98 | 99 | The trained model is saved to a binary file in Neo4j's `import/` directory with name format: `-.bin`, so no need to train the same model again when you restart Neo4j. Cross-validation method is used to evaluate the model, see paragraph *Validation*. 100 | * `NER` - default models (Person, Location, Organization, Date, Time, Money, Percentage) plus all registered customized models are used when invoking `ga.nlp.annotate()` (see example below) 101 | * `Sentiment` - sentiment analysis is run only once (user-trained one has a priority over the default one) 102 | 103 | **Training/testing example:** 104 | Trainig: 105 | ``` 106 | CALL ga.nlp.processor.train({textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", modelIdentifier: "test", alg: "sentiment", inputFile: "/Users/doctor-who/Documents/workspace/datasets/sentiment_tweets.train", trainingParameters: {iter: 10}}) 107 | ``` 108 | 109 | Testing the new model: 110 | ``` 111 | CALL ga.nlp.processor.test({textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", modelIdentifier: "test", alg: "sentiment", inputFile: "/Users/doctor-who/Documents/workspace/datasets/sentiment_tweets.test}) 112 | ``` 113 | 114 | **Usage of new models:** 115 | To use custom models, one needs to assign them to a pipeline, for example: 116 | 117 | ``` 118 | CALL ga.nlp.processor.addPipeline({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', name: 'customPipeline', processingSteps: {tokenize: true, ner: true, dependency: false, customSentiment: }}) 119 | ``` 120 | * `customSentiment` - string value wich is the identifier that you chose for your custom model 121 | * `customNER` - string value which is the identifier that you chose for your custom model; if you want to use more models, separate them by ",", for example: `customNER: "component-en,chemical-en,testing-model"` 122 | 123 | ``` 124 | # Example of a text to analyze 125 | CREATE (l:Lesson {lesson: "Power system distribution at Kennedy Space Center (KSC) consists primarily of high-voltage, underground cables. These cables include approximately 5000 splices.ľ Splice failures result in arc flash events that are extremely hazardous to personnel in the vicinity of the arc flash. Some construction and maintenance tasks cannot be performed effectively in the required personal protective equipment (PPE), and de-energizing the cables is not feasible due to cost, lost productivity, and safety risk to others implementing the required outages. To verify alternate and effective mitigations, arc flash testing was conducted in a controlled environment. The arc flash effects were greater than expected. Testing also demonstrated the addition of neutral grounding resistors (NGRs) would result in substantial reductions to arc flash effects. As a result, NGRs are being installed on KSC primary substation transformers. The presence of the NGRs, enable usage of less cumbersome PPE. Laboratory testing revealed higher than anticipated safety risks from a potential arc-flash event in a manhole environment when conducted at KSCęs unreduced fault current levels.ľ The safety risks included bright flash, excessive sound, and smoke.ľľ Due to these findings and absence of other mitigations installed at the time, manhole entries require full arc-flash PPE.ľ Furthermore, manhole entries were temporarily restricted to short duration inspections until further mitigations could be implemented.ľ With installation of neutral grounding resistors (NGRs) on substation transformers, the flash, sound and flame energy was reduced.ľ The hazard reduction was so substantial that the required PPE would be less cumbersome and enable effective performance of maintenance tasks in the energized configuration."}) 126 | 127 | WITH l 128 | 129 | # Annotate it and use newly trained NER model(s) 130 | CALL ga.nlp.annotate({text:l.lesson, id: l.uuid, pipeline: "customPipeline"}) YIELD result 131 | MERGE (l)-[:HAS_ANNOTATED_TEXT]->(result) 132 | RETURN l, result; 133 | ``` 134 | 135 | **Format of training datasets:** 136 | * `NER` 137 | * one sentence per line 138 | * one empty line between two different texts (paragraphs) 139 | * there must be a space before and after each `` and `` statement 140 | * training data must not contain HTML symbols (such as `H2O`); **TO DO:** check whether text on which NER model is deployed needs to be manually deprived of HTML symbols or whether they are ignored automatically 141 | * Example: categories "person", "organization", "location"): 142 | ``` 143 | Theresa May has said she will form a government with the support of the Democratic Unionists that can provide "certainty" for the future. 144 | Speaking after visiting Buckingham Palace , she said only her party had the "legitimacy" to govern after winning the most seats and votes. 145 | In a short statement outside Downing Street , which followed a 25-minute audience with The Queen , Mrs May said she intended to form a government which could "provide certainty and lead Britain forward at this critical time for our country". 146 | 147 | The Cabinet Office revealed on Wednesday that Japan's GDP grew by 0.3% during the first quarter of 2017 . 148 | Although the reading missed a forecast of 0.6% growth, Japan's economy continued to expand in five consecutive quarters, the country's highest streak in three years. 149 | ``` 150 | * `sentiment` - two columns separated by a white space (tab): the first column is a category as integer (0=VeryNegative, 1=Negative, 2=Neutral, 3=Positive, 4=VeryPositive), the second column is a sentence; example: 151 | ``` 152 | 3 Watching a nice movie 153 | 1 The painting is ugly, will return it tomorrow... 154 | 3 One of the best soccer games, worth seeing it 155 | 3 Very tasty, not only for vegetarians 156 | 1 Damn..the train is late again... 157 | ``` 158 | 159 | **Validation/testing:** 160 | 161 | Evaluation of the new model is performed automatically when invoking procedure `ga.nlp.train()`. The evaluation of the new model is performed using OpenNLP cross-validation method: validation runs *n*-fold times on the same training file, but each time selecting different set of trainig and testing data with the sample size ratio of *train:test = (n-1):1*. Validation measures (Precision, Recall, F-Measure) are pooled together and returned to the user as a result. 162 | 163 | The following procedure can be invoked to test already existing models: 164 | ``` 165 | CALL ga.nlp.test({[project: "myXYProject",] alg: "NER", model: "location", file: "" [, lang: "en"]}) 166 | ``` 167 | Parameters 168 | * `project` (optional) allows to specify which of the existing models we want to test (otherwise it uses the default) 169 | * `alg` (case insensitive) specifies which algorithm is about to be trained; currently available algs: `NER`, `sentiment` 170 | * `model` is an arbitrary string that provides, in combination with `alg` (and with `project` if it's specified), a unique identifier of the model that you want to 171 | * `file` is a path to the test file 172 | * `lang` specified the language 173 | 174 | -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4.0.0 4 | 5 | com.graphaware.neo4j 6 | nlp-opennlp 7 | 3.3.2.52.7-SNAPSHOT 8 | 9 | 10 | 11 | GNU General Public License, version 3 12 | http://www.gnu.org/licenses/gpl-3.0.txt 13 | repo 14 | 15 | 16 | 17 | GraphAware OpenNLP Integration 18 | OpenNLP integration into GraphAware NLP 19 | https://graphaware.com 20 | 21 | 22 | scm:git:git@github.com:graphaware/neo4j-nlp-opennlp.git 23 | scm:git:git@github.com:graphaware/neo4j-nlp-opennlp.git 24 | git@github.com:graphaware/neo4j-nlp-opennlp.git 25 | HEAD 26 | 27 | 28 | 29 | 30 | alenegro 31 | Alessandro Negro 32 | alessandro@graphaware.com 33 | 34 | 35 | ikwattro 36 | Christophe Willemsen 37 | christophe@graphaware.com 38 | 39 | 40 | vlasta-kus 41 | Vlastimil Kus 42 | vlasta@graphaware.com 43 | 44 | 45 | graphaware 46 | GraphAware 47 | nlp@graphaware.com 48 | 49 | 50 | 51 | 2015 52 | 53 | 54 | GitHub 55 | https://github.com/graphaware/neo4j-nlp-opennlp/issues 56 | 57 | 58 | 59 | Graph Aware Limited 60 | https://graphaware.com 61 | 62 | 63 | 64 | UTF-8 65 | 1.9.0 66 | 3.4.7.52 67 | ${graphaware.version}.18 68 | 3.4.7 69 | 1.8 70 | 1.8 71 | 3.4.9.52.16-SNAPSHOT 72 | 73 | 74 | 75 | 76 | com.graphaware.neo4j 77 | nlp 78 | ${nlp.version} 79 | provided 80 | 81 | 82 | 83 | org.apache.opennlp 84 | opennlp-tools 85 | ${open.nlp.version} 86 | 87 | 88 | 89 | org.slf4j 90 | slf4j-api 91 | 1.7.21 92 | 93 | 94 | 95 | junit 96 | junit 97 | 4.12 98 | 99 | 100 | 101 | org.slf4j 102 | slf4j-simple 103 | 1.7.21 104 | 105 | 106 | 107 | junit 108 | junit 109 | 4.12 110 | test 111 | 112 | 113 | 114 | com.graphaware.neo4j 115 | runtime 116 | ${graphaware.version} 117 | test 118 | 119 | 120 | 121 | com.graphaware.neo4j 122 | server 123 | ${graphaware.version} 124 | test 125 | 126 | 127 | 128 | com.graphaware.neo4j 129 | tests 130 | ${graphaware.version} 131 | test 132 | 133 | 134 | 135 | com.graphaware.neo4j 136 | resttest 137 | test 138 | ${resttest.version} 139 | 140 | 141 | 142 | com.graphaware.neo4j 143 | nlp 144 | ${nlp.version} 145 | test-jar 146 | test 147 | 148 | 149 | 150 | com.sun.jersey 151 | jersey-server 152 | 1.19.1 153 | test 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | ossrh 162 | https://oss.sonatype.org/content/repositories/snapshots 163 | 164 | 165 | ossrh 166 | https://oss.sonatype.org/service/local/staging/deploy/maven2/ 167 | 168 | 169 | 170 | 171 | 172 | release 173 | 174 | 175 | performRelease 176 | true 177 | 178 | 179 | 180 | 181 | 182 | org.apache.maven.plugins 183 | maven-gpg-plugin 184 | 1.5 185 | 186 | 187 | sign-artifacts 188 | verify 189 | 190 | sign 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | maven-compiler-plugin 204 | 3.5.1 205 | 206 | 1.8 207 | 1.8 208 | 209 | 210 | 211 | maven-shade-plugin 212 | 2.4.3 213 | 214 | 215 | package 216 | 217 | shade 218 | 219 | 222 | 223 | 224 | 225 | 226 | org.apache.maven.plugins 227 | maven-surefire-plugin 228 | 2.20 229 | 230 | -Xmx8g 231 | 232 | 233 | 234 | 235 | 236 | 237 | -------------------------------------------------------------------------------- /src/main/java/com/graphaware/nlp/processor/opennlp/OpenNLPAnnotation.java: -------------------------------------------------------------------------------- 1 | /* 2 | * To change this license header, choose License Headers in Project Properties. 3 | * To change this template file, choose Tools | Templates 4 | * and open the template in the editor. 5 | */ 6 | package com.graphaware.nlp.processor.opennlp; 7 | 8 | import com.graphaware.nlp.util.OptionalNLPParameters; 9 | import java.util.ArrayList; 10 | import java.util.Arrays; 11 | import java.util.Collection; 12 | import java.util.HashMap; 13 | import java.util.HashSet; 14 | import java.util.List; 15 | import java.util.Map; 16 | import java.util.Set; 17 | import java.util.stream.Collectors; 18 | import opennlp.tools.util.Span; 19 | 20 | public class OpenNLPAnnotation { 21 | 22 | private static final double DEFAULT_SENTIMENT_PROBTHR = 0.7; 23 | 24 | private final String text; 25 | private List sentences; 26 | public static final String DEFAULT_LEMMA_OPEN_NLP = "O"; 27 | public Map otherParams; 28 | 29 | public OpenNLPAnnotation(String text, Map otherParams) { 30 | this.text = text; 31 | this.otherParams = otherParams; 32 | } 33 | 34 | public OpenNLPAnnotation(String text) { 35 | this(text, null); 36 | } 37 | 38 | public String getText() { 39 | return text; 40 | } 41 | 42 | public void setSentences(Span[] sentencesArray) { 43 | sentences = new ArrayList<>(); 44 | for (Span sentence : sentencesArray) { 45 | sentences.add(new Sentence(sentence, getText())); 46 | } 47 | } 48 | 49 | public List getSentences() { 50 | return sentences; 51 | } 52 | 53 | public double getSentimentProb() { 54 | if (otherParams != null && otherParams.containsKey(OptionalNLPParameters.SENTIMENT_PROB_THR)) { 55 | return Double.parseDouble(otherParams.get(OptionalNLPParameters.SENTIMENT_PROB_THR)); 56 | } 57 | return DEFAULT_SENTIMENT_PROBTHR; 58 | } 59 | 60 | public Token getToken(String token, String lemma) { 61 | return new Token(token, lemma); 62 | } 63 | 64 | class Sentence { 65 | 66 | private final Span sentence; 67 | private final String sentenceText; 68 | private String sentenceSentiment; 69 | private List nounphrases; 70 | private String[] words; 71 | private Span[] wordSpans; 72 | private String[] posTags; 73 | private String[] lemmas; 74 | private final Map tokens; 75 | private Span[] chunks; 76 | private String[] chunkStrings; 77 | private String[] chunkSentiments; 78 | private final String defaultStringValue = "-"; // @Deprecated 79 | 80 | public Sentence(Span sentence, String text) { 81 | this.sentence = sentence; 82 | this.sentenceText = String.valueOf(sentence.getCoveredText(text)); 83 | this.tokens = new HashMap<>(); 84 | } 85 | 86 | public void addPhraseIndex(int phraseINdex) { 87 | if (this.nounphrases == null) { 88 | this.nounphrases = new ArrayList<>(); 89 | } 90 | this.nounphrases.add(phraseINdex); 91 | } 92 | 93 | public Span getSentenceSpan() { 94 | return this.sentence; 95 | } 96 | 97 | public String getSentence() { 98 | return this.sentenceText; 99 | } 100 | 101 | public String getSentiment() { 102 | return this.sentenceSentiment; 103 | } 104 | 105 | public void setSentiment(String sent) { 106 | this.sentenceSentiment = sent; 107 | } 108 | 109 | public String[] getWords() { 110 | return words; 111 | } 112 | 113 | public void setWords(String[] words) { 114 | this.words = words; 115 | } 116 | 117 | public Span[] getWordSpans() { 118 | return this.wordSpans; 119 | } 120 | 121 | public void setWordSpans(Span[] spans) { 122 | this.wordSpans = spans; 123 | } 124 | 125 | public void setWordsAndSpans(Span[] spans) { 126 | if (spans == null) { 127 | this.wordSpans = null; 128 | this.words = null; 129 | return; 130 | } 131 | this.wordSpans = spans; 132 | this.words = Arrays.asList(spans).stream() 133 | .map(span -> String.valueOf(span.getCoveredText(sentenceText))) 134 | .collect(Collectors.toList()).toArray(new String[wordSpans.length]); 135 | } 136 | 137 | public int getWordStart(int idx) { 138 | if (this.wordSpans.length > idx) { 139 | return this.wordSpans[idx].getStart(); 140 | } 141 | return -1; 142 | } 143 | 144 | public int getWordEnd(int idx) { 145 | if (this.wordSpans.length > idx) { 146 | return this.wordSpans[idx].getEnd(); 147 | } 148 | return -1; 149 | } 150 | 151 | public String[] getPosTags() { 152 | return this.posTags; 153 | } 154 | 155 | public void setPosTags(String[] posTags) { 156 | this.posTags = posTags; 157 | } 158 | 159 | public Span[] getChunks() { 160 | return this.chunks; 161 | } 162 | 163 | public void setChunks(Span[] chunks) { 164 | this.chunks = chunks; 165 | } 166 | 167 | public String[] getChunkStrings() { 168 | return this.chunkStrings; 169 | } 170 | 171 | public void setChunkStrings(String[] chunkStrings) { 172 | this.chunkStrings = chunkStrings; 173 | } 174 | 175 | public String[] getChunkSentiments() { 176 | return this.chunkSentiments; 177 | } 178 | 179 | public void setChunkSentiments(String[] sents) { 180 | if (sents == null) { 181 | return; 182 | } 183 | if (sents.length != this.chunks.length) { 184 | return; 185 | } 186 | this.chunkSentiments = sents; 187 | } 188 | 189 | // @Deprecated 190 | // public void setDefaultChunks() { 191 | // this.chunks = new Span[this.words.length]; 192 | // Arrays.fill(this.chunks, new Span(0, 0)); 193 | // this.chunkStrings = new String[this.words.length]; 194 | // Arrays.fill(this.chunkStrings, defaultStringValue); 195 | // this.nounphrases = new ArrayList<>(); 196 | // } 197 | 198 | public List getPhrasesIndex() { 199 | //if (nounphrases==null) 200 | //return new ArrayList(); 201 | return nounphrases; 202 | } 203 | 204 | public Collection getTokens() { 205 | return this.tokens.values(); 206 | } 207 | 208 | public String[] getLemmas() { 209 | return this.lemmas; 210 | } 211 | 212 | public void setLemmas(String[] lemmas) { 213 | if (this.words == null || lemmas == null) { 214 | return; 215 | } 216 | if (this.words.length != lemmas.length) // ... something is wrong 217 | { 218 | return; 219 | } 220 | this.lemmas = lemmas; 221 | } 222 | 223 | protected Token getToken(String value, String lemma) { 224 | Token token; 225 | if (tokens.containsKey(value)) { 226 | token = tokens.get(value); 227 | } else { 228 | token = new Token(value, lemma); 229 | tokens.put(value, token); 230 | } 231 | return token; 232 | } 233 | } 234 | 235 | class Token { 236 | 237 | private final String token; 238 | private final Set tokenPOS; 239 | private final String tokenLemmas; 240 | private final Set tokenNEs; 241 | private final List tokenSpans; 242 | 243 | public Token(String token, String lemma) { 244 | this.token = token; 245 | this.tokenLemmas = lemma; 246 | this.tokenNEs = new HashSet<>(); 247 | this.tokenPOS = new HashSet<>(); 248 | this.tokenSpans = new ArrayList<>(); 249 | } 250 | 251 | public List getTokenSpans() { 252 | return tokenSpans; 253 | } 254 | 255 | public String getToken() { 256 | return token; 257 | } 258 | 259 | public void addTokenSpans(Span tokenSpans) { 260 | this.tokenSpans.add(tokenSpans); 261 | } 262 | 263 | public Collection getTokenPOS() { 264 | return tokenPOS; 265 | } 266 | 267 | public void addTokenPOS(Collection tokenPOSes) { 268 | this.tokenPOS.addAll(tokenPOSes); 269 | } 270 | 271 | public void addTokenPOS(String tokenPOS) { 272 | this.tokenPOS.add(tokenPOS); 273 | } 274 | 275 | public String getTokenLemmas() { 276 | return tokenLemmas; 277 | } 278 | 279 | public Collection getTokenNEs() { 280 | return tokenNEs; 281 | } 282 | 283 | public void addTokenNE(String ne) { 284 | this.tokenNEs.add(ne); 285 | } 286 | 287 | } 288 | } 289 | -------------------------------------------------------------------------------- /src/main/java/com/graphaware/nlp/processor/opennlp/OpenNLPPipeline.java: -------------------------------------------------------------------------------- 1 | /* 2 | * To change this license header, choose License Headers in Project Properties. 3 | * To change this template file, choose Tools | Templates 4 | * and open the template in the editor. 5 | */ 6 | package com.graphaware.nlp.processor.opennlp; 7 | 8 | import com.graphaware.nlp.processor.opennlp.model.NERModelTool; 9 | import com.graphaware.nlp.processor.opennlp.model.SentimentModelTool; 10 | import com.graphaware.nlp.processor.AbstractTextProcessor; 11 | import static com.graphaware.nlp.processor.opennlp.OpenNLPAnnotation.DEFAULT_LEMMA_OPEN_NLP; 12 | import java.io.File; 13 | import java.io.FileInputStream; 14 | import java.io.FileOutputStream; 15 | import java.io.BufferedOutputStream; 16 | import java.io.FileNotFoundException; 17 | import java.io.IOException; 18 | import java.io.InputStream; 19 | import java.lang.reflect.Constructor; 20 | import java.lang.reflect.InvocationTargetException; 21 | import java.net.URI; 22 | import java.net.URISyntaxException; 23 | import java.util.Properties; 24 | import java.util.HashMap; 25 | import java.util.Arrays; 26 | import java.util.ArrayList; 27 | import java.util.HashSet; 28 | import java.util.List; 29 | import java.util.Map; 30 | import java.util.Set; 31 | import java.util.concurrent.atomic.AtomicInteger; 32 | import java.util.stream.Collectors; 33 | import opennlp.tools.chunker.ChunkerME; 34 | import opennlp.tools.chunker.ChunkerModel; 35 | import opennlp.tools.postag.POSModel; 36 | import opennlp.tools.postag.POSTaggerME; 37 | import opennlp.tools.sentdetect.SentenceDetectorME; 38 | import opennlp.tools.sentdetect.SentenceModel; 39 | import opennlp.tools.tokenize.TokenizerME; 40 | import opennlp.tools.tokenize.TokenizerModel; 41 | import opennlp.tools.namefind.TokenNameFinderModel; 42 | import opennlp.tools.namefind.NameFinderME; 43 | import opennlp.tools.lemmatizer.DictionaryLemmatizer; // needs OpenNLP >=1.7 44 | //import opennlp.tools.lemmatizer.SimpleLemmatizer; // for OpenNLP < 1.7 45 | import opennlp.tools.doccat.DoccatModel; 46 | import opennlp.tools.doccat.DocumentCategorizerME; 47 | import opennlp.tools.util.Span; 48 | import opennlp.tools.util.model.BaseModel; 49 | import org.slf4j.Logger; 50 | import org.slf4j.LoggerFactory; 51 | 52 | public class OpenNLPPipeline { 53 | 54 | protected static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class); 55 | 56 | public static final String DEFAULT_BACKGROUND_SYMBOL = "O"; 57 | 58 | protected static final String IMPORT_DIRECTORY = "import/"; 59 | 60 | protected static final String PROPERTY_PATH_CHUNKER_MODEL = "chuncker"; 61 | protected static final String PROPERTY_PATH_POS_TAGGER_MODEL = "pos"; 62 | protected static final String PROPERTY_PATH_SENTENCE_MODEL = "sentence"; 63 | protected static final String PROPERTY_PATH_TOKENIZER_MODEL = "tokenizer"; 64 | protected static final String PROPERTY_PATH_LEMMATIZER_MODEL = "lemmatizer"; 65 | protected static final String PROPERTY_PATH_SENTIMENT_MODEL = "sentiment"; 66 | 67 | protected static final String PROPERTY_DEFAULT_CHUNKER_MODEL = "en-chunker.bin"; 68 | protected static final String PROPERTY_DEFAULT_POS_TAGGER_MODEL = "en-pos-maxent.bin"; 69 | protected static final String PROPERTY_DEFAULT_SENTENCE_MODEL = "en-sent.bin"; 70 | protected static final String PROPERTY_DEFAULT_TOKENIZER_MODEL = "en-token.bin"; 71 | protected static final String PROPERTY_DEFAULT_LEMMATIZER_MODEL = "en-lemmatizer.dict"; 72 | protected static final String PROPERTY_DEFAULT_SENTIMENT_MODEL = "en-sentiment-tweets_toy.bin"; 73 | 74 | protected static final String DEFAULT_PROJECT_VALUE = "default"; 75 | 76 | protected final List annotators; 77 | protected final List stopWords; 78 | 79 | protected TokenizerME wordBreaker; 80 | protected POSTaggerME posme; 81 | protected ChunkerME chunkerME; 82 | protected SentenceDetectorME sentenceDetector; 83 | protected DictionaryLemmatizer lemmaDetector; // needs OpenNLP >=1.7 84 | 85 | protected Map customNeModels = new HashMap<>(); 86 | protected Map customSentimentModels = new HashMap<>(); 87 | 88 | protected Map nameDetectors = new HashMap<>(); 89 | //protected Map sentimentDetectors = new HashMap<>(); 90 | protected DocumentCategorizerME sentimentDetector; 91 | 92 | protected static Map BASIC_NE_MODEL; 93 | 94 | { 95 | BASIC_NE_MODEL = new HashMap<>(); 96 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-person", "en-ner-person.bin"); 97 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-date", "en-ner-date.bin"); 98 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-location", "en-ner-location.bin"); 99 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-time", "en-ner-time.bin"); 100 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-organization", "en-ner-organization.bin"); 101 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-money", "en-ner-money.bin"); 102 | BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-percentage", "en-ner-percentage.bin"); 103 | } 104 | 105 | public OpenNLPPipeline(Properties properties) { 106 | findModelFiles(IMPORT_DIRECTORY); 107 | this.annotators = Arrays.asList(properties.getProperty("annotators", "").split(",")).stream().map(str -> str.trim()).collect(Collectors.toList()); 108 | this.stopWords = Arrays.asList(properties.getProperty("stopword", "").split(",")).stream().map(str -> str.trim().toLowerCase()).collect(Collectors.toList()); 109 | init(properties); 110 | } 111 | 112 | private void init(Properties properties) { 113 | try { 114 | setSenteceSplitter(properties); 115 | setTokenizer(properties); 116 | setPosTagger(properties); 117 | setChuncker(properties); 118 | loadNamedEntitiesFinders(properties); 119 | setLemmatizer(properties); 120 | setCategorizer(properties); 121 | 122 | } catch (IOException e) { 123 | LOG.error("Could not initialize OpenNLP models: " + e.getMessage()); 124 | throw new RuntimeException("Could not initialize OpenNLP models", e); 125 | } 126 | } 127 | 128 | private void setChuncker(Properties properties) throws FileNotFoundException { 129 | InputStream is = getInputStream(properties, PROPERTY_PATH_CHUNKER_MODEL, PROPERTY_DEFAULT_CHUNKER_MODEL); 130 | ChunkerModel chunkerModel = loadModel(ChunkerModel.class, is); 131 | closeInputStream(is, PROPERTY_PATH_CHUNKER_MODEL); 132 | chunkerME = new ChunkerME(chunkerModel); 133 | } 134 | 135 | private void setPosTagger(Properties properties) throws FileNotFoundException { 136 | InputStream is = getInputStream(properties, PROPERTY_PATH_POS_TAGGER_MODEL, PROPERTY_DEFAULT_POS_TAGGER_MODEL); 137 | POSModel pm = loadModel(POSModel.class, is); 138 | closeInputStream(is, PROPERTY_PATH_POS_TAGGER_MODEL); 139 | posme = new POSTaggerME(pm); 140 | } 141 | 142 | private void setTokenizer(Properties properties) throws FileNotFoundException { 143 | InputStream is = getInputStream(properties, PROPERTY_PATH_TOKENIZER_MODEL, PROPERTY_DEFAULT_TOKENIZER_MODEL); 144 | TokenizerModel tm = loadModel(TokenizerModel.class, is); 145 | closeInputStream(is, PROPERTY_PATH_TOKENIZER_MODEL); 146 | wordBreaker = new TokenizerME(tm); 147 | } 148 | 149 | private void setSenteceSplitter(Properties properties) throws FileNotFoundException { 150 | InputStream is = getInputStream(properties, PROPERTY_PATH_SENTENCE_MODEL, PROPERTY_DEFAULT_SENTENCE_MODEL); 151 | SentenceModel sentenceModel = loadModel(SentenceModel.class, is); 152 | closeInputStream(is, PROPERTY_PATH_SENTENCE_MODEL); 153 | sentenceDetector = new SentenceDetectorME(sentenceModel); 154 | } 155 | 156 | private void loadNamedEntitiesFinders(Properties properties) throws FileNotFoundException { 157 | // Default NE models 158 | BASIC_NE_MODEL.entrySet().stream().forEach((item) -> { 159 | InputStream is = getInputStream(properties, item.getKey(), item.getValue()); 160 | if (!(is == null)) { 161 | TokenNameFinderModel nameModel = loadModel(TokenNameFinderModel.class, is); 162 | closeInputStream(is, item.getKey()); 163 | nameDetectors.put(item.getKey(), new NameFinderME(nameModel)); 164 | } 165 | }); 166 | 167 | // Custom NE models (in the `import/` dir of the Neo4j installation) 168 | if (properties.containsKey("customNEs")) { 169 | List requiredModels = Arrays.asList(properties.getProperty("customNEs").split(",")).stream().map(str -> str.trim()).collect(Collectors.toList()); 170 | for (String key: requiredModels) { 171 | if (!customNeModels.containsKey(key)) { 172 | LOG.error("Custom NE model " + key + " not found!"); 173 | throw new RuntimeException("Custom NE model " + key + " not found!"); 174 | } 175 | LOG.info("Extracting custom NER model: " + key); 176 | InputStream is = new FileInputStream(new File(customNeModels.get(key))); 177 | TokenNameFinderModel nameModel = loadModel(TokenNameFinderModel.class, is); 178 | closeInputStream(is, key); 179 | nameDetectors.put(key, new NameFinderME(nameModel)); 180 | LOG.info("Custom NER model " + key + " loaded for this pipeline."); 181 | } 182 | } 183 | } 184 | 185 | private void setLemmatizer(Properties properties) throws FileNotFoundException, IOException { 186 | InputStream is = getInputStream(properties, PROPERTY_PATH_LEMMATIZER_MODEL, PROPERTY_DEFAULT_LEMMATIZER_MODEL); 187 | lemmaDetector = new DictionaryLemmatizer(is); 188 | closeInputStream(is, PROPERTY_PATH_LEMMATIZER_MODEL); 189 | } 190 | 191 | private void setCategorizer(Properties properties) throws FileNotFoundException { 192 | // Default sentiment model 193 | if (!properties.containsKey("customSentiment")) { 194 | InputStream is = getInputStream(properties, PROPERTY_PATH_SENTIMENT_MODEL, PROPERTY_DEFAULT_SENTIMENT_MODEL); 195 | if (is != null) { 196 | DoccatModel doccatModel = loadModel(DoccatModel.class, is); 197 | closeInputStream(is, PROPERTY_PATH_SENTIMENT_MODEL); 198 | //sentimentDetectors.put(DEFAULT_PROJECT_VALUE, new DocumentCategorizerME(doccatModel)); 199 | sentimentDetector = new DocumentCategorizerME(doccatModel); 200 | } else { 201 | LOG.warn("No default sentiment detector available (input stream is null)."); 202 | //sentimentDetectors.put(DEFAULT_PROJECT_VALUE, null); 203 | sentenceDetector = null; 204 | } 205 | } 206 | // Custom sentiment model (currently only one is possible) 207 | else { 208 | String customModel = properties.getProperty("customSentiment"); 209 | LOG.info("Extracting custom sentiment model: " + customModel); 210 | if (!customSentimentModels.containsKey(customModel)) { 211 | LOG.error("Custom sentiment model " + customModel + " not found!"); 212 | throw new RuntimeException("Custom sentiment model " + customModel + " not found!"); 213 | } 214 | try { 215 | InputStream is = new FileInputStream(new File(customSentimentModels.get(customModel))); 216 | if (is == null) { 217 | LOG.error("Custom sentiment model: input stream is null"); 218 | return; 219 | } 220 | DoccatModel doccatModel = loadModel(DoccatModel.class, is); 221 | closeInputStream(is, customSentimentModels.get(customModel)); 222 | //sentimentDetectors.put(customModel, new DocumentCategorizerME(doccatModel)); 223 | sentimentDetector = new DocumentCategorizerME(doccatModel); 224 | LOG.info("Custom sentiment model " + customModel + " loaded for this pipeline."); 225 | } catch (IOException ex) { 226 | LOG.error("Error while opening file " + customSentimentModels.get(customModel), ex); 227 | } 228 | } 229 | } 230 | 231 | public void annotate(OpenNLPAnnotation document) { 232 | String text = document.getText(); 233 | try { 234 | Span sentences[] = sentenceDetector.sentPosDetect(text); 235 | document.setSentences(sentences); 236 | document.getSentences().stream() 237 | .forEach((OpenNLPAnnotation.Sentence sentence) -> { 238 | if (annotators.contains("tokenize") && wordBreaker != null) { 239 | Span[] wordSpans = wordBreaker.tokenizePos(sentence.getSentence()); 240 | if (wordSpans != null && wordSpans.length > 0) { 241 | sentence.setWordsAndSpans(wordSpans); 242 | 243 | if (annotators.contains("pos") && posme != null) { 244 | String[] posTags = posme.tag(sentence.getWords()); 245 | sentence.setPosTags(posTags); 246 | if (annotators.contains("lemma")) { 247 | String[] finLemmas = lemmaDetector.lemmatize(sentence.getWords(), posTags); 248 | sentence.setLemmas(finLemmas); 249 | } 250 | 251 | //FIXME: this is wrong 252 | // if (annotators.contains("relation")) { 253 | // Span[] chunks = chunkerME.chunkAsSpans(sentence.getWords(), posTags); 254 | // sentence.setChunks(chunks); 255 | // LOG.info("Found " + chunks.length + " phrases."); 256 | // String[] chunkStrings = Span.spansToStrings(chunks, sentence.getWords()); 257 | // sentence.setChunkStrings(chunkStrings); 258 | // List chunkSentiments = new ArrayList<>(); 259 | // for (int i = 0; i < chunks.length; i++) { 260 | // sentence.addPhraseIndex(i); 261 | // } 262 | // if (!chunkSentiments.isEmpty()) { 263 | // sentence.setChunkSentiments(chunkSentiments.toArray(new String[chunkSentiments.size()])); 264 | // } 265 | // } 266 | } 267 | 268 | Map> nerOccurrences = new HashMap<>(); 269 | if (annotators.contains("ner") && sentence.getWords() != null) { 270 | 271 | // Named Entities identification; needs to be performed after lemmas and POS (see implementation of Sentence.addNamedEntities()) 272 | BASIC_NE_MODEL.keySet().stream().forEach((modelKey) -> { 273 | if (!nameDetectors.containsKey(modelKey)) { 274 | LOG.warn("NER model with key " + modelKey + " not available."); 275 | } else { 276 | List ners = Arrays.asList(nameDetectors.get(modelKey).find(sentence.getWords())); 277 | addNer(ners, nerOccurrences); 278 | } 279 | }); 280 | 281 | if (!customNeModels.isEmpty()) { 282 | for (String key : customNeModels.keySet()) { 283 | if (!nameDetectors.containsKey(key)) { 284 | LOG.warn("Custom NER model with key " + key + " not available."); 285 | continue; 286 | } 287 | if (key.split("-").length == 0) { 288 | continue; 289 | } 290 | LOG.info("Running custom NER: " + key); 291 | List ners = Arrays.asList(nameDetectors.get(key).find(sentence.getWords())); 292 | addNer(ners, nerOccurrences); 293 | } 294 | } 295 | } 296 | processTokens(sentence, nerOccurrences); 297 | } 298 | } 299 | if (sentence.getWords() != null && sentence.getWords().length > 0) { 300 | if (annotators.contains("sentiment") && sentimentDetector != null) { 301 | double[] outcomes = sentimentDetector.categorize(sentence.getWords()); 302 | String category = sentimentDetector.getBestCategory(outcomes); 303 | if (Arrays.stream(outcomes).max().getAsDouble() < document.getSentimentProb()) { 304 | category = "2"; 305 | } 306 | sentence.setSentiment(category); 307 | LOG.info("Sentiment results: sentence = " + sentence.getSentence() + "; category = " + category + "; outcomes = " + Arrays.toString(outcomes)); 308 | } 309 | } 310 | }); 311 | 312 | // if (annotators.contains("ner")) { 313 | // for (String key : BASIC_NE_MODEL.keySet()) { 314 | // if (nameDetectors.containsKey(key)) { 315 | // nameDetectors.get(key).clearAdaptiveData(); 316 | // } 317 | // } 318 | // if (customProject != null) { 319 | // for (String key : customNeModels.keySet()) { 320 | // if (nameDetectors.containsKey(key)) { 321 | // nameDetectors.get(key).clearAdaptiveData(); 322 | // } 323 | // } 324 | // } 325 | // } 326 | } catch (Exception ex) { 327 | LOG.error("Error processing sentence for text: " + text, ex); 328 | throw new RuntimeException("Error processing sentence for text: " + text, ex); 329 | } 330 | } 331 | 332 | protected void addNer(List ners, Map> nerOccurrences) { 333 | if (ners != null && !ners.isEmpty()) { 334 | ners.stream().forEach((ner) -> { 335 | List currentNer = nerOccurrences.get(ner.getStart()); 336 | if (currentNer == null) { 337 | currentNer = new ArrayList<>(); 338 | nerOccurrences.put(ner.getStart(), currentNer); 339 | } 340 | currentNer.add(ner); 341 | }); 342 | } 343 | } 344 | 345 | public String train(String alg, String modelId, String fileTrain, String lang, Map params) { 346 | String fileOut = createModelFileName(lang, alg, modelId); 347 | String newKey = /*lang.toLowerCase() + "-" +*/ modelId.toLowerCase(); 348 | String result = ""; 349 | 350 | if (alg.toLowerCase().equals("ner")) { 351 | NERModelTool nerModel = new NERModelTool(fileTrain, modelId, lang, params); 352 | nerModel.train(); 353 | result = nerModel.validate(); 354 | nerModel.saveModel(fileOut); 355 | // incorporate this model to the OpenNLPPipeline 356 | if (nerModel.getModel() != null) { 357 | customNeModels.put(newKey, fileOut); 358 | /*if (!nameDetectors.containsKey(newKey)) { 359 | nameDetectors.put(newKey, new NameFinderME((TokenNameFinderModel) nerModel.getModel())); 360 | }*/ 361 | } 362 | } 363 | else if (alg.toLowerCase().equals("sentiment")) { 364 | SentimentModelTool sentModel = new SentimentModelTool(fileTrain, modelId, lang, params); 365 | sentModel.train(); 366 | result = sentModel.validate(); 367 | String[] dirPathSplit = fileTrain.split(File.separator); 368 | String fileOutToUse; 369 | if (dirPathSplit.length > 2) { 370 | StringBuilder sb = new StringBuilder(""); 371 | for (int i = 0; i < dirPathSplit.length -2; ++i) { 372 | sb.append(dirPathSplit[i]).append(File.separator); 373 | } 374 | fileOutToUse = sb.toString() + fileOut; 375 | } else { 376 | fileOutToUse = fileOut; 377 | } 378 | System.out.println("Saving model to " + fileOutToUse); 379 | sentModel.saveModel(fileOutToUse); 380 | // incorporate this model to the OpenNLPPipeline 381 | if (sentModel.getModel() != null) { 382 | customSentimentModels.put(newKey, fileOutToUse); 383 | //sentimentDetectors.put(newKey, new DocumentCategorizerME((DoccatModel) sentModel.getModel())); 384 | } 385 | } else { 386 | throw new UnsupportedOperationException("Undefined training procedure for algorithm " + alg); 387 | } 388 | 389 | return result; 390 | } 391 | 392 | public String test(String alg, String modelId, String file, String lang) { 393 | String modelID = /*lang.toLowerCase() + "-" +*/ modelId.toLowerCase(); 394 | String result = "failure"; 395 | 396 | if (alg.toLowerCase().equals("ner")) { 397 | if (customNeModels.containsKey(modelID)) { 398 | LOG.info("Testing NER model: " + modelID); 399 | 400 | TokenNameFinderModel nameModel; 401 | try { 402 | // Load model 403 | InputStream is = new FileInputStream(new File(customNeModels.get(modelID))); 404 | nameModel = loadModel(TokenNameFinderModel.class, is); 405 | closeInputStream(is, modelID); 406 | } catch (Exception e) { 407 | throw new RuntimeException("Loading custom sentiment model " + modelID + " failed: ", e); 408 | } 409 | 410 | NERModelTool nerModel = new NERModelTool(); 411 | result = nerModel.test(file, new NameFinderME(nameModel)); 412 | } else 413 | LOG.error("Required NER model doesn't exist: " + modelID); 414 | } 415 | else if (alg.toLowerCase().equals("sentiment")) { 416 | if (customSentimentModels.containsKey(modelID)) { 417 | LOG.info("Testing sentiment model: " + modelID); 418 | 419 | DoccatModel doccatModel; 420 | try { 421 | // Load model 422 | InputStream is = new FileInputStream(new File(customSentimentModels.get(modelID))); 423 | doccatModel = loadModel(DoccatModel.class, is); 424 | closeInputStream(is, customSentimentModels.get(modelID)); 425 | } catch (Exception e) { 426 | throw new RuntimeException("Loading custom sentiment model " + modelID + " failed: ", e); 427 | } 428 | 429 | SentimentModelTool sentModel = new SentimentModelTool(); 430 | result = sentModel.test(file, new DocumentCategorizerME(doccatModel)); 431 | } else 432 | LOG.error("Required sentiment model doesn't exist: " + modelID); 433 | } else { 434 | throw new UnsupportedOperationException("Undefined training procedure for algorithm " + alg); 435 | } 436 | return result; 437 | } 438 | 439 | private void processTokens(OpenNLPAnnotation.Sentence sentence, Map> nerOccurrences) { 440 | if (sentence.getWords() == null) { 441 | return; 442 | } 443 | String[] words = sentence.getWords(); 444 | String[] lemmas = sentence.getLemmas(); 445 | String[] posTags = sentence.getPosTags(); 446 | Span[] wordSpans = sentence.getWordSpans(); 447 | 448 | for (int i = 0; i < words.length; i++) { 449 | if (nerOccurrences != null && nerOccurrences.containsKey(i)) { 450 | List ners = nerOccurrences.get(i); 451 | final int startSpan = wordSpans[i].getStart(); 452 | AtomicInteger index = new AtomicInteger(i); 453 | ners.forEach(ne -> { 454 | String value = ""; 455 | String lemma = ""; 456 | String type = ne.getType().toUpperCase(); 457 | Set posSet = new HashSet<>(); 458 | int endSpan = startSpan; 459 | for (int j = ne.getStart(); j < ne.getEnd(); j++) { 460 | value += " " + words[j].trim(); 461 | lemma += " " + (lemmas[j].equals(DEFAULT_LEMMA_OPEN_NLP) ? words[j].toLowerCase().trim() : lemmas[j].trim()); 462 | posSet.add(posTags[j]); 463 | endSpan = wordSpans[j].getEnd(); 464 | if (index.get() < j) { 465 | index.set(j); 466 | } 467 | } 468 | 469 | value = value.trim(); 470 | lemma = lemma.trim(); 471 | //check stopwords 472 | if (isNotStopWord(lemma)) { 473 | OpenNLPAnnotation.Token token = sentence.getToken(value, lemma); 474 | token.addTokenNE(type); 475 | token.addTokenPOS(posSet); 476 | token.addTokenSpans(new Span(startSpan, endSpan)); 477 | } 478 | }); 479 | i = index.get(); 480 | } else { 481 | String value = words[i].trim(); 482 | String lemma = lemmas[i].equals(DEFAULT_LEMMA_OPEN_NLP) ? words[i].toLowerCase() : lemmas[i].trim(); 483 | String ne = DEFAULT_BACKGROUND_SYMBOL; 484 | String pos = posTags[i]; 485 | Set posSet = new HashSet<>(); 486 | if (isNotStopWord(lemma)) { 487 | OpenNLPAnnotation.Token token = sentence.getToken(value, lemma); 488 | token.addTokenNE(ne); 489 | posSet.add(pos); 490 | token.addTokenPOS(posSet); 491 | token.addTokenSpans(wordSpans[i]); 492 | } 493 | } 494 | } 495 | } 496 | 497 | private boolean isNotStopWord(String value) { 498 | return !annotators.contains("stopword") || !stopWords.contains(value.toLowerCase()); 499 | } 500 | 501 | private void findModelFiles(String path) { 502 | if (path == null || path.length() == 0) { 503 | LOG.error("Scanning for model files: wrong path specified."); 504 | return; 505 | } 506 | 507 | File folder = new File(path); 508 | File[] listOfFiles = folder.listFiles(); 509 | if (listOfFiles == null) { 510 | return; 511 | } 512 | 513 | String p = path; 514 | if (p.charAt(p.length() - 1) != "/".charAt(0)) { 515 | path += "/"; 516 | } 517 | 518 | for (int i = 0; i < listOfFiles.length; i++) { 519 | if (!listOfFiles[i].isFile()) { 520 | continue; 521 | } 522 | String name = listOfFiles[i].getName(); 523 | String[] sp = name.split("-"); 524 | if (sp.length < 2) { 525 | continue; 526 | } 527 | if (!name.substring(name.length() - 4).equals(".bin")) { 528 | continue; 529 | } 530 | LOG.info("Custom models: Found file " + name); 531 | 532 | String alg = sp[0].toLowerCase(); 533 | 534 | String modelId = sp[1]; 535 | // this is useful in case user-defined model ID contained symbol "-" 536 | for (int j = 2; j < sp.length; j++) 537 | modelId += "-" + sp[j]; 538 | modelId = modelId.substring(0, modelId.length() - 4).toLowerCase(); // remove ".bin" 539 | //modelId = lang + "-" + modelId; 540 | 541 | LOG.info("Registering model name for algorithm " + alg + " under the key " + modelId); 542 | if (alg.equals("ner")) { 543 | customNeModels.put(modelId, path + name); 544 | } else if (alg.equals("sentiment")) { 545 | customSentimentModels.put(modelId, path + name); 546 | } 547 | } 548 | } 549 | 550 | private T loadModel(Class clazz, InputStream in) { 551 | try { 552 | Constructor modelConstructor = clazz.getConstructor(InputStream.class); 553 | T model = modelConstructor.newInstance(in); 554 | return model; 555 | } catch (NoSuchMethodException | SecurityException | InstantiationException | IllegalAccessException | IllegalArgumentException | InvocationTargetException ex) { 556 | LOG.error("Error while initializing model of class: " + clazz, ex); 557 | throw new RuntimeException("Error while initializing model of class: " + clazz, ex); 558 | } 559 | } 560 | 561 | private void saveModel(BaseModel model, String file) { 562 | if (model == null) { 563 | LOG.error("Can't save training results to a " + file + ": model is null"); 564 | return; 565 | } 566 | BufferedOutputStream modelOut = null; 567 | try { 568 | modelOut = new BufferedOutputStream(new FileOutputStream(file)); 569 | model.serialize(modelOut); 570 | modelOut.close(); 571 | } catch (IOException ex) { 572 | LOG.error("Error saving model to file " + file, ex); 573 | throw new RuntimeException("Error saving model to file " + file, ex); 574 | } 575 | return; 576 | } 577 | 578 | private InputStream getInputStream(Properties properties, String property, String defaultValue) { 579 | String path = defaultValue; 580 | if (properties != null) { 581 | path = properties.getProperty(property, defaultValue); 582 | } 583 | InputStream is; 584 | try { 585 | if (path.startsWith("file://")) { 586 | is = new FileInputStream(new File(new URI(path))); 587 | } else if (path.startsWith("/")) { 588 | is = new FileInputStream(new File(path)); 589 | } else { 590 | is = this.getClass().getResourceAsStream(path); 591 | } 592 | } catch (FileNotFoundException | URISyntaxException ex) { 593 | LOG.error("Error while loading model from path: " + path, ex); 594 | throw new RuntimeException("Error while loading model from path: " + path, ex); 595 | } 596 | return is; 597 | } 598 | 599 | private void closeInputStream(InputStream is, String name) { 600 | try { 601 | if (is != null) { 602 | is.close(); 603 | } 604 | } catch (IOException ex) { 605 | LOG.warn("Attept to close stream for " + name + " model failed."); 606 | } 607 | return; 608 | } 609 | 610 | private String createModelFileName(String lang, String alg, String model) { 611 | String delim = "-"; 612 | //String name = "import/" + lang.toLowerCase() + delim + alg.toLowerCase(); 613 | String name = "import/" + alg.toLowerCase(); 614 | if (model != null) { 615 | if (model.length() > 0) { 616 | name += delim + model.toLowerCase(); 617 | } 618 | } 619 | name += ".bin"; 620 | return name; 621 | } 622 | 623 | 624 | /*class ImprovisedInputStreamFactory implements InputStreamFactory { 625 | private File inputSourceFile; 626 | private String inputSourceStr; 627 | 628 | ImprovisedInputStreamFactory(Properties properties, String property, String defaultValue) { 629 | this.inputSourceFile = null; 630 | this.inputSourceStr = defaultValue; 631 | if (properties!=null) this.inputSourceStr = properties.getProperty(property, defaultValue); 632 | 633 | try { 634 | if (this.inputSourceStr.startsWith("file://")) 635 | this.inputSourceFile = new File(new URI(this.inputSourceStr)); 636 | else if (this.inputSourceStr.startsWith("/")) 637 | this.inputSourceFile = new File(this.inputSourceStr); 638 | } catch (Exception ex) { 639 | LOG.error("Error while loading model from " + this.inputSourceStr); 640 | throw new RuntimeException("Error while loading model from " + this.inputSourceStr); 641 | } 642 | } 643 | 644 | @Override 645 | public InputStream createInputStream() throws IOException { 646 | LOG.debug("Creating input stream from " + this.inputSourceFile.getPath()); 647 | //return getClass().getClassLoader().getResourceAsStream(this.inputSourceFile.getPath()); 648 | return new FileInputStream(this.inputSourceFile.getPath()); 649 | } 650 | }*/ 651 | 652 | public Properties getProperties() { 653 | return new Properties();//to be implemented 654 | } 655 | } 656 | -------------------------------------------------------------------------------- /src/main/java/com/graphaware/nlp/processor/opennlp/OpenNLPTextProcessor.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright (c) 2013-2016 GraphAware 3 | * 4 | * This file is part of the GraphAware Framework. 5 | * 6 | * GraphAware Framework is free software: you can redistribute it and/or modify it under the terms of 7 | * the GNU General Public License as published by the Free Software Foundation, either 8 | * version 3 of the License, or (at your option) any later version. 9 | * 10 | * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; 11 | * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 12 | * See the GNU General Public License for more details. You should have received a copy of 13 | * the GNU General Public License along with this program. If not, see 14 | * . 15 | */ 16 | package com.graphaware.nlp.processor.opennlp; 17 | 18 | import com.graphaware.nlp.annotation.NLPTextProcessor; 19 | import com.graphaware.nlp.domain.*; 20 | import com.graphaware.nlp.dsl.request.PipelineSpecification; 21 | import com.graphaware.nlp.processor.AbstractTextProcessor; 22 | 23 | import java.util.*; 24 | import java.util.concurrent.atomic.AtomicInteger; 25 | import java.util.stream.Collectors; 26 | 27 | import com.graphaware.nlp.util.Timer; 28 | import opennlp.tools.util.Span; 29 | import org.jetbrains.annotations.NotNull; 30 | import org.slf4j.Logger; 31 | import org.slf4j.LoggerFactory; 32 | 33 | @NLPTextProcessor(name = "OpenNLPTextProcessor") 34 | public class OpenNLPTextProcessor extends AbstractTextProcessor { 35 | 36 | private static final Logger LOG = LoggerFactory.getLogger(OpenNLPTextProcessor.class); 37 | 38 | private static final String CORE_PIPELINE_NAME = "OpenNLP.CORE"; 39 | public static final String TOKENIZER = "tokenizer"; 40 | public static final String SENTIMENT = "sentiment"; 41 | 42 | private final Map pipelines = new HashMap<>(); 43 | 44 | 45 | @Override 46 | public void init() { 47 | } 48 | 49 | @Override 50 | public String getAlias() { 51 | return "opennlp"; 52 | } 53 | 54 | @Override 55 | public String override() { 56 | return null; 57 | } 58 | 59 | public OpenNLPPipeline getPipeline(String name) { 60 | if (name == null || name.isEmpty()) { 61 | name = TOKENIZER; 62 | LOG.debug("Using default pipeline: " + name); 63 | } 64 | OpenNLPPipeline pipeline = getOpenNLPPipeline(name); 65 | return pipeline; 66 | } 67 | 68 | private void checkPipelineExistOrCreate(PipelineSpecification pipelineSpecification) { 69 | if (!pipelines.containsKey(pipelineSpecification.getName())) { 70 | createPipeline(pipelineSpecification); 71 | } 72 | } 73 | 74 | /* private void createFullPipeline() { 75 | OpenNLPPipeline pipeline = new PipelineBuilder() 76 | .tokenize() 77 | .extractNEs() 78 | .defaultStopWordAnnotator() 79 | .extractRelations() 80 | .extractSentiment() 81 | .threadNumber(6) 82 | .build(); 83 | pipelines.put(CORE_PIPELINE_NAME, pipeline); 84 | } 85 | 86 | private void createTokenizerPipeline() { 87 | OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME); 88 | pipelines.put(TOKENIZER, pipeline); 89 | } 90 | 91 | private void createSentimentPipeline() { 92 | OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME); 93 | pipelines.put(SENTIMENT, pipeline); 94 | } 95 | 96 | private void createTokenizerAndSentimentPipeline() { 97 | OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME); 98 | pipelines.put(TOKENIZER_AND_SENTIMENT, pipeline); 99 | } 100 | 101 | private void createPhrasePipeline() { 102 | OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME); 103 | pipelines.put(PHRASE, pipeline); 104 | }*/ 105 | 106 | @Override 107 | public AnnotatedText annotateText(String text, String lang, PipelineSpecification pipelineSpecification) { 108 | Timer timer = Timer.start(); 109 | checkPipelineExistOrCreate(pipelineSpecification); 110 | timer.lap("pipeline check"); 111 | OpenNLPPipeline pipeline = pipelines.get(pipelineSpecification.getName()); 112 | OpenNLPAnnotation document = new OpenNLPAnnotation(text, Collections.EMPTY_MAP); 113 | pipeline.annotate(document); 114 | 115 | AnnotatedText result = new AnnotatedText(); 116 | List sentences = document.getSentences(); 117 | final AtomicInteger sentenceSequence = new AtomicInteger(0); 118 | sentences.stream().forEach((sentence) -> { 119 | int sentenceNumber = sentenceSequence.getAndIncrement(); 120 | final Sentence newSentence = new Sentence(sentence.getSentence(), sentenceNumber); 121 | extractTokens(lang, sentence, newSentence); 122 | if (pipelineSpecification.hasProcessingStep(STEP_SENTIMENT)) { 123 | extractSentiment(sentence, newSentence); 124 | } 125 | if (pipelineSpecification.hasProcessingStep(STEP_PHRASE)) { 126 | extractPhrases(sentence, newSentence); 127 | } 128 | result.addSentence(newSentence); 129 | }); 130 | 131 | return result; 132 | } 133 | 134 | protected Map getPipelineProperties(OpenNLPPipeline pipeline) { 135 | Map options = new HashMap<>(); 136 | for (Object o : pipeline.getProperties().keySet()) { 137 | if (o instanceof String) { 138 | options.put(o.toString(), pipeline.getProperties().getProperty(o.toString())); 139 | } 140 | } 141 | 142 | return options; 143 | } 144 | 145 | protected Map buildSpecifications(List actives) { 146 | List all = Arrays.asList("tokenize", "ner", "cleanxml", "truecase", "dependency", "relations", "checkLemmaIsStopWord", "coref", "sentiment", "phrase", "customSentiment", "customNER"); 147 | Map specs = new HashMap<>(); 148 | all.forEach(s -> { 149 | specs.put(s, actives.contains(s)); 150 | }); 151 | 152 | return specs; 153 | } 154 | 155 | 156 | /* @Override 157 | public AnnotatedText annotateText(String text, String name, String lang, Map otherParams) { 158 | if (name.length() == 0) { 159 | name = TOKENIZER; 160 | LOG.info("Using default pipeline: " + name); 161 | } 162 | OpenNLPPipeline pipeline = pipelines.get(name); 163 | if (pipeline == null) { 164 | throw new RuntimeException("Pipeline: " + name + " doesn't exist"); 165 | } 166 | OpenNLPAnnotation document = new OpenNLPAnnotation(text, otherParams); 167 | pipeline.annotate(document); 168 | // LOG.info("Annotation for id " + id + " finished."); 169 | 170 | AnnotatedText result = new AnnotatedText(); 171 | List sentences = document.getSentences(); 172 | final AtomicInteger sentenceSequence = new AtomicInteger(0); 173 | sentences.stream().forEach((sentence) -> { 174 | int sentenceNumber = sentenceSequence.getAndIncrement(); 175 | // String sentenceId = id + "_" + sentenceNumber; 176 | final Sentence newSentence = new Sentence(sentence.getSentence(), sentenceNumber); 177 | extractTokens(lang, sentence, newSentence); 178 | extractSentiment(sentence, newSentence); 179 | extractPhrases(sentence, newSentence); 180 | result.addSentence(newSentence); 181 | }); 182 | //extractRelationship(result, sentences, document); 183 | return result; 184 | } 185 | */ 186 | private void extractPhrases(OpenNLPAnnotation.Sentence sentence, Sentence newSentence) { 187 | if (sentence.getPhrasesIndex() == null) { 188 | LOG.warn("extractPhrases(): phrases index empty, aborting extraction"); 189 | return; 190 | } 191 | sentence.getPhrasesIndex().forEach(index -> { 192 | Span chunk = sentence.getChunks()[index]; 193 | String chunkString = sentence.getChunkStrings()[index]; 194 | newSentence.addPhraseOccurrence(chunk.getStart(), chunk.getEnd(), new Phrase(chunkString, chunk.getType())); 195 | }); 196 | } 197 | 198 | private void extractSentiment(OpenNLPAnnotation.Sentence sentence, Sentence newSentence) { 199 | int score = -1; 200 | if (sentence.getSentiment() != null) { // && !sentence.getSentiment().equals("-")) { 201 | try { 202 | score = Integer.valueOf(sentence.getSentiment()); 203 | } catch (NumberFormatException ex) { 204 | LOG.error("NumberFormatException: error extracting sentiment " + sentence.getSentiment() + " as a number.", ex); 205 | } 206 | } 207 | newSentence.setSentiment(score); 208 | } 209 | 210 | private void extractTokens(String lang, OpenNLPAnnotation.Sentence sentence, final Sentence newSentence) { 211 | Collection tokens = sentence.getTokens(); 212 | tokens.stream().filter((token) -> token != null /*&& checkLemmaIsValid(token.getToken())*/).forEach((token) -> { 213 | Tag newTag = getTag(token, lang); 214 | if (newTag != null) { 215 | Tag tagInSentence = newSentence.addTag(newTag); 216 | token.getTokenSpans().stream().forEach((span) -> { 217 | newSentence.addTagOccurrence(span.getStart(), span.getEnd(), token.getToken(), tagInSentence); 218 | }); 219 | } 220 | }); 221 | } 222 | 223 | // private void extractRelationship(AnnotatedText annotatedText, List sentences, Annotation document) { 224 | // Map corefChains = document.get(CorefCoreAnnotations.CorefChainAnnotation.class); 225 | // if (corefChains != null) { 226 | // for (CorefChain chain : corefChains.values()) { 227 | // CorefChain.CorefMention representative = chain.getRepresentativeMention(); 228 | // int representativeSenteceNumber = representative.sentNum - 1; 229 | // List representativeTokens = sentences.get(representativeSenteceNumber).get(CoreAnnotations.TokensAnnotation.class); 230 | // int beginPosition = representativeTokens.get(representative.startIndex - 1).beginPosition(); 231 | // int endPosition = representativeTokens.get(representative.endIndex - 2).endPosition(); 232 | // Phrase representativePhraseOccurrence = annotatedText.getSentences().get(representativeSenteceNumber).getPhraseOccurrence(beginPosition, endPosition); 233 | // if (representativePhraseOccurrence == null) { 234 | // LOG.warn("Representative Phrase not found: " + representative.mentionSpan); 235 | // } 236 | // for (CorefChain.CorefMention mention : chain.getMentionsInTextualOrder()) { 237 | // if (mention == representative) { 238 | // continue; 239 | // } 240 | // int mentionSentenceNumber = mention.sentNum - 1; 241 | // 242 | // List mentionTokens = sentences.get(mentionSentenceNumber).get(CoreAnnotations.TokensAnnotation.class); 243 | // int beginPositionMention = mentionTokens.get(mention.startIndex - 1).beginPosition(); 244 | // int endPositionMention = mentionTokens.get(mention.endIndex - 2).endPosition(); 245 | // Phrase mentionPhraseOccurrence = annotatedText.getSentences().get(mentionSentenceNumber).getPhraseOccurrence(beginPositionMention, endPositionMention); 246 | // if (mentionPhraseOccurrence == null) { 247 | // LOG.warn("Mention Phrase not found: " + mention.mentionSpan); 248 | // } 249 | // if (representativePhraseOccurrence != null 250 | // && mentionPhraseOccurrence != null) { 251 | // mentionPhraseOccurrence.setReference(representativePhraseOccurrence); 252 | // } 253 | // } 254 | // } 255 | // } 256 | // } 257 | @Override 258 | public Tag annotateSentence(String text, String lang, PipelineSpecification pipelineSpecification) { 259 | // Annotation document = new Annotation(text); 260 | // pipelines.get(SENTIMENT).annotate(document); 261 | // List sentences = document.get(CoreAnnotations.SentencesAnnotation.class); 262 | // Optional sentence = sentences.stream().findFirst(); 263 | // if (sentence.isPresent()) { 264 | // Optional oTag = sentence.get().get(CoreAnnotations.TokensAnnotation.class).stream() 265 | // .map((token) -> getTag(token)) 266 | // .filter((tag) -> (tag != null) && checkPunctuation(tag.getLemma())) 267 | // .findFirst(); 268 | // if (oTag.isPresent()) { 269 | // return oTag.get(); 270 | // } 271 | // } 272 | return null; 273 | } 274 | 275 | @Override 276 | public Tag annotateTag(String text, String lang, PipelineSpecification pipelineSpecification) { 277 | OpenNLPAnnotation document = new OpenNLPAnnotation(text); 278 | final OpenNLPPipeline openNLPPipeline = getOpenNLPPipeline(pipelineSpecification.getName()); 279 | openNLPPipeline.annotate(document); 280 | List sentences = document.getSentences(); 281 | if (sentences != null && !sentences.isEmpty()) { 282 | if (sentences.size() > 1) { 283 | throw new RuntimeException("More than one sentence"); 284 | } 285 | Collection tokens = sentences.get(0).getTokens(); 286 | if (tokens != null && tokens.size() == 1) { 287 | OpenNLPAnnotation.Token token = tokens.iterator().next(); 288 | Tag newTag = getTag(token, lang); 289 | return newTag; 290 | } else if (tokens != null && tokens.size() > 1) { 291 | OpenNLPAnnotation.Token token = document.getToken(text, text); 292 | Tag newTag = getTag(token, lang); 293 | return newTag; 294 | } 295 | } 296 | return null; 297 | } 298 | 299 | @NotNull 300 | private OpenNLPPipeline getOpenNLPPipeline(String name) { 301 | final OpenNLPPipeline openNLPPipeline = pipelines.get(name); 302 | if (openNLPPipeline == null) { 303 | throw new RuntimeException("Pipeline " + name + " doesn't exist"); 304 | } 305 | return openNLPPipeline; 306 | } 307 | 308 | private Tag getTag(OpenNLPAnnotation.Token token, String lang) { 309 | List pos = new ArrayList<>(); 310 | List ne = new ArrayList<>(); 311 | String lemma = token.getTokenLemmas(); 312 | pos.addAll(token.getTokenPOS()); 313 | ne.addAll(token.getTokenNEs()); 314 | 315 | // apply lemma validity check (to all words in case of NamedEntities) 316 | lemma = Arrays.asList(lemma.split(" ")).stream().filter(str -> checkLemmaIsValid(str)).collect(Collectors.joining(" ")); 317 | if (lemma == null || lemma.length() == 0) 318 | return null; 319 | 320 | Tag tag = new Tag(lemma, lang); 321 | tag.setPos(pos); 322 | tag.setNe(ne); 323 | LOG.info("POS: " + pos + " ne: " + ne + " lemma: " + lemma); 324 | return tag; 325 | } 326 | 327 | private List annotateTagsAux(String text, String lang, OpenNLPPipeline pipeline) { 328 | List result = new ArrayList<>(); 329 | OpenNLPAnnotation document = new OpenNLPAnnotation(text); 330 | pipeline.annotate(document); 331 | List sentences = document.getSentences(); 332 | if (sentences != null && !sentences.isEmpty()) { 333 | if (sentences.size() > 1) { 334 | throw new RuntimeException("More than one sentence"); 335 | } 336 | Collection tokens = sentences.get(0).getTokens(); 337 | if (tokens != null && tokens.size() > 0) { 338 | tokens.stream().forEach((token) -> { 339 | Tag newTag = getTag(token, lang); 340 | if (newTag != null) 341 | result.add(newTag); 342 | }); 343 | return result; 344 | } 345 | } 346 | return null; 347 | } 348 | 349 | @Override 350 | public List annotateTags(String text, String lang, PipelineSpecification pipelineSpecification) { 351 | return annotateTagsAux(text, lang, getOpenNLPPipeline(pipelineSpecification.getName())); 352 | } 353 | 354 | public List annotateTags(String text, String lang) { 355 | return annotateTagsAux(text, lang, getOpenNLPPipeline(TOKENIZER)); 356 | } 357 | 358 | @Override 359 | public AnnotatedText sentiment(AnnotatedText annotated) { 360 | OpenNLPPipeline pipeline = getOpenNLPPipeline(SENTIMENT); 361 | annotated.getSentences().stream().forEach(item -> { // don't use parallelStream(), it crashes with the current content of the body 362 | OpenNLPAnnotation document = new OpenNLPAnnotation(item.getSentence()); 363 | pipeline.annotate(document); 364 | 365 | List sentences = document.getSentences(); 366 | Optional sentence = sentences.stream().findFirst(); 367 | if (sentence != null && sentence.isPresent()) { 368 | extractSentiment(sentence.get(), item); 369 | } 370 | }); 371 | 372 | return annotated; 373 | } 374 | 375 | @Override 376 | public String train(String alg, String modelId, String file, String lang, Map params) { 377 | // training could be done directly here, but it's better to have everything related to model implementation in one class, therefore ... 378 | OpenNLPPipeline pipeline = getOpenNLPPipeline(TOKENIZER); 379 | return pipeline.train(alg, modelId, file, lang, params); 380 | } 381 | 382 | @Override 383 | public String test(String alg, String modelId, String file, String lang) { 384 | OpenNLPPipeline pipeline = getOpenNLPPipeline(TOKENIZER); 385 | return pipeline.test(alg, modelId, file, lang); 386 | 387 | } 388 | 389 | class TokenHolder { 390 | 391 | private String ne; 392 | private StringBuilder sb; 393 | private int beginPosition; 394 | private int endPosition; 395 | 396 | public TokenHolder() { 397 | reset(); 398 | } 399 | 400 | public String getNe() { 401 | return ne; 402 | } 403 | 404 | public String getToken() { 405 | if (sb == null) { 406 | return " - "; 407 | } 408 | return sb.toString(); 409 | } 410 | 411 | public int getBeginPosition() { 412 | return beginPosition; 413 | } 414 | 415 | public int getEndPosition() { 416 | return endPosition; 417 | } 418 | 419 | public void setNe(String ne) { 420 | this.ne = ne; 421 | } 422 | 423 | public void updateToken(String tknStr) { 424 | this.sb.append(tknStr); 425 | } 426 | 427 | public void setBeginPosition(int beginPosition) { 428 | if (this.beginPosition < 0) { 429 | this.beginPosition = beginPosition; 430 | } 431 | } 432 | 433 | public void setEndPosition(int endPosition) { 434 | this.endPosition = endPosition; 435 | } 436 | 437 | public final void reset() { 438 | sb = new StringBuilder(); 439 | beginPosition = -1; 440 | endPosition = -1; 441 | } 442 | } 443 | 444 | class PhraseHolder implements Comparable { 445 | 446 | private StringBuilder sb; 447 | private int beginPosition; 448 | private int endPosition; 449 | 450 | public PhraseHolder() { 451 | reset(); 452 | } 453 | 454 | public String getPhrase() { 455 | if (sb == null) { 456 | return " - "; 457 | } 458 | return sb.toString(); 459 | } 460 | 461 | public int getBeginPosition() { 462 | return beginPosition; 463 | } 464 | 465 | public int getEndPosition() { 466 | return endPosition; 467 | } 468 | 469 | public void updatePhrase(String tknStr) { 470 | this.sb.append(tknStr); 471 | } 472 | 473 | public void setBeginPosition(int beginPosition) { 474 | if (this.beginPosition < 0) { 475 | this.beginPosition = beginPosition; 476 | } 477 | } 478 | 479 | public void setEndPosition(int endPosition) { 480 | this.endPosition = endPosition; 481 | } 482 | 483 | public final void reset() { 484 | sb = new StringBuilder(); 485 | beginPosition = -1; 486 | endPosition = -1; 487 | } 488 | 489 | @Override 490 | public boolean equals(Object o) { 491 | if (!(o instanceof PhraseHolder)) { 492 | return false; 493 | } 494 | PhraseHolder otherObject = (PhraseHolder) o; 495 | if (this.sb != null 496 | && otherObject.sb != null 497 | && this.sb.toString().equals(otherObject.sb.toString()) 498 | && this.beginPosition == otherObject.beginPosition 499 | && this.endPosition == otherObject.endPosition) { 500 | return true; 501 | } 502 | return false; 503 | } 504 | 505 | @Override 506 | public int compareTo(PhraseHolder o) { 507 | if (o == null) { 508 | return 1; 509 | } 510 | if (this.equals(o)) { 511 | return 0; 512 | } else if (this.beginPosition > o.beginPosition) { 513 | return 1; 514 | } else if (this.beginPosition == o.beginPosition) { 515 | if (this.endPosition > o.endPosition) { 516 | return 1; 517 | } 518 | } 519 | return -1; 520 | } 521 | } 522 | 523 | @Override 524 | public List getPipelines() { 525 | return new ArrayList<>(pipelines.keySet()); 526 | } 527 | 528 | @Override 529 | public boolean checkPipeline(String name) { 530 | return pipelines.containsKey(name); 531 | } 532 | 533 | @Override 534 | public void createPipeline(PipelineSpecification pipelineSpecification) { 535 | //TODO add validation 536 | String name = pipelineSpecification.getName(); 537 | PipelineBuilder pipelineBuilder = new PipelineBuilder(); 538 | List specActive = new ArrayList<>(); 539 | List stopwordsList; 540 | 541 | if (pipelineSpecification.hasProcessingStep("tokenize", true)) { 542 | pipelineBuilder.tokenize(); 543 | specActive.add("tokenize"); 544 | } 545 | 546 | if (pipelineSpecification.hasProcessingStep("ner", true)) { 547 | pipelineBuilder.extractNEs(); 548 | specActive.add("ner"); 549 | } 550 | 551 | String stopWords = pipelineSpecification.getStopWords() != null ? pipelineSpecification.getStopWords() : "default"; 552 | boolean checkLemma = pipelineSpecification.hasProcessingStep("checkLemmaIsStopWord"); 553 | if (checkLemma) { 554 | specActive.add("checkLemmaIsStopWord"); 555 | } 556 | 557 | if (stopWords.equalsIgnoreCase("default")) { 558 | pipelineBuilder.defaultStopWordAnnotator(); 559 | stopwordsList = PipelineBuilder.getDefaultStopwords(); 560 | } else { 561 | pipelineBuilder.customStopWordAnnotator(stopWords); 562 | stopwordsList = PipelineBuilder.getCustomStopwordsList(stopWords); 563 | } 564 | 565 | if (pipelineSpecification.hasProcessingStep("sentiment")) { 566 | pipelineBuilder.extractSentiment(); 567 | specActive.add("sentiment"); 568 | } 569 | if (pipelineSpecification.hasProcessingStep("coref")) { 570 | pipelineBuilder.extractCoref(); 571 | specActive.add("coref"); 572 | } 573 | if (pipelineSpecification.hasProcessingStep("relations")) { 574 | pipelineBuilder.extractRelations(); 575 | specActive.add("relations"); 576 | } 577 | if (pipelineSpecification.hasProcessingStep("customNER")) { 578 | if (!specActive.contains("ner")) { 579 | pipelineBuilder.extractNEs(); 580 | specActive.add("ner"); 581 | } 582 | specActive.add("customNER"); 583 | pipelineBuilder.extractCustomNEs(pipelineSpecification.getProcessingStepAsString("customNER")); 584 | } 585 | if (pipelineSpecification.hasProcessingStep("customSentiment")) { 586 | if (!specActive.contains("sentiment")) { 587 | pipelineBuilder.extractSentiment(); 588 | specActive.add("sentiment"); 589 | } 590 | specActive.add("customSentiment"); 591 | pipelineBuilder.extractCustomSentiment(pipelineSpecification.getProcessingStepAsString("customSentiment")); 592 | } 593 | Long threadNumber = pipelineSpecification.getThreadNumber() != 0 ? pipelineSpecification.getThreadNumber() : 4L; 594 | pipelineBuilder.threadNumber(threadNumber.intValue()); 595 | 596 | OpenNLPPipeline pipeline = pipelineBuilder.build(); 597 | pipelines.put(name, pipeline); 598 | } 599 | 600 | 601 | @Override 602 | public void removePipeline(String name) { 603 | if (!pipelines.containsKey(name)) { 604 | throw new RuntimeException("No pipeline found with name: " + name); 605 | } 606 | pipelines.remove(name); 607 | } 608 | } 609 | -------------------------------------------------------------------------------- /src/main/java/com/graphaware/nlp/processor/opennlp/PipelineBuilder.java: -------------------------------------------------------------------------------- 1 | /* 2 | * To change this license header, choose License Headers in Project Properties. 3 | * To change this template file, choose Tools | Templates 4 | * and open the template in the editor. 5 | */ 6 | package com.graphaware.nlp.processor.opennlp; 7 | 8 | import java.util.ArrayList; 9 | import java.util.Arrays; 10 | import java.util.List; 11 | import java.util.Properties; 12 | 13 | class PipelineBuilder { 14 | 15 | private static final String CUSTOM_STOP_WORD_LIST = "start,starts,period,periods,a,an,and,are,as,at,be,but,by,for,if,in,into,is,it,no,not,of,o,on,or,such,that,the,their,then,there,these,they,this,to,was,will,with"; 16 | 17 | private final Properties properties = new Properties(); 18 | private final StringBuilder annotators = new StringBuilder(); //basics annotators 19 | private int threadsNumber = 4; 20 | 21 | private void checkForExistingAnnotators() { 22 | if (annotators.toString().length() > 0) { 23 | annotators.append(", "); 24 | } 25 | } 26 | 27 | public PipelineBuilder tokenize() { 28 | checkForExistingAnnotators(); 29 | annotators.append("tokenize, pos, lemma"); 30 | return this; 31 | } 32 | 33 | public PipelineBuilder extractNEs() { 34 | checkForExistingAnnotators(); 35 | annotators.append("ner"); 36 | return this; 37 | } 38 | 39 | public PipelineBuilder extractSentiment() { 40 | checkForExistingAnnotators(); 41 | annotators.append("sentiment"); 42 | return this; 43 | } 44 | 45 | public PipelineBuilder extractRelations() { 46 | checkForExistingAnnotators(); 47 | annotators.append("relation"); 48 | return this; 49 | } 50 | 51 | public PipelineBuilder extractCoref() { 52 | return this; 53 | } 54 | 55 | public PipelineBuilder extractCustomNEs(String ners) { 56 | properties.setProperty("customNEs", ners); 57 | return this; 58 | } 59 | 60 | public PipelineBuilder extractCustomSentiment(String sent) { 61 | properties.setProperty("customSentiment", sent); 62 | return this; 63 | } 64 | 65 | public PipelineBuilder defaultStopWordAnnotator() { 66 | checkForExistingAnnotators(); 67 | annotators.append("stopword"); 68 | properties.setProperty("stopword", CUSTOM_STOP_WORD_LIST); 69 | return this; 70 | } 71 | 72 | public PipelineBuilder customStopWordAnnotator(String customStopWordList) { 73 | checkForExistingAnnotators(); 74 | String stopWordList; 75 | if (annotators.indexOf("stopword") >= 0) { 76 | String alreadyexistingStopWordList = properties.getProperty("stopword"); 77 | stopWordList = alreadyexistingStopWordList + "," + customStopWordList; 78 | } else { 79 | annotators.append("stopword"); 80 | stopWordList = CUSTOM_STOP_WORD_LIST + "," + customStopWordList; 81 | } 82 | properties.setProperty("stopword", stopWordList); 83 | return this; 84 | } 85 | 86 | public PipelineBuilder stopWordAnnotator(Properties properties) { 87 | return this; 88 | } 89 | 90 | public PipelineBuilder threadNumber(int threads) { 91 | this.threadsNumber = threads; 92 | return this; 93 | } 94 | 95 | public OpenNLPPipeline build() { 96 | properties.setProperty("annotators", annotators.toString()); 97 | properties.setProperty("threads", String.valueOf(threadsNumber)); 98 | OpenNLPPipeline pipeline = new OpenNLPPipeline(properties); 99 | return pipeline; 100 | } 101 | 102 | public static List getDefaultStopwords() { 103 | List stopwords = new ArrayList<>(); 104 | Arrays.stream(CUSTOM_STOP_WORD_LIST.split(",")).forEach(s -> { 105 | stopwords.add(s.trim()); 106 | }); 107 | 108 | return stopwords; 109 | } 110 | 111 | public static List getCustomStopwordsList(String customStopWordList) { 112 | String stopWordList; 113 | if (customStopWordList.startsWith("+")) { 114 | stopWordList = CUSTOM_STOP_WORD_LIST + "," + customStopWordList.replace("+,", "").replace("+", ""); 115 | } else { 116 | stopWordList = customStopWordList; 117 | } 118 | 119 | List list = new ArrayList<>(); 120 | Arrays.stream(stopWordList.split(",")).forEach(s -> { 121 | list.add(s.trim()); 122 | }); 123 | 124 | return list; 125 | } 126 | } 127 | -------------------------------------------------------------------------------- /src/main/java/com/graphaware/nlp/processor/opennlp/model/NERModelTool.java: -------------------------------------------------------------------------------- 1 | /* 2 | * 3 | * 4 | */ 5 | package com.graphaware.nlp.processor.opennlp.model; 6 | 7 | import com.graphaware.nlp.processor.opennlp.OpenNLPPipeline; 8 | import java.io.IOException; 9 | import java.util.Map; 10 | 11 | import opennlp.tools.namefind.TokenNameFinderFactory; 12 | import opennlp.tools.namefind.TokenNameFinderCrossValidator; 13 | import opennlp.tools.namefind.TokenNameFinderEvaluator; 14 | import opennlp.tools.namefind.NameSample; 15 | import opennlp.tools.namefind.NameFinderME; 16 | import opennlp.tools.namefind.NameSampleDataStream; 17 | import opennlp.tools.namefind.TokenNameFinderModel; 18 | 19 | import opennlp.tools.util.ObjectStream; 20 | 21 | import com.graphaware.nlp.util.GenericModelParameters; 22 | 23 | import org.slf4j.Logger; 24 | import org.slf4j.LoggerFactory; 25 | 26 | public class NERModelTool extends OpenNLPGenericModelTool { 27 | 28 | private String entityType; 29 | private static final String MODEL_NAME = "NER"; 30 | 31 | private static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class); 32 | 33 | public NERModelTool(String fileIn, String modelDescr, String lang, Map params) { 34 | super(fileIn, modelDescr, lang, params); 35 | this.entityType = null; // train only specific named entity; null = train all entities present in the training set 36 | if (params != null) { 37 | if (params.containsKey(GenericModelParameters.TRAIN_ENTITYTYPE)) { 38 | this.entityType = (String) params.get(GenericModelParameters.TRAIN_ENTITYTYPE); 39 | } 40 | } 41 | } 42 | 43 | public NERModelTool(String fileIn, String modelDescr, String lang) { 44 | this(fileIn, modelDescr, lang, null); 45 | } 46 | 47 | public NERModelTool() { 48 | super(); 49 | } 50 | 51 | public void train() { 52 | try (ObjectStream lineStream = openFile(fileIn); NameSampleDataStream sampleStream = new NameSampleDataStream(lineStream)) { 53 | LOG.info("Training of " + MODEL_NAME + " started ..."); 54 | this.model = NameFinderME.train(lang, entityType, sampleStream, trainParams, new TokenNameFinderFactory()); 55 | } catch (IOException ex) { 56 | LOG.error("Error while opening training file: " + fileIn, ex); 57 | throw new RuntimeException("Error while training " + MODEL_NAME + " model " + this.modelDescr, ex); 58 | } catch (Exception ex) { 59 | LOG.error("Error while training " + MODEL_NAME + " model " + modelDescr); 60 | throw new RuntimeException("Error while training " + MODEL_NAME + " model " + this.modelDescr, ex); 61 | } 62 | } 63 | 64 | public String validate() { 65 | String result = ""; 66 | if (this.fileValidate == null) { 67 | //List> listeners = new LinkedList>(); 68 | try (ObjectStream lineStream = openFile(fileIn); NameSampleDataStream sampleStream = new NameSampleDataStream(lineStream)) { 69 | LOG.info("Validation of " + MODEL_NAME + " started ..."); 70 | // Using CrossValidator 71 | TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator(lang, entityType, trainParams, null); 72 | // the second argument of 'evaluate()' gives number of folds (n), i.e. number of times the training-testing will be run (with data splitting train:test = (n-1):1) 73 | evaluator.evaluate(sampleStream, nFolds); 74 | result = "F = " + decFormat.format(evaluator.getFMeasure().getFMeasure()) 75 | + " (Precision = " + decFormat.format(evaluator.getFMeasure().getPrecisionScore()) 76 | + ", Recall = " + decFormat.format(evaluator.getFMeasure().getRecallScore()) + ")"; 77 | LOG.info("Validation: " + result); 78 | } catch (IOException ex) { 79 | LOG.error("Error while opening training file: " + fileIn, ex); 80 | throw new RuntimeException("IOError while evaluating " + MODEL_NAME + " model " + modelDescr, ex); 81 | } catch (Exception ex) { 82 | LOG.error("Error while evaluating " + MODEL_NAME + " model.", ex); 83 | throw new RuntimeException("Error while evaluating " + MODEL_NAME + " model " + modelDescr, ex); 84 | } 85 | } else { 86 | result = test(this.fileValidate, new NameFinderME((TokenNameFinderModel) model)); 87 | } 88 | 89 | return result; 90 | } 91 | 92 | public String test(String file, NameFinderME modelME) { 93 | String result = ""; 94 | try (ObjectStream lineStreamValidate = openFile(file); NameSampleDataStream sampleStreamValidate = new NameSampleDataStream(lineStreamValidate)) { 95 | LOG.info("Testing of " + MODEL_NAME + " started ..."); 96 | //TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME((TokenNameFinderModel) model)); 97 | TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(modelME); 98 | evaluator.evaluate(sampleStreamValidate); 99 | result = "F = " + decFormat.format(evaluator.getFMeasure().getFMeasure()) 100 | + " (Precision = " + decFormat.format(evaluator.getFMeasure().getPrecisionScore()) 101 | + ", Recall = " + decFormat.format(evaluator.getFMeasure().getRecallScore()) + ")"; 102 | LOG.info("Testing result: " + result); 103 | } catch (IOException ex) { 104 | LOG.error("Error while opening test file: " + file, ex); 105 | throw new RuntimeException("Error while testing " + MODEL_NAME + " model " + modelDescr, ex); 106 | } catch (Exception ex) { 107 | LOG.error("Error while testing " + this.MODEL_NAME + " model.", ex); 108 | } 109 | return result; 110 | } 111 | } 112 | -------------------------------------------------------------------------------- /src/main/java/com/graphaware/nlp/processor/opennlp/model/OpenNLPGenericModelTool.java: -------------------------------------------------------------------------------- 1 | /* 2 | * 3 | * 4 | */ 5 | package com.graphaware.nlp.processor.opennlp.model; 6 | 7 | import com.graphaware.nlp.processor.opennlp.OpenNLPPipeline; 8 | import java.io.File; 9 | import java.io.FileInputStream; 10 | import java.io.FileOutputStream; 11 | import java.io.InputStream; 12 | import java.io.BufferedOutputStream; 13 | import java.io.IOException; 14 | import java.net.URI; 15 | import java.util.Properties; 16 | import java.util.Map; 17 | import java.text.DecimalFormat; 18 | 19 | import opennlp.tools.util.PlainTextByLineStream; 20 | import opennlp.tools.util.InputStreamFactory; 21 | import opennlp.tools.util.TrainingParameters; 22 | import opennlp.tools.util.ObjectStream; 23 | import opennlp.tools.util.model.BaseModel; 24 | 25 | import com.graphaware.nlp.util.GenericModelParameters; 26 | 27 | import org.slf4j.Logger; 28 | import org.slf4j.LoggerFactory; 29 | 30 | public class OpenNLPGenericModelTool { 31 | 32 | protected BaseModel model; 33 | protected TrainingParameters trainParams; 34 | protected final String modelDescr; 35 | protected final String lang; 36 | protected final DecimalFormat decFormat; 37 | protected int nFolds; 38 | 39 | protected final String fileIn; 40 | protected String fileValidate; 41 | 42 | private static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class); 43 | 44 | public OpenNLPGenericModelTool(String file, String modelDescr, String lang) { 45 | this.fileValidate = null; 46 | this.fileIn = file; 47 | this.nFolds = 10; 48 | this.modelDescr = modelDescr; 49 | this.lang = lang; 50 | this.decFormat = new DecimalFormat("#0.00"); // for formating validation results with precision 2 decimals 51 | 52 | this.setDefParams(); 53 | } 54 | 55 | public OpenNLPGenericModelTool(String file, String modelDescr, String lang, Map params) { 56 | this(file, modelDescr, lang); 57 | this.setTrainingParameters(params); 58 | } 59 | 60 | /* 61 | * This constructor needed for invoking test() method only (model is provided as an argument of train() ) 62 | */ 63 | public OpenNLPGenericModelTool() { 64 | this(null, null, null); 65 | this.model = null; 66 | } 67 | 68 | // override this method in your child-class if you want different defaults 69 | protected void setDefParams() { 70 | this.trainParams = TrainingParameters.defaultParams(); 71 | } 72 | 73 | protected ObjectStream openFile(String fileName) { 74 | if (fileName == null || fileName.isEmpty()) { 75 | LOG.error("File name is null or empty."); 76 | return null; 77 | } 78 | ObjectStream lStream = null; 79 | try { 80 | ImprovisedInputStreamFactory dataIn = new ImprovisedInputStreamFactory(null, "", fileName); 81 | lStream = new PlainTextByLineStream(dataIn, "UTF-8"); 82 | } catch (IOException ex) { 83 | LOG.error("Failure while opening file " + fileName, ex); 84 | throw new RuntimeException("Failure while opening file " + fileName, ex); 85 | } 86 | 87 | if (lStream == null) 88 | throw new RuntimeException("Failure while opening file " + fileName + ": input stream is null."); 89 | return lStream; 90 | } 91 | 92 | private void setTrainingParameters(Map params) { 93 | this.setDefParams(); 94 | if (params == null || params.isEmpty()) { 95 | LOG.error("Map of parameters is null or empty. Using default values."); 96 | return; 97 | } 98 | 99 | // now add/override-by user-defined parameters 100 | if (params.containsKey(GenericModelParameters.TRAIN_ALG)) { 101 | String val = objectToString(params, GenericModelParameters.TRAIN_ALG); 102 | this.trainParams.put(TrainingParameters.ALGORITHM_PARAM, val); // default: MAXENT 103 | LOG.info("Training parameter " + TrainingParameters.ALGORITHM_PARAM + " set to " + val); 104 | } 105 | if (params.containsKey(GenericModelParameters.TRAIN_TYPE)) { 106 | String val = objectToString(params, GenericModelParameters.TRAIN_TYPE); 107 | this.trainParams.put(TrainingParameters.TRAINER_TYPE_PARAM, val); 108 | LOG.info("Training parameter " + TrainingParameters.TRAINER_TYPE_PARAM + " set to " + val); 109 | } 110 | if (params.containsKey(GenericModelParameters.TRAIN_CUTOFF)) { 111 | String val = objectToString(params, GenericModelParameters.TRAIN_CUTOFF); 112 | this.trainParams.put(TrainingParameters.CUTOFF_PARAM, val); 113 | LOG.info("Training parameter " + TrainingParameters.CUTOFF_PARAM + " set to " + val); 114 | } 115 | if (params.containsKey(GenericModelParameters.TRAIN_ITER)) { 116 | String val = objectToString(params, GenericModelParameters.TRAIN_ITER); 117 | this.trainParams.put(TrainingParameters.ITERATIONS_PARAM, val); 118 | LOG.info("Training parameter " + TrainingParameters.ITERATIONS_PARAM + " set to " + val); 119 | } 120 | if (params.containsKey(GenericModelParameters.TRAIN_THREADS)) { 121 | String val = objectToString(params, GenericModelParameters.TRAIN_THREADS); 122 | this.trainParams.put(TrainingParameters.THREADS_PARAM, val); 123 | LOG.info("Training parameter " + TrainingParameters.THREADS_PARAM + " set to " + val); 124 | } 125 | if (params.containsKey(GenericModelParameters.VALIDATE_FOLDS)) { 126 | this.nFolds = objectToInt(params, GenericModelParameters.VALIDATE_FOLDS); 127 | LOG.info("n-folds for crossvalidation set to %d.", this.nFolds); 128 | } 129 | if (params.containsKey(GenericModelParameters.VALIDATE_FILE)) { 130 | this.fileValidate = objectToString(params, GenericModelParameters.VALIDATE_FILE); 131 | LOG.info("Using valudation file " + fileValidate); 132 | } 133 | } 134 | 135 | private String objectToString(Map params, String key) { 136 | String result = null; 137 | if (params.get(key) instanceof String) 138 | result = (String) params.get(key); 139 | else if (params.get(key) instanceof Long) 140 | result = ((Long) params.get(key)).toString(); 141 | else if (params.get(key) instanceof Integer) 142 | result = ((Integer) params.get(key)).toString(); 143 | else 144 | throw new RuntimeException("Wrong format of parameter " + key); 145 | return result; 146 | } 147 | 148 | private int objectToInt(Map params, String key) { 149 | int result; 150 | if (params.get(key) instanceof String) 151 | result = Integer.parseInt((String) params.get(key)); 152 | else if (params.get(key) instanceof Long) 153 | result = ((Long) params.get(key)).intValue(); 154 | else if (params.get(key) instanceof Integer) 155 | result = ((Integer) params.get(key)).intValue(); 156 | else 157 | throw new RuntimeException("Wrong format of parameter " + key); 158 | return result; 159 | } 160 | 161 | protected void closeInputFiles() { 162 | // try { 163 | // if (this.lineStream != null) { 164 | // this.lineStream.close(); 165 | // } 166 | // } catch (IOException ex) { 167 | // LOG.warn("Attept to close input line-stream from source file " + this.fileIn + " failed."); 168 | // } 169 | // 170 | // try { 171 | // if (this.lineStreamValidate != null) { 172 | // this.lineStreamValidate.close(); 173 | // } 174 | // } catch (IOException ex) { 175 | // LOG.warn("Attept to close input line-stream from source file " + this.fileValidate + " failed."); 176 | // } 177 | } 178 | 179 | public void saveModel(String file) { 180 | if (this.model == null) { 181 | LOG.error("Can't save training results to a " + file + ": model is null"); 182 | return; 183 | } 184 | try { 185 | LOG.info("Saving model to file: " + file); 186 | BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream(file)); 187 | this.model.serialize(modelOut); 188 | modelOut.close(); 189 | } catch (IOException ex) { 190 | LOG.error("Error saving model to file " + file, ex); 191 | throw new RuntimeException("Error saving model to file " + file, ex); 192 | } 193 | 194 | //this.closeInputFile(); 195 | } 196 | 197 | public BaseModel getModel() { 198 | return this.model; 199 | } 200 | 201 | class ImprovisedInputStreamFactory implements InputStreamFactory { 202 | 203 | private File inputSourceFile; 204 | private String inputSourceStr; 205 | 206 | ImprovisedInputStreamFactory(Properties properties, String property, String defaultValue) { 207 | this.inputSourceFile = null; 208 | this.inputSourceStr = defaultValue; 209 | if (properties != null) { 210 | this.inputSourceStr = properties.getProperty(property, defaultValue); 211 | } 212 | try { 213 | if (this.inputSourceStr.startsWith("file://")) { 214 | this.inputSourceFile = new File(new URI(this.inputSourceStr.replace("file://", ""))); 215 | } else if (this.inputSourceStr.startsWith("/")) { 216 | this.inputSourceFile = new File(this.inputSourceStr); 217 | } 218 | } catch (Exception ex) { 219 | LOG.error("Error while loading model from " + this.inputSourceStr); 220 | throw new RuntimeException("Error while loading model from " + this.inputSourceStr); 221 | } 222 | } 223 | 224 | @Override 225 | public InputStream createInputStream() throws IOException { 226 | LOG.debug("Creating input stream from " + this.inputSourceFile.getPath()); 227 | //return getClass().getClassLoader().getResourceAsStream(this.inputSourceFile.getPath()); 228 | return new FileInputStream(this.inputSourceFile.getPath()); 229 | } 230 | 231 | /*public void closeInputStream() { 232 | try { 233 | if (this.is!=null) 234 | this.is.close(); 235 | } catch (IOException ex) { 236 | LOG.warn("Attept to close input stream failed."); 237 | } 238 | }*/ 239 | } 240 | 241 | } 242 | -------------------------------------------------------------------------------- /src/main/java/com/graphaware/nlp/processor/opennlp/model/SentimentModelTool.java: -------------------------------------------------------------------------------- 1 | /* 2 | * 3 | * 4 | */ 5 | package com.graphaware.nlp.processor.opennlp.model; 6 | 7 | import com.graphaware.nlp.processor.opennlp.OpenNLPPipeline; 8 | import java.io.File; 9 | import java.io.FileOutputStream; 10 | import java.io.InputStream; 11 | import java.io.BufferedOutputStream; 12 | import java.io.IOException; 13 | import java.util.Arrays; 14 | import java.util.Properties; 15 | import java.util.Map; 16 | import java.util.HashMap; 17 | import java.util.Collections; 18 | import java.util.Iterator; 19 | import java.net.URI; 20 | 21 | import opennlp.tools.namefind.TokenNameFinderFactory; 22 | import opennlp.tools.namefind.TokenNameFinderCrossValidator; 23 | import opennlp.tools.namefind.TokenNameFinderEvaluator; 24 | import opennlp.tools.doccat.DoccatFactory; 25 | import opennlp.tools.doccat.DocumentSample; 26 | import opennlp.tools.doccat.DocumentSampleStream; 27 | import opennlp.tools.doccat.DocumentCategorizerME; 28 | import opennlp.tools.doccat.DoccatCrossValidator; 29 | import opennlp.tools.doccat.DocumentCategorizerEvaluator; 30 | import opennlp.tools.doccat.DoccatModel; 31 | 32 | import opennlp.tools.namefind.NameSample; 33 | 34 | import opennlp.tools.util.PlainTextByLineStream; 35 | import opennlp.tools.util.InputStreamFactory; 36 | import opennlp.tools.util.TrainingParameters; 37 | import opennlp.tools.util.ObjectStream; 38 | import opennlp.tools.util.eval.CrossValidationPartitioner; 39 | import opennlp.tools.util.eval.FMeasure; 40 | import opennlp.tools.util.FilterObjectStream; 41 | //import opennlp.tools.util.eval.EvaluationMonitor; 42 | 43 | import com.graphaware.nlp.util.GenericModelParameters; 44 | 45 | import org.slf4j.Logger; 46 | import org.slf4j.LoggerFactory; 47 | 48 | /** 49 | * 50 | * @author vla 51 | */ 52 | public class SentimentModelTool extends OpenNLPGenericModelTool { 53 | 54 | private static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class); 55 | 56 | private static final String MODEL_NAME = "sentiment"; 57 | private static final String DEFAULT_ITER = "30"; 58 | private static final String DEFAULT_CUTOFF = "2"; 59 | 60 | public SentimentModelTool(String fileIn, String modelDescr, String lang, Map params) { 61 | super(fileIn, modelDescr, lang, params); 62 | } 63 | 64 | public SentimentModelTool(String fileIn, String modelDescr, String lang) { 65 | this(fileIn, modelDescr, lang, null); 66 | } 67 | 68 | public SentimentModelTool() { 69 | super(); 70 | } 71 | 72 | // here you can specify default parameters specific to this class 73 | @Override 74 | protected void setDefParams() { 75 | this.trainParams = TrainingParameters.defaultParams(); 76 | this.trainParams.put(TrainingParameters.ITERATIONS_PARAM, DEFAULT_ITER); 77 | this.trainParams.put(TrainingParameters.CUTOFF_PARAM, DEFAULT_CUTOFF); 78 | } 79 | 80 | public void train() { 81 | try (ObjectStream lineStream = openFile(fileIn); ObjectStream sampleStream = new DocumentSampleStream(lineStream)) { 82 | LOG.info("Training of " + MODEL_NAME + " started ..."); 83 | this.model = DocumentCategorizerME.train("en", sampleStream, trainParams, new DoccatFactory()); 84 | } catch (IOException e) { 85 | LOG.error("IOError while training a custom " + MODEL_NAME + " model " + modelDescr, e); 86 | throw new RuntimeException("IOError while training a custom " + MODEL_NAME + " model " + this.modelDescr, e); 87 | } 88 | } 89 | 90 | public String validate() { 91 | String result = ""; 92 | if (this.fileValidate == null) { 93 | try (ObjectStream lineStream = openFile(fileIn); ObjectStream sampleStream = new DocumentSampleStream(lineStream)) { 94 | LOG.info("Validation of " + MODEL_NAME + " started ..."); 95 | DoccatCrossValidator evaluator = new DoccatCrossValidator(this.lang, this.trainParams, new DoccatFactory()); 96 | // the second argument of 'evaluate()' gives number of folds (n): number of times the training-testing will be run (with data splitting train:test = (n-1):1) 97 | evaluator.evaluate(sampleStream, this.nFolds); 98 | result = "Accuracy = " + this.decFormat.format(evaluator.getDocumentAccuracy()); 99 | LOG.info("Validation: " + result); 100 | } catch (IOException e) { 101 | LOG.error("Error while opening training file: " + fileIn, e); 102 | throw new RuntimeException("IOError while evaluating a " + MODEL_NAME + " model " + this.modelDescr, e); 103 | } catch (Exception ex) { 104 | LOG.error("Error while evaluating " + MODEL_NAME + " model.", ex); 105 | } 106 | } else { 107 | // Using a separate .test file provided by user 108 | result = test(this.fileValidate, new DocumentCategorizerME((DoccatModel) this.model)); 109 | } 110 | 111 | return result; 112 | } 113 | 114 | public String test(String file, DocumentCategorizerME modelME) { 115 | String result = ""; 116 | try (ObjectStream lineStream = openFile(file); ObjectStream sampleStreamValidate = new DocumentSampleStream(lineStream)) { 117 | LOG.info("Testing of " + MODEL_NAME + " started ..."); 118 | //DocumentCategorizerEvaluator evaluator = new DocumentCategorizerEvaluator(new DocumentCategorizerME((DoccatModel) this.model)); 119 | DocumentCategorizerEvaluator evaluator = new DocumentCategorizerEvaluator(modelME); 120 | evaluator.evaluate(sampleStreamValidate); 121 | result = "Accuracy = " + this.decFormat.format(evaluator.getAccuracy()); 122 | LOG.info("Validation: " + result); 123 | } catch (IOException e) { 124 | LOG.error("Error while opening a test file: " + file, e); 125 | throw new RuntimeException("IOError while testing a " + MODEL_NAME + " model " + this.modelDescr, e); 126 | } catch (Exception ex) { 127 | LOG.error("Error while testing " + MODEL_NAME + " model.", ex); 128 | } 129 | return result; 130 | } 131 | } 132 | -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-chunker.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-chunker.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-date.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-date.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-location.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-location.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-money.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-money.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-organization.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-organization.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-percentage.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-percentage.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-percentage_money.test: -------------------------------------------------------------------------------- 1 | Mr Juncker said the EU was spending on average €27,000 ( £24,000 ; $30,000 ) per soldier on equipment and research, whereas the US was spending €108,000 . 2 | "Together, we spend half as much as the United States, but even then we only achieve 15% of their efficiency," he said. 3 | 4 | Under the plan, €500m will be made available annually after 2020 for joint military research, and another €1bn annually for joint investment and purchases of military equipment, such as drones and helicopters. 5 | 6 | In afternoon trade, sterling was down 1.7% against the dollar at $1.227 . 7 | Against the euro, the pound was down 1.4% at 1.1393.6. 8 | This makes imported goods' prices higher and squeezes consumers' ability to spend. 9 | 10 | Housebuilders, including Taylor Wimpey and Persimmon, saw falls of up to 5% , while retail companies' shares also fell. 11 | Next and Marks and Spencer fell more than 3% . 12 | 13 | Ultimately they may have deprived the state of nearly €32bn ( £28bn ; $36bn ). 14 | As the German broadcaster ARD wryly noted, that would have paid for repairs to a lot of schools and bridges. 15 | 16 | The pound has dropped by more than 2% against the dollar, sterling's biggest one-day fall since the Brexit referendum vote last June. 17 | 18 | Japan's benchmark Nikkei 225 stock index closed 0.5% higher and South Korea's Kospi cended the day up 0.8% . 19 | 20 | Ocwen Financial rose 2.7% . 21 | The loan servicer has been under pressure from the Consumer Financial Protection Bureau, which would be seriously weakened under Thursday's measure. 22 | 23 | The deals range from buying UK chip firm ARM Holdings for £24bn ( $32bn ), investing $1bn in satellite startup OneWeb, to setting up a venture fund with Saudi Arabia. 24 | 25 | The ECB now expects growth across the eurozone to be 1.9% in 2017 compared with its March forecast of 1.8% . 26 | It also increased its growth projection for 2018 to 1.8% from 1.7% , and for 2019 to 1.7% from 1.6% . 27 | 28 | At that rate, the 19 countries that use the euro would see growth at 2.3% this year, nearly double the rate of the US, which is on course to grow 1.2% . 29 | 30 | Although all EU countries are required to observe the 3% limit, only the 19 countries that use the euro as a currency can be fined. 31 | 32 | According to the Office for National Statistics (ONS), manufacturing production grew 0.2% from the month before in April, rebounding from the 0.6% decline recorded in the previous month but falling short of expectations for a 0.8% increase. 33 | 34 | At 1:59pm BST, the Brent front month futures contract for August delivery was down another 0.75% or 36 cents at $47.70 per barrel, with the global proxy benchmark having breached the psychological $50 -level on Monday. 35 | 36 | Concurrently, the West Texas Intermediate (WTI) was down 0.94% or 43 cents to $45.29 per barrel, after US Energy Information Administration said the country's stockpiles rose by 3.3m barrels last week, against market estimates for a 3.5m-barrel drop. 37 | 38 | A breakdown below $45 should open a path lower towards $44 . 39 | Furthermore, with the oversupply woes still a dominant theme in the oil markets and Opec's efforts to stabilise the markets disrupted by US Shale production, WTI Crude may receive further punishment. 40 | 41 | A diamond ring that was initially bought at a car boot sales for £10 has been auctioned off for £656,750 in London. 42 | 43 | The owner was unaware the "exceptionally-sized" stone was instead a 26-carat diamond, which she wore for almost two decades and which fetched almost double the £350,000 it was expected to be sold for. 44 | 45 | A Cartier diamond brooch owned by the late Margaret Thatcher, which estimated to fetch £35,000 was sold for £81,250 . 46 | 47 | French beauty group L'Oreal said it has entered into "exclusive discussions" to sell the natural products cosmetics business for an enterprise value of €1bn ( $1.1bn ), after buying it eleven years ago. 48 | 49 | British multinational utility firm Centrica has announced that intends to sell 60% of its stake in a joint venture oil and gas exploration and production enterprise to a consortium of firms. 50 | The deal is expected to cost £240m . 51 | 52 | Centrica's cash flows also jumped by nearly 130% to £2bn in 2016. 53 | Its operations are mainly confined to Europe and North America. 54 | 55 | MIE Holdings is a Hong Kong Stock Exchange-listed oil and gas firm. 56 | It managed to bring its net loss down by 13% to 1.3 billion renminbi during the 2016 financial year. 57 | 58 | The US Central Intelligence Agency (CIA) has estimated that after Liechtenstein it is Qatar who has the world's second largest GDP per capita with a value of $129,700 . 59 | Moreover, the Sovereign Wealth Fund Institute has ranked the Qatar Investment Authority as the world's 9th largest sovereign wealth fund, with a total asset value of $335bn . 60 | 61 | Data released by Halifax on 7 June showed UK house prices rose 3.3% year-on-year in May, following a 3.8% increase in April. 62 | 63 | The national sales gauge improved slightly to -8% in May from -9% in the previous month. 64 | 65 | At 4:51pm BST, the Brent front month futures contract for August delivery was down 3.49% or $1.75 at $48.37 per barrel, with the global proxy benchmark having breached the psychological $50 -level on Monday. 66 | 67 | Concurrently, the West Texas Intermediate (WTI) was down 4.25% or $2.05 to $46.11 per barrel, having fallen as low as $45.92 in late European trading, after US Energy Information Administration said the country's stockpiles rose by 3.3m barrels last week, against market estimates for a 3.5m-barrel drop. 68 | 69 | The goods trade deficit with the rest of the world narrowed to £10.4bn from £12bn the month before, as import levels of mechanical machinery, oil and cars fell during the period. 70 | 71 | The overall trade deficit - covering goods and services - narrowed to £2.1bn in April from £3.9bn the month before. 72 | Economists had expected the deficit to amount to £3.5bn . 73 | 74 | The pound lost more than 2 cents against the dollar within seconds of the exit poll result, falling from $1.2955 to $1.2752 late Thursday. 75 | 76 | The General Administration of Customs revealed on Thursday that China's exports increased by 8.7% year-on-year in May, beating forecasts of a 7.2% increase. 77 | The country's trade surplus now amounts to $40.8bn ( £31bn ). 78 | 79 | The Cabinet Office revealed on Wednesday that Japan's GDP grew by 0.3% during the first quarter of 2017. 80 | Although the reading missed a forecast of 0.6% growth, Japan's economy continued to expand in five consecutive quarters, the country's highest streak in three years. 81 | -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-person.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-person.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-person.test: -------------------------------------------------------------------------------- 1 | Theresa May has said she will form a government with the support of the Democratic Unionists that can provide "certainty" for the future. 2 | Speaking after visiting Buckingham Palace, she said only her party had the "legitimacy" to govern after winning the most seats and votes. 3 | In a short statement outside Downing Street, which followed a 25-minute audience with The Queen , Mrs May said she intended to form a government which could "provide certainty and lead Britain forward at this critical time for our country". 4 | 5 | The BBC's Laura Kuenssberg said the PM had returned to No 10 a "diminished figure", having ended up with 12 fewer seats than when she called the election in April. 6 | She had called the election with the stated reason that it would strengthen her hand in negotiations for the UK to leave the EU - the talks are due to start on 19 June. 7 | 8 | The general election has ended in a hung Parliament, where no party has the 326 seats needed to get an overall majority in the House of Commons. 9 | So what happens now? 10 | Who is the prime minister? 11 | Theresa May remains prime minister. 12 | She aims to form a minority government, working with the Democratic Unionist Party. 13 | 14 | The Labour leader does not have to wait until Mrs May has exhausted all her options before he starts trying to put a deal of his own together. 15 | He can hold talks with potential partners at the same time as Mrs May . 16 | They may even be talking to the same people. 17 | 18 | Labour had a majority of three after the 1974 general election - but this had vanished by 1977, and it stayed in power thanks to a "pact" with the Liberal Party. 19 | And John Major's Conservative government started out with a majority of 21 in 1992 but was a minority government by the 1997 general election. 20 | 21 | On 8 May 2013, one week before the Pakistani election, the third author, in his keynote address at the Sentiment Analysis Symposium, forecast the winner of the Pakistani election. 22 | The chart in Figure 1 shows varying sentiment on the candidates for prime minister of Pakistan in that election. 23 | The next day, the BBC’s Owen Bennett Jones , reporting from Islamabad, wrote an article titled “Pakistan Elections: Five Reasons Why the Vote is Unpredictable.” 24 | 25 | At the moment the first deadline is Tuesday 13 June, when the new Parliament meets for the first time. 26 | Mrs May has until this date to put together a deal to keep herself in power or resign, according to official guidance issued by the Cabinet Office. 27 | If she were to resign, Mrs May must be clear that Jeremy Corbyn can form a government and that she can't. 28 | She is entitled to wait until the new Parliament to see if she has the confidence of the House of Commons. 29 | 30 | Japan's parliament has passed a one-off bill to allow Emperor Akihito to abdicate, the first emperor to do so in 200 years. 31 | The 83-year-old said last year that his age and health were making it hard for him to fulfil his official duties. 32 | But there was no provision under existing law for him to stand down. 33 | The government will now begin the process of arranging his abdication, expected to happen in late 2018, and the handover to Crown Prince Naruhito . 34 | 35 | Germany has called for diplomatic efforts to resolve a growing crisis over Qatar, which is accused by four Arab neighbours of funding terrorism. 36 | Saudi Arabia, the United Arab Emirates (UAE), Egypt and Bahrain cut travel and diplomatic ties with Qatar on Monday. 37 | Speaking after hosting his Qatari counterpart on Friday, German Foreign Minister Sigmar Gabriel called for the "sea and air blockades" to be lifted. 38 | 39 | Mr Gabriel met Saudi Foreign Minister Adil al-Ahmad al-Jubayr two days ago, and said all parties were seeking "to avoid further escalation". 40 | Then on Friday, Mr Gabriel spoke to Qatari Foreign Minister Sheikh Mohammed bin Abdulrahman al-Thani in the northern German town of Wolfenbuettel. 41 | 42 | On Friday Saudi Arabia and its three allies issued a list of 49 people - including Muslim Brotherhood spiritual leader Yusuf al-Qaradawi - and 12 Qatar-backed charities and groups accused of links with militants. 43 | On Thursday, Qatar's Sheikh Mohammed said his country had been isolated "because we are successful and progressive" and called his country "a platform for peace not terrorism". 44 | 45 | US President Donald Trump has urged Nato allies to boost defence spending. 46 | Last month German Chancellor Angela Merkel said Europe could no longer "completely depend" on the US and UK , following the election of President Trump and the triggering of Brexit. 47 | 48 | The UK has long been one of the strongest voices in the EU against any moves towards forming a European army. 49 | The UK says the EU must not duplicate Nato's role as the main pillar of European defence. 50 | However, Mr Trump's criticisms of Nato have raised questions about the US commitment to defending Europe. 51 | 52 | The Cabinet Office revealed on Wednesday that Japan's GDP grew by 0.3% during the first quarter of 2017. 53 | Although the reading missed a forecast of 0.6% growth, Japan's economy continued to expand in five consecutive quarters, the country's highest streak in three years. 54 | 55 | The General Administration of Customs revealed on Thursday that China's exports increased by 8.7% year-on-year in May, beating forecasts of a 7.2% increase. 56 | The country's trade surplus now amounts to $40.8bn (£31bn). 57 | 58 | Furthermore, the European Central Bank is scheduled to hold its monetary policy meeting today as well. 59 | Expectations point towards the Bank reaffirming its decision to maintain a loose monetary policy, but spectators will carefully scrutinize ECB President Mario Draghi and his team's rhetoric in order to get an indication of when interest rates could possibly rise. 60 | 61 | James Trescothick , senior global strategist, said there is the potential that Opec's agreement to limit current oil production could collapse. 62 | If that happens oil prices could potentially fall. 63 | However, Qatar is one of the smallest oil producers in Opec, with estimated proven reserves of 25bn barrels which is dwarfed by Saudi Arabia's 266bn barrels. 64 | Trescothick believes the main danger to the region and indeed to oil prices is that increased tension could lead Qatar to reach out even further to Iran for support which would no doubt sour diplomatic relations further. 65 | -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-person_organization_location_date.test: -------------------------------------------------------------------------------- 1 | Theresa May has said she will form a government with the support of the Democratic Unionists that can provide "certainty" for the future. 2 | Speaking after visiting Buckingham Palace , she said only her party had the "legitimacy" to govern after winning the most seats and votes. 3 | In a short statement outside Downing Street , which followed a 25-minute audience with The Queen , Mrs May said she intended to form a government which could "provide certainty and lead Britain forward at this critical time for our country". 4 | 5 | The BBC's Laura Kuenssberg said the PM had returned to No 10 a "diminished figure", having ended up with 12 fewer seats than when she called the election in April . 6 | She had called the election with the stated reason that it would strengthen her hand in negotiations for the UK to leave the EU - the talks are due to start on 19 June . 7 | 8 | The general election has ended in a hung Parliament , where no party has the 326 seats needed to get an overall majority in the House of Commons . 9 | So what happens now? 10 | Who is the prime minister? 11 | Theresa May remains prime minister. 12 | She aims to form a minority government, working with the Democratic Unionist Party . 13 | 14 | The Labour leader does not have to wait until Mrs May has exhausted all her options before he starts trying to put a deal of his own together. 15 | He can hold talks with potential partners at the same time as Mrs May . 16 | They may even be talking to the same people. 17 | 18 | Labour had a majority of three after the 1974 general election - but this had vanished by 1977 , and it stayed in power thanks to a "pact" with the Liberal Party . 19 | And John Major's Conservative government started out with a majority of 21 in 1992 but was a minority government by the 1997 general election. 20 | 21 | On 8 May 2013 , one week before the Pakistani election, the third author, in his keynote address at the Sentiment Analysis Symposium , forecast the winner of the Pakistani election. 22 | The chart in Figure 1 shows varying sentiment on the candidates for prime minister of Pakistan in that election. 23 | The next day, the BBC’s Owen Bennett Jones , reporting from Islamabad , wrote an article titled “ Pakistan Elections: Five Reasons Why the Vote is Unpredictable.” 24 | 25 | At the moment the first deadline is Tuesday 13 June , when the new Parliament meets for the first time. 26 | Mrs May has until this date to put together a deal to keep herself in power or resign, according to official guidance issued by the Cabinet Office . 27 | If she were to resign, Mrs May must be clear that Jeremy Corbyn can form a government and that she can't. 28 | She is entitled to wait until the new Parliament to see if she has the confidence of the House of Commons . 29 | 30 | Japan's parliament has passed a one-off bill to allow Emperor Akihito to abdicate, the first emperor to do so in 200 years. 31 | The 83-year-old said last year that his age and health were making it hard for him to fulfil his official duties. 32 | But there was no provision under existing law for him to stand down. 33 | The government will now begin the process of arranging his abdication, expected to happen in late 2018 , and the handover to Crown Prince Naruhito . 34 | 35 | Germany has called for diplomatic efforts to resolve a growing crisis over Qatar , which is accused by four Arab neighbours of funding terrorism. 36 | Saudi Arabia , the United Arab Emirates ( UAE ), Egypt and Bahrain cut travel and diplomatic ties with Qatar on Monday . 37 | Speaking after hosting his Qatari counterpart on Friday , German Foreign Minister Sigmar Gabriel called for the "sea and air blockades" to be lifted. 38 | 39 | Mr Gabriel met Saudi Foreign Minister Adil al-Ahmad al-Jubayr two days ago , and said all parties were seeking "to avoid further escalation". 40 | Then on Friday , Mr Gabriel spoke to Qatari Foreign Minister Sheikh Mohammed bin Abdulrahman al-Thani in the northern German town of Wolfenbuettel . 41 | 42 | On Friday Saudi Arabia and its three allies issued a list of 49 people - including Muslim Brotherhood spiritual leader Yusuf al-Qaradawi - and 12 Qatar-backed charities and groups accused of links with militants. 43 | On Thursday , Qatar's Sheikh Mohammed said his country had been isolated "because we are successful and progressive" and called his country "a platform for peace not terrorism". 44 | 45 | US President Donald Trump has urged Nato allies to boost defence spending. 46 | Last month German Chancellor Angela Merkel said Europe could no longer "completely depend" on the US and UK , following the election of President Trump and the triggering of Brexit. 47 | 48 | The UK has long been one of the strongest voices in the EU against any moves towards forming a European army . 49 | The UK says the EU must not duplicate Nato's role as the main pillar of European defence. 50 | However, Mr Trump's criticisms of Nato have raised questions about the US commitment to defending Europe . 51 | 52 | The Cabinet Office revealed on Wednesday that Japan's GDP grew by 0.3% during the first quarter of 2017 . 53 | Although the reading missed a forecast of 0.6% growth, Japan's economy continued to expand in five consecutive quarters, the country's highest streak in three years. 54 | 55 | The General Administration of Customs revealed on Thursday that China's exports increased by 8.7% year-on-year in May , beating forecasts of a 7.2% increase. 56 | The country's trade surplus now amounts to $40.8bn (£31bn). 57 | 58 | Furthermore, the European Central Bank is scheduled to hold its monetary policy meeting today as well. 59 | Expectations point towards the Bank reaffirming its decision to maintain a loose monetary policy, but spectators will carefully scrutinize ECB President Mario Draghi and his team's rhetoric in order to get an indication of when interest rates could possibly rise. 60 | 61 | James Trescothick , senior global strategist, said there is the potential that Opec's agreement to limit current oil production could collapse. 62 | If that happens oil prices could potentially fall. 63 | However, Qatar is one of the smallest oil producers in Opec , with estimated proven reserves of 25bn barrels which is dwarfed by Saudi Arabia's 266bn barrels. 64 | Trescothick believes the main danger to the region and indeed to oil prices is that increased tension could lead Qatar to reach out even further to Iran for support which would no doubt sour diplomatic relations further. 65 | -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-time.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-time.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-pos-maxent.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-pos-maxent.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-sent.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-sent.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-sentiment-tweets_toy.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-sentiment-tweets_toy.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/en-token.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-token.bin -------------------------------------------------------------------------------- /src/main/resources/com/graphaware/nlp/processor/opennlp/sentiment_tweets.train: -------------------------------------------------------------------------------- 1 | 3 Watching a nice movie 2 | 1 The painting is ugly, will return it tomorrow... 3 | 3 One of the best soccer games, worth seeing it 4 | 3 Very tasty, not only for vegetarians 5 | 3 Super party! 6 | 1 Too early to travel..need a coffee 7 | 1 Damn..the train is late again... 8 | 1 Bad news, my flight just got cancelled. 9 | 3 Happy birthday mr. president 10 | 3 Just watch it. Respect. 11 | 3 Wonderful sunset. 12 | 3 Bravo, first title in 2014! 13 | 1 Had a bad evening, need urgently a beer. 14 | 1 I put on weight again 15 | 3 On today's show we met Angela, a woman with an amazing story 16 | 3 I fell in love again 17 | 1 I lost my keys 18 | 3 On a trip to Iceland 19 | 3 Happy in Berlin 20 | 1 I hate Mondays 21 | 3 Love the new book I reveived for Christmas 22 | 1 He killed our good mood 23 | 3 I am in good spirits again 24 | 3 This guy creates the most awesome pics ever 25 | 1 The dark side of a selfie. 26 | 3 Cool! John is back! 27 | 3 Many rooms and many hopes for new residents 28 | 1 False hopes for the people attending the meeting 29 | 3 I set my new year's resolution 30 | 1 The ugliest car ever! 31 | 1 Feeling bored 32 | 1 Need urgently a pause 33 | 3 Nice to see Ana made it 34 | 3 My dream came true 35 | 1 I didn't see that one coming 36 | 1 Sorry mate, there is no more room for you 37 | 1 Who could have possibly done this? 38 | 3 I won the challenge 39 | 1 I feel bad for what I did 40 | 3 I had a great time tonight 41 | 3 It was a lot of fun 42 | 3 Thank you Molly making this possible 43 | 1 I just did a big mistake 44 | 3 I love it!! 45 | 1 I never loved so hard in my life 46 | 1 I hate you Mike!! 47 | 1 I hate to say goodbye 48 | 3 Lovely! 49 | 3 Like and share if you feel the same 50 | 1 Never try this at home 51 | 1 Don't spoil it! 52 | 3 I love rock and roll 53 | 1 The more I hear you, the more annoyed I get 54 | 3 Finnaly passed my exam! 55 | 3 Lovely kittens 56 | 1 I just lost my appetite 57 | 1 Sad end for this movie 58 | 1 Lonely, I am so lonely 59 | 3 Beautiful morning 60 | 3 She is amazing 61 | 3 Enjoying some time with my friends 62 | 3 Special thanks to Marty 63 | 3 Thanks God I left on time 64 | 3 Greateful for a wonderful meal 65 | 3 So happy to be home 66 | 1 Hate to wait on a long queue 67 | 1 No cab available 68 | 1 Electricity outage, this is a nightmare 69 | 1 Nobody to ask about directions 70 | 3 Great game! 71 | 3 Nice trip 72 | 3 I just received a pretty flower 73 | 3 Excellent idea 74 | 3 Got a new watch. Feeling happy 75 | 1 I feel sick 76 | 1 I am very tired 77 | 3 Such a good taste 78 | 1 Such a bad taste 79 | 3 Enjoying brunch 80 | 1 I don't recommend this restaurant 81 | 3 Thank you mom for supporting me 82 | 1 I will never ever call you again 83 | 1 I just got kicked out of the contest 84 | 3 Smiling 85 | 1 Big pain to see my team loosing 86 | 1 Bitter defeat tonight 87 | 1 My bike was stollen 88 | 3 Great to see you! 89 | 1 I lost every hope for seeing him again 90 | 3 Nice dress! 91 | 3 Stop wasting my time 92 | 3 I have a great idea 93 | 3 Excited to go to the pub 94 | 3 Feeling proud 95 | 3 Cute bunnies 96 | 1 Cold winter ahead 97 | 1 Hopless struggle.. 98 | 1 Ugly hat 99 | 3 Big hug and lots of love 100 | 3 I hope you have a wonderful celebration 101 | -------------------------------------------------------------------------------- /src/test/java/com/graphaware/nlp/processor/opennlp/OpenNLPIntegrationTest.java: -------------------------------------------------------------------------------- 1 | package com.graphaware.nlp.processor.opennlp; 2 | 3 | import com.graphaware.nlp.NLPIntegrationTest; 4 | import com.graphaware.nlp.dsl.AbstractDSL; 5 | import org.neo4j.kernel.impl.proc.Procedures; 6 | import org.reflections.Reflections; 7 | 8 | import java.util.Set; 9 | 10 | public class OpenNLPIntegrationTest extends NLPIntegrationTest { 11 | 12 | @Override 13 | protected void registerProceduresAndFunctions(Procedures procedures) throws Exception { 14 | super.registerProceduresAndFunctions(procedures); 15 | Reflections reflections = new Reflections("com.graphaware.nlp.dsl"); 16 | Set> cls = reflections.getSubTypesOf(AbstractDSL.class); 17 | for (Class c : cls) { 18 | procedures.registerProcedure(c); 19 | procedures.registerFunction(c); 20 | } 21 | } 22 | } 23 | -------------------------------------------------------------------------------- /src/test/java/com/graphaware/nlp/processor/opennlp/OpenNLPPipelineTest.java: -------------------------------------------------------------------------------- 1 | /* 2 | * To change this license header, choose License Headers in Project Properties. 3 | * To change this template file, choose Tools | Templates 4 | * and open the template in the editor. 5 | */ 6 | package com.graphaware.nlp.processor.opennlp; 7 | 8 | import java.io.FileInputStream; 9 | import java.io.IOException; 10 | import java.io.InputStream; 11 | import opennlp.tools.namefind.NameFinderME; 12 | import opennlp.tools.namefind.TokenNameFinderModel; 13 | import opennlp.tools.tokenize.Tokenizer; 14 | import opennlp.tools.tokenize.TokenizerME; 15 | import opennlp.tools.tokenize.TokenizerModel; 16 | import opennlp.tools.util.Span; 17 | import org.junit.After; 18 | import org.junit.AfterClass; 19 | import org.junit.Before; 20 | import org.junit.BeforeClass; 21 | import org.junit.Test; 22 | 23 | /** 24 | * 25 | * @author ale 26 | */ 27 | public class OpenNLPPipelineTest { 28 | 29 | public OpenNLPPipelineTest() { 30 | } 31 | 32 | @BeforeClass 33 | public static void setUpClass() { 34 | } 35 | 36 | @AfterClass 37 | public static void tearDownClass() { 38 | } 39 | 40 | @Before 41 | public void setUp() { 42 | } 43 | 44 | @After 45 | public void tearDown() { 46 | } 47 | 48 | /** 49 | * Test of annotate method, of class OpenNLPPipeline. 50 | */ 51 | @Test 52 | public void testAnnotate() { 53 | String text = "Hello Dralyn. Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office."; 54 | OpenNLPAnnotation document = new OpenNLPAnnotation(text); 55 | OpenNLPPipeline instance = new PipelineBuilder() 56 | .tokenize() 57 | /*.extractPos() 58 | .extractRelations()*/ 59 | .build(); 60 | instance.annotate(document); 61 | 62 | document.getSentences().forEach((sentence) -> { 63 | System.out.println(">>>" + sentence.getSentence()); 64 | if (sentence.getPhrasesIndex() != null) { 65 | sentence.getPhrasesIndex().forEach((phrase) -> { 66 | System.out.println(">>>" + sentence.getChunkStrings()[phrase]); 67 | }); 68 | } 69 | }); 70 | } 71 | 72 | @Test 73 | public void testAnnotateNER() { 74 | String sentence = "Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office."; 75 | 76 | OpenNLPAnnotation document = new OpenNLPAnnotation(sentence); 77 | OpenNLPPipeline instance = new PipelineBuilder() 78 | .tokenize() 79 | /*.extractPos() 80 | .extractRelations()*/ 81 | .build(); 82 | instance.annotate(document); 83 | 84 | document.getSentences().forEach((item) -> { 85 | item.getTokens().stream().forEach((token) -> { 86 | System.out.println("" + token.getTokenPOS()+ " " + token.getToken() + " - " + token.getToken() + " " + token.getTokenNEs()); 87 | }); 88 | }); 89 | 90 | InputStream modelInToken = null; 91 | InputStream modelIn = null; 92 | 93 | try { 94 | 95 | //1. convert sentence into tokens 96 | modelInToken = this.getClass().getResourceAsStream("en-token.bin"); 97 | TokenizerModel modelToken = new TokenizerModel(modelInToken); 98 | Tokenizer tokenizer = new TokenizerME(modelToken); 99 | String tokens[] = tokenizer.tokenize(sentence); 100 | 101 | //2. find names 102 | modelIn = this.getClass().getResourceAsStream("en-ner-person.bin"); 103 | TokenNameFinderModel model = new TokenNameFinderModel(modelIn); 104 | NameFinderME nameFinder = new NameFinderME(model); 105 | 106 | Span nameSpans[] = nameFinder.find(tokens); 107 | 108 | //find probabilities for names 109 | double[] spanProbs = nameFinder.probs(nameSpans); 110 | 111 | //3. print names 112 | for (int i = 0; i < nameSpans.length; i++) { 113 | System.out.println("Span: " + nameSpans[i].toString()); 114 | System.out.println("Covered text is: " + tokens[nameSpans[i].getStart()] + " " + tokens[nameSpans[i].getStart() + 1]); 115 | System.out.println("Probability is: " + spanProbs[i]); 116 | } 117 | } catch (Exception ex) { 118 | ex.printStackTrace(); 119 | } finally { 120 | try { 121 | if (modelInToken != null) { 122 | modelInToken.close(); 123 | } 124 | } catch (IOException e) { 125 | e.printStackTrace(); 126 | }; 127 | try { 128 | if (modelIn != null) { 129 | modelIn.close(); 130 | } 131 | } catch (IOException e) { 132 | e.printStackTrace(); 133 | }; 134 | } 135 | } 136 | 137 | @Test 138 | public void testStopWordsAnnotate() { 139 | String text = "Hello Dralyn. Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office."; 140 | OpenNLPAnnotation document = new OpenNLPAnnotation(text); 141 | OpenNLPPipeline instance = new PipelineBuilder() 142 | .tokenize() 143 | .customStopWordAnnotator("hello,is,and,of,the,to") 144 | /*.extractPos() 145 | .extractRelations()*/ 146 | .build(); 147 | instance.annotate(document); 148 | 149 | document.getSentences().forEach((sentence) -> { 150 | System.out.println(">>>" + sentence.getSentence()); 151 | if (sentence.getTokens() != null) { 152 | sentence.getTokens().forEach((token) -> { 153 | System.out.print(" " + token); 154 | }); 155 | System.out.print("\n "); 156 | } 157 | }); 158 | } 159 | 160 | } 161 | -------------------------------------------------------------------------------- /src/test/java/com/graphaware/nlp/processor/opennlp/TestOpenNLP.java: -------------------------------------------------------------------------------- 1 | //package com.graphaware.nlp.processor.opennlp; 2 | // 3 | //import com.graphaware.nlp.processor.opennlp.OpenNLPPhraseProcessor; 4 | //import org.junit.Test; 5 | //import org.slf4j.Logger; 6 | //import org.slf4j.LoggerFactory; 7 | // 8 | //import java.util.List; 9 | // 10 | ///** 11 | // * Created by michael kilgore on 10/2/16. 12 | // * Begining test class for OpenNLP Phrases 13 | // */ 14 | //public class TestOpenNLP 15 | //{ 16 | // private static final Logger LOG = LoggerFactory.getLogger(TestOpenNLP.class); 17 | // 18 | // public TestOpenNLP() 19 | // { 20 | // } 21 | // 22 | // @Test 23 | // public void testPhrase() 24 | // { 25 | // LOG.info("starting test"); 26 | // String workingDir = System.getProperty("user.dir"); 27 | // OpenNLPPhraseProcessor openNLP = new OpenNLPPhraseProcessor(); 28 | // openNLP.init(workingDir+"/"); 29 | // List phrases = openNLP.processForPhrases("Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office."); 30 | // for (String phrase : phrases) 31 | // LOG.info(phrase); 32 | // } 33 | //} 34 | -------------------------------------------------------------------------------- /src/test/java/com/graphaware/nlp/processor/opennlp/TextProcessorTest.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright (c) 2013-2016 GraphAware 3 | * 4 | * This file is part of the GraphAware Framework. 5 | * 6 | * GraphAware Framework is free software: you can redistribute it and/or modify it under the terms of 7 | * the GNU General Public License as published by the Free Software Foundation, either 8 | * version 3 of the License, or (at your option) any later version. 9 | * 10 | * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; 11 | * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 12 | * See the GNU General Public License for more details. You should have received a copy of 13 | * the GNU General Public License along with this program. If not, see 14 | * . 15 | */ 16 | package com.graphaware.nlp.processor.opennlp; 17 | 18 | import com.graphaware.nlp.domain.AnnotatedText; 19 | import com.graphaware.nlp.domain.Sentence; 20 | import com.graphaware.nlp.domain.Tag; 21 | import com.graphaware.nlp.dsl.request.PipelineSpecification; 22 | import com.graphaware.nlp.processor.AbstractTextProcessor; 23 | import com.graphaware.nlp.processor.TextProcessor; 24 | import com.graphaware.nlp.util.ServiceLoader; 25 | import com.graphaware.nlp.util.TestAnnotatedText; 26 | import com.graphaware.test.integration.EmbeddedDatabaseIntegrationTest; 27 | 28 | import java.util.Collections; 29 | import java.util.HashMap; 30 | import java.util.Map; 31 | 32 | import org.junit.BeforeClass; 33 | import org.junit.Test; 34 | import org.neo4j.graphdb.Node; 35 | import org.neo4j.graphdb.QueryExecutionException; 36 | import org.neo4j.graphdb.ResourceIterator; 37 | import org.neo4j.graphdb.Result; 38 | import org.neo4j.graphdb.Transaction; 39 | 40 | import static com.graphaware.nlp.util.TagUtils.newTag; 41 | import static org.junit.Assert.assertEquals; 42 | import static org.junit.Assert.assertFalse; 43 | import static org.junit.Assert.assertNull; 44 | 45 | public class TextProcessorTest extends OpenNLPIntegrationTest { 46 | 47 | private static TextProcessor textProcessor; 48 | private static final String TEXT_PROCESSOR = "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor"; 49 | private static PipelineSpecification PIPELINE_DEFAULT; 50 | 51 | @BeforeClass 52 | public static void init() { 53 | textProcessor = ServiceLoader.loadTextProcessor(TEXT_PROCESSOR); 54 | textProcessor.init(); 55 | Map processingSteps = new HashMap<>(); 56 | processingSteps.put(AbstractTextProcessor.STEP_TOKENIZE, true); 57 | processingSteps.put(AbstractTextProcessor.STEP_NER, true); 58 | PipelineSpecification pipelineSpecification = new PipelineSpecification("default", OpenNLPTextProcessor.class.getName(), processingSteps, null, 1L, Collections.emptyList(), Collections.emptyList()); 59 | PIPELINE_DEFAULT = pipelineSpecification; 60 | textProcessor.createPipeline(PIPELINE_DEFAULT); 61 | } 62 | 63 | @Test 64 | public void testAnnotatedText() { 65 | AnnotatedText annotatedText = textProcessor.annotateText("On 8 May 2013, " 66 | + "one week before the Pakistani election, the third author, " 67 | + "in his keynote address at the Sentiment Analysis Symposium, " 68 | + "forecast the winner of the Pakistani election. The chart " 69 | + "in Figure 1 shows varying sentiment on the candidates for " 70 | + "prime minister of Pakistan in that election. The next day, " 71 | + "the BBC’s Owen Bennett Jones, reporting from Islamabad, wrote " 72 | + "an article titled “Pakistan Elections: Five Reasons Why the " 73 | + "Vote is Unpredictable,”1 in which he claimed that the election " 74 | + "was too close to call. It was not, and despite his being in Pakistan, " 75 | + "the outcome of the election was exactly as we predicted.", "en", PIPELINE_DEFAULT); 76 | 77 | TestAnnotatedText test = new TestAnnotatedText(annotatedText); 78 | test.assertSentencesCount(4); 79 | test.assertTagsCountInSentence(15, 0); 80 | test.assertTagsCountInSentence(11, 1); 81 | test.assertTagsCountInSentence(22, 2);//(24, 2); // it's 22 because `"Pakistan` & `"1` are not lemmatized by OpenNLP and checkLemmaIsValid() removes non-lemmatized version because of symbols `"` 82 | test.assertTagsCountInSentence(8, 3);//(9, 3); // it's 8 because OpenNLP has "be" among stopwords 83 | 84 | test.assertTag(newTag("pakistan", Collections.singletonList("LOCATION"), Collections.emptyList())); 85 | test.assertTag(newTag("show", Collections.emptyList(), Collections.singletonList("VBZ"))); 86 | 87 | } 88 | 89 | @Test 90 | public void testLemmaLowerCasing() { 91 | AnnotatedText annotateText = textProcessor.annotateText( 92 | "Collibra’s Data Governance Innovation: Enabling Data as a Strategic Asset", 93 | "en", PIPELINE_DEFAULT); 94 | 95 | assertEquals(1, annotateText.getSentences().size()); 96 | assertEquals("governance", annotateText.getSentences().get(0).getTagOccurrence(16).getLemma()); 97 | } 98 | 99 | private void checkLocation(String location) throws QueryExecutionException { 100 | try (Transaction tx = getDatabase().beginTx()) { 101 | ResourceIterator rowIterator = getTagsIterator(location); 102 | Node pakistanNode = (Node) rowIterator.next(); 103 | assertFalse(rowIterator.hasNext()); 104 | String[] neList = (String[]) pakistanNode.getProperty("ne"); 105 | assertEquals(neList[0], "location"); 106 | tx.success(); 107 | } 108 | } 109 | 110 | private void checkVerb(String verb) throws QueryExecutionException { 111 | try (Transaction tx = getDatabase().beginTx()) { 112 | ResourceIterator rowIterator = getTagsIterator(verb); 113 | Node pakistanNode = (Node) rowIterator.next(); 114 | assertFalse(rowIterator.hasNext()); 115 | String[] posL = (String[]) pakistanNode.getProperty("pos"); 116 | assertEquals(posL[0], "VBZ"); 117 | tx.success(); 118 | } 119 | } 120 | 121 | private ResourceIterator getTagsIterator(String value) throws QueryExecutionException { 122 | Map params = new HashMap<>(); 123 | params.put("value", value); 124 | Result pakistan = getDatabase().execute("MATCH (n:Tag {value: {value}}) return n", params); 125 | ResourceIterator rowIterator = pakistan.columnAs("n"); 126 | return rowIterator; 127 | } 128 | 129 | @Test 130 | public void testAnnotatedTag() { 131 | Tag annotateTag = textProcessor.annotateTag("winners", "en", PIPELINE_DEFAULT); 132 | assertEquals(annotateTag.getLemma(), "winner"); 133 | } 134 | 135 | // @Test 136 | // public void testAnnotationAndConcept() { 137 | // // ConceptNet5Importer.Builder() - arguments need fixing 138 | // /*TextProcessor textProcessor = ServiceLoader.loadTextProcessor("com.graphaware.nlp.processor.stanford.StanfordTextProcessor"); 139 | // ConceptNet5Importer conceptnet5Importer = new ConceptNet5Importer.Builder("http://conceptnet5.media.mit.edu/data/5.4", textProcessor) 140 | // .build(); 141 | // String text = "Say hi to Christophe"; 142 | // AnnotatedText annotateText = textProcessor.annotateText(text, 1, 0, "en", false); 143 | // List nodes = new ArrayList<>(); 144 | // try (Transaction beginTx = getDatabase().beginTx()) { 145 | // Node annotatedNode = annotateText.storeOnGraph(getDatabase(), false); 146 | // Map params = new HashMap<>(); 147 | // params.put("id", annotatedNode.getId()); 148 | // Result queryRes = getDatabase().execute("MATCH (n:AnnotatedText)-[*..2]->(t:Tag) where id(n) = {id} return t", params); 149 | // ResourceIterator tags = queryRes.columnAs("t"); 150 | // while (tags.hasNext()) { 151 | // Node tag = tags.next(); 152 | // nodes.add(tag); 153 | // List conceptTags = conceptnet5Importer.importHierarchy(Tag.createTag(tag), "en"); 154 | // conceptTags.stream().forEach((newTag) -> { 155 | // nodes.add(newTag.storeOnGraph(getDatabase(), false)); 156 | // }); 157 | // } 158 | // beginTx.success(); 159 | // }*/ 160 | // } 161 | 162 | //@Test 163 | public void testSentiment() { 164 | AnnotatedText annotateText = textProcessor.annotateText( 165 | "I really hate to study at Stanford, " 166 | + "it was a waste of time, I'll never be there again", "en", PIPELINE_DEFAULT); 167 | assertEquals(1, annotateText.getSentences().size()); 168 | assertEquals(0, annotateText.getSentences().get(0).getSentiment()); 169 | 170 | annotateText = textProcessor.annotateText( 171 | "It was really horrible to study at Stanford", "en", PIPELINE_DEFAULT); 172 | assertEquals(1, annotateText.getSentences().size()); 173 | assertEquals(1, annotateText.getSentences().get(0).getSentiment()); 174 | 175 | annotateText = textProcessor.annotateText("I studied at Stanford", "en", PIPELINE_DEFAULT); 176 | assertEquals(1, annotateText.getSentences().size()); 177 | assertEquals(2, annotateText.getSentences().get(0).getSentiment()); 178 | 179 | annotateText = textProcessor.annotateText("I liked to study at Stanford", "en", PIPELINE_DEFAULT); 180 | assertEquals(1, annotateText.getSentences().size()); 181 | assertEquals(3, annotateText.getSentences().get(0).getSentiment()); 182 | 183 | annotateText = textProcessor.annotateText( 184 | "I liked so much to study at Stanford, I enjoyed my time there, I would recommend every body", 185 | "en", PIPELINE_DEFAULT); 186 | assertEquals(1, annotateText.getSentences().size()); 187 | assertEquals(4, annotateText.getSentences().get(0).getSentiment()); 188 | } 189 | 190 | @Test 191 | public void testAnnotatedTextWithPosition() { 192 | AnnotatedText annotateText = textProcessor.annotateText("On 8 May 2013, " 193 | + "one week before the Pakistani election, the third author, " 194 | + "in his keynote address at the Sentiment Analysis Symposium, " 195 | + "forecast the winner of the Pakistani election. The chart " 196 | + "in Figure 1 shows varying sentiment on the candidates for " 197 | + "prime minister of Pakistan in that election. The next day, " 198 | + "the BBC’s Owen Bennett Jones, reporting from Islamabad, wrote " 199 | + "an article titled “Pakistan Elections: Five Reasons Why the " 200 | + "Vote is Unpredictable,”1 in which he claimed that the election " 201 | + "was too close to call. It was not, and despite his being in Pakistan, " 202 | + "the outcome of the election was exactly as we predicted.", "en", PIPELINE_DEFAULT); 203 | 204 | assertEquals(4, annotateText.getSentences().size()); 205 | Sentence sentence1 = annotateText.getSentences().get(0); 206 | assertEquals(15, sentence1.getTags().size()); 207 | 208 | assertNull(sentence1.getTagOccurrence(0)); 209 | assertEquals("8", sentence1.getTagOccurrence(3).getLemma()); 210 | assertEquals("may 2013", sentence1.getTagOccurrence(5).getLemma()); 211 | assertEquals("May 2013", sentence1.getTagOccurrences().get(5).get(0).getValue()); 212 | assertEquals("one", sentence1.getTagOccurrence(15).getLemma()); 213 | assertEquals("before", sentence1.getTagOccurrence(24).getLemma()); 214 | assertEquals("third", sentence1.getTagOccurrence(59).getLemma()); 215 | //assertEquals("sentiment analysis symposium", sentence1.getTagOccurrence(103).getLemma()); 216 | assertEquals("forecast", sentence1.getTagOccurrence(133).getLemma()); 217 | assertNull(sentence1.getTagOccurrence(184)); 218 | 219 | Sentence sentence2 = annotateText.getSentences().get(1); 220 | assertEquals("show", sentence2.getTagOccurrence(22).getLemma()); 221 | assertEquals("shows", sentence2.getTagOccurrences().get(22).get(0).getValue()); 222 | // assertTrue(sentence1.getPhraseOccurrence(99).contains(new Phrase("the Sentiment Analysis Symposium"))); 223 | // assertTrue(sentence1.getPhraseOccurrence(103).contains(new Phrase("Sentiment"))); 224 | // assertTrue(sentence1.getPhraseOccurrence(113).contains(new Phrase("Analysis"))); 225 | // 226 | // //his(76)-> the third author(54) 227 | // assertTrue(sentence1.getPhraseOccurrence(55).get(1).getContent().equalsIgnoreCase("the third author")); 228 | // Sentence sentence2 = annotateText.getSentences().get(1); 229 | // assertEquals("chart", sentence2.getTagOccurrence(184).getLemma()); 230 | // assertEquals("Figure", sentence2.getTagOccurrence(193).getLemma()); 231 | } 232 | 233 | @Test 234 | public void testAnnotatedShortText() { 235 | AnnotatedText annotateText = textProcessor.annotateText( 236 | "Fixing Batch Endpoint Logging Problem", "en", PIPELINE_DEFAULT); 237 | 238 | assertEquals(1, annotateText.getSentences().size()); 239 | // 240 | // GraphPersistence peristence = new LocalGraphDatabase(getDatabase()); 241 | // peristence.persistOnGraph(annotateText, false); 242 | 243 | } 244 | 245 | @Test 246 | public void testAnnotatedShortText2() { 247 | AnnotatedText annotateText = textProcessor.annotateText( 248 | "Importing CSV data does nothing", "en", PIPELINE_DEFAULT); 249 | assertEquals(1, annotateText.getSentences().size()); 250 | // GraphPersistence peristence = new LocalGraphDatabase(getDatabase()); 251 | // peristence.persistOnGraph(annotateText, false); 252 | } 253 | } 254 | -------------------------------------------------------------------------------- /src/test/java/com/graphaware/nlp/processor/opennlp/conceptnet5/ConceptNet5ImporterTest.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright (c) 2013-2016 GraphAware 3 | * 4 | * This file is part of the GraphAware Framework. 5 | * 6 | * GraphAware Framework is free software: you can redistribute it and/or modify it under the terms of 7 | * the GNU General Public License as published by the Free Software Foundation, either 8 | * version 3 of the License, or (at your option) any later version. 9 | * 10 | * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; 11 | * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 12 | * See the GNU General Public License for more details. You should have received a copy of 13 | * the GNU General Public License along with this program. If not, see 14 | * . 15 | */ 16 | package com.graphaware.nlp.processor.opennlp.conceptnet5; 17 | 18 | import com.graphaware.nlp.domain.Tag; 19 | import com.graphaware.nlp.processor.TextProcessor; 20 | import com.graphaware.nlp.util.ServiceLoader; 21 | import java.util.Arrays; 22 | import java.util.List; 23 | import static org.junit.Assert.assertEquals; 24 | import org.junit.Test; 25 | 26 | public class ConceptNet5ImporterTest { 27 | 28 | private static final String TEXT_PROCESSOR = "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor"; 29 | 30 | public ConceptNet5ImporterTest() { 31 | } 32 | 33 | // /** 34 | // * Test of importHierarchy method, of class ConceptNet5Importer. 35 | // */ 36 | //// @Test 37 | //// public void testImportHierarchy() { 38 | //// TextProcessor textProcessor = ServiceLoader.loadTextProcessor(TEXT_PROCESSOR); 39 | //// //ConceptNet5Importer instance = new ConceptNet5Importer.Builder("http://conceptnet5.media.mit.edu/data/5.4", textProcessor).build(); 40 | //// ConceptNet5Importer instance = new ConceptNet5Importer.Builder("http://api.conceptnet.io").build(); 41 | //// String lang = "en"; 42 | //// Tag source = textProcessor.annotateTag("circuit", lang); 43 | //// List result = instance.importHierarchy(source, lang, true, 2, textProcessor, Arrays.asList("IsA"), Arrays.asList("NN")); 44 | //// assertEquals(4, result.size()); 45 | //// //assertEquals(expResult, result); 46 | //// // TODO review the generated test code and remove the default call to fail. 47 | //// //fail("The test case is a prototype."); 48 | // } 49 | 50 | } 51 | -------------------------------------------------------------------------------- /src/test/java/com/graphaware/nlp/processor/opennlp/model/CustomSentimentModelIntegrationTest.java: -------------------------------------------------------------------------------- 1 | package com.graphaware.nlp.processor.opennlp.model; 2 | 3 | import com.graphaware.nlp.processor.opennlp.OpenNLPIntegrationTest; 4 | import org.junit.Test; 5 | 6 | import java.util.Map; 7 | 8 | import static org.junit.Assert.*; 9 | 10 | public class CustomSentimentModelIntegrationTest extends OpenNLPIntegrationTest { 11 | 12 | //@Test 13 | public void testTrainCustomModelWithProcedure() { 14 | String p = getClass().getClassLoader().getResource("import/sentiment_tweets.train").getPath(); 15 | String q = "CALL ga.nlp.processor.train({textProcessor: \"com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor\", modelIdentifier: \"component-en\", alg: \"sentiment\", inputFile: \""+p+"\" , lang: \"en\"})"; 16 | executeInTransaction(q, (result -> { 17 | assertTrue(result.hasNext()); 18 | })); 19 | 20 | String addPipelineQuery = "CALL ga.nlp.processor.addPipeline({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', name: 'customSentiment', processingSteps: {tokenize: true, ner: true, sentiment: true, dependency: false, customSentiment: \"component-en\"}})"; 21 | executeInTransaction(addPipelineQuery, emptyConsumer()); 22 | 23 | String insertQ = "CREATE (tweet:Tweet) SET tweet.text = \"African American unemployment is the lowest ever recorded in our country. The Hispanic unemployment rate dropped a full point in the last year and is close to the lowest in recorded history. Dems did nothing for you but get your vote!\"\n" + 24 | "WITH tweet\n" + 25 | "CALL ga.nlp.annotate({text:tweet.text, id:id(tweet), pipeline:\"customSentiment\", checkLanguage:false})\n" + 26 | "YIELD result\n" + 27 | "MERGE (tweet)-[:HAS_ANNOTATED_TEXT]->(result)"; 28 | executeInTransaction(insertQ, emptyConsumer()); 29 | executeInTransaction("MATCH (n:Sentence) RETURN ANY(x IN labels(n) WHERE x IN ['Positive','Very Positive','Neutral']) AS hasSentiment", (result -> { 30 | assertTrue(result.hasNext()); 31 | while (result.hasNext()) { 32 | Map record = result.next(); 33 | assertTrue((boolean) record.get("hasSentiment")); 34 | } 35 | })); 36 | } 37 | 38 | } 39 | -------------------------------------------------------------------------------- /src/test/java/com/graphaware/nlp/processor/opennlp/procedure/ProcedureTest.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright (c) 2013-2016 GraphAware 3 | * 4 | * This file is part of the GraphAware Framework. 5 | * 6 | * GraphAware Framework is free software: you can redistribute it and/or modify it under the terms of 7 | * the GNU General Public License as published by the Free Software Foundation, either 8 | * version 3 of the License, or (at your option) any later version. 9 | * 10 | * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; 11 | * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 12 | * See the GNU General Public License for more details. You should have received a copy of 13 | * the GNU General Public License along with this program. If not, see 14 | * . 15 | */ 16 | package com.graphaware.nlp.processor.opennlp.procedure; 17 | 18 | import com.graphaware.nlp.module.NLPConfiguration; 19 | import com.graphaware.nlp.module.NLPModule; 20 | import com.graphaware.nlp.processor.opennlp.OpenNLPIntegrationTest; 21 | import com.graphaware.runtime.GraphAwareRuntime; 22 | import com.graphaware.runtime.GraphAwareRuntimeFactory; 23 | import com.graphaware.test.integration.GraphAwareIntegrationTest; 24 | import java.util.HashMap; 25 | import java.util.List; 26 | import java.util.Map; 27 | import static org.junit.Assert.assertEquals; 28 | import static org.junit.Assert.assertFalse; 29 | import static org.junit.Assert.assertTrue; 30 | import org.junit.Test; 31 | import org.neo4j.graphdb.Node; 32 | import org.neo4j.graphdb.ResourceIterator; 33 | import org.neo4j.graphdb.Result; 34 | import org.neo4j.graphdb.Transaction; 35 | 36 | public class ProcedureTest extends OpenNLPIntegrationTest { 37 | 38 | 39 | private static final String TEXT = "On 8 May 2013, " 40 | + "one week before the Pakistani election, the third author, " 41 | + "in his keynote address at the Sentiment Analysis Symposium, " 42 | + "forecast the winner of the Pakistani election. The chart " 43 | + "in Figure 1 shows varying sentiment on the candidates for " 44 | + "prime minister of Pakistan in that election. The next day, " 45 | + "the BBC’s Owen Bennett Jones, reporting from Islamabad, wrote " 46 | + "an article titled “Pakistan Elections: Five Reasons Why the " 47 | + "Vote is Unpredictable,”1 in which he claimed that the election " 48 | + "was too close to call. It was not, and despite his being in Pakistan, " 49 | + "the outcome of the election was exactly as we predicted."; 50 | 51 | private static final String TEXT_IT = "Questo è un semplice testo in italiano"; 52 | private static final String TEXT_FR = "Ceci est un texte simple en français"; 53 | 54 | private static final String SHORT_TEXT_1 = "You knew China's cities were growing. But the real numbers are stunning http://wef.ch/29IxY7w #China"; 55 | private static final String SHORT_TEXT_2 = "Globalization for the 99%: can we make it work for all?"; 56 | private static final String SHORT_TEXT_3 = "This organisation increased productivity, happiness and trust with just one change http://wef.ch/29PeKxF "; 57 | private static final String SHORT_TEXT_4 = "In pictures: The high-tech villages that live off the grid http://wef.ch/29xuRh8 "; 58 | private static final String SHORT_TEXT_5 = "The 10 countries best prepared for the new digital economy http://wef.ch/2a8DNug "; 59 | private static final String SHORT_TEXT_6 = "This is how to limit damage to the #euro after #Brexit, say economists http://wef.ch/29GGVzG "; 60 | private static final String SHORT_TEXT_7 = "The office jobs that could see you earning nearly 50% less than some of your co-workers http://wef.ch/29P9biE "; 61 | private static final String SHORT_TEXT_8 = "Which nationalities have the best quality of life? http://wef.ch/29uDfwV"; 62 | private static final String SHORT_TEXT_9 = "It’s 9,000km away, but #Brexit has hit #Japan hard http://wef.ch/29P92eQ #economics"; 63 | private static final String SHORT_TEXT_10 = "Which is the world’s fastest-growing large economy? Clue: it’s not #China http://wef.ch/29xuXFd #economics"; 64 | 65 | //@Test 66 | public void overallTest() { 67 | GraphAwareRuntime gaRuntime = GraphAwareRuntimeFactory.createRuntime(getDatabase()); 68 | gaRuntime.registerModule(new NLPModule("NLP", NLPConfiguration.defaultConfiguration(), getDatabase())); 69 | gaRuntime.start(); 70 | gaRuntime.waitUntilStarted(); 71 | testAnnotatedText(); 72 | // clean(); 73 | // testAnnotatedTextWithSentiment(); 74 | // clean(); 75 | // testAnnotatedTextAndSentiment(); 76 | // clean(); 77 | // testAnnotatedTextOnMultiple(); 78 | // clean(); 79 | // testConceptText(); 80 | // clean(); 81 | // testLanguageDetection(); 82 | // clean(); 83 | // testSupportedLanguage(); 84 | // clean(); 85 | // testFilter(); 86 | // clean(); 87 | // testGetProceduresManagement(); 88 | // clean(); 89 | } 90 | 91 | private void clean() { 92 | try (Transaction tx = getDatabase().beginTx()) { 93 | getDatabase().execute("MATCH (n) DETACH DELETE n"); 94 | tx.success(); 95 | } 96 | } 97 | 98 | public void testAnnotatedText() { 99 | try (Transaction tx = getDatabase().beginTx()) { 100 | String id = "id1"; 101 | Map params = new HashMap<>(); 102 | params.put("value", TEXT); 103 | params.put("id", id); 104 | Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n" 105 | + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}) YIELD result\n" 106 | + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n" 107 | + "return result", params); 108 | ResourceIterator rowIterator = news.columnAs("result"); 109 | assertTrue(rowIterator.hasNext()); 110 | Node resultNode = (Node) rowIterator.next(); 111 | assertEquals(resultNode.getProperty("id"), id); 112 | params.clear(); 113 | params.put("id", id); 114 | Result tags = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence)-[:HAS_TAG]->(result:Tag) RETURN result", params); 115 | rowIterator = tags.columnAs("result"); 116 | assertTrue(rowIterator.hasNext()); 117 | 118 | Result sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence) RETURN labels(s) as result", params); 119 | rowIterator = sentences.columnAs("result"); 120 | assertTrue(rowIterator.hasNext()); 121 | int countSentence = 0; 122 | while (rowIterator.hasNext()) { 123 | List next = (List) rowIterator.next(); 124 | assertEquals(next.size(), 1); 125 | countSentence++; 126 | } 127 | 128 | sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:FIRST_SENTENCE|NEXT_SENTENCE*..]->(s:Sentence) RETURN labels(s) as result", params); 129 | rowIterator = sentences.columnAs("result"); 130 | assertTrue(rowIterator.hasNext()); 131 | int newCountSentence = 0; 132 | while (rowIterator.hasNext()) { 133 | List next = (List) rowIterator.next(); 134 | assertEquals(next.size(), 1); 135 | newCountSentence++; 136 | } 137 | assertEquals(countSentence, newCountSentence); 138 | tx.success(); 139 | } 140 | } 141 | 142 | public void testAnnotatedTextWithSentiment() { 143 | try (Transaction tx = getDatabase().beginTx()) { 144 | String id = "id1"; 145 | Map params = new HashMap<>(); 146 | params.put("value", TEXT); 147 | params.put("id", id); 148 | Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n" 149 | + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}, pipeline: \"tokenizerAndSentiment\"}) YIELD result\n" 150 | + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n" 151 | + "return result", params); 152 | ResourceIterator rowIterator = news.columnAs("result"); 153 | assertTrue(rowIterator.hasNext()); 154 | Node resultNode = (Node) rowIterator.next(); 155 | assertEquals(resultNode.getProperty("id"), id); 156 | params.clear(); 157 | params.put("id", id); 158 | Result sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence) RETURN labels(s) as result", params); 159 | rowIterator = sentences.columnAs("result"); 160 | assertTrue(rowIterator.hasNext()); 161 | while (rowIterator.hasNext()) { 162 | List next = (List) rowIterator.next(); 163 | assertEquals(next.size(), 2); 164 | } 165 | tx.success(); 166 | } 167 | } 168 | 169 | public void testAnnotatedTextAndSentiment() { 170 | try (Transaction tx = getDatabase().beginTx()) { 171 | String id = "id1"; 172 | Map params = new HashMap<>(); 173 | params.put("value", TEXT); 174 | params.put("id", id); 175 | Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n" 176 | + "CALL ga.nlp.annotate({text:n.text, id: {id}, store: true, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}}) YIELD result\n" 177 | + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n" 178 | + "return result", params); 179 | ResourceIterator rowIterator = news.columnAs("result"); 180 | assertTrue(rowIterator.hasNext()); 181 | Node resultNode = (Node) rowIterator.next(); 182 | assertEquals(resultNode.getProperty("id"), id); 183 | params.clear(); 184 | params.put("id", id); 185 | Result sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}}) WITH a " 186 | + "CALL ga.nlp.sentiment({node:a}) YIELD result " 187 | + "MATCH (result)-[:CONTAINS_SENTENCE]->(s:Sentence) " 188 | + "return labels(s) as labels", params); 189 | rowIterator = sentences.columnAs("labels"); 190 | assertTrue(rowIterator.hasNext()); 191 | int i = 0; 192 | while (rowIterator.hasNext()) { 193 | List next = (List) rowIterator.next(); 194 | assertEquals(next.size(), 2); 195 | i++; 196 | } 197 | assertEquals(4, i); 198 | //Execute again for checking the number of senteces 199 | sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}}) WITH a " 200 | + "CALL ga.nlp.sentiment({node:a}) YIELD result " 201 | + "MATCH (result)-[:CONTAINS_SENTENCE]->(s:Sentence) " 202 | + "return labels(s) as labels", params); 203 | rowIterator = sentences.columnAs("labels"); 204 | assertTrue(rowIterator.hasNext()); 205 | i = 0; 206 | while (rowIterator.hasNext()) { 207 | List next = (List) rowIterator.next(); 208 | assertEquals(next.size(), 2); 209 | i++; 210 | } 211 | assertEquals(4, i); 212 | tx.success(); 213 | } 214 | } 215 | 216 | public void testAnnotatedTextOnMultiple() { 217 | try (Transaction tx = getDatabase().beginTx()) { 218 | String id = "id1"; 219 | Map params = new HashMap<>(); 220 | params.put("value", SHORT_TEXT_1); 221 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 222 | 223 | params.put("value", SHORT_TEXT_2); 224 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 225 | 226 | params.put("value", SHORT_TEXT_3); 227 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 228 | 229 | params.put("value", SHORT_TEXT_4); 230 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 231 | 232 | params.put("value", SHORT_TEXT_5); 233 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 234 | 235 | params.put("value", SHORT_TEXT_6); 236 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 237 | 238 | params.put("value", SHORT_TEXT_7); 239 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 240 | 241 | params.put("value", SHORT_TEXT_8); 242 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 243 | 244 | params.put("value", SHORT_TEXT_9); 245 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 246 | 247 | params.put("value", SHORT_TEXT_10); 248 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 249 | 250 | getDatabase().execute("MERGE (n:Tweet {id:1})", params); 251 | 252 | //Test for filter based on language 253 | params.put("value", TEXT_IT); 254 | getDatabase().execute("MERGE (n:Tweet {text: {value}})", params); 255 | 256 | Result sentences = getDatabase().execute("MATCH (a:Tweet) WITH a\n" 257 | + "WITH collect(a) AS aa\n" 258 | + "UNWIND aa AS a\n" 259 | + "CALL ga.nlp.annotate({text:a.text, id: id(a), textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}}) YIELD result WITH result as at " 260 | + "MERGE (a)-[:HAS_ANNOTATED_TEXT]->(at) WITH at " 261 | + "MATCH (at)-[:CONTAINS_SENTENCE]->(result) " 262 | + "RETURN result", params); 263 | ResourceIterator rowIterator = sentences.columnAs("result"); 264 | assertTrue(rowIterator.hasNext()); 265 | int i = 0; 266 | while (rowIterator.hasNext()) { 267 | rowIterator.next(); 268 | i++; 269 | } 270 | assertEquals(13, i); 271 | tx.success(); 272 | } 273 | } 274 | 275 | public void testConceptText() { 276 | try (Transaction tx = getDatabase().beginTx()) { 277 | String id = "id1"; 278 | Map params = new HashMap<>(); 279 | params.put("value", TEXT); 280 | params.put("id", id); 281 | Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n" 282 | + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}}) YIELD result\n" 283 | + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n" 284 | + "return result", params); 285 | ResourceIterator rowIterator = news.columnAs("result"); 286 | assertTrue(rowIterator.hasNext()); 287 | Node resultNode = (Node) rowIterator.next(); 288 | assertEquals(resultNode.getProperty("id"), id); 289 | params.clear(); 290 | params.put("id", id); 291 | Result tags = getDatabase().execute( 292 | "MATCH (a:AnnotatedText) " 293 | + "CALL ga.nlp.concept({node:a, depth: 2}) YIELD result\n" 294 | + "return result;", params); 295 | rowIterator = tags.columnAs("result"); 296 | //assertTrue(rowIterator.hasNext()); 297 | tx.success(); 298 | } 299 | } 300 | 301 | public void testLanguageDetection() { 302 | try (Transaction tx = getDatabase().beginTx()) { 303 | Map params = new HashMap<>(); 304 | params.put("value", TEXT); 305 | Result result = getDatabase().execute("CALL ga.nlp.language({text:{value}}) YIELD result\n" 306 | + "return result", params); 307 | ResourceIterator rowIterator = result.columnAs("result"); 308 | assertTrue(rowIterator.hasNext()); 309 | String resultNode = (String) rowIterator.next(); 310 | assertEquals("en", resultNode); 311 | 312 | params.put("value", TEXT_IT); 313 | result = getDatabase().execute("CALL ga.nlp.language({text:{value}}) YIELD result\n" 314 | + "return result", params); 315 | rowIterator = result.columnAs("result"); 316 | assertTrue(rowIterator.hasNext()); 317 | resultNode = (String) rowIterator.next(); 318 | assertEquals("it", resultNode); 319 | 320 | params.put("value", TEXT_FR); 321 | result = getDatabase().execute("CALL ga.nlp.language({text:{value}}) YIELD result\n" 322 | + "return result", params); 323 | rowIterator = result.columnAs("result"); 324 | assertTrue(rowIterator.hasNext()); 325 | resultNode = (String) rowIterator.next(); 326 | assertEquals("fr", resultNode); 327 | 328 | tx.success(); 329 | } 330 | } 331 | 332 | public void testSupportedLanguage() { 333 | try (Transaction tx = getDatabase().beginTx()) { 334 | String id = "id1"; 335 | Map params = new HashMap<>(); 336 | params.put("value", TEXT_IT); 337 | params.put("id", id); 338 | Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n" 339 | + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}}) YIELD result\n" 340 | + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n" 341 | + "return result", params); 342 | ResourceIterator rowIterator = news.columnAs("result"); 343 | assertFalse(rowIterator.hasNext()); 344 | tx.success(); 345 | } 346 | } 347 | 348 | public void testFilter() { 349 | try (Transaction tx = getDatabase().beginTx()) { 350 | String id = "id1"; 351 | Map params = new HashMap<>(); 352 | params.put("value", TEXT); 353 | params.put("filter", "Owen Bennett Jones/PERSON"); 354 | Result news = getDatabase().execute("CALL ga.nlp.filter({text:{value}, filter: {filter}}) YIELD result\n" 355 | + "return result", params); 356 | ResourceIterator rowIterator = news.columnAs("result"); 357 | assertTrue(rowIterator.hasNext()); 358 | Boolean resultNode = (Boolean) rowIterator.next(); 359 | assertEquals(true, resultNode); 360 | 361 | params.clear(); 362 | params.put("value", SHORT_TEXT_1); 363 | params.put("filter", "China/PERSON"); 364 | news = getDatabase().execute("CALL ga.nlp.filter({text:{value}, filter: {filter}}) YIELD result\n" 365 | + "return result", params); 366 | rowIterator = news.columnAs("result"); 367 | assertTrue(rowIterator.hasNext()); 368 | resultNode = (Boolean) rowIterator.next(); 369 | assertEquals(false, resultNode); 370 | 371 | 372 | params.clear(); 373 | params.put("value", TEXT); 374 | params.put("filter", "Owen Bennett Jones/PERSON, BBC, Pakistan/LOCATION"); 375 | news = getDatabase().execute("CALL ga.nlp.filter({text:{value}, filter: {filter}}) YIELD result\n" 376 | + "return result", params); 377 | rowIterator = news.columnAs("result"); 378 | assertTrue(rowIterator.hasNext()); 379 | resultNode = (Boolean) rowIterator.next(); 380 | assertEquals(true, resultNode); 381 | tx.success(); 382 | } 383 | } 384 | 385 | public void testGetProceduresManagement() { 386 | try (Transaction tx = getDatabase().beginTx()) { 387 | Result res = getDatabase().execute("CALL ga.nlp.getProcessors() YIELD class\n" 388 | + "return class"); 389 | ResourceIterator rowIterator = res.columnAs("class"); 390 | assertTrue(rowIterator.hasNext()); 391 | String resultNode = (String) rowIterator.next(); 392 | assertEquals("com.graphaware.nlp.processor.stanford.StanfordTextProcessor", resultNode); 393 | tx.success(); 394 | } 395 | try (Transaction tx = getDatabase().beginTx()) { 396 | Result res = getDatabase().execute("CALL ga.nlp.getPipelines({textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor'}) YIELD result\n" 397 | + "return result"); 398 | ResourceIterator rowIterator = res.columnAs("result"); 399 | assertTrue(rowIterator.hasNext()); 400 | // String resultNode = (String) rowIterator.next(); 401 | // assertEquals("com.graphaware.nlp.processor.stanford.StanfordTextProcessor", resultNode); 402 | tx.success(); 403 | } 404 | 405 | try (Transaction tx = getDatabase().beginTx()) { 406 | Result res = getDatabase().execute("CALL ga.nlp.addPipeline({" 407 | + "textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', " 408 | + "name: 'testPipe', " 409 | + "stopWords: 'class,instance,issue', " 410 | + "threadNumber: 5}) " 411 | + "YIELD result\n" 412 | + "return result"); 413 | ResourceIterator rowIterator = res.columnAs("result"); 414 | assertTrue(rowIterator.hasNext()); 415 | String resultNode = (String) rowIterator.next(); 416 | assertEquals("succeess", resultNode); 417 | tx.success(); 418 | } 419 | 420 | try (Transaction tx = getDatabase().beginTx()) { 421 | Result res = getDatabase().execute("CALL ga.nlp.getPipelines({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}) YIELD result\n" 422 | + "return result"); 423 | ResourceIterator rowIterator = res.columnAs("result"); 424 | 425 | boolean found = false; 426 | while (rowIterator.hasNext()) { 427 | String resultNode = (String) rowIterator.next(); 428 | if (resultNode.equalsIgnoreCase("testPipe")) { 429 | found = true; 430 | } 431 | } 432 | assertTrue(found); 433 | tx.success(); 434 | } 435 | 436 | try (Transaction tx = getDatabase().beginTx()) { 437 | String id = "id1"; 438 | Map params = new HashMap<>(); 439 | params.put("value", TEXT); 440 | params.put("id", id); 441 | Result res = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n" 442 | + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', pipeline: 'testPipe'}) YIELD result\n" 443 | + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n" 444 | + "return result", params); 445 | ResourceIterator rowIterator = res.columnAs("result"); 446 | assertTrue(rowIterator.hasNext()); 447 | Node resultNode = (Node) rowIterator.next(); 448 | assertEquals(resultNode.getProperty("id"), id); 449 | params.clear(); 450 | params.put("id", id); 451 | Result tags = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence)-[:HAS_TAG]->(result:Tag) RETURN result", params); 452 | rowIterator = tags.columnAs("result"); 453 | assertTrue(rowIterator.hasNext()); 454 | 455 | Result sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence) RETURN labels(s) as result", params); 456 | rowIterator = sentences.columnAs("result"); 457 | assertTrue(rowIterator.hasNext()); 458 | int countSentence = 0; 459 | while (rowIterator.hasNext()) { 460 | List next = (List) rowIterator.next(); 461 | assertEquals(1, next.size()); 462 | countSentence++; 463 | } 464 | 465 | sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:FIRST_SENTENCE|NEXT_SENTENCE*..]->(s:Sentence) RETURN labels(s) as result", params); 466 | rowIterator = sentences.columnAs("result"); 467 | assertTrue(rowIterator.hasNext()); 468 | int newCountSentence = 0; 469 | while (rowIterator.hasNext()) { 470 | List next = (List) rowIterator.next(); 471 | assertEquals(next.size(), 1); 472 | newCountSentence++; 473 | } 474 | assertEquals(countSentence, newCountSentence); 475 | tx.success(); 476 | } 477 | 478 | try (Transaction tx = getDatabase().beginTx()) { 479 | Result res = getDatabase().execute("CALL ga.nlp.removePipeline({" 480 | + "textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', " 481 | + "pipeline: 'testPipe'}) " 482 | + "YIELD result\n" 483 | + "return result"); 484 | ResourceIterator rowIterator = res.columnAs("result"); 485 | assertTrue(rowIterator.hasNext()); 486 | String resultNode = (String) rowIterator.next(); 487 | assertEquals("succeess", resultNode); 488 | tx.success(); 489 | } 490 | 491 | try (Transaction tx = getDatabase().beginTx()) { 492 | Result res = getDatabase().execute("CALL ga.nlp.getPipelines({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}) YIELD result\n" 493 | + "return result"); 494 | ResourceIterator rowIterator = res.columnAs("result"); 495 | 496 | boolean found = false; 497 | while (rowIterator.hasNext()) { 498 | String resultNode = (String) rowIterator.next(); 499 | if (resultNode.equalsIgnoreCase("testPipe")) { 500 | found = true; 501 | } 502 | } 503 | assertTrue(!found); 504 | tx.success(); 505 | } 506 | 507 | try (Transaction tx = getDatabase().beginTx()) { 508 | Result res = getDatabase().execute("CALL ga.nlp.addPipeline({" 509 | + "textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', " 510 | + "name: 'testPipe', " 511 | + "stopWords: 'class,instance,issue', " 512 | + "threadNumber: 5}) " 513 | + "YIELD result\n" 514 | + "return result"); 515 | ResourceIterator rowIterator = res.columnAs("result"); 516 | assertTrue(rowIterator.hasNext()); 517 | String resultNode = (String) rowIterator.next(); 518 | assertEquals("succeess", resultNode); 519 | tx.success(); 520 | } 521 | 522 | try (Transaction tx = getDatabase().beginTx()) { 523 | Result res = getDatabase().execute("CALL ga.nlp.getPipelines({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}) YIELD result\n" 524 | + "return result"); 525 | ResourceIterator rowIterator = res.columnAs("result"); 526 | 527 | boolean found = false; 528 | while (rowIterator.hasNext()) { 529 | String resultNode = (String) rowIterator.next(); 530 | if (resultNode.equalsIgnoreCase("testPipe")) { 531 | found = true; 532 | } 533 | } 534 | assertTrue(found); 535 | tx.success(); 536 | } 537 | } 538 | } 539 | -------------------------------------------------------------------------------- /src/test/resources/import/sentiment_tweets.train: -------------------------------------------------------------------------------- 1 | 3 Watching a nice movie 2 | 1 The painting is ugly, will return it tomorrow... 3 | 3 One of the best soccer games, worth seeing it 4 | 3 Very tasty, not only for vegetarians 5 | 3 Super party! 6 | 1 Too early to travel..need a coffee 7 | 1 Damn..the train is late again... 8 | 1 Bad news, my flight just got cancelled. 9 | 3 Happy birthday mr. president 10 | 3 Just watch it. Respect. 11 | 3 Wonderful sunset. 12 | 3 Bravo, first title in 2014! 13 | 1 Had a bad evening, need urgently a beer. 14 | 1 I put on weight again 15 | 3 On today's show we met Angela, a woman with an amazing story 16 | 3 I fell in love again 17 | 1 I lost my keys 18 | 3 On a trip to Iceland 19 | 3 Happy in Berlin 20 | 1 I hate Mondays 21 | 3 Love the new book I reveived for Christmas 22 | 1 He killed our good mood 23 | 3 I am in good spirits again 24 | 3 This guy creates the most awesome pics ever 25 | 1 The dark side of a selfie. 26 | 3 Cool! John is back! 27 | 3 Many rooms and many hopes for new residents 28 | 1 False hopes for the people attending the meeting 29 | 3 I set my new year's resolution 30 | 1 The ugliest car ever! 31 | 1 Feeling bored 32 | 1 Need urgently a pause 33 | 3 Nice to see Ana made it 34 | 3 My dream came true 35 | 1 I didn't see that one coming 36 | 1 Sorry mate, there is no more room for you 37 | 1 Who could have possibly done this? 38 | 3 I won the challenge 39 | 1 I feel bad for what I did 40 | 3 I had a great time tonight 41 | 3 It was a lot of fun 42 | 3 Thank you Molly making this possible 43 | 1 I just did a big mistake 44 | 3 I love it!! 45 | 1 I never loved so hard in my life 46 | 1 I hate you Mike!! 47 | 1 I hate to say goodbye 48 | 3 Lovely! 49 | 3 Like and share if you feel the same 50 | 1 Never try this at home 51 | 1 Don't spoil it! 52 | 3 I love rock and roll 53 | 1 The more I hear you, the more annoyed I get 54 | 3 Finnaly passed my exam! 55 | 3 Lovely kittens 56 | 1 I just lost my appetite 57 | 1 Sad end for this movie 58 | 1 Lonely, I am so lonely 59 | 3 Beautiful morning 60 | 3 She is amazing 61 | 3 Enjoying some time with my friends 62 | 3 Special thanks to Marty 63 | 3 Thanks God I left on time 64 | 3 Greateful for a wonderful meal 65 | 3 So happy to be home 66 | 1 Hate to wait on a long queue 67 | 1 No cab available 68 | 1 Electricity outage, this is a nightmare 69 | 1 Nobody to ask about directions 70 | 3 Great game! 71 | 3 Nice trip 72 | 3 I just received a pretty flower 73 | 3 Excellent idea 74 | 3 Got a new watch. Feeling happy 75 | 1 I feel sick 76 | 1 I am very tired 77 | 3 Such a good taste 78 | 1 Such a bad taste 79 | 3 Enjoying brunch 80 | 1 I don't recommend this restaurant 81 | 3 Thank you mom for supporting me 82 | 1 I will never ever call you again 83 | 1 I just got kicked out of the contest 84 | 3 Smiling 85 | 1 Big pain to see my team loosing 86 | 1 Bitter defeat tonight 87 | 1 My bike was stollen 88 | 3 Great to see you! 89 | 1 I lost every hope for seeing him again 90 | 3 Nice dress! 91 | 3 Stop wasting my time 92 | 3 I have a great idea 93 | 3 Excited to go to the pub 94 | 3 Feeling proud 95 | 3 Cute bunnies 96 | 1 Cold winter ahead 97 | 1 Hopless struggle.. 98 | 1 Ugly hat 99 | 3 Big hug and lots of love 100 | 3 I hope you have a wonderful celebration 101 | --------------------------------------------------------------------------------