├── .gitignore
├── README.md
├── pom.xml
└── src
    ├── main
        ├── java
        │   └── com
        │   │   └── graphaware
        │   │       └── nlp
        │   │           └── processor
        │   │               └── opennlp
        │   │                   ├── OpenNLPAnnotation.java
        │   │                   ├── OpenNLPPipeline.java
        │   │                   ├── OpenNLPTextProcessor.java
        │   │                   ├── PipelineBuilder.java
        │   │                   └── model
        │   │                       ├── NERModelTool.java
        │   │                       ├── OpenNLPGenericModelTool.java
        │   │                       └── SentimentModelTool.java
        └── resources
        │   └── com
        │       └── graphaware
        │           └── nlp
        │               └── processor
        │                   └── opennlp
        │                       ├── en-chunker.bin
        │                       ├── en-lemmatizer.dict
        │                       ├── en-ner-date.bin
        │                       ├── en-ner-location.bin
        │                       ├── en-ner-money.bin
        │                       ├── en-ner-organization.bin
        │                       ├── en-ner-percentage.bin
        │                       ├── en-ner-percentage_money.test
        │                       ├── en-ner-person.bin
        │                       ├── en-ner-person.test
        │                       ├── en-ner-person_organization_location_date.test
        │                       ├── en-ner-time.bin
        │                       ├── en-pos-maxent.bin
        │                       ├── en-sent.bin
        │                       ├── en-sentiment-tweets_toy.bin
        │                       ├── en-token.bin
        │                       └── sentiment_tweets.train
    └── test
        ├── java
            └── com
            │   └── graphaware
            │       └── nlp
            │           └── processor
            │               └── opennlp
            │                   ├── OpenNLPIntegrationTest.java
            │                   ├── OpenNLPPipelineTest.java
            │                   ├── TestOpenNLP.java
            │                   ├── TextProcessorTest.java
            │                   ├── conceptnet5
            │                       └── ConceptNet5ImporterTest.java
            │                   ├── model
            │                       └── CustomSentimentModelIntegrationTest.java
            │                   └── procedure
            │                       └── ProcedureTest.java
        └── resources
            └── import
                └── sentiment_tweets.train


/.gitignore:
--------------------------------------------------------------------------------
1 | target/
2 | *iml
3 | dependency-reduced-pom.xml
4 | **/.DS_Store
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | GraphAware Neo4j NLP - OpenNLP - RETIRED
  2 | ==========================
  3 | 
  4 | ## GraphAware Neo4j NLP - OpenNLP Has Been Retired
  5 | As of May 2021, this [repository has been retired](https://graphaware.com/framework/2021/05/06/from-graphaware-framework-to-graphaware-hume.html).
  6 | 
  7 | ---
  8 | 
  9 | GraphAware NLP Using OpenNLP
 10 | ==========================================
 11 | 
 12 | Getting the Software
 13 | ---------------------
 14 | 
 15 | ### Server Mode
 16 | When using Neo4j in the <a href="http://docs.neo4j.org/chunked/stable/server-installation.html" target="_blank">standalone server</a> mode, you will need the <a href="https://github.com/graphaware/neo4j-framework" target="_blank">GraphAware Neo4j Framework</a> and <a href="https://github.com/graphaware/neo4j-nlp" target="_blank">GraphAware NLP</a> .jar files (both of which you can download here) dropped into the plugins directory of your Neo4j installation. Finally, the following needs to be appended to the `neo4j.conf` file in the `config/` directory:
 17 | 
 18 | ```
 19 |   dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
 20 |   com.graphaware.runtime.enabled=true
 21 | 
 22 |   com.graphaware.module.NLP.2=com.graphaware.nlp.module.NLPBootstrapper
 23 | ```
 24 | 
 25 | ### For Developers
 26 | This package is an extention of the <a href="https://github.com/graphaware/neo4j-nlp" target="_blank">GraphAware NLP</a>, which therefore needs to be packaged and installed beforehand. No other dependencies required.
 27 | 
 28 | ```
 29 |   cd neo4j-nlp
 30 |   mvn clean install
 31 |   cp target/graphaware-nlp-1.0-SNAPSHOT.jar <YOUR_NEO4J_DIR>/plugins
 32 | 
 33 |   cd ../neo4j-nlp-opennlp
 34 |   mvn clean package
 35 |   cp target/nlp-opennlp-1.0.0-SNAPSHOT.jar <YOUR_NEO4J_DIR>/plugins
 36 | ```
 37 | 
 38 | 
 39 | Introduction and How-To
 40 | -------------------------
 41 | 
 42 | The Apache OpenNLP library provides basic features for processing natural language text: sentence segmentation, tokenization, lemmatization, part-of-speach tagging, named entities identification, chunking, parsing and sentiment analysis. OpenNLP is implemented by extending the general <a href="https://github.com/graphaware/neo4j-nlp" target="_blank">GraphAware NLP</a> package with extra parameters:
 43 | 
 44 | ### Tag Extraction / Annotations
 45 | ```
 46 | #Annotate the news
 47 | MATCH (n:News)
 48 | CALL ga.nlp.annotate({text:n.text, id: n.uuid, textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", pipeline: "tokenizer"}) YIELD result
 49 | MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)
 50 | RETURN n, result
 51 | ```
 52 | 
 53 | Available parameters are:
 54 |   * the same ones as described in <a href="https://github.com/graphaware/neo4j-nlp" target="_blank">parent class GraphAware NLP</a>
 55 |   * `sentimentProbabilityThr` (optional, default *0.7*): if assigned sentiment label has confidence smaller than this threshold, set sentiment to *Neutral*
 56 |   * `customProject` (optional): add user trained/provided models associated with specified project, see paragraph *Customizing pipeline models*
 57 | 
 58 | Available pipelines:
 59 |   * `tokenizer` - tokenization, lemmatization, stop-words removal, part-of-speach tagging (POS), named entity recognition (NER)
 60 |   * `sentiment` - tokenization, sentiment analysis
 61 |   * `tokenizerAndSentiment` - tokenization, lemmatization, stop-words removal, POS tagging, NER, sentiment analysis
 62 |   * `phrase` (not supported yet) - tokenization, stop-words removal, relations, sentiment analysis
 63 | 
 64 | ### Sentiment Analysis
 65 | The current implementation of a sentiment analysis is just a toy - it relies on a file with 100 labeled twitter samples which are used to build a model when Neo4j starts (general recommendation for number of training samples is 10k and more). The current model supports only three options - Positive, Neutral, Negative - which are chosen based on the highest probability (the algorithm returns an array of probabilities for each category). If the highest probability is less then 70% (default value which can be customized by using parameter *sentimentProbabilityThr*), the category is not regarded trustworthy and is set to Neutral instead.
 66 | 
 67 | The sentiment analysis can be run either as part of the annotation (see paragraph above) or as an independent procedure (see command below) which takes in AnnotatedText nodes, analyzes all attached sentences and adds to them a label corresponding to its sentiment.
 68 | 
 69 | ```
 70 | MATCH (a:AnnotatedText {id: {id}})
 71 | CALL ga.nlp.sentiment({node:a, textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor"}) YIELD result
 72 | MATCH (result)-[:CONTAINS_SENTENCE]->(s:Sentence) 
 73 | RETURN labels(s) as labels
 74 | ```
 75 | 
 76 | 
 77 | 
 78 | 
 79 | ## BETA
 80 | ### Customizing pipeline models
 81 | To add new customized model (currenlty NER and Sentiment), one can do it via Cypher:
 82 | ```
 83 | CALL ga.nlp.train({textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", modelIdentifier: "component-en", alg: "sentiment", inputFile: "<path_to_your_training_file>" [, lang: "en", trainingParameters: {...<training-parameters-of-choice>...}]})
 84 | ```
 85 |   * `alg` (case insensitive) specifies which algorithm is about to be trained; currently available algs: `NER`, `sentiment`
 86 |   * `modelIdentifier` is an arbitrary string that provides a unique identifier of the model that you want to train (will be used for e.g. saving it into .bin file)
 87 |   * `inputFile` is path to the training data file
 88 |   * `lang` (default is "en") specifies the language
 89 |   * `textProcessor` - desired text processor
 90 |   * **training parameters** (defined in `com.graphaware.nlp.util.GenericModelParameters`) are optional and are not universal (some might be specific to only certain Text Processor):
 91 |     * *iter* - number of iterations
 92 |     * *cutoff* - useful for reducing the size of n-gram models, it's a threashold for n-gram occurrences/frequences in the training dataset
 93 |     * *threads* - provides support for multi-threading
 94 |     * *entityType* - name type to use for NER training, by default all entities (classes such as "Person", "Date", ...) present in provided training file are used
 95 |     * *nFolds* - parameter for cross-validation procedure (default is 10), see paragraph *Validation*
 96 |     * *trainerAlg* - specific for OpenNLP
 97 |     * *trainerType* - specific for OpenNLP
 98 | 
 99 | The trained model is saved to a binary file in Neo4j's `import/` directory with name format: `<alg>-<modelIdentifier>.bin`, so no need to train the same model again when you restart Neo4j. Cross-validation method is used to evaluate the model, see paragraph *Validation*.
100 |   * `NER` - default models (Person, Location, Organization, Date, Time, Money, Percentage) plus all registered customized models are used when invoking `ga.nlp.annotate()` (see example below)
101 |   * `Sentiment` - sentiment analysis is run only once (user-trained one has a priority over the default one)
102 | 
103 | **Training/testing example:**
104 | Trainig:
105 | ```
106 | CALL ga.nlp.processor.train({textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", modelIdentifier: "test", alg: "sentiment", inputFile: "/Users/doctor-who/Documents/workspace/datasets/sentiment_tweets.train", trainingParameters: {iter: 10}})
107 | ```
108 | 
109 | Testing the new model:
110 | ```
111 | CALL ga.nlp.processor.test({textProcessor: "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor", modelIdentifier: "test", alg: "sentiment", inputFile: "/Users/doctor-who/Documents/workspace/datasets/sentiment_tweets.test})
112 | ```
113 | 
114 | **Usage of new models:**
115 | To use custom models, one needs to assign them to a pipeline, for example:
116 | 
117 | ```
118 | CALL ga.nlp.processor.addPipeline({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', name: 'customPipeline', processingSteps: {tokenize: true, ner: true, dependency: false, customSentiment: <modelIdentifier>}})
119 | ```
120 |   * `customSentiment` - string value wich is the identifier that you chose for your custom model
121 |   * `customNER` - string value which is the identifier that you chose for your custom model; if you want to use more models, separate them by ",", for example: `customNER: "component-en,chemical-en,testing-model"`
122 | 
123 | ```
124 | # Example of a text to analyze
125 | CREATE (l:Lesson {lesson: "Power system distribution at Kennedy Space Center (KSC) consists primarily of high-voltage, underground cables. These cables include approximately 5000 splices.ľ Splice failures result in arc flash events that are extremely hazardous to personnel in the vicinity of the arc flash. Some construction and maintenance tasks cannot be performed effectively in the required personal protective equipment (PPE), and de-energizing the cables is not feasible due to cost, lost productivity, and safety risk to others implementing the required outages. To verify alternate and effective mitigations, arc flash testing was conducted in a controlled environment. The arc flash effects were greater than expected. Testing also demonstrated the addition of neutral grounding resistors (NGRs) would result in substantial reductions to arc flash effects. As a result, NGRs are being installed on KSC primary substation transformers. The presence of the NGRs, enable usage of less cumbersome PPE.  Laboratory testing revealed higher than anticipated safety risks from a potential arc-flash event in a manhole environment when conducted at KSCęs unreduced fault current levels.ľ The safety risks included bright flash, excessive sound, and smoke.ľľ Due to these findings and absence of other mitigations installed at the time, manhole entries require full arc-flash PPE.ľ Furthermore, manhole entries were temporarily restricted to short duration inspections until further mitigations could be implemented.ľ With installation of neutral grounding resistors (NGRs) on substation transformers, the flash, sound and flame energy was reduced.ľ The hazard reduction was so substantial that the required PPE would be less cumbersome and enable effective performance of maintenance tasks in the energized configuration."})
126 | 
127 | WITH l
128 | 
129 | # Annotate it and use newly trained NER model(s)
130 | CALL ga.nlp.annotate({text:l.lesson, id: l.uuid, pipeline: "customPipeline"}) YIELD result
131 | MERGE (l)-[:HAS_ANNOTATED_TEXT]->(result)
132 | RETURN l, result;
133 | ```
134 | 
135 | **Format of training datasets:**
136 |   * `NER`
137 |     * one sentence per line
138 |     * one empty line between two different texts (paragraphs)
139 |     * there must be a space before and after each `<START:my_category>` and `<END>` statement
140 |     * training data must not contain HTML symbols (such as `H<sub>2</sub>O`); **TO DO:** check whether text on which NER model is deployed needs to be manually deprived of HTML symbols or whether they are ignored automatically
141 |     * Example: categories "person", "organization", "location"):
142 |     ```
143 |     <START:person> Theresa May <END> has said she will form a government with the support of the <START:organization> Democratic Unionists <END> that can provide "certainty" for the future.
144 |     Speaking after visiting <START:location> Buckingham Palace <END> , she said only her party had the "legitimacy" to govern after winning the most seats and votes.
145 |     In a short statement outside <START:location> Downing Street <END> , which followed a 25-minute audience with <START:person> The Queen <END> , Mrs <START:person> May <END> said she intended to form a government which could "provide certainty and lead <START:location> Britain <END> forward at this critical time for our country".
146 | 
147 |     The <START:organization> Cabinet Office <END> revealed on <START:date> Wednesday <END> that <START:location> Japan's <END> GDP grew by 0.3% during the <START:date> first quarter of 2017 <END> .
148 |     Although the reading missed a forecast of 0.6% growth, <START:location> Japan's <END> economy continued to expand in five consecutive quarters, the country's highest streak in three years.
149 |     ```
150 |   * `sentiment` - two columns separated by a white space (tab): the first column is a category as integer (0=VeryNegative, 1=Negative, 2=Neutral, 3=Positive, 4=VeryPositive), the second column is a sentence; example:
151 |     ```
152 |     3   Watching a nice movie
153 |     1   The painting is ugly, will return it tomorrow...
154 |     3   One of the best soccer games, worth seeing it
155 |     3   Very tasty, not only for vegetarians
156 |     1   Damn..the train is late again...
157 |     ```
158 | 
159 | **Validation/testing:**
160 | 
161 | Evaluation of the new model is performed automatically when invoking procedure `ga.nlp.train()`. The evaluation of the new model is performed using OpenNLP cross-validation method: validation runs *n*-fold times on the same training file, but each time selecting different set of trainig and testing data with the sample size ratio of *train:test = (n-1):1*. Validation measures (Precision, Recall, F-Measure) are pooled together and returned to the user as a result.
162 | 
163 | The following procedure can be invoked to test already existing models:
164 | ```
165 | CALL ga.nlp.test({[project: "myXYProject",] alg: "NER", model: "location", file: "<path_to_your_training_file>" [, lang: "en"]})
166 | ```
167 | Parameters
168 |   * `project` (optional) allows to specify which of the existing models we want to test (otherwise it uses the default)
169 |   * `alg` (case insensitive) specifies which algorithm is about to be trained; currently available algs: `NER`, `sentiment`
170 |   * `model` is an arbitrary string that provides, in combination with `alg` (and with `project` if it's specified), a unique identifier of the model that you want to 
171 |   * `file` is a path to the test file
172 |   * `lang` specified the language
173 | 
174 | 


--------------------------------------------------------------------------------
/pom.xml:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8"?>
  2 | <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  3 |   <modelVersion>4.0.0</modelVersion>
  4 | 
  5 |   <groupId>com.graphaware.neo4j</groupId>
  6 |   <artifactId>nlp-opennlp</artifactId>
  7 |   <version>3.3.2.52.7-SNAPSHOT</version>
  8 | 
  9 |   <licenses>
 10 |     <license>
 11 |       <name>GNU General Public License, version 3</name>
 12 |       <url>http://www.gnu.org/licenses/gpl-3.0.txt</url>
 13 |       <distribution>repo</distribution>
 14 |     </license>
 15 |   </licenses>
 16 | 
 17 |   <name>GraphAware OpenNLP Integration</name>
 18 |   <description>OpenNLP integration into GraphAware NLP</description>
 19 |   <url>https://graphaware.com</url>
 20 | 
 21 |   <scm>
 22 |     <connection>scm:git:git@github.com:graphaware/neo4j-nlp-opennlp.git</connection>
 23 |     <developerConnection>scm:git:git@github.com:graphaware/neo4j-nlp-opennlp.git</developerConnection>
 24 |     <url>git@github.com:graphaware/neo4j-nlp-opennlp.git</url>
 25 |     <tag>HEAD</tag>
 26 |   </scm>
 27 | 
 28 |   <developers>
 29 |     <developer>
 30 |       <id>alenegro</id>
 31 |       <name>Alessandro Negro</name>
 32 |       <email>alessandro@graphaware.com</email>
 33 |     </developer>
 34 |     <developer>
 35 |       <id>ikwattro</id>
 36 |       <name>Christophe Willemsen</name>
 37 |       <email>christophe@graphaware.com</email>
 38 |     </developer>
 39 |     <developer>
 40 |       <id>vlasta-kus</id>
 41 |       <name>Vlastimil Kus</name>
 42 |       <email>vlasta@graphaware.com</email>
 43 |     </developer>
 44 |     <developer>
 45 |       <id>graphaware</id>
 46 |       <name>GraphAware</name>
 47 |       <email>nlp@graphaware.com</email>
 48 |     </developer>
 49 |   </developers>
 50 | 
 51 |   <inceptionYear>2015</inceptionYear>
 52 | 
 53 |   <issueManagement>
 54 |     <system>GitHub</system>
 55 |     <url>https://github.com/graphaware/neo4j-nlp-opennlp/issues</url>
 56 |   </issueManagement>
 57 | 
 58 |   <organization>
 59 |     <name>Graph Aware Limited</name>
 60 |     <url>https://graphaware.com</url>
 61 |   </organization>
 62 | 
 63 |   <properties>
 64 |       <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
 65 |       <open.nlp.version>1.9.0</open.nlp.version>
 66 |       <graphaware.version>3.4.7.52</graphaware.version>
 67 |       <resttest.version>${graphaware.version}.18</resttest.version>
 68 |       <neo4j.version>3.4.7</neo4j.version>
 69 |       <maven.compiler.source>1.8</maven.compiler.source>
 70 |       <maven.compiler.target>1.8</maven.compiler.target>
 71 |       <nlp.version>3.4.9.52.16-SNAPSHOT</nlp.version>
 72 |   </properties>
 73 | 
 74 |   <dependencies>
 75 |     <dependency>
 76 |       <groupId>com.graphaware.neo4j</groupId>
 77 |       <artifactId>nlp</artifactId>
 78 |       <version>${nlp.version}</version>
 79 |       <scope>provided</scope>
 80 |     </dependency>
 81 | 
 82 |     <dependency>
 83 |       <groupId>org.apache.opennlp</groupId>
 84 |       <artifactId>opennlp-tools</artifactId>
 85 |       <version>${open.nlp.version}</version>
 86 |     </dependency>
 87 |     <!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-api -->
 88 |     <dependency>
 89 |       <groupId>org.slf4j</groupId>
 90 |       <artifactId>slf4j-api</artifactId>
 91 |       <version>1.7.21</version>
 92 |     </dependency>
 93 |     <!-- https://mvnrepository.com/artifact/junit/junit -->
 94 |     <dependency>
 95 |       <groupId>junit</groupId>
 96 |       <artifactId>junit</artifactId>
 97 |       <version>4.12</version>
 98 |     </dependency>
 99 |     <!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-simple -->
100 |     <dependency>
101 |       <groupId>org.slf4j</groupId>
102 |       <artifactId>slf4j-simple</artifactId>
103 |       <version>1.7.21</version>
104 |     </dependency>
105 | 
106 |     <dependency>
107 |       <groupId>junit</groupId>
108 |       <artifactId>junit</artifactId>
109 |       <version>4.12</version>
110 |       <scope>test</scope>
111 |     </dependency>
112 | 
113 |     <dependency>
114 |       <groupId>com.graphaware.neo4j</groupId>
115 |       <artifactId>runtime</artifactId>
116 |       <version>${graphaware.version}</version>
117 |       <scope>test</scope>
118 |     </dependency>
119 | 
120 |     <dependency>
121 |       <groupId>com.graphaware.neo4j</groupId>
122 |       <artifactId>server</artifactId>
123 |       <version>${graphaware.version}</version>
124 |       <scope>test</scope>
125 |     </dependency>
126 | 
127 |     <dependency>
128 |       <groupId>com.graphaware.neo4j</groupId>
129 |       <artifactId>tests</artifactId>
130 |       <version>${graphaware.version}</version>
131 |       <scope>test</scope>
132 |     </dependency>
133 | 
134 |     <dependency>
135 |       <groupId>com.graphaware.neo4j</groupId>
136 |       <artifactId>resttest</artifactId>
137 |       <scope>test</scope>
138 |       <version>${resttest.version}</version>
139 |     </dependency>
140 | 
141 |     <dependency>
142 |       <groupId>com.graphaware.neo4j</groupId>
143 |       <artifactId>nlp</artifactId>
144 |       <version>${nlp.version}</version>
145 |       <type>test-jar</type>
146 |       <scope>test</scope>
147 |     </dependency>
148 | 
149 |     <dependency>
150 |       <groupId>com.sun.jersey</groupId>
151 |       <artifactId>jersey-server</artifactId>
152 |       <version>1.19.1</version>
153 |       <scope>test</scope>
154 |     </dependency>
155 | 
156 |   </dependencies>
157 | 
158 | 
159 |   <distributionManagement>
160 |     <snapshotRepository>
161 |       <id>ossrh</id>
162 |       <url>https://oss.sonatype.org/content/repositories/snapshots</url>
163 |     </snapshotRepository>
164 |     <repository>
165 |       <id>ossrh</id>
166 |       <url>https://oss.sonatype.org/service/local/staging/deploy/maven2/</url>
167 |     </repository>
168 |   </distributionManagement>
169 | 
170 |   <profiles>
171 |     <profile>
172 |       <id>release</id>
173 |       <activation>
174 |         <property>
175 |           <name>performRelease</name>
176 |           <value>true</value>
177 |         </property>
178 |       </activation>
179 |       <build>
180 |         <plugins>
181 |           <plugin>
182 |             <groupId>org.apache.maven.plugins</groupId>
183 |             <artifactId>maven-gpg-plugin</artifactId>
184 |             <version>1.5</version>
185 |             <executions>
186 |               <execution>
187 |                 <id>sign-artifacts</id>
188 |                 <phase>verify</phase>
189 |                 <goals>
190 |                   <goal>sign</goal>
191 |                 </goals>
192 |               </execution>
193 |             </executions>
194 |           </plugin>
195 |         </plugins>
196 |       </build>
197 |     </profile>
198 |   </profiles>
199 | 
200 |   <build>
201 |     <plugins>
202 |       <plugin>
203 |         <artifactId>maven-compiler-plugin</artifactId>
204 |         <version>3.5.1</version>
205 |         <configuration>
206 |           <source>1.8</source>
207 |           <target>1.8</target>
208 |         </configuration>
209 |       </plugin>
210 |       <plugin>
211 |         <artifactId>maven-shade-plugin</artifactId>
212 |         <version>2.4.3</version>
213 |         <executions>
214 |           <execution>
215 |             <phase>package</phase>
216 |             <goals>
217 |               <goal>shade</goal>
218 |             </goals>
219 |             <!--configuration>
220 |               <minimizeJar>true</minimizeJar>
221 |             </configuration-->
222 |           </execution>
223 |         </executions>
224 |       </plugin>
225 |       <plugin>
226 |         <groupId>org.apache.maven.plugins</groupId>
227 |         <artifactId>maven-surefire-plugin</artifactId>
228 |         <version>2.20</version>
229 |         <configuration>
230 |           <argLine>-Xmx8g</argLine>
231 |         </configuration>
232 |       </plugin>
233 |     </plugins>
234 |   </build>
235 | 
236 | </project>
237 | 


--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/OpenNLPAnnotation.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * To change this license header, choose License Headers in Project Properties.
  3 |  * To change this template file, choose Tools | Templates
  4 |  * and open the template in the editor.
  5 |  */
  6 | package com.graphaware.nlp.processor.opennlp;
  7 | 
  8 | import com.graphaware.nlp.util.OptionalNLPParameters;
  9 | import java.util.ArrayList;
 10 | import java.util.Arrays;
 11 | import java.util.Collection;
 12 | import java.util.HashMap;
 13 | import java.util.HashSet;
 14 | import java.util.List;
 15 | import java.util.Map;
 16 | import java.util.Set;
 17 | import java.util.stream.Collectors;
 18 | import opennlp.tools.util.Span;
 19 | 
 20 | public class OpenNLPAnnotation {
 21 | 
 22 |     private static final double DEFAULT_SENTIMENT_PROBTHR = 0.7;
 23 | 
 24 |     private final String text;
 25 |     private List<Sentence> sentences;
 26 |     public static final String DEFAULT_LEMMA_OPEN_NLP = "O";
 27 |     public Map<String, String> otherParams;
 28 | 
 29 |     public OpenNLPAnnotation(String text, Map<String, String> otherParams) {
 30 |         this.text = text;
 31 |         this.otherParams = otherParams;
 32 |     }
 33 | 
 34 |     public OpenNLPAnnotation(String text) {
 35 |         this(text, null);
 36 |     }
 37 | 
 38 |     public String getText() {
 39 |         return text;
 40 |     }
 41 | 
 42 |     public void setSentences(Span[] sentencesArray) {
 43 |         sentences = new ArrayList<>();
 44 |         for (Span sentence : sentencesArray) {
 45 |             sentences.add(new Sentence(sentence, getText()));
 46 |         }
 47 |     }
 48 | 
 49 |     public List<Sentence> getSentences() {
 50 |         return sentences;
 51 |     }
 52 | 
 53 |     public double getSentimentProb() {
 54 |         if (otherParams != null && otherParams.containsKey(OptionalNLPParameters.SENTIMENT_PROB_THR)) {
 55 |             return Double.parseDouble(otherParams.get(OptionalNLPParameters.SENTIMENT_PROB_THR));
 56 |         }
 57 |         return DEFAULT_SENTIMENT_PROBTHR;
 58 |     }
 59 |     
 60 |     public Token getToken(String token, String lemma) {
 61 |         return new Token(token, lemma);
 62 |     }
 63 | 
 64 |     class Sentence {
 65 | 
 66 |         private final Span sentence;
 67 |         private final String sentenceText;
 68 |         private String sentenceSentiment;
 69 |         private List<Integer> nounphrases;
 70 |         private String[] words;
 71 |         private Span[] wordSpans;
 72 |         private String[] posTags;
 73 |         private String[] lemmas;
 74 |         private final Map<String, Token> tokens;
 75 |         private Span[] chunks;
 76 |         private String[] chunkStrings;
 77 |         private String[] chunkSentiments;
 78 |         private final String defaultStringValue = "-"; // @Deprecated
 79 | 
 80 |         public Sentence(Span sentence, String text) {
 81 |             this.sentence = sentence;
 82 |             this.sentenceText = String.valueOf(sentence.getCoveredText(text));
 83 |             this.tokens = new HashMap<>();
 84 |         }
 85 | 
 86 |         public void addPhraseIndex(int phraseINdex) {
 87 |             if (this.nounphrases == null) {
 88 |                 this.nounphrases = new ArrayList<>();
 89 |             }
 90 |             this.nounphrases.add(phraseINdex);
 91 |         }
 92 | 
 93 |         public Span getSentenceSpan() {
 94 |             return this.sentence;
 95 |         }
 96 | 
 97 |         public String getSentence() {
 98 |             return this.sentenceText;
 99 |         }
100 | 
101 |         public String getSentiment() {
102 |             return this.sentenceSentiment;
103 |         }
104 | 
105 |         public void setSentiment(String sent) {
106 |             this.sentenceSentiment = sent;
107 |         }
108 | 
109 |         public String[] getWords() {
110 |             return words;
111 |         }
112 | 
113 |         public void setWords(String[] words) {
114 |             this.words = words;
115 |         }
116 | 
117 |         public Span[] getWordSpans() {
118 |             return this.wordSpans;
119 |         }
120 | 
121 |         public void setWordSpans(Span[] spans) {
122 |             this.wordSpans = spans;
123 |         }
124 | 
125 |         public void setWordsAndSpans(Span[] spans) {
126 |             if (spans == null) {
127 |                 this.wordSpans = null;
128 |                 this.words = null;
129 |                 return;
130 |             }
131 |             this.wordSpans = spans;
132 |             this.words = Arrays.asList(spans).stream()
133 |                     .map(span -> String.valueOf(span.getCoveredText(sentenceText)))
134 |                     .collect(Collectors.toList()).toArray(new String[wordSpans.length]);
135 |         }
136 | 
137 |         public int getWordStart(int idx) {
138 |             if (this.wordSpans.length > idx) {
139 |                 return this.wordSpans[idx].getStart();
140 |             }
141 |             return -1;
142 |         }
143 | 
144 |         public int getWordEnd(int idx) {
145 |             if (this.wordSpans.length > idx) {
146 |                 return this.wordSpans[idx].getEnd();
147 |             }
148 |             return -1;
149 |         }
150 | 
151 |         public String[] getPosTags() {
152 |             return this.posTags;
153 |         }
154 | 
155 |         public void setPosTags(String[] posTags) {
156 |             this.posTags = posTags;
157 |         }
158 | 
159 |         public Span[] getChunks() {
160 |             return this.chunks;
161 |         }
162 | 
163 |         public void setChunks(Span[] chunks) {
164 |             this.chunks = chunks;
165 |         }
166 | 
167 |         public String[] getChunkStrings() {
168 |             return this.chunkStrings;
169 |         }
170 | 
171 |         public void setChunkStrings(String[] chunkStrings) {
172 |             this.chunkStrings = chunkStrings;
173 |         }
174 | 
175 |         public String[] getChunkSentiments() {
176 |             return this.chunkSentiments;
177 |         }
178 | 
179 |         public void setChunkSentiments(String[] sents) {
180 |             if (sents == null) {
181 |                 return;
182 |             }
183 |             if (sents.length != this.chunks.length) {
184 |                 return;
185 |             }
186 |             this.chunkSentiments = sents;
187 |         }
188 | 
189 | //        @Deprecated
190 | //        public void setDefaultChunks() {
191 | //            this.chunks = new Span[this.words.length];
192 | //            Arrays.fill(this.chunks, new Span(0, 0));
193 | //            this.chunkStrings = new String[this.words.length];
194 | //            Arrays.fill(this.chunkStrings, defaultStringValue);
195 | //            this.nounphrases = new ArrayList<>();
196 | //        }
197 | 
198 |         public List<Integer> getPhrasesIndex() {
199 |             //if (nounphrases==null)
200 |             //return new ArrayList<Integer>();
201 |             return nounphrases;
202 |         }
203 | 
204 |         public Collection<Token> getTokens() {
205 |             return this.tokens.values();
206 |         }
207 | 
208 |         public String[] getLemmas() {
209 |             return this.lemmas;
210 |         }
211 | 
212 |         public void setLemmas(String[] lemmas) {
213 |             if (this.words == null || lemmas == null) {
214 |                 return;
215 |             }
216 |             if (this.words.length != lemmas.length) // ... something is wrong
217 |             {
218 |                 return;
219 |             }
220 |             this.lemmas = lemmas;
221 |         }
222 | 
223 |         protected Token getToken(String value, String lemma) {
224 |             Token token;
225 |             if (tokens.containsKey(value)) {
226 |                 token = tokens.get(value);
227 |             } else {
228 |                 token = new Token(value, lemma);
229 |                 tokens.put(value, token);
230 |             }
231 |             return token;
232 |         }
233 |     }
234 | 
235 |     class Token {
236 | 
237 |         private final String token;
238 |         private final Set<String> tokenPOS;
239 |         private final String tokenLemmas;
240 |         private final Set<String> tokenNEs;
241 |         private final List<Span> tokenSpans;
242 | 
243 |         public Token(String token, String lemma) {
244 |             this.token = token;
245 |             this.tokenLemmas = lemma;
246 |             this.tokenNEs = new HashSet<>();
247 |             this.tokenPOS = new HashSet<>();
248 |             this.tokenSpans = new ArrayList<>();
249 |         }
250 | 
251 |         public List<Span> getTokenSpans() {
252 |             return tokenSpans;
253 |         }
254 | 
255 |         public String getToken() {
256 |             return token;
257 |         }
258 | 
259 |         public void addTokenSpans(Span tokenSpans) {
260 |             this.tokenSpans.add(tokenSpans);
261 |         }
262 | 
263 |         public Collection<String> getTokenPOS() {
264 |             return tokenPOS;
265 |         }
266 | 
267 |         public void addTokenPOS(Collection<String> tokenPOSes) {
268 |             this.tokenPOS.addAll(tokenPOSes);
269 |         }
270 |         
271 |         public void addTokenPOS(String tokenPOS) {
272 |             this.tokenPOS.add(tokenPOS);
273 |         }
274 | 
275 |         public String getTokenLemmas() {
276 |             return tokenLemmas;
277 |         }
278 | 
279 |         public Collection<String> getTokenNEs() {
280 |             return tokenNEs;
281 |         }
282 | 
283 |         public void addTokenNE(String ne) {
284 |             this.tokenNEs.add(ne);
285 |         }
286 | 
287 |     }
288 | }
289 | 


--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/OpenNLPPipeline.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * To change this license header, choose License Headers in Project Properties.
  3 |  * To change this template file, choose Tools | Templates
  4 |  * and open the template in the editor.
  5 |  */
  6 | package com.graphaware.nlp.processor.opennlp;
  7 | 
  8 | import com.graphaware.nlp.processor.opennlp.model.NERModelTool;
  9 | import com.graphaware.nlp.processor.opennlp.model.SentimentModelTool;
 10 | import com.graphaware.nlp.processor.AbstractTextProcessor;
 11 | import static com.graphaware.nlp.processor.opennlp.OpenNLPAnnotation.DEFAULT_LEMMA_OPEN_NLP;
 12 | import java.io.File;
 13 | import java.io.FileInputStream;
 14 | import java.io.FileOutputStream;
 15 | import java.io.BufferedOutputStream;
 16 | import java.io.FileNotFoundException;
 17 | import java.io.IOException;
 18 | import java.io.InputStream;
 19 | import java.lang.reflect.Constructor;
 20 | import java.lang.reflect.InvocationTargetException;
 21 | import java.net.URI;
 22 | import java.net.URISyntaxException;
 23 | import java.util.Properties;
 24 | import java.util.HashMap;
 25 | import java.util.Arrays;
 26 | import java.util.ArrayList;
 27 | import java.util.HashSet;
 28 | import java.util.List;
 29 | import java.util.Map;
 30 | import java.util.Set;
 31 | import java.util.concurrent.atomic.AtomicInteger;
 32 | import java.util.stream.Collectors;
 33 | import opennlp.tools.chunker.ChunkerME;
 34 | import opennlp.tools.chunker.ChunkerModel;
 35 | import opennlp.tools.postag.POSModel;
 36 | import opennlp.tools.postag.POSTaggerME;
 37 | import opennlp.tools.sentdetect.SentenceDetectorME;
 38 | import opennlp.tools.sentdetect.SentenceModel;
 39 | import opennlp.tools.tokenize.TokenizerME;
 40 | import opennlp.tools.tokenize.TokenizerModel;
 41 | import opennlp.tools.namefind.TokenNameFinderModel;
 42 | import opennlp.tools.namefind.NameFinderME;
 43 | import opennlp.tools.lemmatizer.DictionaryLemmatizer; // needs OpenNLP >=1.7
 44 | //import opennlp.tools.lemmatizer.SimpleLemmatizer;   // for OpenNLP < 1.7
 45 | import opennlp.tools.doccat.DoccatModel;
 46 | import opennlp.tools.doccat.DocumentCategorizerME;
 47 | import opennlp.tools.util.Span;
 48 | import opennlp.tools.util.model.BaseModel;
 49 | import org.slf4j.Logger;
 50 | import org.slf4j.LoggerFactory;
 51 | 
 52 | public class OpenNLPPipeline {
 53 | 
 54 |     protected static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class);
 55 | 
 56 |     public static final String DEFAULT_BACKGROUND_SYMBOL = "O";
 57 | 
 58 |     protected static final String IMPORT_DIRECTORY = "import/";
 59 | 
 60 |     protected static final String PROPERTY_PATH_CHUNKER_MODEL = "chuncker";
 61 |     protected static final String PROPERTY_PATH_POS_TAGGER_MODEL = "pos";
 62 |     protected static final String PROPERTY_PATH_SENTENCE_MODEL = "sentence";
 63 |     protected static final String PROPERTY_PATH_TOKENIZER_MODEL = "tokenizer";
 64 |     protected static final String PROPERTY_PATH_LEMMATIZER_MODEL = "lemmatizer";
 65 |     protected static final String PROPERTY_PATH_SENTIMENT_MODEL = "sentiment";
 66 | 
 67 |     protected static final String PROPERTY_DEFAULT_CHUNKER_MODEL = "en-chunker.bin";
 68 |     protected static final String PROPERTY_DEFAULT_POS_TAGGER_MODEL = "en-pos-maxent.bin";
 69 |     protected static final String PROPERTY_DEFAULT_SENTENCE_MODEL = "en-sent.bin";
 70 |     protected static final String PROPERTY_DEFAULT_TOKENIZER_MODEL = "en-token.bin";
 71 |     protected static final String PROPERTY_DEFAULT_LEMMATIZER_MODEL = "en-lemmatizer.dict";
 72 |     protected static final String PROPERTY_DEFAULT_SENTIMENT_MODEL = "en-sentiment-tweets_toy.bin";
 73 | 
 74 |     protected static final String DEFAULT_PROJECT_VALUE = "default";
 75 | 
 76 |     protected final List<String> annotators;
 77 |     protected final List<String> stopWords;
 78 | 
 79 |     protected TokenizerME wordBreaker;
 80 |     protected POSTaggerME posme;
 81 |     protected ChunkerME chunkerME;
 82 |     protected SentenceDetectorME sentenceDetector;
 83 |     protected DictionaryLemmatizer lemmaDetector; // needs OpenNLP >=1.7
 84 | 
 85 |     protected Map<String, String> customNeModels = new HashMap<>();
 86 |     protected Map<String, String> customSentimentModels = new HashMap<>();
 87 | 
 88 |     protected Map<String, NameFinderME> nameDetectors = new HashMap<>();
 89 |     //protected Map<String, DocumentCategorizerME> sentimentDetectors = new HashMap<>();
 90 |     protected DocumentCategorizerME sentimentDetector;
 91 | 
 92 |     protected static Map<String, String> BASIC_NE_MODEL;
 93 | 
 94 |     {
 95 |         BASIC_NE_MODEL = new HashMap<>();
 96 |         BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-person", "en-ner-person.bin");
 97 |         BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-date", "en-ner-date.bin");
 98 |         BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-location", "en-ner-location.bin");
 99 |         BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-time", "en-ner-time.bin");
100 |         BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-organization", "en-ner-organization.bin");
101 |         BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-money", "en-ner-money.bin");
102 |         BASIC_NE_MODEL.put(DEFAULT_PROJECT_VALUE + "-percentage", "en-ner-percentage.bin");
103 |     }
104 | 
105 |     public OpenNLPPipeline(Properties properties) {
106 |         findModelFiles(IMPORT_DIRECTORY);
107 |         this.annotators = Arrays.asList(properties.getProperty("annotators", "").split(",")).stream().map(str -> str.trim()).collect(Collectors.toList());
108 |         this.stopWords = Arrays.asList(properties.getProperty("stopword", "").split(",")).stream().map(str -> str.trim().toLowerCase()).collect(Collectors.toList());
109 |         init(properties);
110 |     }
111 | 
112 |     private void init(Properties properties) {
113 |         try {
114 |             setSenteceSplitter(properties);
115 |             setTokenizer(properties);
116 |             setPosTagger(properties);
117 |             setChuncker(properties);
118 |             loadNamedEntitiesFinders(properties);
119 |             setLemmatizer(properties);
120 |             setCategorizer(properties);
121 | 
122 |         } catch (IOException e) {
123 |             LOG.error("Could not initialize OpenNLP models: " + e.getMessage());
124 |             throw new RuntimeException("Could not initialize OpenNLP models", e);
125 |         }
126 |     }
127 | 
128 |     private void setChuncker(Properties properties) throws FileNotFoundException {
129 |         InputStream is = getInputStream(properties, PROPERTY_PATH_CHUNKER_MODEL, PROPERTY_DEFAULT_CHUNKER_MODEL);
130 |         ChunkerModel chunkerModel = loadModel(ChunkerModel.class, is);
131 |         closeInputStream(is, PROPERTY_PATH_CHUNKER_MODEL);
132 |         chunkerME = new ChunkerME(chunkerModel);
133 |     }
134 | 
135 |     private void setPosTagger(Properties properties) throws FileNotFoundException {
136 |         InputStream is = getInputStream(properties, PROPERTY_PATH_POS_TAGGER_MODEL, PROPERTY_DEFAULT_POS_TAGGER_MODEL);
137 |         POSModel pm = loadModel(POSModel.class, is);
138 |         closeInputStream(is, PROPERTY_PATH_POS_TAGGER_MODEL);
139 |         posme = new POSTaggerME(pm);
140 |     }
141 | 
142 |     private void setTokenizer(Properties properties) throws FileNotFoundException {
143 |         InputStream is = getInputStream(properties, PROPERTY_PATH_TOKENIZER_MODEL, PROPERTY_DEFAULT_TOKENIZER_MODEL);
144 |         TokenizerModel tm = loadModel(TokenizerModel.class, is);
145 |         closeInputStream(is, PROPERTY_PATH_TOKENIZER_MODEL);
146 |         wordBreaker = new TokenizerME(tm);
147 |     }
148 | 
149 |     private void setSenteceSplitter(Properties properties) throws FileNotFoundException {
150 |         InputStream is = getInputStream(properties, PROPERTY_PATH_SENTENCE_MODEL, PROPERTY_DEFAULT_SENTENCE_MODEL);
151 |         SentenceModel sentenceModel = loadModel(SentenceModel.class, is);
152 |         closeInputStream(is, PROPERTY_PATH_SENTENCE_MODEL);
153 |         sentenceDetector = new SentenceDetectorME(sentenceModel);
154 |     }
155 | 
156 |     private void loadNamedEntitiesFinders(Properties properties) throws FileNotFoundException {
157 |         // Default NE models
158 |         BASIC_NE_MODEL.entrySet().stream().forEach((item) -> {
159 |             InputStream is = getInputStream(properties, item.getKey(), item.getValue());
160 |             if (!(is == null)) {
161 |                 TokenNameFinderModel nameModel = loadModel(TokenNameFinderModel.class, is);
162 |                 closeInputStream(is, item.getKey());
163 |                 nameDetectors.put(item.getKey(), new NameFinderME(nameModel));
164 |             }
165 |         });
166 | 
167 |         // Custom NE models (in the `import/` dir of the Neo4j installation)
168 |         if (properties.containsKey("customNEs")) {
169 |             List<String> requiredModels = Arrays.asList(properties.getProperty("customNEs").split(",")).stream().map(str -> str.trim()).collect(Collectors.toList());
170 |             for (String key: requiredModels) {
171 |                 if (!customNeModels.containsKey(key)) {
172 |                     LOG.error("Custom NE model " + key + " not found!");
173 |                     throw new RuntimeException("Custom NE model " + key + " not found!");
174 |                 }
175 |                 LOG.info("Extracting custom NER model: " + key);
176 |                 InputStream is = new FileInputStream(new File(customNeModels.get(key)));
177 |                 TokenNameFinderModel nameModel = loadModel(TokenNameFinderModel.class, is);
178 |                 closeInputStream(is, key);
179 |                 nameDetectors.put(key, new NameFinderME(nameModel));
180 |                 LOG.info("Custom NER model " + key + " loaded for this pipeline.");
181 |             }
182 |         }
183 |     }
184 | 
185 |     private void setLemmatizer(Properties properties) throws FileNotFoundException, IOException {
186 |         InputStream is = getInputStream(properties, PROPERTY_PATH_LEMMATIZER_MODEL, PROPERTY_DEFAULT_LEMMATIZER_MODEL);
187 |         lemmaDetector = new DictionaryLemmatizer(is);
188 |         closeInputStream(is, PROPERTY_PATH_LEMMATIZER_MODEL);
189 |     }
190 | 
191 |     private void setCategorizer(Properties properties) throws FileNotFoundException {
192 |         // Default sentiment model
193 |         if (!properties.containsKey("customSentiment")) {
194 |             InputStream is = getInputStream(properties, PROPERTY_PATH_SENTIMENT_MODEL, PROPERTY_DEFAULT_SENTIMENT_MODEL);
195 |             if (is != null) {
196 |                 DoccatModel doccatModel = loadModel(DoccatModel.class, is);
197 |                 closeInputStream(is, PROPERTY_PATH_SENTIMENT_MODEL);
198 |                 //sentimentDetectors.put(DEFAULT_PROJECT_VALUE, new DocumentCategorizerME(doccatModel));
199 |                 sentimentDetector = new DocumentCategorizerME(doccatModel);
200 |             } else {
201 |                 LOG.warn("No default sentiment detector available (input stream is null).");
202 |                 //sentimentDetectors.put(DEFAULT_PROJECT_VALUE, null);
203 |                 sentenceDetector = null;
204 |             }
205 |         }
206 |         // Custom sentiment model (currently only one is possible)
207 |         else {
208 |             String customModel = properties.getProperty("customSentiment");
209 |             LOG.info("Extracting custom sentiment model: " + customModel);
210 |             if (!customSentimentModels.containsKey(customModel)) {
211 |                 LOG.error("Custom sentiment model " + customModel + " not found!");
212 |                 throw new RuntimeException("Custom sentiment model " + customModel + " not found!");
213 |             }
214 |             try {
215 |                 InputStream is = new FileInputStream(new File(customSentimentModels.get(customModel)));
216 |                 if (is == null) {
217 |                     LOG.error("Custom sentiment model: input stream is null");
218 |                     return;
219 |                 }
220 |                 DoccatModel doccatModel = loadModel(DoccatModel.class, is);
221 |                 closeInputStream(is, customSentimentModels.get(customModel));
222 |                 //sentimentDetectors.put(customModel, new DocumentCategorizerME(doccatModel));
223 |                 sentimentDetector = new DocumentCategorizerME(doccatModel);
224 |                 LOG.info("Custom sentiment model " + customModel + " loaded for this pipeline.");
225 |             } catch (IOException ex) {
226 |                 LOG.error("Error while opening file " + customSentimentModels.get(customModel), ex);
227 |             }
228 |         }
229 |     }
230 | 
231 |     public void annotate(OpenNLPAnnotation document) {
232 |         String text = document.getText();
233 |         try {
234 |             Span sentences[] = sentenceDetector.sentPosDetect(text);
235 |             document.setSentences(sentences);
236 |             document.getSentences().stream()
237 |                     .forEach((OpenNLPAnnotation.Sentence sentence) -> {
238 |                         if (annotators.contains("tokenize") && wordBreaker != null) {
239 |                             Span[] wordSpans = wordBreaker.tokenizePos(sentence.getSentence());
240 |                             if (wordSpans != null && wordSpans.length > 0) {
241 |                                 sentence.setWordsAndSpans(wordSpans);
242 | 
243 |                                 if (annotators.contains("pos") && posme != null) {
244 |                                     String[] posTags = posme.tag(sentence.getWords());
245 |                                     sentence.setPosTags(posTags);
246 |                                     if (annotators.contains("lemma")) {
247 |                                         String[] finLemmas = lemmaDetector.lemmatize(sentence.getWords(), posTags);
248 |                                         sentence.setLemmas(finLemmas);
249 |                                     }
250 | 
251 |                                     //FIXME: this is wrong
252 | //                        if (annotators.contains("relation")) {
253 | //                            Span[] chunks = chunkerME.chunkAsSpans(sentence.getWords(), posTags);
254 | //                            sentence.setChunks(chunks);
255 | //                            LOG.info("Found " + chunks.length + " phrases.");
256 | //                            String[] chunkStrings = Span.spansToStrings(chunks, sentence.getWords());
257 | //                            sentence.setChunkStrings(chunkStrings);
258 | //                            List<String> chunkSentiments = new ArrayList<>();
259 | //                            for (int i = 0; i < chunks.length; i++) {
260 | //                                sentence.addPhraseIndex(i);
261 | //                            }
262 | //                            if (!chunkSentiments.isEmpty()) {
263 | //                                sentence.setChunkSentiments(chunkSentiments.toArray(new String[chunkSentiments.size()]));
264 | //                            }
265 | //                        } 
266 |                                 }
267 | 
268 |                                 Map<Integer, List<Span>> nerOccurrences = new HashMap<>();
269 |                                 if (annotators.contains("ner") && sentence.getWords() != null) {
270 | 
271 |                                     // Named Entities identification; needs to be performed after lemmas and POS (see implementation of Sentence.addNamedEntities())
272 |                                     BASIC_NE_MODEL.keySet().stream().forEach((modelKey) -> {
273 |                                         if (!nameDetectors.containsKey(modelKey)) {
274 |                                             LOG.warn("NER model with key " + modelKey + " not available.");
275 |                                         } else {
276 |                                             List<Span> ners = Arrays.asList(nameDetectors.get(modelKey).find(sentence.getWords()));
277 |                                             addNer(ners, nerOccurrences);
278 |                                         }
279 |                                     });
280 | 
281 |                                     if (!customNeModels.isEmpty()) {
282 |                                         for (String key : customNeModels.keySet()) {
283 |                                             if (!nameDetectors.containsKey(key)) {
284 |                                                 LOG.warn("Custom NER model with key " + key + " not available.");
285 |                                                 continue;
286 |                                             }
287 |                                             if (key.split("-").length == 0) {
288 |                                                 continue;
289 |                                             }
290 |                                             LOG.info("Running custom NER: " + key);
291 |                                             List ners = Arrays.asList(nameDetectors.get(key).find(sentence.getWords()));
292 |                                             addNer(ners, nerOccurrences);
293 |                                         }
294 |                                     }
295 |                                 }
296 |                                 processTokens(sentence, nerOccurrences);
297 |                             }
298 |                         }
299 |                         if (sentence.getWords() != null && sentence.getWords().length > 0) {
300 |                             if (annotators.contains("sentiment") && sentimentDetector != null) {
301 |                                 double[] outcomes = sentimentDetector.categorize(sentence.getWords());
302 |                                 String category = sentimentDetector.getBestCategory(outcomes);
303 |                                 if (Arrays.stream(outcomes).max().getAsDouble() < document.getSentimentProb()) {
304 |                                     category = "2";
305 |                                 }
306 |                                 sentence.setSentiment(category);
307 |                                 LOG.info("Sentiment results: sentence = " + sentence.getSentence() + "; category = " + category + "; outcomes = " + Arrays.toString(outcomes));
308 |                             }
309 |                         }
310 |                     });
311 | 
312 | //            if (annotators.contains("ner")) {
313 | //                for (String key : BASIC_NE_MODEL.keySet()) {
314 | //                    if (nameDetectors.containsKey(key)) {
315 | //                        nameDetectors.get(key).clearAdaptiveData();
316 | //                    }
317 | //                }
318 | //                if (customProject != null) {
319 | //                    for (String key : customNeModels.keySet()) {
320 | //                        if (nameDetectors.containsKey(key)) {
321 | //                            nameDetectors.get(key).clearAdaptiveData();
322 | //                        }
323 | //                    }
324 | //                }
325 | //            }
326 |         } catch (Exception ex) {
327 |             LOG.error("Error processing sentence for text: " + text, ex);
328 |             throw new RuntimeException("Error processing sentence for text: " + text, ex);
329 |         }
330 |     }
331 | 
332 |     protected void addNer(List<Span> ners, Map<Integer, List<Span>> nerOccurrences) {
333 |         if (ners != null && !ners.isEmpty()) {
334 |             ners.stream().forEach((ner) -> {
335 |                 List<Span> currentNer = nerOccurrences.get(ner.getStart());
336 |                 if (currentNer == null) {
337 |                     currentNer = new ArrayList<>();
338 |                     nerOccurrences.put(ner.getStart(), currentNer);
339 |                 }
340 |                 currentNer.add(ner);
341 |             });
342 |         }
343 |     }
344 | 
345 |     public String train(String alg, String modelId, String fileTrain, String lang, Map<String, Object> params) {
346 |         String fileOut = createModelFileName(lang, alg, modelId);
347 |         String newKey = /*lang.toLowerCase() + "-" +*/ modelId.toLowerCase();
348 |         String result = "";
349 | 
350 |         if (alg.toLowerCase().equals("ner")) {
351 |             NERModelTool nerModel = new NERModelTool(fileTrain, modelId, lang, params);
352 |             nerModel.train();
353 |             result = nerModel.validate();
354 |             nerModel.saveModel(fileOut);
355 |             // incorporate this model to the OpenNLPPipeline
356 |             if (nerModel.getModel() != null) {
357 |                 customNeModels.put(newKey, fileOut);
358 |                 /*if (!nameDetectors.containsKey(newKey)) {
359 |                     nameDetectors.put(newKey, new NameFinderME((TokenNameFinderModel) nerModel.getModel()));
360 |                 }*/
361 |             }
362 |         }
363 |         else if (alg.toLowerCase().equals("sentiment")) {
364 |             SentimentModelTool sentModel = new SentimentModelTool(fileTrain, modelId, lang, params);
365 |             sentModel.train();
366 |             result = sentModel.validate();
367 |             String[] dirPathSplit = fileTrain.split(File.separator);
368 |             String fileOutToUse;
369 |             if (dirPathSplit.length > 2) {
370 |                 StringBuilder sb = new StringBuilder("");
371 |                 for (int i = 0; i < dirPathSplit.length -2; ++i) {
372 |                     sb.append(dirPathSplit[i]).append(File.separator);
373 |                 }
374 |                 fileOutToUse = sb.toString() + fileOut;
375 |             } else {
376 |                 fileOutToUse = fileOut;
377 |             }
378 |             System.out.println("Saving model to " + fileOutToUse);
379 |             sentModel.saveModel(fileOutToUse);
380 |             // incorporate this model to the OpenNLPPipeline
381 |             if (sentModel.getModel() != null) {
382 |                 customSentimentModels.put(newKey, fileOutToUse);
383 |                 //sentimentDetectors.put(newKey, new DocumentCategorizerME((DoccatModel) sentModel.getModel()));
384 |             }
385 |         } else {
386 |             throw new UnsupportedOperationException("Undefined training procedure for algorithm " + alg);
387 |         }
388 | 
389 |         return result;
390 |     }
391 | 
392 |     public String test(String alg, String modelId, String file, String lang) {
393 |         String modelID = /*lang.toLowerCase() + "-" +*/ modelId.toLowerCase();
394 |         String result = "failure";
395 | 
396 |         if (alg.toLowerCase().equals("ner")) {
397 |           if (customNeModels.containsKey(modelID)) {
398 |             LOG.info("Testing NER model: " + modelID);
399 | 
400 |             TokenNameFinderModel nameModel;
401 |             try {
402 |                 // Load model
403 |                 InputStream is = new FileInputStream(new File(customNeModels.get(modelID)));
404 |                 nameModel = loadModel(TokenNameFinderModel.class, is);
405 |                 closeInputStream(is, modelID);
406 |             } catch (Exception e) {
407 |                 throw new RuntimeException("Loading custom sentiment model " + modelID + " failed: ", e);
408 |             }
409 | 
410 |             NERModelTool nerModel = new NERModelTool();
411 |             result = nerModel.test(file, new NameFinderME(nameModel));
412 |           } else
413 |             LOG.error("Required NER model doesn't exist: " + modelID);
414 |         }
415 |         else if (alg.toLowerCase().equals("sentiment")) {
416 |           if (customSentimentModels.containsKey(modelID)) {
417 |             LOG.info("Testing sentiment model: " + modelID);
418 | 
419 |             DoccatModel doccatModel;
420 |             try {
421 |                 // Load model
422 |                 InputStream is = new FileInputStream(new File(customSentimentModels.get(modelID)));
423 |                 doccatModel = loadModel(DoccatModel.class, is);
424 |                 closeInputStream(is, customSentimentModels.get(modelID));
425 |             } catch (Exception e) {
426 |                 throw new RuntimeException("Loading custom sentiment model " + modelID + " failed: ", e);
427 |             }
428 | 
429 |             SentimentModelTool sentModel = new SentimentModelTool();
430 |             result = sentModel.test(file, new DocumentCategorizerME(doccatModel));
431 |           } else
432 |             LOG.error("Required sentiment model doesn't exist: " + modelID);
433 |         } else {
434 |             throw new UnsupportedOperationException("Undefined training procedure for algorithm " + alg);
435 |         }
436 |         return result;
437 |     }
438 | 
439 |     private void processTokens(OpenNLPAnnotation.Sentence sentence, Map<Integer, List<Span>> nerOccurrences) {
440 |         if (sentence.getWords() == null) {
441 |             return;
442 |         }
443 |         String[] words = sentence.getWords();
444 |         String[] lemmas = sentence.getLemmas();
445 |         String[] posTags = sentence.getPosTags();
446 |         Span[] wordSpans = sentence.getWordSpans();
447 | 
448 |         for (int i = 0; i < words.length; i++) {
449 |             if (nerOccurrences != null && nerOccurrences.containsKey(i)) {
450 |                 List<Span> ners = nerOccurrences.get(i);
451 |                 final int startSpan = wordSpans[i].getStart();
452 |                 AtomicInteger index = new AtomicInteger(i);
453 |                 ners.forEach(ne -> {
454 |                     String value = "";
455 |                     String lemma = "";
456 |                     String type = ne.getType().toUpperCase();
457 |                     Set<String> posSet = new HashSet<>();
458 |                     int endSpan = startSpan;
459 |                     for (int j = ne.getStart(); j < ne.getEnd(); j++) {
460 |                         value += " " + words[j].trim();
461 |                         lemma += " " + (lemmas[j].equals(DEFAULT_LEMMA_OPEN_NLP) ? words[j].toLowerCase().trim() : lemmas[j].trim());
462 |                         posSet.add(posTags[j]);
463 |                         endSpan = wordSpans[j].getEnd();
464 |                         if (index.get() < j) {
465 |                             index.set(j);
466 |                         }
467 |                     }
468 | 
469 |                     value = value.trim();
470 |                     lemma = lemma.trim();
471 |                     //check stopwords
472 |                     if (isNotStopWord(lemma)) {
473 |                         OpenNLPAnnotation.Token token = sentence.getToken(value, lemma);
474 |                         token.addTokenNE(type);
475 |                         token.addTokenPOS(posSet);
476 |                         token.addTokenSpans(new Span(startSpan, endSpan));
477 |                     }
478 |                 });
479 |                 i = index.get();
480 |             } else {
481 |                 String value = words[i].trim();
482 |                 String lemma = lemmas[i].equals(DEFAULT_LEMMA_OPEN_NLP) ? words[i].toLowerCase() : lemmas[i].trim();
483 |                 String ne = DEFAULT_BACKGROUND_SYMBOL;
484 |                 String pos = posTags[i];
485 |                 Set<String> posSet = new HashSet<>();
486 |                 if (isNotStopWord(lemma)) {
487 |                     OpenNLPAnnotation.Token token = sentence.getToken(value, lemma);
488 |                     token.addTokenNE(ne);
489 |                     posSet.add(pos);
490 |                     token.addTokenPOS(posSet);
491 |                     token.addTokenSpans(wordSpans[i]);
492 |                 }
493 |             }
494 |         }
495 |     }
496 | 
497 |     private boolean isNotStopWord(String value) {
498 |         return !annotators.contains("stopword") || !stopWords.contains(value.toLowerCase());
499 |     }
500 | 
501 |     private void findModelFiles(String path) {
502 |         if (path == null || path.length() == 0) {
503 |             LOG.error("Scanning for model files: wrong path specified.");
504 |             return;
505 |         }
506 | 
507 |         File folder = new File(path);
508 |         File[] listOfFiles = folder.listFiles();
509 |         if (listOfFiles == null) {
510 |             return;
511 |         }
512 | 
513 |         String p = path;
514 |         if (p.charAt(p.length() - 1) != "/".charAt(0)) {
515 |             path += "/";
516 |         }
517 | 
518 |         for (int i = 0; i < listOfFiles.length; i++) {
519 |             if (!listOfFiles[i].isFile()) {
520 |                 continue;
521 |             }
522 |             String name = listOfFiles[i].getName();
523 |             String[] sp = name.split("-");
524 |             if (sp.length < 2) {
525 |                 continue;
526 |             }
527 |             if (!name.substring(name.length() - 4).equals(".bin")) {
528 |                 continue;
529 |             }
530 |             LOG.info("Custom models: Found file " + name);
531 | 
532 |             String alg  = sp[0].toLowerCase();
533 | 
534 |             String modelId = sp[1];
535 |             // this is useful in case user-defined model ID contained symbol "-"
536 |             for (int j = 2; j < sp.length; j++)
537 |                 modelId += "-" + sp[j];
538 |             modelId = modelId.substring(0, modelId.length() - 4).toLowerCase(); // remove ".bin"
539 |             //modelId = lang + "-" + modelId;
540 | 
541 |             LOG.info("Registering model name for algorithm " + alg + " under the key " + modelId);
542 |             if (alg.equals("ner")) {
543 |                 customNeModels.put(modelId, path + name);
544 |             } else if (alg.equals("sentiment")) {
545 |                 customSentimentModels.put(modelId, path + name);
546 |             }
547 |         }
548 |     }
549 | 
550 |     private <T extends BaseModel> T loadModel(Class<T> clazz, InputStream in) {
551 |         try {
552 |             Constructor<T> modelConstructor = clazz.getConstructor(InputStream.class);
553 |             T model = modelConstructor.newInstance(in);
554 |             return model;
555 |         } catch (NoSuchMethodException | SecurityException | InstantiationException | IllegalAccessException | IllegalArgumentException | InvocationTargetException ex) {
556 |             LOG.error("Error while initializing model of class: " + clazz, ex);
557 |             throw new RuntimeException("Error while initializing model of class: " + clazz, ex);
558 |         }
559 |     }
560 | 
561 |     private void saveModel(BaseModel model, String file) {
562 |         if (model == null) {
563 |             LOG.error("Can't save training results to a " + file + ": model is null");
564 |             return;
565 |         }
566 |         BufferedOutputStream modelOut = null;
567 |         try {
568 |             modelOut = new BufferedOutputStream(new FileOutputStream(file));
569 |             model.serialize(modelOut);
570 |             modelOut.close();
571 |         } catch (IOException ex) {
572 |             LOG.error("Error saving model to file " + file, ex);
573 |             throw new RuntimeException("Error saving model to file " + file, ex);
574 |         }
575 |         return;
576 |     }
577 | 
578 |     private InputStream getInputStream(Properties properties, String property, String defaultValue) {
579 |         String path = defaultValue;
580 |         if (properties != null) {
581 |             path = properties.getProperty(property, defaultValue);
582 |         }
583 |         InputStream is;
584 |         try {
585 |             if (path.startsWith("file://")) {
586 |                 is = new FileInputStream(new File(new URI(path)));
587 |             } else if (path.startsWith("/")) {
588 |                 is = new FileInputStream(new File(path));
589 |             } else {
590 |                 is = this.getClass().getResourceAsStream(path);
591 |             }
592 |         } catch (FileNotFoundException | URISyntaxException ex) {
593 |             LOG.error("Error while loading model from path: " + path, ex);
594 |             throw new RuntimeException("Error while loading model from path: " + path, ex);
595 |         }
596 |         return is;
597 |     }
598 | 
599 |     private void closeInputStream(InputStream is, String name) {
600 |         try {
601 |             if (is != null) {
602 |                 is.close();
603 |             }
604 |         } catch (IOException ex) {
605 |             LOG.warn("Attept to close stream for " + name + " model failed.");
606 |         }
607 |         return;
608 |     }
609 | 
610 |     private String createModelFileName(String lang, String alg, String model) {
611 |         String delim = "-";
612 |         //String name = "import/" + lang.toLowerCase() + delim + alg.toLowerCase();
613 |         String name = "import/" + alg.toLowerCase();
614 |         if (model != null) {
615 |             if (model.length() > 0) {
616 |                 name += delim + model.toLowerCase();
617 |             }
618 |         }
619 |         name += ".bin";
620 |         return name;
621 |     }
622 | 
623 | 
624 |     /*class ImprovisedInputStreamFactory implements InputStreamFactory {
625 |       private File inputSourceFile;
626 |       private String inputSourceStr;
627 | 
628 |       ImprovisedInputStreamFactory(Properties properties, String property, String defaultValue) {
629 |         this.inputSourceFile = null;
630 |         this.inputSourceStr = defaultValue;
631 |         if (properties!=null) this.inputSourceStr = properties.getProperty(property, defaultValue);
632 | 
633 |         try {
634 |           if (this.inputSourceStr.startsWith("file://"))
635 |             this.inputSourceFile = new File(new URI(this.inputSourceStr));
636 |           else if (this.inputSourceStr.startsWith("/"))
637 |             this.inputSourceFile = new File(this.inputSourceStr);
638 |         } catch (Exception ex) {
639 |           LOG.error("Error while loading model from " + this.inputSourceStr);
640 |           throw new RuntimeException("Error while loading model from " + this.inputSourceStr);
641 |         }
642 |       }
643 | 
644 |       @Override
645 |       public InputStream createInputStream() throws IOException {
646 |         LOG.debug("Creating input stream from " + this.inputSourceFile.getPath());
647 |         //return getClass().getClassLoader().getResourceAsStream(this.inputSourceFile.getPath());
648 |         return new FileInputStream(this.inputSourceFile.getPath());
649 |       }
650 |     }*/
651 | 
652 |     public Properties getProperties() {
653 |         return new Properties();//to be implemented
654 |     }
655 | }
656 | 


--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/OpenNLPTextProcessor.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * Copyright (c) 2013-2016 GraphAware
  3 |  *
  4 |  * This file is part of the GraphAware Framework.
  5 |  *
  6 |  * GraphAware Framework is free software: you can redistribute it and/or modify it under the terms of
  7 |  * the GNU General Public License as published by the Free Software Foundation, either
  8 |  * version 3 of the License, or (at your option) any later version.
  9 |  *
 10 |  * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
 11 |  * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 12 |  * See the GNU General Public License for more details. You should have received a copy of
 13 |  * the GNU General Public License along with this program.  If not, see
 14 |  * <http://www.gnu.org/licenses/>.
 15 |  */
 16 | package com.graphaware.nlp.processor.opennlp;
 17 | 
 18 | import com.graphaware.nlp.annotation.NLPTextProcessor;
 19 | import com.graphaware.nlp.domain.*;
 20 | import com.graphaware.nlp.dsl.request.PipelineSpecification;
 21 | import com.graphaware.nlp.processor.AbstractTextProcessor;
 22 | 
 23 | import java.util.*;
 24 | import java.util.concurrent.atomic.AtomicInteger;
 25 | import java.util.stream.Collectors;
 26 | 
 27 | import com.graphaware.nlp.util.Timer;
 28 | import opennlp.tools.util.Span;
 29 | import org.jetbrains.annotations.NotNull;
 30 | import org.slf4j.Logger;
 31 | import org.slf4j.LoggerFactory;
 32 | 
 33 | @NLPTextProcessor(name = "OpenNLPTextProcessor")
 34 | public class OpenNLPTextProcessor extends AbstractTextProcessor {
 35 | 
 36 |     private static final Logger LOG = LoggerFactory.getLogger(OpenNLPTextProcessor.class);
 37 | 
 38 |     private static final String CORE_PIPELINE_NAME = "OpenNLP.CORE";
 39 |     public static final String TOKENIZER = "tokenizer";
 40 |     public static final String SENTIMENT = "sentiment";
 41 | 
 42 |     private final Map<String, OpenNLPPipeline> pipelines = new HashMap<>();
 43 | 
 44 | 
 45 |     @Override
 46 |     public void init() {
 47 |     }
 48 | 
 49 |     @Override
 50 |     public String getAlias() {
 51 |         return "opennlp";
 52 |     }
 53 | 
 54 |     @Override
 55 |     public String override() {
 56 |         return null;
 57 |     }
 58 | 
 59 |     public OpenNLPPipeline getPipeline(String name) {
 60 |         if (name == null || name.isEmpty()) {
 61 |             name = TOKENIZER;
 62 |             LOG.debug("Using default pipeline: " + name);
 63 |         }
 64 |         OpenNLPPipeline pipeline = getOpenNLPPipeline(name);
 65 |         return pipeline;
 66 |     }
 67 | 
 68 |     private void checkPipelineExistOrCreate(PipelineSpecification pipelineSpecification) {
 69 |         if (!pipelines.containsKey(pipelineSpecification.getName())) {
 70 |             createPipeline(pipelineSpecification);
 71 |         }
 72 |     }
 73 | 
 74 | /*    private void createFullPipeline() {
 75 |         OpenNLPPipeline pipeline = new PipelineBuilder()
 76 |                 .tokenize()
 77 |                 .extractNEs()
 78 |                 .defaultStopWordAnnotator()
 79 |                 .extractRelations()
 80 |                 .extractSentiment()
 81 |                 .threadNumber(6)
 82 |                 .build();
 83 |         pipelines.put(CORE_PIPELINE_NAME, pipeline);
 84 |     }
 85 | 
 86 |     private void createTokenizerPipeline() {
 87 |         OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME);
 88 |         pipelines.put(TOKENIZER, pipeline);
 89 |     }
 90 | 
 91 |     private void createSentimentPipeline() {
 92 |         OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME);
 93 |         pipelines.put(SENTIMENT, pipeline);
 94 |     }
 95 | 
 96 |     private void createTokenizerAndSentimentPipeline() {
 97 |         OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME);
 98 |         pipelines.put(TOKENIZER_AND_SENTIMENT, pipeline);
 99 |     }
100 | 
101 |     private void createPhrasePipeline() {
102 |         OpenNLPPipeline pipeline = pipelines.get(CORE_PIPELINE_NAME);
103 |         pipelines.put(PHRASE, pipeline);
104 |     }*/
105 | 
106 |     @Override
107 |     public AnnotatedText annotateText(String text, String lang, PipelineSpecification pipelineSpecification) {
108 |         Timer timer = Timer.start();
109 |         checkPipelineExistOrCreate(pipelineSpecification);
110 |         timer.lap("pipeline check");
111 |         OpenNLPPipeline pipeline = pipelines.get(pipelineSpecification.getName());
112 |         OpenNLPAnnotation document = new OpenNLPAnnotation(text, Collections.EMPTY_MAP);
113 |         pipeline.annotate(document);
114 | 
115 |         AnnotatedText result = new AnnotatedText();
116 |         List<OpenNLPAnnotation.Sentence> sentences = document.getSentences();
117 |         final AtomicInteger sentenceSequence = new AtomicInteger(0);
118 |         sentences.stream().forEach((sentence) -> {
119 |             int sentenceNumber = sentenceSequence.getAndIncrement();
120 |             final Sentence newSentence = new Sentence(sentence.getSentence(), sentenceNumber);
121 |             extractTokens(lang, sentence, newSentence);
122 |             if (pipelineSpecification.hasProcessingStep(STEP_SENTIMENT)) {
123 |                 extractSentiment(sentence, newSentence);
124 |             }
125 |             if (pipelineSpecification.hasProcessingStep(STEP_PHRASE)) {
126 |                 extractPhrases(sentence, newSentence);
127 |             }
128 |             result.addSentence(newSentence);
129 |         });
130 | 
131 |         return result;
132 |     }
133 | 
134 |     protected Map<String, Object> getPipelineProperties(OpenNLPPipeline pipeline) {
135 |         Map<String, Object> options = new HashMap<>();
136 |         for (Object o : pipeline.getProperties().keySet()) {
137 |             if (o instanceof String) {
138 |                 options.put(o.toString(), pipeline.getProperties().getProperty(o.toString()));
139 |             }
140 |         }
141 | 
142 |         return options;
143 |     }
144 | 
145 |     protected Map<String, Object> buildSpecifications(List<String> actives) {
146 |         List<String> all = Arrays.asList("tokenize", "ner", "cleanxml", "truecase", "dependency", "relations", "checkLemmaIsStopWord", "coref", "sentiment", "phrase", "customSentiment", "customNER");
147 |         Map<String, Object> specs = new HashMap<>();
148 |         all.forEach(s -> {
149 |             specs.put(s, actives.contains(s));
150 |         });
151 | 
152 |         return specs;
153 |     }
154 | 
155 | 
156 | /*    @Override
157 |     public AnnotatedText annotateText(String text, String name, String lang, Map<String, String> otherParams) {
158 |         if (name.length() == 0) {
159 |             name = TOKENIZER;
160 |             LOG.info("Using default pipeline: " + name);
161 |         }
162 |         OpenNLPPipeline pipeline = pipelines.get(name);
163 |         if (pipeline == null) {
164 |             throw new RuntimeException("Pipeline: " + name + " doesn't exist");
165 |         }
166 |         OpenNLPAnnotation document = new OpenNLPAnnotation(text, otherParams);
167 |         pipeline.annotate(document);
168 | //        LOG.info("Annotation for id " + id + " finished.");
169 | 
170 |         AnnotatedText result = new AnnotatedText();
171 |         List<OpenNLPAnnotation.Sentence> sentences = document.getSentences();
172 |         final AtomicInteger sentenceSequence = new AtomicInteger(0);
173 |         sentences.stream().forEach((sentence) -> {
174 |             int sentenceNumber = sentenceSequence.getAndIncrement();
175 | //            String sentenceId = id + "_" + sentenceNumber;
176 |             final Sentence newSentence = new Sentence(sentence.getSentence(), sentenceNumber);
177 |             extractTokens(lang, sentence, newSentence);
178 |             extractSentiment(sentence, newSentence);
179 |             extractPhrases(sentence, newSentence);
180 |             result.addSentence(newSentence);
181 |         });
182 |         //extractRelationship(result, sentences, document);
183 |         return result;
184 |     }
185 | */
186 |     private void extractPhrases(OpenNLPAnnotation.Sentence sentence, Sentence newSentence) {
187 |         if (sentence.getPhrasesIndex() == null) {
188 |             LOG.warn("extractPhrases(): phrases index empty, aborting extraction");
189 |             return;
190 |         }
191 |         sentence.getPhrasesIndex().forEach(index -> {
192 |             Span chunk = sentence.getChunks()[index];
193 |             String chunkString = sentence.getChunkStrings()[index];
194 |             newSentence.addPhraseOccurrence(chunk.getStart(), chunk.getEnd(), new Phrase(chunkString, chunk.getType()));
195 |         });
196 |     }
197 | 
198 |     private void extractSentiment(OpenNLPAnnotation.Sentence sentence, Sentence newSentence) {
199 |         int score = -1;
200 |         if (sentence.getSentiment() != null) { // && !sentence.getSentiment().equals("-")) {
201 |             try {
202 |                 score = Integer.valueOf(sentence.getSentiment());
203 |             } catch (NumberFormatException ex) {
204 |                 LOG.error("NumberFormatException: error extracting sentiment " + sentence.getSentiment() + " as a number.", ex);
205 |             }
206 |         }
207 |         newSentence.setSentiment(score);
208 |     }
209 | 
210 |     private void extractTokens(String lang, OpenNLPAnnotation.Sentence sentence, final Sentence newSentence) {
211 |         Collection<OpenNLPAnnotation.Token> tokens = sentence.getTokens();
212 |         tokens.stream().filter((token) -> token != null /*&& checkLemmaIsValid(token.getToken())*/).forEach((token) -> {
213 |             Tag newTag = getTag(token, lang);
214 |             if (newTag != null) {
215 |                 Tag tagInSentence = newSentence.addTag(newTag);
216 |                 token.getTokenSpans().stream().forEach((span) -> {
217 |                     newSentence.addTagOccurrence(span.getStart(), span.getEnd(), token.getToken(), tagInSentence);
218 |                 });
219 |             }
220 |         });
221 |     }
222 | 
223 |     //    private void extractRelationship(AnnotatedText annotatedText, List<CoreMap> sentences, Annotation document) {
224 | //        Map<Integer, CorefChain> corefChains = document.get(CorefCoreAnnotations.CorefChainAnnotation.class);
225 | //        if (corefChains != null) {
226 | //            for (CorefChain chain : corefChains.values()) {
227 | //                CorefChain.CorefMention representative = chain.getRepresentativeMention();
228 | //                int representativeSenteceNumber = representative.sentNum - 1;
229 | //                List<CoreLabel> representativeTokens = sentences.get(representativeSenteceNumber).get(CoreAnnotations.TokensAnnotation.class);
230 | //                int beginPosition = representativeTokens.get(representative.startIndex - 1).beginPosition();
231 | //                int endPosition = representativeTokens.get(representative.endIndex - 2).endPosition();
232 | //                Phrase representativePhraseOccurrence = annotatedText.getSentences().get(representativeSenteceNumber).getPhraseOccurrence(beginPosition, endPosition);
233 | //                if (representativePhraseOccurrence == null) {
234 | //                    LOG.warn("Representative Phrase not found: " + representative.mentionSpan);
235 | //                }
236 | //                for (CorefChain.CorefMention mention : chain.getMentionsInTextualOrder()) {
237 | //                    if (mention == representative) {
238 | //                        continue;
239 | //                    }
240 | //                    int mentionSentenceNumber = mention.sentNum - 1;
241 | //
242 | //                    List<CoreLabel> mentionTokens = sentences.get(mentionSentenceNumber).get(CoreAnnotations.TokensAnnotation.class);
243 | //                    int beginPositionMention = mentionTokens.get(mention.startIndex - 1).beginPosition();
244 | //                    int endPositionMention = mentionTokens.get(mention.endIndex - 2).endPosition();
245 | //                    Phrase mentionPhraseOccurrence = annotatedText.getSentences().get(mentionSentenceNumber).getPhraseOccurrence(beginPositionMention, endPositionMention);
246 | //                    if (mentionPhraseOccurrence == null) {
247 | //                        LOG.warn("Mention Phrase not found: " + mention.mentionSpan);
248 | //                    }
249 | //                    if (representativePhraseOccurrence != null
250 | //                            && mentionPhraseOccurrence != null) {
251 | //                        mentionPhraseOccurrence.setReference(representativePhraseOccurrence);
252 | //                    }
253 | //                }
254 | //            }
255 | //        }
256 | //    }
257 |     @Override
258 |     public Tag annotateSentence(String text, String lang, PipelineSpecification pipelineSpecification) {
259 | //        Annotation document = new Annotation(text);
260 | //        pipelines.get(SENTIMENT).annotate(document);
261 | //        List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
262 | //        Optional<CoreMap> sentence = sentences.stream().findFirst();
263 | //        if (sentence.isPresent()) {
264 | //            Optional<Tag> oTag = sentence.get().get(CoreAnnotations.TokensAnnotation.class).stream()
265 | //                    .map((token) -> getTag(token))
266 | //                    .filter((tag) -> (tag != null) && checkPunctuation(tag.getLemma()))
267 | //                    .findFirst();
268 | //            if (oTag.isPresent()) {
269 | //                return oTag.get();
270 | //            }
271 | //        }
272 |         return null;
273 |     }
274 | 
275 |     @Override
276 |     public Tag annotateTag(String text, String lang, PipelineSpecification pipelineSpecification) {
277 |         OpenNLPAnnotation document = new OpenNLPAnnotation(text);
278 |         final OpenNLPPipeline openNLPPipeline = getOpenNLPPipeline(pipelineSpecification.getName());
279 |         openNLPPipeline.annotate(document);
280 |         List<OpenNLPAnnotation.Sentence> sentences = document.getSentences();
281 |         if (sentences != null && !sentences.isEmpty()) {
282 |             if (sentences.size() > 1) {
283 |                 throw new RuntimeException("More than one sentence");
284 |             }
285 |             Collection<OpenNLPAnnotation.Token> tokens = sentences.get(0).getTokens();
286 |             if (tokens != null && tokens.size() == 1) {
287 |                 OpenNLPAnnotation.Token token = tokens.iterator().next();
288 |                 Tag newTag = getTag(token, lang);
289 |                 return newTag;
290 |             } else if (tokens != null && tokens.size() > 1) {
291 |                 OpenNLPAnnotation.Token token = document.getToken(text, text);
292 |                 Tag newTag = getTag(token, lang);
293 |                 return newTag;
294 |             }
295 |         }
296 |         return null;
297 |     }
298 | 
299 |     @NotNull
300 |     private OpenNLPPipeline getOpenNLPPipeline(String name) {
301 |         final OpenNLPPipeline openNLPPipeline = pipelines.get(name);
302 |         if (openNLPPipeline == null) {
303 |             throw new RuntimeException("Pipeline " + name + " doesn't exist");
304 |         }
305 |         return openNLPPipeline;
306 |     }
307 | 
308 |     private Tag getTag(OpenNLPAnnotation.Token token, String lang) {
309 |         List<String> pos = new ArrayList<>();
310 |         List<String> ne = new ArrayList<>();
311 |         String lemma = token.getTokenLemmas();
312 |         pos.addAll(token.getTokenPOS());
313 |         ne.addAll(token.getTokenNEs());
314 | 
315 |         // apply lemma validity check (to all words in case of NamedEntities)
316 |         lemma = Arrays.asList(lemma.split(" ")).stream().filter(str -> checkLemmaIsValid(str)).collect(Collectors.joining(" "));
317 |         if (lemma == null || lemma.length() == 0)
318 |             return null;
319 | 
320 |         Tag tag = new Tag(lemma, lang);
321 |         tag.setPos(pos);
322 |         tag.setNe(ne);
323 |         LOG.info("POS: " + pos + " ne: " + ne + " lemma: " + lemma);
324 |         return tag;
325 |     }
326 | 
327 |     private List<Tag> annotateTagsAux(String text, String lang, OpenNLPPipeline pipeline) {
328 |         List<Tag> result = new ArrayList<>();
329 |         OpenNLPAnnotation document = new OpenNLPAnnotation(text);
330 |         pipeline.annotate(document);
331 |         List<OpenNLPAnnotation.Sentence> sentences = document.getSentences();
332 |         if (sentences != null && !sentences.isEmpty()) {
333 |             if (sentences.size() > 1) {
334 |                 throw new RuntimeException("More than one sentence");
335 |             }
336 |             Collection<OpenNLPAnnotation.Token> tokens = sentences.get(0).getTokens();
337 |             if (tokens != null && tokens.size() > 0) {
338 |                 tokens.stream().forEach((token) -> {
339 |                     Tag newTag = getTag(token, lang);
340 |                     if (newTag != null)
341 |                         result.add(newTag);
342 |                 });
343 |                 return result;
344 |             }
345 |         }
346 |         return null;
347 |     }
348 | 
349 |     @Override
350 |     public List<Tag> annotateTags(String text, String lang, PipelineSpecification pipelineSpecification) {
351 |         return annotateTagsAux(text, lang, getOpenNLPPipeline(pipelineSpecification.getName()));
352 |     }
353 | 
354 |     public List<Tag> annotateTags(String text, String lang) {
355 |         return annotateTagsAux(text, lang, getOpenNLPPipeline(TOKENIZER));
356 |     }
357 | 
358 |     @Override
359 |     public AnnotatedText sentiment(AnnotatedText annotated) {
360 |         OpenNLPPipeline pipeline = getOpenNLPPipeline(SENTIMENT);
361 |         annotated.getSentences().stream().forEach(item -> { // don't use parallelStream(), it crashes with the current content of the body
362 |             OpenNLPAnnotation document = new OpenNLPAnnotation(item.getSentence());
363 |             pipeline.annotate(document);
364 | 
365 |             List<OpenNLPAnnotation.Sentence> sentences = document.getSentences();
366 |             Optional<OpenNLPAnnotation.Sentence> sentence = sentences.stream().findFirst();
367 |             if (sentence != null && sentence.isPresent()) {
368 |                 extractSentiment(sentence.get(), item);
369 |             }
370 |         });
371 | 
372 |         return annotated;
373 |     }
374 | 
375 |     @Override
376 |     public String train(String alg, String modelId, String file, String lang, Map<String, Object> params) {
377 |         // training could be done directly here, but it's better to have everything related to model implementation in one class, therefore ...
378 |         OpenNLPPipeline pipeline = getOpenNLPPipeline(TOKENIZER);
379 |         return pipeline.train(alg, modelId, file, lang, params);
380 |     }
381 | 
382 |     @Override
383 |     public String test(String alg, String modelId, String file, String lang) {
384 |         OpenNLPPipeline pipeline = getOpenNLPPipeline(TOKENIZER);
385 |         return pipeline.test(alg, modelId, file, lang);
386 | 
387 |     }
388 | 
389 |     class TokenHolder {
390 | 
391 |         private String ne;
392 |         private StringBuilder sb;
393 |         private int beginPosition;
394 |         private int endPosition;
395 | 
396 |         public TokenHolder() {
397 |             reset();
398 |         }
399 | 
400 |         public String getNe() {
401 |             return ne;
402 |         }
403 | 
404 |         public String getToken() {
405 |             if (sb == null) {
406 |                 return " - ";
407 |             }
408 |             return sb.toString();
409 |         }
410 | 
411 |         public int getBeginPosition() {
412 |             return beginPosition;
413 |         }
414 | 
415 |         public int getEndPosition() {
416 |             return endPosition;
417 |         }
418 | 
419 |         public void setNe(String ne) {
420 |             this.ne = ne;
421 |         }
422 | 
423 |         public void updateToken(String tknStr) {
424 |             this.sb.append(tknStr);
425 |         }
426 | 
427 |         public void setBeginPosition(int beginPosition) {
428 |             if (this.beginPosition < 0) {
429 |                 this.beginPosition = beginPosition;
430 |             }
431 |         }
432 | 
433 |         public void setEndPosition(int endPosition) {
434 |             this.endPosition = endPosition;
435 |         }
436 | 
437 |         public final void reset() {
438 |             sb = new StringBuilder();
439 |             beginPosition = -1;
440 |             endPosition = -1;
441 |         }
442 |     }
443 | 
444 |     class PhraseHolder implements Comparable<PhraseHolder> {
445 | 
446 |         private StringBuilder sb;
447 |         private int beginPosition;
448 |         private int endPosition;
449 | 
450 |         public PhraseHolder() {
451 |             reset();
452 |         }
453 | 
454 |         public String getPhrase() {
455 |             if (sb == null) {
456 |                 return " - ";
457 |             }
458 |             return sb.toString();
459 |         }
460 | 
461 |         public int getBeginPosition() {
462 |             return beginPosition;
463 |         }
464 | 
465 |         public int getEndPosition() {
466 |             return endPosition;
467 |         }
468 | 
469 |         public void updatePhrase(String tknStr) {
470 |             this.sb.append(tknStr);
471 |         }
472 | 
473 |         public void setBeginPosition(int beginPosition) {
474 |             if (this.beginPosition < 0) {
475 |                 this.beginPosition = beginPosition;
476 |             }
477 |         }
478 | 
479 |         public void setEndPosition(int endPosition) {
480 |             this.endPosition = endPosition;
481 |         }
482 | 
483 |         public final void reset() {
484 |             sb = new StringBuilder();
485 |             beginPosition = -1;
486 |             endPosition = -1;
487 |         }
488 | 
489 |         @Override
490 |         public boolean equals(Object o) {
491 |             if (!(o instanceof PhraseHolder)) {
492 |                 return false;
493 |             }
494 |             PhraseHolder otherObject = (PhraseHolder) o;
495 |             if (this.sb != null
496 |                     && otherObject.sb != null
497 |                     && this.sb.toString().equals(otherObject.sb.toString())
498 |                     && this.beginPosition == otherObject.beginPosition
499 |                     && this.endPosition == otherObject.endPosition) {
500 |                 return true;
501 |             }
502 |             return false;
503 |         }
504 | 
505 |         @Override
506 |         public int compareTo(PhraseHolder o) {
507 |             if (o == null) {
508 |                 return 1;
509 |             }
510 |             if (this.equals(o)) {
511 |                 return 0;
512 |             } else if (this.beginPosition > o.beginPosition) {
513 |                 return 1;
514 |             } else if (this.beginPosition == o.beginPosition) {
515 |                 if (this.endPosition > o.endPosition) {
516 |                     return 1;
517 |                 }
518 |             }
519 |             return -1;
520 |         }
521 |     }
522 | 
523 |     @Override
524 |     public List<String> getPipelines() {
525 |         return new ArrayList<>(pipelines.keySet());
526 |     }
527 | 
528 |     @Override
529 |     public boolean checkPipeline(String name) {
530 |         return pipelines.containsKey(name);
531 |     }
532 | 
533 |     @Override
534 |     public void createPipeline(PipelineSpecification pipelineSpecification) {
535 |         //TODO add validation
536 |         String name = pipelineSpecification.getName();
537 |         PipelineBuilder pipelineBuilder = new PipelineBuilder();
538 |         List<String> specActive = new ArrayList<>();
539 |         List<String> stopwordsList;
540 | 
541 |         if (pipelineSpecification.hasProcessingStep("tokenize", true)) {
542 |             pipelineBuilder.tokenize();
543 |             specActive.add("tokenize");
544 |         }
545 | 
546 |         if (pipelineSpecification.hasProcessingStep("ner", true)) {
547 |             pipelineBuilder.extractNEs();
548 |             specActive.add("ner");
549 |         }
550 | 
551 |         String stopWords = pipelineSpecification.getStopWords() != null ? pipelineSpecification.getStopWords() : "default";
552 |         boolean checkLemma = pipelineSpecification.hasProcessingStep("checkLemmaIsStopWord");
553 |         if (checkLemma) {
554 |             specActive.add("checkLemmaIsStopWord");
555 |         }
556 | 
557 |         if (stopWords.equalsIgnoreCase("default")) {
558 |             pipelineBuilder.defaultStopWordAnnotator();
559 |             stopwordsList = PipelineBuilder.getDefaultStopwords();
560 |         } else {
561 |             pipelineBuilder.customStopWordAnnotator(stopWords);
562 |             stopwordsList = PipelineBuilder.getCustomStopwordsList(stopWords);
563 |         }
564 | 
565 |         if (pipelineSpecification.hasProcessingStep("sentiment")) {
566 |             pipelineBuilder.extractSentiment();
567 |             specActive.add("sentiment");
568 |         }
569 |         if (pipelineSpecification.hasProcessingStep("coref")) {
570 |             pipelineBuilder.extractCoref();
571 |             specActive.add("coref");
572 |         }
573 |         if (pipelineSpecification.hasProcessingStep("relations")) {
574 |             pipelineBuilder.extractRelations();
575 |             specActive.add("relations");
576 |         }
577 |         if (pipelineSpecification.hasProcessingStep("customNER")) {
578 |             if (!specActive.contains("ner")) {
579 |                 pipelineBuilder.extractNEs();
580 |                 specActive.add("ner");
581 |             }
582 |             specActive.add("customNER");
583 |             pipelineBuilder.extractCustomNEs(pipelineSpecification.getProcessingStepAsString("customNER"));
584 |         }
585 |         if (pipelineSpecification.hasProcessingStep("customSentiment")) {
586 |             if (!specActive.contains("sentiment")) {
587 |                 pipelineBuilder.extractSentiment();
588 |                 specActive.add("sentiment");
589 |             }
590 |             specActive.add("customSentiment");
591 |             pipelineBuilder.extractCustomSentiment(pipelineSpecification.getProcessingStepAsString("customSentiment"));
592 |         }
593 |         Long threadNumber = pipelineSpecification.getThreadNumber() != 0 ? pipelineSpecification.getThreadNumber() : 4L;
594 |         pipelineBuilder.threadNumber(threadNumber.intValue());
595 | 
596 |         OpenNLPPipeline pipeline = pipelineBuilder.build();
597 |         pipelines.put(name, pipeline);
598 |     }
599 | 
600 | 
601 |     @Override
602 |     public void removePipeline(String name) {
603 |         if (!pipelines.containsKey(name)) {
604 |             throw new RuntimeException("No pipeline found with name: " + name);
605 |         }
606 |         pipelines.remove(name);
607 |     }
608 | }
609 | 


--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/PipelineBuilder.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * To change this license header, choose License Headers in Project Properties.
  3 |  * To change this template file, choose Tools | Templates
  4 |  * and open the template in the editor.
  5 |  */
  6 | package com.graphaware.nlp.processor.opennlp;
  7 | 
  8 | import java.util.ArrayList;
  9 | import java.util.Arrays;
 10 | import java.util.List;
 11 | import java.util.Properties;
 12 | 
 13 | class PipelineBuilder {
 14 | 
 15 |     private static final String CUSTOM_STOP_WORD_LIST = "start,starts,period,periods,a,an,and,are,as,at,be,but,by,for,if,in,into,is,it,no,not,of,o,on,or,such,that,the,their,then,there,these,they,this,to,was,will,with";
 16 | 
 17 |     private final Properties properties = new Properties();
 18 |     private final StringBuilder annotators = new StringBuilder(); //basics annotators
 19 |     private int threadsNumber = 4;
 20 | 
 21 |     private void checkForExistingAnnotators() {
 22 |         if (annotators.toString().length() > 0) {
 23 |             annotators.append(", ");
 24 |         }
 25 |     }
 26 | 
 27 |     public PipelineBuilder tokenize() {
 28 |         checkForExistingAnnotators();
 29 |         annotators.append("tokenize, pos, lemma");
 30 |         return this;
 31 |     }
 32 | 
 33 |     public PipelineBuilder extractNEs() {
 34 |         checkForExistingAnnotators();
 35 |         annotators.append("ner");
 36 |         return this;
 37 |     }
 38 | 
 39 |     public PipelineBuilder extractSentiment() {
 40 |         checkForExistingAnnotators();
 41 |         annotators.append("sentiment");
 42 |         return this;
 43 |     }
 44 | 
 45 |     public PipelineBuilder extractRelations() {
 46 |         checkForExistingAnnotators();
 47 |         annotators.append("relation");
 48 |         return this;
 49 |     }
 50 | 
 51 |     public PipelineBuilder extractCoref() {
 52 |         return this;
 53 |     }
 54 | 
 55 |     public PipelineBuilder extractCustomNEs(String ners) {
 56 |         properties.setProperty("customNEs", ners);
 57 |         return this;
 58 |     }
 59 | 
 60 |     public PipelineBuilder extractCustomSentiment(String sent) {
 61 |         properties.setProperty("customSentiment", sent);
 62 |         return this;
 63 |     }
 64 | 
 65 |     public PipelineBuilder defaultStopWordAnnotator() {
 66 |         checkForExistingAnnotators();
 67 |         annotators.append("stopword");
 68 |         properties.setProperty("stopword", CUSTOM_STOP_WORD_LIST);
 69 |         return this;
 70 |     }
 71 | 
 72 |     public PipelineBuilder customStopWordAnnotator(String customStopWordList) {
 73 |         checkForExistingAnnotators();
 74 |         String stopWordList;
 75 |         if (annotators.indexOf("stopword") >= 0) {
 76 |             String alreadyexistingStopWordList = properties.getProperty("stopword");
 77 |             stopWordList = alreadyexistingStopWordList + "," + customStopWordList;
 78 |         } else {
 79 |             annotators.append("stopword");
 80 |             stopWordList = CUSTOM_STOP_WORD_LIST + "," + customStopWordList;
 81 |         }
 82 |         properties.setProperty("stopword", stopWordList);
 83 |         return this;
 84 |     }
 85 | 
 86 |     public PipelineBuilder stopWordAnnotator(Properties properties) {
 87 |         return this;
 88 |     }
 89 | 
 90 |     public PipelineBuilder threadNumber(int threads) {
 91 |         this.threadsNumber = threads;
 92 |         return this;
 93 |     }
 94 | 
 95 |     public OpenNLPPipeline build() {
 96 |         properties.setProperty("annotators", annotators.toString());
 97 |         properties.setProperty("threads", String.valueOf(threadsNumber));
 98 |         OpenNLPPipeline pipeline = new OpenNLPPipeline(properties);
 99 |         return pipeline;
100 |     }
101 |     
102 |     public static List<String> getDefaultStopwords() {
103 |         List<String> stopwords = new ArrayList<>();
104 |         Arrays.stream(CUSTOM_STOP_WORD_LIST.split(",")).forEach(s -> {
105 |             stopwords.add(s.trim());
106 |         });
107 | 
108 |         return stopwords;
109 |     }
110 | 
111 |     public static List<String> getCustomStopwordsList(String customStopWordList) {
112 |         String stopWordList;
113 |         if (customStopWordList.startsWith("+")) {
114 |             stopWordList = CUSTOM_STOP_WORD_LIST + "," + customStopWordList.replace("+,", "").replace("+", "");
115 |         } else {
116 |             stopWordList = customStopWordList;
117 |         }
118 | 
119 |         List<String> list = new ArrayList<>();
120 |         Arrays.stream(stopWordList.split(",")).forEach(s -> {
121 |             list.add(s.trim());
122 |         });
123 | 
124 |         return list;
125 |     }
126 | }
127 | 


--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/model/NERModelTool.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  *
  3 |  *
  4 |  */
  5 | package com.graphaware.nlp.processor.opennlp.model;
  6 | 
  7 | import com.graphaware.nlp.processor.opennlp.OpenNLPPipeline;
  8 | import java.io.IOException;
  9 | import java.util.Map;
 10 | 
 11 | import opennlp.tools.namefind.TokenNameFinderFactory;
 12 | import opennlp.tools.namefind.TokenNameFinderCrossValidator;
 13 | import opennlp.tools.namefind.TokenNameFinderEvaluator;
 14 | import opennlp.tools.namefind.NameSample;
 15 | import opennlp.tools.namefind.NameFinderME;
 16 | import opennlp.tools.namefind.NameSampleDataStream;
 17 | import opennlp.tools.namefind.TokenNameFinderModel;
 18 | 
 19 | import opennlp.tools.util.ObjectStream;
 20 | 
 21 | import com.graphaware.nlp.util.GenericModelParameters;
 22 | 
 23 | import org.slf4j.Logger;
 24 | import org.slf4j.LoggerFactory;
 25 | 
 26 | public class NERModelTool extends OpenNLPGenericModelTool {
 27 | 
 28 |     private String entityType;
 29 |     private static final String MODEL_NAME = "NER";
 30 | 
 31 |     private static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class);
 32 | 
 33 |     public NERModelTool(String fileIn, String modelDescr, String lang, Map<String, Object> params) {
 34 |         super(fileIn, modelDescr, lang, params);
 35 |         this.entityType = null; // train only specific named entity; null = train all entities present in the training set
 36 |         if (params != null) {
 37 |             if (params.containsKey(GenericModelParameters.TRAIN_ENTITYTYPE)) {
 38 |                 this.entityType = (String) params.get(GenericModelParameters.TRAIN_ENTITYTYPE);
 39 |             }
 40 |         }
 41 |     }
 42 | 
 43 |     public NERModelTool(String fileIn, String modelDescr, String lang) {
 44 |         this(fileIn, modelDescr, lang, null);
 45 |     }
 46 | 
 47 |     public NERModelTool() {
 48 |       super();
 49 |     }
 50 | 
 51 |     public void train() {
 52 |         try (ObjectStream<String> lineStream = openFile(fileIn); NameSampleDataStream sampleStream = new NameSampleDataStream(lineStream)) {
 53 |             LOG.info("Training of " + MODEL_NAME + " started ...");
 54 |             this.model = NameFinderME.train(lang, entityType, sampleStream, trainParams, new TokenNameFinderFactory());
 55 |         } catch (IOException ex) {
 56 |             LOG.error("Error while opening training file: " + fileIn, ex);
 57 |             throw new RuntimeException("Error while training " + MODEL_NAME + " model " + this.modelDescr, ex);
 58 |         } catch (Exception ex) {
 59 |             LOG.error("Error while training " + MODEL_NAME + " model " + modelDescr);
 60 |             throw new RuntimeException("Error while training " + MODEL_NAME + " model " + this.modelDescr, ex);
 61 |         }
 62 |     }
 63 | 
 64 |     public String validate() {
 65 |         String result = "";
 66 |         if (this.fileValidate == null) {
 67 |             //List<EvaluationMonitor<NameSample>> listeners = new LinkedList<EvaluationMonitor<NameSample>>();
 68 |             try (ObjectStream<String> lineStream = openFile(fileIn); NameSampleDataStream sampleStream = new NameSampleDataStream(lineStream)) {
 69 |                 LOG.info("Validation of " + MODEL_NAME + " started ...");
 70 |                 // Using CrossValidator
 71 |                 TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator(lang, entityType, trainParams, null);
 72 |                 // the second argument of 'evaluate()' gives number of folds (n), i.e. number of times the training-testing will be run (with data splitting train:test = (n-1):1)
 73 |                 evaluator.evaluate(sampleStream, nFolds);
 74 |                 result = "F = " + decFormat.format(evaluator.getFMeasure().getFMeasure())
 75 |                         + " (Precision = " + decFormat.format(evaluator.getFMeasure().getPrecisionScore())
 76 |                         + ", Recall = " + decFormat.format(evaluator.getFMeasure().getRecallScore()) + ")";
 77 |                 LOG.info("Validation: " + result);
 78 |             } catch (IOException ex) {
 79 |                 LOG.error("Error while opening training file: " + fileIn, ex);
 80 |                 throw new RuntimeException("IOError while evaluating " + MODEL_NAME + " model " + modelDescr, ex);
 81 |             } catch (Exception ex) {
 82 |                 LOG.error("Error while evaluating " + MODEL_NAME + " model.", ex);
 83 |                 throw new RuntimeException("Error while evaluating " + MODEL_NAME + " model " + modelDescr, ex);
 84 |             }
 85 |         } else {
 86 |           result = test(this.fileValidate, new NameFinderME((TokenNameFinderModel) model));
 87 |         }
 88 | 
 89 |         return result;
 90 |     }
 91 | 
 92 |     public String test(String file, NameFinderME modelME) {
 93 |         String result = "";
 94 |         try (ObjectStream<String> lineStreamValidate = openFile(file); NameSampleDataStream sampleStreamValidate = new NameSampleDataStream(lineStreamValidate)) {
 95 |             LOG.info("Testing of " + MODEL_NAME + " started ...");
 96 |             //TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME((TokenNameFinderModel) model));
 97 |             TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(modelME);
 98 |             evaluator.evaluate(sampleStreamValidate);
 99 |             result = "F = " + decFormat.format(evaluator.getFMeasure().getFMeasure())
100 |                     + " (Precision = " + decFormat.format(evaluator.getFMeasure().getPrecisionScore())
101 |                     + ", Recall = " + decFormat.format(evaluator.getFMeasure().getRecallScore()) + ")";
102 |             LOG.info("Testing result: " + result);
103 |         } catch (IOException ex) {
104 |             LOG.error("Error while opening test file: " + file, ex);
105 |             throw new RuntimeException("Error while testing " + MODEL_NAME + " model " + modelDescr, ex);
106 |         } catch (Exception ex) {
107 |             LOG.error("Error while testing " + this.MODEL_NAME + " model.", ex);
108 |         }
109 |         return result;
110 |     }
111 | }
112 | 


--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/model/OpenNLPGenericModelTool.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  *
  3 |  *
  4 |  */
  5 | package com.graphaware.nlp.processor.opennlp.model;
  6 | 
  7 | import com.graphaware.nlp.processor.opennlp.OpenNLPPipeline;
  8 | import java.io.File;
  9 | import java.io.FileInputStream;
 10 | import java.io.FileOutputStream;
 11 | import java.io.InputStream;
 12 | import java.io.BufferedOutputStream;
 13 | import java.io.IOException;
 14 | import java.net.URI;
 15 | import java.util.Properties;
 16 | import java.util.Map;
 17 | import java.text.DecimalFormat;
 18 | 
 19 | import opennlp.tools.util.PlainTextByLineStream;
 20 | import opennlp.tools.util.InputStreamFactory;
 21 | import opennlp.tools.util.TrainingParameters;
 22 | import opennlp.tools.util.ObjectStream;
 23 | import opennlp.tools.util.model.BaseModel;
 24 | 
 25 | import com.graphaware.nlp.util.GenericModelParameters;
 26 | 
 27 | import org.slf4j.Logger;
 28 | import org.slf4j.LoggerFactory;
 29 | 
 30 | public class OpenNLPGenericModelTool {
 31 | 
 32 |     protected BaseModel model;
 33 |     protected TrainingParameters trainParams;
 34 |     protected final String modelDescr;
 35 |     protected final String lang;
 36 |     protected final DecimalFormat decFormat;
 37 |     protected int nFolds;
 38 | 
 39 |     protected final String fileIn;
 40 |     protected String fileValidate;
 41 | 
 42 |     private static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class);
 43 | 
 44 |     public OpenNLPGenericModelTool(String file, String modelDescr, String lang) {
 45 |         this.fileValidate = null;
 46 |         this.fileIn = file;
 47 |         this.nFolds = 10;
 48 |         this.modelDescr = modelDescr;
 49 |         this.lang = lang;
 50 |         this.decFormat = new DecimalFormat("#0.00"); // for formating validation results with precision 2 decimals
 51 | 
 52 |         this.setDefParams();
 53 |     }
 54 | 
 55 |     public OpenNLPGenericModelTool(String file, String modelDescr, String lang, Map<String, Object> params) {
 56 |         this(file, modelDescr, lang);
 57 |         this.setTrainingParameters(params);
 58 |     }
 59 | 
 60 |     /*
 61 |      * This constructor needed for invoking test() method only (model is provided as an argument of train() )
 62 |      */
 63 |     public OpenNLPGenericModelTool() {
 64 |         this(null, null, null);
 65 |         this.model = null;
 66 |     }
 67 | 
 68 |     // override this method in your child-class if you want different defaults
 69 |     protected void setDefParams() {
 70 |         this.trainParams = TrainingParameters.defaultParams();
 71 |     }
 72 | 
 73 |     protected ObjectStream<String> openFile(String fileName) {
 74 |         if (fileName == null || fileName.isEmpty()) {
 75 |             LOG.error("File name is null or empty.");
 76 |             return null;
 77 |         }
 78 |         ObjectStream<String> lStream = null;
 79 |         try {
 80 |             ImprovisedInputStreamFactory dataIn = new ImprovisedInputStreamFactory(null, "", fileName);
 81 |             lStream = new PlainTextByLineStream(dataIn, "UTF-8");
 82 |         } catch (IOException ex) {
 83 |             LOG.error("Failure while opening file " + fileName, ex);
 84 |             throw new RuntimeException("Failure while opening file " + fileName, ex);
 85 |         }
 86 | 
 87 |         if (lStream == null)
 88 |             throw new RuntimeException("Failure while opening file " + fileName + ": input stream is null.");
 89 |         return lStream;
 90 |     }
 91 | 
 92 |     private void setTrainingParameters(Map<String, Object> params) {
 93 |         this.setDefParams();
 94 |         if (params == null || params.isEmpty()) {
 95 |             LOG.error("Map of parameters is null or empty. Using default values.");
 96 |             return;
 97 |         }
 98 | 
 99 |         // now add/override-by user-defined parameters
100 |         if (params.containsKey(GenericModelParameters.TRAIN_ALG)) {
101 |             String val = objectToString(params, GenericModelParameters.TRAIN_ALG);
102 |             this.trainParams.put(TrainingParameters.ALGORITHM_PARAM, val); // default: MAXENT
103 |             LOG.info("Training parameter " + TrainingParameters.ALGORITHM_PARAM + " set to " + val);
104 |         }
105 |         if (params.containsKey(GenericModelParameters.TRAIN_TYPE)) {
106 |             String val = objectToString(params, GenericModelParameters.TRAIN_TYPE);
107 |             this.trainParams.put(TrainingParameters.TRAINER_TYPE_PARAM, val);
108 |             LOG.info("Training parameter " + TrainingParameters.TRAINER_TYPE_PARAM + " set to " + val);
109 |         }
110 |         if (params.containsKey(GenericModelParameters.TRAIN_CUTOFF)) {
111 |             String val = objectToString(params, GenericModelParameters.TRAIN_CUTOFF);
112 |             this.trainParams.put(TrainingParameters.CUTOFF_PARAM, val);
113 |             LOG.info("Training parameter " + TrainingParameters.CUTOFF_PARAM + " set to " + val);
114 |         }
115 |         if (params.containsKey(GenericModelParameters.TRAIN_ITER)) {
116 |             String val = objectToString(params, GenericModelParameters.TRAIN_ITER);
117 |             this.trainParams.put(TrainingParameters.ITERATIONS_PARAM, val);
118 |             LOG.info("Training parameter " + TrainingParameters.ITERATIONS_PARAM + " set to " + val);
119 |         }
120 |         if (params.containsKey(GenericModelParameters.TRAIN_THREADS)) {
121 |             String val = objectToString(params, GenericModelParameters.TRAIN_THREADS);
122 |             this.trainParams.put(TrainingParameters.THREADS_PARAM, val);
123 |             LOG.info("Training parameter " + TrainingParameters.THREADS_PARAM + " set to " + val);
124 |         }
125 |         if (params.containsKey(GenericModelParameters.VALIDATE_FOLDS)) {
126 |             this.nFolds = objectToInt(params, GenericModelParameters.VALIDATE_FOLDS);
127 |             LOG.info("n-folds for crossvalidation set to %d.", this.nFolds);
128 |         }
129 |         if (params.containsKey(GenericModelParameters.VALIDATE_FILE)) {
130 |             this.fileValidate = objectToString(params, GenericModelParameters.VALIDATE_FILE);
131 |             LOG.info("Using valudation file " + fileValidate);
132 |         }
133 |     }
134 | 
135 |     private String objectToString(Map<String, Object> params, String key) {
136 |         String result = null;
137 |         if (params.get(key) instanceof String)
138 |             result = (String) params.get(key);
139 |         else if (params.get(key) instanceof Long)
140 |             result = ((Long) params.get(key)).toString();
141 |         else if (params.get(key) instanceof Integer)
142 |             result = ((Integer) params.get(key)).toString();
143 |         else
144 |             throw new RuntimeException("Wrong format of parameter " + key);
145 |         return result;
146 |     }
147 | 
148 |     private int objectToInt(Map<String, Object> params, String key) {
149 |         int result;
150 |         if (params.get(key) instanceof String)
151 |             result = Integer.parseInt((String) params.get(key));
152 |         else if (params.get(key) instanceof Long)
153 |             result = ((Long) params.get(key)).intValue();
154 |         else if (params.get(key) instanceof Integer)
155 |             result = ((Integer) params.get(key)).intValue();
156 |         else
157 |             throw new RuntimeException("Wrong format of parameter " + key);
158 |         return result;
159 |     }
160 | 
161 |     protected void closeInputFiles() {
162 | //        try {
163 | //            if (this.lineStream != null) {
164 | //                this.lineStream.close();
165 | //            }
166 | //        } catch (IOException ex) {
167 | //            LOG.warn("Attept to close input line-stream from source file " + this.fileIn + " failed.");
168 | //        }
169 | //
170 | //        try {
171 | //            if (this.lineStreamValidate != null) {
172 | //                this.lineStreamValidate.close();
173 | //            }
174 | //        } catch (IOException ex) {
175 | //            LOG.warn("Attept to close input line-stream from source file " + this.fileValidate + " failed.");
176 | //        }
177 |     }
178 | 
179 |     public void saveModel(String file) {
180 |         if (this.model == null) {
181 |             LOG.error("Can't save training results to a " + file + ": model is null");
182 |             return;
183 |         }
184 |         try {
185 |             LOG.info("Saving model to file: " + file);
186 |             BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream(file));
187 |             this.model.serialize(modelOut);
188 |             modelOut.close();
189 |         } catch (IOException ex) {
190 |             LOG.error("Error saving model to file " + file, ex);
191 |             throw new RuntimeException("Error saving model to file " + file, ex);
192 |         }
193 | 
194 |         //this.closeInputFile();
195 |     }
196 | 
197 |     public BaseModel getModel() {
198 |         return this.model;
199 |     }
200 | 
201 |     class ImprovisedInputStreamFactory implements InputStreamFactory {
202 | 
203 |         private File inputSourceFile;
204 |         private String inputSourceStr;
205 | 
206 |         ImprovisedInputStreamFactory(Properties properties, String property, String defaultValue) {
207 |             this.inputSourceFile = null;
208 |             this.inputSourceStr = defaultValue;
209 |             if (properties != null) {
210 |                 this.inputSourceStr = properties.getProperty(property, defaultValue);
211 |             }
212 |             try {
213 |                 if (this.inputSourceStr.startsWith("file://")) {
214 |                     this.inputSourceFile = new File(new URI(this.inputSourceStr.replace("file://", "")));
215 |                 } else if (this.inputSourceStr.startsWith("/")) {
216 |                     this.inputSourceFile = new File(this.inputSourceStr);
217 |                 }
218 |             } catch (Exception ex) {
219 |                 LOG.error("Error while loading model from " + this.inputSourceStr);
220 |                 throw new RuntimeException("Error while loading model from " + this.inputSourceStr);
221 |             }
222 |         }
223 | 
224 |         @Override
225 |         public InputStream createInputStream() throws IOException {
226 |             LOG.debug("Creating input stream from " + this.inputSourceFile.getPath());
227 |             //return getClass().getClassLoader().getResourceAsStream(this.inputSourceFile.getPath());
228 |             return new FileInputStream(this.inputSourceFile.getPath());
229 |         }
230 | 
231 |         /*public void closeInputStream() {
232 |       try {
233 |         if (this.is!=null)
234 |         this.is.close();
235 |       } catch (IOException ex) {
236 |         LOG.warn("Attept to close input stream failed.");
237 |       }
238 |     }*/
239 |     }
240 | 
241 | }
242 | 


--------------------------------------------------------------------------------
/src/main/java/com/graphaware/nlp/processor/opennlp/model/SentimentModelTool.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  *
  3 |  *
  4 |  */
  5 | package com.graphaware.nlp.processor.opennlp.model;
  6 | 
  7 | import com.graphaware.nlp.processor.opennlp.OpenNLPPipeline;
  8 | import java.io.File;
  9 | import java.io.FileOutputStream;
 10 | import java.io.InputStream;
 11 | import java.io.BufferedOutputStream;
 12 | import java.io.IOException;
 13 | import java.util.Arrays;
 14 | import java.util.Properties;
 15 | import java.util.Map;
 16 | import java.util.HashMap;
 17 | import java.util.Collections;
 18 | import java.util.Iterator;
 19 | import java.net.URI;
 20 | 
 21 | import opennlp.tools.namefind.TokenNameFinderFactory;
 22 | import opennlp.tools.namefind.TokenNameFinderCrossValidator;
 23 | import opennlp.tools.namefind.TokenNameFinderEvaluator;
 24 | import opennlp.tools.doccat.DoccatFactory;
 25 | import opennlp.tools.doccat.DocumentSample;
 26 | import opennlp.tools.doccat.DocumentSampleStream;
 27 | import opennlp.tools.doccat.DocumentCategorizerME;
 28 | import opennlp.tools.doccat.DoccatCrossValidator;
 29 | import opennlp.tools.doccat.DocumentCategorizerEvaluator;
 30 | import opennlp.tools.doccat.DoccatModel;
 31 | 
 32 | import opennlp.tools.namefind.NameSample;
 33 | 
 34 | import opennlp.tools.util.PlainTextByLineStream;
 35 | import opennlp.tools.util.InputStreamFactory;
 36 | import opennlp.tools.util.TrainingParameters;
 37 | import opennlp.tools.util.ObjectStream;
 38 | import opennlp.tools.util.eval.CrossValidationPartitioner;
 39 | import opennlp.tools.util.eval.FMeasure;
 40 | import opennlp.tools.util.FilterObjectStream;
 41 | //import opennlp.tools.util.eval.EvaluationMonitor;
 42 | 
 43 | import com.graphaware.nlp.util.GenericModelParameters;
 44 | 
 45 | import org.slf4j.Logger;
 46 | import org.slf4j.LoggerFactory;
 47 | 
 48 | /**
 49 |  *
 50 |  * @author vla
 51 |  */
 52 | public class SentimentModelTool extends OpenNLPGenericModelTool {
 53 | 
 54 |     private static final Logger LOG = LoggerFactory.getLogger(OpenNLPPipeline.class);
 55 | 
 56 |     private static final String MODEL_NAME = "sentiment";
 57 |     private static final String DEFAULT_ITER = "30";
 58 |     private static final String DEFAULT_CUTOFF = "2";
 59 | 
 60 |     public SentimentModelTool(String fileIn, String modelDescr, String lang, Map<String, Object> params) {
 61 |         super(fileIn, modelDescr, lang, params);
 62 |     }
 63 | 
 64 |     public SentimentModelTool(String fileIn, String modelDescr, String lang) {
 65 |         this(fileIn, modelDescr, lang, null);
 66 |     }
 67 | 
 68 |     public SentimentModelTool() {
 69 |         super();
 70 |     }
 71 | 
 72 |     // here you can specify default parameters specific to this class
 73 |     @Override
 74 |     protected void setDefParams() {
 75 |         this.trainParams = TrainingParameters.defaultParams();
 76 |         this.trainParams.put(TrainingParameters.ITERATIONS_PARAM, DEFAULT_ITER);
 77 |         this.trainParams.put(TrainingParameters.CUTOFF_PARAM, DEFAULT_CUTOFF);
 78 |     }
 79 | 
 80 |     public void train() {
 81 |         try (ObjectStream<String> lineStream = openFile(fileIn); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream)) {
 82 |             LOG.info("Training of " + MODEL_NAME + " started ...");
 83 |             this.model = DocumentCategorizerME.train("en", sampleStream, trainParams, new DoccatFactory());
 84 |         } catch (IOException e) {
 85 |             LOG.error("IOError while training a custom " + MODEL_NAME + " model " + modelDescr, e);
 86 |             throw new RuntimeException("IOError while training a custom " + MODEL_NAME + " model " + this.modelDescr, e);
 87 |         }
 88 |     }
 89 | 
 90 |     public String validate() {
 91 |         String result = "";
 92 |         if (this.fileValidate == null) {
 93 |             try (ObjectStream<String> lineStream = openFile(fileIn); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream)) {
 94 |                 LOG.info("Validation of " + MODEL_NAME + " started ...");
 95 |                 DoccatCrossValidator evaluator = new DoccatCrossValidator(this.lang, this.trainParams, new DoccatFactory());
 96 |                 // the second argument of 'evaluate()' gives number of folds (n): number of times the training-testing will be run (with data splitting train:test = (n-1):1)
 97 |                 evaluator.evaluate(sampleStream, this.nFolds);
 98 |                 result = "Accuracy = " + this.decFormat.format(evaluator.getDocumentAccuracy());
 99 |                 LOG.info("Validation: " + result);
100 |             } catch (IOException e) {
101 |                 LOG.error("Error while opening training file: " + fileIn, e);
102 |                 throw new RuntimeException("IOError while evaluating a " + MODEL_NAME + " model " + this.modelDescr, e);
103 |             } catch (Exception ex) {
104 |                 LOG.error("Error while evaluating " + MODEL_NAME + " model.", ex);
105 |             }
106 |         } else {
107 |             // Using a separate .test file provided by user
108 |             result = test(this.fileValidate, new DocumentCategorizerME((DoccatModel) this.model));
109 |         }
110 | 
111 |         return result;
112 |     }
113 | 
114 |     public String test(String file, DocumentCategorizerME modelME) {
115 |         String result = "";
116 |         try (ObjectStream<String> lineStream = openFile(file); ObjectStream<DocumentSample> sampleStreamValidate = new DocumentSampleStream(lineStream)) {
117 |             LOG.info("Testing of " + MODEL_NAME + " started ...");
118 |             //DocumentCategorizerEvaluator evaluator = new DocumentCategorizerEvaluator(new DocumentCategorizerME((DoccatModel) this.model));
119 |             DocumentCategorizerEvaluator evaluator = new DocumentCategorizerEvaluator(modelME);
120 |             evaluator.evaluate(sampleStreamValidate);
121 |             result = "Accuracy = " + this.decFormat.format(evaluator.getAccuracy());
122 |             LOG.info("Validation: " + result);
123 |         } catch (IOException e) {
124 |             LOG.error("Error while opening a test file: " + file, e);
125 |             throw new RuntimeException("IOError while testing a " + MODEL_NAME + " model " + this.modelDescr, e);
126 |         } catch (Exception ex) {
127 |             LOG.error("Error while testing " + MODEL_NAME + " model.", ex);
128 |         }
129 |         return result;
130 |     }
131 | }
132 | 


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-chunker.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-chunker.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-date.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-date.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-location.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-location.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-money.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-money.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-organization.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-organization.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-percentage.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-percentage.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-percentage_money.test:
--------------------------------------------------------------------------------
 1 | Mr Juncker said the EU was spending on average <START:money> €27,000 <END> ( <START:money> £24,000 <END> ; <START:money> $30,000 <END> ) per soldier on equipment and research, whereas the US was spending <START:money> €108,000 <END>.
 2 | "Together, we spend half as much as the United States, but even then we only achieve <START:percentage> 15% <END> of their efficiency," he said.
 3 | 
 4 | Under the plan, <START:money> €500m <END> will be made available annually after 2020 for joint military research, and another <START:money> €1bn <END> annually for joint investment and purchases of military equipment, such as drones and helicopters.
 5 | 
 6 | In afternoon trade, sterling was down <START:percentage> 1.7% <END> against the dollar at <START:money> $1.227 <END> .
 7 | Against the euro, the pound was down <START:percentage> 1.4% <END> at 1.1393.6.
 8 | This makes imported goods' prices higher and squeezes consumers' ability to spend.
 9 | 
10 | Housebuilders, including Taylor Wimpey and Persimmon, saw falls of up to <START:percentage> 5% <END> , while retail companies' shares also fell.
11 | Next and Marks and Spencer fell more than <START:percentage> 3% <END> .
12 | 
13 | Ultimately they may have deprived the state of nearly <START:money> €32bn <END> ( <START:money> £28bn <END> ; <START:money> $36bn <END> ).
14 | As the German broadcaster ARD wryly noted, that would have paid for repairs to a lot of schools and bridges.
15 | 
16 | The pound has dropped by more than <START:percentage> 2% <END> against the dollar, sterling's biggest one-day fall since the Brexit referendum vote last June.
17 | 
18 | Japan's benchmark Nikkei 225 stock index closed <START:percentage> 0.5% <END> higher and South Korea's Kospi cended the day up <START:percentage> 0.8% <END> .
19 | 
20 | Ocwen Financial rose <START:percentage> 2.7% <END> .
21 | The loan servicer has been under pressure from the Consumer Financial Protection Bureau, which would be seriously weakened under Thursday's measure.
22 | 
23 | The deals range from buying UK chip firm ARM Holdings for <START:money> £24bn <END> ( <START:money> $32bn <END> ), investing <START:money> $1bn <END> in satellite startup OneWeb, to setting up a venture fund with Saudi Arabia.
24 | 
25 | The ECB now expects growth across the eurozone to be <START:percentage> 1.9% <END> in 2017 compared with its March forecast of <START:percentage> 1.8% <END> .
26 | It also increased its growth projection for 2018 to <START:percentage> 1.8% <END> from <START:percentage> 1.7% <END> , and for 2019 to <START:percentage> 1.7% <END> from <START:percentage> 1.6% <END> .
27 | 
28 | At that rate, the 19 countries that use the euro would see growth at <START:percentage> 2.3% <END> this year, nearly double the rate of the US, which is on course to grow <START:percentage> 1.2% <END> .
29 | 
30 | Although all EU countries are required to observe the <START:percentage> 3% <END> limit, only the 19 countries that use the euro as a currency can be fined.
31 | 
32 | According to the Office for National Statistics (ONS), manufacturing production grew <START:percentage> 0.2% <END> from the month before in April, rebounding from the <START:percentage> 0.6% <END> decline recorded in the previous month but falling short of expectations for a <START:percentage> 0.8% <END> increase.
33 | 
34 | At 1:59pm BST, the Brent front month futures contract for August delivery was down another 0.75% or 36 cents at $47.70 per barrel, with the global proxy benchmark having breached the psychological <START:money> $50 <END> -level on Monday.
35 | 
36 | Concurrently, the West Texas Intermediate (WTI) was down <START:percentage> 0.94% <END> or <START:money> 43 cents <END> to <START:money> $45.29 <END> per barrel, after US Energy Information Administration said the country's stockpiles rose by 3.3m barrels last week, against market estimates for a 3.5m-barrel drop.
37 | 
38 | A breakdown below <START:money> $45 <END> should open a path lower towards <START:money> $44 <END> .
39 | Furthermore, with the oversupply woes still a dominant theme in the oil markets and Opec's efforts to stabilise the markets disrupted by US Shale production, WTI Crude may receive further punishment.
40 | 
41 | A diamond ring that was initially bought at a car boot sales for <START:money> £10 <END> has been auctioned off for <START:money> £656,750 <END> in London.
42 | 
43 | The owner was unaware the "exceptionally-sized" stone was instead a 26-carat diamond, which she wore for almost two decades and which fetched almost double the <START:money> £350,000 <END> it was expected to be sold for.
44 | 
45 | A Cartier diamond brooch owned by the late Margaret Thatcher, which estimated to fetch <START:money> £35,000 <END> was sold for <START:money> £81,250 <END> .
46 | 
47 | French beauty group L'Oreal said it has entered into "exclusive discussions" to sell the natural products cosmetics business for an enterprise value of <START:money> €1bn <END> ( <START:money> $1.1bn <END> ), after buying it eleven years ago.
48 | 
49 | British multinational utility firm Centrica has announced that intends to sell <START:percentage> 60% <END> of its stake in a joint venture oil and gas exploration and production enterprise to a consortium of firms.
50 | The deal is expected to cost <START:money> £240m <END>.
51 | 
52 | Centrica's cash flows also jumped by nearly <START:percentage> 130% <END> to <START:money> £2bn <END> in 2016.
53 | Its operations are mainly confined to Europe and North America.
54 | 
55 | MIE Holdings is a Hong Kong Stock Exchange-listed oil and gas firm.
56 | It managed to bring its net loss down by <START:percentage> 13% <END> to <START:money> 1.3 billion renminbi <END> during the 2016 financial year.
57 | 
58 | The US Central Intelligence Agency (CIA) has estimated that after Liechtenstein it is Qatar who has the world's second largest GDP per capita with a value of <START:money> $129,700 <END> .
59 | Moreover, the Sovereign Wealth Fund Institute has ranked the Qatar Investment Authority as the world's 9th largest sovereign wealth fund, with a total asset value of <START:money> $335bn <END> . 
60 | 
61 | Data released by Halifax on 7 June showed UK house prices rose <START:percentage> 3.3% <END> year-on-year in May, following a <START:percentage> 3.8% <END> increase in April.
62 | 
63 | The national sales gauge improved slightly to <START:percentage> -8% <END> in May from <START:percentage> -9% <END> in the previous month.
64 | 
65 | At 4:51pm BST, the Brent front month futures contract for August delivery was down <START:percentage> 3.49% <END> or <START:money> $1.75 <END> at <START:money> $48.37 <END> per barrel, with the global proxy benchmark having breached the psychological <START:money> $50 <END> -level on Monday.
66 | 
67 | Concurrently, the West Texas Intermediate (WTI) was down <START:percentage> 4.25% <END> or <START:money> $2.05 <END> to <START:money> $46.11 <END> per barrel, having fallen as low as <START:money> $45.92 <END> in late European trading, after US Energy Information Administration said the country's stockpiles rose by 3.3m barrels last week, against market estimates for a 3.5m-barrel drop.
68 | 
69 | The goods trade deficit with the rest of the world narrowed to <START:money> £10.4bn <END> from <START:money> £12bn <END> the month before, as import levels of mechanical machinery, oil and cars fell during the period.
70 | 
71 | The overall trade deficit - covering goods and services - narrowed to <START:money> £2.1bn <END> in April from <START:money> £3.9bn <END> the month before.
72 | Economists had expected the deficit to amount to <START:money> £3.5bn <END> .
73 | 
74 | The pound lost more than <START:money> 2 cents <END> against the dollar within seconds of the exit poll result, falling from <START:money> $1.2955 <END> to <START:money> $1.2752 <END> late Thursday.
75 | 
76 | The General Administration of Customs revealed on Thursday that China's exports increased by <START:percentage> 8.7% <END> year-on-year in May, beating forecasts of a <START:percentage> 7.2% <END> increase.
77 | The country's trade surplus now amounts to <START:money> $40.8bn <END> ( <START:money> £31bn <END> ).
78 | 
79 | The Cabinet Office revealed on Wednesday that Japan's GDP grew by <START:percentage> 0.3% <END> during the first quarter of 2017.
80 | Although the reading missed a forecast of <START:percentage> 0.6% <END> growth, Japan's economy continued to expand in five consecutive quarters, the country's highest streak in three years.
81 | 


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-person.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-person.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-person.test:
--------------------------------------------------------------------------------
 1 | <START:person> Theresa May <END> has said she will form a government with the support of the Democratic Unionists that can provide "certainty" for the future.
 2 | Speaking after visiting Buckingham Palace, she said only her party had the "legitimacy" to govern after winning the most seats and votes.
 3 | In a short statement outside Downing Street, which followed a 25-minute audience with <START:person> The Queen <END> , Mrs <START:person> May <END> said she intended to form a government which could "provide certainty and lead Britain forward at this critical time for our country".
 4 | 
 5 | The BBC's <START:person> Laura Kuenssberg <END> said the PM had returned to No 10 a "diminished figure", having ended up with 12 fewer seats than when she called the election in April.
 6 | She had called the election with the stated reason that it would strengthen her hand in negotiations for the UK to leave the EU - the talks are due to start on 19 June.
 7 | 
 8 | The general election has ended in a hung Parliament, where no party has the 326 seats needed to get an overall majority in the House of Commons.
 9 | So what happens now?
10 | Who is the prime minister?
11 |  <START:person> Theresa May <END> remains prime minister.
12 | She aims to form a minority government, working with the Democratic Unionist Party.
13 | 
14 | The Labour leader does not have to wait until Mrs <START:person> May <END> has exhausted all her options before he starts trying to put a deal of his own together.
15 | He can hold talks with potential partners at the same time as Mrs <START:person> May <END> .
16 | They may even be talking to the same people.
17 | 
18 | Labour had a majority of three after the 1974 general election - but this had vanished by 1977, and it stayed in power thanks to a "pact" with the Liberal Party.
19 | And <START:person> John Major's <END> Conservative government started out with a majority of 21 in 1992 but was a minority government by the 1997 general election.
20 | 
21 | On 8 May 2013, one week before the Pakistani election, the third author, in his keynote address at the Sentiment Analysis Symposium, forecast the winner of the Pakistani election.
22 | The chart in Figure 1 shows varying sentiment on the candidates for prime minister of Pakistan in that election.
23 | The next day, the BBC’s <START:person> Owen Bennett Jones <END> , reporting from Islamabad, wrote an article titled “Pakistan Elections: Five Reasons Why the Vote is Unpredictable.”
24 | 
25 | At the moment the first deadline is Tuesday 13 June, when the new Parliament meets for the first time.
26 |  Mrs <START:person> May <END> has until this date to put together a deal to keep herself in power or resign, according to official guidance issued by the Cabinet Office.
27 | If she were to resign, Mrs <START:person> May <END> must be clear that <START:person> Jeremy Corbyn <END> can form a government and that she can't.
28 | She is entitled to wait until the new Parliament to see if she has the confidence of the House of Commons.
29 | 
30 | Japan's parliament has passed a one-off bill to allow Emperor <START:person> Akihito <END> to abdicate, the first emperor to do so in 200 years.
31 | The 83-year-old said last year that his age and health were making it hard for him to fulfil his official duties.
32 | But there was no provision under existing law for him to stand down.
33 | The government will now begin the process of arranging his abdication, expected to happen in late 2018, and the handover to <START:person> Crown Prince Naruhito <END> .
34 | 
35 | Germany has called for diplomatic efforts to resolve a growing crisis over Qatar, which is accused by four Arab neighbours of funding terrorism.
36 | Saudi Arabia, the United Arab Emirates (UAE), Egypt and Bahrain cut travel and diplomatic ties with Qatar on Monday.
37 | Speaking after hosting his Qatari counterpart on Friday, German Foreign Minister <START:person> Sigmar Gabriel <END> called for the "sea and air blockades" to be lifted.
38 | 
39 | Mr <START:person> Gabriel <END> met Saudi Foreign Minister <START:person> Adil al-Ahmad al-Jubayr <END> two days ago, and said all parties were seeking "to avoid further escalation".
40 | Then on Friday, Mr <START:person> Gabriel <END> spoke to Qatari Foreign Minister <START:person> Sheikh Mohammed bin Abdulrahman al-Thani <END> in the northern German town of Wolfenbuettel.
41 | 
42 | On Friday Saudi Arabia and its three allies issued a list of 49 people - including Muslim Brotherhood spiritual leader <START:person> Yusuf al-Qaradawi <END> - and 12 Qatar-backed charities and groups accused of links with militants.
43 | On Thursday, Qatar's <START:person> Sheikh Mohammed <END> said his country had been isolated "because we are successful and progressive" and called his country "a platform for peace not terrorism".
44 | 
45 | US President <START:person> Donald Trump <END> has urged Nato allies to boost defence spending.
46 | Last month German Chancellor <START:person> Angela Merkel <END> said Europe could no longer "completely depend" on the US and UK , following the election of President <START:person> Trump <END> and the triggering of Brexit.
47 | 
48 | The UK has long been one of the strongest voices in the EU against any moves towards forming a European army.
49 | The UK says the EU must not duplicate Nato's role as the main pillar of European defence.
50 | However, Mr <START:person> Trump's <END> criticisms of Nato have raised questions about the US commitment to defending Europe.
51 | 
52 | The Cabinet Office revealed on Wednesday that Japan's GDP grew by 0.3% during the first quarter of 2017.
53 | Although the reading missed a forecast of 0.6% growth, Japan's economy continued to expand in five consecutive quarters, the country's highest streak in three years.
54 | 
55 | The General Administration of Customs revealed on Thursday that China's exports increased by 8.7% year-on-year in May, beating forecasts of a 7.2% increase.
56 | The country's trade surplus now amounts to $40.8bn (£31bn).
57 | 
58 | Furthermore, the European Central Bank is scheduled to hold its monetary policy meeting today as well.
59 | Expectations point towards the Bank reaffirming its decision to maintain a loose monetary policy, but spectators will carefully scrutinize ECB President <START:person> Mario Draghi <END> and his team's rhetoric in order to get an indication of when interest rates could possibly rise.
60 | 
61 | <START:person> James Trescothick <END> , senior global strategist, said there is the potential that Opec's agreement to limit current oil production could collapse.
62 | If that happens oil prices could potentially fall.
63 | However, Qatar is one of the smallest oil producers in Opec, with estimated proven reserves of 25bn barrels which is dwarfed by Saudi Arabia's 266bn barrels.
64 | <START:person> Trescothick <END> believes the main danger to the region and indeed to oil prices is that increased tension could lead Qatar to reach out even further to Iran for support which would no doubt sour diplomatic relations further.
65 | 


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-person_organization_location_date.test:
--------------------------------------------------------------------------------
 1 | <START:person> Theresa May <END> has said she will form a government with the support of the <START:organization> Democratic Unionists <END> that can provide "certainty" for the future.
 2 | Speaking after visiting <START:location> Buckingham Palace <END> , she said only her party had the "legitimacy" to govern after winning the most seats and votes.
 3 | In a short statement outside <START:location> Downing Street <END> , which followed a 25-minute audience with <START:person> The Queen <END> , Mrs <START:person> May <END> said she intended to form a government which could "provide certainty and lead <START:location> Britain <END> forward at this critical time for our country".
 4 | 
 5 | The <START:organization> BBC's <END> <START:person> Laura Kuenssberg <END> said the PM had returned to No 10 a "diminished figure", having ended up with 12 fewer seats than when she called the election in <START:date> April <END> .
 6 | She had called the election with the stated reason that it would strengthen her hand in negotiations for the <START:location> UK <END> to leave the <START:location> EU <END> - the talks are due to start on <START:date> 19 June <END> .
 7 | 
 8 | The general election has ended in a hung <START:organization> Parliament <END> , where no party has the 326 seats needed to get an overall majority in the <START:location> House of Commons <END> .
 9 | So what happens now?
10 | Who is the prime minister?
11 |  <START:person> Theresa May <END> remains prime minister.
12 | She aims to form a minority government, working with the <START:organization> Democratic Unionist Party <END> .
13 | 
14 | The <START:organization> Labour <END> leader does not have to wait until Mrs <START:person> May <END> has exhausted all her options before he starts trying to put a deal of his own together.
15 | He can hold talks with potential partners at the same time as Mrs <START:person> May <END> .
16 | They may even be talking to the same people.
17 | 
18 | Labour had a majority of three after the <START:date> 1974 <END> general election - but this had vanished by <START:date> 1977 <END> , and it stayed in power thanks to a "pact" with the <START:organization> Liberal Party <END> .
19 | And <START:person> John Major's <END> Conservative government started out with a majority of 21 in <START:date> 1992 <END> but was a minority government by the <START:date> 1997 <END> general election.
20 | 
21 | On <START:date> 8 May 2013 <END> , one week before the <START:location> Pakistani <END> election, the third author, in his keynote address at the <START:organization> Sentiment Analysis Symposium <END> , forecast the winner of the <START:location> Pakistani <END> election.
22 | The chart in Figure 1 shows varying sentiment on the candidates for prime minister of <START:location> Pakistan <END> in that election.
23 | The next day, the <START:organization> BBC’s <END> <START:person> Owen Bennett Jones <END> , reporting from <START:location> Islamabad <END> , wrote an article titled “ <START:location> Pakistan <END> Elections: Five Reasons Why the Vote is Unpredictable.”
24 | 
25 | At the moment the first deadline is <START:date> Tuesday <END> <START:date> 13 June <END> , when the new <START:organization> Parliament <END> meets for the first time.
26 |  Mrs <START:person> May <END> has until this date to put together a deal to keep herself in power or resign, according to official guidance issued by the <START:organization> Cabinet Office <END> .
27 | If she were to resign, Mrs <START:person> May <END> must be clear that <START:person> Jeremy Corbyn <END> can form a government and that she can't.
28 | She is entitled to wait until the new <START:organization> Parliament <END> to see if she has the confidence of the <START:organization> House of Commons <END> .
29 | 
30 |  <START:location> Japan's <END> parliament has passed a one-off bill to allow Emperor <START:person> Akihito <END> to abdicate, the first emperor to do so in 200 years.
31 | The 83-year-old said last year that his age and health were making it hard for him to fulfil his official duties.
32 | But there was no provision under existing law for him to stand down.
33 | The government will now begin the process of arranging his abdication, expected to happen in late <START:date> 2018 <END> , and the handover to <START:person> Crown Prince Naruhito <END> .
34 | 
35 |  <START:location> Germany <END> has called for diplomatic efforts to resolve a growing crisis over <START:location> Qatar <END> , which is accused by four Arab neighbours of funding terrorism.
36 | <START:location> Saudi Arabia <END> , the <START:location> United Arab Emirates <END> ( <START:location> UAE <END> ), <START:location> Egypt <END> and <START:location> Bahrain <END> cut travel and diplomatic ties with <START:location> Qatar <END> on <START:date> Monday <END> .
37 | Speaking after hosting his Qatari counterpart on <START:date> Friday <END> , <START:location> German <END> Foreign Minister <START:person> Sigmar Gabriel <END> called for the "sea and air blockades" to be lifted.
38 | 
39 | Mr <START:person> Gabriel <END> met Saudi Foreign Minister <START:person> Adil al-Ahmad al-Jubayr <END> <START:date> two days ago <END> , and said all parties were seeking "to avoid further escalation".
40 | Then on <START:date> Friday <END> , Mr <START:person> Gabriel <END> spoke to Qatari Foreign Minister <START:person> Sheikh Mohammed bin Abdulrahman al-Thani <END> in the northern German town of <START:location> Wolfenbuettel <END> .
41 | 
42 | On <START:date> Friday <END> <START:location> Saudi Arabia <END> and its three allies issued a list of 49 people - including <START:organization> Muslim Brotherhood <END> spiritual leader <START:person> Yusuf al-Qaradawi <END> - and 12 Qatar-backed charities and groups accused of links with militants.
43 | On <START:date> Thursday <END> , Qatar's <START:person> Sheikh Mohammed <END> said his country had been isolated "because we are successful and progressive" and called his country "a platform for peace not terrorism".
44 | 
45 | <START:location> US <END> President <START:person> Donald Trump <END> has urged <START:organization> Nato <END> allies to boost defence spending.
46 | Last month German Chancellor <START:person> Angela Merkel <END> said <START:location> Europe <END> could no longer "completely depend" on the <START:location> US <END> and <START:location> UK <END> , following the election of President <START:person> Trump <END> and the triggering of Brexit.
47 | 
48 | The <START:location> UK <END> has long been one of the strongest voices in the <START:location> EU <END> against any moves towards forming a <START:organization> European army <END> .
49 | The <START:location> UK <END> says the <START:location> EU <END> must not duplicate <START:organization> Nato's <END> role as the main pillar of European defence.
50 | However, Mr <START:person> Trump's <END> criticisms of <START:organization> Nato <END> have raised questions about the US commitment to defending <START:location> Europe <END> .
51 | 
52 | The <START:organization> Cabinet Office <END> revealed on <START:date> Wednesday <END> that <START:location> Japan's <END> GDP grew by 0.3% during the <START:date> first quarter of 2017 <END> .
53 | Although the reading missed a forecast of 0.6% growth, <START:location> Japan's <END> economy continued to expand in five consecutive quarters, the country's highest streak in three years.
54 | 
55 | The General Administration of Customs revealed on <START:date> Thursday <END> that <START:location> China's <END> exports increased by 8.7% year-on-year in <START:date> May <END> , beating forecasts of a 7.2% increase.
56 | The country's trade surplus now amounts to $40.8bn (£31bn).
57 | 
58 | Furthermore, the <START:organization> European Central Bank <END> is scheduled to hold its monetary policy meeting today as well.
59 | Expectations point towards the <START:organization> Bank <END> reaffirming its decision to maintain a loose monetary policy, but spectators will carefully scrutinize <START:organization> ECB <END> President <START:person> Mario Draghi <END> and his team's rhetoric in order to get an indication of when interest rates could possibly rise.
60 | 
61 | <START:person> James Trescothick <END> , senior global strategist, said there is the potential that <START:organization> Opec's <END> agreement to limit current oil production could collapse.
62 | If that happens oil prices could potentially fall.
63 | However, <START:location> Qatar <END> is one of the smallest oil producers in <START:organization> Opec <END> , with estimated proven reserves of 25bn barrels which is dwarfed by <START:location> Saudi Arabia's <END> 266bn barrels.
64 | <START:person> Trescothick <END> believes the main danger to the region and indeed to oil prices is that increased tension could lead <START:location> Qatar <END> to reach out even further to <START:location> Iran <END> for support which would no doubt sour diplomatic relations further.
65 | 


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-time.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-ner-time.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-pos-maxent.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-pos-maxent.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-sent.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-sent.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-sentiment-tweets_toy.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-sentiment-tweets_toy.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/en-token.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/graphaware/neo4j-nlp-opennlp/e918bb45f27a00b6a285ea16fcec326609fcdef1/src/main/resources/com/graphaware/nlp/processor/opennlp/en-token.bin


--------------------------------------------------------------------------------
/src/main/resources/com/graphaware/nlp/processor/opennlp/sentiment_tweets.train:
--------------------------------------------------------------------------------
  1 | 3	Watching a nice movie
  2 | 1	The painting is ugly, will return it tomorrow...
  3 | 3	One of the best soccer games, worth seeing it
  4 | 3	Very tasty, not only for vegetarians
  5 | 3	Super party!
  6 | 1	Too early to travel..need a coffee
  7 | 1	Damn..the train is late again...
  8 | 1	Bad news, my flight just got cancelled.
  9 | 3	Happy birthday mr. president
 10 | 3	Just watch it. Respect.
 11 | 3	Wonderful sunset.
 12 | 3	Bravo, first title in 2014!
 13 | 1	Had a bad evening, need urgently a beer.
 14 | 1	I put on weight again
 15 | 3	On today's show we met Angela, a woman with an amazing story
 16 | 3	I fell in love again
 17 | 1	I lost my keys
 18 | 3	On a trip to Iceland
 19 | 3	Happy in Berlin
 20 | 1	I hate Mondays
 21 | 3	Love the new book I reveived for Christmas
 22 | 1	He killed our good mood
 23 | 3	I am in good spirits again
 24 | 3	This guy creates the most awesome pics ever 
 25 | 1	The dark side of a selfie.
 26 | 3	Cool! John is back!
 27 | 3	Many rooms and many hopes for new residents
 28 | 1	False hopes for the people attending the meeting
 29 | 3	I set my new year's resolution
 30 | 1	The ugliest car ever!
 31 | 1	Feeling bored
 32 | 1	Need urgently a pause
 33 | 3	Nice to see Ana made it
 34 | 3	My dream came true
 35 | 1	I didn't see that one coming
 36 | 1	Sorry mate, there is no more room for you
 37 | 1	Who could have possibly done this?
 38 | 3	I won the challenge
 39 | 1	I feel bad for what I did		
 40 | 3	I had a great time tonight
 41 | 3	It was a lot of fun
 42 | 3	Thank you Molly making this possible
 43 | 1	I just did a big mistake
 44 | 3	I love it!!
 45 | 1	I never loved so hard in my life
 46 | 1	I hate you Mike!!
 47 | 1	I hate to say goodbye
 48 | 3	Lovely!
 49 | 3	Like and share if you feel the same
 50 | 1	Never try this at home
 51 | 1	Don't spoil it!
 52 | 3	I love rock and roll
 53 | 1	The more I hear you, the more annoyed I get
 54 | 3	Finnaly passed my exam!
 55 | 3	Lovely kittens
 56 | 1	I just lost my appetite
 57 | 1	Sad end for this movie
 58 | 1	Lonely, I am so lonely
 59 | 3	Beautiful morning
 60 | 3	She is amazing
 61 | 3	Enjoying some time with my friends
 62 | 3	Special thanks to Marty
 63 | 3	Thanks God I left on time
 64 | 3	Greateful for a wonderful meal
 65 | 3	So happy to be home
 66 | 1	Hate to wait on a long queue		
 67 | 1	No cab available
 68 | 1	Electricity outage, this is a nightmare
 69 | 1	Nobody to ask about directions
 70 | 3	Great game!
 71 | 3	Nice trip
 72 | 3	I just received a pretty flower
 73 | 3	Excellent idea
 74 | 3	Got a new watch. Feeling happy
 75 | 1	I feel sick
 76 | 1	I am very tired
 77 | 3	Such a good taste 
 78 | 1	Such a bad taste
 79 | 3	Enjoying brunch
 80 | 1	I don't recommend this restaurant
 81 | 3	Thank you mom for supporting me
 82 | 1	I will never ever call you again
 83 | 1	I just got kicked out of the contest
 84 | 3	Smiling
 85 | 1	Big pain to see my team loosing
 86 | 1	Bitter defeat tonight
 87 | 1	My bike was stollen
 88 | 3	Great to see you!
 89 | 1	I lost every hope for seeing him again
 90 | 3	Nice dress!
 91 | 3	Stop wasting my time
 92 | 3	I have a great idea
 93 | 3	Excited to go to the pub
 94 | 3	Feeling proud
 95 | 3	Cute bunnies
 96 | 1	Cold winter ahead
 97 | 1	Hopless struggle..
 98 | 1	Ugly hat
 99 | 3	Big hug and lots of love
100 | 3	I hope you have a wonderful celebration
101 | 


--------------------------------------------------------------------------------
/src/test/java/com/graphaware/nlp/processor/opennlp/OpenNLPIntegrationTest.java:
--------------------------------------------------------------------------------
 1 | package com.graphaware.nlp.processor.opennlp;
 2 | 
 3 | import com.graphaware.nlp.NLPIntegrationTest;
 4 | import com.graphaware.nlp.dsl.AbstractDSL;
 5 | import org.neo4j.kernel.impl.proc.Procedures;
 6 | import org.reflections.Reflections;
 7 | 
 8 | import java.util.Set;
 9 | 
10 | public class OpenNLPIntegrationTest extends NLPIntegrationTest {
11 | 
12 |     @Override
13 |     protected void registerProceduresAndFunctions(Procedures procedures) throws Exception {
14 |         super.registerProceduresAndFunctions(procedures);
15 |         Reflections reflections = new Reflections("com.graphaware.nlp.dsl");
16 |         Set<Class<? extends AbstractDSL>> cls = reflections.getSubTypesOf(AbstractDSL.class);
17 |         for (Class c : cls) {
18 |             procedures.registerProcedure(c);
19 |             procedures.registerFunction(c);
20 |         }
21 |     }
22 | }
23 | 


--------------------------------------------------------------------------------
/src/test/java/com/graphaware/nlp/processor/opennlp/OpenNLPPipelineTest.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * To change this license header, choose License Headers in Project Properties.
  3 |  * To change this template file, choose Tools | Templates
  4 |  * and open the template in the editor.
  5 |  */
  6 | package com.graphaware.nlp.processor.opennlp;
  7 | 
  8 | import java.io.FileInputStream;
  9 | import java.io.IOException;
 10 | import java.io.InputStream;
 11 | import opennlp.tools.namefind.NameFinderME;
 12 | import opennlp.tools.namefind.TokenNameFinderModel;
 13 | import opennlp.tools.tokenize.Tokenizer;
 14 | import opennlp.tools.tokenize.TokenizerME;
 15 | import opennlp.tools.tokenize.TokenizerModel;
 16 | import opennlp.tools.util.Span;
 17 | import org.junit.After;
 18 | import org.junit.AfterClass;
 19 | import org.junit.Before;
 20 | import org.junit.BeforeClass;
 21 | import org.junit.Test;
 22 | 
 23 | /**
 24 |  *
 25 |  * @author ale
 26 |  */
 27 | public class OpenNLPPipelineTest {
 28 | 
 29 |     public OpenNLPPipelineTest() {
 30 |     }
 31 | 
 32 |     @BeforeClass
 33 |     public static void setUpClass() {
 34 |     }
 35 | 
 36 |     @AfterClass
 37 |     public static void tearDownClass() {
 38 |     }
 39 | 
 40 |     @Before
 41 |     public void setUp() {
 42 |     }
 43 | 
 44 |     @After
 45 |     public void tearDown() {
 46 |     }
 47 | 
 48 |     /**
 49 |      * Test of annotate method, of class OpenNLPPipeline.
 50 |      */
 51 |     @Test
 52 |     public void testAnnotate() {
 53 |         String text = "Hello Dralyn. Barack Hussein Obama II  is the 44th and current President of the United States, and the first African American to hold the office.";
 54 |         OpenNLPAnnotation document = new OpenNLPAnnotation(text);
 55 |         OpenNLPPipeline instance = new PipelineBuilder()
 56 |                 .tokenize()
 57 |                 /*.extractPos()
 58 |                 .extractRelations()*/
 59 |                 .build();
 60 |         instance.annotate(document);
 61 | 
 62 |         document.getSentences().forEach((sentence) -> {
 63 |             System.out.println(">>>" + sentence.getSentence());
 64 |             if (sentence.getPhrasesIndex() != null) {
 65 |                 sentence.getPhrasesIndex().forEach((phrase) -> {
 66 |                     System.out.println(">>>" + sentence.getChunkStrings()[phrase]);
 67 |                 });
 68 |             }
 69 |         });
 70 |     }
 71 | 
 72 |     @Test
 73 |     public void testAnnotateNER() {
 74 |         String sentence = "Barack Hussein Obama II  is the 44th and current President of the United States, and the first African American to hold the office.";
 75 |         
 76 |         OpenNLPAnnotation document = new OpenNLPAnnotation(sentence);
 77 |         OpenNLPPipeline instance = new PipelineBuilder()
 78 |                 .tokenize()
 79 |                 /*.extractPos()
 80 |                 .extractRelations()*/
 81 |                 .build();
 82 |         instance.annotate(document);
 83 | 
 84 |         document.getSentences().forEach((item) -> {
 85 |             item.getTokens().stream().forEach((token) -> {
 86 |                 System.out.println("" + token.getTokenPOS()+ " " + token.getToken() + " - " + token.getToken() + " " + token.getTokenNEs());
 87 |             });
 88 |         });
 89 | 
 90 |         InputStream modelInToken = null;
 91 |         InputStream modelIn = null;
 92 | 
 93 |         try {
 94 | 
 95 |             //1. convert sentence into tokens
 96 |             modelInToken = this.getClass().getResourceAsStream("en-token.bin");
 97 |             TokenizerModel modelToken = new TokenizerModel(modelInToken);
 98 |             Tokenizer tokenizer = new TokenizerME(modelToken);
 99 |             String tokens[] = tokenizer.tokenize(sentence);
100 | 
101 |             //2. find names
102 |             modelIn = this.getClass().getResourceAsStream("en-ner-person.bin");
103 |             TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
104 |             NameFinderME nameFinder = new NameFinderME(model);
105 | 
106 |             Span nameSpans[] = nameFinder.find(tokens);
107 | 
108 |             //find probabilities for names
109 |             double[] spanProbs = nameFinder.probs(nameSpans);
110 | 
111 |             //3. print names
112 |             for (int i = 0; i < nameSpans.length; i++) {
113 |                 System.out.println("Span: " + nameSpans[i].toString());
114 |                 System.out.println("Covered text is: " + tokens[nameSpans[i].getStart()] + " " + tokens[nameSpans[i].getStart() + 1]);
115 |                 System.out.println("Probability is: " + spanProbs[i]);
116 |             }
117 |         } catch (Exception ex) {
118 |             ex.printStackTrace();
119 |         } finally {
120 |             try {
121 |                 if (modelInToken != null) {
122 |                     modelInToken.close();
123 |                 }
124 |             } catch (IOException e) {
125 |                 e.printStackTrace();
126 |             };
127 |             try {
128 |                 if (modelIn != null) {
129 |                     modelIn.close();
130 |                 }
131 |             } catch (IOException e) {
132 |                 e.printStackTrace();
133 |             };
134 |         }
135 |     }
136 | 
137 |     @Test
138 |     public void testStopWordsAnnotate() {
139 |         String text = "Hello Dralyn. Barack Hussein Obama II  is the 44th and current President of the United States, and the first African American to hold the office.";
140 |         OpenNLPAnnotation document = new OpenNLPAnnotation(text);
141 |         OpenNLPPipeline instance = new PipelineBuilder()
142 |                 .tokenize()
143 |                 .customStopWordAnnotator("hello,is,and,of,the,to")
144 |                 /*.extractPos()
145 |                 .extractRelations()*/
146 |                 .build();
147 |         instance.annotate(document);
148 | 
149 |         document.getSentences().forEach((sentence) -> {
150 |             System.out.println(">>>" + sentence.getSentence());
151 |             if (sentence.getTokens() != null) {
152 |                 sentence.getTokens().forEach((token) -> {
153 |                     System.out.print(" " + token);
154 |                 });
155 |                 System.out.print("\n ");
156 |             }
157 |         });
158 |     }
159 | 
160 | }
161 | 


--------------------------------------------------------------------------------
/src/test/java/com/graphaware/nlp/processor/opennlp/TestOpenNLP.java:
--------------------------------------------------------------------------------
 1 | //package com.graphaware.nlp.processor.opennlp;
 2 | //
 3 | //import com.graphaware.nlp.processor.opennlp.OpenNLPPhraseProcessor;
 4 | //import org.junit.Test;
 5 | //import org.slf4j.Logger;
 6 | //import org.slf4j.LoggerFactory;
 7 | //
 8 | //import java.util.List;
 9 | //
10 | ///**
11 | // * Created by michael kilgore on 10/2/16.
12 | // * Begining test class for OpenNLP Phrases
13 | // */
14 | //public class TestOpenNLP
15 | //{
16 | //    private static final Logger LOG = LoggerFactory.getLogger(TestOpenNLP.class);
17 | //
18 | //    public TestOpenNLP()
19 | //    {
20 | //    }
21 | //
22 | //    @Test
23 | //    public void testPhrase()
24 | //    {
25 | //        LOG.info("starting test");
26 | //        String workingDir = System.getProperty("user.dir");
27 | //        OpenNLPPhraseProcessor openNLP = new OpenNLPPhraseProcessor();
28 | //        openNLP.init(workingDir+"/");
29 | //        List<String> phrases = openNLP.processForPhrases("Barack Hussein Obama II  is the 44th and current President of the United States, and the first African American to hold the office.");
30 | //        for (String phrase : phrases)
31 | //            LOG.info(phrase);
32 | //    }
33 | //}
34 | 


--------------------------------------------------------------------------------
/src/test/java/com/graphaware/nlp/processor/opennlp/TextProcessorTest.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * Copyright (c) 2013-2016 GraphAware
  3 |  *
  4 |  * This file is part of the GraphAware Framework.
  5 |  *
  6 |  * GraphAware Framework is free software: you can redistribute it and/or modify it under the terms of
  7 |  * the GNU General Public License as published by the Free Software Foundation, either
  8 |  * version 3 of the License, or (at your option) any later version.
  9 |  *
 10 |  * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
 11 |  * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 12 |  * See the GNU General Public License for more details. You should have received a copy of
 13 |  * the GNU General Public License along with this program.  If not, see
 14 |  * <http://www.gnu.org/licenses/>.
 15 |  */
 16 | package com.graphaware.nlp.processor.opennlp;
 17 | 
 18 | import com.graphaware.nlp.domain.AnnotatedText;
 19 | import com.graphaware.nlp.domain.Sentence;
 20 | import com.graphaware.nlp.domain.Tag;
 21 | import com.graphaware.nlp.dsl.request.PipelineSpecification;
 22 | import com.graphaware.nlp.processor.AbstractTextProcessor;
 23 | import com.graphaware.nlp.processor.TextProcessor;
 24 | import com.graphaware.nlp.util.ServiceLoader;
 25 | import com.graphaware.nlp.util.TestAnnotatedText;
 26 | import com.graphaware.test.integration.EmbeddedDatabaseIntegrationTest;
 27 | 
 28 | import java.util.Collections;
 29 | import java.util.HashMap;
 30 | import java.util.Map;
 31 | 
 32 | import org.junit.BeforeClass;
 33 | import org.junit.Test;
 34 | import org.neo4j.graphdb.Node;
 35 | import org.neo4j.graphdb.QueryExecutionException;
 36 | import org.neo4j.graphdb.ResourceIterator;
 37 | import org.neo4j.graphdb.Result;
 38 | import org.neo4j.graphdb.Transaction;
 39 | 
 40 | import static com.graphaware.nlp.util.TagUtils.newTag;
 41 | import static org.junit.Assert.assertEquals;
 42 | import static org.junit.Assert.assertFalse;
 43 | import static org.junit.Assert.assertNull;
 44 | 
 45 | public class TextProcessorTest extends OpenNLPIntegrationTest {
 46 | 
 47 |     private static TextProcessor textProcessor;
 48 |     private static final String TEXT_PROCESSOR = "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor";
 49 |     private static PipelineSpecification PIPELINE_DEFAULT;
 50 | 
 51 |     @BeforeClass
 52 |     public static void init() {
 53 |         textProcessor = ServiceLoader.loadTextProcessor(TEXT_PROCESSOR);
 54 |         textProcessor.init();
 55 |         Map<String, Object> processingSteps = new HashMap<>();
 56 |         processingSteps.put(AbstractTextProcessor.STEP_TOKENIZE, true);
 57 |         processingSteps.put(AbstractTextProcessor.STEP_NER, true);
 58 |         PipelineSpecification pipelineSpecification = new PipelineSpecification("default", OpenNLPTextProcessor.class.getName(), processingSteps, null, 1L, Collections.emptyList(), Collections.emptyList());
 59 |         PIPELINE_DEFAULT = pipelineSpecification;
 60 |         textProcessor.createPipeline(PIPELINE_DEFAULT);
 61 |     }
 62 | 
 63 |     @Test
 64 |     public void testAnnotatedText() {
 65 |         AnnotatedText annotatedText = textProcessor.annotateText("On 8 May 2013, "
 66 |                 + "one week before the Pakistani election, the third author, "
 67 |                 + "in his keynote address at the Sentiment Analysis Symposium, "
 68 |                 + "forecast the winner of the Pakistani election. The chart "
 69 |                 + "in Figure 1 shows varying sentiment on the candidates for "
 70 |                 + "prime minister of Pakistan in that election. The next day, "
 71 |                 + "the BBC’s Owen Bennett Jones, reporting from Islamabad, wrote "
 72 |                 + "an article titled “Pakistan Elections: Five Reasons Why the "
 73 |                 + "Vote is Unpredictable,”1 in which he claimed that the election "
 74 |                 + "was too close to call. It was not, and despite his being in Pakistan, "
 75 |                 + "the outcome of the election was exactly as we predicted.", "en", PIPELINE_DEFAULT);
 76 | 
 77 |         TestAnnotatedText test = new TestAnnotatedText(annotatedText);
 78 |         test.assertSentencesCount(4);
 79 |         test.assertTagsCountInSentence(15, 0);
 80 |         test.assertTagsCountInSentence(11, 1);
 81 |         test.assertTagsCountInSentence(22, 2);//(24, 2); // it's 22 because `"Pakistan` & `"1` are not lemmatized by OpenNLP and checkLemmaIsValid() removes non-lemmatized version because of symbols `"`
 82 |         test.assertTagsCountInSentence(8, 3);//(9, 3); // it's 8 because OpenNLP has "be" among stopwords
 83 | 
 84 |         test.assertTag(newTag("pakistan", Collections.singletonList("LOCATION"), Collections.emptyList()));
 85 |         test.assertTag(newTag("show", Collections.emptyList(), Collections.singletonList("VBZ")));
 86 | 
 87 |     }
 88 | 
 89 |     @Test
 90 |     public void testLemmaLowerCasing() {
 91 |         AnnotatedText annotateText = textProcessor.annotateText(
 92 |                 "Collibra’s Data Governance Innovation: Enabling Data as a Strategic Asset",
 93 |                 "en", PIPELINE_DEFAULT);
 94 | 
 95 |         assertEquals(1, annotateText.getSentences().size());
 96 |         assertEquals("governance", annotateText.getSentences().get(0).getTagOccurrence(16).getLemma());
 97 |     }
 98 | 
 99 |     private void checkLocation(String location) throws QueryExecutionException {
100 |         try (Transaction tx = getDatabase().beginTx()) {
101 |             ResourceIterator<Object> rowIterator = getTagsIterator(location);
102 |             Node pakistanNode = (Node) rowIterator.next();
103 |             assertFalse(rowIterator.hasNext());
104 |             String[] neList = (String[]) pakistanNode.getProperty("ne");
105 |             assertEquals(neList[0], "location");
106 |             tx.success();
107 |         }
108 |     }
109 | 
110 |     private void checkVerb(String verb) throws QueryExecutionException {
111 |         try (Transaction tx = getDatabase().beginTx()) {
112 |             ResourceIterator<Object> rowIterator = getTagsIterator(verb);
113 |             Node pakistanNode = (Node) rowIterator.next();
114 |             assertFalse(rowIterator.hasNext());
115 |             String[] posL = (String[]) pakistanNode.getProperty("pos");
116 |             assertEquals(posL[0], "VBZ");
117 |             tx.success();
118 |         }
119 |     }
120 | 
121 |     private ResourceIterator<Object> getTagsIterator(String value) throws QueryExecutionException {
122 |         Map<String, Object> params = new HashMap<>();
123 |         params.put("value", value);
124 |         Result pakistan = getDatabase().execute("MATCH (n:Tag {value: {value}}) return n", params);
125 |         ResourceIterator<Object> rowIterator = pakistan.columnAs("n");
126 |         return rowIterator;
127 |     }
128 | 
129 |     @Test
130 |     public void testAnnotatedTag() {
131 |         Tag annotateTag = textProcessor.annotateTag("winners", "en", PIPELINE_DEFAULT);
132 |         assertEquals(annotateTag.getLemma(), "winner");
133 |     }
134 | 
135 | //    @Test
136 | //    public void testAnnotationAndConcept() {
137 | //        // ConceptNet5Importer.Builder() - arguments need fixing
138 | //        /*TextProcessor textProcessor = ServiceLoader.loadTextProcessor("com.graphaware.nlp.processor.stanford.StanfordTextProcessor");
139 | //        ConceptNet5Importer conceptnet5Importer = new ConceptNet5Importer.Builder("http://conceptnet5.media.mit.edu/data/5.4", textProcessor)
140 | //                .build();
141 | //        String text = "Say hi to Christophe";
142 | //        AnnotatedText annotateText = textProcessor.annotateText(text, 1, 0, "en", false);
143 | //        List<Node> nodes = new ArrayList<>();
144 | //        try (Transaction beginTx = getDatabase().beginTx()) {
145 | //            Node annotatedNode = annotateText.storeOnGraph(getDatabase(), false);
146 | //            Map<String, Object> params = new HashMap<>();
147 | //            params.put("id", annotatedNode.getId());
148 | //            Result queryRes = getDatabase().execute("MATCH (n:AnnotatedText)-[*..2]->(t:Tag) where id(n) = {id} return t", params);
149 | //            ResourceIterator<Node> tags = queryRes.columnAs("t");
150 | //            while (tags.hasNext()) {
151 | //                Node tag = tags.next();
152 | //                nodes.add(tag);
153 | //                List<Tag> conceptTags = conceptnet5Importer.importHierarchy(Tag.createTag(tag), "en");
154 | //                conceptTags.stream().forEach((newTag) -> {
155 | //                    nodes.add(newTag.storeOnGraph(getDatabase(), false));
156 | //                });
157 | //            }
158 | //            beginTx.success();
159 | //        }*/
160 | //    }
161 | 
162 |     //@Test
163 |     public void testSentiment() {
164 |         AnnotatedText annotateText = textProcessor.annotateText(
165 |                 "I really hate to study at Stanford, "
166 |                         + "it was a waste of time, I'll never be there again", "en", PIPELINE_DEFAULT);
167 |         assertEquals(1, annotateText.getSentences().size());
168 |         assertEquals(0, annotateText.getSentences().get(0).getSentiment());
169 | 
170 |         annotateText = textProcessor.annotateText(
171 |                 "It was really horrible to study at Stanford", "en", PIPELINE_DEFAULT);
172 |         assertEquals(1, annotateText.getSentences().size());
173 |         assertEquals(1, annotateText.getSentences().get(0).getSentiment());
174 | 
175 |         annotateText = textProcessor.annotateText("I studied at Stanford", "en", PIPELINE_DEFAULT);
176 |         assertEquals(1, annotateText.getSentences().size());
177 |         assertEquals(2, annotateText.getSentences().get(0).getSentiment());
178 | 
179 |         annotateText = textProcessor.annotateText("I liked to study at Stanford", "en", PIPELINE_DEFAULT);
180 |         assertEquals(1, annotateText.getSentences().size());
181 |         assertEquals(3, annotateText.getSentences().get(0).getSentiment());
182 | 
183 |         annotateText = textProcessor.annotateText(
184 |                 "I liked so much to study at Stanford, I enjoyed my time there, I would recommend every body",
185 |                 "en", PIPELINE_DEFAULT);
186 |         assertEquals(1, annotateText.getSentences().size());
187 |         assertEquals(4, annotateText.getSentences().get(0).getSentiment());
188 |     }
189 | 
190 |     @Test
191 |     public void testAnnotatedTextWithPosition() {
192 |         AnnotatedText annotateText = textProcessor.annotateText("On 8 May 2013, "
193 |                 + "one week before the Pakistani election, the third author, "
194 |                 + "in his keynote address at the Sentiment Analysis Symposium, "
195 |                 + "forecast the winner of the Pakistani election. The chart "
196 |                 + "in Figure 1 shows varying sentiment on the candidates for "
197 |                 + "prime minister of Pakistan in that election. The next day, "
198 |                 + "the BBC’s Owen Bennett Jones, reporting from Islamabad, wrote "
199 |                 + "an article titled “Pakistan Elections: Five Reasons Why the "
200 |                 + "Vote is Unpredictable,”1 in which he claimed that the election "
201 |                 + "was too close to call. It was not, and despite his being in Pakistan, "
202 |                 + "the outcome of the election was exactly as we predicted.", "en", PIPELINE_DEFAULT);
203 | 
204 |         assertEquals(4, annotateText.getSentences().size());
205 |         Sentence sentence1 = annotateText.getSentences().get(0);
206 |         assertEquals(15, sentence1.getTags().size());
207 | 
208 |         assertNull(sentence1.getTagOccurrence(0));
209 |         assertEquals("8", sentence1.getTagOccurrence(3).getLemma());
210 |         assertEquals("may 2013", sentence1.getTagOccurrence(5).getLemma());
211 |         assertEquals("May 2013", sentence1.getTagOccurrences().get(5).get(0).getValue());
212 |         assertEquals("one", sentence1.getTagOccurrence(15).getLemma());
213 |         assertEquals("before", sentence1.getTagOccurrence(24).getLemma());
214 |         assertEquals("third", sentence1.getTagOccurrence(59).getLemma());
215 |         //assertEquals("sentiment analysis symposium", sentence1.getTagOccurrence(103).getLemma());
216 |         assertEquals("forecast", sentence1.getTagOccurrence(133).getLemma());
217 |         assertNull(sentence1.getTagOccurrence(184));
218 | 
219 |         Sentence sentence2 = annotateText.getSentences().get(1);
220 |         assertEquals("show", sentence2.getTagOccurrence(22).getLemma());
221 |         assertEquals("shows", sentence2.getTagOccurrences().get(22).get(0).getValue());
222 | //        assertTrue(sentence1.getPhraseOccurrence(99).contains(new Phrase("the Sentiment Analysis Symposium")));
223 | //        assertTrue(sentence1.getPhraseOccurrence(103).contains(new Phrase("Sentiment")));
224 | //        assertTrue(sentence1.getPhraseOccurrence(113).contains(new Phrase("Analysis")));
225 | //
226 | //        //his(76)-> the third author(54)
227 | //        assertTrue(sentence1.getPhraseOccurrence(55).get(1).getContent().equalsIgnoreCase("the third author"));
228 | //        Sentence sentence2 = annotateText.getSentences().get(1);
229 | //        assertEquals("chart", sentence2.getTagOccurrence(184).getLemma());
230 | //        assertEquals("Figure", sentence2.getTagOccurrence(193).getLemma());
231 |     }
232 | 
233 |     @Test
234 |     public void testAnnotatedShortText() {
235 |         AnnotatedText annotateText = textProcessor.annotateText(
236 |                 "Fixing Batch Endpoint Logging Problem", "en", PIPELINE_DEFAULT);
237 | 
238 |         assertEquals(1, annotateText.getSentences().size());
239 | //
240 | //        GraphPersistence peristence = new LocalGraphDatabase(getDatabase());
241 | //        peristence.persistOnGraph(annotateText, false);
242 | 
243 |     }
244 | 
245 |     @Test
246 |     public void testAnnotatedShortText2() {
247 |         AnnotatedText annotateText = textProcessor.annotateText(
248 |                 "Importing CSV data does nothing", "en", PIPELINE_DEFAULT);
249 |         assertEquals(1, annotateText.getSentences().size());
250 | //        GraphPersistence peristence = new LocalGraphDatabase(getDatabase());
251 | //        peristence.persistOnGraph(annotateText, false);
252 |     }
253 | }
254 | 


--------------------------------------------------------------------------------
/src/test/java/com/graphaware/nlp/processor/opennlp/conceptnet5/ConceptNet5ImporterTest.java:
--------------------------------------------------------------------------------
 1 | /*
 2 |  * Copyright (c) 2013-2016 GraphAware
 3 |  *
 4 |  * This file is part of the GraphAware Framework.
 5 |  *
 6 |  * GraphAware Framework is free software: you can redistribute it and/or modify it under the terms of
 7 |  * the GNU General Public License as published by the Free Software Foundation, either
 8 |  * version 3 of the License, or (at your option) any later version.
 9 |  *
10 |  * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
11 |  * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12 |  * See the GNU General Public License for more details. You should have received a copy of
13 |  * the GNU General Public License along with this program.  If not, see
14 |  * <http://www.gnu.org/licenses/>.
15 |  */
16 | package com.graphaware.nlp.processor.opennlp.conceptnet5;
17 | 
18 | import com.graphaware.nlp.domain.Tag;
19 | import com.graphaware.nlp.processor.TextProcessor;
20 | import com.graphaware.nlp.util.ServiceLoader;
21 | import java.util.Arrays;
22 | import java.util.List;
23 | import static org.junit.Assert.assertEquals;
24 | import org.junit.Test;
25 | 
26 | public class ConceptNet5ImporterTest {
27 | 
28 |     private static final String TEXT_PROCESSOR = "com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor";
29 | 
30 |     public ConceptNet5ImporterTest() {
31 |     }
32 | 
33 | //    /**
34 | //     * Test of importHierarchy method, of class ConceptNet5Importer.
35 | //     */
36 | ////    @Test
37 | ////    public void testImportHierarchy() {
38 | ////        TextProcessor textProcessor = ServiceLoader.loadTextProcessor(TEXT_PROCESSOR);
39 | ////        //ConceptNet5Importer instance = new ConceptNet5Importer.Builder("http://conceptnet5.media.mit.edu/data/5.4", textProcessor).build();
40 | ////        ConceptNet5Importer instance = new ConceptNet5Importer.Builder("http://api.conceptnet.io").build();
41 | ////        String lang = "en";
42 | ////        Tag source = textProcessor.annotateTag("circuit", lang);
43 | ////        List<Tag> result = instance.importHierarchy(source, lang, true, 2, textProcessor, Arrays.asList("IsA"), Arrays.asList("NN"));
44 | ////        assertEquals(4, result.size());
45 | ////        //assertEquals(expResult, result);
46 | ////        // TODO review the generated test code and remove the default call to fail.
47 | ////        //fail("The test case is a prototype.");
48 | //    }
49 | 
50 | }
51 | 


--------------------------------------------------------------------------------
/src/test/java/com/graphaware/nlp/processor/opennlp/model/CustomSentimentModelIntegrationTest.java:
--------------------------------------------------------------------------------
 1 | package com.graphaware.nlp.processor.opennlp.model;
 2 | 
 3 | import com.graphaware.nlp.processor.opennlp.OpenNLPIntegrationTest;
 4 | import org.junit.Test;
 5 | 
 6 | import java.util.Map;
 7 | 
 8 | import static org.junit.Assert.*;
 9 | 
10 | public class CustomSentimentModelIntegrationTest extends OpenNLPIntegrationTest {
11 | 
12 |     //@Test
13 |     public void testTrainCustomModelWithProcedure() {
14 |         String p = getClass().getClassLoader().getResource("import/sentiment_tweets.train").getPath();
15 |         String q = "CALL ga.nlp.processor.train({textProcessor: \"com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor\", modelIdentifier: \"component-en\", alg: \"sentiment\", inputFile: \""+p+"\" , lang: \"en\"})";
16 |         executeInTransaction(q, (result -> {
17 |             assertTrue(result.hasNext());
18 |         }));
19 | 
20 |         String addPipelineQuery = "CALL ga.nlp.processor.addPipeline({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', name: 'customSentiment', processingSteps: {tokenize: true, ner: true, sentiment: true, dependency: false, customSentiment: \"component-en\"}})";
21 |         executeInTransaction(addPipelineQuery, emptyConsumer());
22 | 
23 |         String insertQ = "CREATE (tweet:Tweet) SET tweet.text = \"African American unemployment is the lowest ever recorded in our country. The Hispanic unemployment rate dropped a full point in the last year and is close to the lowest in recorded history. Dems did nothing for you but get your vote!\"\n" +
24 |                 "WITH tweet\n" +
25 |                 "CALL ga.nlp.annotate({text:tweet.text, id:id(tweet), pipeline:\"customSentiment\", checkLanguage:false})\n" +
26 |                 "YIELD result\n" +
27 |                 "MERGE (tweet)-[:HAS_ANNOTATED_TEXT]->(result)";
28 |         executeInTransaction(insertQ, emptyConsumer());
29 |         executeInTransaction("MATCH (n:Sentence) RETURN ANY(x IN labels(n) WHERE x IN ['Positive','Very Positive','Neutral']) AS hasSentiment", (result -> {
30 |             assertTrue(result.hasNext());
31 |             while (result.hasNext()) {
32 |                 Map<String, Object> record = result.next();
33 |                 assertTrue((boolean) record.get("hasSentiment"));
34 |             }
35 |         }));
36 |     }
37 | 
38 | }
39 | 


--------------------------------------------------------------------------------
/src/test/java/com/graphaware/nlp/processor/opennlp/procedure/ProcedureTest.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * Copyright (c) 2013-2016 GraphAware
  3 |  *
  4 |  * This file is part of the GraphAware Framework.
  5 |  *
  6 |  * GraphAware Framework is free software: you can redistribute it and/or modify it under the terms of
  7 |  * the GNU General Public License as published by the Free Software Foundation, either
  8 |  * version 3 of the License, or (at your option) any later version.
  9 |  *
 10 |  * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
 11 |  * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 12 |  * See the GNU General Public License for more details. You should have received a copy of
 13 |  * the GNU General Public License along with this program.  If not, see
 14 |  * <http://www.gnu.org/licenses/>.
 15 |  */
 16 | package com.graphaware.nlp.processor.opennlp.procedure;
 17 | 
 18 | import com.graphaware.nlp.module.NLPConfiguration;
 19 | import com.graphaware.nlp.module.NLPModule;
 20 | import com.graphaware.nlp.processor.opennlp.OpenNLPIntegrationTest;
 21 | import com.graphaware.runtime.GraphAwareRuntime;
 22 | import com.graphaware.runtime.GraphAwareRuntimeFactory;
 23 | import com.graphaware.test.integration.GraphAwareIntegrationTest;
 24 | import java.util.HashMap;
 25 | import java.util.List;
 26 | import java.util.Map;
 27 | import static org.junit.Assert.assertEquals;
 28 | import static org.junit.Assert.assertFalse;
 29 | import static org.junit.Assert.assertTrue;
 30 | import org.junit.Test;
 31 | import org.neo4j.graphdb.Node;
 32 | import org.neo4j.graphdb.ResourceIterator;
 33 | import org.neo4j.graphdb.Result;
 34 | import org.neo4j.graphdb.Transaction;
 35 | 
 36 | public class ProcedureTest extends OpenNLPIntegrationTest {
 37 |     
 38 | 
 39 |     private static final String TEXT = "On 8 May 2013, "
 40 |             + "one week before the Pakistani election, the third author, "
 41 |             + "in his keynote address at the Sentiment Analysis Symposium, "
 42 |             + "forecast the winner of the Pakistani election. The chart "
 43 |             + "in Figure 1 shows varying sentiment on the candidates for "
 44 |             + "prime minister of Pakistan in that election. The next day, "
 45 |             + "the BBC’s Owen Bennett Jones, reporting from Islamabad, wrote "
 46 |             + "an article titled “Pakistan Elections: Five Reasons Why the "
 47 |             + "Vote is Unpredictable,”1 in which he claimed that the election "
 48 |             + "was too close to call. It was not, and despite his being in Pakistan, "
 49 |             + "the outcome of the election was exactly as we predicted.";
 50 | 
 51 |     private static final String TEXT_IT = "Questo è un semplice testo in italiano";
 52 |     private static final String TEXT_FR = "Ceci est un texte simple en français";
 53 | 
 54 |     private static final String SHORT_TEXT_1 = "You knew China's cities were growing. But the real numbers are stunning http://wef.ch/29IxY7w  #China";
 55 |     private static final String SHORT_TEXT_2 = "Globalization for the 99%: can we make it work for all?";
 56 |     private static final String SHORT_TEXT_3 = "This organisation increased productivity, happiness and trust with just one change http://wef.ch/29PeKxF ";
 57 |     private static final String SHORT_TEXT_4 = "In pictures: The high-tech villages that live off the grid http://wef.ch/29xuRh8 ";
 58 |     private static final String SHORT_TEXT_5 = "The 10 countries best prepared for the new digital economy http://wef.ch/2a8DNug ";
 59 |     private static final String SHORT_TEXT_6 = "This is how to limit damage to the #euro after #Brexit, say economists http://wef.ch/29GGVzG ";
 60 |     private static final String SHORT_TEXT_7 = "The office jobs that could see you earning nearly 50% less than some of your co-workers http://wef.ch/29P9biE ";
 61 |     private static final String SHORT_TEXT_8 = "Which nationalities have the best quality of life? http://wef.ch/29uDfwV";
 62 |     private static final String SHORT_TEXT_9 = "It’s 9,000km away, but #Brexit has hit #Japan hard http://wef.ch/29P92eQ  #economics";
 63 |     private static final String SHORT_TEXT_10 = "Which is the world’s fastest-growing large economy? Clue: it’s not #China http://wef.ch/29xuXFd  #economics";
 64 | 
 65 |     //@Test
 66 |     public void overallTest() {
 67 |         GraphAwareRuntime gaRuntime = GraphAwareRuntimeFactory.createRuntime(getDatabase());
 68 |         gaRuntime.registerModule(new NLPModule("NLP", NLPConfiguration.defaultConfiguration(), getDatabase()));
 69 |         gaRuntime.start();
 70 |         gaRuntime.waitUntilStarted();
 71 |         testAnnotatedText();
 72 | //        clean();
 73 | //        testAnnotatedTextWithSentiment();
 74 | //        clean();
 75 | //        testAnnotatedTextAndSentiment();
 76 | //        clean();
 77 | //        testAnnotatedTextOnMultiple();
 78 | //        clean();
 79 | //        testConceptText();
 80 | //        clean();
 81 | //        testLanguageDetection();
 82 | //        clean();
 83 | //        testSupportedLanguage();
 84 | //        clean();
 85 | //        testFilter();
 86 | //        clean();
 87 | //        testGetProceduresManagement();
 88 | //        clean();
 89 |     }
 90 |     
 91 |     private void clean() {
 92 |         try (Transaction tx = getDatabase().beginTx()) {
 93 |             getDatabase().execute("MATCH (n) DETACH DELETE n");
 94 |             tx.success();
 95 |         }
 96 |     }
 97 |     
 98 |     public void testAnnotatedText() {
 99 |         try (Transaction tx = getDatabase().beginTx()) {
100 |             String id = "id1";
101 |             Map<String, Object> params = new HashMap<>();
102 |             params.put("value", TEXT);
103 |             params.put("id", id);
104 |             Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n"
105 |                     + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}) YIELD result\n"
106 |                     + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n"
107 |                     + "return result", params);
108 |             ResourceIterator<Object> rowIterator = news.columnAs("result");
109 |             assertTrue(rowIterator.hasNext());
110 |             Node resultNode = (Node) rowIterator.next();
111 |             assertEquals(resultNode.getProperty("id"), id);
112 |             params.clear();
113 |             params.put("id", id);
114 |             Result tags = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence)-[:HAS_TAG]->(result:Tag) RETURN result", params);
115 |             rowIterator = tags.columnAs("result");
116 |             assertTrue(rowIterator.hasNext());
117 | 
118 |             Result sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence) RETURN labels(s) as result", params);
119 |             rowIterator = sentences.columnAs("result");
120 |             assertTrue(rowIterator.hasNext());
121 |             int countSentence = 0;
122 |             while (rowIterator.hasNext()) {
123 |                 List<Object> next = (List) rowIterator.next();
124 |                 assertEquals(next.size(), 1);
125 |                 countSentence++;
126 |             }
127 |             
128 |             sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:FIRST_SENTENCE|NEXT_SENTENCE*..]->(s:Sentence) RETURN labels(s) as result", params);
129 |             rowIterator = sentences.columnAs("result");
130 |             assertTrue(rowIterator.hasNext());
131 |             int newCountSentence = 0;
132 |             while (rowIterator.hasNext()) {
133 |                 List<Object> next = (List) rowIterator.next();
134 |                 assertEquals(next.size(), 1);
135 |                 newCountSentence++;
136 |             }
137 |             assertEquals(countSentence, newCountSentence);
138 |             tx.success();
139 |         }
140 |     }
141 | 
142 |     public void testAnnotatedTextWithSentiment() {
143 |         try (Transaction tx = getDatabase().beginTx()) {
144 |             String id = "id1";
145 |             Map<String, Object> params = new HashMap<>();
146 |             params.put("value", TEXT);
147 |             params.put("id", id);
148 |             Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n"
149 |                     + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}, pipeline: \"tokenizerAndSentiment\"}) YIELD result\n"
150 |                     + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n"
151 |                     + "return result", params);
152 |             ResourceIterator<Object> rowIterator = news.columnAs("result");
153 |             assertTrue(rowIterator.hasNext());
154 |             Node resultNode = (Node) rowIterator.next();
155 |             assertEquals(resultNode.getProperty("id"), id);
156 |             params.clear();
157 |             params.put("id", id);
158 |             Result sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence) RETURN labels(s) as result", params);
159 |             rowIterator = sentences.columnAs("result");
160 |             assertTrue(rowIterator.hasNext());
161 |             while (rowIterator.hasNext()) {
162 |                 List<Object> next = (List) rowIterator.next();
163 |                 assertEquals(next.size(), 2);
164 |             }
165 |             tx.success();
166 |         }
167 |     }
168 | 
169 |     public void testAnnotatedTextAndSentiment() {
170 |         try (Transaction tx = getDatabase().beginTx()) {
171 |             String id = "id1";
172 |             Map<String, Object> params = new HashMap<>();
173 |             params.put("value", TEXT);
174 |             params.put("id", id);
175 |             Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n"
176 |                     + "CALL ga.nlp.annotate({text:n.text, id: {id}, store: true, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}}) YIELD result\n"
177 |                     + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n"
178 |                     + "return result", params);
179 |             ResourceIterator<Object> rowIterator = news.columnAs("result");
180 |             assertTrue(rowIterator.hasNext());
181 |             Node resultNode = (Node) rowIterator.next();
182 |             assertEquals(resultNode.getProperty("id"), id);
183 |             params.clear();
184 |             params.put("id", id);
185 |             Result sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}}) WITH a "
186 |                     + "CALL ga.nlp.sentiment({node:a}) YIELD result "
187 |                     + "MATCH (result)-[:CONTAINS_SENTENCE]->(s:Sentence) "
188 |                     + "return labels(s) as labels", params);
189 |             rowIterator = sentences.columnAs("labels");
190 |             assertTrue(rowIterator.hasNext());
191 |             int i = 0;
192 |             while (rowIterator.hasNext()) {
193 |                 List<Object> next = (List) rowIterator.next();
194 |                 assertEquals(next.size(), 2);
195 |                 i++;
196 |             }
197 |             assertEquals(4, i);
198 |             //Execute again for checking the number of senteces
199 |             sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}}) WITH a "
200 |                     + "CALL ga.nlp.sentiment({node:a}) YIELD result "
201 |                     + "MATCH (result)-[:CONTAINS_SENTENCE]->(s:Sentence) "
202 |                     + "return labels(s) as labels", params);
203 |             rowIterator = sentences.columnAs("labels");
204 |             assertTrue(rowIterator.hasNext());
205 |             i = 0;
206 |             while (rowIterator.hasNext()) {
207 |                 List<Object> next = (List) rowIterator.next();
208 |                 assertEquals(next.size(), 2);
209 |                 i++;
210 |             }
211 |             assertEquals(4, i);
212 |             tx.success();
213 |         }
214 |     }
215 | 
216 |     public void testAnnotatedTextOnMultiple() {
217 |         try (Transaction tx = getDatabase().beginTx()) {
218 |             String id = "id1";
219 |             Map<String, Object> params = new HashMap<>();
220 |             params.put("value", SHORT_TEXT_1);
221 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
222 | 
223 |             params.put("value", SHORT_TEXT_2);
224 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
225 | 
226 |             params.put("value", SHORT_TEXT_3);
227 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
228 | 
229 |             params.put("value", SHORT_TEXT_4);
230 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
231 | 
232 |             params.put("value", SHORT_TEXT_5);
233 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
234 | 
235 |             params.put("value", SHORT_TEXT_6);
236 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
237 | 
238 |             params.put("value", SHORT_TEXT_7);
239 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
240 | 
241 |             params.put("value", SHORT_TEXT_8);
242 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
243 | 
244 |             params.put("value", SHORT_TEXT_9);
245 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
246 | 
247 |             params.put("value", SHORT_TEXT_10);
248 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
249 | 
250 |             getDatabase().execute("MERGE (n:Tweet {id:1})", params);
251 | 
252 |             //Test for filter based on language
253 |             params.put("value", TEXT_IT);
254 |             getDatabase().execute("MERGE (n:Tweet {text: {value}})", params);
255 | 
256 |             Result sentences = getDatabase().execute("MATCH (a:Tweet) WITH a\n"
257 |                     + "WITH collect(a) AS aa\n"
258 |                     + "UNWIND aa AS a\n"
259 |                     + "CALL ga.nlp.annotate({text:a.text, id: id(a), textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}}) YIELD result WITH result as at "
260 |                     + "MERGE (a)-[:HAS_ANNOTATED_TEXT]->(at) WITH at "
261 |                     + "MATCH (at)-[:CONTAINS_SENTENCE]->(result) "
262 |                     + "RETURN result", params);
263 |             ResourceIterator<Object> rowIterator = sentences.columnAs("result");
264 |             assertTrue(rowIterator.hasNext());
265 |             int i = 0;
266 |             while (rowIterator.hasNext()) {
267 |                 rowIterator.next();
268 |                 i++;
269 |             }
270 |             assertEquals(13, i);
271 |             tx.success();
272 |         }
273 |     }
274 | 
275 |     public void testConceptText() {
276 |         try (Transaction tx = getDatabase().beginTx()) {
277 |             String id = "id1";
278 |             Map<String, Object> params = new HashMap<>();
279 |             params.put("value", TEXT);
280 |             params.put("id", id);
281 |             Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n"
282 |                     + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}}) YIELD result\n"
283 |                     + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n"
284 |                     + "return result", params);
285 |             ResourceIterator<Object> rowIterator = news.columnAs("result");
286 |             assertTrue(rowIterator.hasNext());
287 |             Node resultNode = (Node) rowIterator.next();
288 |             assertEquals(resultNode.getProperty("id"), id);
289 |             params.clear();
290 |             params.put("id", id);
291 |             Result tags = getDatabase().execute(
292 |                     "MATCH (a:AnnotatedText) "
293 |                     + "CALL ga.nlp.concept({node:a, depth: 2}) YIELD result\n"
294 |                     + "return result;", params);
295 |             rowIterator = tags.columnAs("result");
296 |             //assertTrue(rowIterator.hasNext());
297 |             tx.success();
298 |         }
299 |     }
300 | 
301 |     public void testLanguageDetection() {
302 |         try (Transaction tx = getDatabase().beginTx()) {
303 |             Map<String, Object> params = new HashMap<>();
304 |             params.put("value", TEXT);
305 |             Result result = getDatabase().execute("CALL ga.nlp.language({text:{value}}) YIELD result\n"
306 |                     + "return result", params);
307 |             ResourceIterator<Object> rowIterator = result.columnAs("result");
308 |             assertTrue(rowIterator.hasNext());
309 |             String resultNode = (String) rowIterator.next();
310 |             assertEquals("en", resultNode);
311 | 
312 |             params.put("value", TEXT_IT);
313 |             result = getDatabase().execute("CALL ga.nlp.language({text:{value}}) YIELD result\n"
314 |                     + "return result", params);
315 |             rowIterator = result.columnAs("result");
316 |             assertTrue(rowIterator.hasNext());
317 |             resultNode = (String) rowIterator.next();
318 |             assertEquals("it", resultNode);
319 | 
320 |             params.put("value", TEXT_FR);
321 |             result = getDatabase().execute("CALL ga.nlp.language({text:{value}}) YIELD result\n"
322 |                     + "return result", params);
323 |             rowIterator = result.columnAs("result");
324 |             assertTrue(rowIterator.hasNext());
325 |             resultNode = (String) rowIterator.next();
326 |             assertEquals("fr", resultNode);
327 | 
328 |             tx.success();
329 |         }
330 |     }
331 | 
332 |     public void testSupportedLanguage() {
333 |         try (Transaction tx = getDatabase().beginTx()) {
334 |             String id = "id1";
335 |             Map<String, Object> params = new HashMap<>();
336 |             params.put("value", TEXT_IT);
337 |             params.put("id", id);
338 |             Result news = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n"
339 |                     + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}}) YIELD result\n"
340 |                     + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n"
341 |                     + "return result", params);
342 |             ResourceIterator<Object> rowIterator = news.columnAs("result");
343 |             assertFalse(rowIterator.hasNext());
344 |             tx.success();
345 |         }
346 |     }
347 | 
348 |     public void testFilter() {
349 |         try (Transaction tx = getDatabase().beginTx()) {
350 |             String id = "id1";
351 |             Map<String, Object> params = new HashMap<>();
352 |             params.put("value", TEXT);
353 |             params.put("filter", "Owen Bennett Jones/PERSON");
354 |             Result news = getDatabase().execute("CALL ga.nlp.filter({text:{value}, filter: {filter}}) YIELD result\n"
355 |                     + "return result", params);
356 |             ResourceIterator<Object> rowIterator = news.columnAs("result");
357 |             assertTrue(rowIterator.hasNext());
358 |             Boolean resultNode = (Boolean) rowIterator.next();
359 |             assertEquals(true, resultNode);
360 |             
361 |             params.clear();
362 |             params.put("value", SHORT_TEXT_1);
363 |             params.put("filter", "China/PERSON");
364 |             news = getDatabase().execute("CALL ga.nlp.filter({text:{value}, filter: {filter}}) YIELD result\n"
365 |                     + "return result", params);
366 |             rowIterator = news.columnAs("result");
367 |             assertTrue(rowIterator.hasNext());
368 |             resultNode = (Boolean) rowIterator.next();
369 |             assertEquals(false, resultNode);
370 |             
371 |             
372 |             params.clear();
373 |             params.put("value", TEXT);
374 |             params.put("filter", "Owen Bennett Jones/PERSON, BBC, Pakistan/LOCATION");
375 |             news = getDatabase().execute("CALL ga.nlp.filter({text:{value}, filter: {filter}}) YIELD result\n"
376 |                     + "return result", params);
377 |             rowIterator = news.columnAs("result");
378 |             assertTrue(rowIterator.hasNext());
379 |             resultNode = (Boolean) rowIterator.next();
380 |             assertEquals(true, resultNode);
381 |             tx.success();
382 |         }
383 |     }
384 |     
385 |     public void testGetProceduresManagement() {
386 |         try (Transaction tx = getDatabase().beginTx()) {
387 |             Result res = getDatabase().execute("CALL ga.nlp.getProcessors() YIELD class\n"
388 |                     + "return class");
389 |             ResourceIterator<Object> rowIterator = res.columnAs("class");
390 |             assertTrue(rowIterator.hasNext());
391 |             String resultNode = (String) rowIterator.next();
392 |             assertEquals("com.graphaware.nlp.processor.stanford.StanfordTextProcessor", resultNode);
393 |             tx.success();
394 |         }
395 |         try (Transaction tx = getDatabase().beginTx()) {
396 |             Result res = getDatabase().execute("CALL ga.nlp.getPipelines({textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor'}) YIELD result\n"
397 |                     + "return result");
398 |             ResourceIterator<Object> rowIterator = res.columnAs("result");
399 |             assertTrue(rowIterator.hasNext());
400 | //            String resultNode = (String) rowIterator.next();
401 | //            assertEquals("com.graphaware.nlp.processor.stanford.StanfordTextProcessor", resultNode);
402 |             tx.success();
403 |         }
404 |         
405 |         try (Transaction tx = getDatabase().beginTx()) {
406 |             Result res = getDatabase().execute("CALL ga.nlp.addPipeline({"
407 |                     + "textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', "
408 |                     + "name: 'testPipe', "
409 |                     + "stopWords: 'class,instance,issue', "
410 |                     + "threadNumber: 5}) "
411 |                     + "YIELD result\n"
412 |                     + "return result");
413 |             ResourceIterator<Object> rowIterator = res.columnAs("result");
414 |             assertTrue(rowIterator.hasNext());
415 |             String resultNode = (String) rowIterator.next();
416 |             assertEquals("succeess", resultNode);
417 |             tx.success();
418 |         }
419 |         
420 |         try (Transaction tx = getDatabase().beginTx()) {
421 |             Result res = getDatabase().execute("CALL ga.nlp.getPipelines({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}) YIELD result\n"
422 |                     + "return result");
423 |             ResourceIterator<Object> rowIterator = res.columnAs("result");
424 |             
425 |             boolean found = false;
426 |             while (rowIterator.hasNext()) {
427 |                 String resultNode = (String) rowIterator.next();
428 |                 if (resultNode.equalsIgnoreCase("testPipe")) {
429 |                     found = true;
430 |                 }
431 |             }            
432 |             assertTrue(found);
433 |             tx.success();
434 |         }
435 |         
436 |         try (Transaction tx = getDatabase().beginTx()) {
437 |             String id = "id1";
438 |             Map<String, Object> params = new HashMap<>();
439 |             params.put("value", TEXT);
440 |             params.put("id", id);
441 |             Result res = getDatabase().execute("MERGE (n:News {text: {value}}) WITH n\n"
442 |                     + "CALL ga.nlp.annotate({text:n.text, id: {id}, textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', pipeline: 'testPipe'}) YIELD result\n"
443 |                     + "MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)\n"
444 |                     + "return result", params);
445 |             ResourceIterator<Object> rowIterator = res.columnAs("result");
446 |             assertTrue(rowIterator.hasNext());
447 |             Node resultNode = (Node) rowIterator.next();
448 |             assertEquals(resultNode.getProperty("id"), id);
449 |             params.clear();
450 |             params.put("id", id);
451 |             Result tags = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence)-[:HAS_TAG]->(result:Tag) RETURN result", params);
452 |             rowIterator = tags.columnAs("result");
453 |             assertTrue(rowIterator.hasNext());
454 | 
455 |             Result sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:CONTAINS_SENTENCE]->(s:Sentence) RETURN labels(s) as result", params);
456 |             rowIterator = sentences.columnAs("result");
457 |             assertTrue(rowIterator.hasNext());
458 |             int countSentence = 0;
459 |             while (rowIterator.hasNext()) {
460 |                 List<Object> next = (List) rowIterator.next();
461 |                 assertEquals(1, next.size());
462 |                 countSentence++;
463 |             }
464 |             
465 |             sentences = getDatabase().execute("MATCH (a:AnnotatedText {id: {id}})-[:FIRST_SENTENCE|NEXT_SENTENCE*..]->(s:Sentence) RETURN labels(s) as result", params);
466 |             rowIterator = sentences.columnAs("result");
467 |             assertTrue(rowIterator.hasNext());
468 |             int newCountSentence = 0;
469 |             while (rowIterator.hasNext()) {
470 |                 List<Object> next = (List) rowIterator.next();
471 |                 assertEquals(next.size(), 1);
472 |                 newCountSentence++;
473 |             }
474 |             assertEquals(countSentence, newCountSentence);
475 |             tx.success();
476 |         }
477 |         
478 |         try (Transaction tx = getDatabase().beginTx()) {
479 |             Result res = getDatabase().execute("CALL ga.nlp.removePipeline({"
480 |                     + "textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', "
481 |                     + "pipeline: 'testPipe'}) "
482 |                     + "YIELD result\n"
483 |                     + "return result");
484 |             ResourceIterator<Object> rowIterator = res.columnAs("result");
485 |             assertTrue(rowIterator.hasNext());
486 |             String resultNode = (String) rowIterator.next();
487 |             assertEquals("succeess", resultNode);
488 |             tx.success();
489 |         }
490 |         
491 |         try (Transaction tx = getDatabase().beginTx()) {
492 |             Result res = getDatabase().execute("CALL ga.nlp.getPipelines({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}) YIELD result\n"
493 |                     + "return result");
494 |             ResourceIterator<Object> rowIterator = res.columnAs("result");
495 |             
496 |             boolean found = false;
497 |             while (rowIterator.hasNext()) {
498 |                 String resultNode = (String) rowIterator.next();
499 |                 if (resultNode.equalsIgnoreCase("testPipe")) {
500 |                     found = true;
501 |                 }
502 |             }            
503 |             assertTrue(!found);
504 |             tx.success();
505 |         }
506 |         
507 |         try (Transaction tx = getDatabase().beginTx()) {
508 |             Result res = getDatabase().execute("CALL ga.nlp.addPipeline({"
509 |                     + "textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor', "
510 |                     + "name: 'testPipe', "
511 |                     + "stopWords: 'class,instance,issue', "
512 |                     + "threadNumber: 5}) "
513 |                     + "YIELD result\n"
514 |                     + "return result");
515 |             ResourceIterator<Object> rowIterator = res.columnAs("result");
516 |             assertTrue(rowIterator.hasNext());
517 |             String resultNode = (String) rowIterator.next();
518 |             assertEquals("succeess", resultNode);
519 |             tx.success();
520 |         }
521 |         
522 |         try (Transaction tx = getDatabase().beginTx()) {
523 |             Result res = getDatabase().execute("CALL ga.nlp.getPipelines({textProcessor: 'com.graphaware.nlp.processor.opennlp.OpenNLPTextProcessor'}) YIELD result\n"
524 |                     + "return result");
525 |             ResourceIterator<Object> rowIterator = res.columnAs("result");
526 |             
527 |             boolean found = false;
528 |             while (rowIterator.hasNext()) {
529 |                 String resultNode = (String) rowIterator.next();
530 |                 if (resultNode.equalsIgnoreCase("testPipe")) {
531 |                     found = true;
532 |                 }
533 |             }            
534 |             assertTrue(found);
535 |             tx.success();
536 |         }
537 |     }
538 | }
539 | 


--------------------------------------------------------------------------------
/src/test/resources/import/sentiment_tweets.train:
--------------------------------------------------------------------------------
  1 | 3	Watching a nice movie
  2 | 1	The painting is ugly, will return it tomorrow...
  3 | 3	One of the best soccer games, worth seeing it
  4 | 3	Very tasty, not only for vegetarians
  5 | 3	Super party!
  6 | 1	Too early to travel..need a coffee
  7 | 1	Damn..the train is late again...
  8 | 1	Bad news, my flight just got cancelled.
  9 | 3	Happy birthday mr. president
 10 | 3	Just watch it. Respect.
 11 | 3	Wonderful sunset.
 12 | 3	Bravo, first title in 2014!
 13 | 1	Had a bad evening, need urgently a beer.
 14 | 1	I put on weight again
 15 | 3	On today's show we met Angela, a woman with an amazing story
 16 | 3	I fell in love again
 17 | 1	I lost my keys
 18 | 3	On a trip to Iceland
 19 | 3	Happy in Berlin
 20 | 1	I hate Mondays
 21 | 3	Love the new book I reveived for Christmas
 22 | 1	He killed our good mood
 23 | 3	I am in good spirits again
 24 | 3	This guy creates the most awesome pics ever 
 25 | 1	The dark side of a selfie.
 26 | 3	Cool! John is back!
 27 | 3	Many rooms and many hopes for new residents
 28 | 1	False hopes for the people attending the meeting
 29 | 3	I set my new year's resolution
 30 | 1	The ugliest car ever!
 31 | 1	Feeling bored
 32 | 1	Need urgently a pause
 33 | 3	Nice to see Ana made it
 34 | 3	My dream came true
 35 | 1	I didn't see that one coming
 36 | 1	Sorry mate, there is no more room for you
 37 | 1	Who could have possibly done this?
 38 | 3	I won the challenge
 39 | 1	I feel bad for what I did		
 40 | 3	I had a great time tonight
 41 | 3	It was a lot of fun
 42 | 3	Thank you Molly making this possible
 43 | 1	I just did a big mistake
 44 | 3	I love it!!
 45 | 1	I never loved so hard in my life
 46 | 1	I hate you Mike!!
 47 | 1	I hate to say goodbye
 48 | 3	Lovely!
 49 | 3	Like and share if you feel the same
 50 | 1	Never try this at home
 51 | 1	Don't spoil it!
 52 | 3	I love rock and roll
 53 | 1	The more I hear you, the more annoyed I get
 54 | 3	Finnaly passed my exam!
 55 | 3	Lovely kittens
 56 | 1	I just lost my appetite
 57 | 1	Sad end for this movie
 58 | 1	Lonely, I am so lonely
 59 | 3	Beautiful morning
 60 | 3	She is amazing
 61 | 3	Enjoying some time with my friends
 62 | 3	Special thanks to Marty
 63 | 3	Thanks God I left on time
 64 | 3	Greateful for a wonderful meal
 65 | 3	So happy to be home
 66 | 1	Hate to wait on a long queue		
 67 | 1	No cab available
 68 | 1	Electricity outage, this is a nightmare
 69 | 1	Nobody to ask about directions
 70 | 3	Great game!
 71 | 3	Nice trip
 72 | 3	I just received a pretty flower
 73 | 3	Excellent idea
 74 | 3	Got a new watch. Feeling happy
 75 | 1	I feel sick
 76 | 1	I am very tired
 77 | 3	Such a good taste 
 78 | 1	Such a bad taste
 79 | 3	Enjoying brunch
 80 | 1	I don't recommend this restaurant
 81 | 3	Thank you mom for supporting me
 82 | 1	I will never ever call you again
 83 | 1	I just got kicked out of the contest
 84 | 3	Smiling
 85 | 1	Big pain to see my team loosing
 86 | 1	Bitter defeat tonight
 87 | 1	My bike was stollen
 88 | 3	Great to see you!
 89 | 1	I lost every hope for seeing him again
 90 | 3	Nice dress!
 91 | 3	Stop wasting my time
 92 | 3	I have a great idea
 93 | 3	Excited to go to the pub
 94 | 3	Feeling proud
 95 | 3	Cute bunnies
 96 | 1	Cold winter ahead
 97 | 1	Hopless struggle..
 98 | 1	Ugly hat
 99 | 3	Big hug and lots of love
100 | 3	I hope you have a wonderful celebration
101 | 


--------------------------------------------------------------------------------