├── .gitignore
├── .travis.yml
├── Gemfile
├── LICENSE
├── README.md
├── Rakefile
├── bin
    ├── Utils.java
    └── utils.jar
├── lib
    ├── open-nlp.rb
    └── open-nlp
    │   ├── base.rb
    │   ├── bindings.rb
    │   ├── classes.rb
    │   ├── config.rb
    │   └── version.rb
├── open-nlp.gemspec
└── spec
    ├── english_spec.rb
    ├── sample.txt
    └── spec_helper.rb


/.gitignore:
--------------------------------------------------------------------------------
 1 | *~
 2 | *.lock
 3 | *.DS_Store
 4 | *swp
 5 | *.out
 6 | *.gem
 7 | *.sh
 8 | *.bin
 9 | 
10 | *.xml
11 | 
12 | *.zip
13 | 
14 | *.jar
15 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: ruby 
 2 | rvm:
 3 |   - 1.9.2
 4 |   - 1.9.3
 5 |   - jruby-19mode
 6 | before_install:
 7 |   - export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386"
 8 |   - export "JRUBY_OPTS=--1.9"
 9 |   - gem install bind-it
10 | before_script: 
11 |   - sudo wget http://www.louismullie.com/treat/open-nlp-english.zip
12 |   - sudo unzip -o open-nlp-english.zip -d bin/
13 | script:
14 |   - rake spec
15 |   - gem uninstall bind-it
16 |   - gem uninstall rjb
17 |   - export "JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386"
18 |   - gem install bind-it
19 |   - rake spec


--------------------------------------------------------------------------------
/Gemfile:
--------------------------------------------------------------------------------
1 | source :rubygems
2 | 
3 | gemspec


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Ruby bindings for the OpenNLP tools.
 2 | 
 3 | This program is free software: you can redistribute it and/or modify
 4 | it under the terms of the GNU General Public License as published by
 5 | the Free Software Foundation, either version 3 of the License, or
 6 | (at your option) any later version.
 7 | 
 8 | This program is distributed in the hope that it will be useful,
 9 | but WITHOUT ANY WARRANTY; without even the implied warranty of
10 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11 | GNU General Public License for more details.
12 | 
13 | This license also applies to the included Stanford CoreNLP files.
14 | 
15 | You should have received a copy of the GNU General Public License
16 | along with this program.  If not, see <http://www.gnu.org/licenses/>.
17 | 
18 | Author: Louis-Antoine Mullie (louis.mullie@gmail.com). Copyright 2012.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | **Warning: This gem is unmaintained.**
  2 | 
  3 | ### About
  4 | 
  5 | This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP). This gem is compatible with Ruby 1.9.2 and 1.9.3 as well as JRuby 1.7.1. It is tested on both Java 6 and Java 7.
  6 | 
  7 | ### Warning
  8 | 
  9 | This package is unmaintained.
 10 | 
 11 | ### Installing
 12 | 
 13 | First, install the gem: `gem install open-nlp`. Then, download [the JARs and English language models](http://louismullie.com/treat/open-nlp-english.zip) in one package (80 MB).
 14 | 
 15 | Place the contents of the extracted archive inside the /bin/ folder of the `open-nlp` gem (e.g. [...]/gems/open-nlp-0.x.x/bin/).
 16 | 
 17 | Alternatively, from a terminal window, `cd` to the gem's folder and run:
 18 | 
 19 | ```
 20 | wget http://www.louismullie.com/treat/open-nlp-english.zip
 21 | unzip -o open-nlp-english.zip -d bin/
 22 | ```
 23 | 
 24 | Afterwards, you may individually download the appropriate models for other languages from the [open-nlp website](http://opennlp.sourceforge.net/models-1.5/).
 25 | 
 26 | ### Configuring
 27 | 
 28 | After installing and requiring the gem (`require 'open-nlp'`), you may want to set some of the following configuration options.
 29 | 
 30 | ```ruby
 31 | # Set an alternative path to look for the JAR files.
 32 | # Default is gem's bin folder.
 33 | OpenNLP.jar_path = '/path_to_jars/'
 34 | 
 35 | # Set an alternative path to look for the model files.
 36 | # Default is gem's bin folder.
 37 | OpenNLP.model_path = '/path_to_models/'
 38 | 
 39 | # Pass some alternative arguments to the Java VM.
 40 | # Default is ['-Xms512M', '-Xmx1024M'].
 41 | OpenNLP.jvm_args = ['-option1', '-option2']
 42 | 
 43 | # Redirect VM output to log.txt
 44 | OpenNLP.log_file = 'log.txt'
 45 | 
 46 | # Set default models for a language.
 47 | OpenNLP.use :language
 48 | ```
 49 | 
 50 | ### Examples
 51 | 
 52 | 
 53 | **Simple tokenizer**
 54 | 
 55 | ```ruby
 56 | OpenNLP.load
 57 | 
 58 | sent = "The death of the poet was kept from his poems."
 59 | tokenizer = OpenNLP::SimpleTokenizer.new
 60 | 
 61 | tokens = tokenizer.tokenize(sent).to_a
 62 | # => %w[The death of the poet was kept from his poems .]
 63 | ```
 64 | 
 65 | **Maximum entropy tokenizer, chunker and POS tagger**
 66 | 
 67 | ```ruby
 68 | 
 69 | OpenNLP.load
 70 | 
 71 | chunker   = OpenNLP::ChunkerME.new
 72 | tokenizer = OpenNLP::TokenizerME.new
 73 | tagger    = OpenNLP::POSTaggerME.new
 74 | 
 75 | sent   = "The death of the poet was kept from his poems."
 76 | 
 77 | tokens = tokenizer.tokenize(sent).to_a
 78 | # => %w[The death of the poet was kept from his poems .]
 79 | 
 80 | tags   = tagger.tag(tokens).to_a
 81 | # => %w[DT NN IN DT NN VBD VBN IN PRP$ NNS .]
 82 | 
 83 | chunks = chunker.chunk(tokens, tags).to_a
 84 | # => %w[B-NP I-NP B-PP B-NP I-NP B-VP I-VP B-PP B-NP I-NP O]
 85 | ```
 86 | 
 87 | **Abstract Bottom-Up Parser**
 88 | 
 89 | ```ruby
 90 | OpenNLP.load
 91 | 
 92 | sent      = "The death of the poet was kept from his poems."
 93 | parser = OpenNLP::Parser.new
 94 | parse = parser.parse(sent)
 95 | 
 96 | parse.get_text.should eql sent
 97 | 
 98 | parse.get_span.get_start.should eql 0
 99 | parse.get_span.get_end.should eql 46
100 | parse.get_child_count.should eql 1
101 | 
102 | child = parse.get_children[0]
103 | 
104 | child.text # => "The death of the poet was kept from his poems."
105 | child.get_child_count # => 3
106 | child.get_head_index #=> 5
107 | child.get_type # => "S"
108 | ```
109 | 
110 | **Maximum Entropy Name Finder***
111 | 
112 | ```ruby
113 | OpenNLP.load
114 | 
115 | text = File.read('./spec/sample.txt').gsub!("\n", "")
116 | 
117 | tokenizer   = OpenNLP::TokenizerME.new
118 | segmenter   = OpenNLP::SentenceDetectorME.new
119 | ner_models  = ['person', 'time', 'money']
120 | 
121 | ner_finders = ner_models.map do |model|
122 |   OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
123 | end
124 | 
125 | sentences = segmenter.sent_detect(text)
126 | named_entities = []
127 | 
128 | sentences.each do |sentence|
129 | 
130 |   tokens = tokenizer.tokenize(sentence)
131 |   
132 |   ner_models.each_with_index do |model,i|
133 |     finder = ner_finders[i]
134 |     name_spans = finder.find(tokens)
135 |     name_probs = finder.probs()
136 |     name_spans.each_with_index do |name_span,j|
137 |       start = name_span.get_start
138 |       stop  = name_span.get_end-1
139 |       slice = tokens[start..stop].to_a
140 |       prob  = name_probs[j]
141 |       named_entities << [slice, model, prob]
142 |     end
143 |   end
144 | 
145 | end
146 | ```
147 | 
148 | **Loading specific models**
149 | 
150 | Just pass the name of the model file to the constructor. The gem will search for the file in the `OpenNLP.model_path` folder.
151 | 
152 | ```ruby
153 | OpenNLP.load
154 | 
155 | tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
156 | tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
157 | name_finder = OpenNLP::NameFinderME.new('en-ner-person.bin')
158 | # etc.
159 | ```
160 | 
161 | **Loading specific classes**
162 | 
163 | You may want to load specific classes from the OpenNLP library that are not loaded by default. The gem provides an API to do this:
164 | 
165 | ```ruby
166 | # Default base class is opennlp.tools.
167 | OpenNLP.load_class('SomeClassName')  
168 | # => OpenNLP::SomeClassName
169 | 
170 | # Here, we specify another base class.
171 | OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
172 | # => OpenNLP::SomeOtherClass
173 | ```
174 | 
175 | **Contributing**
176 | 
177 | Fork the project and send me a pull request! Config updates for other languages are welcome.
178 | 


--------------------------------------------------------------------------------
/Rakefile:
--------------------------------------------------------------------------------
1 | require 'rspec/core/rake_task'
2 | 
3 | RSpec::Core::RakeTask.new(:spec)
4 | 
5 | task :default => :spec


--------------------------------------------------------------------------------
/bin/Utils.java:
--------------------------------------------------------------------------------
1 | import java.util.Arrays;import java.util.ArrayList;import java.lang.String;import opennlp.tools.postag.POSTagger;import opennlp.tools.chunker.ChunkerME;import opennlp.tools.namefind.NameFinderME; // interface instead?import opennlp.tools.util.Span;// javac -cp '.:opennlp.tools.jar' Utils.java// jar cf utils.jar Utils.classpublic class Utils {        public static String[] tagWithArrayList(POSTagger posTagger, ArrayList[] objectArray) {      return posTagger.tag(getStringArray(objectArray));    }    public static Object[] findWithArrayList(NameFinderME nameFinder, ArrayList[] tokens) {      return nameFinder.find(getStringArray(tokens));    }    public static Object[] chunkWithArrays(ChunkerME chunker, ArrayList[] tokens, ArrayList[] tags) {      return chunker.chunk(getStringArray(tokens), getStringArray(tags));    }    public static String[] getStringArray(ArrayList[] objectArray) {      String[] stringArray = Arrays.copyOf(objectArray, objectArray.length, String[].class);		  return stringArray;    }}


--------------------------------------------------------------------------------
/bin/utils.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/louismullie/open-nlp/5977c873377ea9f316764b27e672e8b84e0b7c11/bin/utils.jar


--------------------------------------------------------------------------------
/lib/open-nlp.rb:
--------------------------------------------------------------------------------
 1 | module OpenNLP
 2 | 
 3 |   # Library version.
 4 |   require 'open-nlp/version'
 5 | 
 6 |   # Require Java bindings.
 7 |   require 'open-nlp/bindings'
 8 |   
 9 |   # Require Ruby wrappers.
10 |   require 'open-nlp/classes'
11 |   
12 |   # Setup the JVM and load the default JARs.
13 |   def self.load
14 |     OpenNLP::Bindings.bind
15 |   end
16 | 
17 |   # Load a Java class into the OpenNLP
18 |   # namespace (e.g. OpenNLP::Loaded).
19 |   def self.load_class(*args)
20 |     OpenNLP::Bindings.load_class(*args)
21 |   end
22 |   
23 |   # Forwards the handling of missing
24 |   # constants to the Bindings class.
25 |   def self.const_missing(const)
26 |     OpenNLP::Bindings.const_get(const)
27 |   end
28 |   
29 |   # Forward the handling of missing 
30 |   # methods to the Bindings class.
31 |   def self.method_missing(sym, *args, &block)
32 |     OpenNLP::Bindings.send(sym, *args, &block)
33 |   end
34 | 
35 | end
36 | 


--------------------------------------------------------------------------------
/lib/open-nlp/base.rb:
--------------------------------------------------------------------------------
 1 | class OpenNLP::Base
 2 | 
 3 |   def initialize(file_or_arg=nil, *args)
 4 | 
 5 |     @proxy_class = OpenNLP::Bindings.const_get(last_name)
 6 | 
 7 |     if requires_model?
 8 |       if !file_or_arg && !has_default_model?
 9 |         raise "No default model files are available for " +
10 |         "class #{last_name}. Please supply a model as" +
11 |         "an argument to the constructor."
12 |       end
13 |       @model = OpenNLP::Bindings.get_model(last_name, file_or_arg)
14 |       @proxy_inst = @proxy_class.new(*([@model] + args))
15 |     else
16 |       @proxy_inst = @proxy_class.new(*([*file_or_arg] + args))
17 |     end
18 | 
19 |   end
20 | 
21 |   def has_default_model?
22 |     name = OpenNLP::Config::ClassToName[last_name]
23 |     !OpenNLP::Config::DefaultModels[name].empty?
24 |   end
25 | 
26 |   def requires_model?
27 |     OpenNLP::Config::RequiresModel.include?(last_name)
28 |   end
29 | 
30 |   def last_name
31 |     self.class.to_s.split('::')[-1]
32 |   end
33 | 
34 | 
35 |   def method_missing(sym, *args, &block)
36 |     @proxy_inst.send(sym, *args, &block)
37 |   end
38 |   
39 |   protected
40 | 
41 |   def get_list(tokens)
42 |     list = OpenNLP::Bindings::ArrayList.new
43 |     tokens.each do |t|
44 |       list.add(OpenNLP::Bindings::String.new(t.to_s))
45 |     end
46 |     list
47 |   end
48 | 
49 | end


--------------------------------------------------------------------------------
/lib/open-nlp/bindings.rb:
--------------------------------------------------------------------------------
  1 | module OpenNLP::Bindings
  2 | 
  3 |   # Require configuration.
  4 |   require 'open-nlp/config'
  5 | 
  6 |   # ############################ #
  7 |   # BindIt Configuration Options #
  8 |   # ############################ #
  9 | 
 10 |   require 'bind-it'
 11 |   extend BindIt::Binding
 12 | 
 13 |   # Load the JVM with a minimum heap size of 512MB,
 14 |   # and a maximum heap size of 1024MB.
 15 |   self.jvm_args = ['-Xms512M', '-Xmx1024M']
 16 | 
 17 |   # Turn logging off by default.
 18 |   self.log_file = nil
 19 | 
 20 |   # Default JARs to load.
 21 |   self.default_jars = [
 22 |     'jwnl-1.3.3.jar',
 23 |     'opennlp-tools-1.5.2-incubating.jar',
 24 |     'opennlp-maxent-3.0.2-incubating.jar',
 25 |     'opennlp-uima-1.5.2-incubating.jar'
 26 |   ]
 27 | 
 28 |   # Default namespace.
 29 |   self.default_namespace = 'opennlp.tools'
 30 | 
 31 |   # Default classes.
 32 |   self.default_classes = [
 33 |     # OpenNLP classes.
 34 |     ['AbstractBottomUpParser', 'opennlp.tools.parser'],
 35 |     ['DocumentCategorizerME', 'opennlp.tools.doccat'],
 36 |     ['ChunkerME', 'opennlp.tools.chunker'],
 37 |     ['DictionaryDetokenizer', 'opennlp.tools.tokenize'],
 38 |     ['NameFinderME', 'opennlp.tools.namefind'],
 39 |     ['Parser', 'opennlp.tools.parser.chunking'],
 40 |     ['Parse', 'opennlp.tools.parser'],
 41 |     ['ParserFactory', 'opennlp.tools.parser'],
 42 |     ['POSTaggerME', 'opennlp.tools.postag'],
 43 |     ['SentenceDetectorME', 'opennlp.tools.sentdetect'],
 44 |     ['SimpleTokenizer', 'opennlp.tools.tokenize'],
 45 |     ['Span', 'opennlp.tools.util'],
 46 |     ['TokenizerME', 'opennlp.tools.tokenize'],
 47 |     
 48 |     # Generic Java classes.
 49 |     ['FileInputStream', 'java.io'],
 50 |     ['String', 'java.lang'],
 51 |     ['ArrayList', 'java.util']
 52 |   ]
 53 |   
 54 |   # Add in Rjb workarounds.
 55 |   unless RUBY_PLATFORM =~ /java/
 56 |     self.default_jars << 'utils.jar'
 57 |     self.default_classes << ['Utils', '']
 58 |   end
 59 | 
 60 |   # ############################ #
 61 |   #   OpenNLP bindings proper    #
 62 |   # ############################ #
 63 | 
 64 |   class <<self
 65 |     # A hash containing loaded models.
 66 |     attr_accessor :models
 67 |     # A hash containing the names of loaded models.
 68 |     attr_accessor :model_files
 69 |     # The folder in which to look for models.
 70 |     attr_accessor :model_path
 71 |     # Store the language currently being used.
 72 |     attr_accessor :language
 73 |   end
 74 | 
 75 |   def self.default_path
 76 |     File.dirname(__FILE__) + '/../../bin/'
 77 |   end
 78 | 
 79 |   # The loaded models.
 80 |   self.models = {}
 81 | 
 82 |   # The names of loaded models.
 83 |   self.model_files = {}
 84 | 
 85 |   # The path in which to look for JAR files, with
 86 |   # a trailing slash (default is gem's bin folder).
 87 |   self.jar_path = self.default_path
 88 | 
 89 |   # The path to the main folder containing the folders
 90 |   # with the individual models inside. By default, this
 91 |   # is the same as the JAR path.
 92 |   self.model_path = self.jar_path
 93 | 
 94 |   # Default the language to English.
 95 |   self.language = :english
 96 | 
 97 |   # Use a given language for default models.
 98 |   def self.use(language)
 99 |     self.language = language
100 |   end
101 | 
102 |   def self.get_model(klass, file=nil)
103 |     name = OpenNLP::Config::ClassToName[klass]
104 |     if !self.language and !file
105 |       raise 'No model file was supplied to the ' +
106 |       'constructor. Please supply a model file ' +
107 |       'or call OpenNLP.use(:some_language), to ' +
108 |       'load the default models for a language.'
109 |     end
110 |     self.load_model(name, file)
111 |     model = self.models[name]
112 |   end
113 | 
114 |   def self.set_model
115 |     raise 'Not implemented.'
116 |   end
117 | 
118 |   def self.load_model(name, file = nil)
119 |     if self.models[name] && file ==
120 |       self.model_files[name]
121 |       return self.models[name]
122 |     end
123 |     models = OpenNLP::Config::DefaultModels[name]
124 |     file ||= models[self.language]
125 |     path = self.model_path + file
126 |     stream = FileInputStream.new(path)
127 |     klass = OpenNLP::Config::NameToClass[name]
128 |     load_class(*klass) unless const_defined?(klass[0])
129 |     klass = const_get(klass[0])
130 |     model = klass.new(stream)
131 |     self.model_files[name] = file
132 |     self.models[name] = model
133 |   end
134 | 
135 | end
136 | 


--------------------------------------------------------------------------------
/lib/open-nlp/classes.rb:
--------------------------------------------------------------------------------
 1 | require 'open-nlp/base'
 2 | 
 3 | class OpenNLP::SentenceDetectorME < OpenNLP::Base; end
 4 | 
 5 | class OpenNLP::SimpleTokenizer < OpenNLP::Base; end
 6 | 
 7 | class OpenNLP::TokenizerME < OpenNLP::Base; end
 8 | 
 9 | class OpenNLP::POSTaggerME < OpenNLP::Base
10 | 
11 |   unless RUBY_PLATFORM =~ /java/
12 |     def tag(*args)
13 |         @proxy_inst._invoke("tag", "[Ljava.lang.String;", args[0])
14 |     end
15 |   end
16 | 
17 | end
18 | 
19 | class OpenNLP::ChunkerME < OpenNLP::Base
20 | 
21 |   if RUBY_PLATFORM =~ /java/
22 | 
23 |     def chunk(tokens, tags)
24 |       if !tokens.is_a?(Array)
25 |         tokens = tokens.to_a
26 |         tags = tags.to_a
27 |       end
28 |       tokens = tokens.to_java(:String)
29 |       tags = tags.to_java(:String)
30 |       @proxy_inst.chunk(tokens,tags).to_a
31 |     end
32 | 
33 |   else
34 | 
35 |     def chunk(tokens, tags)
36 |       chunks = @proxy_inst._invoke("chunk", "[Ljava.lang.String;[Ljava.lang.String;", tokens, tags)
37 |       chunks.map { |c| c.to_s }
38 |     end
39 | 
40 |   end
41 | 
42 | end
43 | 
44 | class OpenNLP::Parser < OpenNLP::Base
45 | 
46 |   def parse(text)
47 | 
48 |     tokenizer = OpenNLP::TokenizerME.new
49 |     full_span = OpenNLP::Bindings::Span.new(0, text.size)
50 | 
51 |     parse_obj = OpenNLP::Bindings::Parse.new(
52 |     text, full_span, "INC", 1, 0)
53 | 
54 |     tokens = tokenizer.tokenize_pos(text)
55 | 
56 |     tokens.each_with_index do |tok,i|
57 |       start, stop = tok.get_start, tok.get_end
58 |       token = text[start..stop-1]
59 |       span = OpenNLP::Bindings::Span.new(start, stop)
60 |       parse = OpenNLP::Bindings::Parse.new(text, span, "TK", 0, i)
61 |       parse_obj.insert(parse)
62 |     end
63 | 
64 |     @proxy_inst.parse(parse_obj)
65 | 
66 |   end
67 | 
68 | end
69 | 
70 | class OpenNLP::NameFinderME < OpenNLP::Base
71 |   unless RUBY_PLATFORM =~ /java/
72 |     def find(*args)
73 |       @proxy_inst._invoke("find", "[Ljava.lang.String;", args[0])
74 |     end
75 |     def prob(*args)
76 |       @proxy_inst._invoke("probs")
77 |     end
78 |   end
79 | end
80 | 


--------------------------------------------------------------------------------
/lib/open-nlp/config.rb:
--------------------------------------------------------------------------------
 1 | module OpenNLP::Config
 2 |   
 3 |   NameToClass = {
 4 |     categorizer: ['DoccatModel', 'opennlp.tools.doccat'],
 5 |     chunker: ['ChunkerModel', 'opennlp.tools.chunker'],
 6 |     detokenizer: ['DetokenizationDictionary', 'opennlp.tools.tokenize'],
 7 |     name_finder: ['TokenNameFinderModel', 'opennlp.tools.namefind'],
 8 |     parser: ['ParserModel', 'opennlp.tools.parser'],
 9 |     pos_tagger: ['POSModel', 'opennlp.tools.postag'],
10 |     sentence_detector: ['SentenceModel', 'opennlp.tools.sentdetect'],
11 |     tokenizer: ['TokenizerModel', 'opennlp.tools.tokenize']
12 |   }
13 |   
14 |   ClassToName = {
15 |     'ChunkerME' => :chunker,
16 |     'DictionaryDetokenizer' => :detokenizer,
17 |     'DocumentCategorizerME' => :categorizer,
18 |     'NameFinderME' => :name_finder,
19 |     'POSTaggerME' => :pos_tagger,
20 |     'Parser' => :parser,
21 |     'SentenceDetectorME' => :sentence_detector,
22 |     'TokenizerME' => :tokenizer,
23 |   }
24 | 
25 |   DefaultModels = {
26 |     chunker: {
27 |       english: 'en-chunker.bin'
28 |     },
29 |     detokenizer: {
30 |       english: 'en-detokenizer.xml'
31 |     },
32 |     # Intentionally left empty.
33 |     # Available for English, Spanish, Dutch.
34 |     name_finder: {},
35 |     parser: {
36 |       english: 'en-parser-chunking.bin'
37 |     },
38 |     pos_tagger: { 
39 |       english: 'en-pos-maxent.bin',
40 |       danish: 'da-pos-maxent.bin',
41 |       german: 'de-pos-maxent.bin',
42 |       dutch: 'nl-pos-maxent.bin',
43 |       portuguese: 'pt-pos-maxent.bin',
44 |       swedish: 'se-pos-maxent.bin'
45 |     },
46 |     sentence_detector: {
47 |       english: 'en-sent.bin',
48 |       german: 'de-sent.bin',
49 |       danish: 'da-sent.bin',
50 |       dutch: 'nl-sent.bin',
51 |       portuguese: 'pt-sent.bin',
52 |       swedish: 'se-sent.bin'
53 |     },
54 |     tokenizer: {
55 |       english: 'en-token.bin',
56 |       danish: 'da-token.bin',
57 |       german: 'de-token.bin',
58 |       dutch: 'nl-token.bin',
59 |       portuguese: 'pt-token.bin',
60 |       swedish: 'se-token.bin'
61 |     }
62 |   }
63 | 
64 |   # Classes that require a model as first argument to constructor.
65 |   RequiresModel = [
66 |     'SentenceDetectorME', 'NameFinderME', 'DictionaryDetokenizer',
67 |     'TokenizerME', 'ChunkerME', 'POSTaggerME', 'Parser'
68 |   ]
69 | 
70 |   
71 | end


--------------------------------------------------------------------------------
/lib/open-nlp/version.rb:
--------------------------------------------------------------------------------
1 | module OpenNLP
2 |   VERSION = '0.1.5'
3 | end
4 | 


--------------------------------------------------------------------------------
/open-nlp.gemspec:
--------------------------------------------------------------------------------
 1 | $:.push File.expand_path('../lib', __FILE__)
 2 | 
 3 | require 'open-nlp/version'
 4 | 
 5 | Gem::Specification.new do |s|
 6 | 
 7 |   s.name        = 'open-nlp'
 8 |   s.version     = OpenNLP::VERSION
 9 |   s.authors     = ['Louis Mullie']
10 |   s.email       = ['louis.mullie@gmail.com']
11 |   s.homepage    = 'https://github.com/louismullie/open-nlp'
12 |   s.summary     = %q{ Ruby bindings to the OpenNLP Java toolkit. }
13 |   s.description = %q{ Ruby bindings to the OpenNLP tools, a Java machine learning toolkit for natural language processing (NLP). }
14 |   
15 |   # Add all files.
16 |   s.files = Dir['bin/**/*'] + Dir['lib/**/*'] + Dir['spec/**/*'] +  ['README.md', 'LICENSE']
17 |   
18 |   # Runtime dependency.
19 |   s.add_runtime_dependency 'bind-it', '~>0.2.5'
20 |   
21 |   # Development dependency.
22 |   s.add_development_dependency 'rspec'
23 |   
24 | end
25 | 


--------------------------------------------------------------------------------
/spec/english_spec.rb:
--------------------------------------------------------------------------------
  1 | # encoding: utf-8
  2 | require_relative 'spec_helper'
  3 | 
  4 | describe OpenNLP do
  5 |   
  6 |   context "when an unreachable jar_path or model_path is provided" do
  7 |     it "raises an exception when trying to load" do
  8 |       OpenNLP.jar_path = '/unreachable/'
  9 |       OpenNLP::Bindings.jar_path.should eql '/unreachable/'
 10 |       OpenNLP.model_path = '/unreachable/'
 11 |       OpenNLP::Bindings.model_path.should eql '/unreachable/'
 12 |       expect { OpenNLP.load }.to raise_exception
 13 |       OpenNLP.jar_path = OpenNLP.model_path = OpenNLP.default_path
 14 |       expect { OpenNLP.load }.not_to raise_exception
 15 |     end
 16 |   end
 17 |   
 18 |   context "when a constructor is provided with a specific model to load" do
 19 |     it "loads that model, looking for the supplied file relative to OpenNLP.model_path " do
 20 |       
 21 |       OpenNLP.load
 22 |       
 23 |       tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
 24 |       tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
 25 | 
 26 |       sent = "The death of the poet was kept from his poems."
 27 |       tokens = tokenizer.tokenize(sent)
 28 |       tags = tagger.tag(tokens)
 29 |       
 30 |       OpenNLP.models[:pos_tagger].get_pos_model.to_s
 31 |       .index('opennlp.perceptron.PerceptronModel').should_not be_nil
 32 | 
 33 |       tags.to_a.should eql ["DT", "NN", "IN", "DT", "NN", "VBD", "VBN", "IN", "PRP$", "NNS", "."]
 34 | 
 35 |     end
 36 |   end
 37 | 
 38 |   context "when a class is loaded through the #load_class method" do
 39 |     it "loads the class and allows to access it through the global namespace" do
 40 |       OpenNLP.load_class('ChunkSample', 'opennlp.tools.chunker')
 41 |       expect { OpenNLP::ChunkSample }.not_to raise_exception
 42 |     end
 43 |   end
 44 | 
 45 |   context "the maximum entropy chunker is run after tokenization and POS tagging" do
 46 |     it "should find the accurate chunks" do
 47 |       
 48 |       OpenNLP.load
 49 |       
 50 |       chunker   = OpenNLP::ChunkerME.new
 51 |       tokenizer = OpenNLP::TokenizerME.new
 52 |       tagger    = OpenNLP::POSTaggerME.new
 53 | 
 54 |       sent   = "The death of the poet was kept from his poems."
 55 |       tokens = tokenizer.tokenize(sent)
 56 |       tags   = tagger.tag(tokens)
 57 |       
 58 |       chunks = chunker.chunk(tokens, tags)
 59 | 
 60 |       chunks.to_a.should eql %w[B-NP I-NP B-PP B-NP I-NP B-VP I-VP B-PP B-NP I-NP O]
 61 |       tokens.to_a.should eql %w[The death of the poet was kept from his poems .]
 62 |       tags.to_a.should eql %w[DT NN IN DT NN VBD VBN IN PRP$ NNS .]
 63 |       
 64 |     end
 65 |   end
 66 | 
 67 |   context "the maximum entropy parser is run after tokenization" do
 68 |     it "parses the text accurately" do
 69 |       
 70 |       OpenNLP.load
 71 |       
 72 |       sent      = "The death of the poet was kept from his poems."
 73 |       parser = OpenNLP::Parser.new
 74 |       parse = parser.parse(sent)
 75 | 
 76 |       parse.get_text.should eql sent
 77 | 
 78 |       parse.get_span.get_start.should eql 0
 79 |       parse.get_span.get_end.should eql 46
 80 |       parse.get_span.get_type.should eql nil # ?
 81 |       parse.get_child_count.should eql 1
 82 | 
 83 |       child = parse.get_children[0]
 84 | 
 85 |       child.text.should eql "The death of the poet was kept from his poems."
 86 |       child.get_child_count.should eql 3
 87 |       child.get_head_index.should eql 5
 88 | 
 89 |       child.get_head.get_child_count.should eql 1
 90 |       child.get_type.should eql "S"
 91 | 
 92 |     end
 93 |   end
 94 | 
 95 |   context "the SimpleTokenizer is run" do
 96 |     it "tokenizes the text accurately" do
 97 |       
 98 |       OpenNLP.load
 99 |       
100 |       sent = "The death of the poet was kept from his poems."
101 |       tokenizer = OpenNLP::SimpleTokenizer.new
102 |       tokens = tokenizer.tokenize(sent).to_a
103 |       tokens.should eql %w[The death of the poet was kept from his poems .]
104 |       
105 |     end
106 |   end
107 | 
108 |   context "the maximum entropy sentence detector, tokenizer, POS tagger " +
109 |   "and NER finders are run with the default models for English" do
110 | 
111 |     it "should accurately detect tokens, sentences and named entities" do
112 | 
113 |       OpenNLP.load
114 |       
115 |       text = File.read('./spec/sample.txt').gsub!("\n", "")
116 |       
117 |       tokenizer   = OpenNLP::TokenizerME.new
118 |       segmenter   = OpenNLP::SentenceDetectorME.new
119 |       tagger      = OpenNLP::POSTaggerME.new
120 |       ner_models  = ['person', 'time', 'money']
121 | 
122 |       ner_finders = ner_models.map do |model|
123 |         OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
124 |       end
125 | 
126 |       sentences = segmenter.sent_detect(text)
127 |       all_entities, all_tags, all_sentences, all_tokens = [], [], [], []
128 | 
129 |       sentences.each do |sentence|
130 | 
131 |         tokens = tokenizer.tokenize(sentence)
132 |         tags   = tagger.tag(tokens)
133 |         
134 |         ner_models.each_with_index do |model,i|
135 |           finder = ner_finders[i]
136 |           name_spans = finder.find(tokens)
137 |           name_probs = finder.probs()
138 |           name_spans.each_with_index do |name_span,j|
139 |             start = name_span.get_start
140 |             stop  = name_span.get_end-1
141 |             slice = tokens[start..stop].to_a
142 |             prob = name_probs[j]
143 |             all_entities << [slice, model, prob]
144 |           end
145 |         end
146 | 
147 |         all_tokens << tokens.to_a
148 |         all_sentences << sentence
149 |         all_tags << tags.to_a
150 |         
151 |       end
152 | 
153 |       all_tokens.should eql [["To", "describe", "2009", "as", "a", "stellar", "year", "for", "Petrofac", "(", "LON:PFC)", "would", "be", "a", "huge", "understatement", "."], ["The", "group", "finished", "the", "year", "with", "an", "order", "backlog", "twice", "the", "size", "than", "it", "had", "at", "the", "outset", "."], ["The", "group", "has", "since", "been", "awarded", "a", "US", "600", "million", "contract", "and", "spun", "off", "its", "North", "Sea", "assets", "."], ["The", "group", "’s", "recently", "released", "full", "year", "results", "show", "a", "jump", "in", "revenues", ",", "pre-tax", "profits", "and", "order", "backlog", "."], ["Whilst", "group", "revenue", "rose", "by", "10", "%", "from", "$", "3.3", "billion", "to", "$", "3.7", "billion", ",", "pre-tax", "profits", "rose", "by", "25", "%", "from", "$", "358", "million", "to", "$", "448", "million", ".All", "the", "more", "impressive", ",", "the", "group", "’s", "order", "backlog", "doubled", "to", "over", "$", "8", "billion", "paying", "no", "attention", "to", "the", "15", "%", "cut", "in", "capital", "expenditure", "witnessed", "across", "the", "oil", "and", "gas", "industry", "as", "whole", "in", "2009", ".Focussing", "in", "on", "which", "the", "underlying", "performances", "of", "the", "individual", "segments", ",", "the", "group", "cash", "cow", ",", "its", "Engineering", "and", "Construction", "division", ",", "saw", "operating", "profit", "rise", "33", "%", "over", "the", "year", "to", "$", "322", "million", ",", "thanks", "to", "US$", "6.3", "billion", "worth", "of", "new", "contract", "wins", "during", "the", "year", "which", "included", "a", "$", "100", "million", "contract", "with", "Turkmengaz", ",", "the", "Turkmenistan", "national", "energy", "company", "."], ["The", "division", "has", "picked", "up", "in", "2010", "where", "it", "left", "off", "in", "2009", "and", "has", "been", "awarded", "a", "contract", "worth", "more", "than", "US600", "million", "for", "a", "gas", "sweetening", "facilities", "project", "by", "Qatar", "Petroleum.Elsewhere", "the", "group", "’s", "Offshore", "Engineering", "&", "Operations", "division", "may", "have", "seen", "a", "pullback", "in", "revenue", "and", "earnings", "vis-a-vis", "2008", ",", "but", "it", "did", "secure", "a", "£75", "million", "contract", "with", "Apache", "to", "provideengineering", "and", "construction", "services", "for", "the", "Forties", "field", "in", "the", "UK", "North", "Sea", "."], ["And", "to", "underscore", "the", "fact", "that", "there", "is", "life", "beyond", "NOC’s", "for", "Petrofac", "(", "LON:PFC)", "the", "division", "was", "awarded", "a", "£100", "million", "5-year", "contract", "by", "BP", "(", "LON:BP.", ")", "to", "deliver", "integrated", "maintenance", "management", "support", "services", "for", "all", "of", "BP", "'s", "UK", "offshore", "assets", "and", "onshore", "Dimlington", "plant", "."], ["The", "laggard", "of", "the", "group", "was", "the", "Engineering", ",", "Training", "Services", "and", "Production", "Solutions", "division", "."], ["The", "business", "suffered", "as", "the", "oil", "price", "tailed", "off", "and", "the", "economic", "outlook", "deteriorated", "forcing", "a", "number", "ofmajor", "customers", "to", "postpone", "early", "stage", "engineering", "studies", "or", "re-phased", "work", "upon", "which", "the", "division", "depends", "."], ["Although", "the", "fall", "in", "activity", "was", "notable", ",", "the", "division’s", "operational", "performance", "in", "service", "operator", "role", "for", "production", "of", "Dubai", "'s", "offshore", "oil", "&", "gas", "proved", "a", "highlight.Energy", "Developments", "meanwhile", "saw", "the", "start", "of", "oil", "production", "from", "the", "West", "Don", "field", "during", "the", "first", "half", "of", "the", "year", "less", "than", "a", "year", "from", "Field", "Development", "Programme", "approval", "."], ["In", "addition", "output", "from", "Don", "Southwest", "field", "began", "in", "June", "."], ["Despite", "considerably", "lower", "oil", "prices", "in", "2009", "compared", "to", "the", "prior", "year", ",", "Energy", "Developments", "'", "revenue", "reached", "almost", "US$", "250", "million", "(", "significantly", "higher", "than", "the", "US$", "153", "million", "of", "2008", ")", "due", "not", "only", "to", "the", "‘Don", "fields", "effect", "’", "but", "also", "a", "full", "year", "'s", "contribution", "from", "the", "Chergui", "gas", "plant", ",", "which", "began", "exports", "in", "August", "2008.In", "order", "to", "maximize", "the", "earnings", "potential", "of", "the", "division’s", "North", "Sea", "assets", ",", "including", "the", "Don", "assets", ",", "the", "group", "has", "demerged", "them", "providing", "its", "shareholders", "with", "shares", "in", "a", "newly", "listed", "independent", "exploration", "and", "production", "company", "called", "EnQuest", "(", "LON:ENQ", ")", "."], ["EnQuest", "is", "a", "product", "of", "the", "Petrofac’s", "North", "Sea", "Assets", "with", "those", "off", "of", "Swedish", "explorer", "Lundin", "with", "both", "companies", "divesting", "for", "different", "reasons", "."], ["Upon", "listing", "(", "April", "6th", ")", ",", "Petrofac", "(", "LON:PFC)", "shareholders", "owned", "around", "45", "%", "of", "the", "new", "EnQuest", "entity", "with", "Lundin", "shareholders", "owning", "approximately", "55", "%", "."], ["It", "is", "important", "to", "note", "that", "post", "demerger", "the", "Energy", "Developments", "business", "unit", "is", "still", "a", "key", "constituent", "of", "Petrofac", "'s", "business", "portfolio", ",", "and", "will", "continue", "to", "hold", "significant", "assets", "Tunisia", ",", "Malaysia", ",", "Algeria", "and", "Kyrgyz", "Republic", "-", "sandwiched", "between", "Kazakhstan", "and", "China", "."]]
154 | 
155 |       all_sentences.should eql ["To describe 2009 as a stellar year for Petrofac (LON:PFC) would be a huge understatement.", "The group finished the year with an order backlog twice the size than it had at the outset.", "The group has since been awarded a US 600 million contract and spun off its North Sea assets.", "The group’s recently released full year results show a jump in revenues, pre-tax profits and order backlog.", "Whilst group revenue rose by 10% from $3.3 billion to $3.7 billion, pre-tax profits rose by 25% from $358 million to $448 million.All the more impressive, the group’s order backlog doubled to over $8 billion paying no attention to the 15% cut in capital expenditure witnessed across the oil and gas industry as whole in 2009.Focussing in on which the underlying performances of the individual segments, the group cash cow, its Engineering and Construction division, saw operating profit rise 33% over the year to $322 million, thanks to US$6.3 billion worth of new contract wins during the year which included a $100 million contract with Turkmengaz, the Turkmenistan national energy company.", "The division has picked up in 2010 where it left off in 2009 and has been awarded a contract worth more than US600 million for a gas sweetening facilities project by Qatar Petroleum.Elsewhere the group’s Offshore Engineering & Operations division may have seen a pullback in revenue and earnings vis-a-vis 2008, but it did secure a £75 million contract with Apache to provideengineering and construction services for the Forties field in the UK North Sea.", "And to underscore the fact that there is life beyond NOC’s for Petrofac (LON:PFC) the division was awarded a £100 million 5-year contract by BP (LON:BP.) to deliver integrated maintenance management support services for all of BP's UK offshore assets and onshore Dimlington plant.", "The laggard of the group was the Engineering, Training Services and Production Solutions division.", "The business suffered as the oil price tailed off and the economic outlook deteriorated forcing a number ofmajor customers to postpone early stage engineering studies or re-phased work upon which the division depends.", "Although the fall in activity was notable, the division’s operational performance in service operator role for production of Dubai's offshore oil & gas proved a highlight.Energy Developments meanwhile saw the start of oil production from the West Don field during the first half of the year less than a year from Field Development Programme approval.", "In addition output from Don Southwest field began in June.", "Despite considerably lower oil prices in 2009 compared to the prior year, Energy Developments' revenue reached almost US$250 million (significantly higher than the US$153 million of 2008) due not only to the ‘Don fields effect’ but also a full year's contribution from the Chergui gas plant, which began exports in August 2008.In order to maximize the earnings potential of the division’s North Sea assets, including the Don assets, the group has demerged them providing its shareholders with shares in a newly listed independent exploration and production company called EnQuest (LON:ENQ).", "EnQuest is a product of the Petrofac’s North Sea Assets with those off of Swedish explorer Lundin with both companies divesting for different reasons.", "Upon listing (April 6th), Petrofac (LON:PFC) shareholders owned around 45% of the new EnQuest entity with Lundin shareholders owning approximately 55%.", "It is important to note that post demerger the Energy Developments business unit is still a key constituent of Petrofac's business portfolio, and will continue to hold significant assets Tunisia, Malaysia, Algeria and Kyrgyz Republic - sandwiched between Kazakhstan and China."]
156 | 
157 |       all_entities.should eql [[["$", "3.3", "billion"], "money", 0.999947714896808], [["$", "3.7", "billion"], "money", 0.9999935750135956], [["$", "358", "million", "to", "$", "448", "million"], "money", 0.9999829748632709], [["$", "8", "billion"], "money", 0.9999374095970546], [["$", "322", "million"], "money", 0.9998880078069273], [["$", "100", "million"], "money", 0.9998309013722682], [["Lundin"], "person", 0.9632801339070118], [["Lundin"], "person", 0.9439806969044655]]
158 |       
159 |       all_tags.should eql [["TO", "VB", "CD", "IN", "DT", "NN", "NN", "IN", "NNP", "-LRB-", "NNP", "MD", "VB", "DT", "JJ", "NN", "."], ["DT", "NN", "VBD", "DT", "NN", "IN", "DT", "NN", "NN", "RB", "DT", "NN", "IN", "PRP", "VBD", "IN", "DT", "NN", "."], ["DT", "NN", "VBZ", "RB", "VBN", "VBN", "DT", "PRP", "CD", "CD", "NN", "CC", "VBD", "RP", "PRP$", "NNP", "NNP", "NNS", "."], ["DT", "NN", "VBD", "RB", "VBN", "JJ", "NN", "NNS", "VBP", "DT", "NN", "IN", "NNS", ",", "JJ", "NNS", "CC", "NN", "NN", "."], ["NNP", "NN", "NN", "VBD", "IN", "CD", "NN", "IN", "$", "CD", "CD", "TO", "$", "CD", "CD", ",", "JJ", "NNS", "VBD", "IN", "CD", "NN", "IN", "$", "CD", "CD", "TO", "$", "CD", "CD", "PDT", "DT", "RBR", "JJ", ",", "DT", "NN", "VBZ", "NN", "NN", "VBD", "TO", "RP", "$", "CD", "CD", "VBG", "DT", "NN", "TO", "DT", "CD", "NN", "NN", "IN", "NN", "NN", "VBN", "IN", "DT", "NN", "CC", "NN", "NN", "IN", "JJ", "IN", "CD", "NN", "IN", "IN", "WDT", "DT", "JJ", "NNS", "IN", "DT", "JJ", "NNS", ",", "DT", "NN", "NN", "NN", ",", "PRP$", "NNP", "CC", "NNP", "NN", ",", "VBD", "NN", "NN", "VB", "CD", "NN", "IN", "DT", "NN", "TO", "$", "CD", "CD", ",", "NNS", "TO", "$", "CD", "CD", "NN", "IN", "JJ", "NN", "VBZ", "IN", "DT", "NN", "WDT", "VBD", "DT", "$", "CD", "CD", "NN", "IN", "NNP", ",", "DT", "NNP", "JJ", "NN", "NN", "."], ["DT", "NN", "VBZ", "VBN", "RP", "IN", "CD", "WRB", "PRP", "VBD", "RP", "IN", "CD", "CC", "VBZ", "VBN", "VBN", "DT", "NN", "NN", "JJR", "IN", "CD", "CD", "IN", "DT", "NN", "VBG", "NNS", "NN", "IN", "NNP", "NNP", "DT", "NN", "JJ", "NNP", "NNP", "CC", "NNP", "NN", "MD", "VB", "VBN", "DT", "NN", "IN", "NN", "CC", "NNS", "NN", "CD", ",", "CC", "PRP", "VBD", "VB", "DT", "CD", "CD", "NN", "IN", "NNP", "TO", "VB", "CC", "NN", "NNS", "IN", "DT", "NNP", "NN", "IN", "DT", "NNP", "NNP", "NNP", "."], ["CC", "TO", "VB", "DT", "NN", "IN", "EX", "VBZ", "NN", "IN", "NNP", "IN", "NNP", "-LRB-", "NNP", "DT", "NN", "VBD", "VBN", "DT", "CD", "CD", "JJ", "NN", "IN", "NNP", "-LRB-", "NNP", "-RRB-", "TO", "VB", "JJ", "NN", "NN", "NN", "NNS", "IN", "DT", "IN", "NNP", "POS", "NN", "JJ", "NNS", "CC", "RB", "NNP", "NN", "."], ["DT", "NN", "IN", "DT", "NN", "VBD", "DT", "NNP", ",", "NNP", "NNP", "CC", "NNP", "NNP", "NN", "."], ["DT", "NN", "VBD", "IN", "DT", "NN", "NN", "VBN", "RB", "CC", "DT", "JJ", "NN", "VBD", "VBG", "DT", "NN", "IN", "NNS", "TO", "VB", "JJ", "NN", "NN", "NNS", "CC", "JJ", "NN", "IN", "WDT", "DT", "NN", "VBZ", "."], ["IN", "DT", "NN", "IN", "NN", "VBD", "JJ", ",", "DT", "JJ", "JJ", "NN", "IN", "NN", "NN", "NN", "IN", "NN", "IN", "NNP", "POS", "JJ", "NN", "CC", "NN", "VBD", "DT", "RB", "NNPS", "RB", "VBD", "DT", "NN", "IN", "NN", "NN", "IN", "DT", "NNP", "NNP", "NN", "IN", "DT", "JJ", "NN", "IN", "DT", "NN", "RBR", "IN", "DT", "NN", "IN", "NNP", "NNP", "NNP", "NN", "."], ["IN", "NN", "NN", "IN", "NNP", "NNP", "NN", "VBD", "IN", "NNP", "."], ["IN", "RB", "JJR", "NN", "NNS", "IN", "CD", "VBN", "TO", "DT", "JJ", "NN", ",", "NNP", "NNPS", "POS", "NN", "VBD", "RB", "$", "CD", "CD", "-LRB-", "RB", "JJR", "IN", "DT", "$", "CD", "CD", "IN", "CD", "-RRB-", "RB", "RB", "RB", "TO", "DT", "JJ", "NNS", "NN", ",", "CC", "RB", "DT", "JJ", "NN", "POS", "NN", "IN", "DT", "NNP", "NN", "NN", ",", "WDT", "VBD", "NNS", "IN", "NNP", "IN", "NN", "TO", "VB", "DT", "NNS", "NN", "IN", "DT", "JJ", "NNP", "NNP", "NNS", ",", "VBG", "DT", "NNP", "NNS", ",", "DT", "NN", "VBZ", "VBN", "PRP", "VBG", "PRP$", "NNS", "IN", "NNS", "IN", "DT", "RB", "VBN", "JJ", "NN", "CC", "NN", "NN", "VBD", "NNP", "-LRB-", "NN", "-RRB-", "."], ["NNP", "VBZ", "DT", "NN", "IN", "DT", "NNP", "NNP", "NNP", "NNS", "IN", "DT", "IN", "IN", "JJ", "NN", "NN", "IN", "DT", "NNS", "VBG", "IN", "JJ", "NNS", "."], ["IN", "VBG", "-LRB-", "NNP", "NN", "-RRB-", ",", "NNP", "-LRB-", "NNP", "NNS", "VBD", "IN", "CD", "NN", "IN", "DT", "JJ", "NNP", "NN", "IN", "NNP", "NNS", "VBG", "RB", "CD", "NN", "."], ["PRP", "VBZ", "JJ", "TO", "VB", "IN", "NN", "NN", "DT", "NNP", "NNPS", "NN", "NN", "VBZ", "RB", "DT", "JJ", "NN", "IN", "NNP", "POS", "NN", "NN", ",", "CC", "MD", "VB", "TO", "VB", "JJ", "NNS", "NNP", ",", "NNP", ",", "NNP", "CC", "NNP", "NNP", ":", "VBD", "IN", "NNP", "CC", "NNP", "."]]
160 |       
161 |     end
162 |   end
163 | 
164 | end
165 | 


--------------------------------------------------------------------------------
/spec/sample.txt:
--------------------------------------------------------------------------------
 1 | To describe 2009 as a stellar year for Petrofac (LON:PFC) would be a huge understatement. The group finished the year with an order backlog twice the size than it had at the outset. The group has since been awarded a US 600 million contract and spun off its North Sea assets. 
 2 | The group’s recently released full year results show a jump in revenues, pre-tax profits and order backlog. Whilst group revenue rose by 10% from $3.3 billion to $3.7 billion, pre-tax profits rose by 25% from $358 million to $448 million.
 3 | 
 4 | All the more impressive, the group’s order backlog doubled to over $8 billion paying no attention to the 15% cut in capital expenditure witnessed across the oil and gas industry as whole in 2009.
 5 | 
 6 | Focussing in on which the underlying performances of the individual segments, the group cash cow, its Engineering and Construction division, saw operating profit rise 33% over the year to $322 million, thanks to US$6.3 billion worth of new contract wins during the year which included a $100 million contract with Turkmengaz, the Turkmenistan national energy company. The division has picked up in 2010 where it left off in 2009 and has been awarded a contract worth more than US600 million for a gas sweetening facilities project by Qatar Petroleum.
 7 | 
 8 | Elsewhere the group’s Offshore Engineering & Operations division may have seen a pullback in revenue and earnings vis-a-vis 2008, but it did secure a £75 million contract with Apache to provide
 9 | engineering and construction services for the Forties field in the UK North Sea. And to underscore the fact that there is life beyond NOC’s for Petrofac (LON:PFC) the division was awarded a £100 million 5-year contract by BP (LON:BP.) to deliver integrated maintenance management support services for all of BP's UK offshore assets and onshore Dimlington plant.   
10 | 
11 | The laggard of the group was the Engineering, Training Services and Production Solutions division. The business suffered as the oil price tailed off and the economic outlook deteriorated forcing a number of
12 | major customers to postpone early stage engineering studies or re-phased work upon which the division depends. Although the fall in activity was notable, the division’s operational performance in service operator role for production of Dubai's offshore oil & gas proved a highlight.
13 | 
14 | Energy Developments meanwhile saw the start of oil production from the West Don field during the first half of the year less than a year from Field Development Programme approval. In addition output from Don Southwest field began in June. Despite considerably lower oil prices in 2009 compared to the prior year, Energy Developments' revenue reached almost US$250 million (significantly higher than the US$153 million of 2008) due not only to the ‘Don fields effect’ but also a full year's contribution from the Chergui gas plant, which began exports in August 2008.
15 | 
16 | In order to maximize the earnings potential of the division’s North Sea assets, including the Don assets, the group has demerged them providing its shareholders with shares in a newly listed independent exploration and production company called EnQuest (LON:ENQ). 
17 | 
18 | EnQuest is a product of the Petrofac’s North Sea Assets with those off of Swedish explorer Lundin with both companies divesting for different reasons. Upon listing (April 6th), Petrofac (LON:PFC) shareholders owned around 45% of the new EnQuest entity with Lundin shareholders owning approximately 55%. 
19 | 
20 | It is important to note that post demerger the Energy Developments business unit is still a key constituent of Petrofac's business portfolio, and will continue to hold significant assets Tunisia, Malaysia, Algeria and Kyrgyz Republic - sandwiched between Kazakhstan and China.


--------------------------------------------------------------------------------
/spec/spec_helper.rb:
--------------------------------------------------------------------------------
1 | require 'rspec'
2 | require_relative '../lib/open-nlp'


--------------------------------------------------------------------------------