├── .gitignore
├── .ruby-version
├── .travis.yml
├── FAQ.md
├── Gemfile
├── LICENSE
├── Rakefile
├── contributing.md
├── header.png
├── inbox.md
├── motivation.md
├── pull_request_template.md
├── readme.md
├── ruby.jpg
├── test.png
└── tutorials
├── ruby-stemmer.md
├── template.md
├── tokenizer.md
├── tutorial_template.md
└── weka-jruby.md
/.gitignore:
--------------------------------------------------------------------------------
1 | Gemfile.lock
2 | *json
3 | node_modules
4 | _site
5 | _blog
6 | .sass-cache
7 | .jekyll-metadata
8 | scripts
9 | www
10 |
--------------------------------------------------------------------------------
/.ruby-version:
--------------------------------------------------------------------------------
1 | 2.6.6
2 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: ruby
2 | bundler_args: --without local
3 | rvm:
4 | - 2.2
5 | script: "awesome_bot readme.md motivation.md FAQ.md --allow-dupe"
6 |
--------------------------------------------------------------------------------
/FAQ.md:
--------------------------------------------------------------------------------
1 | # Frequently (not yet) Asked Questions
2 |
3 | ## What is Awesome Ruby NLP list?
4 |
5 | This list is the _first systematic_ attempt to enlist NLP and CL related
6 | resources for Ruby. It's based on other earlier attempts
7 | e.g. https://github.com/diasks2/ruby-nlp. We strive to provide a list of only
8 | working high quality libraries. Read [why](motivation.md) this list is vital for
9 | the Ruby community.
10 |
11 | ## Why use Ruby for NLP?
12 |
13 | Everybody uses Python! Nobody hires Ruby developers for NLP tasks.
14 |
15 | To avoid a long discussion we can simply postulate: Ruby and Python are great
16 | programming languages, both very appealing to the community, but with different
17 | history. Everything written in Ruby could have been written in Python.
18 |
19 | Nevertheless we have our personal preferences like dogs over cats,
20 | tea over coffee etc. That's why you can choose the language which matches
21 | your mindset and does not break your mind to get compliant with a programming
22 | language.
23 |
24 | Take Ruby if you're happy with it. Use Python if you like it more. Do whatever
25 | you want and pay for your decisions!
26 |
27 | And if you still hesitate look at this great
28 | [talk](https://www.youtube.com/watch?v=0D3KfnbTdWw) by Jim Weirich.
29 |
30 | ## Wait ... but Ruby is so slow?
31 |
32 | Ruby **IS** comparable in terms of processing speed with other high level
33 | scripting programming languages like Lua, Perl, Python etc.
34 |
35 | Please look at this comparison:
36 | https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/ruby.html
37 |
38 | ## Hm ... but would I find suitable libraries?
39 |
40 | Python has more! Eventually...
41 |
42 | Please look at the current [list](https://github.com/arbox/nlp-with-ruby),
43 | Ruby is equipped with all important libraries.
44 |
45 | ## Can I write NLP application on the Google's scale with Ruby?
46 |
47 | The answer is simple and sounds "NO". Not in pure Ruby. But you can be very
48 | efficient and use Ruby bindings for Java, C and C++ based libraries.
49 | And sometimes buying newer hardware can be cheaper than writing everything in
50 | C++. It's definitely your choice!
51 |
52 | ## How do you call list items?
53 |
54 | Every library list item has the naming after the Ruby library. The name is
55 | the exact wording of the `gem install lib` statement (or `gem 'lib'` in your
56 | `Gemfile`) to facilitate search and memoization. That's why the appropriate item
57 | is called `treat` and not `Treat`.
58 |
59 |
60 | [motivation]: https://github.com/arbox/nlp-with-ruby/blob/master/motivation.md
61 |
--------------------------------------------------------------------------------
/Gemfile:
--------------------------------------------------------------------------------
1 | # frozen_string_literal: true
2 | source 'https://rubygems.org'
3 |
4 | gem 'awesome_bot', '~> 1.17'
5 | gem 'rake', '~> 12.3'
6 |
7 | group :local do
8 | gem 'jekyll', '~> 3.4'
9 | end
10 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | CC0 1.0 Universal
2 |
3 | Statement of Purpose
4 |
5 | The laws of most jurisdictions throughout the world automatically confer
6 | exclusive Copyright and Related Rights (defined below) upon the creator and
7 | subsequent owner(s) (each and all, an "owner") of an original work of
8 | authorship and/or a database (each, a "Work").
9 |
10 | Certain owners wish to permanently relinquish those rights to a Work for the
11 | purpose of contributing to a commons of creative, cultural and scientific
12 | works ("Commons") that the public can reliably and without fear of later
13 | claims of infringement build upon, modify, incorporate in other works, reuse
14 | and redistribute as freely as possible in any form whatsoever and for any
15 | purposes, including without limitation commercial purposes. These owners may
16 | contribute to the Commons to promote the ideal of a free culture and the
17 | further production of creative, cultural and scientific works, or to gain
18 | reputation or greater distribution for their Work in part through the use and
19 | efforts of others.
20 |
21 | For these and/or other purposes and motivations, and without any expectation
22 | of additional consideration or compensation, the person associating CC0 with a
23 | Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
24 | and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
25 | and publicly distribute the Work under its terms, with knowledge of his or her
26 | Copyright and Related Rights in the Work and the meaning and intended legal
27 | effect of CC0 on those rights.
28 |
29 | 1. Copyright and Related Rights. A Work made available under CC0 may be
30 | protected by copyright and related or neighboring rights ("Copyright and
31 | Related Rights"). Copyright and Related Rights include, but are not limited
32 | to, the following:
33 |
34 | i. the right to reproduce, adapt, distribute, perform, display, communicate,
35 | and translate a Work;
36 |
37 | ii. moral rights retained by the original author(s) and/or performer(s);
38 |
39 | iii. publicity and privacy rights pertaining to a person's image or likeness
40 | depicted in a Work;
41 |
42 | iv. rights protecting against unfair competition in regards to a Work,
43 | subject to the limitations in paragraph 4(a), below;
44 |
45 | v. rights protecting the extraction, dissemination, use and reuse of data in
46 | a Work;
47 |
48 | vi. database rights (such as those arising under Directive 96/9/EC of the
49 | European Parliament and of the Council of 11 March 1996 on the legal
50 | protection of databases, and under any national implementation thereof,
51 | including any amended or successor version of such directive); and
52 |
53 | vii. other similar, equivalent or corresponding rights throughout the world
54 | based on applicable law or treaty, and any national implementations thereof.
55 |
56 | 2. Waiver. To the greatest extent permitted by, but not in contravention of,
57 | applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
58 | unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
59 | and Related Rights and associated claims and causes of action, whether now
60 | known or unknown (including existing as well as future claims and causes of
61 | action), in the Work (i) in all territories worldwide, (ii) for the maximum
62 | duration provided by applicable law or treaty (including future time
63 | extensions), (iii) in any current or future medium and for any number of
64 | copies, and (iv) for any purpose whatsoever, including without limitation
65 | commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
66 | the Waiver for the benefit of each member of the public at large and to the
67 | detriment of Affirmer's heirs and successors, fully intending that such Waiver
68 | shall not be subject to revocation, rescission, cancellation, termination, or
69 | any other legal or equitable action to disrupt the quiet enjoyment of the Work
70 | by the public as contemplated by Affirmer's express Statement of Purpose.
71 |
72 | 3. Public License Fallback. Should any part of the Waiver for any reason be
73 | judged legally invalid or ineffective under applicable law, then the Waiver
74 | shall be preserved to the maximum extent permitted taking into account
75 | Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
76 | is so judged Affirmer hereby grants to each affected person a royalty-free,
77 | non transferable, non sublicensable, non exclusive, irrevocable and
78 | unconditional license to exercise Affirmer's Copyright and Related Rights in
79 | the Work (i) in all territories worldwide, (ii) for the maximum duration
80 | provided by applicable law or treaty (including future time extensions), (iii)
81 | in any current or future medium and for any number of copies, and (iv) for any
82 | purpose whatsoever, including without limitation commercial, advertising or
83 | promotional purposes (the "License"). The License shall be deemed effective as
84 | of the date CC0 was applied by Affirmer to the Work. Should any part of the
85 | License for any reason be judged legally invalid or ineffective under
86 | applicable law, such partial invalidity or ineffectiveness shall not
87 | invalidate the remainder of the License, and in such case Affirmer hereby
88 | affirms that he or she will not (i) exercise any of his or her remaining
89 | Copyright and Related Rights in the Work or (ii) assert any associated claims
90 | and causes of action with respect to the Work, in either case contrary to
91 | Affirmer's express Statement of Purpose.
92 |
93 | 4. Limitations and Disclaimers.
94 |
95 | a. No trademark or patent rights held by Affirmer are waived, abandoned,
96 | surrendered, licensed or otherwise affected by this document.
97 |
98 | b. Affirmer offers the Work as-is and makes no representations or warranties
99 | of any kind concerning the Work, express, implied, statutory or otherwise,
100 | including without limitation warranties of title, merchantability, fitness
101 | for a particular purpose, non infringement, or the absence of latent or
102 | other defects, accuracy, or the present or absence of errors, whether or not
103 | discoverable, all to the greatest extent permissible under applicable law.
104 |
105 | c. Affirmer disclaims responsibility for clearing rights of other persons
106 | that may apply to the Work or any use thereof, including without limitation
107 | any person's Copyright and Related Rights in the Work. Further, Affirmer
108 | disclaims responsibility for obtaining any necessary consents, permissions
109 | or other rights required for any use of the Work.
110 |
111 | d. Affirmer understands and acknowledges that Creative Commons is not a
112 | party to this document and has no duty or obligation with respect to this
113 | CC0 or use of the Work.
114 |
115 | For more information, please see
116 |
117 |
--------------------------------------------------------------------------------
/Rakefile:
--------------------------------------------------------------------------------
1 | require 'yaml'
2 | require 'rake/clean'
3 | CLEAN.include '*.json'
4 |
5 | namespace :test do
6 | task :links2 do
7 | require 'awesome_bot'
8 | content = File.read('README.md')
9 | result = AwesomeBot.check(content)
10 | puts result.success(nil) ? ':)' : ':('
11 | end
12 |
13 | CMD_STRING = YAML.load_file('.travis.yml')['script']
14 | desc 'Test links with AwesomeBot'
15 | task :links do
16 | system(CMD_STRING)
17 | end
18 | end
19 |
20 | desc 'Regenerate the TOC.'
21 | task :toc do
22 | `node_modules/markdown-toc/cli.js -i README.md`
23 | end
24 |
25 | desc 'Create the www sources.'
26 | task :webgen do
27 | DOCS_DIR = 'www/_mkdocs_source/'
28 | SRC_FILES = ['README.md', 'FAQ.md', 'motivation.md']
29 | SRC_FILES.each do |name|
30 | nodoc(name)
31 | end
32 | end
33 |
34 | def nodoc(file)
35 | lines = File.readlines(file)
36 |
37 | if file == 'README.md'
38 | file = 'index.md'
39 | end
40 | File.open(DOCS_DIR + file, 'w') do |file|
41 | lines.each do |line|
42 | unless line =~ /
40 | ## Contents
41 |
42 |
43 |
44 | - [:sparkles: Tutorials](#sparkles-tutorials)
45 | - [NLP Pipeline Subtasks](#nlp-pipeline-subtasks)
46 | * [Pipeline Generation](#pipeline-generation)
47 | * [Multipurpose Engines](#multipurpose-engines)
48 | + [On-line APIs](#on-line-apis)
49 | * [Language Identification](#language-identification)
50 | * [Segmentation](#segmentation)
51 | * [Lexical Processing](#lexical-processing)
52 | + [Stemming](#stemming)
53 | + [Lemmatization](#lemmatization)
54 | + [Lexical Statistics: Counting Types and Tokens](#lexical-statistics-counting-types-and-tokens)
55 | + [Filtering Stop Words](#filtering-stop-words)
56 | * [Phrasal Level Processing](#phrasal-level-processing)
57 | * [Syntactic Processing](#syntactic-processing)
58 | + [Constituency Parsing](#constituency-parsing)
59 | * [Semantic Analysis](#semantic-analysis)
60 | * [Pragmatical Analysis](#pragmatical-analysis)
61 | - [High Level Tasks](#high-level-tasks)
62 | * [Spelling and Error Correction](#spelling-and-error-correction)
63 | * [Text Alignment](#text-alignment)
64 | * [Machine Translation](#machine-translation)
65 | * [Sentiment Analysis](#sentiment-analysis)
66 | * [Numbers, Dates, and Time Parsing](#numbers-dates-and-time-parsing)
67 | * [Named Entity Recognition](#named-entity-recognition)
68 | * [Text-to-Speech-to-Text](#text-to-speech-to-text)
69 | - [Dialog Agents, Assistants, and Chatbots](#dialog-agents-assistants-and-chatbots)
70 | - [Linguistic Resources](#linguistic-resources)
71 | - [Machine Learning Libraries](#machine-learning-libraries)
72 | - [Data Visualization](#data-visualization)
73 | - [Optical Character Recognition](#optical-character-recognition)
74 | - [Text Extraction](#text-extraction)
75 | - [Full Text Search, Information Retrieval, Indexing](#full-text-search-information-retrieval-indexing)
76 | - [Language Aware String Manipulation](#language-aware-string-manipulation)
77 | - [Articles, Posts, Talks, and Presentations](#articles-posts-talks-and-presentations)
78 | - [Projects and Code Examples](#projects-and-code-examples)
79 | - [Books](#books)
80 | - [Community](#community)
81 | - [Needs your Help!](#needs-your-help)
82 | - [Related Resources](#related-resources)
83 | - [License](#license)
84 |
85 |
86 |
87 |
88 |
89 | ## :sparkles: Tutorials
90 |
91 | Please help us to fill out this section! :smiley:
92 |
93 | ## NLP Pipeline Subtasks
94 |
95 | An NLP Pipeline starts with a plain text.
96 |
97 | ### Pipeline Generation
98 |
99 | - [composable_operations](https://github.com/t6d/composable_operations) -
100 | Definition framework for operation pipelines.
101 | - [ruby-spark](https://github.com/ondra-m/ruby-spark) -
102 | Spark bindings with an easy to understand DSL.
103 | - [phobos](https://github.com/phobos/phobos) -
104 | Simplified Ruby Client for [Apache Kafka](https://kafka.apache.org/).
105 | - [parallel](https://github.com/grosser/parallel) -
106 | Supervisor for parallel execution on multiple CPUs or in many threads.
107 | - [pwrake](https://github.com/masa16/pwrake) -
108 | Rake extensions to run local and remote tasks in parallel.
109 |
110 | ### Multipurpose Engines
111 |
112 | - [open-nlp](https://github.com/louismullie/open-nlp) -
113 | Ruby Bindings for the [OpenNLP](https://opennlp.apache.org/) Toolkit.
114 | - [stanford-core-nlp](https://github.com/louismullie/stanford-core-nlp) -
115 | Ruby Bindings for the Stanford [CoreNLP](https://github.com/stanfordnlp/CoreNLP) tools.
116 | - [treat](https://github.com/louismullie/treat) -
117 | Natural Language Processing framework for Ruby (like [NLTK](http://www.nltk.org/) for Python).
118 | - [nlp_toolz](https://github.com/LeFnord/nlp_toolz) -
119 | Wrapper over some [OpenNLP](https://opennlp.apache.org/) classes and
120 | the original [Berkeley Parser](https://github.com/slavpetrov/berkeleyparser).
121 | - [open_nlp](https://github.com/hck/open_nlp) -
122 | JRuby Bindings for the [OpenNLP](https://opennlp.apache.org/) Toolkit.
123 | - [ruby-spacy](https://github.com/yohasebe/ruby-spacy) —
124 | Wrapper module for spaCy NLP library via [PyCall](https://github.com/mrkn/pycall.rb).
125 |
126 | #### On-line APIs
127 |
128 | - [alchemyapi_ruby](https://github.com/alchemyapi/alchemyapi_ruby) -
129 | Legacy Ruby SDK for AlchemyAPI/Bluemix.
130 | - [wit-ruby](https://github.com/wit-ai/wit-ruby) -
131 | Ruby client library for the [Wit.ai](https://wit.ai/) Language Understanding Platform.
132 | - [wlapi](https://github.com/arbox/wlapi) - Ruby client library for
133 | [Wortschatz Leipzig](http://wortschatz.uni-leipzig.de/de) web services.
134 | - [monkeylearn-ruby](https://github.com/monkeylearn/monkeylearn-ruby) - Sentiment
135 | Analysis, Topic Modelling, Language Detection, Named Entity Recognition via
136 | a Ruby based Web API client.
137 | - [google-cloud-language](https://github.com/googleapis/google-cloud-ruby/tree/master/google-cloud-language) -
138 | Google's Natural Language service API for Ruby.
139 |
140 | ### Language Identification
141 |
142 | Language Identification is one of the first crucial steps in every NLP Pipeline.
143 |
144 | - [scylla](https://github.com/hashwin/scylla) -
145 | Language Categorization and Identification.
146 |
147 | ### Segmentation
148 |
149 | Tools for Tokenization, Word and Sentence Boundary Detection and Disambiguation.
150 |
151 | - [tokenizer](https://github.com/arbox/tokenizer) -
152 | Simple multilingual tokenizer.
153 | [[tutorial](tutorials/tokenizer.md)]
154 | - [pragmatic_tokenizer](https://github.com/diasks2/pragmatic_tokenizer) -
155 | Multilingual tokenizer to split a string into tokens.
156 | - [nlp-pure](https://github.com/parhamr/nlp-pure) -
157 | Natural language processing algorithms implemented in pure Ruby with minimal dependencies.
158 | - [textoken](https://github.com/manorie/textoken) -
159 | Simple and customizable text tokenization library.
160 | - [pragmatic_segmenter](https://github.com/diasks2/pragmatic_segmenter) -
161 | Word Boundary Disambiguation with many cookies.
162 | - [punkt-segmenter](https://github.com/lfcipriani/punkt-segmenter) -
163 | Pure Ruby implementation of the Punkt Segmenter.
164 | - [tactful_tokenizer](https://github.com/zencephalon/Tactful_Tokenizer) -
165 | RegExp based tokenizer for different languages.
166 | - [scapel](https://github.com/louismullie/scalpel) -
167 | Sentence Boundary Disambiguation tool.
168 |
169 | ### Lexical Processing
170 |
171 | #### Stemming
172 |
173 | Stemming is the term used in information retrieval to describe the process for
174 | reducing wordforms to some base representation. Stemming should be distinguished
175 | from [Lemmatization](#lemmatization) since `stems` are not necessarily have
176 | linguistic motivation.
177 |
178 | - [ruby-stemmer](https://github.com/aurelian/ruby-stemmer) -
179 | Ruby-Stemmer exposes the SnowBall API to Ruby.
180 | - [uea-stemmer](https://github.com/ealdent/uea-stemmer) -
181 | Conservative stemmer for search and indexing.
182 |
183 | #### Lemmatization
184 |
185 | Lemmatization is considered a process of finding a base form of a word. Lemmas
186 | are often collected in dictionaries.
187 |
188 | - [lemmatizer](https://github.com/yohasebe/lemmatizer) -
189 | WordNet based Lemmatizer for English texts.
190 |
191 | #### Lexical Statistics: Counting Types and Tokens
192 |
193 | - [wc](https://github.com/thesp0nge/wc) -
194 | Facilities to count word occurrences in a text.
195 | - [word_count](https://github.com/AtelierConvivialite/word_count) -
196 | Word counter for `String` and `Hash` objects.
197 | - [words_counted](https://github.com/abitdodgy/words_counted) -
198 | Pure Ruby library counting word statistics with different custom options.
199 |
200 | #### Filtering Stop Words
201 |
202 | - [stopwords-filter](https://github.com/brenes/stopwords-filter) - Filter and
203 | Stop Word Lexicon based on the SnowBall lemmatizer.
204 |
205 | ### Phrasal Level Processing
206 |
207 | - [n_gram](https://github.com/reddavis/N-Gram) -
208 | N-Gram generator.
209 | - [ruby-ngram](https://github.com/tkellen/ruby-ngram) -
210 | Break words and phrases into ngrams.
211 | - [raingrams](https://github.com/postmodern/raingrams) -
212 | Flexible and general-purpose ngrams library written in pure Ruby.
213 |
214 | ### Syntactic Processing
215 |
216 | #### Constituency Parsing
217 |
218 | - [stanfordparser](https://rubygems.org/gems/stanfordparser) -
219 | Ruby based wrapper for the Stanford Parser.
220 | - [rley](https://github.com/famished-tiger/Rley) -
221 | Pure Ruby implementation of the [Earley](https://en.wikipedia.org/wiki/Earley_parser)
222 | Parsing Algorithm for Context-Free Constituency Grammars.
223 | - [rsyntaxtree](https://github.com/yohasebe/rsyntaxtree) -
224 | Visualization for syntactic trees in Ruby based on [RMagick](https://github.com/rmagick/rmagick).
225 | [dep: [ImageMagick](#imagemagick)]
226 |
227 | ### Semantic Analysis
228 |
229 | - [amatch](https://github.com/flori/amatch) -
230 | Set of five distance types between strings (including Levenshtein, Sellers, Jaro-Winkler, 'pair distance').
231 | - [damerau-levenshtein](https://github.com/GlobalNamesArchitecture/damerau-levenshtein) -
232 | Calculates edit distance using the Damerau-Levenshtein algorithm.
233 | - [hotwater](https://github.com/colinsurprenant/hotwater) -
234 | Fast Ruby FFI string edit distance algorithms.
235 | - [levenshtein-ffi](https://github.com/dbalatero/levenshtein-ffi) -
236 | Fast string edit distance computation, using the Damerau-Levenshtein algorithm.
237 | - [tf_idf](https://github.com/reddavis/TF-IDF) -
238 | Term Frequency / Inverse Document Frequency in pure Ruby.
239 | - [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity) -
240 | Calculate the similarity between texts using TF/IDF.
241 |
242 | ### Pragmatical Analysis
243 | - [SentimentLib](https://github.com/nzaillian/sentiment_lib) -
244 | Simple extensible sentiment analysis gem.
245 |
246 | ## High Level Tasks
247 |
248 | ### Spelling and Error Correction
249 |
250 | - [gingerice](https://github.com/subosito/gingerice) -
251 | Spelling and Grammar corrections via the [Ginger](https://www.gingersoftware.com/) API.
252 | - [hunspell-i18n](https://github.com/romanbsd/hunspell) -
253 | Ruby bindings to the standard [Hunspell](https://hunspell.github.io/) Spell Checker.
254 | - [ffi-hunspell](https://github.com/postmodern/ffi-hunspell) -
255 | FFI based Ruby bindings for [Hunspell](https://hunspell.github.io/).
256 | - [hunspell](https://github.com/segabor/Hunspell) -
257 | Ruby bindings to [Hunspell](https://hunspell.github.io/) via Ruby C API.
258 |
259 | ### Text Alignment
260 |
261 | - [alignment](https://github.com/povilasjurcys/alignment) -
262 | Alignment routines for bilingual texts (Gale-Church implementation).
263 |
264 | ### Machine Translation
265 |
266 | - [google-api-client](https://github.com/googleapis/google-api-ruby-client) -
267 | Google API Ruby Client.
268 | - [microsoft_translator](https://github.com/ikayzo/microsoft_translator) -
269 | Ruby client for the microsoft translator API.
270 | - [termit](https://github.com/pawurb/termit) -
271 | Google Translate with speech synthesis in your terminal.
272 | - [zipf](https://github.com/pks/zipf) -
273 | implementation of BLEU and other base algorithms.
274 |
275 | ### Sentiment Analysis
276 |
277 | - [stimmung](https://github.com/pachacamac/stimmung) -
278 | Semantic Polarity based on the
279 | [SentiWS](http://wortschatz.uni-leipzig.de/en/download) lexicon.
280 |
281 | ### Numbers, Dates, and Time Parsing
282 |
283 | - [chronic](https://github.com/mojombo/chronic) -
284 | Pure Ruby natural language date parser.
285 | - [chronic_between](https://github.com/jrobertson/chronic_between) -
286 | Simple Ruby natural language parser for date and time ranges.
287 | - [chronic_duration](https://github.com/henrypoydar/chronic_duration) -
288 | Pure Ruby parser for elapsed time.
289 | - [kronic](https://github.com/xaviershay/kronic) -
290 | Methods for parsing and formatting human readable dates.
291 | - [nickel](https://github.com/iainbeeston/nickel) -
292 | Extracts date, time, and message information from naturally worded text.
293 | - [tickle](https://github.com/yb66/tickle) -
294 | Parser for recurring and repeating events.
295 | - [numerizer](https://github.com/jduff/numerizer) -
296 | Ruby parser for English number expressions.
297 |
298 | ### Named Entity Recognition
299 |
300 | - [ruby-ner](https://github.com/mblongii/ruby-ner) -
301 | Named Entity Recognition with Stanford NER and Ruby.
302 | - [ruby-nlp](https://github.com/tiendung/ruby-nlp) -
303 | Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer.
304 |
305 | ### Text-to-Speech-to-Text
306 |
307 | - [espeak-ruby](https://github.com/dejan/espeak-ruby) -
308 | Small Ruby API for utilizing 'espeak' and 'lame' to create text-to-speech mp3 files.
309 | - [tts](https://github.com/c2h2/tts) -
310 | Text-to-Speech conversion using the Google translate service.
311 | - [att_speech](https://github.com/adhearsion/att_speech) -
312 | Ruby wrapper over the AT&T Speech API for speech to text.
313 | - [pocketsphinx-ruby](https://github.com/watsonbox/pocketsphinx-ruby) -
314 | Pocketsphinx bindings.
315 |
316 | ## Dialog Agents, Assistants, and Chatbots
317 |
318 | - [chatterbot](https://github.com/muffinista/chatterbot) -
319 | Straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate.
320 | - [lita](https://github.com/litaio/lita) -
321 | Highly extensible chat operation bot framework written with persistent storage on [Redis](https://redis.io/).
322 |
323 | ## Linguistic Resources
324 |
325 | - [rwordnet](https://github.com/doches/rwordnet) -
326 | Pure Ruby self contained API library for the [Princeton WordNet®](https://wordnet.princeton.edu/).
327 | - [wordnet](https://github.com/ged/ruby-wordnet/blob/master/README.rdoc) -
328 | Performance tuned bindings for the [Princeton WordNet®](https://wordnet.princeton.edu/).
329 |
330 | ## Machine Learning Libraries
331 |
332 | [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning) Algorithms
333 | in pure Ruby or written in other programming languages with appropriate bindings
334 | for Ruby.
335 |
336 | For more up-to-date list please look at the [Awesome ML with Ruby][ml-with-ruby] list.
337 |
338 | - [rb-libsvm](https://github.com/febeling/rb-libsvm) -
339 | Support Vector Machines with Ruby.
340 | - [weka](https://github.com/paulgoetze/weka-jruby) -
341 | JRuby bindings for Weka, different ML algorithms implemented through Weka.
342 | - [decisiontree](https://github.com/igrigorik/decisiontree) -
343 | Decision Tree ID3 Algorithm in pure Ruby
344 | [[post](https://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/)].
345 | - [rtimbl](https://github.com/maspwr/rtimbl) -
346 | Memory based learners from the Timbl framework.
347 | - [classifier-reborn](https://github.com/jekyll/classifier-reborn) -
348 | General classifier module to allow Bayesian and other types of classifications.
349 | - [lda-ruby](https://github.com/ealdent/lda-ruby) -
350 | Ruby implementation of the [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
351 | (Latent Dirichlet Allocation) for automatic Topic Modelling and Document Clustering.
352 | - [liblinear-ruby-swig](https://github.com/tomz/liblinear-ruby-swig) -
353 | Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification).
354 | - [linnaeus](https://github.com/djcp/linnaeus) -
355 | Redis-backed Bayesian classifier.
356 | - [maxent_string_classifier](https://github.com/mccraigmccraig/maxent_string_classifier) -
357 | JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework.
358 | - [naive_bayes](https://github.com/reddavis/Naive-Bayes) -
359 | Simple Naive Bayes classifier.
360 | - [nbayes](https://github.com/oasic/nbayes) -
361 | Full-featured, Ruby implementation of Naive Bayes.
362 | - [omnicat](https://github.com/mustafaturan/omnicat) -
363 | Generalized rack framework for text classifications.
364 | - [omnicat-bayes](https://github.com/mustafaturan/omnicat-bayes) -
365 | Naive Bayes text classification implementation as an OmniCat classifier strategy.
366 | - [ruby-fann](https://github.com/tangledpath/ruby-fann) -
367 | Ruby bindings to the [Fast Artificial Neural Network Library (FANN)](http://leenissen.dk/fann/wp/).
368 | - [rblearn](https://github.com/himkt/rblearn) - Feature Extraction and Crossvalidation library.
369 |
370 | ## Data Visualization
371 |
372 | Please refer to the [Data Visualization](https://github.com/arbox/data-science-with-ruby#visualization)
373 | section on the [Data Science with Ruby][ds-with-ruby] list.
374 |
375 | ## Optical Character Recognition
376 |
377 | * [tesseract-ocr](https://github.com/meh/ruby-tesseract-ocr) -
378 | FFI based wrapper over the [Tesseract OCR Engine](https://github.com/tesseract-ocr/tesseract).
379 |
380 | ## Text Extraction
381 |
382 | - [yomu](https://github.com/yomurb/yomu) -
383 | library for extracting text and metadata from files and documents
384 | using the [Apache Tika](https://tika.apache.org/) content analysis toolkit.
385 |
386 | ## Full Text Search, Information Retrieval, Indexing
387 |
388 | - [rsolr](https://github.com/rsolr/rsolr) -
389 | Ruby and Rails client library for [Apache Solr](http://lucene.apache.org/solr/).
390 | - [sunspot](https://github.com/sunspot/sunspot) -
391 | Rails centric client for [Apache Solr](http://lucene.apache.org/solr/).
392 | - [thinking-sphinx](https://github.com/pat/thinking-sphinx) -
393 | [Active Record](https://guides.rubyonrails.org/active_record_basics.html)
394 | plugin for using [Sphinx](http://sphinxsearch.com/) in (not only) Rails based projects.
395 | - [elasticsearch](https://github.com/elastic/elasticsearch-ruby/tree/master/elasticsearch) -
396 | Ruby client and API for [Elasticsearch](https://www.elastic.co/).
397 | - [elasticsearch-rails](https://github.com/elastic/elasticsearch-rails) -
398 | Ruby and Rails integrations for [Elasticsearch](https://www.elastic.co/).
399 | - [google-api-client](https://github.com/googleapis/google-api-ruby-client) -
400 | Ruby API library for [Google](https://developers.google.com/api-client-library/ruby/) services.
401 |
402 | ## Language Aware String Manipulation
403 |
404 | Libraries for language aware string manipulation, i.e. search, pattern matching,
405 | case conversion, transcoding, regular expressions which need information about
406 | the underlying language.
407 |
408 | - [fuzzy_match](https://github.com/seamusabshere/fuzzy_match) -
409 | Fuzzy string comparison with Distance measures and Regular Expression.
410 | - [fuzzy-string-match](https://github.com/kiyoka/fuzzy-string-match) -
411 | Fuzzy string matching library for Ruby.
412 | - [active_support](https://github.com/rails/rails/tree/master/activesupport/lib/active_support) -
413 | RoR `ActiveSupport` gem has various string extensions that can handle case.
414 | - [fuzzy_tools](https://github.com/brianhempel/fuzzy_tools) -
415 | Toolset for fuzzy searches in Ruby tuned for accuracy.
416 | - [u](http://disu.se/software/u-1.0/) -
417 | U extends Ruby’s Unicode support.
418 | - [unicode](https://github.com/blackwinter/unicode) -
419 | Unicode normalization library.
420 | - [CommonRegexRuby](https://github.com/talyssonoc/CommonRegexRuby) -
421 | Find a lot of kinds of common information in a string.
422 | - [regexp-examples](https://github.com/tom-lord/regexp-examples) -
423 | Generate strings that match a given regular expression.
424 | - [verbal_expressions](https://github.com/ryan-endacott/verbal_expressions) -
425 | Make difficult regular expressions easy.
426 | - [translit_kit](https://github.com/AnalyzePlatypus/TranslitKit) -
427 | Transliterate Hebrew & Yiddish text into Latin characters.
428 | - [re2](https://github.com/mudge/re2) -
429 | hight-speed Regular Expression library for Text Mining and Text Extraction.
430 | - [regex_sample](https://github.com/mochizukikotaro/regex_sample) -
431 | sample string generation from a given Regular Expression.
432 | - [iuliia](https://github.com/adnikiforov/iuliia-rb) —
433 | transliteration Cyrillic to Latin in many possible ways (defined by the [reference implementation](https://github.com/nalgeon/iuliia)).
434 |
435 | ## Articles, Posts, Talks, and Presentations
436 |
437 | - 2019
438 | - _Extracting Text From Images Using Ruby_ by [aonemd](https://twitter.com/aonemd)
439 | [[post](https://aonemd.github.io/blog/extracting-text-from-images-using-ruby) |
440 | [code](https://gist.github.com/aonemd/7bb3c4760d9e47a9ce8e270198cb40a0)]
441 | - 2018
442 | - _Natural Language Processing and Tweet Sentiment Analysis_ by [Cassandra Corrales](https://twitter.com/casita305)
443 | [[post](https://medium.com/@cmcorrales3/natural-language-processing-and-tweet-sentiment-analysis-fa1edbb5ddd5)]
444 | - 2017
445 | - _The Google NLP API Meets Ruby_ by [Aja Hammerly](https://twitter.com/the_thagomizer)
446 | [[post](http://www.thagomizer.com/blog/2017/04/13/the-google-nlp-api-meets-ruby.html)]
447 | - _Syntax Isn't Everything: NLP For Rubyists_ by [Aja Hammerly](https://twitter.com/the_thagomizer)
448 | [[slides](http://www.thagomizer.com/files/NLP_RailsConf2017.pdf)]
449 | - _Scientific Computing on JRuby_ by [Prasun Anand](https://twitter.com/prasun_anand)
450 | [[slides](https://www.slideshare.net/PrasunAnand2/fosdem2017-scientific-computing-on-jruby) |
451 | [video](https://ftp.fau.de/fosdem/2017/K.4.201/ruby_scientific_computing_on_jruby.mp4) |
452 | [slides](https://www.slideshare.net/PrasunAnand2/scientific-computing-on-jruby) |
453 | [slides](https://www.slideshare.net/PrasunAnand2/scientific-computation-on-jruby)]
454 | - _Unicode Normalization in Ruby_ by [Starr Horne](https://twitter.com/starrhorne)
455 | [[post](https://blog.honeybadger.io/ruby_unicode_normalization/)]
456 | - 2016
457 | - _Quickly Create a Telegram Bot in Ruby_ by [Ardian Haxha](https://twitter.com/ArdianHaxha)
458 | [[tutorial](https://www.sitepoint.com/quickly-create-a-telegram-bot-in-ruby/)]
459 | - _Deep Learning: An Introduction for Ruby Developers_ by [Geoffrey Litt](https://twitter.com/geoffreylitt)
460 | [[slides](https://speakerdeck.com/geoffreylitt/deep-learning-an-introduction-for-ruby-developers)]
461 | - _How I made a pure-Ruby word2vec program more than 3x faster_ by [Kei Sawada](https://twitter.com/remore)
462 | [[slides](https://speakerdeck.com/remore/how-i-made-a-pure-ruby-word2vec-program-more-than-3x-faster)]
463 | - _Dōmo arigatō, Mr. Roboto: Machine Learning with Ruby_ by [Eric Weinstein](https://twitter.com/ericqweinstein)
464 | [[slides](https://speakerdeck.com/ericqweinstein/domo-arigato-mr-roboto-machine-learning-with-ruby) | [video](https://www.youtube.com/watch?v=T1nFQ49TyeA)]
465 | - 2015
466 | - _N-gram Analysis for Fun and Profit_ by [Jesus Castello](https://github.com/matugm)
467 | [[tutorial](https://www.rubyguides.com/2015/09/ngram-analysis-ruby/)]
468 | - _Machine Learning made simple with Ruby_ by [Lorenzo Masini](https://github.com/rugginoso)
469 | [[tutorial](https://www.leanpanda.com/blog/2015/08/24/machine-learning-automatic-classification/)]
470 | - _Using Ruby Machine Learning to Find Paris Hilton Quotes_ by [Rick Carlino](https://github.com/RickCarlino)
471 | [[tutorial](http://web.archive.org/web/20160414072324/http://datamelon.io/blog/2015/using-ruby-machine-learning-id-paris-hilton-quotes.html)]
472 | - _Exploring Natural Language Processing in Ruby_ by [Kevin Dias](https://github.com/diasks2)
473 | [[slides](https://www.slideshare.net/diasks2/exploring-natural-language-processing-in-ruby)]
474 | - _Machine Learning made simple with Ruby_ by [Lorenzo Masini](https://twitter.com/rugginoso)
475 | [[post](https://www.leanpanda.com/blog/2015/08/24/machine-learning-automatic-classification/)]
476 | - _Practical Data Science in Ruby_ by Bobby Grayson
477 | [[slides](http://slides.com/bobbygrayson/p#/)]
478 | - 2014
479 | - _Natural Language Parsing with Ruby_ by [Glauco Custódio](https://github.com/glaucocustodio)
480 | [[tutorial](http://glaucocustodio.github.io/2014/11/10/natural-language-parsing-with-ruby/)]
481 | - _Demystifying Data Science: Analyzing Conference Talks with Rails and Ngrams_ by
482 | [Todd Schneider](https://github.com/toddwschneider)
483 | [[video](https://www.youtube.com/watch?v=2ZDCxwB29Bg) | [code](https://github.com/Genius/abstractogram)]
484 | - _Natural Language Processing with Ruby_ by [Konstantin Tennhard](https://github.com/t6d)
485 | [[video](https://www.youtube.com/watch?v=5u86qVh8r0M) | [video](https://www.youtube.com/watch?v=oFmy_QBQ5DU) |
486 | [video](https://www.youtube.com/watch?v=sPkeeWnsMn0) |
487 | [slides](http://euruko2013.org/speakers/presentations/natural_language_processing_with_ruby_and_opennlp-tennhard.pdf)]
488 | - 2013
489 | - _How to parse 'go' - Natural Language Processing in Ruby_ by
490 | [Tom Cartwright](https://twitter.com/tomcartwrightuk)
491 | [[slides](https://www.slideshare.net/TomCartwright/natual-language-processing-in-ruby) |
492 | [video](https://skillsmatter.com/skillscasts/4883-how-to-parse-go)]
493 | - _Natural Language Processing in Ruby_ by [Brandon Black](https://twitter.com/brandonmblack)
494 | [[slides](https://speakerdeck.com/brandonblack/natural-language-processing-in-ruby) |
495 | [video](http://confreaks.tv/videos/railsconf2013-natural-language-processing-with-ruby)]
496 | - _Natural Language Processing with Ruby: n-grams_ by [Nathan Kleyn](https://github.com/nathankleyn)
497 | [[tutorial](https://www.sitepoint.com/natural-language-processing-ruby-n-grams/) |
498 | [code](https://github.com/nathankleyn/ruby-nlp)]
499 | - _Seeking Lovecraft, Part 1: An introduction to NLP and the Treat Gem_ by
500 | [Robert Qualls](https://github.com/rlqualls)
501 | [[tutorial](https://www.sitepoint.com/seeking-lovecraft-part-1-an-introduction-to-nlp-and-the-treat-gem/)]
502 | - 2012
503 | - _Machine Learning with Ruby, Part One_ by [Vasily Vasinov](https://twitter.com/vasinov)
504 | [[tutorial](http://www.vasinov.com/blog/machine-learning-with-ruby-part-one/)]
505 | - 2011
506 | - _Ruby one-liners_ by [Benoit Hamelin](https://twitter.com/benoithamelin)
507 | [[post](http://benoithamelin.tumblr.com/ruby1line)]
508 | - _Clustering in Ruby_ by [Colin Drake](https://twitter.com/colinfdrake)
509 | [[post](https://colindrake.me/post/k-means-clustering-in-ruby/)/)]
510 | - 2010
511 | - _bayes_motel – Bayesian classification for Ruby_ by [Mike Perham](https://twitter.com/mperham)
512 | [[post](http://www.mikeperham.com/2010/04/28/bayes_motel-bayesian-classification-for-ruby/)]
513 | - 2009
514 | - _Porting the UEA-Lite Stemmer to Ruby_ by [Jason Adams](https://twitter.com/ealdent)
515 | [[post](https://ealdent.wordpress.com/2009/07/16/porting-the-uea-lite-stemmer-to-ruby/)]
516 | - _NLP Resources for Ruby_ by [Jason Adams](https://twitter.com/ealdent)
517 | [[post](https://ealdent.wordpress.com/2009/09/13/nlp-resources-for-ruby/)]
518 | - 2008
519 | - _Support Vector Machines (SVM) in Ruby_ by [Ilya Grigorik](https://twitter.com/igrigorik)
520 | [[post](https://www.igvita.com/2008/01/07/support-vector-machines-svm-in-ruby/)]
521 | - _Practical text classification with Ruby_ by [Gleicon Moraes](https://twitter.com/gleicon)
522 | [[post](https://zenmachine.wordpress.com/practical-text-classification-with-ruby/) |
523 | [code](https://github.com/gleicon/zenmachine)]
524 | - 2007
525 | - _Decision Tree Learning in Ruby_ by [Ilya Grigorik](https://twitter.com/igrigorik)
526 | [[post](https://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/)]
527 | - 2006
528 | - _Speak My Language: Natural Language Processing With Ruby_ by [Michael Granger](https://deveiate.org/resume.html)
529 | [[slides](https://deveiate.org/misc/Speak-My-Language.pdf) |
530 | [write-up](http://blog.nicksieger.com/articles/2006/10/22/rubyconf-natural-language-generation-and-processing-in-ruby/) |
531 | [write-up](http://juixe.com/papers/RubyConf2006.pdf)]
532 |
533 | ## Projects and Code Examples
534 |
535 | - [Going the Distance](https://github.com/schneems/going_the_distance) -
536 | Implementations of various distance algorithms with example calculations.
537 | - [Named entity recognition with Stanford NER and Ruby](https://github.com/mblongii/ruby-ner) -
538 | NER Examples in Ruby and Java with some [explanations](https://web.archive.org/web/20120722225402/http://mblongii.com/2012/04/15/named-entity-recognition-with-stanford-ner-and-ruby/).
539 | - [Words Counted](http://rubywordcount.com/) -
540 | examples of customizable word statistics powered by
541 | [words_counted](https://github.com/abitdodgy/words_counted).
542 | - [RSyntaxTree](https://yohasebe.com/rsyntaxtree/) -
543 | Web based demonstration of the syntactic tree visualization.
544 |
545 | ## Books
546 |
547 | - [Miller, Rob](https://twitter.com/robmil/).
548 | _Text Processing with Ruby: Extract Value from the Data That Surrounds You._
549 | Pragmatic Programmers, 2015.
550 | [[link](https://www.amazon.com/Text-Processing-Ruby-Extract-Surrounds/dp/1680500708)]
551 | - [Watson, Mark](https://twitter.com/mark_l_watson).
552 | _Scripting Intelligence: Web 3.0 Information Gathering and Processing._
553 | APRESS, 2010.
554 | [[link](https://www.amazon.de/Scripting-Intelligence-Information-Gathering-Processing/dp/1430223510)]
555 | - [Watson, Mark](https://twitter.com/mark_l_watson).
556 | _Practical Semantic Web and Linked Data Applications._ Lulu, 2010.
557 | [[link](http://www.lulu.com/shop/mark-watson/practical-semantic-web-and-linked-data-applications-java-edition/paperback/product-10915016.html)]
558 |
559 | ## Community
560 |
561 | - [Reddit](https://www.reddit.com/r/LanguageTechnology/search?q=ruby&restrict_sr=on)
562 | - [Stack Overflow](https://stackoverflow.com/search?q=%5Bnlp%5D+and+%5Bruby%5D)
563 | - [Twitter](https://twitter.com/search?q=Ruby%20NLP%20%23ruby%20OR%20%23nlproc%20OR%20%23rubynlp%20OR%20%23nlp&src=typd&lang=en)
564 |
565 | ## Needs your Help!
566 |
567 | All projects in this section are really important for the community but need
568 | more attention. Please if you have spare time and dedication spend some hours
569 | on the code here.
570 |
571 | - [ferret](https://github.com/dbalmain/ferret) -
572 | Information Retrieval in C and Ruby.
573 | - [summarize](https://github.com/ssoper/summarize) -
574 | Ruby native wrapper for [Open Text Summarizer](https://github.com/neopunisher/Open-Text-Summarizer).
575 |
576 | ## Related Resources
577 |
578 | - [Neural Machine Translation Implementations](https://github.com/jonsafari/nmt-list)
579 | - [Awesome Ruby](https://github.com/markets/awesome-ruby#natural-language-processing) -
580 | Among other awesome items a short list of NLP related projects.
581 | - [Ruby NLP](https://github.com/diasks2/ruby-nlp) -
582 | State-of-Art collection of Ruby libraries for NLP.
583 | - [Speech and Natural Language Processing](https://github.com/edobashira/speech-language-processing) -
584 | General List of NLP related resources (mostly not for Ruby programmers).
585 | - [Scientific Ruby](http://sciruby.com/) -
586 | Linear Algebra, Visualization and Scientific Computing for Ruby.
587 | - [iRuby](https://github.com/SciRuby/iruby) - IRuby kernel for Jupyter (formelly IPython).
588 | - [Awesome OCR](https://github.com/kba/awesome-ocr) -
589 | Multitude of OCR (Optical Character Recognition) resources.
590 | - [Awesome TensorFlow](https://github.com/jtoy/awesome-tensorflow) -
591 | Machine Learning with TensorFlow libraries.
592 | -
593 | [ImageMagick](https://imagemagick.org/index.php)
594 |
595 | ## License
596 |
597 | [](https://creativecommons.org/publicdomain/zero/1.0/) `Awesome NLP with Ruby` by [Andrei Beliankou](https://github.com/arbox) and
598 | [Contributors](https://github.com/arbox/nlp-with-ruby/graphs/contributors).
599 |
600 | To the extent possible under law, the person who associated CC0 with
601 | `Awesome NLP with Ruby` has waived all copyright and related or neighboring rights
602 | to `Awesome NLP with Ruby`.
603 |
604 | You should have received a copy of the CC0 legalcode along with this
605 | work. If not, see .
606 |
607 |
608 | [ruby]: https://www.ruby-lang.org/en/
609 | [motivation]: https://github.com/arbox/nlp-with-ruby/blob/master/motivation.md
610 | [faq]: https://github.com/arbox/nlp-with-ruby/blob/master/FAQ.md
611 | [ds-with-ruby]: https://github.com/arbox/data-science-with-ruby
612 | [ml-with-ruby]: https://github.com/arbox/machine-learning-with-ruby
613 | [change-pr]: https://github.com/RichardLitt/knowledge/blob/master/github/amending-a-commit-guide.md
614 |
--------------------------------------------------------------------------------
/ruby.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arbox/nlp-with-ruby/44db8a3e4c74c2a6fdcc6bef6bd14ed021b7bd24/ruby.jpg
--------------------------------------------------------------------------------
/test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arbox/nlp-with-ruby/44db8a3e4c74c2a6fdcc6bef6bd14ed021b7bd24/test.png
--------------------------------------------------------------------------------
/tutorials/ruby-stemmer.md:
--------------------------------------------------------------------------------
1 | # ruby-stemmer
2 |
--------------------------------------------------------------------------------
/tutorials/template.md:
--------------------------------------------------------------------------------
1 | # This is a template
2 |
--------------------------------------------------------------------------------
/tutorials/tokenizer.md:
--------------------------------------------------------------------------------
1 | # Tokenizer
2 |
3 | ## INSTALLATION
4 | `Tokenizer` is provided as a `.gem` package. Simply install it via
5 | [RubyGems](http://rubygems.org/gems/tokenizer).
6 |
7 | To install `tokenizer` issue the following command:
8 |
9 | ``` shell
10 | $ gem install tokenizer
11 | ```
12 |
13 | If you want to do a system wide installation, do this as root
14 | (possibly using `sudo`).
15 |
16 | Alternatively use your Gemfile for dependency management.
17 |
18 | ## SYNOPSIS
19 |
20 | You can use +Tokenizer+ in two ways.
21 | * As a command line tool:
22 |
23 | ``` shell
24 | $ echo 'Hi, ich gehe in die Schule!. | tokenize
25 | ```
26 |
27 | * As a library for embedded tokenization:
28 |
29 | ``` ruby
30 | > require 'tokenizer'
31 | > de_tokenizer = Tokenizer::WhitespaceTokenizer.new
32 | > de_tokenizer.tokenize('Ich gehe in die Schule!')
33 | > => ["Ich", "gehe", "in", "die", "Schule", "!"]
34 | ```
35 |
36 | * Customizable `PRE` and `POST` list
37 |
38 | ``` ruby
39 | > require 'tokenizer'
40 | > de_tokenizer = Tokenizer::WhitespaceTokenizer.new(:de, { post: Tokenizer::Tokenizer::POST + ['|'] })
41 | > de_tokenizer.tokenize('Ich gehe|in die Schule!')
42 | > => ["Ich", "gehe", "|in", "die", "Schule", "!"]
43 | ```
44 |
45 | See documentation in the `Tokenizer::WhitespaceTokenizer` class for details
46 | on particular methods.
47 |
--------------------------------------------------------------------------------
/tutorials/tutorial_template.md:
--------------------------------------------------------------------------------
1 | # Tutorial
2 |
3 | ![Implemented in pure Ruby][ruby]
4 |
5 | ## Motivation
6 | What tasks can we accomplish with this tool?
7 |
8 | ## Installation
9 | How can we get this tool?
10 |
11 | ## Examples
12 |
13 |
14 | [ruby]: https://img.shields.io/badge/L%3A-Ruby-red.svg
15 | [jruby]: https://img.shields.io/badge/L%3A-JRuby-yellowgreen.svg
16 | [java]: https://img.shields.io/badge/L%3A-Java-yellow.svg
17 | [c]: https://img.shields.io/badge/L%3A-C-brightgreen.svg
18 | [cpp]: https://img.shields.io/badge/L%3A-C%2B%2B-green.svg
19 | [tutorial-present]: https://img.shields.io/badge/Tutorial-%E2%9C%85-green.svg
20 | [tutorial-missing]: https://img.shields.io/badge/Tutorial-%E2%9C%98-lightgrey.svg
21 |
--------------------------------------------------------------------------------
/tutorials/weka-jruby.md:
--------------------------------------------------------------------------------
1 | # Weka with JRuby
2 |
--------------------------------------------------------------------------------