├── .gitignore ├── .ruby-version ├── .travis.yml ├── FAQ.md ├── Gemfile ├── LICENSE ├── Rakefile ├── contributing.md ├── header.png ├── inbox.md ├── motivation.md ├── pull_request_template.md ├── readme.md ├── ruby.jpg ├── test.png └── tutorials ├── ruby-stemmer.md ├── template.md ├── tokenizer.md ├── tutorial_template.md └── weka-jruby.md /.gitignore: -------------------------------------------------------------------------------- 1 | Gemfile.lock 2 | *json 3 | node_modules 4 | _site 5 | _blog 6 | .sass-cache 7 | .jekyll-metadata 8 | scripts 9 | www 10 | -------------------------------------------------------------------------------- /.ruby-version: -------------------------------------------------------------------------------- 1 | 2.6.6 2 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: ruby 2 | bundler_args: --without local 3 | rvm: 4 | - 2.2 5 | script: "awesome_bot readme.md motivation.md FAQ.md --allow-dupe" 6 | -------------------------------------------------------------------------------- /FAQ.md: -------------------------------------------------------------------------------- 1 | # Frequently (not yet) Asked Questions 2 | 3 | ## What is Awesome Ruby NLP list? 4 | 5 | This list is the _first systematic_ attempt to enlist NLP and CL related 6 | resources for Ruby. It's based on other earlier attempts 7 | e.g. https://github.com/diasks2/ruby-nlp. We strive to provide a list of only 8 | working high quality libraries. Read [why](motivation.md) this list is vital for 9 | the Ruby community. 10 | 11 | ## Why use Ruby for NLP? 12 | 13 | Everybody uses Python! Nobody hires Ruby developers for NLP tasks. 14 | 15 | To avoid a long discussion we can simply postulate: Ruby and Python are great 16 | programming languages, both very appealing to the community, but with different 17 | history. Everything written in Ruby could have been written in Python. 18 | 19 | Nevertheless we have our personal preferences like dogs over cats, 20 | tea over coffee etc. That's why you can choose the language which matches 21 | your mindset and does not break your mind to get compliant with a programming 22 | language. 23 | 24 | Take Ruby if you're happy with it. Use Python if you like it more. Do whatever 25 | you want and pay for your decisions! 26 | 27 | And if you still hesitate look at this great 28 | [talk](https://www.youtube.com/watch?v=0D3KfnbTdWw) by Jim Weirich. 29 | 30 | ## Wait ... but Ruby is so slow? 31 | 32 | Ruby **IS** comparable in terms of processing speed with other high level 33 | scripting programming languages like Lua, Perl, Python etc. 34 | 35 | Please look at this comparison: 36 | https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/ruby.html 37 | 38 | ## Hm ... but would I find suitable libraries? 39 | 40 | Python has more! Eventually... 41 | 42 | Please look at the current [list](https://github.com/arbox/nlp-with-ruby), 43 | Ruby is equipped with all important libraries. 44 | 45 | ## Can I write NLP application on the Google's scale with Ruby? 46 | 47 | The answer is simple and sounds "NO". Not in pure Ruby. But you can be very 48 | efficient and use Ruby bindings for Java, C and C++ based libraries. 49 | And sometimes buying newer hardware can be cheaper than writing everything in 50 | C++. It's definitely your choice! 51 | 52 | ## How do you call list items? 53 | 54 | Every library list item has the naming after the Ruby library. The name is 55 | the exact wording of the `gem install lib` statement (or `gem 'lib'` in your 56 | `Gemfile`) to facilitate search and memoization. That's why the appropriate item 57 | is called `treat` and not `Treat`. 58 | 59 | 60 | [motivation]: https://github.com/arbox/nlp-with-ruby/blob/master/motivation.md 61 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | # frozen_string_literal: true 2 | source 'https://rubygems.org' 3 | 4 | gem 'awesome_bot', '~> 1.17' 5 | gem 'rake', '~> 12.3' 6 | 7 | group :local do 8 | gem 'jekyll', '~> 3.4' 9 | end 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | CC0 1.0 Universal 2 | 3 | Statement of Purpose 4 | 5 | The laws of most jurisdictions throughout the world automatically confer 6 | exclusive Copyright and Related Rights (defined below) upon the creator and 7 | subsequent owner(s) (each and all, an "owner") of an original work of 8 | authorship and/or a database (each, a "Work"). 9 | 10 | Certain owners wish to permanently relinquish those rights to a Work for the 11 | purpose of contributing to a commons of creative, cultural and scientific 12 | works ("Commons") that the public can reliably and without fear of later 13 | claims of infringement build upon, modify, incorporate in other works, reuse 14 | and redistribute as freely as possible in any form whatsoever and for any 15 | purposes, including without limitation commercial purposes. These owners may 16 | contribute to the Commons to promote the ideal of a free culture and the 17 | further production of creative, cultural and scientific works, or to gain 18 | reputation or greater distribution for their Work in part through the use and 19 | efforts of others. 20 | 21 | For these and/or other purposes and motivations, and without any expectation 22 | of additional consideration or compensation, the person associating CC0 with a 23 | Work (the "Affirmer"), to the extent that he or she is an owner of Copyright 24 | and Related Rights in the Work, voluntarily elects to apply CC0 to the Work 25 | and publicly distribute the Work under its terms, with knowledge of his or her 26 | Copyright and Related Rights in the Work and the meaning and intended legal 27 | effect of CC0 on those rights. 28 | 29 | 1. Copyright and Related Rights. A Work made available under CC0 may be 30 | protected by copyright and related or neighboring rights ("Copyright and 31 | Related Rights"). Copyright and Related Rights include, but are not limited 32 | to, the following: 33 | 34 | i. the right to reproduce, adapt, distribute, perform, display, communicate, 35 | and translate a Work; 36 | 37 | ii. moral rights retained by the original author(s) and/or performer(s); 38 | 39 | iii. publicity and privacy rights pertaining to a person's image or likeness 40 | depicted in a Work; 41 | 42 | iv. rights protecting against unfair competition in regards to a Work, 43 | subject to the limitations in paragraph 4(a), below; 44 | 45 | v. rights protecting the extraction, dissemination, use and reuse of data in 46 | a Work; 47 | 48 | vi. database rights (such as those arising under Directive 96/9/EC of the 49 | European Parliament and of the Council of 11 March 1996 on the legal 50 | protection of databases, and under any national implementation thereof, 51 | including any amended or successor version of such directive); and 52 | 53 | vii. other similar, equivalent or corresponding rights throughout the world 54 | based on applicable law or treaty, and any national implementations thereof. 55 | 56 | 2. Waiver. To the greatest extent permitted by, but not in contravention of, 57 | applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and 58 | unconditionally waives, abandons, and surrenders all of Affirmer's Copyright 59 | and Related Rights and associated claims and causes of action, whether now 60 | known or unknown (including existing as well as future claims and causes of 61 | action), in the Work (i) in all territories worldwide, (ii) for the maximum 62 | duration provided by applicable law or treaty (including future time 63 | extensions), (iii) in any current or future medium and for any number of 64 | copies, and (iv) for any purpose whatsoever, including without limitation 65 | commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes 66 | the Waiver for the benefit of each member of the public at large and to the 67 | detriment of Affirmer's heirs and successors, fully intending that such Waiver 68 | shall not be subject to revocation, rescission, cancellation, termination, or 69 | any other legal or equitable action to disrupt the quiet enjoyment of the Work 70 | by the public as contemplated by Affirmer's express Statement of Purpose. 71 | 72 | 3. Public License Fallback. Should any part of the Waiver for any reason be 73 | judged legally invalid or ineffective under applicable law, then the Waiver 74 | shall be preserved to the maximum extent permitted taking into account 75 | Affirmer's express Statement of Purpose. In addition, to the extent the Waiver 76 | is so judged Affirmer hereby grants to each affected person a royalty-free, 77 | non transferable, non sublicensable, non exclusive, irrevocable and 78 | unconditional license to exercise Affirmer's Copyright and Related Rights in 79 | the Work (i) in all territories worldwide, (ii) for the maximum duration 80 | provided by applicable law or treaty (including future time extensions), (iii) 81 | in any current or future medium and for any number of copies, and (iv) for any 82 | purpose whatsoever, including without limitation commercial, advertising or 83 | promotional purposes (the "License"). The License shall be deemed effective as 84 | of the date CC0 was applied by Affirmer to the Work. Should any part of the 85 | License for any reason be judged legally invalid or ineffective under 86 | applicable law, such partial invalidity or ineffectiveness shall not 87 | invalidate the remainder of the License, and in such case Affirmer hereby 88 | affirms that he or she will not (i) exercise any of his or her remaining 89 | Copyright and Related Rights in the Work or (ii) assert any associated claims 90 | and causes of action with respect to the Work, in either case contrary to 91 | Affirmer's express Statement of Purpose. 92 | 93 | 4. Limitations and Disclaimers. 94 | 95 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 96 | surrendered, licensed or otherwise affected by this document. 97 | 98 | b. Affirmer offers the Work as-is and makes no representations or warranties 99 | of any kind concerning the Work, express, implied, statutory or otherwise, 100 | including without limitation warranties of title, merchantability, fitness 101 | for a particular purpose, non infringement, or the absence of latent or 102 | other defects, accuracy, or the present or absence of errors, whether or not 103 | discoverable, all to the greatest extent permissible under applicable law. 104 | 105 | c. Affirmer disclaims responsibility for clearing rights of other persons 106 | that may apply to the Work or any use thereof, including without limitation 107 | any person's Copyright and Related Rights in the Work. Further, Affirmer 108 | disclaims responsibility for obtaining any necessary consents, permissions 109 | or other rights required for any use of the Work. 110 | 111 | d. Affirmer understands and acknowledges that Creative Commons is not a 112 | party to this document and has no duty or obligation with respect to this 113 | CC0 or use of the Work. 114 | 115 | For more information, please see 116 | 117 | -------------------------------------------------------------------------------- /Rakefile: -------------------------------------------------------------------------------- 1 | require 'yaml' 2 | require 'rake/clean' 3 | CLEAN.include '*.json' 4 | 5 | namespace :test do 6 | task :links2 do 7 | require 'awesome_bot' 8 | content = File.read('README.md') 9 | result = AwesomeBot.check(content) 10 | puts result.success(nil) ? ':)' : ':(' 11 | end 12 | 13 | CMD_STRING = YAML.load_file('.travis.yml')['script'] 14 | desc 'Test links with AwesomeBot' 15 | task :links do 16 | system(CMD_STRING) 17 | end 18 | end 19 | 20 | desc 'Regenerate the TOC.' 21 | task :toc do 22 | `node_modules/markdown-toc/cli.js -i README.md` 23 | end 24 | 25 | desc 'Create the www sources.' 26 | task :webgen do 27 | DOCS_DIR = 'www/_mkdocs_source/' 28 | SRC_FILES = ['README.md', 'FAQ.md', 'motivation.md'] 29 | SRC_FILES.each do |name| 30 | nodoc(name) 31 | end 32 | end 33 | 34 | def nodoc(file) 35 | lines = File.readlines(file) 36 | 37 | if file == 'README.md' 38 | file = 'index.md' 39 | end 40 | File.open(DOCS_DIR + file, 'w') do |file| 41 | lines.each do |line| 42 | unless line =~ / 40 | ## Contents 41 | 42 | 43 | 44 | - [:sparkles: Tutorials](#sparkles-tutorials) 45 | - [NLP Pipeline Subtasks](#nlp-pipeline-subtasks) 46 | * [Pipeline Generation](#pipeline-generation) 47 | * [Multipurpose Engines](#multipurpose-engines) 48 | + [On-line APIs](#on-line-apis) 49 | * [Language Identification](#language-identification) 50 | * [Segmentation](#segmentation) 51 | * [Lexical Processing](#lexical-processing) 52 | + [Stemming](#stemming) 53 | + [Lemmatization](#lemmatization) 54 | + [Lexical Statistics: Counting Types and Tokens](#lexical-statistics-counting-types-and-tokens) 55 | + [Filtering Stop Words](#filtering-stop-words) 56 | * [Phrasal Level Processing](#phrasal-level-processing) 57 | * [Syntactic Processing](#syntactic-processing) 58 | + [Constituency Parsing](#constituency-parsing) 59 | * [Semantic Analysis](#semantic-analysis) 60 | * [Pragmatical Analysis](#pragmatical-analysis) 61 | - [High Level Tasks](#high-level-tasks) 62 | * [Spelling and Error Correction](#spelling-and-error-correction) 63 | * [Text Alignment](#text-alignment) 64 | * [Machine Translation](#machine-translation) 65 | * [Sentiment Analysis](#sentiment-analysis) 66 | * [Numbers, Dates, and Time Parsing](#numbers-dates-and-time-parsing) 67 | * [Named Entity Recognition](#named-entity-recognition) 68 | * [Text-to-Speech-to-Text](#text-to-speech-to-text) 69 | - [Dialog Agents, Assistants, and Chatbots](#dialog-agents-assistants-and-chatbots) 70 | - [Linguistic Resources](#linguistic-resources) 71 | - [Machine Learning Libraries](#machine-learning-libraries) 72 | - [Data Visualization](#data-visualization) 73 | - [Optical Character Recognition](#optical-character-recognition) 74 | - [Text Extraction](#text-extraction) 75 | - [Full Text Search, Information Retrieval, Indexing](#full-text-search-information-retrieval-indexing) 76 | - [Language Aware String Manipulation](#language-aware-string-manipulation) 77 | - [Articles, Posts, Talks, and Presentations](#articles-posts-talks-and-presentations) 78 | - [Projects and Code Examples](#projects-and-code-examples) 79 | - [Books](#books) 80 | - [Community](#community) 81 | - [Needs your Help!](#needs-your-help) 82 | - [Related Resources](#related-resources) 83 | - [License](#license) 84 | 85 | 86 | 87 | 88 | 89 | ## :sparkles: Tutorials 90 | 91 | Please help us to fill out this section! :smiley: 92 | 93 | ## NLP Pipeline Subtasks 94 | 95 | An NLP Pipeline starts with a plain text. 96 | 97 | ### Pipeline Generation 98 | 99 | - [composable_operations](https://github.com/t6d/composable_operations) - 100 | Definition framework for operation pipelines. 101 | - [ruby-spark](https://github.com/ondra-m/ruby-spark) - 102 | Spark bindings with an easy to understand DSL. 103 | - [phobos](https://github.com/phobos/phobos) - 104 | Simplified Ruby Client for [Apache Kafka](https://kafka.apache.org/). 105 | - [parallel](https://github.com/grosser/parallel) - 106 | Supervisor for parallel execution on multiple CPUs or in many threads. 107 | - [pwrake](https://github.com/masa16/pwrake) - 108 | Rake extensions to run local and remote tasks in parallel. 109 | 110 | ### Multipurpose Engines 111 | 112 | - [open-nlp](https://github.com/louismullie/open-nlp) - 113 | Ruby Bindings for the [OpenNLP](https://opennlp.apache.org/) Toolkit. 114 | - [stanford-core-nlp](https://github.com/louismullie/stanford-core-nlp) - 115 | Ruby Bindings for the Stanford [CoreNLP](https://github.com/stanfordnlp/CoreNLP) tools. 116 | - [treat](https://github.com/louismullie/treat) - 117 | Natural Language Processing framework for Ruby (like [NLTK](http://www.nltk.org/) for Python). 118 | - [nlp_toolz](https://github.com/LeFnord/nlp_toolz) - 119 | Wrapper over some [OpenNLP](https://opennlp.apache.org/) classes and 120 | the original [Berkeley Parser](https://github.com/slavpetrov/berkeleyparser). 121 | - [open_nlp](https://github.com/hck/open_nlp) - 122 | JRuby Bindings for the [OpenNLP](https://opennlp.apache.org/) Toolkit. 123 | - [ruby-spacy](https://github.com/yohasebe/ruby-spacy) — 124 | Wrapper module for spaCy NLP library via [PyCall](https://github.com/mrkn/pycall.rb). 125 | 126 | #### On-line APIs 127 | 128 | - [alchemyapi_ruby](https://github.com/alchemyapi/alchemyapi_ruby) - 129 | Legacy Ruby SDK for AlchemyAPI/Bluemix. 130 | - [wit-ruby](https://github.com/wit-ai/wit-ruby) - 131 | Ruby client library for the [Wit.ai](https://wit.ai/) Language Understanding Platform. 132 | - [wlapi](https://github.com/arbox/wlapi) - Ruby client library for 133 | [Wortschatz Leipzig](http://wortschatz.uni-leipzig.de/de) web services. 134 | - [monkeylearn-ruby](https://github.com/monkeylearn/monkeylearn-ruby) - Sentiment 135 | Analysis, Topic Modelling, Language Detection, Named Entity Recognition via 136 | a Ruby based Web API client. 137 | - [google-cloud-language](https://github.com/googleapis/google-cloud-ruby/tree/master/google-cloud-language) - 138 | Google's Natural Language service API for Ruby. 139 | 140 | ### Language Identification 141 | 142 | Language Identification is one of the first crucial steps in every NLP Pipeline. 143 | 144 | - [scylla](https://github.com/hashwin/scylla) - 145 | Language Categorization and Identification. 146 | 147 | ### Segmentation 148 | 149 | Tools for Tokenization, Word and Sentence Boundary Detection and Disambiguation. 150 | 151 | - [tokenizer](https://github.com/arbox/tokenizer) - 152 | Simple multilingual tokenizer. 153 | [[tutorial](tutorials/tokenizer.md)] 154 | - [pragmatic_tokenizer](https://github.com/diasks2/pragmatic_tokenizer) - 155 | Multilingual tokenizer to split a string into tokens. 156 | - [nlp-pure](https://github.com/parhamr/nlp-pure) - 157 | Natural language processing algorithms implemented in pure Ruby with minimal dependencies. 158 | - [textoken](https://github.com/manorie/textoken) - 159 | Simple and customizable text tokenization library. 160 | - [pragmatic_segmenter](https://github.com/diasks2/pragmatic_segmenter) - 161 | Word Boundary Disambiguation with many cookies. 162 | - [punkt-segmenter](https://github.com/lfcipriani/punkt-segmenter) - 163 | Pure Ruby implementation of the Punkt Segmenter. 164 | - [tactful_tokenizer](https://github.com/zencephalon/Tactful_Tokenizer) - 165 | RegExp based tokenizer for different languages. 166 | - [scapel](https://github.com/louismullie/scalpel) - 167 | Sentence Boundary Disambiguation tool. 168 | 169 | ### Lexical Processing 170 | 171 | #### Stemming 172 | 173 | Stemming is the term used in information retrieval to describe the process for 174 | reducing wordforms to some base representation. Stemming should be distinguished 175 | from [Lemmatization](#lemmatization) since `stems` are not necessarily have 176 | linguistic motivation. 177 | 178 | - [ruby-stemmer](https://github.com/aurelian/ruby-stemmer) - 179 | Ruby-Stemmer exposes the SnowBall API to Ruby. 180 | - [uea-stemmer](https://github.com/ealdent/uea-stemmer) - 181 | Conservative stemmer for search and indexing. 182 | 183 | #### Lemmatization 184 | 185 | Lemmatization is considered a process of finding a base form of a word. Lemmas 186 | are often collected in dictionaries. 187 | 188 | - [lemmatizer](https://github.com/yohasebe/lemmatizer) - 189 | WordNet based Lemmatizer for English texts. 190 | 191 | #### Lexical Statistics: Counting Types and Tokens 192 | 193 | - [wc](https://github.com/thesp0nge/wc) - 194 | Facilities to count word occurrences in a text. 195 | - [word_count](https://github.com/AtelierConvivialite/word_count) - 196 | Word counter for `String` and `Hash` objects. 197 | - [words_counted](https://github.com/abitdodgy/words_counted) - 198 | Pure Ruby library counting word statistics with different custom options. 199 | 200 | #### Filtering Stop Words 201 | 202 | - [stopwords-filter](https://github.com/brenes/stopwords-filter) - Filter and 203 | Stop Word Lexicon based on the SnowBall lemmatizer. 204 | 205 | ### Phrasal Level Processing 206 | 207 | - [n_gram](https://github.com/reddavis/N-Gram) - 208 | N-Gram generator. 209 | - [ruby-ngram](https://github.com/tkellen/ruby-ngram) - 210 | Break words and phrases into ngrams. 211 | - [raingrams](https://github.com/postmodern/raingrams) - 212 | Flexible and general-purpose ngrams library written in pure Ruby. 213 | 214 | ### Syntactic Processing 215 | 216 | #### Constituency Parsing 217 | 218 | - [stanfordparser](https://rubygems.org/gems/stanfordparser) - 219 | Ruby based wrapper for the Stanford Parser. 220 | - [rley](https://github.com/famished-tiger/Rley) - 221 | Pure Ruby implementation of the [Earley](https://en.wikipedia.org/wiki/Earley_parser) 222 | Parsing Algorithm for Context-Free Constituency Grammars. 223 | - [rsyntaxtree](https://github.com/yohasebe/rsyntaxtree) - 224 | Visualization for syntactic trees in Ruby based on [RMagick](https://github.com/rmagick/rmagick). 225 | [dep: [ImageMagick](#imagemagick)] 226 | 227 | ### Semantic Analysis 228 | 229 | - [amatch](https://github.com/flori/amatch) - 230 | Set of five distance types between strings (including Levenshtein, Sellers, Jaro-Winkler, 'pair distance'). 231 | - [damerau-levenshtein](https://github.com/GlobalNamesArchitecture/damerau-levenshtein) - 232 | Calculates edit distance using the Damerau-Levenshtein algorithm. 233 | - [hotwater](https://github.com/colinsurprenant/hotwater) - 234 | Fast Ruby FFI string edit distance algorithms. 235 | - [levenshtein-ffi](https://github.com/dbalatero/levenshtein-ffi) - 236 | Fast string edit distance computation, using the Damerau-Levenshtein algorithm. 237 | - [tf_idf](https://github.com/reddavis/TF-IDF) - 238 | Term Frequency / Inverse Document Frequency in pure Ruby. 239 | - [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity) - 240 | Calculate the similarity between texts using TF/IDF. 241 | 242 | ### Pragmatical Analysis 243 | - [SentimentLib](https://github.com/nzaillian/sentiment_lib) - 244 | Simple extensible sentiment analysis gem. 245 | 246 | ## High Level Tasks 247 | 248 | ### Spelling and Error Correction 249 | 250 | - [gingerice](https://github.com/subosito/gingerice) - 251 | Spelling and Grammar corrections via the [Ginger](https://www.gingersoftware.com/) API. 252 | - [hunspell-i18n](https://github.com/romanbsd/hunspell) - 253 | Ruby bindings to the standard [Hunspell](https://hunspell.github.io/) Spell Checker. 254 | - [ffi-hunspell](https://github.com/postmodern/ffi-hunspell) - 255 | FFI based Ruby bindings for [Hunspell](https://hunspell.github.io/). 256 | - [hunspell](https://github.com/segabor/Hunspell) - 257 | Ruby bindings to [Hunspell](https://hunspell.github.io/) via Ruby C API. 258 | 259 | ### Text Alignment 260 | 261 | - [alignment](https://github.com/povilasjurcys/alignment) - 262 | Alignment routines for bilingual texts (Gale-Church implementation). 263 | 264 | ### Machine Translation 265 | 266 | - [google-api-client](https://github.com/googleapis/google-api-ruby-client) - 267 | Google API Ruby Client. 268 | - [microsoft_translator](https://github.com/ikayzo/microsoft_translator) - 269 | Ruby client for the microsoft translator API. 270 | - [termit](https://github.com/pawurb/termit) - 271 | Google Translate with speech synthesis in your terminal. 272 | - [zipf](https://github.com/pks/zipf) - 273 | implementation of BLEU and other base algorithms. 274 | 275 | ### Sentiment Analysis 276 | 277 | - [stimmung](https://github.com/pachacamac/stimmung) - 278 | Semantic Polarity based on the 279 | [SentiWS](http://wortschatz.uni-leipzig.de/en/download) lexicon. 280 | 281 | ### Numbers, Dates, and Time Parsing 282 | 283 | - [chronic](https://github.com/mojombo/chronic) - 284 | Pure Ruby natural language date parser. 285 | - [chronic_between](https://github.com/jrobertson/chronic_between) - 286 | Simple Ruby natural language parser for date and time ranges. 287 | - [chronic_duration](https://github.com/henrypoydar/chronic_duration) - 288 | Pure Ruby parser for elapsed time. 289 | - [kronic](https://github.com/xaviershay/kronic) - 290 | Methods for parsing and formatting human readable dates. 291 | - [nickel](https://github.com/iainbeeston/nickel) - 292 | Extracts date, time, and message information from naturally worded text. 293 | - [tickle](https://github.com/yb66/tickle) - 294 | Parser for recurring and repeating events. 295 | - [numerizer](https://github.com/jduff/numerizer) - 296 | Ruby parser for English number expressions. 297 | 298 | ### Named Entity Recognition 299 | 300 | - [ruby-ner](https://github.com/mblongii/ruby-ner) - 301 | Named Entity Recognition with Stanford NER and Ruby. 302 | - [ruby-nlp](https://github.com/tiendung/ruby-nlp) - 303 | Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer. 304 | 305 | ### Text-to-Speech-to-Text 306 | 307 | - [espeak-ruby](https://github.com/dejan/espeak-ruby) - 308 | Small Ruby API for utilizing 'espeak' and 'lame' to create text-to-speech mp3 files. 309 | - [tts](https://github.com/c2h2/tts) - 310 | Text-to-Speech conversion using the Google translate service. 311 | - [att_speech](https://github.com/adhearsion/att_speech) - 312 | Ruby wrapper over the AT&T Speech API for speech to text. 313 | - [pocketsphinx-ruby](https://github.com/watsonbox/pocketsphinx-ruby) - 314 | Pocketsphinx bindings. 315 | 316 | ## Dialog Agents, Assistants, and Chatbots 317 | 318 | - [chatterbot](https://github.com/muffinista/chatterbot) - 319 | Straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate. 320 | - [lita](https://github.com/litaio/lita) - 321 | Highly extensible chat operation bot framework written with persistent storage on [Redis](https://redis.io/). 322 | 323 | ## Linguistic Resources 324 | 325 | - [rwordnet](https://github.com/doches/rwordnet) - 326 | Pure Ruby self contained API library for the [Princeton WordNet®](https://wordnet.princeton.edu/). 327 | - [wordnet](https://github.com/ged/ruby-wordnet/blob/master/README.rdoc) - 328 | Performance tuned bindings for the [Princeton WordNet®](https://wordnet.princeton.edu/). 329 | 330 | ## Machine Learning Libraries 331 | 332 | [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning) Algorithms 333 | in pure Ruby or written in other programming languages with appropriate bindings 334 | for Ruby. 335 | 336 | For more up-to-date list please look at the [Awesome ML with Ruby][ml-with-ruby] list. 337 | 338 | - [rb-libsvm](https://github.com/febeling/rb-libsvm) - 339 | Support Vector Machines with Ruby. 340 | - [weka](https://github.com/paulgoetze/weka-jruby) - 341 | JRuby bindings for Weka, different ML algorithms implemented through Weka. 342 | - [decisiontree](https://github.com/igrigorik/decisiontree) - 343 | Decision Tree ID3 Algorithm in pure Ruby 344 | [[post](https://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/)]. 345 | - [rtimbl](https://github.com/maspwr/rtimbl) - 346 | Memory based learners from the Timbl framework. 347 | - [classifier-reborn](https://github.com/jekyll/classifier-reborn) - 348 | General classifier module to allow Bayesian and other types of classifications. 349 | - [lda-ruby](https://github.com/ealdent/lda-ruby) - 350 | Ruby implementation of the [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) 351 | (Latent Dirichlet Allocation) for automatic Topic Modelling and Document Clustering. 352 | - [liblinear-ruby-swig](https://github.com/tomz/liblinear-ruby-swig) - 353 | Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification). 354 | - [linnaeus](https://github.com/djcp/linnaeus) - 355 | Redis-backed Bayesian classifier. 356 | - [maxent_string_classifier](https://github.com/mccraigmccraig/maxent_string_classifier) - 357 | JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework. 358 | - [naive_bayes](https://github.com/reddavis/Naive-Bayes) - 359 | Simple Naive Bayes classifier. 360 | - [nbayes](https://github.com/oasic/nbayes) - 361 | Full-featured, Ruby implementation of Naive Bayes. 362 | - [omnicat](https://github.com/mustafaturan/omnicat) - 363 | Generalized rack framework for text classifications. 364 | - [omnicat-bayes](https://github.com/mustafaturan/omnicat-bayes) - 365 | Naive Bayes text classification implementation as an OmniCat classifier strategy. 366 | - [ruby-fann](https://github.com/tangledpath/ruby-fann) - 367 | Ruby bindings to the [Fast Artificial Neural Network Library (FANN)](http://leenissen.dk/fann/wp/). 368 | - [rblearn](https://github.com/himkt/rblearn) - Feature Extraction and Crossvalidation library. 369 | 370 | ## Data Visualization 371 | 372 | Please refer to the [Data Visualization](https://github.com/arbox/data-science-with-ruby#visualization) 373 | section on the [Data Science with Ruby][ds-with-ruby] list. 374 | 375 | ## Optical Character Recognition 376 | 377 | * [tesseract-ocr](https://github.com/meh/ruby-tesseract-ocr) - 378 | FFI based wrapper over the [Tesseract OCR Engine](https://github.com/tesseract-ocr/tesseract). 379 | 380 | ## Text Extraction 381 | 382 | - [yomu](https://github.com/yomurb/yomu) - 383 | library for extracting text and metadata from files and documents 384 | using the [Apache Tika](https://tika.apache.org/) content analysis toolkit. 385 | 386 | ## Full Text Search, Information Retrieval, Indexing 387 | 388 | - [rsolr](https://github.com/rsolr/rsolr) - 389 | Ruby and Rails client library for [Apache Solr](http://lucene.apache.org/solr/). 390 | - [sunspot](https://github.com/sunspot/sunspot) - 391 | Rails centric client for [Apache Solr](http://lucene.apache.org/solr/). 392 | - [thinking-sphinx](https://github.com/pat/thinking-sphinx) - 393 | [Active Record](https://guides.rubyonrails.org/active_record_basics.html) 394 | plugin for using [Sphinx](http://sphinxsearch.com/) in (not only) Rails based projects. 395 | - [elasticsearch](https://github.com/elastic/elasticsearch-ruby/tree/master/elasticsearch) - 396 | Ruby client and API for [Elasticsearch](https://www.elastic.co/). 397 | - [elasticsearch-rails](https://github.com/elastic/elasticsearch-rails) - 398 | Ruby and Rails integrations for [Elasticsearch](https://www.elastic.co/). 399 | - [google-api-client](https://github.com/googleapis/google-api-ruby-client) - 400 | Ruby API library for [Google](https://developers.google.com/api-client-library/ruby/) services. 401 | 402 | ## Language Aware String Manipulation 403 | 404 | Libraries for language aware string manipulation, i.e. search, pattern matching, 405 | case conversion, transcoding, regular expressions which need information about 406 | the underlying language. 407 | 408 | - [fuzzy_match](https://github.com/seamusabshere/fuzzy_match) - 409 | Fuzzy string comparison with Distance measures and Regular Expression. 410 | - [fuzzy-string-match](https://github.com/kiyoka/fuzzy-string-match) - 411 | Fuzzy string matching library for Ruby. 412 | - [active_support](https://github.com/rails/rails/tree/master/activesupport/lib/active_support) - 413 | RoR `ActiveSupport` gem has various string extensions that can handle case. 414 | - [fuzzy_tools](https://github.com/brianhempel/fuzzy_tools) - 415 | Toolset for fuzzy searches in Ruby tuned for accuracy. 416 | - [u](http://disu.se/software/u-1.0/) - 417 | U extends Ruby’s Unicode support. 418 | - [unicode](https://github.com/blackwinter/unicode) - 419 | Unicode normalization library. 420 | - [CommonRegexRuby](https://github.com/talyssonoc/CommonRegexRuby) - 421 | Find a lot of kinds of common information in a string. 422 | - [regexp-examples](https://github.com/tom-lord/regexp-examples) - 423 | Generate strings that match a given regular expression. 424 | - [verbal_expressions](https://github.com/ryan-endacott/verbal_expressions) - 425 | Make difficult regular expressions easy. 426 | - [translit_kit](https://github.com/AnalyzePlatypus/TranslitKit) - 427 | Transliterate Hebrew & Yiddish text into Latin characters. 428 | - [re2](https://github.com/mudge/re2) - 429 | hight-speed Regular Expression library for Text Mining and Text Extraction. 430 | - [regex_sample](https://github.com/mochizukikotaro/regex_sample) - 431 | sample string generation from a given Regular Expression. 432 | - [iuliia](https://github.com/adnikiforov/iuliia-rb) — 433 | transliteration Cyrillic to Latin in many possible ways (defined by the [reference implementation](https://github.com/nalgeon/iuliia)). 434 | 435 | ## Articles, Posts, Talks, and Presentations 436 | 437 | - 2019 438 | - _Extracting Text From Images Using Ruby_ by [aonemd](https://twitter.com/aonemd) 439 | [[post](https://aonemd.github.io/blog/extracting-text-from-images-using-ruby) | 440 | [code](https://gist.github.com/aonemd/7bb3c4760d9e47a9ce8e270198cb40a0)] 441 | - 2018 442 | - _Natural Language Processing and Tweet Sentiment Analysis_ by [Cassandra Corrales](https://twitter.com/casita305) 443 | [[post](https://medium.com/@cmcorrales3/natural-language-processing-and-tweet-sentiment-analysis-fa1edbb5ddd5)] 444 | - 2017 445 | - _The Google NLP API Meets Ruby_ by [Aja Hammerly](https://twitter.com/the_thagomizer) 446 | [[post](http://www.thagomizer.com/blog/2017/04/13/the-google-nlp-api-meets-ruby.html)] 447 | - _Syntax Isn't Everything: NLP For Rubyists_ by [Aja Hammerly](https://twitter.com/the_thagomizer) 448 | [[slides](http://www.thagomizer.com/files/NLP_RailsConf2017.pdf)] 449 | - _Scientific Computing on JRuby_ by [Prasun Anand](https://twitter.com/prasun_anand) 450 | [[slides](https://www.slideshare.net/PrasunAnand2/fosdem2017-scientific-computing-on-jruby) | 451 | [video](https://ftp.fau.de/fosdem/2017/K.4.201/ruby_scientific_computing_on_jruby.mp4) | 452 | [slides](https://www.slideshare.net/PrasunAnand2/scientific-computing-on-jruby) | 453 | [slides](https://www.slideshare.net/PrasunAnand2/scientific-computation-on-jruby)] 454 | - _Unicode Normalization in Ruby_ by [Starr Horne](https://twitter.com/starrhorne) 455 | [[post](https://blog.honeybadger.io/ruby_unicode_normalization/)] 456 | - 2016 457 | - _Quickly Create a Telegram Bot in Ruby_ by [Ardian Haxha](https://twitter.com/ArdianHaxha) 458 | [[tutorial](https://www.sitepoint.com/quickly-create-a-telegram-bot-in-ruby/)] 459 | - _Deep Learning: An Introduction for Ruby Developers_ by [Geoffrey Litt](https://twitter.com/geoffreylitt) 460 | [[slides](https://speakerdeck.com/geoffreylitt/deep-learning-an-introduction-for-ruby-developers)] 461 | - _How I made a pure-Ruby word2vec program more than 3x faster_ by [Kei Sawada](https://twitter.com/remore) 462 | [[slides](https://speakerdeck.com/remore/how-i-made-a-pure-ruby-word2vec-program-more-than-3x-faster)] 463 | - _Dōmo arigatō, Mr. Roboto: Machine Learning with Ruby_ by [Eric Weinstein](https://twitter.com/ericqweinstein) 464 | [[slides](https://speakerdeck.com/ericqweinstein/domo-arigato-mr-roboto-machine-learning-with-ruby) | [video](https://www.youtube.com/watch?v=T1nFQ49TyeA)] 465 | - 2015 466 | - _N-gram Analysis for Fun and Profit_ by [Jesus Castello](https://github.com/matugm) 467 | [[tutorial](https://www.rubyguides.com/2015/09/ngram-analysis-ruby/)] 468 | - _Machine Learning made simple with Ruby_ by [Lorenzo Masini](https://github.com/rugginoso) 469 | [[tutorial](https://www.leanpanda.com/blog/2015/08/24/machine-learning-automatic-classification/)] 470 | - _Using Ruby Machine Learning to Find Paris Hilton Quotes_ by [Rick Carlino](https://github.com/RickCarlino) 471 | [[tutorial](http://web.archive.org/web/20160414072324/http://datamelon.io/blog/2015/using-ruby-machine-learning-id-paris-hilton-quotes.html)] 472 | - _Exploring Natural Language Processing in Ruby_ by [Kevin Dias](https://github.com/diasks2) 473 | [[slides](https://www.slideshare.net/diasks2/exploring-natural-language-processing-in-ruby)] 474 | - _Machine Learning made simple with Ruby_ by [Lorenzo Masini](https://twitter.com/rugginoso) 475 | [[post](https://www.leanpanda.com/blog/2015/08/24/machine-learning-automatic-classification/)] 476 | - _Practical Data Science in Ruby_ by Bobby Grayson 477 | [[slides](http://slides.com/bobbygrayson/p#/)] 478 | - 2014 479 | - _Natural Language Parsing with Ruby_ by [Glauco Custódio](https://github.com/glaucocustodio) 480 | [[tutorial](http://glaucocustodio.github.io/2014/11/10/natural-language-parsing-with-ruby/)] 481 | - _Demystifying Data Science: Analyzing Conference Talks with Rails and Ngrams_ by 482 | [Todd Schneider](https://github.com/toddwschneider) 483 | [[video](https://www.youtube.com/watch?v=2ZDCxwB29Bg) | [code](https://github.com/Genius/abstractogram)] 484 | - _Natural Language Processing with Ruby_ by [Konstantin Tennhard](https://github.com/t6d) 485 | [[video](https://www.youtube.com/watch?v=5u86qVh8r0M) | [video](https://www.youtube.com/watch?v=oFmy_QBQ5DU) | 486 | [video](https://www.youtube.com/watch?v=sPkeeWnsMn0) | 487 | [slides](http://euruko2013.org/speakers/presentations/natural_language_processing_with_ruby_and_opennlp-tennhard.pdf)] 488 | - 2013 489 | - _How to parse 'go' - Natural Language Processing in Ruby_ by 490 | [Tom Cartwright](https://twitter.com/tomcartwrightuk) 491 | [[slides](https://www.slideshare.net/TomCartwright/natual-language-processing-in-ruby) | 492 | [video](https://skillsmatter.com/skillscasts/4883-how-to-parse-go)] 493 | - _Natural Language Processing in Ruby_ by [Brandon Black](https://twitter.com/brandonmblack) 494 | [[slides](https://speakerdeck.com/brandonblack/natural-language-processing-in-ruby) | 495 | [video](http://confreaks.tv/videos/railsconf2013-natural-language-processing-with-ruby)] 496 | - _Natural Language Processing with Ruby: n-grams_ by [Nathan Kleyn](https://github.com/nathankleyn) 497 | [[tutorial](https://www.sitepoint.com/natural-language-processing-ruby-n-grams/) | 498 | [code](https://github.com/nathankleyn/ruby-nlp)] 499 | - _Seeking Lovecraft, Part 1: An introduction to NLP and the Treat Gem_ by 500 | [Robert Qualls](https://github.com/rlqualls) 501 | [[tutorial](https://www.sitepoint.com/seeking-lovecraft-part-1-an-introduction-to-nlp-and-the-treat-gem/)] 502 | - 2012 503 | - _Machine Learning with Ruby, Part One_ by [Vasily Vasinov](https://twitter.com/vasinov) 504 | [[tutorial](http://www.vasinov.com/blog/machine-learning-with-ruby-part-one/)] 505 | - 2011 506 | - _Ruby one-liners_ by [Benoit Hamelin](https://twitter.com/benoithamelin) 507 | [[post](http://benoithamelin.tumblr.com/ruby1line)] 508 | - _Clustering in Ruby_ by [Colin Drake](https://twitter.com/colinfdrake) 509 | [[post](https://colindrake.me/post/k-means-clustering-in-ruby/)/)] 510 | - 2010 511 | - _bayes_motel – Bayesian classification for Ruby_ by [Mike Perham](https://twitter.com/mperham) 512 | [[post](http://www.mikeperham.com/2010/04/28/bayes_motel-bayesian-classification-for-ruby/)] 513 | - 2009 514 | - _Porting the UEA-Lite Stemmer to Ruby_ by [Jason Adams](https://twitter.com/ealdent) 515 | [[post](https://ealdent.wordpress.com/2009/07/16/porting-the-uea-lite-stemmer-to-ruby/)] 516 | - _NLP Resources for Ruby_ by [Jason Adams](https://twitter.com/ealdent) 517 | [[post](https://ealdent.wordpress.com/2009/09/13/nlp-resources-for-ruby/)] 518 | - 2008 519 | - _Support Vector Machines (SVM) in Ruby_ by [Ilya Grigorik](https://twitter.com/igrigorik) 520 | [[post](https://www.igvita.com/2008/01/07/support-vector-machines-svm-in-ruby/)] 521 | - _Practical text classification with Ruby_ by [Gleicon Moraes](https://twitter.com/gleicon) 522 | [[post](https://zenmachine.wordpress.com/practical-text-classification-with-ruby/) | 523 | [code](https://github.com/gleicon/zenmachine)] 524 | - 2007 525 | - _Decision Tree Learning in Ruby_ by [Ilya Grigorik](https://twitter.com/igrigorik) 526 | [[post](https://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/)] 527 | - 2006 528 | - _Speak My Language: Natural Language Processing With Ruby_ by [Michael Granger](https://deveiate.org/resume.html) 529 | [[slides](https://deveiate.org/misc/Speak-My-Language.pdf) | 530 | [write-up](http://blog.nicksieger.com/articles/2006/10/22/rubyconf-natural-language-generation-and-processing-in-ruby/) | 531 | [write-up](http://juixe.com/papers/RubyConf2006.pdf)] 532 | 533 | ## Projects and Code Examples 534 | 535 | - [Going the Distance](https://github.com/schneems/going_the_distance) - 536 | Implementations of various distance algorithms with example calculations. 537 | - [Named entity recognition with Stanford NER and Ruby](https://github.com/mblongii/ruby-ner) - 538 | NER Examples in Ruby and Java with some [explanations](https://web.archive.org/web/20120722225402/http://mblongii.com/2012/04/15/named-entity-recognition-with-stanford-ner-and-ruby/). 539 | - [Words Counted](http://rubywordcount.com/) - 540 | examples of customizable word statistics powered by 541 | [words_counted](https://github.com/abitdodgy/words_counted). 542 | - [RSyntaxTree](https://yohasebe.com/rsyntaxtree/) - 543 | Web based demonstration of the syntactic tree visualization. 544 | 545 | ## Books 546 | 547 | - [Miller, Rob](https://twitter.com/robmil/). 548 | _Text Processing with Ruby: Extract Value from the Data That Surrounds You._ 549 | Pragmatic Programmers, 2015. 550 | [[link](https://www.amazon.com/Text-Processing-Ruby-Extract-Surrounds/dp/1680500708)] 551 | - [Watson, Mark](https://twitter.com/mark_l_watson). 552 | _Scripting Intelligence: Web 3.0 Information Gathering and Processing._ 553 | APRESS, 2010. 554 | [[link](https://www.amazon.de/Scripting-Intelligence-Information-Gathering-Processing/dp/1430223510)] 555 | - [Watson, Mark](https://twitter.com/mark_l_watson). 556 | _Practical Semantic Web and Linked Data Applications._ Lulu, 2010. 557 | [[link](http://www.lulu.com/shop/mark-watson/practical-semantic-web-and-linked-data-applications-java-edition/paperback/product-10915016.html)] 558 | 559 | ## Community 560 | 561 | - [Reddit](https://www.reddit.com/r/LanguageTechnology/search?q=ruby&restrict_sr=on) 562 | - [Stack Overflow](https://stackoverflow.com/search?q=%5Bnlp%5D+and+%5Bruby%5D) 563 | - [Twitter](https://twitter.com/search?q=Ruby%20NLP%20%23ruby%20OR%20%23nlproc%20OR%20%23rubynlp%20OR%20%23nlp&src=typd&lang=en) 564 | 565 | ## Needs your Help! 566 | 567 | All projects in this section are really important for the community but need 568 | more attention. Please if you have spare time and dedication spend some hours 569 | on the code here. 570 | 571 | - [ferret](https://github.com/dbalmain/ferret) - 572 | Information Retrieval in C and Ruby. 573 | - [summarize](https://github.com/ssoper/summarize) - 574 | Ruby native wrapper for [Open Text Summarizer](https://github.com/neopunisher/Open-Text-Summarizer). 575 | 576 | ## Related Resources 577 | 578 | - [Neural Machine Translation Implementations](https://github.com/jonsafari/nmt-list) 579 | - [Awesome Ruby](https://github.com/markets/awesome-ruby#natural-language-processing) - 580 | Among other awesome items a short list of NLP related projects. 581 | - [Ruby NLP](https://github.com/diasks2/ruby-nlp) - 582 | State-of-Art collection of Ruby libraries for NLP. 583 | - [Speech and Natural Language Processing](https://github.com/edobashira/speech-language-processing) - 584 | General List of NLP related resources (mostly not for Ruby programmers). 585 | - [Scientific Ruby](http://sciruby.com/) - 586 | Linear Algebra, Visualization and Scientific Computing for Ruby. 587 | - [iRuby](https://github.com/SciRuby/iruby) - IRuby kernel for Jupyter (formelly IPython). 588 | - [Awesome OCR](https://github.com/kba/awesome-ocr) - 589 | Multitude of OCR (Optical Character Recognition) resources. 590 | - [Awesome TensorFlow](https://github.com/jtoy/awesome-tensorflow) - 591 | Machine Learning with TensorFlow libraries. 592 | - 593 | [ImageMagick](https://imagemagick.org/index.php) 594 | 595 | ## License 596 | 597 | [![Creative Commons Zero 1.0](http://mirrors.creativecommons.org/presskit/buttons/80x15/svg/cc-zero.svg)](https://creativecommons.org/publicdomain/zero/1.0/) `Awesome NLP with Ruby` by [Andrei Beliankou](https://github.com/arbox) and 598 | [Contributors](https://github.com/arbox/nlp-with-ruby/graphs/contributors). 599 | 600 | To the extent possible under law, the person who associated CC0 with 601 | `Awesome NLP with Ruby` has waived all copyright and related or neighboring rights 602 | to `Awesome NLP with Ruby`. 603 | 604 | You should have received a copy of the CC0 legalcode along with this 605 | work. If not, see . 606 | 607 | 608 | [ruby]: https://www.ruby-lang.org/en/ 609 | [motivation]: https://github.com/arbox/nlp-with-ruby/blob/master/motivation.md 610 | [faq]: https://github.com/arbox/nlp-with-ruby/blob/master/FAQ.md 611 | [ds-with-ruby]: https://github.com/arbox/data-science-with-ruby 612 | [ml-with-ruby]: https://github.com/arbox/machine-learning-with-ruby 613 | [change-pr]: https://github.com/RichardLitt/knowledge/blob/master/github/amending-a-commit-guide.md 614 | -------------------------------------------------------------------------------- /ruby.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arbox/nlp-with-ruby/44db8a3e4c74c2a6fdcc6bef6bd14ed021b7bd24/ruby.jpg -------------------------------------------------------------------------------- /test.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arbox/nlp-with-ruby/44db8a3e4c74c2a6fdcc6bef6bd14ed021b7bd24/test.png -------------------------------------------------------------------------------- /tutorials/ruby-stemmer.md: -------------------------------------------------------------------------------- 1 | # ruby-stemmer 2 | -------------------------------------------------------------------------------- /tutorials/template.md: -------------------------------------------------------------------------------- 1 | # This is a template 2 | -------------------------------------------------------------------------------- /tutorials/tokenizer.md: -------------------------------------------------------------------------------- 1 | # Tokenizer 2 | 3 | ## INSTALLATION 4 | `Tokenizer` is provided as a `.gem` package. Simply install it via 5 | [RubyGems](http://rubygems.org/gems/tokenizer). 6 | 7 | To install `tokenizer` issue the following command: 8 | 9 | ``` shell 10 | $ gem install tokenizer 11 | ``` 12 | 13 | If you want to do a system wide installation, do this as root 14 | (possibly using `sudo`). 15 | 16 | Alternatively use your Gemfile for dependency management. 17 | 18 | ## SYNOPSIS 19 | 20 | You can use +Tokenizer+ in two ways. 21 | * As a command line tool: 22 | 23 | ``` shell 24 | $ echo 'Hi, ich gehe in die Schule!. | tokenize 25 | ``` 26 | 27 | * As a library for embedded tokenization: 28 | 29 | ``` ruby 30 | > require 'tokenizer' 31 | > de_tokenizer = Tokenizer::WhitespaceTokenizer.new 32 | > de_tokenizer.tokenize('Ich gehe in die Schule!') 33 | > => ["Ich", "gehe", "in", "die", "Schule", "!"] 34 | ``` 35 | 36 | * Customizable `PRE` and `POST` list 37 | 38 | ``` ruby 39 | > require 'tokenizer' 40 | > de_tokenizer = Tokenizer::WhitespaceTokenizer.new(:de, { post: Tokenizer::Tokenizer::POST + ['|'] }) 41 | > de_tokenizer.tokenize('Ich gehe|in die Schule!') 42 | > => ["Ich", "gehe", "|in", "die", "Schule", "!"] 43 | ``` 44 | 45 | See documentation in the `Tokenizer::WhitespaceTokenizer` class for details 46 | on particular methods. 47 | -------------------------------------------------------------------------------- /tutorials/tutorial_template.md: -------------------------------------------------------------------------------- 1 | # Tutorial 2 | 3 | ![Implemented in pure Ruby][ruby] 4 | 5 | ## Motivation 6 | What tasks can we accomplish with this tool? 7 | 8 | ## Installation 9 | How can we get this tool? 10 | 11 | ## Examples 12 | 13 | 14 | [ruby]: https://img.shields.io/badge/L%3A-Ruby-red.svg 15 | [jruby]: https://img.shields.io/badge/L%3A-JRuby-yellowgreen.svg 16 | [java]: https://img.shields.io/badge/L%3A-Java-yellow.svg 17 | [c]: https://img.shields.io/badge/L%3A-C-brightgreen.svg 18 | [cpp]: https://img.shields.io/badge/L%3A-C%2B%2B-green.svg 19 | [tutorial-present]: https://img.shields.io/badge/Tutorial-%E2%9C%85-green.svg 20 | [tutorial-missing]: https://img.shields.io/badge/Tutorial-%E2%9C%98-lightgrey.svg 21 | -------------------------------------------------------------------------------- /tutorials/weka-jruby.md: -------------------------------------------------------------------------------- 1 | # Weka with JRuby 2 | --------------------------------------------------------------------------------