A collection of Natural Language Processing (NLP) Ruby libraries, tools and software. Suggestions and contributions are welcome.
- APIs
- Bitext Alignment
- Books
- Chatterbot
- Classification
- Date and Time
- Error Correction
- Full-Text Search
- Keyword Ranking
- Language Detection
- Machine Learning
- Machine Translation
- Miscellaneous
- Multipurpose Tools
- Named Entity Recognition
- Ngrams
- Parsers
- Part-of-Speech Taggers
- Readability
- Regular Expressions
- Ruby NLP Presentations
- Sentence Segmentation
- Speech-to-Text
- Stemmers
- Stop Words
- Summarization
- Text Extraction
- Text Similarity
- Text-to-Speech
- Tokenizers
- Word Count
Client libraries to various 3rd party NLP API services.
- alchemy_api - provides a client API library for AlchemyAPI's NLP services
- aylien_textapi_ruby - AYLIEN's officially supported Ruby client library for accessing Text API
- biffbot - Ruby gem for Diffbot's APIs that extract Articles, Products, Images, Videos, and Discussions from any web page
- BOTServer - Telegram Bot API Webhooks Framework, for Rubyists
- gengo-ruby - a Ruby library to interface with the Gengo API for translation
- monkeylearn-ruby - build and consume machine learning models for language processing from your Ruby apps
- napi-ruby - a simple Ruby wrapper for the Maluuba nAPI
- poliqarpr - Ruby client for Poliqarp text corpus server
- TelegramBot - a charismatic Ruby client for Telegram's Bot API
- TelegramBotRuby - yet another client for Telegram's Bot API
- telegram-bot-ruby - Ruby wrapper for Telegram's Bot API
- wlapi - Ruby based API for the project Wortschatz Leipzig
- Text Processing with Ruby by Rob Miller
Bitext alignment is the process of aligning two parallel documents on a segment by segment basis. In other words, if you have one document in English and its translation in Spanish, bitext alignment is the process of matching each segment from document A with its corresponding translation in document B.
- alignment - alignment functions for corpus linguistics (Gale-Church implementation)
- chatterbot - A straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate
- Lita - Lita is a chat bot written in Ruby with persistent storage provided by Redis
Classification aims to assign a document or piece of text to one or more classes or categories making it easier to manage or sort.
- Classifier - a general module to allow Bayesian and other types of classifications
- classifier-reborn - (a fork of cardmagic/classifier) a general classifier module to allow Bayesian and other types of classifications
- Latent Dirichlet Allocation - used to automatically cluster documents into topics
- liblinear-ruby-swig - Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification and other large linear classifications)
- linnaeus - a redis-backed Bayesian classifier
- maxent_string_classifier - a JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework
- Naive-Bayes - simple Naive Bayes classifier
- nbayes - a full-featured, Ruby implementation of Naive Bayes
- omnicat - a generalized rack framework for text classifications
- omnicat-bayes - Naive Bayes text classification implementation as an OmniCat classifier strategy
- stuff-classifier - a library for classifying text into multiple categories
- Chronic - a pure Ruby natural language date parser
- Chronic Between - a simple Ruby natural language parser for date and time ranges
- Chronic Duration - a simple Ruby natural language parser for elapsed time
- Kronic - a dirt simple library for parsing and formatting human readable dates
- Nickel - extracts date, time, and message information from naturally worded text
- Tickle - a natural language parser for recurring events
- Chat Correct - shows the errors and error types when a correct English sentence is diffed with an incorrect English sentence
- gingerice - Ruby wrapper for correcting spelling and grammar mistakes based on the context of complete sentences
- ferret - an information retrieval library in the same vein as Apache Lucene
- ranguba - a project to provide a full-text search system built on Groonga
- graph-rank - Ruby implementation of the PageRank and TextRank algorithms
- highscore - find and rank keywords in text
- Detect Language API Client - detects language of given text and returns detected language codes and scores
- whatlanguage - a language detection library for Ruby that uses bloom filters for speed
- Decision Tree - a ruby library which implements ID3 (information gain) algorithm for decision tree learning
- rb-libsvm - implementation of SVM, a machine learning and classification algorithm
- RubyFann - a ruby gem that binds to FANN (Fast Artificial Neural Network) from within a ruby/rails environment
- Google API Client - Google API Ruby Client
- microsoft_translator - Ruby client for the microsoft translator API
- termit - Google Translate with speech synthesis in your terminal as ruby gem
- gibber - Gibber replaces text with nonsensical latin with a maximum size difference of +/- 30%
- hiatus - a localization QA tool
- language_filter - a Ruby gem to detect and optionally filter multiple categories of language
- Naturally - Natural (version number) sorting with support for legal document numbering, college course codes, and Unicode
- rwordnet - a pure Ruby interface to the WordNet lexical/semantic database
- twitter-text - gem that provides text processing routines for Twitter Tweets
- nameable - A Ruby gem that provides parsing and output of person names, as well as Gender & Ethnicity matching
- dialable - A Ruby gem that provides parsing and output of North American Numbering Plan (NANP) phone numbers, and includes location & time zones
The following are libraries that integrate multiple NLP tools or functionality.
- nlp - NLP tools for the Polish language
- NlpToolz - Basic NLP tools, mostly based on OpenNLP, at this time sentence finder, tokenizer and POS tagger implemented, plus Berkeley Parser
- Open NLP (Ruby bindings)
- Stanford Core NLP (Ruby bindings)
- Treat - natural language processing framework for Ruby
- ve - a linguistic framework that's easy to use
- zipf - a collection of various NLP tools and libraries
- Confidential Info Redactor - a Ruby gem to semi-automatically redact confidential information from a text
- ruby-ner - named entity recognition with Stanford NER and Ruby
- ruby-nlp - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer
- N-Gram - N-Gram generator in Ruby
- ngram - break words and phrases into ngrams
- raingrams - a flexible and general-purpose ngrams library written in Ruby
A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb.
- linkparser - a Ruby binding for the Abiword version of CMU's Link Grammar, a syntactic parser of English
- Parslet - A small PEG based parser library
- rley - Ruby gem implementing a general context-free grammar parser based on Earley's algorithm
- Treetop - a Ruby-based parsing DSL based on parsing expression grammars
- engtagger - English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
- rbtagger - a simple ruby rule-based part of speech tagger
- TreeTagger for Ruby - Ruby based wrapper for the TreeTagger by Helmut Schmid
- lingua - Lingua::EN::Readability is a Ruby module which calculates statistics on English text
- CommonRegexRuby - find a lot of kinds of common information in a string
- regexp-examples - generate strings that match a given regular expression
- verbal_expressions - make difficult regular expressions easy
- N-gram Analysis for Fun and Profit [tutorial] - Jesus Castello (2015)
- Machine Learning made simple with Ruby [tutorial] - Lorenzo Masini (2015)
- Using Ruby Machine Learning to Find Paris Hilton Quotes [tutorial] - Rick Carlino (2015)
- Exploring Natural Language Processing in Ruby [slides] - Kevin Dias (2015)
- Natural Language Parsing with Ruby [tutorial] - Glauco Custódio (2014)
- Demystifying Data Science (Analyzing Conference Talks with Rails and Ngrams) [video RailsConf 2014 | Repo from the Video] - Todd Schneider (2014)
- Natural Language Processing with Ruby [video ArrrrCamp 2014 | video Ruby Conf India] - Konstantin Tennhard (2014)
- How to parse 'go' - Natural Language Processing in Ruby [slides] - Tom Cartwright (2013)
- Natural Language Processing in Ruby [slides | video] - Brandon Black (2013)
- Natural Language Processing with Ruby: n-grams [tutorial] - Nathan Kleyn (2013)
- A Tour Through Random Ruby [tutorial] - Robert Qualls (2013)
Sentence segmentation (aka sentence boundary disambiguation, sentence boundary detection) is the problem in natural language processing of deciding where sentences begin and end. Sentence segmentation is the foundation of many common NLP tasks (machine translation, bitext alignment, summarization, etc.).
- att_speech - A Ruby library for consuming the AT&T Speech API for speech to text
- pocketsphinx-ruby - Ruby speech recognition with Pocketsphinx
- Speech2Text - using Google Speech to Text API Provide a Simple Interface to Convert Audio Files
Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root form.
- Greek stemmer - a Greek stemmer
- Ruby-Stemmer - Ruby-Stemmer exposes the SnowBall API to Ruby
- Turkish stemmer - a Turkish stemmer
- uea-stemmer - a conservative stemmer for search and indexing
- clarifier
- stopwords - really just a list of stopwords with some helpers
- Stopwords Filter - a very simple and naive implementation of a stopwords filter that remove a list of banned words (stopwords) from a sentence
Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.
- Epitome - A small gem to make your text shorter; an implementation of the Lexrank algorithm
- ots - Ruby bindings to open text summarizer
- summarize - Ruby C wrapper for Open Text Summarizer
- docsplit - Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts
- rtesseract - Ruby library for working with the Tesseract OCR
- Ruby Readability - a tool for extracting the primary readable content of a webpage
- ruby-tesseract - This wrapper binds the TessBaseAPI object through ffi-inline (which means it will work on JRuby too) and then proceeds to wrap said API in a more ruby-esque Engine class
- Yomu - a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit
- damerau-levenshtein - calculates edit distance using the Damerau-Levenshtein algorithm
- FuzzyMatch - find a needle in a haystack based on string similarity and regular expression rules
- fuzzy-string-match - fuzzy string matching library for ruby
- FuzzyTools - In-memory TF-IDF fuzzy document finding with a fancy default tokenizer tuned on diverse record linkage datasets for easy out-of-the-box use
- Going the Distance - contains scripts that do various distance calculations
- hotwater - Fast Ruby FFI string edit distance algorithms
- levenshtein-ffi - fast string edit distance computation, using the Damerau-Levenshtein algorithm
- TF-IDF - Term Frequency - Inverse Document Frequency in Ruby
- tf-idf-similarity - calculate the similarity between texts using tf*idf
- espeak-ruby - small Ruby API for utilizing 'espeak' and 'lame' to create text-to-speech mp3 files
- Isabella - a voice-computing assistant built in Ruby
- tts - a ruby gem for converting text-to-speech using the Google translate service
- Jieba - Chinese tokenizer and segmenter (jRuby)
- MeCab - Japanese morphological analyzer [MeCab Heroku buildpack]
- NLP Pure - natural language processing algorithms implemented in pure Ruby with minimal dependencies
- rseg - a Chinese Word Segmentation (中文分词) routine in pure Ruby
- Textoken - Simple and customizable text tokenization gem
- thailang4r - Thai tokenizer
- tiny_segmenter - Ruby port of TinySegmenter.js for tokenizing Japanese text
- tokenizer - a simple multilingual tokenizer
- wc - a rubygem to count word occurrences in a given text
- word_count - a word counter for String and Hash in Ruby
- Word Count Analyzer - analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used
- WordsCounted - a highly customisable Ruby text analyser