Skip to content

High performance topic modeling for Ruby

License

Notifications You must be signed in to change notification settings

ankane/tomoto-ruby

Repository files navigation

tomoto.rb

🍅 tomoto - high performance topic modeling - for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem "tomoto"

Getting Started

Train a model

model = Tomoto::LDA.new(k: 2)
model.add_doc(["tokens", "from", "document", "one"])
model.add_doc(["tokens", "from", "document", "two"])
model.add_doc(["tokens", "from", "document", "three"])
model.train(100) # iterations

Get the summary

model.summary

Get topic words

model.topic_words

Save the model to a file

model.save("model.bin")

Load the model from a file

model = Tomoto::LDA.load("model.bin")

Get topic probabilities for a document

doc = model.docs[0]
doc.topics

Get the number of words for each topic

model.count_by_topics

Get the vocab

model.vocabs

Get the log likelihood per word

model.ll_per_word

Perform inference for unseen documents

doc = model.make_doc(["unseen", "doc"])
topic_dist, ll = model.infer(doc)

Models

Supports:

  • Latent Dirichlet Allocation (LDA)
  • Labeled LDA (LLDA)
  • Partially Labeled LDA (PLDA)
  • Supervised LDA (SLDA)
  • Dirichlet Multinomial Regression (DMR)
  • Generalized Dirichlet Multinomial Regression (GDMR)
  • Hierarchical Dirichlet Process (HDP)
  • Hierarchical LDA (HLDA)
  • Multi Grain LDA (MGLDA)
  • Pachinko Allocation (PA)
  • Hierarchical PA (HPA)
  • Correlated Topic Model (CT)
  • Dynamic Topic Model (DT)

API

This library follows the tomotopy API. There are a few changes to make it more Ruby-like:

  • The get_ prefix has been removed from methods (topic_words instead of get_topic_words)
  • Methods that return booleans use ? instead of is_ (live_topic? instead of is_live_topic)

If a method or option you need isn’t supported, feel free to open an issue.

Examples

Performance

tomoto uses AVX2, AVX, or SSE2 instructions to increase performance on machines that support it. Check which instruction set architecture it’s using with:

Tomoto.isa

Parallelism

Choose a parallelism algorithm with:

model.train(parallel: :partition)

Supported values are :default, :none, :copy_merge, and :partition.

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone --recursive https://github.com/ankane/tomoto-ruby.git
cd tomoto-ruby
bundle install
bundle exec rake compile
bundle exec rake test