Word2Vec

Julia interface to word2vec

Word2Vec takes a text corpus as input and produces the word vectors as output. Training is done using the original C code, other functionalities are pure Julia. See demo for more details.

Release Notes

Installation

Pkg.add("Word2Vec")

Note: Only linux and OS X are supported.

Functions

All exported functions are documented, i.e., we can type ? functionname to get help. For a list of functions, see here.

Examples

We first download some text corpus, for example http://mattmahoney.net/dc/text8.zip.

Suppose the file text8 is stored in the current working directory. We can train the model with the function word2vec.

julia> word2vec("text8", "text8-vec.txt", verbose = true)
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 350.44k

Now we can import the word vectors text8-vec.txt to Julia.

julia> model = wordvectors("./text8-vec")
WordVectors 71291 words, 100-element Float64 vectors

The vector representation of a word can be obtained using get_vector.

julia> get_vector(model, "book")'
100-element Array{Float64,1}:
 -0.05446138539336186
  0.001090934639284009
  0.06498087707990222
  ⋮
 -0.0024113040415322516
  0.04755140828570571
  0.039764719065723826

The cosine similarity of book, for example, can be computed using cosine_similar_words.

julia> cosine_similar_words(model, "book")
10-element Array{String,1}:
 "book"
 "books"
 "diary"
 "story"
 "chapter"
 "novel"
 "preface"
 "poem"
 "tale"
 "bible"

Word vectors have many interesting properties. For example, vector("king") - vector("man") + vector("woman") is close to vector("queen").

5-element Array{String,1}:
 "queen"
 "empress"
 "prince"
 "princess"
 "throne"

References

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", In Proceedings of Workshop at ICLR, 2013. [pdf]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distributed Representations of Words and Phrases and their Compositionality", In Proceedings of NIPS, 2013. [pdf]
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, "Linguistic Regularities in Continuous Space Word Representations", In Proceedings of NAACL HLT, 2013. [pdf]

Acknowledgements

The design of the package is inspired by Daniel Rodriguez (@danielfrg)'s Python word2vec interface.

Reporting Bugs

Please file an issue to report a bug or request a feature.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github/workflows		.github/workflows
data		data
docs		docs
examples		examples
src		src
test		test
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Project.toml		Project.toml
README.md		README.md
REQUIRE		REQUIRE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec

Installation

Functions

Examples

References

Acknowledgements

Reporting Bugs

About

Releases 4

Packages

Contributors 5

Languages

License

JuliaText/Word2Vec.jl

Folders and files

Latest commit

History

Repository files navigation

Word2Vec

Installation

Functions

Examples

References

Acknowledgements

Reporting Bugs

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 5

Languages

Packages