fasttext is a Python interface for Facebook fastText.
fasttext support Python 2.6 or newer. It requires Cython in order to compile the C++ extension.
pip install fasttext
This package has two main use cases: word representation learning and text classification.
These were described in the two papers 1 and 2.
In order to learn word vectors, as described in 1,
We can use fasttext.skipgram
and fasttext.cbow
function
like the following:
import fasttext
# Skipgram model
model = fasttext.skipgram('data.txt', 'model')
print model.words # list of words in dictionary
# CBOW model
model = fasttext.cbow('data.txt', 'model')
print model.words # list of words in dictionary
where data.txt
is a training file containing utf-8
encoded text.
By default the word vectors will take into account character n-grams from
3 to 6 characters.
At the end of optimization the program will save two files:
model.bin
and model.vec
.
model.vec
is a text file containing the word vectors, one per line.
model.bin
is a binary file containing the parameters of the model
along with the dictionary and all hyper parameters.
The binary file can be used later to compute word vectors or to restart the optimization.
The following fasttext(1)
command is equivalent
# Skipgram model
./fasttext skipgram -input data.txt -output model
# CBOW model
./fasttext cbow -input data.txt -output model
The previously trained model can be used to compute word vectors for out-of-vocabulary words.
print model.get_vector('king') # get the vector of the word 'king'
the following fasttext(1)
command is equivalent:
echo "king" | ./fasttext print-vectors model.bin
This will output the vector of word king
to the standard output.
We can use fasttext.load_model
to load pre-trained model:
model = fasttext.load_model('model.bin')
print model.words # list of words in dictionary
print model.get_vector('king') # get the vector of the word 'king'
Works in progress
import fasttext
model = fasttext.skipgram(params)
model.words
model.get_vector(word)
model = fasttext.cbow(params)
model.words
model.get_vector(word)
model = fasttext.load_model('model.bin')
model.words
model.get_vector(word)
List of params and their default value:
input training file path
output output file path
lr learning rate [0.05]
dim size of word vectors [100]
ws size of the context window [5]
epoch number of epochs [5]
min_count minimal number of word occurences [1]
neg number of negatives sampled [5]
word_ngrams max length of word ngram [1]
loss loss function {ns, hs, softmax} [ns]
bucket number of buckets [2000000]
minn min length of char ngram [3]
maxn max length of char ngram [6]
thread number of threads [12]
verbose how often to print to stdout [10000]
t sampling threshold [0.0001]
silent suspress the log from the C++ extension [1]
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
(* These authors contributed equally.)
- Facebook page: https://www.facebook.com/groups/1174547215919768
- Google group: https://groups.google.com/forum/#!forum/fasttext-library