$ python3 main.py --embedding_size 100 --context_size 10
$ python3 main.py --mode "are_Similar" --word1 "big" --word2 "long"
$ python3 main.py --mode "get_ClosestWords" --word "big"
$ python3 main.py --mode "analogy" --word1 "big" --word2 "long" --word "small"
$ python3 main.py --mode "plotEmbeddings"
##Introduction: GloVe is a specific weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics. The model produces a word vector space with meaningful substructure, as evidenced by its state-of-the-art performance of 75% accuracy on the word analogy dataset. Global Vectors (GloVe) the global corpus statistics are captured efficiently by the model.
The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various probe words, k. Compared to the raw probabilities, the ratio is better able to distinguish relevant words from irrelevant words and it is also better able to discriminate between the two relevant words.
We denote word vectors as w and separate context word vectors as ˜ w . We require that F be a homomorphism, such that it establishes a relation betwwen word vectors and the co occurance counts. It is given by
where word-word co-occurrence counts are denoted by X, whose entries Xij tabulate the number of times word j occurs in the context of word i. The solution to these equations is exponential function. So, the equation, after adding the biases, becomes
We use a new weighted least squares regression model that addresses these problems. Casting above equation as a least squares problem and introducing a weighting function f(Xij) into the cost function gives us the model,
Where V is the size of the vocabulary. The proposed weighting function is
we fix to xmax = 100 for all our experiments. We found that α = 3/4 gives a modest improvement over a linear version with α = 1.
GloVe controls for the main sources of variation by setting the vector length, context window size, corpus, and vocabulary size to the configuration and most importantly training time.
GloVe is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.
The dataset used in the implementation is same as used in word2vec tensorflow implementation to compare the results.