Compute similarity between words using a Skip-gram with Negative Sampling approach. Additionally, we do a manual implementation of the feed-forward and back-propagation steps as well as the negative sampling and embedding steps.
We use the SimLex-999 dataset as a ground truth table for text association scores between two words. More information about it can be found here
For training we use the 1 Billion Words Corpus
The entire code can be found in skipGram.py
We use spaCy en.core.web.sm for tokenization. We then :
- convert to lower case
- lemmatization
- remove punctuations
- filter stop words and tokens of size <3
We model a neural network with 1 hidden layer and the following loss function :
where t is the targeted word, c the context word, and n ∈ N the sample of negative words.
The input to the network is the target word and the output is probabilities for all words to be in the target word's context words.
The trainWord
function of the SkipGram
class does :
- calculates the softmax output and updates the weights matrix,
W
- computes partial differentials required for back propagation
- performs Gradient Descent with
alpha=0.01
- calculates new loss function value
After training for one epoch (stopped due to computational limitations running on CPU) we save the weights matrix to disk.
The idea is that, after training the model and saving the weights matrix, the words with high similarity (note that we define similarity based on usage and this in turn is determined by the context words surrounding the target word) will have similar weights.
In order to predict we pass the --test
argument to the script which :
- loads the word pairs for which we want to predict the similarity
- loads pre-trained weights matrix
- calls the
similarity
function which computes the similarity i.e. dot product of the weight vectors (we retrieve by id from the weights matrix) for each word pair