Skip to content

Compute similarity between words using Skip-gram with Negative Sampling

Notifications You must be signed in to change notification settings

sandeepchittilla/skip-gram-neg-sampling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

skip-gram-neg-sampling

Compute similarity between words using a Skip-gram with Negative Sampling approach. Additionally, we do a manual implementation of the feed-forward and back-propagation steps as well as the negative sampling and embedding steps.

Data

We use the SimLex-999 dataset as a ground truth table for text association scores between two words. More information about it can be found here

For training we use the 1 Billion Words Corpus

Solution

The entire code can be found in skipGram.py

PreProcessing

We use spaCy en.core.web.sm for tokenization. We then :

  • convert to lower case
  • lemmatization
  • remove punctuations
  • filter stop words and tokens of size <3

SkipGram Training

We model a neural network with 1 hidden layer and the following loss function :

where t is the targeted word, c the context word, and n ∈ N the sample of negative words.

The input to the network is the target word and the output is probabilities for all words to be in the target word's context words.

The trainWord function of the SkipGram class does :

  • calculates the softmax output and updates the weights matrix, W
  • computes partial differentials required for back propagation
  • performs Gradient Descent with alpha=0.01
  • calculates new loss function value

After training for one epoch (stopped due to computational limitations running on CPU) we save the weights matrix to disk.

Similarity Scores

The idea is that, after training the model and saving the weights matrix, the words with high similarity (note that we define similarity based on usage and this in turn is determined by the context words surrounding the target word) will have similar weights.

In order to predict we pass the --test argument to the script which :

  • loads the word pairs for which we want to predict the similarity
  • loads pre-trained weights matrix
  • calls the similarity function which computes the similarity i.e. dot product of the weight vectors (we retrieve by id from the weights matrix) for each word pair

About

Compute similarity between words using Skip-gram with Negative Sampling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages