skip-gram-neg-sampling

Compute similarity between words using a Skip-gram with Negative Sampling approach. Additionally, we do a manual implementation of the feed-forward and back-propagation steps as well as the negative sampling and embedding steps.

Data

We use the SimLex-999 dataset as a ground truth table for text association scores between two words. More information about it can be found here

For training we use the 1 Billion Words Corpus

Solution

The entire code can be found in skipGram.py

PreProcessing

We use spaCy en.core.web.sm for tokenization. We then :

convert to lower case
lemmatization
remove punctuations
filter stop words and tokens of size <3

SkipGram Training

We model a neural network with 1 hidden layer and the following loss function :

$L(t,c) = -log[\sigma(c.t)] + \sum_{n\in Neg}log[\sigma(-n.t)]$

where t is the targeted word, c the context word, and n ∈ N the sample of negative words.

The input to the network is the target word and the output is probabilities for all words to be in the target word's context words.

The trainWord function of the SkipGram class does :

calculates the softmax output and updates the weights matrix, W
computes partial differentials required for back propagation
performs Gradient Descent with alpha=0.01
calculates new loss function value

After training for one epoch (stopped due to computational limitations running on CPU) we save the weights matrix to disk.

Similarity Scores

The idea is that, after training the model and saving the weights matrix, the words with high similarity (note that we define similarity based on usage and this in turn is determined by the context words surrounding the target word) will have similar weights.

In order to predict we pass the --test argument to the script which :

loads the word pairs for which we want to predict the similarity
loads pre-trained weights matrix
calls the similarity function which computes the similarity i.e. dot product of the weight vectors (we retrieve by id from the weights matrix) for each word pair

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
skipGram.py		skipGram.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skip-gram-neg-sampling

Data

Solution

PreProcessing

SkipGram Training

Similarity Scores

About

Releases

Packages

Languages

sandeepchittilla/skip-gram-neg-sampling

Folders and files

Latest commit

History

Repository files navigation

skip-gram-neg-sampling

Data

Solution

PreProcessing

SkipGram Training

Similarity Scores

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages