You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am puzzled what exactly is TCM (term co-occurrence matrix). The documentation of create_tcm just tells that
This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.
and that its value is
dgTMatrix TCM matrix
Pennington, Socher and Manning, when introducing GloVe, define
matrix of word-word co-occurrence counts be denoted by X, whose entries X$_{ij}$ tabulate the number of times word $j$ occurs in the context of word $i$
My reading is that this matrix should be symmetric, ie $X_{ij} = X_{ji}$ if the context is symmetric and weights are 1. However, consider a very simple example with window 1:
doc<- c("a b c b a")
it<- itoken(doc)
vocab<- create_vocabulary(it)
vectorizer<- vocab_vectorizer(vocab)
tcm<- create_tcm(it,
vectorizer,
skip_grams_window=1,
skip_grams_window_context="symmetric",
weights=1)
tcm
This results in
3 x 3 sparse Matrix of class "dgTMatrix"
c a b
c . . 2
a . . 2
b . . .
This is clearly not symmetric, e.g there is no context for word "b". The rest of it makes sense--"c" has two "b"-s as context, and "a" has two "b"-s in a similar fashion.
Does the returned TCM only fill out the upper triangle? This seems to be confirmed when reading documentation for coherence.
I am happy to contribute with PR-s and such, but would like to hear from you before I do this.
The text was updated successfully, but these errors were encountered:
I am puzzled what exactly is TCM (term co-occurrence matrix). The documentation of
create_tcm
just tells thatand that its value is
Pennington, Socher and Manning, when introducing GloVe, define
My reading is that this matrix should be symmetric, ie$X_{ij} = X_{ji}$ if the context is symmetric and weights are 1. However, consider a very simple example with window 1:
This results in
This is clearly not symmetric, e.g there is no context for word "b". The rest of it makes sense--"c" has two "b"-s as context, and "a" has two "b"-s in a similar fashion.
Does the returned TCM only fill out the upper triangle? This seems to be confirmed when reading documentation for
coherence
.I am happy to contribute with PR-s and such, but would like to hear from you before I do this.
The text was updated successfully, but these errors were encountered: