You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
Can someone explain how the following code works under the hood:
#loading a bert model
embedding = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased")
# create a sentence
sentence = Sentence('The grass is green. The roses are red.')
# embed words in sentence
embedding.embed(sentence)
print(sentence.embedding)`
When we perform embed on a Sentence object, does it transform each token into a vector and then the resulted vector is the
average of the sum of the vectors or it is using another strategy to generate the final document embedding?
Like, in my example sentence(or document) the resulted vector will be:
(embedding of the token "the" + embedding of the token "grass" + .... + embedding of the token "red")/8 (dimention-wise).
Hi @teoML
According to the Bert paper each sentence has a [CLS] token which will be representing the sentence for classification tasks.
The default behavior is also using this.
However if you want to change that, you can look at the cls_pooling parameter in the docs. The other pooling strategies apply the respective function on the embeddings of the individual tokens.
Hi @helpmefindaname , thank you for answering my question!
I just tried out changing the pooling type and I get the same embedding for the same sentence:
embedding = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased", allow_long_sentences= True, cls_polling = "mean")
embedding_2 = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased", allow_long_sentences= True, cls_polling = "cls")
# create a sentence
sentence1 = Sentence('The grass is green. The roses are red.')
sentence2 = Sentence('The grass is green. The roses are red.')
embedding.embed(sentence1)
embedding_2.embed(sentence2)
a = sentence1.embedding
b = sentence2.embedding
print(a == b)
Each element in the resulted 768 dimensions long vector is True which means that there is no change in the embeddings of both sentences (although the pooling type is different). Is that considered normal behaviour? I tried out also with max and I also got equal vectors. It would be nice if you can help me out - my purpose is to create an embedding for a document which consists of around 100 sentences.
Question
Hi,
Can someone explain how the following code works under the hood:
When we perform embed on a Sentence object, does it transform each token into a vector and then the resulted vector is the
average of the sum of the vectors or it is using another strategy to generate the final document embedding?
Like, in my example sentence(or document) the resulted vector will be:
(embedding of the token "the" + embedding of the token "grass" + .... + embedding of the token "red")/8 (dimention-wise).
Maybe @alanakbik could answer?
Thank you!
The text was updated successfully, but these errors were encountered: