[Question]: How does .embed(Sentence) work under the hood? #3409

teoML · 2024-02-21T21:19:47Z

Question

Hi,
Can someone explain how the following code works under the hood:

 #loading a bert model
 embedding = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased")

 # create a sentence
 sentence = Sentence('The grass is green. The roses are red.')

 # embed words in sentence
 embedding.embed(sentence)

 print(sentence.embedding)`

When we perform embed on a Sentence object, does it transform each token into a vector and then the resulted vector is the
average of the sum of the vectors or it is using another strategy to generate the final document embedding?
Like, in my example sentence(or document) the resulted vector will be:
(embedding of the token "the" + embedding of the token "grass" + .... + embedding of the token "red")/8 (dimention-wise).

Maybe @alanakbik could answer?
Thank you!

The text was updated successfully, but these errors were encountered:

helpmefindaname · 2024-03-01T09:57:13Z

Hi @teoML
According to the Bert paper each sentence has a [CLS] token which will be representing the sentence for classification tasks.
The default behavior is also using this.
However if you want to change that, you can look at the cls_pooling parameter in the docs. The other pooling strategies apply the respective function on the embeddings of the individual tokens.

teoML · 2024-03-04T16:05:28Z

Hi @helpmefindaname , thank you for answering my question!
I just tried out changing the pooling type and I get the same embedding for the same sentence:

embedding = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased", allow_long_sentences= True, cls_polling = "mean")

embedding_2 = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased", allow_long_sentences= True, cls_polling = "cls")

# create a sentence
sentence1 = Sentence('The grass is green. The roses are red.')
sentence2 = Sentence('The grass is green. The roses are red.')

embedding.embed(sentence1)
embedding_2.embed(sentence2)

a = sentence1.embedding
b = sentence2.embedding

print(a == b)

Each element in the resulted 768 dimensions long vector is True which means that there is no change in the embeddings of both sentences (although the pooling type is different). Is that considered normal behaviour? I tried out also with max and I also got equal vectors. It would be nice if you can help me out - my purpose is to create an embedding for a document which consists of around 100 sentences.

helpmefindaname · 2024-03-29T13:22:18Z

the parameter is cls_pooling not cls_polling hence both embeddings are using "cls" per default

teoML added the question Further information is requested label Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How does .embed(Sentence) work under the hood? #3409

[Question]: How does .embed(Sentence) work under the hood? #3409

teoML commented Feb 21, 2024

helpmefindaname commented Mar 1, 2024

teoML commented Mar 4, 2024

helpmefindaname commented Mar 29, 2024

[Question]: How does .embed(Sentence) work under the hood? #3409

[Question]: How does .embed(Sentence) work under the hood? #3409

Comments

teoML commented Feb 21, 2024

Question

helpmefindaname commented Mar 1, 2024

teoML commented Mar 4, 2024

helpmefindaname commented Mar 29, 2024