Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How does .embed(Sentence) work under the hood? #3409

Open
teoML opened this issue Feb 21, 2024 · 3 comments
Open

[Question]: How does .embed(Sentence) work under the hood? #3409

teoML opened this issue Feb 21, 2024 · 3 comments
Labels
question Further information is requested

Comments

@teoML
Copy link

teoML commented Feb 21, 2024

Question

Hi,
Can someone explain how the following code works under the hood:

 #loading a bert model
 embedding = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased")

 # create a sentence
 sentence = Sentence('The grass is green. The roses are red.')

 # embed words in sentence
 embedding.embed(sentence)

 print(sentence.embedding)`

When we perform embed on a Sentence object, does it transform each token into a vector and then the resulted vector is the
average of the sum of the vectors or it is using another strategy to generate the final document embedding?
Like, in my example sentence(or document) the resulted vector will be:
(embedding of the token "the" + embedding of the token "grass" + .... + embedding of the token "red")/8 (dimention-wise).

Maybe @alanakbik could answer?
Thank you!

@teoML teoML added the question Further information is requested label Feb 21, 2024
@helpmefindaname
Copy link
Collaborator

Hi @teoML
According to the Bert paper each sentence has a [CLS] token which will be representing the sentence for classification tasks.
The default behavior is also using this.
However if you want to change that, you can look at the cls_pooling parameter in the docs. The other pooling strategies apply the respective function on the embeddings of the individual tokens.

@teoML
Copy link
Author

teoML commented Mar 4, 2024

Hi @helpmefindaname , thank you for answering my question!
I just tried out changing the pooling type and I get the same embedding for the same sentence:

embedding = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased", allow_long_sentences= True, cls_polling = "mean")

embedding_2 = TransformerDocumentEmbeddings("dbmdz/bert-base-german-uncased", allow_long_sentences= True, cls_polling = "cls")

# create a sentence
sentence1 = Sentence('The grass is green. The roses are red.')
sentence2 = Sentence('The grass is green. The roses are red.')

embedding.embed(sentence1)
embedding_2.embed(sentence2)

a = sentence1.embedding
b = sentence2.embedding

print(a == b)

Each element in the resulted 768 dimensions long vector is True which means that there is no change in the embeddings of both sentences (although the pooling type is different). Is that considered normal behaviour? I tried out also with max and I also got equal vectors. It would be nice if you can help me out - my purpose is to create an embedding for a document which consists of around 100 sentences.

@helpmefindaname
Copy link
Collaborator

the parameter is cls_pooling not cls_polling hence both embeddings are using "cls" per default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants