-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Inconsistent embeddings when passing in a list of inputs #3576
Comments
Hi @ntravis22 I have slightly adopted your example to give more visibility of the outputs: Using flairimport itertools
from flair.data import Sentence
from flair.embeddings import TransformerWordEmbeddings
embedding = TransformerWordEmbeddings('roberta-base')
sentence1 = Sentence('The grass is green .')
sentence1_copy = Sentence('The grass is green .')
sentence2 = Sentence('The grass is blue .')
embedding.embed([sentence1, sentence2])
embedding.embed(sentence1_copy)
for s1, s2 in itertools.product([sentence1, sentence1_copy, sentence2], repeat=2):
e1 = s1[0].embedding
e2 = s2[0].embedding
print(((e1-e2)**2).sum())
print(sentence1[0].embedding[:5])
print(sentence1_copy[0].embedding[:5])
print(sentence2[0].embedding[:5]) As well as recreated the script using the native transformers library: Using transformersimport itertools
import torch
from transformers import AutoModel, AutoTokenizer
model_name = "roberta-base"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
single_batch = tokenizer(["The", "grass", "is", "green", "."], return_tensors="pt", is_split_into_words=True)
multi_batch = tokenizer([["The", "grass", "is", "green", "."], ["The", "grass", "is", "blue", "."]], return_tensors="pt", is_split_into_words=True)
with torch.no_grad():
emb_sentence_1 = model(**single_batch).last_hidden_state[0, 1]
r_m = model(**multi_batch).last_hidden_state
emb_sentence_1_copy = r_m[0, 1]
emb_sentence_2 = r_m[1, 1]
for e1, e2 in itertools.product([emb_sentence_1, emb_sentence_1_copy, emb_sentence_2], repeat=2):
print(((e1-e2)**2).sum())
print(emb_sentence_1[:5])
print(emb_sentence_1_copy[:5])
print(emb_sentence_2[:5]) The output for flair is:
The output for transformers is:
I don't know why exactly the difference in flair is slightly lower than the one in transformers, but I don't think that matters, as your underlying issue does still exist within the transformers library (and likely it is just standard pytorch behaivour). I suppose if you want to go down this rabbit hole I would recommend contacting the transformers or pytorch teams, but I personally think that such slight inaccuracies are something one just needs to be aware of and handle it practically. |
Fair enough, thank you. Yes I think for example we could add some noise to the embeddings during training if needed to make them more robust, or use some other method. |
Describe the bug
Calling
embeddings.embed([sentence1, sentence2])
gives different results than callingembeddings.embed(sentence1); embeddings.embed(sentence2)
.The floating point numbers in the embeddings are close but not the same. The reason this came up for us is we have a recommendations application. For retrieval, we fetch nearest neighbors using trained embeddings. Then for ranking, we take the nearest neighbors and do predictions. For speed purposes, we would like to pass the already computed embeddings as an input (since computing the embeddings takes the majority of the time), and this is easy to do with a small change to the model code (in our case, TextPairClassifier).
However, the slight differences in the embeddings (due to the above mentioned bug) makes our predictions differ when we do this.
To Reproduce
Expected behavior
This should print a Tensor full of True values, and indeed if we change the above code to where the first call to
embedding.embed
does not use a list, it does (i.e. we change that line toembedding.embed(sentence1)
.Logs and Stack traces
No response
Screenshots
No response
Additional Context
No response
Environment
Versions:
Flair
0.14.0
Pytorch
2.4.0
Transformers
4.44.2
GPU
False
The text was updated successfully, but these errors were encountered: