Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Inconsistent embeddings when passing in a list of inputs #3576

Closed
ntravis22 opened this issue Dec 4, 2024 · 2 comments
Closed

[Bug]: Inconsistent embeddings when passing in a list of inputs #3576

ntravis22 opened this issue Dec 4, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@ntravis22
Copy link

Describe the bug

Calling embeddings.embed([sentence1, sentence2]) gives different results than calling embeddings.embed(sentence1); embeddings.embed(sentence2).

The floating point numbers in the embeddings are close but not the same. The reason this came up for us is we have a recommendations application. For retrieval, we fetch nearest neighbors using trained embeddings. Then for ranking, we take the nearest neighbors and do predictions. For speed purposes, we would like to pass the already computed embeddings as an input (since computing the embeddings takes the majority of the time), and this is easy to do with a small change to the model code (in our case, TextPairClassifier).

However, the slight differences in the embeddings (due to the above mentioned bug) makes our predictions differ when we do this.

To Reproduce

from flair.data import Sentence
from flair.embeddings import TransformerWordEmbeddings

# init embedding                                                                                                          
embedding = TransformerWordEmbeddings('roberta-base')

# create a sentence                                                                                                       
sentence1 = Sentence('The grass is green .')
sentence1_copy = Sentence('The grass is green .')
sentence2 = Sentence('The grass is blue .')

# embed words together                                                                                                    
embedding.embed([sentence1, sentence2])
embedding.embed(sentence1_copy)
print(sentence1[0].embedding == sentence1_copy[0].embedding)

Expected behavior

This should print a Tensor full of True values, and indeed if we change the above code to where the first call to embedding.embed does not use a list, it does (i.e. we change that line to embedding.embed(sentence1).

Logs and Stack traces

No response

Screenshots

No response

Additional Context

No response

Environment

Versions:

Flair

0.14.0

Pytorch

2.4.0

Transformers

4.44.2

GPU

False

@ntravis22 ntravis22 added the bug Something isn't working label Dec 4, 2024
@helpmefindaname
Copy link
Collaborator

Hi @ntravis22

I have slightly adopted your example to give more visibility of the outputs:

Using flair
import itertools

from flair.data import Sentence
from flair.embeddings import TransformerWordEmbeddings

embedding = TransformerWordEmbeddings('roberta-base')

sentence1 = Sentence('The grass is green .')
sentence1_copy = Sentence('The grass is green .')
sentence2 = Sentence('The grass is blue .')

embedding.embed([sentence1, sentence2])
embedding.embed(sentence1_copy)

for s1, s2 in itertools.product([sentence1, sentence1_copy, sentence2], repeat=2):
    e1 = s1[0].embedding
    e2 = s2[0].embedding
    print(((e1-e2)**2).sum())

print(sentence1[0].embedding[:5])
print(sentence1_copy[0].embedding[:5])
print(sentence2[0].embedding[:5])

As well as recreated the script using the native transformers library:

Using transformers
import itertools

import torch
from transformers import AutoModel, AutoTokenizer

model_name = "roberta-base"


model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)

single_batch = tokenizer(["The", "grass", "is", "green", "."], return_tensors="pt", is_split_into_words=True)
multi_batch = tokenizer([["The", "grass", "is", "green", "."], ["The", "grass", "is", "blue", "."]], return_tensors="pt", is_split_into_words=True)
with torch.no_grad():
    emb_sentence_1 = model(**single_batch).last_hidden_state[0, 1]
    r_m = model(**multi_batch).last_hidden_state
emb_sentence_1_copy = r_m[0, 1]
emb_sentence_2 = r_m[1, 1]

for e1, e2 in itertools.product([emb_sentence_1, emb_sentence_1_copy, emb_sentence_2], repeat=2):
    print(((e1-e2)**2).sum())

print(emb_sentence_1[:5])
print(emb_sentence_1_copy[:5])
print(emb_sentence_2[:5])

The output for flair is:

tensor(0., device='cuda:0')
tensor(5.8900e-11, device='cuda:0')
tensor(1.6990, device='cuda:0')
tensor(5.8900e-11, device='cuda:0')
tensor(0., device='cuda:0')
tensor(1.6990, device='cuda:0')
tensor(1.6990, device='cuda:0')
tensor(1.6990, device='cuda:0')
tensor(0., device='cuda:0')
tensor([-0.0082, -0.0274,  0.0205,  0.2279,  1.0304], device='cuda:0')
tensor([-0.0082, -0.0274,  0.0205,  0.2279,  1.0304], device='cuda:0')
tensor([ 0.0206, -0.0464,  0.0643,  0.2451,  0.9354], device='cuda:0')

The output for transformers is:

tensor(0.)
tensor(6.7273e-11)
tensor(1.6990)
tensor(6.7273e-11)
tensor(0.)
tensor(1.6990)
tensor(1.6990)
tensor(1.6990)
tensor(0.)
tensor([-0.0082, -0.0274,  0.0205,  0.2279,  1.0304])
tensor([-0.0082, -0.0274,  0.0205,  0.2279,  1.0304])
tensor([ 0.0206, -0.0464,  0.0643,  0.2451,  0.9354])

I don't know why exactly the difference in flair is slightly lower than the one in transformers, but I don't think that matters, as your underlying issue does still exist within the transformers library (and likely it is just standard pytorch behaivour).

I suppose if you want to go down this rabbit hole I would recommend contacting the transformers or pytorch teams, but I personally think that such slight inaccuracies are something one just needs to be aware of and handle it practically.

@ntravis22
Copy link
Author

Fair enough, thank you. Yes I think for example we could add some noise to the embeddings during training if needed to make them more robust, or use some other method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants