-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy between Hugging Face and fashion-clip #14
Comments
I am surprised because both methods seem to use the same transformation, but I'll take a look! thanks!! |
This looks like a similar issue:-
The probs generated here and hugginface hosted inference UI seems to be different: https://huggingface.co/patrickjohncyh/fashion-clip. I beleive both should ideally output same probability for same input image? Are they both using latest v2 models? Both the above methods classify image as 'drawstring waist' - wrongly. But it's correctly identified in the HF hosted inference API. |
Hi @anilsathyan7! I am not sure how the UI computes the score; in the meantime, I have run your example on both the original HF API and our internal wrapper and the results are more or less the same. Take a look: img_url = "https://sc04.alicdn.com/kf/Ha258d067f6ff4af687a73b1b18b07333w/233027149/Ha258d067f6ff4af687a73b1b18b07333w.jpg"
image = requests.get(img_url').content
image = Image.open(BytesIO(image))
inputs = processor(text=['paperbag waist', 'waist band', 'drawstring waist'],
images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)
print(probs)
test_captions = ['paperbag waist', 'waist band', 'drawstring waist']
test_img_path = 'paperbag_waist.jpg'
images = [test_img_path]
texts = test_captions
# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)
# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)
# note that we need to include logit scaling to get the same output the default hugging face model gives us
logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)
Which are reasonably similar scores. |
@vinid Ok, that's strange. The hosted api shows the image as 'paperbag waist' clearly with probs - 0.943. It's a large difference and the 'Hosted inference API' output is actually correct. What could be the reason for this? |
it's an effect due to prompting, by default the pipeline component (included in the UI) uses the format "this is a photo of {}." See here. test_img_path = 'paperbag_waist.jpg'
test_captions = ['This is a photo of paperbag waist.', 'This is a photo of waist band.', 'This is a photo of drawstring waist.']
images = [test_img_path]
texts = test_captions
# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)
# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)
logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)
(You have some typos in your screnshot, you should remove ' chars) |
@vinid Thanks a lot ... |
Great find! I was just thinking the same thing and was pleasantly surprised to stumble onto this insightful thread. |
Hello there,
I was looking into the difference in performance between the Hugging Face implementation of FashionCLIP and this repo, which wraps around the former.
I noticed there's a discrepancy between the image embeddings produced by the two approaches. Having dug into it, it looks like the cause is that in this repo the images are put into a Hugging Face Dataset here before being passed to the model.
The below code illustrates the discrepancy:
In the above code the embeddings produced by passing the images through a Dataset,
hf_ds_embeddings
, are the same as those produced by this repo,fc_embeddings
. The embeddings produced without using a Dataset,hf_wo_embeddings
are slightly different.I imagine that putting the images into the dataset is implicitly applying some transformation or pre-processing.
Just wanted to flag this, thanks!
The text was updated successfully, but these errors were encountered: