You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi. Thanks for your work. This is a question related to the paper, not the code. It may be a stupid question but I would love to hear an explanation from you.
Your main reason for using a momentum encoder is to achieve consistency, in the sense that if you don't update the key encoder with a slow moving average of the query encoder and simply use a copy of the query encoder, the features of previous mini-batches you stored in the queue become inconsistent due to the rapidly changing query encoder when you update it. So the momentum encoder basically makes the inconsistency among all these different mini-batch features small.
My question is why can't you simply make the stored keys as the input image views (before running them through the key encoder) rather than the features of these image views, and then for every update (when you need the negative samples) you just run these stored image views (negative samples) through a copy of the same query encoder? All of the output features you get would be consistent as well.
The text was updated successfully, but these errors were encountered:
I am not part of the Facebook AI team, but maybe I can answer your question.
So you suggest to use one network only and run that on a large batch of images? I think you are describing SimCLR, but correct me if I'm wrong. The main reason behind using FIFO queue is that storing 65k embeddings is very cheap. In order to get a new batch of examples, you only need to run a frozen network without calculating the gradients, you run it once and the results require a very small amount of memory (2048x4 bytes per sample). On the other hand, storing images require a lot of memory. Even if you generate them - just like SimCLR does - it requires a very large batch size and therefore a lot of VRAM and computation to achieve an adequate number of negative samples. So to sum up, using an embedding queue is computationally efficient.
Hi. Thanks for your work. This is a question related to the paper, not the code. It may be a stupid question but I would love to hear an explanation from you.
Your main reason for using a momentum encoder is to achieve consistency, in the sense that if you don't update the key encoder with a slow moving average of the query encoder and simply use a copy of the query encoder, the features of previous mini-batches you stored in the queue become inconsistent due to the rapidly changing query encoder when you update it. So the momentum encoder basically makes the inconsistency among all these different mini-batch features small.
My question is why can't you simply make the stored keys as the input image views (before running them through the key encoder) rather than the features of these image views, and then for every update (when you need the negative samples) you just run these stored image views (negative samples) through a copy of the same query encoder? All of the output features you get would be consistent as well.
The text was updated successfully, but these errors were encountered: