You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for the wonderful work! I really enjoyed reading it.
I am currently trying to reimplement your work and got some questions.
Is there any reference to orthogonality regularization? Also, it seems like it is regularized to induce better expressibility for row vector then column vector, which is unorthodox. Is this because the goal of regularization is actually for the latent code than its output whose value eventually some type of "prototype vector" for each class?
Is the KL divergence implementation correct? Since our prior and posterior is both Gaussian, there is an analytic formula to calculate it, but the current implementation is using a sample-based approach, and it doesn't represent the KL-divergence; it should be \sum q * (log q - log p), but it is currently \mean (log q - log p). Am I missing something?
Currently, $z_n$ is sampled K times from q, and it resulted in K $w_n$ vectors where the mean is taken over logits. Is this to stabilize training caused by reparameterization trick, or is there more than that?
Sorry for taking a while to get back to you: your question somehow slipped through my inbox.
I don't have a reference, but our motivation was relatively simple. Decoder maps produced latent codes for each image to the logits for each class. As we would like our (whole) network to be able to distinguish even similar images, we'd like to keep logits for different classes "far away" in some sense. By encouraging decoder weights to be orthogonal, we hope to prevent logits for different classes to be similar. As we backpropagate through the whole network, this may also encourage the latents to be different from each other.
Yes. Indeed, instead of using a closed-form formula for KL of two Gaussians, we use a simple Monte-Carlo estimator (cf. notes). Note that q is implicit (similarly to lack of f_X(x) in the notes): we sample from q, and different outcomes come with different probabilities. Similarly, the sum you are referring to is over different outcomes of the random variable. In our case, even a single-sample estimate: log(q(sample) / prior(sample)) would be unbiased: we average multiple of those to decrease the variance of the estimator.
If I remember correctly, when I was writing this, I thought that if I used a closed-form solution, it would be harder to backpropagate through it back to the network weights, but now that I think about it, it should also work.
The presence of the relation network makes it a little less clear than it could be, but the high-level intuition is that encoder maps each of the NK images to a code, then we average those to get N z_n's (one per class). We need to get back NK weights for the softmax classifier (one per image, as we will be multiplying those by x_n ("image embeddings")), so we sample K times from each classes' Gaussian.
Hope that it makes it clear. Feel free to follow up with any further questions.
First of all, thank you for the wonderful work! I really enjoyed reading it.
I am currently trying to reimplement your work and got some questions.
https://github.com/deepmind/leo/blob/de9a0c2a77dd7a42c1986b1eef18d184a86e294a/model.py#L40-L41
https://github.com/deepmind/leo/blob/de9a0c2a77dd7a42c1986b1eef18d184a86e294a/model.py#L269-L274
https://github.com/deepmind/leo/blob/de9a0c2a77dd7a42c1986b1eef18d184a86e294a/model.py#L250-L254
Thank you!
The text was updated successfully, but these errors were encountered: