Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on implementation #13

Open
hiwonjoon opened this issue Apr 14, 2020 · 1 comment
Open

Questions on implementation #13

hiwonjoon opened this issue Apr 14, 2020 · 1 comment

Comments

@hiwonjoon
Copy link

First of all, thank you for the wonderful work! I really enjoyed reading it.

I am currently trying to reimplement your work and got some questions.

  1. Is there any reference to orthogonality regularization? Also, it seems like it is regularized to induce better expressibility for row vector then column vector, which is unorthodox. Is this because the goal of regularization is actually for the latent code than its output whose value eventually some type of "prototype vector" for each class?

https://github.com/deepmind/leo/blob/de9a0c2a77dd7a42c1986b1eef18d184a86e294a/model.py#L40-L41

  1. Is the KL divergence implementation correct? Since our prior and posterior is both Gaussian, there is an analytic formula to calculate it, but the current implementation is using a sample-based approach, and it doesn't represent the KL-divergence; it should be \sum q * (log q - log p), but it is currently \mean (log q - log p). Am I missing something?

https://github.com/deepmind/leo/blob/de9a0c2a77dd7a42c1986b1eef18d184a86e294a/model.py#L269-L274

  1. Currently, $z_n$ is sampled K times from q, and it resulted in K $w_n$ vectors where the mean is taken over logits. Is this to stabilize training caused by reparameterization trick, or is there more than that?

https://github.com/deepmind/leo/blob/de9a0c2a77dd7a42c1986b1eef18d184a86e294a/model.py#L250-L254

Thank you!

@sygi
Copy link
Contributor

sygi commented Aug 28, 2020

Sorry for taking a while to get back to you: your question somehow slipped through my inbox.

  1. I don't have a reference, but our motivation was relatively simple. Decoder maps produced latent codes for each image to the logits for each class. As we would like our (whole) network to be able to distinguish even similar images, we'd like to keep logits for different classes "far away" in some sense. By encouraging decoder weights to be orthogonal, we hope to prevent logits for different classes to be similar. As we backpropagate through the whole network, this may also encourage the latents to be different from each other.
  2. Yes. Indeed, instead of using a closed-form formula for KL of two Gaussians, we use a simple Monte-Carlo estimator (cf. notes). Note that q is implicit (similarly to lack of f_X(x) in the notes): we sample from q, and different outcomes come with different probabilities. Similarly, the sum you are referring to is over different outcomes of the random variable. In our case, even a single-sample estimate: log(q(sample) / prior(sample)) would be unbiased: we average multiple of those to decrease the variance of the estimator.

If I remember correctly, when I was writing this, I thought that if I used a closed-form solution, it would be harder to backpropagate through it back to the network weights, but now that I think about it, it should also work.

  1. The presence of the relation network makes it a little less clear than it could be, but the high-level intuition is that encoder maps each of the NK images to a code, then we average those to get N z_n's (one per class). We need to get back NK weights for the softmax classifier (one per image, as we will be multiplying those by x_n ("image embeddings")), so we sample K times from each classes' Gaussian.

Hope that it makes it clear. Feel free to follow up with any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants