You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Before change# for layer in self.llama.layers[:-1 * self.query_layer]:# h = layer(h, 0, freqs_cis, mask)forlayerinself.llama.layers[:-1*self.query_layer]:
h=layer(h, 0, freqs_cis, mask, visual_query)
In the process of extracting a visual prompt, I assumed that the visual encoder directly extracts it. However, I observed that the first visual embedding was attended through a self-attention-like module, and then only the first 10 elements from the attended visual embedding are used as a visual prompt.
Could you please explain the reason for this approach?
Thanks for your wonderful work. I've been recently working on this repository and learned a lot so far.
I have two questions about the code.
https://github.com/OpenGVLab/LLaMA-Adapter/blob/a50befee3fdde8a08ca346b2ec70407e59ff6536/llama_adapter_v2_multimodal7b/llama/llama_adapter.py#L152C8-L172C46
The llama adapter v2 code is different from the paper.
The code doesn't do early fusion in the forward function.
The below code has to be as follows.
https://github.com/OpenGVLab/LLaMA-Adapter/blob/a50befee3fdde8a08ca346b2ec70407e59ff6536/llama_adapter_v2_multimodal7b/llama/llama_adapter.py#L163C9-L164C45
Could you please explain the reason for this approach?
https://github.com/OpenGVLab/LLaMA-Adapter/blob/a50befee3fdde8a08ca346b2ec70407e59ff6536/llama_adapter_v2_multimodal7b/llama/llama_adapter.py#L135C1-L149C28
The text was updated successfully, but these errors were encountered: