Intuition on Visual Projection #817
adrielkuek
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dear @haotian-liu and team, thanks for the wonderful work and I'm always excited see new updates to the LlaVA family with the inclusion of the latest LlaVA-plus model. I would just like understand from the point of view of the team's perspective on the choice of the current visual projection strategy. As all of us know, one of the critical components in effective visual understanding with LLMs lies within the multimodal interaction space. In the LlaVA papers, the authors argued that a simple MLP projection with a visual encoder provides the ease of training and is effective in bridging the multimodality gap. Has the team done any ablation experiments on the comparision of visual projection architectures i.e. some of the popular frameworks like Flamingo-interleaved, BLIP-2 Q-former, Soft-prompt prefixing, as well as the recently introduced Fuyu-8B where they did away with the visual encoder and insert visual patch tokens directly into the transformer to scale resolution. In the light of all these emerging strategies, it would be insightful to understand the team's consideration and arguments for the strategy choice, and how do you see VLMs evolving in the near future.
Thanks very much for the food for thought.
Beta Was this translation helpful? Give feedback.
All reactions