- Supported higher resolution input using
google/siglip-so400m-patch14-384
as the vision encoder for a more detailed visual understanding. - Changed
capacity_factor
to 1.5 to support stronger MoE-LLaVA. - Added the results of MME benchmark and evaluation pipeline.
- Improved docs.
- Fixed typos.
We hope that community researchers can pay attention to the fact that large vision-language models can also be sparsified and even perform better.