This repository contains the code for the paper: "Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models".
Please follow the instructions in Grounded Segment Anything to set up the environment.
- Building Vprompt
- Using Tprompt to prompt Multimodal Large Language Models for generating answers.
We are currently in the process of organizing detailed evaluation code and usage tutorials. Please stay tuned for updates!