Runsen Xu
Zhiwei Huang
Tai Wang
Yilun Chen
Jiangmiao Pang*
Dahua Lin
The Chinese University of Hong Kong Shanghai AI Laboratory Zhejiang University
- [2024-10-17] We release the paper of VLM-Grounder and all the codes! We are looking for self-motivated students to conduct research regarding VLM (agent) for 3D perception. Please send an email to [email protected] with your CV if you are interested! 🔥
- [2024-09-04] VLM-Grounder has been accepted by CoRL 2024! 🎉
First, clone the repository with submodules:
git clone --recurse-submodules https://github.com/OpenRobotLab/VLM-Grounder.git
cd VLM-Grounder
To ensure compatibility, please use the following specific versions of the submodules:
- Grounding-DINO-1.5-API @ 414e737
- pats @ 98d2e03
This project is tested on Python 3.10.11:
conda create -n "vlm-grounder" python=3.10.11
conda activate vlm-grounder
Install PyTorch 2.0.1. For detailed instructions, refer to PyTorch's official page:
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
Then, install the required Python packages and PyTorch3D:
pip install -r requirements.txt
pip install "git+https://github.com/facebookresearch/pytorch3d.git"
SAM-Huge: Download the SAM-Huge weight file from here and place it in the checkpoints/SAM
folder.
PATS: For image matching, we use PATS. Download the required weights and place them in the 3rdparty/pats/weights
folder.
Install the tensor-resize module:
cd 3rdparty/pats/setup
python setup.py install
cd ../../..
The PATS weights folder structure should look like this:
pats
├── data
└── weights
├── indoor_coarse.pt
├── indoor_fine.pt
├── indoor_third.pt
├── outdoor_coarse.pt
├── outdoor_fine.pt
└── outdoor_third.pt
Set your OpenAI API key in vlm_grounder/utils/my_openai.py
:
api_key = "your_openai_api_key" # sk-******
If you choose to use Grounding-DINO-1.5, please install Grounding-DINO-1.5-API and set the API key.
Install the Grounding-DINO-1.5-API:
cd 3rdparty/Grounding-DINO-1.5-API
pip install -v -e .
cd ../..
Set the Grounding-DINO-1.5-API key in vlm_grounder/utils/my_gdino.py
(you can request it from DeepDataSpace):
api_key = "your_gdino_api_key"
Navigate to the dataset folder:
cd data/scannet/
Download the ScanNet dataset and organize the data folder structure as follows:
data/
└── scannet
├── grounding
├── meta_data
├── scans # Place ScanNet data here
│ ├── scene0000_00
│ ├── scene0000_01
│ ...
│
└── tools
We extract one frame out of every 20, requiring approximately 850 seconds and 27GB of disk space:
python tools/extract_posed_images.py --frame_skip 20 --nproc 8 # using 8 processes
This will generate the data/scannet/posed_images
folder.
Run the script to batch load ScanNet data:
python tools/batch_load_scannet_data.py
This will export the ScanNet data to the data/scannet/scannet_instance_data
folder.
Update the info file with posed images information:
python tools/update_info_file_with_images.py
First, set the path environment variable:
cd path/to/VLMGrounder
export PYTHONPATH=$PYTHONPATH:path/to/VLMGrounder
We release the test data used in our paper in the outputs/query_analysis
folder (scanrefer_250.csv and nr3d_250.csv).
We provide some cached data for our test data to save the cost of running the entire pipeline, which contains:
- Exhaustive matching data (containing all ScanRefer validation scenes and scenes in nr3d_250).
- GDINO detect results (containing GDINO 1.5 pro detect results in scanrefer_250 and nr3d_250).
- Global cache folder (containing category_judger, new detections, and query_analysis results for scanrefer_250 and nr3d_250).
Cached data folder structure:
data
└── scannet
└── scannet_match_data
└── exhaustive_matching.pkl # Exhaustive matching data
outputs
├── global_cache # Global cache folder
│ ├── category_judger
│ ├── gdino_cache
│ └── query_analysis_v2
└── image_instance_detector # GDINO detect results
├── Grounding-DINO-1_nr3d_test_top250_pred_target_classes
└── Grounding-DINO-1_scanrefer_test_top250_pred_target_classes
If you want to use new data, please refer to the following steps 2, 3, 4, 5, and 6 to prepare data. If you want to use our test data, jump to step 7 directly.
Convert ScanRefer to the Referit3D format:
python data/scannet/tools/convert_scanrefer_to_referit3d.py --input_json_path data/scannet/grounding/scanrefer/ScanRefer_filtered_val.json --output_csv_path data/scannet/grounding/referit3d/*.csv
Subsample the CSV file for quick experiments:
python vlm_grounder/utils/csv_utils.py --csv_file data/scannet/grounding/referit3d/*.csv --sample_num 250
Calculate the fine-grained categories (e.g., Unique, Easy, VD):
python data/scannet/tools/pre_compute_category.py --vg_file data/scannet/grounding/referit3d/*.csv
Use PATS to obtain exhaustive matching data:
python vlm_grounder/tools/exhaustive_matching.py --vg_file data/scannet/grounding/referit3d/*.csv
This will generate data/scannet/scannet_match_data/exhaustive_matching.pkl
, containing exhaustive matching data for each scene. Please note this process can take a long time (~20 minutes per scene).
Note: Using cached data provided in step 1 may save some time.
Run the QueryAnalysis module to analyze each query and get the predicted target class and conditions:
python vlm_grounder/tools/query_analysis.py --vg_file data/scannet/grounding/referit3d/*_relations.csv
The output will be in the outputs/query_analysis
folder. Predicted target class accuracy typically exceeds 98%.
Run the ImageInstanceDetector module to detect target class objects for each image. You can use Yolov8-world or Grounding-DINO-1.5-Pro for object detection. If using YOLO, checkpoints/yolov8_world/yolov8x-worldv2.pt
will be downloaded automatically:
python vlm_grounder/tools/image_instance_detector.py --vg_file outputs/query_analysis/*.csv --chunk_size -1 --detector [yolo|gdino]
Output results will be in the outputs/image_instance_detector
folder.
Note: If using gdino, ensure your quota is sufficient as this operation is quota-intensive. Using cached data provided in step 1 may save some time and quota.
Run the ViewPreSelection module to locate all images containing the predicted target class. This process takes about 0.7 seconds per sample:
python vlm_grounder/tools/view_pre_selection.py --vg_file outputs/query_analysis/*.csv --det_file outputs/image_instance_detector/*/chunk*/detection.pkl
A new CSV file will be produced in the QueryAnalysis output directory, with the suffix _with_images_selected_diffconf_and_pkl
appended.
Run the VisualGrounder module. Intermediate results with visualization will be saved in outputs/visual_grounding
.
A sample run.sh
script is provided for ease, which can be modified to change parameters.
Please change the VG_FILE
, DET_INFO
, MATCH_INFO
, DATE
, and EXP_NAME
variables accordingly.
Note: The sampled data tested in the paper is at outputs/query_analysis/nr3d_250.csv
and outputs/query_analysis/scanrefer_250.csv
.
#!/usr/bin/zsh
source ~/.zshrc
# Initial visual grounding
VG_FILE=outputs/query_analysis/*_relations_with_images_selected_diffconf_and_pkl.csv
DET_INFO=outputs/image_instance_detector/*/chunk*/detection.pkl
MATCH_INFO=data/scannet/scannet_match_data/*.pkl
DATE=2024-06-21
EXP_NAME=test
GPT_TYPE=gpt-4o-2024-05-13
PROMPT_VERSION=3
python ./vlm_grounder/grounder/visual_grouder.py \
--from_scratch \
--post_process_component \
--post_process_erosion \
--use_sam_huge \
--use_bbox_prompt \
--vg_file_path ${VG_FILE} \
--exp_name ${DATE}_${EXP_NAME} \
--prompt_version ${PROMPT_VERSION} \
--openaigpt_type ${GPT_TYPE} \
--skip_bbox_selection_when1 \
--det_info_path ${DET_INFO} \
--matching_info_path ${MATCH_INFO} \
--use_new_detections \
--dynamic_stitching \
--online_detector [yolo|gdino]
For Nr3D, we need to match the predicted bbox with the gt bbox before evaluation. We provide 3 evaluation methods: 2D IoU, GT Bbox IoU, and GT Bbox Distance.
- GT Bbox Distance: Choose the GT bbox with the smallest distance to the predicted bbox, which is used for Nr3D in the paper.
- GT Bbox IoU: Choose the GT bbox with the highest IoU with the predicted bbox.
- 2D IoU: Use 2d mask to compute IoU for evaluation.
python vlm_grounder/eval/accuracy_evaluator.py --method [2d_iou|gtbbox_iou|gtbbox_dist] --exp_dir
outputs/visual_grouding/*
Note that to use 2D IoU matching, you need to unzip the {scene_id}_2d-instance-filt.zip
files from the ScanNet dataset before running the evaluation.
Some unused features have been temporarily left in the codes. They were relevant during the development phase but are not related to the final results. You can ignore them. If you encounter any issues, feel free to open an issue at any time.
If you find our work and this codebase helpful, please consider starring this repo 🌟 and cite:
@inproceedings{xu2024vlmgrounder,
title={VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding},
author={Xu, Runsen and Huang, Zhiwei and Wang, Tai and Chen, Yilun and Pang, Jiangmiao and Lin, Dahua},
booktitle={CoRL},
year={2024}
}
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The authors would like to thank Tianhe Ren and Lei Zhang from The International Digital Economy Academy (IDEA) for providing access to the excellent Grounding DINO-1.5 model and Junjie Ni from Zhejiang University for the help with PATS.