Transcrib3D: 3D Referring Expression Resolution through Large Language Models (IROS 2024)

Jiading Fang*, Xiangshan Tan*, Shengjie Lin*, Igor Vasiljevic, Vitor Guizilini, Hongyuan Mei, Rares Ambrus Gregory Shakhnarovich Matthew Walter

Transcrib3d_real_robot_demo_compressed.mp4

Transcrib3D reasons about and acts according to complex 3D referring expressions with real robots.

Environment Settings

For evaluation, a small number of packages are required, include numpy, openai and tenacity.

pip install numpy openai tenacity

Additional packages are needed for data preprocessing:

pip install plyfile scikit-learn scipy pandas

Set up your OpenAI API key as an environment variable OPENAI_API_KEY:

export OPENAI_API_KEY=xxx

Data Preparation

Since the ReferIt3D dataset, which includes sr3d and nr3d, and the ScanRefer dataset depend on ScanNet, we first preproces the ScanNet data.

Quick Start

To make things easier, we provide the bounding boxes for each scene at data/scannet_object_info. Currently, it only includes ground-truth bounding boxes (which is the setting for NR3D and SR3D from the Referit3D benchmark). Detected bounding boxes will be provided later. There is no need to prepare the original ScanNet scene data for the sole purpose of testing (original scene data are still useful for debugging and visualization).

You could jump to Evaluation to get a quick start.

If you want to generate the bounding boxes from the original ScanNet data, follow the steps below.

Download ScanNet Data

Follow the official instructions to download the data. This involves filling out a form and emailing the ScanNet authors. Then, you will receive a response email with detailed instructions and a Python script download-scannet.py for downloading the data. Run the script to download certain types of data:

python download-scannet.py -o [directory in which to download] --type [file suffix]

Since the original 1.3TB ScanNet data contains many types of data files, some of which are not necessary for this project (e.g., the RGBD stream .sens type), you could use the optional argument --type to download only the necessary types:

_vh_clean_2.ply _vh_clean_2.labels.ply _vh_clean_2.0.010000.segs.json _vh_clean.segs.json .aggregation.json _vh_clean.aggregation.json .txt

Run the following shell script/CMD instruction to download them (to avoid any key-pressing during download, comment the code key = input('') at line 147 and 225):

# bash
download_dir="your_scannet_download_directory"
suffixes=(
    "_vh_clean_2.ply"
    "_vh_clean_2.labels.ply"
    "_vh_clean_2.0.010000.segs.json"
    "_vh_clean.segs.json"
    ".aggregation.json"
    "_vh_clean.aggregation.json"
    ".txt"
)
for suffix in "${suffixes[@]}"; do
    python download-scannet.py -o "$download_dir" --type "$suffix"
done

CMD
set download_dir="your_scannet_download_directory"
set suffixes=_vh_clean_2.ply;_vh_clean_2.labels.ply;_vh_clean_2.0.010000.segs.json;_vh_clean.segs.json;.aggregation.json;_vh_clean.aggregation.json;.txt

for %s in (%suffixes%) do (
  python download-scannet.py -o  %download_dir% --type %s
)

After downloading, your directory structure should look like:

your_scannet_download_directory/
|-- scans/
|   |-- scene0000_00/
|   |   |-- scene0000_00_vh_clean_2.ply
|   |   |-- scene0000_00_vh_clean_2.labels.ply
|   |   |-- scene0000_00_vh_clean_2.0.010000.segs.json
|   |   |-- scene0000_00_vh_clean.segs.json
|   |   |-- scene0000_00.aggregation.json
|   |   |-- scene0000_00_vh_clean.aggregation.json
|   |   |-- scene0000_00.txt
|   |-- scenexxxx_xx/
|   |   |-- ...
|-- scans_test/
|   |-- scene0707_00/
|   |-- ...
|-- scannetv2-labels.combined.tsv

Axis-Align

Next, use the axis align matrices (recorded in scenexxxx_xx.txt) to transform the coordinates of vertices:

python preprocessing/align_scannet_mesh.py --scannet_download_path [your_scannet_download_directory]

Download the ReferIt3D and ScanRefer Data

Follow the ReferIt3D official guide to download nr3d.csv, sr3d.csv, sr3d_train.csv, sr3d_test.csv and save them in the data/referit3d folder.

Follow the ScanRefer official guide to download the dataset and place them within the data/scanrefer folder.

Generate Object Information

In this step, we process the ScanNet data to extract the quantitative and semantic information of the objects in each scene.

For object instance segmentation, we use either ground-truth (ScanNet official) data or an off-the-shelf segmentation tool (Mask3d).

To use ground-truth segmentation data, run:

python preprocessing/gen_obj_list.py --scannet_download_path [your_scannet_download_directory] --bbox_type gt

You can find the results in scannet_download_path/scans/objects_info/ and scannet_download_path/scans_test/objects_info/.

To use Mask3D segmentation data, first follow the Mask3D official guide to produce the instance segmentation results, then run:

python preprocessing/gen_obj_list.py --scannet_download_path [your_scannet_download_directory] \
    --bbox_type mask3d \
    --mask3d_result_path [your_mask3d_result_directory]
# Note: mask3d_result_path should look like xxx/Mask3D/eval_output/instance_evaluation_mask3d_export_scannet200_0/val/

You can find the results in scannet_download_path/scans/objects_info_mask3d_200c/.

Evaluation

Quick Start

Run the first 50 data records of nr3d_test_sampled1000.csv with config index 1:

python main.py --workspace_path /path/to/Transcribe3D/project/folder --scannet_data_root /path/to/ScanNet/Data/  --mode eval --dataset_type nr3d --conf_idx 1 --range 2 52

Remember to replace the paths.

Note that scannet_data_root can be set to /path/to/Transcribe3D/project/folder/data/scannet_object_info as we already provide the ground-truth ScanNet bounding boxes. If you preprocess the data by yourself, it can be set to scannet_download_path/scans/objects_info/.

Modifying the Configuration

To run our model on different refering datasets, simply modify the --dataset_type setting to [sr3d/nr3d/scanrefer].
To select the evaluation range of the dataset, modify the --range setting. For Sr3D and Nr3D, which use .csv files, the minimum number is 2. For ScanRefer, which uses .json files, the minimum number is 0.
For convenience, more configurations are placed in config/config.py. There are 3 dictionaries inside: confs_nr3d, confs_sr3d and confs_scanrefer. Each of them contains several configurations of that dataset. The meaning of different configurations can be understood from the variable names. Modify the --conf_idx setting to select a configuration. You can also add your own configurations.
More information can be found by running python main.py -h.

Result Storage

After running the evaluation with specific configurations, a folder will be created that contains configuration infomation with a name that starts with eval_results_ under the results folder. Under this folder, there will be subfolders named after the start time of the experiment.

Analyzing Results

You might run one or more experiments of a evaluation configuration, and get some subfolders named according to the formatted time. The time(s) are used to analyze the results. An example timestamp looks like 2023-10-26-15-48-12.

Specify the formatted time(s) after the --ft setting:

python main.py --workspace_path /path/to/Transcribe3D/project/folder/ --scannet_data_root /path/to/ScanNet/Data/  --mode result --dataset_type nr3d --conf_idx 1 --ft time1 time2

Check ScanRefer

Check how many cases are provided with detected boxes that has 0.5 or higher IOU with the ground-truth box, which indicates the upper bound of performance on ScanRefer.

python main.py --workspace_path /path/to/Transcribe3D/project/folder/ --scannet_data_root /path/to/ScanNet/Data/ --mode check_scanrefer --dataset_type scanrefer --conf_idx 1

Finetuning

We provide scripts for finetuning on open-source LLMs (e.g., codeLlama, Llama2) within the finetune directory.

Environment

The script uses the Huggingface trl (https://github.com/huggingface/trl) library to perform finetuning jobs. Main dependencies include Huggingface accelerate, transformers, datasets, peft, trl.

Data

We provide processed finetuning data following the OpenAI finetune file protocal in the finetune/finetune_files directory. It contains many different settings aligned as described in our paper. The original processing script is finetune/prepare_finetuning_data.py, which processes results from the main script.

Scripts

We provide two example shell scripts to run the finetuning jobs, one with codellama model (finetune/trl_finetune_codellama_instruct.sh) and the other with llama2_chat model (finetune/trl_finetune_llama2_chat.sh). You can also customize finetuning job using finetune/trl_finetune.py.

Notes

The finetuned open-source models (e.g., codeLlama, Llama2) still under-performs the finetuned closed-source model (gpt-3.5-turbo) as of September 2023. We expect the situation might change dramatically in the coming future with quickly improving open-source models.
The resource required for finetuning is roughly 24GB+ GPU memory for 7B models and 36GB+ GPU memory for 13B models.

Bibtex

If you find our paper useful, and use it in a publication, we would appreciate it if you cite it as:

@misc{fang2024transcrib3d3dreferringexpression,
      title={Transcrib3D: 3D Referring Expression Resolution through Large Language Models}, 
      author={Jiading Fang and Xiangshan Tan and Shengjie Lin and Igor Vasiljevic and Vitor Guizilini and Hongyuan Mei and Rares Ambrus and Gregory Shakhnarovich and Matthew R Walter},
      year={2024},
      eprint={2404.19221},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2404.19221}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transcrib3D: 3D Referring Expression Resolution through Large Language Models (IROS 2024)

Environment Settings

Data Preparation

Quick Start

Download ScanNet Data

Axis-Align

Download the ReferIt3D and ScanRefer Data

Generate Object Information

Evaluation

Quick Start

Modifying the Configuration

Result Storage

Analyzing Results

Check ScanRefer

Finetuning

Environment

Data

Scripts

Notes

Bibtex

Files

README.md

Latest commit

History

README.md

File metadata and controls

Transcrib3D: 3D Referring Expression Resolution through Large Language Models (IROS 2024)

Environment Settings

Data Preparation

Quick Start

Download ScanNet Data

Axis-Align

Download the ReferIt3D and ScanRefer Data

Generate Object Information

Evaluation

Quick Start

Modifying the Configuration

Result Storage

Analyzing Results

Check ScanRefer

Finetuning

Environment

Data

Scripts

Notes

Bibtex