WACV 2025
Anh-Quan Cao1 Maximilian Jaritz2 Matthieu Guillaumin2 Raoul de Charette1 Loris Bazzani2
If you find this work or code useful, please cite our paper and give this repo a star:
@InProceedings{cao2024latteclip,
title={LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts},
author={Anh-Quan Cao and Maximilian Jaritz and Matthieu Guillaumin and Raoul de Charette and Loris Bazzani},
year={2024},
booktitle = {arXiv}
}
- 17/12/2024: code is released.
- 14/10/2024: code will be available soon.
Follow these steps to install the necessary dependencies:
Create a new conda environment and install the dependencies:
conda create -n latteclip python=3.10
conda activate latteclip
Navigate to the latteclip
directory and run the following command:
make install
make install-training
Follow the official instructions here.
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .
Create a folder to store the data and set the path in the bash variable $LATTECLIP_DATA_DIR
:
mkdir -p /path/to/data
export LATTECLIP_DATA_DIR=/path/to/data
Download the data from this link and extract all files into the $LATTECLIP_DATA_DIR
.
Navigate to the latteclip
directory and run the preprocess script to create the webdataset, tarfiles, and extract the clip features:
cd latteclip
bash scripts/preprocess/preprocess.sh
To generate image descriptions, follow these steps:
Run the following command:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh $MACHINE_ID $NUM_MACHINE classname_dtd dtd $NUM_PROCESSES_PER_GPU $NUM_GPUS
Assume you have 2 machines, 1 GPU per machine, and 5 generation processes per Tesla V100 32g GPU:
Machine 0:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 2 classname_dtd dtd 5 1
Machine 1:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 1 2 classname_dtd dtd 5 1
Use the following commands:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_dtd dtd 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_eurosat eurosat 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_scene sun397 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_flower flower102 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_food101 food101 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_pets oxford_pets 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_car stanford_cars 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_ufc ucf101 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_caltech caltech101 5 1
The process is similar to generating image descriptions. Use the following commands:
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 dtd_describe_common_v3 dtd 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 eurosat_describe_common_v3 eurosat 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 sun397_describe_common_v3 sun397 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 flower102_describe_common_v3 flower102 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 food101_describe_common_v3 food101 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 pets_describe_common_v3 oxford_pets 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 car_describe_common_v3 stanford_cars 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 ufc_describe_common_v3 ucf101 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 caltech_describe_common_v3 caltech101 5 1
To train the model on dtd
, run:
bash scripts/unsupervised/dtd/dtd_fine_tune_multiclass.sh $lr $class_per_image $device $port $seed $exp_name
$lr
: Learning rate$class_per_image
: Number of classes per image (always set to 1)$device
: Device ID$port
: Port for the job (Not used)$seed
: Random seed$exp_name
: Experiment name
To train with learning rate 1e-7, on device 0, with port 25680, random seed 3, and experiment name exp_dtd
:
bash scripts/unsupervised/dtd_fine_tune_multiclass.sh 1e-7 1 0 25680 1 exp_dtd
bash scripts/unsupervised/eurosat_fine_tune_multiclass.sh 1e-7 1 0 25666 1 exp_eurosat
bash scripts/unsupervised/caltech101_fine_tune_multiclass.sh 1e-7 1 0 25665 1 exp_caltech101
bash scripts/unsupervised/fgvc_aircraft/fgvc_aircraft_fine_tune_multiclass.sh 1e-7 1 0 25667 1 exp_fgvc_aircraft
bash scripts/unsupervised/flower102_fine_tune_multiclass.sh 1e-7 1 0 25668 1 exp_flower102
bash scripts/unsupervised/food101_fine_tune_multiclass.sh 1e-7 1 0 25669 1 exp_food101
bash scripts/unsupervised/oxford_pets_fine_tune_multiclass.sh 1e-7 1 0 25670 1 exp_oxford_pets
bash scripts/unsupervised/stanford_cars/stanford_cars_fine_tune_multiclass.sh 1e-7 1 0 25671 1 exp_stanford_cars
bash scripts/unsupervised/sun397_fine_tune_multiclass.sh 1e-7 1 0 25672 1 exp_sun397
bash scripts/unsupervised/ucf101_fine_tune_multiclass.sh 1e-7 1 0 25673 1 exp_ucf101
Note
Logs will be stored in the logs
folder.