How are region-level descriptions obtained? #16

UcanSee · 2024-04-22T07:50:07Z

Thanks for your great work！
In your paper, the label of the semantic classification branch is mask cropped embedding obtained by CLIP, then how is the GT of the caption branch generated from SA-1B?

PhyscalX · 2024-04-22T08:14:18Z

Hi, @UcanSee

Caption branch (i.e., TextDecoder) is randomly initialized, but is NOT trained during SA-1B pre-training.
Caption branch is then trained only on VG data, with the frozen ImageEncoder & ImageDecoder.
Further e2e fine-tuning for Caption branch on a mixed SA/VG dataset (set caption loss to zero for SA data),
could improve VG CIDEr from 154.7 to 164.7.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are region-level descriptions obtained? #16

How are region-level descriptions obtained? #16

UcanSee commented Apr 22, 2024

PhyscalX commented Apr 22, 2024 •

edited

Loading

How are region-level descriptions obtained? #16

How are region-level descriptions obtained? #16

Comments

UcanSee commented Apr 22, 2024

PhyscalX commented Apr 22, 2024 • edited Loading

PhyscalX commented Apr 22, 2024 •

edited

Loading