Skip to content

Commit

Permalink
Add support for PaliGemma (& PaliGemma2) (#1074)
Browse files Browse the repository at this point in the history
* Bump versions

* Add support for PaliGemma (&PaliGemma2)

* Add unit tests

* Remove debug line

* Revert version bump (move to new PR)
  • Loading branch information
xenova authored Dec 6, 2024
1 parent e4dac8a commit 6f27a10
Show file tree
Hide file tree
Showing 8 changed files with 213 additions and 5 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
1. **[PaliGemma](https://huggingface.co/docs/transformers/main/model_doc/paligemma)** (from Google) released with the papers [PaliGemma: A versatile 3B VLM for transfer](https://arxiv.org/abs/2407.07726) and [PaliGemma 2: A Family of Versatile VLMs for Transfer](https://arxiv.org/abs/2412.03555) by the PaliGemma Google team.
1. **[PatchTSMixer](https://huggingface.co/docs/transformers/main/model_doc/patchtsmixer)** (from IBM) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/abs/2306.09364) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[PatchTST](https://huggingface.co/docs/transformers/main/model_doc/patchtst)** (from Princeton University, IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
1. **[PaliGemma](https://huggingface.co/docs/transformers/main/model_doc/paligemma)** (from Google) released with the papers [PaliGemma: A versatile 3B VLM for transfer](https://arxiv.org/abs/2407.07726) and [PaliGemma 2: A Family of Versatile VLMs for Transfer](https://arxiv.org/abs/2412.03555) by the PaliGemma Google team.
1. **[PatchTSMixer](https://huggingface.co/docs/transformers/main/model_doc/patchtsmixer)** (from IBM) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/abs/2306.09364) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[PatchTST](https://huggingface.co/docs/transformers/main/model_doc/patchtst)** (from Princeton University, IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
Expand Down
37 changes: 32 additions & 5 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -558,7 +558,9 @@ async function decoderForward(self, model_inputs, is_encoder_decoder = false) {
new_model_inputs.use_cache_branch = boolTensor(!!past_key_values);
}
if (session.inputNames.includes('position_ids') && new_model_inputs.attention_mask && !new_model_inputs.position_ids) {
new_model_inputs.position_ids = createPositionIds(new_model_inputs, past_key_values);
// NOTE: Handle a special case for paligemma models, where positions are 1-indexed
const start_index = self.config.model_type === 'paligemma' ? 1 : 0;
new_model_inputs.position_ids = createPositionIds(new_model_inputs, past_key_values, start_index);
}

// Unpack the `past_key_values` object into model inputs
Expand Down Expand Up @@ -694,14 +696,14 @@ async function imageTextToTextForward(self, {
* @param {Tensor} attention_mask
* @returns {{data: BigInt64Array, dims: number[]}}
*/
function cumsum_masked_fill(attention_mask) {
function cumsum_masked_fill(attention_mask, start_index = 0) {
const [bz, seq_len] = attention_mask.dims;
const attn_mask_data = attention_mask.data;

const data = new BigInt64Array(attn_mask_data.length);
for (let i = 0; i < bz; ++i) {
const start = i * seq_len;
let sum = BigInt(0);
let sum = BigInt(start_index);
for (let j = 0; j < seq_len; ++j) {
const index = start + j;
if (attn_mask_data[index] === 0n) {
Expand All @@ -728,10 +730,10 @@ function cumsum_masked_fill(attention_mask) {
* position_ids = position_ids[:, -input_ids.shape[1] :]
* ```
*/
function createPositionIds(model_inputs, past_key_values = null) {
function createPositionIds(model_inputs, past_key_values = null, start_index = 0) {
const { input_ids, inputs_embeds, attention_mask } = model_inputs;

const { data, dims } = cumsum_masked_fill(attention_mask);
const { data, dims } = cumsum_masked_fill(attention_mask, start_index);
let position_ids = new Tensor('int64', data, dims);
if (past_key_values) {
const offset = -(input_ids ?? inputs_embeds).dims.at(1);
Expand Down Expand Up @@ -3548,6 +3550,30 @@ export class Florence2ForConditionalGeneration extends Florence2PreTrainedModel
}
}

export class PaliGemmaPreTrainedModel extends PreTrainedModel {
forward_params = [
'input_ids',
// 'inputs_embeds',
'attention_mask',
'pixel_values',
'position_ids',
'past_key_values',
];
}

export class PaliGemmaForConditionalGeneration extends PaliGemmaPreTrainedModel {
_merge_input_ids_with_image_features(kwargs) {
const vision_hidden_size = kwargs.image_features.dims.at(-1);
const reshaped_image_hidden_states = kwargs.image_features.view(-1, vision_hidden_size);

return default_merge_input_ids_with_image_features({
// @ts-ignore
image_token_id: this.config.image_token_index,
...kwargs,
image_features: reshaped_image_hidden_states,
})
}
}

//////////////////////////////////////////////////
// Idefics3 Models
Expand Down Expand Up @@ -7015,6 +7041,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
['florence2', ['Florence2ForConditionalGeneration', Florence2ForConditionalGeneration]],
['qwen2-vl', ['Qwen2VLForConditionalGeneration', Qwen2VLForConditionalGeneration]],
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
['paligemma', ['PaliGemmaForConditionalGeneration', PaliGemmaForConditionalGeneration]],
]);

const MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
Expand Down
82 changes: 82 additions & 0 deletions src/models/paligemma/processing_paligemma.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import { Processor } from "../../base/processing_utils.js";
import { AutoImageProcessor } from "../auto/image_processing_auto.js";
import { AutoTokenizer } from "../../tokenizers.js";

const IMAGE_TOKEN = "<image>";

function build_string_from_input(
prompt,
bos_token,
image_seq_len,
image_token,
num_images,
) {
return `${image_token.repeat(image_seq_len * num_images)}${bos_token}${prompt}\n`
}

export class PaliGemmaProcessor extends Processor {
static tokenizer_class = AutoTokenizer
static image_processor_class = AutoImageProcessor
static uses_processor_config = false;

/**
* @typedef {import('../../utils/image.js').RawImage} RawImage
*/

// `images` is required, `text` is optional
async _call(/** @type {RawImage|RawImage[]} */ images, text = null, kwargs = {}) {
if (!text) {
console.warn(
"You are using PaliGemma without a text prefix. It will perform as a picture-captioning model."
)
text = ""
}

if (!Array.isArray(images)) {
images = [images]
}

if (!Array.isArray(text)) {
text = [text]
}

const bos_token = this.tokenizer.bos_token;
const image_seq_length = this.image_processor.config.image_seq_length;
let input_strings;
if (text.some((t) => t.includes(IMAGE_TOKEN))) {
input_strings = text.map(
sample => {
const expanded_sample = sample.replaceAll(IMAGE_TOKEN, IMAGE_TOKEN.repeat(image_seq_length));
const bos_rfind_index = expanded_sample.lastIndexOf(IMAGE_TOKEN);
const bos_index = bos_rfind_index === -1 ? 0 : bos_rfind_index + IMAGE_TOKEN.length;
return expanded_sample.slice(0, bos_index) + bos_token + expanded_sample.slice(bos_index) + "\n";
}
)
} else {
console.warn(
"You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special " +
"image tokens in the text, as many tokens as there are images per each text. It is recommended to " +
"add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images " +
"each text has and add special tokens."
)

input_strings = text.map(
sample => build_string_from_input(
sample,
bos_token,
image_seq_length,
IMAGE_TOKEN,
images.length,
)
)
}

const text_inputs = this.tokenizer(input_strings, kwargs);
const image_inputs = await this.image_processor(images, kwargs);

return {
...image_inputs,
...text_inputs,
}
}
}
1 change: 1 addition & 0 deletions src/models/processors.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ export * from './idefics3/processing_idefics3.js';
export * from './janus/processing_janus.js';
export * from './jina_clip/processing_jina_clip.js';
export * from './owlvit/processing_owlvit.js';
export * from './paligemma/processing_paligemma.js';
export * from './pyannote/processing_pyannote.js';
export * from './qwen2_vl/processing_qwen2_vl.js';
export * from './sam/processing_sam.js';
Expand Down
6 changes: 6 additions & 0 deletions src/tokenizers.js
Original file line number Diff line number Diff line change
Expand Up @@ -2605,6 +2605,12 @@ export class PreTrainedTokenizer extends Callable {
this.unk_token = this.getToken('unk_token');
this.unk_token_id = this.model.tokens_to_ids.get(this.unk_token);

this.bos_token = this.getToken('bos_token');
this.bos_token_id = this.model.tokens_to_ids.get(this.bos_token);

this.eos_token = this.getToken('eos_token');
this.eos_token_id = this.model.tokens_to_ids.get(this.eos_token);

this.model_max_length = tokenizerConfig.model_max_length;

/** @type {boolean} Whether or not to strip the text when tokenizing (removing excess spaces before and after the string). */
Expand Down
36 changes: 36 additions & 0 deletions tests/processors.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ const MODELS = {
florence2: "Xenova/tiny-random-Florence2ForConditionalGeneration",
qwen2_vl: "hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration",
idefics3: "hf-internal-testing/tiny-random-Idefics3ForConditionalGeneration",
paligemma: "hf-internal-testing/tiny-random-PaliGemmaForConditionalGeneration",
};

const BASE_URL = "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/";
Expand Down Expand Up @@ -1196,5 +1197,40 @@ describe("Processors", () => {
},
MAX_TEST_TIME,
);

describe(
"PaliGemmaProcessor",
() => {
/** @type {import('../src/transformers.js').PaliGemmaProcessor} */
let processor;
let images = {};

beforeAll(async () => {
processor = await AutoProcessor.from_pretrained(MODELS.paligemma);
images = {
white_image: await load_image(TEST_IMAGES.white_image),
};
});

it("Image-only (default text)", async () => {
const { input_ids, pixel_values } = await processor(images.white_image);
compare(input_ids.dims, [1, 258]);
compare(pixel_values.dims, [1, 3, 224, 224]);
});

it("Single image & text", async () => {
const { input_ids, pixel_values } = await processor(images.white_image, "<image>What is on the flower?");
compare(input_ids.dims, [1, 264]);
compare(pixel_values.dims, [1, 3, 224, 224]);
});

it("Multiple images & text", async () => {
const { input_ids, pixel_values } = await processor([images.white_image, images.white_image], "<image><image>Describe the images.");
compare(input_ids.dims, [1, 518]);
compare(pixel_values.dims, [2, 3, 224, 224]);
});
},
MAX_TEST_TIME,
);
});
});
54 changes: 54 additions & 0 deletions tests/tiny_random.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import {
Processor,
Florence2Processor,
Idefics3Processor,
PaliGemmaProcessor,

// Models
LlamaForCausalLM,
Expand Down Expand Up @@ -54,6 +55,7 @@ import {
VisionEncoderDecoderModel,
Florence2ForConditionalGeneration,
Qwen2VLForConditionalGeneration,
PaliGemmaForConditionalGeneration,
MarianMTModel,
PatchTSTModel,
PatchTSTForPrediction,
Expand Down Expand Up @@ -1072,6 +1074,58 @@ describe("Tiny random models", () => {
});
});

describe("paligemma", () => {
const text = "<image>What is on the flower?";

// Empty white image
const dims = [224, 224, 3];
const image = new RawImage(new Uint8ClampedArray(dims[0] * dims[1] * dims[2]).fill(255), ...dims);

describe("PaliGemmaForConditionalGeneration", () => {
const model_id = "hf-internal-testing/tiny-random-PaliGemmaForConditionalGeneration";

/** @type {PaliGemmaForConditionalGeneration} */
let model;
/** @type {PaliGemmaProcessor} */
let processor;
beforeAll(async () => {
model = await PaliGemmaForConditionalGeneration.from_pretrained(model_id, {
// TODO move to config
...DEFAULT_MODEL_OPTIONS,
});
processor = await AutoProcessor.from_pretrained(model_id);
}, MAX_MODEL_LOAD_TIME);

it(
"forward",
async () => {
const inputs = await processor(image, text);

const { logits } = await model(inputs);
expect(logits.dims).toEqual([1, 264, 257216]);
expect(logits.mean().item()).toBeCloseTo(-0.0023024685215204954, 6);
},
MAX_TEST_EXECUTION_TIME,
);

it(
"batch_size=1",
async () => {
const inputs = await processor(image, text);
const generate_ids = await model.generate({ ...inputs, max_new_tokens: 10 });

const new_tokens = generate_ids.slice(null, [inputs.input_ids.dims.at(-1), null]);
expect(new_tokens.tolist()).toEqual([[91711n, 24904n, 144054n, 124983n, 83862n, 124983n, 124983n, 124983n, 141236n, 124983n]]);
},
MAX_TEST_EXECUTION_TIME,
);

afterAll(async () => {
await model?.dispose();
}, MAX_MODEL_DISPOSE_TIME);
});
});

describe("vision-encoder-decoder", () => {
describe("VisionEncoderDecoderModel", () => {
const model_id = "hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2";
Expand Down

0 comments on commit 6f27a10

Please sign in to comment.