Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ChineseCLIP models #455

Merged
merged 5 commits into from
Dec 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/).
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/).
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
Expand Down
10 changes: 10 additions & 0 deletions scripts/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,16 @@
# 'Xenova/tiny-random-ClapModel',
}
},
'chinese_clip': {
# Zero-shot image classification
# TODO: Add `--split_modalities` option
'zero-shot-image-classification': [
'OFA-Sys/chinese-clip-vit-base-patch16',
'OFA-Sys/chinese-clip-vit-large-patch14',
'OFA-Sys/chinese-clip-vit-large-patch14-336px',
# 'OFA-Sys/chinese-clip-vit-huge-patch14', # TODO add
],
},
'clip': {
# Zero-shot image classification (and feature extraction)
# (with and without `--split_modalities`)
Expand Down
9 changes: 9 additions & 0 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -3084,8 +3084,16 @@ export class CLIPVisionModelWithProjection extends CLIPPreTrainedModel {
return super.from_pretrained(pretrained_model_name_or_path, options);
}
}
//////////////////////////////////////////////////


//////////////////////////////////////////////////
// ChineseCLIP models
export class ChineseCLIPPreTrainedModel extends PreTrainedModel { }

export class ChineseCLIPModel extends ChineseCLIPPreTrainedModel { }
//////////////////////////////////////////////////


//////////////////////////////////////////////////
// GPT2 models
Expand Down Expand Up @@ -4677,6 +4685,7 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
['xlm-roberta', ['XLMRobertaModel', XLMRobertaModel]],
['clap', ['ClapModel', ClapModel]],
['clip', ['CLIPModel', CLIPModel]],
['chinese_clip', ['ChineseCLIPModel', ChineseCLIPModel]],
['mobilebert', ['MobileBertModel', MobileBertModel]],
['squeezebert', ['SqueezeBertModel', SqueezeBertModel]],
['wav2vec2', ['Wav2Vec2Model', Wav2Vec2Model]],
Expand Down
26 changes: 13 additions & 13 deletions src/pipelines.js
Original file line number Diff line number Diff line change
Expand Up @@ -1762,38 +1762,38 @@ export class ZeroShotImageClassificationPipeline extends Pipeline {
async _call(images, candidate_labels, {
hypothesis_template = "This is a photo of {}"
} = {}) {
let isBatched = Array.isArray(images);
const isBatched = Array.isArray(images);
images = await prepareImages(images);

// Insert label into hypothesis template
let texts = candidate_labels.map(
const texts = candidate_labels.map(
x => hypothesis_template.replace('{}', x)
);

// Run tokenization
let text_inputs = this.tokenizer(texts, {
const text_inputs = this.tokenizer(texts, {
padding: true,
truncation: true
});

// Run processor
let { pixel_values } = await this.processor(images);
const { pixel_values } = await this.processor(images);

// Run model with both text and pixel inputs
let output = await this.model({ ...text_inputs, pixel_values });
const output = await this.model({ ...text_inputs, pixel_values });

// Compare each image with each candidate label
let toReturn = [];
for (let batch of output.logits_per_image) {
const toReturn = [];
for (const batch of output.logits_per_image) {
// Compute softmax per image
let probs = softmax(batch.data);
const probs = softmax(batch.data);

toReturn.push([...probs].map((x, i) => {
return {
score: x,
label: candidate_labels[i]
}
const result = [...probs].map((x, i) => ({
score: x,
label: candidate_labels[i]
}));
result.sort((a, b) => b.score - a.score); // sort by score in descending order
toReturn.push(result);
}

return isBatched ? toReturn : toReturn[0];
Expand Down
2 changes: 2 additions & 0 deletions src/processors.js
Original file line number Diff line number Diff line change
Expand Up @@ -613,6 +613,7 @@ export class BitImageProcessor extends ImageFeatureExtractor { }
export class DPTFeatureExtractor extends ImageFeatureExtractor { }
export class GLPNFeatureExtractor extends ImageFeatureExtractor { }
export class CLIPFeatureExtractor extends ImageFeatureExtractor { }
export class ChineseCLIPFeatureExtractor extends ImageFeatureExtractor { }
export class ConvNextFeatureExtractor extends ImageFeatureExtractor { }
export class ConvNextImageProcessor extends ConvNextFeatureExtractor { } // NOTE extends ConvNextFeatureExtractor
export class ViTFeatureExtractor extends ImageFeatureExtractor { }
Expand Down Expand Up @@ -1695,6 +1696,7 @@ export class AutoProcessor {
MobileViTFeatureExtractor,
OwlViTFeatureExtractor,
CLIPFeatureExtractor,
ChineseCLIPFeatureExtractor,
ConvNextFeatureExtractor,
ConvNextImageProcessor,
BitImageProcessor,
Expand Down
24 changes: 12 additions & 12 deletions tests/pipelines.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -1179,9 +1179,9 @@ describe('Pipelines', () => {
let output = await classifier(url, classes);

let expected = [
{ "score": 0.992206871509552, "label": "football" },
{ "score": 0.0013248942559584975, "label": "airport" },
{ "score": 0.006468251813203096, "label": "animals" }
{ score: 0.9719080924987793, label: 'football' },
{ score: 0.022564826533198357, label: 'animals' },
{ score: 0.005527070723474026, label: 'airport' }
]
compare(output, expected, 0.1);

Expand All @@ -1194,17 +1194,17 @@ describe('Pipelines', () => {

let expected = [
[
{ "score": 0.9919875860214233, "label": "football" },
{ "score": 0.0012227334082126617, "label": "airport" },
{ "score": 0.006789708975702524, "label": "animals" }
{ score: 0.9712504148483276, label: 'football' },
{ score: 0.022469401359558105, label: 'animals' },
{ score: 0.006280169822275639, label: 'airport' }
], [
{ "score": 0.0003043194592464715, "label": "football" },
{ "score": 0.998708188533783, "label": "airport" },
{ "score": 0.0009874969255179167, "label": "animals" }
{ score: 0.997433602809906, label: 'airport' },
{ score: 0.0016500800848007202, label: 'animals' },
{ score: 0.0009163151844404638, label: 'football' }
], [
{ "score": 0.015163016505539417, "label": "football" },
{ "score": 0.016037866473197937, "label": "airport" },
{ "score": 0.9687991142272949, "label": "animals" }
{ score: 0.9851226806640625, label: 'animals' },
{ score: 0.007516484707593918, label: 'football' },
{ score: 0.007360846735537052, label: 'airport' }
]
];
compare(output, expected, 0.1);
Expand Down
1 change: 1 addition & 0 deletions tests/processors.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,7 @@ describe('Processors', () => {
// VitMatteImageProcessor
// - tests custom overrides
// - tests multiple inputs
// - tests `size_divisibility` and no size (size_divisibility=32)
it(MODELS.vitmatte, async () => {
const processor = await AutoProcessor.from_pretrained(m(MODELS.vitmatte))

Expand Down
Loading