Skip to content

Commit

Permalink
Merge branch 'main' into add-depth-estimation
Browse files Browse the repository at this point in the history
  • Loading branch information
xenova authored Nov 20, 2023
2 parents 8903ede + 5ddc472 commit 6e2505d
Show file tree
Hide file tree
Showing 22 changed files with 693 additions and 74 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
| [Text-to-Image](https://huggingface.co/tasks/text-to-image) | `text-to-image` | Generates images from input text. ||
| [Visual Question Answering](https://huggingface.co/tasks/visual-question-answering) | `visual-question-answering` | Answering open-ended questions based on an image. ||
| [Zero-Shot Image Classification](https://huggingface.co/tasks/zero-shot-image-classification) | `zero-shot-image-classification` | Classifying images into classes that are unseen during training. |[(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ZeroShotImageClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=zero-shot-image-classification&library=transformers.js) |
| [Zero-Shot Object Detection](https://huggingface.co/tasks/zero-shot-object-detection) | `zero-shot-object-detection` | Identify objects of classes that are unseen during training. |[(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ZeroShotObjectDetectionPipeline)<br>[(models)](https://huggingface.co/models?other=zero-shot-object-detection&library=transformers.js) |


#### Reinforcement Learning
Expand Down Expand Up @@ -301,7 +302,9 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (from MosaiML) released with the repository [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team.
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/5_supported-tasks.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
| [Text-to-Image](https://huggingface.co/tasks/text-to-image) | `text-to-image` | Generates images from input text. | ❌ |
| [Visual Question Answering](https://huggingface.co/tasks/visual-question-answering) | `visual-question-answering` | Answering open-ended questions based on an image. | ❌ |
| [Zero-Shot Image Classification](https://huggingface.co/tasks/zero-shot-image-classification) | `zero-shot-image-classification` | Classifying images into classes that are unseen during training. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ZeroShotImageClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=zero-shot-image-classification&library=transformers.js) |
| [Zero-Shot Object Detection](https://huggingface.co/tasks/zero-shot-object-detection) | `zero-shot-object-detection` | Identify objects of classes that are unseen during training. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ZeroShotObjectDetectionPipeline)<br>[(models)](https://huggingface.co/models?other=zero-shot-object-detection&library=transformers.js) |


#### Reinforcement Learning
Expand Down
2 changes: 2 additions & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@
1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (from MosaiML) released with the repository [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team.
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"dev": "webpack serve --no-client-overlay",
"build": "webpack && npm run typegen",
"generate-tests": "python -m tests.generate_tests",
"test": "node --experimental-vm-modules node_modules/jest/bin/jest.js --verbose",
"test": "node --experimental-vm-modules node_modules/jest/bin/jest.js --verbose --maxConcurrency 1",
"readme": "python ./docs/scripts/build_readme.py",
"docs-api": "node ./docs/scripts/generate.js",
"docs-preview": "doc-builder preview transformers.js ./docs/source/ --not_python_module",
Expand Down
7 changes: 6 additions & 1 deletion scripts/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@
'vision-encoder-decoder': {
'per_channel': False,
'reduce_range': False,
}
},
}

MODELS_WITHOUT_TOKENIZERS = [
Expand Down Expand Up @@ -326,6 +326,11 @@ def main():
with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
json.dump(tokenizer_json, fp, indent=4)

elif config.model_type == 'owlvit':
# Override default batch size to 1, needed because non-maximum suppression is performed for exporting.
# For more information, see https://github.com/huggingface/optimum/blob/e3b7efb1257c011db907ef40ab340e795cc5684c/optimum/exporters/onnx/model_configs.py#L1028-L1032
export_kwargs['batch_size'] = 1

else:
pass # TODO

Expand Down
12 changes: 12 additions & 0 deletions scripts/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -358,13 +358,25 @@
'google/mt5-small',
'google/mt5-base',
],
'nougat': [
# Image-to-text
'facebook/nougat-small',
'facebook/nougat-base',
],
'opt': [
# Text generation
'facebook/opt-125m',
'facebook/opt-350m',
# (TODO conversational)
'PygmalionAI/pygmalion-350m',
],
'owlvit': [
# Object detection (Zero-shot object detection)
# NOTE: Exported with --batch_size 1
'google/owlvit-base-patch32',
'google/owlvit-base-patch16',
'google/owlvit-large-patch14',
],
'resnet': [
# Image classification
'microsoft/resnet-18',
Expand Down
2 changes: 1 addition & 1 deletion src/backends/onnx.js
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
import * as ONNX_NODE from 'onnxruntime-node';
import * as ONNX_WEB from 'onnxruntime-web';

/** @type {module} The ONNX runtime module. */
/** @type {import('onnxruntime-web')} The ONNX runtime module. */
export let ONNX;

export const executionProviders = [
Expand Down
71 changes: 55 additions & 16 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ import {
AutoConfig,
} from './configs.js';

import {
add_token_types,
} from './tokenizers.js';

import {
Callable,
isIntegralNumber,
Expand All @@ -64,6 +68,7 @@ import {
WhisperTimeStampLogitsProcessor,
NoRepeatNGramLogitsProcessor,
RepetitionPenaltyLogitsProcessor,
NoBadWordsLogitsProcessor,
MinLengthLogitsProcessor,
MinNewTokensLengthLogitsProcessor,

Expand All @@ -82,7 +87,9 @@ import {

import { executionProviders, ONNX } from './backends/onnx.js';
import { medianFilter } from './transformers.js';
const { InferenceSession, Tensor: ONNXTensor } = ONNX;
const { InferenceSession, Tensor: ONNXTensor, env } = ONNX;

/** @typedef {import('onnxruntime-web').InferenceSession} InferenceSession */

//////////////////////////////////////////////////
// Model types: used internally
Expand Down Expand Up @@ -142,21 +149,31 @@ async function constructSession(pretrained_model_name_or_path, fileName, options
/**
* Validate model inputs
* @param {InferenceSession} session The InferenceSession object that will be run.
* @param {Object} inputs The inputs to check.
* @returns {Promise<Object>} A Promise that resolves to the checked inputs.
* @param {Record<string, Tensor>} inputs The inputs to check.
* @returns {Record<string, Tensor>} The checked inputs.
* @throws {Error} If any inputs are missing.
* @private
*/
async function validateInputs(session, inputs) {
// NOTE: Only create a shallow copy
const checkedInputs = {};
function validateInputs(session, inputs) {
/**
* NOTE: Create either a shallow or deep copy based on `onnx.wasm.proxy`
* @type {Record<string, Tensor>}
*/
const checkedInputs = Object.create(null);
const missingInputs = [];
for (let inputName of session.inputNames) {
if (inputs[inputName] === undefined) {
for (const inputName of session.inputNames) {
const tensor = inputs[inputName];
// Rare case where one of the model's input names corresponds to a built-in
// object name (e.g., toString), which would cause a simple (!tensor) check to fail,
// because it's not undefined but a function.
if (!(tensor instanceof Tensor)) {
missingInputs.push(inputName);
} else {
checkedInputs[inputName] = inputs[inputName];
continue;
}
// NOTE: When `env.wasm.proxy is true` the tensor is moved across the Worker
// boundary, transferring ownership to the worker and invalidating the tensor.
// So, in this case, we simply sacrifice a clone for it.
checkedInputs[inputName] = env.wasm.proxy ? tensor.clone() : tensor;
}
if (missingInputs.length > 0) {
throw new Error(
Expand Down Expand Up @@ -187,7 +204,7 @@ async function validateInputs(session, inputs) {
* @private
*/
async function sessionRun(session, inputs) {
const checkedInputs = await validateInputs(session, inputs);
const checkedInputs = validateInputs(session, inputs);
try {
let output = await session.run(checkedInputs);
output = replaceTensors(output);
Expand Down Expand Up @@ -488,10 +505,15 @@ function seq2seqUpdatebeam(beam, newTokenId) {
* @private
*/
async function encoderForward(self, model_inputs) {
let encoderFeeds = {};
for (let key of self.session.inputNames) {
const encoderFeeds = Object.create(null);
for (const key of self.session.inputNames) {
encoderFeeds[key] = model_inputs[key];
}
if (self.session.inputNames.includes('token_type_ids') && !encoderFeeds.token_type_ids) {
// Assign default `token_type_ids` to the `encoderFeeds` if the model expects it,
// but they weren't created by the tokenizer.
add_token_types(encoderFeeds);
}
return await sessionRun(self.session, encoderFeeds);
}

Expand Down Expand Up @@ -836,9 +858,9 @@ export class PreTrainedModel extends Callable {
// }
// }

// if (generation_config.bad_words_ids !== null) {
// processors.push(new NoBadWordsLogitsProcessor(generation_config.bad_words_ids, generation_config.eos_token_id));
// }
if (generation_config.bad_words_ids !== null) {
processors.push(new NoBadWordsLogitsProcessor(generation_config.bad_words_ids, generation_config.eos_token_id));
}

if (generation_config.min_length !== null && generation_config.eos_token_id !== null && generation_config.min_length > 0) {
processors.push(new MinLengthLogitsProcessor(generation_config.min_length, generation_config.eos_token_id));
Expand Down Expand Up @@ -3179,6 +3201,12 @@ export class MobileViTForImageClassification extends MobileViTPreTrainedModel {

//////////////////////////////////////////////////

//////////////////////////////////////////////////
export class OwlViTPreTrainedModel extends PreTrainedModel { }
export class OwlViTModel extends OwlViTPreTrainedModel { }
export class OwlViTForObjectDetection extends OwlViTPreTrainedModel { }
//////////////////////////////////////////////////

//////////////////////////////////////////////////
// Beit Models
export class BeitPreTrainedModel extends PreTrainedModel { }
Expand Down Expand Up @@ -4083,6 +4111,7 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
['detr', ['DetrModel', DetrModel]],
['vit', ['ViTModel', ViTModel]],
['mobilevit', ['MobileViTModel', MobileViTModel]],
['owlvit', ['OwlViTModel', OwlViTModel]],
['beit', ['BeitModel', BeitModel]],
['deit', ['DeiTModel', DeiTModel]],
['resnet', ['ResNetModel', ResNetModel]],
Expand Down Expand Up @@ -4246,6 +4275,10 @@ const MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES = new Map([
['yolos', ['YolosForObjectDetection', YolosForObjectDetection]],
]);

const MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES = new Map([
['owlvit', ['OwlViTForObjectDetection', OwlViTForObjectDetection]],
]);

const MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES = new Map([
['detr', ['DetrForSegmentation', DetrForSegmentation]],
]);
Expand Down Expand Up @@ -4291,6 +4324,7 @@ const MODEL_CLASS_TYPE_MAPPING = [
[MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_MASK_GENERATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_CTC_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
Expand Down Expand Up @@ -4461,6 +4495,11 @@ export class AutoModelForObjectDetection extends PretrainedMixin {
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES];
}

export class AutoModelForZeroShotObjectDetection extends PretrainedMixin {
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES];
}


/**
* Helper class which is used to instantiate pretrained object detection models with the `from_pretrained` function.
* The chosen model class is determined by the type specified in the model config.
Expand Down
Loading

0 comments on commit 6e2505d

Please sign in to comment.