Skip to content

Commit

Permalink
Add image-to-image task w/ Swin2SR (for super-resolution) (#381)
Browse files Browse the repository at this point in the history
* Add `Swin2SRImageProcessor`

* Add `RawImage.fromTensor` helper function

* Add clamp tensor function

* Add support for `.to` data type conversion

* Add `round` tensor function

* Add support for `mul` tensor function

* Fix image padding

* Only perform padding if it will affect size

* Create basic processors unit test suite

* Add SamProcessor test case

* Move `CONTENT_TYPE_MAP` outside `RawImage` class

* Perform reflective padding for swin2sr models

* Add swin2sr models for image super-resolution

* Add listed support for Swin2SR models

* Add image-to-image pipeline

* Add listed support for image-to-image task

* Add image-to-image unit tests

* Add `add` tensor functions

* Generalize `pad_image` helper function

* Add more unit tests for image processors

* Fix typo
  • Loading branch information
xenova authored Nov 9, 2023
1 parent 3e8d227 commit 73a99ba
Show file tree
Hide file tree
Showing 12 changed files with 679 additions and 39 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
| [Depth Estimation](https://huggingface.co/tasks/depth-estimation) | `depth-estimation` | Predicting the depth of objects present in an image. ||
| [Image Classification](https://huggingface.co/tasks/image-classification) | `image-classification` | Assigning a label or class to an entire image. |[(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-classification&library=transformers.js) |
| [Image Segmentation](https://huggingface.co/tasks/image-segmentation) | `image-segmentation` | Divides an image into segments where each pixel is mapped to an object. This task has multiple variants such as instance segmentation, panoptic segmentation and semantic segmentation. |[(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageSegmentationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-segmentation&library=transformers.js) |
| [Image-to-Image](https://huggingface.co/tasks/image-to-image) | `image-to-image` | Transforming a source image to match the characteristics of a target image or a target image domain. | |
| [Image-to-Image](https://huggingface.co/tasks/image-to-image) | `image-to-image` | Transforming a source image to match the characteristics of a target image or a target image domain. | [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageToImagePipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-to-image&library=transformers.js) |
| [Mask Generation](https://huggingface.co/tasks/mask-generation) | `mask-generation` | Generate masks for the objects in an image. ||
| [Object Detection](https://huggingface.co/tasks/object-detection) | `object-detection` | Identify objects of certain defined classes within an image. |[(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ObjectDetectionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=object-detection&library=transformers.js) |
| [Video Classification](https://huggingface.co/tasks/video-classification) | n/a | Assigning a label or class to an entire video. ||
Expand Down Expand Up @@ -302,6 +302,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
Expand Down
2 changes: 1 addition & 1 deletion docs/snippets/5_supported-tasks.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
| [Depth Estimation](https://huggingface.co/tasks/depth-estimation) | `depth-estimation` | Predicting the depth of objects present in an image. | ❌ |
| [Image Classification](https://huggingface.co/tasks/image-classification) | `image-classification` | Assigning a label or class to an entire image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-classification&library=transformers.js) |
| [Image Segmentation](https://huggingface.co/tasks/image-segmentation) | `image-segmentation` | Divides an image into segments where each pixel is mapped to an object. This task has multiple variants such as instance segmentation, panoptic segmentation and semantic segmentation. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageSegmentationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-segmentation&library=transformers.js) |
| [Image-to-Image](https://huggingface.co/tasks/image-to-image) | `image-to-image` | Transforming a source image to match the characteristics of a target image or a target image domain. | |
| [Image-to-Image](https://huggingface.co/tasks/image-to-image) | `image-to-image` | Transforming a source image to match the characteristics of a target image or a target image domain. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageToImagePipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-to-image&library=transformers.js) |
| [Mask Generation](https://huggingface.co/tasks/mask-generation) | `mask-generation` | Generate masks for the objects in an image. | ❌ |
| [Object Detection](https://huggingface.co/tasks/object-detection) | `object-detection` | Identify objects of certain defined classes within an image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ObjectDetectionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=object-detection&library=transformers.js) |
| [Video Classification](https://huggingface.co/tasks/video-classification) | n/a | Assigning a label or class to an entire video. | ❌ |
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
Expand Down
11 changes: 11 additions & 0 deletions scripts/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,17 @@
'microsoft/swin-large-patch4-window7-224-in22k',
'microsoft/swin-large-patch4-window12-384',
],
'swin2sr': [
# Image-to-image (Super-resolution)
'caidas/swin2SR-classical-sr-x2-64',
'caidas/swin2SR-realworld-sr-x4-64-bsrgan-psnr',
'caidas/swin2SR-classical-sr-x4-64',
'caidas/swin2SR-compressed-sr-x4-48',
'caidas/swin2SR-lightweight-x2-64',

# Feature extraction
'hf-tiny-model-private/tiny-random-Swin2SRModel',
],
't5': [
# Text-to-text (Translation/Summarization)
't5-small',
Expand Down
53 changes: 53 additions & 0 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -3300,6 +3300,49 @@ export class SwinForImageClassification extends SwinPreTrainedModel {
}
//////////////////////////////////////////////////

//////////////////////////////////////////////////
export class Swin2SRPreTrainedModel extends PreTrainedModel { }

/**
* The bare Swin2SR Model transformer outputting raw hidden-states without any specific head on top.
*/
export class Swin2SRModel extends Swin2SRPreTrainedModel { }

/**
* Swin2SR Model transformer with an upsampler head on top for image super resolution and restoration.
*
* **Example:** Super-resolution w/ `Xenova/swin2SR-classical-sr-x2-64`.
*
* ```javascript
* import { AutoProcessor, Swin2SRForImageSuperResolution, RawImage } from '@xenova/transformers';
*
* // Load processor and model
* const model_id = 'Xenova/swin2SR-classical-sr-x2-64';
* const processor = await AutoProcessor.from_pretrained(model_id);
* const model = await Swin2SRForImageSuperResolution.from_pretrained(model_id);
*
* // Prepare model inputs
* const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/butterfly.jpg';
* const image = await RawImage.fromURL(url);
* const inputs = await processor(image);
*
* // Run model
* const outputs = await model(inputs);
*
* // Convert Tensor to RawImage
* const output = outputs.reconstruction.squeeze().clamp_(0, 1).mul_(255).round_().to('uint8');
* const outputImage = RawImage.fromTensor(output);
* // RawImage {
* // data: Uint8Array(786432) [ 41, 31, 24, ... ],
* // width: 512,
* // height: 512,
* // channels: 3
* // }
* ```
*/
export class Swin2SRForImageSuperResolution extends Swin2SRPreTrainedModel { }
//////////////////////////////////////////////////

//////////////////////////////////////////////////
export class DonutSwinPreTrainedModel extends PreTrainedModel { }

Expand Down Expand Up @@ -3950,6 +3993,7 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
['deit', ['DeiTModel', DeiTModel]],
['resnet', ['ResNetModel', ResNetModel]],
['swin', ['SwinModel', SwinModel]],
['swin2sr', ['Swin2SRModel', Swin2SRModel]],
['donut-swin', ['DonutSwinModel', DonutSwinModel]],
['yolos', ['YolosModel', YolosModel]],

Expand Down Expand Up @@ -4124,6 +4168,10 @@ const MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES = new Map([
['wavlm', ['WavLMForSequenceClassification', WavLMForSequenceClassification]],
]);

const MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES = new Map([
['swin2sr', ['Swin2SRForImageSuperResolution', Swin2SRForImageSuperResolution]],
])


const MODEL_CLASS_TYPE_MAPPING = [
[MODEL_MAPPING_NAMES_ENCODER_ONLY, MODEL_TYPES.EncoderOnly],
Expand All @@ -4139,6 +4187,7 @@ const MODEL_CLASS_TYPE_MAPPING = [
[MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES, MODEL_TYPES.Vision2Seq],
[MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_MASK_GENERATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_CTC_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
Expand Down Expand Up @@ -4333,6 +4382,10 @@ export class AutoModelForDocumentQuestionAnswering extends PretrainedMixin {
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES];
}

export class AutoModelForImageToImage extends PretrainedMixin {
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES];
}

//////////////////////////////////////////////////

//////////////////////////////////////////////////
Expand Down
51 changes: 51 additions & 0 deletions src/pipelines.js
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ import {
AutoModelForImageSegmentation,
AutoModelForObjectDetection,
AutoModelForDocumentQuestionAnswering,
AutoModelForImageToImage,
// AutoModelForTextToWaveform,
PreTrainedModel,
} from './models.js';
Expand Down Expand Up @@ -1947,6 +1948,44 @@ export class TextToAudioPipeline extends Pipeline {
}
}

/**
* Image to Image pipeline using any `AutoModelForImageToImage`. This pipeline generates an image based on a previous image input.
*
* **Example:** Super-resolution w/ `Xenova/swin2SR-classical-sr-x2-64`
* ```javascript
* let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/butterfly.jpg';
* let upscaler = await pipeline('image-to-image', 'Xenova/swin2SR-classical-sr-x2-64');
* let output = await upscaler(url);
* // RawImage {
* // data: Uint8Array(786432) [ 41, 31, 24, 43, ... ],
* // width: 512,
* // height: 512,
* // channels: 3
* // }
* ```
*/
export class ImageToImagePipeline extends Pipeline {
/**
* Transform the image(s) passed as inputs.
* @param {any} images The images to transform.
* @returns {Promise<any>} An image or a list of images containing result(s).
*/
async _call(images) {
images = await prepareImages(images);

let inputs = await this.processor(images);
let outputs = await this.model(inputs);

let toReturn = [];
for (let batch of outputs.reconstruction) {
const output = batch.squeeze().clamp_(0, 1).mul_(255).round_().to('uint8');
toReturn.push(RawImage.fromTensor(output));
}

return toReturn.length > 1 ? toReturn : toReturn[0];
}
}

const SUPPORTED_TASKS = {
"text-classification": {
"tokenizer": AutoTokenizer,
Expand Down Expand Up @@ -2160,6 +2199,18 @@ const SUPPORTED_TASKS = {
},
"type": "multimodal",
},
"image-to-image": {
// no tokenizer
"pipeline": ImageToImagePipeline,
"model": AutoModelForImageToImage,
"processor": AutoProcessor,
"default": {
// TODO: replace with original
// "model": "caidas/swin2SR-classical-sr-x2-64",
"model": "Xenova/swin2SR-classical-sr-x2-64",
},
"type": "image",
},

// This task serves as a useful interface for dealing with sentence-transformers (https://huggingface.co/sentence-transformers).
"feature-extraction": {
Expand Down
Loading

0 comments on commit 73a99ba

Please sign in to comment.