diff --git a/README.md b/README.md index df5a688b4..c07e9a094 100644 --- a/README.md +++ b/README.md @@ -210,7 +210,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te | Task | ID | Description | Supported? | |--------------------------|----|-------------|------------| -| [Depth Estimation](https://huggingface.co/tasks/depth-estimation) | `depth-estimation` | Predicting the depth of objects present in an image. | ❌ | +| [Depth Estimation](https://huggingface.co/tasks/depth-estimation) | `depth-estimation` | Predicting the depth of objects present in an image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.DepthEstimationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=depth-estimation&library=transformers.js) | | [Image Classification](https://huggingface.co/tasks/image-classification) | `image-classification` | Assigning a label or class to an entire image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageClassificationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-classification&library=transformers.js) | | [Image Segmentation](https://huggingface.co/tasks/image-segmentation) | `image-segmentation` | Divides an image into segments where each pixel is mapped to an object. This task has multiple variants such as instance segmentation, panoptic segmentation and semantic segmentation. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageSegmentationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-segmentation&library=transformers.js) | | [Image-to-Image](https://huggingface.co/tasks/image-to-image) | `image-to-image` | Transforming a source image to match the characteristics of a target image or a target image domain. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageToImagePipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-to-image&library=transformers.js) | @@ -277,8 +277,10 @@ You can refine your search by selecting the task you're interested in (e.g., [te 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. +1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. 1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. diff --git a/docs/snippets/5_supported-tasks.snippet b/docs/snippets/5_supported-tasks.snippet index 5699f8f28..e4598cf1f 100644 --- a/docs/snippets/5_supported-tasks.snippet +++ b/docs/snippets/5_supported-tasks.snippet @@ -22,7 +22,7 @@ | Task | ID | Description | Supported? | |--------------------------|----|-------------|------------| -| [Depth Estimation](https://huggingface.co/tasks/depth-estimation) | `depth-estimation` | Predicting the depth of objects present in an image. | ❌ | +| [Depth Estimation](https://huggingface.co/tasks/depth-estimation) | `depth-estimation` | Predicting the depth of objects present in an image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.DepthEstimationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=depth-estimation&library=transformers.js) | | [Image Classification](https://huggingface.co/tasks/image-classification) | `image-classification` | Assigning a label or class to an entire image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageClassificationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-classification&library=transformers.js) | | [Image Segmentation](https://huggingface.co/tasks/image-segmentation) | `image-segmentation` | Divides an image into segments where each pixel is mapped to an object. This task has multiple variants such as instance segmentation, panoptic segmentation and semantic segmentation. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageSegmentationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-segmentation&library=transformers.js) | | [Image-to-Image](https://huggingface.co/tasks/image-to-image) | `image-to-image` | Transforming a source image to match the characteristics of a target image or a target image domain. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageToImagePipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-to-image&library=transformers.js) | diff --git a/docs/snippets/6_supported-models.snippet b/docs/snippets/6_supported-models.snippet index f6f678ada..e3c947387 100644 --- a/docs/snippets/6_supported-models.snippet +++ b/docs/snippets/6_supported-models.snippet @@ -18,8 +18,10 @@ 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. +1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. 1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. diff --git a/scripts/supported_models.py b/scripts/supported_models.py index 6dfc30c75..723984d8d 100644 --- a/scripts/supported_models.py +++ b/scripts/supported_models.py @@ -208,11 +208,21 @@ # Document Question Answering 'naver-clova-ix/donut-base-finetuned-docvqa', ], + 'dpt': [ + # Depth estimation + 'Intel/dpt-hybrid-midas', + 'Intel/dpt-large', + ], 'falcon': [ # Text generation 'Rocketknight1/tiny-random-falcon-7b', 'fxmarty/really-tiny-falcon-testing', ], + 'glpn': [ + # Depth estimation + 'vinvino02/glpn-kitti', + 'vinvino02/glpn-nyu', + ], 'gpt_neo': [ # Text generation 'EleutherAI/gpt-neo-125M', diff --git a/src/models.js b/src/models.js index b190f0802..70567f971 100644 --- a/src/models.js +++ b/src/models.js @@ -3371,6 +3371,100 @@ export class Swin2SRModel extends Swin2SRPreTrainedModel { } export class Swin2SRForImageSuperResolution extends Swin2SRPreTrainedModel { } ////////////////////////////////////////////////// +////////////////////////////////////////////////// +export class DPTPreTrainedModel extends PreTrainedModel { } + +/** + * The bare DPT Model transformer outputting raw hidden-states without any specific head on top. + */ +export class DPTModel extends DPTPreTrainedModel { } + +/** + * DPT Model with a depth estimation head on top (consisting of 3 convolutional layers) e.g. for KITTI, NYUv2. + * + * **Example:** Depth estimation w/ `Xenova/dpt-hybrid-midas`. + * ```javascript + * import { DPTForDepthEstimation, AutoProcessor, RawImage, interpolate, max } from '@xenova/transformers'; + * + * // Load model and processor + * const model_id = 'Xenova/dpt-hybrid-midas'; + * const model = await DPTForDepthEstimation.from_pretrained(model_id); + * const processor = await AutoProcessor.from_pretrained(model_id); + * + * // Load image from URL + * const url = 'http://images.cocodataset.org/val2017/000000039769.jpg'; + * const image = await RawImage.fromURL(url); + * + * // Prepare image for the model + * const inputs = await processor(image); + * + * // Run model + * const { predicted_depth } = await model(inputs); + * + * // Interpolate to original size + * const prediction = interpolate(predicted_depth, image.size.reverse(), 'bilinear', false); + * + * // Visualize the prediction + * const formatted = prediction.mul_(255 / max(prediction.data)[0]).to('uint8'); + * const depth = RawImage.fromTensor(formatted); + * // RawImage { + * // data: Uint8Array(307200) [ 85, 85, 84, ... ], + * // width: 640, + * // height: 480, + * // channels: 1 + * // } + * ``` + */ +export class DPTForDepthEstimation extends DPTPreTrainedModel { } +////////////////////////////////////////////////// + +////////////////////////////////////////////////// +export class GLPNPreTrainedModel extends PreTrainedModel { } + +/** + * The bare GLPN encoder (Mix-Transformer) outputting raw hidden-states without any specific head on top. + */ +export class GLPNModel extends GLPNPreTrainedModel { } + +/** + * GLPN Model transformer with a lightweight depth estimation head on top e.g. for KITTI, NYUv2. + * + * **Example:** Depth estimation w/ `Xenova/glpn-kitti`. + * ```javascript + * import { GLPNForDepthEstimation, AutoProcessor, RawImage, interpolate, max } from '@xenova/transformers'; + * + * // Load model and processor + * const model_id = 'Xenova/glpn-kitti'; + * const model = await GLPNForDepthEstimation.from_pretrained(model_id); + * const processor = await AutoProcessor.from_pretrained(model_id); + * + * // Load image from URL + * const url = 'http://images.cocodataset.org/val2017/000000039769.jpg'; + * const image = await RawImage.fromURL(url); + * + * // Prepare image for the model + * const inputs = await processor(image); + * + * // Run model + * const { predicted_depth } = await model(inputs); + * + * // Interpolate to original size + * const prediction = interpolate(predicted_depth, image.size.reverse(), 'bilinear', false); + * + * // Visualize the prediction + * const formatted = prediction.mul_(255 / max(prediction.data)[0]).to('uint8'); + * const depth = RawImage.fromTensor(formatted); + * // RawImage { + * // data: Uint8Array(307200) [ 207, 169, 154, ... ], + * // width: 640, + * // height: 480, + * // channels: 1 + * // } + * ``` + */ +export class GLPNForDepthEstimation extends GLPNPreTrainedModel { } +////////////////////////////////////////////////// + ////////////////////////////////////////////////// export class DonutSwinPreTrainedModel extends PreTrainedModel { } @@ -4025,6 +4119,8 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([ ['swin2sr', ['Swin2SRModel', Swin2SRModel]], ['donut-swin', ['DonutSwinModel', DonutSwinModel]], ['yolos', ['YolosModel', YolosModel]], + ['dpt', ['DPTModel', DPTModel]], + ['glpn', ['GLPNModel', GLPNModel]], ['hifigan', ['SpeechT5HifiGan', SpeechT5HifiGan]], @@ -4205,6 +4301,11 @@ const MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES = new Map([ ['swin2sr', ['Swin2SRForImageSuperResolution', Swin2SRForImageSuperResolution]], ]) +const MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES = new Map([ + ['dpt', ['DPTForDepthEstimation', DPTForDepthEstimation]], + ['glpn', ['GLPNForDepthEstimation', GLPNForDepthEstimation]], +]) + const MODEL_CLASS_TYPE_MAPPING = [ [MODEL_MAPPING_NAMES_ENCODER_ONLY, MODEL_TYPES.EncoderOnly], @@ -4221,6 +4322,7 @@ const MODEL_CLASS_TYPE_MAPPING = [ [MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], [MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], [MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], + [MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], [MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], [MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], [MODEL_FOR_MASK_GENERATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], @@ -4425,6 +4527,10 @@ export class AutoModelForImageToImage extends PretrainedMixin { static MODEL_CLASS_MAPPINGS = [MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES]; } +export class AutoModelForDepthEstimation extends PretrainedMixin { + static MODEL_CLASS_MAPPINGS = [MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES]; +} + ////////////////////////////////////////////////// ////////////////////////////////////////////////// diff --git a/src/pipelines.js b/src/pipelines.js index 7bb125e5d..db1ca2c0c 100644 --- a/src/pipelines.js +++ b/src/pipelines.js @@ -36,6 +36,7 @@ import { AutoModelForZeroShotObjectDetection, AutoModelForDocumentQuestionAnswering, AutoModelForImageToImage, + AutoModelForDepthEstimation, // AutoModelForTextToWaveform, PreTrainedModel, } from './models.js'; @@ -65,6 +66,7 @@ import { import { Tensor, mean_pooling, + interpolate, } from './utils/tensor.js'; import { RawImage } from './utils/image.js'; @@ -2108,6 +2110,56 @@ export class ImageToImagePipeline extends Pipeline { } } +/** + * Depth estimation pipeline using any `AutoModelForDepthEstimation`. This pipeline predicts the depth of an image. + * + * **Example:** Depth estimation w/ `Xenova/dpt-hybrid-midas` + * ```javascript + * let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg'; + * let depth_estimator = await pipeline('depth-estimation', 'Xenova/dpt-hybrid-midas'); + * let out = await depth_estimator(url); + * // { + * // predicted_depth: Tensor { + * // dims: [ 384, 384 ], + * // type: 'float32', + * // data: Float32Array(147456) [ 542.859130859375, 545.2833862304688, 546.1649169921875, ... ], + * // size: 147456 + * // }, + * // depth: RawImage { + * // data: Uint8Array(307200) [ 86, 86, 86, ... ], + * // width: 640, + * // height: 480, + * // channels: 1 + * // } + * // } + * ``` + */ +export class DepthEstimationPipeline extends Pipeline { + /** + * Predicts the depth for the image(s) passed as inputs. + * @param {any} images The images to compute depth for. + * @returns {Promise} An image or a list of images containing result(s). + */ + async _call(images) { + images = await prepareImages(images); + + const inputs = await this.processor(images); + const { predicted_depth } = await this.model(inputs); + + const toReturn = []; + for (let i = 0; i < images.length; ++i) { + const prediction = interpolate(predicted_depth[i], images[i].size.reverse(), 'bilinear', false); + const formatted = prediction.mul_(255 / max(prediction.data)[0]).to('uint8'); + toReturn.push({ + predicted_depth: predicted_depth[i], + depth: RawImage.fromTensor(formatted), + }); + } + + return toReturn.length > 1 ? toReturn : toReturn[0]; + } +} + const SUPPORTED_TASKS = { "text-classification": { "tokenizer": AutoTokenizer, @@ -2345,6 +2397,18 @@ const SUPPORTED_TASKS = { }, "type": "image", }, + "depth-estimation": { + // no tokenizer + "pipeline": DepthEstimationPipeline, + "model": AutoModelForDepthEstimation, + "processor": AutoProcessor, + "default": { + // TODO: replace with original + // "model": "Intel/dpt-large", + "model": "Xenova/dpt-large", + }, + "type": "image", + }, // This task serves as a useful interface for dealing with sentence-transformers (https://huggingface.co/sentence-transformers). "feature-extraction": { @@ -2378,6 +2442,7 @@ const TASK_ALIASES = { * @param {string} task The task defining which pipeline will be returned. Currently accepted tasks are: * - `"audio-classification"`: will return a `AudioClassificationPipeline`. * - `"automatic-speech-recognition"`: will return a `AutomaticSpeechRecognitionPipeline`. + * - `"depth-estimation"`: will return a `DepthEstimationPipeline`. * - `"document-question-answering"`: will return a `DocumentQuestionAnsweringPipeline`. * - `"feature-extraction"`: will return a `FeatureExtractionPipeline`. * - `"fill-mask"`: will return a `FillMaskPipeline`. diff --git a/src/processors.js b/src/processors.js index 217f0262a..3d3349028 100644 --- a/src/processors.js +++ b/src/processors.js @@ -204,6 +204,7 @@ export class ImageFeatureExtractor extends FeatureExtractor { this.do_resize = this.config.do_resize; this.do_thumbnail = this.config.do_thumbnail; this.size = this.config.size; + this.size_divisor = this.config.size_divisor; this.do_center_crop = this.config.do_center_crop; this.crop_size = this.config.crop_size; @@ -427,7 +428,7 @@ export class ImageFeatureExtractor extends FeatureExtractor { shortest_edge = this.size; longest_edge = this.config.max_size ?? shortest_edge; - } else { + } else if (this.size !== undefined) { // Extract known properties from `this.size` shortest_edge = this.size.shortest_edge; longest_edge = this.size.longest_edge; @@ -460,11 +461,20 @@ export class ImageFeatureExtractor extends FeatureExtractor { resample: this.resample, }); - } else if (this.size.width !== undefined && this.size.height !== undefined) { + } else if (this.size !== undefined && this.size.width !== undefined && this.size.height !== undefined) { // If `width` and `height` are set, resize to those dimensions image = await image.resize(this.size.width, this.size.height, { resample: this.resample, }); + + } else if (this.size_divisor !== undefined) { + // Rounds the height and width down to the closest multiple of size_divisor + const newWidth = Math.floor(srcWidth / this.size_divisor) * this.size_divisor; + const newHeight = Math.floor(srcHeight / this.size_divisor) * this.size_divisor; + image = await image.resize(newWidth, newHeight, { + resample: this.resample, + }); + } else { throw new Error(`Could not resize image due to unsupported \`this.size\` option in config: ${JSON.stringify(this.size)}`); } @@ -578,6 +588,8 @@ export class ImageFeatureExtractor extends FeatureExtractor { } +export class DPTFeatureExtractor extends ImageFeatureExtractor { } +export class GLPNFeatureExtractor extends ImageFeatureExtractor { } export class CLIPFeatureExtractor extends ImageFeatureExtractor { } export class ConvNextFeatureExtractor extends ImageFeatureExtractor { } export class ViTFeatureExtractor extends ImageFeatureExtractor { } @@ -1633,6 +1645,8 @@ export class AutoProcessor { OwlViTFeatureExtractor, CLIPFeatureExtractor, ConvNextFeatureExtractor, + DPTFeatureExtractor, + GLPNFeatureExtractor, BeitFeatureExtractor, DeiTFeatureExtractor, DetrFeatureExtractor, diff --git a/src/utils/image.js b/src/utils/image.js index cd40a5a6e..96c1d8227 100644 --- a/src/utils/image.js +++ b/src/utils/image.js @@ -91,6 +91,10 @@ export class RawImage { this.channels = channels; } + get size() { + return [this.width, this.height]; + } + /** * Helper method for reading an image from a variety of input types. * @param {RawImage|string|URL} input diff --git a/tests/pipelines.test.js b/tests/pipelines.test.js index ed449c15b..50d3aecd7 100644 --- a/tests/pipelines.test.js +++ b/tests/pipelines.test.js @@ -1463,6 +1463,47 @@ describe('Pipelines', () => { }, MAX_TEST_EXECUTION_TIME); }); + + describe('Depth estimation', () => { + + // List all models which will be tested + const models = [ + 'Intel/dpt-hybrid-midas', + ]; + + it(models[0], async () => { + let depth_estimator = await pipeline('depth-estimation', m(models[0])); + + let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg'; + + // single + { + let { predicted_depth, depth } = await depth_estimator(url); + compare(predicted_depth.dims, [384, 384]); + expect(depth.width).toEqual(640); + expect(depth.height).toEqual(480); + expect(depth.channels).toEqual(1); + expect(depth.data).toHaveLength(307200); + } + + // batched + { + let outputs = await depth_estimator([url, url]); + expect(outputs).toHaveLength(2); + for (let output of outputs) { + let { predicted_depth, depth } = output; + compare(predicted_depth.dims, [384, 384]); + expect(depth.width).toEqual(640); + expect(depth.height).toEqual(480); + expect(depth.channels).toEqual(1); + expect(depth.data).toHaveLength(307200); + } + } + + await depth_estimator.dispose(); + }, MAX_TEST_EXECUTION_TIME); + }); + describe('Document question answering', () => { // List all models which will be tested diff --git a/tests/processors.test.js b/tests/processors.test.js index 7efd12487..ee21d265a 100644 --- a/tests/processors.test.js +++ b/tests/processors.test.js @@ -38,6 +38,8 @@ describe('Processors', () => { beit: 'microsoft/beit-base-patch16-224-pt22k-ft22k', detr: 'facebook/detr-resnet-50', yolos: 'hustvl/yolos-small-300', + dpt: 'Intel/dpt-hybrid-midas', + glpn: 'vinvino02/glpn-kitti', nougat: 'facebook/nougat-small', owlvit: 'google/owlvit-base-patch32', clip: 'openai/clip-vit-base-patch16', @@ -243,6 +245,49 @@ describe('Processors', () => { } }, MAX_TEST_EXECUTION_TIME); + // DPTFeatureExtractor + it(MODELS.dpt, async () => { + const processor = await AutoProcessor.from_pretrained(m(MODELS.dpt)) + + { // Tests grayscale image + const image = await load_image(TEST_IMAGES.cats); + const { pixel_values, original_sizes, reshaped_input_sizes } = await processor(image); + + compare(pixel_values.dims, [1, 3, 384, 384]); + compare(avg(pixel_values.data), 0.0372855559389454); + + compare(original_sizes, [[480, 640]]); + compare(reshaped_input_sizes, [[384, 384]]); + } + }, MAX_TEST_EXECUTION_TIME); + + // GLPNForDepthEstimation + // - tests `size_divisor` and no size (size_divisor=32) + it(MODELS.glpn, async () => { + const processor = await AutoProcessor.from_pretrained(m(MODELS.glpn)) + + { + const image = await load_image(TEST_IMAGES.cats); + const { pixel_values, original_sizes, reshaped_input_sizes } = await processor(image); + compare(pixel_values.dims, [1, 3, 480, 640]); + compare(avg(pixel_values.data), 0.5186172404123327); + + compare(original_sizes, [[480, 640]]); + compare(reshaped_input_sizes, [[480, 640]]); + } + + { // Tests input which is not a multiple of 32 ([408, 612] -> [384, 608]) + const image = await load_image(TEST_IMAGES.tiger); + const { pixel_values, original_sizes, reshaped_input_sizes } = await processor(image); + + compare(pixel_values.dims, [1, 3, 384, 608]); + compare(avg(pixel_values.data), 0.38628831535989555); + + compare(original_sizes, [[408, 612]]); + compare(reshaped_input_sizes, [[384, 608]]); + } + }); + // NougatImageProcessor // - tests padding after normalization (image_mean != 0.5, image_std != 0.5) it(MODELS.nougat, async () => { @@ -263,7 +308,6 @@ describe('Processors', () => { // OwlViTFeatureExtractor it(MODELS.owlvit, async () => { const processor = await AutoProcessor.from_pretrained(m(MODELS.owlvit)) - { const image = await load_image(TEST_IMAGES.cats); const { pixel_values, original_sizes, reshaped_input_sizes } = await processor(image);