Refactoring of ImageProcessorFast #35069

yonigozlan · 2024-12-04T01:10:40Z

What does this PR do?

This PR introduces a significant refactoring of how fast image processors can be implemented in Transformers.

Motivation

The primary goal of this refactoring is to simplify the process of creating fast image processors when a slow image processor already exists.

Unlike the current BaseImageProcessor, which provides only a minimal skeleton for image processor classes, the newly introduced BaseImageProcessorFast includes all the basic functionalities needed by a typical image processor. This design allows contributors to focus primarily on modifying default parameters, such as the image mean, std, or default resizing dimensions, rather than rewriting foundational code.

Key Advantages:

Ease of Contribution: Contributors no longer need to rely on copy-paste from an arbitrary slow image processor (which I feel is what is happening currently for some slow image processors). Instead, the BaseImageProcessorFast provides a natural starting point with predefined functionalities.
Consistency: Contributors are encouraged to use a common structure. Whether they only modify default parameters, leverage mixins, or add custom code, they are likely to follow a consistent syntax and logic.
Automatic Optimizations: Improvements made to BaseImageProcessorFast are automatically propagated to all derived fast image processors.
Reduced Diffs: The new approach minimizes added diffs compared to the existing "# Copied from" philosophy in slow image processors. While the "repeat yourself" philosophy is an important part of modeling in Transformers, I feel that it might not be as necessary for image processing, as the model's uniqueness is rarely found in the image processing logic.

Implementation

Functional or class transforms

Following the torchvision approach to defining image transforms, there are two main ways to write the processing logic for image processors: using functional transforms or class-based transforms.

This PR showcases both approaches:

BaseImageProcessorFast: implemented using functional transforms.
ViTImageProcessorFast: almost equivalent but uses piped class-based transforms.
LlavaNextImageProcessorFast: implemented using functionals.
LlavaOnevisionImageProcessorFast: implemented using a mix of class-based transforms and functionals. equivalent to LlavaNextImageProcessorFast

The choice is entirely open for debate at this point. I see advantages to both approaches, but I’m sure I haven’t considered everything, so please share your thoughts if you have a preference one way or the other.

To me, the advantages/drawbacks of functionals are the following:

🟢 Less abstraction so potentially easier to read
🟢 Easier for contributors to write or adapt from existing slow image processors (which currently use functionals).
🟢 Allows more flexibility in processing logic, as transforms do not need to be sequential. For more complex processors, using piped class-based transforms would likely require mixing functionals and class-based transforms, or adding logic outside the transforms pipeline instead of a simple one-liner.
🔴 The logic can be more verbose than for class transforms

For class transforms:

🟢 Aligns with practices in other libraries like Albumentations.
🟢 Generally cleaner and more structured, easier to add/remove simple transforms.
🔴 As mentioned before, the logic is restricted by the sequential nature of the pipeline. For complex processors (e.g., involving patching), mixing functionals and class-based transforms or adding logic around the pipeline seems unavoidable (as seen in LlavaOnevisionImageProcessorFast).
🔴 There appear to be compilation issues: LlavaOnevisionImageProcessorFast fails to compile, as it seemingly gets stuck in an infinite compilation loop, while its functional equivalent, LlavaNextImageProcessorFast, compiles without problems. Hover this needs more thorough investigation.

New `add-fast-image-processor` Command in `transformers-cli`

This PR introduces a new CLI command that automates the creation of fast image processors. Given a model folder name, it generates all necessary imports, documentation, dummy objects, and a fast image processor file where default parameters are parsed from the slow image processor file.
The new fast image processor class is also added to the image processor test file by the CLI. However, some tests may still need manual adjustments to properly include and validate the new fast image processor class.

Example Usage:

transformers-cli add-fast-image-processor --model-name blip

Example Output:

In transformers/src/models/blip/image_processing_blip_fast.py:

# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fast Image processor class for BLIP."""

from ...image_processing_utils_fast import BaseImageProcessorFast
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling


class BlipImageProcessorFast(BaseImageProcessorFast):
    r"""
    Constructs a fast BLIP image processor.

    Args:
        do_resize (`bool`, *optional*):
            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
            `do_resize` parameter in the `preprocess` method.
        size (`dict`, *optional*):
            Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
            method.
        resample (`PILImageResampling`, *optional*):
            Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
            overridden by the `resample` parameter in the `preprocess` method.
        do_center_crop (`bool`, *optional*, defaults to `True`):
            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
            `preprocess` method.
        crop_size (`Dict[str, int]` *optional*, defaults to 224):
            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
            method.
        do_rescale (`bool`, *optional*):
            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
            `do_rescale` parameter in the `preprocess` method.
        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
            overridden by the `rescale_factor` parameter in the `preprocess` method.
        do_normalize (`bool`, *optional*):
            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
            method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
        image_mean (`float` or `List[float]`, *optional*):
            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
            overridden by the `image_mean` parameter in the `preprocess` method.
        image_std (`float` or `List[float]`, *optional*):
            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
            Can be overridden by the `image_std` parameter in the `preprocess` method.
        do_convert_rgb (`bool`, *optional*):
            Whether to convert the image to RGB.
    """


    # This generated class can be used as a starting point for the fast image processor.
    # if the image processor is only used for simple augmentations, such as resizing, center cropping, rescaling, or normalizing,
    # only the default values should be set in the class.
    # If the image processor requires more complex augmentations, methods from BaseImageProcessorFast can be overridden.
    # For an example of a fast image processor requiring more complex augmentations, see `LlavaOnevisionImageProcessorFast`.

    # Default values should be checked against the slow image processor
    # None values left after checking can be removed
    resample = PILImageResampling.BICUBIC
    image_mean = OPENAI_CLIP_MEAN
    image_std = OPENAI_CLIP_STD
    size = {"height": 384, "width": 384}
    default_to_square = None
    crop_size = None
    do_resize = True
    do_center_crop = None
    do_rescale = True
    do_normalize = True
    do_convert_rgb = True

In this case, this is enough to get a fully working fast Blip image processor!

New Mixins for Common Logic

To handle shared preprocessing and post-processing logic, this PR introduces reusable mixins ~~(only LlavaPatchingMixin is present as an example in this PR)~~. Additional mixins are planned for other common patterns, such as:

Video processing
DETR-like processing
Segmentation post-processing
Depth estimation post-processing

Edit: Removing the Mixins for patching in favor of # Copied from or modular in the futur, as such preprocessing techniques don't usually stick for a long time, and adding Mixins every time a technique is used twice or more wouldn't scale well.

Summary: Three Types of Fast Image Processors

Basic Processors:
- These support standard operations like resizing, rescaling, normalizing, and cropping.
- They require minimal customization, mostly overriding default parameters.
- Examples in this PR: blip, clip, deit, siglip, and vit.

~~2. Mixin-Based Processors:~~
~~- These rely heavily on predefined mixins to implement shared logic.~~
~~- Examples in this PR: llava_next and llava_onevision.~~

Edit: Removing Mixins in favor of # Copied From or modular for now, see earlier comment.

Exotic Processors:
- These have unique processing logic that differs significantly from the base class or existing mixins.
- Contributors need to override functions like __init__ and preprocess while reusing the syntax and structure of BaseImageProcessorFast.
- Example in this PR: convnext

Miscellaneous Issues and Questions

The CLI currently needs an existing slow image processor to work. If the library aims to eventually fully deprecate slow image processors or at least for new models, this will need to change, and the CI should maybe be integrated into the add-new-model-like.
There is a significant design difference between slow and fast image processors. For slow processors, contributors need to rewrite most of the logic, while for fast processors, the goal is to rewrite as little code as possible. This might get confusing for contributors, especially since as mentioned in the previous point, users will most likely start with writing a slow image processor.
Adding lots of mixins to image_processing_utils_fast might make the file large and difficult to read. This also raises the question of when a mixin should be created for repeated patterns across image processors. Should it be created when a pattern is shared by two or more models? Or when it presents an idea likely to be reused in the future?
Padding functionality is currently not included in BaseImageProcessorFast because the image processors that use padding implement it in slightly different ways. This makes it challenging to standardize the padding logic. A more consistent approach should maybe be added in the future.
The center_crop functions in Transformers' image_transforms and in torchvision have different implementations, namely the cropping boundaries in transformers are defined as: top = (orig_height - crop_height) // 2, left = (orig_width - crop_width) // 2 and top = int(round((orig_height - crop_height) / 2.0)), left = int(round((orig_width - crop_width) / 2.0)), which can result in shifted cropping when the size before cropping is odd. I don't think this should be much of a problem for users, but when comparing outputs of slow and fast image processors in tests, this results in huge differences.

Who can review?

@qubvel @molbap @zucchini-nlp

HuggingFaceDocBuilderDev · 2024-12-04T01:53:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yonigozlan · 2024-12-09T23:03:31Z

tests/test_image_processing_common.py

+        dict_slow_0 = image_processor_slow_0.to_dict()
+        dict_slow_1 = image_processor_slow_1.to_dict()
+        difference = {
+            key: dict_slow_0.get(key) if key in dict_slow_0 else dict_slow_1.get(key)
+            for key in set(dict_slow_0) ^ set(dict_slow_1)
+        }
+        dict_slow_0 = {key: dict_slow_0[key] for key in set(dict_slow_0) & set(dict_slow_1)}
+        dict_slow_1 = {key: dict_slow_1[key] for key in set(dict_slow_0) & set(dict_slow_1)}
+        # check that all additional keys are None
+        self.assertTrue(all(value is None for value in difference.values()))
+        # check that the remaining keys are the same
+        self.assertEqual(dict_slow_0, dict_slow_1)
+
+        dict_fast_0 = image_processor_fast_0.to_dict()
+        dict_fast_1 = image_processor_fast_1.to_dict()
+        difference = {
+            key: dict_fast_0.get(key) if key in dict_fast_0 else dict_fast_1.get(key)
+            for key in set(dict_fast_0) ^ set(dict_fast_1)
+        }
+        dict_fast_0 = {key: dict_fast_0[key] for key in set(dict_fast_0) & set(dict_fast_1)}
+        dict_fast_1 = {key: dict_fast_1[key] for key in set(dict_fast_0) & set(dict_fast_1)}
+        # check that all additional keys are None
+        self.assertTrue(all(value is None for value in difference.values()))
+        # check that the remaining keys are the same
+        self.assertEqual(dict_fast_0, dict_fast_1)


This modification is needed as the BaseImageProcessorFast on which the fast processor is most probably based might have more capabilities (able to do center cropping for example) than the slow image processor.

yonigozlan · 2024-12-09T23:03:41Z

tests/test_image_processing_common.py

+        dict_slow_0 = image_processor_slow_0.to_dict()
+        dict_slow_1 = image_processor_slow_1.to_dict()
+        difference = {
+            key: dict_slow_0.get(key) if key in dict_slow_0 else dict_slow_1.get(key)
+            for key in set(dict_slow_0) ^ set(dict_slow_1)
+        }
+        dict_slow_0 = {key: dict_slow_0[key] for key in set(dict_slow_0) & set(dict_slow_1)}
+        dict_slow_1 = {key: dict_slow_1[key] for key in set(dict_slow_0) & set(dict_slow_1)}
+        # check that all additional keys are None
+        self.assertTrue(all(value is None for value in difference.values()))
+        # check that the remaining keys are the same
+        self.assertEqual(dict_slow_0, dict_slow_1)
+
+        dict_fast_0 = image_processor_fast_0.to_dict()
+        dict_fast_1 = image_processor_fast_1.to_dict()
+        difference = {
+            key: dict_fast_0.get(key) if key in dict_fast_0 else dict_fast_1.get(key)
+            for key in set(dict_fast_0) ^ set(dict_fast_1)
+        }
+        dict_fast_0 = {key: dict_fast_0[key] for key in set(dict_fast_0) & set(dict_fast_1)}
+        dict_fast_1 = {key: dict_fast_1[key] for key in set(dict_fast_0) & set(dict_fast_1)}
+        # check that all additional keys are None
+        self.assertTrue(all(value is None for value in difference.values()))
+        # check that the remaining keys are the same
+        self.assertEqual(dict_fast_0, dict_fast_1)


yonigozlan · 2024-12-11T16:48:47Z

@ArthurZucker This PR is still in a rough state but pinging you for visibility as there are lots of refactoring choices that are up for discussion!

qubvel

I made an initial quick pass. My main concern here is how common these methods are. I mean, for preprocess, does it scale to most models, or will we need to override it somewhere?

I think (but might be wrong) that the better design choice would be to follow standards in image processing, where we define each transformation as a separate class and then can pipe them. This approach is followed by torchvision, albumentations, kornia, and imgaug. So I suppose this is a pretty robust approach that should work well in our case too.

We can reuse the original torch transforms if they are compatible with torch compile, otherwise, we can rewrite them ourselves. The most common transforms will live in the base file, while custom will live in image_processing_* model's file and can be moved to common for reuse.

Let me know what you think regarding this? Do you see any difficulties using this approach?

src/transformers/image_processing_utils_fast.py

yonigozlan · 2024-12-11T17:45:41Z

I think (but might be wrong) that the better design choice would be to follow standards in image processing, where we define each transformation as a separate class and then can pipe them. This approach is followed by torchvision, albumentations, kornia, and imgaug. So I suppose this is a pretty robust approach that should work well in our case too.

We can reuse the original torch transforms if they are compatible with torch compile, otherwise, we can rewrite them ourselves. The most common transforms will live in the base file, while custom will live in image_processing_* model's file and can be moved to common for reuse.

@qubvel The problem I see with this is that some image processors needs additional logic in-between transforms, that depends on the result of the previous transforms (like the padding operation in detr-like image processors), so unless I'm mistaken I don't think piping class transforms would be possible here.
It feels to me that having some processors using piping of class transforms and other functional transforms would be confusing and break consistency. I also think that using functional transforms is closer to what is done in slow processors, so maybe easier to adapt for contributors when they need to implement "exotic" fast image processing operations.

That said, I might be misunderstanding or missing something if this is a standard approach in other image preprocessing library, so happy to discuss this further.

zucchini-nlp

Very cool PR, happy to see fast image processors getting propagated!

I left a few questions here and there, mostly nits. Overall I like the "less code for users" and moving redundant parts for the base class idea. But seems like we are hiding the core processing logic in base class. So I agree here with @qubvel that having a pipe of transforms looks nicer and easier to consume.

And for weird model, prob we don't need separate patching mixin, as it doesn't scale much because not many of those patch methods will stick for long time and new methods will continue to be invented. I am also thinking if all this repeated code can be squeezed when we add modular to those models? For example llava-next style patching is used in three models, so we need to define it once and then let modular copy the main code body

src/transformers/image_processing_utils_fast.py

src/transformers/models/convnext/image_processing_convnext_fast.py

src/transformers/image_processing_utils_fast.py

…ava_onevision

yonigozlan · 2024-12-18T03:16:37Z

Hello @qubvel @zucchini-nlp ! Thanks again for your reviews. I added examples of how I would write processors using class-based transforms, and a section in the PR with the advantages and drawbacks I see for functional vs class transforms. If you have some time I would be glad to have your opinions on this :)

qubvel

Great work! Thanks for providing both options for review 👍 I agree, the functional approach looks more flexible right now.

I would differentiate image processors for the following cases:

Basic image transforms: operate only on raw images, and usually, in the original code, we might see the torchvision defined preprocessing pipeline.
Patching transforms: operate on patches but still work with images only. The preprocessing usually looks as follows: split into patches -> apply basic image transforms to patches (normalize, rescale, etc.).
Annotation-aware image transforms: operate on images and annotations (object detection, image segmentation). These are the most complicated, I suppose, because for each transform, we have to handle annotation transformation like bounding boxes/masks/keypoints.

It migth be oversimplified, am I missing something?

I think we can have different styles of image processors for these cases. And the thing we have to take into consideration is that it's common to have transforms written in torchvision, so it would be nice to have a clear way to port the original transforms to be transformers-like.

qubvel · 2024-12-18T11:31:34Z

src/transformers/models/convnext/__init__.py

+try:
+    if not is_torchvision_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["image_processing_convnext_fast"] = ["ConvNextImageProcessorFast"]
+


reminder, __ini__ files no longer have this

Yes, I was just thinking that making the modifications in this PR would add a lot of noise. I can open separate PRs to modify them

src/transformers/models/convnext/image_processing_convnext_fast.py

qubvel · 2024-12-18T11:49:36Z

src/transformers/models/convnext/image_processing_convnext_fast.py

+        validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self.valid_extra_kwargs)
+
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        default_to_square = kwargs.pop(
+            "default_to_square", self.default_to_square if self.default_to_square is not None else True
+        )
+        size = get_size_dict(size=size, default_to_square=default_to_square) if size is not None else None
+        crop_pct = crop_pct if crop_pct is not None else self.crop_pct
+        resample = resample if resample is not None else self.resample
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
+        crop_size = crop_size if crop_size is not None else self.crop_size
+        crop_size = get_size_dict(crop_size, param_name="crop_size") if crop_size is not None else None
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+        return_tensors = "pt" if return_tensors is None else return_tensors
+        device = kwargs.pop("device", None)
+
+        images = self._prepare_input_images(
+            images=images,
+            do_convert_rgb=do_convert_rgb,
+            device=device,
+            input_data_format=input_data_format,
+        )
+
+        image_mean, image_std, size, crop_size, interpolation = self._prepare_process_arguments(
+            device=images[0].device,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+            crop_size=crop_size,
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            return_tensors=return_tensors,
+            data_format=data_format,
+            **kwargs,
+        )


This one definitely makes it harder to catch the processing logic with a quick look. While it might be fine for us, because we see it every day, the first-time contributor is required to read and understand it. Any ideas on how it can be simplified/split?

There's really no processing happening here. Ideally this should be common to all processors, and can be overriden to make some tiny changes if needed. This replaces the very verbose "pre"-preprocessing that is at the beginning of all preprocess functions of image processors, and not fully consistent across processors. Having these two wrapper functions would be a way to force consistency in this "pre"-preprocessing step across fast processors.

qubvel · 2024-12-18T12:04:16Z

src/transformers/models/llava_onevision/image_processing_llava_onevision_fast.py

+        transforms.append(GroupByShape())
+        if do_resize:
+            transforms.append(Resize(size, interpolation=interpolation))
+            # Since the size was changed, we need to group the images by shape again
+            transforms.append(ReorderImages())
+            transforms.append(GroupByShape())
+        if do_center_crop:
+            transforms.append(CenterCrop(crop_size))
+            # Since the size was changed, we need to group the images by shape again
+            transforms.append(ReorderImages())
+            transforms.append(GroupByShape())
+        # We can combine rescale and normalize into a single operation for speed
+        if do_rescale and do_normalize:
+            # image_mean and image_std have already been adjusted for rescaling
+            transforms.append(Normalize(image_mean, image_std))
+        elif do_rescale:
+            transforms.append(Rescale(rescale_factor=rescale_factor))
+        elif do_normalize:
+            transforms.append(Normalize(image_mean, image_std))
+
+        if isinstance(transforms[-1], GroupByShape):
+            # No added transforms, so we can remove the last GroupByShape
+            transforms.pop()
+        else:
+            # We necessarily have grouped images, so we need to reorder them back to the original order
+            transforms.append(ReorderImages())
+
+        return Sequential(*transforms)


Ok, this one looks a bit overcomplicated compared to the functional one, but I like that we have a separate method, which is easier to read. If we avoid the GroupByShape transform, it will look better. Instead, we can add StackIfPossible, which will stack images only if they all have the same size.

It would be a shame to miss out on some nice speedups though. For example, with processors using patching, this group by shape method can lead to a 10 to 20x speedup. This would not work with something like StackIfPossible, as the border patches are usually of a different size than the middle patches.

src/transformers/models/llava_onevision/processing_llava_onevision.py

qubvel · 2024-12-18T12:10:47Z

src/transformers/image_transforms.py

+if is_torch_available():
+
+    class GroupByShape(torch.nn.Module):
+        """
+        Groups images by shape.
+        Returns a dictionary with the shape as key and a list of images with that shape as value,
+        and a dictionary with the index of the image in the original list as key and the shape and index in the grouped list as value.
+        """
+
+        def __init__(self):
+            super().__init__()
+
+        def forward(self, images: List["torch.Tensor"]):
+            grouped_images, grouped_images_index = group_images_by_shape(images)
+            return {"grouped_images": grouped_images, "grouped_images_index": grouped_images_index}
+


It's fine for prototyping, but if we go further, we will need to redesign this. Hiding classes under is_torch_available looks weird.

Agreed, but otherwise CI gets mad 😅

yonigozlan · 2024-12-19T22:31:15Z

Great work! Thanks for providing both options for review 👍 I agree, the functional approach looks more flexible right now.

I believe so too for the functional approach. Using class transforms might look nicer in some cases, but they are limiting and add a layer of complexity for contributors that I don't think is necessary.
I can leave the two class transforms based fast image processors (ViTImageProcessorFast and LlavaOnevisionImageProcessorFast) as they are for now, in case future reviewers have stronger opinion on this, but I'll focus on improving the functional based ones.

I would differentiate image processors for the following cases:

Basic image transforms: operate only on raw images, and usually, in the original code, we might see the torchvision defined preprocessing pipeline.
Patching transforms: operate on patches but still work with images only. The preprocessing usually looks as follows: split into patches -> apply basic image transforms to patches (normalize, rescale, etc.).
Annotation-aware image transforms: operate on images and annotations (object detection, image segmentation). These are the most complicated, I suppose, because for each transform, we have to handle annotation transformation like bounding boxes/masks/keypoints.
It migth be oversimplified, am I missing something?

I think we can have different styles of image processors for these cases. And the thing we have to take into consideration is that it's common to have transforms written in torchvision, so it would be nice to have a clear way to port the original transforms to be transformers-like.

Agreed on the three different categories, but I'm not sure how to handle this differentiation in the code.
I think having the BaseImageProcessorFast as the template for the first category (as it is currently) is a nice way to reduce diffs and force consistency. For the two others, as we discussed before, having a mixin for each doesn't seem to be the way to go, as they are so many slight variations between all models belonging to each categories that it would be impossible to have a Mixin that generalizes to all model in the category and will stand the test of time.

What I was thinking was to add some guidance in the fast image processor file generated by the transformers-cli, explaining what existing processor to use as a baseline depending on what sort of image processing the model needs

…processor-base

yonigozlan force-pushed the improve-fast-image-processor-base branch from 9e31e43 to c393647 Compare December 7, 2024 00:18

yonigozlan commented Dec 9, 2024

View reviewed changes

qubvel reviewed Dec 11, 2024

View reviewed changes

src/transformers/image_processing_utils_fast.py Outdated Show resolved Hide resolved

src/transformers/image_processing_utils_fast.py Outdated Show resolved Hide resolved

yonigozlan force-pushed the improve-fast-image-processor-base branch from 867b1f5 to 3f8674f Compare December 11, 2024 22:15

zucchini-nlp reviewed Dec 13, 2024

View reviewed changes

yonigozlan added 21 commits December 16, 2024 19:32

add init and base image processing functions

46fd0dd

add add_fast_image_processor to transformers-cli

e0ac6ad

add working fast image processor clip

0edb3f4

add fast image processor to doc, working tests

5f0884b

remove "to be implemented" SigLip

2f07d0e

fix unprotected import

451b66c

fix unprotected vision import

7c5b9ec

update ViTImageProcessorFast

5b1d116

increase threshold slow fast ewuivalence

d289428

add fast img blip

90d2831

add fast class in tests with cli

4b52b1b

improve cli

14a6491

add fast image processor convnext

3ad51b6

add LlavaPatchingMixin and fast image processor for llava_next and ll…

19cbc3c

…ava_onevision

add device kwarg to ImagesKwargs for fast processing on cuda

5c7d389

cleanup

2cdb61c

fix unprotected import

5b45eab

group images by sizes and add batch processing

1214f8f

Add batch equivalence tests, skip when center_crop is used

87afdab

cleanup

5abe04e

update init and cli

2e9fc31

yonigozlan added 3 commits December 16, 2024 19:33

fix-copies

baa9ba5

refactor convnext, cleanup base

ccbd31b

fix

da6de2e

yonigozlan force-pushed the improve-fast-image-processor-base branch from f12c5d7 to da6de2e Compare December 16, 2024 19:35

yonigozlan added 5 commits December 17, 2024 01:27

remove patching mixins, add piped torchvision transforms for ViT

3a2b903

fix unbatched processing

dd01a79

fix f strings

63c139c

protect imports

b923625

change llava onevision to class transforms (test)

ff9f04c

yonigozlan and others added 2 commits December 18, 2024 03:21

fix convnext

6196c46

Merge branch 'main' into improve-fast-image-processor-base

9fd8f14

qubvel reviewed Dec 18, 2024

View reviewed changes

yonigozlan added 5 commits January 6, 2025 19:28

Merge remote-tracking branch 'upstream/main' into improve-fast-image-…

6aae318

…processor-base

improve formatting (following Pavel review)

209de9c

fix handling device arg

3b4ba0a

improve cli

c506d5b

fix

d9e7e77

yonigozlan changed the title ~~[WIP] Refactoring of ImageProcessorFast~~ Refactoring of ImageProcessorFast Jan 6, 2025

yonigozlan requested a review from ArthurZucker January 6, 2025 23:13

molbap mentioned this pull request Jan 8, 2025

Add Molmo (7B-D, 7B-O, 70B) #33962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring of ImageProcessorFast #35069

Refactoring of ImageProcessorFast #35069

yonigozlan commented Dec 4, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 4, 2024

yonigozlan Dec 9, 2024

yonigozlan Dec 9, 2024

yonigozlan commented Dec 11, 2024

qubvel left a comment

yonigozlan commented Dec 11, 2024 •

edited

Loading

zucchini-nlp left a comment

yonigozlan commented Dec 18, 2024

qubvel left a comment

qubvel Dec 18, 2024

yonigozlan Dec 19, 2024

qubvel Dec 18, 2024

yonigozlan Dec 19, 2024

qubvel Dec 18, 2024

yonigozlan Dec 19, 2024

qubvel Dec 18, 2024

yonigozlan Dec 19, 2024

yonigozlan commented Dec 19, 2024

Refactoring of ImageProcessorFast #35069

Are you sure you want to change the base?

Refactoring of ImageProcessorFast #35069

Conversation

yonigozlan commented Dec 4, 2024 • edited Loading

What does this PR do?

Motivation

Key Advantages:

Implementation

Functional or class transforms

New add-fast-image-processor Command in transformers-cli

Example Usage:

Example Output:

New Mixins for Common Logic

Summary: Three Types of Fast Image Processors

Miscellaneous Issues and Questions

Who can review?

HuggingFaceDocBuilderDev commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonigozlan commented Dec 11, 2024

qubvel left a comment

Choose a reason for hiding this comment

yonigozlan commented Dec 11, 2024 • edited Loading

zucchini-nlp left a comment

Choose a reason for hiding this comment

yonigozlan commented Dec 18, 2024

qubvel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonigozlan commented Dec 19, 2024

yonigozlan commented Dec 4, 2024 •

edited

Loading

New `add-fast-image-processor` Command in `transformers-cli`

yonigozlan commented Dec 11, 2024 •

edited

Loading