-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring of ImageProcessorFast #35069
base: main
Are you sure you want to change the base?
Refactoring of ImageProcessorFast #35069
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
9e31e43
to
c393647
Compare
dict_slow_0 = image_processor_slow_0.to_dict() | ||
dict_slow_1 = image_processor_slow_1.to_dict() | ||
difference = { | ||
key: dict_slow_0.get(key) if key in dict_slow_0 else dict_slow_1.get(key) | ||
for key in set(dict_slow_0) ^ set(dict_slow_1) | ||
} | ||
dict_slow_0 = {key: dict_slow_0[key] for key in set(dict_slow_0) & set(dict_slow_1)} | ||
dict_slow_1 = {key: dict_slow_1[key] for key in set(dict_slow_0) & set(dict_slow_1)} | ||
# check that all additional keys are None | ||
self.assertTrue(all(value is None for value in difference.values())) | ||
# check that the remaining keys are the same | ||
self.assertEqual(dict_slow_0, dict_slow_1) | ||
|
||
dict_fast_0 = image_processor_fast_0.to_dict() | ||
dict_fast_1 = image_processor_fast_1.to_dict() | ||
difference = { | ||
key: dict_fast_0.get(key) if key in dict_fast_0 else dict_fast_1.get(key) | ||
for key in set(dict_fast_0) ^ set(dict_fast_1) | ||
} | ||
dict_fast_0 = {key: dict_fast_0[key] for key in set(dict_fast_0) & set(dict_fast_1)} | ||
dict_fast_1 = {key: dict_fast_1[key] for key in set(dict_fast_0) & set(dict_fast_1)} | ||
# check that all additional keys are None | ||
self.assertTrue(all(value is None for value in difference.values())) | ||
# check that the remaining keys are the same | ||
self.assertEqual(dict_fast_0, dict_fast_1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modification is needed as the BaseImageProcessorFast on which the fast processor is most probably based might have more capabilities (able to do center cropping for example) than the slow image processor.
dict_slow_0 = image_processor_slow_0.to_dict() | ||
dict_slow_1 = image_processor_slow_1.to_dict() | ||
difference = { | ||
key: dict_slow_0.get(key) if key in dict_slow_0 else dict_slow_1.get(key) | ||
for key in set(dict_slow_0) ^ set(dict_slow_1) | ||
} | ||
dict_slow_0 = {key: dict_slow_0[key] for key in set(dict_slow_0) & set(dict_slow_1)} | ||
dict_slow_1 = {key: dict_slow_1[key] for key in set(dict_slow_0) & set(dict_slow_1)} | ||
# check that all additional keys are None | ||
self.assertTrue(all(value is None for value in difference.values())) | ||
# check that the remaining keys are the same | ||
self.assertEqual(dict_slow_0, dict_slow_1) | ||
|
||
dict_fast_0 = image_processor_fast_0.to_dict() | ||
dict_fast_1 = image_processor_fast_1.to_dict() | ||
difference = { | ||
key: dict_fast_0.get(key) if key in dict_fast_0 else dict_fast_1.get(key) | ||
for key in set(dict_fast_0) ^ set(dict_fast_1) | ||
} | ||
dict_fast_0 = {key: dict_fast_0[key] for key in set(dict_fast_0) & set(dict_fast_1)} | ||
dict_fast_1 = {key: dict_fast_1[key] for key in set(dict_fast_0) & set(dict_fast_1)} | ||
# check that all additional keys are None | ||
self.assertTrue(all(value is None for value in difference.values())) | ||
# check that the remaining keys are the same | ||
self.assertEqual(dict_fast_0, dict_fast_1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@ArthurZucker This PR is still in a rough state but pinging you for visibility as there are lots of refactoring choices that are up for discussion! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made an initial quick pass. My main concern here is how common these methods are. I mean, for preprocess
, does it scale to most models, or will we need to override it somewhere?
I think (but might be wrong) that the better design choice would be to follow standards in image processing, where we define each transformation as a separate class and then can pipe them. This approach is followed by torchvision, albumentations, kornia, and imgaug. So I suppose this is a pretty robust approach that should work well in our case too.
We can reuse the original torch transforms if they are compatible with torch compile, otherwise, we can rewrite them ourselves. The most common transforms will live in the base file, while custom will live in image_processing_*
model's file and can be moved to common for reuse.
Let me know what you think regarding this? Do you see any difficulties using this approach?
@qubvel The problem I see with this is that some image processors needs additional logic in-between transforms, that depends on the result of the previous transforms (like the padding operation in detr-like image processors), so unless I'm mistaken I don't think piping class transforms would be possible here. That said, I might be misunderstanding or missing something if this is a standard approach in other image preprocessing library, so happy to discuss this further. |
867b1f5
to
3f8674f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool PR, happy to see fast image processors getting propagated!
I left a few questions here and there, mostly nits. Overall I like the "less code for users" and moving redundant parts for the base class idea. But seems like we are hiding the core processing logic in base class. So I agree here with @qubvel that having a pipe of transforms
looks nicer and easier to consume.
And for weird model, prob we don't need separate patching mixin, as it doesn't scale much because not many of those patch methods will stick for long time and new methods will continue to be invented. I am also thinking if all this repeated code can be squeezed when we add modular to those models? For example llava-next style patching is used in three models, so we need to define it once and then let modular copy the main code body
src/transformers/models/convnext/image_processing_convnext_fast.py
Outdated
Show resolved
Hide resolved
f12c5d7
to
da6de2e
Compare
Hello @qubvel @zucchini-nlp ! Thanks again for your reviews. I added examples of how I would write processors using class-based transforms, and a section in the PR with the advantages and drawbacks I see for functional vs class transforms. If you have some time I would be glad to have your opinions on this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for providing both options for review 👍 I agree, the functional approach looks more flexible right now.
I would differentiate image processors for the following cases:
- Basic image transforms: operate only on raw images, and usually, in the original code, we might see the
torchvision
defined preprocessing pipeline. - Patching transforms: operate on patches but still work with images only. The preprocessing usually looks as follows: split into patches -> apply basic image transforms to patches (normalize, rescale, etc.).
- Annotation-aware image transforms: operate on images and annotations (object detection, image segmentation). These are the most complicated, I suppose, because for each transform, we have to handle annotation transformation like bounding boxes/masks/keypoints.
It migth be oversimplified, am I missing something?
I think we can have different styles of image processors for these cases. And the thing we have to take into consideration is that it's common to have transforms written in torchvision, so it would be nice to have a clear way to port the original transforms to be transformers
-like.
try: | ||
if not is_torchvision_available(): | ||
raise OptionalDependencyNotAvailable() | ||
except OptionalDependencyNotAvailable: | ||
pass | ||
else: | ||
_import_structure["image_processing_convnext_fast"] = ["ConvNextImageProcessorFast"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder, __ini__
files no longer have this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was just thinking that making the modifications in this PR would add a lot of noise. I can open separate PRs to modify them
src/transformers/models/convnext/image_processing_convnext_fast.py
Outdated
Show resolved
Hide resolved
validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self.valid_extra_kwargs) | ||
|
||
do_resize = do_resize if do_resize is not None else self.do_resize | ||
size = size if size is not None else self.size | ||
default_to_square = kwargs.pop( | ||
"default_to_square", self.default_to_square if self.default_to_square is not None else True | ||
) | ||
size = get_size_dict(size=size, default_to_square=default_to_square) if size is not None else None | ||
crop_pct = crop_pct if crop_pct is not None else self.crop_pct | ||
resample = resample if resample is not None else self.resample | ||
do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop | ||
crop_size = crop_size if crop_size is not None else self.crop_size | ||
crop_size = get_size_dict(crop_size, param_name="crop_size") if crop_size is not None else None | ||
do_rescale = do_rescale if do_rescale is not None else self.do_rescale | ||
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor | ||
do_normalize = do_normalize if do_normalize is not None else self.do_normalize | ||
image_mean = image_mean if image_mean is not None else self.image_mean | ||
image_std = image_std if image_std is not None else self.image_std | ||
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb | ||
return_tensors = "pt" if return_tensors is None else return_tensors | ||
device = kwargs.pop("device", None) | ||
|
||
images = self._prepare_input_images( | ||
images=images, | ||
do_convert_rgb=do_convert_rgb, | ||
device=device, | ||
input_data_format=input_data_format, | ||
) | ||
|
||
image_mean, image_std, size, crop_size, interpolation = self._prepare_process_arguments( | ||
device=images[0].device, | ||
do_resize=do_resize, | ||
size=size, | ||
resample=resample, | ||
crop_size=crop_size, | ||
do_rescale=do_rescale, | ||
rescale_factor=rescale_factor, | ||
do_normalize=do_normalize, | ||
image_mean=image_mean, | ||
image_std=image_std, | ||
return_tensors=return_tensors, | ||
data_format=data_format, | ||
**kwargs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one definitely makes it harder to catch the processing logic with a quick look. While it might be fine for us, because we see it every day, the first-time contributor is required to read and understand it. Any ideas on how it can be simplified/split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's really no processing happening here. Ideally this should be common to all processors, and can be overriden to make some tiny changes if needed. This replaces the very verbose "pre"-preprocessing that is at the beginning of all preprocess functions of image processors, and not fully consistent across processors. Having these two wrapper functions would be a way to force consistency in this "pre"-preprocessing step across fast processors.
transforms.append(GroupByShape()) | ||
if do_resize: | ||
transforms.append(Resize(size, interpolation=interpolation)) | ||
# Since the size was changed, we need to group the images by shape again | ||
transforms.append(ReorderImages()) | ||
transforms.append(GroupByShape()) | ||
if do_center_crop: | ||
transforms.append(CenterCrop(crop_size)) | ||
# Since the size was changed, we need to group the images by shape again | ||
transforms.append(ReorderImages()) | ||
transforms.append(GroupByShape()) | ||
# We can combine rescale and normalize into a single operation for speed | ||
if do_rescale and do_normalize: | ||
# image_mean and image_std have already been adjusted for rescaling | ||
transforms.append(Normalize(image_mean, image_std)) | ||
elif do_rescale: | ||
transforms.append(Rescale(rescale_factor=rescale_factor)) | ||
elif do_normalize: | ||
transforms.append(Normalize(image_mean, image_std)) | ||
|
||
if isinstance(transforms[-1], GroupByShape): | ||
# No added transforms, so we can remove the last GroupByShape | ||
transforms.pop() | ||
else: | ||
# We necessarily have grouped images, so we need to reorder them back to the original order | ||
transforms.append(ReorderImages()) | ||
|
||
return Sequential(*transforms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this one looks a bit overcomplicated compared to the functional one, but I like that we have a separate method, which is easier to read. If we avoid the GroupByShape
transform, it will look better. Instead, we can add StackIfPossible
, which will stack images only if they all have the same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a shame to miss out on some nice speedups though. For example, with processors using patching, this group by shape method can lead to a 10 to 20x speedup. This would not work with something like StackIfPossible, as the border patches are usually of a different size than the middle patches.
src/transformers/models/llava_onevision/processing_llava_onevision.py
Outdated
Show resolved
Hide resolved
if is_torch_available(): | ||
|
||
class GroupByShape(torch.nn.Module): | ||
""" | ||
Groups images by shape. | ||
Returns a dictionary with the shape as key and a list of images with that shape as value, | ||
and a dictionary with the index of the image in the original list as key and the shape and index in the grouped list as value. | ||
""" | ||
|
||
def __init__(self): | ||
super().__init__() | ||
|
||
def forward(self, images: List["torch.Tensor"]): | ||
grouped_images, grouped_images_index = group_images_by_shape(images) | ||
return {"grouped_images": grouped_images, "grouped_images_index": grouped_images_index} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine for prototyping, but if we go further, we will need to redesign this. Hiding classes under is_torch_available
looks weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but otherwise CI gets mad 😅
I believe so too for the functional approach. Using class transforms might look nicer in some cases, but they are limiting and add a layer of complexity for contributors that I don't think is necessary.
Agreed on the three different categories, but I'm not sure how to handle this differentiation in the code. What I was thinking was to add some guidance in the fast image processor file generated by the transformers-cli, explaining what existing processor to use as a baseline depending on what sort of image processing the model needs |
What does this PR do?
This PR introduces a significant refactoring of how fast image processors can be implemented in Transformers.
Motivation
The primary goal of this refactoring is to simplify the process of creating fast image processors when a slow image processor already exists.
Unlike the current
BaseImageProcessor
, which provides only a minimal skeleton for image processor classes, the newly introducedBaseImageProcessorFast
includes all the basic functionalities needed by a typical image processor. This design allows contributors to focus primarily on modifying default parameters, such as the image mean, std, or default resizing dimensions, rather than rewriting foundational code.Key Advantages:
BaseImageProcessorFast
provides a natural starting point with predefined functionalities.BaseImageProcessorFast
are automatically propagated to all derived fast image processors.Implementation
Functional or class transforms
Following the
torchvision
approach to defining image transforms, there are two main ways to write the processing logic for image processors: using functional transforms or class-based transforms.This PR showcases both approaches:
BaseImageProcessorFast
: implemented using functional transforms.ViTImageProcessorFast
: almost equivalent but uses piped class-based transforms.LlavaNextImageProcessorFast
: implemented using functionals.LlavaOnevisionImageProcessorFast
: implemented using a mix of class-based transforms and functionals. equivalent toLlavaNextImageProcessorFast
The choice is entirely open for debate at this point. I see advantages to both approaches, but I’m sure I haven’t considered everything, so please share your thoughts if you have a preference one way or the other.
To me, the advantages/drawbacks of functionals are the following:
For class transforms:
LlavaOnevisionImageProcessorFast
).LlavaOnevisionImageProcessorFast
fails to compile, as it seemingly gets stuck in an infinite compilation loop, while its functional equivalent,LlavaNextImageProcessorFast
, compiles without problems. Hover this needs more thorough investigation.New
add-fast-image-processor
Command intransformers-cli
This PR introduces a new CLI command that automates the creation of fast image processors. Given a model folder name, it generates all necessary imports, documentation, dummy objects, and a fast image processor file where default parameters are parsed from the slow image processor file.
The new fast image processor class is also added to the image processor test file by the CLI. However, some tests may still need manual adjustments to properly include and validate the new fast image processor class.
Example Usage:
Example Output:
In
transformers/src/models/blip/image_processing_blip_fast.py
:In this case, this is enough to get a fully working fast Blip image processor!
New Mixins for Common Logic
To handle shared preprocessing and post-processing logic, this PR introduces reusable mixins
(only. Additional mixins are planned for other common patterns, such as:LlavaPatchingMixin
is present as an example in this PR)Edit: Removing the Mixins for patching in favor of
# Copied from
or modular in the futur, as such preprocessing techniques don't usually stick for a long time, and adding Mixins every time a technique is used twice or more wouldn't scale well.Summary: Three Types of Fast Image Processors
blip
,clip
,deit
,siglip
, andvit
.2. Mixin-Based Processors:- These rely heavily on predefined mixins to implement shared logic.- Examples in this PR:llava_next
andllava_onevision
.Edit: Removing Mixins in favor of
# Copied From
or modular for now, see earlier comment.__init__
andpreprocess
while reusing the syntax and structure ofBaseImageProcessorFast
.convnext
Miscellaneous Issues and Questions
image_processing_utils_fast
might make the file large and difficult to read. This also raises the question of when a mixin should be created for repeated patterns across image processors. Should it be created when a pattern is shared by two or more models? Or when it presents an idea likely to be reused in the future?BaseImageProcessorFast
because the image processors that use padding implement it in slightly different ways. This makes it challenging to standardize the padding logic. A more consistent approach should maybe be added in the future.center_crop
functions in Transformers'image_transforms
and intorchvision
have different implementations, namely the cropping boundaries in transformers are defined as:top = (orig_height - crop_height) // 2
,left = (orig_width - crop_width) // 2
andtop = int(round((orig_height - crop_height) / 2.0))
,left = int(round((orig_width - crop_width) / 2.0))
, which can result in shifted cropping when the size before cropping is odd. I don't think this should be much of a problem for users, but when comparing outputs of slow and fast image processors in tests, this results in huge differences.Who can review?
@qubvel @molbap @zucchini-nlp