Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image + text + audio uniform processors #30511

Open
wants to merge 97 commits into
base: main
Choose a base branch
from

Conversation

molbap
Copy link
Contributor

@molbap molbap commented Apr 26, 2024

What does this PR do?

This PR is a stab at uniformizing the processors across all transformers models. If we are happy with the design, I'll expand it to all existing models. It only touches on some text + audio and text + images models as experiment subjects. Linked with #28711 which has a larger scope, and several previous discussions with team members.

Usage

As before, kwargs that are passed to processors at __call__ time take priority. However, per-modality processors can be instantiated with their own kwargs, and if they are not overriden at call time, they will serve as defaults.

Type hinting of kwargs is preserved if they are passed as structured dictionary entries
image

It also works with kwargs passed without nesting:
image

Merging of kwargs and handling priority order is done in processing_utils through a dedicated method.
The order of operations is as follows:

  1. kwargs passed as before have highest priority to preserve BC.
high_priority_kwargs = {"crop_size" = (224, 224), "padding" = "max_length"}
processor(..., **high_priority_kwargs)
  1. kwargs passed as modality-specific kwargs have second priority. This is the recommended API.
processor(..., text_kwargs={"padding": "max_length"}, images_kwargs={"crop_size": (224, 224)}})
  1. kwargs passed during instantiation of a modality processor have fourth priority.
tokenizer = tokenizer_class(..., {"padding": "max_length"})
image_processor = image_processor_class(...)
processor(tokenizer, image_processor) # will pass max_length unless overriden by kwargs at call
  1. defaults kwargs specified at processor level have lowest priority.
class MyProcessingKwargs(ProcessingKwargs, CommonKwargs, TextKwargs, ImagesKwargs, total=False):
    _defaults = {
        "text_kwargs": {
            "padding": "max_length",
            "max_length": 64,
        },
    }

What changes:

  • ~~There is now an attribute in the constructor that stores the processing kwargs needed to be passed to various encoders down the line. ~~ Not anymore, see next point
  • There is now a base ProcessingKwargs TypedDict of kwargs that inherits from ImagesKwargs, TextKwargs, and so on.
  • These nested attributes (one dictionary for text, one for images, one for audio, one for video) are typed with a TypedDict that does not need to be total. We can expand this typed dict (in processing_utils) as processors get uniformized.
  • The processors are called with the same kwargs signature, text, images, audio, videos. kwargs set by a user will always override default processing kwargs.
  • Slicing of positional args is removed and replaced by named kwargs corresponding to the modality. Order of kwargs is not constant as a consequence to preserve BC.
  • Inputs (text, images, audio, videos) are now always typed in the call.

What that allows:

  • We know each model has its own processing logic/way to mix inputs. This way of typing both the modalities sent to the processors AND the processing arguments allow a faster design of future processors, hence faster review, and faster merge, better usage.
  • The reason for that is that we push away from the function call the actual mixing of modalities and their specific processing, which can be handled explicitly with the kwargs passed.
  • With TypedDict, type hints are preserved even in the nesting. I hesitated with pydantic/dataclasses and opted for TypedDict because it is less flexible, and we want to enforce a standard.

Limitations:

  • This still relies on kwargs for the processing.
  • Few models tested on that PR - more will follow in other PRs.

Tests pass (I think). What's missing:

Who can review?

Models:

@molbap molbap mentioned this pull request Jun 11, 2024
5 tasks
@huggingface huggingface deleted a comment from github-actions bot Jul 8, 2024
@molbap molbap mentioned this pull request Jul 17, 2024
5 tasks
@huggingface huggingface deleted a comment from github-actions bot Aug 2, 2024
feature_extractor = None
if "feature_extractor" in kwargs:
def __init__(self, image_processor=None, tokenizer=None, feature_extractor=None):
if "feature_extractor":
Copy link

@gafrom gafrom Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always true? (same in other places)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants