Image + text + audio uniform processors #30511

molbap · 2024-04-26T20:58:53Z

What does this PR do?

This PR is a stab at uniformizing the processors across all transformers models. If we are happy with the design, I'll expand it to all existing models. It only touches on some text + audio and text + images models as experiment subjects. Linked with #28711 which has a larger scope, and several previous discussions with team members.

Usage

As before, kwargs that are passed to processors at __call__ time take priority. However, per-modality processors can be instantiated with their own kwargs, and if they are not overriden at call time, they will serve as defaults.

Type hinting of kwargs is preserved if they are passed as structured dictionary entries

It also works with kwargs passed without nesting:

Merging of kwargs and handling priority order is done in processing_utils through a dedicated method.
The order of operations is as follows:

kwargs passed as before have highest priority to preserve BC.

high_priority_kwargs = {"crop_size" = (224, 224), "padding" = "max_length"}
processor(..., **high_priority_kwargs)

kwargs passed as modality-specific kwargs have second priority. This is the recommended API.

processor(..., text_kwargs={"padding": "max_length"}, images_kwargs={"crop_size": (224, 224)}})

kwargs passed during instantiation of a modality processor have fourth priority.

tokenizer = tokenizer_class(..., {"padding": "max_length"})
image_processor = image_processor_class(...)
processor(tokenizer, image_processor) # will pass max_length unless overriden by kwargs at call

defaults kwargs specified at processor level have lowest priority.

class MyProcessingKwargs(ProcessingKwargs, CommonKwargs, TextKwargs, ImagesKwargs, total=False):
    _defaults = {
        "text_kwargs": {
            "padding": "max_length",
            "max_length": 64,
        },
    }

What changes:

~~There is now an attribute in the constructor that stores the processing kwargs needed to be passed to various encoders down the line. ~~ Not anymore, see next point
There is now a base ProcessingKwargs TypedDict of kwargs that inherits from ImagesKwargs, TextKwargs, and so on.
These nested attributes (one dictionary for text, one for images, one for audio, one for video) are typed with a TypedDict that does not need to be total. We can expand this typed dict (in processing_utils) as processors get uniformized.
The processors are called with the same kwargs signature, text, images, audio, videos. kwargs set by a user will always override default processing kwargs.
Slicing of positional args is removed and replaced by named kwargs corresponding to the modality. Order of kwargs is not constant as a consequence to preserve BC.
Inputs (text, images, audio, videos) are now always typed in the call.

What that allows:

We know each model has its own processing logic/way to mix inputs. This way of typing both the modalities sent to the processors AND the processing arguments allow a faster design of future processors, hence faster review, and faster merge, better usage.
The reason for that is that we push away from the function call the actual mixing of modalities and their specific processing, which can be handled explicitly with the kwargs passed.
With TypedDict, type hints are preserved even in the nesting. I hesitated with pydantic/dataclasses and opted for TypedDict because it is less flexible, and we want to enforce a standard.

Limitations:

This still relies on kwargs for the processing.
Few models tested on that PR - more will follow in other PRs.

Tests pass (I think). What's missing:

Who can review?

Models:

vision models: @amyeroberts
speech models: @sanchit-gandhi

This parameter depends on tokenizers received.

gafrom · 2024-08-07T18:49:02Z

src/transformers/models/donut/processing_donut.py

-        feature_extractor = None
-        if "feature_extractor" in kwargs:
+    def __init__(self, image_processor=None, tokenizer=None, feature_extractor=None):
+        if "feature_extractor":


Always true? (same in other places)

molbap added 30 commits January 25, 2024 11:02

expand kwargs from align

42ecf48

remove kwargs from altclip processor

ccb2147

add explicit args for donut processor

f999e0c

add explicit call to current processor for in context manager

8fb3a6b

format

a90c766

remove unused kwargs

49cb6cc

move conditions for encodings

3ac1c7e

improve flow over text/image

7a819fd

[breaking] pass explicit args to bridgetower

9cc38b7

wwsMerge branch 'main' into improve_multimodal_processors

ff6a950

add default kwargs for BC

7db64a0

fix bridgetower

41674d9

debug bridgetower image proc

618a687

format

f39cdc1

move kwargs message to info level

9a6f97d

add debug messages

380f82f

fix arguments not being passed in bridgetower

75f15d3

keep backwards compat for processing + modify testing args dict

3df5faa

Merge branch 'main' into improve_multimodal_processors

5ad0694

fix quality

69e5a2d

log kwargs mismatch to info level

68c2f40

fix quality

e1e4084

Merge branch 'main' into improve_multimodal_processors

bfa81e5

address comments

4b557b0

fix typo

b7fc377

fix expected tests for bridgetower

270bb9e

fix conflicts

94a1b75

Merge branch 'main' into improve_multimodal_processors

6603bf0

fix valid processor keys

004c961

remove unused arg list

c2e49f5

molbap added 18 commits June 7, 2024 13:40

run-slow[align]

3acdf28

handle kwargs passed as nested dict

404239f

add from_pretrained test for nested kwargs handling

603be40

[run-slow]align

71c9d6c

update documentation + imports

26383c5

update audio inputs

4521f4f

protect audio types, silly

b96eb64

try removing imports

9c5c01c

make things simpler

3ccb505

simplerer

142acf3

move out kwargs test to common mixin

60a5730

[run-slow]align

be6c141

skip tests for old processors

84135d7

[run-slow]align, clip

ce967ac

!$#@!! protect imports, darn it

f78ec52

[run-slow]align, clip

52fd5ad

Merge branch 'main' into uniform_processors_1

8f21abe

[run-slow]align, clip

d510030

molbap mentioned this pull request Jun 11, 2024

Uniformize model processors #31368

Merged

5 tasks

molbap added 7 commits June 11, 2024 13:49

update doc

fd43bcd

improve documentation for default values

b2cd7c9

add model_max_length testing

bcbd646

This parameter depends on tokenizers received.

Raise if kwargs are specified in two places

39c1587

fix

1f73bdf

Merge branch 'main' into uniform_processors_1

b3f98ba

Merge branch 'uniform_processors_1' into image_text_audio_processors

a7bab25

huggingface deleted a comment from github-actions bot Jul 8, 2024

molbap mentioned this pull request Jul 17, 2024

Adding mplugdocowl #31792

Open

5 tasks

huggingface deleted a comment from github-actions bot Aug 2, 2024

gafrom reviewed Aug 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image + text + audio uniform processors #30511

Image + text + audio uniform processors #30511

molbap commented Apr 26, 2024 •

edited

Loading

gafrom Aug 7, 2024 •

edited

Loading

Image + text + audio uniform processors #30511

Are you sure you want to change the base?

Image + text + audio uniform processors #30511

Conversation

molbap commented Apr 26, 2024 • edited Loading

What does this PR do?

Usage

What changes:

What that allows:

Limitations:

Tests pass (I think). What's missing:

Who can review?

gafrom Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

molbap commented Apr 26, 2024 •

edited

Loading

gafrom Aug 7, 2024 •

edited

Loading