Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding mplugdocowl #31059

Draft
wants to merge 55 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
b311e5e
feat: adding mplugdocowl
danaaubakirova May 27, 2024
aa0ec04
feat: added separate file for the mPLUGDocOwl language model
danaaubakirova May 27, 2024
cc7e9b3
feat: added vision encoder for mplugdocowl
danaaubakirova May 27, 2024
204daba
fix: changed the attention mechanism in clip vision, renamed to MPLUG…
danaaubakirova May 28, 2024
6e144e5
feat: added hreducer and new things in config, changed vision embeddi…
danaaubakirova May 28, 2024
9f94d2c
fix: converted hreducer module related tensors to contiguous
danaaubakirova May 29, 2024
19ffc83
feat: added shape adaptive module
danaaubakirova May 31, 2024
85dce8d
feat: added new image_processing script
danaaubakirova Jun 3, 2024
0f5fb87
Update src/transformers/models/mplugdocowl/image_processing_mplugdoco…
danaaubakirova Jun 4, 2024
53aca6d
fix: small fix
danaaubakirova Jun 4, 2024
cb25b05
Merge branch 'mplugdocowl' of github.com:danaaubakirova/transformers …
danaaubakirova Jun 4, 2024
1debae3
feat: added the additional keys to the output of the data
danaaubakirova Jun 4, 2024
66b849d
feat: made major modifications to image_processing script. added the …
danaaubakirova Jun 6, 2024
1716668
feat: refactored shape_adaptive_cropping function and resolved the is…
danaaubakirova Jun 10, 2024
452ebf5
feat: testing forward
danaaubakirova Jun 11, 2024
1e7f386
feat: corrected image tag
danaaubakirova Jun 12, 2024
8577f35
fix: attention mask handling is fixed, .forward works
danaaubakirova Jun 13, 2024
f546fbc
feat: updates in vision architecture
danaaubakirova Jun 18, 2024
edc358d
Update src/transformers/models/mplugdocowl/language_modeling_mplugdoc…
danaaubakirova Jun 19, 2024
9003d59
fix: renaming the model
danaaubakirova Jun 19, 2024
9f688d9
grand fix: fixed hreducer, the firstgenerated token is correct. forw…
danaaubakirova Jun 21, 2024
30c8a2b
fix: need to fix prepare_inputs_for_generation()
danaaubakirova Jun 24, 2024
5483f82
fix: fixed prepare_inputs_for_generation()
danaaubakirova Jun 24, 2024
413ddad
Merge branch 'main' into mplugdocowl
danaaubakirova Jun 25, 2024
7546063
testing phase
danaaubakirova Jun 25, 2024
e3cc222
removed copied from ..
danaaubakirova Jun 25, 2024
4f4f219
small fixes
danaaubakirova Jun 25, 2024
661bd75
removed some things from the config
danaaubakirova Jun 26, 2024
8aded38
small fixes
danaaubakirova Jun 27, 2024
19e0a35
update
danaaubakirova Jun 27, 2024
8300463
small fix
danaaubakirova Jun 27, 2024
f0c87d8
Update tests/models/mplugdocowl/test_modeling_mplugdocowl.py
danaaubakirova Jun 27, 2024
b75b2b9
Update src/transformers/models/mplugdocowl/modeling_mplugdocowl.py
danaaubakirova Jun 27, 2024
2aae5ca
Update tests/models/mplugdocowl/test_modeling_mplugdocowl.py
danaaubakirova Jun 27, 2024
105b5e1
Update tests/models/mplugdocowl/test_modeling_mplugdocowl.py
danaaubakirova Jun 27, 2024
7a2f434
Update tests/models/mplugdocowl/test_modeling_mplugdocowl.py
danaaubakirova Jun 27, 2024
205e345
Update tests/models/mplugdocowl/test_modeling_mplugdocowl.py
danaaubakirova Jun 27, 2024
0f5ba22
Update src/transformers/models/mplugdocowl/processing_mplugdocowl.py
danaaubakirova Jun 27, 2024
c0e241a
Update src/transformers/models/mplugdocowl/processing_mplugdocowl.py
danaaubakirova Jun 27, 2024
1555e04
Update src/transformers/models/mplugdocowl/processing_mplugdocowl.py
danaaubakirova Jun 27, 2024
219d866
Update src/transformers/models/mplugdocowl/image_processing_mplugdoco…
danaaubakirova Jun 27, 2024
4600f75
Update src/transformers/models/mplugdocowl/convert_mplugdocowl_weight…
danaaubakirova Jun 27, 2024
cb55d49
Update src/transformers/models/mplugdocowl/language_modeling_mplugdoc…
danaaubakirova Jun 27, 2024
c4c711c
model card is updated. tips to be added
danaaubakirova Jun 28, 2024
3007178
fix
danaaubakirova Jun 28, 2024
cdcf2f6
added documentation,updated rotary embedding function, added ModelTest
danaaubakirova Jun 28, 2024
cc7681f
updated
danaaubakirova Jul 1, 2024
c8c8b14
fixes
danaaubakirova Jul 2, 2024
6897da5
update
danaaubakirova Jul 2, 2024
0f0e517
deleted test.py
danaaubakirova Jul 2, 2024
046e2bd
filled in the types and docstrings
danaaubakirova Jul 2, 2024
1c498fc
nit
danaaubakirova Jul 2, 2024
6b5af5e
fixes
danaaubakirova Jul 2, 2024
e8cebb5
update
danaaubakirova Jul 2, 2024
dd0f8ce
new
danaaubakirova Jul 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -796,6 +796,8 @@
title: MatCha
- local: model_doc/mgp-str
title: MGP-STR
- local: model_doc/mplugdocowl
title: mPLUGDocOwl
- local: model_doc/nougat
title: Nougat
- local: model_doc/oneformer
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,7 @@ Flax), PyTorch, and/or TensorFlow.
| [MobileNetV2](model_doc/mobilenet_v2) | ✅ | ❌ | ❌ |
| [MobileViT](model_doc/mobilevit) | ✅ | ✅ | ❌ |
| [MobileViTV2](model_doc/mobilevitv2) | ✅ | ❌ | ❌ |
| [mPLUGDocOwl](model_doc/mplugdocowl) | ✅ | ❌ | ❌ |
| [MPNet](model_doc/mpnet) | ✅ | ✅ | ❌ |
| [MPT](model_doc/mpt) | ✅ | ❌ | ❌ |
| [MRA](model_doc/mra) | ✅ | ❌ | ❌ |
Expand Down
58 changes: 58 additions & 0 deletions docs/source/en/model_doc/mplugdocowl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# mPLUG-DocOwl1.5

## Overview

The mPLUG-DocOwl1.5 model was proposed in [mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/pdf/2403.12895) by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan
Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou.

MPLUG-DocOwl1.5 is a multimodal model designed for text-rich images. It features the H-Reducer vision-to-text module, which preserves spatial relationships and efficiently processes high-resolution document images by merging visual features horizontally.

The model employs Unified Structure Learning with structure-aware parsing tasks and multi-grained text localization tasks, teaching it to parse text using line feeds, spaces, and extended Markdown syntax, which enhances the model's ability to correlate text with specific positions in the image.

DocOwl 1.5 undergoes a two-stage training process: Unified Structure Learning followed by Multi-task Tuning among Downstream Tasks. The high-quality DocReason25K dataset boosts reasoning abilities, allowing DocOwl 1.5-Chat to balance concise answers and detailed explanations.

The abstract from the paper is the following:

*Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Mod- els (MLLMs) for Visual Document Understanding are equipped with text recogni- tion ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Vi- sual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure- aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of vi- sual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the de- tailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks.*

Tips:

<INSERT TIPS ABOUT MODEL HERE>

This model was contributed by [danaaubakirova](https://huggingface.co/danaaubakirova).
The original code can be found [here](https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5).


## MPLUGDocOwlConfig

[[autodoc]] MPLUGDocOwlConfig

## MPLUGDocOwlImageProcessor
[[autodoc]] MPLUGDocOwlImageProcessor

## MPLUGDocOwlProcessor
[[autodoc]] MPLUGDocOwlProcessor

## MPLUGDocOwlHReducer
[[autodoc]] MPLUGDocOwlHReducer

## MPLUGDocOwlForConditionalGeneration

[[autodoc]] MPLUGDocOwlForConditionalGeneration
- forward
Binary file added examples_multi_col_60204.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 18 additions & 1 deletion src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -538,6 +538,10 @@
"models.mobilenet_v2": ["MobileNetV2Config"],
"models.mobilevit": ["MobileViTConfig"],
"models.mobilevitv2": ["MobileViTV2Config"],
"models.mplugdocowl": [
"MPLUGDocOwlConfig",
"MPLUGDocOwlProcessor",
],
"models.mpnet": [
"MPNetConfig",
"MPNetTokenizer",
Expand Down Expand Up @@ -1144,6 +1148,7 @@
_import_structure["models.mobilenet_v1"].extend(["MobileNetV1FeatureExtractor", "MobileNetV1ImageProcessor"])
_import_structure["models.mobilenet_v2"].extend(["MobileNetV2FeatureExtractor", "MobileNetV2ImageProcessor"])
_import_structure["models.mobilevit"].extend(["MobileViTFeatureExtractor", "MobileViTImageProcessor"])
_import_structure["models.mplugdocowl"].extend(["MPLUGDocOwlImageProcessor"])
_import_structure["models.nougat"].append("NougatImageProcessor")
_import_structure["models.oneformer"].extend(["OneFormerImageProcessor"])
_import_structure["models.owlv2"].append("Owlv2ImageProcessor")
Expand Down Expand Up @@ -2509,6 +2514,9 @@
"MobileViTV2PreTrainedModel",
]
)
_import_structure["models.mplugdocowl"].extend(
["MPLUGDocOwlForConditionalGeneration", "MPLUGDocOwlHReducer", "MPLUGDocOwlPreTrainedModel"]
)
_import_structure["models.mpnet"].extend(
[
"MPNetForMaskedLM",
Expand Down Expand Up @@ -5125,6 +5133,10 @@
from .models.mobilevitv2 import (
MobileViTV2Config,
)
from .models.mplugdocowl import (
MPLUGDocOwlConfig,
MPLUGDocOwlProcessor,
)
from .models.mpnet import (
MPNetConfig,
MPNetTokenizer,
Expand Down Expand Up @@ -5767,6 +5779,7 @@
MobileNetV2ImageProcessor,
)
from .models.mobilevit import MobileViTFeatureExtractor, MobileViTImageProcessor
from .models.mplugdocowl import MPLUGDocOwlImageProcessor
from .models.nougat import NougatImageProcessor
from .models.oneformer import OneFormerImageProcessor
from .models.owlv2 import Owlv2ImageProcessor
Expand Down Expand Up @@ -5794,7 +5807,6 @@
from .models.vitmatte import VitMatteImageProcessor
from .models.vivit import VivitImageProcessor
from .models.yolos import YolosFeatureExtractor, YolosImageProcessor

# Modeling
try:
if not is_torch_available():
Expand Down Expand Up @@ -6889,6 +6901,11 @@
MobileViTV2Model,
MobileViTV2PreTrainedModel,
)
from .models.mplugdocowl import (
MPLUGDocOwlForConditionalGeneration,
MPLUGDocOwlHReducer,
MPLUGDocOwlPreTrainedModel,
)
from .models.mpnet import (
MPNetForMaskedLM,
MPNetForMultipleChoice,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@
mobilenet_v2,
mobilevit,
mobilevitv2,
mplugdocowl,
mpnet,
mpt,
mra,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@
("mobilenet_v2", "MobileNetV2Config"),
("mobilevit", "MobileViTConfig"),
("mobilevitv2", "MobileViTV2Config"),
("mplugdocowl", "MPLUGDocOwlConfig"),
("mpnet", "MPNetConfig"),
("mpt", "MptConfig"),
("mra", "MraConfig"),
Expand Down Expand Up @@ -447,6 +448,7 @@
("mobilenet_v2", "MobileNetV2"),
("mobilevit", "MobileViT"),
("mobilevitv2", "MobileViTV2"),
("mplugdocowl", "mPLUGDocOwl"),
("mpnet", "MPNet"),
("mpt", "MPT"),
("mra", "MRA"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@
("mobilevit", "MobileViTImageProcessor"),
("mobilevit", "MobileViTImageProcessor"),
("mobilevitv2", "MobileViTImageProcessor"),
("mplugdocowl", "MPLUGDocOwlImageProcessor"),
("nat", "ViTImageProcessor"),
("nougat", "NougatImageProcessor"),
("oneformer", "OneFormerImageProcessor"),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,7 @@
("mega", "MegaForMaskedLM"),
("megatron-bert", "MegatronBertForPreTraining"),
("mobilebert", "MobileBertForPreTraining"),
("mplugdocowl", "MPLUGDocOwlForConditionalGeneration"),
("mpnet", "MPNetForMaskedLM"),
("mpt", "MptForCausalLM"),
("mra", "MraForMaskedLM"),
Expand Down Expand Up @@ -699,6 +700,7 @@
("kosmos-2", "Kosmos2ForConditionalGeneration"),
("llava", "LlavaForConditionalGeneration"),
("llava_next", "LlavaNextForConditionalGeneration"),
("mplugdocowl", "MPLUGDocOwlForConditionalGeneration"),
("paligemma", "PaliGemmaForConditionalGeneration"),
("pix2struct", "Pix2StructForConditionalGeneration"),
("video_llava", "VideoLlavaForConditionalGeneration"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
("markuplm", "MarkupLMProcessor"),
("mctct", "MCTCTProcessor"),
("mgp-str", "MgpstrProcessor"),
("mplugdocowl", "MPLUGDocOwlProcessor"),
("oneformer", "OneFormerProcessor"),
("owlv2", "Owlv2Processor"),
("owlvit", "OwlViTProcessor"),
Expand Down
4 changes: 4 additions & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,10 @@
),
("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
(
"mplugdocowl",
("MPLUGDocOwlTokenizer", "MPLUGDocOwlTokenizerFast" if is_tokenizers_available() else None),
),
("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
("mpt", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("mra", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
Expand Down
75 changes: 75 additions & 0 deletions src/transformers/models/mplugdocowl/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available


_import_structure = {
"configuration_mplugdocowl": ["MPLUGDocOwlConfig"],
"modeling_mplugdocowl": ["MPLUGDocOwlHReducer"],
"processing_mplugdocowl": ["MPLUGDocOwlProcessor"],
}

try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["image_processing_mplugdocowl"] = ["MPLUGDocOwlImageProcessor"]

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_mplugdocowl"] = [
"MPLUGDocOwlForConditionalGeneration",
"MPLUGDocOwlPreTrainedModel",
"MPLUGDocOwlHReducer",
]


if TYPE_CHECKING:
from .configuration_mplugdocowl import MPLUGDocOwlConfig
from .modeling_mplugdocowl import MPLUGDocOwlHReducer
from .processing_mplugdocowl import MPLUGDocOwlProcessor

try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .image_processing_mplugdocowl import MPLUGDocOwlImageProcessor

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_mplugdocowl import (
MPLUGDocOwlForConditionalGeneration,
MPLUGDocOwlHReducer,
MPLUGDocOwlPreTrainedModel,
)


else:
import sys

sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
Loading