Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Kosmos-2.5 #31711

Open
wants to merge 283 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
283 commits
Select commit Hold shift + click to select a range
477fd34
add torch required
Jun 29, 2024
71d3275
format
Jun 29, 2024
d9b23c4
format
Jun 29, 2024
3cbca06
.
Jun 29, 2024
8bde09f
format
Jun 29, 2024
2e6cad8
add procesor
Jun 30, 2024
6d797c6
init weight
Jun 30, 2024
532b1e0
.
Jun 30, 2024
234149a
.
Jun 30, 2024
05c9943
import sort
Jun 30, 2024
7d8783b
.
Jun 30, 2024
2de836d
format
Jun 30, 2024
9eece30
format
Jun 30, 2024
3a0cfaa
reformat
Jun 30, 2024
b72fe0a
reformat
Jun 30, 2024
589e9ef
reformat
Jun 30, 2024
fe51247
Merge remote-tracking branch 'upstream/main' into main
Jun 30, 2024
241b0bf
fixup
Jun 30, 2024
ba8b3dd
init test
Jun 30, 2024
9c74c61
init weight
Jul 1, 2024
363180b
modeling_test in progress
Jul 1, 2024
29d7cff
model test
Jul 1, 2024
42dd2ea
better initilization
Jul 2, 2024
9046ec5
model test
Jul 2, 2024
b64e300
restore ks2_test; update ks25 test
Jul 2, 2024
916781a
load from the config
Jul 2, 2024
578acce
processor test
Jul 2, 2024
c306325
run slow-prepare some test
Jul 2, 2024
b7d5ec9
skip sdpa test
Jul 2, 2024
f05e361
test finish
Jul 2, 2024
f19b06c
duplicate import
Jul 2, 2024
73dddc5
add mean
Jul 2, 2024
cd8ac6e
std
Jul 3, 2024
35ef655
fixup
Jul 3, 2024
9379458
remove tmp img
Jul 3, 2024
2e398f7
hi
Jul 3, 2024
40b4e98
init test
Jul 3, 2024
303e918
fix format
Jul 3, 2024
d5ad957
initialization test passed
Jul 3, 2024
e81b7fe
update readme
Jul 3, 2024
eb2b93c
Merge remote-tracking branch 'upstream/main' into main
Jul 4, 2024
6fa6221
[run-slow] kosmos2_5
ydshieh Jul 10, 2024
7710f9a
[run-slow] kosmos2_5 on A10
ydshieh Jul 10, 2024
63877c3
[run-slow] kosmos2_5
ydshieh Jul 10, 2024
630a40d
fix copyright
Jul 17, 2024
ca820d0
Update create_circleci_config.py
tic-top Jul 21, 2024
f518e50
Update create_circleci_config.py
tic-top Jul 21, 2024
5c5dd54
Revert "fix format"
Jul 21, 2024
d14ac7d
Merge branch 'main' of https://github.com/tic-top/transformers into main
Jul 21, 2024
8998e48
Revert "Revert "fix format""
Jul 21, 2024
c5c4864
Revert "format"
Jul 21, 2024
c7c52a7
Fix copyright and add arvix link
Jul 21, 2024
607f65e
Update create_circleci_config.py
tic-top Jul 21, 2024
a5c48d5
fix copyright
Jul 21, 2024
9b24a63
Merge branch 'main' of https://github.com/tic-top/transformers into main
Jul 21, 2024
625fc05
test for ks25 processor
Jul 22, 2024
2fba9ab
sdpa, eager, fa2 modeling test
Jul 22, 2024
5d1d095
fix format
Jul 23, 2024
7e810e2
upload doc images
ydshieh Jul 25, 2024
2fe1f94
[ydshieh] update eager/sdpa ocr expected outputs
ydshieh Jul 25, 2024
ec82032
[ydshieh] update FA2 ocr expected outputs
ydshieh Jul 25, 2024
8066ee7
[ydshieh] require_flash_attn
ydshieh Jul 25, 2024
9c1539a
[ydshieh] no need eval()
ydshieh Jul 25, 2024
4eca23c
[ydshieh] cuda_compute_capability_major_version
ydshieh Jul 25, 2024
b574b09
[ydshieh] fix FA2 deco
ydshieh Jul 25, 2024
d2c57cc
[ydshieh] [ydshieh] update eager ocr expected outputs
ydshieh Jul 25, 2024
93b291f
[ydshieh] update FA2 md expected outputs
ydshieh Jul 25, 2024
b7be077
[ydshieh] fix
ydshieh Jul 25, 2024
d577c90
remove add_special_tokens
Jul 26, 2024
2537140
without grad when generating
Jul 26, 2024
24961cd
Update src/transformers/models/kosmos2_5/configuration_kosmos2_5.py
tic-top Jul 26, 2024
6eb0683
Update src/transformers/models/kosmos2_5/convert_kosmos2_5.py
tic-top Jul 26, 2024
ca57f47
Update src/transformers/models/kosmos2_5/configuration_kosmos2_5.py
tic-top Jul 26, 2024
c23a8dd
Update src/transformers/models/kosmos2_5/configuration_kosmos2_5.py
tic-top Jul 27, 2024
452b23d
add batch test
Jul 28, 2024
4308a40
fix document in ks25 config
Jul 28, 2024
2db6b88
Merge branch 'main' of https://github.com/tic-top/transformers into main
Jul 28, 2024
1776f31
fix foc in ks25 processor
Jul 28, 2024
188adbf
add comment to ks25 image processor
Jul 28, 2024
3cebe13
update copyright
Jul 28, 2024
c54f9a8
Update src/transformers/models/kosmos2_5/convert_kosmos2_5.py
tic-top Jul 28, 2024
54a632e
Update src/transformers/models/kosmos2_5/convert_kosmos2_5.py
tic-top Jul 28, 2024
5b3a6f7
fix doc in ks25 cfg
Jul 28, 2024
e9e56d0
simplify ks25 image procrssor
Jul 28, 2024
8b27f80
Merge branch 'main' of https://github.com/tic-top/transformers into main
Jul 28, 2024
5ba6d84
simplify ks25 image processor
Jul 28, 2024
25e3260
[ydshieh] update repo name in doc
ydshieh Jul 29, 2024
fbbf151
[ydshieh] images, width, height, rows, cols = ...
ydshieh Jul 29, 2024
28b58ff
remove unnecessary comment
Jul 29, 2024
06c52ae
copied from comment added
Jul 29, 2024
99f0d99
add meaningful comment
Jul 29, 2024
2a782f0
Merge branch 'main' of https://github.com/tic-top/transformers into main
Jul 29, 2024
da45edd
ks25 image processor test added
Jul 30, 2024
0ddfe76
add more ks25 processor test
Jul 30, 2024
9dcacfc
fix style
Jul 30, 2024
0d166de
[ydshieh] 2024
ydshieh Jul 30, 2024
32df418
[ydshieh] better skip
ydshieh Jul 30, 2024
9fca9ca
[ydshieh] num_image_tokens
ydshieh Jul 30, 2024
87ccbc7
Merge remote-tracking branch 'upstream/main' into main
Jul 30, 2024
ed50bbd
refractor FA2
Jul 30, 2024
c027a98
fix error
Jul 30, 2024
64f915e
fix ans
Jul 30, 2024
26fb969
[ydshieh] test_sdpa
ydshieh Jul 30, 2024
6b82ce0
[ydshieh] better skip
ydshieh Jul 30, 2024
482e5e1
[ydshieh] better skip
ydshieh Jul 30, 2024
bd76555
fix format
Jul 30, 2024
09d8b29
make style
Jul 30, 2024
cfaa28f
test_model_input_names need torch
Jul 30, 2024
ab546cc
[ydshieh] remove
ydshieh Jul 30, 2024
6cae0b6
[ydshieh] add copied
ydshieh Jul 30, 2024
9e0c277
[ydshieh] style
ydshieh Jul 30, 2024
cc17791
[ydshieh] Kosmos2_5ForConditionalGeneration
ydshieh Jul 30, 2024
865fc2f
[ydshieh] docstring
ydshieh Jul 30, 2024
162f569
[ydshieh] copied
ydshieh Jul 30, 2024
889d9da
[ydshieh] copied
ydshieh Jul 30, 2024
40dc555
[ydshieh] copied
ydshieh Jul 30, 2024
7e5a91c
[ydshieh] copied
ydshieh Jul 30, 2024
7dfd145
[ydshieh] copied
ydshieh Jul 30, 2024
d0e4fb7
[ydshieh] copied
ydshieh Jul 30, 2024
60240f2
[ydshieh] copied
ydshieh Jul 30, 2024
2b2fe1c
[ydshieh] copied
ydshieh Aug 2, 2024
267e1d6
[ydshieh] copied
ydshieh Aug 2, 2024
2ea4d4f
[ydshieh] fix
ydshieh Aug 2, 2024
18fa43b
[ydshieh] fix
ydshieh Aug 2, 2024
2157f31
[ydshieh] fix
ydshieh Aug 2, 2024
ac1968b
fix bug
Aug 3, 2024
29d272b
[kirp] make style
Aug 3, 2024
70d85cd
[ydshieh] copied
ydshieh Aug 5, 2024
1424e07
[ydshieh] copied
ydshieh Aug 5, 2024
6f8b2e6
[ydshieh] _init_weights
ydshieh Aug 5, 2024
2cdb62a
[ydshieh] _init_weights
ydshieh Aug 5, 2024
f2b61c2
[ydshieh] _init_weights
ydshieh Aug 5, 2024
3681119
[yilinjia] fix doc in config
Aug 7, 2024
7df3000
[ydshieh] update vision model class inheritance
ydshieh Aug 12, 2024
de6d842
[ydshieh] copied statement for vision model
ydshieh Aug 12, 2024
e09217e
[ydshieh] update _init_weights
ydshieh Aug 13, 2024
210ccb1
[ydshieh] update _init_weights
ydshieh Aug 13, 2024
4e709e5
[ydshieh] update _init_weights
ydshieh Aug 13, 2024
e62993c
[ydshieh] copied statement for Kosmos2_5TextModel
ydshieh Aug 13, 2024
e6fe2ae
[ydshieh] Kosmos2TextForCausalLM
ydshieh Aug 13, 2024
703ccfd
[ydshieh] tiny tweak
ydshieh Aug 13, 2024
e41b875
[ydshieh] tests
ydshieh Aug 13, 2024
9822d00
[ydshieh] tests
ydshieh Aug 13, 2024
1e175ba
[ydshieh] tests
ydshieh Aug 13, 2024
e583cd4
[ydshieh] tests
ydshieh Aug 13, 2024
bb4c247
[ydshieh] stye
ydshieh Aug 13, 2024
139e834
[ydshieh] revert
ydshieh Aug 13, 2024
66af73d
remove old url
Aug 14, 2024
6659897
[ydshieh] fix
ydshieh Aug 14, 2024
720a8ab
[ydshieh] fix
ydshieh Aug 14, 2024
8ee2aa9
[ydshieh] fix
ydshieh Aug 14, 2024
9d7363f
[ydshieh] update value
ydshieh Aug 21, 2024
1bd02b2
[ydshieh] add to toctree
ydshieh Aug 21, 2024
06cbb5d
[kirp] update the example part in readme
Aug 27, 2024
f4c73b3
[kirp] remove zero bias
Sep 2, 2024
0ae49e0
[kirp] iterate over the images only once
Sep 2, 2024
ef6754c
[kirp] remove cross attention
Sep 2, 2024
9a01f8f
[kirp] reformat
Sep 2, 2024
eb116ab
[kirp] use string
Sep 2, 2024
e1ab413
[kirp] remove creating mask in the layer
Sep 2, 2024
fe418d0
[kirp] remove cache
Sep 2, 2024
cc7d28f
Revert "[kirp] remove creating mask in the layer"
Sep 2, 2024
e5ffaee
[kirp] fix typo in processor
Sep 3, 2024
b5ebf09
[kirp] remove head mask
Sep 3, 2024
dd12798
[kirp] remove test file
Sep 3, 2024
15feaea
[kirp] cache for eager
Sep 30, 2024
ab687f5
[kirp] sdpa cache
Sep 30, 2024
87ab935
[kirp] move attention_mask maker to vision encoder
Sep 30, 2024
54b1984
[kirp] cache sdpa and format
Sep 30, 2024
5e5a9e9
[kirp] fix format
Sep 30, 2024
0ed8541
[kirp] fix format
Sep 30, 2024
df9d3ad
[kirp] use update_causal_mask
Sep 30, 2024
55cb12d
[kirp] check copies
Sep 30, 2024
d99934d
[kirp] regroup the init
Sep 30, 2024
c705049
[kirp] make style
Sep 30, 2024
806ca1b
[run-slow] kosmos2_5
Sep 30, 2024
9e620b6
[run-slow] fix checkpoint bug
Sep 30, 2024
65490b4
[run-slow] fix checkpoint bug
Sep 30, 2024
d0bf57e
Merge remote-tracking branch 'upstream/main' into main
Oct 2, 2024
f5d4439
[run-slow] kosmos2_5
Oct 2, 2024
40ff015
[run-slow] kosmos2_5
Oct 2, 2024
63603d6
[kirp] remove cross_attn in textblock
tic-top Oct 10, 2024
f8497ce
[run-slow] kosmos2_5
tic-top Oct 10, 2024
eab8e69
[run-slow] kosmos2_5
tic-top Oct 10, 2024
a6154db
[run-slow] kosmos2_5
tic-top Oct 11, 2024
94cc6d2
[ydshieh] update loop
ydshieh Oct 22, 2024
968b033
[ydshieh] remove duplication in init file
ydshieh Oct 25, 2024
142604d
[ydshieh] tokenizer class
ydshieh Oct 25, 2024
4b7bc95
[ydshieh] remove copied from
ydshieh Oct 25, 2024
6f2bd73
[ydshieh] skip
ydshieh Oct 25, 2024
08e1cb0
[ydshieh] move
ydshieh Oct 29, 2024
f2dae0d
Merge branch 'main' into kosmos25
ydshieh Oct 29, 2024
fcc095f
[ydshieh] fix copie
ydshieh Oct 29, 2024
f66c6ee
[ydshieh] remove
ydshieh Oct 29, 2024
9a8479d
[ydshieh] Add to MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
ydshieh Oct 29, 2024
830671b
[ydshieh] new init
ydshieh Oct 29, 2024
1c58c8f
[ydshieh] fix
ydshieh Oct 29, 2024
0153a08
[ydshieh] remove
ydshieh Oct 31, 2024
ac94b57
[ydshieh] add ProcessorTesterMixin
ydshieh Oct 31, 2024
52788cc
[ydshieh] add GenerationTesterMixin
ydshieh Oct 31, 2024
0b9e5ad
Merge branch 'main' into kosmos25
ydshieh Dec 6, 2024
925e14a
Merge branch 'main' into main
ydshieh Dec 6, 2024
6ed504d
fix
ydshieh Dec 6, 2024
9a841ad
fix
ydshieh Dec 6, 2024
dcced48
fix
ydshieh Dec 13, 2024
91fa383
fix
ydshieh Dec 13, 2024
e3802f4
fix
ydshieh Dec 13, 2024
85da449
fix
ydshieh Dec 13, 2024
b1db4f2
fix
ydshieh Dec 13, 2024
f8c98d6
it's Friday night, let cross finger
ydshieh Dec 13, 2024
fbb3e59
it's Friday night, let cross finger
ydshieh Dec 13, 2024
ce3a6b0
it's Friday night, let cross finger
ydshieh Dec 13, 2024
90c4fcc
it's Friday night, let cross finger
ydshieh Dec 13, 2024
00e324d
it's Friday night, let cross finger
ydshieh Dec 13, 2024
9c8aff7
it's Friday night, let cross finger
ydshieh Dec 13, 2024
2c47915
it's Friday night, let cross finger
ydshieh Dec 13, 2024
395a636
it's Monday let's go
ydshieh Dec 16, 2024
8a058d9
it's Monday let's go
ydshieh Dec 16, 2024
c639eeb
it's Monday let's go
ydshieh Dec 16, 2024
b688c4f
Merge branch 'ca03842c' into kosmos25
ydshieh Dec 16, 2024
d1c52f4
temp
ydshieh Dec 17, 2024
3a58742
temp
ydshieh Dec 17, 2024
d5b8349
temp
ydshieh Dec 17, 2024
9ddc86b
temp
ydshieh Dec 17, 2024
39dc6ef
temp
ydshieh Dec 17, 2024
b2c3db2
temp
ydshieh Dec 17, 2024
c356a36
temp
ydshieh Dec 17, 2024
55944fc
temp
ydshieh Dec 17, 2024
83d600e
temp
ydshieh Dec 17, 2024
2d4cbba
temp
ydshieh Dec 17, 2024
6b2f7d7
temp
ydshieh Dec 17, 2024
5f731a9
temp
ydshieh Dec 17, 2024
0ec499a
temp
ydshieh Dec 17, 2024
7f0d26c
temp
ydshieh Dec 17, 2024
db865db
temp
ydshieh Dec 17, 2024
bf14c4b
temp
ydshieh Dec 17, 2024
9b29aac
temp
ydshieh Dec 17, 2024
ce222a6
temp
ydshieh Dec 17, 2024
876cb6b
temp
ydshieh Dec 17, 2024
a3638ea
temp
ydshieh Dec 17, 2024
30f927a
temp
ydshieh Dec 17, 2024
a65a9b1
temp
ydshieh Dec 17, 2024
7c99fd0
temp
ydshieh Dec 17, 2024
ec9ea0c
fix
ydshieh Dec 17, 2024
8fc9699
fix
ydshieh Dec 17, 2024
22cb70d
fix
ydshieh Dec 18, 2024
001fd70
fix
ydshieh Dec 18, 2024
d1116f5
fix
ydshieh Dec 18, 2024
6f09a51
fix
ydshieh Dec 18, 2024
7d0b827
Merge branch 'main' into main
ydshieh Dec 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -860,6 +860,8 @@
title: InstructBlipVideo
- local: model_doc/kosmos-2
title: KOSMOS-2
- local: model_doc/kosmos-2.5
title: KOSMOS-2.5
- local: model_doc/layoutlm
title: LayoutLM
- local: model_doc/layoutlmv2
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,7 @@ Flax), PyTorch, and/or TensorFlow.
| [JetMoe](model_doc/jetmoe) | ✅ | ❌ | ❌ |
| [Jukebox](model_doc/jukebox) | ✅ | ❌ | ❌ |
| [KOSMOS-2](model_doc/kosmos-2) | ✅ | ❌ | ❌ |
| [KOSMOS-2.5](model_doc/kosmos-2.5) | ✅ | ❌ | ❌ |
| [LayoutLM](model_doc/layoutlm) | ✅ | ✅ | ❌ |
| [LayoutLMv2](model_doc/layoutlmv2) | ✅ | ❌ | ❌ |
| [LayoutLMv3](model_doc/layoutlmv3) | ✅ | ✅ | ❌ |
Expand Down
63 changes: 63 additions & 0 deletions docs/source/en/model_doc/kosmos-2.5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# KOSMOS-2.5

## Overview

Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared decoder-only auto-regressive Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

The abstract from the paper is the following:

*We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.*

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_ocr.png"
alt="drawing" width="600"/>

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_md.png"
alt="drawing" width="600"/>

<small> Overview of tasks that KOSMOS-2.5 can handle. Taken from the <a href="https://arxiv.org/abs/2309.11419">original paper</a>. </small>

## Example
**Markdown Task:** For usage instructions, please refer to [md.py](https://huggingface.co/microsoft/kosmos-2.5/blob/main/md.py).

**OCR Task:** For usage instructions, please refer to [ocr.py](https://huggingface.co/microsoft/kosmos-2.5/blob/main/ocr.py).



## Kosmos2_5Config

[[autodoc]] Kosmos2_5Config

## Kosmos2_5ImageProcessor

[[autodoc]] Kosmos2_5ImageProcessor

## Kosmos2_5Processor

[[autodoc]] Kosmos2_5Processor
- __call__

## Kosmos2_5Model

[[autodoc]] Kosmos2_5Model
- forward

## Kosmos2_5ForConditionalGeneration

[[autodoc]] Kosmos2_5ForConditionalGeneration
- forward
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
* [Kosmos-2.5](https://huggingface.co/docs/transformers/model_doc/kosmos2_5#transformers.Kosmos2_5Model)
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
* [Llava](https://huggingface.co/docs/transformers/model_doc/llava)
* [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next)
Expand Down Expand Up @@ -251,6 +252,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [GraniteMoe](https://huggingface.co/docs/transformers/model_doc/granitemoe#transformers.GraniteMoeModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
* [Kosmos-2.5](https://huggingface.co/docs/transformers/model_doc/kosmos2_5#transformers.Kosmos2_5Model)
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
* [Llava](https://huggingface.co/docs/transformers/model_doc/llava)
* [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next)
Expand Down
22 changes: 22 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -512,6 +512,10 @@
"Kosmos2Config",
"Kosmos2Processor",
],
"models.kosmos2_5": [
"Kosmos2_5Config",
"Kosmos2_5Processor",
],
"models.layoutlm": [
"LayoutLMConfig",
"LayoutLMTokenizer",
Expand Down Expand Up @@ -1216,6 +1220,7 @@
_import_structure["models.idefics3"].extend(["Idefics3ImageProcessor"])
_import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
_import_structure["models.instructblipvideo"].extend(["InstructBlipVideoImageProcessor"])
_import_structure["models.kosmos2_5"].extend(["Kosmos2_5ImageProcessor"])
_import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
_import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"])
_import_structure["models.levit"].extend(["LevitFeatureExtractor", "LevitImageProcessor"])
Expand Down Expand Up @@ -2557,6 +2562,13 @@
"Kosmos2PreTrainedModel",
]
)
_import_structure["models.kosmos2_5"].extend(
[
"Kosmos2_5ForConditionalGeneration",
"Kosmos2_5Model",
"Kosmos2_5PreTrainedModel",
]
)
_import_structure["models.layoutlm"].extend(
[
"LayoutLMForMaskedLM",
Expand Down Expand Up @@ -5438,6 +5450,10 @@
Kosmos2Config,
Kosmos2Processor,
)
from .models.kosmos2_5 import (
Kosmos2_5Config,
Kosmos2_5Processor,
)
from .models.layoutlm import (
LayoutLMConfig,
LayoutLMTokenizer,
Expand Down Expand Up @@ -6177,6 +6193,7 @@
from .models.idefics3 import Idefics3ImageProcessor
from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
from .models.instructblipvideo import InstructBlipVideoImageProcessor
from .models.kosmos2_5 import Kosmos2_5ImageProcessor
from .models.layoutlmv2 import (
LayoutLMv2FeatureExtractor,
LayoutLMv2ImageProcessor,
Expand Down Expand Up @@ -7301,6 +7318,11 @@
Kosmos2Model,
Kosmos2PreTrainedModel,
)
from .models.kosmos2_5 import (
Kosmos2_5ForConditionalGeneration,
Kosmos2_5Model,
Kosmos2_5PreTrainedModel,
)
from .models.layoutlm import (
LayoutLMForMaskedLM,
LayoutLMForQuestionAnswering,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@
jamba,
jetmoe,
kosmos2,
kosmos2_5,
layoutlm,
layoutlmv2,
layoutlmv3,
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@
("jetmoe", "JetMoeConfig"),
("jukebox", "JukeboxConfig"),
("kosmos-2", "Kosmos2Config"),
("kosmos-2.5", "Kosmos2_5Config"),
("layoutlm", "LayoutLMConfig"),
("layoutlmv2", "LayoutLMv2Config"),
("layoutlmv3", "LayoutLMv3Config"),
Expand Down Expand Up @@ -459,6 +460,7 @@
("jetmoe", "JetMoe"),
("jukebox", "Jukebox"),
("kosmos-2", "KOSMOS-2"),
("kosmos-2.5", "KOSMOS-2.5"),
("layoutlm", "LayoutLM"),
("layoutlmv2", "LayoutLMv2"),
("layoutlmv3", "LayoutLMv3"),
Expand Down Expand Up @@ -692,6 +694,7 @@
("data2vec-vision", "data2vec"),
("donut-swin", "donut"),
("kosmos-2", "kosmos2"),
("kosmos-2.5", "kosmos2_5"),
("maskformer-swin", "maskformer"),
("xclip", "x_clip"),
("clip_vision_model", "clip"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@
("instructblip", ("BlipImageProcessor",)),
("instructblipvideo", ("InstructBlipVideoImageProcessor",)),
("kosmos-2", ("CLIPImageProcessor",)),
("kosmos-2.5", ("Kosmos2_5ImageProcessor",)),
("layoutlmv2", ("LayoutLMv2ImageProcessor",)),
("layoutlmv3", ("LayoutLMv3ImageProcessor",)),
("levit", ("LevitImageProcessor",)),
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@
("jetmoe", "JetMoeModel"),
("jukebox", "JukeboxModel"),
("kosmos-2", "Kosmos2Model"),
("kosmos-2.5", "Kosmos2_5Model"),
("layoutlm", "LayoutLMModel"),
("layoutlmv2", "LayoutLMv2Model"),
("layoutlmv3", "LayoutLMv3Model"),
Expand Down Expand Up @@ -761,6 +762,7 @@
("instructblip", "InstructBlipForConditionalGeneration"),
("instructblipvideo", "InstructBlipVideoForConditionalGeneration"),
("kosmos-2", "Kosmos2ForConditionalGeneration"),
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"),
("llava", "LlavaForConditionalGeneration"),
("llava_next", "LlavaNextForConditionalGeneration"),
("llava_next_video", "LlavaNextVideoForConditionalGeneration"),
Expand Down Expand Up @@ -788,6 +790,7 @@
("idefics3", "Idefics3ForConditionalGeneration"),
("instructblip", "InstructBlipForConditionalGeneration"),
("kosmos-2", "Kosmos2ForConditionalGeneration"),
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"),
("llava", "LlavaForConditionalGeneration"),
("llava_next", "LlavaNextForConditionalGeneration"),
("llava_onevision", "LlavaOnevisionForConditionalGeneration"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
("instructblip", "InstructBlipProcessor"),
("instructblipvideo", "InstructBlipVideoProcessor"),
("kosmos-2", "Kosmos2Processor"),
("kosmos-2.5", "Kosmos2_5Processor"),
("layoutlmv2", "LayoutLMv2Processor"),
("layoutlmv3", "LayoutLMv3Processor"),
("llava", "LlavaProcessor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
("kosmos-2.5", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/kosmos2/modeling_kosmos2.py
Original file line number Diff line number Diff line change
Expand Up @@ -2073,6 +2073,7 @@ def forward(
vision_model_output=vision_model_output,
)

@torch.no_grad()
def generate(
self,
pixel_values: Optional[torch.Tensor] = None,
Expand Down
30 changes: 30 additions & 0 deletions src/transformers/models/kosmos2_5/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# coding=utf-8
# Copyright 2024 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_kosmos2_5 import *
from .image_processing_kosmos2_5 import *
from .modeling_kosmos2_5 import *
from .processing_kosmos2_5 import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading