Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GOT-OCR 2.0 to Transformers #34721

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
16c3388
init modular got_ocr2
yonigozlan Oct 23, 2024
9f93654
Get correct got_ocr architecture
yonigozlan Oct 30, 2024
c0f4bfe
add processing
yonigozlan Nov 5, 2024
a3c8f67
run modular with processing
yonigozlan Nov 5, 2024
c55bfbc
add working inference
yonigozlan Nov 10, 2024
e2f9cf5
apply modular
yonigozlan Nov 13, 2024
5b628d1
Refactor and fix style
yonigozlan Nov 14, 2024
84c76a6
Refactor, cleanup, fix style
yonigozlan Nov 14, 2024
9828b29
fix init order
yonigozlan Nov 14, 2024
adc6b9a
Fix docs
yonigozlan Nov 14, 2024
c7fa74b
add base modeling tests
yonigozlan Nov 14, 2024
4bcfc04
fix style and consistency
yonigozlan Nov 14, 2024
ec4a8f9
rename doc file
yonigozlan Nov 14, 2024
b57b336
fix repo consistency
yonigozlan Nov 14, 2024
9151aea
fix inference with box
yonigozlan Nov 18, 2024
bf0ea21
add image processing and support for crop_to_multi_page
yonigozlan Nov 20, 2024
d2caee3
Fix batch inference
yonigozlan Nov 22, 2024
70d6f30
add tests
yonigozlan Nov 22, 2024
2000570
fixup
yonigozlan Nov 22, 2024
7817fe7
fix slow test
yonigozlan Nov 22, 2024
330478b
fix docstrings
yonigozlan Nov 25, 2024
4eb7b94
Add model doc
yonigozlan Nov 25, 2024
85a00c5
update to new init
yonigozlan Nov 25, 2024
bf21173
fix input autocast pixel_values dtype
yonigozlan Nov 25, 2024
568be30
update doc
yonigozlan Nov 25, 2024
9e49b2d
move doc to multimodal
yonigozlan Nov 25, 2024
d67af61
Reformat crop_image_to_patches and add docstrings
yonigozlan Nov 27, 2024
8add3a1
Fix example in forward docstring
yonigozlan Nov 27, 2024
10e5644
Address Pablo review
yonigozlan Nov 29, 2024
5c3a4eb
[run slow] got_ocr2
yonigozlan Nov 29, 2024
0d22212
remove defaults defined twice
yonigozlan Dec 4, 2024
df94db3
apply modular
yonigozlan Dec 5, 2024
879fe3e
add torch_device to integration tests
yonigozlan Dec 5, 2024
79e5734
Merge branch 'main' into add-got-ocr2
yonigozlan Dec 16, 2024
8fb0ed7
update modular
yonigozlan Dec 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -844,6 +844,8 @@
title: FLAVA
- local: model_doc/git
title: GIT
- local: model_doc/got_ocr2
title: GOT-OCR2
- local: model_doc/grounding-dino
title: Grounding DINO
- local: model_doc/groupvit
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ Flax), PyTorch, and/or TensorFlow.
| [GIT](model_doc/git) | ✅ | ❌ | ❌ |
| [GLM](model_doc/glm) | ✅ | ❌ | ❌ |
| [GLPN](model_doc/glpn) | ✅ | ❌ | ❌ |
| [GOT-OCR2](model_doc/got_ocr2) | ✅ | ❌ | ❌ |
| [GPT Neo](model_doc/gpt_neo) | ✅ | ❌ | ✅ |
| [GPT NeoX](model_doc/gpt_neox) | ✅ | ❌ | ❌ |
| [GPT NeoX Japanese](model_doc/gpt_neox_japanese) | ✅ | ❌ | ❌ |
Expand Down
268 changes: 268 additions & 0 deletions docs/source/en/model_doc/got_ocr2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
<!--Copyright 2024 StepFun and The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# GOT-OCR2

## Overview

The GOT-OCR2 model was proposed in [General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model](https://arxiv.org/abs/2409.01704) by Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang.

The abstract from the paper is the following:

*Traditional OCR systems (OCR-1.0) are increasingly unable to meet people’snusage due to the growing demand for intelligent processing of man-made opticalncharacters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multipage OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.*

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/got_ocr_overview.png"
alt="drawing" width="600"/>

<small> GOT-OCR2 training stages. Taken from the <a href="https://arxiv.org/abs/2409.01704">original paper.</a> </small>


Tips:

GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like `pdftex`, `mathpix`, `matplotlib`, `tikz`, `verovio` or `pyecharts`.
The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region's bounding box.

This model was contributed by [yonigozlan](https://huggingface.co/yonigozlan).
The original code can be found [here](https://github.com/Ucas-HaoranWei/GOT-OCR2.0).

## Usage example

### Plain text inference

```python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText

>>> model = AutoModelForImageTextToText.from_pretrained("yonigozlan/GOT-OCR-2.0-hf").to("cuda")
>>> processor = AutoProcessor.from_pretrained("yonigozlan/GOT-OCR-2.0-hf")

>>> image = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/image_ocr.jpg"
>>> inputs = processor(image, return_tensors="pt").to("cuda")

>>> generate_ids = model.generate(
>>> **inputs,
>>> do_sample=False,
>>> tokenizer=processor.tokenizer,
>>> stop_strings="<|im_end|>",
>>> max_new_tokens=4096,
>>> )

>>> processor.decode(generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
"R&D QUALITY IMPROVEMENT\nSUGGESTION/SOLUTION FORM\nName/Phone Ext. : (...)"
```

### Plain text inference batched

```python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText

>>> model = AutoModelForImageTextToText.from_pretrained("yonigozlan/GOT-OCR-2.0-hf")
>>> processor = AutoProcessor.from_pretrained("yonigozlan/GOT-OCR-2.0-hf")

>>> image1 = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/multi_box.png"
>>> image2 = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/image_ocr.jpg"

>>> inputs = processor([image1, image2], return_tensors="pt")

>>> generate_ids = model.generate(
>>> **inputs,
>>> do_sample=False,
>>> tokenizer=processor.tokenizer,
>>> stop_strings="<|im_end|>",
>>> max_new_tokens=4,
>>> )

>>> processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
["Reducing the number", "R&D QUALITY"]
```

### Formatted text inference

GOT-OCR2 can also generate formatted text, such as markdown or LaTeX. Here is an example of how to generate formatted text:

```python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText

>>> model = AutoModelForImageTextToText.from_pretrained("yonigozlan/GOT-OCR-2.0-hf").to("cuda")
>>> processor = AutoProcessor.from_pretrained("yonigozlan/GOT-OCR-2.0-hf")

>>> image = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/latex.png"
>>> inputs = processor(image, return_tensors="pt", format=True).to("cuda")

>>> generate_ids = model.generate(
>>> **inputs,
>>> do_sample=False,
>>> tokenizer=processor.tokenizer,
>>> stop_strings="<|im_end|>",
>>> max_new_tokens=4096,
>>> )

>>> processor.decode(generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
"\\author{\nHanwen Jiang* \\(\\quad\\) Arjun Karpur \\({ }^{\\dagger} \\quad\\) Bingyi Cao \\({ }^{\\dagger} \\quad\\) (...)"
```

### Inference on multiple pages

Although it might be reasonable in most cases to use a “for loop” for multi-page processing, some text data with formatting across several pages make it necessary to process all pages at once. GOT introduces a multi-page OCR (without “for loop”) feature, where multiple pages can be processed by the model at once, whith the output being one continuous text.
Here is an example of how to process multiple pages at once:


```python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText

>>> model = AutoModelForImageTextToText.from_pretrained("yonigozlan/GOT-OCR-2.0-hf").to("cuda")
>>> processor = AutoProcessor.from_pretrained("yonigozlan/GOT-OCR-2.0-hf")

>>> image1 = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/page1.png"
>>> image2 = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/page2.png"
>>> inputs = processor([image1, image2], return_tensors="pt", format=True).to("cuda")

>>> generate_ids = model.generate(
>>> **inputs,
>>> do_sample=False,
>>> tokenizer=processor.tokenizer,
>>> stop_strings="<|im_end|>",
>>> max_new_tokens=4096,
>>> )

>>> processor.decode(generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
"\\title{\nGeneral OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model\n}\n\\author{\nHaoran Wei (...)"
```

### Inference on cropped patches

GOT supports a 1024×1024 input resolution, which is sufficient for most OCR tasks, such as scene OCR or processing A4-sized PDF pages. However, certain scenarios, like horizontally stitched two-page PDFs commonly found in academic papers or images with unusual aspect ratios, can lead to accuracy issues when processed as a single image. To address this, GOT can dynamically crop an image into patches, process them all at once, and merge the results for better accuracy with such inputs.
Here is an example of how to process cropped patches:

```python
>>> import torch
>>> from transformers import AutoProcessor, AutoModelForImageTextToText

>>> model = AutoModelForImageTextToText.from_pretrained("yonigozlan/GOT-OCR-2.0-hf", torch_dtype=torch.bfloat16).to("cuda")
>>> processor = AutoProcessor.from_pretrained("yonigozlan/GOT-OCR-2.0-hf")

>>> image = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/one_column.png"
>>> inputs = processor(image, return_tensors="pt", format=True, crop_to_patches=True, max_patches=3).to("cuda")

>>> generate_ids = model.generate(
>>> **inputs,
>>> do_sample=False,
>>> tokenizer=processor.tokenizer,
>>> stop_strings="<|im_end|>",
>>> max_new_tokens=4096,
>>> )

>>> processor.decode(generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
"on developing architectural improvements to make learnable matching methods generalize.\nMotivated by the above observations, (...)"
```

### Inference on a specific region

GOT supports interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region's bounding box. Here is an example of how to process a specific region:

```python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText

>>> model = AutoModelForImageTextToText.from_pretrained("yonigozlan/GOT-OCR-2.0-hf").to("cuda")
>>> processor = AutoProcessor.from_pretrained("yonigozlan/GOT-OCR-2.0-hf")

>>> image = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/multi_box.png"
>>> inputs = processor(image, return_tensors="pt", color="green") # or box=[x1, y1, x2, y2] for coordinates (image pixels)
>>> inputs = inputs.to("cuda")

>>> generate_ids = model.generate(
>>> **inputs,
>>> do_sample=False,
>>> tokenizer=processor.tokenizer,
>>> stop_strings="<|im_end|>",
>>> max_new_tokens=4096,
>>> )

>>> processor.decode(generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
"You should keep in mind what features from the module should be used, especially \nwhen you’re planning to sell a template."
```

### Inference on general OCR data example: sheet music

Although this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like `pdftex`, `mathpix`, `matplotlib`, `tikz`, `verovio` or `pyecharts`.
Here is an example of how to process sheet music:

```python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import verovio

>>> model = AutoModelForImageTextToText.from_pretrained("yonigozlan/GOT-OCR-2.0-hf").to("cuda")
>>> processor = AutoProcessor.from_pretrained("yonigozlan/GOT-OCR-2.0-hf")

>>> image = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/sheet_music.png"
>>> inputs = processor(image, return_tensors="pt", format=True).to("cuda")

>>> generate_ids = model.generate(
>>> **inputs,
>>> do_sample=False,
>>> tokenizer=processor.tokenizer,
>>> stop_strings="<|im_end|>",
>>> max_new_tokens=4096,
>>> )

>>> outputs = processor.decode(generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
>>> tk = verovio.toolkit()
>>> tk.loadData(outputs)
>>> tk.setOptions(
>>> {
>>> "pageWidth": 2100,
>>> "pageHeight": 800,
>>> "footer": "none",
>>> "barLineWidth": 0.5,
>>> "beamMaxSlope": 15,
>>> "staffLineWidth": 0.2,
>>> "spacingStaff": 6,
>>> }
>>> )
>>> tk.getPageCount()
>>> svg = tk.renderToSVG()
>>> svg = svg.replace('overflow="inherit"', 'overflow="visible"')
>>> with open("output.svg", "w") as f:
>>> f.write(svg)
```
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/sheet_music.svg"
alt="drawing" width="600"/>

## GotOcr2Config

[[autodoc]] GotOcr2Config

## GotOcr2VisionConfig

[[autodoc]] GotOcr2VisionConfig

## GotOcr2ImageProcessor

[[autodoc]] GotOcr2ImageProcessor

## GotOcr2Processor

[[autodoc]] GotOcr2Processor

## GotOcr2Model

[[autodoc]] GotOcr2Model
- forward

## GotOcr2ForConditionalGeneration

[[autodoc]] GotOcr2ForConditionalGeneration
- forward

18 changes: 9 additions & 9 deletions docs/source/en/model_doc/qwen2_vl.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.

## Overview

The [Qwen2-VL](https://qwenlm.github.io/blog/qwen2-vl/) model is a major update to [Qwen-VL](https://arxiv.org/pdf/2308.12966) from the Qwen team at Alibaba Research.
The [Qwen2-VL](https://qwenlm.github.io/blog/qwen2-vl/) model is a major update to [Qwen-VL](https://arxiv.org/pdf/2308.12966) from the Qwen team at Alibaba Research.

The abstract from the blog is the following:

Expand Down Expand Up @@ -231,7 +231,7 @@ In case of limited GPU RAM, one can reduce the resolution as follows:

```python
min_pixels = 256*28*28
max_pixels = 1024*28*28
max_pixels = 1024*28*28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
```
This ensures each image gets encoded using a number between 256-1024 tokens. The 28 comes from the fact that the model uses a patch size of 14 and a temporal patch size of 2 (14 x 2 = 28).
Expand All @@ -245,7 +245,7 @@ conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "Hello, how are you?"}
]
},
Expand All @@ -256,10 +256,10 @@ conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Can you describe these images and video?"},
{"type": "image"},
{"type": "image"},
{"type": "video"},
{"type": "text", "text": "Can you describe these images and video?"},
{"type": "image"},
{"type": "image"},
{"type": "video"},
{"type": "text", "text": "These are from my vacation."}
]
},
Expand Down Expand Up @@ -300,8 +300,8 @@ To load and run a model using Flash Attention-2, simply add `attn_implementation
from transformers import Qwen2VLForConditionalGeneration

model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
"Qwen/Qwen2-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
```
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel)
* [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel)
* [Gemma2](https://huggingface.co/docs/transformers/model_doc/gemma2#transformers.Gemma2Model)
* [GotOcr2](https://huggingface.co/docs/transformers/model_doc/got_ocr2#transformers.GotOcr2Model)
* [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)
* [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel)
* [GPTNeo](https://huggingface.co/docs/transformers/model_doc/gpt_neo#transformers.GPTNeoModel)
Expand Down Expand Up @@ -239,6 +240,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
* [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel)
* [Gemma2](https://huggingface.co/docs/transformers/model_doc/gemma2#transformers.Gemma2Model)
* [GotOcr2](https://huggingface.co/docs/transformers/model_doc/got_ocr2#transformers.GotOcr2Model)
* [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel)
* [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)
* [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel)
Expand Down
Loading