Support Kosmos-2.5 #31711

tic-top · 2024-06-29T15:48:17Z

What does this PR do?

#30877 Implementation of Kosmos-2.5 in transformers.
https://huggingface.co/kirp/kosmos2_5/blob/main/README.md

Usage

from PIL import Image
import requests
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq, AutoConfig
import re

repo = "kirp/kosmos2_5"
device = "cuda:0"
config = AutoConfig.from_pretrained(repo)

NAME = {
    "f" : "flash_attention_2",
    "s" : "sdpa",
    "e" : "eager",
}

# all sdpa fp16
dtype = torch.float16
config._attn_implementation = NAME["s"]
config.vision_config._attn_implementation = NAME["s"]
config.text_config._attn_implementation = NAME["s"]

# # all sdpa fp16
# dtype = torch.float16
# config._attn_implementation = NAME["s"]
# config.text_config._attn_implementation = NAME["s"]
# config.vision_config._attn_implementation = NAME["s"]

# # all eager bf16
# dtype = torch.bfloat16
# config._attn_implementation = NAME["e"]
# config.text_config._attn_implementation = NAME["e"]
# config.vision_config._attn_implementation = NAME["e"]


model = AutoModelForVision2Seq.from_pretrained(repo, device_map = device, torch_dtype=dtype, config=config)
processor = AutoProcessor.from_pretrained(repo)

url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<ocr>" # <md>

inputs = processor(text=prompt, images=image, return_tensors="pt")
height, width = inputs.pop("height"), inputs.pop("width")
raw_width, raw_height = image.size
scale_height = raw_height / height
scale_width = raw_width / width

inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)

def postprocess(y, scale_height, scale_width):
    y = y.replace(prompt, "")
    if "<md>" in prompt:
        return y
    pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
    bboxs_raw = re.findall(pattern, y)
    lines = re.split(pattern, y)[1:]
    bboxs = [re.findall(r"\d+", i) for i in bboxs_raw]
    bboxs = [[int(j) for j in i] for i in bboxs]
    info = ""
    for i in range(len(lines)):
        box = bboxs[i]
        x0, y0, x1, y1 = box
        if not (x0 >= x1 or y0 >= y1):
            x0 = int(x0 * scale_width)
            y0 = int(y0 * scale_height)
            x1 = int(x1 * scale_width)
            y1 = int(y1 * scale_height)
            info += f"{x0},{y0},{x1},{y0},{x1},{y1},{x0},{y1},{lines[i]}"
    return info

output_text = postprocess(generated_text[0], scale_height, scale_width)
print(output_text)

amyeroberts · 2024-07-01T10:10:04Z

cc @ydshieh

zucchini-nlp

Thanks a lot! ❤️ Just a few more things that can be removed now but looks nice overall to me. Looking forward for changes to accommodate generate() from mixin and should be okay

src/transformers/models/kosmos2_5/configuration_kosmos2_5.py

zucchini-nlp · 2024-12-16T16:48:40Z

src/transformers/models/kosmos2_5/modeling_kosmos2_5.py

+        return_legacy_cache = False
+        if (
+            use_cache and not isinstance(past_key_values, Cache) and not self.training
+        ):  # kept for BC (non `Cache` `past_key_values` inputs)
+            return_legacy_cache = True
+            past_key_values = DynamicCache.from_legacy_cache(past_key_values)
+            logger.warning_once(
+                "We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. "
+                "Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)"
+            )


hmm i've been removing deprecation for new models. Oke, if core maintainer agrees we can leave but maybe change the deprecation version to around v4.50 or smth?

zucchini-nlp · 2024-12-17T14:13:51Z

src/transformers/models/kosmos2_5/processing_kosmos2_5.py

+        self.boi = tokenizer.convert_tokens_to_ids("<image>")
+        self.eoi = tokenizer.convert_tokens_to_ids("</image>")
+        self.pad = tokenizer.convert_tokens_to_ids("<pad>")
+        self.bos = tokenizer.convert_tokens_to_ids("<s>")
+        self.eos = tokenizer.convert_tokens_to_ids("</s>")
+


nit: these can be saved as special tokens in tokenizer so you can do tokenizer.boi_token_id instead of hardcoding the string of each token. Not super necessary, simply FYI :)

for that when converting you need to add

# bos/eos/pad tokens are already in tokenizer so no need to add them extra_special_tokens = {"boi_token": "<image>", "eoi_token": </image>} tokenizer = AutoTokenizer.from_pretrained(model_id, extra_special_tokens=extra_special_tokens)

ydshieh · 2024-12-17T17:35:12Z

Hi @zucchini-nlp Great reviews! Could you check the changes shown below? I think all comments addressed.

(and also the removed generate)

https://github.com/tic-top/transformers/compare/0ec499a841b35ce9c77e072362a09a20a287a2f5..8fc9699655c5b80b0c5cbb1fadfc4837daf1ad90

zucchini-nlp

Thanks a lot for iterating and removing the overwritten generate()!

ydshieh · 2024-12-18T12:57:42Z

run slow

github-actions · 2024-12-18T12:58:35Z

This comment contains run-slow, running the specified jobs: ['models/kosmos2_5'] ...

[email protected] added 21 commits June 29, 2024 16:21

add torch required

477fd34

format

71d3275

format

d9b23c4

.

3cbca06

format

8bde09f

add procesor

2e6cad8

init weight

6d797c6

.

532b1e0

.

234149a

import sort

05c9943

.

7d8783b

format

2de836d

format

9eece30

reformat

3a0cfaa

reformat

b72fe0a

reformat

589e9ef

Merge remote-tracking branch 'upstream/main' into main

fe51247

fixup

241b0bf

init test

ba8b3dd

init weight

9c74c61

modeling_test in progress

363180b

ydshieh self-assigned this Jul 1, 2024

ydshieh added the run-slow label Jul 1, 2024

[email protected] added 6 commits July 1, 2024 17:39

model test

29d7cff

better initilization

42dd2ea

model test

9046ec5

restore ks2_test; update ks25 test

b64e300

load from the config

916781a

processor test

578acce

ydshieh added 2 commits December 17, 2024 14:51

temp

b2c3db2

temp

c356a36

zucchini-nlp reviewed Dec 17, 2024

View reviewed changes

ydshieh added 17 commits December 17, 2024 15:19

temp

55944fc

temp

83d600e

temp

2d4cbba

temp

6b2f7d7

temp

5f731a9

temp

0ec499a

temp

7f0d26c

temp

db865db

temp

bf14c4b

temp

9b29aac

temp

ce222a6

temp

876cb6b

temp

a3638ea

temp

30f927a

temp

a65a9b1

temp

7c99fd0

fix

ec9ea0c

ydshieh added 5 commits December 17, 2024 18:37

fix

8fc9699

fix

22cb70d

fix

001fd70

fix

d1116f5

fix

6f09a51

zucchini-nlp approved these changes Dec 18, 2024

View reviewed changes

Merge branch 'main' into main

7d0b827

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Kosmos-2.5 #31711

Support Kosmos-2.5 #31711

tic-top commented Jun 29, 2024 •

edited

Loading

amyeroberts commented Jul 1, 2024

zucchini-nlp left a comment

zucchini-nlp Dec 16, 2024

zucchini-nlp Dec 17, 2024

ydshieh commented Dec 17, 2024 •

edited

Loading

zucchini-nlp left a comment

ydshieh commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

Support Kosmos-2.5 #31711

Are you sure you want to change the base?

Support Kosmos-2.5 #31711

Conversation

tic-top commented Jun 29, 2024 • edited Loading

What does this PR do?

Usage

amyeroberts commented Jul 1, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Dec 16, 2024

Choose a reason for hiding this comment

zucchini-nlp Dec 17, 2024

Choose a reason for hiding this comment

ydshieh commented Dec 17, 2024 • edited Loading

zucchini-nlp left a comment

Choose a reason for hiding this comment

ydshieh commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

tic-top commented Jun 29, 2024 •

edited

Loading

ydshieh commented Dec 17, 2024 •

edited

Loading