BugFix: Fixed input to llama_vision processor #431

Danielohayon · 2024-11-28T09:17:51Z

BugFix: llama_vision processor accepts "text" key and not "content" key.

Problem Description

In the current implementation the processor accepts the messages variable with a "content" field:
messages[-1]["content"].append({"type": "text", "content": contexts})
But this is not the correct format for llama vision, and for example when this is the messages variable:

(Pdb) messages
[{'role': 'user', 'content': [{'type': 'image'}, {'type': 'text', 'content': "<image 1> Baxter Company has a relevant range of production between 15,000 and 30,000 units. The following cost data represents average variable costs per unit for 25,000 units of production. If 30,000 units are produced, what are the per unit manufacturing overhead costs incurred?\nA. $6\nB. $7\nC. $8\nD. $9\n\nAnswer with the option's letter from the given choices directly."}]}]

The resulting prompt after running prompt = self.processor.apply_chat_template(messages, add_generation_prompt=True) does not contain the contexts:

(Pdb) prompt
'<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

Solution

The fix is easy, following the llama vision example from https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct, the format of the input to the apply_chat_template function should be with a "text" key instead of a "content" key.
From the model documentation:

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)

After fixing this the resulting prompt for the same messages from before is:

(Pdb) prompt
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|><image 1> Baxter Company has a relevant range of production between 15,000 and 30,000 units. The following cost data represents average variable costs per unit for 25,000 units of production. If 30,000 units are produced, what are the per unit manufacturing overhead costs incurred?\nA. $6\nB. $7\nC. $8\nD. $9\n\nAnswer with the option's letter from the given choices directly.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

BugFix: llama_vision processor accepts "text" key and not "content" key.

Update llama_vision.py

b212155

BugFix: llama_vision processor accepts "text" key and not "content" key.

Luodian approved these changes Nov 29, 2024

View reviewed changes

Luodian merged commit dd2839e into EvolvingLMMs-Lab:main Nov 29, 2024
1 check passed

ZhaoCinyu pushed a commit to ZhaoCinyu/lmms-eval that referenced this pull request Dec 9, 2024

Update llama_vision.py (EvolvingLMMs-Lab#431)

26091a3

BugFix: llama_vision processor accepts "text" key and not "content" key.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BugFix: Fixed input to llama_vision processor #431

BugFix: Fixed input to llama_vision processor #431

Danielohayon commented Nov 28, 2024 •

edited

Loading

BugFix: Fixed input to llama_vision processor #431

BugFix: Fixed input to llama_vision processor #431

Conversation

Danielohayon commented Nov 28, 2024 • edited Loading

Problem Description

Solution

Danielohayon commented Nov 28, 2024 •

edited

Loading