llama : first attempt to implement vision API (WIP) #9687

ngxson · 2024-09-29T15:03:03Z

(Hopefully) fix #8010

Important

This is still WIP, only simple example is working
Collaborators are encouraged to discuss and give feedback on this

Motivation

Currently, the vision capability is provided by llava example, which is a CLIP implementation in ggml. While it's a good start, the API need some refactoring to be cleaner and more future-proof.

Inspired by current rework on sampling API, I propose to move the CLIP implementation into the main libllama, providing user a stable, easy-to-use API like what we did for llama_encode

The goals of this refactoring are:

Provide a good API and code architecture for more models to come in the future
Single gguf file for both vision+language (so, no more model surgery)
Only llava for now (because it simple for me to understand)
Have llama-cli accept image input

The no-goals:

No change to llama-server. It will be another PR
No minicpm, llama-3.2-vision, phi-3-vision, etc. Again, will be another PR

Plan

define the plan:
- gguf metadata and tensor naming scheme
- define API to be exposed from libllama
upgrade convert_hf_to_gguf.py to support llava --> not an ideal implementation, but kinda works
extend llama_model and llama_context to hold vision-related data
add llama-vision.{cpp|h}
add image capability to llama-cli

Implementation

Naming scheme

For metadata, we will add vision.* namespace.

vision.type: the type of vision encoder. We only support "clip" for now (not sure if there are any other implementation out there)
vision.*: other params for vision encoding, for example patch size, image size, etc
vision.clip.*: CLIP-related params

Example:

vision.type = 'clip'
vision.image_size = 336
vision.patch_size = 14
vision.clip.architecture = 'llava'
vision.clip.block_count = 24
vision.clip.embedding_length = 1024
vision.clip.feed_forward_length = 4096
vision.clip.attention.head_count = 16

For tensor naming scheme, we will prefix all vision-related tensor with v.enc.*. For example:

v.mmproj_a.bias
v.mmproj_a.weight
v.enc.embd.cls
v.enc.embd.patch.weight
v.enc.embd.pos.weight
v.enc.blk.0.input_norm.bias
v.enc.blk.0.input_norm.weight

API

libllamawill be responsible for:

Accepting bitmap image (RGB format) and split it into patches
It will NOT process specific format like PNG, JPG, etc. User must convert these format into bitmap (for example, using STB) before giving to llama
The API returns embeddings that can be add to a language batch

// represent an RGB image
// size of data must be equal to 3*nx*ny
struct llama_img {
    uint32_t nx;
    uint32_t ny;
    unsigned char * data;
};

typedef struct llama_img_batch {
    int32_t     n_imgs;
    llama_img * imgs;
    // add other things in future?
}

// encode image into embeddings
int32_t llama_vision_encode(struct llama_context * ctx, llama_img_batch * batch);

// get output embeddings, to be put into language batch
float * llama_vision_get_embeddings(struct llama_context * ctx, int32_t idx);

I have read the contributing guidelines
Self-reported review complexity:
- High

ashvinnihalani · 2024-10-02T00:20:15Z

gguf-py/gguf/constants.py

@@ -178,6 +178,28 @@ class Adapter:
        TYPE       = "adapter.type"
        LORA_ALPHA = "adapter.lora.alpha"

+    class Vision:
+        # only support vision.type = "clip" for now


Probably better to be very specific here you are supporting ViT. HuggingFace has also moved away from calling every generic terms. Also I don't think the purpose is to support actual CLIP inference.

Yeah, agree. I think for now it's safer to call this clip-vit to reflect the base implementation is openai/clip-vit-*

Atm it's quite complicated for me to drop the clip_ prefix in all functions. But hey at least the file name is now llama-vision.{cpp|h} instead of llama-clip, which should reflect that we can support ViT and also other things to come in the future.

ngxson · 2024-10-03T18:21:11Z

examples/simple/simple.cpp

+        n_cur = 0;
+        llama_batch_clear(batch);
+        llama_batch_add(batch, 1, 0, { 0 }, false);
+        llama_decode(ctx, batch);
+
+        n_cur = t1.size();
+        llama_batch_clear(batch);
+        llama_batch batch0 = {int32_t(576), nullptr, _test_get_img_embd(ctx), nullptr, nullptr, nullptr, nullptr, n_cur, 1, 0, };
+        llama_decode(ctx, batch0);
+
+        n_cur = 0;
+        llama_batch_clear(batch);
+        for (auto t : t1) { llama_batch_add(batch, t, n_cur, { 0 }, false); n_cur++; }
+        llama_decode(ctx, batch);
+
+        n_cur = t1.size() + 576;
+        llama_batch_clear(batch);
+        printf("pos %d\n", n_cur);
+        for (auto t : t2) { llama_batch_add(batch, t, n_cur, { 0 }, false); n_cur++; }
+        batch.logits[batch.n_tokens - 1] = true;


@ggerganov I observed something weird here that I can't understand, so I would like to ask if you know why.

What I'm trying to do here is to test if I can call llama_decode on different parts of the sentence out-of-order. That means instead of evaluating text,image,text, I try evaluating image,text,text. In theory, this should work because the position is set correctly for all input tokens/embeddings, just the placement in KV cache is different.

But in reality, llama_decode on a batch containing embd doesn't work correctly if the KV cache is empty. In this example, I need to evaluate at least one token for it to work.

I'd be happy to provide more info to help pin-point the problem. Thank you.

But in reality, llama_decode on a batch containing embd doesn't work correctly if the KV cache is empty.

What error do you get?

There is no error, but it the generated text is quite irrelevant.

Expected output: The image features a beautiful view of the Eiffel Tower

Without the first single-token llama_decode: The image features a cityscape with a tall building prominently visible in the center. The building appears to be lit up, possibly at night or during --> clearly wrong because I provided a daytime image

Please note that if the code block follow order of text,image,text, then it will work correctly (generate expected output).

I have a feeling that the KV for embedding is truncated somehow, but can't find a way to verify this. I tried playing with llama_kv_cache_view but the output indicates all tokens are in their correct place.

show code

llama_kv_cache_view v = llama_kv_cache_view_init(ctx, 1); llama_kv_cache_view_update(ctx, &v); printf("n_cells %d\n", v.n_cells); for (int i = 0; i < 16; i++) { printf("cell %d --> pos %d (seq %d)\n", i, v.cells[i].pos, v.cells_sequences[0]); } for (int i = 570; i < 570+16; i++) { printf("cell %d --> pos %d (seq %d)\n", i, v.cells[i].pos, v.cells_sequences[0]); } for (int i = n_cur-8; i < n_cur+8; i++) { printf("cell %d --> pos %d (seq %d)\n", i, v.cells[i].pos, v.cells_sequences[0]); }

(I'm uploading the model so that you can test it. Will drop a link here when it's available)

Link to gguf: https://huggingface.co/ngxson/test-llava-will-be-deleted-soon/blob/main/model.gguf

Command: ./llama-simple -m ../models/llava-1.5-7b-hf/model.gguf -c 1024 -v

Edit: you also need this image placed under ../models/eiffel-tower-3349075_1280.jpg:
https://cdn.sdnews.com/wp-content/uploads/20230705142505/eiffel-tower-3349075_1280.jpg

I don't think that text,image,text is equivalent to image,text,text regardless if the positions are correct. But before we discuss that, it seems to me that the positions are off by one due to the BOS token:

n_cur = 0; llama_batch_clear(batch); llama_batch_add(batch, 1, 0, { 0 }, false); llama_decode(ctx, batch); n_cur = t1.size(); llama_batch_clear(batch); llama_batch batch0 = {int32_t(576), nullptr, _test_get_img_embd(ctx), nullptr, nullptr, nullptr, nullptr, 1 + n_cur, 1, 0, }; llama_decode(ctx, batch0); n_cur = 0; llama_batch_clear(batch); for (auto t : t1) { llama_batch_add(batch, t, 1 + n_cur, { 0 }, false); n_cur++; } llama_decode(ctx, batch); n_cur = t1.size() + 576; llama_batch_clear(batch); printf("pos %d\n", n_cur); for (auto t : t2) { llama_batch_add(batch, t, 1 + n_cur, { 0 }, false); n_cur++; } batch.logits[batch.n_tokens - 1] = true;

But before we discuss that, it seems to me that the positions are off by one due to the BOS token

Just to clarify a bit, the single-token llama_decode is there just for testing. Indeed, adding this will effectively make the sequence to have 2 BOS at the same position 0. But for some reasons, the output is the same with single BOS: bos,text,image,text (which is the expected output)

So, it's just to highlight the difference between llama_decode(img_embd) on empty KV cache vs with KV cache having single token

I see, so this is OK in that case.

I don't think that text,image,text is equivalent to image,text,text

Let's for clarity assume the 2 cases of submitting 3 batches in different order:

text0, image, text1

image, text0, text1

In the first case, when the image batch is evaluated it will "see" the text0 tokens, because they are stored in the KV cache (due to computing the text0 batch before that) and so the image tokens will attend to the text0 tokens when computing the attention.

If instead we first submit the "image" batch on an empty cache, then the tokens will not attend to text0. Also, when submitting the second text0 batch, it will also not attend to the image tokens because of the causal mask - this is correct.

In both cases the text1 batch is evaluated correctly because it will "see" all the data that it is supposed to. But I think the problem is that in the second case, the image batch is computed incorrectly.

One thing to be careful is when computing the image batch, what mask should we use? There was some relevant discussion before in #7553. But in short, it's possible that the image tokens require non-causal mask between themselves (i.e. each image token "sees" all other image tokens). I'm not sure if this is the case, but it is something to consider.

Edit: just now I read #7553 (comment) by @abetlen and it appears that for the PaliGemma models, the entire context (both image and text) has to be evaluated with non-causal attention. This makes it necessary to submit everything inside a single ubatch, otherwise tokens from a previous ubatch cannot attend to tokens in a future ubatch.

Thanks for your response. FYI, I tried printing the inp_KQ_mask and turns out it's the problem. In the second case image, text0, text1, somehow the text1 can't attend to tokens in text0.

What's weird is that, if now I have 4 batches: bos,image,text0,text1, then it will work correctly (bos is the batch that contains one single BOS token)

I also set the n_batch and n_ubatch to 1024 to make sure the image fit into one batch, but it doesn't solve my issue.

One thing to be careful is when computing the image batch, what mask should we use?

In the case of llava model, let's imagine that the image batch is just a batch containing 576 tokens, so let's call it textI instead of image. So my assumption is text0,textI,text1 should be equivalent to textI,text0,text1 (?). When generating the next token, we simply add all tokens that we already evaluated to the mask, so it should be a auto-aggressive mask.

Perhaps PaliGemma is different because it requires non-causal.

I also tried setting llama_set_causal_attn to false before evaluating image batch, but that doesn't help either.

I don't think that text,image,text is equivalent to image,text,text regardless if the positions are correct.

Yeah right, I've just realized that my mathematical model is wrong. Just maintaining the correct position embedding is not enough.

Because softmax is applied to each layer, that means from the second layer of the model, each cell in KV cache already contains the info of the token before it. Therefore, decoding image first is not be possible since it don't have information about text0. They must at least be in the same batch in order to make it work.

Thank you again for the explanation! (And sorry for making noise)

For now, I think a hacky way would be to internally populate image tokens into the batch in llama_decode. I'm also thinking about possibility of extending llm_build_inp_embd to accept both token and embedding input at the same time. What do you think? (also cc @abetlen since that what your PR #7553 is about to resolve)

fmz · 2024-10-04T18:38:28Z

src/llama-vision.cpp

+    image_output = std::move(padded_image);
+}
+
+static void normalize_image_u8_to_f32(const clip_image_u8 src, clip_image_f32 dst, const std::array<float, 3> & mean, const std::array<float, 3> & std) {


This might be a rootcause for some bugs ;)

Yes, probably missing clip_image_f32 &; thanks for pointing this out.

Nekotekina · 2024-10-07T06:25:24Z

Forgive me for (probably) offtopic, but would it be possible to use CLIP model with llama.cpp to compute text embedding for the "classic" purpose of matching text with images?

ngxson · 2024-10-11T10:17:30Z

Forgive me for (probably) offtopic, but would it be possible to use CLIP model with llama.cpp to compute text embedding for the "classic" purpose of matching text with images?

I'm not 100% sure, but in theory, everything is possible with the correct model weight and the correct compute graph.

danbev · 2024-11-12T08:05:09Z

I took a shot at rebasing this branch and as there have been quite a few changes upstream which affects this PR and I wanted try this API out. I wanted to share the rebased repo in case it would be useful/save time.

I moved the example to a new example named simple-vision as the simple example in upstream has been updated to remove the dependencies on common. There are probably some improvements to be made in the changes made during rebasing but it might be easier/quicker to make those changes to the rebased branch.

sragrawal · 2024-12-03T00:07:24Z

@ngxson are you still planning on working on this? Having this as a first step towards llama-3.2-vision would be very useful.

ngxson · 2024-12-03T12:08:30Z

I'm not actively working on it, but will continue soon. Ref discussion: #10381

add llava to conversion

cd806a7

ngxson changed the title ~~llama : refactor vision API (WIP)~~ llama : first attempt to refactor vision API (WIP) Sep 29, 2024

github-actions bot added the python python script changes label Sep 29, 2024

ngxson changed the title ~~llama : first attempt to refactor vision API (WIP)~~ llama : first attempt to implement vision API (WIP) Sep 29, 2024

model is loadable

a75c5c4

ngxson mentioned this pull request Sep 30, 2024

server: Bring back multimodal support #8010

Open

img pre processing

6854ad4

mudler mentioned this pull request Oct 1, 2024

llama3.2 vision models mudler/LocalAI#3669

Open

llava cgraph ok

5648e30

github-actions bot added the examples label Oct 1, 2024

ngxson added 2 commits October 1, 2024 22:26

fix build

4897ff6

simple example works

6089b0a

ashvinnihalani reviewed Oct 2, 2024

View reviewed changes

ngxson added 3 commits October 2, 2024 12:37

add more kv metadata

c2ec885

use id for mmproj tensors

49e7304

wip

a88c0d5

ngxson commented Oct 3, 2024

View reviewed changes

fmz reviewed Oct 4, 2024

View reviewed changes

ggerganov mentioned this pull request Oct 7, 2024

Add PaliGemma Support #7553

Open

ngxson mentioned this pull request Oct 10, 2024

Wirthual/fix vision ngxson/llama.cpp#12

Open

4 tasks

ngxson mentioned this pull request Oct 14, 2024

llava : fix memory leaks in minicpmv #9879

Open

dan-homebrew mentioned this pull request Nov 12, 2024

Llama3.2-11b-Vision janhq/models#25

Open

wirthual mentioned this pull request Nov 13, 2024

Bug: Cannot use llama 3.2 vision Mozilla-Ocho/llamafile#615

Closed

pminev mentioned this pull request Dec 18, 2024

Update submodules alpaca-core/ilib-llama.cpp#26

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : first attempt to implement vision API (WIP) #9687

llama : first attempt to implement vision API (WIP) #9687

ngxson commented Sep 29, 2024 •

edited

Loading

ashvinnihalani Oct 2, 2024

ngxson Oct 2, 2024 •

edited

Loading

ngxson Oct 3, 2024

ggerganov Oct 3, 2024

ngxson Oct 3, 2024 •

edited

Loading

ngxson Oct 3, 2024 •

edited

Loading

ggerganov Oct 3, 2024 •

edited

Loading

ngxson Oct 3, 2024 •

edited

Loading

ggerganov Oct 4, 2024 •

edited

Loading

ngxson Oct 4, 2024 •

edited

Loading

ngxson Oct 4, 2024

ngxson Oct 4, 2024 •

edited

Loading

fmz Oct 4, 2024

ngxson Oct 10, 2024

Nekotekina commented Oct 7, 2024

ngxson commented Oct 11, 2024

danbev commented Nov 12, 2024

sragrawal commented Dec 3, 2024

ngxson commented Dec 3, 2024

llama : first attempt to implement vision API (WIP) #9687

Are you sure you want to change the base?

llama : first attempt to implement vision API (WIP) #9687

Conversation

ngxson commented Sep 29, 2024 • edited Loading

Motivation

Plan

Implementation

Naming scheme

API

ashvinnihalani Oct 2, 2024

Choose a reason for hiding this comment

ngxson Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Oct 3, 2024

Choose a reason for hiding this comment

ggerganov Oct 3, 2024

Choose a reason for hiding this comment

ngxson Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Oct 4, 2024

Choose a reason for hiding this comment

ngxson Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

fmz Oct 4, 2024

Choose a reason for hiding this comment

ngxson Oct 10, 2024

Choose a reason for hiding this comment

Nekotekina commented Oct 7, 2024

ngxson commented Oct 11, 2024

danbev commented Nov 12, 2024

sragrawal commented Dec 3, 2024

ngxson commented Dec 3, 2024

ngxson commented Sep 29, 2024 •

edited

Loading

ngxson Oct 2, 2024 •

edited

Loading

ngxson Oct 3, 2024 •

edited

Loading

ngxson Oct 3, 2024 •

edited

Loading

ggerganov Oct 3, 2024 •

edited

Loading

ngxson Oct 3, 2024 •

edited

Loading

ggerganov Oct 4, 2024 •

edited

Loading

ngxson Oct 4, 2024 •

edited

Loading

ngxson Oct 4, 2024 •

edited

Loading