Add PaliGemma Support #7553

abetlen · 2024-05-27T02:49:36Z

Very much still a work in progress however I've been able to convert the weights and update the clip model to support the differences in the PaliGemma implementation of SigLIP (second projector mlp layer is missing).

The next missing piece is that PaliGemma evaluates the prefix part of the prompt (which includes image embeddings) using a fully-connected attention mask as opposed to causal attention (See HF Implementation). ~~I haven't played around too much with llama_set_causal_attn function but I believe this may be sufficient~~, otherwise it will be necessary to update the API to specify a custom attention mask.

I've created a custom script to generate the f16 ggufs here I've opted to do this in a custom script as the current convert-hf-to-gguf.py is not suited for converting vlms at the moment.

ggerganov · 2024-05-27T06:54:47Z

The next missing piece is that PaliGemma evaluates the prefix part of the prompt (which includes image embeddings) using a fully-connected attention mask as opposed to causal attention

Hm yes, this is not currently supported. Need to figure out what changes would be necessary. Explicitly setting the mask through the API would be possible, but I think it would be too difficult to use. There are other options such as:

for each token in the batch, along with it's pos, we provide one more integer of how many "future" token it attends to (defaults to 0)
be able to set attention mask rules, such as "positions [p0, p1] are non-causal" and we apply those rules on top of the mask that we normally construct

Not sure yet what would be best

abetlen · 2024-05-27T13:26:34Z

Explicitly setting the mask through the API would be possible, but I think it would be too difficult to use.

I'm partial to this if it's the most straightforward to implement just because it offers the most general API but I understand the concern.

Otherwise, I think option 1 is the better approach as it's centered around the batch api and would require minimal work to modify existing code.

abetlen · 2024-05-30T21:04:03Z

@ggerganov I'll see what I can come up with along those lines.

We could probably limit complexity by only allowing future attention to work within a batch, otherwise we would need multiple graph compute calls to update previously computed kv positions. I might be mistaken about that though.

ggerganov · 2024-06-03T09:16:59Z

We could probably limit complexity by only allowing future attention to work within a batch

Yes you are correct, future attention can only be done within the batch.

Not sure if option 1 would require less changes though. We have to remember for each token what is the set of tokens that it attends to. So even if we can provide this information for the current batch through llama_batch, then for the next batch we would still need it in order to construct the correct mask, so it would likely require to update llama_kv_cache to store it for example.

On the other hand option 2 might need to just maintain a container with ranges in the llama_context and blindly apply it on top of whatever mask has been constructed normally. It would be also much easier to remove in the future if we think of a better way to support this

ggerganov · 2024-06-04T18:19:50Z

cc @iamlemec, @compilade - bringing your attention to this discussion in case you get some additional ideas how to support this functionality

iamlemec · 2024-06-06T19:22:08Z

I think that the choice between 1 and 2 might come down to whether the causal/non-causal block positions are generally fixed. It seems like with PaliGemma the image block is of fixed length determined by the image processing model. With something like GritLM the non-causal segments can be of varying size. Also, a 3rd slightly more general implementation would be to add a per-token [p0,p1] to the batch that specifies which positions this token attends to. Keeping in mind that the referenced positions must either be in the current batch or in the kv_cache. Somehow thinking in terms of positions seems more natural to me than future offsets.

That said, there might be a way to do it with the current codebase. Can we just evaluate the image tokens in a first batch with causal_attn=False, then evaluate the prompt in a subsequent batch with causal_attn=True? The first batch will populate the kv_cache appropriately.

ggerganov · 2024-06-07T11:55:29Z

Can we just evaluate the image tokens in a first batch with causal_attn=False, then evaluate the prompt in a subsequent batch with causal_attn=True? The first batch will populate the kv_cache appropriately.

The problem I think is that during the second batch, the attention mask will be causal for the entire context, which would lead to incorrect masking of the tokens from the first batch, no?

iamlemec · 2024-06-07T12:11:22Z

The problem I think is that during the second batch, the attention mask will be causal for the entire context, which would lead to incorrect masking of the tokens from the first batch, no?

But those KV's from the first block are already computed, cached, and wont be recomputed. So any given token in the second batch will be attending to every single token in the first batch, and the KV's they pick up will have been computed non-causally in the first batch execution.

ggerganov · 2024-06-08T12:07:20Z

The KV data is cached yes, but the softmax computation in the attention during the new batch still needs the correct mask for the entire context. With a causal mask, the tokens from the new batch would correctly attend to all previous ones, but the previous tokens would not attend correctly to themselves - they still attend to each other via the softmax, even though they are from the previous batch

iamlemec · 2024-06-08T21:54:29Z

Thanks for the reply @ggerganov. I'm still not 100% convinced, but it seems possible that I'm confused about the premise here. Partially for my own edification, and partially to clear up any ambiguity, I decided to write up a minimal working example in Python with transformers.

Here's a link to the Gist: https://gist.github.com/iamlemec/3febf59b41b7f32a450fcfcb4be0713c. I used RoBERTa because it's still relatively small and allows one to specify full 2D attention matrices, rather than just 1D attention masks like in base BERT. Anyway, those asserts pass! So that's potential validation?

ggerganov · 2024-06-09T15:13:34Z

@iamlemec I stand corrected. Thanks for this example and helping figuring this out!

So this is actually very good news - we should be able to support PaliGemma with the existing API, correct?

abetlen · 2024-06-10T16:30:59Z

@iamlemec @ggerganov sounds good, so if I understand correctly the approach would be to update the path for causal_attn == false in decode_internal to also populate the kv cache and then ensure the mask is handled appropriately when causal attention is re-enabled?

This may negatively impact performance of pure embedding models as they don't store anything in the cache afaik but maybe there's a way to distinguish the two scenarios.

ggerganov · 2024-06-11T15:17:11Z

@abetlen Yes, I think this should work

so if I understand correctly the approach would be to update the path for causal_attn == false in decode_internal to also populate the kv cache and then ensure the mask is handled appropriately when causal attention is re-enabled?

Hopefully no change would be needed. Note that we have two parameters:

hparams.causal_attn
cparams.causal_attn

The latter can be changed via llama_set_causal_attn, while the former is determined during loading the model. In general, hparams.causal_attn is false only for embedding models, so for PaliGemma it should be true and thus the KV cache in llama_decode_internal would be correctly updated

iamlemec · 2024-06-11T16:59:34Z

@abetlen @ggerganov Do we need to actually modify llama_decode_internal in llama.cpp for this? I would think that this would be best handled at the examples level, either as a generalization of llava or separately. Like, if this was llava, you should just be able to change (approx lines 186-188 in llava-cli.cpp)

eval_string(ctx_llava->ctx_llama, system_prompt.c_str(), params->n_batch, &n_past, true);
llava_eval_image_embed(ctx_llava->ctx_llama, image_embed, params->n_batch, &n_past);
eval_string(ctx_llava->ctx_llama, user_prompt.c_str(), params->n_batch, &n_past, false);

to

llama_set_causal_attn(ctx_llava->ctx_llama, true);
eval_string(ctx_llava->ctx_llama, system_prompt.c_str(), params->n_batch, &n_past, true);
llama_set_causal_attn(ctx_llava->ctx_llama, false);
llava_eval_image_embed(ctx_llava->ctx_llama, image_embed, params->n_batch, &n_past);
llama_set_causal_attn(ctx_llava->ctx_llama, true);
eval_string(ctx_llava->ctx_llama, user_prompt.c_str(), params->n_batch, &n_past, false);

I'm not quite sure how the system prompt works with PaliGemma, but basically just putting the appropriate llama_set_causal_attn before evals should be enough, right?

ggerganov · 2024-06-11T17:02:11Z

Yes, it should work without changes to llama_decode_internal, just as how you demonstrated

…add-paligemma-support

abetlen · 2024-06-12T01:20:38Z

That's the approach I was initially trying but it caused this assert to fail as the logits aren't reserved when cparams.causal_attn is false.

However I think I was just missing a one line change in llama_output_reserve I'll test it out.

arseniybelkov · 2024-07-04T09:50:02Z

Hello, I have cloned abetlen's work. I am trying to run the converting script on this model https://huggingface.co/google/paligemma-3b-pt-224/tree/main, but I keep getting the following error:

Traceback (most recent call last):
  File "/home/belkov.arseniy/paligemma/convert.py", line 305, in <module>
    special_vocab.add_to_gguf(fout)
  File "/home/belkov.arseniy/paligemma/llama.cpp/gguf-py/gguf/vocab.py", line 69, in add_to_gguf
    add_handler(value)
  File "/home/belkov.arseniy/paligemma/llama.cpp/gguf-py/gguf/gguf_writer.py", line 530, in add_add_bos_token
    self.add_bool(Keys.Tokenizer.ADD_BOS, value)
  File "/home/belkov.arseniy/paligemma/llama.cpp/gguf-py/gguf/gguf_writer.py", line 201, in add_bool
    self.add_key_value(key, val, GGUFValueType.BOOL)
  File "/home/belkov.arseniy/paligemma/llama.cpp/gguf-py/gguf/gguf_writer.py", line 166, in add_key_value
    raise ValueError(f'Duplicated key name {key!r}')
ValueError: Duplicated key name 'tokenizer.ggml.add_bos_token'

Can someone please help?

The-TallGuy · 2024-08-28T22:02:31Z

I don't get it, is the feature meant to have already been implemented or is it a work in progress?

…add-paligemma-support

abetlen · 2024-10-02T15:43:46Z

@ggerganov was able to come back to this and finally get it working.

Changes:

Added a llama_token_inp_embd function to the llama.h API which translates a set of input tokens to input embeddings that can be used for llama_batch.embd. The reason this change is necessary is that for the Paligemma both the image and text are decoded with non-causal attention. Currently non-causal attention doesnt work accross ubatches so to work around this I'm getting the input embeddings for the text and concatenating with the image embeddings so I can fit it in a single ubatch.
Added a missing embedding rescaling step at the end of the image encoder where image embeddings are re-scaled by 1 / sqrt(embedding_dim), I made this value just a configurable scalar that can be set in the gguf which defaults to no rescaling if not set.
Updated the weights conversion script to add the <image> tokens.

Link to weights

The below python pseudocaode is what's currently required to evaluate the prefill correctly

cparams.n_ubatch = 512 # This needs to be large enough to fit the image embeddings __and__ text embeddings from the input prompt in a single llama_decode call
# ...
tokens = tokenize(prompt)
token_embeddingss = llama_token_inp_embd(lctx, tokens) # we need to embed the input tokens because llama_decode supports __either__ token or embedding inputs
# ...
image_embeddings = embed_from_bytes(llava_ctx, image_bytes)
# ...
memcpy(batch.embd, image_embeddings.embd, n_image_pos * embedding_dim * sizeof(float))
memcpy(batch.embd + n_image_pos * embedding_dim * sizeof(float), token_embeddings, len(tokens) * embedding_dim * sizeof(float));
# ...
llama_set_causal_attn(lctx, False)
llama_decode(batch);
llama_set_causal_attn(lctx, True)

ggerganov · 2024-10-07T12:53:58Z

@abetlen Great to see that this is working. Merging this PR might have to wait or be implemented within the scope of the new effort for adding vision capabilities to the core libllama (see #9687). The reason is that the new llama_token_inp_embd() might get very quickly obsoleted, depending on how we decide to redesign llama_batch.

abetlen · 2024-10-15T21:13:20Z

@ggerganov no problem, I'll work with @ngxson and see if I can provide support on that PR.

ngxson · 2024-10-15T21:32:47Z

@abetlen FYI, this week I'm currently quite busy on @huggingface hub stuff 😅

My current plan is to be able to add both tokens and embeddings into llama_batch. I firstly need to do this refactoring #9745 before I can start working on that.

In the meantime, would you mind to take a look on how the batch can be divided into ubatch with both the tokens and embd inside?

Ignore second mlp layer if weights are null

5833323

github-actions bot added the examples label May 27, 2024

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label May 27, 2024

iamlemec mentioned this pull request Jun 11, 2024

Feature Request: Add Paligemma support #7875

Closed

4 tasks

abetlen added 2 commits June 11, 2024 21:10

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

7612e4c

…add-paligemma-support

Reserve logits when causal attention is disabled on context

9bed1ae

abetlen marked this pull request as ready for review August 10, 2024 21:30

abetlen marked this pull request as draft August 10, 2024 21:30

abetlen added 2 commits August 10, 2024 17:49

Merge branch 'master' into add-paligemma-support

614cb6a

Merge branch 'ggerganov:master' into add-paligemma-support

bdaec8f

abetlen added 2 commits September 29, 2024 15:10

Update branch

f1fb914

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

0875f76

…add-paligemma-support

Update GGML_ASSERT

dd34db2

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 29, 2024

abetlen added 2 commits October 1, 2024 06:12

Add embeddings scale to clip_ctx to rescale final image embeddings

9aecd38

Add llama_token_in_embd function to embed input tokens

c702e55

abetlen marked this pull request as ready for review October 2, 2024 16:22

ggerganov mentioned this pull request Oct 4, 2024

llama : first attempt to implement vision API (WIP) #9687

Draft

7 tasks

ggerganov mentioned this pull request Oct 12, 2024

Feature Request: Paligemma Support #9227

Open

4 tasks

akshayballal95 mentioned this pull request Oct 20, 2024

quantized version of the model fo CPU inference illuin-tech/colpali#78

Closed

ManuelFay mentioned this pull request Oct 21, 2024

[question] usage in Rust/ONNX? illuin-tech/colpali#103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PaliGemma Support #7553

Add PaliGemma Support #7553

abetlen commented May 27, 2024 •

edited

Loading

ggerganov commented May 27, 2024

abetlen commented May 27, 2024

abetlen commented May 30, 2024

ggerganov commented Jun 3, 2024

ggerganov commented Jun 4, 2024

iamlemec commented Jun 6, 2024

ggerganov commented Jun 7, 2024 •

edited

Loading

iamlemec commented Jun 7, 2024

ggerganov commented Jun 8, 2024 •

edited

Loading

iamlemec commented Jun 8, 2024

ggerganov commented Jun 9, 2024

abetlen commented Jun 10, 2024

ggerganov commented Jun 11, 2024

iamlemec commented Jun 11, 2024

ggerganov commented Jun 11, 2024

abetlen commented Jun 12, 2024

arseniybelkov commented Jul 4, 2024

The-TallGuy commented Aug 28, 2024

abetlen commented Oct 2, 2024 •

edited

Loading

ggerganov commented Oct 7, 2024

abetlen commented Oct 15, 2024

ngxson commented Oct 15, 2024 •

edited

Loading

Add PaliGemma Support #7553

Are you sure you want to change the base?

Add PaliGemma Support #7553

Conversation

abetlen commented May 27, 2024 • edited Loading

ggerganov commented May 27, 2024

abetlen commented May 27, 2024

abetlen commented May 30, 2024

ggerganov commented Jun 3, 2024

ggerganov commented Jun 4, 2024

iamlemec commented Jun 6, 2024

ggerganov commented Jun 7, 2024 • edited Loading

iamlemec commented Jun 7, 2024

ggerganov commented Jun 8, 2024 • edited Loading

iamlemec commented Jun 8, 2024

ggerganov commented Jun 9, 2024

abetlen commented Jun 10, 2024

ggerganov commented Jun 11, 2024

iamlemec commented Jun 11, 2024

ggerganov commented Jun 11, 2024

abetlen commented Jun 12, 2024

arseniybelkov commented Jul 4, 2024

The-TallGuy commented Aug 28, 2024

abetlen commented Oct 2, 2024 • edited Loading

ggerganov commented Oct 7, 2024

abetlen commented Oct 15, 2024

ngxson commented Oct 15, 2024 • edited Loading

abetlen commented May 27, 2024 •

edited

Loading

ggerganov commented Jun 7, 2024 •

edited

Loading

ggerganov commented Jun 8, 2024 •

edited

Loading

abetlen commented Oct 2, 2024 •

edited

Loading

ngxson commented Oct 15, 2024 •

edited

Loading