llama : first attempt to implement vision API (WIP) #9687

ngxson · 2024-09-29T15:03:03Z

(Hopefully) fix #8010

Important

This is still WIP, nothing is working
Collaborators are encouraged to discuss and give feedback on this

Motivation

Currently, the vision capability is provided by llava example, which is a CLIP implementation in ggml. While it's a good start, the API need some refactoring to be cleaner and more future-proof.

Inspired by current rework on sampling API, I propose to move the CLIP implementation into the main libllama, providing user a stable, easy-to-use API like what we did for llama_encode

The goals of this refactoring are:

Provide a good API and code architecture for more models to come in the future
Single gguf file for both vision+language (so, no more model surgery)
Only llava for now (because it simple for me to understand)
Have llama-cli accept image input

The no-goals:

No change to llama-server. It will be another PR
No minicpm, llama-3.2-vision, phi-3-vision, etc. Again, will be another PR

Plan

define the plan:
- gguf metadata and tensor naming scheme
- define API to be exposed from libllama
upgrade convert_hf_to_gguf.py to support llava --> not an ideal implementation, but kinda works
extend llama_model and llama_context to hold vision-related data
add llama-vision.{cpp|h}
add image capability to llama-cli

Implementation

Naming scheme

For metadata, we will add vision.* namespace.

vision.type: the type of vision encoder. We only support "clip" for now (not sure if there are any other implementation out there)
vision.*: other params for vision encoding, for example patch size, image size, etc
vision.clip.*: CLIP-related params

Example:

vision.type = 'clip'
vision.image_size = 336
vision.patch_size = 14
vision.clip.architecture = 'llava'
vision.clip.block_count = 24
vision.clip.embedding_length = 1024
vision.clip.feed_forward_length = 4096
vision.clip.attention.head_count = 16

For tensor naming scheme, we will prefix all vision-related tensor with v.enc.*. For example:

v.mmproj_a.bias
v.mmproj_a.weight
v.enc.embd.cls
v.enc.embd.patch.weight
v.enc.embd.pos.weight
v.enc.blk.0.input_norm.bias
v.enc.blk.0.input_norm.weight

API

libllamawill be responsible for:

Accepting bitmap image (RGB format) and split it into patches
It will NOT process specific format like PNG, JPG, etc. User must convert these format into bitmap (for example, using STB) before giving to llama
The API returns embeddings that can be add to a language batch

// represent an RGB image
// size of data must be equal to 3*nx*ny
struct llama_img {
    uint32_t nx;
    uint32_t ny;
    unsigned char * data;
};

typedef struct llama_img_batch {
    int32_t     n_imgs;
    llama_img * imgs;
    // add other things in future?
}

// encode image into embeddings
int32_t llama_vision_encode(struct llama_context * ctx, llama_img_batch * batch);

// get output embeddings, to be put into language batch
float * llama_vision_get_embeddings(struct llama_context * ctx, int32_t idx);

I have read the contributing guidelines
Self-reported review complexity:
- High

add llava to conversion

cd806a7

ngxson changed the title ~~llama : refactor vision API (WIP)~~ llama : first attempt to refactor vision API (WIP) Sep 29, 2024

github-actions bot added the python python script changes label Sep 29, 2024

ngxson changed the title ~~llama : first attempt to refactor vision API (WIP)~~ llama : first attempt to implement vision API (WIP) Sep 29, 2024

model is loadable

a75c5c4

ngxson mentioned this pull request Sep 30, 2024

server: Bring back multimodal support #8010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : first attempt to implement vision API (WIP) #9687

llama : first attempt to implement vision API (WIP) #9687

ngxson commented Sep 29, 2024

llama : first attempt to implement vision API (WIP) #9687

Are you sure you want to change the base?

llama : first attempt to implement vision API (WIP) #9687

Conversation

ngxson commented Sep 29, 2024

Motivation

Plan

Implementation

Naming scheme

API