Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : first attempt to implement vision API (WIP) #9687

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Sep 29, 2024

(Hopefully) fix #8010

Important

This is still WIP, nothing is working
Collaborators are encouraged to discuss and give feedback on this

Motivation

Currently, the vision capability is provided by llava example, which is a CLIP implementation in ggml. While it's a good start, the API need some refactoring to be cleaner and more future-proof.

Inspired by current rework on sampling API, I propose to move the CLIP implementation into the main libllama, providing user a stable, easy-to-use API like what we did for llama_encode

The goals of this refactoring are:

  • Provide a good API and code architecture for more models to come in the future
  • Single gguf file for both vision+language (so, no more model surgery)
  • Only llava for now (because it simple for me to understand)
  • Have llama-cli accept image input

The no-goals:

  • No change to llama-server. It will be another PR
  • No minicpm, llama-3.2-vision, phi-3-vision, etc. Again, will be another PR

Plan

  • define the plan:
    • gguf metadata and tensor naming scheme
    • define API to be exposed from libllama
  • upgrade convert_hf_to_gguf.py to support llava --> not an ideal implementation, but kinda works
  • extend llama_model and llama_context to hold vision-related data
  • add llama-vision.{cpp|h}
  • add image capability to llama-cli

Implementation

Naming scheme

For metadata, we will add vision.* namespace.

  • vision.type: the type of vision encoder. We only support "clip" for now (not sure if there are any other implementation out there)
  • vision.*: other params for vision encoding, for example patch size, image size, etc
  • vision.clip.*: CLIP-related params

Example:

vision.type = 'clip'
vision.image_size = 336
vision.patch_size = 14
vision.clip.architecture = 'llava'
vision.clip.block_count = 24
vision.clip.embedding_length = 1024
vision.clip.feed_forward_length = 4096
vision.clip.attention.head_count = 16

For tensor naming scheme, we will prefix all vision-related tensor with v.enc.*. For example:

v.mmproj_a.bias
v.mmproj_a.weight
v.enc.embd.cls
v.enc.embd.patch.weight
v.enc.embd.pos.weight
v.enc.blk.0.input_norm.bias
v.enc.blk.0.input_norm.weight

API

libllamawill be responsible for:

  • Accepting bitmap image (RGB format) and split it into patches
  • It will NOT process specific format like PNG, JPG, etc. User must convert these format into bitmap (for example, using STB) before giving to llama
  • The API returns embeddings that can be add to a language batch
// represent an RGB image
// size of data must be equal to 3*nx*ny
struct llama_img {
    uint32_t nx;
    uint32_t ny;
    unsigned char * data;
};

typedef struct llama_img_batch {
    int32_t     n_imgs;
    llama_img * imgs;
    // add other things in future?
}

// encode image into embeddings
int32_t llama_vision_encode(struct llama_context * ctx, llama_img_batch * batch);

// get output embeddings, to be put into language batch
float * llama_vision_get_embeddings(struct llama_context * ctx, int32_t idx);

@ngxson ngxson changed the title llama : refactor vision API (WIP) llama : first attempt to refactor vision API (WIP) Sep 29, 2024
@github-actions github-actions bot added the python python script changes label Sep 29, 2024
@ngxson ngxson changed the title llama : first attempt to refactor vision API (WIP) llama : first attempt to implement vision API (WIP) Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: Bring back multimodal support
1 participant