llama : first attempt to implement vision API (WIP) #9687
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(Hopefully) fix #8010
Important
This is still WIP, nothing is working
Collaborators are encouraged to discuss and give feedback on this
Motivation
Currently, the vision capability is provided by
llava
example, which is a CLIP implementation in ggml. While it's a good start, the API need some refactoring to be cleaner and more future-proof.Inspired by current rework on sampling API, I propose to move the CLIP implementation into the main
libllama
, providing user a stable, easy-to-use API like what we did forllama_encode
The goals of this refactoring are:
llama-cli
accept image inputThe no-goals:
llama-server
. It will be another PRPlan
libllama
convert_hf_to_gguf.py
to support llava --> not an ideal implementation, but kinda worksllama_model
andllama_context
to hold vision-related datallama-vision.{cpp|h}
llama-cli
Implementation
Naming scheme
For metadata, we will add
vision.*
namespace.vision.type
: the type of vision encoder. We only support"clip"
for now (not sure if there are any other implementation out there)vision.*
: other params for vision encoding, for example patch size, image size, etcvision.clip.*
: CLIP-related paramsExample:
For tensor naming scheme, we will prefix all vision-related tensor with
v.enc.*
. For example:API
libllama
will be responsible for: