server: Bring back multimodal support #8010

ngxson · 2024-06-19T12:03:45Z

Multimodal has been removed since #5882

Depends on the refactoring of llava, we will be able to bring back the support: #6027

This issue is created mostly for tracking purpose. If someone want to take this task, feel free to comment below.

Currently, there is not yet any plan for this task.

The text was updated successfully, but these errors were encountered:

ZhenyaPav · 2024-07-20T13:56:29Z

Can someone advise me on what is the latest (working) commit before multimodal was removed?

HAV0X1014 · 2024-07-21T21:49:47Z

It is rather annoying that multimodal support was removed for the server and has not been re-implemented for such a long time now (4 months?). Multimodal LLMs and interleaved image and text models are growing in capability recently, and not being able to run models that used to work before is unfortunate. Seemingly, the only way to restore this functionality is to downgrade to a version that loses support for most new models and improvements.

I am not trying to demand multimodal/llava support to return, but show that this feature on the server is missed.

micoraweb · 2024-08-16T14:26:08Z

Hello, is there still no multimodal support in llama-server ? According to ReadMe in LLaMA.cpp HTTP Server it should be supported?

How to use it with OpenAI API format?

its-ven · 2024-08-22T03:46:35Z

Have there been any updates on this?

HAV0X1014 · 2024-08-31T14:04:20Z

So far, it appears that there hasn't been any updates. This really stinks because there were updates to llava recently to support new models.

micoraweb · 2024-09-12T09:33:26Z

So, this functionality seems to be unavailable for months and there is no hope to get it running? With all the amazing new models we could work with, such as MiniCPM, even Pixtral, etc. Can someone point us at a working server software that allows working with the newer multimodal models ? We just need something like llama-server that should run these multimodal GGUFs. Perhaps one of the llama.cpp files allows to run a server? It's also important to have a standard (OpenAI) API to support standard interactions.. It's so frustrating to wait months and months for such an important feature with no one even bothering to reply!

ggerganov · 2024-09-12T10:08:49Z

Not much has changes since the issue was created. We need contributions to improve the existing vision code and people to maintain it. There is interest to reintroduce full multimodal support, but there are other things with higher priority that are currently worked upon by the core maintainers of the project.

ngxson · 2024-09-12T10:26:07Z

Just to remind: Currently, llama-cpp-python has a server implementation that supports vision models (with OAI compat API). You can use it as an alternative.

Of course it's much better to bring vision support into llama.cpp itself (instead of staying as llava example). The problem is that the current code requires a big clean up. We will eventually do that, as vision capability is becoming more and more streamline.

chigkim · 2024-09-26T00:08:34Z

@ggerganov, Meta released Llama-3.2 with multimodal capabilities. Does this affect the priority for core maintainers? I hope this question doesn’t come across as entitled...

ggerganov · 2024-09-26T08:48:24Z

@chigkim My PoV is that adding multimodal support is a great opportunity for new people with good software architecture skills to get involved in the project. The general low to mid level patterns and details needed for the implementation are already available in the codebase - from model conversion, to data loading, backend usage and inference. It would take some high-level understanding of the project architecture in order to implement support for the vision models and extend the API in the correct way.

We really need more people with this sort of skillset, so at this point I feel it is better to wait and see if somebody will show up and take the opportunity to help out with the project long-term. Otherwise, I'm afraid we won't be able to sustain the quality of the project.

hidden1nin · 2024-09-27T07:01:05Z

Great! Good opportunities, from a developer perspective everyone loves to dive into the code. I would love to help but don't know where to start, is there a list of requirements for the implementation or just make something work for now? What would the finished implementation look like?

mattepiu · 2024-09-27T12:07:00Z

Correct me if I'm wrong, but actual multimodal opensource models are essentially just like usual llm plus accepting images as input.
If so, keeping in mind it should be modular (i.e.: derivable to accept audio), it would essentially need a base class to load an image, convert/rescale/normalize with a library like openCV and then produce an output tensor which is used as input for the inference, after which the output would not change and be purely text based (no image generation, no automatic target recognition with frames around objects).
I can thus see a couple issues:

dependency on external libraries (like openCV, not the slimmest of the dependencies)
each llm has it's own way and it's own code, so all the conversion operation from the image to the tensor should be optional or even swappable (a rearrangeable/programmable pipeline to allow compatibility with as much as possible models)

ngxson · 2024-09-27T13:40:12Z

IMO the llava example and the clip.cpp implementation is already a good start. Basically, what we need now is:

Define list of API calls and data struct that must be added to llama.h. For example, something like clip_image_encode can be moved to llama_clip_image_encode
Get rid of model surgery. If llama.cpp has native vision support, it's better to have both language + vision model in one single gguf. convert_hf_to_gguf.py script must be modify to handle this.
Expand llama_model and llama_context to hold vision data (model weights, temporary tensors, etc). This will also allow using mmap and n_gpu_layers for loading vision model, which is currently missing from clip.cpp

ggerganov · 2024-09-28T11:43:06Z

I would love to help but don't know where to start, is there a list of requirements for the implementation or just make something work for now?

It's hard to make a list of requirements - I personally don't have the expertise and experience needed to decide what is the best way to integrate multimodal. It mainly depends on the ways that the functionality is used - what is the input and output. The core implementation should be 99% the same as every transformer based model.

What would the finished implementation look like?

Likely, libllama extended in a way to support clip and multimodal. When this is ready, llama-server should be extended to accept images.

Correct me if I'm wrong, but actual multimodal opensource models are essentially just like usual llm plus accepting images as input.

That's my understanding as well. In a similar way, Whisper is an LLM that instead of text tokens, accepts raw audio, that is encoded and passed to the decoder.

dependency on external libraries (like openCV, not the slimmest of the dependencies)

libllama should not depend on external libraries. The examples can depend on very lightweight, STB-like libraries. Particularly, OpenCV is a no-go.

IMO the llava example and the clip.cpp implementation is already a good start

Yes, I agree.

Define list of API calls and data struct that must be added to llama.h. For example, something like clip_image_encode can be moved to llama_clip_image_encode

Yes, something along these lines, though I don't really have a good picture. Maybe even consider to reuse llama_encode instead of new CLIP-specific encoding API. After all, AFAIK, all encoders take embeddings as input and produce cross KV + new embeddings, regardless if it is text, audio, images, etc.

ngxson · 2024-09-30T12:01:12Z

Yes, something along these lines, though I don't really have a good picture. Maybe even consider to reuse llama_encode instead of new CLIP-specific encoding API. After all, AFAIK, all encoders take embeddings as input and produce cross KV + new embeddings, regardless if it is text, audio, images, etc.

CLIP is quite different from whisper because it doesn't use cross attention. Instead, the vision model outputs embeddings that can be taken as input for language model. It also depends on the chat template to know where to put the embedding. A common pattern that I observe looks like this:

<|im_start|>user
<image><put_embeddings_here></image>
what do you see in the image?
<|im_end|>

So, llama_encode may not be a good fit here, because we expect encode to enable cross attention (please correct if I'm wrong here).

My current solution on #9687 is to firstly call llama_vision_encode, then call llama_vision_get_embeddings to get the embeddings. After that, add the image embeddings to llama_batch, then decode the batch to generate text.

Not sure if this is the best way to do though, as I'm still new to vision models. Feedbacks are welcomed on this subject.

ngxson added the enhancement New feature or request label Jun 19, 2024

This was referenced Jun 19, 2024

server : improvements and maintenance #4216

Open

Bug: llama-server + LLava 1.6 hallucinates #8001

Closed

ngxson added llava LLaVa and multimodal server labels Jun 19, 2024

ngxson mentioned this issue Jun 19, 2024

Bug: Server not support mmproj #7988

Closed

github-actions bot added the stale label Jul 20, 2024

github-actions bot removed the stale label Jul 21, 2024

amakropoulos mentioned this issue Sep 4, 2024

Image Input undreamai/LLMUnity#134

Open

AlexM4H mentioned this issue Sep 26, 2024

Add the new Multi-Modal model of mistral AI: pixtral-12b mudler/LocalAI#3535

Open

mudler mentioned this issue Sep 26, 2024

llama3.2 vision models mudler/LocalAI#3669

Open

thiswillbeyourgithub mentioned this issue Sep 26, 2024

Llama-3.2 11B Vision Support #9643

Open

ngxson linked a pull request Sep 29, 2024 that will close this issue

llama : first attempt to implement vision API (WIP) #9687

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: Bring back multimodal support #8010

server: Bring back multimodal support #8010

ngxson commented Jun 19, 2024 •

edited

Loading

ZhenyaPav commented Jul 20, 2024

HAV0X1014 commented Jul 21, 2024

micoraweb commented Aug 16, 2024

its-ven commented Aug 22, 2024

HAV0X1014 commented Aug 31, 2024

micoraweb commented Sep 12, 2024

ggerganov commented Sep 12, 2024

ngxson commented Sep 12, 2024

chigkim commented Sep 26, 2024

ggerganov commented Sep 26, 2024

hidden1nin commented Sep 27, 2024

mattepiu commented Sep 27, 2024

ngxson commented Sep 27, 2024

ggerganov commented Sep 28, 2024

ngxson commented Sep 30, 2024 •

edited

Loading

server: Bring back multimodal support #8010

server: Bring back multimodal support #8010

Comments

ngxson commented Jun 19, 2024 • edited Loading

ZhenyaPav commented Jul 20, 2024

HAV0X1014 commented Jul 21, 2024

micoraweb commented Aug 16, 2024

its-ven commented Aug 22, 2024

HAV0X1014 commented Aug 31, 2024

micoraweb commented Sep 12, 2024

ggerganov commented Sep 12, 2024

ngxson commented Sep 12, 2024

chigkim commented Sep 26, 2024

ggerganov commented Sep 26, 2024

hidden1nin commented Sep 27, 2024

mattepiu commented Sep 27, 2024

ngxson commented Sep 27, 2024

ggerganov commented Sep 28, 2024

ngxson commented Sep 30, 2024 • edited Loading

ngxson commented Jun 19, 2024 •

edited

Loading

ngxson commented Sep 30, 2024 •

edited

Loading