Llama-3.2 11B Vision Support #9643

yukiarimo · 2024-09-25T20:00:17Z

Is it working right now in any way?

stduhpf · 2024-09-25T20:14:31Z

Most likely not. I believe its architrecture would be closer to Pixtral (which is unsupported) than to Llava.

mirek190 · 2024-09-25T23:02:40Z

Currently any new vision+text models are not supported.

If llamacpp want to exist in the future must to implement text+vision models as that will be more and more common.
Soon probably voice as well.

yukiarimo · 2024-09-25T23:14:33Z

True. Is it possible to run it somehow on macOS not using llama.cpp?

Animaxx · 2024-09-26T02:52:49Z

Not sure if MLX can run it...?

MaratG2 · 2024-09-26T06:32:13Z

I got an error that architecture is unsupported, which was coming

MoonRide303 · 2024-09-26T07:26:10Z

First milestone could be ignoring non-text capabilities, and simply let people use text input/output to interact with multimodal models.

Below output from current (llama.cpp b3827) convert_hf_to_gguf.py (so people looking for this error could find this issue):

INFO:hf-to-gguf:Loading model: Llama-3.2-11B-Vision-Instruct
ERROR:hf-to-gguf:Model MllamaForConditionalGeneration is not supported

thiswillbeyourgithub · 2024-09-26T13:13:12Z

FYI the repo owner has a related statement here:

chigkim My PoV is that adding multimodal support is a great opportunity for new people with good software architecture skills to get involved in the project. The general low to mid level patterns and details needed for the implementation are already available in the codebase - from model conversion, to data loading, backend usage and inference. It would take some high-level understanding of the project architecture in order to implement support for the vision models and extend the API in the correct way.

We really need more people with this sort of skillset, so at this point I feel it is better to wait and see if somebody will show up and take the opportunity to help out with the project long-term. Otherwise, I'm afraid we won't be able to sustain the quality of the project.

So if you know anyone with strong skills or have influence over big tech and can motivate them please do!

HanClinto · 2024-09-26T15:53:51Z

It may be helpful to draw a distinction between multimodal / vision support in the core llama.cpp library vs. multimodal / vision support in llama.cpp's server.

Multimodal support was removed from the server in #5882, but it was not removed from the core library / command-line. I believe that ggerganov's comments re: looking for new developers to support vision models in the API is talking about the server -- not the core library.

This is how wrappers (such as ollama) are still able to provide an API to serve multimodal models via llama.cpp as a back-end. Ollama provides the HTTP server, and llama.cpp still does the core processing (including multi-modal support).

Long-term, I kinda' wonder if it isn't in llama.cpp's interests to stop supporting the HTTP server altogether, and instead farm that out to other wrapper projects (such as ollama), while we focus on enhancing the capabilities of the core API.

thiswillbeyourgithub · 2024-09-26T15:57:33Z

Oh thank you immensely for that clarification. I wasn't even aware that llama.cpp had a server as it seems so redundant with the other efforts such as ollama so I agree with you. Thanks a lot!

JohannesGaessler · 2024-09-26T17:35:32Z

Long-term, I kinda' wonder if it isn't in llama.cpp's interests to stop supporting the HTTP server altogether, and instead farm that out to other wrapper projects (such as ollama), while we focus on enhancing the capabilities of the core API.

I may be misremembering, but I think ollama internally forwards its calls to a llama.cpp HTTP server. In any case, my personal opinion is that I would rather have a server be part of the project instead of having to rely on a third party.

HanClinto · 2024-09-26T18:03:28Z

Long-term, I kinda' wonder if it isn't in llama.cpp's interests to stop supporting the HTTP server altogether, and instead farm that out to other wrapper projects (such as ollama), while we focus on enhancing the capabilities of the core API.

I may be misremembering, but I think ollama internally forwards its calls to a llama.cpp HTTP server. In any case, my personal opinion is that I would rather have a server be part of the project instead of having to rely on a third party.

Sorry to take this issue off topic, but hopefully it's relevant to enough warrant continuing it here.

Thank you for the correction -- it looks like you are right!

Best I can tell, ollama maintains a fork of llama.cpp's server that branched off of b2356 (our last release version that supported multimodal). Since then, they have continued updating that branch of server.cpp to add new features, remove the web front-end that we include in ours, maintain multimodal support, etc. I'm going through the diff to try and parse through what's being brought over and what's not (I'm not fully clear on their update strategy / method), but it seems to be a full-on fork at this point.

I haven't yet figured out how much their server maintains the spirit of the refactoring from #5882, or if merging their version of server.cpp into ours would be too much of a regress. If we're going to continue this discussion much further, perhaps opening a new issue to discuss sync'ing our version of server.cpp with ollama's would be useful?

Thellton · 2024-09-27T03:19:41Z

It may be helpful to draw a distinction between multimodal / vision support in the core llama.cpp library vs. multimodal / vision support in llama.cpp's server.

Multimodal support was removed from the server in #5882, but it was not removed from the core library / command-line. I believe that ggerganov's comments re: looking for new developers to support vision models in the API is talking about the server -- not the core library.

This is how wrappers (such as ollama) are still able to provide an API to serve multimodal models via llama.cpp as a back-end. Ollama provides the HTTP server, and llama.cpp still does the core processing (including multi-modal support).

Long-term, I kinda' wonder if it isn't in llama.cpp's interests to stop supporting the HTTP server altogether, and instead farm that out to other wrapper projects (such as ollama), while we focus on enhancing the capabilities of the core API.

Honestly, I agree with that last point, maybe the server should be depreciated in favour of an API to allow others to wrap a server around it so that llamacpp can focus on its core competency whilst others handle the specifics of actually serving inference up. Having an actual server included as part of the program has been good, but is the program and the community well served by clinging to it now and into the future when it's very apparent that others are perfectly happy to address the issue of serving models as is happening with ollama?

the same might even be said of the number of inference backends (CUDA, ROCm, various CPUs of both x86 and ARM flavours, Vulkan, and SYCL) as there is technically a great deal of duplication of effort in that regard. It'd be interesting to see what the technical feasibility would be of for example of Vulkan reaching feature parity with CUDA, ROCm, and CPU would be. of course, at the end of the day, I'm basically all talk on this matter as it'd be years before I'd have the competence to contribute :/

JohannesGaessler · 2024-09-27T07:25:54Z

Open source projects aren't run like a company. There isn't a boss at the top directing people to work on specific things, people are choosing what to work on out of their own volition. Removing the server isn't going to result in more resources going towards other aspects of the project, it's going to result in the people who are currently choosing to contribute to and maintain the server being frustrated with their efforts being "wasted". The only aspect where I think there is a real zero sum game is code review since the vast majority of work is done by Georgi and slaren.

Thellton · 2024-09-27T10:45:07Z

Open source projects aren't run like a company. There isn't a boss at the top directing people to work on specific things, people are choosing what to work on out of their own volition. Removing the server isn't going to result in more resources going towards other aspects of the project, it's going to result in the people who are currently choosing to contribute to and maintain the server being frustrated with their efforts being "wasted". The only aspect where I think there is a real zero sum game is code review since the vast majority of work is done by Georgi and slaren.

yeah... kind of just accentuates just how frustrating it is to want to help, to contribute and yet knowing it'd be years before I could do so in a way that is useful :/ basically just a plain bummer.

jrp2014 · 2024-09-29T17:23:37Z

It works OK on a MacBook Pro using the example code on the hugging face page. I tried to get it to caption and keyword some pics. It doesn't seem to understand what a keyword is and produces, instead a good deal of prose. I preferred mlx-vlm + Llava 1.6, which was also a bit faster to run.

This was referenced Sep 26, 2024

Add the new Multi-Modal model of mistral AI: pixtral-12b mudler/LocalAI#3535

Open

llama3.2 vision models mudler/LocalAI#3669

Open

giladgd mentioned this issue Sep 26, 2024

feat: pass an image as part of the evaluation withcatai/node-llama-cpp#88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3.2 11B Vision Support #9643

Llama-3.2 11B Vision Support #9643

yukiarimo commented Sep 25, 2024

stduhpf commented Sep 25, 2024

mirek190 commented Sep 25, 2024

yukiarimo commented Sep 25, 2024

Animaxx commented Sep 26, 2024

MaratG2 commented Sep 26, 2024

MoonRide303 commented Sep 26, 2024 •

edited

Loading

thiswillbeyourgithub commented Sep 26, 2024

HanClinto commented Sep 26, 2024 •

edited

Loading

thiswillbeyourgithub commented Sep 26, 2024

JohannesGaessler commented Sep 26, 2024

HanClinto commented Sep 26, 2024

Thellton commented Sep 27, 2024

JohannesGaessler commented Sep 27, 2024

Thellton commented Sep 27, 2024

jrp2014 commented Sep 29, 2024

Llama-3.2 11B Vision Support #9643

Llama-3.2 11B Vision Support #9643

Comments

yukiarimo commented Sep 25, 2024

stduhpf commented Sep 25, 2024

mirek190 commented Sep 25, 2024

yukiarimo commented Sep 25, 2024

Animaxx commented Sep 26, 2024

MaratG2 commented Sep 26, 2024

MoonRide303 commented Sep 26, 2024 • edited Loading

thiswillbeyourgithub commented Sep 26, 2024

HanClinto commented Sep 26, 2024 • edited Loading

thiswillbeyourgithub commented Sep 26, 2024

JohannesGaessler commented Sep 26, 2024

HanClinto commented Sep 26, 2024

Thellton commented Sep 27, 2024

JohannesGaessler commented Sep 27, 2024

Thellton commented Sep 27, 2024

jrp2014 commented Sep 29, 2024

MoonRide303 commented Sep 26, 2024 •

edited

Loading

HanClinto commented Sep 26, 2024 •

edited

Loading