Bug: llama-server crashes when started with --embeddings #8076

marcingomulkiewicz · 2024-06-23T11:03:07Z

What happened?

System: Mac M1, latest OS (14.5), latest llama.cpp (b3204), build in a std. way (make).
Path to reproduce: start llama server with --embeddings (tested with llama3/8b/fp16 and mistral/7b/Q8_0), go to gui, type anything.
Expected result: system completes/respond.
Actual result: llama-server segfaults: llama_get_logits_ith: invalid logits id 23, reason: no logits / zsh: segmentation fault.
Other notes: The same build/models behave normally when llama-server is started without --embeddings. Similar issue confirmed on Linux/CUDA.

Name and Version

version: 3204 (45c0e2e)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0

What operating system are you seeing the problem on?

Linux, Mac

Relevant log output

[...]
llama_kv_cache_init:      Metal KV buffer size =  4096.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.03 MiB
llama_new_context_with_model:      Metal compute buffer size =  2144.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
INFO [                    init] initializing slots | tid="0x1f31c4c00" timestamp=1719140243 n_slots=1
INFO [                    init] new slot | tid="0x1f31c4c00" timestamp=1719140243 id_slot=0 n_ctx_slot=32768
INFO [                    main] model loaded | tid="0x1f31c4c00" timestamp=1719140243
INFO [                    main] chat template | tid="0x1f31c4c00" timestamp=1719140243 chat_example="[INST] You are a helpful assistant\nHello [/INST]Hi there</s>[INST] How are you? [/INST]" built_in=true
INFO [                    main] HTTP server listening | tid="0x1f31c4c00" timestamp=1719140243 port="8080" n_threads_http="19" hostname="127.0.0.1"
INFO [            update_slots] all slots are idle | tid="0x1f31c4c00" timestamp=1719140243
INFO [      log_server_request] request | tid="0x16b85b000" timestamp=1719140252 remote_addr="127.0.0.1" remote_port=56208 status=200 method="GET" path="/" params={}
INFO [      log_server_request] request | tid="0x16b85b000" timestamp=1719140252 remote_addr="127.0.0.1" remote_port=56208 status=200 method="GET" path="/index.js" params={}
INFO [      log_server_request] request | tid="0x16b8e7000" timestamp=1719140252 remote_addr="127.0.0.1" remote_port=56209 status=200 method="GET" path="/completion.js" params={}
INFO [      log_server_request] request | tid="0x16b973000" timestamp=1719140252 remote_addr="127.0.0.1" remote_port=56211 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="0x1f31c4c00" timestamp=1719140258 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="0x1f31c4c00" timestamp=1719140258 id_slot=0 id_task=0 p0=0
llama_get_logits_ith: invalid logits id 23, reason: no logits
zsh: segmentation fault

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-06-23T16:40:04Z

Text completion should not be used with the --embedding option. Though we should handle this better than just crashing and return an error instead

cc @iamlemec

marcingomulkiewicz · 2024-06-23T17:01:52Z

IMHO it's not a good idea to only be able to start llama server as either completion or embedding (but not both): ok, in "enterprise" RAG there should be 2 separate models for embedding and generation, but in "home" setups it makes sense to use only one model, as it saves memory.

iamlemec · 2024-06-23T22:53:42Z

Indeed. Can we just call llama_set_embeddings before the respective llama_decode calls?

ggerganov · 2024-06-24T05:13:29Z

We would also need to guarantee that the batch that is decoded contains only either embedding or completion tokens. Could be determined based on the first slot that is added to the batch and after on we add only slots that are the same task as the first one

jdwx · 2024-06-25T00:29:25Z

I don't know if it's helpful, but completion and embedding coexisted peacefully (provided you didn't mix batches) up until commit 80ea089.

I'm entirely unfamiliar with this codebase, but I took a look and while it seemed like it should be simple to restore the previous behavior in llama.cpp without trashing the LLAMA_POOLING_TYPE_LAST stuff, a couple of hours later I don't really seem any closer.

If I've understood server.cpp correctly (which is not a given), it seems like this issue occurs because the server unconditionally passes the embedding setting into the internals if the --embeddings flag is given. Would it make sense / be difficult to make that conditional on whether the a given request is actually on one of the embedding endpoints?

This would be separate from the need to ensure that a single batch only contains one or the other, but it should at least restore the previous functionality. (If need be, it could be limited to the case where there's only one slot, which is certainly one way to avoid collisions.)

If that's acceptable, it's something I can try to look into, but there's definitely a learning curve here, and I'm at the bottom of it, staring upward with a dumb look on my face.

iamlemec · 2024-06-25T18:41:23Z

Thanks for the info @jdwx! Yeah, I just tried running a generative model with --embeddings, and while embeddings work fine, asking for a completion crashes when it finds there are no logits. It does seem like the logic change from #7477 takes the cparams.embeddings parameter a bit more seriously in way that affects the server.

I think we basically have to go all the way and ensure pure separation of batches to fix this. I took a stab at in iamlemec/llama.cpp:server-embed, but there are still some kinks to be worked out, especially as it relates to pooling.

Note that even with the old behavior, the embeddings that come out will be a bit arbitray in the absense of a pooling specifier. In the case where it's a single sequence batch, you'll get the embedding for the first token. But if it's a multi-sequence batch, it will be the n'th token for the n'th sequence. Right now, the pooling option is restricted to the embedding scope, so we might need to expand that too.

iamlemec · 2024-06-27T23:17:05Z

I believe I have a simple fix of only a few added lines. @ggerganov should I incorporate this into #8087 or make a new PR?

ggerganov · 2024-06-28T06:59:10Z

A new PR seems better. Thanks!

beginor · 2024-07-05T01:11:07Z

Waiting for this issue to bo fixed. Have to run two llama-server instances now, one for completion, and one for embedding.

iamlemec · 2024-07-05T17:30:12Z

@beginor and other folks doing dual-use here: what sorts of models are you using? The only model I've seen that is designed for this type of thing is GritLM. For that one, you need to use causal attention for generation and non-causal attention for embeddings. Any fix that ensure batches are only either generation or embedding will have to take a stance on what to do about the attention type.

Also, are you using this for token level embeddings or sequence level embeddings?

beginor · 2024-07-06T03:41:25Z

@iamlemec I have tried qwen2-7b-instruct-fp16.gguf , meta-llama-3-8b-instruct.fp16.gguf, phi-3-mini-4k-instruct-fp16.gguf , gemma-2-9b-it-q5_k_m.gguf.

And I want to use them for sequence level embeddings, follow the start guide from GraphRAG, and change the api base url to use llama-server. And I also have been doing some embeddings with RAG tests.

josharian · 2024-07-09T16:33:02Z

@iamlemec we are using Gemma and Llama family models, for sequence level embeddings. We could separate out embeddings from generations, but being able to switch back and forth on a batch-by-batch basis lets us do less GPU juggling and keep our coordination code simpler. (And to answer the implicit question, we would be happy to use causal attention when generating embeddings.)

iamlemec · 2024-07-10T20:03:44Z

Thanks @beginor and @josharian! Just submitted #8420 to address this.

marcingomulkiewicz added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Jun 23, 2024

beginor mentioned this issue Jul 5, 2024

Bug: Server Embedding Segmentation Fault #8243

Closed

ggerganov mentioned this issue Jul 12, 2024

server: Ensure batches are either all embed or all completion (#8076) #8420

Merged

4 tasks

ggerganov closed this as completed in #8420 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: llama-server crashes when started with --embeddings #8076

Bug: llama-server crashes when started with --embeddings #8076

marcingomulkiewicz commented Jun 23, 2024

ggerganov commented Jun 23, 2024

marcingomulkiewicz commented Jun 23, 2024

iamlemec commented Jun 23, 2024

ggerganov commented Jun 24, 2024

jdwx commented Jun 25, 2024 •

edited

Loading

iamlemec commented Jun 25, 2024

iamlemec commented Jun 27, 2024

ggerganov commented Jun 28, 2024

beginor commented Jul 5, 2024

iamlemec commented Jul 5, 2024

beginor commented Jul 6, 2024

josharian commented Jul 9, 2024 •

edited

Loading

iamlemec commented Jul 10, 2024

Bug: llama-server crashes when started with --embeddings #8076

Bug: llama-server crashes when started with --embeddings #8076

Comments

marcingomulkiewicz commented Jun 23, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented Jun 23, 2024

marcingomulkiewicz commented Jun 23, 2024

iamlemec commented Jun 23, 2024

ggerganov commented Jun 24, 2024

jdwx commented Jun 25, 2024 • edited Loading

iamlemec commented Jun 25, 2024

iamlemec commented Jun 27, 2024

ggerganov commented Jun 28, 2024

beginor commented Jul 5, 2024

iamlemec commented Jul 5, 2024

beginor commented Jul 6, 2024

josharian commented Jul 9, 2024 • edited Loading

iamlemec commented Jul 10, 2024

jdwx commented Jun 25, 2024 •

edited

Loading

josharian commented Jul 9, 2024 •

edited

Loading