Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: llama-server crashes when started with --embeddings #8076

Closed
marcingomulkiewicz opened this issue Jun 23, 2024 · 13 comments · Fixed by #8420
Closed

Bug: llama-server crashes when started with --embeddings #8076

marcingomulkiewicz opened this issue Jun 23, 2024 · 13 comments · Fixed by #8420
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

Comments

@marcingomulkiewicz
Copy link

What happened?

System: Mac M1, latest OS (14.5), latest llama.cpp (b3204), build in a std. way (make).
Path to reproduce: start llama server with --embeddings (tested with llama3/8b/fp16 and mistral/7b/Q8_0), go to gui, type anything.
Expected result: system completes/respond.
Actual result: llama-server segfaults: llama_get_logits_ith: invalid logits id 23, reason: no logits / zsh: segmentation fault.
Other notes: The same build/models behave normally when llama-server is started without --embeddings. Similar issue confirmed on Linux/CUDA.

Name and Version

version: 3204 (45c0e2e)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0

What operating system are you seeing the problem on?

Linux, Mac

Relevant log output

[...]
llama_kv_cache_init:      Metal KV buffer size =  4096.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.03 MiB
llama_new_context_with_model:      Metal compute buffer size =  2144.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
INFO [                    init] initializing slots | tid="0x1f31c4c00" timestamp=1719140243 n_slots=1
INFO [                    init] new slot | tid="0x1f31c4c00" timestamp=1719140243 id_slot=0 n_ctx_slot=32768
INFO [                    main] model loaded | tid="0x1f31c4c00" timestamp=1719140243
INFO [                    main] chat template | tid="0x1f31c4c00" timestamp=1719140243 chat_example="[INST] You are a helpful assistant\nHello [/INST]Hi there</s>[INST] How are you? [/INST]" built_in=true
INFO [                    main] HTTP server listening | tid="0x1f31c4c00" timestamp=1719140243 port="8080" n_threads_http="19" hostname="127.0.0.1"
INFO [            update_slots] all slots are idle | tid="0x1f31c4c00" timestamp=1719140243
INFO [      log_server_request] request | tid="0x16b85b000" timestamp=1719140252 remote_addr="127.0.0.1" remote_port=56208 status=200 method="GET" path="/" params={}
INFO [      log_server_request] request | tid="0x16b85b000" timestamp=1719140252 remote_addr="127.0.0.1" remote_port=56208 status=200 method="GET" path="/index.js" params={}
INFO [      log_server_request] request | tid="0x16b8e7000" timestamp=1719140252 remote_addr="127.0.0.1" remote_port=56209 status=200 method="GET" path="/completion.js" params={}
INFO [      log_server_request] request | tid="0x16b973000" timestamp=1719140252 remote_addr="127.0.0.1" remote_port=56211 status=200 method="GET" path="/json-schema-to-grammar.mjs" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="0x1f31c4c00" timestamp=1719140258 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="0x1f31c4c00" timestamp=1719140258 id_slot=0 id_task=0 p0=0
llama_get_logits_ith: invalid logits id 23, reason: no logits
zsh: segmentation fault
@marcingomulkiewicz marcingomulkiewicz added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Jun 23, 2024
@ggerganov
Copy link
Owner

Text completion should not be used with the --embedding option. Though we should handle this better than just crashing and return an error instead

cc @iamlemec

@marcingomulkiewicz
Copy link
Author

IMHO it's not a good idea to only be able to start llama server as either completion or embedding (but not both): ok, in "enterprise" RAG there should be 2 separate models for embedding and generation, but in "home" setups it makes sense to use only one model, as it saves memory.

@iamlemec
Copy link
Collaborator

Indeed. Can we just call llama_set_embeddings before the respective llama_decode calls?

@ggerganov
Copy link
Owner

We would also need to guarantee that the batch that is decoded contains only either embedding or completion tokens. Could be determined based on the first slot that is added to the batch and after on we add only slots that are the same task as the first one

@jdwx
Copy link

jdwx commented Jun 25, 2024

I don't know if it's helpful, but completion and embedding coexisted peacefully (provided you didn't mix batches) up until commit 80ea089.

I'm entirely unfamiliar with this codebase, but I took a look and while it seemed like it should be simple to restore the previous behavior in llama.cpp without trashing the LLAMA_POOLING_TYPE_LAST stuff, a couple of hours later I don't really seem any closer.

If I've understood server.cpp correctly (which is not a given), it seems like this issue occurs because the server unconditionally passes the embedding setting into the internals if the --embeddings flag is given. Would it make sense / be difficult to make that conditional on whether the a given request is actually on one of the embedding endpoints?

This would be separate from the need to ensure that a single batch only contains one or the other, but it should at least restore the previous functionality. (If need be, it could be limited to the case where there's only one slot, which is certainly one way to avoid collisions.)

If that's acceptable, it's something I can try to look into, but there's definitely a learning curve here, and I'm at the bottom of it, staring upward with a dumb look on my face.

@iamlemec
Copy link
Collaborator

Thanks for the info @jdwx! Yeah, I just tried running a generative model with --embeddings, and while embeddings work fine, asking for a completion crashes when it finds there are no logits. It does seem like the logic change from #7477 takes the cparams.embeddings parameter a bit more seriously in way that affects the server.

I think we basically have to go all the way and ensure pure separation of batches to fix this. I took a stab at in iamlemec/llama.cpp:server-embed, but there are still some kinks to be worked out, especially as it relates to pooling.

Note that even with the old behavior, the embeddings that come out will be a bit arbitray in the absense of a pooling specifier. In the case where it's a single sequence batch, you'll get the embedding for the first token. But if it's a multi-sequence batch, it will be the n'th token for the n'th sequence. Right now, the pooling option is restricted to the embedding scope, so we might need to expand that too.

@iamlemec
Copy link
Collaborator

I believe I have a simple fix of only a few added lines. @ggerganov should I incorporate this into #8087 or make a new PR?

@ggerganov
Copy link
Owner

A new PR seems better. Thanks!

@beginor
Copy link

beginor commented Jul 5, 2024

Waiting for this issue to bo fixed. Have to run two llama-server instances now, one for completion, and one for embedding.

@iamlemec
Copy link
Collaborator

iamlemec commented Jul 5, 2024

@beginor and other folks doing dual-use here: what sorts of models are you using? The only model I've seen that is designed for this type of thing is GritLM. For that one, you need to use causal attention for generation and non-causal attention for embeddings. Any fix that ensure batches are only either generation or embedding will have to take a stance on what to do about the attention type.

Also, are you using this for token level embeddings or sequence level embeddings?

@beginor
Copy link

beginor commented Jul 6, 2024

@iamlemec I have tried qwen2-7b-instruct-fp16.gguf , meta-llama-3-8b-instruct.fp16.gguf, phi-3-mini-4k-instruct-fp16.gguf , gemma-2-9b-it-q5_k_m.gguf.

And I want to use them for sequence level embeddings, follow the start guide from GraphRAG, and change the api base url to use llama-server. And I also have been doing some embeddings with RAG tests.

@josharian
Copy link

josharian commented Jul 9, 2024

@iamlemec we are using Gemma and Llama family models, for sequence level embeddings. We could separate out embeddings from generations, but being able to switch back and forth on a batch-by-batch basis lets us do less GPU juggling and keep our coordination code simpler. (And to answer the implicit question, we would be happy to use causal attention when generating embeddings.)

@iamlemec
Copy link
Collaborator

Thanks @beginor and @josharian! Just submitted #8420 to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants