-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Server Embedding Segmentation Fault #8243
Comments
Get the same error when run Server output is:
Just try call curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about python exceptions"
}
]
}'
curl: (52) Empty reply from server I have tried with other models , like |
There is already an issue about this, I think they are the same. #8076 |
Yes, this is related. And it was also expected. I actually made a review comment about this use case at the time. The fix will likely exactly involve properly using
I think that what previously happened was that the server gave causal embeddings for non-BERT models instead of non-causal ones, but I might be wrong on that. But it's not exactly straightforward to fix, because a single call to Hmm. Or to make that fix fairer, if there's any slot requesting embeddings and the model is not yet in embedding mode, switch to that and dedicate the next The workaround is currently to run 2 instances of |
Since this issue seems to be already known (and already worked on #8076 (comment)), I'm closing it in favor of #8076. Thank you @compilade for your nice explanation. I'm not too familiar with the concept of causal and non-causal embeddings. Do you maybe have a reference where I could read more about this? |
What happened?
When starting the server in embedding mode, requests to the
/complete
endpoint result in a segmentation fault (other endpoints might be affected too).Tested on Macbook Air M1 and RTX 4090.
The embedding API recently changed, so it might be intended, that completion and embedding mode can't be used simultaneously (?).
However, I remember both working simultaneously in the past and couldn't find this behavior documented anywhere.
Also, the error should probably be handled more gracefully.
Steps to reproduce:
Name and Version
$ ./llama-cli --version
version: 3276 (cb5fad4)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
What operating system are you seeing the problem on?
Linux, Mac
Relevant log output
The text was updated successfully, but these errors were encountered: