-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamline embeddings from "non-embedding" models #8087
Conversation
llama.cpp
Outdated
@@ -12618,14 +12618,15 @@ static int llama_decode_internal( | |||
std::vector<llama_seq_id *> seq_id_arr; | |||
std::vector<std::vector<llama_seq_id>> seq_id; | |||
|
|||
// this indicates we are doing pooled embedding, so we ignore batch.logits and output all tokens | |||
bool embed_pooled = cparams.embeddings && cparams.pooling_type != LLAMA_POOLING_TYPE_NONE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming this embd_pooled
seems more inline with the usage of the embd
identifier in llama.cpp
:
bool embed_pooled = cparams.embeddings && cparams.pooling_type != LLAMA_POOLING_TYPE_NONE; | |
const bool embd_pooled = cparams.embeddings && cparams.pooling_type != LLAMA_POOLING_TYPE_NONE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks good to me.
@iamlemec, you'll need to merge master
into to this first to resolve the conflicts; some files have moved.
@compilade cool! just rebased to master |
The goal here is to get the big embedding models at the top of the MTEB leaderboard working. There are two changes:
batch.logits
is fully ignored for pooled embeddings.attention_type
tollama_contex_params
that allows for causal, non-causal, or unspecified (model default).With this PR, we can get accurate results (matching HF) from at least the number 2 spot
gte-Qwen2-7B-instruct
. For instance, with the command:./llama-embedding -m gte-qwen2-7b-instruct-f16.gguf -p "hello world" -ngl 99 --pooling last --at tention non-causal -c 512