Can llama.cpp support batched prompts? #10299
-
I am new to llama.cpp. I browse discussions and issues to find how to inference multi requests together. For example, if there is only one prompt. We have a 2d array. If there are several prompts together, the input will be a matrix. In my opinion, processing several prompts together is faster than process them separately. Llama have provide batched requests. I wonder if llama.cpp have similar feature? By the way, n_batch and n_ubatch in llama.cpp may refers to the chunk size in a single prompt. Not the number of prompts processed together. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
yes, easiest way to use it is with llama-server where you create multiple slots Edit : Long story short, you set the number of parallel requests and parallel http threads to the batch size. total context size = batch_size * individual context (4096 or 8192 etc) |
Beta Was this translation helpful? Give feedback.
yes, easiest way to use it is with llama-server where you create multiple slots
Edit :
Long story short, you set the number of parallel requests and parallel http threads to the batch size.
total context size = batch_size * individual context (4096 or 8192 etc)