Can llama.cpp support batched prompts? #10299

chosen-ox · 2024-11-15T02:31:41Z

chosen-ox
Nov 15, 2024

I am new to llama.cpp. I browse discussions and issues to find how to inference multi requests together. For example, if there is only one prompt. We have a 2d array. If there are several prompts together, the input will be a matrix. In my opinion, processing several prompts together is faster than process them separately. Llama have provide batched requests. I wonder if llama.cpp have similar feature?

By the way, n_batch and n_ubatch in llama.cpp may refers to the chunk size in a single prompt. Not the number of prompts processed together.

Answered by ExtReMLapin

Nov 15, 2024

yes, easiest way to use it is with llama-server where you create multiple slots

Edit :

Long story short, you set the number of parallel requests and parallel http threads to the batch size.

total context size = batch_size * individual context (4096 or 8192 etc)

View full answer

ExtReMLapin · 2024-11-15T08:19:34Z

ExtReMLapin
Nov 15, 2024

yes, easiest way to use it is with llama-server where you create multiple slots

Edit :

Long story short, you set the number of parallel requests and parallel http threads to the batch size.

total context size = batch_size * individual context (4096 or 8192 etc)

1 reply

chosen-ox Nov 15, 2024
Author

Does example directory have a correspond file I can learn about？Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can llama.cpp support batched prompts? #10299

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Can llama.cpp support batched prompts? #10299

chosen-ox Nov 15, 2024

Replies: 1 comment · 1 reply

ExtReMLapin Nov 15, 2024

chosen-ox Nov 15, 2024 Author

chosen-ox
Nov 15, 2024

Replies: 1 comment 1 reply

ExtReMLapin
Nov 15, 2024

chosen-ox Nov 15, 2024
Author