How to swap LoRA during a post request to llama-server? #8849
-
I recently came across @ngxson's PR: #8332. The idea of swapping LoRAs during runtime is fascinating! Question: For a custom deployment how does one actively swap the LoRA depending on the request from user? While I could not find any parameters in the PS: One way could be to launch multiple instances of llama-server, then let the user control which version he wants to query, but that limits it to the amount of RAM you have (since you need to host multiple copies of the base model). Instead if one could keep all LoRA weights in the same llama.cpp deployment, it would be much more memory efficient. Especially from the perspective of small devices and personal computers. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 11 replies
-
I drafted a PR, could you try it out? Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi @ngxson, really appreciate your work on lora! |
Beta Was this translation helpful? Give feedback.
I drafted a PR, could you try it out? Thanks.
#8857