How to swap LoRA during a post request to llama-server? #8849

Ujjawal-K-Panchal · 2024-08-04T05:35:32Z

Ujjawal-K-Panchal
Aug 4, 2024

I recently came across @ngxson's PR: #8332. The idea of swapping LoRAs during runtime is fascinating!
I went through the PR & the documentation of llama-server and wanted to try this out and then came across this question.

Question: For a custom deployment how does one actively swap the LoRA depending on the request from user? While I could not find any parameters in the POST request, is there any other way or workaround to let the model use a specific LoRA to serve a specific request?

PS: One way could be to launch multiple instances of llama-server, then let the user control which version he wants to query, but that limits it to the amount of RAM you have (since you need to host multiple copies of the base model). Instead if one could keep all LoRA weights in the same llama.cpp deployment, it would be much more memory efficient. Especially from the perspective of small devices and personal computers.

Answered by ngxson

Aug 4, 2024

I drafted a PR, could you try it out? Thanks.

#8857

View full answer

ngxson · 2024-08-04T17:52:48Z

ngxson
Aug 4, 2024
Collaborator

I drafted a PR, could you try it out? Thanks.

#8857

8 replies

stippi Oct 28, 2024

Thanks very much for taking the time to reply! What does it mean technically to change the scales of the loaded LoRA(s)? Is changing the scale multiple times a lossless operation? In other words, are the base-model weights preserved independently? I would like to understand why it is not possible to process each request with its own scaling values. Maybe I just need to understand better what batching really means.

ngxson Oct 28, 2024
Collaborator

Is changing the scale multiple times a lossless operation?

Yes, think of it like y = ax + b where a is the lora, x is lora scale and b is original weight. Obviously, x=0 results in y=b which means "scale lora to 0"

Now, the KV is calculated by embd * y = embd * (ax + b). If you have multiple slots in parallel, then embd contains embeddings (token+positional embeddings) from all slots, so you cannot have different lora scaling for each slot.

stippi Oct 28, 2024

embd contains embeddings (token+positional embeddings) from all slots

That is hard for me to wrap my head around. Is it correct to think of K in KV as a sequence of tokens and V as the resulting "state" from the model processing those tokens, ready to process the next tokens not yet in the cache? And if that is correct, are you saying that the LoRA scale is not factored into K? So basically, when we change the LoRA scale, we have to clear the KV cache?

ngxson Oct 28, 2024
Collaborator

Maybe oversimplified, but imagine: each token produce a K vector and a V vector. They are calculated by multiplying K weight and V weight respectively with embedding vector: K = Wk * embd and V = Wv * embd

From perspective of lora, we don't even care if it's a K or V, simply say that K_or_V = weight * embedding

Where weight = lora*scale + W, with W can be Wk or Wv depending on you're calculating K or V.

And yes when changing lora scale, we must clear the cache because cached token is calculated with the weight = lora*scale + W, but not the original W

slaren Oct 28, 2024
Collaborator

In principle it would be possible to apply a different LoRA or set of LoRAs to each sequence, you would just need to splip the embedding tensors into sequences and then merge the result into a single tensor again. Admittedly, the way batches are currently organized does not lend itself very well to this, and it's one of the things that would be better suited to representing batches in terms of sequences rather than tokens.

MaTwickenham · 2024-08-08T08:03:26Z

MaTwickenham
Aug 8, 2024

Hi @ngxson, really appreciate your work on lora!
Are there any docs about how to use hot-swap lora? I saw the #8332 but have no idea on how to use multi lora adapters.

3 replies

ngxson Aug 8, 2024
Collaborator

Please have a look on #8857

MaTwickenham Aug 8, 2024

So it is necessary to set up a LLaMA.cpp HTTP Server to use adapters by adding post arguments? I didn't know if I got it right.

Ujjawal-K-Panchal Aug 8, 2024
Author

TLDR:
The deployment looks like this:

llama-server -m <base-model> --lora-scaled <lora-1> 0.0 --lora-scaled <lora-2> 0.0 --lora-init-without-apply

Then when you want to apply a specific LoRA, you make a POST request to: /lora-adapters with the following in Body:

[
  {"id": 0, "scale": 1.0},
  {"id": 1, "scale": 0.0}
]

Generally id = 0 corresponds to lora-1, id = 1 corresponds to lora-2 ... likewise.

If you want to apply a combination of LoRA, you can also apply some combination of scales:
With the following as body (applying both LoRAs equally):

[
  {"id": 0, "scale": 0.5},
  {"id": 1, "scale": 0.5}
]

Then to disable all LoRAs, simply set all scales to 0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to swap LoRA during a post request to llama-server? #8849

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to swap LoRA during a post request to llama-server? #8849

Ujjawal-K-Panchal Aug 4, 2024

Replies: 2 comments · 11 replies

ngxson Aug 4, 2024 Collaborator

stippi Oct 28, 2024

ngxson Oct 28, 2024 Collaborator

stippi Oct 28, 2024

ngxson Oct 28, 2024 Collaborator

slaren Oct 28, 2024 Collaborator

MaTwickenham Aug 8, 2024

ngxson Aug 8, 2024 Collaborator

MaTwickenham Aug 8, 2024

Ujjawal-K-Panchal Aug 8, 2024 Author

Ujjawal-K-Panchal
Aug 4, 2024

Replies: 2 comments 11 replies

ngxson
Aug 4, 2024
Collaborator

ngxson Oct 28, 2024
Collaborator

ngxson Oct 28, 2024
Collaborator

slaren Oct 28, 2024
Collaborator

MaTwickenham
Aug 8, 2024

ngxson Aug 8, 2024
Collaborator

Ujjawal-K-Panchal Aug 8, 2024
Author