How does llama.cpp manage the memory? #6324

VincentXWD · 2024-03-26T12:52:19Z

VincentXWD
Mar 26, 2024

Hello, I'm wondering how does llama.cpp manage the memory:

Does llama.cpp allocate space for tensors including static parameter tensors and temporary tensors at once? I only accumulated the allocations of parameter tensors(e.g. blk.0.attn_q.weight) but no more temporary tensors (e.g. inp_embd). Could you please explain where the allocating process is?
Will llama.cpp free all tensors at the end of the model inference process including both parameter tensors and temporary tensors? I'm also wondering where llama.cpp frees these tensors.

I hope I described my confusions properly and thanks for your attention.

slaren · 2024-03-26T12:59:07Z

slaren
Mar 26, 2024
Collaborator

Hi, the inputs and temporary tensors used in the computation graph are allocated by ggml-alloc through ggml_backend_sched. This is a custom allocator that seeks to reduce memory usage by delaying allocation of the tensors until they are used, and freeing them when they are not needed any longer.

1 reply

VincentXWD Mar 26, 2024
Author

Thank you for your quick reply!

VincentXWD · 2024-03-26T18:10:08Z

VincentXWD
Mar 26, 2024
Author

Hi slaren,
I investigated the ggml_backend_sched and it seems like to allocate the memory buffer at once (in cuda, kompute and etc.). I still cannot understand the logic. Could you please give me a brief introduction to the memory allocating mechanism of any one of these backends? This will help me and other users understand llama.cpp more in memory management.

Thank you.

2 replies

slaren Mar 26, 2024
Collaborator

ggml_backend_sched adds some complexity to the allocator to support multiple devices, but it does not perform any allocation by itself. You should be looking at ggml_gallocr (graph allocator) in ggml-alloc.c.

After the computation graph is built, llama.cpp creates a list of nodes in topological order that will be executed in sequence. Functionally, what the allocator does is roughly equivalent to this:

# initialize node.refs to the number of times it appears as a source in the graph
for node in nodes:
  allocate(node)
  execute(node)
  for src in node.sources:
    src.refs -= 1
    if src.refs == 0: free(src)

However, instead of doing this allocation in real time as the graph is executed, ggml-alloc does it in a preprocessing step, where the offsets within the compute buffer where each tensor will be allocated is calculated.

VincentXWD Mar 27, 2024
Author

yes, i found it. this is a dynamic allocating strategy. thank you so much!

Zijie-Tian · 2024-10-29T09:50:26Z

Zijie-Tian
Oct 29, 2024

After the computation graph is built, llama.cpp creates a list of nodes in topological order that will be executed in sequence. Functionally, what the allocator does is roughly equivalent to this:

@slaren Hi slaren, I wonder if the "execute" of this graph is an asynchronous process. If one part of the node is in the CPU and the other part is in the GPU, is there a possibility of parallelism? Is there any issue or discussion about this?

11 replies

JohannesGaessler Oct 29, 2024
Collaborator

what I mean is splitting the operation between the CPU and GPU. Specifically, dividing the MUL_MAT operation so that one part executes on the CPU and the other on the GPU, and add a concat node. Is there potential for performance improvement?

When I initially did the implementation of what is currently --split-mode row I tested parallelizing CPU and GPU but the overhead from CPU <-> GPU copies and synchronization seemed to have been much larger than any potential benefit. At the time there was also no support for CUDA graphs whatsoever though which may help with at least the synchronization overhead. My overall intuition is however that it would still not be viable.

Zijie-Tian Oct 29, 2024

I think, for SoCs (like Jetson or Mac devices), when using cudaManagedMemory, the copy overhead between the CPU and GPU seems negligible. Also, we only need to use ManagedMemory for the activation variables, and it appears the current llama.cpp implementation hasn’t optimized in this area yet.

Zijie-Tian Oct 29, 2024

At the time there was also no support for CUDA graphs whatsoever though which may help with at least the synchronization overhead. My overall intuition is however that it would still not be viable.

Yes, you're right; cudaGraph indeed cannot be used.

JohannesGaessler Oct 29, 2024
Collaborator

No, what I meant to say is that if I remember correctly CUDA graphs have functionality where you can embed CPU code which I would expect to have less overhead than doing memcpys and calling e.g. cudaStreamSynchronize.

Zijie-Tian Oct 29, 2024

OK, Thanks for the response; I'll give it a try. If there are good results, we can discuss this issue elsewhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does llama.cpp manage the memory? #6324

{{title}}

Replies: 3 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How does llama.cpp manage the memory? #6324

VincentXWD Mar 26, 2024

Replies: 3 comments · 14 replies

slaren Mar 26, 2024 Collaborator

VincentXWD Mar 26, 2024 Author

VincentXWD Mar 26, 2024 Author

slaren Mar 26, 2024 Collaborator

VincentXWD Mar 27, 2024 Author

Zijie-Tian Oct 29, 2024

JohannesGaessler Oct 29, 2024 Collaborator

Zijie-Tian Oct 29, 2024

Zijie-Tian Oct 29, 2024

JohannesGaessler Oct 29, 2024 Collaborator

Zijie-Tian Oct 29, 2024

VincentXWD
Mar 26, 2024

Replies: 3 comments 14 replies

slaren
Mar 26, 2024
Collaborator

VincentXWD Mar 26, 2024
Author

VincentXWD
Mar 26, 2024
Author

slaren Mar 26, 2024
Collaborator

VincentXWD Mar 27, 2024
Author

Zijie-Tian
Oct 29, 2024

JohannesGaessler Oct 29, 2024
Collaborator

JohannesGaessler Oct 29, 2024
Collaborator