How does llama.cpp works ? #4531
Replies: 3 comments 1 reply
-
Essentially, we load data, build a computation graph and compute it. The main focus is running on the CPU with efficient SIMD instructions - this is built-in into the core implementation of the library (
So it is a generalization API that makes it easier to start running One of the simplest examples of using |
Beta Was this translation helpful? Give feedback.
-
Dear Georgi, A gguf file consists of a header and a body. The header contains key-value pairs that provide metadata about the model, such as its name, version, source, tokenizer, computation graph, etc. The body contains the tensors that represent the model parameters, such as the weights and biases of the layers. The tensors are stored in a compressed format to reduce the file size. The gguf file format is extensible, meaning that new features can be added without breaking compatibility with older models would you say the above is correct ? If so, how to describe what llama.cpp does with such file ? The input tokens are searched in the GGUF ? Once found, whats next ? thank you for your patience helping on this |
Beta Was this translation helpful? Give feedback.
-
llama.cpp performs the following steps: It initializes a llama context from the gguf file using the llama_init_from_file function. This function reads the header and the body of the gguf file and creates a llama context object, which contains the model information and the backend to run the model on (CPU, GPU, or Metal). Is this correct ? Can be better explained ? |
Beta Was this translation helpful? Give feedback.
-
Could someone provide a very simple explanation about it ? Just a flow diagram, some pseudo code, a little working demo.
It would be great to have a working example using a high level use just for learning purposes even if it were terribly slow.
thank you
Beta Was this translation helpful? Give feedback.
All reactions