How does llama.cpp works ? #4531

FiveTechSoft · 2023-12-19T07:09:52Z

FiveTechSoft
Dec 19, 2023

Could someone provide a very simple explanation about it ? Just a flow diagram, some pseudo code, a little working demo.

It would be great to have a working example using a high level use just for learning purposes even if it were terribly slow.

thank you

ggerganov · 2023-12-19T07:41:52Z

ggerganov
Dec 19, 2023
Maintainer

llama.cpp uses the ggml library to perform LLM inference. You can first start with some basic examples of using ggml:

MNIST network https://github.com/ggerganov/ggml/blob/master/examples/mnist/main.cpp
GPT-2 network https://github.com/ggerganov/ggml/blob/master/examples/gpt-2/main-ctx.cpp

Essentially, we load data, build a computation graph and compute it. The main focus is running on the CPU with efficient SIMD instructions - this is built-in into the core implementation of the library (ggml.c). The backends (ggml-cuda, ggml-metal, etc.) are used to compute the graphs on GPU accelerators.

llama.cpp deals with a lot of extra features around LLM inference:

data formats
model architectures
tokenizers
sampling
grammar
KV cache management
etc.

So it is a generalization API that makes it easier to start running ggml in your project. Though if you have a very specific need or use case, you can built off straight on top of ggml or alternatively, create a strip-down version of llama.cpp by removing the unnecessary stuff.

One of the simplest examples of using llama.cpp is the examples/simple example

0 replies

FiveTechSoft · 2023-12-19T09:13:13Z

FiveTechSoft
Dec 19, 2023
Author

Dear Georgi,

A gguf file consists of a header and a body. The header contains key-value pairs that provide metadata about the model, such as its name, version, source, tokenizer, computation graph, etc. The body contains the tensors that represent the model parameters, such as the weights and biases of the layers. The tensors are stored in a compressed format to reduce the file size. The gguf file format is extensible, meaning that new features can be added without breaking compatibility with older models

would you say the above is correct ? If so, how to describe what llama.cpp does with such file ?

The input tokens are searched in the GGUF ? Once found, whats next ?

thank you for your patience helping on this

0 replies

FiveTechSoft · 2023-12-19T09:16:41Z

FiveTechSoft
Dec 19, 2023
Author

llama.cpp performs the following steps:

It initializes a llama context from the gguf file using the llama_init_from_file function. This function reads the header and the body of the gguf file and creates a llama context object, which contains the model information and the backend to run the model on (CPU, GPU, or Metal).
It tokenizes the input text using the llama_tokenize function. This function converts the input text into a sequence of tokens based on the tokenizer specified in the gguf file header. The tokens are stored in an array of llama tokens, which are integers that represent the token IDs.
It generates the output text using the llama_generate function. This function takes the input tokens and the llama context as arguments and runs the model on the backend. It uses the computation graph specified in the gguf file header to perform the forward pass of the model and calculate the next token probabilities. It then samples the next token from the probability distribution and appends it to the output tokens. It repeats this process until the end-of-text token or the maximum number of tokens is reached. The output tokens are stored in another array of llama tokens.
It detokenizes the output text using the llama_detokenize function. This function converts the output tokens into a string of text based on the tokenizer specified in the gguf file header. It handles the special tokens, such as the end-of-text token, the padding token, and the unknown token, and returns the final output text.

Is this correct ? Can be better explained ?

1 reply

Doormanstryd Oct 25, 2024

@FiveTechSoft You are asking exactly what I am trying to figure out! Thanks anyway！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does llama.cpp works ? #4531

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How does llama.cpp works ? #4531

FiveTechSoft Dec 19, 2023

Replies: 3 comments · 1 reply

ggerganov Dec 19, 2023 Maintainer

FiveTechSoft Dec 19, 2023 Author

FiveTechSoft Dec 19, 2023 Author

Doormanstryd Oct 25, 2024

FiveTechSoft
Dec 19, 2023

Replies: 3 comments 1 reply

ggerganov
Dec 19, 2023
Maintainer

FiveTechSoft
Dec 19, 2023
Author

FiveTechSoft
Dec 19, 2023
Author