Have you ever wanted to inference a baby Llama 2 model in Go? No? Well, now you can!
This is based on llama2.c, we didn't port anything! So it's very different from these Go implementations:
llgo plays a great role as a bridge, allowing the C ecosystem to be seamlessly connected to Go.
You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough (ref: TinyStories paper). This repo is a "fullstack" train + inference solution for Llama 2 LLM, with focus on minimalism and simplicity.
As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B.
How to run this example? The simplest way is to run it without any arguments:
llgo run .
This means it uses the default model checkpoint file (stories15M.bin
), and the default prompt (Once upon a time
).
You need download the model checkpoint file first. Download this 15M parameter model trained on the TinyStories dataset (~60MB download):
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
If you want to specify a prompt (eg. Long long ago
):
llgo run . 'Long long ago'
We can also try a bit bigger 42M parameter model (ie. stories42M.bin
):
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
llgo run . -m stories42M.bin 'Long long ago'
There is also an even better 110M param model available, see models.
Quick note on sampling, the recommendation for ~best results is to sample with -t 1.0 -p 0.9
, i.e. temperature 1.0 (default) but also top-p sampling at 0.9 (default). Intuitively, top-p ensures that tokens with tiny probabilities do not get sampled, so we can't get "unlucky" during sampling, and we are less likely to go "off the rails" afterwards. More generally, to control the diversity of samples use either the temperature (i.e. vary -t
between 0 and 1 and keep top-p off with -p 0
) or the top-p value (i.e. vary -p
between 0 and 1 and keep -t 1
), but not both. Nice explainers on LLM sampling strategies include this, this or this.
As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the Meta instructions. Once we have those checkpoints, we have to convert them into the llama2.c format.
For this we need to install the python dependencies (pip install -r requirements.txt
) and then use the export.py
file, e.g. for 7B model:
python export.py llama2_7b.bin --meta-llama path/to/llama/model/7B
The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called llama2_7b.bin
in the current directory. It has been reported that despite efforts. I would not attempt to run anything above 7B right now for two reasons: first, 13B+ currently doesn't work because of integer flow in pointer arithmetic, which is yet to be fixed, and second, even if it were fixed, this repo is doing float32 inference right now, so it would be fairly unusably slow. Once the export is done, we can run it:
./run llama2_7b.bin
This ran at about 4 tokens/s compiled with OpenMP on 96 threads on my CPU Linux box in the cloud. (On my MacBook Air M1, currently it's closer to 30 seconds per token if you just build with make runfast
.) Example output:
The purpose of this document is to highlight the state-of-the-art of CoO generation technologies, both recent developments and those in commercial use. The focus is on the technologies with the highest merit to become the dominating processes of the future and therefore to be technologies of interest to S&T ... R&D. As such, CoO generation technologies developed in Russia, Japan and Europe are described in some depth. The document starts with an introduction to cobalt oxides as complex products and a short view on cobalt as an essential material. The document continues with the discussion of the available CoO generation processes with respect to energy and capital consumption as well as to environmental damage.
base models... ¯\(ツ)/¯. Since we can inference the base model, it should be possible to also inference the chat model quite easily, and have a conversation with it. And if we can find a way to run 7B more efficiently, we can start adding LoRA to our training script, and going wild with finetunes all within the repo!
You can also try Meta's Code Llama models even if support for them is incomplete. In particular, some hyperparameters changed (e.g. the constant in RoPE layer), so the inference is not exactly correct and a bit buggy right now. Looking into fixes. Make sure to build the tokenizer for the plain and instruct variants and pass it when doing inference.
python export.py codellama2_7b.bin --meta-llama /path/to/CodeLlama-7b
python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b/tokenizer.model
./run codellama2_7b.bin -z /path/to/CodeLlama-7b/tokenizer.bin
Chat with Code Llama Instruct:
python export.py codellama2_7b_instruct.bin --meta-llama /path/to/CodeLlama-7b-Instruct
python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b-Instruct/tokenizer.model
./run codellama2_7b_instruct.bin -m chat -z /path/to/CodeLlama-7b-Instruct/tokenizer.bin
We can load any huggingface models that use the Llama 2 architecture. See the script export.py and the --hf
flag to export the model .bin file.
For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours. I am hosting them on huggingface hub tinyllamas, both in the original PyTorch .pt, and also in the llama2.c format .bin:
model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download |
---|---|---|---|---|---|---|---|---|
260K | 64 | 5 | 8 | 4 | 512 | 260K | 1.297 | stories260K |
OG | 288 | 6 | 6 | 6 | 256 | 15M | 1.072 | stories15M.bin |
42M | 512 | 8 | 8 | 8 | 1024 | 42M | 0.847 | stories42M.bin |
110M | 768 | 12 | 12 | 12 | 1024 | 110M | 0.760 | stories110M.bin |
You'll notice that the 110M model is equivalent to GPT-1 in size. Alternatively, this is also the smallest model in the GPT-2 series (GPT-2 small
), except the max context length is only 1024 instead of 2048. The only notable changes from GPT-1/2 architecture is that Llama uses RoPE relatively positional embeddings instead of absolute/learned positional embeddings, a bit more fancy SwiGLU non-linearity in the MLP, RMSNorm instead of LayerNorm, bias=False on all Linear layers, and is optionally multiquery.
Let's see how we can train a baby Llama 2 from scratch using the code in this repo. First let's download and pretokenize some source dataset, e.g. I like TinyStories so this is the only example currently available in this repo. But it should be very easy to add datasets, see the code.
python tinystories.py download
python tinystories.py pretokenize
Then train our model:
python train.py
brief training guide. See the train.py script for more exotic launches and hyperparameter overrides. Here is a brief guide to how to set the parameters. Look at the table at the very end of the Chinchilla paper to get a sense of how the Transformer parameters (dim, n_layers, n_heads) grow or shrink together. Extrapolate/interpolate this pattern to get bigger or smaller transformers. Set the max context length however you wish, depending on the problem: this should be the max number of tokens that matter to predict the next token. E.g. Llama 2 uses 2048. Next, you want the total batch size per update (printed by the script as "tokens per iteration will be:") to be somewhere around 100K tokens for medium-sized applications. For tiny applications it could be lower, for large training (e.g. GPTs/LLamas) it is usually ~0.5M, or even more. You get there by first maxing out the batch_size to whatever your system allows (e.g. mine was 16 in a recent run because after that my GPU runs out of memory), and then you want to increase gradient_accumulation_steps to be as high as necessary to reach the total batch size of ~100K. Finally, you want to tune your learning_rate (LR). You want this to be as high as your training allows. Very small networks can get away with a large LR (e.g. 1e-3 or even higher). Large networks need lower LRs. 3e-4 is a safe choice in most medium-sized applications, but can be too low for small networks, so try to increase it! Finally, max_iters is the length of training. Play with different settings. I mostly only ever tune these parameters and leave most of the others unchanged. Here is an example of how I trained the 110M model, which I don't think is anywhere near optimal, but looked sensible to me: dim 768, n_layers 12, n_heads 12 (so size of each head is 768 / 12 = 64 channels), seq len of 1024, batch size 16 (this is the most that fit my A100 40GB GPU), gradient_accumulation_steps = 8 was needed to get total tokens batch size to be 16 batch size * 1024 tokens in sequence * 8 grad_accum = 131,072 tokens per update. Good. Learning rate 4e-4 (probably a little too low). max_iters 200K (probably a bit too high). Dropout 0.1, as that usually helps a bit at medium size. That was it. I ran using Distributed Data Parallel (DDP) on 4 GPUs on my cloud machine, training took ~day or so.
Totally understand if you want to skip model training, for simple demo just download one of the pretrained models (see models section), e.g.:
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
Once we have the model.bin file, we can inference in C. Compile the C code first:
llgo run . -m stories15M.bin