llama4micro 🦙🔬

A "large" language model running on a microcontroller.

Background

I was wondering if it's possible to fit a non-trivial language model on a microcontroller. Turns out the answer is some version of yes! (Later, things got a bit out of hand and now the prompt is based on objects detected by the camera.)

This project is using the Coral Dev Board Micro with its FreeRTOS toolchain. The board has a number of neat hardware features, but – most importantly for our purposes – it has 64MB of RAM. That's tiny for LLMs, which are typically measured in the GBs, but comparatively huge for a microcontroller.

The LLM implementation itself is an adaptation of llama2.c and the tinyllamas checkpoints trained on the TinyStories dataset. The quality of the smaller model versions isn't ideal, but good enough to generate somewhat coherent (and occasionally weird) stories.

Note

Language model inference runs on the 800 MHz Arm Cortex-M7 CPU core. Camera image classification uses the Edge TPU and a compiled YOLOv5 model. The board also has a second 400 MHz Arm Cortex-M4 CPU core, which is currently unused.

Setup

Clone this repo with its submodules karpathy/llama2.c, google-coral/coralmicro, and ultralytics/yolov5.

git clone --recurse-submodules https://github.com/maxbbraun/llama4micro.git

cd llama4micro

The pre-trained models are in the models/ directory. Refer to the instructions on how to download and convert them.

Build the image:

mkdir build
cd build

cmake ..
make -j

Flash the image:

python3 -m venv venv
. venv/bin/activate

pip install -r ../coralmicro/scripts/requirements.txt

python ../coralmicro/scripts/flashtool.py \
    --build_dir . \
    --elf_path llama4micro

Usage

The models load automatically when the board powers up.
- This takes ~7 seconds.
- The green light will turn on when ready.
Point the camera at an object and press the button.
- The green light will turn off.
- The camera will take a picture and detect an object.
The model now generates tokens starting with a prompt based on the object.
- The results are streamed to the serial port.
- This happens at a rate of ~2.5 tokens per second.
Generation stops after the end token or maximum steps.
- The green light will turn on again.
- Goto 2.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
coralmicro @ c9f665b		coralmicro @ c9f665b
llama2.c @ d986206		llama2.c @ d986206
models		models
yolov5 @ b378d10		yolov5 @ b378d10
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MIMXRT1176xxxxx_cm7_ram.ld		MIMXRT1176xxxxx_cm7_ram.ld
README.md		README.md
llama2.h		llama2.h
llama4micro.gif		llama4micro.gif
main.cc		main.cc
yolov5.h		yolov5.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama4micro 🦙🔬

Background

Setup

Usage

About

Releases

Packages

Languages

License

maxbbraun/llama4micro

Folders and files

Latest commit

History

Repository files navigation

llama4micro 🦙🔬

Background

Setup

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages