Skip to content

2.2.11 Backend: AirLLM

av edited this page Sep 14, 2024 · 1 revision

Handle: airllm URL: http://localhost:33981

airllm_logo

Quickstart | Configurations | MacOS | Example notebooks | FAQ

AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now.

Note that above is true, but don't expect a performant inference. AirLLM loads LLM layers into memory in small groups. The main benefit is that it allows a "transformers"-like workflow for models that are much much larger than your VRAM.

Note

AirLLM requires a GPU with CUDA by default, can't be run on CPU.

Starting

# [Optional] Pre-build the image
# Needs PyTorch and CUDA, so will be quite large
harbor build airllm

# Start the service
# Will download selected models if not present yet
harbor up airllm

# See service logs
harbor logs airllm

# Check it's running
curl $(harbor url airllm)/v1/models

Configuration

AirLLM only supports specific models, see original README for details.

Note

Default context size is 128, according to official examples

For funsies, Harbor also adds an OpenAI-compatible API to AirLLM, so you can enjoy 40m of wait per request when "chatting" with Llama 3.1 405B.

# Set the model to run
harbor airllm model meta-llama/Meta-Llama-3.1-8B-Instruct

# Set the context size to use
harbor airllm ctx 1024

# Set the compression
harbor airllm compression 4bit
Clone this wiki locally