Skip to content

2.2.3 Backend: vLLM

av edited this page Sep 14, 2024 · 3 revisions

Handle: vllm URL: http://localhost:33911

vLLM

A high-throughput and memory-efficient inference and serving engine for LLMs

Models

Once you've found a model you want to run, you can configure it with Harbor:

# Quickly lookup some of the compatible quants
harbor hf find awq
harbor hf find gptq

# This propagates the settings
# to the relevant configuration files
harbor vllm model google/gemma-2-2b-it

# To run a gated model, ensure that you've
# also set your Huggingface API Token
harbor hf token <your-token>

Starting

harbor up vllm

Models served by vLLM should be available in the Open WebUI by default.

Configuration

You can configure specific portions of vllm via Harbor CLI:

# See original CLI help
harbor run vllm --help

# Get/Set the extra arguments
harbor vllm args
harbor vllm args '--dtype bfloat16 --code-revision 3.5'

# Select attention backend
harbor vllm attention ROCM_FLASH

harbor config set vllm.host.port 4090

# Get/set desired vLLM version
harbor vllm version # v0.5.3
# Command accepts a docker tag
harbor vllm version latest

You can specify more options directly via the .env file.

VRAM

Below are some steps to take if running out of VRAM (no magic, though).

Offloading

vLLM supports partial offloading to the CPU, similar to llama.cpp and some other backends. This can be configured via the --cpu-offload-gb flag.

harbor vllm args --cpu-offload-gb 4
Disable CUDA Graphs

When loading the model, VRAM usage can spike when computing the CUDA graphs. This can be disabled via --enforce-eager flag.

harbor vllm args --enforce-eager
GPU Memory Utilization

Reduce the amount of VRAM allocated for the model executor. Can be ranged from 0 to 1.0, 0.9 by default.

harbor vllm args --gpu-memory-utilization 0
Run on CPU

You can move to CPU by setting the --device cpu flag.

harbor vllm args --device cpu
Clone this wiki locally