LMCache lets LLMs prefill each text only once. By storing the KV caches of all reusable texts, LMCache can reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. It thus reduces prefill delay, i.e., time to first token (TTFT), as well as saves the precious GPU cycles.
By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.
Try LMCache with pre-built vllm docker images here.
LMCache provides the integration to the latest vLLM (0.6.2). To install LMCache, use the following command:
# requires python >= 3.10 and nvcc >= 12.1
pip install lmcache lmcache_vllm
LMCache has the same interface as vLLM (both online serving and offline inference). To use the online serving, you can start an OpenAI API-compatible vLLM server with LMCache via:
lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8
To use vLLM's offline inference with LMCache, just simply add lmcache_vllm
before the import to vLLM components. For example
import lmcache_vllm.vllm as vllm
from lmcache_vllm.vllm import LLM
More detailed documentation will be available soon.
Prerequisite:
- Installed vLLM 0.6.1.post2
- Installed Pytorch 2.6.0+rocm6.2
To install LMCache, use the following command:
# requires python >= 3.10 and rocm >= 6.2
git clone https://github.com/EmbeddedLLM/torchac_rocm.git
cd torchac_rocm
# you might need to run this command twice to install if it fails the first time
HCC_AMDGPU_TARGET=gfx90a python3 setup.py develop
cd ..
git clone https://github.com/EmbeddedLLM/LMCache.git
python3 setup.py develop
LMCache supports sharing KV across different vLLM instances by the lmcache.server
module. Here is a quick guide:
# Start lmcache server
lmcache_server localhost 65432
Then, start two vLLM instances with the LMCache config file
wget https://raw.githubusercontent.com/LMCache/LMCache/refs/heads/dev/examples/example.yaml
# start the first vLLM instance
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=0 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8000
# start the second vLLM instance
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=1 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8001
We also provide multiple docker-based demos at 🔗LMCache-demos repo. The demos cover the following use cases:
- Share KV caches across multiple serving engines (🔗link)
- Loading non-prefix KV caches for RAG (🔗link)
Fill out the interest form and our team will reach out to you! https://forms.gle/mQfQDUXbKfp2St1z7
- First release of LMCache
- Support installation through pip install and integrate with latest vLLM
- Stable support for non-prefix KV caches
- User and developer documentation
Our blog posts and documentations are available online
If you use LMCache for your research, please cite our papers:
@inproceedings{liu2024cachegen,
title={Cachegen: Kv cache compression and streaming for fast large language model serving},
author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others},
booktitle={Proceedings of the ACM SIGCOMM 2024 Conference},
pages={38--56},
year={2024}
}
@article{cheng2024large,
title={Do Large Language Models Need a Content Delivery Network?},
author={Cheng, Yihua and Du, Kuntai and Yao, Jiayi and Jiang, Junchen},
journal={arXiv preprint arXiv:2409.13761},
year={2024}
}
@article{yao2024cacheblend,
title={CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion},
author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},
journal={arXiv preprint arXiv:2405.16444},
year={2024}
}