Provide an OpenAI-compatible API for TensorRT-LLM and NVIDIA Triton Inference Server, which allows you to integrate with langchain
Make sure you have built your own TensorRT LLM engine following the tensorrtllm_backend tutorial. The final model repository should look like the official example.
Notice: to enable streaming, you should set decoupled to true for triton_model_repo/tensorrt_llm/config.pbtxt per the tutorial
Remember to include the dependencies when cloning to build the project.
git clone --recursive https://github.com/npuichigo/openai_trtllm.git
Make sure you have Rust installed.
cargo run --release
The executable arguments can be set from environment variables (prefixed by OPENAI_TRTLLM_) or command line:
Notice: openai_trtllm
communicate with triton
over gRPC, so the --triton-endpoint
should be the gRPC port.
./target/release/openai_trtllm --help
Usage: openai_trtllm [OPTIONS]
Options:
-H, --host <HOST>
Host to bind to [default: 0.0.0.0]
-p, --port <PORT>
Port to bind to [default: 3000]
-t, --triton-endpoint <TRITON_ENDPOINT>
Triton gRPC endpoint [default: http://localhost:8001]
-o, --otlp-endpoint <OTLP_ENDPOINT>
Endpoint of OpenTelemetry collector
--history-template <HISTORY_TEMPLATE>
Template for converting OpenAI message history to prompt
--history-template-file <HISTORY_TEMPLATE_FILE>
File containing the history template string
--api-key <API_KEY>
Api Key to access the server
-h, --help
Print help
Make sure you have Docker and Docker Compose installed.
docker compose build openai_trtllm
docker compose up
openai_trtllm
support custom history templates to convert message history to prompt for chat models. The template
engine used here is liquid. Follow the syntax to create your own template.
For examples of history templates, see the templates folder.
Here's an example of llama3:
{% for item in items -%}
<|start_header_id|>{{ item.identity }}<|end_header_id|>
{{ item.content }}<|eot_id|>
{% endfor -%}
<|start_header_id|>assistant<|end_header_id|>
Since the openai_trtllm
is compatible with OpenAI API, you can easily integrate with LangChain as an alternative to
OpenAI
or ChatOpenAI
.
Although you can use the TensorRT LLM integration published recently, it has no support for chat models yet, not to mention user defined templates.
Trace is available with the support of tracing, tracing-opentelemetry and opentelemetry-otlp crates.
Here is an example of tracing with Tempo on a k8s cluster:
To test tracing locally, let's say you use the Jaeger backend.
docker run --rm --name jaeger \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
-p 14250:14250 \
-p 14268:14268 \
-p 14269:14269 \
-p 9411:9411 \
jaegertracing/all-in-one:1.51
To enable tracing, set the OPENAI_TRTLLM_OTLP_ENDPOINT
environment variable or --otlp-endpoint
command line
argument to the endpoint of your OpenTelemetry collector.
OPENAI_TRTLLM_OTLP_ENDPOINT=http://localhost:4317 cargo run --release