TensorRT-LLM v0.16 Release #2611

kaiyux · 2024-12-24T07:11:29Z

TensorRT-LLM Release 0.16.0

Added guided decoding support with XGrammar backend.
Added quantization support for RecurrentGemma. Refer to examples/recurrentgemma/README.md.
Added ulysses context parallel support. Refer to an example on building LLaMA 7B using 2-way tensor parallelism and 2-way context parallelism at examples/llama/README.md.
Added W4A8 quantization support to BF16 models on Ada (SM89).
Added PDL support for the FP8 GEMM plugins.
Added a runtime max_num_tokens dynamic tuning feature, which can be enabled by setting --enable_max_num_tokens_tuning to gptManagerBenchmark.
Added typical acceptance support for EAGLE.
Supported chunked context and sliding window attention to be enabled together.
Added head size 64 support for the XQA kernel.
Added the following features to the LLM API:
- Lookahead decoding.
- DeepSeek V1 support.
- Medusa support.
- max_num_tokens and max_batch_size arguments to control the runtime parameters.
- extended_runtime_perf_knob_config to enable various performance configurations.
Added LogN scaling support for Qwen models.
Added AutoAWQ checkpoints support for Qwen. Refer to the “INT4-AWQ” section in examples/qwen/README.md.
Added AutoAWQ and AutoGPTQ Hugging Face checkpoints support for LLaMA. (Is it possible load quantized model from huggingface? #2458)
Added allottedTimeMs to the C++ Request class to support per-request timeout.
[BREAKING CHANGE] Removed NVIDIA V100 GPU support.

[BREAKING CHANGE] Removed enable_xqa argument from trtllm-build.
[BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
[BREAKING CHANGE] Enabled embedding sharing automatically when possible and remove the flag --use_embedding_sharing from convert checkpoints scripts.
[BREAKING CHANGE] The if __name__ == "__main__" entry point is required for both single-GPU and multi-GPU cases when using the LLM API.
[BREAKING CHANGE] Cancelled requests now return empty results.
Added the enable_chunked_prefill flag to the LlmArgs of the LLM API.
Integrated BERT and RoBERTa models to the trtllm-build command.

Added Qwen2-VL support. Refer to the “Qwen2-VL” section of examples/multimodal/README.md.
Added multimodal evaluation examples. Refer to examples/multimodal.
Added Stable Diffusion XL support. Refer to examples/sdxl/README.md. Thanks for the contribution from @Zars19 in Support SDXL and its distributed inference #1514.

Fixed unnecessary batch logits post processor calls. ([bug] unnecessary batch logits post processor calls #2439)
Fixed a typo in the error message. (Fix minor typo #2473)
Fixed the in-place clamp operation usage in smooth quant. Thanks for the contribution from @StarrickLiu in The clamp in-place operation cannot modify the weight_scales tensor directly. #2485.
Fixed sampling_params to only be setup if end_id is None and tokenizer is not None in the LLM API. Thanks to the contribution from @mfuntowicz in [LLM] sampling_params should be setup only if end_id is None and tokenizer is not None #2573.

Updated the base Docker image for TensorRT-LLM to nvcr.io/nvidia/pytorch:24.11-py3.
Updated the base Docker image for TensorRT-LLM Backend to nvcr.io/nvidia/tritonserver:24.11-py3.
Updated to TensorRT v10.7.
Updated to CUDA v12.6.3.
Added support for Python 3.10 and 3.12 to TensorRT-LLM Python wheels on PyPI.
Updated to ModelOpt v0.21 for Linux platform, while v0.17 is still used on Windows platform.

There is a known AllReduce performance issue on AMD-based CPU platforms on NCCL 2.23.4, which can be workarounded by export NCCL_P2P_LEVEL=SYS.

open source 1973c2d56e9b357e129e80d095bc203e7461d2bc

b4f2086

Shixiaowei02 approved these changes Dec 24, 2024

View reviewed changes

kaiyux merged commit 42a7b09 into rel Dec 24, 2024

kaiyux deleted the preview/rel branch December 24, 2024 07:58