Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT-LLM v0.16 Release #2611

Merged
merged 1 commit into from
Dec 24, 2024
Merged

TensorRT-LLM v0.16 Release #2611

merged 1 commit into from
Dec 24, 2024

Conversation

kaiyux
Copy link
Member

@kaiyux kaiyux commented Dec 24, 2024

TensorRT-LLM Release 0.16.0

Key Features and Enhancements

  • Added guided decoding support with XGrammar backend.
  • Added quantization support for RecurrentGemma. Refer to examples/recurrentgemma/README.md.
  • Added ulysses context parallel support. Refer to an example on building LLaMA 7B using 2-way tensor parallelism and 2-way context parallelism at examples/llama/README.md.
  • Added W4A8 quantization support to BF16 models on Ada (SM89).
  • Added PDL support for the FP8 GEMM plugins.
  • Added a runtime max_num_tokens dynamic tuning feature, which can be enabled by setting --enable_max_num_tokens_tuning to gptManagerBenchmark.
  • Added typical acceptance support for EAGLE.
  • Supported chunked context and sliding window attention to be enabled together.
  • Added head size 64 support for the XQA kernel.
  • Added the following features to the LLM API:
    • Lookahead decoding.
    • DeepSeek V1 support.
    • Medusa support.
    • max_num_tokens and max_batch_size arguments to control the runtime parameters.
    • extended_runtime_perf_knob_config to enable various performance configurations.
  • Added LogN scaling support for Qwen models.
  • Added AutoAWQ checkpoints support for Qwen. Refer to the “INT4-AWQ” section in examples/qwen/README.md.
  • Added AutoAWQ and AutoGPTQ Hugging Face checkpoints support for LLaMA. (Is it possible load quantized model from huggingface? #2458)
  • Added allottedTimeMs to the C++ Request class to support per-request timeout.
  • [BREAKING CHANGE] Removed NVIDIA V100 GPU support.

API Changes

  • [BREAKING CHANGE] Removed enable_xqa argument from trtllm-build.
  • [BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
  • [BREAKING CHANGE] Enabled embedding sharing automatically when possible and remove the flag --use_embedding_sharing from convert checkpoints scripts.
  • [BREAKING CHANGE] The if __name__ == "__main__" entry point is required for both single-GPU and multi-GPU cases when using the LLM API.
  • [BREAKING CHANGE] Cancelled requests now return empty results.
  • Added the enable_chunked_prefill flag to the LlmArgs of the LLM API.
  • Integrated BERT and RoBERTa models to the trtllm-build command.

Model Updates

  • Added Qwen2-VL support. Refer to the “Qwen2-VL” section of examples/multimodal/README.md.
  • Added multimodal evaluation examples. Refer to examples/multimodal.
  • Added Stable Diffusion XL support. Refer to examples/sdxl/README.md. Thanks for the contribution from @Zars19 in Support SDXL and its distributed inference #1514.

Fixed Issues

Infrastructure Changes

  • Updated the base Docker image for TensorRT-LLM to nvcr.io/nvidia/pytorch:24.11-py3.
  • Updated the base Docker image for TensorRT-LLM Backend to nvcr.io/nvidia/tritonserver:24.11-py3.
  • Updated to TensorRT v10.7.
  • Updated to CUDA v12.6.3.
  • Added support for Python 3.10 and 3.12 to TensorRT-LLM Python wheels on PyPI.
  • Updated to ModelOpt v0.21 for Linux platform, while v0.17 is still used on Windows platform.

Known Issues

  • There is a known AllReduce performance issue on AMD-based CPU platforms on NCCL 2.23.4, which can be workarounded by export NCCL_P2P_LEVEL=SYS.

@kaiyux kaiyux merged commit 42a7b09 into rel Dec 24, 2024
@kaiyux kaiyux deleted the preview/rel branch December 24, 2024 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants