Releases: OpenNMT/CTranslate2
Releases Β· OpenNMT/CTranslate2
CTranslate2 3.23.0
New features
- Support Phi model
Fixes and improvements
- Fix the conversion for whisper without the "alignment_heads" in the "generation_config.json"
- Fix forward batch
CTranslate2 3.22.0
New features
- Support "sliding window" and "chunking input" for Mistral
Fixes and improvements
- Take into account the "generation_config.json" and fix "lang_ids" getter for Whisper converter
- Accept callback even on "generate_tokens" method
- Fix iomp5 linking with latest Intel OpenAPI on Ubuntu
- Fixed "decoder_start_token_id" for T5
CTranslate2 3.21.0
New features
- Minimal Support for Mistral (Loader and Rotary extension for long sequence). No sliding yet
- Support Distil-Whisper
- Support Whisper-large-v3
CTranslate2 3.20.0
New features
- Update the Transformers converter to support more model architectures:
- MixFormerSequential (used by microsoft/phi-1_5)
- Accept batch inputs in methods
generate_tokens
- Add method
Generator.async_generate_tokens
to return an asynchronous generator compatible withasyncio
Fixes and improvements
- Remove the epsilon value in the softmax CPU kernel for consistency with other implementations
- Optimize implementation of the Dynamic Time Wrapping (DTW) function (used for Whisper alignment)
- Avoid an unnecessary copy of the input arguments in method
Whisper::align
CTranslate2 3.19.0
Changes
- Binary wheels for Python 3.7 are no longer built
New features
- Build wheels for Python 3.12
- Update the Transformers converter to support more model architectures:
- Falcon-RW
- DistilBERT
- Llama with linear RoPE scaling (e.g. Vicuna v1.5)
- Llama with a non default RoPE base period (e.g. CodeLlama)
- Accept the token type IDs as inputs for encoder models
- Add property
GenerationStepResult.hypothesis_id
to identify the different hypotheses when running random sampling withnum_hypotheses
> 1
Fixes and improvements
- Improve performance of 8-bit models on CPU:
- Vectorize the GEMM output dequantization
- Fuse the GEMM output dequantization with bias and activation
- Allow inputs shorter than 30 seconds in Whisper methods
- Fix incorrect
batch_id
values passed to the callback function - Fix a shape error in models using both MQA and relative positions
- Fix compilation error related to AVX512 when using GCC 7
- Call
.detach()
on PyTorch tensors before getting the Numpy array in converters
CTranslate2 3.18.0
Changes
Converted models now uses the same floating point precision as the original models. For example, a model saved in float16 will be converted to a float16 model. Before this change, the weights were casted to float32 by default.
Similarly, selecting int8 keeps non quantized weights in their original precision unless a more specific quantization type is selected:
- int8_float32
- int8_float16
- int8_bfloat16
New features
- Add property
compute_type
to model instances - Extend the Python class
StorageView
with additional methods and properties:to(dtype)
device_index
device
dtype
shape
Fixes and improvements
- Update the function
get_supported_compute_types
to correctly return bfloat16 when supported - Update the HF Llama converter to accept extra tokens in the vocabulary
- Fix a shape error when enabling
return_alternatives
with a model using relative positions - Fix a conversion error when using
torch<1.13
- Fix a type error when running Whisper models with the bfloat16 type
- Update pybind11 to 2.11.1
CTranslate2 3.17.1
Fixes and improvements
- Fix an error when running models with the new
int8_bfloat16
computation type - Fix a vocabulary error when converting Llama 2 models with the Transformers converter
- Update the Transformers converter to correctly convert Llama models using GQA
- Stop the decoding when the generator returned by the method
generate_tokens
is closed
CTranslate2 3.17.0
New features
- Add new computation types:
bfloat16
andint8_bfloat16
(require a GPU with Compute Capability 8.0 or above) - Support multi-query attention for encoder-decoder models
- Allow converters to register weights as PyTorch tensors instead of Numpy arrays
Fixes and improvements
- Pass the flag
trust_remote_code
when loading the tokenizer in the Transformers converter - Improve performance of T5 models by reusing the same relative position bias in every layers
- Whisper: disable the first timestamp decoding rule when a prefix is used
- Install the CMake configuration in the correct library directory (e.g. some platforms use
lib64
instead oflib
)
CTranslate2 3.16.1
Fixes and improvements
- Fix repeated outputs in version 3.16.0 when using
include_prompt_in_result=False
and a batch input with variable lengths: a typo in the code led tomin_length
being incorrectly applied - Update the Transformers converter to accept extra tokens for Falcon models
- Release the Python GIL when loading the model
- Initialize the rotary embeddings on the GPU instead of the CPU
- Avoid a copy for the input features passed to the Whisper methods
- Vectorize copy in the Tile CUDA operator
CTranslate2 3.16.0
New features
- Update the Transformers converter to support more architectures:
- Falcon-40B
- XLM-RoBERTa
- Add the generation option
sampling_topp
to enable top-p (nucleus) sampling - Save vocabulary files in the JSON format to better support tokens containing newlines or carriage returns
Fixes and improvements
- Fix the application of
min_length
andmax_length
when usinginclude_prompt_in_result=False
and a batch input with variable lengths: the length constraint should only apply to the sequence after the prompt - Update oneDNN to 3.1.1