🤖 Model | 📔 Jupyter Notebook | 🤗 Huggingface Space Demo | 📃 Medium Blog (Thai)
Thonburian Whisper is an Automatic Speech Recognition (ASR) model for Thai, fine-tuned using Whisper model originally from OpenAI. The model is released as a part of Huggingface's Whisper fine-tuning event (December 2022). We fine-tuned Whisper models for Thai using Commonvoice 13, Gowajee corpus, Thai Elderly Speech, Thai Dialect datasets. Our models demonstrate robustness under environmental noise and fine-tuned abilities to domain-specific audio such as financial and medical domains. We release models and distilled models on Huggingface model hubs (see below).
Use the model with Huggingface's transformers as follows:
import torch
from transformers import pipeline
MODEL_NAME = "biodatlab/whisper-th-medium-combined" # see alternative model names below
lang = "th"
device = 0 if torch.cuda.is_available() else "cpu"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
chunk_length_s=30,
device=device,
)
# Perform ASR with the created pipe.
pipe("audio.mp3", generate_kwargs={"language":"<|th|>", "task":"transcribe"}, batch_size=16)["text"]
Use pip
to install the requirements as follows:
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!sudo apt install ffmpeg
We measure word error rate (WER) of the model with deepcut tokenizer after normalizing special tokens (▁ to _ and — to -) and simple text-postprocessing (เเ to แ and ํา to ำ). See an example evaluation script here.
Model | WER (Commonvoice 13) | Model URL |
---|---|---|
Thonburian Whisper (small) | 11.0 | Link |
Thonburian Whisper (medium) | 7.42 | Link |
Thonburian Whisper (large-v2) | 7.69 | Link |
Thonburian Whisper (large-v3) | 6.59 | Link |
Distilled Thonburian Whisper (small) | 11.2 | Link |
Distilled Thonburian Whisper (medium) | 7.6 | Link |
Thonburian Whisper (medium-timestamps) | 15.57 | Link |
Thonburian Whisper is fine-tuned with a combined dataset of Thai speech including common voice, google fleurs, and curated datasets.
The common voice test splitting is based on original splitting from datasets
.
Inference time
We have performed benchmark average inference speed on 1 minute audio with different model sizes (small, medium, and large) on NVIDIA A100 with fp32 precision, batch size of 1. The medium model presents a balanced trade-off between WER and computational costs. (Note that the distilled models due to their smaller size and a batch size of 1 are not fully saturating the GPU. With higher batch size, the inference time will be lower substantially.)
Model | Memory usage (Mb) | Inference time (sec / 1 min) | Number of Parameters | Model URL |
---|---|---|---|---|
Thonburian Whisper (small) | 931.93 | 0.50 | 242M | Link |
Thonburian Whisper (medium) | 2923.05 | 0.83 | 764M | Link |
Thonburian Whisper (large) | 6025.84 | 1.89 | 1540M | Link |
Distilled Thonburian Whisper (small) | 650.27 | 4.42 | 166M | Link |
Distilled Thonburian Whisper (medium) | 1642.15 | 4.36 | 428M | Link |
These models are fine-tuned versions of OpenAI's Whisper, optimized for Thai ASR:
- Small: Balanced performance with lower resource requirements.
- Medium: Best trade-off between accuracy and computational cost.
- Large-v2/v3: Highest accuracy, but more resource-intensive.
Use these for general Thai ASR tasks where timestamps are not required.
Model: biodatlab/whisper-th-medium-timestamp
This model is specifically designed for Thai ASR with timestamp generation. It's based on the Whisper medium architecture and fine-tuned on a custom longform dataset.
Key Features:
- Generates timestamps for transcribed text
- WER: 15.57 (with Deepcut Tokenizer)
- Suitable for subtitle creation or audio-text alignment tasks
Usage:
from transformers import pipeline
import torch
MODEL_NAME = "biodatlab/whisper-th-medium-timestamp"
lang = "th"
device = 0 if torch.cuda.is_available() else "cpu"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
chunk_length_s=30,
device=device,
return_timestamps=True,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
language=lang,
task="transcribe"
)
result = pipe("audio.mp3", return_timestamps=True)
text, timestamps = result["text"], result["chunks"]
Note: While this model provides timestamp information, its accuracy may be lower than non-timestamped versions due to several factors.
These models are distilled versions of the larger Thonburian Whisper models, offering improved efficiency:
-
Distilled Medium:
- 4 decoder layers (vs 24 in teacher model)
- Distilled from the Medium Whisper ASR model
-
Distilled Small:
- 4 decoder layers (vs 12 in teacher model)
- Distilled from the Small Whisper ASR model
Both distilled models were trained on a combination of Common Voice v13, Gowajee, Thai Elderly Speech Corpus, custom scraped data, and Thai-Central Dialect from SLSCU Thai Dialect Corpus.
Use these models for efficient Thai ASR in resource-constrained environments or for faster inference times.
Thonburian Whisper can be used for long-form audio transcription by combining VAD, Thai-word tokenizer, and chunking for word-level alignment.
We found that this is more robust and produce less insertion error rate (IER) comparing to using Whisper with timestamp. See README.md
in longform_transcription folder for detail usage.
If you use the model, you can cite it with the following bibtex.
@misc {thonburian_whisper_med,
author = { Zaw Htet Aung, Thanachot Thavornmongkol, Atirut Boribalburephan, Vittavas Tangsriworakan, Knot Pipatsrisawat, Titipat Achakulvisut },
title = { Thonburian Whisper: A fine-tuned Whisper model for Thai automatic speech recognition },
year = 2022,
url = { https://huggingface.co/biodatlab/whisper-th-medium-combined },
doi = { 10.57967/hf/0226 },
publisher = { Hugging Face }
}