This repository contains a collection of alternative Thai ASR test sets that we use to evaluate our Thai ASR project. We think it might be useful for your Thai ASR project as well to alternatively use these test sets as out-of-the-box ones. All of the audios in this repo are in 1 channel WAV files, with 16000 samplerate.
Some of test sets are collected from Mozilla Common Voice TH and SER dataset from VISTEC-depa.
We also report word-error-rate results of each test set transcribed by our models. Output transcription of out model can be seen in output_samples
folder. To keep WER consistent across test sets, we used ThaiNLP to tokenize both reference and output transcriptions before calculating WER by jiwer.
Test Set | Description | Source | License |
---|---|---|---|
CommonVoice2000 | A collection of 2000 audios randomly selected from CommonVoiceTH | Mozilla Common Voice | CC0 1.0 |
Male_Voice | A collection of 44 audios from 22 male speakers speaking the same sentence | VISTEC-depa | CC BY-SA 4.0 |
Female_Voice | A collection of 44 audios from 22 female speakers speaking the same sentence as in Male_Voice. | VISTEC-depa | CC BY-SA 4.0 |
Piyabutr_Interview | A collection of 33 audios from one of Piyabutr’s interviews with somewhat noisy environment. | ||
Piyabutr_with_Music_BG | A collection of 7 audios from a political advertisement clip, a male speaker (Piyabutr Saengkanokkul), with upbeat music on the background. | ||
Obodroid | A collection of 9 audios with clear speech from a male speaker talking about Obodroid’s products. | ||
Reporter | A collection of 19 audios from more than 15 Thai reporters. Each audio comes from difference news programs. Length of audios varies from 8 seconds to 1 minute. The topics cover daily news, weather, sport, politic etc. |
Model / WER(%) | CommonVoice2000 | Reporter | Piyabutr_Interview | Piyabutr_with_Music_BG | MaleVoice | FemaleVoice | Obodroid |
---|---|---|---|---|---|---|---|
DeepSpeech 300 hrs | 31.78 | 21.91 | 38.10 | 43.54 | 26.87 | 22.19 | 16.02 |
DeepSpeech 330 hrs | 32.1 | 21.02 | 35.83 | 42.31 | 28.07 | 29.81 | 11.57 |
- Numbers after the model indicate the size of (private) dataset which the model was trained on.
- The models were decoding with external KenLM (trained on ThaiSum dataset).
- All test sets, except CommonVoice2000, were manually transcribed by Nakhun Chumpolsathien.