After using VAD, the start and end times of the recognized segments are incorrect #1119

zlyMaster · 2024-11-07T08:14:12Z

path = r"D:\Project\Python_Project\FasterWhisper\large-v3"

model = WhisperModel(model_size_or_path=path, device="cuda", local_files_only=True)

segments, info = model.transcribe("audio.wav", beam_size=5, language="zh", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=1000))

When I use vad to transcribe an audio, the segment 'start' and 'end' time are incorrect.
Before using VAD, the time range represented by start and end is close to or accurately corresponds to the conversation time in the audio. For example, (the time has been converted to hh: mm: ss. ms):

segment1.start=00:02:45.111
segment1.end=**00:02:46.333**
segment1.text=AAAAAAA

segment2.start=**00:02:51.222**
segment2.end=00:02:59.444
segment2.text=BBBBBB

But after using VAD, the end time is not the time when the voice ends, but is equal to the start time of the next segment. For example (the time has been converted to hh: mm: ss. ms):

segment1.start=00:02:45.111
segment1.end=**00:02:51.222**
segment1.text=AAAAAAA

segment2.start=**00:02:51.222**
segment2.end=00:02:59.444
segment2.text=BBBBBB

I can see that the end time point of segment 1 has changed to the start time point of segment 2, and I think this is a bug.
To verify this, silero vad was used alone for voice recognition on the same audio file (with the same vad parameters), and the results showed that the start and end time points were close to the time point when voice appeared in the audio.

The text was updated successfully, but these errors were encountered:

zlyMaster · 2024-11-07T08:21:13Z

I use faster-whisper to generate movie's subtitles, so accuracy of time is very important.Otherwise, it will affect the display of subtitles.

MahmoudAshraf97 · 2024-11-07T12:21:38Z

enable word timestamps for better timing accuracy, but this is not a VAD problem because whisper segment timing is not accurate in the first place, or use forced alignment for even better timings

Genesis1231 · 2024-11-10T05:46:12Z

i had a lot of trouble with VAD too, there are a few vad_parameters you can try, look for the vad.py in the package to know more about it. i remember there are settings for beginning and ending. also you can try the threshold parameter,

but because every single audio clip is different, there is no one fit all settings. if you want absolute accuracy, you might have to work on your own VAD module.

montvid mentioned this issue Nov 15, 2024

Timestamps always maximum length when using Silero VAD jhj0517/Whisper-WebUI#287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After using VAD, the start and end times of the recognized segments are incorrect #1119

After using VAD, the start and end times of the recognized segments are incorrect #1119

zlyMaster commented Nov 7, 2024

zlyMaster commented Nov 7, 2024

MahmoudAshraf97 commented Nov 7, 2024

Genesis1231 commented Nov 10, 2024 •

edited

Loading

After using VAD, the start and end times of the recognized segments are incorrect #1119

After using VAD, the start and end times of the recognized segments are incorrect #1119

Comments

zlyMaster commented Nov 7, 2024

zlyMaster commented Nov 7, 2024

MahmoudAshraf97 commented Nov 7, 2024

Genesis1231 commented Nov 10, 2024 • edited Loading

Genesis1231 commented Nov 10, 2024 •

edited

Loading