Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After using VAD, the start and end times of the recognized segments are incorrect #1119

Open
zlyMaster opened this issue Nov 7, 2024 · 3 comments

Comments

@zlyMaster
Copy link

path = r"D:\Project\Python_Project\FasterWhisper\large-v3"

model = WhisperModel(model_size_or_path=path, device="cuda", local_files_only=True)

segments, info = model.transcribe("audio.wav", beam_size=5, language="zh", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=1000))

When I use vad to transcribe an audio, the segment 'start' and 'end' time are incorrect.
Before using VAD, the time range represented by start and end is close to or accurately corresponds to the conversation time in the audio. For example, (the time has been converted to hh: mm: ss. ms):

segment1.start=00:02:45.111
segment1.end=**00:02:46.333**
segment1.text=AAAAAAA

segment2.start=**00:02:51.222**
segment2.end=00:02:59.444
segment2.text=BBBBBB

But after using VAD, the end time is not the time when the voice ends, but is equal to the start time of the next segment. For example (the time has been converted to hh: mm: ss. ms):

segment1.start=00:02:45.111
segment1.end=**00:02:51.222**
segment1.text=AAAAAAA

segment2.start=**00:02:51.222**
segment2.end=00:02:59.444
segment2.text=BBBBBB

I can see that the end time point of segment 1 has changed to the start time point of segment 2, and I think this is a bug.
To verify this, silero vad was used alone for voice recognition on the same audio file (with the same vad parameters), and the results showed that the start and end time points were close to the time point when voice appeared in the audio.

@zlyMaster
Copy link
Author

I use faster-whisper to generate movie's subtitles, so accuracy of time is very important.Otherwise, it will affect the display of subtitles.

@MahmoudAshraf97
Copy link
Collaborator

enable word timestamps for better timing accuracy, but this is not a VAD problem because whisper segment timing is not accurate in the first place, or use forced alignment for even better timings

@Genesis1231
Copy link

Genesis1231 commented Nov 10, 2024

i had a lot of trouble with VAD too, there are a few vad_parameters you can try, look for the vad.py in the package to know more about it. i remember there are settings for beginning and ending. also you can try the threshold parameter,

but because every single audio clip is different, there is no one fit all settings. if you want absolute accuracy, you might have to work on your own VAD module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants