You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
path = r"D:\Project\Python_Project\FasterWhisper\large-v3"
model = WhisperModel(model_size_or_path=path, device="cuda", local_files_only=True)
segments, info = model.transcribe("audio.wav", beam_size=5, language="zh", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=1000))
When I use vad to transcribe an audio, the segment 'start' and 'end' time are incorrect.
Before using VAD, the time range represented by start and end is close to or accurately corresponds to the conversation time in the audio. For example, (the time has been converted to hh: mm: ss. ms):
But after using VAD, the end time is not the time when the voice ends, but is equal to the start time of the next segment. For example (the time has been converted to hh: mm: ss. ms):
I can see that the end time point of segment 1 has changed to the start time point of segment 2, and I think this is a bug.
To verify this, silero vad was used alone for voice recognition on the same audio file (with the same vad parameters), and the results showed that the start and end time points were close to the time point when voice appeared in the audio.
The text was updated successfully, but these errors were encountered:
enable word timestamps for better timing accuracy, but this is not a VAD problem because whisper segment timing is not accurate in the first place, or use forced alignment for even better timings
i had a lot of trouble with VAD too, there are a few vad_parameters you can try, look for the vad.py in the package to know more about it. i remember there are settings for beginning and ending. also you can try the threshold parameter,
but because every single audio clip is different, there is no one fit all settings. if you want absolute accuracy, you might have to work on your own VAD module.
When I use vad to transcribe an audio, the segment 'start' and 'end' time are incorrect.
Before using VAD, the time range represented by start and end is close to or accurately corresponds to the conversation time in the audio. For example, (the time has been converted to hh: mm: ss. ms):
But after using VAD, the end time is not the time when the voice ends, but is equal to the start time of the next segment. For example (the time has been converted to hh: mm: ss. ms):
I can see that the end time point of segment 1 has changed to the start time point of segment 2, and I think this is a bug.
To verify this, silero vad was used alone for voice recognition on the same audio file (with the same vad parameters), and the results showed that the start and end time points were close to the time point when voice appeared in the audio.
The text was updated successfully, but these errors were encountered: