-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BatchedInferencePipeline degrades transcription quality heavily #1179
Comments
Check the settings of the VAD and set less restrictive parameters. |
This is my current config for the model:
However doesn't VAD only remove silent fragments of the audio? In other words filter out only audio data with speech in it. How does this influence batched processing? AFAIK the |
VAD detects speech, not silence. If it were designed to detect silence and remove it, it would be called Silence Activity Detection. |
Yes, we're saying the same thing. The VAD filter just removes all non voice audio data. My point is that incorrect language switching and prompt injection seems not related to the VAD filter. |
Check less restrictive settings, for example::
|
The code switching abilities are not officially a part of whisper, so your result may vary, this has nothing to do with |
I've also had many occurrences of this. |
It's happening for me too, for example The audio of lanuage urdu has whol 10 seconds missing from an audio of 60 seconds |
Yes, I've also encountered this situation. When using the BatchedInferencePipeline mode, the problem of certain segments being missing often occurs, and adjusting the VAD parameters doesn't work either. However, there's no such problem when using the non-batch mode. |
There is same problem with missing transcriptions in non-batch mode too, only that it happens less often.
That's expected. |
I suspect the current WER bechmarks, do not reflect this performance degradation, since the test audio tends to be very clean, and much less sensitive to the VAD. A better test set would be a good first step towards improving the batched performance. |
At first the new
BatchedInferencePipeline
seems great. It produces around 2X speed improvement compared to the normal pipeline. But after some more testing I discovered for some audio files the transcription quality is highly degraded. Whole segments are missing compared to the normal pipeline. Some segments switch language mid way for long periods.Example:
Segment A has 30 seconds audio, fully in Dutch. It does contains a few English words. Half way the transcription segment the text becomes English, translating the Dutch audio. And at the end of the segment the
initial_prompt
is displayed. This happens at multiple places.So this makes the
BatchedInferencePipeline
not suited for a production application.The text was updated successfully, but these errors were encountered: