BatchedInferencePipeline degrades transcription quality heavily #1179

Appfinity-development · 2024-11-28T11:22:07Z

At first the new BatchedInferencePipeline seems great. It produces around 2X speed improvement compared to the normal pipeline. But after some more testing I discovered for some audio files the transcription quality is highly degraded. Whole segments are missing compared to the normal pipeline. Some segments switch language mid way for long periods.

Example:

Segment A has 30 seconds audio, fully in Dutch. It does contains a few English words. Half way the transcription segment the text becomes English, translating the Dutch audio. And at the end of the segment the initial_prompt is displayed. This happens at multiple places.

So this makes the BatchedInferencePipeline not suited for a production application.

The text was updated successfully, but these errors were encountered:

Venzon · 2024-11-28T11:32:14Z

Check the settings of the VAD and set less restrictive parameters.

Appfinity-development · 2024-11-28T13:26:11Z

This is my current config for the model:

options = dict(
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=1000),
        initial_prompt=prompt,
        word_timestamps=True,
        language=translate_language if translate else language,
        log_progress=True,
        multilingual=is_multilingual,
        task="translate" if translate else "transcribe",
    )

However doesn't VAD only remove silent fragments of the audio? In other words filter out only audio data with speech in it. How does this influence batched processing? AFAIK the BatchedInferencePipeline performs speech to text on multiple segments in parallel as opposed to transcribing them sequentially. Which delivers performance gains but also adds errors?

Venzon · 2024-11-28T13:36:34Z

VAD detects speech, not silence. If it were designed to detect silence and remove it, it would be called Silence Activity Detection.
So if speech is not detected, that segment is lost in the transcription.

Appfinity-development · 2024-11-28T13:39:43Z

Yes, we're saying the same thing. The VAD filter just removes all non voice audio data. My point is that incorrect language switching and prompt injection seems not related to the VAD filter.

Venzon · 2024-11-28T13:41:43Z

Check less restrictive settings, for example::

"vad_parameters": VadOptions(
    max_speech_duration_s=model.feature_extractor.chunk_length,
    min_silence_duration_ms=150,
    min_speech_duration_ms=120,
    speech_pad_ms=200,
    onset=0.25,
    offset=0.2
)

MahmoudAshraf97 · 2024-11-28T20:10:16Z

The code switching abilities are not officially a part of whisper, so your result may vary, this has nothing to do with BatchedInferencePipeline

Saccarab · 2024-12-05T14:17:00Z

I've also had many occurrences of this.
This does happen regardless of VAD settings, the transcript for the same audio file will have some speech chunks missing when using batched. I also debugged the VAD segments produced and ensured they are identical. Also confirmed that the segments missing in the transcript are within the boundaries of VAD segments.

iamkhalidbashir · 2024-12-05T14:24:57Z

It's happening for me too, for example The audio of lanuage urdu has whol 10 seconds missing from an audio of 60 seconds

wildwind0 · 2024-12-09T17:32:11Z

I've also had many occurrences of this. This does happen regardless of VAD settings, the transcript for the same audio file will have some speech chunks missing when using batched. I also debugged the VAD segments produced and ensured they are identical. Also confirmed that the segments missing in the transcript are within the boundaries of VAD segments.

Yes, I've also encountered this situation. When using the BatchedInferencePipeline mode, the problem of certain segments being missing often occurs, and adjusting the VAD parameters doesn't work either. However, there's no such problem when using the non-batch mode.

Purfview · 2024-12-10T11:34:31Z

However, there's no such problem when using the non-batch mode.

There is same problem with missing transcriptions in non-batch mode too, only that it happens less often.
Btw, it happens with all Whisper implementations.

BatchedInferencePipeline degrades transcription quality

That's expected.

stri8ed · 2024-12-10T15:53:21Z

I suspect the current WER bechmarks, do not reflect this performance degradation, since the test audio tends to be very clean, and much less sensitive to the VAD. A better test set would be a good first step towards improving the batched performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchedInferencePipeline degrades transcription quality heavily #1179

BatchedInferencePipeline degrades transcription quality heavily #1179

Appfinity-development commented Nov 28, 2024 •

edited

Loading

Venzon commented Nov 28, 2024

Appfinity-development commented Nov 28, 2024 •

edited

Loading

Venzon commented Nov 28, 2024

Appfinity-development commented Nov 28, 2024 •

edited

Loading

Venzon commented Nov 28, 2024

MahmoudAshraf97 commented Nov 28, 2024

Saccarab commented Dec 5, 2024

iamkhalidbashir commented Dec 5, 2024

wildwind0 commented Dec 9, 2024

Purfview commented Dec 10, 2024 •

edited

Loading

stri8ed commented Dec 10, 2024 •

edited

Loading

BatchedInferencePipeline degrades transcription quality heavily #1179

BatchedInferencePipeline degrades transcription quality heavily #1179

Comments

Appfinity-development commented Nov 28, 2024 • edited Loading

Venzon commented Nov 28, 2024

Appfinity-development commented Nov 28, 2024 • edited Loading

Venzon commented Nov 28, 2024

Appfinity-development commented Nov 28, 2024 • edited Loading

Venzon commented Nov 28, 2024

MahmoudAshraf97 commented Nov 28, 2024

Saccarab commented Dec 5, 2024

iamkhalidbashir commented Dec 5, 2024

wildwind0 commented Dec 9, 2024

Purfview commented Dec 10, 2024 • edited Loading

stri8ed commented Dec 10, 2024 • edited Loading

Appfinity-development commented Nov 28, 2024 •

edited

Loading

Appfinity-development commented Nov 28, 2024 •

edited

Loading

Appfinity-development commented Nov 28, 2024 •

edited

Loading

Purfview commented Dec 10, 2024 •

edited

Loading

stri8ed commented Dec 10, 2024 •

edited

Loading