Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchedInferencePipeline degrades transcription quality heavily #1179

Open
Appfinity-development opened this issue Nov 28, 2024 · 11 comments
Open

Comments

@Appfinity-development
Copy link

Appfinity-development commented Nov 28, 2024

At first the new BatchedInferencePipeline seems great. It produces around 2X speed improvement compared to the normal pipeline. But after some more testing I discovered for some audio files the transcription quality is highly degraded. Whole segments are missing compared to the normal pipeline. Some segments switch language mid way for long periods.

Example:

Segment A has 30 seconds audio, fully in Dutch. It does contains a few English words. Half way the transcription segment the text becomes English, translating the Dutch audio. And at the end of the segment the initial_prompt is displayed. This happens at multiple places.

So this makes the BatchedInferencePipeline not suited for a production application.

@Venzon
Copy link

Venzon commented Nov 28, 2024

Check the settings of the VAD and set less restrictive parameters.

@Appfinity-development
Copy link
Author

Appfinity-development commented Nov 28, 2024

This is my current config for the model:

options = dict(
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=1000),
        initial_prompt=prompt,
        word_timestamps=True,
        language=translate_language if translate else language,
        log_progress=True,
        multilingual=is_multilingual,
        task="translate" if translate else "transcribe",
    )

However doesn't VAD only remove silent fragments of the audio? In other words filter out only audio data with speech in it. How does this influence batched processing? AFAIK the BatchedInferencePipeline performs speech to text on multiple segments in parallel as opposed to transcribing them sequentially. Which delivers performance gains but also adds errors?

@Venzon
Copy link

Venzon commented Nov 28, 2024

VAD detects speech, not silence. If it were designed to detect silence and remove it, it would be called Silence Activity Detection.
So if speech is not detected, that segment is lost in the transcription.

@Appfinity-development
Copy link
Author

Appfinity-development commented Nov 28, 2024

Yes, we're saying the same thing. The VAD filter just removes all non voice audio data. My point is that incorrect language switching and prompt injection seems not related to the VAD filter.

@Venzon
Copy link

Venzon commented Nov 28, 2024

Check less restrictive settings, for example::

"vad_parameters": VadOptions(
    max_speech_duration_s=model.feature_extractor.chunk_length,
    min_silence_duration_ms=150,
    min_speech_duration_ms=120,
    speech_pad_ms=200,
    onset=0.25,
    offset=0.2
)

@MahmoudAshraf97
Copy link
Collaborator

The code switching abilities are not officially a part of whisper, so your result may vary, this has nothing to do with BatchedInferencePipeline

@Saccarab
Copy link

Saccarab commented Dec 5, 2024

I've also had many occurrences of this.
This does happen regardless of VAD settings, the transcript for the same audio file will have some speech chunks missing when using batched. I also debugged the VAD segments produced and ensured they are identical. Also confirmed that the segments missing in the transcript are within the boundaries of VAD segments.

@iamkhalidbashir
Copy link

It's happening for me too, for example The audio of lanuage urdu has whol 10 seconds missing from an audio of 60 seconds

@wildwind0
Copy link

I've also had many occurrences of this. This does happen regardless of VAD settings, the transcript for the same audio file will have some speech chunks missing when using batched. I also debugged the VAD segments produced and ensured they are identical. Also confirmed that the segments missing in the transcript are within the boundaries of VAD segments.

Yes, I've also encountered this situation. When using the BatchedInferencePipeline mode, the problem of certain segments being missing often occurs, and adjusting the VAD parameters doesn't work either. However, there's no such problem when using the non-batch mode.

@Purfview
Copy link
Contributor

Purfview commented Dec 10, 2024

However, there's no such problem when using the non-batch mode.

There is same problem with missing transcriptions in non-batch mode too, only that it happens less often.
Btw, it happens with all Whisper implementations.

BatchedInferencePipeline degrades transcription quality

That's expected.

@stri8ed
Copy link

stri8ed commented Dec 10, 2024

I suspect the current WER bechmarks, do not reflect this performance degradation, since the test audio tends to be very clean, and much less sensitive to the VAD. A better test set would be a good first step towards improving the batched performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants