Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Transcription Results with faster-whisper in Asynchronous Functions #1207

Open
savank7 opened this issue Dec 18, 2024 · 12 comments

Comments

@savank7
Copy link

savank7 commented Dec 18, 2024

Issue Description:

I am using the faster-whisper library for speech-to-text transcription and encountering inconsistent results when running the transcription function asynchronously. Here's the code snippet:

from faster_whisper import WhisperModel
import asyncio

model_size = 'medium'
model = WhisperModel(model_size, device='cuda')

segments, _ = model.transcribe('audio_path', language="en")
transcription = " ".join(segment.text for segment in segments)
print(f"Transcription without async: {transcription}")


async def tts_func(num):
    asy_segments, _ = model.transcribe('audio_path', language="en")
    asy_transcription = " ".join(asy_segment.text for asy_segment in asy_segments)
    print(f"Transcription {num}: {asy_transcription}")


async def call_func():
    for i in range(5):
        await tts_func(i)


if __name__ == "__main__":
    asyncio.run(call_func())

output:

Transcription without async:  good morning hello hello hello
Transcription 0:  good morning hello hello hello  good morning hello hello hello
Transcription 1:  Good morning. Hello. Hello. Hello.
Transcription 2:  good morning hello hello hello  Hello
Transcription 3:  Good morning. Hello. Hello. Hello.
Transcription 4:  Good morning. Hello. Hello. Hello.  So.  So.  So.  So.  So.

Problem:
The synchronous transcription produces consistent and accurate results:
good morning hello hello hello

However, the asynchronous transcriptions are inconsistent and sometimes produce repeated or erroneous outputs, such as:
good morning hello hello hello good morning hello hello hello
Good morning. Hello. Hello. Hello. So. So. So. So. So.

This is critical because my project requires asynchronous processing for optimal flow, but the inconsistencies make it unusable.

Expected Behavior:
The transcription results from the asynchronous calls should be identical to the synchronous call's results.

Question:

  1. Why is the asynchronous transcription producing inconsistent results?
  2. Is the faster-whisper model thread-safe, or does it require specific handling in an asynchronous context?
  3. How can I resolve this issue to get consistent results when using asynchronous functions?

Additional Details:

  • Model size: medium
  • Device: CUDA (Nvidia GPU)
  • Language: English (language="en")
  • Library: faster-whisper

Any guidance or recommendations to ensure consistent results in the asynchronous workflow would be greatly appreciated.

@Purfview
Copy link
Contributor

Whisper model is non-deterministic if temperature is not 0, try to set temperature=0, but results may degrade.

@savank7
Copy link
Author

savank7 commented Dec 18, 2024

Hello @Purfview

I have tried the solution which you have mentioned. but still the same result.

from faster_whisper import WhisperModel
import asyncio

model_size = 'medium'
model = WhisperModel(model_size, device='cuda')
temperature = 1

segments, _ = model.transcribe('audio_path', language="en", temperature=temperature)
transcription = " ".join(segment.text for segment in segments)
print(f"Transcription without async and temperature - {temperature}: {transcription}")


async def tts_func(num):
    asy_segments, _ = model.transcribe('audio_path', language="en", temperature=temperature)
    asy_transcription = " ".join(asy_segment.text for asy_segment in asy_segments)
    print(f"Transcription {num} and temperature - {temperature} : {asy_transcription}")


async def call_func():
    for i in range(5):
        await tts_func(i)


if __name__ == "__main__":
    asyncio.run(call_func())

output:

Transcription without async and temperature - 0:  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 0 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 1 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 2 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 3 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 4 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.


Transcription without async and temperature - 0.5:  Good morning. Hello. Hello. Hello.  Good morning.
Transcription 0 and temperature - 0.5 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello.
Transcription 1 and temperature - 0.5 :  Good morning. Hello. Hello. Hello.  Hello.
Transcription 2 and temperature - 0.5 :  Good morning. Hello. Hello. Hello.
Transcription 3 and temperature - 0.5 :  Good morning. Hello. Hello. Hello.  Hello. Hello. Hello.
Transcription 4 and temperature - 0.5 :  Good morning. Hello. Hello. Hello.


Transcription without async and temperature - 1:  Good morning. Hello. Hello. Hello.  Hello-hello! Hello!  Hello! Hello!  Hello! Hello!  Hello! Hello!  Hello! Hello!
Transcription 0 and temperature - 1 :  Good morning Hello  Hello Hello  Hello Hello  Hello Hello  Good morning Hello  Hello Hello  Hello Hello  Hello Hello  Hello Hello  Hello Hello  Hello Hello
Transcription 1 and temperature - 1 :  Good morning. Hello. Hello. Hello.
Transcription 2 and temperature - 1 :  Good morning. Hello. Hello. Hello.  Hello. Hello. Hello. Hello. Hello.  Hello. Hello. Hello. Hello. Hello. Hello. Hello. Hello.  Hello.  Hello.
Transcription 3 and temperature - 1 :  good morning hello hello hello  so
Transcription 4 and temperature - 1 :  Good morning. Hello. Hello. Hello.  Hello. Hello. Hello.  Hello. Hello. Hello.  Hello. Hello. Hello.  Hello. Hello.  Hello. Hello.

Also, I have attached the screenshot for the reference.

Screenshot from 2024-12-18 19-31-36

@Purfview
Copy link
Contributor

Expected Behavior:
The transcription results from the asynchronous calls should be identical to the synchronous call's results.

I have tried the solution which you have mentioned. but still the same result.

But it's not the same:

Without async and temperature   - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 0 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 1 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 2 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 3 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.
Transcription 4 and temperature - 0 :  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.  Good morning. Hello. Hello. Hello.

vs your previous result:

Without async:    good morning hello hello hello
Transcription 0:  good morning hello hello hello  good morning hello hello hello
Transcription 1:  Good morning. Hello. Hello. Hello.
Transcription 2:  good morning hello hello hello  Hello
Transcription 3:  Good morning. Hello. Hello. Hello.
Transcription 4:  Good morning. Hello. Hello. Hello.  So.  So.  So.  So.  So.

@Purfview
Copy link
Contributor

Read it again:

Whisper model is non-deterministic if temperature is not 0, try to set temperature=0, but results may degrade.

@savank7
Copy link
Author

savank7 commented Dec 18, 2024

yes @Purfview it is producing new results every time.

@Purfview
Copy link
Contributor

Purfview commented Dec 18, 2024

yes @Purfview it is producing new results every time.

It doesn't when you set temperature to 0.

@savank7
Copy link
Author

savank7 commented Dec 18, 2024

yes @Purfview it is producing new results every time.

It doesn't when you set temperature to 0.

yes, @Purfview but the generated text is incorrect.

@Purfview
Copy link
Contributor

yes, @Purfview but the generated text is incorrect.

Whisper doesn't guaranty a correct result, you can try a bigger model like large-v2.

And:

...but results may degrade.

@Purfview
Copy link
Contributor

BTW, your result looks like it contains hallucinations, try hallucination_silence_threshold=2

@savank7
Copy link
Author

savank7 commented Dec 18, 2024

@Purfview can you please help me where can I use hallucination_silence_threshold=2 in the given code?

@savank7
Copy link
Author

savank7 commented Dec 18, 2024

@Purfview let me give you a brief about my use case, I have implemented a WebSocket server in Python that utilizes the Whisper model for audio transcription. The server receives audio data in byte format from a Vosk-Asterisk integration. I process this audio data by converting the bytes into 3-second WAV files. These WAV files are then transcribed using the Whisper model, and the transcribed text is sent back to the client. However, I am currently facing issues in this process. like this is a real-time streaming type of project.

@Purfview
Copy link
Contributor

Purfview commented Dec 18, 2024

can you please help me where can I use hallucination_silence_threshold=2 in the given code?

model.transcribe('audio_path', language="en", hallucination_silence_threshold=1) doesn't work?

I process this audio data by converting the bytes into 3-second WAV files... ...a real-time streaming

I dunno what's proper approach with streaming, but Whisper models are trained for 30s chunks, 3s chunks are too weird, try without_timestamps=True.

Btw, you can skip wav and pass numpy audio array to transcribe().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants