Speker Diarization Using Whisper Transcription and Nvidia NeMo #898

MahmoudAshraf97 · 2023-01-26T00:25:18Z

MahmoudAshraf97
Jan 26, 2023

Hello, I've built a pipeline Here to enable speaker diarization using whisper's transcriptions. It includes preprocessing that separates the vocals from other sounds, and post processing by realigning the transcriptions according to punctuations (thanks to @mu4farooqi). It also uses WhisperX (by @m-bain) for timestamp correction.

From my trials, the results are better than the PyAnnote approach mentioned in #264
The code is originally written to handle 1hr+ podcasts so no need to split the audio in advance
Feel free to try it and give me your feedback and suggestions

Jeronymous · 2023-01-26T14:40:25Z

Jeronymous
Jan 26, 2023

Cool, that's a great contribution!

FYI, there are other approaches than WhisperX to get word timestamps, that do not require an additional wav2vec model.
I mean approach based on Whisper's cross-attention weights.
In particular, it's much easier to support several languages with such approaches.
I did an implementation of that here.

8 replies

magicniuniu Jan 28, 2023

I have succeeded, whisper timestamped 2.wav --model tiny --output_dir d:, this is what is needed in reality, thank you Jeronymous for the py

MahmoudAshraf97 Jan 29, 2023
Author

@Jeronymous I tried whisper-timestamped but I faced some drawbacks that prevented the replacement of WhisperX

Strangely, it had higher VRAM usage that whisper+wav2vec combined, that was observed on different machines
it uses the timestamps generated by whisper, from my testing, these timestamps were found to be inaccurate by more than 30 seconds in long audio files such as podcasts and slightly inaccurate in small files, this can be tolerated in case of captioning but in diarization, we rely on the sync between the timing generated by wav2vec and the timing generated by NeMo, thus the timestamps affect diarization quality drastically

m-bain Jan 29, 2023

@MahmoudAshraf97 I am finding NeMo is very slow to run compared to pyannote -- even with cuda, do you have any idea on how to speed this up?

Jeronymous Jan 30, 2023

Thanks a lot for your feedback @MahmoudAshraf97
I'll investigate how to improve VRAM/RAM usage.

these timestamps were found to be inaccurate by more than 30 seconds

More than 30 seconds is very weird as whisper is splitting the audio in chunks of 30 seconds at most.
So it sounds like a bug somewhere... Did you see these unexpected timestamps when running whisper (not whisper-timestamped)?

MahmoudAshraf97 Jan 30, 2023
Author

@m-bain I think it's due to the underlying NeMo structure, if someone can extract the underlying pytorch models and handle the inference it can be a bit faster, also note that the NeMo part is totally independent from whisper so it's completely parallelizable if enough VRAM exists

@Jeronymous Yes, the incorrect timestamps are also a problem with whisper, that's why it's not entirely an alignment problem, we need to regenerate accurate timestamps from an external source

Jeronymous · 2023-02-18T17:19:27Z

Jeronymous
Feb 18, 2023

I'll investigate how to improve VRAM/RAM usage.

On the last version of whisper-timestamped, there is no problem of VRAM/RAM.
The memory usage is comparable to WhisperX, and there is little overhead in comparison to standard Whisper.

I also benchmarked the accuracy of word timestamp on a French dataset for which I had an accurate annotation of timestamps, and the performance of WhisperX and whisper-timestamped are very close.
Just whisper-timestamped is able to locate a little more things, like digits ("32") and symbols ("€", "%"...).

I will soon implement an approach that uses VAD to be independent of whisper timestamp prediction.
And also some heuristics to improve things around disfluencies that are not transcribed by Whisper (they are currently a problem both for WhisperX and whisper-timestamped).

0 replies

josephcappadona · 2023-03-08T19:42:47Z

josephcappadona
Mar 8, 2023

Hey, this worked pretty well! The diarization isn't the best, but it's pretty good and should be good enough for many use cases. Thanks for sharing!

0 replies

Majdoddin · 2023-07-21T05:23:02Z

Majdoddin
Jul 21, 2023

www.lexicaps.com seamlessly adds diarization to Whispers transcription. No 3rd party packages.
Announcement: #1537
Repo: https://github.com/Majdoddin/lexicaps

0 replies

chainyo · 2023-08-14T14:29:36Z

chainyo
Aug 14, 2023

Hi, we are also building an ASR tool using Whisper and NVIDIA NeMo for diarization on public: https://github.com/Wordcab/wordcab-transcribe

Audio file transcription will never be a struggle anymore, plus we provide a top-class API on top of the full process.

7 replies

dgoryeo Aug 15, 2023

Thanks @chainyo . Out of curiosity, was there any particular reason you selected Nemo for diarization? I see quite a few pyannote implementations but less Nemo. I was wondering.

chainyo Aug 15, 2023

Good question. In my experience, NeMo diarization (not the entire MSDD pipeline), based on a multiscale auto-tuning spectral clustering, is much more resilient than the pyannote version.

Arno-Riverside Jan 30, 2024

Hi @chainyo, what do you mean by "more resilient" ? I'm currently comparing pyannote vs nemo on my dataset and I'm struggling to tune Nemo's hyperparameters to outperform pyannote.

chainyo Jan 30, 2024

Hi @chainyo, what do you mean by "more resilient" ? I'm currently comparing pyannote vs nemo on my dataset and I'm struggling to tune Nemo's hyperparameters to outperform pyannote.

From what I tested some months ago, the accuracy/precision of the returned timestamps were way better than pyannote.

tristancgardner Oct 1, 2024

Same here in 2024

piegu · 2024-05-01T15:00:52Z

piegu
May 1, 2024

Hi.
The code of @MahmoudAshraf97 (Whisper+Nemo for diarization) is great. Thank you for that.
Now, I'm looking for an API that uses it for easy integration in an app.
Who knows online services about Whisper+Diarization?
Can you share links here?
Thank you.
Pierre

0 replies

AbreraHabib · 2024-12-19T08:01:34Z

AbreraHabib
Dec 19, 2024

Hi @MahmoudAshraf97

I am exploring the possibility of using a local model for transcription with your diarization repository. I have fine-tuned a Hugging Face Whisper model using PEFT LoRA adapters and would like to integrate it into your notebook, specifically the Whisper Transcription + NeMo Diarization notebook. My goal is to replace the current transcription setup, which uses faster_whisper, with my locally trained model. Could you guide me on where I need to make changes in the notebook to load and use my Hugging Face model for transcription? Additionally, I want to ensure that the diarization workflow remains intact after integrating my model. Are there specific adjustments required to pass the transcriptions from my custom model into the NeMo diarization pipeline seamlessly? Lastly, could you highlight which sections of the notebook need modification for this integration? Looking forward to your response

1 reply

MahmoudAshraf97 Dec 19, 2024
Author

You can replace the whole transcription block with whatever works best for you, the next stage expects a string that contains the whole transcription as an input, and I guess that's fairly simple to produce from any transcription provider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speker Diarization Using Whisper Transcription and Nvidia NeMo #898

{{title}}

Replies: 7 comments 16 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Speker Diarization Using Whisper Transcription and Nvidia NeMo #898

Replies: 7 comments · 16 replies

MahmoudAshraf97 Jan 29, 2023 Author

MahmoudAshraf97 Jan 30, 2023 Author

MahmoudAshraf97 Dec 19, 2024 Author

Replies: 7 comments 16 replies

MahmoudAshraf97 Jan 29, 2023
Author

MahmoudAshraf97 Jan 30, 2023
Author

MahmoudAshraf97 Dec 19, 2024
Author