Speker Diarization Using Whisper Transcription and Nvidia NeMo #898
Replies: 7 comments 16 replies
-
Cool, that's a great contribution! FYI, there are other approaches than WhisperX to get word timestamps, that do not require an additional wav2vec model. |
Beta Was this translation helpful? Give feedback.
-
On the last version of whisper-timestamped, there is no problem of VRAM/RAM. I also benchmarked the accuracy of word timestamp on a French dataset for which I had an accurate annotation of timestamps, and the performance of WhisperX and whisper-timestamped are very close. I will soon implement an approach that uses VAD to be independent of whisper timestamp prediction. |
Beta Was this translation helpful? Give feedback.
-
Hey, this worked pretty well! The diarization isn't the best, but it's pretty good and should be good enough for many use cases. Thanks for sharing! |
Beta Was this translation helpful? Give feedback.
-
www.lexicaps.com seamlessly adds diarization to Whispers transcription. No 3rd party packages. |
Beta Was this translation helpful? Give feedback.
-
Hi, we are also building an ASR tool using Whisper and NVIDIA NeMo for diarization on public: https://github.com/Wordcab/wordcab-transcribe Audio file transcription will never be a struggle anymore, plus we provide a top-class API on top of the full process. |
Beta Was this translation helpful? Give feedback.
-
Hi. |
Beta Was this translation helpful? Give feedback.
-
I am exploring the possibility of using a local model for transcription with your diarization repository. I have fine-tuned a Hugging Face Whisper model using PEFT LoRA adapters and would like to integrate it into your notebook, specifically the Whisper Transcription + NeMo Diarization notebook. My goal is to replace the current transcription setup, which uses faster_whisper, with my locally trained model. Could you guide me on where I need to make changes in the notebook to load and use my Hugging Face model for transcription? Additionally, I want to ensure that the diarization workflow remains intact after integrating my model. Are there specific adjustments required to pass the transcriptions from my custom model into the NeMo diarization pipeline seamlessly? Lastly, could you highlight which sections of the notebook need modification for this integration? Looking forward to your response |
Beta Was this translation helpful? Give feedback.
-
Hello, I've built a pipeline Here to enable speaker diarization using whisper's transcriptions. It includes preprocessing that separates the vocals from other sounds, and post processing by realigning the transcriptions according to punctuations (thanks to @mu4farooqi). It also uses WhisperX (by @m-bain) for timestamp correction.
From my trials, the results are better than the PyAnnote approach mentioned in #264
The code is originally written to handle 1hr+ podcasts so no need to split the audio in advance
Feel free to try it and give me your feedback and suggestions
Beta Was this translation helpful? Give feedback.
All reactions