Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent definition of clip_timestamps parameter between WhisperModel and BatchedInferencePipeline #1205

Open
zh-plus opened this issue Dec 17, 2024 · 5 comments

Comments

@zh-plus
Copy link
Contributor

zh-plus commented Dec 17, 2024

The clip_timestamps parameter is defined differently in WhisperModel and BatchedInferencePipeline, which creates confusion when switching between these classes:

In WhisperModel.transcribe():

clip_timestamps: Union[str, List[float]] = "0",

In BatchedInferencePipeline.transcribe():

clip_timestamps: Optional[List[dict]] = None,

@Purfview
Copy link
Contributor

Purfview commented Dec 18, 2024

Yes, I was thinking to rename it to vad_timestamps for batched, but it's "cleaner" as is I think.
Maybe just add the notes at the descriptions that it needs different input in batched vs sequential modes?

@zh-plus
Copy link
Contributor Author

zh-plus commented Dec 19, 2024

I think unifying the interfaces would be better than documenting differences. We could change both to Optional[List[dict]] where each dict has start/end times. This would make it possible to use WhisperModel and BatchedInferencePipeline with the same clip_timestamps for self-defined VAD.

@Purfview
Copy link
Contributor

Purfview commented Dec 19, 2024

We could change both to Optional[List[dict]]

clip_timestamps in sequential mode is for a quick user input, not for "self-defined VAD", btw, you can't have this functionality in batched mode as audio there is already chunked by VAD.

If you want a "self-defined VAD" in sequential mode then make a new option, and rename batched "clip_timestamps" to that.

@MahmoudAshraf97
Copy link
Collaborator

I had that in mind when naming the parameter, it's not ideal I know, but if I'm going to change anything, then the sequential one is the candidate, because the batched one is much clearer IMO with clear starts and ends

@Purfview
Copy link
Contributor

If you change the sequential one then it will lose original functionality which is to replicate the vanilla Whisper's clip_timestamps .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants