Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to carry initial_prompt with the sliding window #2343

Merged
merged 5 commits into from
Oct 26, 2024

Conversation

kittsil
Copy link
Contributor

@kittsil kittsil commented Sep 18, 2024

Background
Whisper's transcribe() struggles with contextual proper nouns if they appear after the initial prompt has been consumed; see some experimental results here. This solves that issue by allowing the initial "context" prompt to be carried as the sliding window moves through the audio.

Changes
Add an option carry_initial_prompt = False to whisper.transcribe().

When carry_initial_prompt is set to True, initial_prompt is prepended to each internal decode() call's prompt. If there is not enough context space at the start of the prompt, the prompt is left-sliced to make space.

@kittsil
Copy link
Contributor Author

kittsil commented Sep 18, 2024

There are outstanding issues with this PR:

  1. I have not found the definition of the 224 context token length.
  2. It prepends the initial_prompt to itself before enough tokens have been generated, resulting in a predilection toward looping.
  3. I have not written tests.

Closing this PR since I can't find a way to move it to draft.

@kittsil kittsil closed this Sep 18, 2024
@ryanheise
Copy link
Contributor

@ryanheise
Copy link
Contributor

Also a relevant discussion here: #1040 (comment)

I have not found the definition of the 224 context token length.

It's part of the model dimensions itself, actually 448 tokens total, and half that for the prompt. The logic is in decoding.py if you look for self.n_ctx: int = model.dims.n_text_ctx and look for the references to it.

Kittsil and others added 3 commits September 18, 2024 22:58
Add an option `carry_initial_prompt = False` to `whisper.transcribe()`.
When set to `True`, `initial_prompt` is prepended to each internal `decode()` call's `prompt`.
If there is not enough context space at the start of the prompt, the prompt is left-sliced to make space.
@kittsil kittsil reopened this Sep 19, 2024
@kittsil
Copy link
Contributor Author

kittsil commented Sep 19, 2024

@ryanheise Thank you for your input; it was helpful. Do you mind providing any additional feedback?


Aside: I did find the left-slice in the code, and it turns out that the docs are wrong, as actually the maximum prompt length is 223!

Confirming with the medium.en model...

>>> medium = torch.load('/home/kittsil/.cache/whisper/medium.en.pt')
>>> medium['dims']
{'n_mels': 80, 'n_vocab': 51864, 'n_audio_ctx': 1500, 'n_audio_state': 1024, 'n_audio_head': 16, 'n_audio_layer': 24, 'n_text_ctx': 448, 'n_text_state': 1024, 'n_text_head': 16, 'n_text_layer': 24}
>>> medium['dims']['n_text_ctx'] // 2 - 1
223

@FurkanGozukara
Copy link

FurkanGozukara commented Oct 5, 2024

hello if i locally merge this what do i add command to prevent whisper losing punctuation during transcription?

can you also update here so i can directly install it : https://github.com/kittsil/whisper/tree/patch-1

@kittsil

@FurkanGozukara
Copy link

why this very important feature is still not merged @jongwook ?

@FurkanGozukara
Copy link

@kittsil i use CLI so adding --carry_initial_prompt will work right?

@FurkanGozukara
Copy link

I am transcribing a 3 hours video working awesome so far

how errors like this can be fixed?

image

@kittsil
Copy link
Contributor Author

kittsil commented Oct 23, 2024

how errors like this can be fixed?

image

@FurkanGozukara, that's an issue with whisper, not with your prompt. You can try setting compression_ratio_threshold lower; I have found some success with 1.7 (as opposed to the default 2.4).

In general, though, I wouldn't comment on a PR for debugging help; it's best to keep PRs focused on the request / review process.

@FurkanGozukara
Copy link

@kittsil thank you so much your PR saved me so much

I transcribed this 3 hours video and without your PR I would be devastated because YouTube auto timing also failed :D

https://youtu.be/FvpWy1x5etM

@jongwook jongwook merged commit 5979f03 into openai:main Oct 26, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants