Releases: argmaxinc/WhisperKit
v0.10.1
Small patch for building on older macOS versions. Also includes a fix for early stopping callback logic that had regressed from 0.9.4.
What's Changed
- Patch for <macOS 15 build systems by @ZachNagengast in #283
Full Changelog: v0.10.0...v0.10.1
v0.10.0
Highlights
This release provides support for protocol-defined model inputs and output types, supporting full MLX or MLTensor pipelines without the need to convert to MLMultiArrays between encoder/decoder stages. For example, instead of
func encodeFeatures(_ features: MLMultiArray) async throws -> MLMultiArray?
you can now define the types by protocol:
func encodeFeatures(_ features: any FeatureExtractorOutputType) async throws -> (any AudioEncoderOutputType)?
where the types are defined as so:
public protocol FeatureExtractorOutputType {}
extension MLMultiArray: FeatureExtractorOutputType {}
public protocol AudioEncoderOutputType {}
extension MLMultiArray: AudioEncoderOutputType {}
or for a type that is a struct:
public struct TextDecoderMLMultiArrayOutputType: TextDecoderOutputType {
public var logits: MLMultiArray?
public var cache: DecodingCache?
}
so the entire structure can be handled by any model that conforms to the protocol, adding more flexibility for passing different data types between models, and thus reducing the amount of conversion steps vs. previous where it was assumed to be all MLMultiArrays.
We've made a start in using different inference types by using the new MLTensor
for token sampling on devices that have the latest OS support, which resulted in a 2x speedup for that operation. Future work will shift the entire pipeline to using these.
There are also some important fixes included:
- Timestamp rules are now enabled when the
withoutTimestamps
decoding option is set to false, increasing parity with OpenAI's python implementation. This will significantly increase the amount of timestamps returned during decoding and shorten the average length of individual segments overall.- Previous:
<|0.00|> So in college, I was a government major,<|4.92|><|4.94|> which means I had to write a lot of papers.<|7.38|>
- Now:
<|0.00|> So in college,<|2.00|><|3.36|> I was a government major,<|4.88|><|4.90|> which means I had to write a lot of papers.<|7.36|>
- Previous:
- Early stopping via callback (a way to stop the decoding loop early if repetition is detected) has been converted to use an actor to fix some concurrency issues noted by the community.
- CI script now uploads failure results to github for better visibility.
β οΈ Breaking changes
- Changing the protocol may result in some unexpected behavior if you are using a custom implementation, please raise an issue if you notice anything.
WhisperKit.sampleRate
has been moved toConstants.defaultWindowSamples
Finally, there were some great open-source contributions listed below, with a broad range of improvements to the library. Huge thanks to all the contributors π
What's Changed
- Fix audio processing edge case by @ZachNagengast in #237
- Add public callbacks to help expose internal state a little more by @iandundas in #240
- Freeze loglevel enum by @ZachNagengast in #255
- Update WhisperAX app icon for macOS to align with Apple HIG standards by @Stv-X in #257
- Add ability to prevent config.json being written to
~/Documents/huggingface/...
by @iandundas in #262 - Typo in Model Descriptions by @rk-helper in #269
- Audio: Fix taking a suffix of negative length from a collection by @mattisssa in #278
New Contributors
- @Stv-X made their first contribution in #257
- @rk-helper made their first contribution in #269
- @mattisssa made their first contribution in #278
Full Changelog: v0.9.4...v0.10.0
v0.9.4
Minor patch to open up access to the logging callback and freeze the enum for LogLevel
Usage:
Logging.shared.loggingCallback = { message in
print("WhisperKit logs: ", message)
}
What's Changed
- Freeze loglevel enum by @ZachNagengast in #255
Full Changelog: v0.9.3...v0.9.4
v0.9.3
This release adds a number of useful callbacks that you can receive updates from while the transcription is processing:
/// A callback that provides transcription segments as they are discovered.
/// - Parameters:
/// - segments: An array of `TranscriptionSegment` objects representing the transcribed segments
public typealias SegmentDiscoveryCallback = (_ segments: [TranscriptionSegment]) -> Void
/// A callback that reports changes in the model's state.
/// - Parameters:
/// - oldState: The previous state of the model, if any
/// - newState: The current state of the model
public typealias ModelStateCallback = (_ oldState: ModelState?, _ newState: ModelState) -> Void
/// A callback that reports changes in the transcription process.
/// - Parameter state: The current `TranscriptionState` of the transcription process
public typealias TranscriptionStateCallback = (_ state: TranscriptionState) -> Void
Thanks you @iandundas for the excellent contribution! β¨
What's Changed
- Add public callbacks to help expose internal state a little more by @iandundas in #240
Full Changelog: v0.9.2...v0.9.3
v0.9.2
Highlights
With this release we are launching a comprehensive suite of benchmarks that you can run yourself on your own devices - or view the results that we've run on a wide variety of devices via our WhisperKit Benchmarks HuggingFace space! This was a huge effort kicked off by @Abhinay1997 so we're very excited to bring it to main. Read more in the discussion here and let us know what you think!
Along with this, there are also several bug fixes and improvements included in this release based on recent reported issues, see below for the relevant PRs.
What's Changed
- Fix expo release script by @ZachNagengast in #220
- Fix progress for vad by @ZachNagengast in #223
- Regression Test Pipeline by @Abhinay1997 in #120
- Update xcconfig tracking and provisioning by @ZachNagengast in #234
- Fix audio processing edge case by @ZachNagengast in #237
Full Changelog: v0.9.0...v0.9.2
v0.9.0
Highlights
Package Updates
With #216 the default for checking whether a model is supported on the device uses the model repo config.json as a source of truth. The need for this came about with the release of the new large-v3 turbo model, which is listed in the model repo as openai_whisper-large-v3-v20240930, which was recommended for devices that would crash if attempting to load. This situation can now be mitigated by updating this config.json without the need for a new release and can be called directly with the new static method recommendedRemoteModels
:
let recommendedModel = await WhisperKit.recommendedRemoteModels().default
let pipe = WhisperKit(model: recommendModel)
The existing interface for WhisperKit.recommendedModels()
remains the same, but now returns a ModelSupport
object with a list of supported models for the current device.
public struct ModelSupport: Codable, Equatable {
public let `default`: String
public let supported: [String]
public var disabled: [String] = []
}
Also, in an ongoing effort to improve modularity, extensibility, and code structure, there is a new way to initialize WhisperKit: using the new WhisperKitConfig
class. The parameters are exactly the same and the previous init method is still in place, but this can assist in defining WhisperKit settings and protocol objects ahead of time and initialize WhisperKit more cleanly:
Previous:
let pipe = try? await WhisperKit(model: "your-custom-model", modelRepo: "username/your-model-repo")
New:
let config = WhisperKitConfig(model: "your-custom-model", modelRepo: "username/your-model-repo") // Initialize config
config.model = "your-custom-model" // Alternatively set parameters directly
let pipe = try? await WhisperKit(config) // Pass into WhisperKit initializer
WhisperAX example app and CLI
Thanks to some memory and audio processing optimizations in #195, #216, and #217, (shout out to @keleftheriou for finding a big improvement there) we've updated the example implementations to use VAD by default with a concurrentWorkerCount
of 4. This will significantly improve default inference speed on long files for devices that support async prediction, as well as real time streaming for devices/model combinations that are greater than 1 real-time factor.
β οΈ Deprecations and changed interfaces
- The extension on
Process.processor
is nowProcessInfo.processor
and includes a new propertyProcessInfo.hwModel
which will return a similar string asuname(&utsname)
for non-macs. public func modelSupport(for deviceName: String) -> (default: String, disabled: [String])
is now a disfavored overload in preference ofpublic func modelSupport(for deviceName: String, from config: ModelSupportConfig? = nil) -> ModelSupport
What's Changed
- Make additional initializers, functions, members public for extensibility by @bpkeene in #192
- Fix start time logic for file loading by @ZachNagengast in #195
- Change
static var
stored properties tostatic let
by @fumoboy007 in #190 - Add VoiceActivityDetector base class by @a2they in #199
- Set default concurrentWorkerCount by @atiorh in #205
- Improving modularity and code structure by @a2they in #212
- Add model support config fetching from model repo by @ZachNagengast in #216
- Example app VAD default + memory reduction by @ZachNagengast in #217
New Contributors
- @bpkeene made their first contribution in #192
- @fumoboy007 made their first contribution in #190
- @a2they made their first contribution in #199
- @atiorh made their first contribution in #205
- @1amageek made their first contribution in #216
- @keleftheriou made their first contribution in #217
Full Changelog: v0.8.0...v0.9.0
v0.8.0
With this release, we had a huge focus on reliability in terms of memory usage (especially for large files), common crashes, and various correctness errors that the community has reported in issues.
Highlights
- Memory-efficient Handling of Large Files: WhisperKit is much more memory-efficient for large files with some improvements to #158 by @finnvoor. This change speeds up the audio resampling significantly and removes a few other unnecessary data copies. It also fixes a buffer misalignment issue that caused #183 . For more aggressive memory savings, the default audio file chunking size can be configured through maxReadFrameSize. Here is the memory chart for a ~200 MB compressed audio file from #174, showing up to 3x faster resampling with 50% less memory. Note that WhisperKit requires uncompressed Float values for the MLModel input, so the compressed file becomes roughly ~1 GB minimum after read and resample to 16khz 1 channel.
Before | After |
---|---|
-
Progress Bar: @finnvoor also contributed a fix to the progress when in VAD chunking mode. WhisperAX now shows an indicator while the file is being resampled and the overall progress of the decoding. Note that this is not an exactly linear progress bar because it is based on how many windows have completed decoding, so it will speed up toward the end of the process as more windows complete.
-
Various other improvements: We also did a pass on our current issues and resolved many of them, if you have one pending please test out this version to verify they are fixed. Thanks again to everyone that contributes to these issues, it helps immensely to make WhisperKit better for everyone π.
What's Changed
- Remove purported OGG support from CLI by @iandundas in #153
- Resample audio files in 10mb chunks by @finnvoor in #158
- feat: add version output by @chenrui333 in #148
- Fix TEST_HOST name mismatch by @CongLeSolutionX in #177
- feat: copy text with eager decoding, add keyboard shortcut by @iGerman00 in #178
- Fix progress when using VAD chunking by @finnvoor in #179
- Fix indeterminate tests by @ZachNagengast in #180
- Fix resampling large files by @ZachNagengast in #183
New Contributors
- @iandundas made their first contribution in #153
- @chenrui333 made their first contribution in #148
- @CongLeSolutionX made their first contribution in #177
- @iGerman00 made their first contribution in #178
Full Changelog: v0.7.2...v0.8.0
v0.7.2
Early stopping now keeps track of the chunked window internally when running async transcription via the VAD chunking method. This will give further control for stopping specific windows based on your custom criteria in the TranscriptionCallback
.
What's Changed
- Fix early stopping for VAD by @ZachNagengast in #155
Full Changelog: v0.7.1...v0.7.2
v0.7.1
Hotifx for shouldEarlyStop
logic
What's Changed
- Ensures early stopping flag on TextDecoder is always reset at the beginning of a new loop
Full Changelog: v0.7.0...v0.7.1
v0.7.0
This is a very exciting release because we're seeing yet another massive speedup in offline throughput thanks to VAD based chunking π
Highlights
- Energy VAD based chunking π£οΈ @jkrukowski
- There is a new decoding option called
chunkingStrategy
which can significantly speed up your single file transcriptions with minimal WER downsides. - It works by finding a clip point in the middle of the longest silence (lowest audio energy) in the last 15s of a 30s window and uses that to split up all the audio ahead of time so it can be asynchronously decoded in parallel.
- Heres a video of it in action, comparing
.none
chunking strategy with.vad
- There is a new decoding option called
vad.chunking.mp4
- Detect language helper:
- You can now call
detectLanguage
with just an audio path as input from the main whisperKit object. This will return a simple language code and probability back as a tuple, and has minimal logging/timing. - Example:
- You can now call
let whisperKit = try await WhisperKit()
let (language, probs) = try await whisperKit.detectLanguage(audioPath: "your/audio/path/spanish.wav")
print(language) // "es"
- WhisperKit via Expo @seb-sep
- For anyone that's been wanting to use WhisperKit in react native, @seb-sep is maintaining a repo that makes it easy, and also setup an automation that will automatically update it with each new WhisperKit release, check it out here: https://github.com/seb-sep/whisper-kit-expo
- Bug fixes and enhancements:
- @jiangdi0924 and @fengcunhan contributed some nice fixes in this release with #136 and #138 (see below)
- Also moved the decoding progress callback to be fully async so that it doesn't block the decoder thread
What's Changed
- Fix language detection by @jkrukowski in #133
- Fix the reset operation exception in transcribeFile in the Demo. by @jiangdi0924 in #136
- gh action for making pr to whisper-kit-expo on whisperkit release by @seb-sep in #137
- add reStartRecordingLive function by @fengcunhan in #138
- Added
@_disfavoredOverload
for deprecated methods by @jkrukowski in #143 - VAD audio chunking by @jkrukowski in #135
- Async Progress Callback by @ZachNagengast in #145
- Detect language helper by @ZachNagengast in #146
New Contributors
- @jiangdi0924 made their first contribution in #136
- @seb-sep made their first contribution in #137
- @fengcunhan made their first contribution in #138
Full Changelog: v0.6.1...v0.7.0