Skip to content

v0.10.0

Compare
Choose a tag to compare
@ZachNagengast ZachNagengast released this 20 Dec 00:17

Highlights

This release provides support for protocol-defined model inputs and output types, supporting full MLX or MLTensor pipelines without the need to convert to MLMultiArrays between encoder/decoder stages. For example, instead of

func encodeFeatures(_ features: MLMultiArray) async throws -> MLMultiArray?

you can now define the types by protocol:

func encodeFeatures(_ features: any FeatureExtractorOutputType) async throws -> (any AudioEncoderOutputType)?

where the types are defined as so:

public protocol FeatureExtractorOutputType {}
extension MLMultiArray: FeatureExtractorOutputType {}
public protocol AudioEncoderOutputType {}
extension MLMultiArray: AudioEncoderOutputType {}

or for a type that is a struct:

public struct TextDecoderMLMultiArrayOutputType: TextDecoderOutputType {
    public var logits: MLMultiArray?
    public var cache: DecodingCache?
}

so the entire structure can be handled by any model that conforms to the protocol, adding more flexibility for passing different data types between models, and thus reducing the amount of conversion steps vs. previous where it was assumed to be all MLMultiArrays.

We've made a start in using different inference types by using the new MLTensor for token sampling on devices that have the latest OS support, which resulted in a 2x speedup for that operation. Future work will shift the entire pipeline to using these.

There are also some important fixes included:

  • Timestamp rules are now enabled when the withoutTimestamps decoding option is set to false, increasing parity with OpenAI's python implementation. This will significantly increase the amount of timestamps returned during decoding and shorten the average length of individual segments overall.
    • Previous: <|0.00|> So in college, I was a government major,<|4.92|><|4.94|> which means I had to write a lot of papers.<|7.38|>
    • Now: <|0.00|> So in college,<|2.00|><|3.36|> I was a government major,<|4.88|><|4.90|> which means I had to write a lot of papers.<|7.36|>
  • Early stopping via callback (a way to stop the decoding loop early if repetition is detected) has been converted to use an actor to fix some concurrency issues noted by the community.
  • CI script now uploads failure results to github for better visibility.

⚠️ Breaking changes

  • Changing the protocol may result in some unexpected behavior if you are using a custom implementation, please raise an issue if you notice anything.
  • WhisperKit.sampleRate has been moved to Constants.defaultWindowSamples

Finally, there were some great open-source contributions listed below, with a broad range of improvements to the library. Huge thanks to all the contributors 🙏

What's Changed

  • Fix audio processing edge case by @ZachNagengast in #237
  • Add public callbacks to help expose internal state a little more by @iandundas in #240
  • Freeze loglevel enum by @ZachNagengast in #255
  • Update WhisperAX app icon for macOS to align with Apple HIG standards by @Stv-X in #257
  • Add ability to prevent config.json being written to ~/Documents/huggingface/... by @iandundas in #262
  • Typo in Model Descriptions by @rk-helper in #269
  • Audio: Fix taking a suffix of negative length from a collection by @mattisssa in #278

New Contributors

Full Changelog: v0.9.4...v0.10.0