Important

Created by Preternatural AI, an exhaustive client-side AI infrastructure for Swift.
This project and the frameworks used are presently in alpha stage of development.

NarratorBot: Transform Image into Audio

A bot that narrates what it sees in front of it, in the style of a BBC nature documentary with the voice of Sir David Attenborough. Uses GPT-4o for image understanding and ElevenLabs for Audio Generation.

Usage

Supported Platforms

macos

Download Xcode from the App Store.
Open NarratorBot.xcodeproj and wait for it to resolve packages.
Add your OpenAI API Key in the LLMManager file:

// LLMManager
private static let llm: any LLMRequestHandling = OpenAI.Client(apiKey: "YOUR_API_KEY")

You can get the OpenAI API key on the OpenAI developer website. Note that you have to set up billing and add a small amount of money for the API calls to work (this will cost you less than 1 dollar).

Add your ElevenLabs API Key in TTSManager:

// TTSManager
private static let tts: ElevenLabs.Client = .init(apiKey: "YOUR_API_KEY")

ElevenLabs is a “Text-to-Speech” (TTS) service which is used in the NarratorBot app to generate the audio of the image description. You can get your ElevenLabs API Key on the ElevenLabs website. The API key is located in your user profile:

Click run. You will be prompted to trust macros from CorePersistence and SwiftUIZ. Click trust.
Click run again.

Upon successful installation, enjoy the app!

3ADDC3E09A54793F7BEF1.mp4

Key Concepts

The Narrator app is developed to demonstrate the the following key concepts:

Working with OpenAI's Vision API
Using ElevenLabs for Audio Generation

Preternatural Frameworks

The following Preternatural Frameworks were used in this project:

AI: The definitive, open-source Swift framework for interfacing with generative AI.
Media: Media makes it stupid simple to work with media capture & playback in Swift.

Technical Specifications

Large Language Models (LLMs) are rapidly evolving and expanding into multimodal capabilities - the ability to process inputs in multiple modes, such as text, images, and audio. Multi-modal LLMs are now starting to be referred to as Large Multimodal Models, or LMMs. As Apple Developers, we are in the perfect position for these multimodal AI models as we are making applications for devices that consumers literally use with built-in cameras and voice recorders. With the Vision capabilities of LLMs, it is easier than ever to process images that the user captures in novel ways, such as in this example of NarratorBot.

Image-to-Text (Vision) Implementation

When a user opens NarratorBot, they are prompted via a camera view to capture a photo of themselves. The first step is to send the photo to an LLM and instruct it to describe what is in the photo. This is done in a few simple steps using Preternatural's AI framework:

Import the AI Framework:

// LLMManager
import AI

Specify the LLM client. Currently only OpenAI GPT 4+ models are able to process images (Anthropic will be added as well in the near future).

private static let llm: any LLMRequestHandling = OpenAI.Client(apiKey: "YOUR_API_KEY")

Specify the LLM Model. Currently only OpenAI's GPT 4+ models support Vision requests:

static let model: OpenAI.Model = .gpt_4o // .gpt_4

The image will be a prompt input for the LLM model. Converting the image (UIKit and NSImage supported) to an PromptLiteral is simple:

let imagePrompt = try PromptLiteral(image: image)

Specify the System Prompt to give the LLM model general instructions.

// TTSManager.narrator.prompt
let prompt: PromptLiteral =
  """
  You are Sir David Attenborough. Follow these instructions:
  
  1. Narrate the picture of the human as if it is a nature documentary.
  2. Make it snarky and funny.
  3. Don't repeat yourself.
  4. Make it short.
  5. If I do anything remotely interesting, make a big deal about it!
  """

Include the System Prompt, User Prompt, and ImagePrompt as Messages for the LLM:

// LLMManager
let messages: [AbstractLLM.ChatMessage] = [
    .system(TTSManager.narrator.prompt),
    .user {
        .concatenate(separator: nil) {
            PromptLiteral("Describe this image") // the user prompt
            imagePrompt
        }
    }
]

Add any Parameters, such as a Token Limit:

let tokenLimit = 1000
let parameters = AbstractLLM.ChatCompletionParameters(
                tokenLimit: .fixed(tokenLimit))

Make the LLM completion request to get the image description in Sir David Attenborough's nature documentary style:

let imageDescription: String = try await llm.complete(
        messages,
        parameters: parameters,
        model: llmModel,
        as: .string)

return imageDescription

You can view the full implementation in LLMManager.

Text-to-Speech (TTS) Implementation

Now that we have the text of what Sir David Attenborough would say if he was a picture of a human, we can generate the audio in Sir David Attenborough's voice that ElevenLabs provides in their VoiceLab. To do this, we first have to specify the Sir David Attenborough's voice id:

//TTSManager
var elevenLabsVoice: String {
    switch self {
    // future voice implementation
    case .ericCartman:
        return "ZvOw3uFB0hlmUg3wjXi6"
    // ElevenLabs has many voices in the style of David Attenborough
    // Voices can also be easily cloned
    // (if you do clone and make a commercial product, make sure you have the concent of the voice actor)
    case .davidAttenborough:
        return "17jPwOCwyfZmp68jZqhx"
    }
}

To make a speech generation call to the ElevenLabs client using the AI framework, we need to import the AI and ElevenLabs modules:

// TTSManager
import AI
import ElevenLabs

Next, we specify the client:

private static let tts: ElevenLabs.Client = .init(apiKey: "YOUR_API_KEY")

Now, simply provide the text to be converted to audio (the image description of the user generated by OpenAI's Vision API) and the audio data will be returned:

static func createTextNarration(_ text: String) async throws -> Data {
    let data = try await tts.speech(
        for: text,
        voiceID: narrator.elevenLabsVoice,
        voiceSettings: ElevenLabs.VoiceSettings(),
        model: .TurboV2
    )
    return data
}

You can view the full implementation in TTSManager.

Conclusion

While LLMs initially gained popularity in chat mode, they are evolving to offer much more, including the ability to analyze images (and even videos). When combined with powerful models like ElevenLabs' voice generation API, this creates a powerful and versatile toolset for us as developers. The NarratorBot example demonstrates the potential of this technology by combining OpenAI's Image-to-Text (Vision) capabilities with ElevenLabs' Text-to-Speech (voice generation) API to create a dynamic, entertaining narration based on user-captured photos.

Acknowledgements

This app is the product of a fun night of hacking with my friend Siddarth!

License

This package is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NarratorBot: Transform Image into Audio

Table of Contents

Usage

Supported Platforms

Key Concepts

Preternatural Frameworks

Technical Specifications

Image-to-Text (Vision) Implementation

Text-to-Speech (TTS) Implementation

Conclusion

Acknowledgements

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

NarratorBot: Transform Image into Audio

Table of Contents

Usage

Supported Platforms

Key Concepts

Preternatural Frameworks

Technical Specifications

Image-to-Text (Vision) Implementation

Text-to-Speech (TTS) Implementation

Conclusion

Acknowledgements

License