GitHub - tigrisdata-community/multi-modal-starter-kit: Multi-modal starter kit for AI video understanding and narration. Works with Ollama (Llava, bakllava), GPT-4v

Multi Modal Starter Kit 🤖📽️

A multi modal starter kit that can have AI narrate a video or scene of your choice. Includes examples of how to do video processing, frames extraction, and sending frames to AI models optimally. Cost $0 to run.

Works with the following models 👇🦙

LLaVa (powered by Ollama)
LLaVa-vicuna (powered by Ollama)
BakLLaVA (powered by Ollama)
Moondream (powered by Fal.ai)
...and many others on https://ollama.com/library
GPT-4v

Have questions? Join AI Stack devs #multi-modal-starter-kit

🎉 Demo (Sound ON 🔊)

MM-demo.mp4

Stack

💻 Video and Image hosting: Tigris
🦙 Inference: Ollama, Fal with options to use OpenAI
🔌 GPU: Fly
💾 Caching: Upstash
🤔 AI response pub/sub: Upstash
📢 Video narration: ElevenLabs
🗺️ Workflow orchestration: Inngest
🖼️ App logic: Next.js
🖌️ UI: Vercel v0

Overview

🚀 Quickstart
💻 Useful Commands

Quickstart

Step 0: Fork this repo and clone it

git clone git@github.com:[YOUR_GITHUB_ACCOUNT_NAME]/multi-modal-starter-kit.git

Install dependencies

If you are using Homebrew on your machine, run brew bundle to install all the needed dependencies. If you need to install them manually, install these from your package manager of choice:

ffmpeg (ideally with a wide berth of codecs supported; if you don't know what this means, the default package is probably fine)
Node.js 20.x or higher

Step 1: Set up Tigris

Create an .env file

cd multi-modal-starter-kit
cp .env.example .env

Set up Tigris

Make sure you have a fly.io account and have fly CLI installed on your computer
cd multi-modal-starter-kit
Pick a name for your version of your app. App names on fly are global, so it has to be unique. For example multi-modal-awesomeness
Create the app on fly with fly app create <your app name> so for example fly app create multi-modal-awesomeness
Create the storage with fly storage create
You should get a list of credentials like below:
If you get a list of keys without values, destroy the bucket with fly storage destroy and try again.
Copy paste these values to your .env under "Tigris"
Note that the name for the storage bucket is NEXT_PUBLIC_BUCKET_NAME. If you copy/paste add the NEXT_ part at the beginning

Set Tigris bucket cors policy and bucket access policy

fly storage update YOUR_BUCKET_NAME --public
Make sure you have aws CLI installed and run aws configure. Enter the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY printed above. Note that these are not actual Amazon Web Services credentials, but Tigris credentials. If you have the aws CLI already configured for Amazon, it will overwrite those values.

Run the following command to update CORS policy on the bucket

aws s3api put-bucket-cors --bucket BUCKET_NAME --cors-configuration file://cors.json --endpoint-url https://fly.storage.tigris.dev/

Step 2: Create a test video

We have a sample video in the assets directory that you can use to test the app. You can run the following command if you want to test the app with this video

aws s3 cp ./assets/pasta-making.mp4 s3://BUCKET_NAME --endpoint-url https://fly.storage.tigris.dev`

Alternatively you can also uploading your own videos.

Step 3: Set up Ollama / Llava

By Default the app uses Ollama / llava for vision. If you want to use OpenAI Chatgpt4v instead, you can set INFERENCE_PLATFROM="OpenAI" and fill in OPENAI_API_KEY in .env

There are two ways to get Ollama up and running. You can either use Fly GPU, which provides very fast inference, or use your laptop.

Option 1: Fly GPU

Make sure you have a Fly account and flyctl installed
Fork ollama-demo, edit fly.toml to rename the app, and run fly launch
Under the ollama-demo directory, run fly console ssh -- once you have ssh'd into the instance, run ollama pull llava -- by default, this pulls the llava7b model, but you could also pull other vision models to use with your app, such as:

ollama pull llava:34b
ollama pull llava:7b-v1.6-vicuna-q4_0
...

You should get a hostname once fly launch succeeds, copy paste this value to OLLAMA_HOST in .env Your app will now use this Fly GPU for instance.

Option 2: Your laptop

Install Ollama
Run ollama pull llava on your terminal. Like mentioned under Option 1, you can also pull other models to compare the results.
(optional) Watch requests coming into Ollama by running this in a new terminal tab tail -f ~/.ollama/logs/server.log

Step 4: Set up ElevenLabs

Go to https://elevenlabs.io/, log in, and click on your profile picture on lower left. Select "Profile + API key". Copy the API key and save it as XI_API_KEY in the .env file
Select a 11labs voice by clicking on "Voices" on the left side nav bar and navigate to "VoiceLab". Copy the voice ID and save it as XI_VOICE_ID in .env

Step 5: Set up Upstash

When narrating a very long video, Upstash Redis is used for pub/sub and notifies the client when new snippets of reply come back. Upstash is also used for the critical task of caching video/images so the subsequent requests don't take long.

Go to https://console.upstash.com/, select "Create Database" with the following settings
Once created, under 'Node' - 'io-redis' tab, copy the whole string starting with "rediss://" and set UPSTASH_REDIS_URL value as this string in .env
On the same page, scroll down to the "Rest API" section and copy paste everything under ".env" tab to your .env file

Step 6: Run App

npm install
npm run dev

Step 7: Deploying on fly

By now you should have a functional app, let's deploy it to fly.io cloud account that you setup in Step 1.

First, lets see what secrets are already available in our app using fly secrets list:

$ ➔  fly secrets list
NAME                            DIGEST         CREATED AT
AWS_ACCESS_KEY_ID               xxxxxxx        Feb 23 2024 20:33
AWS_ENDPOINT_URL_S3             xxxxxxx        Feb 23 2024 20:33
AWS_REGION                      xxxxxxx        Feb 23 2024 20:33
AWS_SECRET_ACCESS_KEY           xxxxxxx        Feb 23 2024 20:33
BUCKET_NAME                     xxxxxxx        Feb 23 2024 20:33

We need to match the secrets as in .env.example file. Rename the BUCKET_NAME secret to NEXT_PUBLIC_BUCKET_NAME:

$ ➔ fly secrets set NEXT_PUBLIC_BUCKET_NAME=<YOUR BUCKET NAME>
$ ➔ fly secrets unset BUCKET_NAME

Now, all other environment vars:

$ ➔ fly secrets set OPENAI_API_KEY=<YOUR KEY HERE>
$ ➔ fly secrets set UPSTASH_REDIS_URL=<UPSTASH REDIS URL HERE>
$ ➔ fly secrets set UPSTASH_REDIS_REST_URL=<UPSTASH REDIS REST URL HERE>
$ ➔ fly secrets set UPSTASH_REDIS_REST_TOKEN=<UPSTASH REDIS REST TOKEN HERE>
$ ➔ fly secrets set XI_API_KEY=<XI API KEY>
$ ➔ fly secrets set XI_VOICE_ID=<XI VOICE ID>

Once environment is all set, we can make the app fly:

$ ➔ fly launch
$ ➔ fly deploy

fly.io instructions for NextJS

[Optional] Step 8: Production-ready workflow orchestration

There is an example in the repo that leverages Inngest for workflow orchestration -- Inngest is especially helpful here when you have a long-running workflow and does automatic retries. Example code is in src/inngest/functions.ts.

In this example, Inngest waits for new images to upload to Tigris, then sends the image to Ollama/OpenAI for processing. The "describe-image" step is auto-retried when there is a failure or returned JSON is malformed.

export const inngestTick = inngest.createFunction(
  { id: "tick" },
  { cron: "* * * * *" },
  async ({ step }) => {
    await step.run("fetch-latest-snapshot", async () => {
      return await fetchLatestFromTigris();
    });

    const result = await step.waitForEvent("Tigris.complete", {
      event: "Tigris.complete",
      timeout: "1m",
    });

    const url = result?.data.url;
    console.log("url", url);
    if (!!url) {
      await step.run("describe-image", async () => {
        return await describeImage(url);
      });
    }
  }
);

[Optional] Step 9: Change Inference Platforms

fal

fal.ai is an inference platfrom that specilizes on fast media model inference. To use fal with the multimodal starter-kit demo set the INFERENCE_PLATFORM environment variable to "fal", and add a new FAL_KEY environment variable from the fal.ai website. First, create an account with fal.ai, navigate to the keys page keys and follow the steps to create a key. Copy the result into the .env file and save it as FAL_KEY.

INFERENCE_PLATFORM=fal
FAL_KEY=***

Currently, only the moondream model is avaliable with fal. Stay tuned for llava7B and llava34B.

Useful Commands

Tigris is 100% aws cli compatible. Here are some frequently used commands during active development:

Pause voice

Press 'v' to toggle the voice. This pauses the voice so it will resume at the point it was paused.

Check Tigris Dashboard

fly storage dashboard BUCKET_NAME

Periodic cleanup

Currently temporary files for the snapshots that get passed to the model and the elevenlabs voice files are stored in the bucket and are not cleaned up. To clean these up, you can run the following from the CLI:

aws s3 rm s3://BUCKET_NAME/ --endpoint-url https://fly.storage.tigris.dev --recursive --exclude "*.mp4"

Upload videos

aws s3 cp PATH_TO_YOUR_VIDEO s3://BUCKET_NAME --endpoint-url https://fly.storage.tigris.dev

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
assets		assets
public		public
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
Brewfile		Brewfile
Brewfile.lock.json		Brewfile.lock.json
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
components.json		components.json
cors.json		cors.json
fly.toml		fly.toml
next.config.js		next.config.js
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi Modal Starter Kit 🤖📽️

Stack

Overview

Quickstart

Step 0: Fork this repo and clone it

Install dependencies

Step 1: Set up Tigris

Step 2: Create a test video

Step 3: Set up Ollama / Llava

Step 4: Set up ElevenLabs

Step 5: Set up Upstash

Step 6: Run App

Step 7: Deploying on fly

[Optional] Step 8: Production-ready workflow orchestration

[Optional] Step 9: Change Inference Platforms

fal

Useful Commands

Pause voice

Check Tigris Dashboard

Periodic cleanup

Upload videos

About

Releases

Packages

Contributors 10

Languages

License

tigrisdata-community/multi-modal-starter-kit

Folders and files

Latest commit

History

Repository files navigation

Multi Modal Starter Kit 🤖📽️

Stack

Overview

Quickstart

Step 0: Fork this repo and clone it

Install dependencies

Step 1: Set up Tigris

Step 2: Create a test video

Step 3: Set up Ollama / Llava

Step 4: Set up ElevenLabs

Step 5: Set up Upstash

Step 6: Run App

Step 7: Deploying on fly

[Optional] Step 8: Production-ready workflow orchestration

[Optional] Step 9: Change Inference Platforms

fal

Useful Commands

Pause voice

Check Tigris Dashboard

Periodic cleanup

Upload videos

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages