Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
dtiberio committed Dec 13, 2024
1 parent aed7aae commit 6e411e2
Show file tree
Hide file tree
Showing 8 changed files with 972 additions and 2 deletions.
4 changes: 4 additions & 0 deletions .env copy
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# update this with your private key
# you can get one from the Google AI Studio

GEMINI_API_KEY = 'YOUR_PRIVATE_GEMINI_API_KEY'
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -158,3 +158,5 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

logs/
34 changes: 32 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,32 @@
# Gemini_2.0_Live_API_Tutorials
Gemini_2.0_Live_API_Tutorials
# Gemini_2.0_Live_API_Tutorials

These two Python files provide a demo of the newly release Google Gemini 2.0 Live API.

## live_api_starter_cv.md
The Live API Starter is a Python application that implements real-time audio and video interaction with Google's Gemini AI model. It creates a bidirectional communication channel where users can send text, audio, and video input while receiving audio and text responses from the model in real-time.
The code shares the video from your webcam with the Gemini model, while you can also do a voice chat.

## live_api_starter_desk.py
This application is a desktop assistant that combines audio input/output capabilities with screen capture functionality to interact with Google's Gemini API. It creates an interactive experience where users can communicate with the Gemini model through both voice and text while sharing their screen.
The code shares your desktop with the Gemini model, while you can also do a voice chat.

# References:
https://github.com/google-gemini/cookbook/blob/main/gemini-2/README.md
https://github.com/google-gemini/cookbook/blob/main/gemini-2/live_api_starter.py

## The new Gemini 2.0 Live API requires:
https://pypi.org/project/google-genai/
https://github.com/googleapis/python-genai

To learn more, see the Python SDK reference:
https://googleapis.github.io/python-genai/

# Possible bugs
During testing I've noticed that the Gemini model sometimes fails to "see" the webcam or the desktop when you request that via the text prompts, however, it usually works well if you make the same request via the voice prompt.
This might be due to the nature of the experimental release at the time of the tests.

# Please note
The code is provided as-is for learning purposes, please don't expect any updates in the future.
I've made some changes to the code in the Google cookbook to add some logging and troubleshooting details.
This code was tested with Python 3.12 running on Windows 11.
Unfortuantely, I can't provide any support.
112 changes: 112 additions & 0 deletions live_api_starter_cv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Live API Starter Documentation

References:
https://github.com/google-gemini/cookbook/blob/main/gemini-2/README.md
https://github.com/google-gemini/cookbook/blob/main/gemini-2/live_api_starter.py

The new Live API requires:
https://pypi.org/project/google-genai/
https://github.com/googleapis/python-genai

This PyPI package doesn't support Live API:
https://pypi.org/project/google-generativeai/
https://github.com/google-gemini/generative-ai-python

Notes as of 2024-12-13:
`google-generativeai` - old python sdk, for Gemini API in Google AI Studio only
`google-vertexai` - more complex, for Gemini LLM models in Google Vertex AI only
`google-genai` - new one sdk, for both VertexAI and Gemini API. Supports the Live API.

## The NEW GenAI API:

The new Google Gen AI SDK provides a unified interface to Gemini 2.0 through both the Gemini Developer API and the Gemini Enterprise API ( Vertex AI).
With a few exceptions, code that runs on one platform will run on both. The Gen AI SDK also supports the Gemini 1.5 models.

Python
The Google Gen AI SDK for Python is available on PyPI and GitHub:
google-genai on PyPI --> `pip install google-genai`
python-genai on GitHub

To learn more, see the Python SDK reference:
https://googleapis.github.io/python-genai/

Quickstart
1. Import libraries
``` python
from google import genai
from google.genai import types
```
2. Create a client
``` python
client = genai.Client(api_key='YOUR_API_KEY')
```
3. Generate content
``` python
response = client.models.generate_content(
model='gemini-2.0-flash-exp', contents='What is your name?'
)
print(response.text)
```

## Overview
The Live API Starter is a Python application that implements real-time audio and video interaction with Google's Gemini AI model. It creates a bidirectional communication channel where users can send text, audio, and video input while receiving audio and text responses from the model in real-time.

## Key Features
- Real-time audio input/output processing
- Video capture and streaming
- Text-based interaction
- Asynchronous operation
- Bidirectional communication with Gemini AI model

## Class Documentation

### AudioLoop
Main class that manages the audio/video streaming pipeline and communication with the Gemini AI model.

## Method Documentation

### AudioLoop.__init__
Initializes queues for audio/video processing and sets up session management.

### AudioLoop.send_text
Handles text input from the user and sends it to the Gemini session.

### AudioLoop._get_frame
Captures and processes a single video frame, converting it to JPEG format with size constraints.

### AudioLoop.get_frames
Continuously captures video frames from the default camera and adds them to the video queue.

### AudioLoop.send_frames
Sends captured video frames to the Gemini session.

### AudioLoop.listen_audio
Sets up and manages audio input stream from the microphone.

### AudioLoop.send_audio
Sends audio chunks from the output queue to the Gemini session.

### AudioLoop.receive_audio
Processes responses from the Gemini model, handling both text and audio data.

### AudioLoop.play_audio
Manages audio playback of responses received from the model.

### AudioLoop.run
Main execution method that coordinates all the async tasks and manages the session lifecycle.

## Global Constants

- FORMAT: Set to pyaudio.paInt16 for audio format
- CHANNELS: Set to 1 for mono audio
- SEND_SAMPLE_RATE: 16000Hz for input audio
- RECEIVE_SAMPLE_RATE: 24000Hz for output audio
- CHUNK_SIZE: 512 bytes for audio processing
- MODEL: Uses "models/gemini-2.0-flash-exp" for AI interactions

## Technical Details
- Uses asyncio for concurrent operations
- Implements PyAudio for audio handling
- Uses OpenCV (cv2) for video capture
- Integrates with Google's Genai client
- Supports Python 3.11+ with fallback for earlier versions
Loading

0 comments on commit 6e411e2

Please sign in to comment.