Skip to content

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Notifications You must be signed in to change notification settings

OpenGVLab/vinci

Repository files navigation

Vinci - An Online Egocentric Video-Language Assistant

Open in Spaces

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Arxiv, 2024

💬 TL,DR

  • Overview: A real-time, embodied smart assistant based on an egocentric vision-language model.
  • Portable Device Compatibility: Designed for smartphones and wearable cameras, operating in an "always on" mode.
  • Hands-Free Interaction: Users engage in natural conversations to ask questions and get responses delivered via audio.
  • Real-Time Video Processing: Processes long video streams to answer queries about current and historical observations.
  • Task Planning and Guidance: Provides task planning based on past interactions and generates visual task demonstrations.

📣 Demo video

output.mp4
small_with_subtitle.mp4

🔨 Installation

git clone https://github.com/OpenGVLab/vinci.git
conda env create -f environment.yml

Requirements:

  • python 3.8 and above
  • pytorch 2.0 and above are recommended
  • CUDA 11.4 and above are recommended
  • Docker is required when deploying streaming demo
  • Gradio is required when using local web-based demo

Downloading Checkpoints

bash download.sh

Running download.sh will take up >100GB disk space.

🎓 Getting Started

We offer two ways to run our Vinci model

🎬 Online Streaming Demo

  1. start the frontend, backend and model services:
sudo ./boot.sh {start|stop|restart} [--cuda <CUDA_VISIBLE_DEVICES>] [--language chn/eng] [--version v0/v1]
  • --cuda <CUDA_VISIBLE_DEVICES>: Specify the GPU devices to run the model

  • --language <chn|eng>: Choose the language for the demo (default: chn).

    • chn: Chinese
    • eng: English
  • --version <v0|v1>: Select the model version (default: v1).

    • v0: Optimized for first-person perspective videos.
    • v1: Generalized model for both first-person and third-person perspective videos.

Then use the browser to access the frontend page:http://YOUR_IP_ADDRESS:19333 (E.g., http://102.2.52.16:19333)

  1. Push live stream With an smartphone app or GoPro/DJI cameras, push the stream to: rtmp://YOUR_IP_ADDRESS/vinci/livestream

With a webcam, use the following command: ffmpeg -f video4linux2 -framerate 30 -video_size 1280x720 -i /dev/video1 -f alsa -i default -vcodec libx264 -preset ultrafast -pix_fmt yuv420p -video_size 1280x720 -c:a aac -threads 0 -f flv rtmp://YOUR_IP_ADDRESS:1935/vinci/livestream

Interact with Online Video Streaming Demo

  1. Activate Model Service: To wake up the model and begin using it, simply say the wake-up phrase: "你好望舒 (Ni hao wang shu)" (Currently, only Chinese wakeup command is supported)
  2. Chat with Vinci: Once activated, you can start chatting with Vinci with speech. The model will respond in text and speech. Tip: For the best experience, speak clearly and at a moderate pace.
  3. Generate Predictive Visualizations: If you want to generate a predictive visualization of actions, include the keyword "可视化 (Ke shi hua)" in your command.

🎬 Gradio Demo for uploaded videos

python demovl.py [--language chn/eng] [--version v0/v1]
  • --cuda <CUDA_VISIBLE_DEVICES>: Specify the GPU devices to run the model

  • --language <chn|eng>: Choose the language for the demo (default: chn).

    • chn: Chinese
    • eng: English
  • --version <v0|v1>: Select the model version (default: v1).

    • v0: Optimized for first-person perspective videos.
    • v1: Generalized model for both first-person and third-person perspective videos.

Interact with Gradio Demo

  1. Upload local video file
  1. Click Upload & Start Chat button to initiate the chat session
  1. Click the play button to start playing the video
  1. Adjusting the Stride of Memory. This allows you to control the granularity of the model's memory.
  1. Real-Time Interaction:Type your questions in the chat box. The model will respond based on the current frame and historical context.
Describe current action

Retrieve object from the history

Summarize previous actions

Scene understanding

Temporal grounding

Predict future actions

  1. Generate future videos: based on the current frame and the historical context, the model can generate a short future video.
Generate future actions

♥️ Origin

The name "Vinci" embodies several layers of meaning:

  • It is inspired by the famous Renaissance master Leonardo da Vinci, symbolizing a wealth of knowledge and insight, and suggesting that this assistant can provide equally exceptional service to users.
  • The word "Vinci" is derived from the Latin "vincere," meaning "to conquer" or "to overcome," implying that this assistant helps users overcome various difficulties and challenges.
  • Phonetically, it resembles "Vision," highlighting the assistant's core function of analyzing and responding based on visual information.
  • It represents a fusion of elegance, wisdom, and innovation, complementing the high-tech nature of first-person camera devices.

望舒 - 出自《楚辞·离骚》:“前望舒使先驱兮,后飞廉使奔属。” 望舒是神话传说中替月亮驾车的天神,描述象征着引导和指引的意义

✒️ Citation

If this work is helpful for your research, please consider citing us.

@article{vinci,
  title={Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model},
  author={Huang, Yifei and Xu, Jilan and Pei, Baoqi and He, Yuping and Chen, Guo and Yang, Lijin and Chen, Xinyuan and Wang, Yaohui and Nie, Zheng and Liu, Jinyao and Fan, Guoshun and Lin, Dechen and Fang, Fang and Li, Kunpeng and Yuan, Chang and Wang, Yali and Qiao, Yu and Wang, Limin},
  journal={arXiv preprint arXiv:2412.21080},
  year={2024}
}
@article{pei2024egovideo,
  title={EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation},
  author={Pei, Baoqi and Chen, Guo and Xu, Jilan and He, Yuping and Liu, Yicheng and Pan, Kanghua and Huang, Yifei and Wang, Yali and Lu, Tong and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2406.18070 },
  year={2024}
}
@inproceedings{xu2024retrieval,
  title={Retrieval-augmented egocentric video captioning},
  author={Xu, Jilan and Huang, Yifei and Hou, Junlin and Chen, Guo and Zhang, Yuejie and Feng, Rui and Xie, Weidi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={13525--13536},
  year={2024}
}
 @InProceedings{huang2024egoexolearn,
     title={EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World},
     author={Huang, Yifei and Chen, Guo and Xu, Jilan and Zhang, Mingfang and Yang, Lijin and Pei, Baoqi and Zhang, Hongjie and Lu, Dong and Wang, Yali and Wang, Limin and Qiao, Yu},
     booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
     year={2024}
 }
@inproceedings{chen2023seine,
  title={Seine: Short-to-long video diffusion model for generative transition and prediction},
  author={Chen, Xinyuan and Wang, Yaohui and Zhang, Lingjun and Zhuang, Shaobin and Ma, Xin and Yu, Jiashuo and Wang, Yali and Lin, Dahua and Qiao, Yu and Liu, Ziwei},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2023}
}
@article{wang2024internvideo2,
  title={Internvideo2: Scaling video foundation models for multimodal video understanding},
  author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
  journal={arXiv preprint arXiv:2403.15377},
  year={2024}
}

About

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published