We propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
-
Video editing with off-the-shelf image diffusion models.
-
No training on any video.
-
Promising results in editing attributes, subjects, places, etc., in real-world videos.
- [2023.4.12] Online Gradio Demo is available here.
- [2023.4.11] Add Gradio Demo (runs in local).
- [2023.4.9] Code released!
pip install -r requirements.txt
Installing xformers is highly recommended for improved efficiency and speed on GPUs.
[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from 🤗 Hugging Face (e.g., Stable Diffusion v1-4, v2-1). We use Stable Diffusion v1-4 by default.
Simply run:
accelerate launch test_vid2vid_zero.py --config path/to/config
For example:
accelerate launch test_vid2vid_zero.py --config configs/car-moving.yaml
Launch the local demo built with gradio:
python app.py
Or you can use our online gradio demo here.
Note that we disable Null-text Inversion and enable fp16 for faster demo response.
@article{vid2vid-zero,
title={Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models},
author={Wang, Wen and Xie, kangyang and Liu, Zide and Chen, Hao and Cao, Yue and Wang, Xinlong and Shen, Chunhua},
journal={arXiv preprint arXiv:2303.17599},
year={2023}
}
Tune-A-Video, diffusers, prompt-to-prompt.
We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns.
If you are interested in working with us on foundation model, visual perception and multimodal learning, please contact Xinlong Wang (wangxinlong@baai.ac.cn
) and Yue Cao (caoyue@baai.ac.cn
).