Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Shuai Yang, Yifan Zhou, Ziwei Liu and Chen Change Loy
in SIGGRAPH Asia 2023 Conference Proceedings
Project Page | Paper | Supplementary Video | Input Data and Video Results
Abstract: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.
Features:
- Temporal consistency: cross-frame constraints for low-level temporal consistency.
- Zero-shot: no training or fine-tuning required.
- Flexibility: compatible with off-the-shelf models (e.g., ControlNet, LoRA) for customized translation.
overview.mp4
- [09/2023] Code is released.
- [09/2023] Accepted to SIGGRAPH Asia 2023 Conference Proceedings!
- [06/2023] Integrated to 🤗 Hugging Face. Enjoy the web demo!
- [05/2023] This website is created.
- Integrate into Diffusers.
-
Add Inference instructions in README.md. -
Add Examples to webUI. -
Add optional poisson fusion to the pipeline. -
Add Installation instructions for Windows
Please make sure your installation path only contain English letters or _
- Clone the repository. (Don't forget --recursive. Otherwise, please run
git submodule update --init --recursive
)
git clone git@github.com:williamyang1991/Rerender_A_Video.git --recursive
cd Rerender_A_Video
- If you have installed PyTorch CUDA, you can simply set up the environment with pip.
pip install -r requirements.txt
You can also create a new conda environment from scratch.
conda env create -f environment.yml
conda activate rerender
- Run the installation script. The required models will be downloaded in
./models
.
python install.py
- You can run the demo with
rerender.py
python rerender.py --cfg config/real2sculpture.json
Installation on Windows
Before running the above 1-4 steps, you need prepare:
Installation Fails?
- In case building ebsynth fails, we provides our complied ebsynth
FileNotFoundError: [Errno 2] No such file or directory: 'xxxx.bin' or 'xxxx.jpg'
: make sure your path only contains English letters or _ (williamyang1991#18 (comment))KeyError: 'dataset'
: upgrade Gradio to the latest version (williamyang1991#14 (comment))- Error when processing videos: manually install ffmpeg (williamyang1991#19 (comment), williamyang1991#29 (comment))
ERR_ADDRESS_INVALID
Cannot open the webUI in browser: replace 0.0.0.0 with 127.0.0.1 in webUI.py (williamyang1991#19 (comment))CUDA out of memory
: (williamyang1991#23 (comment))
python webUI.py
The Gradio app also allows you to flexibly change the inference options. Just try it for more details. (For WebUI, you need to download revAnimated_v11 and realisticVisionV20_v20 to ./models/
after Installation)
Upload your video, input the prompt, select the seed, and hit:
- Run 1st Key Frame: only translate the first frame, so you can adjust the prompts/models/parameters to find your ideal output appearance before running the whole video.
- Run Key Frames: translate all the key frames based on the settings of the first frame, so you can adjust the temporal-related parameters for better temporal consistency before running the whole video.
- Run Propagation: propagate the key frames to other frames for full video translation
- Run All: Run 1st Key Frame, Run Key Frames and Run Propagation
We provide abundant advanced options to play with
Using customized models
- Using LoRA/Dreambooth/Finetuned/Mixed SD models
- Modify
sd_model_cfg.py
to add paths to the saved SD models
- Modify
- Using other controls from ControlNet (e.g., Depth, Pose)
- Add more options like
control_type = gr.Dropdown(['HED', 'canny', 'depth']
here https://github.com/williamyang1991/Rerender_A_Video/blob/b6cafb5d80a79a3ef831c689ffad92ec095f2794/webUI.py#L690 - Add model loading options like
elif control_type == 'depth':
following https://github.com/williamyang1991/Rerender_A_Video/blob/b6cafb5d80a79a3ef831c689ffad92ec095f2794/webUI.py#L88 - Add model detectors like
elif control_type == 'depth':
following https://github.com/williamyang1991/Rerender_A_Video/blob/b6cafb5d80a79a3ef831c689ffad92ec095f2794/webUI.py#L122 - One example is given here
- Add more options like
Advanced options for the 1st frame translation
- Resolution related (Frame resolution, left/top/right/bottom crop length): crop the frame and resize its short side to 512.
- ControlNet related:
- ControlNet strength: how well the output matches the input control edges
- Control type: HED edge or Canny edge
- Canny low/high threshold: low values for more edge details
- SDEdit related:
- Denoising strength: repaint degree (low value to make the output look more like the original video)
- Preserve color: preserve the color of the original video
- SD related:
- Steps: denoising step
- CFG scale: how well the output matches the prompt
- Base model: base Stable Diffusion model (SD 1.5)
- Stable Diffusion 1.5: official model
- revAnimated_v11: a semi-realistic (2.5D) model
- realisticVisionV20_v20: a photo-realistic model
- Added prompt/Negative prompt: supplementary prompts
Advanced options for the key frame translation
- Key frame related
- Key frame frequency (K): Uniformly sample the key frame every K frames. Small value for large or fast motions.
- Number of key frames (M): The final output video will have K*M+1 frames with M+1 key frames.
- Temporal consistency related
- Cross-frame attention:
- Cross-frame attention start/end: When applying cross-frame attention for global style consistency
- Cross-frame attention update frequency (N): Update the reference style frame every N key frames. Should be large for long videos to avoid error accumulation.
- Shape-aware fusion Check to use this feature
- Shape-aware fusion start/end: When applying shape-aware fusion for local shape consistency
- Pixel-aware fusion Check to use this feature
- Pixel-aware fusion start/end: When applying pixel-aware fusion for pixel-level temporal consistency
- Pixel-aware fusion strength: The strength to preserve the non-inpainting region. Small to avoid error accumulation. Large to avoid burry textures.
- Pixel-aware fusion detail level: The strength to sharpen the inpainting region. Small to avoid error accumulation. Large to avoid burry textures.
- Smooth fusion boundary: Check to smooth the inpainting boundary (avoid error accumulation).
- Color-aware AdaIN Check to use this feature
- Color-aware AdaIN start/end: When applying AdaIN to make the video color consistent with the first frame
- Cross-frame attention:
Advanced options for the full video translation
- Gradient blending: apply Poisson Blending to reduce ghosting artifacts. May slow the process and increase flickers.
- Number of parallel processes: multiprocessing to speed up the process. Large value (8) is recommended.
We also provide a flexible script rerender.py
to run our method.
Set the options via command line. For example,
python rerender.py --input videos/pexels-antoni-shkraba-8048492-540x960-25fps.mp4 --output result/man/man.mp4 --prompt "a handsome man in van gogh painting"
The script will run the full pipeline. A work directory will be created at result/man
and the result video will be saved as result/man/man.mp4
Set the options via a config file. For example,
python rerender.py --cfg config/van_gogh_man.json
The script will run the full pipeline.
We provide some examples of the config in config
directory.
Most options in the config is the same as those in WebUI.
Please check the explanations in the WebUI section.
Specifying customized models by setting sd_model
in config. For example:
{
"sd_model": "models/realisticVisionV20_v20.safetensors",
}
Similar to WebUI, we provide three-step workflow: Rerender the first key frame, then rerender the full key frames, finally rerender the full video with propagation. To run only a single step, specify options -one
, -nb
and -nr
:
- Rerender the first key frame
python rerender.py --cfg config/van_gogh_man.json -one -nb
- Rerender the full key frames
python rerender.py --cfg config/van_gogh_man.json -nb
- Rerender the full video with propagation
python rerender.py --cfg config/van_gogh_man.json -nr
We provide a separate Ebsynth python script video_blend.py
with the temporal blending algorithm introduced in
Stylizing Video by Example for interpolating style between key frames.
It can work on your own stylized key frames independently of our Rerender algorithm.
Usage:
video_blend.py [-h] [--output OUTPUT] [--fps FPS] [--beg BEG] [--end END] [--itv ITV] [--key KEY]
[--n_proc N_PROC] [-ps] [-ne] [-tmp]
name
positional arguments:
name Path to input video
optional arguments:
-h, --help show this help message and exit
--output OUTPUT Path to output video
--fps FPS The FPS of output video
--beg BEG The index of the first frame to be stylized
--end END The index of the last frame to be stylized
--itv ITV The interval of key frame
--key KEY The subfolder name of stylized key frames
--n_proc N_PROC The max process count
-ps Use poisson gradient blending
-ne Do not run ebsynth (use previous ebsynth output)
-tmp Keep temporary output
For example, to run Ebsynth on video man.mp4
,
- Put the stylized key frames to
videos/man/keys
for every 10 frames (named as0001.png
,0011.png
, ...) - Put the original video frames in
videos/man/video
(named as0001.png
,0002.png
, ...). - Run Ebsynth on the first 101 frames of the video with poisson gradient blending and save the result to
videos/man/blend.mp4
under FPS 25 with the following command:
python video_blend.py videos/man \
--beg 1 \
--end 101 \
--itv 10 \
--key keys \
--output videos/man/blend.mp4 \
--fps 25.0 \
-ps
white ancient Greek sculpture, Venus de Milo, light pink and blue background | a handsome Greek man | a traditional mountain in chinese ink wash painting | a cartoon tiger |
a swan in chinese ink wash painting, monochrome | a beautiful woman in CG style | a clean simple white jade sculpture | a fluorescent jellyfish in the deep dark blue sea |
Text-guided virtual character generation.
more_result_1.mp4
more_result_2.mp4
Video stylization and video editing.
more_result_3.mp4
If you find this work useful for your research, please consider citing our paper:
@inproceedings{yang2023rerender,
title = {Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation},
author = {Yang, Shuai and Zhou, Yifan and Liu, Ziwei and and Loy, Chen Change},
booktitle = {ACM SIGGRAPH Asia Conference Proceedings},
year = {2023},
}
The code is mainly developed based on ControlNet, Stable Diffusion, GMFlow and Ebsynth.