Vision Transformer for video interpolation 🤖📷✌️

This project implements state-of-the-art model Vision Transformer for video frame interpolation to increase its FPS , and compares its performance with traditional approaches like Deep Voxel flow , Super slomo model.

Introduction

Here I implemented VIFT and Super Slomo model as published in these vift, slomo respectively .

Deep Voxel Flow
- uses Optical flow & CNN approach
- unable to handle complex motions
Super Slomo
- It replaces Optical flow by flow interpretation Unet like architecture .
- computationally expensive
Video Frame Interpolation Transformer
- It uses Swin Transformer blocks ( Shifted Window transformer ) to reduce time complexity from quadratic to linear .
- much smaller compared to Super Slomo , while still achieving better performance .

Demo

Results

model / metric	Parameters (M)	PSNR ( peek-signal-to-noise-ratio )	SSMI ( structural similarity index )
Deep voxel flow	-	27.6	0.92
Super Slomo	38	31.4	0.94
VIFT	7	35.1	0.96

References

Video Frame Interpolation Transformer
Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation
Video Frame Synthesis using Deep Voxel Flow
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

License

This project is licensed under the MIT License. See the LICENSE.md file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Vision Transformer for video interpolation 🤖📷✌️

Table of Contents

Introduction

Demo

Results

References

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Vision Transformer for video interpolation 🤖📷✌️

Table of Contents

Introduction

Demo

Results

References

License