This project implements state-of-the-art model Vision Transformer for video frame interpolation to increase its FPS , and compares its performance with traditional approaches like Deep Voxel flow , Super slomo model.
Here I implemented VIFT and Super Slomo model as published in these vift, slomo respectively .
- Deep Voxel Flow
- uses Optical flow & CNN approach
- unable to handle complex motions
- Super Slomo
- It replaces Optical flow by flow interpretation Unet like architecture .
- computationally expensive
- Video Frame Interpolation Transformer
- It uses Swin Transformer blocks ( Shifted Window transformer ) to reduce time complexity from quadratic to linear .
- much smaller compared to Super Slomo , while still achieving better performance .
model / metric | Parameters (M) | PSNR ( peek-signal-to-noise-ratio ) | SSMI ( structural similarity index ) |
---|---|---|---|
Deep voxel flow | - | 27.6 | 0.92 |
Super Slomo | 38 | 31.4 | 0.94 |
VIFT | 7 | 35.1 | 0.96 |
- Video Frame Interpolation Transformer
- Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation
- Video Frame Synthesis using Deep Voxel Flow
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
This project is licensed under the MIT License. See the LICENSE.md file for details.