GitHub - NITHISHM2410/spatial-temporal-transformer: Spatial Temporal Transformer

Spatial-Temporal Transformer

I. Vector-Quantized Autoencoders

Vector-Quantized autoencoders are used to represent images in a compressed latent space.
For High-Resolution outputs, VQGANs (VQVAE trained with Discriminator) are a much better choice.
The generative transformer model works on this compressed latent space of the images.
The reason to work on a compressed latent space rather than the high-dimensional original input space is to reduce the computational load and memory requirements.

II. Transformer

1. Spatial Attention

Applies attention across all spatial elements(pixels) within each temporal unit.
Captures spatial relationships within each temporal snapshot, enhancing the model's understanding of spatial dependencies.

2. Spatial-Temporal Attention

Executes attention across temporal units for each pixel independently.
Captures temporal relationships between spatial elements(pixels), enabling the model to learn dynamics over time.

3. Gaussian Parameterization

A convolutional layer generates the parameters (mean and variance) of a normal distribution based on the outputs from the attention layers.
Samples are drawn from the generated distribution.
The sampled output is then passed to the subsequent transformer layer.
This layer introduces stochasticity into the predictions, allowing the model to explore diverse outputs during the autoregressive process.

Video Generative Models - Application

Video Generation models are one of the applications of Spatial-Temporal Transformer with every frame of video serving as the temporal unit and each pixel in the frame (or image) representing a spatial unit.
The demo notebook and trained weights are available here and here .
The left video is the input prompt video and right video is the input prompt video plus the generated frames. In this example, the input prompt video has 15 frames and model predicts the next 5 frames. So, the right video has 20 frames where the last 5 frames are the generated.
Note: Each frame in the above video is repeated by 10 to prolong the video. Actual number of frames are 15 and 20 for left and right video.

Future plans (more may be added)

Application of Spatial Temporal Transformer - Video Generation
Use VQ autoencoders trained with discriminators (VQGAN) for High-resolution outputs.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
samples		samples
spatial_temporal_transformer		spatial_temporal_transformer
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spatial-Temporal Transformer

I. Vector-Quantized Autoencoders

II. Transformer

1. Spatial Attention

2. Spatial-Temporal Attention

3. Gaussian Parameterization

Video Generative Models - Application

Future plans (more may be added)

About

Releases

Packages

Languages

License

NITHISHM2410/spatial-temporal-transformer

Folders and files

Latest commit

History

Repository files navigation

Spatial-Temporal Transformer

I. Vector-Quantized Autoencoders

II. Transformer

1. Spatial Attention

2. Spatial-Temporal Attention

3. Gaussian Parameterization

Video Generative Models - Application

Future plans (more may be added)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages