This repository contains a non-exhaustive collection of vision transformer models implemented in TensorFlow by me. Not to confuse with the original Vision Transformers paper [1], the architectures implemented here are generally referred to as Vision Transformers since they make use of Transformers in some way or the other for the vision modality.
- ViT
- DeiT
- Swin
- CaiT
- MobileViT
- CCT
- ViT MAE (with Aritra Roy Gosthipaty)
- ViT MSN
- ViT data2vec
- SegFormer
[1] Dosovitskiy, Alexey, et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv, 3 June 2021. arXiv.org, https://doi.org/10.48550/arXiv.2010.11929.