Our semantic segmentation code is developed on top of MMSegmentation v0.12.0.
For more details please refer to our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.
- Libraries (Python3.6-based)
pip3 install mmcv-full==1.2.7 mmsegmentation==0.12.0
-
Prepare ADE20K dataset according to guidelines in MMSegmentation v0.12.0
-
Prepare pretrained CrossFormer models
import torch
ckpt = torch.load("crossformer-s.pth") ## load classification checkpoint
torch.save(ckpt["model"], "backbone-corssformer-s.pth") ## only model weights are needed
-
Modify
data_root
inconfigs/_base_/datasets/ade20k.py
andconfigs/_base_/datasets/ade20k_swin.py
to your path to the ADE20K dataset. -
Training
## Use config in Results table listed below as <CONFIG_FILE>
./dist_train.sh <CONFIG_FILE> <GPUS> <PRETRAIN_MODEL>
## e.g. train fpn_crossformer_b model with 8 GPUs
./dist_train.sh configs/fpn_crossformer_b_ade20k_40k.py 8 path/to/backbone-corssformer-s.pth
- Inference
./dist_test.sh <CONFIG_FILE> <GPUS> <DET_CHECKPOINT_FILE>
## e.g. evaluate semantic segmentation model by mIoU
./dist_test.sh configs/fpn_crossformer_b_ade20k_40k.py 8 path/to/ckpt
Notes: We use single-scale testing by default, you can enable multi-scale testing or flip testing manually by following the instructions in configs/_base_/datasets/ade20k[_swin].py
.
Backbone | Iterations | Params | FLOPs | IOU | config | Models |
---|---|---|---|---|---|---|
PVT-M | 80K | 48.0M | 219.0G | 41.6 | - | - |
CrossFormer-S | 80K | 34.3M | 209.8G | 46.4 | config | Google Drive/BaiduCloud, key: sn5h |
PVT-L | 80K | 65.1M | 283.0G | 42.1 | - | - |
Swin-S | 80K | 53.2M | 274.0G | 45.2 | - | - |
CrossFormer-B | 80K | 55.6M | 320.1G | 48.0 | config | Google Drive/BaiduCloud, key: joi5 |
CrossFormer-L | 80K | 95.4M | 482.7G | 49.1 | config | Google Drive/BaiduCloud, key: 6v5d |
Backbone | Iterations | Params | FLOPs | IOU | MS IOU | config | Models |
---|---|---|---|---|---|---|---|
ResNet-101 | 160K | 86.0M | 1029.0G | 44.9 | - | - | - |
Swin-T | 160K | 60.0M | 945.0G | 44.5 | 45.8 | - | - |
CrossFormer-S | 160K | 62.3M | 979.5G | 47.6 | 48.4 | config | Google Drive/BaiduCloud, key: wesb |
Swin-S | 160K | 81.0M | 1038.0G | 47.6 | 49.5 | - | - |
CrossFormer-B | 160K | 83.6M | 1089.7G | 49.7 | 50.6 | config | Google Drive/BaiduCloud, key: j061 |
Swin-B | 160K | 121.0M | 1088.0G | 48.1 | 49.7 | - | - |
CrossFormer-L | 160K | 125.5M | 1257.8G | 50.4 | 51.4 | config | Google Drive/BaiduCloud, key: 17ks |
Notes:
- MS IOU means IOU with multi-scale testing.
- Models are trained on ADE20K. Backbones are initialized with weights pre-trained on ImageNet-1K.
- For Semantic FPN, models are trained for 80K iterations with batch size 16. For UperNet, models are trained for 160K iterations.
- More detailed training settings can be found in corresponding configs.
- More results can be seen in our paper.
use get_flops.py
to calculate FLOPs and #parameters of the specified model.
python get_flops.py <CONFIG_FILE> --shape <height> <width>
## e.g. get FLOPs and #params of fpn_crossformer_b with input image size [1024, 1024]
python get_flops.py configs/fpn_crossformer_b_ade20k_40k.py --shape 1024 1024
Notes: Default input image size is [1024, 1024]. For calculation with different input image size, you need to change <height> <width>
in the above command and change img_size
in crossformer_factory.py
accordingly at the same time.
@inproceedings{wang2021crossformer,
title = {CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention},
author = {Wang, Wenxiao and Yao, Lu and Chen, Long and Lin, Binbin and Cai, Deng and He, Xiaofei and Liu, Wei},
booktitle = {International Conference on Learning Representations, {ICLR}},
url = {https://openreview.net/forum?id=_PHymLIxuI},
year = {2022}
}