Baiyu Chen
- A label-efficient VOS network based on autoencoding.
- Achieve comparable results with fully-supervised VOS approaches using only the first frame annotations of videos.
The above figure depicts the structure of STMAE, consisting of a key encoder and a mask autoencoder (mask encoder & decoder). The key encoder captures the spatiotemporal correspondences between reference frames and the query frame, and aggregates a coarse mask for the query frame according to the captured correspondences. Next, the mask autoencoder is responsible for reconstruct a clear prediction mask from the coarse one.
The above figure illustrates the simple idea of the One-shot Training strategy. A forward reconstruction operation is first taken to obtain the predictions of subsquent frames under the stop gradient setting, and with gradients being calculated a backward reconstruction operation is used to rebuild the first frame mask by using predictions of subsquent frames.
Dataset | J&F | J | F | Label % | Train on |
---|---|---|---|---|---|
DAVIS 2016 val. | 87.3 | 87.2 | 87.5 | 3.5 | DAVIS 2017 + YouTube-VOS 2018 |
DAVIS 2017 val. | 79.6 | 76.7 | 82.5 | 3.5 | DAVIS 2017 + YouTube-VOS 2018 |
Dataset | Overall Score | J-Seen | F-Seen | J-Unseen | F-Unseen | Label % | Train on |
---|---|---|---|---|---|---|---|
YouTubeVOS 18 val. | 71.8 | 75.7 | 79.6 | 62.2 | 69.7 | 3.5 % | DAVIS 2017 + YouTube-VOS 2018 |
To reproduce our results, you can either train the model following Training or evaluate our pretrained model following the instruction in Inference. Before you start, the experiments environment can be configured using conda
and pip
.
git clone https://github.com/Supgb/STMAE.git && cd STMAE
conda create -n stmae python=3.8
conda activate stmae
pip install -r requirements.txt
python -m scripts.download_datasets
If datasets are already in your machine, you should use softlink (ln -s
) to organize their structures as following:
├── STMAE
├── DAVIS
│ ├── 2016
│ │ ├── Annotations
│ │ └── ...
│ └── 2017
│ ├── test-dev
│ │ ├── Annotations
│ │ └── ...
│ └── trainval
│ ├── Annotations
│ └── ...
├── static
│ ├── BIG_small
│ └── ...
├── YouTube
│ ├── all_frames
│ │ └── valid_all_frames
│ ├── train
│ ├── train_480p
│ └── valid
└── YouTube2018
├── all_frames
│ └── valid_all_frames
└── valid
We provide code to visualize the training and evaluation on the Neptune platform. Please refer the official instructions for details. After you have your Neptune account, you should create an environment file .env.train
under the root of STMAE for the training environment. It should contain the NEPTUNE_PROJ_NAME
and NEPTUNE_TOKEN
variables and their value can be obtained from your Neptune account.
Create the .env.train
, and put these lines in:
NEPTUNE_PROJ_NAME=[your_account_name/your_project_name]
NEPTUNE_TOKEN=[your_project_token]
Our experiments are conducted using a batch size of 32 with 4 NVIDIA Tesla V100 GPUs. But we have tested that using a smaller batch size (incorperating the linear learning rate scaling) or less GPUs can deliver similar performances. The following command can be used to train our model from scratch:
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=[address:port] train.py --stage 3 --s3_batch_size 32 --s3_lr 2e-5 --s3_num_frames 8 --s3_num_ref_frames 3 --exp_id [identifier_for_exp] --val_epoch 5 --total_epoch 350
If you prefer a pretrained model for fine-tuning, please use the flag --load_network
followed by the path-to-the-pretrained-model
To evaluate the trained model on DAVIS 2016/2017 or YouTube-VOS 2018/2019, the model should first inference on the dataset. Next, the qualitation results can be obtained by following the corresponding evaluation instructions of datasets, i.e., vos-benchmark for DAVIS dataset or evaluation servers for YouTube-VOS (2018 CondaLab & 2019 CondaLab).
Given a model, the inference command is following:
python eval.py --model [path-to-model] --output outputs/[d16/d17/y18/y19] --dataset [D16/D17/Y18/Y19]
TBD
We thank PyTorch contributors and Ho Kei Cheng for releasing their implementation of XMem and STCN.