STMAE

A Spatiotemporal Mask Autoencoder for One-shot Video Object Segmentation

Baiyu Chen $^1$, Li Zhao $^1$, Sixian Chan $^2$

$^1$ Key Laboratory of Intelligent Informatics for Safety & Emergency of Zhejiang Province, Wenzhou University
$^2$ The College of Computer Science and Technology, Zhejiang University of Technology

FAIML 2024

Demo

Features

A label-efficient VOS network based on autoencoding.
Achieve comparable results with fully-supervised VOS approaches using only the first frame annotations of videos.

Introduction

The above figure depicts the structure of STMAE, consisting of a key encoder and a mask autoencoder (mask encoder & decoder). The key encoder captures the spatiotemporal correspondences between reference frames and the query frame, and aggregates a coarse mask for the query frame according to the captured correspondences. Next, the mask autoencoder is responsible for reconstruct a clear prediction mask from the coarse one.

The above figure illustrates the simple idea of the One-shot Training strategy. A forward reconstruction operation is first taken to obtain the predictions of subsquent frames under the stop gradient setting, and with gradients being calculated a backward reconstruction operation is used to rebuild the first frame mask by using predictions of subsquent frames.

Results

$^*$ The results here are improved caused we've updated our implementation.

Dataset	J&F	J	F	Label %	Train on
DAVIS 2016 val.	87.3	87.2	87.5	3.5	DAVIS 2017 + YouTube-VOS 2018
DAVIS 2017 val.	79.6	76.7	82.5	3.5	DAVIS 2017 + YouTube-VOS 2018

Dataset	Overall Score	J-Seen	F-Seen	J-Unseen	F-Unseen	Label %	Train on
YouTubeVOS 18 val.	71.8	75.7	79.6	62.2	69.7	3.5 %	DAVIS 2017 + YouTube-VOS 2018

Getting Started

To reproduce our results, you can either train the model following Training or evaluate our pretrained model following the instruction in Inference. Before you start, the experiments environment can be configured using conda and pip.

Clone the repository

git clone https://github.com/Supgb/STMAE.git && cd STMAE

Create the environment using `conda`

conda create -n stmae python=3.8
conda activate stmae

Install the dependencies using `pip`

pip install -r requirements.txt

Download the datasets

python -m scripts.download_datasets

If datasets are already in your machine, you should use softlink (ln -s) to organize their structures as following:

├── STMAE
├── DAVIS
│   ├── 2016
│   │   ├── Annotations
│   │   └── ...
│   └── 2017
│       ├── test-dev
│       │   ├── Annotations
│       │   └── ...
│       └── trainval
│           ├── Annotations
│           └── ...
├── static
│   ├── BIG_small
│   └── ...
├── YouTube
│   ├── all_frames
│   │   └── valid_all_frames
│   ├── train
│   ├── train_480p
│   └── valid
└── YouTube2018
    ├── all_frames
    │   └── valid_all_frames
    └── valid

Neptune configuration

We provide code to visualize the training and evaluation on the Neptune platform. Please refer the official instructions for details. After you have your Neptune account, you should create an environment file .env.train under the root of STMAE for the training environment. It should contain the NEPTUNE_PROJ_NAME and NEPTUNE_TOKEN variables and their value can be obtained from your Neptune account.

Create the .env.train, and put these lines in:

NEPTUNE_PROJ_NAME=[your_account_name/your_project_name]
NEPTUNE_TOKEN=[your_project_token]

Training

Our experiments are conducted using a batch size of 32 with 4 NVIDIA Tesla V100 GPUs. But we have tested that using a smaller batch size (incorperating the linear learning rate scaling) or less GPUs can deliver similar performances. The following command can be used to train our model from scratch:

torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=[address:port] train.py --stage 3 --s3_batch_size 32 --s3_lr 2e-5 --s3_num_frames 8 --s3_num_ref_frames 3 --exp_id [identifier_for_exp] --val_epoch 5 --total_epoch 350

If you prefer a pretrained model for fine-tuning, please use the flag --load_network followed by the path-to-the-pretrained-model

Inference

To evaluate the trained model on DAVIS 2016/2017 or YouTube-VOS 2018/2019, the model should first inference on the dataset. Next, the qualitation results can be obtained by following the corresponding evaluation instructions of datasets, i.e., vos-benchmark for DAVIS dataset or evaluation servers for YouTube-VOS (2018 CondaLab & 2019 CondaLab).

Given a model, the inference command is following:

python eval.py --model [path-to-model] --output outputs/[d16/d17/y18/y19] --dataset [D16/D17/Y18/Y19]

Citation

TBD

Acknowledgements

We thank PyTorch contributors and Ho Kei Cheng for releasing their implementation of XMem and STCN.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
docs		docs
inference		inference
model		model
scripts		scripts
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STMAE

A Spatiotemporal Mask Autoencoder for One-shot Video Object Segmentation

Demo

Features

Table of Contents

Introduction

Results

Getting Started

Clone the repository

Create the environment using `conda`

Install the dependencies using `pip`

Download the datasets

Neptune configuration

Training

Inference

Citation

Acknowledgements

About

Releases 1

Packages

Languages

License

Supgb/STMAE

Folders and files

Latest commit

History

Repository files navigation

STMAE

A Spatiotemporal Mask Autoencoder for One-shot Video Object Segmentation

Demo

Features

Table of Contents

Introduction

Results

Getting Started

Clone the repository

Create the environment using conda

Install the dependencies using pip

Download the datasets

Neptune configuration

Training

Inference

Citation

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Create the environment using `conda`

Install the dependencies using `pip`

Packages