Mask-YOLO: A Multi-task Learning Architecture for Object Detection and Instance Segmentation

1. Architecture and Results

This work combines the one-stage detection pipeline, YOLOv2 with the idea of two-branch architecture from Mask R-CNN. Due to the hardware limitation, I only implemented it on a small CNN backbone ( MobileNet) with depthwise separable blocks, though it has the potential to be implemented with deeper network, e.g. ResNet-50 or ResNet-101 with FPN (Feature Pyramid Networks).
The overall architecture can be visualized like this:

Training results on Shapes dataset:

Training results on Rice and Generic Food:

2. How to use it

myolo - the main implementation of Mask-YOLO. model.py is the model instantiation.

example - including three training examples with inference: Shapes dataset is randomly generated by dataset_shapes.py. Rice and Food are small datasets I hand-annotated by VGG Image Annotator (VIA), and can be downloaded from https://drive.google.com/file/d/1druK4Kgx5AhfchClU2aq5kf7UVoDtkvu/view.

3. Reference

Mask R-CNN paper: https://arxiv.org/pdf/1703.06870.pdf
YOLOv2 paper: https://arxiv.org/pdf/1612.08242.pdf
Kears and TensorFlow implementation of Mask R-CNN: https://github.com/matterport/Mask_RCNN