Official repository of paper titled "Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection".
Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, Fahad Shahbaz Khan
- (Oct 12, 2022)
- Interactive colab demo released.
- (Sep 15, 2022)
- Paper accepted at NeurIPS 2022.
- (July 7, 2022)
- Training and evaluation code with pretrained models are released.
Abstract: Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting. On the COCO benchmark, our proposed approach achieves 40.3 AP50 on novel classes, an absolute 11.9 gain over the previous best performance. For LVIS, we surpass the state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall.
- Region-based Knowledge Distillation (RKD) adapts image-centric language representations to be object-centric.
- Pesudo Image-level Supervision (PIS) uses weak image-level supervision from pretrained multi-modal ViTs(MAVL) to improve generalization of the detector to novel classes.
- Weight Transfer function efficiently combines above two proposed components.
The code is tested with PyTorch 1.10.0 and CUDA 11.3. After cloning the repository, follow the below steps in INSTALL.md. All of our models are trained using 8 A100 GPUs.
Checkout our demo using our interactive colab notebook. Create your own custom detector with your own class names.
We present performance of Object-centric Open Vocabulary object detector that demonstrates state-of-the-art results on Open Vocabulary COCO and LVIS benchmark datasets. For COCO, base and novel categories are shown in purple and green colors respectively.
Effect of individual components in our method. Our weight transfer method provides complimentary gains from RKD and ILS, achieving superior results as compared to naively adding both components.
Name | APnovel | APbase | AP | Train-time | Download |
---|---|---|---|---|---|
Base-OVD-RCNN-C4 | 1.7 | 53.2 | 39.6 | 8h | model |
COCO_OVD_Base_RKD | 21.2 | 54.7 | 45.9 | 8h | model |
COCO_OVD_Base_PIS | 30.4 | 52.6 | 46.8 | 8.5h | model |
COCO_OVD_RKD_PIS | 31.5 | 52.8 | 47.2 | 8.5h | model |
COCO_OVD_RKD_PIS_WeightTransfer | 36.6 | 54.0 | 49.4 | 8.5h | model |
COCO_OVD_RKD_PIS_WeightTransfer_8x | 36.9 | 56.6 | 51.5 | 2.5 days | model |
Our Mask R-CNN based LVIS Baseline (mask_rcnn_R50FPN_CLIP_sigmoid) achieves 12.2 rare class and 20.9 overall AP and trains in only 4.5 hours on 8 A100 GPUs. We believe this could be a good baseline to be considered for the future research work in LVIS OVD setting.
Name | APr | APc | APf | AP | Epochs |
---|---|---|---|---|---|
PromptDet Baseline | 7.4 | 17.2 | 26.1 | 19.0 | 12 |
ViLD-text | 10.1 | 23.9 | 32.5 | 24.9 | 384 |
Ours Baseline | 12.2 | 19.4 | 26.4 | 20.9 | 12 |
Name | APr | APc | APf | AP | Train-time | Download |
---|---|---|---|---|---|---|
mask_rcnn_R50FPN_CLIP_sigmoid | 12.2 | 19.4 | 26.4 | 20.9 | 4.5h | model |
LVIS_OVD_Base_RKD | 15.2 | 20.2 | 27.3 | 22.1 | 4.5h | model |
LVIS_OVD_Base_PIS | 17.0 | 21.2 | 26.1 | 22.4 | 5h | model |
LVIS_OVD_RKD_PIS | 17.3 | 20.9 | 25.5 | 22.1 | 5h | model |
LVIS_OVD_RKD_PIS_WeightTransfer | 17.1 | 21.4 | 26.7 | 22.8 | 5h | model |
LVIS_OVD_RKD_PIS_WeightTransfer_8x | 21.1 | 25.0 | 29.1 | 25.9 | 1.5 days | model |
To train or evaluate, first prepare the required datasets.
To train a model, run the below command with the corresponding config file.
python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml
Note: Some trainings are initialized from Supervised-base or RKD models. Download the corresponding pretrained models
and place them under $object-centric-ovd/saved_models/
.
To evaluate a pretrained model, run
python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth
If you use our work, please consider citing:
@inproceedings{Hanoona2022Bridging,
title={Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection},
author={Rasheed, Hanoona and Maaz, Muhammad and Khattak, Muhammad Uzair and Khan, Salman and Khan, Fahad Shahbaz},
booktitle={36th Conference on Neural Information Processing Systems (NIPS)},
year={2022}
}
@inproceedings{Maaz2022Multimodal,
title={Class-agnostic Object Detection with Multi-modal Transformer},
author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz and Anwer, Rao Muhammad and Yang, Ming-Hsuan},
booktitle={17th European Conference on Computer Vision (ECCV)},
year={2022},
organization={Springer}
}
If you have any questions, please create an issue on this repository or contact at hanoona.bangalath@mbzuai.ac.ae or muhammad.maaz@mbzuai.ac.ae.
Our RKD and PIS methods utilize the MViT model Multiscale Attention ViT with Late fusion (MAVL) proposed in the work Class-agnostic Object Detection with Multi-modal Transformer (ECCV 2022). Our code is based on Detic repository. We thank them for releasing their code.