Skip to content

Latest commit

 

History

History
154 lines (117 loc) · 12.3 KB

README.md

File metadata and controls

154 lines (117 loc) · 12.3 KB

Object Centric Open Vocabulary Detection (NeurIPS 2022)

Official repository of paper titled "Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection".

Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, Fahad Shahbaz Khan

Website paper Colab Demo video slides

PWC PWC PWC PWC

🚀 News

  • (Oct 12, 2022)
    • Interactive colab demo released.
  • (Sep 15, 2022)
    • Paper accepted at NeurIPS 2022.
  • (July 7, 2022)
    • Training and evaluation code with pretrained models are released.

main figure

Abstract: Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting. On the COCO benchmark, our proposed approach achieves 40.3 AP50 on novel classes, an absolute 11.9 gain over the previous best performance. For LVIS, we surpass the state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall.

Main Contributions

  1. Region-based Knowledge Distillation (RKD) adapts image-centric language representations to be object-centric.
  2. Pesudo Image-level Supervision (PIS) uses weak image-level supervision from pretrained multi-modal ViTs(MAVL) to improve generalization of the detector to novel classes.
  3. Weight Transfer function efficiently combines above two proposed components.

Installation

The code is tested with PyTorch 1.10.0 and CUDA 11.3. After cloning the repository, follow the below steps in INSTALL.md. All of our models are trained using 8 A100 GPUs.


Demo: Create your own custom detector

Open In Colab Checkout our demo using our interactive colab notebook. Create your own custom detector with your own class names.

Results

We present performance of Object-centric Open Vocabulary object detector that demonstrates state-of-the-art results on Open Vocabulary COCO and LVIS benchmark datasets. For COCO, base and novel categories are shown in purple and green colors respectively. tSNE_plots

Open-vocabulary COCO

Effect of individual components in our method. Our weight transfer method provides complimentary gains from RKD and ILS, achieving superior results as compared to naively adding both components.

Name APnovel APbase AP Train-time Download
Base-OVD-RCNN-C4 1.7 53.2 39.6 8h model
COCO_OVD_Base_RKD 21.2 54.7 45.9 8h model
COCO_OVD_Base_PIS 30.4 52.6 46.8 8.5h model
COCO_OVD_RKD_PIS 31.5 52.8 47.2 8.5h model
COCO_OVD_RKD_PIS_WeightTransfer 36.6 54.0 49.4 8.5h model
COCO_OVD_RKD_PIS_WeightTransfer_8x 36.9 56.6 51.5 2.5 days model

New LVIS Baseline

Our Mask R-CNN based LVIS Baseline (mask_rcnn_R50FPN_CLIP_sigmoid) achieves 12.2 rare class and 20.9 overall AP and trains in only 4.5 hours on 8 A100 GPUs. We believe this could be a good baseline to be considered for the future research work in LVIS OVD setting.

Name APr APc APf AP Epochs
PromptDet Baseline 7.4 17.2 26.1 19.0 12
ViLD-text 10.1 23.9 32.5 24.9 384
Ours Baseline 12.2 19.4 26.4 20.9 12

Open-vocabulary LVIS

Name APr APc APf AP Train-time Download
mask_rcnn_R50FPN_CLIP_sigmoid 12.2 19.4 26.4 20.9 4.5h model
LVIS_OVD_Base_RKD 15.2 20.2 27.3 22.1 4.5h model
LVIS_OVD_Base_PIS 17.0 21.2 26.1 22.4 5h model
LVIS_OVD_RKD_PIS 17.3 20.9 25.5 22.1 5h model
LVIS_OVD_RKD_PIS_WeightTransfer 17.1 21.4 26.7 22.8 5h model
LVIS_OVD_RKD_PIS_WeightTransfer_8x 21.1 25.0 29.1 25.9 1.5 days model

t-SNE plots

tSNE_plots


Training and Evaluation

To train or evaluate, first prepare the required datasets.

To train a model, run the below command with the corresponding config file.

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml

Note: Some trainings are initialized from Supervised-base or RKD models. Download the corresponding pretrained models and place them under $object-centric-ovd/saved_models/.

To evaluate a pretrained model, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth

Citation

If you use our work, please consider citing:

@inproceedings{Hanoona2022Bridging,
    title={Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection},
    author={Rasheed, Hanoona and Maaz, Muhammad and Khattak, Muhammad Uzair  and Khan, Salman and Khan, Fahad Shahbaz},
    booktitle={36th Conference on Neural Information Processing Systems (NIPS)},
    year={2022}
}
    
@inproceedings{Maaz2022Multimodal,
      title={Class-agnostic Object Detection with Multi-modal Transformer},
      author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz and Anwer, Rao Muhammad and Yang, Ming-Hsuan},
      booktitle={17th European Conference on Computer Vision (ECCV)},
      year={2022},
      organization={Springer}
}

Contact

If you have any questions, please create an issue on this repository or contact at hanoona.bangalath@mbzuai.ac.ae or muhammad.maaz@mbzuai.ac.ae.

References

Our RKD and PIS methods utilize the MViT model Multiscale Attention ViT with Late fusion (MAVL) proposed in the work Class-agnostic Object Detection with Multi-modal Transformer (ECCV 2022). Our code is based on Detic repository. We thank them for releasing their code.