Skip to content

Latest commit

 

History

History

blip

BLIP

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

How to use it?

Use the model

from mmpretrain import inference_model

result = inference_model('blip-base_3rdparty_caption', 'demo/cat-dog.png')
print(result)
# {'pred_caption': 'a puppy and a cat sitting on a blanket'}

Test Command

Prepare your dataset according to the docs.

Test:

python tools/test.py configs/blip/blip-base_8xb32_caption.py https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth

Models and results

Image Caption on COCO

Model Params (M) BLEU-4 CIDER Config Download
blip-base_3rdparty_caption* 223.97 40.12 132.82 config model

Image Caption on NoCaps

Model Params (M) SPICE CIDER Config Download
blip-base_3rdparty_caption* 223.97 14.69 109.12 config model

Image Caption on Flickr30k

Model Params (M) SPICE CIDER Config Download
blip-base_3rdparty_caption* 223.97 15.58 68.89 config model

Visual Grounding on RefCOCO

Model Params (M) Accuracy (testA) Accuracy (testB) Config Download
blip-base_8xb16_refcoco 498.49 86.14 77.33 config model | log

Visual Question Answering on VQAv2

Model Params (M) Accuracy Config Download
blip-base_3rdparty_vqa* 361.48 78.20 config model

Visual Question Answering on OK-VQA

Model Params (M) Accuracy Config Download
blip-base_3rdparty_vqa* 361.48 40.59# config model

Visual Question Answering on OCR-VQA

Model Params (M) Accuracy Config Download
blip-base_3rdparty_vqa* 361.48 28.30# config model

Image-To-Text Retrieval on COCO

Model Params (M) Recall@1 Recall@5 Config Download
blip-base_3rdparty_retrieval* 447.49 82.52 95.34 config model

Text-To-Image Retrieval on COCO

Model Params (M) Recall@1 Recall@5 Config Download
blip-base_3rdparty_retrieval* 447.49 64.82 86.28 config model

Image-To-Text Retrieval on Flickr30k

Model Params (M) Recall@1 Recall@5 Config Download
blip-base_3rdparty_retrieval* 447.49 95.10# 99.60# config model

Text-To-Image Retrieval on Flickr30k

Model Params (M) Recall@1 Recall@5 Config Download
blip-base_3rdparty_retrieval* 447.49 85.26# 96.58# config model

NLVR on NLVR2

Model Params (M) Top-1 (%) Config Download
blip-base_3rdparty_nlvr* 259.37 82.33 config model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven't reproduce the training results.

Results with # denote zero-shot evaluation. The corresponding model hasn't been finetuned on that dataset.

Citation

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}