Official implementation of CLIPCAM: A Simple Baseline for Zero-shot Object and Action Localization (ICASSP 2022)
- Environment Setup
- Quick Demo
- Supported models for CLIPCAM
- CAM Variations
- Dataset Preparation
- Evaluation
- Grid-view Zero-shot Object Localization
- OpenImage
- Grid-view Zero-shot Action Localization
- HICO-DET
- Single-image Zero-shot Object Localization
- OpenImage
- ILSVRC
- COSMOS
- Custom Images
- Grid-view Zero-shot Object Localization
- Other features
- create conda enviroment with Python=3.7
conda create -n clipcam python=3.7
conda activate clipcam
- install pytorch 1.9.0, torchvision 0.10.0 with compatible cuda version (or any compatible torch version)
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
- install required package
pip install -r requirements.txt
Please go to this link for a quick demo.
*P.S. First time user: please follow the instruction on top of the demo website to allow your browser connecting to my server.
python clipcam.py \
--image_path "{single image path or grid image directory (4 images)}" \
--sentence "{input sentence}" \
--gpu_id 0 \
--clip_model_name "ViT-B/16" \
--cam_model_name "GradCAM" \
CLIP Models (from OpenAI):
example: --clip_model_name ViT-B/16
- ViT-B/16
- ViT-B/32
- RN50
- RN101
- RN50x4
- RN50x16
example: --clip_model_name ViT-B/16-pretrained
CAMs for CLIP (CLIPCAMs) (from pytorch-grad-cam)
example: --cam_model_name GradCAM
- GradCAM
- GradCAMPlusPlus
- XGradCAM
- ScoreCAM
- EigenCAM
- EigenGradCAM
- GuidedBackpropReLUModel
- LayerCAM
CAMs for other models (from pytorch-grad-cam)
example: --cam_model_name GradCAM_original
- GradCAM_original
- GradCAMPlusPlus_original
- XGradCAM_original
- ScoreCAM_original
- EigenGradCAM_original
- EigenCAM_original
- GuidedBackpropReLUModel_original
- LayerCAM_original
- OpenImage V6
Download OpenImage V6 validation set withdata_prep/openimage.py
. - HICO-DET
Download HICO-DET from this link. - ILSVRC (optional)
Download ILSVRC validation set. - COSMOS (optional)
Download COSMOS validation set.
- Dataset structure (OpenImage)
|--OpenImage |--validation |--data |--{image_path_1} |--{image_path_2} |-- ... |--labels |--detections.csv |--metadata |--classes.csv
- Run
evaluate_grid_openimage.py
with any model selectionpython evaluate_grid_openimage.py \ --data_dir Dataset/OpenImage/validation \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/grid/openimage/vitb32-grad' \ --mask_threshold 0.2 \ --sentence_prefix 'a photo of ' \ --attack 'None' \ --save_result 1
-
Dataset structure (HICO-DET)
|--HICO-DET |--images |--test |--{image_path_1} |--{image_path_2} |-- ... |--train |--{image_path_1} |--{image_path_2} |-- ... |--anno.mat |--anno_bbox.mat
-
Run
verb_grid.py
for pre-trained model
Train the model with half of the classes in HICO-DET.
Or download the fine-tuned checkpoints from this OneDrive.
--train_mode
with 'full', 'few' and 'half' specifies the setting when loading the classes in HICO-DET dataset.python verb_grid.py \ --data_dir datasets/hico-det \ --gpu_id 0 \ --clip_model_name 'ViT-B/32-pretrained' \ --cam_model_name 'GradCAM_original' \ --save_dir 'eval_result/grid/hicodet/vitb32-pretrained-grad' \ --mask_threshold 0.2 \ --train_mode 'half' \ --model_name checkpoints/models/vitb32-pretrained-half-1e-6.pth \ --save_result 1
-
Run
verb_grid.py
for CLIPCAMpython verb_grid.py \ --data_dir dataset/hico-det \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/grid/hicodet/vitb32-grad' \ --mask_threshold 0.2 \ --save_result 1
-
OpenImage
a. Runevaluate_openimage.py
python evaluate_openimage.py \ --data_dir datasets/OpenImage/validation \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/single/openimage/vitb32-grad' \ --save_result 1 \ --sentence_prefix 'a photo of ' \ --distill_num 0 \ --attack 'None'
-
|--ImageNet |--validation |--{label_1}\ |--{image_path_1} |--{image_path_2} |-- ... |--{label_2} |-- ... |--bbox |--val |--{image_path_1}.xml |--{image_path_1}.xml |-- ...
b. Run
evaluate_imagenet.py
python evaluate_imagenet.py \ --data_dir dataset/ImageNet/validation \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/single/imagenet/vitb32-grad' \ --batch 128 \ --save_result 1 \ --sentence_prefix 'sentence' \ --attack 'None'
-
COSMOS, OpenImage and custom images
a. Runevaluate.py
with--dataset cosmos
or--dataset openimage
python evaluate.py \ --data_dir datasets/COSMOS/val \ --gpu_id 0 \ --dataset cosmos \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/single/cosmos/vitb32-grad' \ --distill_num 0 \ --attack 'None'
-
Test on images with custom guiding text a. Put the images in a folder and run
evaluate.py
b. Runevaluate.py
without specifying--dataset
python evaluate.py \ --data_dir {path to folder} \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/custom-input-vitb32-grad' \ --distill_num 0
We propose an iterative refinement method based on masking out high neural importance areas to expand attention or enhance weak response regions.
Set --distill_num {n}
to iteratively mask out {n} times.
We experimented the ability of CLIPCAM to handle attacked images.
Set --attack fog
or --attack snow
to create fog or snow attack.
If you find the paper or the code useful for your study, please consider citing the CLIPCAM paper:
@article{clipcam_hsia_icassp2022,
author = {Hsia, Hsuan-An and Lin, Che-Hsien and Kung, Bo-Han and Chen, Jhao-Ting and Tan, Daniel Stanley and Chen, Jun-Cheng and Hua, Kai-Lung},
title = "{CLIPCAM: A Simple Baseline for Zero-shot Text-guided Object and Action Localization}",
booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year = {2022}
}
If you have questions regarding the paper or code, please open an issue or email us: Jhao-Ting Chen or Che-Hsien Lin. We will get back to you as soon as possible.