This repository is the official implementation of Reducing Hallucinations in Vision-Language Models via Latent Space Steering .
Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications. Unlike in large language models (LLMs), hallucination in LVLMs often arises from misalignments between visual inputs and textual outputs. This paper investigates the underlying mechanisms of hallucination, focusing on the unique structure of LVLMs that distinguishes them from large language models (LLMs). We identify that hallucinations often arise from the sensitivity of text decoders to vision inputs, a natural phenomenon when image encoders and text decoders are pre-trained separately. Inspired by this, we introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features. As a task-agnostic test-time intervention, VTI can be easily applied to any problem without additional cost. Extensive experiments demonstrate that it can effectively reduce hallucinations and outperform baseline methods across multiple metrics, highlighting the critical role of vision feature stability in LVLMs.
Overview of the proposed algorithm visual and textual test-time intervention (VTI). Given an example set {(vᵢ, xᵢ, x̅ᵢ)} where vᵢ is the vision input and (xᵢ, x̅ᵢ) is paired captions with and without hallucination, VTI first runs the model on each query (vᵢ, xᵢ, x̅ᵢ) and records all hidden states. It then computes the shifting vectors dₗ,ₜᵛⁱˢⁱᵒⁿ and dₗ,ₜᵗᵉˣᵗ for all layer l and token t according to the method section in the paper. During inference, the vectors are subsequently added to every layer of the vision encoder and text decoder, respectively, when processing a new query. Notice that the vectors are task- and dataset-agnostic, i.e., they are pre-computed using a few samples from one specific task and dataset, and fixed unchanged throughout the entire experiments in our paper.conda create -yn vti python=3.9
conda activate vti
cd VTI
pip install -r requirements.txt
The following evaluation requires for MSCOCO 2014 dataset (for computing the VTI directions as well as evaluation). Please download here and extract it in your data path.
There are two core functions of VTI, computing the VTI directions and adding the directions to the LVLM.
- Compute the VTI visual and textual directions for a LVLM model
- data to compute VTI visual and textual directions
input_images, input_ids = get_demos(args, image_processor, model, tokenizer)
- compute VTI visual direction
vti_vision, _ = obtain_visual_vti(
model, input_images, rank=1
)
visual_direction = vti_vision[1:]
- compute VTI textual direction
vti_text, _ = obtain_textual_vti(
model, input_ids, input_images, rank=1
)
textual_direction = vti_text[1:]
- Add the directions to the LVLM
add_vti_layers(model, torch.stack([textual_direction],dim=1).cuda(), alpha = [args.alpha_text])
- Note that you need to specify the vision encoder of the model to add the visual direction
add_vti_layers(model.model.vision_tower.vision_tower.vision_model, torch.stack([visual_direction],dim=1).cuda(), alpha = [args.alpha_image])
MMHal-Bench Download
- Run experiments
python ./experiments/eval/run_mmhal_vti.py \
--alpha_image 0.9 \
--alpha_text 0.9 \
--seed 42 \
--image-folder dir/to/COCO/val2014/ \
--data-file dir/to/COCO/ \
--answers-file ./results/MMHal_answer.jsonl \
--num_demos 70 \
--mask_ratio 0.99 \
--num_trials 50
- To evaluate
python experiments/eval/eval_mmhal.py \
--response ./results/MMHal_answer.jsonl \
--api-key YOUR OPENAI_API_KEY
To be updated
If you find our code or models useful in your work, please cite our paper:
@article{liu2024reducing,
title={Reducing Hallucinations in Vision-Language Models via Latent Space Steering},
author={Liu, Sheng and Ye, Haotian and Zou, James},
journal={arXiv preprint arXiv:2410.15778},
year={2024}
}