Project Page | Paper | Video
Yidong Huang, Jacob Sansom, Ziqiao Ma, Felix Gervits, Joyce Chai
University of Michigan, ARL
IROS 2024
The code is adopted from video-chatgpt. We recommend setting up a conda environment for the project:
conda create --name=drivlme python=3.10
conda activate drivlme
git clone git@github.com:sled-group/driVLMe.git
cd driVLMe
pip install -r requirements.txt
pip install -e .
Please follow the following instructions to get LLaVA weights.
- Get the original LLaMA weights in the Hugging Face format by following the instructions here.
- Use the following scripts to get LLaVA weights by applying our delta.
python scripts/apply_delta.py \
--base-model-path <path to LLaMA 7B weights> \
--target-model-path LLaVA-Lightning-7B-v1-1 \
--delta-path liuhaotian/LLaVA-Lightning-7B-delta-v1-1
The above command will download the LLaVA-Lightening-7B-v1-1 delta from Hugging Face, apply it to the provided LLaMA weights and save the LLaVA-Lightening-7B-v1-1 weights in the current directory.
Alternatively you can download the ready LLaVA-Lightening-7B weights from mmaaz60/LLaVA-Lightening-7B-v1-1.
You can get the data at this dowanload link and untar all the file under folder "videos".
Train on 4 A40 GPUs using the command
torchrun --nproc_per_node=4 --master_port 29001 drivlme/train/train_xformers.py \
--model_name_or_path <Path to Llava> \
--version v1 \
--data_path datasets/bddx_pretrain.json \
--video_folder videos/bdd100k_feats \
--tune_mm_mlp_adapter True \
--mm_use_vid_start_end \
--bf16 True \
--output_dir ./DriVLMe_model_weights/bddx_pretrain_ckpt \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 100 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True
Train on 4 A40 GPUs with deepspeed zero2 using the command
deepspeed --master_port=29001 drivlme/train/train_xformers.py \
--deepspeed ./scripts/zero2.json \
--model_name_or_path <Path to Llava> \
--pretrain_mm_mlp_adapter ./DriVLMe_model_weights/bddx_pretrain_ckpt/mm_projector.bin \
--version v1 \
--lora_enable True \
--lora_r 128 \
--lora_alpha 256 \
--data_path datasets/DriVLMe_sft_data.json \
--video_folder videos/SDN_train_feats \
--mm_use_vid_start_end \
--bf16 True \
--output_dir ./model_path/DriVLMe \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 3 \
--learning_rate 5e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 100 \
--tf32 True \
--model_max_length 2048 \
--lazy_preprocess True
You can also download the pretrained checkpoints from this link
To run the open-loop evaluation, we can use the command
python drivlme/single_video_inference_SDN.py --model-name /nfs/turbo/coe-chaijy-unreplicated/pre-trained-weights/LLaVA/LLaVA-7B-Lightening-v1-1/ --projection_path ./DriVLMe_model_weights/bddx_pretrain_ckpt/mm_projector.bin --lora_path ./DriVLMe_model_weights/DriVLMe/ --json_path datasets/SDN_test_actions.json --video_root videos/SDN_test_videos/ --out_path SDN_test_actions.json
python evaluation/physical_action_acc.py
for NfD task and
python drivlme/single_video_inference_SDN.py --model-name /nfs/turbo/coe-chaijy-unreplicated/pre-trained-weights/LLaVA/LLaVA-7B-Lightening-v1-1/ --projection_path ./DriVLMe_model_weights/bddx_pretrain_ckpt/mm_projector.bin --lora_path ./DriVLMe_model_weights/DriVLMe/ --json_path datasets/SDN_test_conversations.json --video_root videos/SDN_test_videos/ --out_path SDN_test_conversations.json
python evaluation/diag_action_acc.py
for RfN task.
@misc{huang2024drivlmeenhancingllmbasedautonomous,
title={DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences},
author={Yidong Huang and Jacob Sansom and Ziqiao Ma and Felix Gervits and Joyce Chai},
year={2024},
eprint={2406.03008},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.03008},
}
We thank the awesome research works Video-Chatgpt, DriveGPT4, DriveMLM