GitHub - TencentARC/ST-LLM: [ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"

ST-LLM: Large Language Models Are Effective Temporal Learners

News 📢

[2024/3/28] All codes and weights are available now! Welcome to watch this repository for the latest updates.

Introduction 💡

ST-LLM is a temporal-sensitive video large language model. Our model incorporates three key architectural:
- (1) Joint spatial-temporal modeling within large language models for effective video understanding.
- (2) Dynamic masking strategy and mask video modeling for efficiency and robustness.
- (3) Global-local input module for long video understanding.
ST-LLM has established new state-of-the-art results on MVBench, VideoChatGPT Bench and VideoQA Bench:

Method	MVBench	VcgBench						VideoQABench
Method	MVBench	Avg	Correct	Detail	Context	Temporal	Consist	MSVD	MSRVTT	ANet
VideoLLaMA	34.1	1.96	2.18	2.16	1.82	1.79	1.98	51.6	29.6	12.4
LLaMA-Adapter	31.7	2.03	2.32	2.30	1.98	2.15	2.16	54.9	43.8	34.2
VideoChat	35.5	2.23	2.50	2.53	1.94	2.24	2.29	56.3	45.0	26.5
VideoChatGPT	32.7	2.38	2.40	2.52	2.62	1.98	2.37	64.9	49.3	35.7
MovieChat	-	2.76	2.93	3.01	2.24	2.42	2.67	74.2	52.7	45.7
Vista-LLaMA	-	2.44	2.64	3.18	2.26	2.31	2.57	65.3	60.5	48.3
LLaMA-VID	-	2.89	2.96	3.00	3.53	2.46	2.51	69.7	57.7	47.4
Chat-UniVi	-	2.99	2.89	2.91	3.46	2.89	2.81	65.0	54.6	45.8
VideoChat2	51.1	2.98	3.02	2.88	3.51	2.66	2.81	70.0	54.1	49.1
ST-LLM	54.9	3.15	3.23	3.05	3.74	2.93	2.81	74.6	63.2	50.9

Demo 🤗

Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:

CUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight

We have also prepared local scripts that are easy to modify：demo.py

Examples 👀

Video Description: for high-difficulty videos with complex scene changes, ST-LLM can accurately describe all the contents.

Action Identification: ST-LLM can accurately and comprehensively describe the actions occurring in the video.

Reasoning: for the challenging open-ended reasoning questions, STLLM can also provide reasonable answers.

Installation 🛠️

Git clone our repository, creating a Python environment and activate it via the following command

git clone https://github.com/farewellthree/ST-LLM.git
cd ST-LLM
conda create --name stllm python=3.10
conda activate stllm
pip install -r requirement.txt

Training & Validation 📊

The instructions of data, training and evaluating can be found in trainval.md.

Acknowledgement 👍

Video-ChatGPT and MVBench Great job contributing video LLM benchmark.
InstuctBLIP and MiniGPT4 The codebase and the basic image LLM we built upon.

Citation ✏️

If you find the code and paper useful for your research, please consider staring this repo and citing our paper:

@article{liu2023one,
  title={One for all: Video conversation is feasible without video instruction tuning},
  author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},
  journal={arXiv preprint arXiv:2309.15785},
  year={2023}
}

@article{liu2023one,
  title={ST-LLM: Large Language Models Are Effective Temporal Learners},
  author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
  journal={https://arxiv.org/abs/2404.00308},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config		config
example		example
prompts		prompts
script		script
stllm		stllm
LICENSE		LICENSE
PrepareVicuna.md		PrepareVicuna.md
README.md		README.md
demo.py		demo.py
demo_gradio.py		demo_gradio.py
requirement.txt		requirement.txt
trainval.md		trainval.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ST-LLM: Large Language Models Are Effective Temporal Learners

News 📢

Introduction 💡

Demo 🤗

Examples 👀

Installation 🛠️

Training & Validation 📊

Acknowledgement 👍

Citation ✏️

About

Releases

Packages

Languages

License

TencentARC/ST-LLM

Folders and files

Latest commit

History

Repository files navigation

ST-LLM: Large Language Models Are Effective Temporal Learners

News 📢

Introduction 💡

Demo 🤗

Examples 👀

Installation 🛠️

Training & Validation 📊

Acknowledgement 👍

Citation ✏️

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages