A comprehensive dataset of 1.9M data annotations is available in JSON format. Due to the extensive size of the full data, we provide only JSON files here. For corresponding images and videos, please follow our instructions.
For image datasets, we utilized M3IT, filtering out lower-quality data by:
- Correcting typos: Most sentences with incorrect punctuation usage were rectified.
- Rephrasing incorrect answers: Some responses generated by ChatGPT, such as "Sorry, ...", were incorrect. These were rephrased using GPT-4.
You can easily download the datasets we employed from M3IT.
We treated video datasets differently. Please download the original videos from the provided links:
- VideoChat: Based on InternVid, we created additional instruction data and used GPT-4 to condense the existing data.
- VideoChatGPT: The original caption data was converted into conversation data based on the same VideoIDs.
- Kinetics-710 & SthSthV2: Option candidates were generated from UMT top-20 predictions.
- NExTQA: Typos in the original sentences were corrected.
- CLEVRER: For single-option multiple-choice QAs, we used only those concerning color/material/shape. For multi-option multiple-choice QAs, we utilized all the data.
- WebVid: Non-overlapping data was selected for captioning and QA.
- YouCook2: Original videos were truncated based on the official dense captions.
- TextVR: All data was used without modifications.
- TGIF: Only TGIF${frame}$ and TGIF${Transition}$ subsets were considered.
- EgoQA: Some egocentric QAs were generated from Ego4D data.
For all datasets, task instructions were automatically generated using GPT-4.