LLaVA-Video-178K
收藏魔搭社区2026-05-17 更新2024-10-12 收录
下载链接:
https://modelscope.cn/datasets/lmms-lab/LLaVA-Video-178K
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for LLaVA-Video-178K
## Dataset Description
- **Curated by:** Yuanhan Zhang, Jinming Wu, Wei Li
- **Language(s) (NLP):** English, Chinese
- **License:** Apache License 2.0
## Uses
This dataset is used for the training of the LLaVA-Video model. We only allow the use of this dataset for academic research and education purpose. For OpenAI GPT-4 generated data, we recommend the users to check the [OpenAI Usage Policy](https://openai.com/policies/usage-policies/).
### Data Sources
For the training of LLaVA-Video, we utilized video-language data from five primary sources:
- **LLaVA-Video-178K**: This dataset includes **178,510** caption entries, 960,792 open-ended QA (question and answer) items, and 196,198 multiple-choice QA items. These data were newly annotated for this project.
- We include this dataset in this repository: LLaVA-Video-178K/XXX_academic_v0_1 and LLaVA-Video-178K/XXX_youtube_v0_1.
- **NeXT-QA**: Comprises 17,090 open-ended QA items and 17,024 multiple-choice QA items.
- We include this dataset in this repository: LLaVA-Video-178K/XXX_nextqa.
- **ActivityNetQA**: Includes 23,530 open-ended QA items,
- We include this dataset in this repository: LLaVA-Video-178K/XXX_activitynetqa.
- **PerceptionTest**: Includes 1,803 open-ended QA items.
- We include this dataset in this repository: LLaVA-Video-178K/XXX_perceptiontest.
- **LLaVA-Hound**: Contains 240,000 open-ended QA items and 15,000 caption entries.
- The video data and annotations are available at the following URLs:
- Video data: [train_300k](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k)
- Annotation data: LLaVA-Video-178K/llava_hound
- loading function is specified here: [function](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/7125e3654d88063cb467ed242db76f1e2b184d4c/llava/train/train.py#L1162)
The **LLaVA-Video-178K** dataset is the only contribution from this repository; we provide additional datasets for reproducing LLaVA-Video.
- **Project Page:** [Project Page](https://llava-vl.github.io/blog/2024-09-30-llava-video/).
- **Paper**: For more details, please check our [paper](https://arxiv.org/abs/2410.02713)
### Annotation Pipeline
The following directories are provided for generating captions and QA data:
- **Captions**: `LLaVA-Video-178K/gpt4o_caption_prompt`
- **QA**: `LLaVA-Video-178K/gpt4o_qa_prompt`
### The subset used in the LLaVA-OneVision
We have included captions and open-ended questions in the [0_30_s_academic_v0_1 split](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/tree/main/0_30_s_academic_v0_1), along with 240,000 open-ended QA items and 15,000 caption entries, as part of the video data in LLaVA-Hound for LLaVA-OneVision.
- [**0_30_s_academic_v0_1 caption**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json)
- [**0_30_s_academic_v0_1 open-ended QA**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json)
- **LLaVA-Hound**: Same as above.
## Citation
```bibtex
@misc{zhang2024videoinstructiontuningsynthetic,
title={Video Instruction Tuning With Synthetic Data},
author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li},
year={2024},
eprint={2410.02713},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.02713},
}
```
## Dataset Card Contact
[Yuanhan Zhang](https://zhangyuanhan-ai.github.io/)
[Jinming Wu](https://scholar.google.com/citations?user=eh-XJIoAAAAJ&hl=zh-CN)
[Wei Li](https://scholar.google.com/citations?user=q8ZrKVIAAAAJ&hl=zh-CN)
# LLaVA-Video-178K 数据集卡片
## 数据集说明
- **整理者**:张元翰、吴锦明、李伟
- **自然语言处理所用语言**:英语、中文
- **许可证**:Apache许可证2.0
## 用途说明
本数据集用于LLaVA-Video模型的训练,仅允许用于学术研究与教育用途。对于由OpenAI GPT-4生成的数据,我们建议用户查阅[OpenAI使用政策](https://openai.com/policies/usage-policies/)。
### 数据来源
为训练LLaVA-Video模型,我们从五大主要来源获取多模态视频-语言数据:
- **LLaVA-Video-178K**:该数据集包含**178,510**条字幕条目、960,792条开放式问答(QA)样本以及196,198项多项选择问答样本,所有数据均为本项目全新标注。
- 本数据集已收录至本仓库的以下路径:LLaVA-Video-178K/XXX_academic_v0_1 与 LLaVA-Video-178K/XXX_youtube_v0_1。
- **NeXT-QA**:包含17,090条开放式问答样本与17,024条多项选择问答样本。
- 本数据集已收录至本仓库的 LLaVA-Video-178K/XXX_nextqa 路径。
- **ActivityNetQA**:包含23,530条开放式问答样本。
- 本数据集已收录至本仓库的 LLaVA-Video-178K/XXX_activitynetqa 路径。
- **PerceptionTest**:包含1,803条开放式问答样本。
- 本数据集已收录至本仓库的 LLaVA-Video-178K/XXX_perceptiontest 路径。
- **LLaVA-Hound**:包含240,000条开放式问答样本与15,000条字幕条目。
- 视频数据与标注数据可通过以下链接获取:
- 视频数据:[train_300k](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k)
- 标注数据:LLaVA-Video-178K/llava_hound
- 数据加载函数定义于此处:[加载函数](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/7125e3654d88063cb467ed242db76f1e2b184d4c/llava/train/train.py#L1162)
本仓库的唯一贡献为**LLaVA-Video-178K**数据集;我们额外提供了其他数据集以支持LLaVA-Video模型的复现。
- **项目主页**:[项目主页](https://llava-vl.github.io/blog/2024-09-30-llava-video/)
- **研究论文**:如需了解更多细节,请查阅我们的[研究论文](https://arxiv.org/abs/2410.02713)
### 标注流程
以下目录用于生成字幕与问答数据:
- **字幕生成**:`LLaVA-Video-178K/gpt4o_caption_prompt`
- **问答生成**:`LLaVA-Video-178K/gpt4o_qa_prompt`
### LLaVA-OneVision 所用子集
针对LLaVA-OneVision所使用的子集,我们已将字幕与开放式问答数据收录至[0_30_s_academic_v0_1划分集](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/tree/main/0_30_s_academic_v0_1)中,同时包含LLaVA-Hound中的240,000条开放式问答样本与15,000条字幕条目,作为LLaVA-OneVision的视频数据组成部分。
- [**0_30_s_academic_v0_1 字幕数据**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json)
- [**0_30_s_academic_v0_1 开放式问答数据**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json)
- **LLaVA-Hound**:同上。
## 引用格式
bibtex
@misc{zhang2024videoinstructiontuningsynthetic,
title={Video Instruction Tuning With Synthetic Data},
author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li},
year={2024},
eprint={2410.02713},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.02713},
}
## 数据集卡片联系人
[张元翰](https://zhangyuanhan-ai.github.io/)
[吴锦明](https://scholar.google.com/citations?user=eh-XJIoAAAAJ&hl=zh-CN)
[李伟](https://scholar.google.com/citations?user=q8ZrKVIAAAAJ&hl=zh-CN)
提供机构:
maas
创建时间:
2024-10-07
搜集汇总
数据集介绍

背景与挑战
背景概述
LLaVA-Video-178K是一个大规模多模态视频语言数据集,包含178,510条标题、960,792个开放式问答和196,198个多项选择问答,专为LLaVA-Video模型的训练而构建。数据集以英语和中文为主,采用Apache License 2.0许可证,仅限学术研究和教育用途,整合了多个视频数据源以支持模型复现。
以上内容由遇见数据集搜集并总结生成



