LLaVA-Video-178K

Name: LLaVA-Video-178K
Creator: maas
Published: 2026-05-17 01:09:09
License: 暂无描述

魔搭社区2026-05-17 更新2024-10-12 收录

下载链接：

https://modelscope.cn/datasets/lmms-lab/LLaVA-Video-178K

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for LLaVA-Video-178K ## Dataset Description - **Curated by:** Yuanhan Zhang, Jinming Wu, Wei Li - **Language(s) (NLP):** English, Chinese - **License:** Apache License 2.0 ## Uses This dataset is used for the training of the LLaVA-Video model. We only allow the use of this dataset for academic research and education purpose. For OpenAI GPT-4 generated data, we recommend the users to check the [OpenAI Usage Policy](https://openai.com/policies/usage-policies/). ### Data Sources For the training of LLaVA-Video, we utilized video-language data from five primary sources: - **LLaVA-Video-178K**: This dataset includes **178,510** caption entries, 960,792 open-ended QA (question and answer) items, and 196,198 multiple-choice QA items. These data were newly annotated for this project. - We include this dataset in this repository: LLaVA-Video-178K/XXX_academic_v0_1 and LLaVA-Video-178K/XXX_youtube_v0_1. - **NeXT-QA**: Comprises 17,090 open-ended QA items and 17,024 multiple-choice QA items. - We include this dataset in this repository: LLaVA-Video-178K/XXX_nextqa. - **ActivityNetQA**: Includes 23,530 open-ended QA items, - We include this dataset in this repository: LLaVA-Video-178K/XXX_activitynetqa. - **PerceptionTest**: Includes 1,803 open-ended QA items. - We include this dataset in this repository: LLaVA-Video-178K/XXX_perceptiontest. - **LLaVA-Hound**: Contains 240,000 open-ended QA items and 15,000 caption entries. - The video data and annotations are available at the following URLs: - Video data: [train_300k](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k) - Annotation data: LLaVA-Video-178K/llava_hound - loading function is specified here: [function](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/7125e3654d88063cb467ed242db76f1e2b184d4c/llava/train/train.py#L1162) The **LLaVA-Video-178K** dataset is the only contribution from this repository; we provide additional datasets for reproducing LLaVA-Video. - **Project Page:** [Project Page](https://llava-vl.github.io/blog/2024-09-30-llava-video/). - **Paper**: For more details, please check our [paper](https://arxiv.org/abs/2410.02713) ### Annotation Pipeline The following directories are provided for generating captions and QA data: - **Captions**: `LLaVA-Video-178K/gpt4o_caption_prompt` - **QA**: `LLaVA-Video-178K/gpt4o_qa_prompt` ### The subset used in the LLaVA-OneVision We have included captions and open-ended questions in the [0_30_s_academic_v0_1 split](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/tree/main/0_30_s_academic_v0_1), along with 240,000 open-ended QA items and 15,000 caption entries, as part of the video data in LLaVA-Hound for LLaVA-OneVision. - [**0_30_s_academic_v0_1 caption**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json) - [**0_30_s_academic_v0_1 open-ended QA**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json) - **LLaVA-Hound**: Same as above. ## Citation ```bibtex @misc{zhang2024videoinstructiontuningsynthetic, title={Video Instruction Tuning With Synthetic Data}, author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li}, year={2024}, eprint={2410.02713}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.02713}, } ``` ## Dataset Card Contact [Yuanhan Zhang](https://zhangyuanhan-ai.github.io/) [Jinming Wu](https://scholar.google.com/citations?user=eh-XJIoAAAAJ&hl=zh-CN) [Wei Li](https://scholar.google.com/citations?user=q8ZrKVIAAAAJ&hl=zh-CN)

# LLaVA-Video-178K 数据集卡片 ## 数据集说明 - **整理者**：张元翰、吴锦明、李伟 - **自然语言处理所用语言**：英语、中文 - **许可证**：Apache许可证2.0 ## 用途说明本数据集用于LLaVA-Video模型的训练，仅允许用于学术研究与教育用途。对于由OpenAI GPT-4生成的数据，我们建议用户查阅[OpenAI使用政策](https://openai.com/policies/usage-policies/)。 ### 数据来源为训练LLaVA-Video模型，我们从五大主要来源获取多模态视频-语言数据： - **LLaVA-Video-178K**：该数据集包含**178,510**条字幕条目、960,792条开放式问答（QA）样本以及196,198项多项选择问答样本，所有数据均为本项目全新标注。 - 本数据集已收录至本仓库的以下路径：LLaVA-Video-178K/XXX_academic_v0_1 与 LLaVA-Video-178K/XXX_youtube_v0_1。 - **NeXT-QA**：包含17,090条开放式问答样本与17,024条多项选择问答样本。 - 本数据集已收录至本仓库的 LLaVA-Video-178K/XXX_nextqa 路径。 - **ActivityNetQA**：包含23,530条开放式问答样本。 - 本数据集已收录至本仓库的 LLaVA-Video-178K/XXX_activitynetqa 路径。 - **PerceptionTest**：包含1,803条开放式问答样本。 - 本数据集已收录至本仓库的 LLaVA-Video-178K/XXX_perceptiontest 路径。 - **LLaVA-Hound**：包含240,000条开放式问答样本与15,000条字幕条目。 - 视频数据与标注数据可通过以下链接获取： - 视频数据：[train_300k](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k) - 标注数据：LLaVA-Video-178K/llava_hound - 数据加载函数定义于此处：[加载函数](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/7125e3654d88063cb467ed242db76f1e2b184d4c/llava/train/train.py#L1162) 本仓库的唯一贡献为**LLaVA-Video-178K**数据集；我们额外提供了其他数据集以支持LLaVA-Video模型的复现。 - **项目主页**：[项目主页](https://llava-vl.github.io/blog/2024-09-30-llava-video/) - **研究论文**：如需了解更多细节，请查阅我们的[研究论文](https://arxiv.org/abs/2410.02713) ### 标注流程以下目录用于生成字幕与问答数据： - **字幕生成**：`LLaVA-Video-178K/gpt4o_caption_prompt` - **问答生成**：`LLaVA-Video-178K/gpt4o_qa_prompt` ### LLaVA-OneVision 所用子集针对LLaVA-OneVision所使用的子集，我们已将字幕与开放式问答数据收录至[0_30_s_academic_v0_1划分集](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/tree/main/0_30_s_academic_v0_1)中，同时包含LLaVA-Hound中的240,000条开放式问答样本与15,000条字幕条目，作为LLaVA-OneVision的视频数据组成部分。 - [**0_30_s_academic_v0_1 字幕数据**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json) - [**0_30_s_academic_v0_1 开放式问答数据**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json) - **LLaVA-Hound**：同上。 ## 引用格式 bibtex @misc{zhang2024videoinstructiontuningsynthetic, title={Video Instruction Tuning With Synthetic Data}, author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li}, year={2024}, eprint={2410.02713}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.02713}, } ## 数据集卡片联系人 [张元翰](https://zhangyuanhan-ai.github.io/) [吴锦明](https://scholar.google.com/citations?user=eh-XJIoAAAAJ&hl=zh-CN) [李伟](https://scholar.google.com/citations?user=q8ZrKVIAAAAJ&hl=zh-CN)

提供机构：

maas

创建时间：

2024-10-07

搜集汇总

数据集介绍

背景与挑战

背景概述

LLaVA-Video-178K是一个大规模多模态视频语言数据集，包含178,510条标题、960,792个开放式问答和196,198个多项选择问答，专为LLaVA-Video模型的训练而构建。数据集以英语和中文为主，采用Apache License 2.0许可证，仅限学术研究和教育用途，整合了多个视频数据源以支持模型复现。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集