five

LLaVA-Video-178K

收藏
魔搭社区2026-05-17 更新2024-10-12 收录
下载链接:
https://modelscope.cn/datasets/lmms-lab/LLaVA-Video-178K
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for LLaVA-Video-178K ## Dataset Description - **Curated by:** Yuanhan Zhang, Jinming Wu, Wei Li - **Language(s) (NLP):** English, Chinese - **License:** Apache License 2.0 ## Uses This dataset is used for the training of the LLaVA-Video model. We only allow the use of this dataset for academic research and education purpose. For OpenAI GPT-4 generated data, we recommend the users to check the [OpenAI Usage Policy](https://openai.com/policies/usage-policies/). ### Data Sources For the training of LLaVA-Video, we utilized video-language data from five primary sources: - **LLaVA-Video-178K**: This dataset includes **178,510** caption entries, 960,792 open-ended QA (question and answer) items, and 196,198 multiple-choice QA items. These data were newly annotated for this project. - We include this dataset in this repository: LLaVA-Video-178K/XXX_academic_v0_1 and LLaVA-Video-178K/XXX_youtube_v0_1. - **NeXT-QA**: Comprises 17,090 open-ended QA items and 17,024 multiple-choice QA items. - We include this dataset in this repository: LLaVA-Video-178K/XXX_nextqa. - **ActivityNetQA**: Includes 23,530 open-ended QA items, - We include this dataset in this repository: LLaVA-Video-178K/XXX_activitynetqa. - **PerceptionTest**: Includes 1,803 open-ended QA items. - We include this dataset in this repository: LLaVA-Video-178K/XXX_perceptiontest. - **LLaVA-Hound**: Contains 240,000 open-ended QA items and 15,000 caption entries. - The video data and annotations are available at the following URLs: - Video data: [train_300k](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k) - Annotation data: LLaVA-Video-178K/llava_hound - loading function is specified here: [function](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/7125e3654d88063cb467ed242db76f1e2b184d4c/llava/train/train.py#L1162) The **LLaVA-Video-178K** dataset is the only contribution from this repository; we provide additional datasets for reproducing LLaVA-Video. - **Project Page:** [Project Page](https://llava-vl.github.io/blog/2024-09-30-llava-video/). - **Paper**: For more details, please check our [paper](https://arxiv.org/abs/2410.02713) ### Annotation Pipeline The following directories are provided for generating captions and QA data: - **Captions**: `LLaVA-Video-178K/gpt4o_caption_prompt` - **QA**: `LLaVA-Video-178K/gpt4o_qa_prompt` ### The subset used in the LLaVA-OneVision We have included captions and open-ended questions in the [0_30_s_academic_v0_1 split](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/tree/main/0_30_s_academic_v0_1), along with 240,000 open-ended QA items and 15,000 caption entries, as part of the video data in LLaVA-Hound for LLaVA-OneVision. - [**0_30_s_academic_v0_1 caption**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json) - [**0_30_s_academic_v0_1 open-ended QA**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json) - **LLaVA-Hound**: Same as above. ## Citation ```bibtex @misc{zhang2024videoinstructiontuningsynthetic, title={Video Instruction Tuning With Synthetic Data}, author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li}, year={2024}, eprint={2410.02713}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.02713}, } ``` ## Dataset Card Contact [Yuanhan Zhang](https://zhangyuanhan-ai.github.io/) [Jinming Wu](https://scholar.google.com/citations?user=eh-XJIoAAAAJ&hl=zh-CN) [Wei Li](https://scholar.google.com/citations?user=q8ZrKVIAAAAJ&hl=zh-CN)

# LLaVA-Video-178K 数据集卡片 ## 数据集说明 - **整理者**:张元翰、吴锦明、李伟 - **自然语言处理所用语言**:英语、中文 - **许可证**:Apache许可证2.0 ## 用途说明 本数据集用于LLaVA-Video模型的训练,仅允许用于学术研究与教育用途。对于由OpenAI GPT-4生成的数据,我们建议用户查阅[OpenAI使用政策](https://openai.com/policies/usage-policies/)。 ### 数据来源 为训练LLaVA-Video模型,我们从五大主要来源获取多模态视频-语言数据: - **LLaVA-Video-178K**:该数据集包含**178,510**条字幕条目、960,792条开放式问答(QA)样本以及196,198项多项选择问答样本,所有数据均为本项目全新标注。 - 本数据集已收录至本仓库的以下路径:LLaVA-Video-178K/XXX_academic_v0_1 与 LLaVA-Video-178K/XXX_youtube_v0_1。 - **NeXT-QA**:包含17,090条开放式问答样本与17,024条多项选择问答样本。 - 本数据集已收录至本仓库的 LLaVA-Video-178K/XXX_nextqa 路径。 - **ActivityNetQA**:包含23,530条开放式问答样本。 - 本数据集已收录至本仓库的 LLaVA-Video-178K/XXX_activitynetqa 路径。 - **PerceptionTest**:包含1,803条开放式问答样本。 - 本数据集已收录至本仓库的 LLaVA-Video-178K/XXX_perceptiontest 路径。 - **LLaVA-Hound**:包含240,000条开放式问答样本与15,000条字幕条目。 - 视频数据与标注数据可通过以下链接获取: - 视频数据:[train_300k](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k) - 标注数据:LLaVA-Video-178K/llava_hound - 数据加载函数定义于此处:[加载函数](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/7125e3654d88063cb467ed242db76f1e2b184d4c/llava/train/train.py#L1162) 本仓库的唯一贡献为**LLaVA-Video-178K**数据集;我们额外提供了其他数据集以支持LLaVA-Video模型的复现。 - **项目主页**:[项目主页](https://llava-vl.github.io/blog/2024-09-30-llava-video/) - **研究论文**:如需了解更多细节,请查阅我们的[研究论文](https://arxiv.org/abs/2410.02713) ### 标注流程 以下目录用于生成字幕与问答数据: - **字幕生成**:`LLaVA-Video-178K/gpt4o_caption_prompt` - **问答生成**:`LLaVA-Video-178K/gpt4o_qa_prompt` ### LLaVA-OneVision 所用子集 针对LLaVA-OneVision所使用的子集,我们已将字幕与开放式问答数据收录至[0_30_s_academic_v0_1划分集](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/tree/main/0_30_s_academic_v0_1)中,同时包含LLaVA-Hound中的240,000条开放式问答样本与15,000条字幕条目,作为LLaVA-OneVision的视频数据组成部分。 - [**0_30_s_academic_v0_1 字幕数据**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json) - [**0_30_s_academic_v0_1 开放式问答数据**](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/blob/main/0_30_s_academic_v0_1/0_30_s_academic_v0_1_cap_processed.json) - **LLaVA-Hound**:同上。 ## 引用格式 bibtex @misc{zhang2024videoinstructiontuningsynthetic, title={Video Instruction Tuning With Synthetic Data}, author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li}, year={2024}, eprint={2410.02713}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.02713}, } ## 数据集卡片联系人 [张元翰](https://zhangyuanhan-ai.github.io/) [吴锦明](https://scholar.google.com/citations?user=eh-XJIoAAAAJ&hl=zh-CN) [李伟](https://scholar.google.com/citations?user=q8ZrKVIAAAAJ&hl=zh-CN)
提供机构:
maas
创建时间:
2024-10-07
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
LLaVA-Video-178K是一个大规模多模态视频语言数据集,包含178,510条标题、960,792个开放式问答和196,198个多项选择问答,专为LLaVA-Video模型的训练而构建。数据集以英语和中文为主,采用Apache License 2.0许可证,仅限学术研究和教育用途,整合了多个视频数据源以支持模型复现。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作