LLaVA-OneVision-Data

Name: LLaVA-OneVision-Data
Creator: maas
Published: 2026-05-07 21:23:06
License: 暂无描述

魔搭社区2026-05-07 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/lmms-lab/LLaVA-OneVision-Data

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for LLaVA-OneVision **[2024-09-01]: Uploaded VisualWebInstruct(filtered), it's used in OneVision Stage** > almost all subsets are uploaded with HF's required format and you can use the recommended interface to download them and follow our code below to convert them. > the subset of `ureader_kg` and `ureader_qa` are uploaded with the processed jsons and tar.gz of image folders. > You may directly download them from the following url. > https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg In this dataset, we include the data splits used in the both final image stage and one-vision stage. For more details, please check our [paper](arxiv.org/abs/2408.03326) and our [training doc](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data). ## Dataset Description - **Curated by:** Bo Li, Kaichen Zhang, Hao Zhang, Yuanhan Zhang, Renrui Zhang, Feng Li, Dong Guo - **Language(s) (NLP):** English, Chinese - **License:** Apache License 2.0 ## Dataset Sources  - **Dataset Collection:** We include a few subsets from existing dataset collection [Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M), [Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron), [UReader](https://arxiv.org/abs/2310.05126). Since we only used a few subsets from these datasets, and applied the cleaning and re-annotation process, we uploaded our processed version of these datasets into our own repository and thank the authors for providing the original datasets. - **Other Datasets:** For rest single source dataset, such as AI2D, OKVQA, we cite and link the original sources in our paper. ## Uses This dataset is used for the training of the LLaVA-OneVision model. We only allow the use of this dataset for academic research and education purpose. For OpenAI GPT-4 generated data, we recommend the users to check the [OpenAI Usage Policy](https://openai.com/policies/usage-policies/). ## Dataset Structure We expalin the data composition for mid-stage and final-stage at our repo in [**training doc**](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data). ### Statistics We provide the statistics of the dataset in the following figures, and refer the audience to check our paper. ![](https://i.postimg.cc/2y989XZJ/WX20240802-145215-2x.png) ![](https://i.postimg.cc/MZ9TGXFD/WX20240802-145226-2x.png) ### Code Guidance To help audience to better understand our dataest, we upload them into Hugging Face Dataset compatible format. During LLaVA-OneVision training, we use the `json` and `image/video` folder to store the data. > the subset of `ureader_kg` and `ureader_qa` are uploaded with the processed jsons and tar.gz of image folders. You may directly download them from the following url. > https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg Here we provide the code guidance to convert the dataset into the format of LLaVA-OneVision, and conduct the training of the LLaVA-OneVision model with converted dataset. ```python import os from datasets import load_dataset from tqdm import tqdm import json data = load_dataset("lmms-lab/LLaVA-OneVision-Data", split="train") image_folder = "<your_image_folder>" converted_data = [] for da in tqdm(data): json_data = {} json_data["id"] = da["id"] if da["image"] is not None: json_data["image"] = f"{da['id']}.jpg" da["image"].save(os.path.join(image_folder, json_data["image"])) json_data["conversations"] = da["conversations"] converted_data.append(json_data) with open("<your_json_file>.json", "w") as f: json.dump(converted_data, f, indent=4, ensure_ascii=False) ``` ## Citation **BibTeX:** [More Information Needed] ## Glossary The dataset collection process is conducted by all of the authors, we thank the Feng Li and Renrui Zhang for providing [LLaVA-M4-Instruct Data](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data) and Yuanhan for providing the [Video datasets](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). After the dataset collection, the cleaning and re-annotation process, including final mixture of the dataset, is conducted by Bo Li and with the great help of Kaichen Zhang. ## Dataset Card Authors The dataset is curated by the following authors: Bo Li, Kaichen Zhang, Hao Zhang, Yuanhan Zhang, Renrui Zhang, Feng Li ## Dataset Card Contact [Bo Li](https://brianboli.com/): drluodian@gmail.com [Kaichen Zhang](https://www.linkedin.com/in/kaichen-zhang-014b17219/?originalSubdomain=sg)

# LLaVA-OneVision 数据集卡片 **[2024-09-01]：上传了经过过滤的VisualWebInstruct数据集，该数据集将用于OneVision阶段** > 绝大多数子集已按照Hugging Face（HF）要求的格式上传，您可通过推荐的下载接口获取数据集，并按照下文提供的代码进行格式转换。 > > `ureader_kg`与`ureader_qa`子集已上传处理完成的JSON文件及图像文件夹的tar.gz压缩包。您可通过以下链接直接下载：https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg 本数据集包含用于最终图像阶段与单视觉（OneVision）阶段的数据划分。如需了解更多细节，请查阅我们的[论文](arxiv.org/abs/2408.03326)与[训练文档](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data)。 ## 数据集描述 - **整理方：** 李博、张凯宸、张浩、张元瀚、张仁睿、李丰、郭栋 - **自然语言处理所用语言：** 英语、中文 - **授权协议：** Apache License 2.0 ## 数据集来源  - **数据集采集：** 本数据集采集自现有数据集集合[Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M)、[Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)与[UReader](https://arxiv.org/abs/2310.05126)的部分子集。由于仅使用了上述数据集的少量子集，并对其进行了清洗与重新标注处理，我们将处理后的数据集版本上传至自有仓库，在此感谢原数据集作者的开源贡献。 - **其他数据集：** 其余单源数据集（如AI2D、OKVQA）的原始来源与引用信息已在我们的论文中给出。 ## 数据集用途本数据集用于训练LLaVA-OneVision模型，仅允许用于学术研究与教育用途。若涉及由OpenAI GPT-4生成的数据，请使用者务必查阅[OpenAI使用政策](https://openai.com/policies/usage-policies/)。 ## 数据集结构我们在[**训练文档**](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data)中详细说明了数据集在中间阶段与最终阶段的构成。 ### 数据集统计我们通过以下图表展示了本数据集的统计信息，详细内容请参阅我们的论文。 ![](https://i.postimg.cc/2y989XZJ/WX20240802-145215-2x.png) ![](https://i.postimg.cc/MZ9TGXFD/WX20240802-145226-2x.png) ### 代码指南为便于使用者更好地理解本数据集，我们将数据集上传为兼容Hugging Face Dataset的格式。在LLaVA-OneVision模型训练过程中，我们使用JSON文件与图像/视频文件夹存储数据。 > `ureader_kg`与`ureader_qa`子集已上传处理完成的JSON文件及图像文件夹的tar.gz压缩包。您可通过以下链接直接下载：https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg 下文将提供将数据集转换为LLaVA-OneVision所需格式，并基于转换后的数据集训练LLaVA-OneVision模型的代码指南。 python import os from datasets import load_dataset from tqdm import tqdm import json data = load_dataset("lmms-lab/LLaVA-OneVision-Data", split="train") image_folder = "<your_image_folder>" converted_data = [] for da in tqdm(data): json_data = {} json_data["id"] = da["id"] if da["image"] is not None: json_data["image"] = f"{da['id']}.jpg" da["image"].save(os.path.join(image_folder, json_data["image"])) json_data["conversations"] = da["conversations"] converted_data.append(json_data) with open("<your_json_file>.json", "w") as f: json.dump(converted_data, f, indent=4, ensure_ascii=False) ## 引用 **BibTeX格式引用：** [待补充更多信息] ## 致谢本数据集的采集工作由全体作者共同完成。在此感谢李丰与张仁睿提供的[LLaVA-M4-Instruct数据集](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data)，以及张元瀚提供的[视频数据集](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K)。数据集采集完成后，李博主导了数据清洗与重新标注流程（包括最终的数据集混合工作），并得到了张凯宸的大力协助。 ## 数据集卡片撰写者本数据集由以下作者整理：李博、张凯宸、张浩、张元瀚、张仁睿、李丰 ## 数据集卡片联系方式 [李博](https://brianboli.com/)：drluodian@gmail.com [张凯宸](https://www.linkedin.com/in/kaichen-zhang-014b17219/?originalSubdomain=sg)

提供机构：

maas

创建时间：

2024-10-06

搜集汇总

数据集介绍