five

LLaVA-OneVision-Data

收藏
魔搭社区2026-05-07 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/lmms-lab/LLaVA-OneVision-Data
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for LLaVA-OneVision **[2024-09-01]: Uploaded VisualWebInstruct(filtered), it's used in OneVision Stage** > almost all subsets are uploaded with HF's required format and you can use the recommended interface to download them and follow our code below to convert them. > the subset of `ureader_kg` and `ureader_qa` are uploaded with the processed jsons and tar.gz of image folders. > You may directly download them from the following url. > https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg In this dataset, we include the data splits used in the both final image stage and one-vision stage. For more details, please check our [paper](arxiv.org/abs/2408.03326) and our [training doc](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data). ## Dataset Description - **Curated by:** Bo Li, Kaichen Zhang, Hao Zhang, Yuanhan Zhang, Renrui Zhang, Feng Li, Dong Guo - **Language(s) (NLP):** English, Chinese - **License:** Apache License 2.0 ## Dataset Sources <!-- Provide the basic links for the dataset. --> - **Dataset Collection:** We include a few subsets from existing dataset collection [Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M), [Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron), [UReader](https://arxiv.org/abs/2310.05126). Since we only used a few subsets from these datasets, and applied the cleaning and re-annotation process, we uploaded our processed version of these datasets into our own repository and thank the authors for providing the original datasets. - **Other Datasets:** For rest single source dataset, such as AI2D, OKVQA, we cite and link the original sources in our paper. ## Uses This dataset is used for the training of the LLaVA-OneVision model. We only allow the use of this dataset for academic research and education purpose. For OpenAI GPT-4 generated data, we recommend the users to check the [OpenAI Usage Policy](https://openai.com/policies/usage-policies/). ## Dataset Structure We expalin the data composition for mid-stage and final-stage at our repo in [**training doc**](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data). ### Statistics We provide the statistics of the dataset in the following figures, and refer the audience to check our paper. ![](https://i.postimg.cc/2y989XZJ/WX20240802-145215-2x.png) ![](https://i.postimg.cc/MZ9TGXFD/WX20240802-145226-2x.png) ### Code Guidance To help audience to better understand our dataest, we upload them into Hugging Face Dataset compatible format. During LLaVA-OneVision training, we use the `json` and `image/video` folder to store the data. > the subset of `ureader_kg` and `ureader_qa` are uploaded with the processed jsons and tar.gz of image folders. You may directly download them from the following url. > https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg Here we provide the code guidance to convert the dataset into the format of LLaVA-OneVision, and conduct the training of the LLaVA-OneVision model with converted dataset. ```python import os from datasets import load_dataset from tqdm import tqdm import json data = load_dataset("lmms-lab/LLaVA-OneVision-Data", split="train") image_folder = "<your_image_folder>" converted_data = [] for da in tqdm(data): json_data = {} json_data["id"] = da["id"] if da["image"] is not None: json_data["image"] = f"{da['id']}.jpg" da["image"].save(os.path.join(image_folder, json_data["image"])) json_data["conversations"] = da["conversations"] converted_data.append(json_data) with open("<your_json_file>.json", "w") as f: json.dump(converted_data, f, indent=4, ensure_ascii=False) ``` ## Citation **BibTeX:** [More Information Needed] ## Glossary The dataset collection process is conducted by all of the authors, we thank the Feng Li and Renrui Zhang for providing [LLaVA-M4-Instruct Data](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data) and Yuanhan for providing the [Video datasets](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). After the dataset collection, the cleaning and re-annotation process, including final mixture of the dataset, is conducted by Bo Li and with the great help of Kaichen Zhang. ## Dataset Card Authors The dataset is curated by the following authors: Bo Li, Kaichen Zhang, Hao Zhang, Yuanhan Zhang, Renrui Zhang, Feng Li ## Dataset Card Contact [Bo Li](https://brianboli.com/): drluodian@gmail.com [Kaichen Zhang](https://www.linkedin.com/in/kaichen-zhang-014b17219/?originalSubdomain=sg)

# LLaVA-OneVision 数据集卡片 **[2024-09-01]:上传了经过过滤的VisualWebInstruct数据集,该数据集将用于OneVision阶段** > 绝大多数子集已按照Hugging Face(HF)要求的格式上传,您可通过推荐的下载接口获取数据集,并按照下文提供的代码进行格式转换。 > > `ureader_kg`与`ureader_qa`子集已上传处理完成的JSON文件及图像文件夹的tar.gz压缩包。您可通过以下链接直接下载:https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg 本数据集包含用于最终图像阶段与单视觉(OneVision)阶段的数据划分。如需了解更多细节,请查阅我们的[论文](arxiv.org/abs/2408.03326)与[训练文档](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data)。 ## 数据集描述 - **整理方:** 李博、张凯宸、张浩、张元瀚、张仁睿、李丰、郭栋 - **自然语言处理所用语言:** 英语、中文 - **授权协议:** Apache License 2.0 ## 数据集来源 <!-- 提供数据集的基础链接 --> - **数据集采集:** 本数据集采集自现有数据集集合[Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M)、[Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)与[UReader](https://arxiv.org/abs/2310.05126)的部分子集。由于仅使用了上述数据集的少量子集,并对其进行了清洗与重新标注处理,我们将处理后的数据集版本上传至自有仓库,在此感谢原数据集作者的开源贡献。 - **其他数据集:** 其余单源数据集(如AI2D、OKVQA)的原始来源与引用信息已在我们的论文中给出。 ## 数据集用途 本数据集用于训练LLaVA-OneVision模型,仅允许用于学术研究与教育用途。若涉及由OpenAI GPT-4生成的数据,请使用者务必查阅[OpenAI使用政策](https://openai.com/policies/usage-policies/)。 ## 数据集结构 我们在[**训练文档**](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data)中详细说明了数据集在中间阶段与最终阶段的构成。 ### 数据集统计 我们通过以下图表展示了本数据集的统计信息,详细内容请参阅我们的论文。 ![](https://i.postimg.cc/2y989XZJ/WX20240802-145215-2x.png) ![](https://i.postimg.cc/MZ9TGXFD/WX20240802-145226-2x.png) ### 代码指南 为便于使用者更好地理解本数据集,我们将数据集上传为兼容Hugging Face Dataset的格式。在LLaVA-OneVision模型训练过程中,我们使用JSON文件与图像/视频文件夹存储数据。 > `ureader_kg`与`ureader_qa`子集已上传处理完成的JSON文件及图像文件夹的tar.gz压缩包。您可通过以下链接直接下载:https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg 下文将提供将数据集转换为LLaVA-OneVision所需格式,并基于转换后的数据集训练LLaVA-OneVision模型的代码指南。 python import os from datasets import load_dataset from tqdm import tqdm import json data = load_dataset("lmms-lab/LLaVA-OneVision-Data", split="train") image_folder = "<your_image_folder>" converted_data = [] for da in tqdm(data): json_data = {} json_data["id"] = da["id"] if da["image"] is not None: json_data["image"] = f"{da['id']}.jpg" da["image"].save(os.path.join(image_folder, json_data["image"])) json_data["conversations"] = da["conversations"] converted_data.append(json_data) with open("<your_json_file>.json", "w") as f: json.dump(converted_data, f, indent=4, ensure_ascii=False) ## 引用 **BibTeX格式引用:** [待补充更多信息] ## 致谢 本数据集的采集工作由全体作者共同完成。在此感谢李丰与张仁睿提供的[LLaVA-M4-Instruct数据集](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data),以及张元瀚提供的[视频数据集](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K)。数据集采集完成后,李博主导了数据清洗与重新标注流程(包括最终的数据集混合工作),并得到了张凯宸的大力协助。 ## 数据集卡片撰写者 本数据集由以下作者整理:李博、张凯宸、张浩、张元瀚、张仁睿、李丰 ## 数据集卡片联系方式 [李博](https://brianboli.com/):drluodian@gmail.com [张凯宸](https://www.linkedin.com/in/kaichen-zhang-014b17219/?originalSubdomain=sg)
提供机构:
maas
创建时间:
2024-10-06
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
LLaVA-OneVision-Data是一个多模态数据集,用于训练LLaVA-OneVision模型,包含经过清洗和重新标注的图像和视频数据,支持中英文,适用于学术研究和教育目的。数据集提供了详细的代码指导和结构说明,方便用户转换和使用。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作