LLaVA-OneVision-Data
收藏魔搭社区2026-05-07 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/lmms-lab/LLaVA-OneVision-Data
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for LLaVA-OneVision
**[2024-09-01]: Uploaded VisualWebInstruct(filtered), it's used in OneVision Stage**
> almost all subsets are uploaded with HF's required format and you can use the recommended interface to download them and follow our code below to convert them.
> the subset of `ureader_kg` and `ureader_qa` are uploaded with the processed jsons and tar.gz of image folders.
> You may directly download them from the following url.
> https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg
In this dataset, we include the data splits used in the both final image stage and one-vision stage. For more details, please check our [paper](arxiv.org/abs/2408.03326) and our [training doc](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data).
## Dataset Description
- **Curated by:** Bo Li, Kaichen Zhang, Hao Zhang, Yuanhan Zhang, Renrui Zhang, Feng Li, Dong Guo
- **Language(s) (NLP):** English, Chinese
- **License:** Apache License 2.0
## Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **Dataset Collection:** We include a few subsets from existing dataset collection [Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M), [Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron), [UReader](https://arxiv.org/abs/2310.05126). Since we only used a few subsets from these datasets, and applied the cleaning and re-annotation process, we uploaded our processed version of these datasets into our own repository and thank the authors for providing the original datasets.
- **Other Datasets:** For rest single source dataset, such as AI2D, OKVQA, we cite and link the original sources in our paper.
## Uses
This dataset is used for the training of the LLaVA-OneVision model. We only allow the use of this dataset for academic research and education purpose. For OpenAI GPT-4 generated data, we recommend the users to check the [OpenAI Usage Policy](https://openai.com/policies/usage-policies/).
## Dataset Structure
We expalin the data composition for mid-stage and final-stage at our repo in [**training doc**](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data).
### Statistics
We provide the statistics of the dataset in the following figures, and refer the audience to check our paper.


### Code Guidance
To help audience to better understand our dataest, we upload them into Hugging Face Dataset compatible format. During LLaVA-OneVision training, we use the `json` and `image/video` folder to store the data.
> the subset of `ureader_kg` and `ureader_qa` are uploaded with the processed jsons and tar.gz of image folders. You may directly download them from the following url.
> https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg
Here we provide the code guidance to convert the dataset into the format of LLaVA-OneVision, and conduct the training of the LLaVA-OneVision model with converted dataset.
```python
import os
from datasets import load_dataset
from tqdm import tqdm
import json
data = load_dataset("lmms-lab/LLaVA-OneVision-Data", split="train")
image_folder = "<your_image_folder>"
converted_data = []
for da in tqdm(data):
json_data = {}
json_data["id"] = da["id"]
if da["image"] is not None:
json_data["image"] = f"{da['id']}.jpg"
da["image"].save(os.path.join(image_folder, json_data["image"]))
json_data["conversations"] = da["conversations"]
converted_data.append(json_data)
with open("<your_json_file>.json", "w") as f:
json.dump(converted_data, f, indent=4, ensure_ascii=False)
```
## Citation
**BibTeX:**
[More Information Needed]
## Glossary
The dataset collection process is conducted by all of the authors, we thank the Feng Li and Renrui Zhang for providing [LLaVA-M4-Instruct Data](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data) and Yuanhan for providing the [Video datasets](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K).
After the dataset collection, the cleaning and re-annotation process, including final mixture of the dataset, is conducted by Bo Li and with the great help of Kaichen Zhang.
## Dataset Card Authors
The dataset is curated by the following authors:
Bo Li, Kaichen Zhang, Hao Zhang, Yuanhan Zhang, Renrui Zhang, Feng Li
## Dataset Card Contact
[Bo Li](https://brianboli.com/): drluodian@gmail.com
[Kaichen Zhang](https://www.linkedin.com/in/kaichen-zhang-014b17219/?originalSubdomain=sg)
# LLaVA-OneVision 数据集卡片
**[2024-09-01]:上传了经过过滤的VisualWebInstruct数据集,该数据集将用于OneVision阶段**
> 绝大多数子集已按照Hugging Face(HF)要求的格式上传,您可通过推荐的下载接口获取数据集,并按照下文提供的代码进行格式转换。
>
> `ureader_kg`与`ureader_qa`子集已上传处理完成的JSON文件及图像文件夹的tar.gz压缩包。您可通过以下链接直接下载:https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg
本数据集包含用于最终图像阶段与单视觉(OneVision)阶段的数据划分。如需了解更多细节,请查阅我们的[论文](arxiv.org/abs/2408.03326)与[训练文档](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data)。
## 数据集描述
- **整理方:** 李博、张凯宸、张浩、张元瀚、张仁睿、李丰、郭栋
- **自然语言处理所用语言:** 英语、中文
- **授权协议:** Apache License 2.0
## 数据集来源
<!-- 提供数据集的基础链接 -->
- **数据集采集:** 本数据集采集自现有数据集集合[Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M)、[Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)与[UReader](https://arxiv.org/abs/2310.05126)的部分子集。由于仅使用了上述数据集的少量子集,并对其进行了清洗与重新标注处理,我们将处理后的数据集版本上传至自有仓库,在此感谢原数据集作者的开源贡献。
- **其他数据集:** 其余单源数据集(如AI2D、OKVQA)的原始来源与引用信息已在我们的论文中给出。
## 数据集用途
本数据集用于训练LLaVA-OneVision模型,仅允许用于学术研究与教育用途。若涉及由OpenAI GPT-4生成的数据,请使用者务必查阅[OpenAI使用政策](https://openai.com/policies/usage-policies/)。
## 数据集结构
我们在[**训练文档**](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train#about-the-llava-onevision-data)中详细说明了数据集在中间阶段与最终阶段的构成。
### 数据集统计
我们通过以下图表展示了本数据集的统计信息,详细内容请参阅我们的论文。


### 代码指南
为便于使用者更好地理解本数据集,我们将数据集上传为兼容Hugging Face Dataset的格式。在LLaVA-OneVision模型训练过程中,我们使用JSON文件与图像/视频文件夹存储数据。
> `ureader_kg`与`ureader_qa`子集已上传处理完成的JSON文件及图像文件夹的tar.gz压缩包。您可通过以下链接直接下载:https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/tree/main/ureader_kg
下文将提供将数据集转换为LLaVA-OneVision所需格式,并基于转换后的数据集训练LLaVA-OneVision模型的代码指南。
python
import os
from datasets import load_dataset
from tqdm import tqdm
import json
data = load_dataset("lmms-lab/LLaVA-OneVision-Data", split="train")
image_folder = "<your_image_folder>"
converted_data = []
for da in tqdm(data):
json_data = {}
json_data["id"] = da["id"]
if da["image"] is not None:
json_data["image"] = f"{da['id']}.jpg"
da["image"].save(os.path.join(image_folder, json_data["image"]))
json_data["conversations"] = da["conversations"]
converted_data.append(json_data)
with open("<your_json_file>.json", "w") as f:
json.dump(converted_data, f, indent=4, ensure_ascii=False)
## 引用
**BibTeX格式引用:** [待补充更多信息]
## 致谢
本数据集的采集工作由全体作者共同完成。在此感谢李丰与张仁睿提供的[LLaVA-M4-Instruct数据集](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data),以及张元瀚提供的[视频数据集](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K)。数据集采集完成后,李博主导了数据清洗与重新标注流程(包括最终的数据集混合工作),并得到了张凯宸的大力协助。
## 数据集卡片撰写者
本数据集由以下作者整理:李博、张凯宸、张浩、张元瀚、张仁睿、李丰
## 数据集卡片联系方式
[李博](https://brianboli.com/):drluodian@gmail.com
[张凯宸](https://www.linkedin.com/in/kaichen-zhang-014b17219/?originalSubdomain=sg)
提供机构:
maas
创建时间:
2024-10-06
搜集汇总
数据集介绍

背景与挑战
背景概述
LLaVA-OneVision-Data是一个多模态数据集,用于训练LLaVA-OneVision模型,包含经过清洗和重新标注的图像和视频数据,支持中英文,适用于学术研究和教育目的。数据集提供了详细的代码指导和结构说明,方便用户转换和使用。
以上内容由遇见数据集搜集并总结生成



