LLaVA-NeXT-Data
收藏魔搭社区2025-12-11 更新2024-10-12 收录
下载链接:
https://modelscope.cn/datasets/lmms-lab/LLaVA-NeXT-Data
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for LLaVA-NeXT
We provide the whole details of LLaVA-NeXT Dataset. In this dataset, we include the data that was used in the instruction tuning stage for [LLaVA-NeXT](https://llava-vl.github.io/blog/2024-01-30-llava-next/) and [LLaVA-NeXT(stronger)](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/).
Aug 30, 2024: We update the dataset with [raw format](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data/tree/main/llava_next_raw_format) (de-compress it for json file and images with structured folder), you can directly download them if you are familiar with LLaVA data format.
## Dataset Description
- **Curated by:** Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee
- **Language(s) (NLP):** English, Chinese
- **License:** Apache License 2.0
## Dataset Sources
<!-- Provide the basic links for the dataset. -->
Compared to the instruction data mixture for LLaVA-1.5, following changes were made:
- **High-quality User Instruct Data.** Our definition of high-quality visual instruction-following data hinges on two principal criteria: First, the diversity of task instructions, ensuring adequately represent a broad spectrum of user intents that are likely to be encountered in real-world scenarios, particularly during the model’s deployment phase. Second, the superiority of responses is critical, with the objective of soliciting favorable user feedback. To achieve this, we consider two data sources: (1) Existing GPT-V data. LAION-GPT-V and ShareGPT-4V. (2) To further facilitate better visual conversation for more scenarios, we collect a small 15K visual instruction tuning dataset covering different applications. The instructions and images come from LLaVA demo, which are real-world users requests. We carefully filter samples that may have privacy concerns or are potentially harmful, and generate the response with GPT-4V.
- **Multimodal Document/Chart Data.** (1) We remove TextCaps from our training data as we realize that TextCaps uses the same set of training images as TextVQA. This allows us to better understand the zero-shot OCR capability of our model when evaluating TextVQA during development. To maintain and further improve our model’s OCR capability, we replace TextCaps with DocVQA and SynDog-EN. (2) Motivated by Qwen-VL-7B-Chat, we further add ChartQA, DVQA, and AI2D for better chart and diagram understanding.
Due to license issue and policy concern, **15k instruction data from user data** were not released and the total data mixture under this repo contains around 779k rows.
## Uses
This dataset is used for the training of the LLaVA-NeXT model. We only allow the use of this dataset for academic research and education purpose. For OpenAI GPT-4 generated data, we recommend the users to check the [OpenAI Usage Policy](https://openai.com/policies/usage-policies/).
### Code Guidance
To help audience to better understand our dataest, we upload them into Hugging Face Dataset compatible format. During LLaVA-NeXT training, we use the `json` and `image` folder to store the data.
Here we provide the code guidance to convert the dataset into the format of LLaVA-NeXT, and conduct the training of the LLaVA-NeXT model with converted dataset.
```python
import os
from datasets import load_dataset
from tqdm import tqdm
import json
data = load_dataset("lmms-lab/LLaVA-NeXT-Data", split="train")
image_folder = "<your_image_folder>"
converted_data = []
for da in tqdm(data):
json_data = {}
json_data["id"] = da["id"]
if da["image"] is not None:
json_data["image"] = f"{da['id']}.jpg"
da["image"].save(os.path.join(image_folder, json_data["image"]))
json_data["conversations"] = da["conversations"]
converted_data.append(json_data)
with open("<your_json_file>.json", "w") as f:
json.dump(converted_data, f, indent=4, ensure_ascii=False)
```
## Citation
**BibTeX:**
```
@misc{liu2024llavanext,
title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
month={January},
year={2024}
}
```
## Dataset Card Authors
The dataset is curated by the following authors:
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee
# LLaVA-NeXT 数据集卡片
我们将完整呈现LLaVA-NeXT数据集的全部细节。本数据集包含用于[LLaVA-NeXT](https://llava-vl.github.io/blog/2024-01-30-llava-next/)与[LLaVA-NeXT(增强版)](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/)指令微调阶段的全部数据。
2024年8月30日更新:我们已发布该数据集的[原始格式版本](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data/tree/main/llava_next_raw_format)(解压后可获得结构化文件夹中的JSON文件与图像文件),若您熟悉LLaVA数据集格式,可直接下载使用。
## 数据集概况
- **整理方:** 刘浩天、李春元、李宇航、李博、张元涵、沈晟、李永在(Yong Jae Lee)
- **自然语言支持:** 英语、中文
- **许可证:** Apache License 2.0
## 数据集来源
相较于LLaVA-1.5的指令数据混合方案,本数据集做出了如下调整:
- **高质量用户指令数据**
我们对高质量视觉指令遵循数据的定义基于两项核心准则:其一,任务指令需具备多样性,确保充分覆盖现实场景(尤其是模型部署阶段)中可能出现的各类用户意图;其二,回复质量需具备优越性,目标是获得用户的正向反馈。为此,我们采用两类数据源:(1) 现有GPT-V类数据:LAION-GPT-V与ShareGPT-4V。(2) 为进一步覆盖更多场景以优化视觉对话能力,我们收集了一个包含15K条视觉指令微调数据的小型数据集,涵盖不同应用场景。该数据集的指令与图像均来自LLaVA演示页面中的真实用户请求。我们会仔细过滤存在隐私风险或潜在危害性的样本,并使用GPT-4V生成对应回复。
- **多模态文档/图表数据**
(1) 我们从训练数据中移除了TextCaps数据集,原因是其训练图像与TextVQA完全重合。此举可确保我们在开发阶段评估模型于TextVQA任务上的表现时,能够更精准地衡量模型的零样本光学字符识别(Optical Character Recognition,OCR)能力。为维持并进一步提升模型的OCR能力,我们使用DocVQA与SynDog-EN替代了TextCaps。(2) 受Qwen-VL-7B-Chat的启发,我们新增了ChartQA、DVQA与AI2D数据集,以增强模型对图表与示意图的理解能力。
由于许可证与政策限制,**来自用户数据的15K条指令数据**未对外发布,本仓库中的总数据混合量约为77.9万条。
## 数据集用途
本数据集仅用于LLaVA-NeXT模型的训练。我们仅允许将该数据集用于学术研究与教育用途。对于由OpenAI GPT-4生成的数据,我们建议用户查阅[OpenAI使用政策](https://openai.com/policies/usage-policies/)。
### 代码指引
为便于使用者更好地理解本数据集,我们将其上传为兼容Hugging Face Dataset的格式。在LLaVA-NeXT训练过程中,我们使用`json`与`image`文件夹存储数据。以下提供将数据集转换为LLaVA-NeXT格式,并基于转换后的数据集训练LLaVA-NeXT模型的代码示例:
python
import os
from datasets import load_dataset
from tqdm import tqdm
import json
data = load_dataset("lmms-lab/LLaVA-NeXT-Data", split="train")
image_folder = "<your_image_folder>"
converted_data = []
for da in tqdm(data):
json_data = {}
json_data["id"] = da["id"]
if da["image"] is not None:
json_data["image"] = f"{da['id']}.jpg"
da["image"].save(os.path.join(image_folder, json_data["image"]))
json_data["conversations"] = da["conversations"]
converted_data.append(json_data)
with open("<your_json_file>.json", "w") as f:
json.dump(converted_data, f, indent=4, ensure_ascii=False)
## 引用
**BibTeX格式引用:**
@misc{liu2024llavanext,
title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
month={January},
year={2024}
}
## 数据集卡片整理者
本数据集由以下作者整理:
刘浩天、李春元、李宇航、李博、张元涵、沈晟、李永在(Yong Jae Lee)
提供机构:
maas
创建时间:
2024-10-07



