LLaVA-OneVision-1.5-Insturct-Data
收藏魔搭社区2026-01-04 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
下载链接
链接失效反馈官方服务:
资源简介:
# LLaVA-OneVision-1.5 Instruction Data
[Paper](https://huggingface.co/papers/2509.23661) | [Code](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
## 📌 Introduction
This dataset, **LLaVA-OneVision-1.5-Instruct**, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and fully open framework for building high-quality vision-language models entirely from scratch.
It has significantly enhanced the performance of Vision-Language Models (VLMs) in structured information processing and knowledge-based question answering tasks.
As part of the LLaVA-OneVision-1.5 open-source initiative, we are releasing this dataset to the community in the hope of advancing VLM research and driving further progress in the field.
## ⚙️ Usage Notes
Although the dataset itself is of high quality, we recommend deduplicating and combining it with the FineVision dataset to achieve better training results.
## 🚀 Sample Usage
Below is a quick start guide demonstrating how to use the LLaVA-OneVision-1.5 models with Hugging Face `transformers` for inference. This snippet is directly from the project's GitHub repository.
```python
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"
# default: Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
# default processer
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## 📊 Data Analysis
### Distribution of Data Categories
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/2xoRKPrNZsgK2YqbwVLCs.png"
width="512" height="512" alt="sft_dataset_pie_chart">
</p>
### Compare and Scaling with FineVision
Performance comparison of three datasets (Merge46M, FineVision, and LLaVA-OneVision-1.5-Inst-Data) across 16 benchmarks during the SFT phase, demonstrating the superiority of Merge46M on most benchmarks.

## 🙏 Acknowledgement
We would like to acknowledge the contributions of **[FineVision](https://huggingface.co/spaces/HuggingFaceM4/FineVision)** , whose open dataset served as an important foundation and benchmark for building this SFT dataset.
## 📜 Cite
If you find *LLaVA-OneVision-1.5* useful in your research, please consider to cite the following related papers:
```bibtex
@inproceedings{LLaVA-OneVision-1.5,
title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Huajie Tan and Li, Chunyuan and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
booktitle={arxiv},
year={2025},
url={https://arxiv.org/abs/2509.23661},
}
@inproceedings{xie2025region,
title={Region-based Cluster Discrimination for Visual Representation Learning},
author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
booktitle={ICCV},
year={2025}
}
@article{lillava,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
journal={Transactions on Machine Learning Research},
year={2024}
}
```
# LLaVA-OneVision-1.5 指令数据集
[论文](https://huggingface.co/papers/2509.23661) | [代码](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
## 📌 简介
本数据集**LLaVA-OneVision-1.5-Instruct**是在LLaVA-OneVision-1.5开发过程中收集并整合而来。LLaVA-OneVision-1.5是一类全新的大型多模态模型(Large Multimodal Models, LMMs),其在实现当前最优性能的同时,大幅降低了计算与经费成本。这份经过精心整理的2200万条指令数据集(LLaVA-OneVision-1.5-Instruct),是完全从零构建高质量视觉语言模型的全面开源框架的组成部分。
该数据集显著提升了视觉语言模型(Vision-Language Models, VLMs)在结构化信息处理与基于知识的问答任务中的性能。作为LLaVA-OneVision-1.5开源计划的一部分,我们将本数据集公开发布至社区,以期推动视觉语言模型的研究进展,助力该领域的进一步发展。
## ⚙️ 使用说明
尽管本数据集本身具备较高质量,我们仍建议对其进行去重处理,并与FineVision数据集结合使用,以获得更优的训练效果。
## 🚀 示例用法
以下为快速入门指南,演示如何结合Hugging Face `transformers`库使用LLaVA-OneVision-1.5模型进行推理,该代码片段直接取自项目的GitHub仓库。
python
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"
# 默认:将模型加载至可用设备
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
# 默认处理器
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# 推理准备
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# 推理:生成输出
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
## 📊 数据分析
### 数据类别分布
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/2xoRKPrNZsgK2YqbwVLCs.png"
width="512" height="512" alt="sft_dataset_pie_chart">
</p>
### 与FineVision的对比及缩放实验
本部分对比了Merge46M、FineVision与LLaVA-OneVision-1.5-Inst-Data三个数据集在监督微调(Supervised Fine-Tuning, SFT)阶段16项基准测试中的表现,结果显示Merge46M在多数基准测试中具备优势。

## 🙏 致谢
我们感谢**[FineVision](https://huggingface.co/spaces/HuggingFaceM4/FineVision)** 开源数据集,其为本次构建的监督微调数据集提供了重要基础与基准参考。
## 📜 引用
若您在研究中使用*LLaVA-OneVision-1.5*,请引用以下相关论文:
bibtex
@inproceedings{LLaVA-OneVision-1.5,
title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Huajie Tan and Li, Chunyuan and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
booktitle={arxiv},
year={2025},
url={https://arxiv.org/abs/2509.23661},
}
@inproceedings{xie2025region,
title={Region-based Cluster Discrimination for Visual Representation Learning},
author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
booktitle={ICCV},
year={2025}
}
@article{lillava,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
journal={Transactions on Machine Learning Research},
year={2024}
}
提供机构:
maas
创建时间:
2025-09-17



