LLaVA-One-Vision-1.5-Insturct-26M
收藏魔搭社区2026-05-18 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/lmms-lab/LLaVA-One-Vision-1.5-Insturct-26M
下载链接
链接失效反馈官方服务:
资源简介:
# LLaVA-OneVision-1.5 Instruction Data
[Paper](https://huggingface.co/papers/2509.23661) | [Code](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
## 📌 Introduction
This dataset, **LLaVA-OneVision-1.5-Instruct**, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and fully open framework for building high-quality vision-language models entirely from scratch.
It has significantly enhanced the performance of Vision-Language Models (VLMs) in structured information processing and knowledge-based question answering tasks.
As part of the LLaVA-OneVision-1.5 open-source initiative, we are releasing this dataset to the community in the hope of advancing VLM research and driving further progress in the field.
## ⚙️ Usage Notes
Although the dataset itself is of high quality, we recommend deduplicating and combining it with the FineVision dataset to achieve better training results.
## 🚀 Sample Usage
Below is a quick start guide demonstrating how to use the LLaVA-OneVision-1.5 models with Hugging Face `transformers` for inference. This snippet is directly from the project's GitHub repository.
```python
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"
# default: Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
# default processer
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## 📊 Data Analysis
### Distribution of Data Categories
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/2xoRKPrNZsgK2YqbwVLCs.png"
width="512" height="512" alt="sft_dataset_pie_chart">
</p>
### Compare and Scaling with FineVision
Performance comparison of three datasets (Merge46M, FineVision, and LLaVA-OneVision-1.5-Inst-Data) across 16 benchmarks during the SFT phase, demonstrating the superiority of Merge46M on most benchmarks.

## 🙏 Acknowledgement
We would like to acknowledge the contributions of **[FineVision](https://huggingface.co/spaces/HuggingFaceM4/FineVision)** , whose open dataset served as an important foundation and benchmark for building this SFT dataset.
## 📜 Cite
If you find *LLaVA-OneVision-1.5* useful in your research, please consider to cite the following related papers:
```bibtex
@inproceedings{LLaVA-OneVision-1.5,
title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Huajie Tan and Li, Chunyuan and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
booktitle={arxiv},
year={2025},
url={https://arxiv.org/abs/2509.23661},
}
@inproceedings{xie2025region,
title={Region-based Cluster Discrimination for Visual Representation Learning},
author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
booktitle={ICCV},
year={2025}
}
@article{lillava,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
journal={Transactions on Machine Learning Research},
year={2024}
}
```
# LLaVA-OneVision-1.5 指令数据集
[论文](https://huggingface.co/papers/2509.23661) | [代码](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
## 📌 简介
本数据集**LLaVA-OneVision-1.5-Instruct**是在LLaVA-OneVision-1.5开发过程中收集并整合而来。LLaVA-OneVision-1.5是一类全新的大多模态模型(Large Multimodal Models,LMMs),以显著更低的计算与经济成本实现了当前最优性能。这套经过精心甄选的2200万条指令数据集(LLaVA-OneVision-1.5-Instruct),是完全从零构建高质量视觉语言模型的全面开源框架的核心组成部分。
该数据集显著提升了视觉语言模型(Vision-Language Models,VLMs)在结构化信息处理与基于知识的问答任务中的性能表现。作为LLaVA-OneVision-1.5开源计划的一部分,我们将此数据集公开分享给社区,以期推动视觉语言模型领域的研究进展,并促进行业内的进一步突破。
## ⚙️ 使用说明
尽管本数据集本身质量优异,我们建议将其与FineVision数据集进行去重并合并,以获得更出色的训练效果。
## 🚀 示例用法
以下是快速入门指南,演示如何结合Hugging Face `transformers`库使用LLaVA-OneVision-1.5模型进行推理。该代码片段直接取自项目GitHub仓库。
python
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"
# default: Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
# default processer
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
## 📊 数据分析
### 数据类别分布
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/2xoRKPrNZsgK2YqbwVLCs.png" width="512" height="512" alt="sft_dataset_pie_chart">
</p>
### 与FineVision的对比及缩放实验
在监督微调(Supervised Fine-Tuning,SFT)阶段,三个数据集(Merge46M、FineVision与LLaVA-OneVision-1.5-Instruct-Data)在16个基准测试集上的性能对比结果显示,Merge46M在多数基准测试中表现更优。

## 🙏 致谢
我们谨向**FineVision**(https://huggingface.co/spaces/HuggingFaceM4/FineVision)致谢,其开源数据集为构建本监督微调数据集提供了重要的基础与基准参考。
## 📜 引用
如果您在研究中使用*LLaVA-OneVision-1.5*,请引用以下相关论文:
bibtex
@inproceedings{LLaVA-OneVision-1.5,
title={LLaVA-OneVision-1.5:面向普惠化多模态训练的完全开源框架},
author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Huajie Tan and Li, Chunyuan and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
booktitle={arXiv预印本},
year={2025},
url={https://arxiv.org/abs/2509.23661},
}
@inproceedings{xie2025region,
title={面向视觉表征学习的基于区域的聚类判别},
author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
booktitle={国际计算机视觉大会(ICCV)},
year={2025}
}
@article{lillava,
title={LLaVA-OneVision:轻松实现视觉任务迁移},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
journal={机器学习研究汇刊(Transactions on Machine Learning Research)},
year={2024}
}
提供机构:
maas
创建时间:
2025-09-11
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是LLaVA-OneVision-1.5-Instruct,包含22M精心整理的指令数据,用于从零开始构建高质量视觉语言模型的开放框架。它显著提升了视觉语言模型在结构化信息处理和知识问答任务中的性能,并作为开源计划的一部分发布,以促进相关领域的研究进展。
以上内容由遇见数据集搜集并总结生成



