LLaVA-OneVision-1.5-Instruct-Data

Name: LLaVA-OneVision-1.5-Instruct-Data
Creator: maas
Published: 2026-04-21 17:48:55
License: 暂无描述

魔搭社区2026-04-21 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/lmms-lab/LLaVA-OneVision-1.5-Instruct-Data

下载链接

链接失效反馈

官方服务：

资源简介：

# LLaVA-OneVision-1.5 Instruction Data [Paper](https://huggingface.co/papers/2509.23661) | [Code](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5) ## 📌 Introduction This dataset, **LLaVA-OneVision-1.5-Instruct**, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and fully open framework for building high-quality vision-language models entirely from scratch. It has significantly enhanced the performance of Vision-Language Models (VLMs) in structured information processing and knowledge-based question answering tasks. As part of the LLaVA-OneVision-1.5 open-source initiative, we are releasing this dataset to the community in the hope of advancing VLM research and driving further progress in the field. ## ⚙️ Usage Notes Although the dataset itself is of high quality, we recommend deduplicating and combining it with the FineVision dataset to achieve better training results. ## 🚀 Sample Usage Below is a quick start guide demonstrating how to use the LLaVA-OneVision-1.5 models with Hugging Face `transformers` for inference. This snippet is directly from the project's GitHub repository. ```python from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM from qwen_vl_utils import process_vision_info model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct" # default: Load the model on the available device(s) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True ) # default processer processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## 📊 Data Analysis ### Distribution of Data Categories <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/2xoRKPrNZsgK2YqbwVLCs.png" width="512" height="512" alt="sft_dataset_pie_chart"> </p> ### Compare and Scaling with FineVision Performance comparison of three datasets (Merge46M, FineVision, and LLaVA-OneVision-1.5-Inst-Data) across 16 benchmarks during the SFT phase, demonstrating the superiority of Merge46M on most benchmarks. ![ablation_instruct](https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/E2Px27arJ3J-LXZLgAUDN.jpeg) ## 🙏 Acknowledgement We would like to acknowledge the contributions of **[FineVision](https://huggingface.co/spaces/HuggingFaceM4/FineVision)** , whose open dataset served as an important foundation and benchmark for building this SFT dataset. ## 📜 Cite If you find *LLaVA-OneVision-1.5* useful in your research, please consider to cite the following related papers: ```bibtex @inproceedings{LLaVA-OneVision-1.5, title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training}, author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Huajie Tan and Li, Chunyuan and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng}, booktitle={arxiv}, year={2025}, url={https://arxiv.org/abs/2509.23661}, } @inproceedings{xie2025region, title={Region-based Cluster Discrimination for Visual Representation Learning}, author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang}, booktitle={ICCV}, year={2025} } @article{lillava, title={LLaVA-OneVision: Easy Visual Task Transfer}, author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan}, journal={Transactions on Machine Learning Research}, year={2024} } ```

# LLaVA-OneVision-1.5 指令数据集 [论文](https://huggingface.co/papers/2509.23661) | [代码](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5) ## 📌 简介本数据集**LLaVA-OneVision-1.5-Instruct**是在LLaVA-OneVision-1.5开发过程中收集并整合而来。LLaVA-OneVision-1.5是一类新型的大型多模态模型（Large Multimodal Models, LMMs），凭借显著降低的计算与财务成本实现了顶尖性能。本经过精心整理的2200万条指令数据集（LLaVA-OneVision-1.5-Instruct）是完全从零构建高质量视觉语言模型的全面且完全开源框架的组成部分。该数据集显著提升了视觉语言模型（Vision-Language Models, VLMs）在结构化信息处理与基于知识的问答任务中的性能。作为LLaVA-OneVision-1.5开源计划的一部分，我们将此数据集开放给社区，以期推动视觉语言模型的研究进展，并助力该领域的进一步突破。 ## ⚙️ 使用须知尽管本数据集本身质量优异，我们建议将其与FineVision数据集进行去重并结合使用，以获得更优的训练效果。 ## 🚀 示例用法以下是快速入门指南，演示如何结合Hugging Face `transformers`库使用LLaVA-OneVision-1.5模型进行推理。该代码片段直接取自项目的GitHub仓库。 python from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM from qwen_vl_utils import process_vision_info model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct" # default: Load the model on the available device(s) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True ) # default processer processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ## 📊 数据分析 ### 数据类别分布 <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/2xoRKPrNZsgK2YqbwVLCs.png" width="512" height="512" alt="sft_dataset_pie_chart"> </p> ### 与FineVision的对比及缩放实验在监督微调（Supervised Fine-Tuning, SFT）阶段，三个数据集（Merge46M、FineVision与LLaVA-OneVision-1.5-Instruct-Data）在16个基准测试中的性能对比结果显示，Merge46M在多数基准测试中表现更优。 ![ablation_instruct](https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/E2Px27arJ3J-LXZLgAUDN.jpeg) ## 🙏 致谢我们谨向**[FineVision](https://huggingface.co/spaces/HuggingFaceM4/FineVision)**致谢，其开源数据集为构建本监督微调数据集提供了重要基础与基准。 ## 📜 引用如果您在研究中用到*LLaVA-OneVision-1.5*，请引用以下相关论文： bibtex @inproceedings{LLaVA-OneVision-1.5, title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training}, author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Huajie Tan and Li, Chunyuan and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng}, booktitle={arxiv}, year={2025}, url={https://arxiv.org/abs/2509.23661}, } @inproceedings{xie2025region, title={Region-based Cluster Discrimination for Visual Representation Learning}, author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang}, booktitle={ICCV}, year={2025} } @article{lillava, title={LLaVA-OneVision: Easy Visual Task Transfer}, author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan}, journal={Transactions on Machine Learning Research}, year={2024} }

提供机构：

maas

创建时间：

2025-10-10

搜集汇总

数据集介绍

背景与挑战

背景概述

LLaVA-OneVision-1.5-Instruct-Data是一个包含22M条指令的高质量数据集，专为训练大型多模态模型（LMMs）设计，旨在提升视觉语言模型（VLMs）的性能。该数据集是LLaVA-OneVision-1.5开源计划的一部分，建议与FineVision数据集结合使用以获得更佳效果。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集