R1-Onevision

Name: R1-Onevision
Creator: maas
Published: 2025-12-05 16:25:11
License: 暂无描述

魔搭社区2025-12-05 更新2025-03-01 收录

下载链接：

https://modelscope.cn/datasets/Fancy-MLLM/R1-Onevision

下载链接

链接失效反馈

官方服务：

资源简介：

## R1-Onevision [\[📂 GitHub\]](https://github.com/Fancy-MLLM/R1-Onevision)[\[📝 Paper\]](https://arxiv.org/pdf/2503.10615) [\[🤗 Reasoning Benchmark\]](https://huggingface.co/datasets/Fancy-MLLM/R1-OneVision-Bench) [\[🤗 HF Demo\]](https://huggingface.co/spaces/Fancy-MLLM/R1-OneVision) # R1-Onevision Dataset ## Dataset Overview The **R1-Onevision** dataset is a meticulously crafted resource designed to empower models with advanced multimodal reasoning capabilities. Aimed at bridging the gap between visual and textual understanding, this dataset provides rich, context-aware reasoning tasks across diverse domains, including natural scenes, science, mathematical problems, OCR-based content, and complex charts. It combines high-quality data from LLaVA-OneVision with domain-specific datasets, each carefully selected and filtered to provide a solid foundation for complex visual reasoning tasks. With a focus on enabling deep reasoning and accurate model predictions, **R1-Onevision** equips models to handle a variety of visual and textual inputs, tackling intricate reasoning challenges with precision. # Data Generation Pipeline ## The Genesis: Data Preparation & Filtering R1-Onevision began with the aggregation of a diverse collection of datasets spanning natural images, OCR, charts, science, mathematical problems, and more. These datasets were filtered and curated with the core objective of supporting reasoning tasks across different domains. We ensured that only the most relevant data and question types were selected, setting the stage for a dataset that offers both quality and depth. In addition to LLaVA-OneVision, the dataset incorporates a carefully chosen set of domain-specific datasets, enhancing the robustness of the dataset and making it an ideal tool for multimodal reasoning research. ## Image Captioning: Turning Pixels into Formal Language Captions serve as the backbone of this dataset, capturing fine-grained visual details to enable deep reasoning. The process involved the combined power of GPT-4o, Grounding DINO, and EasyOCR, which allowed us to generate detailed captions and spatial information for a variety of images. - **Charts & Diagrams**: GPT-4o was employed to translate visual elements into structured formats, including SPICE for circuit schematics, PlantUML for flowcharts, and HTML for UI layouts. These captions also provided explanations in a formal language, aiding in their use for reasoning tasks. - **Natural Scenes**: GPT-4o generated rich, descriptive captions for images, while Grounding DINO pinpointed the key elements within the scene to enhance model understanding. - **Text-Only Images** : EasyOCR extracted text from images containing printed or handwritten text, and GPT-4o restored the original format to ensure that context and layout were preserved. - **Images with Textual Content**: For images containing both visual and textual elements, OCR data, bounding boxes, and GPT-4o-generated captions combined to recreate the original layout and structure. - **Mathematical Images**: For images involving mathematical content, GPT-4o summarized the image content into structured captions, reasoning steps, and results, ensuring that the context provided could directly support complex reasoning tasks. ## Reasoning Process: Chain-of-Thought Generation Once captions were generated, the next step involved reasoning over the images. The **Chain-of-Thought (CoT)** approach was used to guide the model through a structured reasoning process, drawing on both textual and visual information. To enhance the model’s reasoning ability, we implemented a **Role-Playing** approach. This allowed the model to "see" the image and iteratively revisit key visual information to refine its understanding and reasoning. This process encouraged the model to think more critically about the visual elements, generating more accurate and contextually rich answers. ## Final Filter: Quality Assurance The dataset also includes a final layer of quality assurance. GPT-4 was used to filter out any inaccurate or irrelevant reasoning steps, ensuring that only valid, coherent, and contextually accurate conclusions remained. This layer of validation strengthens the reliability and trustworthiness of the reasoning process. # Data Format The data is stored in `parquet` files with the following structure: ```json { "id": "<unique identifier>", "image": "<base64>", "conversations": [ {"from": "human", "value": "<question>"}, {"from": "assistant", "value": "<cot>"} ] } ``` # Data distribution <img src="https://cdn-uploads.huggingface.co/production/uploads/65af78bb3e82498d4c65ed2a/W4IG0lu2BrXqwXRIXwdzL.png"/> # Institution - Zhejiang University # Dataset Contact - panhongkun@zju.edu.cn - yang-yi@zju.edu.cn - xiaoxuanhe@zju.edu.cn

R1-Onevision [📂 GitHub仓库](https://github.com/Fancy-MLLM/R1-Onevision) [📝 论文](https://arxiv.org/pdf/2503.10615) [🤗 推理基准集](https://huggingface.co/datasets/Fancy-MLLM/R1-OneVision-Bench) [🤗 HF演示空间](https://huggingface.co/spaces/Fancy-MLLM/R1-OneVision) # R1-Onevision 数据集 ## 数据集概览 **R1-Onevision** 数据集是一项精心打造的资源，旨在赋能模型具备高级多模态推理能力。该数据集旨在弥合视觉与文本理解之间的鸿沟，提供覆盖多元领域的丰富上下文感知推理任务，涵盖自然场景、科学领域、数学问题、基于光学字符识别（OCR）的内容以及复杂图表等。它整合了来自LLaVA-OneVision的高质量数据与领域专属数据集，所有数据均经过精心筛选与过滤，为复杂视觉推理任务奠定了坚实基础。**R1-Onevision** 聚焦于支持深度推理与精准的模型预测，使模型能够处理多样化的视觉与文本输入，精准应对复杂的推理挑战。 # 数据生成流程 ## 起源：数据准备与过滤 R1-Onevision 的构建始于对涵盖自然图像、OCR、图表、科学、数学问题等多元数据集的整合。所有数据集均经过筛选与整理，核心目标是支持跨领域的推理任务。我们仅选取最相关的数据与问题类型，为兼具质量与深度的数据集奠定了基础。除LLaVA-OneVision外，该数据集还整合了精心挑选的领域专属数据集，提升了数据集的鲁棒性，使其成为多模态推理研究的理想工具。 ## 图像字幕生成：将像素转化为规范语言字幕是该数据集的核心支柱，能够捕捉细粒度的视觉细节以支持深度推理。该流程结合了GPT-4o、Grounding DINO与EasyOCR的能力，可为各类图像生成详细的字幕与空间信息。 - **图表与示意图**：采用GPT-4o将视觉元素转化为结构化格式，其中针对电路原理图使用SPICE、流程图使用PlantUML、用户界面（UI）布局使用HTML。此类字幕还以规范语言提供了解释，助力推理任务的开展。 - **自然场景**：GPT-4o为图像生成丰富的描述性字幕，同时Grounding DINO精准定位场景内的关键元素，以提升模型的理解能力。 - **纯文本图像**：EasyOCR从包含印刷或手写文本的图像中提取文字，随后GPT-4o还原原始格式，确保上下文与布局得以保留。 - **含文本内容的图像**：对于同时包含视觉与文本元素的图像，OCR数据、边界框（bounding box）与GPT-4o生成的字幕相结合，可还原原始布局与结构。 - **数学图像**：对于包含数学内容的图像，GPT-4o将图像内容总结为结构化字幕、推理步骤与结果，确保所提供的上下文可直接支持复杂推理任务。 ## 推理流程：思维链生成完成字幕生成后，下一步是对图像开展推理。我们采用**思维链（Chain-of-Thought, CoT）**方法，引导模型结合文本与视觉信息，完成结构化的推理流程。为提升模型的推理能力，我们采用了**角色扮演（Role-Playing）**方法。该方法可让模型“观察”图像，并迭代回顾关键视觉信息，以优化其理解与推理过程。此流程能够促使模型对视觉元素进行更严谨的思考，生成更精准且上下文丰富的答案。 ## 最终过滤：质量保障该数据集还包含最终的质量保障环节。我们采用GPT-4过滤所有不准确或不相关的推理步骤，确保仅保留有效、连贯且上下文准确的结论。此验证环节提升了推理流程的可靠性与可信度。 # 数据格式数据以`parquet`格式文件存储，结构如下： json { "id": "<unique identifier>", "image": "<base64>", "conversations": [ {"from": "human", "value": "<question>"}, {"from": "assistant", "value": "<cot>"} ] } # 数据分布 <img src="https://cdn-uploads.huggingface.co/production/uploads/65af78bb3e82498d4c65ed2a/W4IG0lu2BrXqwXRIXwdzL.png"/> # 所属机构 - 浙江大学 # 数据集联系方式 - panhongkun@zju.edu.cn - yang-yi@zju.edu.cn - xiaoxuanhe@zju.edu.cn

提供机构：

maas

创建时间：

2025-03-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集