FineReason
收藏魔搭社区2025-12-05 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/OpenDataArena/FineReason
下载链接
链接失效反馈官方服务:
资源简介:
# FineReason: A Comprehensive Multimodal Dataset for Visual Reasoning
FineReason is a multimodal reasoning dataset designed to enhance large multimodal models (LMMs) in visual reasoning, covering **STEM (Science, Technology, Engineering, and Mathematics), visual puzzles, games, complex diagram reasoning**.
Each example includes a reasoning-style answer distilled from **Qwen3-VL-235B-a22B-thinking**, promoting long-chain, interpretable multimodal reasoning.
---
## 🧠 Motivation
Reasoning over structured or non-natural images requires more than visual perception and OCR capabilities. It demands **logical inference, symbolic understanding, and step-by-step analytical thinking**.
However:
1. **Data imbalance**: In existing composite open-source multimodal datasets (e.g., FineVision, LLaVA-OneVision-1.5-data), reasoning samples are limited and underrepresented due to the intrinsic difficulty of acquiring high-quality data.
2. **Constraints on reasoning quality**: Existing open-source multimodal datasets are generally small, scattered, and lack a consistent reasoning style with long-form, interpretable reasoning chains, which hinders research on data-centric approaches for multimodal reasoning.
FineReason aims to address this gap by curating and distilling high-quality reasoning datasets with a consistent reasoning style, thereby providing a robust foundation for **data-centric** multimodal training and evaluation.
---
## 📊 Dataset Composition (Continuously Expanding...)
| Sub-dataset | Count |
| -------------------------------------- | ------- |
| BMMR | 85,275 |
| Euclid30K | 27,111 |
| ai2d_merged | 2,446 |
| geo170k (qa) | 12,101 |
| geometry3k (mathv360k) | 9,724 |
| scienceqa | 6,146 |
| tqa | 12,565 |
| visualwebinstruct (filtered) | 261,436 |
| MMR1 |1,610,242|
| VisualSphinx | 3,781 |
| mmopenr1-8k | 7,428 |
| WeMath2-Standard | 5,774 |
| WeMath2-Pro | 4,531 |
| WeMath2-SFT | 826 |
| WaltonColdStart | 51,263 |
| MMK12 | 15,549 |
| ViRL39K | 36,263 |
---
## 🧩 Data Structure
Each entry contains:
```json
{
"id": "unique_identifier",
"question": "textual question",
"image": "PIL Image",
"qwen3vl_235b_thinking_response": "reasoning-style answer distilled from Qwen3-VL-235B-a22B-thinking"
}
```
---
## ⚙️ Data Generation Process
We unify all sub-datasets under a **common reasoning style** by **distilling long-chain answers** from ***Qwen3-VL-235B-a22B-thinking***.
The model is prompted to produce structured, interpretable, and step-by-step reasoning grounded in the provided images and questions.
### Example Reasoning Pattern
```text
<think>
[Detailed reasoning process]
- Analyze the problem and extract key information
- Identify relevant formulas/principles
- Work through step-by-step calculations
- Consider multiple approaches if needed
- Resolve any contradictions
- Converge toward the solution
- Verification
</think>
<answer>
[Final answer here]
</answer>
```
This ensures:
* Consistent reasoning traces across datasets
* Visually grounded logical steps
* Improved interpretability and compositional reasoning
---
## 📈 Future Work
We are continuously:
* Expanding coverage across math, science, logical, and spatial reasoning
* Re-distilling reasoning traces with improved thinking models
* Filtering and improving response quality
* Performing domain-specific reasoning data augmentation
---
# 🌐 About OpenDataArena
[OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.
**Key Features:**
* 🏆 **Dataset Leaderboard** — helps researchers identify **the most valuable and high-quality datasets across different domains**.
* 📊 **Detailed Evaluation Scores** — provides **comprehensive metrics** to assess data quality, complexity, difficulty etc.
* 🧰 **Data Processing Toolkit** — [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool)
offers an open-source pipeline for dataset curation and scoring.
If you find our work helpful, please consider **⭐ starring and subscribing** to support our research.
# 📚 Citation
```bibtex
@dataset{opendataarena_finereason_2025,
author = {OpenDataArena},
title = {OpenDataArena-finereason},
year = {2025},
url = {[https://huggingface.co/datasets/OpenDataArena/FineReason](https://huggingface.co/datasets/OpenDataArena/FineReason)}
}
```
# FineReason:面向视觉推理的综合多模态数据集
FineReason是一款面向视觉推理任务的多模态推理数据集,旨在提升大多模态模型(Large Multimodal Models, LMMs)的视觉推理能力,涵盖**STEM(科学、技术、工程与数学)、视觉谜题、游戏及复杂图表推理**四大类任务。每个样本均包含从**Qwen3-VL-235B-a22B-thinking**中提炼的推理式答案,以支持长链条、可解释的多模态推理。
---
## 🧠 研究动机
对结构化或非自然图像进行推理,不仅需要视觉感知与光学字符识别(Optical Character Recognition, OCR)能力,更依赖**逻辑推理、符号理解与逐步分析式思维**。
然而存在两大核心问题:
1. **数据分布失衡**:现有开源复合多模态数据集(如FineVision、LLaVA-OneVision-1.5-data)中,由于获取高质量数据本身难度较高,推理类样本数量有限且占比不足。
2. **推理质量受限**:现有开源多模态数据集普遍规模较小且分布零散,缺乏统一的长文本可解释推理链条风格,这阻碍了面向多模态推理的以数据为中心研究。
FineReason旨在通过构建并提炼具备统一推理风格的高质量推理数据集来填补这一空白,从而为**以数据为中心**的多模态训练与评估提供坚实支撑。
---
## 📊 数据集构成(持续扩充中...)
| 子数据集 | 样本数量 |
| -------------------------------------- | ------- |
| BMMR | 85,275 |
| Euclid30K | 27,111 |
| ai2d_merged | 2,446 |
| geo170k (qa) | 12,101 |
| geometry3k (mathv360k) | 9,724 |
| scienceqa | 6,146 |
| tqa | 12,565 |
| visualwebinstruct (filtered) | 261,436 |
| MMR1 |1,610,242|
| VisualSphinx | 3,781 |
| mmopenr1-8k | 7,428 |
| WeMath2-Standard | 5,774 |
| WeMath2-Pro | 4,531 |
| WeMath2-SFT | 826 |
| WaltonColdStart | 51,263 |
| MMK12 | 15,549 |
| ViRL39K | 36,263 |
---
## 🧩 数据结构
每个数据条目包含以下字段:
json
{
"id": "unique_identifier",
"question": "textual question",
"image": "PIL Image",
"qwen3vl_235b_thinking_response": "reasoning-style answer distilled from Qwen3-VL-235B-a22B-thinking"
}
翻译后字段含义:
json
{
"id": "唯一标识符",
"question": "文本问题",
"image": "PIL图像",
"qwen3vl_235b_thinking_response": "从Qwen3-VL-235B-a22B-thinking中提炼的推理式答案"
}
---
## ⚙️ 数据生成流程
我们通过从***Qwen3-VL-235B-a22B-thinking***中提炼长链条答案,将所有子数据集统一至**统一的推理风格**之下。我们对模型进行提示,使其生成基于给定图像与问题的结构化、可解释且逐步推进的推理过程。
### 示例推理范式
text
<think>
[Detailed reasoning process]
- Analyze the problem and extract key information
- Identify relevant formulas/principles
- Work through step-by-step calculations
- Consider multiple approaches if needed
- Resolve any contradictions
- Converge toward the solution
- Verification
</think>
<answer>
[Final answer here]
</answer>
翻译后:
text
<think>
[详细推理流程]
- 分析问题并提取关键信息
- 识别相关公式与原理
- 逐步完成演算推导
- 按需考量多种可行路径
- 消解存在的矛盾
- 逐步收敛至解决方案
- 结果验证
</think>
<answer>
[最终答案]
</answer>
此举可确保:
* 不同数据集间推理轨迹的一致性
* 基于视觉内容的逻辑推导步骤
* 更强的可解释性与组合式推理能力
---
## 📈 未来工作
我们正持续推进以下工作:
* 拓展覆盖数学、科学、逻辑与空间推理等更多领域
* 借助更先进的思维模型重新提炼推理轨迹
* 筛选并优化回复质量
* 开展面向特定领域的推理数据增强
---
## 🌐 关于OpenDataArena平台
[OpenDataArena](https://opendataarena.github.io/)是一个开源研究平台,致力于**探索、评估并优化用于人工智能后训练的高质量数据集**。该平台构建了透明化、以数据为中心的生态系统,以支持可复现的数据集评估与共享。
### 核心特性:
* 🏆 **数据集排行榜** — 助力研究者识别跨领域的高价值优质数据集。
* 📊 **详细评估分数** — 提供涵盖数据质量、复杂度、难度等维度的**综合评测指标**。
* 🧰 **数据处理工具集** — [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool) 提供了用于数据集构建与评分的开源流水线工具。
若您认为本工作对您有所帮助,欢迎**⭐ 点赞并关注**,以支持我们的研究工作。
---
## 📚 引用格式
bibtex
@dataset{opendataarena_finereason_2025,
author = {OpenDataArena},
title = {OpenDataArena-finereason},
year = {2025},
url = {[https://huggingface.co/datasets/OpenDataArena/FineReason](https://huggingface.co/datasets/OpenDataArena/FineReason)}
}
提供机构:
maas
创建时间:
2025-11-13



