five

MMFineReason-1.8M-Qwen3-VL-235B-Thinking

收藏
魔搭社区2026-04-29 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/OpenDataArena/MMFineReason-1.8M-Qwen3-VL-235B-Thinking
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <h1>MMFineReason</h1> <p><strong>Closing the Multimodal Reasoning Gap via Open Data-Centric Methods</strong></p> </div> <div align="center"> [![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2601.21821) [![Homepage](https://img.shields.io/badge/Homepage-MMFineReason-blue)](https://mmfinereason.github.io/) [![Collections](https://img.shields.io/badge/🤗-Collections-yellow)](https://huggingface.co/collections/OpenDataArena/mmfinereason) </div> <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/model_compare.png" width="100%" alt="Model Performance Comparison"> <figcaption><em>Average score across mathematical reasoning and multimodal understanding benchmarks.</em></figcaption> </figure> --- ## 📖 Overview **MMFineReason** is a large-scale, high-quality multimodal reasoning dataset comprising **1.8M samples** and **5.1B solution tokens**, featuring detailed reasoning annotations distilled from **Qwen3-VL-235B-A22B-Thinking**. ### 🎯 Key Highlights - **1.8M High-Quality Samples** with **5.1B Solution Tokens** - **Long-Form CoT**: Average reasoning length of **2,910 tokens** (2.7× HoneyBee, 4.3× OpenMMReasoner) - **100% Caption Coverage**: Dense visual descriptions averaging 609 tokens - **Multi-Domain**: Mathematics (79.4%), Science (13.8%), Puzzle/Game (4.6%), General/OCR (2.2%) - **State-of-the-Art**: Models trained on this dataset achieve SOTA performance in their size class --- ## 🏗️ Data Construction Pipeline <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/pipeline_detailed.png" width="100%" alt="Data Construction Pipeline"> <figcaption><em>MMFineReason data pipeline and the two-stage training.</em></figcaption> </figure> ### Stage 1: Data Collection & Standardization - Aggregate diverse multimodal datasets from open-source community - Translate non-English questions; remove noise and extraneous artifacts - Rewrite shallow prompts into reasoning-encouraging instructions - Filter non-reasoning tasks; clean corrupted/oversized images ### Stage 2: Reasoning Distillation - **Teacher Model**: Qwen3-VL-235B-A22B-Thinking - **Four-Phase Framework**: Information Extraction → Problem Setup → Solution Execution → Validation - **Output**: Reasoning in `<think>...</think>`, final answer in `<answer>...</answer>` - **Caption Generation**: 100% coverage via Qwen3-VL-235B-A22B-Thinking ### Stage 3: Data Selection - **Quality Filtering**: Template/length validation, n-gram deduplication, correctness verification (~20% removed) - **Difficulty Filtering**: Use Qwen3-VL-4B-Thinking pass rate as proxy - **MMFineReason-123K**: Pass rate = 0 (hardest 7%) - **MMFineReason-586K**: Pass rate ≠ 1 (challenging 33%) --- ## 🔧 Data Schema | Field | Description | |-------|-------------| | `source` | Origin dataset name (e.g., "Geometry3K", "MMR1", "BMMR") | | `id` | Unique sample identifier within the source dataset | | `original_question` | Raw question text as obtained from the source | | `original_answer` | Raw answer as obtained from the source | | `image` | Visual input (PIL Image) | | `question` | Cleaned, standardized question in English | | `answer` | Verified answer extracted and standardized | | `qwen3vl_235b_instruct_caption` | Dense visual description generated by Qwen3-VL-235B-A22B-Instruct | | `qwen3vl_235b_thinking_response` | Long-form Chain-of-Thought reasoning generated by Qwen3-VL-235B-A22B-Thinking | | `qwen3vl_4b_pass_rate` | Difficulty proxy based on Qwen3-VL-4B-Thinking's performance (0.0 = hardest, 1.0 = easiest) | | `is_consistent` | Boolean indicating whether generated reasoning matches ground truth | | `consistency_analysis` | Detailed analysis of consistency verification | --- ## 🗂️ Dataset Composition <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/composition_sunburst.png" width="100%" alt="Dataset Composition"> <figcaption><em>Dataset composition of MMFineReason-1.8M.</em></figcaption> </figure> **Mathematics (79.4%)** forms the backbone, primarily sourced from MMR1 (1.27M) and enriched with WaltonColdStart, ViRL39K, Euclid30K, MMK12, Geo170K, Geo3K, mm-openr1, and the WeMath family. **Science (13.8%)** is anchored by VisualWebInstruct (157.3K) and BMMR (54.6K), complemented by TQA, AI2D, Zebra-CoT, and ScienceQA. **Puzzle/Game (4.6%)** targets strategic planning and abstract reasoning, dominated by GameQA-140K (71.7K) and enriched by Raven, VisualSphinx, and PuzzleQA. **General/OCR (2.2%)** includes 38.7K samples from LLaVA-CoT, serving as regularization to preserve broad visual and OCR capabilities. --- ## 📊 Dataset Statistics ### Token Length Comparison with Other Datasets <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_token_length.png" width="100%" alt="Token Length Statistics Comparison"> <figcaption><em>Comparison of token length statistics across datasets.</em></figcaption> </figure> <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/token_length_distribution.png" width="100%" alt="Token Length Distribution"> <figcaption><em>Token length analysis. (Left) Internal domain distribution; (Mid) External CoT comparison; (Right) Caption richness comparison.</em></figcaption> </figure> MMFineReason achieves an average CoT length of **2,910 tokens**—approximately **2.7× longer** than HoneyBee and **4.3× longer** than OpenMMReasoner. The extended tail (Max: 16,316) demonstrates capacity for highly complex, multi-stage reasoning tasks. For captions, MMFineReason averages 609 tokens with **100% coverage**, compared to HoneyBee's 299 tokens at ~58% coverage. --- ### 🖼️ Image Category Distribution <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_image_category.png" width="100%" alt="Image Category Statistics"> <figcaption><em>Image category statistics by group (STEM vs. Natural).</em></figcaption> </figure> The corpus is predominantly STEM and diagrammatic content (98.3%), with geometric diagrams, mathematical plots, and logic puzzles accounting for 75.3%. Natural images (1.7%) provide diversity across urban scenes, indoor scenes, and human activities for generalization assessment. --- ### 📈 Difficulty Distribution <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/pass_rate_distribution.png" width="100%" alt="Pass Rate Distribution"> <figcaption><em>Pass rate distribution across sub-datasets, sorted by descending mean pass rate.</em></figcaption> </figure> Science-oriented datasets (ScienceQA, AI2D, TQA) exhibit high pass rates due to clean diagrams and MCQ format. Puzzle/game datasets (GameQA-140K, Raven, VisualSphinx) show lowest pass rates, requiring multi-step abstract reasoning. The binary distribution pattern reflects that reasoning tasks often follow all-or-nothing outcomes. --- ## 📊 Benchmark Results ### Main Results <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_main_results.png" width="100%" alt="Main Benchmark Results"> <figcaption><em>Comparison of MMFineReason models with state-of-the-art models.</em></figcaption> </figure> MMFineReason-4B surpasses Qwen3-VL-8B-Thinking (73.9 vs 72.5), while MMFineReason-8B outperforms the larger Qwen3-VL-30B-A3B-Thinking (75.7 vs 74.5) and exceeds Gemini-2.5-Flash. On mathematical benchmarks, MFR-8B achieves 83.4% on DynaMath (vs Qwen3-VL-32B-Thinking's 82.0%) and 67.1% on MathVision, outperforming HoneyBee-8B and OMR-7B by 23-30 points. Despite minimal chart training data, MFR-8B generalizes well to CharXiv (90.8%) and RealWorldQA (75.6%). ### SFT vs RL Training Analysis <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_sft_rl_results.png" width="100%" alt="SFT vs RL Results"> <figcaption><em>Results comparing MFR-SFT and MFR-Thinking models against base Qwen3-VL variants.</em></figcaption> </figure> SFT drives major gains in mathematical reasoning (e.g., MathVision: 53.9% → 67.6% for 8B). RL enhances generalization on understanding benchmarks (e.g., AI2D: 78.5% → 82.5% for 2B) while showing variance on math benchmarks. --- ## 🔬 Ablation Studies ### Data Efficiency ("Less is More") <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/ablation_data_efficiency.png" width="100%" alt="Data Efficiency Analysis"> <figcaption><em>Performance comparison across different data scales and model sizes.</em></figcaption> </figure> Removing 67% easy samples (Pass Rate = 1) improves performance by 0.6 points (75.0 → 75.6). Training on only the hardest 7% (123K samples) achieves 73.3—surpassing Qwen3-VL-8B-Thinking (72.5) with 14× less data. This demonstrates that challenging samples provide most training signal, and rigorous filtering eliminates redundancy in large-scale datasets. ### Sub-Dataset Performance <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/subdataset_performance.png" width="100%" alt="Sub-Dataset Performance Analysis"> <figcaption><em>Performance landscape of distilled sub-datasets (x-axis: sample count, log scale).</em></figcaption> </figure> ViRL39K (39K samples) retains 98.9% of MMR1's (1.5M) performance with only 2.4% data volume. WeMath2.0-SFT achieves 70.98% with just 814 samples, matching datasets 1000× larger. BMMR (80K, 300+ disciplines) outperforms the larger GameQA-140K (140K), showing that disciplinary diversity matters more than scale. --- ## 🏆 Trained Models | Model | Parameters | Avg Score | HuggingFace | |-------|------------|-----------|-------------| | MMFineReason-2B | 2B | 65.3 | [🤗 Link](https://huggingface.co/OpenDataArena/MMFineReason-2B) | | MMFineReason-4B | 4B | 73.9 | [🤗 Link](https://huggingface.co/OpenDataArena/MMFineReason-4B) | | MMFineReason-8B | 8B | 75.7 | [🤗 Link](https://huggingface.co/OpenDataArena/MMFineReason-8B) | --- ## 📚 Citation ```bibtex @misc{lin2026mmfinereasonclosingmultimodalreasoning, title={MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods}, author={Honglin Lin and Zheng Liu and Yun Zhu and Chonghan Qin and Juekai Lin and Xiaoran Shang and Conghui He and Wentao Zhang and Lijun Wu}, year={2026}, eprint={2601.21821}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.21821}, } ``` --- ## 📄 License This dataset is released under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0). Individual source datasets may have their own licenses. --- ## 🤝 Acknowledgments We thank the creators of FineVision, MMR1, BMMR, Euclid30K, GameQA-140K, LLaVA-CoT, WeMath, ViRL39K, and others. We also thank the Qwen team for the powerful Qwen3-VL series models.

<div align="center"> <h1>MMFineReason</h1> <p><strong>通过开放数据为中心的方法缩小多模态推理差距</strong></p> </div> <div align="center"> [![论文](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2601.21821) [![主页](https://img.shields.io/badge/Homepage-MMFineReason-blue)](https://mmfinereason.github.io/) [![收藏集](https://img.shields.io/badge/🤗-Collections-yellow)](https://huggingface.co/collections/OpenDataArena/mmfinereason) </div> <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/model_compare.png" width="100%" alt="Model Performance Comparison"> <figcaption><em>数学推理与多模态理解基准的平均得分。</em></figcaption> </figure> --- ## 📖 概述 **MMFineReason** 是大规模高质量多模态推理数据集,包含 **180万个样本** 与 **51亿个解答Token**,其推理标注均源自 **Qwen3-VL-235B-A22B-Thinking** 模型的蒸馏结果。 ### 🎯 核心亮点 - **180万高质量样本**与**51亿个解答Token** - **长格式思维链(Chain-of-Thought, CoT)**:平均推理长度达**2910个Token**(为HoneyBee的2.7倍、OpenMMReasoner的4.3倍) - **100% 图像标注覆盖**:密集视觉描述平均长度为609个Token - **多领域覆盖**:数学(79.4%)、科学(13.8%)、谜题/游戏(4.6%)、通用/OCR(2.2%) - **顶尖性能**:基于该数据集训练的模型在同参数量级中达到当前最优(State-of-the-art, SOTA)性能 --- ## 🏗️ 数据构建流程 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/pipeline_detailed.png" width="100%" alt="Data Construction Pipeline"> <figcaption><em>MMFineReason数据流水线与两阶段训练流程。</em></figcaption> </figure> ### 阶段1:数据收集与标准化 - 聚合开源社区中的多样化多模态数据集 - 翻译非英语问题;剔除噪声与冗余干扰项 - 将浅层提示重写为鼓励推理的指令 - 过滤非推理任务;清理损坏或尺寸过大的图像 ### 阶段2:推理蒸馏 - **教师模型**:Qwen3-VL-235B-A22B-Thinking - **四阶段框架**:信息提取 → 问题建模 → 解答执行 → 验证 - **输出格式**:推理内容置于`<think>...</think>`标签内,最终答案置于`<answer>...</answer>`标签内 - **图像标注生成**:通过Qwen3-VL-235B-A22B-Thinking实现100%覆盖 ### 阶段3:数据筛选 - **质量过滤**:模板/长度校验、n-gram去重、正确性验证(约剔除20%样本) - **难度过滤**:以Qwen3-VL-4B-Thinking的通过率作为难度代理指标 - **MMFineReason-123K**:通过率=0(最难的7%样本) - **MMFineReason-586K**:通过率≠1(具有挑战性的33%样本) --- ## 🔧 数据结构规范 | 字段 | 描述 | |-------|-------------| | `source` | 源数据集名称(例如"Geometry3K"、"MMR1"、"BMMR") | | `id` | 源数据集中的唯一样本标识符 | | `original_question` | 从源数据集获取的原始问题文本 | | `original_answer` | 从源数据集获取的原始答案 | | `image` | 视觉输入(PIL图像) | | `question` | 经过清理、标准化的英语问题 | | `answer` | 经过验证与标准化的提取答案 | | `qwen3vl_235b_instruct_caption` | 由Qwen3-VL-235B-A22B-Instruct生成的密集视觉描述 | | `qwen3vl_235b_thinking_response` | 由Qwen3-VL-235B-A22B-Thinking生成的长格式思维链推理内容 | | `qwen3vl_4b_pass_rate` | 基于Qwen3-VL-4B-Thinking性能的难度代理指标(0.0=最难,1.0=最易) | | `is_consistent` | 布尔值,指示生成的推理是否与真实值一致 | | `consistency_analysis` | 一致性验证的详细分析 | --- ## 🗂️ 数据集组成 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/composition_sunburst.png" width="100%" alt="Dataset Composition"> <figcaption><em>MMFineReason-1.8M的数据集组成。</em></figcaption> </figure> **数学(79.4%)** 构成数据集的核心,主要源自MMR1(127万样本),并补充了WaltonColdStart、ViRL39K、Euclid30K、MMK12、Geo170K、Geo3K、mm-openr1以及WeMath系列数据集。 **科学(13.8%)** 以VisualWebInstruct(15.73万样本)和BMMR(5.46万样本)为基础,辅以TQA、AI2D、Zebra-CoT与ScienceQA数据集。 **谜题/游戏(4.6%)** 聚焦策略规划与抽象推理,以GameQA-140K(7.17万样本)为主,并补充了Raven、VisualSphinx与PuzzleQA数据集。 **通用/OCR(2.2%)** 包含来自LLaVA-CoT的3.87万样本,用于保留通用视觉与OCR能力以实现正则化。 --- ## 📊 数据集统计 ### Token长度对比 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_token_length.png" width="100%" alt="Token Length Statistics Comparison"> <figcaption><em>不同数据集的Token长度统计对比。</em></figcaption> </figure> <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/token_length_distribution.png" width="100%" alt="Token Length Distribution"> <figcaption><em>Token长度分析。(左)内部领域分布;(中)外部思维链对比;(右)标注丰富度对比。</em></figcaption> </figure> MMFineReason的平均思维链长度达**2910个Token**——约为HoneyBee的2.7倍、OpenMMReasoner的4.3倍。其超长尾分布(最大值:16316)展现了处理高度复杂、多阶段推理任务的能力。在图像标注方面,MMFineReason实现了100%覆盖,平均长度为609个Token,而HoneyBee的标注覆盖率约为58%,平均长度仅为299个Token。 --- ### 🖼️ 图像类别分布 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_image_category.png" width="100%" alt="Image Category Statistics"> <figcaption><em>按组(STEM与自然图像)划分的图像类别统计。</em></figcaption> </figure> 该数据集以STEM与图表内容为主(98.3%),其中几何图表、数学绘图与逻辑谜题占比75.3%。自然图像(1.7%)涵盖城市场景、室内场景与人类活动,用于评估模型的泛化能力。 --- ### 📈 难度分布 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/pass_rate_distribution.png" width="100%" alt="Pass Rate Distribution"> <figcaption><em>按子数据集平均通过率降序排列的通过率分布。</em></figcaption> </figure> 以科学为导向的数据集(ScienceQA、AI2D、TQA)因图表清晰且为选择题格式,通过率较高。谜题/游戏类数据集(GameQA-140K、Raven、VisualSphinx)通过率最低,需要多步抽象推理。这种二元分布模式反映出推理任务往往存在“全对或全错”的结果特征。 --- ## 📊 基准测试结果 ### 主结果 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_main_results.png" width="100%" alt="Main Benchmark Results"> <figcaption><em>MMFineReason模型与当前最优模型的对比。</em></figcaption> </figure> MMFineReason-4B模型超越了Qwen3-VL-8B-Thinking(73.9 vs 72.5),而MMFineReason-8B模型则优于更大参数量的Qwen3-VL-30B-A3B-Thinking(75.7 vs 74.5),并超过了Gemini-2.5-Flash。在数学基准测试中,MFR-8B在DynaMath上取得83.4%的准确率(相较于Qwen3-VL-32B-Thinking的82.0%),在MathVision上取得67.1%的准确率,较HoneyBee-8B与OMR-7B提升23至30个百分点。尽管图表训练数据极少,MFR-8B仍在CharXiv(90.8%)与RealWorldQA(75.6%)上实现了良好的泛化能力。 ### 监督微调与强化学习训练分析 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_sft_rl_results.png" width="100%" alt="SFT vs RL Results"> <figcaption><em>MFR-SFT与MFR-Thinking模型与基础Qwen3-VL变体的对比结果。</em></figcaption> </figure> 监督微调(Supervised Fine-Tuning, SFT)在数学推理任务上带来了显著提升(例如,8B模型的MathVision准确率从53.9%提升至67.6%)。强化学习(Reinforcement Learning, RL)则提升了模型在理解基准测试上的泛化能力(例如,2B模型的AI2D准确率从78.5%提升至82.5%),但在数学基准测试上表现存在波动。 --- ## 🔬 消融实验 ### 数据效率(“少即是多”) <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/ablation_data_efficiency.png" width="100%" alt="Data Efficiency Analysis"> <figcaption><em>不同数据规模与模型参数量下的性能对比。</em></figcaption> </figure> 移除67%的简单样本(通过率=1)可使模型性能提升0.6个百分点(75.0 → 75.6)。仅使用最难的7%样本(12.3万样本)进行训练即可达到73.3的平均分——较Qwen3-VL-8B-Thinking(72.5)提升,且训练数据仅为其1/14。这表明具有挑战性的样本提供了大部分训练信号,而严格的过滤可消除大规模数据集中的冗余信息。 ### 子数据集性能 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/subdataset_performance.png" width="100%" alt="Sub-Dataset Performance Analysis"> <figcaption><em>蒸馏后子数据集的性能表现(横轴:样本数量,对数刻度)。</em></figcaption> </figure> ViRL39K(3.9万样本)仅使用MMR1(150万样本)2.4%的数据量,即可保留其98.9%的性能。WeMath2.0-SFT仅用814个样本即达到70.98%的准确率,与规模大1000倍的数据集性能相当。BMMR(8万样本,覆盖300+学科)的性能优于规模更大的GameQA-140K(14万样本),这表明学科多样性比数据规模更为重要。 --- ## 🏆 训练模型 | 模型 | 参数量 | 平均得分 | HuggingFace 链接 | |-------|------------|-----------|-------------| | MMFineReason-2B | 20亿 | 65.3 | [🤗 链接](https://huggingface.co/OpenDataArena/MMFineReason-2B) | | MMFineReason-4B | 40亿 | 73.9 | [🤗 链接](https://huggingface.co/OpenDataArena/MMFineReason-4B) | | MMFineReason-8B | 80亿 | 75.7 | [🤗 链接](https://huggingface.co/OpenDataArena/MMFineReason-8B) | --- ## 📚 引用 bibtex @misc{lin2026mmfinereasonclosingmultimodalreasoning, title={MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods}, author={Honglin Lin and Zheng Liu and Yun Zhu and Chonghan Qin and Juekai Lin and Xiaoran Shang and Conghui He and Wentao Zhang and Lijun Wu}, year={2026}, eprint={2601.21821}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.21821}, } --- ## 📄 许可证 本数据集采用[Apache 2.0许可证](https://opensource.org/licenses/Apache-2.0)发布。各源数据集可能拥有各自的许可证。 --- ## 🤝 致谢 我们感谢FineVision、MMR1、BMMR、Euclid30K、GameQA-140K、LLaVA-CoT、WeMath、ViRL39K等数据集的创建者。同时感谢Qwen团队推出的强大Qwen3-VL系列模型。
提供机构:
maas
创建时间:
2026-02-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作