MMFineReason-1.8M-Qwen3-VL-235B-Thinking

Name: MMFineReason-1.8M-Qwen3-VL-235B-Thinking
Creator: maas
Published: 2026-04-29 15:50:04
License: 暂无描述

魔搭社区2026-04-29 更新2026-05-03 收录

下载链接：

https://modelscope.cn/datasets/OpenDataArena/MMFineReason-1.8M-Qwen3-VL-235B-Thinking

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <h1>MMFineReason</h1> <p><strong>Closing the Multimodal Reasoning Gap via Open Data-Centric Methods</strong></p> </div> <div align="center"> [![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2601.21821) [![Homepage](https://img.shields.io/badge/Homepage-MMFineReason-blue)](https://mmfinereason.github.io/) [![Collections](https://img.shields.io/badge/🤗-Collections-yellow)](https://huggingface.co/collections/OpenDataArena/mmfinereason) </div> <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/model_compare.png" width="100%" alt="Model Performance Comparison"> <figcaption><em>Average score across mathematical reasoning and multimodal understanding benchmarks.</em></figcaption> </figure> --- ## 📖 Overview **MMFineReason** is a large-scale, high-quality multimodal reasoning dataset comprising **1.8M samples** and **5.1B solution tokens**, featuring detailed reasoning annotations distilled from **Qwen3-VL-235B-A22B-Thinking**. ### 🎯 Key Highlights - **1.8M High-Quality Samples** with **5.1B Solution Tokens** - **Long-Form CoT**: Average reasoning length of **2,910 tokens** (2.7× HoneyBee, 4.3× OpenMMReasoner) - **100% Caption Coverage**: Dense visual descriptions averaging 609 tokens - **Multi-Domain**: Mathematics (79.4%), Science (13.8%), Puzzle/Game (4.6%), General/OCR (2.2%) - **State-of-the-Art**: Models trained on this dataset achieve SOTA performance in their size class --- ## 🏗️ Data Construction Pipeline <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/pipeline_detailed.png" width="100%" alt="Data Construction Pipeline"> <figcaption><em>MMFineReason data pipeline and the two-stage training.</em></figcaption> </figure> ### Stage 1: Data Collection & Standardization - Aggregate diverse multimodal datasets from open-source community - Translate non-English questions; remove noise and extraneous artifacts - Rewrite shallow prompts into reasoning-encouraging instructions - Filter non-reasoning tasks; clean corrupted/oversized images ### Stage 2: Reasoning Distillation - **Teacher Model**: Qwen3-VL-235B-A22B-Thinking - **Four-Phase Framework**: Information Extraction → Problem Setup → Solution Execution → Validation - **Output**: Reasoning in `<think>...</think>`, final answer in `<answer>...</answer>` - **Caption Generation**: 100% coverage via Qwen3-VL-235B-A22B-Thinking ### Stage 3: Data Selection - **Quality Filtering**: Template/length validation, n-gram deduplication, correctness verification (~20% removed) - **Difficulty Filtering**: Use Qwen3-VL-4B-Thinking pass rate as proxy - **MMFineReason-123K**: Pass rate = 0 (hardest 7%) - **MMFineReason-586K**: Pass rate ≠ 1 (challenging 33%) --- ## 🔧 Data Schema | Field | Description | |-------|-------------| | `source` | Origin dataset name (e.g., "Geometry3K", "MMR1", "BMMR") | | `id` | Unique sample identifier within the source dataset | | `original_question` | Raw question text as obtained from the source | | `original_answer` | Raw answer as obtained from the source | | `image` | Visual input (PIL Image) | | `question` | Cleaned, standardized question in English | | `answer` | Verified answer extracted and standardized | | `qwen3vl_235b_instruct_caption` | Dense visual description generated by Qwen3-VL-235B-A22B-Instruct | | `qwen3vl_235b_thinking_response` | Long-form Chain-of-Thought reasoning generated by Qwen3-VL-235B-A22B-Thinking | | `qwen3vl_4b_pass_rate` | Difficulty proxy based on Qwen3-VL-4B-Thinking's performance (0.0 = hardest, 1.0 = easiest) | | `is_consistent` | Boolean indicating whether generated reasoning matches ground truth | | `consistency_analysis` | Detailed analysis of consistency verification | --- ## 🗂️ Dataset Composition <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/composition_sunburst.png" width="100%" alt="Dataset Composition"> <figcaption><em>Dataset composition of MMFineReason-1.8M.</em></figcaption> </figure> **Mathematics (79.4%)** forms the backbone, primarily sourced from MMR1 (1.27M) and enriched with WaltonColdStart, ViRL39K, Euclid30K, MMK12, Geo170K, Geo3K, mm-openr1, and the WeMath family. **Science (13.8%)** is anchored by VisualWebInstruct (157.3K) and BMMR (54.6K), complemented by TQA, AI2D, Zebra-CoT, and ScienceQA. **Puzzle/Game (4.6%)** targets strategic planning and abstract reasoning, dominated by GameQA-140K (71.7K) and enriched by Raven, VisualSphinx, and PuzzleQA. **General/OCR (2.2%)** includes 38.7K samples from LLaVA-CoT, serving as regularization to preserve broad visual and OCR capabilities. --- ## 📊 Dataset Statistics ### Token Length Comparison with Other Datasets <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_token_length.png" width="100%" alt="Token Length Statistics Comparison"> <figcaption><em>Comparison of token length statistics across datasets.</em></figcaption> </figure> <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/token_length_distribution.png" width="100%" alt="Token Length Distribution"> <figcaption><em>Token length analysis. (Left) Internal domain distribution; (Mid) External CoT comparison; (Right) Caption richness comparison.</em></figcaption> </figure> MMFineReason achieves an average CoT length of **2,910 tokens**—approximately **2.7× longer** than HoneyBee and **4.3× longer** than OpenMMReasoner. The extended tail (Max: 16,316) demonstrates capacity for highly complex, multi-stage reasoning tasks. For captions, MMFineReason averages 609 tokens with **100% coverage**, compared to HoneyBee's 299 tokens at ~58% coverage. --- ### 🖼️ Image Category Distribution <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_image_category.png" width="100%" alt="Image Category Statistics"> <figcaption><em>Image category statistics by group (STEM vs. Natural).</em></figcaption> </figure> The corpus is predominantly STEM and diagrammatic content (98.3%), with geometric diagrams, mathematical plots, and logic puzzles accounting for 75.3%. Natural images (1.7%) provide diversity across urban scenes, indoor scenes, and human activities for generalization assessment. --- ### 📈 Difficulty Distribution <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/pass_rate_distribution.png" width="100%" alt="Pass Rate Distribution"> <figcaption><em>Pass rate distribution across sub-datasets, sorted by descending mean pass rate.</em></figcaption> </figure> Science-oriented datasets (ScienceQA, AI2D, TQA) exhibit high pass rates due to clean diagrams and MCQ format. Puzzle/game datasets (GameQA-140K, Raven, VisualSphinx) show lowest pass rates, requiring multi-step abstract reasoning. The binary distribution pattern reflects that reasoning tasks often follow all-or-nothing outcomes. --- ## 📊 Benchmark Results ### Main Results <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_main_results.png" width="100%" alt="Main Benchmark Results"> <figcaption><em>Comparison of MMFineReason models with state-of-the-art models.</em></figcaption> </figure> MMFineReason-4B surpasses Qwen3-VL-8B-Thinking (73.9 vs 72.5), while MMFineReason-8B outperforms the larger Qwen3-VL-30B-A3B-Thinking (75.7 vs 74.5) and exceeds Gemini-2.5-Flash. On mathematical benchmarks, MFR-8B achieves 83.4% on DynaMath (vs Qwen3-VL-32B-Thinking's 82.0%) and 67.1% on MathVision, outperforming HoneyBee-8B and OMR-7B by 23-30 points. Despite minimal chart training data, MFR-8B generalizes well to CharXiv (90.8%) and RealWorldQA (75.6%). ### SFT vs RL Training Analysis <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_sft_rl_results.png" width="100%" alt="SFT vs RL Results"> <figcaption><em>Results comparing MFR-SFT and MFR-Thinking models against base Qwen3-VL variants.</em></figcaption> </figure> SFT drives major gains in mathematical reasoning (e.g., MathVision: 53.9% → 67.6% for 8B). RL enhances generalization on understanding benchmarks (e.g., AI2D: 78.5% → 82.5% for 2B) while showing variance on math benchmarks. --- ## 🔬 Ablation Studies ### Data Efficiency ("Less is More") <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/ablation_data_efficiency.png" width="100%" alt="Data Efficiency Analysis"> <figcaption><em>Performance comparison across different data scales and model sizes.</em></figcaption> </figure> Removing 67% easy samples (Pass Rate = 1) improves performance by 0.6 points (75.0 → 75.6). Training on only the hardest 7% (123K samples) achieves 73.3—surpassing Qwen3-VL-8B-Thinking (72.5) with 14× less data. This demonstrates that challenging samples provide most training signal, and rigorous filtering eliminates redundancy in large-scale datasets. ### Sub-Dataset Performance <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/subdataset_performance.png" width="100%" alt="Sub-Dataset Performance Analysis"> <figcaption><em>Performance landscape of distilled sub-datasets (x-axis: sample count, log scale).</em></figcaption> </figure> ViRL39K (39K samples) retains 98.9% of MMR1's (1.5M) performance with only 2.4% data volume. WeMath2.0-SFT achieves 70.98% with just 814 samples, matching datasets 1000× larger. BMMR (80K, 300+ disciplines) outperforms the larger GameQA-140K (140K), showing that disciplinary diversity matters more than scale. --- ## 🏆 Trained Models | Model | Parameters | Avg Score | HuggingFace | |-------|------------|-----------|-------------| | MMFineReason-2B | 2B | 65.3 | [🤗 Link](https://huggingface.co/OpenDataArena/MMFineReason-2B) | | MMFineReason-4B | 4B | 73.9 | [🤗 Link](https://huggingface.co/OpenDataArena/MMFineReason-4B) | | MMFineReason-8B | 8B | 75.7 | [🤗 Link](https://huggingface.co/OpenDataArena/MMFineReason-8B) | --- ## 📚 Citation ```bibtex @misc{lin2026mmfinereasonclosingmultimodalreasoning, title={MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods}, author={Honglin Lin and Zheng Liu and Yun Zhu and Chonghan Qin and Juekai Lin and Xiaoran Shang and Conghui He and Wentao Zhang and Lijun Wu}, year={2026}, eprint={2601.21821}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.21821}, } ``` --- ## 📄 License This dataset is released under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0). Individual source datasets may have their own licenses. --- ## 🤝 Acknowledgments We thank the creators of FineVision, MMR1, BMMR, Euclid30K, GameQA-140K, LLaVA-CoT, WeMath, ViRL39K, and others. We also thank the Qwen team for the powerful Qwen3-VL series models.

<div align="center"> <h1>MMFineReason</h1> <p><strong>通过开放数据为中心的方法缩小多模态推理差距</strong></p> </div> <div align="center"> [![论文](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2601.21821) [![主页](https://img.shields.io/badge/Homepage-MMFineReason-blue)](https://mmfinereason.github.io/) [![收藏集](https://img.shields.io/badge/🤗-Collections-yellow)](https://huggingface.co/collections/OpenDataArena/mmfinereason) </div> <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/model_compare.png" width="100%" alt="Model Performance Comparison"> <figcaption><em>数学推理与多模态理解基准的平均得分。</em></figcaption> </figure> --- ## 📖 概述 **MMFineReason** 是大规模高质量多模态推理数据集，包含 **180万个样本** 与 **51亿个解答Token**，其推理标注均源自 **Qwen3-VL-235B-A22B-Thinking** 模型的蒸馏结果。 ### 🎯 核心亮点 - **180万高质量样本**与**51亿个解答Token** - **长格式思维链（Chain-of-Thought, CoT）**：平均推理长度达**2910个Token**（为HoneyBee的2.7倍、OpenMMReasoner的4.3倍） - **100% 图像标注覆盖**：密集视觉描述平均长度为609个Token - **多领域覆盖**：数学（79.4%）、科学（13.8%）、谜题/游戏（4.6%）、通用/OCR（2.2%） - **顶尖性能**：基于该数据集训练的模型在同参数量级中达到当前最优（State-of-the-art, SOTA）性能 --- ## 🏗️ 数据构建流程 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/pipeline_detailed.png" width="100%" alt="Data Construction Pipeline"> <figcaption><em>MMFineReason数据流水线与两阶段训练流程。</em></figcaption> </figure> ### 阶段1：数据收集与标准化 - 聚合开源社区中的多样化多模态数据集 - 翻译非英语问题；剔除噪声与冗余干扰项 - 将浅层提示重写为鼓励推理的指令 - 过滤非推理任务；清理损坏或尺寸过大的图像 ### 阶段2：推理蒸馏 - **教师模型**：Qwen3-VL-235B-A22B-Thinking - **四阶段框架**：信息提取 → 问题建模 → 解答执行 → 验证 - **输出格式**：推理内容置于`<think>...</think>`标签内，最终答案置于`<answer>...</answer>`标签内 - **图像标注生成**：通过Qwen3-VL-235B-A22B-Thinking实现100%覆盖 ### 阶段3：数据筛选 - **质量过滤**：模板/长度校验、n-gram去重、正确性验证（约剔除20%样本） - **难度过滤**：以Qwen3-VL-4B-Thinking的通过率作为难度代理指标 - **MMFineReason-123K**：通过率=0（最难的7%样本） - **MMFineReason-586K**：通过率≠1（具有挑战性的33%样本） --- ## 🔧 数据结构规范 | 字段 | 描述 | |-------|-------------| | `source` | 源数据集名称（例如"Geometry3K"、"MMR1"、"BMMR"） | | `id` | 源数据集中的唯一样本标识符 | | `original_question` | 从源数据集获取的原始问题文本 | | `original_answer` | 从源数据集获取的原始答案 | | `image` | 视觉输入（PIL图像） | | `question` | 经过清理、标准化的英语问题 | | `answer` | 经过验证与标准化的提取答案 | | `qwen3vl_235b_instruct_caption` | 由Qwen3-VL-235B-A22B-Instruct生成的密集视觉描述 | | `qwen3vl_235b_thinking_response` | 由Qwen3-VL-235B-A22B-Thinking生成的长格式思维链推理内容 | | `qwen3vl_4b_pass_rate` | 基于Qwen3-VL-4B-Thinking性能的难度代理指标（0.0=最难，1.0=最易） | | `is_consistent` | 布尔值，指示生成的推理是否与真实值一致 | | `consistency_analysis` | 一致性验证的详细分析 | --- ## 🗂️ 数据集组成 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/composition_sunburst.png" width="100%" alt="Dataset Composition"> <figcaption><em>MMFineReason-1.8M的数据集组成。</em></figcaption> </figure> **数学（79.4%）** 构成数据集的核心，主要源自MMR1（127万样本），并补充了WaltonColdStart、ViRL39K、Euclid30K、MMK12、Geo170K、Geo3K、mm-openr1以及WeMath系列数据集。 **科学（13.8%）** 以VisualWebInstruct（15.73万样本）和BMMR（5.46万样本）为基础，辅以TQA、AI2D、Zebra-CoT与ScienceQA数据集。 **谜题/游戏（4.6%）** 聚焦策略规划与抽象推理，以GameQA-140K（7.17万样本）为主，并补充了Raven、VisualSphinx与PuzzleQA数据集。 **通用/OCR（2.2%）** 包含来自LLaVA-CoT的3.87万样本，用于保留通用视觉与OCR能力以实现正则化。 --- ## 📊 数据集统计 ### Token长度对比 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_token_length.png" width="100%" alt="Token Length Statistics Comparison"> <figcaption><em>不同数据集的Token长度统计对比。</em></figcaption> </figure> <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/token_length_distribution.png" width="100%" alt="Token Length Distribution"> <figcaption><em>Token长度分析。（左）内部领域分布；（中）外部思维链对比；（右）标注丰富度对比。</em></figcaption> </figure> MMFineReason的平均思维链长度达**2910个Token**——约为HoneyBee的2.7倍、OpenMMReasoner的4.3倍。其超长尾分布（最大值：16316）展现了处理高度复杂、多阶段推理任务的能力。在图像标注方面，MMFineReason实现了100%覆盖，平均长度为609个Token，而HoneyBee的标注覆盖率约为58%，平均长度仅为299个Token。 --- ### 🖼️ 图像类别分布 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_image_category.png" width="100%" alt="Image Category Statistics"> <figcaption><em>按组（STEM与自然图像）划分的图像类别统计。</em></figcaption> </figure> 该数据集以STEM与图表内容为主（98.3%），其中几何图表、数学绘图与逻辑谜题占比75.3%。自然图像（1.7%）涵盖城市场景、室内场景与人类活动，用于评估模型的泛化能力。 --- ### 📈 难度分布 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/pass_rate_distribution.png" width="100%" alt="Pass Rate Distribution"> <figcaption><em>按子数据集平均通过率降序排列的通过率分布。</em></figcaption> </figure> 以科学为导向的数据集（ScienceQA、AI2D、TQA）因图表清晰且为选择题格式，通过率较高。谜题/游戏类数据集（GameQA-140K、Raven、VisualSphinx）通过率最低，需要多步抽象推理。这种二元分布模式反映出推理任务往往存在“全对或全错”的结果特征。 --- ## 📊 基准测试结果 ### 主结果 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_main_results.png" width="100%" alt="Main Benchmark Results"> <figcaption><em>MMFineReason模型与当前最优模型的对比。</em></figcaption> </figure> MMFineReason-4B模型超越了Qwen3-VL-8B-Thinking（73.9 vs 72.5），而MMFineReason-8B模型则优于更大参数量的Qwen3-VL-30B-A3B-Thinking（75.7 vs 74.5），并超过了Gemini-2.5-Flash。在数学基准测试中，MFR-8B在DynaMath上取得83.4%的准确率（相较于Qwen3-VL-32B-Thinking的82.0%），在MathVision上取得67.1%的准确率，较HoneyBee-8B与OMR-7B提升23至30个百分点。尽管图表训练数据极少，MFR-8B仍在CharXiv（90.8%）与RealWorldQA（75.6%）上实现了良好的泛化能力。 ### 监督微调与强化学习训练分析 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_sft_rl_results.png" width="100%" alt="SFT vs RL Results"> <figcaption><em>MFR-SFT与MFR-Thinking模型与基础Qwen3-VL变体的对比结果。</em></figcaption> </figure> 监督微调（Supervised Fine-Tuning, SFT）在数学推理任务上带来了显著提升（例如，8B模型的MathVision准确率从53.9%提升至67.6%）。强化学习（Reinforcement Learning, RL）则提升了模型在理解基准测试上的泛化能力（例如，2B模型的AI2D准确率从78.5%提升至82.5%），但在数学基准测试上表现存在波动。 --- ## 🔬 消融实验 ### 数据效率（“少即是多”） <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/ablation_data_efficiency.png" width="100%" alt="Data Efficiency Analysis"> <figcaption><em>不同数据规模与模型参数量下的性能对比。</em></figcaption> </figure> 移除67%的简单样本（通过率=1）可使模型性能提升0.6个百分点（75.0 → 75.6）。仅使用最难的7%样本（12.3万样本）进行训练即可达到73.3的平均分——较Qwen3-VL-8B-Thinking（72.5）提升，且训练数据仅为其1/14。这表明具有挑战性的样本提供了大部分训练信号，而严格的过滤可消除大规模数据集中的冗余信息。 ### 子数据集性能 <figure align="center"> <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/subdataset_performance.png" width="100%" alt="Sub-Dataset Performance Analysis"> <figcaption><em>蒸馏后子数据集的性能表现（横轴：样本数量，对数刻度）。</em></figcaption> </figure> ViRL39K（3.9万样本）仅使用MMR1（150万样本）2.4%的数据量，即可保留其98.9%的性能。WeMath2.0-SFT仅用814个样本即达到70.98%的准确率，与规模大1000倍的数据集性能相当。BMMR（8万样本，覆盖300+学科）的性能优于规模更大的GameQA-140K（14万样本），这表明学科多样性比数据规模更为重要。 --- ## 🏆 训练模型 | 模型 | 参数量 | 平均得分 | HuggingFace 链接 | |-------|------------|-----------|-------------| | MMFineReason-2B | 20亿 | 65.3 | [🤗 链接](https://huggingface.co/OpenDataArena/MMFineReason-2B) | | MMFineReason-4B | 40亿 | 73.9 | [🤗 链接](https://huggingface.co/OpenDataArena/MMFineReason-4B) | | MMFineReason-8B | 80亿 | 75.7 | [🤗 链接](https://huggingface.co/OpenDataArena/MMFineReason-8B) | --- ## 📚 引用 bibtex @misc{lin2026mmfinereasonclosingmultimodalreasoning, title={MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods}, author={Honglin Lin and Zheng Liu and Yun Zhu and Chonghan Qin and Juekai Lin and Xiaoran Shang and Conghui He and Wentao Zhang and Lijun Wu}, year={2026}, eprint={2601.21821}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.21821}, } --- ## 📄 许可证本数据集采用[Apache 2.0许可证](https://opensource.org/licenses/Apache-2.0)发布。各源数据集可能拥有各自的许可证。 --- ## 🤝 致谢我们感谢FineVision、MMR1、BMMR、Euclid30K、GameQA-140K、LLaVA-CoT、WeMath、ViRL39K等数据集的创建者。同时感谢Qwen团队推出的强大Qwen3-VL系列模型。

提供机构：

maas

创建时间：

2026-02-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集