ReasonMed
收藏魔搭社区2026-01-06 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/ReasonMed
下载链接
链接失效反馈官方服务:
资源简介:
# ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
<p align="center">
<a href="https://arxiv.org/pdf/2506.09513">📄 Paper</a> |
<a href="https://github.com/YuSun-Work/ReasonMed">💻 Code</a> |
<a href="https://huggingface.co/datasets/lingshu-medical-mllm/ReasonMed">📊 Dataset</a>
</p>
**ReasonMed** is the largest open-source medical reasoning dataset to date, containing **370 K** high-quality question–answer examples with multi-step chain-of-thought (CoT) rationales and concise summaries. We distilled these from **1.75 M** initial reasoning paths generated by three competitive large-language models (Qwen-2.5-72B, DeepSeek-R1-Distill-Llama-70B, and HuatuoGPT-o1-70B), using a rigorous multi-agent verification and refinement pipeline.
---
## 📚 Dataset Composition
We sourced **194,925** unique multiple-choice medical questions from six established benchmarks, then generated and validated CoT paths:
| **Source** | **# Questions** |
|--------------------------------|-----------------|
| **MedQA** (train / dev) | 10,178 / 1,272 |
| **MedMCQA** (train) | 182,822 |
| **PubMedQA** (train / val) | 450 / 50 |
| **MMLU – Anatomy** (dev / val) | 5 / 14 |
| **MMLU – Clinical Knowledge** | 5 / 29 |
| **MMLU – College Biology** | 5 / 16 |
| **MMLU – College Medicine** | 5 / 22 |
| **MMLU – Medical Genetics** | 5 / 11 |
| **MMLU – Professional Medicine**| 5 / 31 |
| **Total** | **194,925** |
---
## 🔍 Data Generation & Curation Pipeline
1. **Multi-Agent CoT Generation**
- Three LLMs each generate 3 CoT trajectories per question at temperatures {0.7, 0.9, 1.0}, yielding 1.75 M raw paths.
2. **Verification (Qwen-2.5-72B)**
- Judge each CoT for correctness, logical coherence, and medical factuality.
- Label as “Correct” or “Error” with error reasons.
3. **Difficulty Tiers & Refinement**
- **Easy (0–4 errors):** select top 2 CoTs via Quality Ranker.
- **Medium (5–7 errors):** refine top 2 CoTs via Error Refiner (GPT-4o-mini).
- **Difficult (8–9 errors):** regenerate full CoT via GPT-o1 with a 6-step template.
4. **Summarization (GPT-4o-mini)**
- Condense each CoT into a concise answer rationale.
5. **Final Dataset**
- Each dataset contains 370k pieces of data, for a total of 1.1M pieces of data:
- ReasonMed(<think>{CoT}</think>{response})
- CoTMed({CoT})
- ResponseMed({response})
---
## 📊 Data Quality Evaluation
### Medium Pipeline Validity Verification
To evaluate the Medium Pipeline, we sampled 1 000 questions + CoTs and used our Score Evaluator to score before and after GPT-4o-mini corrections. The average score improved by **0.8** points.
| **Dataset** | **Samples** | **Avg. Score** |
|-------------------------------|-------------|----------------|
| Medium Pipeline (pre-opt) | 1 000 | 7.37 |
| Medium Pipeline (post-opt) | 1 000 | 8.17 |
### Comparison with Other Medical Reasoning Corpora
We compared ReasonMed against two open datasets, sampling 1 000 instances each, and also evaluated 3 000 ReasonMed samples:
| **Dataset** | **Samples** | **Avg. Score** |
|---------------------------------|-------------|----------------|
| medical-o1-reasoning-SFT | 1 000 | 8.03 |
| Medical-R1-Distill-Data | 1 000 | 8.18 |
| **ReasonMed** | 1 000 | **8.45** |
| **ReasonMed** | 3 000 | **8.50** |
---
## 🎯 Multiscale Supervised Fine-Tuning Results
We fine-tuned Qwen2.5-7B under three regimes—CoT, Response, and hybrid Reason—over three epochs and one epoch. Evaluation on MedQA, MedMCQA, PubMedQA, and six MMLU subdomains yields:
| Model | MedQA | MedMCQA (val) | PubMedQA | Anatomy | CK | C-Bio | C-Med | Med-Gene | P-Med | **Total Acc** | Avg. Tokens |
|------------------------------|-------------|---------------|-------------|----------------|----------------|-----------------|-----------------|----------------|----------------|---------------|-------------|
| **BioMistral-7B** | 45.6 ± 1.4 | 41.5 ± 0.8 | 71.0 ± 2.0 | 76.3 ± 3.7 | 63.0 ± 3.0 | 62.5 ± 4.1 | 53.8 ± 3.8 | 67.0 ± 4.7 | 53.3 ± 3.0 | 48.9 | 60.1 |
| **Llama3-OpenBioLLM-8B** | 57.9 ± 1.4 | 57.7 ± 0.8 | 76.0 ± 6.1 | 68.9 ± 4.0 | 77.7 ± 2.6 | 83.3 ± 3.1 | 69.4 ± 3.5 | 83.0 ± 3.8 | 79.0 ± 2.5 | 62.9 | 75.1 |
| **Llama-3-8B-UltraMedical** | 63.2 ± 1.4 | 57.7 ± 0.8 | 78.0 ± 5.9 | 67.4 ± 4.1 | 74.3 ± 2.7 | 75.7 ± 3.6 | 61.9 ± 3.7 | 73.0 ± 4.5 | 78.7 ± 2.5 | 63.5 | 5177.7 |
| **Mistral-7B-Instruct-v0.3** | 52.2 ± 1.4 | 48.2 ± 0.8 | 82.0 ± 5.5 | 59.3 ± 4.2 | 69.4 ± 2.8 | 72.9 ± 3.7 | 56.7 ± 3.8 | 70.0 ± 4.6 | 66.5 ± 2.9 | 55.9 | 111.8 |
| **Yi-1.5-9B-Chatbot** | 49.8 ± 1.4 | 47.0 ± 0.8 | 69.0 ± 2.1 | 67.5 ± 3.8 | 63.9 ± 2.8 | 70.3 ± 3.8 | 51.2 ± 4.0 | 68.8 ± 4.5 | 66.7 ± 3.1 | 52.9 | 162.2 |
| **HuatuoGPT-o1-7B** | **68.4 ± 1.3** | 57.5 ± 0.8 | 74.0 ± 2.0 | 71.9 ± 3.9 | 78.5 ± 2.5 | **88.2 ± 2.7** | 67.6 ± 3.6 | 80.0 ± 4.0 | 77.6 ± 2.5 | 64.4 | 446.0 |
| **HuatuoGPT-o1-8B** | 65.4 ± 1.3 | 61.0 ± 0.8 | 74.6 ± 2.0 | 69.6 ± 4.0 | 77.7 ± 2.6 | 81.3 ± 3.3 | 69.9 ± 3.5 | 78.0 ± 4.2 | 71.0 ± 2.8 | 65.5 | 468.9 |
| **ResponseMed-7B (1 epoch)** | 62.2 ± 1.4 | 57.6 ± 0.8 | 84.0 ± 5.2 | 75.6 ± 3.7 | 77.7 ± 2.6 | 81.3 ± 3.3 | 69.9 ± 3.5 | 87.0 ± 3.4 | 76.8 ± 2.6 | 64.8 | – |
| **CoTMed-7B (1 epoch)** | 64.3 ± 1.3 | 62.4 ± 0.8 | 82.0 ± 5.5 | **77.0 ± 3.6** | **80.8 ± 2.4** | 81.3 ± 3.3 | 72.8 ± 3.4 | **90.0 ± 3.0** | 79.4 ± 2.5 | 67.8 | – |
| **ReasonMed-7B (1 epoch)** | 65.3 ± 1.3 | 62.3 ± 0.8 | 82.0 ± 5.5 | 74.8 ± 3.7 | 80.0 ± 2.5 | 81.3 ± 3.3 | **74.0 ± 3.4** | 86.0 ± 3.5 | 79.0 ± 2.5 | 67.7 | – |
| **ResponseMed-7B** | 67.5 ± 1.3 | 60.9 ± 0.8 | 80.0 ± 5.7 | 74.8 ± 3.7 | 77.4 ± 2.6 | **84.0 ± 3.1** | 71.1 ± 3.5 | 88.0 ± 3.3 | 76.5 ± 2.6 | 67.0 | 225.2 |
| **CoTMed-7B** | 66.3 ± 1.3 | 64.7 ± 0.7 | 80.0 ± 5.7 | 75.6 ± 3.7 | 79.6 ± 2.5 | 82.1 ± 3.2 | 71.7 ± 3.4 | 86.0 ± 3.5 | 79.9 ± 2.6 | 69.1 | 555.4 |
| **ReasonMed-7B** | 66.9 ± 1.3 | **65.1 ± 0.7** | **82.0 ± 5.5** | 75.6 ± 3.7 | 79.3 ± 2.5 | 79.2 ± 3.4 | 73.4 ± 3.4 | 85.0 ± 3.6 | **80.9 ± 2.4** | **69.6** | 626.0 |
> **Note**:
> - **CK** = Clinical Knowledge
> - **C-Bio** = College Biology
> - **C-Med** = College Medicine
> - **Med-Gene** = Medical Genetics
> - **P-Med** = Professional Medicine
- **One-epoch vs Three-epoch**: Three-epoch models outperform one-epoch variants (e.g., ReasonMed-7B improves from 67.7% to 69.6%)
- **Token Length**: CoTMed and ReasonMed generate deeper reasoning (≈555–626 tokens) vs ResponseMed (≈225 tokens).
---
## Citation
```
@misc{sun2025reasonmed370kmultiagentgenerated,
title={ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning},
author={Yu Sun and Xingyu Qian and Weiwen Xu and Hao Zhang and Chenghao Xiao and Long Li and Yu Rong and Wenbing Huang and Qifeng Bai and Tingyang Xu},
year={2025},
eprint={2506.09513},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.09513},
}
@misc{lasateam2025lingshugeneralistfoundationmodel,
title={Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning},
author={LASA Team and Weiwen Xu and Hou Pong Chan and Long Li and Mahani Aljunied and Ruifeng Yuan and Jianyu Wang and Chenghao Xiao and Guizhen Chen and Chaoqun Liu and Zhaodonghui Li and Yu Sun and Junao Shen and Chaojun Wang and Jie Tan and Deli Zhao and Tingyang Xu and Hao Zhang and Yu Rong},
year={2025},
eprint={2506.07044},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.07044},
}
```
# ReasonMed:面向医学推理进阶的37万规模多智能体生成数据集
<p align="center">
<a href="https://arxiv.org/pdf/2506.09513">📄 论文</a> |
<a href="https://github.com/YuSun-Work/ReasonMed">💻 代码</a> |
<a href="https://huggingface.co/datasets/lingshu-medical-mllm/ReasonMed">📊 数据集</a>
</p>
**ReasonMed** 是目前规模最大的开源医学推理数据集,包含37万条高质量问答样本,配套多步思维链(Chain-of-Thought, CoT)推理依据与简洁总结。我们从3款主流大语言模型(Large Language Model, LLM)生成的175万条初始推理路径中,通过严格的多智能体验证与精调流程,筛选提炼得到该数据集。这三款模型分别为Qwen-2.5-72B、DeepSeek-R1-Distill-Llama-70B与HuatuoGPT-o1-70B。
---
## 📚 数据集组成
我们从6个成熟基准数据集中共收集了194925道独特的医学多选题,随后生成并验证了思维链路径:
| **数据源** | **样本量** |
|--------------------------------|------------|
| **MedQA(训练集/开发集)** | 10178 / 1272 |
| **MedMCQA(训练集)** | 182822 |
| **PubMedQA(训练集/验证集)** | 450 / 50 |
| **MMLU – 解剖学(开发集/验证集)** | 5 / 14 |
| **MMLU – 临床知识** | 5 / 29 |
| **MMLU – 大学生物学** | 5 / 16 |
| **MMLU – 大学医学** | 5 / 22 |
| **MMLU – 医学遗传学** | 5 / 11 |
| **MMLU – 专业医学**| 5 / 31 |
| **总计** | **194925** |
---
## 🔍 数据生成与整理流程
1. **多智能体思维链生成**
- 三款大语言模型分别针对每道问题以温度参数{0.7, 0.9, 1.0}生成3条思维链轨迹,最终得到175万条原始推理路径。
2. **验证环节(Qwen-2.5-72B)**
- 校验每条思维链的正确性、逻辑连贯性与医学事实准确性。
- 标注为“正确”或“错误”并附带错误原因。
3. **难度分级与精调**
- **简单(0-4处错误)**:通过质量排序器选取最优的2条思维链。
- **中等(5-7处错误)**:通过错误修正器(GPT-4o-mini)对最优的2条思维链进行精调。
- **困难(8-9处错误)**:通过GPT-o1结合6步模板重新生成完整思维链。
4. **总结环节(GPT-4o-mini)**
- 将每条思维链浓缩为简洁的答案推理依据。
5. **最终数据集**
- 最终数据集包含37万条样本,整体分为三个子数据集,总数据量达110万条:
- ReasonMed(<think>{CoT}</think>{response})
- CoTMed({CoT})
- ResponseMed({response})
---
## 📊 数据质量评估
### 中等流程有效性验证
为评估中等难度流程的效果,我们随机抽取1000道问题及配套思维链,使用评分评估器对GPT-4o-mini修正前后的样本进行评分,最终平均得分提升了0.8分。
| **数据集** | **样本量** | **平均得分** |
|-------------------------------|-------------|----------------|
| 中等流程(优化前) | 1000 | 7.37 |
| 中等流程(优化后) | 1000 | 8.17 |
### 与其他医学推理语料的对比
我们将ReasonMed与两个开源数据集进行对比,每个数据集随机抽取1000条样本,同时对3000条ReasonMed样本进行评估:
| **数据集** | **样本量** | **平均得分** |
|---------------------------------|-------------|----------------|
| medical-o1-reasoning-SFT | 1000 | 8.03 |
| Medical-R1-Distill-Data | 1000 | 8.18 |
| **ReasonMed** | 1000 | **8.45** |
| **ReasonMed** | 3000 | **8.50** |
---
## 🎯 多尺度监督微调实验结果
我们针对Qwen2.5-7B模型在三种训练范式下进行微调:思维链(CoT)、答案回复、混合推理模式,分别进行3轮训练与1轮训练。在MedQA、MedMCQA、PubMedQA以及6个MMLU子数据集上的评估结果如下:
| 模型 | MedQA | MedMCQA(验证集) | PubMedQA | 解剖学 | 临床知识(CK) | 大学生物学(C-Bio) | 大学医学(C-Med) | 医学遗传学(Med-Gene) | 专业医学(P-Med) | **总准确率** | 平均Token数 |
|------------------------------|-------------|---------------|-------------|----------------|----------------|-----------------|-----------------|----------------|----------------|---------------|-------------|
| **BioMistral-7B** | 45.6 ± 1.4 | 41.5 ± 0.8 | 71.0 ± 2.0 | 76.3 ± 3.7 | 63.0 ± 3.0 | 62.5 ± 4.1 | 53.8 ± 3.8 | 67.0 ± 4.7 | 53.3 ± 3.0 | 48.9 | 60.1 |
| **Llama3-OpenBioLLM-8B** | 57.9 ± 1.4 | 57.7 ± 0.8 | 76.0 ± 6.1 | 68.9 ± 4.0 | 77.7 ± 2.6 | 83.3 ± 3.1 | 69.4 ± 3.5 | 83.0 ± 3.8 | 79.0 ± 2.5 | 62.9 | 75.1 |
| **Llama-3-8B-UltraMedical** | 63.2 ± 1.4 | 57.7 ± 0.8 | 78.0 ± 5.9 | 67.4 ± 4.1 | 74.3 ± 2.7 | 75.7 ± 3.6 | 61.9 ± 3.7 | 73.0 ± 4.5 | 78.7 ± 2.5 | 63.5 | 5177.7 |
| **Mistral-7B-Instruct-v0.3** | 52.2 ± 1.4 | 48.2 ± 0.8 | 82.0 ± 5.5 | 59.3 ± 4.2 | 69.4 ± 2.8 | 72.9 ± 3.7 | 56.7 ± 3.8 | 70.0 ± 4.6 | 66.5 ± 2.9 | 55.9 | 111.8 |
| **Yi-1.5-9B-Chatbot** | 49.8 ± 1.4 | 47.0 ± 0.8 | 69.0 ± 2.1 | 67.5 ± 3.8 | 63.9 ± 2.8 | 70.3 ± 3.8 | 51.2 ± 4.0 | 68.8 ± 4.5 | 66.7 ± 3.1 | 52.9 | 162.2 |
| **HuatuoGPT-o1-7B** | **68.4 ± 1.3** | 57.5 ± 0.8 | 74.0 ± 2.0 | 71.9 ± 3.9 | 78.5 ± 2.5 | **88.2 ± 2.7** | 67.6 ± 3.6 | 80.0 ± 4.0 | 77.6 ± 2.5 | 64.4 | 446.0 |
| **HuatuoGPT-o1-8B** | 65.4 ± 1.3 | 61.0 ± 0.8 | 74.6 ± 2.0 | 69.6 ± 4.0 | 77.7 ± 2.6 | 81.3 ± 3.3 | 69.9 ± 3.5 | 78.0 ± 4.2 | 71.0 ± 2.8 | 65.5 | 468.9 |
| **ResponseMed-7B(1轮训练)** | 62.2 ± 1.4 | 57.6 ± 0.8 | 84.0 ± 5.2 | 75.6 ± 3.7 | 77.7 ± 2.6 | 81.3 ± 3.3 | 69.9 ± 3.5 | 87.0 ± 3.4 | 76.8 ± 2.6 | 64.8 | – |
| **CoTMed-7B(1轮训练)** | 64.3 ± 1.3 | 62.4 ± 0.8 | 82.0 ± 5.5 | **77.0 ± 3.6** | **80.8 ± 2.4** | 81.3 ± 3.3 | 72.8 ± 3.4 | **90.0 ± 3.0** | 79.4 ± 2.5 | 67.8 | – |
| **ReasonMed-7B(1轮训练)** | 65.3 ± 1.3 | 62.3 ± 0.8 | 82.0 ± 5.5 | 74.8 ± 3.7 | 80.0 ± 2.5 | 81.3 ± 3.3 | **74.0 ± 3.4** | 86.0 ± 3.5 | 79.0 ± 2.5 | 67.7 | – |
| **ResponseMed-7B** | 67.5 ± 1.3 | 60.9 ± 0.8 | 80.0 ± 5.7 | 74.8 ± 3.7 | 77.4 ± 2.6 | **84.0 ± 3.1** | 71.1 ± 3.5 | 88.0 ± 3.3 | 76.5 ± 2.6 | 67.0 | 225.2 |
| **CoTMed-7B** | 66.3 ± 1.3 | 64.7 ± 0.7 | 80.0 ± 5.7 | 75.6 ± 3.7 | 79.6 ± 2.5 | 82.1 ± 3.2 | 71.7 ± 3.4 | 86.0 ± 3.5 | 79.9 ± 2.6 | 69.1 | 555.4 |
| **ReasonMed-7B** | 66.9 ± 1.3 | **65.1 ± 0.7** | **82.0 ± 5.5** | 75.6 ± 3.7 | 79.3 ± 2.5 | 79.2 ± 3.4 | 73.4 ± 3.4 | 85.0 ± 3.6 | **80.9 ± 2.4** | **69.6** | 626.0 |
> **备注**:
> - **CK** = 临床知识(Clinical Knowledge)
> - **C-Bio** = 大学生物学(College Biology)
> - **C-Med** = 大学医学(College Medicine)
> - **Med-Gene** = 医学遗传学(Medical Genetics)
> - **P-Med** = 专业医学(Professional Medicine)
- **1轮训练 vs 3轮训练**:3轮训练的模型性能优于1轮训练的变体(例如ReasonMed-7B的准确率从67.7%提升至69.6%)
- **Token长度**:CoTMed与ReasonMed生成的推理内容更长(约555-626个Token),远高于ResponseMed的约225个Token。
---
## 引用
@misc{sun2025reasonmed370kmultiagentgenerated,
title={ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning},
author={Yu Sun and Xingyu Qian and Weiwen Xu and Hao Zhang and Chenghao Xiao and Long Li and Yu Rong and Wenbing Huang and Qifeng Bai and Tingyang Xu},
year={2025},
eprint={2506.09513},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.09513},
}
@misc{lasateam2025lingshugeneralistfoundationmodel,
title={Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning},
author={LASA Team and Weiwen Xu and Hou Pong Chan and Long Li and Mahani Aljunied and Ruifeng Yuan and Jianyu Wang and Chenghao Xiao and Guizhen Chen and Chaoqun Liu and Zhaodonghui Li and Yu Sun and Junao Shen and Chaojun Wang and Jie Tan and Deli Zhao and Tingyang Xu and Hao Zhang and Yu Rong},
year={2025},
eprint={2506.07044},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.07044},
}
提供机构:
maas
创建时间:
2025-06-15
搜集汇总
数据集介绍

背景与挑战
背景概述
ReasonMed是目前最大的开源医学推理数据集,包含37万高质量问答对,具有多步思维链推理和简洁总结。数据集通过多代理验证和精炼流程生成,覆盖多个医学领域的基准测试,并在数据质量和模型性能上进行了详细评估。
以上内容由遇见数据集搜集并总结生成



