muhammadocama/ReasonMed
收藏Hugging Face2026-02-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/muhammadocama/ReasonMed
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 100K<n<1M
task_categories:
- question-answering
- text-generation
pretty_name: ReasonMed
tags:
- biology
- medical
---
# ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
<p align="center">
<a href="https://arxiv.org/pdf/2506.09513">📄 Paper</a> |
<a href="https://github.com/YuSun-Work/ReasonMed">💻 Code</a> |
<a href="https://huggingface.co/datasets/lingshu-medical-mllm/ReasonMed">📊 Dataset</a>
</p>
**ReasonMed** is the largest open-source medical reasoning dataset to date, containing **370 K** high-quality question–answer examples with multi-step chain-of-thought (CoT) rationales and concise summaries. We distilled these from **1.75 M** initial reasoning paths generated by three competitive large-language models (Qwen-2.5-72B, DeepSeek-R1-Distill-Llama-70B, and HuatuoGPT-o1-70B), using a rigorous multi-agent verification and refinement pipeline.
---
## 📚 Dataset Composition
We sourced **194,925** unique multiple-choice medical questions from six established benchmarks, then generated and validated CoT paths:
| **Source** | **# Questions** |
|--------------------------------|-----------------|
| **MedQA** (train / dev) | 10,178 / 1,272 |
| **MedMCQA** (train) | 182,822 |
| **PubMedQA** (train / val) | 450 / 50 |
| **MMLU – Anatomy** (dev / val) | 5 / 14 |
| **MMLU – Clinical Knowledge** | 5 / 29 |
| **MMLU – College Biology** | 5 / 16 |
| **MMLU – College Medicine** | 5 / 22 |
| **MMLU – Medical Genetics** | 5 / 11 |
| **MMLU – Professional Medicine**| 5 / 31 |
| **Total** | **194,925** |
---
## 🔍 Data Generation & Curation Pipeline
1. **Multi-Agent CoT Generation**
- Three LLMs each generate 3 CoT trajectories per question at temperatures {0.7, 0.9, 1.0}, yielding 1.75 M raw paths.
2. **Verification (Qwen-2.5-72B)**
- Judge each CoT for correctness, logical coherence, and medical factuality.
- Label as “Correct” or “Error” with error reasons.
3. **Difficulty Tiers & Refinement**
- **Easy (0–4 errors):** select top 2 CoTs via Quality Ranker.
- **Medium (5–7 errors):** refine top 2 CoTs via Error Refiner (GPT-4o-mini).
- **Difficult (8–9 errors):** regenerate full CoT via GPT-o1 with a 6-step template.
4. **Summarization (GPT-4o-mini)**
- Condense each CoT into a concise answer rationale.
5. **Final Dataset**
- Each dataset contains 370k pieces of data, for a total of 1.1M pieces of data:
- ReasonMed(<think>{CoT}</think>{response})
- CoTMed({CoT})
- ResponseMed({response})
---
## 📊 Data Quality Evaluation
### Medium Pipeline Validity Verification
To evaluate the Medium Pipeline, we sampled 1 000 questions + CoTs and used our Score Evaluator to score before and after GPT-4o-mini corrections. The average score improved by **0.8** points.
| **Dataset** | **Samples** | **Avg. Score** |
|-------------------------------|-------------|----------------|
| Medium Pipeline (pre-opt) | 1 000 | 7.37 |
| Medium Pipeline (post-opt) | 1 000 | 8.17 |
### Comparison with Other Medical Reasoning Corpora
We compared ReasonMed against two open datasets, sampling 1 000 instances each, and also evaluated 3 000 ReasonMed samples:
| **Dataset** | **Samples** | **Avg. Score** |
|---------------------------------|-------------|----------------|
| medical-o1-reasoning-SFT | 1 000 | 8.03 |
| Medical-R1-Distill-Data | 1 000 | 8.18 |
| **ReasonMed** | 1 000 | **8.45** |
| **ReasonMed** | 3 000 | **8.50** |
---
## 🎯 Multiscale Supervised Fine-Tuning Results
We fine-tuned Qwen2.5-7B under three regimes—CoT, Response, and hybrid Reason—over three epochs and one epoch. Evaluation on MedQA, MedMCQA, PubMedQA, and six MMLU subdomains yields:
| Model | MedQA | MedMCQA (val) | PubMedQA | Anatomy | CK | C-Bio | C-Med | Med-Gene | P-Med | **Total Acc** | Avg. Tokens |
|------------------------------|-------------|---------------|-------------|----------------|----------------|-----------------|-----------------|----------------|----------------|---------------|-------------|
| **BioMistral-7B** | 45.6 ± 1.4 | 41.5 ± 0.8 | 71.0 ± 2.0 | 76.3 ± 3.7 | 63.0 ± 3.0 | 62.5 ± 4.1 | 53.8 ± 3.8 | 67.0 ± 4.7 | 53.3 ± 3.0 | 48.9 | 60.1 |
| **Llama3-OpenBioLLM-8B** | 57.9 ± 1.4 | 57.7 ± 0.8 | 76.0 ± 6.1 | 68.9 ± 4.0 | 77.7 ± 2.6 | 83.3 ± 3.1 | 69.4 ± 3.5 | 83.0 ± 3.8 | 79.0 ± 2.5 | 62.9 | 75.1 |
| **Llama-3-8B-UltraMedical** | 63.2 ± 1.4 | 57.7 ± 0.8 | 78.0 ± 5.9 | 67.4 ± 4.1 | 74.3 ± 2.7 | 75.7 ± 3.6 | 61.9 ± 3.7 | 73.0 ± 4.5 | 78.7 ± 2.5 | 63.5 | 5177.7 |
| **Mistral-7B-Instruct-v0.3** | 52.2 ± 1.4 | 48.2 ± 0.8 | 82.0 ± 5.5 | 59.3 ± 4.2 | 69.4 ± 2.8 | 72.9 ± 3.7 | 56.7 ± 3.8 | 70.0 ± 4.6 | 66.5 ± 2.9 | 55.9 | 111.8 |
| **Yi-1.5-9B-Chatbot** | 49.8 ± 1.4 | 47.0 ± 0.8 | 69.0 ± 2.1 | 67.5 ± 3.8 | 63.9 ± 2.8 | 70.3 ± 3.8 | 51.2 ± 4.0 | 68.8 ± 4.5 | 66.7 ± 3.1 | 52.9 | 162.2 |
| **HuatuoGPT-o1-7B** | **68.4 ± 1.3** | 57.5 ± 0.8 | 74.0 ± 2.0 | 71.9 ± 3.9 | 78.5 ± 2.5 | **88.2 ± 2.7** | 67.6 ± 3.6 | 80.0 ± 4.0 | 77.6 ± 2.5 | 64.4 | 446.0 |
| **HuatuoGPT-o1-8B** | 65.4 ± 1.3 | 61.0 ± 0.8 | 74.6 ± 2.0 | 69.6 ± 4.0 | 77.7 ± 2.6 | 81.3 ± 3.3 | 69.9 ± 3.5 | 78.0 ± 4.2 | 71.0 ± 2.8 | 65.5 | 468.9 |
| **ResponseMed-7B (1 epoch)** | 62.2 ± 1.4 | 57.6 ± 0.8 | 84.0 ± 5.2 | 75.6 ± 3.7 | 77.7 ± 2.6 | 81.3 ± 3.3 | 69.9 ± 3.5 | 87.0 ± 3.4 | 76.8 ± 2.6 | 64.8 | – |
| **CoTMed-7B (1 epoch)** | 64.3 ± 1.3 | 62.4 ± 0.8 | 82.0 ± 5.5 | **77.0 ± 3.6** | **80.8 ± 2.4** | 81.3 ± 3.3 | 72.8 ± 3.4 | **90.0 ± 3.0** | 79.4 ± 2.5 | 67.8 | – |
| **ReasonMed-7B (1 epoch)** | 65.3 ± 1.3 | 62.3 ± 0.8 | 82.0 ± 5.5 | 74.8 ± 3.7 | 80.0 ± 2.5 | 81.3 ± 3.3 | **74.0 ± 3.4** | 86.0 ± 3.5 | 79.0 ± 2.5 | 67.7 | – |
| **ResponseMed-7B** | 67.5 ± 1.3 | 60.9 ± 0.8 | 80.0 ± 5.7 | 74.8 ± 3.7 | 77.4 ± 2.6 | **84.0 ± 3.1** | 71.1 ± 3.5 | 88.0 ± 3.3 | 76.5 ± 2.6 | 67.0 | 225.2 |
| **CoTMed-7B** | 66.3 ± 1.3 | 64.7 ± 0.7 | 80.0 ± 5.7 | 75.6 ± 3.7 | 79.6 ± 2.5 | 82.1 ± 3.2 | 71.7 ± 3.4 | 86.0 ± 3.5 | 79.9 ± 2.6 | 69.1 | 555.4 |
| **ReasonMed-7B** | 66.9 ± 1.3 | **65.1 ± 0.7** | **82.0 ± 5.5** | 75.6 ± 3.7 | 79.3 ± 2.5 | 79.2 ± 3.4 | 73.4 ± 3.4 | 85.0 ± 3.6 | **80.9 ± 2.4** | **69.6** | 626.0 |
> **Note**:
> - **CK** = Clinical Knowledge
> - **C-Bio** = College Biology
> - **C-Med** = College Medicine
> - **Med-Gene** = Medical Genetics
> - **P-Med** = Professional Medicine
- **One-epoch vs Three-epoch**: Three-epoch models outperform one-epoch variants (e.g., ReasonMed-7B improves from 67.7% to 69.6%)
- **Token Length**: CoTMed and ReasonMed generate deeper reasoning (≈555–626 tokens) vs ResponseMed (≈225 tokens).
---
## Citation
```
@misc{sun2025reasonmed370kmultiagentgenerated,
title={ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning},
author={Yu Sun and Xingyu Qian and Weiwen Xu and Hao Zhang and Chenghao Xiao and Long Li and Yu Rong and Wenbing Huang and Qifeng Bai and Tingyang Xu},
year={2025},
eprint={2506.09513},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.09513},
}
@misc{lasateam2025lingshugeneralistfoundationmodel,
title={Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning},
author={LASA Team and Weiwen Xu and Hou Pong Chan and Long Li and Mahani Aljunied and Ruifeng Yuan and Jianyu Wang and Chenghao Xiao and Guizhen Chen and Chaoqun Liu and Zhaodonghui Li and Yu Sun and Junao Shen and Chaojun Wang and Jie Tan and Deli Zhao and Tingyang Xu and Hao Zhang and Yu Rong},
year={2025},
eprint={2506.07044},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.07044},
}
```
语言:
- 英语
许可协议:Apache-2.0
规模类别:
- 10万<样本数<100万
任务类别:
- 问答
- 文本生成
友好名称:ReasonMed
标签:
- 生物学
- 医学
# ReasonMed:面向医学推理进阶的37万规模多智能体生成数据集
<p align="center">
<a href="https://arxiv.org/pdf/2506.09513">📄 论文</a> |
<a href="https://github.com/YuSun-Work/ReasonMed">💻 代码</a> |
<a href="https://huggingface.co/datasets/lingshu-medical-mllm/ReasonMed">📊 数据集</a>
</p>
**ReasonMed** 是目前规模最大的开源医学推理数据集,包含**37万**条高质量问答样本,附带多步思维链(Chain-of-Thought,CoT)推理依据与简洁总结。本数据集通过严格的多智能体验证与精修流程,从3个顶尖大语言模型(Large Language Model,LLM)生成的**175万**条初始推理路径中提炼得到。
## 📚 数据集构成
我们从6个权威基准数据集中共采集**194925**条独特的医学多选题,随后生成并验证了CoT推理路径:
| **数据集来源** | **样本数量** |
|--------------------------------|-------------|
| **MedQA**(训练集/开发集) | 10178 / 1272 |
| **MedMCQA**(训练集) | 182822 |
| **PubMedQA**(训练集/验证集) | 450 / 50 |
| **MMLU – 解剖学**(开发集/验证集) | 5 / 14 |
| **MMLU – 临床知识** | 5 / 29 |
| **MMLU – 大学生物学** | 5 / 16 |
| **MMLU – 大学医学** | 5 / 22 |
| **MMLU – 医学遗传学** | 5 / 11 |
| **MMLU – 专业医学** | 5 / 31 |
| **总计** | **194925** |
## 🔍 数据生成与整理流程
1. **多智能体CoT生成**
- 3个大语言模型(LLM)针对每个问题在温度参数{0.7、0.9、1.0}下分别生成3条CoT推理轨迹,最终得到175万条原始推理路径。
2. **验证环节(Qwen-2.5-72B)**
- 对每条CoT的正确性、逻辑连贯性与医学事实准确性进行评判。
- 标注为“正确”或“错误”,并附带错误原因。
3. **难度分级与精修**
- **简单(0-4处错误)**:通过质量排序器选取排名前2的CoT推理轨迹。
- **中等(5-7处错误)**:通过错误精修器(GPT-4o-mini)对排名前2的CoT推理轨迹进行精修。
- **困难(8-9处错误)**:通过GPT-o1结合6步模板重新生成完整的CoT推理轨迹。
4. **总结生成(GPT-4o-mini)**
- 将每条CoT推理轨迹压缩为简洁的答案依据。
5. **最终数据集**
- 本数据集包含37万条数据,整体总数据量达110万条,具体分为三类:
- ReasonMed(<think>{CoT}</think>{response})
- CoTMed({CoT})
- ResponseMed({response})
## 📊 数据质量评估
### 中等难度流程有效性验证
为评估中等难度流程的有效性,我们抽取了1000条问题与CoT推理轨迹,使用我们的评分评估器在GPT-4o-mini精修前后分别进行评分。最终平均得分提升了**0.8**分。
| **数据集** | **样本数量** | **平均得分** |
|-------------------------------|-------------|-------------|
| 中等难度流程(优化前) | 1000 | 7.37 |
| 中等难度流程(优化后) | 1000 | 8.17 |
### 与其他医学推理语料库的对比
我们将ReasonMed与2个开源数据集进行对比,每个数据集均抽取1000条样本,同时评估了3000条ReasonMed样本:
| **数据集** | **样本数量** | **平均得分** |
|---------------------------------|-------------|-------------|
| medical-o1-reasoning-SFT | 1000 | 8.03 |
| Medical-R1-Distill-Data | 1000 | 8.18 |
| **ReasonMed** | 1000 | **8.45** |
| **ReasonMed** | 3000 | **8.50** |
## 🎯 多尺度监督微调实验结果
我们针对Qwen2.5-7B在三种训练模式下分别进行了3轮和1轮的微调,这三种模式分别为:CoT模式、答案模式以及混合推理模式。我们在MedQA、MedMCQA、PubMedQA以及6个MMLU子任务上进行了评估,结果如下:
| 模型名称 | MedQA | MedMCQA(验证集) | PubMedQA | 解剖学 | CK | C-Bio | C-Med | Med-Gene | P-Med | **总准确率** | 平均Token数 |
|------------------------------|-------------|---------------|-------------|----------------|----------------|-----------------|-----------------|----------------|----------------|---------------|-------------|
| **BioMistral-7B** | 45.6 ± 1.4 | 41.5 ± 0.8 | 71.0 ± 2.0 | 76.3 ± 3.7 | 63.0 ± 3.0 | 62.5 ± 4.1 | 53.8 ± 3.8 | 67.0 ± 4.7 | 53.3 ± 3.0 | 48.9 | 60.1 |
| **Llama3-OpenBioLLM-8B** | 57.9 ± 1.4 | 57.7 ± 0.8 | 76.0 ± 6.1 | 68.9 ± 4.0 | 77.7 ± 2.6 | 83.3 ± 3.1 | 69.4 ± 3.5 | 83.0 ± 3.8 | 79.0 ± 2.5 | 62.9 | 75.1 |
| **Llama-3-8B-UltraMedical** | 63.2 ± 1.4 | 57.7 ± 0.8 | 78.0 ± 5.9 | 67.4 ± 4.1 | 74.3 ± 2.7 | 75.7 ± 3.6 | 61.9 ± 3.7 | 73.0 ± 4.5 | 78.7 ± 2.5 | 63.5 | 5177.7 |
| **Mistral-7B-Instruct-v0.3** | 52.2 ± 1.4 | 48.2 ± 0.8 | 82.0 ± 5.5 | 59.3 ± 4.2 | 69.4 ± 2.8 | 72.9 ± 3.7 | 56.7 ± 3.8 | 70.0 ± 4.6 | 66.5 ± 2.9 | 55.9 | 111.8 |
| **Yi-1.5-9B-Chatbot** | 49.8 ± 1.4 | 47.0 ± 0.8 | 69.0 ± 2.1 | 67.5 ± 3.8 | 63.9 ± 2.8 | 70.3 ± 3.8 | 51.2 ± 4.0 | 68.8 ± 4.5 | 66.7 ± 3.1 | 52.9 | 162.2 |
| **HuatuoGPT-o1-7B** | **68.4 ± 1.3** | 57.5 ± 0.8 | 74.0 ± 2.0 | 71.9 ± 3.9 | 78.5 ± 2.5 | **88.2 ± 2.7** | 67.6 ± 3.6 | 80.0 ± 4.0 | 77.6 ± 2.5 | 64.4 | 446.0 |
| **HuatuoGPT-o1-8B** | 65.4 ± 1.3 | 61.0 ± 0.8 | 74.6 ± 2.0 | 69.6 ± 4.0 | 77.7 ± 2.6 | 81.3 ± 3.3 | 69.9 ± 3.5 | 78.0 ± 4.2 | 71.0 ± 2.8 | 65.5 | 468.9 |
| **ResponseMed-7B (1 epoch)** | 62.2 ± 1.4 | 57.6 ± 0.8 | 84.0 ± 5.2 | 75.6 ± 3.7 | 77.7 ± 2.6 | 81.3 ± 3.3 | 69.9 ± 3.5 | 87.0 ± 3.4 | 76.8 ± 2.6 | 64.8 | – |
| **CoTMed-7B (1 epoch)** | 64.3 ± 1.3 | 62.4 ± 0.8 | 82.0 ± 5.5 | **77.0 ± 3.6** | **80.8 ± 2.4** | 81.3 ± 3.3 | 72.8 ± 3.4 | **90.0 ± 3.0** | 79.4 ± 2.5 | 67.8 | – |
| **ReasonMed-7B (1 epoch)** | 65.3 ± 1.3 | 62.3 ± 0.8 | 82.0 ± 5.5 | 74.8 ± 3.7 | 80.0 ± 2.5 | 81.3 ± 3.3 | **74.0 ± 3.4** | 86.0 ± 3.5 | 79.0 ± 2.5 | 67.7 | – |
| **ResponseMed-7B** | 67.5 ± 1.3 | 60.9 ± 0.8 | 80.0 ± 5.7 | 74.8 ± 3.7 | 77.4 ± 2.6 | **84.0 ± 3.1** | 71.1 ± 3.5 | 88.0 ± 3.3 | 76.5 ± 2.6 | 67.0 | 225.2 |
| **CoTMed-7B** | 66.3 ± 1.3 | 64.7 ± 0.7 | 80.0 ± 5.7 | 75.6 ± 3.7 | 79.6 ± 2.5 | 82.1 ± 3.2 | 71.7 ± 3.4 | 86.0 ± 3.5 | 79.9 ± 2.6 | 69.1 | 555.4 |
| **ReasonMed-7B** | 66.9 ± 1.3 | **65.1 ± 0.7** | **82.0 ± 5.5** | 75.6 ± 3.7 | 79.3 ± 2.5 | 79.2 ± 3.4 | 73.4 ± 3.4 | 85.0 ± 3.6 | **80.9 ± 2.4** | **69.6** | 626.0 |
> **说明**:
> - **CK** = 临床知识(Clinical Knowledge)
> - **C-Bio** = 大学生物学(College Biology)
> - **C-Med** = 大学医学(College Medicine)
> - **Med-Gene** = 医学遗传学(Medical Genetics)
> - **P-Med** = 专业医学(Professional Medicine)
- **1轮 vs 3轮训练**:3轮训练的模型性能优于1轮训练的变体模型(例如ReasonMed-7B的准确率从67.7%提升至69.6%)
- **Token长度**:CoTMed与ReasonMed生成的推理内容更长(约555~626个Token),而ResponseMed仅约225个Token,推理深度更优。
## 引用
@misc{sun2025reasonmed370kmultiagentgenerated,
title={ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning},
author={Yu Sun and Xingyu Qian and Weiwen Xu and Hao Zhang and Chenghao Xiao and Long Li and Yu Rong and Wenbing Huang and Qifeng Bai and Tingyang Xu},
year={2025},
eprint={2506.09513},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.09513},
}
@misc{lasateam2025lingshugeneralistfoundationmodel,
title={Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning},
author={LASA Team and Weiwen Xu and Hou Pong Chan and Long Li and Mahani Aljunied and Ruifeng Yuan and Jianyu Wang and Chenghao Xiao and Guizhen Chen and Chaoqun Liu and Zhaodonghui Li and Yu Sun and Junao Shen and Chaojun Wang and Jie Tan and Deli Zhao and Tingyang Xu and Hao Zhang and Yu Rong},
year={2025},
eprint={2506.07044},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.07044},
}
提供机构:
muhammadocama



