II-Medical-Reasoning-SFT
收藏魔搭社区2025-12-05 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/Intelligent-Internet/II-Medical-Reasoning-SFT
下载链接
链接失效反馈官方服务:
资源简介:
# II-Medical-Reasoning-SFT
<!-- Provide a quick summary of the dataset. -->
II-Medical SFT is a curated dataset designed to support the supervised fine-tuning of large language models (LLMs) for medical reasoning tasks. It comprises multi-turn dialogues, clinical case scenarios, and question-answer pairs that reflect the complex reasoning processes encountered in real-world clinical practice.
The dataset is intended to help models develop key competencies such as differential diagnosis, evidence-based decision-making, patient communication, and guideline-informed treatment planning. II-Medical SFT is built using a combination of our custom synthetic data generation pipeline and publicly available medical reasoning datasets, ensuring both diversity and clinical relevance.
We hope this dataset will be valuable resource for the community and contributes to the advancement of medical reasoning capabilities in AI systems.
## Dataset Creation
The training dataset comprises 2,197,741 samples from the following sources:
### 1. Public Medical Reasoning Datasets
- [General Medical Reasoning](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
- [Medical-R1-Distill-Data](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data)
- [Medical-R1-Distill-Data-Chinese](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data-Chinese)
- [UCSC-VLAA/m23k-tokenized](https://huggingface.co/datasets/UCSC-VLAA/m23k-tokenized)
### 2. Synthetic Medical QA Data with Qwen3-235B-A22B (873,497 samples)
Generated from established medical datasets:
- [MedMcQA](https://huggingface.co/datasets/openlifescienceai/medmcqa)
- [MedQA](https://huggingface.co/datasets/bigbio/med_qa)
- [PubmedQA](https://huggingface.co/datasets/qiaojin/PubMedQA/viewer/pqa_unlabeled)
- [MedReason](https://huggingface.co/datasets/UCSC-VLAA/MedReason)
For each prompt, we generated 6-10 sampled responses, resulting in the comprehensive dataset mentioned above, and keep only the correct one.
### 3. Curated Medical R1 Traces
First we gather all the public R1 traces from:
- [PrimeIntellect/SYNTHETIC-1](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37)
- [GeneralReasoning/GeneralThought-430K](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
- [a-m-team/AM-DeepSeek-R1-Distilled-1.4M](https://arxiv.org/abs/2503.19633v1)
- [open-thoughts/OpenThoughts2-1M](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M)
- [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset): Science subset only
- Other resources: [cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1), [ServiceNow-AI/R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT),...
All R1 reasoning traces were processed through a domain-specific pipeline as follows:
1. Embedding Generation: Prompts are embedded using sentence-transformers/all-MiniLM-L6-v2.
2. Clustering: Perform K-means clustering with 50,000 clusters.
3. Domain Classification:
- For each cluster, select the 10 prompts nearest to the cluster center.
- Classify the domain of each selected prompt using Qwen2.5-32b-Instruct.
- Assign the cluster's domain based on majority voting among the classified prompts.
4. Domain Filtering: Keep only clusters labeled as Medical or Biology for the final dataset.
### General Medical & Instruction Following Dataset (1,025,903 samples)
We generated general medical instruction-following data and evaluated it with GPT-4o as an automatic judge. Only the high-scoring (i.e >= 8/10) responses compared to ground-truth answers were retained.
- 229,433 prompts from [Text-Book-QA-subset](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus)
- 276,079 prompts from [Text-Patient-QA-subset](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus)
- 142,927 prompts from [Text-GuildLine-QA-subset](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus)
- 215,647 prompts from [Chat-Doctor-QA](https://huggingface.co/datasets/lavita/ChatDoctor-HealthCareMagic-100k)
- 74,190 prompts from our Evol-Instruct medical dataset.
We also using 87,627 prompts from Subset Instruction-following [a-m-team/AM-Qwen3-Distilled](https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled)
### Deduplicate
Response Deduplicate
- Ngram: 4
- Jacard Threshold: 0.7
### Data Decontamination
We using two step decontamination:
1. Following [open-r1](https://github.com/huggingface/open-r1) project: We decontaminate a dataset using 8-grams with the evaluation datasets.
2. After that, we using the fuzzy decontamination from [`s1k`](https://arxiv.org/abs/2501.19393) method with threshold 80%.
**Our pipeline is carefully decontaminated with the evaluation datasets.**
## VII. Limitations and Considerations
- Dataset may contain inherent biases from source materials
- Medical knowledge requires regular updates
## VIII. Citation
```bib
@misc{2025II-Medical-Reasoning,
title={II-Medical-Reasoning: Medical Reasoning Dataset},
author={Intelligent Internet},
year={2025}
}
```
# II-Medical-Reasoning-SFT
<!-- 数据集简要概述 -->
II-Medical-SFT是一款精心打造的精选数据集,旨在为面向医疗推理任务的大语言模型(Large Language Model,LLM)的监督微调提供支撑。该数据集包含多轮对话、临床病例场景以及问答对,能够反映真实临床实践中遇到的复杂推理过程。
本数据集旨在帮助模型掌握核心能力,包括鉴别诊断、循证决策、医患沟通以及基于指南的治疗方案制定等。II-Medical-SFT由自研的合成数据生成流水线与公开可用的医疗推理数据集结合构建而成,兼顾了数据多样性与临床相关性。
我们希望本数据集能够为社区提供有价值的资源,并助力AI系统医疗推理能力的发展。
## 数据集构建
本训练数据集包含2,197,741条样本,数据来源如下:
### 1. 公开医疗推理数据集
- [通用医疗推理数据集(General Medical Reasoning)](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
- [Medical-R1-Distill-Data数据集](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data)
- [中文Medical-R1-Distill-Data数据集](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data-Chinese)
- [UCSC-VLAA/m23k-tokenized数据集](https://huggingface.co/datasets/UCSC-VLAA/m23k-tokenized)
### 2. 基于Qwen3-235B-A22B生成的合成医疗问答数据(共873,497条样本)
数据源自以下公开医疗数据集:
- [MedMcQA数据集](https://huggingface.co/datasets/openlifescienceai/medmcqa)
- [MedQA数据集](https://huggingface.co/datasets/bigbio/med_qa)
- [PubMedQA数据集](https://huggingface.co/datasets/qiaojin/PubMedQA/viewer/pqa_unlabeled)
- [MedReason数据集](https://huggingface.co/datasets/UCSC-VLAA/MedReason)
针对每条提示词,我们生成6至10条采样回复,并仅保留正确的回复,最终形成前述的完整数据集。
### 3. 精选医疗R1推理轨迹
我们首先从以下来源收集所有公开的R1推理轨迹:
- [PrimeIntellect/SYNTHETIC-1数据集集合](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37)
- [GeneralReasoning/GeneralThought-430K数据集](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
- [a-m-team/AM-DeepSeek-R1-Distilled-1.4M数据集](https://arxiv.org/abs/2503.19633v1)
- [open-thoughts/OpenThoughts2-1M数据集](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M)
- [nvidia/Llama-Nemotron-Post-Training-Dataset数据集](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset):仅保留科学子集
- 其他资源:[cognitivecomputations/dolphin-r1数据集](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1)、[ServiceNow-AI/R1-Distill-SFT数据集](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT)等
所有R1推理轨迹均通过以下领域专属流水线进行处理:
1. 嵌入生成:使用sentence-transformers/all-MiniLM-L6-v2模型对提示词进行嵌入处理。
2. 聚类:采用K-means聚类算法,设置50,000个聚类中心。
3. 领域分类:
- 针对每个聚类,选取距离聚类中心最近的10条提示词。
- 使用Qwen2.5-32b-Instruct模型对每条选中的提示词进行领域分类。
- 根据分类结果的多数投票结果,为该聚类分配对应领域。
4. 领域过滤:仅保留被标记为医疗或生物领域的聚类,用于最终数据集。
## 通用医疗与指令遵循数据集(共1,025,903条样本)
我们生成了通用医疗指令遵循数据,并使用GPT-4o作为自动评判工具进行评估。仅保留与标准答案相比得分不低于8/10的回复。
- 229,433条提示词源自[Text-Book-QA-subset数据集](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus)
- 276,079条提示词源自[Text-Patient-QA-subset数据集](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus)
- 142,927条提示词源自[Text-Guideline-QA-subset数据集](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus)
- 215,647条提示词源自[Chat-Doctor-QA数据集](https://huggingface.co/datasets/lavita/ChatDoctor-HealthCareMagic-100k)
- 74,190条提示词源自我们自研的Evol-Instruct医疗数据集。
此外,我们还引入了源自[a-m-team/AM-Qwen3-Distilled数据集](https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled)的87,627条指令遵循子集提示词。
## 数据去重
### 回复去重
- Ngram:4
- 雅卡尔相似度阈值:0.7
## 数据污染去除
我们采用两步法进行数据污染去除:
1. 参考[open-r1](https://github.com/huggingface/open-r1)项目方案:使用8元语法(8-grams)与评估数据集进行数据集污染去除。
2. 随后,采用[`s1k`](https://arxiv.org/abs/2501.19393)论文提出的模糊污染去除方法,设置阈值为80%。
**我们的流水线已针对评估数据集进行了严格的污染去除处理。**
## 七、局限性与注意事项
- 本数据集可能包含源数据中固有的偏见
- 医疗知识需要定期更新
## 八、引用
bib
@misc{2025II-Medical-Reasoning,
title={II-Medical-Reasoning: Medical Reasoning Dataset},
author={Intelligent Internet},
year={2025}
}
提供机构:
maas
创建时间:
2025-07-04



