five

Dragnoz/Medical-Reasoning-SFT-Mega

收藏
Hugging Face2026-02-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Dragnoz/Medical-Reasoning-SFT-Mega
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - medical - reasoning - healthcare - clinical - chain-of-thought - thinking - sft - mega - combined size_categories: - 1M<n<10M --- # Medical-Reasoning-SFT-Mega The ultimate medical reasoning dataset - combining 7 state-of-the-art AI models with fair distribution deduplication. 1.79 million unique samples with 3.78 billion tokens of medical chain-of-thought reasoning. ## Dataset Overview | Metric | Value | |--------|-------| | **Total Samples** | 1,789,998 (after deduplication) | | **Total Tokens** | ~3.78 Billion | | **Content Tokens** | ~2.22 Billion | | **Reasoning Tokens** | ~1.56 Billion | | **Samples with Reasoning** | 1,789,764 (100.0%) | | **Unique Questions** | 1,237,711 | | **Shared Questions** | 552,287 | | **Source Models** | 7 | | **Language** | English | ## Source Models This dataset combines reasoning from 7 leading AI models with fair representation: | Model | Original | Unique | From Shared | Final | % | |-------|----------|--------|-------------|-------|---| | MiniMax-M2.1 | 204,773 | 76,892 | 42,126 | 119,018 | 6.6% | | Baichuan-M3-235B | 124,520 | 30,589 | 43,070 | 73,659 | 4.1% | | GPT-OSS-120B | 506,150 | 403,208 | 42,113 | 445,321 | 24.9% | | Qwen3-Next-80B | 604,249 | 295,828 | 118,539 | 414,367 | 23.1% | | GLM_4.5_Air | 225,179 | 151,278 | 42,117 | 193,395 | 10.8% | | Nemotron-Nano-30B | 444,544 | 86,978 | 118,546 | 205,524 | 11.5% | | Trinity-Mini | 810,284 | 192,938 | 145,776 | 338,714 | 18.9% | **Total before deduplication:** 2,919,699 samples **Total after deduplication:** 1,789,998 samples ## Deduplication Strategy: Fair Distribution When the same medical question appears in multiple source datasets, we use **fair distribution** to ensure balanced representation: 1. **Unique questions** (1,237,711): Questions appearing in only one model are included directly 2. **Shared questions** (552,287): Questions appearing in multiple models are distributed fairly across all contributing models For shared questions, we assign each question to the model that has been assigned the fewest shared questions so far. This ensures no single model dominates the duplicate pool, resulting in diverse reasoning styles across the dataset. This differs from priority-based deduplication where one model would "win" all duplicates. With fair distribution, each model gets roughly equal share of the shared question pool. ## Schema Each sample follows the conversational messages format with reasoning content: ```json { "messages": [ { "role": "system", "content": "You are a medical expert...", "reasoning_content": null }, { "role": "user", "content": "What are the symptoms of diabetes?", "reasoning_content": null }, { "role": "assistant", "content": "The main symptoms of diabetes include...", "reasoning_content": "Let me think through this systematically..." } ] } ``` ### Fields | Field | Type | Description | |-------|------|-------------| | `messages` | list | Array of message objects in the conversation | | `messages[].role` | string | Either "system", "user", or "assistant" | | `messages[].content` | string | The main message content | | `messages[].reasoning_content` | string or null | Chain-of-thought reasoning (assistant messages only) | ## Usage ### Loading with Datasets Library ```python from datasets import load_dataset dataset = load_dataset("OpenMed/Medical-Reasoning-SFT-Mega") ``` ### Accessing Samples ```python # Get a sample sample = dataset['train'][0] # Access messages for msg in sample['messages']: print(f"Role: {msg['role']}") print(f"Content: {msg['content'][:100]}...") if msg['reasoning_content']: print(f"Reasoning: {msg['reasoning_content'][:100]}...") ``` ### Filtering by Reasoning ```python # Get samples with reasoning content samples_with_reasoning = dataset['train'].filter( lambda x: x['messages'][-1]['reasoning_content'] is not None ) ``` ## Intended Use This dataset is designed for: - **Fine-tuning medical reasoning models**: Train LLMs with diverse reasoning styles from multiple state-of-the-art models - **Chain-of-thought training**: Develop models that show detailed thinking processes - **Medical QA systems**: Build robust question-answering systems for healthcare applications - **Research**: Study and compare reasoning patterns across different AI architectures - **Distillation**: Transfer capabilities from multiple large models to smaller ones ## Why Mega? 1. **Diversity**: 7 different model architectures provide varied reasoning approaches 2. **Fair Representation**: Fair distribution ensures balanced contributions from all models 3. **Scale**: 1.79M unique samples and 3.78B tokens for comprehensive training 4. **Coverage**: Spans clinical, diagnostic, pharmacological, and general medical knowledge ## Limitations and Considerations - This dataset is generated by AI models and should not be used as a substitute for professional medical advice - Responses may contain inaccuracies and should be validated by medical professionals - Not intended for clinical decision-making without expert review - The reasoning traces reflect model approaches, not necessarily optimal clinical reasoning ## Related Datasets Individual model datasets are also available: - [Medical-Reasoning-SFT-MiniMax-M2.1](https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-MiniMax-M2.1) - [Medical-Reasoning-SFT-Baichuan-M3-235B](https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Baichuan-M3-235B) - [Medical-Reasoning-SFT-GPT-OSS-120B-V2](https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B-V2) - [Medical-Reasoning-SFT-Qwen3-Next-80B](https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Qwen3-Next-80B) - [Medical-Reasoning-SFT-GLM_4.5_Air](https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-GLM_4.5_Air) - [Medical-Reasoning-SFT-Nemotron-Nano-30B](https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B) - [Medical-Reasoning-SFT-Trinity-Mini](https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Trinity-Mini) ## Citation If you use this dataset, please cite: ```bibtex @dataset{medical_reasoning_sft_mega, title={Medical-Reasoning-SFT-Mega}, author={OpenMed}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Mega} } ``` ## License Apache 2.0
提供机构:
Dragnoz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作