YDXX/G-Health-sft-data

Name: YDXX/G-Health-sft-data
Creator: YDXX
Published: 2026-03-08 04:21:10
License: 暂无描述

Hugging Face2026-03-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/YDXX/G-Health-sft-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering tags: - medical --- # G-Health SFT Data This directory contains the Supervised Fine-Tuning (SFT) dialogue data used in **Stage 1** of the G-Health model training pipeline. ## Overview As the stage-1 training data, we aggregated multi-source Chinese medical dialogue and question-answering datasets, totaling 2,964,200 dialogue samples before cleaning. ## Data Sources The two largest components are **Chinese-medical-dialogue** (799,743 samples, ~27.0%) and **huatuo_knowledge_graph_qa** (798,444 samples, ~26.9%), together accounting for ~53.9%. The next major sources include **DISC-Med-SFT** (464,882 samples, ~15.7%), **UltraMedical** (409,593 samples, ~13.8%), and **huatuo_encyclopedia_qa** (364,420 samples, ~12.3%). Collectively, these five datasets comprise ~95.7% of the SFT corpus and form its primary backbone. We further incorporated smaller but more targeted datasets to strengthen specific capabilities: **medical-o1-reasoning-SFT** (90,120 samples, ~3.0%) and **Medical-R1-Distill-Data-Chinese** (17,000 samples, ~0.57%) emphasize long-chain reasoning and learning verifiable reasoning traces, while **XunYiWenYao** (19,998 samples, ~0.67%) provides additional QA examples that more closely reflect real patient-style queries ## Data Scale (After Cleaning) After data cleaning and deduplication, we obtained **2,817,556** dialogue samples in total: | File | Samples | Description | |------|---------|-------------| | **medical_history.json** | 342,562 | Multi-turn samples with dialogue history | | **medical_without_history.json** | 2,474,994 | Single-turn samples without dialogue history | ## Data Format ### Multi-turn (with history) Each sample contains `instruction`, `input`, `output`, and `history`: | Field | Description | |-------|-------------| | `instruction` | Current user query | | `input` | Additional context (often empty) | | `output` | Model response for the current turn | | `history` | List of `[user_query, assistant_response]` pairs for previous turns | ```json { "instruction": "足部骨折。你好大夫...请问骨折对位可以吗？/no_think", "input": "", "output": "您好，我很高兴能为您提供帮助...", "history": [ ["这是图片。您好毛大夫.../no_think", "非常感谢您提供的图片..."], ["手术已经完成几天了。/no_think", "了解了，手术已经完成几天了..."] ] } ``` ### Single-turn (without history) Each sample contains `instruction`, `input`, and `output`: | Field | Description | |-------|-------------| | `instruction` | User query | | `input` | Additional context (e.g., patient description) | | `output` | Model response | ```json { "instruction": "小儿肥胖超重该如何治疗/no_think", "input": "女宝宝，刚7岁，这一年，察觉到，我家孩子身上肉很多...", "output": "孩子出现肥胖症的情况。家长要通过孩子运功和健康的饮食来缓解..." } ``` ## Files - **medical_history.json** — Multi-turn dialogue samples (342,562) - **medical_without_history.json** — Single-turn dialogue samples (2,474,994)

许可证：Apache-2.0 任务类别： - 问答标签： - 医疗 # G-Health 监督微调（Supervised Fine-Tuning，SFT）数据集本目录包含用于G-Health模型训练流水线**第一阶段**的监督微调对话数据。 ## 概览作为第一阶段训练数据，我们整合了多源中文医疗对话与问答数据集，清洗前总计包含2,964,200条对话样本。 ## 数据集来源两大核心组件分别为**Chinese-medical-dialogue**（799,743条样本，占比约27.0%）与**huatuo_knowledge_graph_qa**（798,444条样本，占比约26.9%），二者合计占比约53.9%。其余主要来源包括**DISC-Med-SFT**（464,882条样本，占比约15.7%）、**UltraMedical**（409,593条样本，占比约13.8%）以及**huatuo_encyclopedia_qa**（364,420条样本，占比约12.3%）。上述五大数据集合计占比约95.7%，构成了本监督微调语料的主体骨架。我们进一步引入了体量更小但针对性更强的数据集以强化特定能力：**medical-o1-reasoning-SFT**（90,120条样本，占比约3.0%）与**Medical-R1-Distill-Data-Chinese**（17,000条样本，占比约0.57%）侧重长链推理与可验证推理轨迹学习；**XunYiWenYao**（19,998条样本，占比约0.67%）则提供了更贴近真实患者问询风格的额外问答示例。 ## 清洗后数据规模经过数据清洗与去重处理后，我们最终获得总计2,817,556条对话样本： | 文件 | 样本量 | 描述 | |------|---------|-------------| | **medical_history.json** | 342,562 | 包含对话历史的多轮对话样本 | | **medical_without_history.json** | 2,474,994 | 不含对话历史的单轮对话样本 | ## 数据格式 ### 多轮对话（含历史）每条样本包含`instruction`、`input`、`output`与`history`字段： | 字段 | 说明 | |-------|-------------| | `instruction` | 当前用户问询 | | `input` | 额外上下文（通常为空） | | `output` | 当前轮次的模型回复 | | `history` | 过往轮次的`[用户问询, 助手回复]`对列表 | json { "instruction": "足部骨折。你好大夫...请问骨折对位可以吗？/no_think", "input": "", "output": "您好，我很高兴能为您提供帮助...", "history": [ ["这是图片。您好毛大夫.../no_think", "非常感谢您提供的图片..."], ["手术已经完成几天了。/no_think", "了解了，手术已经完成几天了..."] ] } ### 单轮对话（不含历史）每条样本包含`instruction`、`input`与`output`字段： | 字段 | 说明 | |-------|-------------| | `instruction` | 用户问询 | | `input` | 额外上下文（如患者病情描述） | | `output` | 模型回复 | json { "instruction": "小儿肥胖超重该如何治疗/no_think", "input": "女宝宝，刚7岁，这一年，察觉到，我家孩子身上肉很多...", "output": "孩子出现肥胖症的情况。家长要通过孩子运动和健康的饮食来缓解..." } ## 数据集文件 - **medical_history.json** — 多轮对话样本（342,562条） - **medical_without_history.json** — 单轮对话样本（2,474,994条）

提供机构：

YDXX

5,000+

优质数据集

54 个

任务类型

进入经典数据集