YDXX/G-Health-sft-data
收藏Hugging Face2026-03-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/YDXX/G-Health-sft-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
tags:
- medical
---
# G-Health SFT Data
This directory contains the Supervised Fine-Tuning (SFT) dialogue data used in **Stage 1** of the G-Health model training pipeline.
## Overview
As the stage-1 training data, we aggregated multi-source Chinese medical dialogue and question-answering datasets, totaling 2,964,200 dialogue samples before cleaning.
## Data Sources
The two largest components are **Chinese-medical-dialogue** (799,743 samples, ~27.0%) and **huatuo_knowledge_graph_qa** (798,444 samples, ~26.9%), together accounting for ~53.9%. The next major sources include **DISC-Med-SFT** (464,882 samples, ~15.7%), **UltraMedical** (409,593 samples, ~13.8%), and **huatuo_encyclopedia_qa** (364,420 samples, ~12.3%). Collectively, these five datasets comprise ~95.7% of the SFT corpus and form its primary backbone.
We further incorporated smaller but more targeted datasets to strengthen specific capabilities: **medical-o1-reasoning-SFT** (90,120 samples, ~3.0%) and **Medical-R1-Distill-Data-Chinese** (17,000 samples, ~0.57%) emphasize long-chain reasoning and learning verifiable reasoning traces, while **XunYiWenYao** (19,998 samples, ~0.67%) provides additional QA examples that more closely reflect real patient-style queries
## Data Scale (After Cleaning)
After data cleaning and deduplication, we obtained **2,817,556** dialogue samples in total:
| File | Samples | Description |
|------|---------|-------------|
| **medical_history.json** | 342,562 | Multi-turn samples with dialogue history |
| **medical_without_history.json** | 2,474,994 | Single-turn samples without dialogue history |
## Data Format
### Multi-turn (with history)
Each sample contains `instruction`, `input`, `output`, and `history`:
| Field | Description |
|-------|-------------|
| `instruction` | Current user query |
| `input` | Additional context (often empty) |
| `output` | Model response for the current turn |
| `history` | List of `[user_query, assistant_response]` pairs for previous turns |
```json
{
"instruction": "足部骨折。你好大夫...请问骨折对位可以吗?/no_think",
"input": "",
"output": "您好,我很高兴能为您提供帮助...",
"history": [
["这是图片。您好毛大夫.../no_think", "非常感谢您提供的图片..."],
["手术已经完成几天了。/no_think", "了解了,手术已经完成几天了..."]
]
}
```
### Single-turn (without history)
Each sample contains `instruction`, `input`, and `output`:
| Field | Description |
|-------|-------------|
| `instruction` | User query |
| `input` | Additional context (e.g., patient description) |
| `output` | Model response |
```json
{
"instruction": "小儿肥胖超重该如何治疗/no_think",
"input": "女宝宝,刚7岁,这一年,察觉到,我家孩子身上肉很多...",
"output": "孩子出现肥胖症的情况。家长要通过孩子运功和健康的饮食来缓解..."
}
```
## Files
- **medical_history.json** — Multi-turn dialogue samples (342,562)
- **medical_without_history.json** — Single-turn dialogue samples (2,474,994)
许可证:Apache-2.0
任务类别:
- 问答
标签:
- 医疗
# G-Health 监督微调(Supervised Fine-Tuning,SFT)数据集
本目录包含用于G-Health模型训练流水线**第一阶段**的监督微调对话数据。
## 概览
作为第一阶段训练数据,我们整合了多源中文医疗对话与问答数据集,清洗前总计包含2,964,200条对话样本。
## 数据集来源
两大核心组件分别为**Chinese-medical-dialogue**(799,743条样本,占比约27.0%)与**huatuo_knowledge_graph_qa**(798,444条样本,占比约26.9%),二者合计占比约53.9%。其余主要来源包括**DISC-Med-SFT**(464,882条样本,占比约15.7%)、**UltraMedical**(409,593条样本,占比约13.8%)以及**huatuo_encyclopedia_qa**(364,420条样本,占比约12.3%)。上述五大数据集合计占比约95.7%,构成了本监督微调语料的主体骨架。
我们进一步引入了体量更小但针对性更强的数据集以强化特定能力:**medical-o1-reasoning-SFT**(90,120条样本,占比约3.0%)与**Medical-R1-Distill-Data-Chinese**(17,000条样本,占比约0.57%)侧重长链推理与可验证推理轨迹学习;**XunYiWenYao**(19,998条样本,占比约0.67%)则提供了更贴近真实患者问询风格的额外问答示例。
## 清洗后数据规模
经过数据清洗与去重处理后,我们最终获得总计2,817,556条对话样本:
| 文件 | 样本量 | 描述 |
|------|---------|-------------|
| **medical_history.json** | 342,562 | 包含对话历史的多轮对话样本 |
| **medical_without_history.json** | 2,474,994 | 不含对话历史的单轮对话样本 |
## 数据格式
### 多轮对话(含历史)
每条样本包含`instruction`、`input`、`output`与`history`字段:
| 字段 | 说明 |
|-------|-------------|
| `instruction` | 当前用户问询 |
| `input` | 额外上下文(通常为空) |
| `output` | 当前轮次的模型回复 |
| `history` | 过往轮次的`[用户问询, 助手回复]`对列表 |
json
{
"instruction": "足部骨折。你好大夫...请问骨折对位可以吗?/no_think",
"input": "",
"output": "您好,我很高兴能为您提供帮助...",
"history": [
["这是图片。您好毛大夫.../no_think", "非常感谢您提供的图片..."],
["手术已经完成几天了。/no_think", "了解了,手术已经完成几天了..."]
]
}
### 单轮对话(不含历史)
每条样本包含`instruction`、`input`与`output`字段:
| 字段 | 说明 |
|-------|-------------|
| `instruction` | 用户问询 |
| `input` | 额外上下文(如患者病情描述) |
| `output` | 模型回复 |
json
{
"instruction": "小儿肥胖超重该如何治疗/no_think",
"input": "女宝宝,刚7岁,这一年,察觉到,我家孩子身上肉很多...",
"output": "孩子出现肥胖症的情况。家长要通过孩子运动和健康的饮食来缓解..."
}
## 数据集文件
- **medical_history.json** — 多轮对话样本(342,562条)
- **medical_without_history.json** — 单轮对话样本(2,474,994条)
提供机构:
YDXX



