five

YDXX/G-Health-sft-data

收藏
Hugging Face2026-03-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/YDXX/G-Health-sft-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering tags: - medical --- # G-Health SFT Data This directory contains the Supervised Fine-Tuning (SFT) dialogue data used in **Stage 1** of the G-Health model training pipeline. ## Overview As the stage-1 training data, we aggregated multi-source Chinese medical dialogue and question-answering datasets, totaling 2,964,200 dialogue samples before cleaning. ## Data Sources The two largest components are **Chinese-medical-dialogue** (799,743 samples, ~27.0%) and **huatuo_knowledge_graph_qa** (798,444 samples, ~26.9%), together accounting for ~53.9%. The next major sources include **DISC-Med-SFT** (464,882 samples, ~15.7%), **UltraMedical** (409,593 samples, ~13.8%), and **huatuo_encyclopedia_qa** (364,420 samples, ~12.3%). Collectively, these five datasets comprise ~95.7% of the SFT corpus and form its primary backbone. We further incorporated smaller but more targeted datasets to strengthen specific capabilities: **medical-o1-reasoning-SFT** (90,120 samples, ~3.0%) and **Medical-R1-Distill-Data-Chinese** (17,000 samples, ~0.57%) emphasize long-chain reasoning and learning verifiable reasoning traces, while **XunYiWenYao** (19,998 samples, ~0.67%) provides additional QA examples that more closely reflect real patient-style queries ## Data Scale (After Cleaning) After data cleaning and deduplication, we obtained **2,817,556** dialogue samples in total: | File | Samples | Description | |------|---------|-------------| | **medical_history.json** | 342,562 | Multi-turn samples with dialogue history | | **medical_without_history.json** | 2,474,994 | Single-turn samples without dialogue history | ## Data Format ### Multi-turn (with history) Each sample contains `instruction`, `input`, `output`, and `history`: | Field | Description | |-------|-------------| | `instruction` | Current user query | | `input` | Additional context (often empty) | | `output` | Model response for the current turn | | `history` | List of `[user_query, assistant_response]` pairs for previous turns | ```json { "instruction": "足部骨折。你好大夫...请问骨折对位可以吗?/no_think", "input": "", "output": "您好,我很高兴能为您提供帮助...", "history": [ ["这是图片。您好毛大夫.../no_think", "非常感谢您提供的图片..."], ["手术已经完成几天了。/no_think", "了解了,手术已经完成几天了..."] ] } ``` ### Single-turn (without history) Each sample contains `instruction`, `input`, and `output`: | Field | Description | |-------|-------------| | `instruction` | User query | | `input` | Additional context (e.g., patient description) | | `output` | Model response | ```json { "instruction": "小儿肥胖超重该如何治疗/no_think", "input": "女宝宝,刚7岁,这一年,察觉到,我家孩子身上肉很多...", "output": "孩子出现肥胖症的情况。家长要通过孩子运功和健康的饮食来缓解..." } ``` ## Files - **medical_history.json** — Multi-turn dialogue samples (342,562) - **medical_without_history.json** — Single-turn dialogue samples (2,474,994)

许可证:Apache-2.0 任务类别: - 问答 标签: - 医疗 # G-Health 监督微调(Supervised Fine-Tuning,SFT)数据集 本目录包含用于G-Health模型训练流水线**第一阶段**的监督微调对话数据。 ## 概览 作为第一阶段训练数据,我们整合了多源中文医疗对话与问答数据集,清洗前总计包含2,964,200条对话样本。 ## 数据集来源 两大核心组件分别为**Chinese-medical-dialogue**(799,743条样本,占比约27.0%)与**huatuo_knowledge_graph_qa**(798,444条样本,占比约26.9%),二者合计占比约53.9%。其余主要来源包括**DISC-Med-SFT**(464,882条样本,占比约15.7%)、**UltraMedical**(409,593条样本,占比约13.8%)以及**huatuo_encyclopedia_qa**(364,420条样本,占比约12.3%)。上述五大数据集合计占比约95.7%,构成了本监督微调语料的主体骨架。 我们进一步引入了体量更小但针对性更强的数据集以强化特定能力:**medical-o1-reasoning-SFT**(90,120条样本,占比约3.0%)与**Medical-R1-Distill-Data-Chinese**(17,000条样本,占比约0.57%)侧重长链推理与可验证推理轨迹学习;**XunYiWenYao**(19,998条样本,占比约0.67%)则提供了更贴近真实患者问询风格的额外问答示例。 ## 清洗后数据规模 经过数据清洗与去重处理后,我们最终获得总计2,817,556条对话样本: | 文件 | 样本量 | 描述 | |------|---------|-------------| | **medical_history.json** | 342,562 | 包含对话历史的多轮对话样本 | | **medical_without_history.json** | 2,474,994 | 不含对话历史的单轮对话样本 | ## 数据格式 ### 多轮对话(含历史) 每条样本包含`instruction`、`input`、`output`与`history`字段: | 字段 | 说明 | |-------|-------------| | `instruction` | 当前用户问询 | | `input` | 额外上下文(通常为空) | | `output` | 当前轮次的模型回复 | | `history` | 过往轮次的`[用户问询, 助手回复]`对列表 | json { "instruction": "足部骨折。你好大夫...请问骨折对位可以吗?/no_think", "input": "", "output": "您好,我很高兴能为您提供帮助...", "history": [ ["这是图片。您好毛大夫.../no_think", "非常感谢您提供的图片..."], ["手术已经完成几天了。/no_think", "了解了,手术已经完成几天了..."] ] } ### 单轮对话(不含历史) 每条样本包含`instruction`、`input`与`output`字段: | 字段 | 说明 | |-------|-------------| | `instruction` | 用户问询 | | `input` | 额外上下文(如患者病情描述) | | `output` | 模型回复 | json { "instruction": "小儿肥胖超重该如何治疗/no_think", "input": "女宝宝,刚7岁,这一年,察觉到,我家孩子身上肉很多...", "output": "孩子出现肥胖症的情况。家长要通过孩子运动和健康的饮食来缓解..." } ## 数据集文件 - **medical_history.json** — 多轮对话样本(342,562条) - **medical_without_history.json** — 单轮对话样本(2,474,994条)
提供机构:
YDXX
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作