OpenDCAI/dataflow-knowledge-med-40k

Name: OpenDCAI/dataflow-knowledge-med-40k
Creator: OpenDCAI
Published: 2025-12-29 10:10:43
License: 暂无描述

Hugging Face2025-12-29 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/OpenDCAI/dataflow-knowledge-med-40k

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含从权威医学指南文档中提取的多项选择题-答案对，使用DataFlow知识提取流程生成。每个数据实例包括一个问题和一个对应的答案，答案格式包含明确的推理过程和最终选择的选项。数据集旨在支持医学问答、临床推理、指令调整和多跳推理能力评估的研究与开发。数据来源于MedQA Books、StatPearls和Clinical Guidelines三大公开可用的医学语料库，通过Qwen3-30B-A3B-Instruct-2507模型合成约40K高质量监督微调样本。数据集结构为JSON/JSONL格式，每个示例包含question和answer字段，其中answer包含推理过程（<thinking>）和最终选项（<answer>）。数据集经过严格的质量控制，包括源文档验证、答案格式一致性检查和模糊或不支持问题的过滤。

This dataset contains multiple-choice question–answer (QA) pairs derived from authoritative medical guideline documents using DataFlow knowledge extraction pipeline. Each data instance consists of a question and a corresponding answer, where the answer is formatted with explicit reasoning and a final selected option. The dataset is designed to support research and development in medical question answering, clinical reasoning, instruction tuning, and evaluation of multi-hop reasoning capabilities. The data sources include MedQA Books, StatPearls, and Clinical Guidelines, and approximately 40K high-quality SFT samples are synthesized using the Qwen3-30B-A3B-Instruct-2507 model. The dataset structure is in JSON/JSONL format, with each example containing question and answer fields, where answer includes reasoning process (<thinking>) and final selected option (<answer>). Quality assurance includes validation against source documents, consistency checks on answer format, and filtering of ambiguous or unsupported questions.

提供机构：

OpenDCAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集