OpenDCAI/dataflow-knowledge-med-40k
收藏Hugging Face2025-12-29 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/OpenDCAI/dataflow-knowledge-med-40k
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含从权威医学指南文档中提取的多项选择题-答案对,使用DataFlow知识提取流程生成。每个数据实例包括一个问题和一个对应的答案,答案格式包含明确的推理过程和最终选择的选项。数据集旨在支持医学问答、临床推理、指令调整和多跳推理能力评估的研究与开发。数据来源于MedQA Books、StatPearls和Clinical Guidelines三大公开可用的医学语料库,通过Qwen3-30B-A3B-Instruct-2507模型合成约40K高质量监督微调样本。数据集结构为JSON/JSONL格式,每个示例包含question和answer字段,其中answer包含推理过程(<thinking>)和最终选项(<answer>)。数据集经过严格的质量控制,包括源文档验证、答案格式一致性检查和模糊或不支持问题的过滤。
This dataset contains multiple-choice question–answer (QA) pairs derived from authoritative medical guideline documents using DataFlow knowledge extraction pipeline. Each data instance consists of a question and a corresponding answer, where the answer is formatted with explicit reasoning and a final selected option. The dataset is designed to support research and development in medical question answering, clinical reasoning, instruction tuning, and evaluation of multi-hop reasoning capabilities. The data sources include MedQA Books, StatPearls, and Clinical Guidelines, and approximately 40K high-quality SFT samples are synthesized using the Qwen3-30B-A3B-Instruct-2507 model. The dataset structure is in JSON/JSONL format, with each example containing question and answer fields, where answer includes reasoning process (<thinking>) and final selected option (<answer>). Quality assurance includes validation against source documents, consistency checks on answer format, and filtering of ambiguous or unsupported questions.
提供机构:
OpenDCAI



