MedCD: A Medical Clinical Dataset

Name: MedCD: A Medical Clinical Dataset
Creator: IEEE DataPort
Published: 2025-02-10 05:02:32
License: 暂无描述

DataCite Commons2025-02-10 更新2025-04-16 收录

下载链接：

https://ieee-dataport.org/documents/medcd-medical-clinical-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

We curated and release a real-world medical clinical dataset, namely MedCD, in the context of building generative artificial intelligence (AI) applications in the clinical setting. The MedCD dataset is one of the accomplishments from our longitudinal applied AI research and deployment in a tertiary care hospital in China. First, the dataset is real and comprehensive, in that it was sourced from real-world electronic health records (EHRs), clinical notes, lab examination reports and more. Second, the dataset is large, that contains 1·7 million EHR examples involving more than 250K patients, collected from 30 clinical departments over the first quarter of year 2024. The scale is comparable to that of MIMIC-IV. The data was de-identified and organized into a format similar to MIMIC-IV free-text clinical notes. Moreover, the objective of this dataset is to accelerate generative AI research and development in healthcare. MedCD not only contains millions of patients' data, but also features supervised data for a variety of real fundamental clinical tasks with months' worth of annotation endeavors by clinicians. Following the general paradigm of generative AI application development, the MedCD dataset consists of: (1) unsupervised pretraining data where each patient data is organized as a medical document, (2) supervised fine-tuning data for a wide spectrum of clinical applications including NER, retrieval and summarization, and (3) benchmark data for evaluating fundamental clinical tasks such as patient triage and notes generation. Further, we describe a spectrum of deployed clinical applications making use of this data, as reference implementation and baseline. We believe that MedCD is to-date the most comprehensive and largest scale clinical dataset in Chinese, and the first designed for generative AI research and development in healthcare.

为推进临床场景下生成式人工智能（Generative AI）应用的研发，本研究整理并发布了一款真实世界临床数据集MedCD。MedCD数据集是我们在中国某三级医院开展的长期应用AI研究与落地项目的核心成果之一。首先，该数据集兼具真实性与全面性：数据源自真实世界的电子健康档案（Electronic Health Records, EHRs）、临床病历文书、检验报告等多类医疗数据源。其次，该数据集规模庞大：共包含170万条电子健康档案样本，涉及超过25万名患者，采集自30个临床科室，时间范围为2024年第一季度，其规模可与公开数据集MIMIC-IV相媲美。所有数据均已完成去标识化处理，并采用与MIMIC-IV自由文本临床病历一致的格式进行组织。此外，本数据集的核心发布目标，在于加速医疗领域生成式AI的研究与落地开发进程。MedCD不仅涵盖海量患者诊疗数据，还包含经临床医师历时数月完成标注的、适用于多项核心临床任务的监督学习数据集。遵循生成式AI应用开发的通用范式，MedCD数据集包含三大组成部分：(1) 无监督预训练数据：每条患者数据均以完整医疗文档的形式组织；(2) 适用于多类临床应用的监督微调数据，涵盖命名实体识别（Named Entity Recognition, NER）、信息检索与文本摘要等任务；(3) 用于评估核心临床任务的基准数据集，例如患者分诊与病历生成任务。此外，本文还介绍了基于该数据集落地的一系列临床应用场景，作为参考实现方案与基准对照模型。我们认为，MedCD是目前为止规模最大、内容最全面的中文临床数据集，同时也是首款专为医疗领域生成式AI研发设计的临床数据集。

提供机构：

IEEE DataPort

创建时间：

2025-02-10

搜集汇总

数据集介绍