CARMEN-I: A resource of anonymized electronic health records in Spanish and Catalan for training and testing NLP tools
收藏DataCite Commons2024-04-24 更新2024-07-13 收录
下载链接:
https://physionet.org/content/carmen-i/1.0.1/
下载链接
链接失效反馈官方服务:
资源简介:
The CARMEN-I corpus comprises 2,000 clinical records, encompassing discharge
letters, referrals, and radiology reports from Hospital Clinic of Barcelona
between March 2020 and March 2022. These reports, primarily in Spanish with
some Catalan sections, cover COVID-19 patients with diverse comorbidities like
kidney failure, cardiovascular diseases, malignancies, and immunosuppression.
The corpus underwent thorough anonymization, validation, and expert
annotation, replacing sensitive data with synthetic equivalents. A subset of
the corpus features annotations of medical concepts by specialists,
encompassing symptoms, diseases, procedures, medications, species, and humans
(including family members). CARMEN-I serves as a valuable resource for
training and assessing clinical NLP techniques and language models, aiding
tasks like de-identification, concept detection, linguistic modifier
extraction, document classification, and more. It also facilitates training
researchers in clinical NLP and is a collaborative effort involving Barcelona
Supercomputing Center's NLP4BIA team, Hospital Clinic, and Universitat de
Barcelona's CLiC group.
CARMEN-I语料库囊括2000份临床记录,涵盖2020年3月至2022年3月期间来自巴塞罗那临床医院的出院小结、转诊单及放射科报告。该类报告主体为西班牙语,包含部分加泰罗尼亚语章节,收录了伴有肾衰竭、心血管疾病、恶性肿瘤、免疫功能低下等多种合并症的新冠肺炎(COVID-19)患者相关医疗文书。该语料库已完成全面匿名化处理、质量校验及专家标注流程,将敏感数据替换为合成等效数据。语料库的子集包含由专业医师标注的医学概念,涵盖症状、疾病、诊疗操作、药物、物种及人类(包括家属)。CARMEN-I可作为训练与评测临床自然语言处理(Natural Language Processing, NLP)技术及语言模型的优质资源,可辅助去标识化、概念检测、语言修饰成分提取、文档分类等多项任务。同时,该语料库可为临床自然语言处理领域的研究人员提供训练支持,其研发由巴塞罗那超级计算中心NLP4BIA团队、巴塞罗那临床医院及巴塞罗那大学CLiC研究组共同协作完成。
提供机构:
PhysioNet
创建时间:
2024-04-10
搜集汇总
数据集介绍

背景与挑战
背景概述
CARMEN-I数据集包含2000份匿名临床记录,主要用于NLP工具的训练和测试,特别关注COVID-19患者及其并发症。数据集经过严格匿名化处理,并包含专家标注的医学概念,适用于多种NLP任务。
以上内容由遇见数据集搜集并总结生成



