Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes)
收藏OpenDataLab2026-06-07 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/Rare_Diseases_Mentions_in_etc
下载链接
链接失效反馈官方服务:
资源简介:
数据注释 1,073 个完整的罕见病提及注释(来自 312 个 MIMIC-III 出院摘要)位于 full_set_RD_ann_MIMIC_III_disch.csv 中。数据拆分: * 前 400 行用于验证,validation_set_RD_ann_MIMIC_III_disch.csv,* 最后 673 行用于测试,test_set_RD_ann_MIMIC_III_disch.csv。 198 条罕见病提及注释(来自 145 份 MIMIC-III 放射学报告)位于 test_set_RD_ann_MIMIC_III_rad.csv 中。需要注意的是,放射学报告仅用于测试而非验证。注意:只有当 ORDO 的黄金提及列标签的值为 1 时,一行才能被认为是患者的真实表型。 数据采样和注释程序 (i) 随机采样 500 份出院总结(和 1000 份放射学报告) ) 来自 MIMIC-III (ii) 500 份出院总结中的 312 份(以及 1000 份放射学报告中的 145 份)至少有一个与 ORDO 相关的正面 UMLS 提及,如 SemEHR 所确定的; UMLS/ORDO 总共有 1073 条(放射学报告中有 198 条)提及。 (iii) 3 名医学信息学研究人员(工作人员或博士生)注释了 1,073 次提及(以及 2 名医学信息学研究人员注释了放射学报告中的 198 次提及),关于它们是否是与 UMLS 和 ORDO 匹配的正确患者表型。注释中的矛盾随后由另一位具有生物医学背景的研究人员解决。数据字典 列名 描述 ROW_ID 每行唯一的标识符,请参阅 https://mimic.physionet.org/mimictables/noteevents/ SUBJECT_ID 患者唯一的标识符,请参阅 https://mimic.physionet.org/mimictables/noteevents/ HADM_ID患者住院的唯一标识符,请参阅 https://mimic.physionet.org/mimictables/noteevents/ 文档结构名称 提及的文档结构名称。文档结构名称由 SemEHR 标识(仅用于出院摘要)。完整文档中的文档结构偏移量 整个排放摘要中的文档结构文本(或模板)的开始和结束偏移量。文档结构由 SemEHR 使用正则表达式解析(仅用于出院摘要)。提及 SemEHR 识别的提及。文档结构中提及的偏移量 文档结构中提及的开始和结束偏移量(仅用于出院摘要)。完整文档中提及的偏移量 整个出院摘要中提及的开始和结束偏移量。它们可以通过完整文档中的文档结构偏移量和文档结构中的提及偏移量来计算。 UMLS with desc 由 SemEHR 识别的 UMLS,对应于提及。 ORDO 与 desc 匹配到 UMLS 的 ORDO,使用 ORDO 本体中的链接,见 https://www.ebi.ac.uk/ols/ontologies/ordo/terms?iri=http%3A%2F%2Fwww.orpha .net%2FORDO%2FOrphanet_3325 为例。黄金提及到 UMLS 标签 提及-UMLS 对是否指示患者的正确表型(即与 UMLS 概念正确匹配的正面提及),如果正确,则为 1,否则为 0。 gold UMLS-to-ORDO label 从UMLS概念到ORDO概念匹配是否正确,正确为1,不正确为0。金色提及 ORDO 标签 提及 ORDO 三元组是否指示患者的正确表型,1 表示正确,0 表示不正确。如果mention-to-UMLS 标签和UMLS-to-ORDO 标签都为1,则该列为1,否则为0。 注意: * 这些手动注释绝不是完美的。有一些假设的提及,注释者很难做出决定。此外,它们基于 SemEHR 的输出,它没有 100% 的召回率,因此注释可能不会涵盖抽样出院摘要中提到的所有罕见疾病。 * 在完整集或验证集的第 323 行中,提及 nph 不在文档结构中(由于提及提取错误),因此提及到 UMLS 的黄金标签为 -1。
A total of 1,073 complete rare disease mention annotations (derived from 312 MIMIC-III discharge summaries) are stored in full_set_RD_ann_MIMIC_III_disch.csv. Data split: * The first 400 rows are used for validation, stored in validation_set_RD_ann_MIMIC_III_disch.csv; * The last 673 rows are used for testing, stored in test_set_RD_ann_MIMIC_III_disch.csv. 198 rare disease mention annotations (from 145 MIMIC-III radiology reports) are located in test_set_RD_ann_MIMIC_III_rad.csv. It should be noted that radiology reports are only used for testing, not validation. Note: A row can only be considered a true patient phenotype if the value of the gold mention label column for ORDO is 1.
Data sampling and annotation procedures: (i) Randomly sampled 500 discharge summaries (and 1000 radiology reports) from MIMIC-III. (ii) 312 of the 500 discharge summaries (and 145 of the 1000 radiology reports) contain at least one positive UMLS mention associated with ORDO, as identified by SemEHR; there are a total of 1,073 such mentions for UMLS/ORDO (and 198 mentions in the radiology reports). (iii) Three medical informatics researchers (staff or PhD students) annotated the 1,073 mentions (and two medical informatics researchers annotated the 198 mentions in the radiology reports) to determine whether they were correct patient phenotypes matching UMLS and ORDO. Discrepancies in annotations were subsequently resolved by another researcher with a biomedical background.
Data Dictionary
| Column Name | Description |
|---|---|
| ROW_ID | Unique identifier for each row, refer to https://mimic.physionet.org/mimictables/noteevents/ |
| SUBJECT_ID | Unique identifier for the patient, refer to https://mimic.physionet.org/mimictables/noteevents/ |
| HADM_ID | Unique identifier for the patient's hospitalization, refer to https://mimic.physionet.org/mimictables/noteevents/ |
| Document Structure Name | Name of the document structure where the mention is located. Identified by SemEHR (only for discharge summaries). |
| Document Structure Offset in Full Document | Start and end offsets of the document structure text (or template) in the entire discharge summary. The document structure is parsed by SemEHR using regular expressions (only for discharge summaries). |
| Mention | Mention identified by SemEHR. |
| Mention Offset in Document Structure | Start and end offsets of the mention within the document structure (only for discharge summaries). |
| Mention Offset in Full Document | Start and end offsets of the mention in the entire discharge summary. These can be calculated using the Document Structure Offset in Full Document and the Mention Offset in Document Structure. |
| UMLS with desc | UMLS identified by SemEHR, corresponding to the mention. |
| ORDO with desc | ORDO matched to the UMLS, using links within the ORDO ontology, see https://www.ebi.ac.uk/ols/ontologies/ordo/terms?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_3325 as an example. |
| Gold Mention to UMLS Label | Whether the mention-UMLS pair indicates the correct patient phenotype (i.e., a positive mention correctly matched to the UMLS concept), with 1 for correct and 0 for incorrect. |
| Gold UMLS-to-ORDO Label | Whether the matching from UMLS concept to ORDO concept is correct, with 1 for correct and 0 for incorrect. |
| Gold Mention ORDO Label | Whether the mention-ORDO triple indicates the correct patient phenotype, with 1 for correct and 0 for incorrect. This column is 1 only if both the mention-to-UMLS label and the UMLS-to-ORDO label are 1, otherwise 0.
Notes: * These manual annotations are by no means perfect. There are hypothetical mentions that are difficult for annotators to make a definitive decision on. In addition, they are based on the output of SemEHR, which does not achieve 100% recall, so the annotations may not cover all rare disease mentions in the sampled discharge summaries. * In row 323 of the full set or validation set, the mention "nph" is not within the document structure (due to mention extraction errors), so the gold label for mention-to-UMLS is -1.
提供机构:
OpenDataLab
创建时间:
2022-05-25
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集基于MIMIC-III临床笔记,提供了罕见疾病提及的注释,包括1,073个来自出院摘要的注释和198个来自放射学报告的注释,并通过人工标注与本体匹配验证数据质量。数据集分为验证集和测试集,旨在支持从临床文本中识别罕见疾病的研究应用。
以上内容由遇见数据集搜集并总结生成



