Synthetic Mention Corpora for Disease Entity Recognition and Normalization

Name: Synthetic Mention Corpora for Disease Entity Recognition and Normalization
Creator: PhysioNet
Published: 2025-02-03 22:13:06
License: 暂无描述

DataCite Commons2025-02-03 更新2025-04-16 收录

下载链接：

https://physionet.org/content/synthetic-mention-corpora/

下载链接

链接失效反馈

官方服务：

资源简介：

Named Entity Recognition (NER) and Entity Normalization (EN) are fundamental tasks in information extraction, particularly in the biomedical and clinical domains. NER identifies textual mentions of entities, while EN maps these mentions to unique identifiers within a structured vocabulary. However, the biomedical domain presents unique challenges for NER, including the diverse and inconsistent lexical representations of biomedical concepts, such as non- standard terminology, abbreviations, complex phrases, and frequent misspellings in clinical texts. Additionally, rare entities are often underrepresented in training datasets and may lack detailed descriptions or synonyms in knowledge graphs, limiting the quality of training data for Disease Entity Recognition (DER) and Disease Entity Normalization (DEN). To address this, we present the Synthetic Mention Corpora for Disease Entity Recognition and Normalization, a dataset comprising 128,000 synthetic disease mentions generated using a fine-tuned LLaMa-2-13B-Chat model. These mentions are derived from the Unified Medical Language System (UMLS) disorder group. This corpus aims to enhance the development of more robust systems for disease entity identification and linking in biomedical and clinical text, addressing current limitations in training data availability.

提供机构：

PhysioNet

创建时间：

2025-02-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集