Synthetic Mention Corpora for Disease Entity Recognition and Normalization
收藏DataCite Commons2025-02-03 更新2025-04-16 收录
下载链接:
https://physionet.org/content/synthetic-mention-corpora/
下载链接
链接失效反馈官方服务:
资源简介:
Named Entity Recognition (NER) and Entity Normalization (EN) are fundamental
tasks in information extraction, particularly in the biomedical and clinical
domains. NER identifies textual mentions of entities, while EN maps these
mentions to unique identifiers within a structured vocabulary. However, the
biomedical domain presents unique challenges for NER, including the diverse
and inconsistent lexical representations of biomedical concepts, such as non-
standard terminology, abbreviations, complex phrases, and frequent
misspellings in clinical texts. Additionally, rare entities are often
underrepresented in training datasets and may lack detailed descriptions or
synonyms in knowledge graphs, limiting the quality of training data for
Disease Entity Recognition (DER) and Disease Entity Normalization (DEN). To
address this, we present the Synthetic Mention Corpora for Disease Entity
Recognition and Normalization, a dataset comprising 128,000 synthetic disease
mentions generated using a fine-tuned LLaMa-2-13B-Chat model. These mentions
are derived from the Unified Medical Language System (UMLS) disorder group.
This corpus aims to enhance the development of more robust systems for disease
entity identification and linking in biomedical and clinical text, addressing
current limitations in training data availability.
提供机构:
PhysioNet
创建时间:
2025-02-03



