EntityBERT: BERT-based Models Pretrained on MIMIC-III with or without Entity-centric Masking Strategy for the Clinical Domain
收藏Mendeley Data2024-01-31 更新2024-06-28 收录
下载链接:
https://physionet.org/content/entity-bert/
下载链接
链接失效反馈官方服务:
资源简介:
Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus (MIMIC-III) along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We curated the MIMIC-III corpus by annotating events (including diseases/disorders, signs/symptoms, medications, anatomical sites, and procedures) and time expressions (e.g. "yesterday", "this weekend", "02/31/2028"(an example date)) with special markers. Marked events and time expressions are randomly chosen together with other words in a certain ratio to be masked for training the entity-centric mask language model. Therefore, the models are infused with clinical entity information and good for entity- related clinical NLP tasks.
创建时间:
2024-01-31



