EntityBERT: BERT-based Models Pretrained on MIMIC-III with or without Entity-centric Masking Strategy for the Clinical Domain
收藏DataCite Commons2022-03-17 更新2025-04-16 收录
下载链接:
https://physionet.org/content/entity-bert/
下载链接
链接失效反馈官方服务:
资源简介:
Transformer-based neural language models have led to breakthroughs for a
variety of natural language processing (NLP) tasks. However, most models are
pretrained on general domain data. We propose a methodology to produce a model
focused on the clinical domain: continued pretraining of a model with a broad
representation of biomedical terminology (PubMedBERT) on a clinical corpus
(MIMIC-III) along with a novel entity-centric masking strategy to infuse
domain knowledge in the learning process.
We curated the MIMIC-III corpus by annotating events (including
diseases/disorders, signs/symptoms, medications, anatomical sites, and
procedures) and time expressions (e.g. "yesterday", "this weekend",
"02/31/2028"(an example date)) with special markers. Marked events and time
expressions are randomly chosen together with other words in a certain ratio
to be masked for training the entity-centric mask language model. Therefore,
the models are infused with clinical entity information and good for entity-
related clinical NLP tasks.
提供机构:
PhysioNet
创建时间:
2021-08-26



