five

EntityBERT: BERT-based Models Pretrained on MIMIC-III with or without Entity-centric Masking Strategy for the Clinical Domain

收藏
DataCite Commons2022-03-17 更新2025-04-16 收录
下载链接:
https://physionet.org/content/entity-bert/
下载链接
链接失效反馈
官方服务:
资源简介:
Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus (MIMIC-III) along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We curated the MIMIC-III corpus by annotating events (including diseases/disorders, signs/symptoms, medications, anatomical sites, and procedures) and time expressions (e.g. "yesterday", "this weekend", "02/31/2028"(an example date)) with special markers. Marked events and time expressions are randomly chosen together with other words in a certain ratio to be masked for training the entity-centric mask language model. Therefore, the models are infused with clinical entity information and good for entity- related clinical NLP tasks.
提供机构:
PhysioNet
创建时间:
2021-08-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作