MaterialBERT for natural language processing of materials science texts
收藏DataCite Commons2022-12-12 更新2024-07-29 收录
下载链接:
https://tandf.figshare.com/articles/dataset/MaterialBERT_for_Natural_Language_Processing_of_Materials_Science_Texts/21130151/2
下载链接
链接失效反馈官方服务:
资源简介:
A BERT (Bidirectional Encoder Representations from Transformers) model, which we named “MaterialBERT”, has been generated using scientific papers in wide area of material science as a corpus. A new vocabulary list for tokenizer was generated using material science corpus. Two BERT models with different vocabulary lists for the tokenizer, one with the original one made by Google and the other newly made by the authors, were generated. Word vectors embedded during the pre-training with the two MaterialBERT models reasonably reflect the meanings of materials names in material-class clustering and in the relationship between base materials and their compounds or derivatives for not only inorganic materials but also organic materials and organometallic compounds. Fine-tuning with CoLA (The Corpus of Linguistic Acceptability) using the pre-trained MaterialBERT showed a higher score than the original BERT. The two MaterialBERTs could be also utilized as a starting point for transfer learning of a narrower domain-specific BERT.
本研究以覆盖多领域的材料科学学术论文作为语料库,训练得到一款命名为MaterialBERT的BERT(Bidirectional Encoder Representations from Transformers)模型。针对材料科学语料库构建了专属的分词器词表,基于两种不同分词器词表分别训练得到两款BERT模型:一款采用谷歌官方原生词表,另一款采用本研究作者新建的词表。两款MaterialBERT在预训练阶段生成的词向量,可在材料类别聚类任务中,以及基础材料与其化合物、衍生物的关联表征上合理反映材料名称的语义内涵,且适用于无机材料、有机材料及金属有机化合物等全品类材料。使用语言可接受性语料库(CoLA,The Corpus of Linguistic Acceptability)对预训练后的MaterialBERT进行微调,其性能得分优于原生BERT模型。此外,两款MaterialBERT均可作为窄领域专用BERT迁移学习的初始基准模型。
提供机构:
Taylor & Francis
创建时间:
2022-09-23



