Exploring BERT for Reaction Yield Prediction: Evaluating the Impact of Tokenization, Molecular Representation, and Pretraining Data Augmentation
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/Exploring_BERT_for_Reaction_Yield_Prediction_Evaluating_the_Impact_of_Tokenization_Molecular_Representation_and_Pretraining_Data_Augmentation/28915632
下载链接
链接失效反馈官方服务:
资源简介:
Predicting reaction yields in synthetic chemistry remains
a significant
challenge. This study systematically evaluates the impact of tokenization,
molecular representation, pretraining data, and adversarial training
on a BERT-based model for yield prediction of Buchwald-Hartwig and
Suzuki-Miyaura coupling reactions using publicly available HTE data
sets. We demonstrate that molecular representation choice (SMILES,
DeepSMILES, SELFIES, Morgan fingerprint-based notation, IUPAC names)
has minimal impact on model performance, while typically BPE and SentencePiece
tokenization outperform other methods. WordPiece is strongly discouraged
for SELFIES and fingerprint-based notation. Furthermore, pretraining
with relatively small data sets (<100 K reactions) achieves comparable
performance to larger data sets containing millions of examples. The
use of artificially generated domain-specific pretraining data is
proposed. The artificially generated sets prove to be a good surrogate
for the reaction schemes extracted from reaction data sets such as
Pistachio or Reaxys. The best performance was observed for hybrid
pretraining sets combining the real and the domain-specific, artificial
data. Finally, we show that a novel adversarial training approach,
perturbing input embeddings dynamically, improves model robustness
and generalizability for yield and reaction success prediction. These
findings provide valuable insights for developing robust and practical
machine learning models for yield prediction in synthetic chemistry.
GSK’s BERT training code base is made available to the community
with this work.
创建时间:
2025-05-01



