five

Exploring BERT for Reaction Yield Prediction: Evaluating the Impact of Tokenization, Molecular Representation, and Pretraining Data Augmentation

收藏
Figshare2025-05-01 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Exploring_BERT_for_Reaction_Yield_Prediction_Evaluating_the_Impact_of_Tokenization_Molecular_Representation_and_Pretraining_Data_Augmentation/28915632
下载链接
链接失效反馈
官方服务:
资源简介:
Predicting reaction yields in synthetic chemistry remains a significant challenge. This study systematically evaluates the impact of tokenization, molecular representation, pretraining data, and adversarial training on a BERT-based model for yield prediction of Buchwald-Hartwig and Suzuki-Miyaura coupling reactions using publicly available HTE data sets. We demonstrate that molecular representation choice (SMILES, DeepSMILES, SELFIES, Morgan fingerprint-based notation, IUPAC names) has minimal impact on model performance, while typically BPE and SentencePiece tokenization outperform other methods. WordPiece is strongly discouraged for SELFIES and fingerprint-based notation. Furthermore, pretraining with relatively small data sets (<100 K reactions) achieves comparable performance to larger data sets containing millions of examples. The use of artificially generated domain-specific pretraining data is proposed. The artificially generated sets prove to be a good surrogate for the reaction schemes extracted from reaction data sets such as Pistachio or Reaxys. The best performance was observed for hybrid pretraining sets combining the real and the domain-specific, artificial data. Finally, we show that a novel adversarial training approach, perturbing input embeddings dynamically, improves model robustness and generalizability for yield and reaction success prediction. These findings provide valuable insights for developing robust and practical machine learning models for yield prediction in synthetic chemistry. GSK’s BERT training code base is made available to the community with this work.
创建时间:
2025-05-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作