Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity
收藏NIAID Data Ecosystem2026-03-07 收录
下载链接:
https://figshare.com/articles/dataset/Development_of_a_Novel_Fingerprint_for_Chemical_Reactions_and_Its_Application_to_Large_Scale_Reaction_Classification_and_Similarity/2213242
下载链接
链接失效反馈官方服务:
资源简介:
Fingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties. We validated the fingerprints on a large data set of reactions text-mined from granted United States patents from the last 40 years that have been classified using a substructure-based expert system. We applied machine learning to build a 50-class predictive model for reaction-type classification that correctly predicts 97% of the reactions in an external test set. Impressive accuracies were also observed when applying the classifier to reactions from an in-house electronic laboratory notebook. The performance of the novel fingerprint for assessing reaction similarity was evaluated by a cluster analysis that recovered 48 out of 50 of the reaction classes with a median F-score of 0.63 for the clusters. The data sets used for training and primary validation as well as all python scripts required to reproduce the analysis are provided in the Supporting Information.
应用于分子的指纹方法已被证实可用于相似度判定,且可作为机器学习模型的输入。在此,我们报道了一种全新化学反应指纹的构建工作,并验证了其在构建机器学习模型与相似度评估中的实用价值。我们最终得到的指纹以产物与反应物的原子对指纹(atom-pair fingerprints)之差为核心,并通过计算得到的理化性质纳入反应试剂的相关信息。我们基于过去40年间获批的美国专利中经文本挖掘得到的大型反应数据集,并采用基于子结构的专家系统对该数据集进行分类,以此完成对该指纹的验证。我们应用机器学习方法构建了一个包含50个类别的反应类型分类预测模型,该模型在外部测试集上的预测准确率可达97%。将该分类器应用于内部电子实验记录本(electronic laboratory notebook)中的反应数据时,同样取得了出色的预测精度。针对该新型指纹的反应相似度评估性能,我们通过聚类分析进行了评估:该分析可准确归类50个反应类别中的48个,各聚类的中位数F值(F-score)为0.63。本研究用于训练与初步验证的数据集,以及所有可复现分析流程的Python脚本,均已在补充信息(Supporting Information)中提供。
创建时间:
2016-02-15



