Molecular similarity perception based on machine-learning models

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/6472292

下载链接

链接失效反馈

官方服务：

资源简介：

Molecular similarity is an particularly important notion for chemical legislation, specifically in the evaluation process for orphan drugs (i.e., drugs for rare diseases). A new molecule needs to be dissimilar from any other existing drug for a given disease to be assigned the financially advantageous status of orphan drug. Currently, there are many ways to define whether two molecules are similar or dissimilar. Thus far, the European Medicines Agency has used experts majority voting on discretional judgments of similarity when assessing new drugs for rare diseases. The decision of individual expert whether two compounds are similar is inherently subjective, depending on factors such as gender, age, state of mind, and previous experiences. It is therefore desirable, in this context, to benefit from an objective measure of similarity. To answer this need, we report a new dataset of molecular similarity assessments, that includes complex and difficult similarity scenarios. As a result, we propose new and improved models for similarity-prediction procedures, including 3D properties. These models are publicly available: https://chematlas.chimie.unistra.fr/ReadySim/. Software, 3D structures and pictures are available in the git related to this deposit: https://github.com/enricogandini/paper_similarity_prediction.git The deposit contains two files. original_training_set.csv: this is one of the dataset published initially in [doi: 10.1186/1758-2946-6-5]. new_dataset.csv: result from a new survey organized in 2020 The columns are the following: id_pair: unique identifier of the compound pair curated_smiles_molecule_a: first compound of the pair curated_smiles_molecule_b: second compound of the pair tanimoto_cdk_Extended: ECFP similarity measure TanimotoCombo: ComboScore similarity measure pchembl_distance: difference of activity of the compound pair target_name: protein to which the compound pair is binding simil_2D: similar based on ECFP (0 or 1) simil_3D: similar based on ComboScore (0 or 1) dissimil_2D: dissimilar based on ECFP (0 or 1) dissimil_3D: dissimilar based on ComboScore (0 or 1) pair_type: pairs are classified based on ECFP and ComboScore as similar or dissimilar in 2D and 3D - Sim2DSim3D, Sim2DDis3D, Dis2D,Sim3D, Dis2DSim3D n_answers: number of answers from experts n_similar: number of answers labeling the pair as similar compounds frac_similar: n_similar/n_answers

创建时间：

2022-05-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集