Molecular similarity perception based on machine-learning models
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6472292
下载链接
链接失效反馈官方服务:
资源简介:
Molecular similarity is an particularly important notion for chemical legislation, specifically in the evaluation process for orphan drugs (i.e., drugs for rare diseases). A new molecule needs to be dissimilar from any other existing drug for a given disease to be assigned the financially advantageous status of orphan drug. Currently, there are many ways to define whether two molecules are similar or dissimilar. Thus far, the European Medicines Agency has used experts majority voting on discretional judgments of similarity when assessing new drugs for rare diseases. The decision of individual expert whether two compounds are similar is inherently subjective, depending on factors such as gender, age, state of mind, and previous experiences. It is therefore desirable, in this context, to benefit from an objective measure of similarity. To answer this need, we report a new dataset of molecular similarity assessments, that includes complex and difficult similarity scenarios. As a result, we propose new and improved models for similarity-prediction procedures, including 3D properties. These models are publicly available: https://chematlas.chimie.unistra.fr/ReadySim/.
Software, 3D structures and pictures are available in the git related to this deposit: https://github.com/enricogandini/paper_similarity_prediction.git
The deposit contains two files.
original_training_set.csv: this is one of the dataset published initially in [doi: 10.1186/1758-2946-6-5].
new_dataset.csv: result from a new survey organized in 2020
The columns are the following:
id_pair: unique identifier of the compound pair
curated_smiles_molecule_a: first compound of the pair
curated_smiles_molecule_b: second compound of the pair
tanimoto_cdk_Extended: ECFP similarity measure
TanimotoCombo: ComboScore similarity measure
pchembl_distance: difference of activity of the compound pair
target_name: protein to which the compound pair is binding
simil_2D: similar based on ECFP (0 or 1)
simil_3D: similar based on ComboScore (0 or 1)
dissimil_2D: dissimilar based on ECFP (0 or 1)
dissimil_3D: dissimilar based on ComboScore (0 or 1)
pair_type: pairs are classified based on ECFP and ComboScore as similar or dissimilar in 2D and 3D - Sim2DSim3D, Sim2DDis3D, Dis2D,Sim3D, Dis2DSim3D
n_answers: number of answers from experts
n_similar: number of answers labeling the pair as similar compounds
frac_similar: n_similar/n_answers
创建时间:
2022-05-06



