five

Data Sheet 1_Beyond Tanimoto: a learned bioactivity similarity index enhances ligand discovery.zip

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Data_Sheet_1_Beyond_Tanimoto_a_learned_bioactivity_similarity_index_enhances_ligand_discovery_zip/30737660
下载链接
链接失效反馈
官方服务:
资源简介:
Structural similarity metrics such as the Tanimoto coefficient (TC) miss many functionally related compounds—indeed, 60% of similarly bioactive ligand pairs in the ChEMBL database show TC < 0.30, revealing a major blind spot that constrains ligand-based discovery. Our motivation is to overcome this blind spot and enable the recovery of structurally different yet functionally equivalent chemotypes that structure-based similarity fails to detect. Here, we introduce the bioactivity similarity index (BSI), a machine learning model that estimates the probability that two molecules bind the same or related protein receptors. Trained under leave-one-protein-out (LOPO) across Pfam-defined protein groups on dissimilar pairs, BSI not only outperforms TC but also surpasses modern molecular embedding baselines (ChemBERTa and contrastive language-molecule pre-training (CLAMP), using cosine similarity) across protein families. We further develop a cross-family model (BSI-Large) that, while slightly below group-specific models, generalizes better and can be fine-tuned with less data, consistently improving over models trained from scratch. In retrospective validation on new ChEMBL v35 data, BSI achieves strong early-retrieval performance (top 2% enrichment factor, EF2%), with group-specific models delivering the best enrichment, and BSI-Large remaining competitive. In a realistic virtual screening-like scenario against the target gene ADRA2B, the mean rank of the next active, given a known active, improves from 45.2 (TC) to 3.9 (BSI), with 54.9 for ChemBERTa and 28.6 for CLAMP. Altogether, BSI complements, rather than replaces, structure-based similarity and embedding-based comparisons, extending hit finding to remote chemotypes that are structurally dissimilar yet functionally equivalent. The code is available at https://github.com/gschottlender/bioactivity-similarity-index.
创建时间:
2025-11-28
二维码
社区交流群
二维码
科研交流群
商业服务