five

Leash-Biosciences/papyrus-decoy-eval

收藏
Hugging Face2026-02-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Leash-Biosciences/papyrus-decoy-eval
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit --- # Papyrus Decoy Evaluation Set A protein-ligand binding benchmark dataset derived from the [Papyrus](https://chemrxiv.org/engage/chemrxiv/article-details/617aa2467a002162403d71f0) bioactivity database, augmented with property-matched decoy molecules for evaluating virtual screening and binding prediction models. ## What is this dataset? This dataset pairs known protein-ligand binders (actives) with synthetically selected **decoy** molecules — compounds that are physicochemically similar to the actives but are assumed to be non-binders. This setup enables evaluation of models on a realistic discrimination task: can the model distinguish true binders from plausible-looking non-binders? ## Source Data - **Actives**: Extracted from the Papyrus 05.5++ combined dataset (human proteins only, pChEMBL > 7 threshold for binding) - **Decoys**: Selected from the [GuacaMol](https://github.com/BenevolentAI/guacamol) training set (~1M drug-like molecules) ## Decoy Generation Decoys are generated using a **KNN-first, similarity-filtered** strategy: 1. **Property matching**: For each active molecule, K nearest neighbors are retrieved from the GuacaMol library based on five normalized molecular properties (molecular weight, Bertz complexity, LogP, TPSA, ring count). 2. **Structural filtering**: Candidates with Tanimoto similarity > 0.30 (Morgan fingerprints, radius=2, 2048 bits) to any active for that target are discarded, ensuring decoys are physicochemically similar but structurally distinct. 3. **Sampling**: From the filtered pool, decoys are randomly sampled at the target ratio. ## Dataset Splits | Split | Description | |-------|-------------| | `papyrus_decoy_v4_val.parquet` | Validation set (~60% of targets) | | `papyrus_decoy_v4_test.parquet` | Test set (~40% of targets) | The decoy-to-binder ratio is **100:1** (100 decoys per active). ## Filtering Criteria All molecules (actives and decoys) satisfy: - Molecular weight: 400–900 Da - LogP: 0–8 - Rotatable bonds: ≤ 12 - Largest ring size: ≤ 11 Proteins are restricted to Homo sapiens sequences with length ≤ 1024 amino acids. Per-target actives are diversity-subsampled to a maximum of 250 using fingerprint-based average-distance sampling. ## Key Columns | Column | Description | |--------|-------------| | `smiles` | SMILES string of the molecule | | `amino_acid_sequence` | Protein amino acid sequence | | `protein_name` | HGNC gene symbol | | `binds` | Binary label (1 = active binder, 0 = decoy) | | `molecule_lib` | Source library (`papyrus` or `guacamol`) | | `pchembl_value_Mean` | Mean pChEMBL value (actives only) | | `UniProtID` | UniProt accession | | `moldesc_*` | Precomputed molecular descriptors (MW, LogP, TPSA, ECFP4, etc.) | ## Intended Use Benchmarking protein-ligand binding prediction models, virtual screening methods, and molecular representation learning. The high decoy ratio (100:1) provides a challenging, realistic class imbalance setting. ## Citation If you use this dataset, please cite the underlying Papyrus database: > Béquignon OJM, Bongers BJ, Jespers W, et al. Papyrus: a large-scale curated dataset aimed at bioactivity predictions. *J Cheminform*. 2023;15(1):3.
提供机构:
Leash-Biosciences
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作