Leash-Biosciences/papyrus-decoy-eval
收藏Hugging Face2026-02-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Leash-Biosciences/papyrus-decoy-eval
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# Papyrus Decoy Evaluation Set
A protein-ligand binding benchmark dataset derived from the [Papyrus](https://chemrxiv.org/engage/chemrxiv/article-details/617aa2467a002162403d71f0) bioactivity database, augmented with property-matched decoy molecules for evaluating virtual screening and binding prediction models.
## What is this dataset?
This dataset pairs known protein-ligand binders (actives) with synthetically selected **decoy** molecules — compounds that are physicochemically similar to the actives but are assumed to be non-binders. This setup enables evaluation of models on a realistic discrimination task: can the model distinguish true binders from plausible-looking non-binders?
## Source Data
- **Actives**: Extracted from the Papyrus 05.5++ combined dataset (human proteins only, pChEMBL > 7 threshold for binding)
- **Decoys**: Selected from the [GuacaMol](https://github.com/BenevolentAI/guacamol) training set (~1M drug-like molecules)
## Decoy Generation
Decoys are generated using a **KNN-first, similarity-filtered** strategy:
1. **Property matching**: For each active molecule, K nearest neighbors are retrieved from the GuacaMol library based on five normalized molecular properties (molecular weight, Bertz complexity, LogP, TPSA, ring count).
2. **Structural filtering**: Candidates with Tanimoto similarity > 0.30 (Morgan fingerprints, radius=2, 2048 bits) to any active for that target are discarded, ensuring decoys are physicochemically similar but structurally distinct.
3. **Sampling**: From the filtered pool, decoys are randomly sampled at the target ratio.
## Dataset Splits
| Split | Description |
|-------|-------------|
| `papyrus_decoy_v4_val.parquet` | Validation set (~60% of targets) |
| `papyrus_decoy_v4_test.parquet` | Test set (~40% of targets) |
The decoy-to-binder ratio is **100:1** (100 decoys per active).
## Filtering Criteria
All molecules (actives and decoys) satisfy:
- Molecular weight: 400–900 Da
- LogP: 0–8
- Rotatable bonds: ≤ 12
- Largest ring size: ≤ 11
Proteins are restricted to Homo sapiens sequences with length ≤ 1024 amino acids. Per-target actives are diversity-subsampled to a maximum of 250 using fingerprint-based average-distance sampling.
## Key Columns
| Column | Description |
|--------|-------------|
| `smiles` | SMILES string of the molecule |
| `amino_acid_sequence` | Protein amino acid sequence |
| `protein_name` | HGNC gene symbol |
| `binds` | Binary label (1 = active binder, 0 = decoy) |
| `molecule_lib` | Source library (`papyrus` or `guacamol`) |
| `pchembl_value_Mean` | Mean pChEMBL value (actives only) |
| `UniProtID` | UniProt accession |
| `moldesc_*` | Precomputed molecular descriptors (MW, LogP, TPSA, ECFP4, etc.) |
## Intended Use
Benchmarking protein-ligand binding prediction models, virtual screening methods, and molecular representation learning. The high decoy ratio (100:1) provides a challenging, realistic class imbalance setting.
## Citation
If you use this dataset, please cite the underlying Papyrus database:
> Béquignon OJM, Bongers BJ, Jespers W, et al. Papyrus: a large-scale curated dataset aimed at bioactivity predictions. *J Cheminform*. 2023;15(1):3.
提供机构:
Leash-Biosciences



