Leash-Biosciences/mf-pcba-bind
收藏Hugging Face2026-02-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Leash-Biosciences/mf-pcba-bind
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# MF-PCBA-Bind
Code for generating these datasets can be found at https://github.com/Leash-Labs/mf-pcba-bind
Protein-ligand binding prediction datasets derived from the [MF-PCBA](https://github.com/davidbuterez/mf-pcba) benchmark.
This repository extends the original MF-PCBA dataset by:
- Filtering to binding assays only (excluding phenotypic assays)
- Removing PAINS (pan-assay interference compounds) that show non-specific activity
- Providing pre-built validation and test splits for protein-ligand binding prediction
- Aggregating the data across assays into combined, easy-to-access files
- Providing "compact" versions of each val/test sets to be more accessible to expensive models
## Attribution
This work is based on **MF-PCBA** by David Buterez et al. The original dataset, retrieval scripts, and methodology are preserved in the `original-mf-pcba/` directory.
If you use this dataset, please cite the original MF-PCBA paper:
```bibtex
@article{doi:10.1021/acs.jcim.2c01569,
author = {Buterez, David and Janet, Jon Paul and Kiddle, Steven J. and Li�, Pietro},
title = {MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning},
journal = {Journal of Chemical Information and Modeling},
year = {2023},
doi = {10.1021/acs.jcim.2c01569},
URL = {https://doi.org/10.1021/acs.jcim.2c01569}
}
```
## Dataset Overview
### Data Files
| File | Description |
|------|-------------|
| `data/mf_pcba_bind_val_full.parquet` | Full validation set with all binders and non-binders |
| `data/mf_pcba_bind_val_compact.parquet` | Compact validation set (binders + 4x sampled non-binders) |
| `data/mf_pcba_bind_test_full.parquet` | Full test set with all binders and non-binders |
| `data/mf_pcba_bind_test_compact.parquet` | Compact test set (binders + 4x sampled non-binders) |
| `data/mf_pcba_bind_val+test_full.parquet` | Combined val+test full set |
| `data/mf_pcba_bind_val+test_compact.parquet` | Combined val+test compact set |
| `data/MF-PCBA-Assay-Metadata.csv` | Curated metadata for all assays including protein sequences |
### Columns
Each parquet file contains:
- `CID`: PubChem Compound ID
- `smiles`: Molecular SMILES string
- `binds`: Binary label (1 = binder, 0 = non-binder)
- `protein_name`: Target protein name
- `protein_category`: A rough categorization of the protein
- `protein_accession`: Protein accession number
- `amino_acid_sequence`: Full protein sequence
- `AID`: PubChem Assay ID
### Label Definitions
- **Binders (binds=1)**: Compounds marked "Active" in dose-response (DR) confirmatory screening
- **Non-binders (binds=0)**: Compounds marked "Inactive" in single-dose (SD) primary screening
### PAINS Filtering
PAINS (Pan-Assay Interference Compounds) are filtered out using RDKit's FilterCatalog. These are compounds that tend to show activity across many assays due to non-specific mechanisms (e.g., aggregation, redox cycling, fluorescence interference) rather than genuine target binding.
## Scripts
### `scripts/build_val_test_sets.py`
Builds the validation and test parquet files from retrieved MF-PCBA data:
1. Reads manually reviewed and curated assay metadata from `data/MF-PCBA-Assay-Metadata.csv`
2. Filters to binding assays only (excludes phenotypic assays)
3. Extracts binders (DR active) and non-binders (SD inactive)
4. Removes PAINS compounds using RDKit's FilterCatalog
5. Adds protein sequence information to each compound
6. Creates full and compact versions of validation, test, and combined sets
**Requirements**: `pandas`, `numpy`, `pyarrow`, `rdkit`
**Usage**:
```bash
# First, retrieve the raw MF-PCBA data using scripts in original-mf-pcba/
# Then run:
python scripts/build_val_test_sets.py
```
## Original MF-PCBA
The `original-mf-pcba/` directory contains the original MF-PCBA retrieval code and scripts. See `original-mf-pcba/README-original.md` for details on downloading and processing the raw PubChem data.
## License
MIT License - see [LICENSE](LICENSE)
Original MF-PCBA code copyright (c) 2022 David Buterez.
提供机构:
Leash-Biosciences



