five

Leash-Biosciences/mf-pcba-bind

收藏
Hugging Face2026-02-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Leash-Biosciences/mf-pcba-bind
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit --- # MF-PCBA-Bind Code for generating these datasets can be found at https://github.com/Leash-Labs/mf-pcba-bind Protein-ligand binding prediction datasets derived from the [MF-PCBA](https://github.com/davidbuterez/mf-pcba) benchmark. This repository extends the original MF-PCBA dataset by: - Filtering to binding assays only (excluding phenotypic assays) - Removing PAINS (pan-assay interference compounds) that show non-specific activity - Providing pre-built validation and test splits for protein-ligand binding prediction - Aggregating the data across assays into combined, easy-to-access files - Providing "compact" versions of each val/test sets to be more accessible to expensive models ## Attribution This work is based on **MF-PCBA** by David Buterez et al. The original dataset, retrieval scripts, and methodology are preserved in the `original-mf-pcba/` directory. If you use this dataset, please cite the original MF-PCBA paper: ```bibtex @article{doi:10.1021/acs.jcim.2c01569, author = {Buterez, David and Janet, Jon Paul and Kiddle, Steven J. and Li�, Pietro}, title = {MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning}, journal = {Journal of Chemical Information and Modeling}, year = {2023}, doi = {10.1021/acs.jcim.2c01569}, URL = {https://doi.org/10.1021/acs.jcim.2c01569} } ``` ## Dataset Overview ### Data Files | File | Description | |------|-------------| | `data/mf_pcba_bind_val_full.parquet` | Full validation set with all binders and non-binders | | `data/mf_pcba_bind_val_compact.parquet` | Compact validation set (binders + 4x sampled non-binders) | | `data/mf_pcba_bind_test_full.parquet` | Full test set with all binders and non-binders | | `data/mf_pcba_bind_test_compact.parquet` | Compact test set (binders + 4x sampled non-binders) | | `data/mf_pcba_bind_val+test_full.parquet` | Combined val+test full set | | `data/mf_pcba_bind_val+test_compact.parquet` | Combined val+test compact set | | `data/MF-PCBA-Assay-Metadata.csv` | Curated metadata for all assays including protein sequences | ### Columns Each parquet file contains: - `CID`: PubChem Compound ID - `smiles`: Molecular SMILES string - `binds`: Binary label (1 = binder, 0 = non-binder) - `protein_name`: Target protein name - `protein_category`: A rough categorization of the protein - `protein_accession`: Protein accession number - `amino_acid_sequence`: Full protein sequence - `AID`: PubChem Assay ID ### Label Definitions - **Binders (binds=1)**: Compounds marked "Active" in dose-response (DR) confirmatory screening - **Non-binders (binds=0)**: Compounds marked "Inactive" in single-dose (SD) primary screening ### PAINS Filtering PAINS (Pan-Assay Interference Compounds) are filtered out using RDKit's FilterCatalog. These are compounds that tend to show activity across many assays due to non-specific mechanisms (e.g., aggregation, redox cycling, fluorescence interference) rather than genuine target binding. ## Scripts ### `scripts/build_val_test_sets.py` Builds the validation and test parquet files from retrieved MF-PCBA data: 1. Reads manually reviewed and curated assay metadata from `data/MF-PCBA-Assay-Metadata.csv` 2. Filters to binding assays only (excludes phenotypic assays) 3. Extracts binders (DR active) and non-binders (SD inactive) 4. Removes PAINS compounds using RDKit's FilterCatalog 5. Adds protein sequence information to each compound 6. Creates full and compact versions of validation, test, and combined sets **Requirements**: `pandas`, `numpy`, `pyarrow`, `rdkit` **Usage**: ```bash # First, retrieve the raw MF-PCBA data using scripts in original-mf-pcba/ # Then run: python scripts/build_val_test_sets.py ``` ## Original MF-PCBA The `original-mf-pcba/` directory contains the original MF-PCBA retrieval code and scripts. See `original-mf-pcba/README-original.md` for details on downloading and processing the raw PubChem data. ## License MIT License - see [LICENSE](LICENSE) Original MF-PCBA code copyright (c) 2022 David Buterez.
提供机构:
Leash-Biosciences
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作