wjiaqi/evobind
收藏Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/wjiaqi/evobind
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
tags:
- biology
- protein
- protein-protein-interaction
- PPI
- bioinformatics
pretty_name: Human Protein-Protein Interaction Pairs (Cong Lab humanPPI)
size_categories:
- 100K<n<1M
---
# Human Protein-Protein Interaction Dataset
## Description
A curated dataset of human protein-protein interactions (PPIs) derived from the
[Cong Lab humanPPI](https://conglab.swmed.edu/humanPPI/humanPPI_download.html) resource
at UT Southwestern Medical Center. Each row represents an experimentally supported
interacting protein pair enriched with metadata from UniProt.
The source data (`PPI_database_pairs`) contains candidate PPIs gathered from
UniProt, BioGRID, and STRING physical interactions among ~19,528 human proteins.
## Features
| Column | Type | Description |
|--------|------|-------------|
| `pair_id` | string | Unique pair identifier (P0000000 format) |
| `protein_a` | string | UniProt accession of protein A (canonical order) |
| `protein_b` | string | UniProt accession of protein B (canonical order) |
| `protein_a_accession` | string | Same as protein_a (explicit accession column) |
| `protein_b_accession` | string | Same as protein_b (explicit accession column) |
| `entry_name_a` | string | UniProt entry name for protein A |
| `entry_name_b` | string | UniProt entry name for protein B |
| `gene_name_a` | string | Primary gene name for protein A |
| `gene_name_b` | string | Primary gene name for protein B |
| `species_a` | string | Organism scientific name for protein A |
| `species_b` | string | Organism scientific name for protein B |
| `taxon_id_a` | int | NCBI taxonomy ID for protein A |
| `taxon_id_b` | int | NCBI taxonomy ID for protein B |
| `sequence_a` | string | Amino acid sequence for protein A |
| `sequence_b` | string | Amino acid sequence for protein B |
| `length_a` | int | Sequence length for protein A |
| `length_b` | int | Sequence length for protein B |
| `reviewed_a` | bool | Whether protein A is Swiss-Prot reviewed |
| `reviewed_b` | bool | Whether protein B is Swiss-Prot reviewed |
| `benchmark_label` | string | Benchmark label (positive/negative) if available |
| `interface_size` | string | Interface size category if available |
| `predicted_precision` | float | Prediction precision level (80 or 90) if in final predictions |
| `rf2ppi_prob` | float | RF2-PPI interaction probability if available |
| `af2_prob` | float | AlphaFold2 interaction probability if available |
| `source_pipeline` | string | Prediction pipeline source if available |
| `pdb_template` | string | PDB template availability |
| `interaction_label` | int | 1 = positive interaction pair |
| `source_dataset` | string | Dataset provenance: `conglab_humanppi` |
| `evidence_type` | string | Evidence type: `ppi_database_pairs` |
| `split_random` | string | Random 90/5/5 train/valid/test split |
| `split_protein_disjoint` | string | Protein-disjoint split (no protein leakage) |
| `split_strict` | string | Connected-component-based strict split |
## Split Strategies
Three split annotations are provided to support different evaluation needs:
| Split | Train | Valid | Test |
|-------|-------|-------|------|
| `split_random` | 853,485 | 47,554 | 47,366 |
| `split_protein_disjoint` | 939,114 | 7,226 | 2,065 |
| `split_strict` | 947,308 | 572 | 525 |
1. **`split_random`** — Standard random 90/5/5 split at the pair level. Fastest for prototyping but allows protein leakage between splits.
2. **`split_protein_disjoint`** — Proteins are randomly assigned to splits first, then pairs inherit the split. Prevents individual protein leakage. Cross-split pairs default to train.
3. **`split_strict`** — Hub-aware protein-disjoint split. High-degree "hub" proteins (top 20% by interaction count) are forced into train. Only pairs between non-hub proteins appear in test/valid. The PPI network is one giant connected component (~17.7K proteins), making component-isolation splits impractical. This approach instead produces a small but challenging evaluation set of interactions among less-connected proteins.
## Dataset Size
- **948,405** interaction pairs (after deduplication and UniProt resolution)
- **17,669** unique proteins
- Storage: ~1 GB Parquet
## Source
- **Website**: https://conglab.swmed.edu/humanPPI/humanPPI_download.html
- **Reference**: Cong Lab, UT Southwestern Medical Center
- **Protein metadata**: [UniProt](https://www.uniprot.org/)
## License
The humanPPI dataset is governed by the
[Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://conglab.swmed.edu/humanPPI/LICENSE.txt).
## Limitations
- This is a **positive-only** interaction dataset. Negative pairs are not included (except benchmark labels where available).
- PPI evidence includes direct binding, complex co-membership, and indirect interactions — not all pairs represent physical binding interfaces.
- Species is expected to be uniformly *Homo sapiens* for this dataset. Non-human entries, if any, likely indicate mapping issues.
- Sequences are retrieved from the current UniProt release and may differ slightly from the sequences used in the original Cong Lab study.
## Citation
If you use this dataset, please cite the original Cong Lab humanPPI resource and the underlying databases (UniProt, BioGRID, STRING).
提供机构:
wjiaqi



