wjiaqi/evobind

Name: wjiaqi/evobind
Creator: wjiaqi
Published: 2026-03-11 13:54:28
License: 暂无描述

Hugging Face2026-03-11 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/wjiaqi/evobind

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification tags: - biology - protein - protein-protein-interaction - PPI - bioinformatics pretty_name: Human Protein-Protein Interaction Pairs (Cong Lab humanPPI) size_categories: - 100K<n<1M --- # Human Protein-Protein Interaction Dataset ## Description A curated dataset of human protein-protein interactions (PPIs) derived from the [Cong Lab humanPPI](https://conglab.swmed.edu/humanPPI/humanPPI_download.html) resource at UT Southwestern Medical Center. Each row represents an experimentally supported interacting protein pair enriched with metadata from UniProt. The source data (`PPI_database_pairs`) contains candidate PPIs gathered from UniProt, BioGRID, and STRING physical interactions among ~19,528 human proteins. ## Features | Column | Type | Description | |--------|------|-------------| | `pair_id` | string | Unique pair identifier (P0000000 format) | | `protein_a` | string | UniProt accession of protein A (canonical order) | | `protein_b` | string | UniProt accession of protein B (canonical order) | | `protein_a_accession` | string | Same as protein_a (explicit accession column) | | `protein_b_accession` | string | Same as protein_b (explicit accession column) | | `entry_name_a` | string | UniProt entry name for protein A | | `entry_name_b` | string | UniProt entry name for protein B | | `gene_name_a` | string | Primary gene name for protein A | | `gene_name_b` | string | Primary gene name for protein B | | `species_a` | string | Organism scientific name for protein A | | `species_b` | string | Organism scientific name for protein B | | `taxon_id_a` | int | NCBI taxonomy ID for protein A | | `taxon_id_b` | int | NCBI taxonomy ID for protein B | | `sequence_a` | string | Amino acid sequence for protein A | | `sequence_b` | string | Amino acid sequence for protein B | | `length_a` | int | Sequence length for protein A | | `length_b` | int | Sequence length for protein B | | `reviewed_a` | bool | Whether protein A is Swiss-Prot reviewed | | `reviewed_b` | bool | Whether protein B is Swiss-Prot reviewed | | `benchmark_label` | string | Benchmark label (positive/negative) if available | | `interface_size` | string | Interface size category if available | | `predicted_precision` | float | Prediction precision level (80 or 90) if in final predictions | | `rf2ppi_prob` | float | RF2-PPI interaction probability if available | | `af2_prob` | float | AlphaFold2 interaction probability if available | | `source_pipeline` | string | Prediction pipeline source if available | | `pdb_template` | string | PDB template availability | | `interaction_label` | int | 1 = positive interaction pair | | `source_dataset` | string | Dataset provenance: `conglab_humanppi` | | `evidence_type` | string | Evidence type: `ppi_database_pairs` | | `split_random` | string | Random 90/5/5 train/valid/test split | | `split_protein_disjoint` | string | Protein-disjoint split (no protein leakage) | | `split_strict` | string | Connected-component-based strict split | ## Split Strategies Three split annotations are provided to support different evaluation needs: | Split | Train | Valid | Test | |-------|-------|-------|------| | `split_random` | 853,485 | 47,554 | 47,366 | | `split_protein_disjoint` | 939,114 | 7,226 | 2,065 | | `split_strict` | 947,308 | 572 | 525 | 1. **`split_random`** — Standard random 90/5/5 split at the pair level. Fastest for prototyping but allows protein leakage between splits. 2. **`split_protein_disjoint`** — Proteins are randomly assigned to splits first, then pairs inherit the split. Prevents individual protein leakage. Cross-split pairs default to train. 3. **`split_strict`** — Hub-aware protein-disjoint split. High-degree "hub" proteins (top 20% by interaction count) are forced into train. Only pairs between non-hub proteins appear in test/valid. The PPI network is one giant connected component (~17.7K proteins), making component-isolation splits impractical. This approach instead produces a small but challenging evaluation set of interactions among less-connected proteins. ## Dataset Size - **948,405** interaction pairs (after deduplication and UniProt resolution) - **17,669** unique proteins - Storage: ~1 GB Parquet ## Source - **Website**: https://conglab.swmed.edu/humanPPI/humanPPI_download.html - **Reference**: Cong Lab, UT Southwestern Medical Center - **Protein metadata**: [UniProt](https://www.uniprot.org/) ## License The humanPPI dataset is governed by the [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://conglab.swmed.edu/humanPPI/LICENSE.txt). ## Limitations - This is a **positive-only** interaction dataset. Negative pairs are not included (except benchmark labels where available). - PPI evidence includes direct binding, complex co-membership, and indirect interactions — not all pairs represent physical binding interfaces. - Species is expected to be uniformly *Homo sapiens* for this dataset. Non-human entries, if any, likely indicate mapping issues. - Sequences are retrieved from the current UniProt release and may differ slightly from the sequences used in the original Cong Lab study. ## Citation If you use this dataset, please cite the original Cong Lab humanPPI resource and the underlying databases (UniProt, BioGRID, STRING).

提供机构：

wjiaqi

5,000+

优质数据集

54 个

任务类型

进入经典数据集