Synthyra/ecoli_holdout_ppi_large
收藏Hugging Face2026-03-18 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Synthyra/ecoli_holdout_ppi_large
下载链接
链接失效反馈官方服务:
资源简介:
---
tags:
- protein-protein-interactions
- biology
- bioinformatics
---
# Clustered PPI datasets (BIOGRID + STRING) with sequence-disjoint splits
This dataset repo contains multiple **dataset variants** of protein–protein interactions (PPIs),
built by clustering proteins by sequence similarity and then constructing **train/valid/test** splits that are
intended to be **disjoint at the protein level** (and thus hard to memorize via near-identical sequences).
Artifacts are stored as compressed pickles (`*.pkl.gz`). A helper downloader exists in this repo:
- `data_processing/download_ppi_data.py::download_clustered_ppi_data`
## What’s in each split dataframe?
Each split is a `pandas.DataFrame` with (at minimum):
- **IdA / IdB**: protein identifiers
- **OrgA / OrgB**: organism identifiers (STRING taxon id for STRING datasets; BIOGRID org id for BIOGRID datasets)
- **labels**: `>0` indicates a positive interaction, `0` indicates a sampled negative
Some variants also include additional columns (e.g. `cluster_a`, `cluster_b`, `confidences`, `org_a`, `org_b`).
When negatives are concatenated, some of these columns may be `NaN` for negative rows.
## Dataset variants (index)
A machine-readable index is available at:
- `tables/dataset_index.csv`
| variant | source | threshold | train rows | valid rows | test rows | train pos rate | protein overlap (max) |
|---|---:|---:|---:|---:|---:|---:|---:|
| `ecoli_holdout_st030` | `ecoli_holdout` | `st030` | 185132088 | 201460 | 976732 | 0.500 | 0 |
## Per-variant deep dive (plots + stats)
Each variant has:
- `plots/<variant>/...png` (rendered below)
- `tables/<variant>/summary.csv` and `tables/<variant>/schema.csv`
### `ecoli_holdout_st030`
<details>
<summary>Open report</summary>
**Summary tables**
- `tables/ecoli_holdout_st030/summary.csv`
- `tables/ecoli_holdout_st030/schema.csv`
**Label balance**
- train: `plots/ecoli_holdout_st030/train_label_counts.png`
- valid: `plots/ecoli_holdout_st030/valid_label_counts.png`
- test: `plots/ecoli_holdout_st030/test_label_counts.png`
**Organism distributions (positives vs negatives)**

- data: `plots/ecoli_holdout_st030/train_organism_distribution.csv`
- stats: `plots/ecoli_holdout_st030/train_organism_distribution_stats.csv`

- data: `plots/ecoli_holdout_st030/valid_organism_distribution.csv`
- stats: `plots/ecoli_holdout_st030/valid_organism_distribution_stats.csv`

- data: `plots/ecoli_holdout_st030/test_organism_distribution.csv`
- stats: `plots/ecoli_holdout_st030/test_organism_distribution_stats.csv`
**Cross-split organism shift tests**
- positives: `plots/ecoli_holdout_st030/cross_split_pos_stats.csv`
- negatives: `plots/ecoli_holdout_st030/cross_split_neg_stats.csv`
**Sequence length distributions (unique proteins)**

- stats: `plots/ecoli_holdout_st030/train_seq_length_stats.csv`

- stats: `plots/ecoli_holdout_st030/valid_seq_length_stats.csv`

- stats: `plots/ecoli_holdout_st030/test_seq_length_stats.csv`
**Top organism pairs**
- train positives: `plots/ecoli_holdout_st030/train_top_org_pairs_pos.png`
- train negatives: `plots/ecoli_holdout_st030/train_top_org_pairs_neg.png`
- valid positives: `plots/ecoli_holdout_st030/valid_top_org_pairs_pos.png`
- valid negatives: `plots/ecoli_holdout_st030/valid_top_org_pairs_neg.png`
- test positives: `plots/ecoli_holdout_st030/test_top_org_pairs_pos.png`
- test negatives: `plots/ecoli_holdout_st030/test_top_org_pairs_neg.png`
</details>
## How to download and load
Use the helper in this codebase:
```python
from data_processing.download_ppi_data import download_clustered_ppi_data
# BIOGRID example
train_df, valid_df, test_df, interaction_set, seq_dict = download_clustered_ppi_data(
data_type='biogrid',
cluster_percentage=0.5,
hf_repo='Synthyra/ecoli_holdout_ppi_large',
)
# STRING example (descriptor must match the variant prefix: e.g. 'human' or 'model_orgs')
train_df, valid_df, test_df, interaction_set, seq_dict = download_clustered_ppi_data(
data_type='string',
descriptor='human',
cluster_percentage=0.5,
hf_repo='Synthyra/ecoli_holdout_ppi_large',
)
```
提供机构:
Synthyra



