Synthyra/clustered_ppi_string
收藏Hugging Face2026-02-04 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/Synthyra/clustered_ppi_string
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个蛋白质-蛋白质相互作用(PPIs)的变体,通过按序列相似性对蛋白质进行聚类构建,并创建在蛋白质水平上不相交的训练/验证/测试分割(从而难以通过近乎相同的序列记忆)。数据以压缩的pickle文件(*.pkl.gz)存储,并提供了一个帮助下载器。每个分割是一个pandas.DataFrame,至少包含蛋白质标识符(IdA/IdB)、生物体标识符(OrgA/OrgB)和标签(>0表示阳性相互作用,0表示采样的阴性)。部分变体还包括额外列(如cluster_a、cluster_b等)。数据集变体的详细信息可通过machine-readable index获取,例如string_human_st040变体,其来源于string_human,阈值为st040,训练行数为12000396,验证行数为10554,测试行数为20110,训练阳性率为0.500,蛋白质重叠为0。此外,每个变体都有详细的统计数据和图表,包括标签平衡、生物体分布、跨分割生物体转移测试、序列长度分布和顶级生物体对等。
This dataset repo contains multiple dataset variants of protein–protein interactions (PPIs), built by clustering proteins by sequence similarity and then constructing train/valid/test splits that are intended to be disjoint at the protein level (and thus hard to memorize via near-identical sequences). Artifacts are stored as compressed pickles (*.pkl.gz). A helper downloader exists in this repo. Each split is a pandas.DataFrame with (at minimum) protein identifiers (IdA/IdB), organism identifiers (OrgA/OrgB), and labels (>0 indicates a positive interaction, 0 indicates a sampled negative). Some variants also include additional columns (e.g., cluster_a, cluster_b). A machine-readable index is available, detailing variants like string_human_st040, which is sourced from string_human, has a threshold of st040, train rows of 12000396, valid rows of 10554, test rows of 20110, train pos rate of 0.500, and protein overlap (max) of 0. Each variant has detailed plots and stats, including label balance, organism distributions, cross-split organism shift tests, sequence length distributions, and top organism pairs.
提供机构:
Synthyra



