PPI prediction from sequence, gold standard dataset

Name: PPI prediction from sequence, gold standard dataset
Creator: figshare
Published: 2025-06-01 05:01:53
License: 暂无描述

DataCite Commons2025-06-01 更新2024-09-03 收录

下载链接：

https://figshare.com/articles/dataset/PPI_prediction_from_sequence_gold_standard_dataset/21591618/2

下载链接

链接失效反馈

官方服务：

资源简介：

Gold Standard Dataset for sequence-based PPI prediction: Big dataset: 163,192 training points (Intra-1), 59,260 validation points (Intra-0), 52,048 test points (Intra-2)) No direct data leakage: proteins from training are not contained in validation or test, proteins from validation are not in training or test, proteins from test are not in validation or training Minimized sequence similarity between training, validation, test because whole human proteome was split with KaHIP such that sequence similarities are minimized w.r.t. length-normalized bitscores Redundancy-reduction with CD-HIT: inside of the datasets, no proteins with >40% pairwise sequence similarity

基于序列的蛋白质-蛋白质相互作用（PPI）预测金标准数据集：该大型数据集包含163,192条训练样本（Intra-1）、59,260条验证样本（Intra-0）以及52,048条测试样本（Intra-2）。本数据集不存在直接数据泄露问题：训练集中的蛋白质不会出现在验证集或测试集中，验证集中的蛋白质不会出现在训练集或测试集中，测试集中的蛋白质也不会出现在验证集或训练集中。训练集、验证集与测试集之间的序列相似性已被最小化：我们通过KaHIP工具对人类全蛋白质组进行划分，使得基于长度归一化比特得分（length-normalized bitscores）的序列相似性降至最低。此外通过CD-HIT进行数据集内部冗余去除：数据集内任意成对蛋白质的序列相似性均不超过40%。

提供机构：

figshare

创建时间：

2023-06-16

搜集汇总

数据集介绍