Comprehensive Dataset of Protein Sequences

Name: Comprehensive Dataset of Protein Sequences
Creator: AlphaFold Database
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/jw-chae/pLDDT_Predictor

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了150万个多样化的蛋白质序列，序列长度从50至2048个氨基酸不等，用于训练和评估pLDDT-Predictor模型。为了模型的训练，该数据集被分为训练集（占80%）、验证集（占10%）和测试集（占10%），并且在训练过程中对pLDDT得分进行了标准化处理。这一大规模数据集的任务是预测蛋白质结构质量评估的Plddt得分。

This dataset contains 1.5 million diverse protein sequences, with lengths ranging from 50 to 2048 amino acids, and is used for training and evaluating the pLDDT-Predictor model. For model training, the dataset is split into a training set (80%), a validation set (10%), and a test set (10%), and the pLDDT scores were standardized during the training process. The task of this large-scale dataset is to predict pLDDT scores for protein structure quality assessment.

提供机构：

AlphaFold Database

5,000+

优质数据集

54 个

任务类型

进入经典数据集