Comprehensive Dataset of Protein Sequences
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/jw-chae/pLDDT_Predictor
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了150万个多样化的蛋白质序列,序列长度从50至2048个氨基酸不等,用于训练和评估pLDDT-Predictor模型。为了模型的训练,该数据集被分为训练集(占80%)、验证集(占10%)和测试集(占10%),并且在训练过程中对pLDDT得分进行了标准化处理。这一大规模数据集的任务是预测蛋白质结构质量评估的Plddt得分。
This dataset contains 1.5 million diverse protein sequences, with lengths ranging from 50 to 2048 amino acids, and is used for training and evaluating the pLDDT-Predictor model. For model training, the dataset is split into a training set (80%), a validation set (10%), and a test set (10%), and the pLDDT scores were standardized during the training process. The task of this large-scale dataset is to predict pLDDT scores for protein structure quality assessment.
提供机构:
AlphaFold Database



