Protein Sequences with Gene Ontology Terms
收藏arXiv2025-09-30 收录
下载链接:
https://www.uniprot.org/ and https://alphafold.ebi.ac.uk/
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了从UniProtKB数据库中提取的蛋白质序列,根据与分子功能、生物学过程和细胞组分相关的六个选定基因本体(Gene Ontology)术语进行分类。每个属性都配有训练集和评估集,以确保数据之间没有泄露,同时使用几种度量标准(如CLS得分、TM得分、RMSD、pLDDT)来评估生成序列的功能性和结构稳定性。每个属性的训练集包含1万个蛋白质序列,评估集包含10万个蛋白质序列。该任务旨在进行蛋白质序列的生成与评估。
This dataset contains protein sequences extracted from the UniProtKB database, classified according to six selected Gene Ontology (GO) terms associated with molecular function, biological process, and cellular component. Each category is paired with a training set and an evaluation set to prevent data leakage between datasets, and several metrics including CLS score, TM score, RMSD, and pLDDT are used to evaluate the functionality and structural stability of the generated protein sequences. The training set for each category contains 10,000 protein sequences, while the evaluation set includes 100,000 protein sequences. This task focuses on protein sequence generation and evaluation.
提供机构:
UniProtKB and AlphaFold protein structure database



