five

Protein Sequences with Gene Ontology Terms

收藏
arXiv2025-09-30 收录
下载链接:
https://www.uniprot.org/ and https://alphafold.ebi.ac.uk/
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了从UniProtKB数据库中提取的蛋白质序列,根据与分子功能、生物学过程和细胞组分相关的六个选定基因本体(Gene Ontology)术语进行分类。每个属性都配有训练集和评估集,以确保数据之间没有泄露,同时使用几种度量标准(如CLS得分、TM得分、RMSD、pLDDT)来评估生成序列的功能性和结构稳定性。每个属性的训练集包含1万个蛋白质序列,评估集包含10万个蛋白质序列。该任务旨在进行蛋白质序列的生成与评估。

This dataset contains protein sequences extracted from the UniProtKB database, classified according to six selected Gene Ontology (GO) terms associated with molecular function, biological process, and cellular component. Each category is paired with a training set and an evaluation set to prevent data leakage between datasets, and several metrics including CLS score, TM score, RMSD, and pLDDT are used to evaluate the functionality and structural stability of the generated protein sequences. The training set for each category contains 10,000 protein sequences, while the evaluation set includes 100,000 protein sequences. This task focuses on protein sequence generation and evaluation.
提供机构:
UniProtKB and AlphaFold protein structure database
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作