UniProtCC - ProtBFN Training Data
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14678317
下载链接
链接失效反馈官方服务:
资源简介:
The UniProtCC dataset is a a (C)leaned and (C)lustered subset of the January 2024 release of UniProt.
It is constructed by taking all entries UniProtKB with length <512 and then filtering according to the "protein existence" property, specifically only allowing entries where the protein existence level is one of:
Experimental evidence at protein level
Experimental evidence at transcript level
Protein inferred from homology
UniProtCC therefore contains no UniProtKB entries where the protein is either hypothetical or is of uncertain origin. UniProtCC comes equipped with a clustering; specifically, the UniRef50 clustering assignements, also from the January 2024 release of UniProt. In the original work describing this dataset, protein sequence modelling with bayesian flow networks, these cluster assignments were used to reweight the training of a generative model, and debias it away from the most studied proteins within UniProtKB.
This dataset contains two csv files:
uniprot_cc_511.csv. This file contains the UniProtCC dataset, and is approximately 22.7GB and contains 72,369,921 rows. Each row consists of:
uniprot_id (string). The unique uniprot ID of the entry.
amino_acids (string). The amino acid sequence of the entry.
uniref50_cluster_id (string). The unique identifier of the UniRef50 cluster with which the entry is associated.
uniprot_cc_511_cluster_sizes.csv. This file contains metadata about the UniRef50 cluster sizes, and is approximately 139MB and contains 6,460,005 rows. Each row consists of:
uniref50_cluster_id (string). The unique identifier of a UniRef50 cluster.
seq_count (integer). The number of entries within uniprot_cc_511.csv that are associated with this cluster. Summing seq_count across all rows in this CSV gives a total of 72,369,921 e.g. the total number of entries in uniprot_cc_511.csv.
创建时间:
2025-02-17



