Synthyra/omg_prot50_packed
收藏Hugging Face2025-07-03 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/Synthyra/omg_prot50_packed
下载链接
链接失效反馈官方服务:
资源简介:
这是一个OMGprot50的预标记版本数据集,使用ESM2令牌编码为uint8格式。包含训练集、验证集和测试集,适用于蛋白质序列分析任务。数据集在50%相似性下聚类,确保了训练集与评估集的非冗余性。评估集由随机选取的1万个样本组成。测试集还额外包含了OMG创建以来带有转录水平证据的新Uniprot条目。
This is a pre-tokenized version of the OMGprot50 dataset, encoded with ESM2 tokens in uint8 format. It includes train, validation, and test sets suitable for protein sequence analysis tasks. The dataset is clustered at 50% identity, ensuring non-redundancy between the training set and the evaluation sets by default. The evaluation sets consist of randomly selected 10,000 samples. The test set also includes all new Uniprot entries with transcript-level evidence since the creation of OMG, after deduplication.
提供机构:
Synthyra



