five

GleghornLab/taxonomy_order

收藏
Hugging Face2025-07-25 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/GleghornLab/taxonomy_order
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个基于UniProt数据库Swiss-Prot reviewed entries构建的分类数据集,包含序列信息及其对应的分类标签。数据集经过过滤,只保留了序列长度在20到2048之间的条目,并且通过Taxonomic IDs列提取了分类信息。数据集还使用了CD-HIT算法进行序列聚类,以80%的相似性阈值保留代表序列。最终,数据集被分为训练集、验证集和测试集,每个集包含约5000个样本,确保了样本的多样性。

This is a classification dataset based on UniProts Swiss-Prot reviewed entries, containing sequence information and corresponding taxonomic labels. The dataset has been filtered to retain entries with sequence lengths between 20 and 2048, and taxonomic information has been extracted from the taxonomic_lineage_ids column. The dataset has also undergone sequence clustering using the CD-HIT algorithm with an 80% similarity threshold to retain representative sequences. Finally, the dataset is split into training, validation, and test sets, each with approximately 5,000 samples, ensuring the diversity of the samples.
提供机构:
GleghornLab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作