GleghornLab/taxonomy_order
收藏Hugging Face2025-07-25 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/GleghornLab/taxonomy_order
下载链接
链接失效反馈官方服务:
资源简介:
这是一个基于UniProt数据库Swiss-Prot reviewed entries构建的分类数据集,包含序列信息及其对应的分类标签。数据集经过过滤,只保留了序列长度在20到2048之间的条目,并且通过Taxonomic IDs列提取了分类信息。数据集还使用了CD-HIT算法进行序列聚类,以80%的相似性阈值保留代表序列。最终,数据集被分为训练集、验证集和测试集,每个集包含约5000个样本,确保了样本的多样性。
This is a classification dataset based on UniProts Swiss-Prot reviewed entries, containing sequence information and corresponding taxonomic labels. The dataset has been filtered to retain entries with sequence lengths between 20 and 2048, and taxonomic information has been extracted from the taxonomic_lineage_ids column. The dataset has also undergone sequence clustering using the CD-HIT algorithm with an 80% similarity threshold to retain representative sequences. Finally, the dataset is split into training, validation, and test sets, each with approximately 5,000 samples, ensuring the diversity of the samples.
提供机构:
GleghornLab



