GleghornLab/taxonomy_phylum
收藏Hugging Face2025-07-25 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/GleghornLab/taxonomy_phylum
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了来自UniProt搜索的经过Swiss-Prot审核的条目,其中包含了分类学谱系(Ids)、序列和长度信息。数据集经过筛选,保留了序列长度在20到2048之间的条目,并从中提取了分类信息,包括域、界、门、纲、目、科、属、种。数据集保留了条目标识、门信息和序列信息,并去除了空值。通过CD-HIT工具进行了序列聚类,保留了代表性序列,并根据门信息创建了标签。移除了样本数量少于100的门分类,并进行了分层划分,首先确定测试集(5000个样本),然后是验证集(5000个样本),剩余的作为训练集。
The dataset consists of Swiss-Prot reviewed entries from UniProt search, including taxonomic lineage (Ids), sequence, and length information. The dataset has been filtered to retain entries with sequence lengths between 20 and 2048, and taxonomic information including domain, kingdom, phylum, class, order, family, genus, and species has been extracted. The dataset keeps entry identifiers, phylum, and sequence information, and removes null values. Sequences have been clustered using CD-HIT at an 80% similarity threshold and representative sequences have been kept. Labels have been created based on the phylum, and phyla with fewer than 100 examples have been removed. The dataset has been stratified with a test set (5,000 examples) determined first, followed by a validation set (5,000 examples), and the remainder as the training set.
提供机构:
GleghornLab



