GleghornLab/taxonomy_family
收藏Hugging Face2025-07-25 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/GleghornLab/taxonomy_family
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了生物序列信息,以及对应的分类学家族标签。数据集经过过滤,只包含序列长度在20到2048之间的记录。它提取了分类学ID,包括域、界、门、纲、目、科、属、种等信息。数据集中的序列经过CD-HIT聚类,保留了代表性序列。数据集还根据家族创建了标签,并移除了样本数量少于100的家族。数据集分为训练集、验证集和测试集,分别包含242331、5000和5000个样本。
This dataset consists of biological sequence information along with corresponding taxonomic family labels. The dataset has been filtered to include only records with sequence lengths between 20 and 2048. It extracts taxonomic IDs, including domain, kingdom, phylum, class, order, family, genus, and species. The sequences in the dataset have been clustered using CD-HIT, retaining representative sequences. The dataset is also labeled based on families and removes families with fewer than 100 samples. The dataset is split into training, validation, and test sets, containing 242331, 5000, and 5000 samples respectively.
提供机构:
GleghornLab



