BGI-HangzhouAI/Benchmark_Dataset-Primate_mammal_species_classification
收藏Hugging Face2025-12-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/BGI-HangzhouAI/Benchmark_Dataset-Primate_mammal_species_classification
下载链接
链接失效反馈官方服务:
资源简介:
该数据集提供了一个基准,用于评估模型在不同进化距离上识别物种特异性基因组特征的能力。数据集包含来自七种不同哺乳动物的全基因组组装数据,包括四种灵长类动物(人类、黑猩猩、猩猩、狨猴)和三种非灵长类动物(小鼠、牛、羊)。数据集设计了一个多类物种分类任务,通过将不同的染色体组分配给训练、验证和测试集来严格评估泛化能力并防止同源区域的数据泄漏。数据集采样了四种固定长度(1k、8k、32k和128k)的非重叠序列,以确保均匀的基因组覆盖。该任务挑战模型学习全局系统发育特征,而不是记忆特定的基因组位置。
This dataset provides a benchmark for evaluating the models capability to identify species-specific genomic signatures across varying evolutionary distances. Using whole-genome assembly data from seven diverse mammals—including four primates (Human, Pan troglodytes, Pongo abelii, Callithrix) and three non-primates (Mouse, Cattle, Sheep)—we designed a multi-class species classification task. To rigorously assess generalization and prevent data leakage from homologous regions, we assigned distinct sets of chromosomes to the training, validation, and testing splits respectively. We sampled non-overlapping sequences across four fixed lengths—1k, 8k, 32k, and 128k—to ensure uniform genomic coverage. This task challenges the model to learn global phylogenetic features rather than memorizing specific genomic locations.
提供机构:
BGI-HangzhouAI



