BGI-HangzhouAI/Benchmark_Dataset-variant_hotspot
收藏Hugging Face2025-10-16 更新2025-10-18 收录
下载链接:
https://hf-mirror.com/datasets/BGI-HangzhouAI/Benchmark_Dataset-variant_hotspot
下载链接
链接失效反馈官方服务:
资源简介:
该数据集提供了通过突变热点分类任务评估基因组模型在处理甚至更长DNA输入时的可扩展性的基准。使用中国全景基因组联盟(CPC)的全基因组变异数据,识别出显著高于局部染色质背景突变密度的基因组区域(热点)。提取了8 Kbp、32 Kbp和128 Kbp的序列,创建了三个并行任务,以比较不同输入长度下的模型。每个序列标记为热点(1)或非热点(0),形成了一个用于评估大语境基因组基础模型的平衡二分类数据集。
This dataset provides a benchmark for evaluating the scalability of genomic models to even-longer DNA inputs through a mutation hotspot classification task. Using whole-genome variant data from the Chinese Pangenome Consortium (CPC), genomic regions (hotspots) with significantly higher mutation densities compared to local chromosomal backgrounds are identified. Sequences of 8 Kbp, 32 Kbp, and 128 Kbp are extracted to create three parallel tasks, enabling model comparison across different input lengths. Each sequence is labeled as either hotspot (1) or non-hotspot (0), forming a balanced binary classification dataset designed for evaluating large-context genomic foundation models.
提供机构:
BGI-HangzhouAI



