Benchmark_Dataset-Human_population_classification

Name: Benchmark_Dataset-Human_population_classification
Creator: maas
Published: 2025-12-05 16:54:58
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/BGI-HangzhouAI/Benchmark_Dataset-Human_population_classification

下载链接

链接失效反馈

官方服务：

资源简介：

## Sumary This dataset provides a benchmark for evaluating the model's ability to leverage richer genetic information from longer sequences to achieve more accurate inference. Using data from the Human Pangenome Reference Consortium (BioProject ID: PRJNA730823), we designed a population classification task focusing on African, East Asian, and European population groups. From samples' VCF file and the reference genome sequence, we generated sample pseudo-sequences. Based on variant site information recorded in the VCF file, we extracted a variant-dense region from chromosome 9. We used three sequence lengths: **8,192 bp (8K)**, **32,768 bp (32K)**, and **131,072 bp (128K)**. An XGBoost classifier was employed to perform classification predictions for individual sequences. ## Usage ```python from datasets import load_dataset #Download the whole dataset dataset = load_dataset("BGI-HangzhouAI/Benchmark_Dataset-Human_population_classification") #Download a specific task task_name = "Human_population_classification_8192" dataset = load_dataset( "BGI-HangzhouAI/Benchmark_Dataset-Human_population_classification", data_files = { "train": f"{task_name}/train.jsonl", "test": f"{task_name}/test.jsonl", "eval": f"{task_name}/eval.jsonl", } ) ``` ## Benchmark tasks | Task | `task_name` | Input fields | # Train Seqs | # Validation Seqs | # Test Seqs | |-------------|--------------|------------------|---------------|-------------------|--------------| | Human_population_classification 8k | `Human_population_classification_8192` | {seq, label} | 23,172 | 2,906 | 2,916 | | Human_population_classification 32k | `Human_population_classification_32768` | {seq, label} | 23,207 | 2,913 | 2,925 | | Human_population_classification 128k | `Human_population_classification_131072` | {seq, label} | 23,623 | 2,830 | 2,957 | | Population | Sample_counts | Label| |------------|---------------|-------------| |EUR-European|30 |0 | |AFR-African |69 |1 | |EAS-East Asian|50 |2 | ## Data processing ### 1. Pseudo-sequence Generation: For each sample, pseudo-sequences (including hap1 and hap2) were generated from its VCF file and the reference genome sequence using `bcftools`. ### 2. VCF Variant Region Statistics: Using the sample VCF files, sliding windows of three different lengths (8K, 32K, 128K) were applied from the start of chromosome 9 of the reference genome. The overlap between consecutive windows was half the window length. The number of variants within each window was counted. Windows were then ranked in descending order based on this variant count to identify variant-dense genomic coordinates. Chromosome 9 were randomly selected, other autosomes could also be used. ### 3. Centromere Removal: Centromeric regions, which are repetitive and non-coding and thus unsuitable for variant or classification tasks, were filtered out according to a BED file. This resulted in a final mapping of genomic windows to their corresponding variant counts. ## 4. Data Selection: The samples for each label were split into training, validation and test sets in a 8:1:1 ratio. Based on the previously obtained window-variant count mapping, the hap1 pseudo-sequences for chromosome 9 of each sample were segmented. Regions were selected starting from the highest variant count downwards, while ensuring a roughly balanced number of sequences for each label. ### 5. Final format Datasets are saved in JSONL format. Each file contains: - `"seq"` — the DNA sequence string (A/C/G/T, uppercase) - `"label"` — ternary class indicator (0 = CEU, 1 = AFR, 2 = EAS) ### 6. Additional information The XGBoost model here used only training and test sets. The reserved validation set is available for algorithms requiring it, such as Multilayer Perceptron(MLP).

## 数据集概述本数据集为一款用于评估模型利用更长序列中的丰富遗传信息以实现更精准推断能力的基准测试集。本数据集使用人类泛基因组参考联盟（Human Pangenome Reference Consortium）的公开数据，其BioProject编号为PRJNA730823，设计了聚焦于非洲、东亚及欧洲人群的群体分类任务。我们从样本的VCF（变异识别格式，Variant Call Format）文件与参考基因组序列中生成样本伪序列。基于VCF文件中记录的变异位点信息，我们从9号染色体中提取了变异密集区域。本次实验设置了三种序列长度：**8192 bp（8K）**、**32768 bp（32K）**与**131072 bp（128K）**。我们采用XGBoost分类器对单条序列执行分类预测任务。 ## 使用方法 python from datasets import load_dataset # 下载完整数据集 dataset = load_dataset("BGI-HangzhouAI/Benchmark_Dataset-Human_population_classification") # 下载指定任务数据集 task_name = "Human_population_classification_8192" dataset = load_dataset( "BGI-HangzhouAI/Benchmark_Dataset-Human_population_classification", data_files = { "train": f"{task_name}/train.jsonl", "test": f"{task_name}/test.jsonl", "eval": f"{task_name}/eval.jsonl", } ) ## 基准测试任务 | 任务名称 | 任务标识符`task_name` | 输入字段 | 训练序列数 | 验证序列数 | 测试序列数 | |------------------------------|----------------------|----------------|------------|------------|------------| | 人类群体分类（8K序列） | `Human_population_classification_8192` | {seq, label} | 23,172 | 2,906 | 2,916 | | 人类群体分类（32K序列） | `Human_population_classification_32768` | {seq, label} | 23,207 | 2,913 | 2,925 | | 人类群体分类（128K序列） | `Human_population_classification_131072` | {seq, label} | 23,623 | 2,830 | 2,957 | | 人群分组 | 样本数量 | 标签编号 | |----------------|----------|----------| | EUR-欧洲人群 | 30 | 0 | | AFR-非洲人群 | 69 | 1 | | EAS-东亚人群 | 50 | 2 | ## 数据处理流程 ### 1. 伪序列生成针对每个样本，我们使用`bcftools`工具，结合其VCF文件与参考基因组序列生成伪序列，包含hap1（单倍型1）与hap2（单倍型2）两条单倍型序列。 ### 2. VCF变异区域统计借助样本的VCF文件，我们以参考基因组9号染色体的起始位置为起点，设置三种不同长度的滑动窗口（8K、32K、128K），相邻窗口的重叠长度为窗口长度的一半。统计每个窗口内的变异位点数量，随后按变异位点数量降序排序，以筛选出变异密集的基因组坐标区间。本次实验随机选择9号染色体，也可选用其他常染色体。 ### 3. 着丝粒区域过滤着丝粒区域属于重复序列且不编码蛋白质，不适用于变异分析或分类任务，因此我们依据BED文件过滤掉该类区域，最终得到基因组窗口与对应变异位点数量的映射关系。 ### 4. 数据集划分与样本选择将每个标签对应的样本按照8:1:1的比例划分为训练集、验证集与测试集。基于前文得到的窗口-变异位点数量映射关系，我们对每个样本的9号染色体hap1伪序列进行分段。从变异位点数量最高的区域开始依次选取，同时确保每个标签对应的序列数量大致均衡。 ### 5. 最终数据格式数据集以JSONL格式存储。每个数据样本包含以下字段： - `"seq"`：DNA序列字符串，仅包含大写的A/C/G/T碱基 - `"label"`：三分类标签，其中0代表欧洲人群（CEU），1代表非洲人群（AFR），2代表东亚人群（EAS） ### 6. 补充说明本次实验中的XGBoost模型仅使用了训练集与测试集；预留的验证集可供需要调参的算法使用，例如多层感知机（Multilayer Perceptron，MLP）。

提供机构：

maas

创建时间：

2025-10-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集