electricsheepafrica/africa-synth-structural-variation-catalog-all
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-synth-structural-variation-catalog-all
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- genomics
- structural-variation
- synthetic-data
- copy-number-variation
- indels
- sub-saharan-africa
- synthetic
license: cc-by-nc-4.0
pretty_name: SSA Multi-ancestry Structural Variation Catalog (Germline)
task_categories:
- other
size_categories:
- 10M<n<100M
data_type: synthetic
---
> ⚠️ **Synthetic dataset** — Parameterized from published SSA literature, not real observations. Not suitable for empirical analysis or policy inference.
# SSA Multi-ancestry Structural Variation Catalog (Germline, Synthetic)
## Dataset summary
This dataset provides a **germline structural variation (SV) catalog** for a **multi-ancestry cohort of 20,000 synthetic individuals** with a strong focus on **sub-Saharan African (SSA)** ancestry. It complements the genome-wide SNP array synthetic dataset by adding **copy number variants (CNVs)** and **small indels** with explicit **population-specific structural variants**.
The cohort includes:
- Four SSA regional groups (West, East, Central, Southern).
- An African American women (AAW) group as an admixed African diaspora reference.
- European (EUR) and East Asian (EAS) reference panels.
SVs are simulated on a synthetic genome scaffold (chromosomes 1–22, each 100 Mb) and are **not aligned to a real reference genome**. The dataset is therefore suitable for **methods development and benchmarking** (e.g., ancestry-aware SV detection, population genetics, burden analysis), **not** for clinical or individual-level inference.
All data are **fully synthetic** and were generated under the **GENOMICS Synthetic Data Playbook** used across the Electric Sheep Africa dataset family.
## Cohort design
### Sample size and populations
- **Total N**: 20,000 synthetic individuals.
- **Populations and sample sizes**:
- `SSA_West`: 3,000
- `SSA_East`: 3,000
- `SSA_Central`: 2,000
- `SSA_Southern`: 2,000
- `AAW` (African American women, admixed): 3,000
- `EUR` (European reference): 4,000
- `EAS` (East Asian reference): 3,000
- **Sex distribution**:
- `Male`: 50%
- `Female`: 50%
The SSA subgroups are intended to be **compatible with other SSA-focused synthetic datasets** from Electric Sheep Africa (e.g., SNP array, colorectal genomic, ovarian somatic), enabling **cross-dataset method development**.
## Structural variation model
### SV classes
The catalog includes two broad classes of germline structural variants:
- **Copy number variants (CNVs)**
- `CNV_del` – deletions.
- `CNV_dup` – duplications.
- **Small indels** (1–50 bp)
- `indel_del` – small deletions.
- `indel_ins` – small insertions.
Each variant is represented as a **region on a synthetic chromosome** with:
- `chrom` – synthetic chromosome ("1"–"22").
- `start`, `end` – 0-based coordinates within the 100 Mb chromosome.
- `length_bp` – event length in base pairs.
### CNV and indel burden per individual
Per-sample SV burdens were tuned using literature-informed expectations from:
- Redon et al., *Nature* 2006 (first global CNV map).
- Sudmant et al., *Nature* 2015 (1000 Genomes integrated SV map).
- Collins et al., *Nature* 2020 (gnomAD-SV reference).
Target mean counts per individual (approximated in the generator):
- **CNVs**
- `CNV_del`: mean ~80 deletions per individual (std ~25).
- `CNV_dup`: mean ~60 duplications per individual (std ~20).
- **Small indels** (1–50 bp)
- `indel_del`: mean ~200 deletions per individual (std ~50).
- `indel_ins`: mean ~200 insertions per individual (std ~50).
This yields roughly **140 CNVs** and **400 small indels** per genome on average, producing a diverse but computationally manageable SV catalog.
### Length distributions
SV lengths follow type-specific distributions:
- **CNVs (CNV_del, CNV_dup)**
- Log10-normal length distribution.
- Approximate median length ~100 kb.
- Length range: **1 kb – 5 Mb**.
- **Indels (indel_del, indel_ins)**
- Uniform integer length.
- Length range: **1 – 50 bp**.
These parameters are anchored qualitatively to the size spectra reported in large-scale SV resources, particularly **1000 Genomes SV** and **gnomAD-SV**.
## Population-specific structural variants
A key design feature is the inclusion of **population-enriched structural variants**, motivated by:
- Redon et al. 2006 – CNVs with marked population differentiation.
- Collins et al. 2020 – numerous African- and non-African-enriched SVs in gnomAD-SV.
In the synthetic model:
- A fixed fraction of events are designated **population-specific**:
- `CNV_del`: 5% of deletions.
- `CNV_dup`: 5% of duplications.
- `indel_del`: 2% of small deletions.
- `indel_ins`: 2% of small insertions.
- For each population-specific SV:
- One **target population** is chosen (e.g., SSA_West, EUR, EAS, AAW).
- In the **target population**, carrier frequencies are drawn to be **moderately common** (roughly 5–25%).
- In **non-target populations**, carrier frequencies are constrained to be **very low** (≤0.5%).
This structure yields many SVs where **target/non-target frequency ratios exceed 5x**, giving a clear population-specific signal for benchmarking ancestry-aware SV methods and population genetics pipelines.
## Files and schema
### 1. `sv_samples.parquet`
One row per synthetic individual.
Core columns:
- `sample_id` – unique synthetic sample identifier.
- `population` – one of `SSA_West`, `SSA_East`, `SSA_Central`, `SSA_Southern`, `AAW`, `EUR`, `EAS`.
- `region` – SSA subregion (for SSA populations) or `Non_SSA` for reference panels.
- `is_SSA` – boolean flag for SSA populations.
- `is_reference_panel` – boolean flag for AAW/EUR/EAS reference groups.
- `sex` – `Male` or `Female`.
Burden summary columns:
- `n_CNV_del` – count of CNV deletions in this sample.
- `n_CNV_dup` – count of CNV duplications in this sample.
- `n_indel_del` – count of small deletions in this sample.
- `n_indel_ins` – count of small insertions in this sample.
- `n_cnvs` – total CNV count (`n_CNV_del + n_CNV_dup`).
- `n_indels` – total indel count (`n_indel_del + n_indel_ins`).
- `n_sv_total` – total SV count per sample.
These columns allow simple **burden analyses by ancestry, region, and sex** without loading the full event table.
### 2. `sv_events.parquet`
One row per **SV carrier** (i.e., per event per sample).
Core columns:
- `sv_id` – structural variant identifier (shared across carriers of the same event).
- `sample_id` – ID of the carrier.
- `sv_type` – `CNV_del`, `CNV_dup`, `indel_del`, or `indel_ins`.
- `population` – population label of the carrier sample.
- `chrom` – synthetic chromosome ("1"–"22").
- `start` – 0-based start coordinate (inclusive).
- `end` – end coordinate (exclusive).
- `length_bp` – event length in base pairs.
- `is_population_specific` – boolean flag; `True` for population-enriched events.
- `target_population` – population in which the event is enriched (if `is_population_specific=True`).
This table is the main **event-level catalog** for SV-based analyses.
### 3. `sv_frequencies.parquet`
One row per **SV–population** combination, summarizing carrier frequencies.
Core columns:
- `sv_id` – structural variant identifier.
- `sv_type` – SV type.
- `population` – population label.
- `carrier_count` – number of carriers in that population.
- `carrier_frequency` – carrier_count / N_population.
- `is_population_specific` – matches the flag in `sv_events.parquet`.
- `target_population` – target population for enriched SVs.
This table is designed for **population genetics** use cases (e.g., allele frequency spectra, Fst-like metrics, enrichment analyses) without needing to aggregate the full event table.
## Generation and validation
### Generation
The dataset was generated using the Python script:
- `structural_variation/scripts/generate_structural_variation.py`
Key steps:
1. **Sample generation**
- Creates 20,000 individuals partitioned across the seven populations with the configured sex distribution.
2. **SV event definition**
- For each SV type, defines a set of synthetic events with positions and lengths on the 22 synthetic chromosomes.
- Distinguishes a subset of **population-specific events** with a target population.
3. **Frequency and carrier assignment**
- For each event and population, draws carrier frequencies from Beta distributions (with different behavior for common vs low-frequency variants), modified for population-specific events.
- Samples carrier individuals accordingly, generating the event-level and frequency tables.
4. **Burden summarization**
- Aggregates per-sample SV counts by type and totals.
The configuration driving this process is stored in:
- `structural_variation/configs/structural_variation_config.yaml`
- Literature links are documented in:
- `structural_variation/docs/LITERATURE_INVENTORY.csv`
### Validation
Validation follows the GENOMICS Synthetic Data Playbook and was performed using:
- `structural_variation/scripts/validate_structural_variation.py`
The validator reads the three Parquet tables and computes multiple checks, including:
- **C01 – Sample size matches config**
- Confirms N = 20,000.
- **C02 – Population sample sizes vs config**
- Per-population counts within an acceptable relative deviation (10%).
- **C03 – Required columns present**
- Ensures essential schema columns in samples, events, and frequencies.
- **C04 – SV burden per sample vs config**
- Compares observed mean counts by SV type to configured targets.
- **C05 – SV length spectrum by type**
- Checks that min/median/max lengths are consistent with configured ranges.
- **C06 – Population-specific enrichment**
- Quantifies target vs non-target carrier frequency ratios for population-specific SVs and confirms strong enrichment.
- **C07 – Missingness in key variables**
- Ensures negligible missingness in key columns.
The validation outputs a Markdown report:
- `structural_variation/output/validation_report.md`
For the released version of this dataset, all defined checks completed with an **overall status of `PASS`**.
## Intended use
This dataset is intended for:
- **Methods development** for SV detection, genotyping, and frequency estimation in multi-ancestry cohorts.
- **Population genetics and ancestry-aware modeling** of CNVs and indels, including SSA-focused questions.
- **Benchmarking** of burden tests and association pipelines that incorporate structural variation.
- **Teaching and demonstration** of SV analysis workflows without access to sensitive human data.
It is **not suitable** for:
- Clinical decision-making.
- Individual-level risk prediction.
- Inference about real individuals or specific real-world populations.
All samples and variants are fully synthetic and do not correspond to real persons.
## Ethical and privacy considerations
- The dataset is entirely synthetic and contains **no real patient data**.
- Cohort labels (e.g., SSA regions, AAW, EUR, EAS) are intended for **methodological realism** only.
- Users should avoid framing analyses as statements about real-world groups and should instead treat this resource as a **simulation tool**.
## License
- License: **CC BY-NC 4.0**.
- Non-commercial use is encouraged for research, teaching, and methods development.
## Citation
If you use this dataset in your work, please cite:
> Electric Sheep Africa. "SSA Multi-ancestry Structural Variation Catalog (Germline, Synthetic)." Hugging Face Datasets.
and, where appropriate, cite the SV resources that inspired the design:
- Redon R, et al. Global variation in copy number in the human genome. *Nature*. 2006.
- Sudmant PH, et al. An integrated map of structural variation in 2,504 human genomes. *Nature*. 2015.
- Collins RL, et al. A structural variation reference for medical and population genetics. *Nature*. 2020.
language:
- 英语
tags:
- 基因组学
- 结构变异
- 合成数据
- 拷贝数变异(copy number variant, CNV)
- 插入缺失(insertion-deletion, indel)
- 撒哈拉以南非洲(sub-Saharan Africa, SSA)
- 合成
license: 知识共享署名-非商业性使用 4.0 国际许可协议(CC BY-NC 4.0)
pretty_name: SSA多祖先生殖系结构变异合成目录
task_categories:
- 其他
size_categories:
- 10M<n<100M
data_type: 合成数据
---
> ⚠️ **合成数据集** — 基于已发表的SSA相关文献参数化生成,并非真实观测数据,不适用于实证分析或政策推断。
# SSA多祖先生殖系结构变异合成目录
## 数据集概述
本数据集为**20000个合成个体组成的多祖先队列**提供了**生殖系结构变异(germline structural variation, SV)目录**,重点聚焦**撒哈拉以南非洲(sub-Saharan Africa, SSA)**祖先人群。它补充了全基因组SNP芯片合成数据集,新增了**拷贝数变异(CNVs)**和**小型插入缺失(indels)**,并包含明确的**人群特异性结构变异**。
该队列涵盖:
- 四个SSA区域人群组(西非、东非、中非、南非)。
- 非洲裔美国女性(AAW)队列,作为混血非洲散居群体参考。
- 欧洲(EUR)和东亚(EAS)参考面板。
结构变异是在合成基因组支架(1~22号染色体,每条长100 Mb)上模拟生成的,**未与真实参考基因组对齐**。因此本数据集适用于**方法开发与基准测试**(例如,感知祖先的结构变异检测、群体遗传学、负担分析),**不适用于**临床或个体层面的推断。
所有数据均为**完全合成**生成,遵循了用于Electric Sheep Africa数据集系列的**基因组学合成数据手册(GENOMICS Synthetic Data Playbook)**。
## 队列设计
### 样本量与人群
- **总样本量(N)**:20000个合成个体。
- **人群与样本量**:
- `SSA_West`(西非SSA人群):3000
- `SSA_East`(东非SSA人群):3000
- `SSA_Central`(中非SSA人群):2000
- `SSA_Southern`(南非SSA人群):2000
- `AAW`(非洲裔美国女性,混血人群):3000
- `EUR`(欧洲参考人群):4000
- `EAS`(东亚参考人群):3000
- **性别分布**:
- 男性:50%
- 女性:50%
SSA亚组旨在与Electric Sheep Africa推出的其他SSA聚焦合成数据集(例如SNP芯片数据集、结直肠基因组数据集、卵巢体细胞数据集)兼容,支持**跨数据集方法开发**。
## 结构变异模型
### 结构变异类型
本目录包含两大类生殖系结构变异:
- **拷贝数变异(CNVs)**
- `CNV_del`:拷贝数缺失。
- `CNV_dup`:拷贝数重复。
- **小型插入缺失(indels,1~50 bp)**
- `indel_del`:小型缺失。
- `indel_ins`:小型插入。
每个变异以**合成染色体上的区域**表示,包含以下字段:
- `chrom`:合成染色体编号("1"~"22")。
- `start`、`end`:100 Mb染色体范围内的0坐标体系起始坐标。
- `length_bp`:变异事件的碱基对长度。
### 个体层面的CNV与indel负担
单样本结构变异负担通过已发表文献的预期值进行校准,参考文献包括:
- Redon等人,《Nature》2006年(全球首张CNV图谱)。
- Sudmant等人,《Nature》2015年(1000基因组计划整合结构变异图谱)。
- Collins等人,《Nature》2020年(gnomAD-SV参考数据集)。
每个个体的目标平均变异数(生成器中已近似实现):
- **CNVs**
- `CNV_del`:平均每个个体约80个缺失(标准差~25)。
- `CNV_dup`:平均每个个体约60个重复(标准差~20)。
- **小型插入缺失(1~50 bp)**
- `indel_del`:平均每个个体约200个小型缺失(标准差~50)。
- `indel_ins`:平均每个个体约200个小型插入(标准差~50)。
这使得每个基因组平均包含约**140个CNVs**和**400个小型indels**,既保证了变异多样性,又在计算上易于处理。
### 长度分布
结构变异长度遵循类型特定的分布:
- **CNVs(CNV_del、CNV_dup)**
- 遵循对数10正态长度分布。
- 近似中位长度约100 kb。
- 长度范围:**1 kb ~ 5 Mb**。
- **Indels(indel_del、indel_ins)**
- 遵循均匀整数长度分布。
- 长度范围:**1 ~ 50 bp**。
这些参数定性锚定自大规模结构变异资源报告的大小谱,尤其是**1000基因组计划结构变异(1000 Genomes SV)**和**gnomAD-SV**数据集。
## 人群特异性结构变异
本数据集的核心设计特征是纳入**人群富集结构变异**,其设计动机来自:
- Redon等人2006年研究:存在显著人群分化的CNVs。
- Collins等人2020年研究:gnomAD-SV中存在大量非洲和非非洲富集的结构变异。
在合成模型中:
- 固定比例的变异被标记为**人群特异性**:
- `CNV_del`:5%的缺失变异。
- `CNV_dup`:5%的重复变异。
- `indel_del`:2%的小型缺失变异。
- `indel_ins`:2%的小型插入变异。
- 对于每个人群特异性变异:
- 选定一个**目标人群**(例如SSA_West、EUR、EAS、AAW)。
- 在**目标人群**中,携带者频率设定为**中等常见**(约5%~25%)。
- 在**非目标人群**中,携带者频率被限制为**极低**(≤0.5%)。
这种结构使得大量变异的**目标/非目标人群频率比超过5倍**,为基准测试感知祖先的结构变异方法和群体遗传学流程提供了明确的人群特异性信号。
## 文件与模式
### 1. `sv_samples.parquet`
每个合成个体对应一行。
核心字段:
- `sample_id`:唯一合成样本标识符。
- `population`:人群标签,取值为`SSA_West`、`SSA_East`、`SSA_Central`、`SSA_Southern`、`AAW`、`EUR`、`EAS`。
- `region`:SSA亚区域(仅SSA人群)或`Non_SSA`(参考面板人群)。
- `is_SSA`:标识是否为SSA人群的布尔标志。
- `is_reference_panel`:标识是否为AAW/EUR/EAS参考组的布尔标志。
- `sex`:`Male`(男性)或`Female`(女性)。
负担汇总字段:
- `n_CNV_del`:本样本中的CNV缺失变异数量。
- `n_CNV_dup`:本样本中的CNV重复变异数量。
- `n_indel_del`:本样本中的小型缺失变异数量。
- `n_indel_ins`:本样本中的小型插入变异数量。
- `n_cnvs`:总CNV数量(`n_CNV_del + n_CNV_dup`)。
- `n_indels`:总indel数量(`n_indel_del + n_indel_ins`)。
- `n_sv_total`:本样本的总结构变异数量。
这些字段支持无需加载完整变异表即可开展**按祖先、区域和性别分组的负担分析**。
### 2. `sv_events.parquet`
每个**结构变异携带者**对应一行(即每个样本中的每个变异事件对应一行)。
核心字段:
- `sv_id`:结构变异标识符(同一变异的所有携带者共享该ID)。
- `sample_id`:携带者的样本ID。
- `sv_type`:变异类型,取值为`CNV_del`、`CNV_dup`、`indel_del`或`indel_ins`。
- `population`:携带者样本的人群标签。
- `chrom`:合成染色体编号("1"~"22")。
- `start`:0-based起始坐标(包含)。
- `end`:结束坐标(不包含)。
- `length_bp`:变异事件的碱基对长度。
- `is_population_specific`:布尔标志,`True`表示人群富集变异。
- `target_population`:该变异富集的目标人群(仅当`is_population_specific=True`时有效)。
该表是开展**基于变异事件的结构变异分析**的核心目录。
### 3. `sv_frequencies.parquet`
每个**变异-人群组合**对应一行,汇总携带者频率。
核心字段:
- `sv_id`:结构变异标识符。
- `sv_type`:变异类型。
- `population`:人群标签。
- `carrier_count`:该人群中的携带者数量。
- `carrier_frequency`:携带者频率(`carrier_count / N_population`)。
- `is_population_specific`:与`sv_events.parquet`中的标志一致。
- `target_population`:富集变异的目标人群。
该表专为**群体遗传学用例**设计(例如等位基因频率谱、类Fst指标、富集分析),无需聚合完整的变异事件表即可开展分析。
## 生成与验证
### 生成流程
本数据集通过以下Python脚本生成:
- `structural_variation/scripts/generate_structural_variation.py`
关键步骤:
1. **样本生成**
- 创建20000个个体,按配置的性别分布划分至7个人群组。
2. **结构变异事件定义**
- 针对每种变异类型,在22条合成染色体上定义一组具有位置和长度的合成变异事件。
- 区分出带有目标人群的**人群特异性变异事件**子集。
3. **频率与携带者分配**
- 针对每个变异和人群,从Beta分布中抽取携带者频率(针对常见和低频变异采用不同的分布参数),针对人群特异性变异进行调整。
- 据此抽样携带者个体,生成变异事件表和频率表。
4. **负担汇总**
- 按变异类型和总计汇总每个样本的结构变异计数。
驱动该流程的配置文件存储于:
- `structural_variation/configs/structural_variation_config.yaml`
- 文献链接记录于:
- `structural_variation/docs/LITERATURE_INVENTORY.csv`
### 验证流程
验证遵循**基因组学合成数据手册**,通过以下脚本完成:
- `structural_variation/scripts/validate_structural_variation.py`
验证程序读取三个Parquet表并执行多项检查,包括:
- **C01 – 样本量与配置匹配**:确认总样本量N=20000。
- **C02 – 人群样本量与配置一致**:各人群样本量在可接受的相对偏差(10%)范围内。
- **C03 – 必备字段存在**:确保样本、变异事件和频率表包含必要的模式字段。
- **C04 – 单样本结构变异负担与配置匹配**:将按变异类型统计的观测平均计数与配置目标进行对比。
- **C05 – 结构变异长度谱符合类型要求**:检查最小/中位/最大长度与配置范围一致。
- **C06 – 人群特异性变异富集**:量化人群特异性变异的目标与非目标人群携带者频率比,确认存在显著富集。
- **C07 – 关键变量缺失率**:确保关键字段的缺失率可忽略不计。
验证输出一份Markdown报告:
- `structural_variation/output/validation_report.md`
本数据集的发布版本完成了所有定义的检查,**整体状态为`PASS`(通过)**。
## 预期用途
本数据集适用于:
- 针对多祖先队列的结构变异检测、基因分型和频率估计的**方法开发**。
- 针对CNVs和indels的**群体遗传学与感知祖先建模**,包括聚焦SSA的研究问题。
- 纳入结构变异的**负担测试与关联流程基准测试**。
- 无需访问敏感人类数据即可开展的**教学与结构变异分析流程演示**。
本数据集**不适用于**:
- 临床决策制定。
- 个体层面的风险预测。
- 针对真实个体或特定真实世界人群的推断。
所有样本与变异均为完全合成生成,不对应任何真实个体。
## 伦理与隐私考量
- 本数据集完全为合成数据,**不包含任何真实患者数据**。
- 队列标签(例如SSA区域人群、AAW、EUR、EAS)仅用于**方法学真实性**。
- 用户应避免将分析结果表述为针对真实世界群体的结论,而应将本资源视为**模拟工具**。
## 许可协议
- 许可协议:**CC BY-NC 4.0**。
- 鼓励非商业用途用于研究、教学与方法开发。
## 引用说明
若您在研究中使用本数据集,请引用:
> Electric Sheep Africa. "SSA Multi-ancestry Structural Variation Catalog (Germline, Synthetic)." Hugging Face Datasets.
并在适当情况下引用启发本数据集设计的结构变异资源:
- Redon R, et al. Global variation in copy number in the human genome. *Nature*. 2006.
- Sudmant PH, et al. An integrated map of structural variation in 2,504 human genomes. *Nature*. 2015.
- Collins RL, et al. A structural variation reference for medical and population genetics. *Nature*. 2020.
提供机构:
electricsheepafrica



