electricsheepafrica/ssa-structural-variation-catalog
收藏Hugging Face2025-11-23 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/ssa-structural-variation-catalog
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- genomics
- structural-variation
- synthetic-data
- copy-number-variation
- indels
- sub-saharan-africa
license: cc-by-nc-4.0
pretty_name: SSA Multi-ancestry Structural Variation Catalog (Germline)
task_categories:
- other
size_categories:
- 10M<n<100M
---
# SSA Multi-ancestry Structural Variation Catalog (Germline, Synthetic)
## Dataset summary
This dataset provides a **germline structural variation (SV) catalog** for a **multi-ancestry cohort of 20,000 synthetic individuals** with a strong focus on **sub-Saharan African (SSA)** ancestry. It complements the genome-wide SNP array synthetic dataset by adding **copy number variants (CNVs)** and **small indels** with explicit **population-specific structural variants**.
The cohort includes:
- Four SSA regional groups (West, East, Central, Southern).
- An African American women (AAW) group as an admixed African diaspora reference.
- European (EUR) and East Asian (EAS) reference panels.
SVs are simulated on a synthetic genome scaffold (chromosomes 1–22, each 100 Mb) and are **not aligned to a real reference genome**. The dataset is therefore suitable for **methods development and benchmarking** (e.g., ancestry-aware SV detection, population genetics, burden analysis), **not** for clinical or individual-level inference.
All data are **fully synthetic** and were generated under the **GENOMICS Synthetic Data Playbook** used across the Electric Sheep Africa dataset family.
## Cohort design
### Sample size and populations
- **Total N**: 20,000 synthetic individuals.
- **Populations and sample sizes**:
- `SSA_West`: 3,000
- `SSA_East`: 3,000
- `SSA_Central`: 2,000
- `SSA_Southern`: 2,000
- `AAW` (African American women, admixed): 3,000
- `EUR` (European reference): 4,000
- `EAS` (East Asian reference): 3,000
- **Sex distribution**:
- `Male`: 50%
- `Female`: 50%
The SSA subgroups are intended to be **compatible with other SSA-focused synthetic datasets** from Electric Sheep Africa (e.g., SNP array, colorectal genomic, ovarian somatic), enabling **cross-dataset method development**.
## Structural variation model
### SV classes
The catalog includes two broad classes of germline structural variants:
- **Copy number variants (CNVs)**
- `CNV_del` – deletions.
- `CNV_dup` – duplications.
- **Small indels** (1–50 bp)
- `indel_del` – small deletions.
- `indel_ins` – small insertions.
Each variant is represented as a **region on a synthetic chromosome** with:
- `chrom` – synthetic chromosome ("1"–"22").
- `start`, `end` – 0-based coordinates within the 100 Mb chromosome.
- `length_bp` – event length in base pairs.
### CNV and indel burden per individual
Per-sample SV burdens were tuned using literature-informed expectations from:
- Redon et al., *Nature* 2006 (first global CNV map).
- Sudmant et al., *Nature* 2015 (1000 Genomes integrated SV map).
- Collins et al., *Nature* 2020 (gnomAD-SV reference).
Target mean counts per individual (approximated in the generator):
- **CNVs**
- `CNV_del`: mean ~80 deletions per individual (std ~25).
- `CNV_dup`: mean ~60 duplications per individual (std ~20).
- **Small indels** (1–50 bp)
- `indel_del`: mean ~200 deletions per individual (std ~50).
- `indel_ins`: mean ~200 insertions per individual (std ~50).
This yields roughly **140 CNVs** and **400 small indels** per genome on average, producing a diverse but computationally manageable SV catalog.
### Length distributions
SV lengths follow type-specific distributions:
- **CNVs (CNV_del, CNV_dup)**
- Log10-normal length distribution.
- Approximate median length ~100 kb.
- Length range: **1 kb – 5 Mb**.
- **Indels (indel_del, indel_ins)**
- Uniform integer length.
- Length range: **1 – 50 bp**.
These parameters are anchored qualitatively to the size spectra reported in large-scale SV resources, particularly **1000 Genomes SV** and **gnomAD-SV**.
## Population-specific structural variants
A key design feature is the inclusion of **population-enriched structural variants**, motivated by:
- Redon et al. 2006 – CNVs with marked population differentiation.
- Collins et al. 2020 – numerous African- and non-African-enriched SVs in gnomAD-SV.
In the synthetic model:
- A fixed fraction of events are designated **population-specific**:
- `CNV_del`: 5% of deletions.
- `CNV_dup`: 5% of duplications.
- `indel_del`: 2% of small deletions.
- `indel_ins`: 2% of small insertions.
- For each population-specific SV:
- One **target population** is chosen (e.g., SSA_West, EUR, EAS, AAW).
- In the **target population**, carrier frequencies are drawn to be **moderately common** (roughly 5–25%).
- In **non-target populations**, carrier frequencies are constrained to be **very low** (≤0.5%).
This structure yields many SVs where **target/non-target frequency ratios exceed 5x**, giving a clear population-specific signal for benchmarking ancestry-aware SV methods and population genetics pipelines.
## Files and schema
### 1. `sv_samples.parquet`
One row per synthetic individual.
Core columns:
- `sample_id` – unique synthetic sample identifier.
- `population` – one of `SSA_West`, `SSA_East`, `SSA_Central`, `SSA_Southern`, `AAW`, `EUR`, `EAS`.
- `region` – SSA subregion (for SSA populations) or `Non_SSA` for reference panels.
- `is_SSA` – boolean flag for SSA populations.
- `is_reference_panel` – boolean flag for AAW/EUR/EAS reference groups.
- `sex` – `Male` or `Female`.
Burden summary columns:
- `n_CNV_del` – count of CNV deletions in this sample.
- `n_CNV_dup` – count of CNV duplications in this sample.
- `n_indel_del` – count of small deletions in this sample.
- `n_indel_ins` – count of small insertions in this sample.
- `n_cnvs` – total CNV count (`n_CNV_del + n_CNV_dup`).
- `n_indels` – total indel count (`n_indel_del + n_indel_ins`).
- `n_sv_total` – total SV count per sample.
These columns allow simple **burden analyses by ancestry, region, and sex** without loading the full event table.
### 2. `sv_events.parquet`
One row per **SV carrier** (i.e., per event per sample).
Core columns:
- `sv_id` – structural variant identifier (shared across carriers of the same event).
- `sample_id` – ID of the carrier.
- `sv_type` – `CNV_del`, `CNV_dup`, `indel_del`, or `indel_ins`.
- `population` – population label of the carrier sample.
- `chrom` – synthetic chromosome ("1"–"22").
- `start` – 0-based start coordinate (inclusive).
- `end` – end coordinate (exclusive).
- `length_bp` – event length in base pairs.
- `is_population_specific` – boolean flag; `True` for population-enriched events.
- `target_population` – population in which the event is enriched (if `is_population_specific=True`).
This table is the main **event-level catalog** for SV-based analyses.
### 3. `sv_frequencies.parquet`
One row per **SV–population** combination, summarizing carrier frequencies.
Core columns:
- `sv_id` – structural variant identifier.
- `sv_type` – SV type.
- `population` – population label.
- `carrier_count` – number of carriers in that population.
- `carrier_frequency` – carrier_count / N_population.
- `is_population_specific` – matches the flag in `sv_events.parquet`.
- `target_population` – target population for enriched SVs.
This table is designed for **population genetics** use cases (e.g., allele frequency spectra, Fst-like metrics, enrichment analyses) without needing to aggregate the full event table.
## Generation and validation
### Generation
The dataset was generated using the Python script:
- `structural_variation/scripts/generate_structural_variation.py`
Key steps:
1. **Sample generation**
- Creates 20,000 individuals partitioned across the seven populations with the configured sex distribution.
2. **SV event definition**
- For each SV type, defines a set of synthetic events with positions and lengths on the 22 synthetic chromosomes.
- Distinguishes a subset of **population-specific events** with a target population.
3. **Frequency and carrier assignment**
- For each event and population, draws carrier frequencies from Beta distributions (with different behavior for common vs low-frequency variants), modified for population-specific events.
- Samples carrier individuals accordingly, generating the event-level and frequency tables.
4. **Burden summarization**
- Aggregates per-sample SV counts by type and totals.
The configuration driving this process is stored in:
- `structural_variation/configs/structural_variation_config.yaml`
- Literature links are documented in:
- `structural_variation/docs/LITERATURE_INVENTORY.csv`
### Validation
Validation follows the GENOMICS Synthetic Data Playbook and was performed using:
- `structural_variation/scripts/validate_structural_variation.py`
The validator reads the three Parquet tables and computes multiple checks, including:
- **C01 – Sample size matches config**
- Confirms N = 20,000.
- **C02 – Population sample sizes vs config**
- Per-population counts within an acceptable relative deviation (10%).
- **C03 – Required columns present**
- Ensures essential schema columns in samples, events, and frequencies.
- **C04 – SV burden per sample vs config**
- Compares observed mean counts by SV type to configured targets.
- **C05 – SV length spectrum by type**
- Checks that min/median/max lengths are consistent with configured ranges.
- **C06 – Population-specific enrichment**
- Quantifies target vs non-target carrier frequency ratios for population-specific SVs and confirms strong enrichment.
- **C07 – Missingness in key variables**
- Ensures negligible missingness in key columns.
The validation outputs a Markdown report:
- `structural_variation/output/validation_report.md`
For the released version of this dataset, all defined checks completed with an **overall status of `PASS`**.
## Intended use
This dataset is intended for:
- **Methods development** for SV detection, genotyping, and frequency estimation in multi-ancestry cohorts.
- **Population genetics and ancestry-aware modeling** of CNVs and indels, including SSA-focused questions.
- **Benchmarking** of burden tests and association pipelines that incorporate structural variation.
- **Teaching and demonstration** of SV analysis workflows without access to sensitive human data.
It is **not suitable** for:
- Clinical decision-making.
- Individual-level risk prediction.
- Inference about real individuals or specific real-world populations.
All samples and variants are fully synthetic and do not correspond to real persons.
## Ethical and privacy considerations
- The dataset is entirely synthetic and contains **no real patient data**.
- Cohort labels (e.g., SSA regions, AAW, EUR, EAS) are intended for **methodological realism** only.
- Users should avoid framing analyses as statements about real-world groups and should instead treat this resource as a **simulation tool**.
## License
- License: **CC BY-NC 4.0**.
- Non-commercial use is encouraged for research, teaching, and methods development.
## Citation
If you use this dataset in your work, please cite:
> Electric Sheep Africa. "SSA Multi-ancestry Structural Variation Catalog (Germline, Synthetic)." Hugging Face Datasets.
and, where appropriate, cite the SV resources that inspired the design:
- Redon R, et al. Global variation in copy number in the human genome. *Nature*. 2006.
- Sudmant PH, et al. An integrated map of structural variation in 2,504 human genomes. *Nature*. 2015.
- Collins RL, et al. A structural variation reference for medical and population genetics. *Nature*. 2020.
提供机构:
electricsheepafrica



