electricsheepafrica/ssa-structural-variation-catalog

Name: electricsheepafrica/ssa-structural-variation-catalog
Creator: electricsheepafrica
Published: 2025-11-23 23:44:51
License: 暂无描述

Hugging Face2025-11-23 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/electricsheepafrica/ssa-structural-variation-catalog

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - genomics - structural-variation - synthetic-data - copy-number-variation - indels - sub-saharan-africa license: cc-by-nc-4.0 pretty_name: SSA Multi-ancestry Structural Variation Catalog (Germline) task_categories: - other size_categories: - 10M<n<100M --- # SSA Multi-ancestry Structural Variation Catalog (Germline, Synthetic) ## Dataset summary This dataset provides a **germline structural variation (SV) catalog** for a **multi-ancestry cohort of 20,000 synthetic individuals** with a strong focus on **sub-Saharan African (SSA)** ancestry. It complements the genome-wide SNP array synthetic dataset by adding **copy number variants (CNVs)** and **small indels** with explicit **population-specific structural variants**. The cohort includes: - Four SSA regional groups (West, East, Central, Southern). - An African American women (AAW) group as an admixed African diaspora reference. - European (EUR) and East Asian (EAS) reference panels. SVs are simulated on a synthetic genome scaffold (chromosomes 1–22, each 100 Mb) and are **not aligned to a real reference genome**. The dataset is therefore suitable for **methods development and benchmarking** (e.g., ancestry-aware SV detection, population genetics, burden analysis), **not** for clinical or individual-level inference. All data are **fully synthetic** and were generated under the **GENOMICS Synthetic Data Playbook** used across the Electric Sheep Africa dataset family. ## Cohort design ### Sample size and populations - **Total N**: 20,000 synthetic individuals. - **Populations and sample sizes**: - `SSA_West`: 3,000 - `SSA_East`: 3,000 - `SSA_Central`: 2,000 - `SSA_Southern`: 2,000 - `AAW` (African American women, admixed): 3,000 - `EUR` (European reference): 4,000 - `EAS` (East Asian reference): 3,000 - **Sex distribution**: - `Male`: 50% - `Female`: 50% The SSA subgroups are intended to be **compatible with other SSA-focused synthetic datasets** from Electric Sheep Africa (e.g., SNP array, colorectal genomic, ovarian somatic), enabling **cross-dataset method development**. ## Structural variation model ### SV classes The catalog includes two broad classes of germline structural variants: - **Copy number variants (CNVs)** - `CNV_del` – deletions. - `CNV_dup` – duplications. - **Small indels** (1–50 bp) - `indel_del` – small deletions. - `indel_ins` – small insertions. Each variant is represented as a **region on a synthetic chromosome** with: - `chrom` – synthetic chromosome ("1"–"22"). - `start`, `end` – 0-based coordinates within the 100 Mb chromosome. - `length_bp` – event length in base pairs. ### CNV and indel burden per individual Per-sample SV burdens were tuned using literature-informed expectations from: - Redon et al., *Nature* 2006 (first global CNV map). - Sudmant et al., *Nature* 2015 (1000 Genomes integrated SV map). - Collins et al., *Nature* 2020 (gnomAD-SV reference). Target mean counts per individual (approximated in the generator): - **CNVs** - `CNV_del`: mean ~80 deletions per individual (std ~25). - `CNV_dup`: mean ~60 duplications per individual (std ~20). - **Small indels** (1–50 bp) - `indel_del`: mean ~200 deletions per individual (std ~50). - `indel_ins`: mean ~200 insertions per individual (std ~50). This yields roughly **140 CNVs** and **400 small indels** per genome on average, producing a diverse but computationally manageable SV catalog. ### Length distributions SV lengths follow type-specific distributions: - **CNVs (CNV_del, CNV_dup)** - Log10-normal length distribution. - Approximate median length ~100 kb. - Length range: **1 kb – 5 Mb**. - **Indels (indel_del, indel_ins)** - Uniform integer length. - Length range: **1 – 50 bp**. These parameters are anchored qualitatively to the size spectra reported in large-scale SV resources, particularly **1000 Genomes SV** and **gnomAD-SV**. ## Population-specific structural variants A key design feature is the inclusion of **population-enriched structural variants**, motivated by: - Redon et al. 2006 – CNVs with marked population differentiation. - Collins et al. 2020 – numerous African- and non-African-enriched SVs in gnomAD-SV. In the synthetic model: - A fixed fraction of events are designated **population-specific**: - `CNV_del`: 5% of deletions. - `CNV_dup`: 5% of duplications. - `indel_del`: 2% of small deletions. - `indel_ins`: 2% of small insertions. - For each population-specific SV: - One **target population** is chosen (e.g., SSA_West, EUR, EAS, AAW). - In the **target population**, carrier frequencies are drawn to be **moderately common** (roughly 5–25%). - In **non-target populations**, carrier frequencies are constrained to be **very low** (≤0.5%). This structure yields many SVs where **target/non-target frequency ratios exceed 5x**, giving a clear population-specific signal for benchmarking ancestry-aware SV methods and population genetics pipelines. ## Files and schema ### 1. `sv_samples.parquet` One row per synthetic individual. Core columns: - `sample_id` – unique synthetic sample identifier. - `population` – one of `SSA_West`, `SSA_East`, `SSA_Central`, `SSA_Southern`, `AAW`, `EUR`, `EAS`. - `region` – SSA subregion (for SSA populations) or `Non_SSA` for reference panels. - `is_SSA` – boolean flag for SSA populations. - `is_reference_panel` – boolean flag for AAW/EUR/EAS reference groups. - `sex` – `Male` or `Female`. Burden summary columns: - `n_CNV_del` – count of CNV deletions in this sample. - `n_CNV_dup` – count of CNV duplications in this sample. - `n_indel_del` – count of small deletions in this sample. - `n_indel_ins` – count of small insertions in this sample. - `n_cnvs` – total CNV count (`n_CNV_del + n_CNV_dup`). - `n_indels` – total indel count (`n_indel_del + n_indel_ins`). - `n_sv_total` – total SV count per sample. These columns allow simple **burden analyses by ancestry, region, and sex** without loading the full event table. ### 2. `sv_events.parquet` One row per **SV carrier** (i.e., per event per sample). Core columns: - `sv_id` – structural variant identifier (shared across carriers of the same event). - `sample_id` – ID of the carrier. - `sv_type` – `CNV_del`, `CNV_dup`, `indel_del`, or `indel_ins`. - `population` – population label of the carrier sample. - `chrom` – synthetic chromosome ("1"–"22"). - `start` – 0-based start coordinate (inclusive). - `end` – end coordinate (exclusive). - `length_bp` – event length in base pairs. - `is_population_specific` – boolean flag; `True` for population-enriched events. - `target_population` – population in which the event is enriched (if `is_population_specific=True`). This table is the main **event-level catalog** for SV-based analyses. ### 3. `sv_frequencies.parquet` One row per **SV–population** combination, summarizing carrier frequencies. Core columns: - `sv_id` – structural variant identifier. - `sv_type` – SV type. - `population` – population label. - `carrier_count` – number of carriers in that population. - `carrier_frequency` – carrier_count / N_population. - `is_population_specific` – matches the flag in `sv_events.parquet`. - `target_population` – target population for enriched SVs. This table is designed for **population genetics** use cases (e.g., allele frequency spectra, Fst-like metrics, enrichment analyses) without needing to aggregate the full event table. ## Generation and validation ### Generation The dataset was generated using the Python script: - `structural_variation/scripts/generate_structural_variation.py` Key steps: 1. **Sample generation** - Creates 20,000 individuals partitioned across the seven populations with the configured sex distribution. 2. **SV event definition** - For each SV type, defines a set of synthetic events with positions and lengths on the 22 synthetic chromosomes. - Distinguishes a subset of **population-specific events** with a target population. 3. **Frequency and carrier assignment** - For each event and population, draws carrier frequencies from Beta distributions (with different behavior for common vs low-frequency variants), modified for population-specific events. - Samples carrier individuals accordingly, generating the event-level and frequency tables. 4. **Burden summarization** - Aggregates per-sample SV counts by type and totals. The configuration driving this process is stored in: - `structural_variation/configs/structural_variation_config.yaml` - Literature links are documented in: - `structural_variation/docs/LITERATURE_INVENTORY.csv` ### Validation Validation follows the GENOMICS Synthetic Data Playbook and was performed using: - `structural_variation/scripts/validate_structural_variation.py` The validator reads the three Parquet tables and computes multiple checks, including: - **C01 – Sample size matches config** - Confirms N = 20,000. - **C02 – Population sample sizes vs config** - Per-population counts within an acceptable relative deviation (10%). - **C03 – Required columns present** - Ensures essential schema columns in samples, events, and frequencies. - **C04 – SV burden per sample vs config** - Compares observed mean counts by SV type to configured targets. - **C05 – SV length spectrum by type** - Checks that min/median/max lengths are consistent with configured ranges. - **C06 – Population-specific enrichment** - Quantifies target vs non-target carrier frequency ratios for population-specific SVs and confirms strong enrichment. - **C07 – Missingness in key variables** - Ensures negligible missingness in key columns. The validation outputs a Markdown report: - `structural_variation/output/validation_report.md` For the released version of this dataset, all defined checks completed with an **overall status of `PASS`**. ## Intended use This dataset is intended for: - **Methods development** for SV detection, genotyping, and frequency estimation in multi-ancestry cohorts. - **Population genetics and ancestry-aware modeling** of CNVs and indels, including SSA-focused questions. - **Benchmarking** of burden tests and association pipelines that incorporate structural variation. - **Teaching and demonstration** of SV analysis workflows without access to sensitive human data. It is **not suitable** for: - Clinical decision-making. - Individual-level risk prediction. - Inference about real individuals or specific real-world populations. All samples and variants are fully synthetic and do not correspond to real persons. ## Ethical and privacy considerations - The dataset is entirely synthetic and contains **no real patient data**. - Cohort labels (e.g., SSA regions, AAW, EUR, EAS) are intended for **methodological realism** only. - Users should avoid framing analyses as statements about real-world groups and should instead treat this resource as a **simulation tool**. ## License - License: **CC BY-NC 4.0**. - Non-commercial use is encouraged for research, teaching, and methods development. ## Citation If you use this dataset in your work, please cite: > Electric Sheep Africa. "SSA Multi-ancestry Structural Variation Catalog (Germline, Synthetic)." Hugging Face Datasets. and, where appropriate, cite the SV resources that inspired the design: - Redon R, et al. Global variation in copy number in the human genome. *Nature*. 2006. - Sudmant PH, et al. An integrated map of structural variation in 2,504 human genomes. *Nature*. 2015. - Collins RL, et al. A structural variation reference for medical and population genetics. *Nature*. 2020.

提供机构：

electricsheepafrica

5,000+

优质数据集

54 个

任务类型

进入经典数据集