five

electricsheepafrica/breast-cancer-genomics-ssa

收藏
Hugging Face2025-11-21 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/breast-cancer-genomics-ssa
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - tabular-classification - tabular-regression pretty_name: Breast Cancer Genomics Synthetic Dataset (Sub-Saharan Africa) size_categories: - 100K<n<1M tags: - medical - genomics - breast-cancer - synthetic - gene-expression - oncology - africa - healthcare - precision-medicine - molecular-subtypes --- # A Comprehensive Synthetic Breast Cancer Genomics Dataset: Addressing African Population Underrepresentation in Cancer Research ## Abstract **Background:** Breast cancer genomics research has been predominantly conducted in European-ancestry populations, with Sub-Saharan African (SSA) populations representing less than 2% of major genomic studies despite bearing a disproportionate cancer burden. This data gap limits the generalizability of precision medicine approaches and perpetuates health disparities. **Methods:** We developed a literature-grounded synthetic dataset comprising 100,000 breast cancer patients with 130 clinico-genomic variables, including demographics, tumor characteristics, receptor status, gene expression profiles for 100 cancer-relevant genes, somatic mutations, genomic risk scores, and survival outcomes. The dataset incorporates population-specific distributions (50% SSA, 50% Caucasian) and molecular subtype-specific patterns derived from The Cancer Genome Atlas (TCGA), PAM50 classifier, and peer-reviewed literature. **Results:** The dataset demonstrates high biological fidelity with 97-99% receptor-subtype concordance and gene expression patterns matching TCGA benchmarks. Luminal A tumors show elevated ESR1/PGR expression (+2.18/+1.76) and low proliferation (-0.85), HER2-enriched tumors exhibit ERBB2 amplification (+2.98), and triple-negative breast cancers (TNBC) display high basal markers (+2.60) and proliferation (+1.85). Mutation frequencies align with published rates: TP53 mutations in 80% of TNBC vs 12% of Luminal A; PIK3CA mutations in 45% of Luminal A vs 9% of TNBC. **Conclusions:** This synthetic dataset provides a validated, privacy-safe resource for developing machine learning algorithms, benchmarking analytical methods, and training researchers in cancer genomics without ethical constraints. The explicit inclusion of SSA populations enables investigation of population-specific patterns and promotes equity in computational oncology research. **Availability:** The dataset is publicly available under CC-BY-4.0 license in CSV (213 MB) and Parquet (103 MB) formats with comprehensive documentation. **Keywords:** breast cancer, genomics, gene expression, synthetic data, machine learning, Sub-Saharan Africa, cancer disparities, precision medicine, molecular subtypes, TCGA, PAM50 --- ## 1. Introduction ### 1.1 Background and Motivation Breast cancer is the most frequently diagnosed cancer and leading cause of cancer death among women worldwide, with an estimated 2.3 million new cases and 685,000 deaths in 2020 [1]. The advent of high-throughput genomic technologies has revolutionized our understanding of breast cancer heterogeneity, enabling molecular classification into intrinsic subtypes (Luminal A, Luminal B, HER2-enriched, and basal-like/triple-negative) with distinct clinical behaviors, treatment responses, and outcomes [2,3]. However, a critical limitation of existing genomic research is the severe underrepresentation of non-European populations. Analysis of major cancer genomics initiatives reveals that individuals of African ancestry comprise less than 2% of participants in The Cancer Genome Atlas (TCGA) and other landmark studies [4,5]. This disparity is particularly concerning given that: 1. **Higher Incidence in Younger Women:** African women develop breast cancer at younger ages (median 52 years vs 62 years in Caucasians) [6] 2. **More Aggressive Subtypes:** Triple-negative breast cancer (TNBC) is 2-3 times more prevalent in African populations (20-30% vs 10-15%) [7,8] 3. **Worse Outcomes:** African ancestry is associated with higher mortality even after adjusting for socioeconomic factors and treatment access [9] 4. **Genomic Differences:** Emerging evidence suggests distinct mutational landscapes and gene expression patterns in African populations [10,11] The lack of diverse genomic data creates a "precision medicine gap" where algorithmic tools, risk prediction models, and therapeutic strategies developed from European-centric datasets may not generalize to underrepresented populations [12]. ### 1.2 The Need for Synthetic Data While collecting real-world diverse genomic data remains essential, synthetic datasets offer complementary value: - **Privacy:** No patient consent or ethical approval required - **Accessibility:** Can be freely shared for research and education - **Scale:** Generate arbitrarily large sample sizes for robust algorithm development - **Control:** Systematically vary specific parameters to study their effects - **Benchmarking:** Establish ground truth for validating analytical methods - **Training:** Enable education in genomics and bioinformatics without data access barriers ### 1.3 Objectives This work presents a comprehensive synthetic breast cancer genomics dataset designed to: 1. **Represent diverse populations:** Include 50% Sub-Saharan African samples with population-specific characteristics 2. **Capture biological complexity:** Model gene expression, mutations, clinical features, and outcomes with literature-grounded parameters 3. **Enable multiple research applications:** Support classification, regression, survival analysis, and method development 4. **Provide validation benchmarks:** Include known biological relationships for algorithm testing 5. **Facilitate reproducible research:** Offer fully documented, version-controlled data with generation code --- ## 2. Methods ### 2.1 Data Generation Framework We developed a modular Python-based framework for generating synthetic genomic data with the following design principles: 1. **Literature-Grounded Parameters:** All probability distributions derived from peer-reviewed publications and public databases (TCGA, SEER, clinical trials) 2. **Dependency-Aware Generation:** Topological sorting ensures variables are generated in correct dependency order (e.g., molecular subtype → receptor status → gene expression) 3. **Multivariate Modeling:** Gene expression modeled as multivariate normal distributions to preserve biological correlations 4. **Realistic Missingness:** Missing data patterns follow Missing Completely At Random (MCAR) and Missing At Random (MAR) mechanisms 5. **Population Stratification:** Distinct parameter sets for Sub-Saharan African and Caucasian populations 6. **Reproducibility:** Fixed random seed (seed=42) ensures reproducible generation ### 2.2 Variable Categories and Definitions The dataset comprises 130 variables across five categories: #### 2.2.1 Clinical Variables (n=17) **Demographics (n=3)** - `population`: Sub-Saharan African (50%) vs Caucasian (50%) - `country`: 19 countries (Nigeria, Kenya, South Africa, Ghana, Ethiopia, Uganda, Tanzania, Rwanda, Senegal, Zimbabwe, USA, UK, Germany, France, Netherlands, Sweden, Canada, Australia, Belgium) - `age_at_diagnosis`: Continuous, years (range: 18-95) - SSA: Mean 52.0±12.5 years [6] - Caucasian: Mean 62.0±13.5 years **Tumor Characteristics (n=6)** - `molecular_subtype`: Categorical (Luminal A, Luminal B, HER2-enriched, TNBC) - Distribution based on PAM50 classifier and population-specific prevalences [2,7] - `tumor_stage`: TNM staging I-IV - Stage distribution: I (30%), II (40%), III (20%), IV (10%) - `tumor_grade`: Histological grade 1-3 (Nottingham grading system) - `tumor_size_cm`: Continuous (0.3-20 cm), gamma distribution - `lymph_nodes_positive`: Count (0-30), stage-dependent - `histological_type`: IDC (75%), ILC (15%), mixed (7%), other (3%) **Receptor Status (n=3)** - `ER_status`: Estrogen receptor (positive/negative/mixed) - `PR_status`: Progesterone receptor (positive/negative/mixed) - `HER2_status`: HER2 amplification (positive/negative/mixed) - Deterministically generated from molecular subtype with 97-99% concordance **Patient Factors (n=3)** - `menopausal_status`: Pre- vs post-menopausal (age-dependent) - `BMI`: Body mass index (kg/m²) - SSA: 26.5±5.0 [13] - Caucasian: 27.8±6.2 - `family_history`: None (85%), first-degree relative (13%), known BRCA carrier (2%) **Treatment & Technical (n=2)** - `primary_treatment`: Surgery, chemotherapy, palliative care, none - `batch`: Sequencing batch effects (1-10) #### 2.2.2 Gene Expression (n=100) Gene expression values modeled as multivariate normal distributions with subtype-specific means and covariance structures. All values are log2-normalized relative to reference. Genes grouped into 10 functional categories: **Luminal Markers (n=10):** ESR1, PGR, FOXA1, GATA3, XBP1, BCL2, AR, TFF1, AGR2, ERBB4 - High expression in Luminal A/B subtypes [2] **HER2 Amplicon (n=10):** ERBB2, GRB7, PGAP3, STARD3, TCAP, PNMT, PPARBP, NRG1, IKZF3, MED1 - Co-amplified with ERBB2 on chromosome 17q12 [14] **Basal/TNBC Markers (n=10):** EGFR, KIT, KRT5, KRT17, FOXC1, MYC, SOX10, EN1, GABRP, TRIM29 - Elevated in basal-like/triple-negative tumors [15] **Proliferation Genes (n=10):** MKI67, CCNB1, AURKA, BIRC5, TOP2A, UBE2C, CENPF, CEP55, NDC80, PTTG1 - Cell cycle and mitosis markers [16] **Driver Genes (n=10):** TP53, PIK3CA, PTEN, AKT1, CDH1, MAP3K1, NCOR1, TBX3, RUNX1, CBFB - Frequently mutated oncogenes and tumor suppressors [17] **DNA Repair Genes (n=10):** BRCA1, BRCA2, ATM, CHEK2, PALB2, RAD51, BARD1, RAD50, NBN, FANCA - Homologous recombination and DNA damage response [18] **Cell Cycle Genes (n=10):** CDKN2A, CDKN1B, RB1, CCND1, CDK4, CDK6, CDKN1A, E2F1, CCNE1, CDC25A - Cell cycle checkpoints and regulation [19] **PI3K Pathway (n=10):** PIK3R1, AKT2, MTOR, TSC1, TSC2, PTEN2, INPP4B, PIK3CB, AKT3, RPTOR - PI3K/AKT/mTOR signaling cascade [20] **MAPK Pathway (n=10):** KRAS, NRAS, BRAF, MAP2K4, NF1, MEK1, ERK1, ERK2, RAF1, SOS1 - RAS/RAF/MEK/ERK signaling [21] **Immune Genes (n=10):** CD8A, CD274 (PD-L1), PDCD1 (PD-1), CTLA4, LAG3, CXCL9, CXCL10, IFNG, TNF, IL6 - Immune checkpoint and cytokine markers [22] **Expression Parameters by Subtype:** | Gene | Luminal A | Luminal B | HER2+ | TNBC | |------|-----------|-----------|-------|------| | ESR1 | +2.5±0.8 | +2.0±0.8 | -1.5±0.8 | -2.0±0.8 | | ERBB2 | -0.5±0.6 | +0.5±0.8 | +3.0±0.8 | -0.5±0.6 | | MKI67 | -1.0±0.7 | +0.5±0.8 | +1.5±0.9 | +2.0±0.9 | | EGFR | -1.0±0.7 | -0.5±0.8 | +0.5±0.9 | +2.5±0.9 | #### 2.2.3 Somatic Mutations (n=5) Binary mutation status with subtype-specific rates: **TP53 (Tumor Protein p53):** - Luminal A: 12%, Luminal B: 29%, HER2+: 72%, TNBC: 80% - Most frequently mutated gene in breast cancer [23] **PIK3CA (Phosphatidylinositol 3-Kinase):** - Luminal A: 45%, Luminal B: 32%, HER2+: 39%, TNBC: 9% - Hotspot mutations (H1047R, E545K, E542K) [24] **PTEN (Phosphatase and Tensin Homolog):** - Luminal A: 3%, Luminal B: 5%, HER2+: 8%, TNBC: 12% - Tumor suppressor in PI3K pathway [25] **BRCA1 (Germline):** - Enriched in TNBC (15%) vs Luminal A (<1%) - Associated with basal-like phenotype [26] **BRCA2 (Germline):** - Family history dependent: None (0.5%), First-degree (5%), Carrier families (50%) - Less subtype-specific than BRCA1 [26] #### 2.2.4 Genomic Risk Scores (n=3) **Oncotype DX Recurrence Score:** - Range: 0-100 (continuous) - Only applicable to ER+ tumors (60% missingness overall) - Luminal A: 18±10 (low risk) - Luminal B: 32±12 (intermediate-high risk) - Clinical validation from TAILORx trial [27] **PAM50 Risk of Recurrence (ROR) Score:** - Range: 0-100 (continuous) - All subtypes, 20% missingness - Luminal A: 20±15, Luminal B: 55±18, HER2+: 60±20, TNBC: 75±15 - Based on 50-gene intrinsic subtype classifier [28] **Genomic Grade Index (GGI):** - Range: -2 to +2 (continuous) - Proliferation-based signature, 25% missingness - Grade 1: -0.5±0.3, Grade 2: 0.0±0.4, Grade 3: +0.6±0.35 - Validated prognostic marker [29] #### 2.2.5 Survival Outcomes (n=5) **Overall Survival:** - `survival_months`: Time from diagnosis to death/censoring (Weibull distribution) - Luminal A: shape=1.5, scale=120 (median ~10 years) - Luminal B: shape=1.3, scale=84 (median ~7 years) - HER2+: shape=1.2, scale=72 (median ~6 years) - TNBC: shape=1.1, scale=60 (median ~5 years) - `vital_status`: Alive (76%) vs Deceased (24%) **Recurrence:** - `recurrence_free_months`: Time to recurrence (Weibull distribution) - `recurrence_event`: Binary (yes/no) - Luminal A: 20%, Luminal B: 35%, HER2+: 40%, TNBC: 45% **Metastasis:** - `distant_metastasis`: Binary, stage-dependent - Stage I: 5%, Stage II: 20%, Stage III: 50%, Stage IV: 90% ### 2.3 Statistical Modeling Approaches #### 2.3.1 Categorical Variables Generated using conditional multinomial distributions: ``` P(X|parent) ~ Categorical(θ_parent) ``` where θ parameters are subtype- or population-specific. #### 2.3.2 Continuous Variables Modeled using conditional Gaussian or gamma distributions: ``` X|parent ~ N(μ_parent, σ²_parent) or Gamma(α_parent, β_parent) ``` #### 2.3.3 Gene Expression Multivariate normal with subtype-specific covariance: ``` [G₁, G₂, ..., G₁₀₀]ᵀ ~ MVN(μ_subtype, Σ_subtype) ``` Correlation structure modeled based on pathway co-regulation and TCGA coexpression data. #### 2.3.4 Survival Outcomes Parametric survival modeling using Weibull distribution: ``` T ~ Weibull(α_subtype, λ_subtype) h(t) = α λᵅ tᵅ⁻¹ ``` where α is shape parameter (hazard shape) and λ is scale parameter. ### 2.4 Validation Procedures Generated data validated against multiple benchmarks: 1. **Distributional Validation:** Compare marginal distributions to literature ranges 2. **Correlation Validation:** Verify expected relationships (e.g., ER+ with ESR1 expression) 3. **Concordance Validation:** Check receptor-subtype agreement (target: >95%) 4. **Biological Plausibility:** No impossible combinations (e.g., ER+ TNBC) 5. **Statistical Properties:** Verify variance, skewness match expected patterns --- ## 3. Results ### 3.1 Dataset Overview The final dataset comprises **100,000 patients** with **130 variables** across five categories. Table 1 summarizes the dataset composition. **Table 1: Dataset Composition** | Category | Variables | Missingness | Description | |----------|-----------|-------------|-------------| | Clinical | 17 | 0-5% | Demographics, tumor characteristics, treatment | | Gene Expression | 100 | 2% per gene | Log2-normalized expression values | | Mutations | 5 | 3-5% | Somatic and germline alterations | | Genomic Scores | 3 | 20-40% | Risk prediction scores (ER+ specific) | | Survival Outcomes | 5 | 5-12% | Overall survival, recurrence, metastasis | | **Total** | **130** | **~2-5%** | **Complete clinico-genomic profiles** | **File Statistics:** - CSV Format: 213 MB (100,001 rows including header) - Parquet Format: 103 MB (52% compression ratio) - Generation Time: ~45 seconds (100K samples) - Memory Usage: ~2.5 GB peak ### 3.2 Population and Demographic Characteristics **Table 2: Population Demographics** | Characteristic | Sub-Saharan African (n=~50,000) | Caucasian (n=~50,000) | p-value | |----------------|--------------------------------|----------------------|---------| | Age at diagnosis (years) | 52.0 ± 12.5 | 62.0 ± 13.5 | <0.001 | | BMI (kg/m²) | 26.5 ± 5.0 | 27.8 ± 6.2 | <0.001 | | Pre-menopausal (%) | 45% | 25% | <0.001 | | Stage III/IV (%) | 32% | 28% | <0.01 | | TNBC prevalence (%) | 28% | 14% | <0.001 | | Family history (%) | 15% | 15% | 0.89 | **Key Observations:** - 10-year age gap between populations matches epidemiological data [6] - 2-fold higher TNBC prevalence in SSA aligns with literature [7,8] - Higher pre-menopausal proportion in SSA consistent with younger age distribution - BMI distributions match population health surveys [13] ### 3.3 Molecular Subtype Distribution **Table 3: Molecular Subtype Frequencies** | Subtype | Overall | SSA | Caucasian | TCGA Reference [2] | |---------|---------|-----|-----------|-------------------| | Luminal A | 42,129 (42.1%) | 36% | 48% | 40-50% | | Luminal B | 22,412 (22.4%) | 20% | 25% | 20-25% | | HER2-enriched | 13,676 (13.7%) | 12% | 15% | 10-15% | | TNBC | 20,809 (20.8%) | 28% | 14% | 10-20% | | Unknown | 974 (1.0%) | 4% | <1% | N/A | **Validation:** - Overall subtype distribution within TCGA ranges - Population-specific TNBC enrichment in SSA correctly modeled - Luminal A remains most common subtype in both populations ### 3.4 Gene Expression Validation #### 3.4.1 Subtype-Specific Expression Patterns **Table 4: Mean Gene Expression by Molecular Subtype (log2 scale)** | Gene | Function | Luminal A | Luminal B | HER2+ | TNBC | Expected Pattern | |------|----------|-----------|-----------|-------|------|------------------| | ESR1 | ER marker | +2.18 | +1.94 | -1.42 | -1.88 | High in Luminal | | PGR | PR marker | +1.76 | +1.52 | -1.15 | -1.69 | High in Luminal | | ERBB2 | HER2 marker | -0.45 | +0.52 | +2.98 | -0.48 | Amplified in HER2+ | | GRB7 | HER2 co-amplified | -0.38 | +0.48 | +2.88 | -0.42 | Co-expressed with ERBB2 | | MKI67 | Proliferation | -0.96 | +0.58 | +1.62 | +1.94 | High in TNBC/HER2+ | | CCNB1 | Proliferation | -0.88 | +0.52 | +1.58 | +1.88 | Correlates with MKI67 | | EGFR | Basal marker | -0.92 | -0.48 | +0.58 | +2.60 | High in TNBC | | KRT5 | Basal marker | -0.85 | -0.42 | +0.52 | +2.04 | TNBC specific | | FOXA1 | Luminal TF | +1.43 | +1.22 | -0.62 | -1.15 | Luminal enriched | | GATA3 | Luminal TF | +1.38 | +1.18 | -0.58 | -1.08 | Co-regulated with ESR1 | **Statistical Validation:** - Luminal A vs TNBC ESR1 difference: Δ = 4.06, p < 0.001, Cohen's d = 5.2 (very large effect) - HER2+ ERBB2 expression vs other subtypes: Δ = 3.46, p < 0.001 - Proliferation gene cluster correlation (MKI67-CCNB1-TOP2A): r = 0.82-0.89, all p < 0.001 #### 3.4.2 Gene Expression Correlations **Table 5: Selected Gene-Gene Correlations** | Gene Pair | Biological Relationship | Observed r | Expected r | Validation | |-----------|------------------------|------------|------------|------------| | ESR1-PGR | ER-regulated | +0.84 | 0.75-0.85 | | | ERBB2-GRB7 | Co-amplified (17q12) | +0.91 | 0.85-0.95 | | | MKI67-TOP2A | Co-proliferation | +0.87 | 0.80-0.90 | | | BRCA1-RAD51 | DNA repair pathway | +0.72 | 0.65-0.75 | | | ESR1-EGFR | Inverse regulation | -0.76 | -0.70 to -0.80 | | | PIK3CA-PTEN | Antagonistic pathway | -0.42 | -0.30 to -0.50 | | **Pathway-Level Validation:** - PI3K pathway genes (n=10): Mean intra-pathway correlation = 0.65 - Luminal markers (n=10): Mean correlation = 0.71 - Basal markers (n=10): Inverse correlation with luminal = -0.68 ### 3.5 Receptor Status Concordance **Table 6: Receptor Status by Molecular Subtype** | Subtype | n | ER+ | PR+ | HER2+ | Triple-Negative | Concordance | |---------|---|-----|-----|-------|-----------------|-------------| | Luminal A | 42,129 | 97.5% | 94.2% | 0.1% | 0.0% | 97.5% | | Luminal B | 22,412 | 97.4% | 92.8% | 0.2% | 0.1% | 97.4% | | HER2-enriched | 13,676 | 2.0% | 1.8% | 99.4% | 0.0% | 99.4% | | TNBC | 20,809 | 1.9% | 1.6% | 0.1% | 98.1% | 98.1% | **Validation Summary:** - 97-99% concordance between molecular subtype and immunohistochemistry - Expected small proportion of ER-/HER2+ tumors correctly classified as HER2-enriched - Triple-negative tumors properly negative for all three receptors (>98%) - Luminal tumors predominantly ER+/PR+ as expected ### 3.6 Mutation Landscape #### 3.6.1 Mutation Frequencies by Subtype **Table 7: Somatic Mutation Rates by Molecular Subtype** | Gene | Luminal A | Luminal B | HER2+ | TNBC | TCGA Ref [23,24] | Validation | |------|-----------|-----------|-------|------|------------------|------------| | TP53 | 11.6% | 28.1% | 70.5% | 77.3% | 12/29/72/80% | Excellent match | | PIK3CA | 43.5% | 31.1% | 38.3% | 8.6% | 45/32/39/9% | Excellent match | | PTEN | 2.9% | 4.8% | 7.9% | 11.4% | 3/5/8/12% | Excellent match | | BRCA1 (germline) | 0.5% | 1.0% | 0.8% | 14.2% | <1/<1/<1/15% | TNBC enrichment | | BRCA2 (germline) | 1.8% | 2.1% | 1.9% | 2.4% | 2/2/2/2% | Uniform | **Key Observations:** - TP53 mutations highly enriched in TNBC and HER2+ (basal-like tumors) - PIK3CA mutations predominantly in luminal tumors (PI3K pathway activation) - BRCA1 germline variants strongly associated with TNBC phenotype - Inverse relationship between TP53 and PIK3CA mutations (mutual exclusivity) #### 3.6.2 Co-Mutation Patterns **Table 8: Mutation Co-Occurrence (Odds Ratios)** | Mutation Pair | Observed OR | 95% CI | Relationship | Validation | |---------------|-------------|--------|--------------|------------| | TP53 - PIK3CA | 0.18 | [0.17-0.19] | Mutual exclusion | [30] | | PIK3CA - PTEN | 0.52 | [0.48-0.57] | Partial exclusion | [25] | | BRCA1 - TP53 | 2.8 | [2.5-3.2] | Co-occurrence | [26] | | TP53 - EGFR expr | 3.2 | [3.0-3.4] | Associated in TNBC | [15] | ### 3.7 Genomic Risk Scores **Table 9: Genomic Risk Score Distributions** | Score | Median | IQR | Missingness | Clinical Interpretation | |-------|--------|-----|-------------|------------------------| | Oncotype DX | 22.0 | [14.0-32.0] | 40% (ER+ only) | Low: <18 (35%), Int: 18-30 (42%), High: >30 (23%) | | PAM50 ROR | 42.0 | [25.0-64.0] | 20% | Low: <40 (45%), Int: 40-60 (32%), High: >60 (23%) | | Genomic Grade Index | 0.10 | [-0.35-0.58] | 25% | Low: <0 (45%), High: >0 (55%) | **Score-Subtype Associations:** - Luminal A tumors: 78% low Oncotype DX, consistent with good prognosis - TNBC tumors: 82% high PAM50 ROR, reflecting aggressive biology - Grade 3 tumors: Mean GGI = +0.62, significantly higher than Grade 1 (-0.48), p<0.001 ### 3.8 Survival and Clinical Outcomes #### 3.8.1 Overall Survival **Table 10: Survival by Molecular Subtype** | Subtype | n | Median OS (months) | 5-year OS | 10-year OS | Literature [31,32] | |---------|---|-------------------|-----------|------------|-------------------| | Luminal A | 42,129 | 118.2 | 87% | 74% | 85-90% / 70-75% | | Luminal B | 22,412 | 82.4 | 76% | 58% | 75-80% / 55-60% | | HER2+ | 13,676 | 71.8 | 68% | 48% | 65-70% / 45-50% | | TNBC | 20,809 | 58.6 | 62% | 42% | 60-65% / 40-45% | **Cox Proportional Hazards Model:** ``` Hazard Ratios (ref: Luminal A): - Luminal B: HR = 1.48 [95% CI: 1.42-1.54], p < 0.001 - HER2+: HR = 1.78 [95% CI: 1.69-1.88], p < 0.001 - TNBC: HR = 2.12 [95% CI: 2.04-2.21], p < 0.001 ``` #### 3.8.2 Recurrence and Metastasis **Table 11: Recurrence and Metastasis Rates** | Subtype | Recurrence Rate | Median RFS (months) | Distant Metastasis | |---------|----------------|--------------------|--------------------| | Luminal A | 19.8% | 96.4 | 18.2% | | Luminal B | 34.6% | 68.8 | 26.8% | | HER2+ | 39.2% | 58.2 | 31.4% | | TNBC | 44.8% | 46.2 | 36.8% | **Stage-Specific Metastasis Rates:** - Stage I: 5.1% (expected: ~5%) - Stage II: 19.8% (expected: ~20%) - Stage III: 49.2% (expected: ~50%) - Stage IV: 89.4% (expected: ~90%) ### 3.9 Data Quality Metrics **Table 12: Data Quality Assessment** | Metric | Target | Observed | Status | |--------|--------|----------|--------| | Receptor-subtype concordance | >95% | 97-99% | Excellent | | Gene expression ESR1-ER+ correlation | >0.80 | 0.88 | Excellent | | ERBB2 expression in HER2+ | >+2.5 | +2.98 | Excellent | | TP53 mutation in TNBC | 75-85% | 77.3% | Excellent | | PIK3CA mutation in Luminal A | 40-50% | 43.5% | Excellent | | No impossible combinations | 0% | 0% | Pass | | Missing data rate | <5% | 2-5% | Pass | | Duplicate samples | 0 | 0 | Pass | [14] Slamon DJ, et al. (1987). Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. *Science* 235(4785):177-182. [15] Lehmann BD, et al. (2011). Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. *J Clin Invest* 121(7):2750-2767. [16] Rakha EA, et al. (2008). Prognostic significance of Nottingham histologic grade in invasive breast carcinoma. *J Clin Oncol* 26(19):3153-3158. [17] Cancer Genome Atlas Network. (2012). Comprehensive molecular portraits of human breast tumours. *Nature* 490(7418):61-70. [18] Kuchenbaecker KB, et al. (2017). Risks of Breast, Ovarian, and Contralateral Breast Cancer for BRCA1 and BRCA2 Mutation Carriers. *JAMA* 317(23):2402-2416. [19] Malumbres M, Barbacid M. (2009). Cell cycle, CDKs and cancer: a changing paradigm. *Nat Rev Cancer* 9(3):153-166. [20] Miller TW, et al. (2011). Mutations in the phosphatidylinositol 3-kinase pathway: role in tumor progression and therapeutic implications in breast cancer. *Breast Cancer Res* 13(6):224. [21] Santarpia L, et al. (2012). Targeting the MAPK-RAS-RAF signaling pathway in cancer therapy. *Expert Opin Ther Targets* 16(1):103-119. [22] Emens LA. (2018). Breast Cancer Immunotherapy: Facts and Hopes. *Clin Cancer Res* 24(3):511-520. [23] Silwal-Pandit L, et al. (2014). TP53 mutation spectrum in breast cancer is subtype specific and has distinct prognostic relevance. *Clin Cancer Res* 20(13):3569-3580. [24] Samuels Y, et al. (2004). High frequency of mutations of the PIK3CA gene in human cancers. *Science* 304(5670):554. [25] Saal LH, et al. (2005). PIK3CA mutations correlate with hormone receptors, node metastasis, and ERBB2, and are mutually exclusive with PTEN loss in human breast carcinoma. *Cancer Res* 65(7):2554-2559. [26] Foulkes WD, et al. (2010). Triple-negative breast cancer. *N Engl J Med* 363(20):1938-1948. [27] Sparano JA, et al. (2015). Prospective Validation of a 21-Gene Expression Assay in Breast Cancer. *N Engl J Med* 373(21):2005-2014. [28] Gnant M, et al. (2015). Predicting distant recurrence in receptor-positive breast cancer patients with limited clinicopathological risk. *Lancet Oncol* 16(4):378-388. [29] Sotiriou C, et al. (2006). Gene expression profiling in breast cancer. *J Clin Oncol* 24(8):1236-1244. [30] Curtis C, et al. (2012). The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. *Nature* 486(7403):346-352. [31] Howlader N, et al. (2014). Differences in breast cancer survival by molecular subtypes in the United States. *Cancer Epidemiol Biomarkers Prev* 23(7):1239-1246. [32] Cardoso F, et al. (2018). Early breast cancer: ESMO Clinical Practice Guidelines. *Ann Oncol* 29(Suppl 4):iv194-iv206. [33] Hernandez-Boussard T, et al. (2020). MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care. *J Am Med Inform Assoc* 27(12):2011-2015. --- ## 10. Supplementary Information ### 10.1 Data Generation Parameters Full generation parameters available in accompanying YAML files. ### 10.2 Validation Scripts Python validation scripts available at GitHub repository [Link TBD]. ### 10.3 Tutorial Notebooks Jupyter notebooks demonstrating common analyses: - `01_data_exploration.ipynb` - `02_subtype_classification.ipynb` - `03_survival_analysis.ipynb` - `04_gene_expression_analysis.ipynb` ### 10.4 Contact Information **Dataset Curators:** [Your Name/Organization] **Questions or Issues:** - Open an issue on Hugging Face Discussions - Email: [Your Email] - GitHub: [Repository URL if public] --- ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC BY 4.0)**. **You are free to:** - Share — copy and redistribute the material in any medium or format - Adapt — remix, transform, and build upon the material for any purpose, even commercially **Under the following terms:** - **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made. Full license: https://creativecommons.org/licenses/by/4.0/ --- ## Acknowledgments We acknowledge the researchers and consortia whose published work enabled parameter estimation for this synthetic dataset, including The Cancer Genome Atlas Research Network, the METABRIC consortium, and numerous individual investigators whose studies we cited. --- **Dataset Version:** 1.0.0 **Release Date:** November 21, 2025 **Last Updated:** November 21, 2025 **DOI:** [To be assigned] --- **Keywords:** breast cancer, genomics, gene expression, synthetic data, machine learning, Sub-Saharan Africa, cancer disparities, precision medicine, molecular subtypes, TCGA, PAM50, ER, HER2, triple-negative, mutations, survival analysis, risk prediction, algorithm development, health equity
提供机构:
electricsheepafrica
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作