Auditing Shortcut Learning and Misclassification in AI-Based Genomic Subtyping of Breast Cancer
收藏DataCite Commons2025-05-12 更新2025-05-17 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/BALJNT
下载链接
链接失效反馈官方服务:
资源简介:
Borges, Julian (2025) "Auditing Shortcut Learning and Misclassification in AI-Based Genomic Subtyping of Breast Cancer - Hidden Biases in AI-Powered Genomic Subtyping of Breast Cancer" Capstone Project - Harvard Medical School (HMS) - Post Graduate Medical Education (PGME) - Global Clinical Scholars Research Training (GCSRT).
* Date: April 2025
* Purpose: To investigate whether machine learning models trained on gene expression data for breast cancer subtyping:
Rely on shortcut features (e.g., ER/PR/HER2, age, tumor size)
Misclassify high-risk patients (especially hormone-sensitive tumors)
Expose risks of black-box AI in clinical genomics
Audit how AI models use features to classify molecular subtypes and detect shortcut learning.
================================================================
1-Research Title (Working):
“Auditing Shortcut Learning and Misclassification in AI-Based Genomic Subtyping of Breast Cancer” "Hidden Biases in AI-Powered Genomic Subtyping of Breast Cancer"
---------------------------------------------------------------------------------------------------
Research Framework:
Design: Retrospective modeling study
Population: Breast cancer patients from TCGA-BRCA dataset
Outcome: Subtype misclassification (Luminal A/B, HER2, Basal-like)
Predictors: Gene expression + clinical features (e.g., age, tumor size, batch ID)
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Hybrid Workflow (STATA + Python)
Tool Task
STATA Data cleaning, survival analysis, epidemiologic stratification,
regression modeling
Python (Colab) XGBoost training, prediction, SHAP feature attribution,
misclassification flag
---------------------------------------------------------------------------------------------------
Project Objectives:
- Train and validate machine learning models to classify breast cancer molecular subtypes using gene expression data.
- Apply feature attribution techniques (e.g., SHAP) to audit the model's reliance on non-genomic features.
- Analyze misclassification patterns across clinical and genomic subgroups.
- Propose a framework to detect shortcut learning and flag high-risk predictions before they impact clinical care.
---------------------------------------------------------------------------------------------------
2-Data Acquisition
Primary Dataset:
TCGA-BRCA (The Cancer Genome Atlas – Breast Cancer Cohort)
Hosted on the UCSC Xena Browser:
https://xenabrowser.net/datapages/?cohort=TCGA%20Breast%20Cancer%20(BRCA)
Also available via the Genomic Data Commons (GDC) Portal:
https://portal.gdc.cancer.gov/projects/TCGA-BRCA
Data includes:
Normalized RNA-Seq gene expression
Molecular subtypes (PAM50)
Clinical variables: ER/PR/HER2 status, age, tumor stage, tumor size
Batch metadata (for shortcut learning detection)
---------------------------------------------------------------------------------------------------
3-Data Cleaning and Integration
Merge expression data with subtype labels.
Merge clinical metadata (age, stage, ER/PR/HER2 status, tumor size).
Exclude samples with missing subtype or critical metadata.
---------------------------------------------------------------------------------------------------
4-Model Development
Goal:
Build an AI model to predict subtype → extract predictions → analyze errors.
Modeling Plan:
Input: Top 50–100 genes by variance + clinical features
Model types:
Logistic Regression (baseline)
Random Forest / XGBoost (advanced)
Outcome: 4-class classifier (PAM50 subtype)
---------------------------------------------------------------------------------------------------
5-Misclassification & Shortcut Detection
Analyze Misclassification:
Create prediction matrix: TP, FP, FN, TN
Identify false negatives & positives for each subtype.
Audit Feature Importance:
SHAP values or permutation importance
Are non-genomic variables (e.g., age) ranked higher than gene expression?
---------------------------------------------------------------------------------------------------
6-Stratified Analysis
Stratify error rates by:
ER/PR status
Luminal A vs. others
Age group, ancestry (if available), tumor size
Statistical tests:
Chi-square for categorical features
T-test or ANOVA for continuous variables
---------------------------------------------------------------------------------------------------
7-Internal Validation
Bootstrap or cross-validation (k=10)
Evaluate:
Overall accuracy
Subgroup error rates
Calibration (Brier score, reliability curves)
---------------------------------------------------------------------------------------------------
8-Results & Visualization
SHAP summary plots
Confusion matrix heatmaps
Error rate bar charts across subgroups
Risk flags for shortcut-prone predictions
---------------------------------------------------------------------------------------------------
9-Documentation & Reproducibility
Store:
Cleaned data (no redistribution)
Code notebooks (.do files or Python/R scripts)
README + data dictionary
Visuals and tables for final report
-------------------------------------------
breast-cancer-ai-misclassification/
│
├── README.md
├── LICENSE
├── .gitignore
│
├── data/
│ ├── raw/ # Original data from TCGA/Xena (no changes)
│ ├── processed/ # Cleaned & merged data ready for modeling
│ ├── metadata/ # Variable dictionaries, subtype labels, cohort lists
│
├── notebooks/
│ ├── 01_data_cleaning.ipynb # Data cleaning and merging script
│ ├── 02_eda.ipynb # Exploratory data analysis
│ ├── 03_model_training.ipynb # Model fitting and evaluation
│ ├── 04_shap_analysis.ipynb # SHAP or feature attribution workflow
│ ├── 05_subgroup_analysis.ipynb # Stratified misclassification analysis
│
├── scripts/
│ ├── run_model.py # CLI script to run model pipeline
│ ├── run_validation.py # Script for cross-validation
│ ├── utils.py # Helper functions (e.g., metrics, plotting)
│
├── results/
│ ├── figures/ # Confusion matrices, SHAP plots, histograms
│ ├── tables/ # Model performance, FN/FP breakdowns
│ ├── outputs/ # Exported predictions or risk scores
│
├── reports/
│ ├── capstone_summary.pdf # Final written report
│ ├── slides_deck.pptx # Presentation slides
│ ├── GCSRT_FINAL.docx # Formatted protocol
│
└── environment.yml # Conda environment file (reproducibility)
提供机构:
Harvard Dataverse
创建时间:
2025-04-09



