five

Auditing Shortcut Learning and Misclassification in AI-Based Genomic Subtyping of Breast Cancer

收藏
DataCite Commons2025-05-12 更新2025-05-17 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/BALJNT
下载链接
链接失效反馈
官方服务:
资源简介:
Borges, Julian (2025) "Auditing Shortcut Learning and Misclassification in AI-Based Genomic Subtyping of Breast Cancer - Hidden Biases in AI-Powered Genomic Subtyping of Breast Cancer" Capstone Project - Harvard Medical School (HMS) - Post Graduate Medical Education (PGME) - Global Clinical Scholars Research Training (GCSRT). * Date: April 2025 * Purpose: To investigate whether machine learning models trained on gene expression data for breast cancer subtyping: Rely on shortcut features (e.g., ER/PR/HER2, age, tumor size) Misclassify high-risk patients (especially hormone-sensitive tumors) Expose risks of black-box AI in clinical genomics Audit how AI models use features to classify molecular subtypes and detect shortcut learning. ================================================================ 1-Research Title (Working): “Auditing Shortcut Learning and Misclassification in AI-Based Genomic Subtyping of Breast Cancer” "Hidden Biases in AI-Powered Genomic Subtyping of Breast Cancer" --------------------------------------------------------------------------------------------------- Research Framework: Design: Retrospective modeling study Population: Breast cancer patients from TCGA-BRCA dataset Outcome: Subtype misclassification (Luminal A/B, HER2, Basal-like) Predictors: Gene expression + clinical features (e.g., age, tumor size, batch ID) --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- Hybrid Workflow (STATA + Python) Tool Task STATA Data cleaning, survival analysis, epidemiologic stratification, regression modeling Python (Colab) XGBoost training, prediction, SHAP feature attribution, misclassification flag --------------------------------------------------------------------------------------------------- Project Objectives: - Train and validate machine learning models to classify breast cancer molecular subtypes using gene expression data. - Apply feature attribution techniques (e.g., SHAP) to audit the model's reliance on non-genomic features. - Analyze misclassification patterns across clinical and genomic subgroups. - Propose a framework to detect shortcut learning and flag high-risk predictions before they impact clinical care. --------------------------------------------------------------------------------------------------- 2-Data Acquisition Primary Dataset: TCGA-BRCA (The Cancer Genome Atlas – Breast Cancer Cohort) Hosted on the UCSC Xena Browser: https://xenabrowser.net/datapages/?cohort=TCGA%20Breast%20Cancer%20(BRCA) Also available via the Genomic Data Commons (GDC) Portal: https://portal.gdc.cancer.gov/projects/TCGA-BRCA Data includes: Normalized RNA-Seq gene expression Molecular subtypes (PAM50) Clinical variables: ER/PR/HER2 status, age, tumor stage, tumor size Batch metadata (for shortcut learning detection) --------------------------------------------------------------------------------------------------- 3-Data Cleaning and Integration Merge expression data with subtype labels. Merge clinical metadata (age, stage, ER/PR/HER2 status, tumor size). Exclude samples with missing subtype or critical metadata. --------------------------------------------------------------------------------------------------- 4-Model Development Goal: Build an AI model to predict subtype → extract predictions → analyze errors. Modeling Plan: Input: Top 50–100 genes by variance + clinical features Model types: Logistic Regression (baseline) Random Forest / XGBoost (advanced) Outcome: 4-class classifier (PAM50 subtype) --------------------------------------------------------------------------------------------------- 5-Misclassification & Shortcut Detection Analyze Misclassification: Create prediction matrix: TP, FP, FN, TN Identify false negatives & positives for each subtype. Audit Feature Importance: SHAP values or permutation importance Are non-genomic variables (e.g., age) ranked higher than gene expression? --------------------------------------------------------------------------------------------------- 6-Stratified Analysis Stratify error rates by: ER/PR status Luminal A vs. others Age group, ancestry (if available), tumor size Statistical tests: Chi-square for categorical features T-test or ANOVA for continuous variables --------------------------------------------------------------------------------------------------- 7-Internal Validation Bootstrap or cross-validation (k=10) Evaluate: Overall accuracy Subgroup error rates Calibration (Brier score, reliability curves) --------------------------------------------------------------------------------------------------- 8-Results & Visualization SHAP summary plots Confusion matrix heatmaps Error rate bar charts across subgroups Risk flags for shortcut-prone predictions --------------------------------------------------------------------------------------------------- 9-Documentation & Reproducibility Store: Cleaned data (no redistribution) Code notebooks (.do files or Python/R scripts) README + data dictionary Visuals and tables for final report ------------------------------------------- breast-cancer-ai-misclassification/ │ ├── README.md ├── LICENSE ├── .gitignore │ ├── data/ │ ├── raw/ # Original data from TCGA/Xena (no changes) │ ├── processed/ # Cleaned & merged data ready for modeling │ ├── metadata/ # Variable dictionaries, subtype labels, cohort lists │ ├── notebooks/ │ ├── 01_data_cleaning.ipynb # Data cleaning and merging script │ ├── 02_eda.ipynb # Exploratory data analysis │ ├── 03_model_training.ipynb # Model fitting and evaluation │ ├── 04_shap_analysis.ipynb # SHAP or feature attribution workflow │ ├── 05_subgroup_analysis.ipynb # Stratified misclassification analysis │ ├── scripts/ │ ├── run_model.py # CLI script to run model pipeline │ ├── run_validation.py # Script for cross-validation │ ├── utils.py # Helper functions (e.g., metrics, plotting) │ ├── results/ │ ├── figures/ # Confusion matrices, SHAP plots, histograms │ ├── tables/ # Model performance, FN/FP breakdowns │ ├── outputs/ # Exported predictions or risk scores │ ├── reports/ │ ├── capstone_summary.pdf # Final written report │ ├── slides_deck.pptx # Presentation slides │ ├── GCSRT_FINAL.docx # Formatted protocol │ └── environment.yml # Conda environment file (reproducibility)
提供机构:
Harvard Dataverse
创建时间:
2025-04-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作