electricsheepafrica/chest-ctscan-african-ehr

Name: electricsheepafrica/chest-ctscan-african-ehr
Creator: electricsheepafrica
Published: 2026-04-08 23:41:43
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/electricsheepafrica/chest-ctscan-african-ehr

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odbl task_categories: - image-classification - tabular-classification - image-to-text tags: - healthcare - africa - lung-cancer - oncology - multimodal - medical-imaging - ct-scan - electronic-health-records - synthetic-ehr - radiology - chest - tuberculosis - hiv pretty_name: "Chest CT Scans + Synthetic African EHR (Lung Cancer)" size_categories: - 1K<n<10K --- # Chest CT Scans + Synthetic African EHR (Lung Cancer) A **multimodal lung cancer dataset** pairing 1,000 de-identified chest CT scan images with **class-conditional synthetic Electronic Health Records (EHRs) calibrated to sub-Saharan and North African populations**. Every CT image is linked to a unique synthetic patient with ~60 clinical, demographic, exposure, and laboratory fields designed to match the epidemiology of lung cancer in Africa. > **Part of the [Electric Sheep Africa — Healthcare Collection](https://huggingface.co/electricsheepafrica).** > ⚠️ **The EHR fields are SYNTHETIC.** See [`SYNTHETIC_NOTICE.md`](./SYNTHETIC_NOTICE.md) for the full ethics and limitations statement. No real patient data were used to construct the EHR. --- ## Why this dataset exists Most publicly available lung cancer imaging datasets are image-only, and virtually all multimodal imaging+EHR datasets are derived from North American or European cohorts. This leaves a gap: - **African lung cancer epidemiology differs materially** from Western reference series (younger age at diagnosis, more never-smoker cancer, major role of biomass-fuel exposure and TB sequelae, non-trivial HIV burden, later stage at presentation). - Models trained on Western data **do not automatically generalize** to African patients. - Researchers working on African clinical AI lack paired multimodal benchmarks for prototyping, teaching, and stress-testing. This dataset provides a **reproducible synthetic sandbox** for that gap. The images are real and de-identified; the EHR is clearly synthetic, deterministically generated, and statistically grounded in published African epidemiology. --- ## Contents | File | Rows | Description | |---|---|---| | `patients.csv` | 1000 | One row per patient; contains all EHR fields plus the image path | | `ehr_only.csv` | 1000 | Same as above without image-path columns, for pure tabular workflows | | `data_dictionary.csv` | ~60 | Schema, type, source (label / ground-truth / synthetic), description for every field | | `images/{split}/{class}/<patient_id>.png` | 1000 | CT images restructured with stable IDs | | `SYNTHETIC_NOTICE.md` | — | Ethics, intended use, and limitations | ### Splits (from the original Kaggle release) | split | adenocarcinoma | squamous_cell_carcinoma | large_cell_carcinoma | normal | total | |---|---:|---:|---:|---:|---:| | train | 195 | 155 | 115 | 148 | 613 | | valid | 23 | 15 | 21 | 13 | 72 | | test | 120 | 90 | 51 | 54 | 315 | | **total** | **338** | **260** | **187** | **215** | **1000** | --- ## Source provenance | Layer | Source | License | |---|---|---| | CT images | [`mohamedhanyyy/chest-ctscan-images`](https://www.kaggle.com/datasets/mohamedhanyyy/chest-ctscan-images) on Kaggle | ODbL-1.0 | | EHR fields | **Synthetic** — generated by the included `generate_ehr.py` with seed `20260409`, using class-conditional distributions calibrated to African lung cancer epidemiology | — | | TNM staging (train/valid) | Ground-truth labels embedded in the original Kaggle folder names (e.g. `adenocarcinoma_left.lower.lobe_T2_N0_M0_Ib`) | — | | TNM staging (test) | Sampled from African late-presentation stage distributions | Synthetic | --- ## The Africa-specific priors Priors differ from Western reference series in the following load-bearing ways. Sources listed below the table. | Aspect | Western reference | **This dataset (Africa-calibrated)** | |---|---|---| | Mean age at cancer diagnosis | ~68 yr | **~58–62 yr** | | Never-smokers with adenocarcinoma | ~25% | **~40%** | | Biomass-fuel cooking exposure | Rarely recorded | **~45–55%** of cancer patients, higher in women and rural settings | | Past pulmonary TB | Rare | **Class-dependent prior, scaled by WHO regional TB prevalence** | | HIV co-infection | Rare | **Southern Africa: ~18% base; other sub-regions per UNAIDS 2023** | | Stage at presentation | Stage I–II common | **~66% present at stage IIIA/IIIB/IV** (late presentation) | | Screening referrals | Common (LDCT programs) | **<10%** (limited screening infrastructure) | | Baseline hemoglobin | ~13.5 g/dL | **~11–12 g/dL** (African anemia burden) | ### Country / regional distribution Synthetic patients are assigned an African country with weights reflecting where hospital-based African lung cancer case series have historically been published: | Country | n | Country | n | |---|---:|---|---:| | South Africa | 200 | Ghana | 57 | | Nigeria | 140 | Tanzania | 56 | | Egypt | 105 | Uganda | 45 | | Kenya | 99 | Algeria | 36 | | Ethiopia | 74 | Others (Cameroon, Senegal, Tunisia, Morocco, Zimbabwe, Côte d'Ivoire) | ~188 | Sub-region field (`region`) collapses these into **North / West / East / Central / Southern Africa**, which drives the HIV and TB prevalence priors. ### Literature used to calibrate priors - **Adeloye D et al.** An estimate of the prevalence of lung cancer in Africa. *J Glob Health* 2016;6(2):020409. - **Koegelenberg CFN et al.** The current burden and epidemiology of lung cancer in Africa. *J Thorac Dis* 2019. - **Mbulaiteye SM et al.** HIV/AIDS-related cancers in Africa. *Infect Agent Cancer* 2011. - **Gordon SB et al.** Respiratory risks from household air pollution in LMICs. *Lancet Respir Med* 2014;2(10):823–60. - **GLOBOCAN 2022 (IARC)** — Africa incidence & mortality estimates. - **Parkin DM et al.** Cancer in sub-Saharan Africa. *IARC Scientific Publications* No. 167. - **WHO Global Tuberculosis Report 2023.** - **UNAIDS 2023** — adult HIV prevalence by sub-region. --- ## Schema highlights The full schema is in `data_dictionary.csv` (60+ fields). Here are the highlights: ### Labels - `diagnosis_class` — `adenocarcinoma` | `squamous_cell_carcinoma` | `large_cell_carcinoma` | `normal` - `diagnosis_label` — `Lung Cancer` | `No Malignancy` - `stage_group`, `t_stage`, `n_stage`, `m_stage` — TNM (ground-truth where encoded in source folder, synthetic otherwise) - `tumor_location` — anatomical location ### Geography & setting - `country`, `region`, `setting` (Urban / Peri-urban / Rural) ### Demographics & exposures - `age`, `sex`, `bmi` - `smoking_status`, `pack_years`, `years_since_quit` - `biomass_fuel_exposure`, `biomass_exposure_years` - `occupational_dust_exposure`, `occupation_high_risk` - `past_pulmonary_tb`, `tb_treatment_completed` - `hiv_status`, `on_antiretroviral_therapy`, `cd4_count_cells_uL` ### Comorbidities - `copd`, `emphysema`, `hypertension`, `diabetes_type2`, `coronary_artery_disease` - `family_hx_lung_cancer`, `prior_cancer_any` ### Symptoms - `cough`, `hemoptysis`, `chest_pain`, `dyspnea`, `weight_loss`, `fatigue`, `symptom_duration_weeks` ### Vitals - `sbp_mmHg`, `dbp_mmHg`, `heart_rate_bpm`, `resp_rate_bpm`, `temp_C`, `spo2_pct` ### Laboratory - CBC: `hemoglobin_g_dL`, `wbc_10e9_L`, `platelets_10e9_L` - Chemistry: `sodium_mmol_L`, `potassium_mmol_L`, `creatinine_mg_dL`, `calcium_mg_dL`, `albumin_g_dL` - Inflammation: `ldh_U_L`, `crp_mg_L`, `esr_mm_hr` - Tumor markers: `cea_ng_mL`, `cyfra21_1_ng_mL`, `nse_ng_mL` ### Pulmonary function - `fev1_pct_predicted`, `fvc_pct_predicted`, `fev1_fvc_ratio` ### Functional status & workflow - `ecog_performance_status` (0–4) - `referral_reason` ### Imaging metadata - `ct_scanner`, `slice_thickness_mm`, `contrast_used`, `kvp` --- ## Statistical sanity checks Some class-conditional statistics produced by the generator (directly readable from `patients.csv`): **Mean age by class (years)** | class | mean | std | |---|---:|---:| | adenocarcinoma | 57.7 | 11.0 | | squamous_cell_carcinoma | 60.8 | 9.8 | | large_cell_carcinoma | 58.5 | 10.0 | | normal | 47.0 | 12.4 | **Smoking status by class** | class | Never | Former | Current | |---|---:|---:|---:| | adenocarcinoma | 0.35 | 0.43 | 0.22 | | squamous_cell_carcinoma | 0.08 | 0.40 | 0.52 | | large_cell_carcinoma | 0.10 | 0.36 | 0.54 | | normal | 0.71 | 0.19 | 0.10 | **HIV-positive fraction by region** | region | HIV+ | |---|---:| | Southern Africa | 0.276 | | Central Africa | 0.167 | | East Africa | 0.088 | | West Africa | 0.042 | | North Africa | 0.000 | **Stage at presentation (cancer cases)** Stage III/IV combined: **~63%** — consistent with published African case series reporting that the majority of patients present at locally advanced or metastatic stage. **Mean FEV₁ %-predicted by class** | class | mean | |---|---:| | adenocarcinoma | 72.1 | | squamous_cell_carcinoma | 57.3 | | large_cell_carcinoma | 59.1 | | normal | 88.7 | --- ## Usage ### Load the EHR table ```python from huggingface_hub import hf_hub_download import pandas as pd csv_path = hf_hub_download( repo_id="electricsheepafrica/chest-ctscan-african-ehr", filename="patients.csv", repo_type="dataset", ) df = pd.read_csv(csv_path) print(df.shape, df["diagnosis_class"].value_counts()) ``` ### Load an individual CT image ```python from huggingface_hub import hf_hub_download from PIL import Image row = df.iloc[0] img_path = hf_hub_download( repo_id="electricsheepafrica/chest-ctscan-african-ehr", filename=row["image_path"], repo_type="dataset", ) img = Image.open(img_path) img.show() ``` ### Snapshot the whole dataset locally ```python from huggingface_hub import snapshot_download local = snapshot_download( repo_id="electricsheepafrica/chest-ctscan-african-ehr", repo_type="dataset", ) ``` ### Minimal tabular baseline ```python import pandas as pd from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import classification_report df = pd.read_csv("patients.csv") train = df[df.split == "train"] test = df[df.split == "test"] feat_cols = [ "age", "sex", "bmi", "smoking_status", "pack_years", "biomass_fuel_exposure", "biomass_exposure_years", "past_pulmonary_tb", "hiv_status", "copd", "emphysema", "cough", "hemoptysis", "weight_loss", "hemoglobin_g_dL", "albumin_g_dL", "ldh_U_L", "crp_mg_L", "cea_ng_mL", "cyfra21_1_ng_mL", "nse_ng_mL", "fev1_pct_predicted", "ecog_performance_status", ] X_tr = pd.get_dummies(train[feat_cols]) y_tr = (train["diagnosis_class"] != "normal").astype(int) X_te = pd.get_dummies(test[feat_cols]).reindex(columns=X_tr.columns, fill_value=0) y_te = (test["diagnosis_class"] != "normal").astype(int) clf = GradientBoostingClassifier(random_state=42).fit(X_tr, y_tr) print(classification_report(y_te, clf.predict(X_te))) ``` --- ## Suggested research tasks 1. **Multimodal fusion**: Compare late-fusion (image CNN + tabular MLP) vs. early-fusion vs. cross-attention architectures on the four-class task. 2. **Tabular-only baselines**: Establish how far structured features alone go (CEA/CYFRA + FEV₁ + smoking + weight loss are highly informative). 3. **Subgroup analysis**: Evaluate model calibration across regions, sexes, HIV status, smoking status. 4. **Missingness & noise studies**: Inject realistic missingness patterns to simulate lower-resource settings and measure degradation. 5. **Staging regression**: Predict AJCC stage group from image + EHR combined. 6. **Domain-shift stress tests**: Train on Western-prior synthetic data and evaluate on this Africa-prior synthetic data, or vice versa, to quantify transportability. --- ## Limitations - **EHR is synthetic.** Models trained on these fields will learn the generator's structure, not real biology. Do not use for clinical decisions. - **Image provenance is not African.** The CT images come from a publicly available Kaggle dataset whose original source hospitals are not Africa-specific. The Africa focus lives entirely in the synthetic EHR layer. - **Class imbalance** across splits, especially the validation split (n=72). - **Simplifying assumptions**: Covariances between synthetic fields are modeled through a modest set of modifiers (smoking → COPD, HIV → CD4, etc.) but do not reach the full correlation structure of real EHRs. - **TNM for the test split is sampled**, not ground-truth; the original test folders do not encode staging. - **One image = one patient**: Some of the original Kaggle class folders contain near-duplicate images; each is treated as a distinct synthetic patient. --- ## Reproducibility To regenerate the EHR from scratch: ```bash python3 generate_ehr.py ``` The seed is `20260409`. The script is deterministic and will reproduce this exact release byte-for-byte (given the same input images). The generator script, the priors, and the literature used to calibrate them are all in the script header comment — they are intentionally transparent so future users can audit or modify them. --- ## Citation If you use this dataset, please cite both the original image source and this multimodal augmentation: ``` @misc{electricsheepafrica_chest_ct_african_ehr, title = {Chest CT Scans + Synthetic African EHR (Lung Cancer)}, author = {Electric Sheep Africa}, year = {2026}, howpublished = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/electricsheepafrica/chest-ctscan-african-ehr} } @misc{mohamedhanyyy_chest_ctscan_images, title = {Chest CT-Scan Images Dataset}, author = {Mohamed Hany}, howpublished = {Kaggle}, url = {https://www.kaggle.com/datasets/mohamedhanyyy/chest-ctscan-images} } ``` --- ## License - **Images**: ODbL-1.0 (inherited from the upstream Kaggle dataset) - **Synthetic EHR**: released under the same ODbL-1.0 terms - **Generator script**: MIT-licensed for reuse --- ## Collection Part of the **Electric Sheep Africa Healthcare Collection** — a curated set of clinical, imaging, and epidemiological datasets focused on African health contexts. 👉 [huggingface.co/electricsheepafrica](https://huggingface.co/electricsheepafrica)

提供机构：

electricsheepafrica

5,000+

优质数据集

54 个

任务类型

进入经典数据集