electricsheepafrica/chest-ctscan-african-ehr
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/chest-ctscan-african-ehr
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odbl
task_categories:
- image-classification
- tabular-classification
- image-to-text
tags:
- healthcare
- africa
- lung-cancer
- oncology
- multimodal
- medical-imaging
- ct-scan
- electronic-health-records
- synthetic-ehr
- radiology
- chest
- tuberculosis
- hiv
pretty_name: "Chest CT Scans + Synthetic African EHR (Lung Cancer)"
size_categories:
- 1K<n<10K
---
# Chest CT Scans + Synthetic African EHR (Lung Cancer)
A **multimodal lung cancer dataset** pairing 1,000 de-identified chest CT scan images with **class-conditional synthetic Electronic Health Records (EHRs) calibrated to sub-Saharan and North African populations**. Every CT image is linked to a unique synthetic patient with ~60 clinical, demographic, exposure, and laboratory fields designed to match the epidemiology of lung cancer in Africa.
> **Part of the [Electric Sheep Africa — Healthcare Collection](https://huggingface.co/electricsheepafrica).**
> ⚠️ **The EHR fields are SYNTHETIC.** See [`SYNTHETIC_NOTICE.md`](./SYNTHETIC_NOTICE.md) for the full ethics and limitations statement. No real patient data were used to construct the EHR.
---
## Why this dataset exists
Most publicly available lung cancer imaging datasets are image-only, and virtually all multimodal imaging+EHR datasets are derived from North American or European cohorts. This leaves a gap:
- **African lung cancer epidemiology differs materially** from Western reference series (younger age at diagnosis, more never-smoker cancer, major role of biomass-fuel exposure and TB sequelae, non-trivial HIV burden, later stage at presentation).
- Models trained on Western data **do not automatically generalize** to African patients.
- Researchers working on African clinical AI lack paired multimodal benchmarks for prototyping, teaching, and stress-testing.
This dataset provides a **reproducible synthetic sandbox** for that gap. The images are real and de-identified; the EHR is clearly synthetic, deterministically generated, and statistically grounded in published African epidemiology.
---
## Contents
| File | Rows | Description |
|---|---|---|
| `patients.csv` | 1000 | One row per patient; contains all EHR fields plus the image path |
| `ehr_only.csv` | 1000 | Same as above without image-path columns, for pure tabular workflows |
| `data_dictionary.csv` | ~60 | Schema, type, source (label / ground-truth / synthetic), description for every field |
| `images/{split}/{class}/<patient_id>.png` | 1000 | CT images restructured with stable IDs |
| `SYNTHETIC_NOTICE.md` | — | Ethics, intended use, and limitations |
### Splits (from the original Kaggle release)
| split | adenocarcinoma | squamous_cell_carcinoma | large_cell_carcinoma | normal | total |
|---|---:|---:|---:|---:|---:|
| train | 195 | 155 | 115 | 148 | 613 |
| valid | 23 | 15 | 21 | 13 | 72 |
| test | 120 | 90 | 51 | 54 | 315 |
| **total** | **338** | **260** | **187** | **215** | **1000** |
---
## Source provenance
| Layer | Source | License |
|---|---|---|
| CT images | [`mohamedhanyyy/chest-ctscan-images`](https://www.kaggle.com/datasets/mohamedhanyyy/chest-ctscan-images) on Kaggle | ODbL-1.0 |
| EHR fields | **Synthetic** — generated by the included `generate_ehr.py` with seed `20260409`, using class-conditional distributions calibrated to African lung cancer epidemiology | — |
| TNM staging (train/valid) | Ground-truth labels embedded in the original Kaggle folder names (e.g. `adenocarcinoma_left.lower.lobe_T2_N0_M0_Ib`) | — |
| TNM staging (test) | Sampled from African late-presentation stage distributions | Synthetic |
---
## The Africa-specific priors
Priors differ from Western reference series in the following load-bearing ways. Sources listed below the table.
| Aspect | Western reference | **This dataset (Africa-calibrated)** |
|---|---|---|
| Mean age at cancer diagnosis | ~68 yr | **~58–62 yr** |
| Never-smokers with adenocarcinoma | ~25% | **~40%** |
| Biomass-fuel cooking exposure | Rarely recorded | **~45–55%** of cancer patients, higher in women and rural settings |
| Past pulmonary TB | Rare | **Class-dependent prior, scaled by WHO regional TB prevalence** |
| HIV co-infection | Rare | **Southern Africa: ~18% base; other sub-regions per UNAIDS 2023** |
| Stage at presentation | Stage I–II common | **~66% present at stage IIIA/IIIB/IV** (late presentation) |
| Screening referrals | Common (LDCT programs) | **<10%** (limited screening infrastructure) |
| Baseline hemoglobin | ~13.5 g/dL | **~11–12 g/dL** (African anemia burden) |
### Country / regional distribution
Synthetic patients are assigned an African country with weights reflecting where hospital-based African lung cancer case series have historically been published:
| Country | n | Country | n |
|---|---:|---|---:|
| South Africa | 200 | Ghana | 57 |
| Nigeria | 140 | Tanzania | 56 |
| Egypt | 105 | Uganda | 45 |
| Kenya | 99 | Algeria | 36 |
| Ethiopia | 74 | Others (Cameroon, Senegal, Tunisia, Morocco, Zimbabwe, Côte d'Ivoire) | ~188 |
Sub-region field (`region`) collapses these into **North / West / East / Central / Southern Africa**, which drives the HIV and TB prevalence priors.
### Literature used to calibrate priors
- **Adeloye D et al.** An estimate of the prevalence of lung cancer in Africa. *J Glob Health* 2016;6(2):020409.
- **Koegelenberg CFN et al.** The current burden and epidemiology of lung cancer in Africa. *J Thorac Dis* 2019.
- **Mbulaiteye SM et al.** HIV/AIDS-related cancers in Africa. *Infect Agent Cancer* 2011.
- **Gordon SB et al.** Respiratory risks from household air pollution in LMICs. *Lancet Respir Med* 2014;2(10):823–60.
- **GLOBOCAN 2022 (IARC)** — Africa incidence & mortality estimates.
- **Parkin DM et al.** Cancer in sub-Saharan Africa. *IARC Scientific Publications* No. 167.
- **WHO Global Tuberculosis Report 2023.**
- **UNAIDS 2023** — adult HIV prevalence by sub-region.
---
## Schema highlights
The full schema is in `data_dictionary.csv` (60+ fields). Here are the highlights:
### Labels
- `diagnosis_class` — `adenocarcinoma` | `squamous_cell_carcinoma` | `large_cell_carcinoma` | `normal`
- `diagnosis_label` — `Lung Cancer` | `No Malignancy`
- `stage_group`, `t_stage`, `n_stage`, `m_stage` — TNM (ground-truth where encoded in source folder, synthetic otherwise)
- `tumor_location` — anatomical location
### Geography & setting
- `country`, `region`, `setting` (Urban / Peri-urban / Rural)
### Demographics & exposures
- `age`, `sex`, `bmi`
- `smoking_status`, `pack_years`, `years_since_quit`
- `biomass_fuel_exposure`, `biomass_exposure_years`
- `occupational_dust_exposure`, `occupation_high_risk`
- `past_pulmonary_tb`, `tb_treatment_completed`
- `hiv_status`, `on_antiretroviral_therapy`, `cd4_count_cells_uL`
### Comorbidities
- `copd`, `emphysema`, `hypertension`, `diabetes_type2`, `coronary_artery_disease`
- `family_hx_lung_cancer`, `prior_cancer_any`
### Symptoms
- `cough`, `hemoptysis`, `chest_pain`, `dyspnea`, `weight_loss`, `fatigue`, `symptom_duration_weeks`
### Vitals
- `sbp_mmHg`, `dbp_mmHg`, `heart_rate_bpm`, `resp_rate_bpm`, `temp_C`, `spo2_pct`
### Laboratory
- CBC: `hemoglobin_g_dL`, `wbc_10e9_L`, `platelets_10e9_L`
- Chemistry: `sodium_mmol_L`, `potassium_mmol_L`, `creatinine_mg_dL`, `calcium_mg_dL`, `albumin_g_dL`
- Inflammation: `ldh_U_L`, `crp_mg_L`, `esr_mm_hr`
- Tumor markers: `cea_ng_mL`, `cyfra21_1_ng_mL`, `nse_ng_mL`
### Pulmonary function
- `fev1_pct_predicted`, `fvc_pct_predicted`, `fev1_fvc_ratio`
### Functional status & workflow
- `ecog_performance_status` (0–4)
- `referral_reason`
### Imaging metadata
- `ct_scanner`, `slice_thickness_mm`, `contrast_used`, `kvp`
---
## Statistical sanity checks
Some class-conditional statistics produced by the generator (directly readable from `patients.csv`):
**Mean age by class (years)**
| class | mean | std |
|---|---:|---:|
| adenocarcinoma | 57.7 | 11.0 |
| squamous_cell_carcinoma | 60.8 | 9.8 |
| large_cell_carcinoma | 58.5 | 10.0 |
| normal | 47.0 | 12.4 |
**Smoking status by class**
| class | Never | Former | Current |
|---|---:|---:|---:|
| adenocarcinoma | 0.35 | 0.43 | 0.22 |
| squamous_cell_carcinoma | 0.08 | 0.40 | 0.52 |
| large_cell_carcinoma | 0.10 | 0.36 | 0.54 |
| normal | 0.71 | 0.19 | 0.10 |
**HIV-positive fraction by region**
| region | HIV+ |
|---|---:|
| Southern Africa | 0.276 |
| Central Africa | 0.167 |
| East Africa | 0.088 |
| West Africa | 0.042 |
| North Africa | 0.000 |
**Stage at presentation (cancer cases)**
Stage III/IV combined: **~63%** — consistent with published African case series reporting that the majority of patients present at locally advanced or metastatic stage.
**Mean FEV₁ %-predicted by class**
| class | mean |
|---|---:|
| adenocarcinoma | 72.1 |
| squamous_cell_carcinoma | 57.3 |
| large_cell_carcinoma | 59.1 |
| normal | 88.7 |
---
## Usage
### Load the EHR table
```python
from huggingface_hub import hf_hub_download
import pandas as pd
csv_path = hf_hub_download(
repo_id="electricsheepafrica/chest-ctscan-african-ehr",
filename="patients.csv",
repo_type="dataset",
)
df = pd.read_csv(csv_path)
print(df.shape, df["diagnosis_class"].value_counts())
```
### Load an individual CT image
```python
from huggingface_hub import hf_hub_download
from PIL import Image
row = df.iloc[0]
img_path = hf_hub_download(
repo_id="electricsheepafrica/chest-ctscan-african-ehr",
filename=row["image_path"],
repo_type="dataset",
)
img = Image.open(img_path)
img.show()
```
### Snapshot the whole dataset locally
```python
from huggingface_hub import snapshot_download
local = snapshot_download(
repo_id="electricsheepafrica/chest-ctscan-african-ehr",
repo_type="dataset",
)
```
### Minimal tabular baseline
```python
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
df = pd.read_csv("patients.csv")
train = df[df.split == "train"]
test = df[df.split == "test"]
feat_cols = [
"age", "sex", "bmi", "smoking_status", "pack_years",
"biomass_fuel_exposure", "biomass_exposure_years",
"past_pulmonary_tb", "hiv_status",
"copd", "emphysema", "cough", "hemoptysis", "weight_loss",
"hemoglobin_g_dL", "albumin_g_dL", "ldh_U_L", "crp_mg_L",
"cea_ng_mL", "cyfra21_1_ng_mL", "nse_ng_mL",
"fev1_pct_predicted", "ecog_performance_status",
]
X_tr = pd.get_dummies(train[feat_cols])
y_tr = (train["diagnosis_class"] != "normal").astype(int)
X_te = pd.get_dummies(test[feat_cols]).reindex(columns=X_tr.columns, fill_value=0)
y_te = (test["diagnosis_class"] != "normal").astype(int)
clf = GradientBoostingClassifier(random_state=42).fit(X_tr, y_tr)
print(classification_report(y_te, clf.predict(X_te)))
```
---
## Suggested research tasks
1. **Multimodal fusion**: Compare late-fusion (image CNN + tabular MLP) vs. early-fusion vs. cross-attention architectures on the four-class task.
2. **Tabular-only baselines**: Establish how far structured features alone go (CEA/CYFRA + FEV₁ + smoking + weight loss are highly informative).
3. **Subgroup analysis**: Evaluate model calibration across regions, sexes, HIV status, smoking status.
4. **Missingness & noise studies**: Inject realistic missingness patterns to simulate lower-resource settings and measure degradation.
5. **Staging regression**: Predict AJCC stage group from image + EHR combined.
6. **Domain-shift stress tests**: Train on Western-prior synthetic data and evaluate on this Africa-prior synthetic data, or vice versa, to quantify transportability.
---
## Limitations
- **EHR is synthetic.** Models trained on these fields will learn the generator's structure, not real biology. Do not use for clinical decisions.
- **Image provenance is not African.** The CT images come from a publicly available Kaggle dataset whose original source hospitals are not Africa-specific. The Africa focus lives entirely in the synthetic EHR layer.
- **Class imbalance** across splits, especially the validation split (n=72).
- **Simplifying assumptions**: Covariances between synthetic fields are modeled through a modest set of modifiers (smoking → COPD, HIV → CD4, etc.) but do not reach the full correlation structure of real EHRs.
- **TNM for the test split is sampled**, not ground-truth; the original test folders do not encode staging.
- **One image = one patient**: Some of the original Kaggle class folders contain near-duplicate images; each is treated as a distinct synthetic patient.
---
## Reproducibility
To regenerate the EHR from scratch:
```bash
python3 generate_ehr.py
```
The seed is `20260409`. The script is deterministic and will reproduce this exact release byte-for-byte (given the same input images).
The generator script, the priors, and the literature used to calibrate them are all in the script header comment — they are intentionally transparent so future users can audit or modify them.
---
## Citation
If you use this dataset, please cite both the original image source and this multimodal augmentation:
```
@misc{electricsheepafrica_chest_ct_african_ehr,
title = {Chest CT Scans + Synthetic African EHR (Lung Cancer)},
author = {Electric Sheep Africa},
year = {2026},
howpublished = {Hugging Face Datasets},
url = {https://huggingface.co/datasets/electricsheepafrica/chest-ctscan-african-ehr}
}
@misc{mohamedhanyyy_chest_ctscan_images,
title = {Chest CT-Scan Images Dataset},
author = {Mohamed Hany},
howpublished = {Kaggle},
url = {https://www.kaggle.com/datasets/mohamedhanyyy/chest-ctscan-images}
}
```
---
## License
- **Images**: ODbL-1.0 (inherited from the upstream Kaggle dataset)
- **Synthetic EHR**: released under the same ODbL-1.0 terms
- **Generator script**: MIT-licensed for reuse
---
## Collection
Part of the **Electric Sheep Africa Healthcare Collection** — a curated set of clinical, imaging, and epidemiological datasets focused on African health contexts.
👉 [huggingface.co/electricsheepafrica](https://huggingface.co/electricsheepafrica)
提供机构:
electricsheepafrica



