exalsius/NIH-Chest-XRay-Federated
收藏Hugging Face2025-11-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/exalsius/NIH-Chest-XRay-Federated
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- image-classification
language:
- en
tags:
- health
- fl
- federated-learning
size_categories:
- 100K<n<1M
dataset_info:
- config_name: hospital_a
features:
- name: image
dtype: image
- name: label
list: string
- name: Patient Age
dtype: int32
- name: Patient Gender
dtype: string
- name: View Position
dtype: string
- name: Patient ID
dtype: int32
splits:
- name: train
num_bytes: 16765970640
num_examples: 42093
- name: eval
num_bytes: 2203235963
num_examples: 5490
download_size: 18970003008
dataset_size: 18969206603
- config_name: hospital_b
features:
- name: image
dtype: image
- name: label
list: string
- name: Patient Age
dtype: int32
- name: Patient Gender
dtype: string
- name: View Position
dtype: string
- name: Patient ID
dtype: int32
splits:
- name: train
num_bytes: 8903408771
num_examples: 21753
- name: eval
num_bytes: 1172415511
num_examples: 2860
download_size: 10076311749
dataset_size: 10075824282
- config_name: hospital_c
features:
- name: image
dtype: image
- name: label
list: string
- name: Patient Age
dtype: int32
- name: Patient Gender
dtype: string
- name: View Position
dtype: string
- name: Patient ID
dtype: int32
splits:
- name: train
num_bytes: 8278158765
num_examples: 20594
- name: eval
num_bytes: 1093044736
num_examples: 2730
download_size: 9371596059
dataset_size: 9371203501
- config_name: test
features:
- name: image
dtype: image
- name: label
list: string
- name: Patient Age
dtype: int32
- name: Patient Gender
dtype: string
- name: View Position
dtype: string
- name: Patient ID
dtype: int32
splits:
- name: test_A
num_bytes: 2259295233
num_examples: 5671
- name: test_B
num_bytes: 1130571999
num_examples: 2757
- name: test_C
num_bytes: 1051016858
num_examples: 2617
- name: test_D
num_bytes: 2198481213
num_examples: 5539
download_size: 6639649842
dataset_size: 6639365303
configs:
- config_name: hospital_a
data_files:
- split: train
path: hospital_a/train-*
- split: eval
path: hospital_a/eval-*
- config_name: hospital_b
data_files:
- split: train
path: hospital_b/train-*
- split: eval
path: hospital_b/eval-*
- config_name: hospital_c
data_files:
- split: train
path: hospital_c/train-*
- split: eval
path: hospital_c/eval-*
- config_name: test
data_files:
- split: test_A
path: test/test_A-*
- split: test_B
path: test/test_B-*
- split: test_C
path: test/test_C-*
- split: test_D
path: test/test_D-*
---
# NIH Chest X-ray Federated Learning Dataset
Federated learning splits designed for the [\[Cold Start:\] Distributed AI Hack Berlin 2025](https://github.com/exalsius/hackathon-coldstart2025).
The dataset is based on the [NIH Chest X-ray14 dataset](https://huggingface.co/datasets/BahaaEldin0/NIH-Chest-Xray-14), which contains ~112,000 X-ray images from 30,805 unique patients, and models a federated learning scenario with non-IID characteristics across three hospitals, plus an out-of-distribution test set.
## Dataset Description
The data was partitioned using a scoring algorithm that creates non-IID distributions:
1. **Patient-level splitting**: Each patient appears in only one hospital/split
2. **Demographic biasing**: Age and sex distributions vary across hospitals
3. **Equipment simulation**: AP/PA view ratios differ by hospital type
4. **Pathology concentration**: Each hospital has characteristic disease patterns
5. **Train/eval/test split**: 80/10/10 split within each hospital (patient-disjoint)
See the [preparation script](https://github.com/exalsius/coldstart/blob/main/data/prepare_datasets.py) for implementation details.
### Data Distribution
We partitioned the chest X-rays into hospital silos that reflect real-world data heterogeneity:
- **Hospital A (Portable Inpatient)**: 42,093 train, 5,490 eval
- Demographics: Elderly males (age 60+)
- Equipment: AP (anterior-posterior) view dominant
- Common findings: Fluid-related conditions (Effusion, Edema, Atelectasis)
- **Hospital B (Outpatient Clinic)**: 21,753 train, 2,860 eval
- Demographics: Younger females (age 20-65)
- Equipment: PA (posterior-anterior) view dominant
- Common findings: Nodules, masses, pneumothorax
- **Hospital C (Mixed with Rare Conditions)**: 20,594 train, 2,730 eval
- Demographics: Mixed age and sex
- Equipment: PA view preferred
- Common findings: Rare conditions (Hernia, Fibrosis, Emphysema)
### Test Sets
The dataset includes 4 test sets:
- **test_A**: In-distribution test for Hospital A
- **test_B**: In-distribution test for Hospital B
- **test_C**: In-distribution test for Hospital C
- **test_D**: **Out-of-distribution** ICU/Critical Care data (age extremes, multi-morbidity)
All splits are **patient-disjoint** to prevent data leakage.
## Usage
```python
from datasets import load_dataset
# Load Hospital A data
hospital_a = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_a")
# Returns: DatasetDict({'train': Dataset, 'eval': Dataset})
# Load Hospital B
hospital_b = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_b")
# Returns: DatasetDict({'train': Dataset, 'eval': Dataset})
# Load Hospital C
hospital_c = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_c")
# Returns: DatasetDict({'train': Dataset, 'eval': Dataset})
# Load test sets
test_data = load_dataset("exalsius/NIH-Chest-XRay-Federated", "test")
# Returns: DatasetDict({'test_a': Dataset, 'test_b': Dataset, 'test_c': Dataset, 'test_d': Dataset})
```
## Original NIH Dataset
```bibtex
@article{wang2017chestxray,
title={ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on
Weakly-Supervised Classification and Localization of Common Thorax Diseases},
author={Wang, Xiaosong and Peng, Yifan and Lu, Le and Lu, Zhiyong and
Bagheri, Mohammadhadi and Summers, Ronald M},
journal={CVPR},
year={2017}
}
```
许可证:MIT许可证
任务类别:图像分类
语言:英语
标签:医疗、联邦学习(Federated Learning)
样本规模区间:100,000 < 样本数量 < 1,000,000
数据集信息:
- 配置项:hospital_a
特征:
- 图像(image)字段:数据类型为图像
- label字段:数据类型为字符串列表
- 患者年龄(Patient Age):数据类型为int32(32位整数)
- 患者性别(Patient Gender):数据类型为字符串
- 拍摄体位(View Position):数据类型为字符串
- 患者ID(Patient ID):数据类型为int32(32位整数)
拆分集:
- 训练集:数据量16765970640字节,样本数42093
- 验证集:数据量2203235963字节,样本数5490
下载大小:18970003008字节,数据集总大小:18969206603字节
- 配置项:hospital_b
特征:同hospital_a配置项的特征结构
拆分集:
- 训练集:数据量8903408771字节,样本数21753
- 验证集:数据量1172415511字节,样本数2860
下载大小:10076311749字节,数据集总大小:10075824282字节
- 配置项:hospital_c
特征:同hospital_a配置项的特征结构
拆分集:
- 训练集:数据量8278158765字节,样本数20594
- 验证集:数据量1093044736字节,样本数2730
下载大小:9371596059字节,数据集总大小:9371203501字节
- 配置项:test
特征:同hospital_a配置项的特征结构
拆分集:
- test_A:数据量2259295233字节,样本数5671
- test_B:数据量1130571999字节,样本数2757
- test_C:数据量1051016858字节,样本数2617
- test_D:数据量2198481213字节,样本数5539
下载大小:6639649842字节,数据集总大小:6639365303字节
配置项详情:
- 配置项:hospital_a,数据文件:训练集对应hospital_a/train-*,验证集对应hospital_a/eval-*
- 配置项:hospital_b,数据文件:训练集对应hospital_b/train-*,验证集对应hospital_b/eval-*
- 配置项:hospital_c,数据文件:训练集对应hospital_c/train-*,验证集对应hospital_c/eval-*
- 配置项:test,数据文件:test_A对应test/test_A-*,test_B对应test/test_B-*,test_C对应test/test_C-*,test_D对应test/test_D-*
# NIH胸部X光联邦学习数据集
联邦学习拆分方案专为**[冷启动:2025柏林分布式AI黑客松](https://github.com/exalsius/hackathon-coldstart2025)** 设计。
本数据集基于[NIH胸部X光14数据集(NIH Chest X-ray14 dataset)](https://huggingface.co/datasets/BahaaEldin0/NIH-Chest-Xray-14),该原始数据集包含来自30805名独特患者的约112000张X光影像,本数据集构建了跨三家医院的非独立同分布(Non-IID)联邦学习场景,并额外设置了分布外测试集。
## 数据集描述
本数据集通过评分算法进行划分,以生成非独立同分布的数据分布,具体规则如下:
1. **患者级拆分**:每名患者仅会出现在一家医院/一个拆分集中
2. **人口统计学偏倚**:各医院的年龄与性别分布存在差异
3. **设备模拟**:各医院的AP/PA拍摄体位比例各不相同
4. **病理特征集中化**:每家医院具有特征性的疾病分布模式
5. **训练/验证/测试拆分**:每家医院内部按照80/10/10的比例进行拆分,且患者互不重叠
具体实现细节可参考[数据预处理脚本](https://github.com/exalsius/coldstart/blob/main/data/prepare_datasets.py)。
### 数据分布
我们将胸部X光影像划分为反映真实世界数据异质性的医院孤岛:
- **医院A(便携住院部)**:训练集42093条,验证集5490条
- 人口统计学特征:以老年男性群体为主(年龄60岁以上)
- 拍摄设备:以AP(前后位)拍摄体位为主
- 常见影像发现:积液相关病症(胸腔积液、肺水肿、肺不张)
- **医院B(门诊诊所)**:训练集21753条,验证集2860条
- 人口统计学特征:以年轻女性群体为主(年龄20-65岁)
- 拍摄设备:以PA(后前位)拍摄体位为主
- 常见影像发现:结节、肿块、气胸
- **医院C(含罕见病症的混合科室)**:训练集20594条,验证集2730条
- 人口统计学特征:年龄与性别分布均衡
- 拍摄设备:偏好PA拍摄体位
- 常见影像发现:罕见病症(疝、肺纤维化、肺气肿)
### 测试集
本数据集包含4个测试集:
- **test_A**:医院A的分布内测试集
- **test_B**:医院B的分布内测试集
- **test_C**:医院C的分布内测试集
- **test_D**:**分布外(Out-of-distribution)** 重症监护室/危重症护理数据集(包含极端年龄群体、多重并发症)
所有拆分集均遵循**患者互不重叠**原则,以避免数据泄露。
## 使用方法
python
from datasets import load_dataset
# 加载医院A数据
hospital_a = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_a")
# 返回:DatasetDict({'train': Dataset, 'eval': Dataset})
# 加载医院B
hospital_b = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_b")
# 返回:DatasetDict({'train': Dataset, 'eval': Dataset})
# 加载医院C
hospital_c = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_c")
# 返回:DatasetDict({'train': Dataset, 'eval': Dataset})
# 加载测试集
test_data = load_dataset("exalsius/NIH-Chest-XRay-Federated", "test")
# 返回:DatasetDict({'test_a': Dataset, 'test_b': Dataset, 'test_c': Dataset, 'test_d': Dataset})
## 原始NIH数据集
bibtex
@article{wang2017chestxray,
title={ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on
Weakly-Supervised Classification and Localization of Common Thorax Diseases},
author={Wang, Xiaosong and Peng, Yifan and Lu, Le and Lu, Zhiyong and
Bagheri, Mohammadhadi and Summers, Ronald M},
journal={CVPR},
year={2017}
}
提供机构:
exalsius



