exalsius/NIH-Chest-XRay-Federated

Name: exalsius/NIH-Chest-XRay-Federated
Creator: exalsius
Published: 2025-11-19 17:03:23
License: 暂无描述

Hugging Face2025-11-19 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/exalsius/NIH-Chest-XRay-Federated

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - image-classification language: - en tags: - health - fl - federated-learning size_categories: - 100K<n<1M dataset_info: - config_name: hospital_a features: - name: image dtype: image - name: label list: string - name: Patient Age dtype: int32 - name: Patient Gender dtype: string - name: View Position dtype: string - name: Patient ID dtype: int32 splits: - name: train num_bytes: 16765970640 num_examples: 42093 - name: eval num_bytes: 2203235963 num_examples: 5490 download_size: 18970003008 dataset_size: 18969206603 - config_name: hospital_b features: - name: image dtype: image - name: label list: string - name: Patient Age dtype: int32 - name: Patient Gender dtype: string - name: View Position dtype: string - name: Patient ID dtype: int32 splits: - name: train num_bytes: 8903408771 num_examples: 21753 - name: eval num_bytes: 1172415511 num_examples: 2860 download_size: 10076311749 dataset_size: 10075824282 - config_name: hospital_c features: - name: image dtype: image - name: label list: string - name: Patient Age dtype: int32 - name: Patient Gender dtype: string - name: View Position dtype: string - name: Patient ID dtype: int32 splits: - name: train num_bytes: 8278158765 num_examples: 20594 - name: eval num_bytes: 1093044736 num_examples: 2730 download_size: 9371596059 dataset_size: 9371203501 - config_name: test features: - name: image dtype: image - name: label list: string - name: Patient Age dtype: int32 - name: Patient Gender dtype: string - name: View Position dtype: string - name: Patient ID dtype: int32 splits: - name: test_A num_bytes: 2259295233 num_examples: 5671 - name: test_B num_bytes: 1130571999 num_examples: 2757 - name: test_C num_bytes: 1051016858 num_examples: 2617 - name: test_D num_bytes: 2198481213 num_examples: 5539 download_size: 6639649842 dataset_size: 6639365303 configs: - config_name: hospital_a data_files: - split: train path: hospital_a/train-* - split: eval path: hospital_a/eval-* - config_name: hospital_b data_files: - split: train path: hospital_b/train-* - split: eval path: hospital_b/eval-* - config_name: hospital_c data_files: - split: train path: hospital_c/train-* - split: eval path: hospital_c/eval-* - config_name: test data_files: - split: test_A path: test/test_A-* - split: test_B path: test/test_B-* - split: test_C path: test/test_C-* - split: test_D path: test/test_D-* --- # NIH Chest X-ray Federated Learning Dataset Federated learning splits designed for the [\[Cold Start:\] Distributed AI Hack Berlin 2025](https://github.com/exalsius/hackathon-coldstart2025). The dataset is based on the [NIH Chest X-ray14 dataset](https://huggingface.co/datasets/BahaaEldin0/NIH-Chest-Xray-14), which contains ~112,000 X-ray images from 30,805 unique patients, and models a federated learning scenario with non-IID characteristics across three hospitals, plus an out-of-distribution test set. ## Dataset Description The data was partitioned using a scoring algorithm that creates non-IID distributions: 1. **Patient-level splitting**: Each patient appears in only one hospital/split 2. **Demographic biasing**: Age and sex distributions vary across hospitals 3. **Equipment simulation**: AP/PA view ratios differ by hospital type 4. **Pathology concentration**: Each hospital has characteristic disease patterns 5. **Train/eval/test split**: 80/10/10 split within each hospital (patient-disjoint) See the [preparation script](https://github.com/exalsius/coldstart/blob/main/data/prepare_datasets.py) for implementation details. ### Data Distribution We partitioned the chest X-rays into hospital silos that reflect real-world data heterogeneity: - **Hospital A (Portable Inpatient)**: 42,093 train, 5,490 eval - Demographics: Elderly males (age 60+) - Equipment: AP (anterior-posterior) view dominant - Common findings: Fluid-related conditions (Effusion, Edema, Atelectasis) - **Hospital B (Outpatient Clinic)**: 21,753 train, 2,860 eval - Demographics: Younger females (age 20-65) - Equipment: PA (posterior-anterior) view dominant - Common findings: Nodules, masses, pneumothorax - **Hospital C (Mixed with Rare Conditions)**: 20,594 train, 2,730 eval - Demographics: Mixed age and sex - Equipment: PA view preferred - Common findings: Rare conditions (Hernia, Fibrosis, Emphysema) ### Test Sets The dataset includes 4 test sets: - **test_A**: In-distribution test for Hospital A - **test_B**: In-distribution test for Hospital B - **test_C**: In-distribution test for Hospital C - **test_D**: **Out-of-distribution** ICU/Critical Care data (age extremes, multi-morbidity) All splits are **patient-disjoint** to prevent data leakage. ## Usage ```python from datasets import load_dataset # Load Hospital A data hospital_a = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_a") # Returns: DatasetDict({'train': Dataset, 'eval': Dataset}) # Load Hospital B hospital_b = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_b") # Returns: DatasetDict({'train': Dataset, 'eval': Dataset}) # Load Hospital C hospital_c = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_c") # Returns: DatasetDict({'train': Dataset, 'eval': Dataset}) # Load test sets test_data = load_dataset("exalsius/NIH-Chest-XRay-Federated", "test") # Returns: DatasetDict({'test_a': Dataset, 'test_b': Dataset, 'test_c': Dataset, 'test_d': Dataset}) ``` ## Original NIH Dataset ```bibtex @article{wang2017chestxray, title={ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases}, author={Wang, Xiaosong and Peng, Yifan and Lu, Le and Lu, Zhiyong and Bagheri, Mohammadhadi and Summers, Ronald M}, journal={CVPR}, year={2017} } ```

许可证：MIT许可证任务类别：图像分类语言：英语标签：医疗、联邦学习（Federated Learning）样本规模区间：100,000 < 样本数量 < 1,000,000 数据集信息： - 配置项：hospital_a 特征： - 图像（image）字段：数据类型为图像 - label字段：数据类型为字符串列表 - 患者年龄（Patient Age）：数据类型为int32（32位整数） - 患者性别（Patient Gender）：数据类型为字符串 - 拍摄体位（View Position）：数据类型为字符串 - 患者ID（Patient ID）：数据类型为int32（32位整数）拆分集： - 训练集：数据量16765970640字节，样本数42093 - 验证集：数据量2203235963字节，样本数5490 下载大小：18970003008字节，数据集总大小：18969206603字节 - 配置项：hospital_b 特征：同hospital_a配置项的特征结构拆分集： - 训练集：数据量8903408771字节，样本数21753 - 验证集：数据量1172415511字节，样本数2860 下载大小：10076311749字节，数据集总大小：10075824282字节 - 配置项：hospital_c 特征：同hospital_a配置项的特征结构拆分集： - 训练集：数据量8278158765字节，样本数20594 - 验证集：数据量1093044736字节，样本数2730 下载大小：9371596059字节，数据集总大小：9371203501字节 - 配置项：test 特征：同hospital_a配置项的特征结构拆分集： - test_A：数据量2259295233字节，样本数5671 - test_B：数据量1130571999字节，样本数2757 - test_C：数据量1051016858字节，样本数2617 - test_D：数据量2198481213字节，样本数5539 下载大小：6639649842字节，数据集总大小：6639365303字节配置项详情： - 配置项：hospital_a，数据文件：训练集对应hospital_a/train-*，验证集对应hospital_a/eval-* - 配置项：hospital_b，数据文件：训练集对应hospital_b/train-*，验证集对应hospital_b/eval-* - 配置项：hospital_c，数据文件：训练集对应hospital_c/train-*，验证集对应hospital_c/eval-* - 配置项：test，数据文件：test_A对应test/test_A-*，test_B对应test/test_B-*，test_C对应test/test_C-*，test_D对应test/test_D-* # NIH胸部X光联邦学习数据集联邦学习拆分方案专为**[冷启动：2025柏林分布式AI黑客松](https://github.com/exalsius/hackathon-coldstart2025)** 设计。本数据集基于[NIH胸部X光14数据集（NIH Chest X-ray14 dataset）](https://huggingface.co/datasets/BahaaEldin0/NIH-Chest-Xray-14)，该原始数据集包含来自30805名独特患者的约112000张X光影像，本数据集构建了跨三家医院的非独立同分布（Non-IID）联邦学习场景，并额外设置了分布外测试集。 ## 数据集描述本数据集通过评分算法进行划分，以生成非独立同分布的数据分布，具体规则如下： 1. **患者级拆分**：每名患者仅会出现在一家医院/一个拆分集中 2. **人口统计学偏倚**：各医院的年龄与性别分布存在差异 3. **设备模拟**：各医院的AP/PA拍摄体位比例各不相同 4. **病理特征集中化**：每家医院具有特征性的疾病分布模式 5. **训练/验证/测试拆分**：每家医院内部按照80/10/10的比例进行拆分，且患者互不重叠具体实现细节可参考[数据预处理脚本](https://github.com/exalsius/coldstart/blob/main/data/prepare_datasets.py)。 ### 数据分布我们将胸部X光影像划分为反映真实世界数据异质性的医院孤岛： - **医院A（便携住院部）**：训练集42093条，验证集5490条 - 人口统计学特征：以老年男性群体为主（年龄60岁以上） - 拍摄设备：以AP（前后位）拍摄体位为主 - 常见影像发现：积液相关病症（胸腔积液、肺水肿、肺不张） - **医院B（门诊诊所）**：训练集21753条，验证集2860条 - 人口统计学特征：以年轻女性群体为主（年龄20-65岁） - 拍摄设备：以PA（后前位）拍摄体位为主 - 常见影像发现：结节、肿块、气胸 - **医院C（含罕见病症的混合科室）**：训练集20594条，验证集2730条 - 人口统计学特征：年龄与性别分布均衡 - 拍摄设备：偏好PA拍摄体位 - 常见影像发现：罕见病症（疝、肺纤维化、肺气肿） ### 测试集本数据集包含4个测试集： - **test_A**：医院A的分布内测试集 - **test_B**：医院B的分布内测试集 - **test_C**：医院C的分布内测试集 - **test_D**：**分布外（Out-of-distribution）** 重症监护室/危重症护理数据集（包含极端年龄群体、多重并发症）所有拆分集均遵循**患者互不重叠**原则，以避免数据泄露。 ## 使用方法 python from datasets import load_dataset # 加载医院A数据 hospital_a = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_a") # 返回：DatasetDict({'train': Dataset, 'eval': Dataset}) # 加载医院B hospital_b = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_b") # 返回：DatasetDict({'train': Dataset, 'eval': Dataset}) # 加载医院C hospital_c = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_c") # 返回：DatasetDict({'train': Dataset, 'eval': Dataset}) # 加载测试集 test_data = load_dataset("exalsius/NIH-Chest-XRay-Federated", "test") # 返回：DatasetDict({'test_a': Dataset, 'test_b': Dataset, 'test_c': Dataset, 'test_d': Dataset}) ## 原始NIH数据集 bibtex @article{wang2017chestxray, title={ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases}, author={Wang, Xiaosong and Peng, Yifan and Lu, Le and Lu, Zhiyong and Bagheri, Mohammadhadi and Summers, Ronald M}, journal={CVPR}, year={2017} }

提供机构：

exalsius

5,000+

优质数据集

54 个

任务类型

进入经典数据集