five

exalsius/NIH-Chest-XRay-Federated

收藏
Hugging Face2025-11-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/exalsius/NIH-Chest-XRay-Federated
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - image-classification language: - en tags: - health - fl - federated-learning size_categories: - 100K<n<1M dataset_info: - config_name: hospital_a features: - name: image dtype: image - name: label list: string - name: Patient Age dtype: int32 - name: Patient Gender dtype: string - name: View Position dtype: string - name: Patient ID dtype: int32 splits: - name: train num_bytes: 16765970640 num_examples: 42093 - name: eval num_bytes: 2203235963 num_examples: 5490 download_size: 18970003008 dataset_size: 18969206603 - config_name: hospital_b features: - name: image dtype: image - name: label list: string - name: Patient Age dtype: int32 - name: Patient Gender dtype: string - name: View Position dtype: string - name: Patient ID dtype: int32 splits: - name: train num_bytes: 8903408771 num_examples: 21753 - name: eval num_bytes: 1172415511 num_examples: 2860 download_size: 10076311749 dataset_size: 10075824282 - config_name: hospital_c features: - name: image dtype: image - name: label list: string - name: Patient Age dtype: int32 - name: Patient Gender dtype: string - name: View Position dtype: string - name: Patient ID dtype: int32 splits: - name: train num_bytes: 8278158765 num_examples: 20594 - name: eval num_bytes: 1093044736 num_examples: 2730 download_size: 9371596059 dataset_size: 9371203501 - config_name: test features: - name: image dtype: image - name: label list: string - name: Patient Age dtype: int32 - name: Patient Gender dtype: string - name: View Position dtype: string - name: Patient ID dtype: int32 splits: - name: test_A num_bytes: 2259295233 num_examples: 5671 - name: test_B num_bytes: 1130571999 num_examples: 2757 - name: test_C num_bytes: 1051016858 num_examples: 2617 - name: test_D num_bytes: 2198481213 num_examples: 5539 download_size: 6639649842 dataset_size: 6639365303 configs: - config_name: hospital_a data_files: - split: train path: hospital_a/train-* - split: eval path: hospital_a/eval-* - config_name: hospital_b data_files: - split: train path: hospital_b/train-* - split: eval path: hospital_b/eval-* - config_name: hospital_c data_files: - split: train path: hospital_c/train-* - split: eval path: hospital_c/eval-* - config_name: test data_files: - split: test_A path: test/test_A-* - split: test_B path: test/test_B-* - split: test_C path: test/test_C-* - split: test_D path: test/test_D-* --- # NIH Chest X-ray Federated Learning Dataset Federated learning splits designed for the [\[Cold Start:\] Distributed AI Hack Berlin 2025](https://github.com/exalsius/hackathon-coldstart2025). The dataset is based on the [NIH Chest X-ray14 dataset](https://huggingface.co/datasets/BahaaEldin0/NIH-Chest-Xray-14), which contains ~112,000 X-ray images from 30,805 unique patients, and models a federated learning scenario with non-IID characteristics across three hospitals, plus an out-of-distribution test set. ## Dataset Description The data was partitioned using a scoring algorithm that creates non-IID distributions: 1. **Patient-level splitting**: Each patient appears in only one hospital/split 2. **Demographic biasing**: Age and sex distributions vary across hospitals 3. **Equipment simulation**: AP/PA view ratios differ by hospital type 4. **Pathology concentration**: Each hospital has characteristic disease patterns 5. **Train/eval/test split**: 80/10/10 split within each hospital (patient-disjoint) See the [preparation script](https://github.com/exalsius/coldstart/blob/main/data/prepare_datasets.py) for implementation details. ### Data Distribution We partitioned the chest X-rays into hospital silos that reflect real-world data heterogeneity: - **Hospital A (Portable Inpatient)**: 42,093 train, 5,490 eval - Demographics: Elderly males (age 60+) - Equipment: AP (anterior-posterior) view dominant - Common findings: Fluid-related conditions (Effusion, Edema, Atelectasis) - **Hospital B (Outpatient Clinic)**: 21,753 train, 2,860 eval - Demographics: Younger females (age 20-65) - Equipment: PA (posterior-anterior) view dominant - Common findings: Nodules, masses, pneumothorax - **Hospital C (Mixed with Rare Conditions)**: 20,594 train, 2,730 eval - Demographics: Mixed age and sex - Equipment: PA view preferred - Common findings: Rare conditions (Hernia, Fibrosis, Emphysema) ### Test Sets The dataset includes 4 test sets: - **test_A**: In-distribution test for Hospital A - **test_B**: In-distribution test for Hospital B - **test_C**: In-distribution test for Hospital C - **test_D**: **Out-of-distribution** ICU/Critical Care data (age extremes, multi-morbidity) All splits are **patient-disjoint** to prevent data leakage. ## Usage ```python from datasets import load_dataset # Load Hospital A data hospital_a = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_a") # Returns: DatasetDict({'train': Dataset, 'eval': Dataset}) # Load Hospital B hospital_b = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_b") # Returns: DatasetDict({'train': Dataset, 'eval': Dataset}) # Load Hospital C hospital_c = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_c") # Returns: DatasetDict({'train': Dataset, 'eval': Dataset}) # Load test sets test_data = load_dataset("exalsius/NIH-Chest-XRay-Federated", "test") # Returns: DatasetDict({'test_a': Dataset, 'test_b': Dataset, 'test_c': Dataset, 'test_d': Dataset}) ``` ## Original NIH Dataset ```bibtex @article{wang2017chestxray, title={ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases}, author={Wang, Xiaosong and Peng, Yifan and Lu, Le and Lu, Zhiyong and Bagheri, Mohammadhadi and Summers, Ronald M}, journal={CVPR}, year={2017} } ```

许可证:MIT许可证 任务类别:图像分类 语言:英语 标签:医疗、联邦学习(Federated Learning) 样本规模区间:100,000 < 样本数量 < 1,000,000 数据集信息: - 配置项:hospital_a 特征: - 图像(image)字段:数据类型为图像 - label字段:数据类型为字符串列表 - 患者年龄(Patient Age):数据类型为int32(32位整数) - 患者性别(Patient Gender):数据类型为字符串 - 拍摄体位(View Position):数据类型为字符串 - 患者ID(Patient ID):数据类型为int32(32位整数) 拆分集: - 训练集:数据量16765970640字节,样本数42093 - 验证集:数据量2203235963字节,样本数5490 下载大小:18970003008字节,数据集总大小:18969206603字节 - 配置项:hospital_b 特征:同hospital_a配置项的特征结构 拆分集: - 训练集:数据量8903408771字节,样本数21753 - 验证集:数据量1172415511字节,样本数2860 下载大小:10076311749字节,数据集总大小:10075824282字节 - 配置项:hospital_c 特征:同hospital_a配置项的特征结构 拆分集: - 训练集:数据量8278158765字节,样本数20594 - 验证集:数据量1093044736字节,样本数2730 下载大小:9371596059字节,数据集总大小:9371203501字节 - 配置项:test 特征:同hospital_a配置项的特征结构 拆分集: - test_A:数据量2259295233字节,样本数5671 - test_B:数据量1130571999字节,样本数2757 - test_C:数据量1051016858字节,样本数2617 - test_D:数据量2198481213字节,样本数5539 下载大小:6639649842字节,数据集总大小:6639365303字节 配置项详情: - 配置项:hospital_a,数据文件:训练集对应hospital_a/train-*,验证集对应hospital_a/eval-* - 配置项:hospital_b,数据文件:训练集对应hospital_b/train-*,验证集对应hospital_b/eval-* - 配置项:hospital_c,数据文件:训练集对应hospital_c/train-*,验证集对应hospital_c/eval-* - 配置项:test,数据文件:test_A对应test/test_A-*,test_B对应test/test_B-*,test_C对应test/test_C-*,test_D对应test/test_D-* # NIH胸部X光联邦学习数据集 联邦学习拆分方案专为**[冷启动:2025柏林分布式AI黑客松](https://github.com/exalsius/hackathon-coldstart2025)** 设计。 本数据集基于[NIH胸部X光14数据集(NIH Chest X-ray14 dataset)](https://huggingface.co/datasets/BahaaEldin0/NIH-Chest-Xray-14),该原始数据集包含来自30805名独特患者的约112000张X光影像,本数据集构建了跨三家医院的非独立同分布(Non-IID)联邦学习场景,并额外设置了分布外测试集。 ## 数据集描述 本数据集通过评分算法进行划分,以生成非独立同分布的数据分布,具体规则如下: 1. **患者级拆分**:每名患者仅会出现在一家医院/一个拆分集中 2. **人口统计学偏倚**:各医院的年龄与性别分布存在差异 3. **设备模拟**:各医院的AP/PA拍摄体位比例各不相同 4. **病理特征集中化**:每家医院具有特征性的疾病分布模式 5. **训练/验证/测试拆分**:每家医院内部按照80/10/10的比例进行拆分,且患者互不重叠 具体实现细节可参考[数据预处理脚本](https://github.com/exalsius/coldstart/blob/main/data/prepare_datasets.py)。 ### 数据分布 我们将胸部X光影像划分为反映真实世界数据异质性的医院孤岛: - **医院A(便携住院部)**:训练集42093条,验证集5490条 - 人口统计学特征:以老年男性群体为主(年龄60岁以上) - 拍摄设备:以AP(前后位)拍摄体位为主 - 常见影像发现:积液相关病症(胸腔积液、肺水肿、肺不张) - **医院B(门诊诊所)**:训练集21753条,验证集2860条 - 人口统计学特征:以年轻女性群体为主(年龄20-65岁) - 拍摄设备:以PA(后前位)拍摄体位为主 - 常见影像发现:结节、肿块、气胸 - **医院C(含罕见病症的混合科室)**:训练集20594条,验证集2730条 - 人口统计学特征:年龄与性别分布均衡 - 拍摄设备:偏好PA拍摄体位 - 常见影像发现:罕见病症(疝、肺纤维化、肺气肿) ### 测试集 本数据集包含4个测试集: - **test_A**:医院A的分布内测试集 - **test_B**:医院B的分布内测试集 - **test_C**:医院C的分布内测试集 - **test_D**:**分布外(Out-of-distribution)** 重症监护室/危重症护理数据集(包含极端年龄群体、多重并发症) 所有拆分集均遵循**患者互不重叠**原则,以避免数据泄露。 ## 使用方法 python from datasets import load_dataset # 加载医院A数据 hospital_a = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_a") # 返回:DatasetDict({'train': Dataset, 'eval': Dataset}) # 加载医院B hospital_b = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_b") # 返回:DatasetDict({'train': Dataset, 'eval': Dataset}) # 加载医院C hospital_c = load_dataset("exalsius/NIH-Chest-XRay-Federated", "hospital_c") # 返回:DatasetDict({'train': Dataset, 'eval': Dataset}) # 加载测试集 test_data = load_dataset("exalsius/NIH-Chest-XRay-Federated", "test") # 返回:DatasetDict({'test_a': Dataset, 'test_b': Dataset, 'test_c': Dataset, 'test_d': Dataset}) ## 原始NIH数据集 bibtex @article{wang2017chestxray, title={ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases}, author={Wang, Xiaosong and Peng, Yifan and Lu, Le and Lu, Zhiyong and Bagheri, Mohammadhadi and Summers, Ronald M}, journal={CVPR}, year={2017} }
提供机构:
exalsius
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作