alkzar90/NIH-Chest-X-ray-dataset
收藏Hugging Face2022-11-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/alkzar90/NIH-Chest-X-ray-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
- expert-generated
language_creators:
- machine-generated
- expert-generated
language:
- en
license:
- unknown
multilinguality:
- monolingual
pretty_name: NIH-CXR14
paperswithcode_id: chestx-ray14
size_categories:
- 100K<n<1M
task_categories:
- image-classification
task_ids:
- multi-class-image-classification
---
# Dataset Card for NIH Chest X-ray dataset
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [NIH Chest X-ray Dataset of 10 Common Thorax Disease Categories](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345)
- **Repository:**
- **Paper:** [ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases](https://arxiv.org/abs/1705.02315)
- **Leaderboard:**
- **Point of Contact:** rms@nih.gov
### Dataset Summary
_ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the text-mined fourteen disease image labels (where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper: [1705.02315](https://arxiv.org/abs/1705.02315)_

## Dataset Structure
### Data Instances
A sample from the training set is provided below:
```
{'image_file_path': '/root/.cache/huggingface/datasets/downloads/extracted/95db46f21d556880cf0ecb11d45d5ba0b58fcb113c9a0fff2234eba8f74fe22a/images/00000798_022.png',
'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=1024x1024 at 0x7F2151B144D0>,
'labels': [9, 3]}
```
### Data Fields
The data instances have the following fields:
- `image_file_path` a `str` with the image path
- `image`: A `PIL.Image.Image` object containing the image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]`.
- `labels`: an `int` classification label.
<details>
<summary>Class Label Mappings</summary>
```json
{
"No Finding": 0,
"Atelectasis": 1,
"Cardiomegaly": 2,
"Effusion": 3,
"Infiltration": 4,
"Mass": 5,
"Nodule": 6,
"Pneumonia": 7,
"Pneumothorax": 8,
"Consolidation": 9,
"Edema": 10,
"Emphysema": 11,
"Fibrosis": 12,
"Pleural_Thickening": 13,
"Hernia": 14
}
```
</details>
**Label distribution on the dataset:**
| labels | obs | freq |
|:-------------------|------:|-----------:|
| No Finding | 60361 | 0.426468 |
| Infiltration | 19894 | 0.140557 |
| Effusion | 13317 | 0.0940885 |
| Atelectasis | 11559 | 0.0816677 |
| Nodule | 6331 | 0.0447304 |
| Mass | 5782 | 0.0408515 |
| Pneumothorax | 5302 | 0.0374602 |
| Consolidation | 4667 | 0.0329737 |
| Pleural_Thickening | 3385 | 0.023916 |
| Cardiomegaly | 2776 | 0.0196132 |
| Emphysema | 2516 | 0.0177763 |
| Edema | 2303 | 0.0162714 |
| Fibrosis | 1686 | 0.0119121 |
| Pneumonia | 1431 | 0.0101104 |
| Hernia | 227 | 0.00160382 |
### Data Splits
| |train| test|
|-------------|----:|----:|
|# of examples|86524|25596|
**Label distribution by dataset split:**
| labels | ('Train', 'obs') | ('Train', 'freq') | ('Test', 'obs') | ('Test', 'freq') |
|:-------------------|-------------------:|--------------------:|------------------:|-------------------:|
| No Finding | 50500 | 0.483392 | 9861 | 0.266032 |
| Infiltration | 13782 | 0.131923 | 6112 | 0.164891 |
| Effusion | 8659 | 0.082885 | 4658 | 0.125664 |
| Atelectasis | 8280 | 0.0792572 | 3279 | 0.0884614 |
| Nodule | 4708 | 0.0450656 | 1623 | 0.0437856 |
| Mass | 4034 | 0.038614 | 1748 | 0.0471578 |
| Consolidation | 2852 | 0.0272997 | 1815 | 0.0489654 |
| Pneumothorax | 2637 | 0.0252417 | 2665 | 0.0718968 |
| Pleural_Thickening | 2242 | 0.0214607 | 1143 | 0.0308361 |
| Cardiomegaly | 1707 | 0.0163396 | 1069 | 0.0288397 |
| Emphysema | 1423 | 0.0136211 | 1093 | 0.0294871 |
| Edema | 1378 | 0.0131904 | 925 | 0.0249548 |
| Fibrosis | 1251 | 0.0119747 | 435 | 0.0117355 |
| Pneumonia | 876 | 0.00838518 | 555 | 0.0149729 |
| Hernia | 141 | 0.00134967 | 86 | 0.00232012 |
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### License and attribution
There are no restrictions on the use of the NIH chest x-ray images. However, the dataset has the following attribution requirements:
- Provide a link to the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC
- Include a citation to the CVPR 2017 paper (see Citation information section)
- Acknowledge that the NIH Clinical Center is the data provider
### Citation Information
```
@inproceedings{Wang_2017,
doi = {10.1109/cvpr.2017.369},
url = {https://doi.org/10.1109%2Fcvpr.2017.369},
year = 2017,
month = {jul},
publisher = {{IEEE}
},
author = {Xiaosong Wang and Yifan Peng and Le Lu and Zhiyong Lu and Mohammadhadi Bagheri and Ronald M. Summers},
title = {{ChestX}-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases},
booktitle = {2017 {IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})}
}
```
### Contributions
Thanks to [@alcazar90](https://github.com/alcazar90) for adding this dataset.
提供机构:
alkzar90
原始信息汇总
数据集概述
数据集基本信息
- 数据集名称: NIH-CXR14
- 别名: ChestX-ray14
- 语言: 英语 (en)
- 许可证: 未知
- 多语言性: 单语
- 大小: 100K<n<1M
- 任务类别: 图像分类
- 任务ID: 多类图像分类
数据集内容
数据集摘要
- 包含: 112,120 张正面X光图像,来自30,805名独特患者。
- 标签: 通过自然语言处理从相关放射学报告中提取的14种疾病图像标签,每个图像可能包含多个标签。
- 疾病类型: 包括Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass和Hernia。
- 标签准确性: 预期超过90%。
数据结构
数据实例
-
示例:
{image_file_path: /path/to/image.png, image: <PIL.Image.Image>, labels: [label_id]}
数据字段
- image_file_path: 图像文件路径,类型为字符串。
- image: 图像对象,类型为PIL.Image.Image。
- labels: 分类标签,类型为整数。
数据分割
- 训练集: 86,524个样本
- 测试集: 25,596个样本
标签分布
- 总体分布: 详细列出了每种疾病的观测次数和频率。
- 分割分布: 详细列出了训练集和测试集中每种疾病的观测次数和频率。
数据集创建
许可证和归属
- 使用限制: 无限制。
- 归属要求:
- 提供链接至NIH下载站点。
- 引用CVPR 2017论文。
- 承认NIH临床中心为数据提供者。
引用信息
@inproceedings{Wang_2017, doi = {10.1109/cvpr.2017.369}, url = {https://doi.org/10.1109%2Fcvpr.2017.369}, year = 2017, month = {jul}, publisher = {{IEEE} }, author = {Xiaosong Wang and Yifan Peng and Le Lu and Zhiyong Lu and Mohammadhadi Bagheri and Ronald M. Summers}, title = {{ChestX}-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases}, booktitle = {2017 {IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})} }
搜集汇总
数据集介绍

构建方式
在医学影像分析领域,大规模标注数据的获取始终是推动算法发展的关键。NIH胸部X射线数据集通过整合来自临床环境的真实影像资源,构建了一个包含超过11万张正面视角X射线图像的数据集合。这些图像源自三万余名患者的临床记录,利用自然语言处理技术从放射学报告中自动提取了十四种常见胸部疾病的标签,实现了机器与专家标注的有机结合。数据集的构建过程注重临床实用性与标注效率,为后续的弱监督学习研究奠定了坚实基础。
使用方法
研究人员可通过Hugging Face平台便捷加载该数据集,利用其标准化的数据字段进行模型训练与评估。典型工作流程包括访问`image`字段获取解码后的影像数据,并结合`labels`字段的多标签整数编码进行监督学习。数据集已预分为训练集与测试集,便于进行可复现的实验。使用者需遵循特定的引用规范,并在应用中注明数据来源,以符合学术伦理要求。该数据集主要服务于图像分类任务,尤其适合探索多标签分类及弱监督学习在医学影像分析中的前沿应用。
背景与挑战
背景概述
在医学影像分析领域,胸部X光片是诊断多种胸部疾病的基础工具。NIH-CXR14数据集由美国国立卫生研究院临床中心于2017年发布,核心研究团队包括Xiaosong Wang、Yifan Peng等学者。该数据集旨在通过大规模、弱监督的方式,推动胸部X光影像中十四种常见疾病的自动分类与定位研究,涵盖肺不张、胸腔积液、肺炎等多种病理。其规模达到112,120张前视图X光图像,涉及30,805名患者,为深度学习模型在医学影像领域的训练与验证提供了重要资源,显著促进了计算机辅助诊断技术的发展。
当前挑战
该数据集致力于解决胸部X光影像中多疾病分类的挑战,其核心问题在于处理图像的多标签标注以及疾病间的复杂关联性。构建过程中的主要挑战包括:通过自然语言处理技术从放射学报告中挖掘疾病标签,虽宣称准确率超过90%,但自动化标注可能引入噪声与偏差;数据来源于单一机构,可能导致患者群体分布不均,影响模型的泛化能力;此外,原始放射报告未公开,限制了标注过程的透明性与可复现性。这些因素共同构成了数据集在研究与实际应用中的关键局限。
常用场景
经典使用场景
在医学影像分析领域,NIH-CXR14数据集作为大规模胸部X光图像资源,常被用于多标签分类任务的基准测试。研究者借助其包含的十四种常见胸部疾病标签,开发并验证深度学习模型,以自动化识别肺不张、积液、肺炎等多种病理特征。这一过程不仅推动了计算机辅助诊断系统的性能提升,还为弱监督学习提供了丰富的实验场景,使得模型能够在有限标注信息下实现精准的疾病定位与分类。
解决学术问题
该数据集有效应对了医学影像研究中数据稀缺与标注成本高昂的挑战。通过提供超过十万张经过文本挖掘生成的弱标注图像,它支持了弱监督学习方法的探索,缓解了全监督学习对精细标注的依赖。其多标签特性促进了模型对复杂共病情况的识别能力,为胸部疾病的自动化筛查与分类奠定了数据基础,显著加速了医学人工智能在放射学领域的学术进展。
实际应用
在实际医疗环境中,基于NIH-CXR14训练的模型已逐步集成至临床工作流,辅助放射科医师进行初步诊断。这些系统能够快速筛查X光片,标记可疑病变区域,从而提升诊断效率并减少人为疏忽。尤其在资源有限的医疗机构,此类工具可作为第二阅片者,增强胸部疾病如气胸、心脏肥大的早期检测能力,优化患者分流与治疗决策。
数据集最近研究
最新研究方向
在医学影像分析领域,NIH-CXR14数据集作为胸部X射线图像的重要资源,持续推动着深度学习模型在疾病诊断方面的前沿探索。当前研究聚焦于弱监督学习与多标签分类的深度融合,旨在利用文本挖掘生成的标签提升模型对十四种常见胸部疾病的识别精度与定位能力。随着人工智能在医疗健康领域的广泛应用,该数据集促进了跨模态学习方法的兴起,例如结合临床报告与影像特征以增强模型的可解释性。这些进展不仅加速了辅助诊断系统的开发,还为应对全球公共卫生挑战提供了技术支撑,彰显了大规模标注数据在推动精准医疗发展中的关键作用。
以上内容由遇见数据集搜集并总结生成



