NIH Chest X-rays
收藏www.kaggle.com2018-02-21 更新2025-01-08 收录
下载链接:
https://www.kaggle.com/nih-chest-xrays/data
下载链接
链接失效反馈官方服务:
资源简介:
# NIH Chest X-ray Dataset
---
### National Institutes of Health Chest X-Ray Dataset
Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, [Openi][1] was the largest publicly available source of chest X-ray images with 4,143 images available.
This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (*Wang et al.*)
[Link to paper][30]
[1]: https://openi.nlm.nih.gov/
<br>
### Data limitations:
1. The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.
2. Very limited numbers of disease region bounding boxes (See BBox_list_2017.csv)
3. Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation
<br>
### File contents
- **Image format**: 112,120 total images with size 1024 x 1024
- **images_001.zip**: Contains 4999 images
- **images_002.zip**: Contains 10,000 images
- **images_003.zip**: Contains 10,000 images
- **images_004.zip**: Contains 10,000 images
- **images_005.zip**: Contains 10,000 images
- **images_006.zip**: Contains 10,000 images
- **images_007.zip**: Contains 10,000 images
- **images_008.zip**: Contains 10,000 images
- **images_009.zip**: Contains 10,000 images
- **images_010.zip**: Contains 10,000 images
- **images_011.zip**: Contains 10,000 images
- **images_012.zip**: Contains 7,121 images
- **README_ChestXray.pdf**: Original README file
- **BBox_list_2017.csv**: Bounding box coordinates. *Note: Start at x,y, extend horizontally w pixels, and vertically h pixels*
- Image Index: File name
- Finding Label: Disease type (Class label)
- Bbox x
- Bbox y
- Bbox w
- Bbox h
- **Data_entry_2017.csv**: Class labels and patient data for the entire dataset
- Image Index: File name
- Finding Labels: Disease type (Class label)
- Follow-up #
- Patient ID
- Patient Age
- Patient Gender
- View Position: X-ray orientation
- OriginalImageWidth
- OriginalImageHeight
- OriginalImagePixelSpacing_x
- OriginalImagePixelSpacing_y
<br>
### Class descriptions
There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes:
- Atelectasis
- Consolidation
- Infiltration
- Pneumothorax
- Edema
- Emphysema
- Fibrosis
- Effusion
- Pneumonia
- Pleural_thickening
- Cardiomegaly
- Nodule Mass
- Hernia
<br>
### Full Dataset Content
There are 12 zip files in total and range from ~2 gb to 4 gb in size. Additionally, we randomly sampled 5% of these images and created a smaller dataset for use in Kernels. The random sample contains 5606 X-ray images and class labels.
- [Sample][9]: sample.zip
[9]: https://www.kaggle.com/nih-chest-xrays/sample
<br>
### Modifications to original data
- Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform
- CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory
<br>
### Citations
- Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, [ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf][30]
- NIH News release: [NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community][30]
- Original source files and documents: [https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345][31]
<br>
### Acknowledgements
This work was supported by the Intramural Research Program of the NClinical Center (clinicalcenter.nih.gov) and National Library of Medicine (www.nlm.nih.gov).
[30]: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community
[31]: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
国家卫生研究院胸部X射线数据集
胸部X射线检查是临床医学中最常见且成本效益较高的影像学检查之一。然而,对胸部X射线的临床诊断往往颇具挑战性,有时甚至比胸部CT成像的诊断更为困难。由于缺乏大量公开且标注的大型数据集,实现在现实世界医疗场所基于胸部X射线的临床相关计算机辅助检测与诊断(CAD)仍然非常困难,甚至几乎不可能。构建大型X射线图像数据集的主要障碍在于标注这些图像所需的资源匮乏。在发布本数据集之前,[Openi][1] 是公开可获取的胸部X射线图像的最大来源,提供了4,143张图像。
本国家卫生研究院胸部X射线数据集由112,120张带有疾病标签的X射线图像组成,这些图像来自30,805名独特患者。为了创建这些标签,作者利用自然语言处理技术从相关的放射学报告中挖掘疾病分类。预计这些标签的准确性超过90%,适用于弱监督学习。原始的放射学报告未公开,但您可以在以下开放获取论文中找到关于标注过程的更多详细信息:“ChestX-ray8:医院规模胸部X射线数据库及常见胸部疾病弱监督分类与定位基准。”(Wang等,2017)
[链接至论文][30]
[1]: https://openi.nlm.nih.gov/
### 数据集局限性
1. 图像标签是通过自然语言处理提取的,因此可能存在一些错误的标签,但自然语言处理的标注准确性预计超过90%。
2. 疾病区域边界框的数量非常有限(见BBox_list_2017.csv)
3. 预计胸部X射线放射学报告不会公开共享。使用本公共数据集的各方被鼓励在后续研究中分享他们“更新”的图像标签和/或新的边界框,可能通过人工标注完成。
### 文件内容
- **图像格式**:总计112,120张图像,尺寸为1024 x 1024。
- **images_001.zip**:包含4,999张图像
- **images_002.zip**:包含10,000张图像
- **images_003.zip**:包含10,000张图像
- **images_004.zip**:包含10,000张图像
- **images_005.zip**:包含10,000张图像
- **images_006.zip**:包含10,000张图像
- **images_007.zip**:包含10,000张图像
- **images_008.zip**:包含10,000张图像
- **images_009.zip**:包含10,000张图像
- **images_010.zip**:包含10,000张图像
- **images_011.zip**:包含10,000张图像
- **images_012.zip**:包含7,121张图像
- **README_ChestXray.pdf**:原始的README文件
- **BBox_list_2017.csv**:边界框坐标。
- 图像索引:文件名
- 病理标签:疾病类型(类别标签)
- Bbox x
- Bbox y
- Bbox w
- Bbox h
- **Data_entry_2017.csv**:整个数据集的类别标签和患者数据
- 图像索引:文件名
- 病理标签:疾病类型(类别标签)
- 随访次数
- 患者ID
- 患者年龄
- 患者性别
- 观察位置:X射线方向
- 原始图像宽度
- 原始图像高度
- 原始图像像素间距_x
- 原始图像像素间距_y
### 类别描述
共有15个类别(14种疾病,以及一个“无发现”类别)。图像可以归类为“无发现”或一个或多个疾病类别:
- 肺不张
- 实变
- 浸润
- 气胸
- 肿胀
- 肺气肿
- 纤维化
- 浸润
- 肺炎
- 胸膜增厚
- 心脏肥大
- 肺结节
- 肿块
### 整个数据集内容
共有12个压缩文件,大小从约2 GB到4 GB不等。此外,我们还随机抽取了这些图像的5%,创建了一个较小的数据集用于Kernels。随机样本包含5,606张X射线图像及其类别标签。
- [样本][9]:sample.zip
[9]: https://www.kaggle.com/nih-chest-xrays/sample
### 对原始数据的修改
- 将原始的TAR归档转换为ZIP归档,以与Kaggle平台兼容
- 稍微修改了CSV标题,使其在逗号分隔和字段自解释方面更加明确
### 引用
- 王翔,彭毅,陆丽,陆峥,巴格赫里,Summers RM. ChestX-ray8:医院规模胸部X射线数据库及常见胸部疾病弱监督分类与定位基准。IEEE CVPR 2017,[ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf][30]
- 美国国家卫生研究院新闻发布:[NIH临床中心向科学界提供最大的公开胸部X射线数据集之一][30]
- 原始源文件和文档:[https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345][31]
[30]: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community
[31]: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
### 致谢
本工作得到了国家临床中心(clinicalcenter.nih.gov)和国家医学图书馆(www.nlm.nih.gov)院内研究计划的支持。
[30]: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community
[31]: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
提供机构:
Kaggle
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



