ykumards/open-i

Name: ykumards/open-i
Creator: ykumards
Published: 2023-09-27 11:54:03
License: 暂无描述

Hugging Face2023-09-27 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ykumards/open-i

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: uid dtype: int64 - name: MeSH dtype: string - name: Problems dtype: string - name: image dtype: string - name: indication dtype: string - name: comparison dtype: string - name: findings dtype: string - name: impression dtype: string - name: img_frontal dtype: binary - name: img_lateral dtype: binary splits: - name: train num_bytes: 2104109741 num_examples: 3851 download_size: 2095869611 dataset_size: 2104109741 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-nd-4.0 language: - en pretty_name: Chest X-rays (Indiana University) size_categories: - 1K<n<10K --- # Chest X-rays (Indiana University) Copy of the kaggle dataset: https://www.kaggle.com/datasets/raddar/chest-xrays-indiana-university created by [raddar](https://www.kaggle.com/raddar) --- Open access chest X-ray collection from Indiana University Original source: https://openi.nlm.nih.gov/ Original images were downloaded in raw DICOM standard. Each image was converted to png using some post-processing: top/bottom 0.5% DICOM pixel values were clipped (to eliminate very dark or very bright pixel outliers) DICOM pixel values scaled linearly to fit into 0-255 range resized to 2048 on shorter side (to fit in Kaggle dataset limits) Metadata downloaded using available API (https://openi.nlm.nih.gov/services#searchAPIUsingGET) Each image classified manually into frontal and lateral chest X-ray categories. License: [Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ](https://creativecommons.org/licenses/by-nc-nd/4.0/) --- ### Usage The lateral and frontal images of each uid are grouped together in an example. The images are stored as bytes, and can be loaded to PIL Image using the following method ``` def load_image_from_byte_array(byte_array): return Image.open(io.BytesIO(byte_array)) ``` ### Cite Please site the original [source of the dataset](https://openi.nlm.nih.gov/). ``` @article{demner2016preparing, title={Preparing a collection of radiology examinations for distribution and retrieval}, author={Demner-Fushman, Dina and Kohli, Marc D and Rosenman, Marc B and Shooshan, Sonya E and Rodriguez, Laritza and Antani, Sameer and Thoma, George R and McDonald, Clement J}, journal={Journal of the American Medical Informatics Association}, volume={23}, number={2}, pages={304--310}, year={2016}, publisher={Oxford University Press} } ```

数据集信息：特征字段： - 字段名：uid，数据类型：int64 - 字段名：MeSH（医学主题词表），数据类型：字符串 - 字段名：Problems，数据类型：字符串 - 字段名：image，数据类型：字符串 - 字段名：indication（检查指征），数据类型：字符串 - 字段名：comparison（对比影像），数据类型：字符串 - 字段名：findings（影像所见），数据类型：字符串 - 字段名：impression（诊断印象），数据类型：字符串 - 字段名：img_frontal（正面胸部X线影像），数据类型：二进制数据 - 字段名：img_lateral（侧位胸部X线影像），数据类型：二进制数据数据集划分： - 划分名称：训练集（train），字节占用：2104109741，样本数量：3851 下载总大小：2095869611，数据集存储总大小：2104109741 配置项： - 配置名称：default，数据文件路径： - 划分：train，路径：data/train-* 许可证：CC BY-ND 4.0 语言：英语友好名称：印第安纳大学胸部X线影像数据集样本规模区间：1K<n<10K # 印第安纳大学胸部X线影像数据集本数据集为Kaggle平台同名数据集的复刻版本：https://www.kaggle.com/datasets/raddar/chest-xrays-indiana-university，由[raddar](https://www.kaggle.com/raddar)制作。 --- 该数据集为印第安纳大学公开可获取的胸部X线影像馆藏。原始数据来源：https://openi.nlm.nih.gov/ 原始影像以未处理的DICOM（医学数字成像与通信）标准格式下载，经以下后处理步骤转换为PNG格式： 1. 裁剪DICOM像素值上下0.5%的极值，以消除过暗或过亮的像素异常值 2. 将DICOM像素值线性缩放至0-255的8位灰度区间 3. 将影像短边尺寸调整至2048像素，以适配Kaggle平台的数据集存储限制元数据通过官方公开API（https://openi.nlm.nih.gov/services#searchAPIUsingGET）获取。所有影像均经人工标注分类为胸部正位X线片与侧位X线片两类。许可证：[署名-非商业性使用-禁止演绎4.0国际许可协议（CC BY-NC-ND 4.0）](https://creativecommons.org/licenses/by-nc-nd/4.0/) --- ### 使用说明每个uid对应的侧位与正位影像会被整合为一个样本。影像以字节数组形式存储，可通过以下代码加载为Python图像处理库（PIL）图像对象： python def load_image_from_byte_array(byte_array): return Image.open(io.BytesIO(byte_array)) ### 引用规范请引用本数据集的原始来源[数据集官方地址](https://openi.nlm.nih.gov/)。 bibtex @article{demner2016preparing, title={构建用于发布与检索的放射检查影像集}, author={Demner-Fushman, Dina and Kohli, Marc D and Rosenman, Marc B and Shooshan, Sonya E and Rodriguez, Laritza and Antani, Sameer and Thoma, George R and McDonald, Clement J}, journal={美国医学信息学协会期刊}, volume={23}, number={2}, pages={304--310}, year={2016}, publisher={牛津大学出版社} }

提供机构：

ykumards

原始信息汇总

Chest X-rays (Indiana University) 数据集概述

数据集信息

特征

uid: 数据类型为 int64
MeSH: 数据类型为 string
Problems: 数据类型为 string
image: 数据类型为 string
indication: 数据类型为 string
comparison: 数据类型为 string
findings: 数据类型为 string
impression: 数据类型为 string
img_frontal: 数据类型为 binary
img_lateral: 数据类型为 binary

数据分割

train: 包含 3851 个样本，总字节数为 2104109741

数据集大小

下载大小: 2095869611 字节
数据集大小: 2104109741 字节

配置

default: 数据文件路径为 data/train-*

许可证

cc-by-nd-4.0

语言

数据集名称

Chest X-rays (Indiana University)

数据集规模

1K<n<10K

数据处理

原始图像为 DICOM 格式，经过处理转换为 PNG 格式：
- 顶部和底部 0.5% 的 DICOM 像素值被裁剪
- DICOM 像素值线性缩放至 0-255 范围
- 短边缩放至 2048 像素

元数据

使用 API 下载元数据：https://openi.nlm.nih.gov/services#searchAPIUsingGET

图像分类

每个图像手动分类为正面或侧面胸部 X 光图像

使用方法

图像以字节形式存储，可通过以下方法加载为 PIL 图像： python def load_image_from_byte_array(byte_array): return Image.open(io.BytesIO(byte_array))

引用

请引用原始数据集来源：https://openi.nlm.nih.gov/

@article{demner2016preparing, title={Preparing a collection of radiology examinations for distribution and retrieval}, author={Demner-Fushman, Dina and Kohli, Marc D and Rosenman, Marc B and Shooshan, Sonya E and Rodriguez, Laritza and Antani, Sameer and Thoma, George R and McDonald, Clement J}, journal={Journal of the American Medical Informatics Association}, volume={23}, number={2}, pages={304--310}, year={2016}, publisher={Oxford University Press} }

搜集汇总

数据集介绍

构建方式

该数据集的构建基于印第安纳大学的开放式胸部X射线影像集合。原始的DICOM格式图像经过一系列后处理，包括像素值修剪、线性缩放以及调整尺寸，转换为PNG格式。同时，通过手动分类将图像区分为正位和侧位胸部X射线类别，并利用API下载相应的元数据，从而构建了一个结构化且分类清晰的医学影像数据集。

特点

此数据集具有以下显著特点：首先，它包含了大量的胸部X射线图像，满足了1K<n<10K的规模要求；其次，图像经过标准化处理，便于后续的数据分析和模型训练；最后，所有图像均遵循CC BY-NC-ND 4.0许可，保证了数据集的开放性和可用性，同时保护了原作者的权益。

使用方法

在使用该数据集时，用户可以通过指定的函数将存储为字节的图像加载为PIL Image对象，进而进行图像处理和视觉分析。此外，数据集以训练集的形式提供，支持通过HuggingFace的API进行高效的数据加载和预处理。使用时，建议用户遵循数据使用规范，并在成果中引用原始数据源以尊重版权。

背景与挑战

背景概述

在医学影像分析领域， Chest X-rays (Indiana University) 数据集的构建具有重要的研究价值。该数据集源于 Indiana University，并由 raddar 在 Kaggle 平台上进行整理与共享。创建于医学影像信息化的大背景下，该数据集旨在为研究者提供开放获取的胸部 X 射线图像，以促进医学影像诊断相关算法的开发与评估。其包含了经过特定预处理步骤的图像数据，以及相应的图像类别标签，为计算机辅助诊断系统的研究提供了丰富的数据资源。此数据集的构建，不仅提升了医学影像分析领域的研究深度，也为临床决策支持系统的优化提供了有力支撑。

当前挑战

该数据集在构建过程中遇到了多方面的挑战。首先，原始图像以 DICOM 标准存储，需经过格式转换与像素值调整，以确保图像的通用性与标准化。其次，图像分类过程中的人工标注存在主观性，可能影响数据集的标注质量与一致性。此外，数据集在遵循版权协议的同时，还需保障数据的安全性与隐私性。在研究领域问题方面， Chest X-rays (Indiana University) 数据集面临的挑战包括提高图像识别算法的准确性、减少误诊率以及提升算法在多模态影像数据上的泛化能力。

常用场景

经典使用场景

在医学影像分析的领域内，'Chest X-rays (Indiana University)'数据集的经典使用场景主要在于辅助医生进行胸部X光片的诊断。通过提供大量的正面和侧面X光图像，该数据集支持深度学习模型的训练，使其能够识别如肺炎、肿瘤等病变，从而提升诊断的准确性和效率。

解决学术问题

该数据集解决了医学影像分析中标注数据不足、数据质量参差不齐的问题。它为研究者提供了经过预处理的高质量图像，以及详细的元数据信息，有助于学术研究中关于图像识别、病变检测等问题的深入探讨，为医学影像信息的自动化解析提供了可靠的数据基础。

衍生相关工作

基于此数据集，研究者们衍生出了一系列相关工作，包括但不限于开发新的图像识别算法、构建更为复杂的辅助诊断模型，以及进行跨领域的数据融合研究，如将影像数据与电子病历结合，以提供更为全面的病患健康分析。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集