---
annotations_creators:
- no-annotation
language_creators:
- other
language:
- en
license:
- other
multilinguality:
- monolingual
size_categories:
- 100B<n<1T
source_datasets:
- original
task_categories:
- image-classification
task_ids:
- multi-label-image-classification
pretty_name: ColonCancerCTDataset
tags:
- colon cancer
- medical
- cancer
dataset_info:
features:
- name: image
dtype: image
- name: ImageType
sequence: string
- name: StudyDate
dtype: string
- name: SeriesDate
dtype: string
- name: Manufacturer
dtype: string
- name: StudyDescription
dtype: string
- name: SeriesDescription
dtype: string
- name: PatientSex
dtype: string
- name: PatientAge
dtype: string
- name: PregnancyStatus
dtype: string
- name: BodyPartExamined
dtype: string
splits:
- name: train
num_bytes: 3537157.0
num_examples: 30
download_size: 3538117
dataset_size: 3537157.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Dataset Card Creation Guide
## Table of Contents
- [Dataset Card Creation Guide](#dataset-card-creation-guide)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Who are the annotators?](#who-are-the-annotators)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://portal.imaging.datacommons.cancer.gov]()
- **Repository:** [https://aws.amazon.com/marketplace/pp/prodview-3bcx7vcebfi2i#resources]()
- **Paper:** [https://aacrjournals.org/cancerres/article/81/16/4188/670283/NCI-Imaging-Data-CommonsNCI-Imaging-Data-Commons]()
### Dataset Summary
The dataset in the focus of this project is a curated subset of the National Cancer Institute Imaging Data Commons (IDC), specifically highlighting CT Colonography images. This specialized dataset will encompass a targeted collection from the broader IDC repository hosted on the AWS Marketplace, which includes diverse cancer imaging data. The images included are sourced from clinical studies worldwide and encompass modalities such as Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET).
In addition to the clinical images, essential metadata that contains patient demographics (sex and pregnancy status) and detailed study descriptions are also included in this dataset, enabling nuanced analysis and interpretation of the imaging data.
### Supported Tasks
The dataset can be utilized for several tasks:
- Developing machine learning models to differentiate between benign and malignant colonic lesions.
- Developing algorithms for Creating precise algorithms for segmenting polyps and other colonic structures.
- Conducting longitudinal studies on cancer progression.
- Assessing the diagnostic accuracy of CT Colonography compared to other imaging modalities in colorectal conditions.
### Languages
English is used for text data like labels and imaging study descriptions.
## Dataset Structure
### Data Instances
The data will follow the structure below:
'''
{
"image": image.png # A CT image,
"ImageType": ['ORIGINAL', 'PRIMARY', 'AXIAL', 'CT_SOM5 SPI'] # A list containing the info of the image,
"StudyDate": "20000101" # Date of the case study,
"SeriesDate": 20000101" # Date of the series,
"Manufacturer": "SIEMENS" # Manufacturer of the device used for imaging,
"StudyDescription": "Abdomen^24ACRIN_Colo_IRB2415-04 (Adult)" # Description of the study,
"SeriesDescription": "Colo_prone 1.0 B30f" # Description of the series,
"PatientSex": "F" # Patient's sex,
"PatientAge": "059Y" # Patient's age,
"PregnancyStatus": "None" # Patient's pregnancy status,
"BodyPartExamined": "COLON" # Body part examined
}
'''
### Data Fields
- image (PIL.PngImagePlugin.PngImageFile): The CT image in PNG format
- ImageType (List(String)): A list containing the info of the image
- StudyDate (String): Date of the case study
- SeriesDate (String): Date of the series study
- Manufacturer (String): Manufacturer of the device used for imaging
- StudyDescription (String): Description of the study
- SeriesDescription (String): Description of the series
- PatientSex (String): Patient's sex
- PatientAge (String): Patient's age
- PregnancyStatus (String): Patient's pregnancy status
- BodyPartExamined (String): The body part examined
### Data Splits
| | train | validation | test |
|-------------------------|------:|-----------:|-----:|
| Average Sentence Length | | | |
## Dataset Creation
### Curation Rationale
The dataset is conceived from the necessity to streamline a vast collection of heterogeneous cancer imaging data to facilitate focused research on colon cancer. By distilling the dataset to specifically include CT Colonography, it addresses the challenge of data accessibility for researchers and healthcare professionals interested in colon cancer. This refinement simplifies the task of obtaining relevant data for developing diagnostic models and potentially improving patient outcomes through early detection. The curation of this focused dataset aims to make data more open and usable for specialists and academics in the field of colon cancer research.
### Source Data
According to [IDC](https://portal.imaging.datacommons.cancer.gov/about/), data are submitted from NCI-funded driving projects and other special selected projects.
### Personal and Sensitive Information
According to [IDC](https://portal.imaging.datacommons.cancer.gov/about/), submitters of data to IDC must ensure that the data have been de-identified for protected health information (PHI).
## Considerations for Using the Data
### Social Impact of Dataset
The dataset tailored for CT Colonography aims to enhance medical research and potentially aid in early detection and treatment of colon cancer. Providing high-quality imaging data empowers the development of diagnostic AI tools, contributing to improved patient care and outcomes. This can have a profound social impact, as timely diagnosis is crucial in treating cancer effectively.
### Discussion of Biases
Given the dataset's focus on CT Colonography, biases may arise from the population demographics represented or the prevalence of certain conditions within the dataset. It is crucial to ensure that the dataset includes diverse cases to mitigate biases in model development and to ensure that AI tools developed using this data are generalizable and equitable in their application.
### Other Known Limitations
The dataset may have limitations in terms of variability and scope, as it focuses solely on CT Colonography. Other modalities and cancer types are not represented, which could limit the breadth of research.
### Licensing Information
https://fairsharing.org/FAIRsharing.0b5a1d
### Citation Information
Provide the [BibTex](http://www.bibtex.org/)-formatted reference for the dataset. For example:
```
@article{fedorov2021nci,
title={NCI imaging data commons},
author={Fedorov, Andrey and Longabaugh, William JR and Pot, David
and Clunie, David A and Pieper, Steve and Aerts, Hugo JWL and
Homeyer, Andr{\'e} and Lewis, Rob and Akbarzadeh, Afshin and
Bontempi, Dennis and others},
journal={Cancer research},
volume={81},
number={16},
pages={4188--4193},
year={2021},
publisher={AACR}
}
```
[DOI](https://doi.org/10.1158/0008-5472.CAN-21-0950)
---
annotations_creators:
- 无标注(no-annotation)
language_creators:
- 其他(other)
language:
- 英语(en)
license:
- 其他(other)
multilinguality:
- 单语种(monolingual)
size_categories:
- 1000亿 < 样本数 < 1万亿
source_datasets:
- 原创数据集(original)
task_categories:
- 图像分类(image-classification)
task_ids:
- 多标签图像分类(multi-label-image-classification)
pretty_name: 结直肠癌CT数据集(ColonCancerCTDataset)
tags:
- 结直肠癌(colon cancer)
- 医学(medical)
- 癌症(cancer)
dataset_info:
features:
- name: 图像(image)
dtype: 图像(image)
- name: 图像类型(ImageType)
dtype: 字符串序列(sequence of string)
- name: 检查日期(StudyDate)
dtype: 字符串(string)
- name: 序列日期(SeriesDate)
dtype: 字符串(string)
- name: 设备制造商(Manufacturer)
dtype: 字符串(string)
- name: 检查描述(StudyDescription)
dtype: 字符串(string)
- name: 序列描述(SeriesDescription)
dtype: 字符串(string)
- name: 患者性别(PatientSex)
dtype: 字符串(string)
- name: 患者年龄(PatientAge)
dtype: 字符串(string)
- name: 妊娠状态(PregnancyStatus)
dtype: 字符串(string)
- name: 检查部位(BodyPartExamined)
dtype: 字符串(string)
splits:
- name: 训练集(train)
num_bytes: 3537157.0
num_examples: 30
download_size: 3538117
dataset_size: 3537157.0
configs:
- config_name: 默认配置(default)
data_files:
- split: 训练集(train)
path: data/train-*
---
# 数据集卡片创建指南
## 目录
- [数据集卡片创建指南](#数据集卡片创建指南)
- [目录](#目录)
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务与排行榜](#支持任务和-leaderboards)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据拆分](#数据拆分)
- [数据集创建](#数据集创建)
- [遴选依据](#curation-rationale)
- [源数据](#源数据)
- [初始数据收集与标准化](#initial-data-collection-and-normalization)
- [文本数据来源方是谁?](#who-are-the-source-language-producers)
- [标注](#标注)
- [标注流程](#annotation-process)
- [标注者是谁?](#who-are-the-annotators)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集策展人](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页:** [https://portal.imaging.datacommons.cancer.gov]()
- **仓库:** [https://aws.amazon.com/marketplace/pp/prodview-3bcx7vcebfi2i#resources]()
- **论文:** [https://aacrjournals.org/cancerres/article/81/16/4188/670283/NCI-Imaging-Data-CommonsNCI-Imaging-Data-Commons]()
### 数据集概述
本项目聚焦的数据集是美国国家癌症研究所(National Cancer Institute, NCI)影像数据共享平台(Imaging Data Commons, IDC)的精选子集,重点涵盖CT结肠成像(CT Colonography)图像。该专属数据集从AWS Marketplace托管的IDC综合仓库中筛选而来,后者包含多样化的癌症影像数据。数据集内的图像来源于全球范围内的临床研究,涵盖计算机断层扫描(Computed Tomography, CT)、磁共振成像(Magnetic Resonance Imaging, MRI)以及正电子发射断层扫描(Positron Emission Tomography, PET)等多种成像模态。
除临床影像外,本数据集还包含患者人口统计学信息(性别与妊娠状态)以及详细的检查描述等核心元数据,可支持对影像数据进行精细化分析与解读。
### 支持任务
本数据集可应用于多项任务:
- 开发机器学习模型以区分结肠良性与恶性病变;
- 开发用于分割息肉及其他结肠结构的精准算法;
- 开展癌症进展的纵向研究;
- 评估CT结肠成像相较于其他成像模态在结直肠疾病诊断中的准确性。
### 语言
文本数据(如标签与影像检查描述)采用英语。
## 数据集结构
### 数据实例
数据将遵循以下格式:
{
"image": image.png # 一张CT图像,
"ImageType": ['ORIGINAL', 'PRIMARY', 'AXIAL', 'CT_SOM5 SPI'] # 包含图像信息的列表,
"StudyDate": "20000101" # 检查日期,
"SeriesDate": "20000101" # 序列日期,
"Manufacturer": "SIEMENS" # 成像设备的制造商,
"StudyDescription": "Abdomen^24ACRIN_Colo_IRB2415-04 (Adult)" # 检查描述,
"SeriesDescription": "Colo_prone 1.0 B30f" # 序列描述,
"PatientSex": "F" # 患者性别,
"PatientAge": "059Y" # 患者年龄,
"PregnancyStatus": "None" # 患者妊娠状态,
"BodyPartExamined": "COLON" # 检查部位
}
### 数据字段
- image(PIL.PngImagePlugin.PngImageFile):PNG格式的CT图像
- ImageType(字符串列表):包含图像信息的列表
- StudyDate(字符串):检查日期
- SeriesDate(字符串):序列检查日期
- Manufacturer(字符串):成像设备制造商
- StudyDescription(字符串):检查描述
- SeriesDescription(字符串):序列描述
- PatientSex(字符串):患者性别
- PatientAge(字符串):患者年龄
- PregnancyStatus(字符串):患者妊娠状态
- BodyPartExamined(字符串):检查部位
### 数据拆分
| | 训练集(train) | 验证集(validation) | 测试集(test) |
|-------------------------|------:|-----------:|-----:|
| 平均句长 | | | |
## 数据集创建
### 遴选依据
本数据集的构建源于对海量异构癌症影像数据进行精简的需求,以推动结直肠癌领域的定向研究。通过将数据集限定为CT结肠成像数据,本数据集解决了结直肠癌研究领域的研究者与医疗从业者面临的数据可及性难题。这种精简简化了获取相关数据的流程,助力开发诊断模型,有望通过早期检测改善患者预后。本数据集的遴选目标是让结直肠癌研究领域的专家与学者能够更便捷地获取并使用公开可用的高质量数据。
### 源数据
根据[NCI影像数据共享平台(IDC)](https://portal.imaging.datacommons.cancer.gov/about/)的说明,数据由NCI资助的驱动项目及其他精选项目提交。
### 个人与敏感信息
根据[NCI影像数据共享平台(IDC)](https://portal.imaging.datacommons.cancer.gov/about/)的要求,向IDC提交数据的提交者必须确保数据已完成去标识化处理,以保护受保护健康信息(Protected Health Information, PHI)。
## 数据使用注意事项
### 数据集的社会影响
本针对CT结肠成像的数据集旨在推动医学研究,助力结直肠癌的早期检测与治疗。提供高质量的影像数据可赋能诊断型AI工具的开发,有助于改善患者护理与预后。及时的癌症诊断对有效治疗至关重要,因此本数据集将产生深远的社会影响。
### 偏差讨论
鉴于本数据集聚焦CT结肠成像,可能会因所涵盖的人群人口统计学特征或数据集中特定病症的患病率而产生偏差。至关重要的是,数据集需纳入多样化的病例,以减轻模型开发中的偏差,确保基于本数据集开发的AI工具具备泛化能力与应用公平性。
### 其他已知局限性
本数据集仅聚焦CT结肠成像,因此在数据变异性与覆盖范围上存在局限性,未涵盖其他成像模态与癌症类型,这可能会限制研究的广度。
### 许可信息
https://fairsharing.org/FAIRsharing.0b5a1d
### 引用信息
请提供符合BibTex格式的数据集引用参考。示例如下:
@article{fedorov2021nci,
title={"NCI imaging data commons"},
author={Fedorov, Andrey and Longabaugh, William JR and Pot, David
and Clunie, David A and Pieper, Steve and Aerts, Hugo JWL and
Homeyer, André and Lewis, Rob and Akbarzadeh, Afshin and
Bontempi, Dennis and others},
journal={Cancer research},
volume={81},
number={16},
pages={4188--4193},
year={2021},
publisher={AACR}
}
[DOI](https://doi.org/10.1158/0008-5472.CAN-21-0950)