YuxuanZhang888/ColonCancerCTDataset

Name: YuxuanZhang888/ColonCancerCTDataset
Creator: YuxuanZhang888
Published: 2024-03-19 05:02:07
License: 暂无描述

Hugging Face2024-03-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/YuxuanZhang888/ColonCancerCTDataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - other language: - en license: - other multilinguality: - monolingual size_categories: - 100B<n<1T source_datasets: - original task_categories: - image-classification task_ids: - multi-label-image-classification pretty_name: ColonCancerCTDataset tags: - colon cancer - medical - cancer dataset_info: features: - name: image dtype: image - name: ImageType sequence: string - name: StudyDate dtype: string - name: SeriesDate dtype: string - name: Manufacturer dtype: string - name: StudyDescription dtype: string - name: SeriesDescription dtype: string - name: PatientSex dtype: string - name: PatientAge dtype: string - name: PregnancyStatus dtype: string - name: BodyPartExamined dtype: string splits: - name: train num_bytes: 3537157.0 num_examples: 30 download_size: 3538117 dataset_size: 3537157.0 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card Creation Guide ## Table of Contents - [Dataset Card Creation Guide](#dataset-card-creation-guide) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://portal.imaging.datacommons.cancer.gov]() - **Repository:** [https://aws.amazon.com/marketplace/pp/prodview-3bcx7vcebfi2i#resources]() - **Paper:** [https://aacrjournals.org/cancerres/article/81/16/4188/670283/NCI-Imaging-Data-CommonsNCI-Imaging-Data-Commons]() ### Dataset Summary The dataset in the focus of this project is a curated subset of the National Cancer Institute Imaging Data Commons (IDC), specifically highlighting CT Colonography images. This specialized dataset will encompass a targeted collection from the broader IDC repository hosted on the AWS Marketplace, which includes diverse cancer imaging data. The images included are sourced from clinical studies worldwide and encompass modalities such as Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET). In addition to the clinical images, essential metadata that contains patient demographics (sex and pregnancy status) and detailed study descriptions are also included in this dataset, enabling nuanced analysis and interpretation of the imaging data. ### Supported Tasks The dataset can be utilized for several tasks: - Developing machine learning models to differentiate between benign and malignant colonic lesions. - Developing algorithms for Creating precise algorithms for segmenting polyps and other colonic structures. - Conducting longitudinal studies on cancer progression. - Assessing the diagnostic accuracy of CT Colonography compared to other imaging modalities in colorectal conditions. ### Languages English is used for text data like labels and imaging study descriptions. ## Dataset Structure ### Data Instances The data will follow the structure below: ''' { "image": image.png # A CT image, "ImageType": ['ORIGINAL', 'PRIMARY', 'AXIAL', 'CT_SOM5 SPI'] # A list containing the info of the image, "StudyDate": "20000101" # Date of the case study, "SeriesDate": 20000101" # Date of the series, "Manufacturer": "SIEMENS" # Manufacturer of the device used for imaging, "StudyDescription": "Abdomen^24ACRIN_Colo_IRB2415-04 (Adult)" # Description of the study, "SeriesDescription": "Colo_prone 1.0 B30f" # Description of the series, "PatientSex": "F" # Patient's sex, "PatientAge": "059Y" # Patient's age, "PregnancyStatus": "None" # Patient's pregnancy status, "BodyPartExamined": "COLON" # Body part examined } ''' ### Data Fields - image (PIL.PngImagePlugin.PngImageFile): The CT image in PNG format - ImageType (List(String)): A list containing the info of the image - StudyDate (String): Date of the case study - SeriesDate (String): Date of the series study - Manufacturer (String): Manufacturer of the device used for imaging - StudyDescription (String): Description of the study - SeriesDescription (String): Description of the series - PatientSex (String): Patient's sex - PatientAge (String): Patient's age - PregnancyStatus (String): Patient's pregnancy status - BodyPartExamined (String): The body part examined ### Data Splits | | train | validation | test | |-------------------------|------:|-----------:|-----:| | Average Sentence Length | | | | ## Dataset Creation ### Curation Rationale The dataset is conceived from the necessity to streamline a vast collection of heterogeneous cancer imaging data to facilitate focused research on colon cancer. By distilling the dataset to specifically include CT Colonography, it addresses the challenge of data accessibility for researchers and healthcare professionals interested in colon cancer. This refinement simplifies the task of obtaining relevant data for developing diagnostic models and potentially improving patient outcomes through early detection. The curation of this focused dataset aims to make data more open and usable for specialists and academics in the field of colon cancer research. ### Source Data According to [IDC](https://portal.imaging.datacommons.cancer.gov/about/), data are submitted from NCI-funded driving projects and other special selected projects. ### Personal and Sensitive Information According to [IDC](https://portal.imaging.datacommons.cancer.gov/about/), submitters of data to IDC must ensure that the data have been de-identified for protected health information (PHI). ## Considerations for Using the Data ### Social Impact of Dataset The dataset tailored for CT Colonography aims to enhance medical research and potentially aid in early detection and treatment of colon cancer. Providing high-quality imaging data empowers the development of diagnostic AI tools, contributing to improved patient care and outcomes. This can have a profound social impact, as timely diagnosis is crucial in treating cancer effectively. ### Discussion of Biases Given the dataset's focus on CT Colonography, biases may arise from the population demographics represented or the prevalence of certain conditions within the dataset. It is crucial to ensure that the dataset includes diverse cases to mitigate biases in model development and to ensure that AI tools developed using this data are generalizable and equitable in their application. ### Other Known Limitations The dataset may have limitations in terms of variability and scope, as it focuses solely on CT Colonography. Other modalities and cancer types are not represented, which could limit the breadth of research. ### Licensing Information https://fairsharing.org/FAIRsharing.0b5a1d ### Citation Information Provide the [BibTex](http://www.bibtex.org/)-formatted reference for the dataset. For example: ``` @article{fedorov2021nci, title={NCI imaging data commons}, author={Fedorov, Andrey and Longabaugh, William JR and Pot, David and Clunie, David A and Pieper, Steve and Aerts, Hugo JWL and Homeyer, Andr{\'e} and Lewis, Rob and Akbarzadeh, Afshin and Bontempi, Dennis and others}, journal={Cancer research}, volume={81}, number={16}, pages={4188--4193}, year={2021}, publisher={AACR} } ``` [DOI](https://doi.org/10.1158/0008-5472.CAN-21-0950)

--- annotations_creators: - 无标注（no-annotation） language_creators: - 其他（other） language: - 英语（en） license: - 其他（other） multilinguality: - 单语种（monolingual） size_categories: - 1000亿 < 样本数 < 1万亿 source_datasets: - 原创数据集（original） task_categories: - 图像分类（image-classification） task_ids: - 多标签图像分类（multi-label-image-classification） pretty_name: 结直肠癌CT数据集（ColonCancerCTDataset） tags: - 结直肠癌（colon cancer） - 医学（medical） - 癌症（cancer） dataset_info: features: - name: 图像（image） dtype: 图像（image） - name: 图像类型（ImageType） dtype: 字符串序列（sequence of string） - name: 检查日期（StudyDate） dtype: 字符串（string） - name: 序列日期（SeriesDate） dtype: 字符串（string） - name: 设备制造商（Manufacturer） dtype: 字符串（string） - name: 检查描述（StudyDescription） dtype: 字符串（string） - name: 序列描述（SeriesDescription） dtype: 字符串（string） - name: 患者性别（PatientSex） dtype: 字符串（string） - name: 患者年龄（PatientAge） dtype: 字符串（string） - name: 妊娠状态（PregnancyStatus） dtype: 字符串（string） - name: 检查部位（BodyPartExamined） dtype: 字符串（string） splits: - name: 训练集（train） num_bytes: 3537157.0 num_examples: 30 download_size: 3538117 dataset_size: 3537157.0 configs: - config_name: 默认配置（default） data_files: - split: 训练集（train） path: data/train-* --- # 数据集卡片创建指南 ## 目录 - [数据集卡片创建指南](#数据集卡片创建指南) - [目录](#目录) - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与排行榜](#支持任务和-leaderboards) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据拆分](#数据拆分) - [数据集创建](#数据集创建) - [遴选依据](#curation-rationale) - [源数据](#源数据) - [初始数据收集与标准化](#initial-data-collection-and-normalization) - [文本数据来源方是谁？](#who-are-the-source-language-producers) - [标注](#标注) - [标注流程](#annotation-process) - [标注者是谁？](#who-are-the-annotators) - [个人与敏感信息](#personal-and-sensitive-information) - [数据使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集策展人](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **主页:** [https://portal.imaging.datacommons.cancer.gov]() - **仓库:** [https://aws.amazon.com/marketplace/pp/prodview-3bcx7vcebfi2i#resources]() - **论文:** [https://aacrjournals.org/cancerres/article/81/16/4188/670283/NCI-Imaging-Data-CommonsNCI-Imaging-Data-Commons]() ### 数据集概述本项目聚焦的数据集是美国国家癌症研究所（National Cancer Institute, NCI）影像数据共享平台（Imaging Data Commons, IDC）的精选子集，重点涵盖CT结肠成像（CT Colonography）图像。该专属数据集从AWS Marketplace托管的IDC综合仓库中筛选而来，后者包含多样化的癌症影像数据。数据集内的图像来源于全球范围内的临床研究，涵盖计算机断层扫描（Computed Tomography, CT）、磁共振成像（Magnetic Resonance Imaging, MRI）以及正电子发射断层扫描（Positron Emission Tomography, PET）等多种成像模态。除临床影像外，本数据集还包含患者人口统计学信息（性别与妊娠状态）以及详细的检查描述等核心元数据，可支持对影像数据进行精细化分析与解读。 ### 支持任务本数据集可应用于多项任务： - 开发机器学习模型以区分结肠良性与恶性病变； - 开发用于分割息肉及其他结肠结构的精准算法； - 开展癌症进展的纵向研究； - 评估CT结肠成像相较于其他成像模态在结直肠疾病诊断中的准确性。 ### 语言文本数据（如标签与影像检查描述）采用英语。 ## 数据集结构 ### 数据实例数据将遵循以下格式： { "image": image.png # 一张CT图像, "ImageType": ['ORIGINAL', 'PRIMARY', 'AXIAL', 'CT_SOM5 SPI'] # 包含图像信息的列表, "StudyDate": "20000101" # 检查日期, "SeriesDate": "20000101" # 序列日期, "Manufacturer": "SIEMENS" # 成像设备的制造商, "StudyDescription": "Abdomen^24ACRIN_Colo_IRB2415-04 (Adult)" # 检查描述, "SeriesDescription": "Colo_prone 1.0 B30f" # 序列描述, "PatientSex": "F" # 患者性别, "PatientAge": "059Y" # 患者年龄, "PregnancyStatus": "None" # 患者妊娠状态, "BodyPartExamined": "COLON" # 检查部位 } ### 数据字段 - image（PIL.PngImagePlugin.PngImageFile）：PNG格式的CT图像 - ImageType（字符串列表）：包含图像信息的列表 - StudyDate（字符串）：检查日期 - SeriesDate（字符串）：序列检查日期 - Manufacturer（字符串）：成像设备制造商 - StudyDescription（字符串）：检查描述 - SeriesDescription（字符串）：序列描述 - PatientSex（字符串）：患者性别 - PatientAge（字符串）：患者年龄 - PregnancyStatus（字符串）：患者妊娠状态 - BodyPartExamined（字符串）：检查部位 ### 数据拆分 | | 训练集（train） | 验证集（validation） | 测试集（test） | |-------------------------|------:|-----------:|-----:| | 平均句长 | | | | ## 数据集创建 ### 遴选依据本数据集的构建源于对海量异构癌症影像数据进行精简的需求，以推动结直肠癌领域的定向研究。通过将数据集限定为CT结肠成像数据，本数据集解决了结直肠癌研究领域的研究者与医疗从业者面临的数据可及性难题。这种精简简化了获取相关数据的流程，助力开发诊断模型，有望通过早期检测改善患者预后。本数据集的遴选目标是让结直肠癌研究领域的专家与学者能够更便捷地获取并使用公开可用的高质量数据。 ### 源数据根据[NCI影像数据共享平台（IDC）](https://portal.imaging.datacommons.cancer.gov/about/)的说明，数据由NCI资助的驱动项目及其他精选项目提交。 ### 个人与敏感信息根据[NCI影像数据共享平台（IDC）](https://portal.imaging.datacommons.cancer.gov/about/)的要求，向IDC提交数据的提交者必须确保数据已完成去标识化处理，以保护受保护健康信息（Protected Health Information, PHI）。 ## 数据使用注意事项 ### 数据集的社会影响本针对CT结肠成像的数据集旨在推动医学研究，助力结直肠癌的早期检测与治疗。提供高质量的影像数据可赋能诊断型AI工具的开发，有助于改善患者护理与预后。及时的癌症诊断对有效治疗至关重要，因此本数据集将产生深远的社会影响。 ### 偏差讨论鉴于本数据集聚焦CT结肠成像，可能会因所涵盖的人群人口统计学特征或数据集中特定病症的患病率而产生偏差。至关重要的是，数据集需纳入多样化的病例，以减轻模型开发中的偏差，确保基于本数据集开发的AI工具具备泛化能力与应用公平性。 ### 其他已知局限性本数据集仅聚焦CT结肠成像，因此在数据变异性与覆盖范围上存在局限性，未涵盖其他成像模态与癌症类型，这可能会限制研究的广度。 ### 许可信息 https://fairsharing.org/FAIRsharing.0b5a1d ### 引用信息请提供符合BibTex格式的数据集引用参考。示例如下： @article{fedorov2021nci, title={"NCI imaging data commons"}, author={Fedorov, Andrey and Longabaugh, William JR and Pot, David and Clunie, David A and Pieper, Steve and Aerts, Hugo JWL and Homeyer, André and Lewis, Rob and Akbarzadeh, Afshin and Bontempi, Dennis and others}, journal={Cancer research}, volume={81}, number={16}, pages={4188--4193}, year={2021}, publisher={AACR} } [DOI](https://doi.org/10.1158/0008-5472.CAN-21-0950)

提供机构：

YuxuanZhang888

原始信息汇总

数据集概述

数据集描述

数据集摘要

该数据集是国立癌症研究所影像数据中心（IDC）的一个精选子集，特别强调了CT结肠成像图像。这个专门的数据集包含从更广泛的IDC存储库中精选的集合，该存储库托管在AWS市场上，包括多种癌症成像数据。所包含的图像来自全球的临床研究，涵盖了计算机断层扫描（CT）、磁共振成像（MRI）和正电子发射断层扫描（PET）等多种成像方式。

除了临床图像外，该数据集还包含了包含患者人口统计信息（性别和妊娠状态）和详细研究描述的重要元数据，使图像数据的分析和解释更加细致。

支持的任务

该数据集可用于以下任务：

开发机器学习模型以区分良性和恶性结肠病变。
开发精确的算法用于分割息肉和其他结肠结构。
进行癌症进展的纵向研究。
评估CT结肠成像与其他成像方式在结直肠疾病诊断中的准确性。

语言

文本数据（如标签和成像研究描述）使用英语。

数据集结构

数据实例

数据将遵循以下结构：

json { "image": "image.png", # CT图像 "ImageType": ["ORIGINAL", "PRIMARY", "AXIAL", "CT_SOM5 SPI"], # 图像信息列表 "StudyDate": "20000101", # 研究日期 "SeriesDate": "20000101", # 系列日期 "Manufacturer": "SIEMENS", # 成像设备制造商 "StudyDescription": "Abdomen^24ACRIN_Colo_IRB2415-04 (Adult)", # 研究描述 "SeriesDescription": "Colo_prone 1.0 B30f", # 系列描述 "PatientSex": "F", # 患者性别 "PatientAge": "059Y", # 患者年龄 "PregnancyStatus": "None", # 患者妊娠状态 "BodyPartExamined": "COLON" # 检查的身体部位 }

数据字段

image (PIL.PngImagePlugin.PngImageFile): PNG格式的CT图像
ImageType (List(String)): 包含图像信息的列表
StudyDate (String): 研究日期
SeriesDate (String): 系列研究日期
Manufacturer (String): 成像设备制造商
StudyDescription (String): 研究描述
SeriesDescription (String): 系列描述
PatientSex (String): 患者性别
PatientAge (String): 患者年龄
PregnancyStatus (String): 患者妊娠状态
BodyPartExamined (String): 检查的身体部位

数据分割

分割名称	字节数	样本数
train	3537157	30

数据集创建

策划理由

该数据集的构想源于简化大量异质性癌症成像数据以促进结肠癌的集中研究的必要性。通过将数据集专门化到CT结肠成像，它解决了研究人员和医疗专业人员对结肠癌数据访问的挑战。这种精炼简化了获取相关数据的任务，以便开发诊断模型，并可能通过早期检测改善患者预后。该数据集的策划旨在使数据对结肠癌研究领域的专家和学者更加开放和可用。

源数据

根据IDC，数据由NCI资助的驱动项目和其他特殊选定项目提交。

个人和敏感信息

根据IDC，向IDC提交数据的人员必须确保数据已去标识化，以保护健康信息（PHI）。

使用数据的注意事项

数据集的社会影响

针对CT结肠成像的数据集旨在增强医学研究，并可能有助于结肠癌的早期检测和治疗。提供高质量的成像数据有助于开发诊断AI工具，从而改善患者护理和预后。这可以产生深远的社会影响，因为及时的诊断对于有效治疗癌症至关重要。

偏见的讨论

鉴于数据集专注于CT结肠成像，偏见可能源于所代表的人口统计数据或数据集中某些条件的普遍性。确保数据集包含多样化的病例至关重要，以减轻模型开发中的偏见，并确保使用此数据开发的AI工具具有普遍性和公平性。

其他已知限制

数据集可能在变异性和范围方面存在限制，因为它仅专注于CT结肠成像。其他成像方式和癌症类型未被涵盖，这可能限制研究的广度。

许可信息

https://fairsharing.org/FAIRsharing.0b5a1d

引用信息

提供数据集的BibTex格式引用。例如：

bibtex @article{fedorov2021nci, title={NCI imaging data commons}, author={Fedorov, Andrey and Longabaugh, William JR and Pot, David and Clunie, David A and Pieper, Steve and Aerts, Hugo JWL and Homeyer, Andr{e} and Lewis, Rob and Akbarzadeh, Afshin and Bontempi, Dennis and others}, journal={Cancer research}, volume={81}, number={16}, pages={4188--4193}, year={2021}, publisher={AACR} }

DOI

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个专门针对结肠癌的CT影像数据集，包含来自全球临床研究的CT结肠造影图像和丰富的元数据信息。数据集支持多种机器学习任务，如良恶性病变区分和息肉分割，旨在促进结肠癌的早期诊断和治疗研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集