CSAW-CC (mammography) – a dataset for AI research to improve screening, diagnostics and prognostics of breast cancer
收藏DataCite Commons2026-03-23 更新2025-04-16 收录
下载链接:
https://researchdata.se/catalogue/dataset/2021-204-1
下载链接
链接失效反馈官方服务:
资源简介:
The dataset contains x-ray images, mammography, from breast cancer screening at the Karolinska University Hospital, Stockholm, Sweden, collected by principal investigator Fredrik Strand at Karolinska Institutet. The purpose for compiling the dataset was to perform AI research to improve screening, diagnostics and prognostics of breast cancer.
The dataset is based on a selection of cases with and without a breast cancer diagnosis, taken from a more comprehensive source dataset.
1,103 cases of first-time breast cancer for women in the screening age range (40-74 years) during the included time period (November 2008 to December 2015) were included. Of these, a random selection of 873 cases have been included in the published dataset.
A random selection of 10,000 healthy controls during the same time period were included. Of these, a random selection of 7,850 cases have been included in the published dataset.
For each individual all screening mammograms, also repeated over time, were included; as well as the date of screening and the age. In addition, there are pixel-level annotations of the tumors created by a breast radiologist (small lesions such as micro-calcifications have been annotated as an area). Annotations were also drawn in mammograms prior to diagnosis; if these contain a single pixel it means no cancer was seen but the estimated location of the center of the future cancer was shown by a single pixel annotation.
In addition to images, the dataset also contains cancer data created at the Karolinska University Hospital and extracted through the Regional Cancer Center Stockholm-Gotland. This data contains information about the time of diagnosis and cancer characteristics including tumor size, histology and lymph node metastasis.
The precision of non-image data was decreased, through categorisation and jittering, to ensure that no single individual can be identified.
The following types of files are available:
- CSV: The following data is included (if applicable): cancer/no cancer (meaning breast cancer during 2008 to 2015), age group at screening, days from image to diagnosis (if any), cancer histology, cancer size group, ipsilateral axillary lymph node metastasis. There is one csv file for the entire dataset, with one row per image. Any information about cancer diagnosis is repeated for all rows for an individual who was diagnosed (i.e., it is also included in rows before diagnosis). For each exam date there is the assessment by radiologist 1, radiologist 2 and the consensus decision.
- DICOM: Mammograms. For each screening, four images for the standard views were acuqired: left and right, mediolateral oblique and craniocaudal. There should be four files per examination date.
- PNG: Cancer annotations. For each DICOM image containing a visible tumor.
Access:
The dataset is available upon request due to the size of the material. The image files in DICOM and PNG format comprises approximately 2.5 TB.
Access to the CSV file including parametric data is possible via download as associated documentation.
本数据集包含来自瑞典斯德哥尔摩卡罗林斯卡大学医院(Karolinska University Hospital)乳腺癌筛查的乳腺X线摄影(mammography)图像,由卡罗林斯卡学院(Karolinska Institutet)首席研究员弗雷德里克·斯特兰德(Fredrik Strand)收集。本数据集的构建目的是开展人工智能研究,以优化乳腺癌的筛查、诊断与预后评估。
本数据集源自一个更庞大的源数据集,从中筛选出了确诊乳腺癌与未确诊乳腺癌的病例。
在纳入的时间范围(2008年11月至2015年12月)内,共纳入1103例筛查年龄区间(40-74岁)的女性首次乳腺癌病例。其中,公开数据集随机选取了873例。
同期还纳入了10000例健康对照的随机样本,公开数据集从中选取了7850例。
针对每位个体,纳入其历次筛查的所有乳腺X线摄影图像,同时包含筛查日期与年龄信息。此外,还包含由乳腺放射科医师标注的肿瘤像素级区域(对于微钙化等微小病变,亦以区域形式完成标注)。若在确诊前的乳腺X线图像中进行标注:若标注仅包含单个像素,则代表当时未发现癌症,但以单个像素标记了未来癌症病灶中心的预估位置。
除影像数据外,本数据集还包含由卡罗林斯卡大学医院生成、通过斯德哥尔摩-哥塔兰区域癌症中心提取的癌症相关数据,其中涵盖诊断时间、肿瘤大小、组织学类型以及淋巴结转移等癌症特征信息。
为确保无法识别到任何个体,非影像数据已通过分类与扰动处理降低了精度。
可用文件类型如下:
- CSV格式:包含如下数据(如适用):是否患癌(即2008-2015年间是否确诊乳腺癌)、筛查时的年龄组、从影像拍摄至诊断的间隔天数(若有)、癌症组织学类型、肿瘤大小分组、同侧腋窝淋巴结转移情况。全数据集仅包含一个CSV文件,每一行对应一幅影像。对于确诊的个体,其所有影像行均会重复癌症诊断相关信息(即诊断前的影像行亦包含此类信息)。针对每一次检查日期,还包含放射科医师1、放射科医师2的评估结果以及二者的共识结论。
- DICOM格式:乳腺X线摄影图像。每一次筛查均采用标准体位采集,共获得4幅影像:双侧乳腺(左、右)的内外斜位(mediolateral oblique)与头尾位(craniocaudal)图像,因此每一次检查日期对应4个DICOM文件。
- PNG格式:癌症标注文件。仅针对包含可见肿瘤的DICOM影像生成对应的标注文件。
访问说明:
由于数据集体量较大,需通过申请方可获取。其中DICOM与PNG格式的影像文件总容量约为2.5TB。包含参数化数据的CSV文件可作为关联文档下载获取。
提供机构:
Karolinska Institutet
创建时间:
2022-04-22
搜集汇总
数据集介绍

背景与挑战
背景概述
CSAW-CC是一个用于乳腺癌AI研究的大规模数据集,包含来自瑞典卡罗林斯卡大学医院的乳腺钼靶图像和临床数据,涵盖873例乳腺癌病例和7,850例健康对照,时间跨度为2008年至2015年。数据集特点包括多模态数据(DICOM图像、像素级注释和CSV参数数据),专注于改进筛查、诊断和预后,并通过数据脱敏保护隐私,总大小约2.5 TB,访问需申请。
以上内容由遇见数据集搜集并总结生成



