five

Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset

收藏
魔搭社区2025-12-05 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
> **Note:** Our 245 TCGA cases are ones we identified as having potential for improvement. > We plan to upload them in two phases: the first batch of 138 cases, and the second batch of 107 cases in the quality review pipeline, we plan to upload them around early of January, 2025. # Dataset: A Second Opinion on TCGA PRAD Prostate Dataset Labels with ROI-Level Annotations ## Overview ![Exmaple of Annotated WSI](cover2.png) This dataset provides enhanced Gleason grading annotations for the TCGA PRAD prostate cancer dataset, supported by Region of Interest (ROI)-level spatial annotations. Developed in collaboration with **[Codatta](https://codatta.io)** and **[DPath.ai](https://dpath.ai)**, where **[DPath.ai](https://dpath.ai)** launched a dedicated community via **[Codatta](https://codatta.io)** to assemble a network of pathologists, this dataset improved accuracies and granularity of information in the original **[TCGA-PRAD](https://portal.gdc.cancer.gov/projects/TCGA-PRAD)** slide-level labels. The collaborative effort enabled pathologists worldwide to contribute annotations, improving label reliability for AI model training and advancing pathology research. Unlike traditional labeling marketplaces, collaborators a.k.a pathologists retain ownership of the dataset, ensuring their contributions remain recognized, potentially rewarded and valuable within the community. Please cite the dataset in any publication or work to acknowledge the collaborative efforts of **[Codatta](https://codatta.io)**, **[DPath.ai](https://dpath.ai)**, and the contributing pathologists. ## Motivation We discovered significant opportunities for data improvement in the diagnostic labels of the TCGA PRAD dataset, where: * Some labels could be enhanced with additional diagnostic labels a.k.a opinions. * Some labels lacked granular descriptions of the Gleason patterns. This presents a challenge for AI pathology models, as reported high accuracy might reflect learning from improved labels. To address this: * Pathologists re-annotated slides to improve label quality. * ROI annotations were introduced to clearly differentiate Gleason tumor grades. * Each annotation is supported by detailed reasoning, providing transparency and justification for corrections. ## Dataset Contents This dataset includes two primary files: 1. Slide-Level Labels ([PRAD.csv](https://huggingface.co/datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset/blob/main/dataset/PRAD/PRAD.csv)) * Contains comprehensive metadata and diagnostic details: * `data_profile_uri`: A URI that links to Codatta's web application, offering a detailed view of the slide's metadata and its associated data lineage. * `slide_id`: Unique slide identifier. * `slide_name`: TCGA Whole Slide Image (WSI) name. * `label`: Corrected Gleason grade (e.g., 4+3, 5+4). * `diagnosis`: Pathologist-provided reasoning for the label. * `num_rois`: Number of labeled Regions of Interest (ROIs) per slide. 2. ROI-Level Annotations (**.geojson**) * Provides spatial coordinates for tumor regions: * Each ROI corresponds to specific Gleason grades (e.g., Grade 3, Grade 4). * Compatible with tools like QuPath for interactive visualization. ## Key Statistics | Type | Counts | |--------------------------|--------| | TCGA WSI Total | 435 | | Agree with TCGA's label | 190 | | Refined label and ROI | 245 | > We are currently in the process of uploading data files to meet the quantities mentioned above. This is an ongoing effort to balance community impact and data quality. ## Re-Annotation and Labeling Process ![workflow of re-annotation and labelingt](workflow_re_annotation.png) 1. **Curation of Cases for Enhancement**: An expert committee identified slides requiring review. 2. **Annotation**: Junior pathologists performed initial ROI-level annotations. 3. **Expert Review**: Senior pathologists validated and refined the annotations. 4. **Enhancements**: * Granular ROI labeling for tumor regions. * GIntroduction of Minor Grades: For example, Minor Grade 5 indicates <5% of Gleason Grade 5 tumor presence. * GPathologist Reasoning: Each label includes a detailed explanation of the annotation process. Some labels can be improved by adding alternative opinions to enhance the labels further. ## Improvements Over TCGA Labels * **Accuracy**: Enhanced slide-level Gleason labels with additional opinions and improved granularity. * **Granularity**: Clear ROI-level annotations for primary, secondary, and minor tumor grades. * **Transparency**: Pathologist-provided reasoning ensures a chain of thought for label decisions. **Example**: * Original TCGA label: Gleason Grade 4+3. * Enhanced label: Gleason Grade 4+3 + Minor Grade 5. ## Usage ### For AI Training Pipelines Combine Whole Slide Images (WSI) from TCGA PRAD with this dataset's slide-level labels (PRAD.csv) and ROI annotations (.geojson) to generate high-quality [X, y] pairs. ### For Pathology Research Use the ROI annotations in Whole Slide Images (WSIs) to interactively visualize labeled tumor regions. The slides can be viewed through Codatta's data profile (e.g., https://data.codatta.io/[slide_id]) or other compatible viewers like QuPath. Additionally, explore detailed reasoning behind Gleason grade decisions to gain insights into tumor composition. ### How to Load the Dataset 1. **CSV File** Use pandas to explore slide-level metadata: ```python import pandas as pd df = pd.read_csv("PRAD.csv") print(df.head()) ``` 2. **GeoJSON File** Load ROIs using tools like QuPath or GeoPandas: ```python import geopandas as gpd roi_data = gpd.read_file("annotations.geojson") roi_data.plot() ``` 3. TCGA Whole Slide Images Original WSI files can be downloaded from the GDC Portal. Match WSI filenames with the `slide_name` column in `PRAD.csv` for integration. ## Example Steps 1. Download **[TCGA PRAD](https://portal.gdc.cancer.gov/projects/TCGA-PRAD)** slides from the GDC Portal. 2. Load [`PRAD.csv`](https://huggingface.co/datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset/blob/main/dataset/PRAD/PRAD.csv) to access corrected labels and reasoning. 3. Visualize ROI annotations using the .geojson files. 4. Train AI models with X = WSI images, y = ROI annotations + slide-level labels. ## Licensing This dataset is licensed under **OpenRAIL-M** for non-commercial use. Commercial use requires obtaining a separate license. Please contact [hello@codatta.io](mailto:hello@codatta.io) for licensing inquiries. ## Credits This dataset is a collaboration between: * [Codatta](https://codatta.io): A platform that brings together a community of anonymous pathologists to collaborate on annotation and labeling tasks. * [DPath.ai](https://dpath.ai): The initiator of the project, responsible for defining the task scope, data requirements, quality assurance workflows, and recruiting the initial cohort of expert pathologists. ## Contact For questions, suggestions, or collaborations (launch custom data sourcing and labeling tasks), please reach out via: * **Email**: hello@codatta.io * **Website**: https://codatta.io

**注**:我们筛选出的245例TCGA病例均具备进一步优化的潜力。我们计划分两阶段上传这批数据:第一批次138例,第二批次107例正处于质量审核流程中,预计将于2025年1月初完成上传。 # 数据集:TCGA PRAD前列腺癌数据集标签二次研判与ROI级标注 ## 概述 ![标注全玻片图像示例](cover2.png) 本数据集为TCGA PRAD前列腺癌数据集提供了优化的格里森分级(Gleason grading)标注,并附带感兴趣区(Region of Interest, ROI)级空间标注。本数据集由**[Codatta](https://codatta.io)**与**[DPath.ai](https://dpath.ai)**合作开发:其中DPath.ai通过Codatta搭建专属社区,集结病理医师网络,对原始**[TCGA-PRAD](https://portal.gdc.cancer.gov/projects/TCGA-PRAD)**数据集的玻片级标签的准确性与信息颗粒度进行了优化。本次协作邀请全球病理医师参与标注,提升了标签可靠性,为大语言模型(Large Language Model, LLM)训练与病理研究提供支撑。与传统标注交易平台不同,参与协作的病理医师保留数据集所有权,确保其贡献得到认可、获得相应回报,并在社区中彰显价值。任何使用本数据集的出版物或工作均需引用本数据集,以感谢Codatta、DPath.ai及所有参与贡献的病理医师的协作付出。 ## 项目背景 我们发现TCGA PRAD数据集的诊断标签存在多处可优化空间: * 部分标签可通过补充诊断研判意见进一步完善; * 部分标签未对格里森分型进行细化描述。 这给AI病理模型的训练带来挑战——此前报道的高准确率可能源于对优化后标签的学习。为解决该问题: * 病理医师对玻片进行重新标注以提升标签质量; * 引入ROI标注以明确区分格里森肿瘤分级; * 每条标注均附带详细推理过程,为修正结果提供透明度与依据。 ## 数据集内容 本数据集包含两类核心文件: 1. **玻片级标签文件([PRAD.csv](https://huggingface.co/datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset/blob/main/dataset/PRAD/PRAD.csv))** 包含完整元数据与诊断细节: * `data_profile_uri`:指向Codatta网页应用的统一资源标识符(URI),可查看玻片的详细元数据及其关联数据谱系; * `slide_id`:唯一玻片标识符; * `slide_name`:TCGA全玻片图像(Whole Slide Image, WSI)名称; * `label`:修正后的格里森分级(如4+3、5+4); * `diagnosis`:病理医师给出的标签判定依据; * `num_rois`:单张玻片的标注感兴趣区(ROI)数量。 2. **ROI级标注文件(.geojson)** 提供肿瘤区域的空间坐标信息: * 每个ROI对应特定的格里森分级(如3级、4级); * 兼容QuPath等工具,支持交互式可视化。 ## 核心统计数据 | 数据类型 | 数量 | |--------------------------|--------| | TCGA全玻片图像总数量 | 435 | | 与TCGA原始标签一致的样本 | 190 | | 完成标签优化与ROI标注的样本 | 245 | > 我们目前正在上传数据文件以达到上述统计规模,此项工作仍在持续推进,以平衡社区影响与数据质量。 ## 重新标注与标签制作流程 ![重新标注与标注工作流](workflow_re_annotation.png) 1. **待优化病例筛选**:专家委员会筛选出需要复核的玻片; 2. **标注环节**:初级病理医师完成初始ROI级标注; 3. **专家复核**:高级病理医师对标注进行验证与优化; 4. **优化内容**: * 对肿瘤区域进行精细化ROI标注; * 引入次要分级:例如,次要5级表示存在占比<5%的格里森5级肿瘤; * 病理推理说明:每条标签均附带标注流程的详细解释。 部分标签可通过补充备选研判意见进一步完善。 ## 相较于TCGA原始标签的优化点 * **准确性提升**:通过补充研判意见与细化分级,优化了玻片级格里森标签; * **颗粒度提升**:针对原发、继发与次要肿瘤分级,提供了清晰的ROI级标注; * **透明度提升**:病理医师给出的推理过程确保了标签判定的逻辑可追溯。 **示例**: * TCGA原始标签:格里森分级4+3; * 优化后标签:格里森分级4+3 + 次要5级。 ## 使用方式 ### 用于AI训练流程 将TCGA PRAD项目的全玻片图像(WSI)与本数据集的玻片级标签文件(PRAD.csv)及ROI标注文件(.geojson)结合,可生成高质量的[X, y]样本对。 ### 用于病理研究 利用ROI标注对全玻片图像中的标注肿瘤区域进行交互式可视化。可通过Codatta的数据档案页面(如https://data.codatta.io/[slide_id])或QuPath等兼容查看器浏览玻片。此外,可通过格里森分级判定的详细推理过程,深入了解肿瘤构成。 ### 数据集加载方法 1. **CSV文件加载** 使用pandas库浏览玻片级元数据: python import pandas as pd df = pd.read_csv("PRAD.csv") print(df.head()) 2. **GeoJSON文件加载** 使用QuPath或GeoPandas等工具加载ROI数据: python import geopandas as gpd roi_data = gpd.read_file("annotations.geojson") roi_data.plot() 3. **TCGA全玻片图像** 原始WSI文件可从GDC门户下载,通过`PRAD.csv`中的`slide_name`列匹配WSI文件名以完成数据集整合。 ## 示例操作步骤 1. 从GDC门户下载**[TCGA PRAD](https://portal.gdc.cancer.gov/projects/TCGA-PRAD)**玻片图像; 2. 加载[`PRAD.csv`](https://huggingface.co/datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset/blob/main/dataset/PRAD/PRAD.csv)以获取修正后的标签与判定依据; 3. 使用.geojson文件可视化ROI标注; 4. 以WSI图像为X、ROI标注+玻片级标签为y训练AI模型。 ## 授权协议 本数据集采用**OpenRAIL-M**协议进行非商业使用授权。商业使用需另行获取许可,如有授权咨询需求,请联系[hello@codatta.io](mailto:hello@codatta.io)。 ## 致谢 本数据集由以下机构合作完成: * [Codatta](https://codatta.io):该平台集结匿名病理医师社区,共同开展标注与标签制作任务; * [DPath.ai](https://dpath.ai):本项目的发起方,负责定义任务范围、数据需求、质量保证流程,并招募首批专家病理医师团队。 ## 联系方式 如有疑问、建议或合作意向(定制化数据采集与标注任务),请通过以下方式联系: * **邮箱**:hello@codatta.io * **官网**:https://codatta.io
提供机构:
maas
创建时间:
2025-08-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作