Supplementary Material for: A novel validated real-world dataset for the diagnosis of multi-class serous effusion cytology according to TIS and ground-truth validation data.

Name: Supplementary Material for: A novel validated real-world dataset for the diagnosis of multi-class serous effusion cytology according to TIS and ground-truth validation data.
Creator: Karger Publishers
Published: 2025-05-01 06:42:13
License: 暂无描述

DataCite Commons2025-05-01 更新2024-08-19 收录

下载链接：

https://karger.figshare.com/articles/dataset/Supplementary_Material_for_A_novel_validated_real-world_dataset_for_the_diagnosis_of_multi-class_serous_effusion_cytology_according_to_TIS_and_ground-truth_validation_data_/25467583/1

下载链接

链接失效反馈

官方服务：

资源简介：

Introduction: The application of AI algorithms in serous fluid cytology is lacking due to the deficiency in standardized publicly available datasets. Here, we develop a novel public serous effusion cytology dataset. Furthermore, we apply AI algorithms on it to test its diagnostic utility and safety in clinical practice. Methods: The work is divided into three phases. Phase 1 entails building the dataset based on the multi-tiered evidence-based classification system proposed by the international system (TIS) of serous fluid cytology along with ground truth tissue diagnosis for malignancy. To ensure reliable results of future AI research on this dataset, we carefully consider all the steps of the preparation and staining from a real-world cytopathology perspective. In Phase 2, we pay special consideration to the image acquisition pipeline to ensure image integrity. Then we utilize the power of transfer learning using the convolutional layers of the VGG16 deep learning model for feature extraction Finally, in Phase 3, we apply the random forest classifier on the constructed dataset. Results: The dataset comprises 3731 images distributed among the four TIS diagnostic categories. The model achieves 74 % accuracy in this multiclass classification problem. Using a one versus all classifier, the fall-out rate for images that are misclassified as negative for malignancy despite being a higher risk diagnosis is 0.13. Most of these misclassified images (77%) belong to the atypia of undetermined significance category in concordance with real-life statistics. Conclusion: This is the first and largest publicly available serous fluid cytology dataset based on a standardized diagnostic system. It is also the first dataset to include various types of effusions and is the first dataset to include pericardial fluid specimens. In addition, it is the first dataset to include the diagnostically challenging atypical categories. AI algorithms applied on this novel dataset show reliable results that can incorporated in actual clinical practice with minimal risk of missing a diagnosis of malignancy. This work provides a foundation for researchers to develop and test further AI algorithms for the diagnosis of serous effusions.

引言：浆膜腔液细胞学（serous fluid cytology）领域中，由于缺乏标准化的公开可用数据集，人工智能算法的应用仍相对匮乏。本研究构建了一款全新的公开浆膜腔积液细胞学（serous effusion cytology）数据集，并在此数据集上应用人工智能算法，以验证其在临床实践中的诊断效能与安全性。方法：本研究分为三个阶段。第一阶段基于国际浆膜腔液细胞学分类系统（TIS）提出的分层循证分类体系，结合恶性肿瘤的金标准组织病理学诊断（ground truth tissue diagnosis）构建数据集。为保障后续基于本数据集开展的人工智能研究结果具备可靠性，我们从临床细胞病理学的实际场景出发，对标本制备与染色的全流程进行了严谨考量。第二阶段重点优化图像采集流程，以保障图像完整性；随后采用基于VGG16深度学习模型卷积层的迁移学习（transfer learning）方法完成特征提取；第三阶段则将随机森林分类器（random forest classifier）应用于构建完成的数据集。结果：本数据集共包含3731张图像，涵盖TIS诊断分类体系下的4个类别。针对该多分类（multiclass classification）任务，模型的分类准确率达74%。采用一对多分类器（one versus all classifier）时，实际为高风险诊断却被误分类为恶性阴性（即无恶性征象）的图像漏诊率为0.13。结合临床实际统计数据，此类被误分类的图像中77%均归属于意义未明的不典型增生（atypia of undetermined significance）类别。结论：本数据集是首个基于标准化诊断体系的、目前规模最大的公开浆膜腔液细胞学数据集。同时，它也是首个涵盖多种类型浆膜腔积液、首个包含心包积液标本的数据集，亦是首个纳入诊断难度较高的异型细胞类别的数据集。基于本新型数据集训练的人工智能算法展现出可靠的性能，可应用于实际临床实践，且恶性肿瘤漏诊风险极低。本研究为后续开发与测试用于浆膜腔积液诊断的人工智能算法提供了研究基础。

提供机构：

Karger Publishers

创建时间：

2024-03-24