five

patho-ssl-data-curation

收藏
魔搭社区2025-09-03 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/swiss-ai/patho-ssl-data-curation
下载链接
链接失效反馈
官方服务:
资源简介:
# Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology **Abstract** Vision foundation models (FMs) are accelerating the devel- opment of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. However, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further identify these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs, and propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks. ## Data We provide the following data: 1. **Clustering result:** - `clustering_results/clustering_{t1,t2}.csv`: Hierarchical cluster labels for each of the 350M tiles for mode t1 and t2 as csv files. The csv contains the slide_id (WSI the tile originates from), (x,y)-coordinate of the tile at level 0 (highest pyramid level of the WSI) and cluster labels for each level (denoted as columns "level_1"..."level_4"). Then slides can be downloaded from the [TCGA](https://portal.gdc.cancer.gov/) and [GTEx](https://www.gtexportal.org/home/histologyPage) websites, for tile extraction from the WSIs we recommend [openslide](https://openslide.org/api/python/). Structure of `clustering_results/clustering_{t1,t2}.csv`: | slide\_id | tile\_x | tile\_y | level\_1 | level\_2 | level\_3 | level\_4 | | ------------------------------------------------- | ------- | ------- | -------- | -------- | -------- | -------- | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 32406 | 10621 | 1301309 | 17404 | 2 | 24 | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 32850 | 10621 | 3481104 | 17557 | 343 | 8 | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 30630 | 11064 | 2269415 | 34147 | 2 | 24 | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 31074 | 11064 | 3352403 | 3486 | 2 | 24 | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 31519 | 11064 | 3352388 | 11187 | 2 | 24 | **slide\_id**: Unique identifier for the slide image, **tile\_x, tile\_y**: Coordinates of the tile within the slide at level 0, the highest pyramid level. The tiles are of size `224px X 224px`at 20x magnification (=`112um X 112um`), **level\_1 to level\_4**: Hierarchical cluster labels the tile is associated with. - `clustering_results/kmeans_centroids`: K-means cluster centroids at each level. This allows associating any 224x224px tiles encoded with UNI to the nearest kmeans cluster at each level. 3. **Visualization tool**: We provide (1) UMAP coordinates for 2 million curated tiles in [metadata_N=2M.csv](https://huggingface.co/datasets/swiss-ai/patho-ssl-data-curation/blob/main/visualization_tool/metadata_N%3D2M.csv) for the t1 setting (62 level_4 clusters), and (2) tile images for a representative subset of 500k tiles, packaged in 10 tar files (`tiles_0000.tar`-`tiles_0009.tar`, each containing 224×224px pngs). (3) The file [metadata.csv](https://huggingface.co/datasets/swiss-ai/patho-ssl-data-curation/blob/main/visualization_tool/metadata.csv) contains metadata corresponding specifically to this 500k tile subset. We only provide the 500k tiles and not the full 2M due to file size limitations. Please download all files for usage with our visualization tool code at [Visualization Tool Github](https://github.com/lely475/patho-ssl-data-curation/tree/main/visualization_tool). ## Citation Please cite our publication if you use the provided data. ``` @misc{chen2025revisitingautomaticdatacuration, title={Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology}, author={Boqi Chen and Cédric Vincent-Cuaz and Lydia A. Schoenpflug and Manuel Madeira and Lisa Fournier and Vaishnavi Subramanian and Sonali Andani and Samuel Ruiperez-Campillo and Julia E. Vogt and Raphaëlle Luisier and Dorina Thanou and Viktor H. Koelzer and Pascal Frossard and Gabriele Campanella and Gunnar Rätsch}, year={2025}, eprint={2503.18709}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.18709}, } ```

# 重新审视数字病理学中视觉基础模型的自动数据精选 **摘要** 视觉基础模型(Vision Foundation Models, FMs)正加速数字病理学算法的研发,并重塑生物医学研究格局。这类模型以自监督学习方式,对从真实患者样本的全切片图像(Whole-Slide Images, WSIs)中提取的高度异质性图像块(tiles)进行组织学特征表征学习。此类视觉基础模型的性能极大程度上受预训练数据的规模、多样性与均衡性影响。然而过往的数据筛选主要依赖全切片图像层面的专家知识,聚焦于疾病分类、组织类型等维度,却极大程度上忽视了图像块层面的细粒度细节。 本文针对3.5亿张图像块,探索了图像块层面无监督自动数据精选的可行性。具体而言,我们将层次聚类树应用于预提取的图像块嵌入特征,使得我们能够在预训练视觉基础模型的嵌入空间中均匀采样得到均衡数据集。我们进一步发现此类数据集存在规模与均衡性之间的权衡问题,可能会损害视觉基础模型学到的表征质量,并据此提出了定制化的批次采样策略以缓解这一影响。我们通过在一系列与临床相关的多样化下游任务上取得的性能提升,验证了所提方法的有效性。 ## 数据 我们提供如下数据集资源: 1. **聚类结果** - `clustering_results/clustering_{t1,t2}.csv`:该CSV文件存储了模式t1与t2下3.5亿张图像块各自的层次聚类标签。文件中包含了图像块所属的切片ID(即该图像块来源的全切片图像)、图像块在第0层(全切片图像最高金字塔层级)的(x,y)坐标,以及各层级的聚类标签(对应列名为`level_1`至`level_4`)。用户可从[癌症基因组图谱(The Cancer Genome Atlas, TCGA)](https://portal.gdc.cancer.gov/)与[基因型组织表达数据库(Genotype-Tissue Expression, GTEx)](https://www.gtexportal.org/home/histologyPage)官网下载对应切片,若需从全切片图像中提取图像块,我们推荐使用[OpenSlide](https://openslide.org/api/python/)工具。 `clustering_results/clustering_{t1,t2}.csv`的文件结构如下: | slide_id | tile_x | tile_y | level_1 | level_2 | level_3 | level_4 | | --- | --- | --- | --- | --- | --- | --- | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 32406 | 10621 | 1301309 | 17404 | 2 | 24 | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 32850 | 10621 | 3481104 | 17557 | 343 | 8 | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 30630 | 11064 | 2269415 | 34147 | 2 | 24 | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 31074 | 11064 | 3352403 | 3486 | 2 | 24 | | TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39... | 31519 | 11064 | 3352388 | 11187 | 2 | 24 | **slide_id**:切片图像的唯一标识符;**tile_x、tile_y**:图像块在第0层(最高金字塔层级)切片内的坐标。本数据集的图像块尺寸为`224px × 224px`,对应20倍放大倍率(即`112μm × 112μm`);**level_1至level_4**:该图像块所属的层级聚类标签。 - `clustering_results/kmeans_centroids`:各层级的K-means聚类中心。该文件可用于将任意使用UNI模型编码的`224px × 224px`图像块匹配至各层级下最近的K-means聚类簇。 3. **可视化工具**:我们提供以下资源:(1) 针对t1设置(62个第四层级聚类簇)的200万张精选图像块的[均匀流形近似与投影(UMAP)]坐标,存储于[metadata_N=2M.csv](https://huggingface.co/datasets/swiss-ai/patho-ssl-data-curation/blob/main/visualization_tool/metadata_N%3D2M.csv);(2) 50万张代表性图像块的图像文件,打包为10个tar文件(`tiles_0000.tar`至`tiles_0009.tar`,每个文件内含`224px × 224px`的PNG格式图像块);(3) 文件[metadata.csv](https://huggingface.co/datasets/swiss-ai/patho-ssl-data-curation/blob/main/visualization_tool/metadata.csv)包含了该50万张图像点子集的专属元数据。由于文件大小限制,我们仅提供50万张图像块,而非完整的200万张。如需使用配套的可视化工具代码,请前往[可视化工具GitHub仓库](https://github.com/lely475/patho-ssl-data-curation/tree/main/visualization_tool)下载所有相关文件。 ## 引用声明 若您使用本数据集,请引用我们的发表论文。 @misc{chen2025revisitingautomaticdatacuration, title={Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology}, author={Boqi Chen and Cédric Vincent-Cuaz and Lydia A. Schoenpflug and Manuel Madeira and Lisa Fournier and Vaishnavi Subramanian and Sonali Andani and Samuel Ruiperez-Campillo and Julia E. Vogt and Raphaëlle Luisier and Dorina Thanou and Viktor H. Koelzer and Pascal Frossard and Gabriele Campanella and Gunnar Rätsch}, year={2025}, eprint={2503.18709}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.18709}, }
提供机构:
maas
创建时间:
2025-09-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作