Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models
收藏DataCite Commons2026-03-13 更新2026-04-25 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.6djh9w174
下载链接
链接失效反馈官方服务:
资源简介:
Data scarcity presents a significant obstacle in the field of biomedicine,
where acquiring diverse and sufficient datasets can be costly and
challenging. Synthetic data generation offers a potential solution to this
problem by expanding dataset sizes, thereby enabling the training of more
robust and generalizable machine learning models. Although previous
studies have explored synthetic data generation for cancer diagnosis, they
have predominantly focused on single-modality settings, such as
whole-slide image tiles or RNA-Seq data. To bridge this gap, we propose a
novel approach, RNA-Cascaded-Diffusion-Model or RNA-CDM, for performing
RNA-to-image synthesis in a multi-cancer context, drawing inspiration from
successful text-to-image synthesis models used in natural images. In our
approach, we employ a variational auto-encoder to reduce the
dimensionality of a patient’s gene expression profile, effectively
distinguishing between different types of cancer. Subsequently, we employ
a cascaded diffusion model to synthesize realistic whole-slide image tiles
using the latent representation derived from the patient’s RNA-Seq data.
Our results demonstrate that the generated tiles accurately preserve the
distribution of cell types observed in real-world data, with
state-of-the-art cell identification models successfully detecting
important cell types in the synthetic samples. Furthermore, we illustrate
that the synthetic tiles maintain the cell fraction observed in bulk
RNA-Seq data and that modifications in gene expression affect the
composition of cell types in the synthetic tiles. Next, we utilize the
synthetic data generated by RNA-CDM to pretrain machine learning models
and observe improved performance compared to training from scratch. Our
study emphasizes the potential usefulness of synthetic data in developing
machine learning models in scarce-data settings, while also highlighting
the possibility of imputing missing data modalities by leveraging the
available information. In conclusion, our proposed RNA-CDM approach for
synthetic data generation in biomedicine, particularly in the context of
cancer diagnosis, offers a novel and promising solution to address data
scarcity. By generating synthetic data that align with real-world
distributions and leveraging it to pretrain machine learning models, we
contribute to the development of robust clinical decision support systems
and potential advancements in precision medicine.
提供机构:
Dryad
创建时间:
2023-11-03



