Synthetic bulk RNA-Seq transcriptomic profiles representing 10 Cancer hallmarks
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.zw3r228jc
下载链接
链接失效反馈官方服务:
资源简介:
Evidence before this study
We conducted an extensive literature search using Google Scholar without language restrictions, employing search terms such as “(Predicting OR Classifying OR Annotating) and (cancer hallmarks) AND (Deep OR Machine Learning) OR (Artificial Intelligence OR AI).” Despite notable advances in molecular oncology and computational methodologies, a critical gap remains: no existing machine learning or deep learning framework comprehensively predicts cancer hallmarks from tumor biopsy samples. Current research primarily targets specific molecular pathways associated with individual hallmarks, leaving clinicians without an integrated model to interpret hallmark activity at the level of an individual tumor. Moreover, the absence of wet-lab techniques capable of annotating all cancer hallmarks in biopsy samples has further impeded progress, limiting the clinical utility of hallmark-related insights for precision oncology.
Added value of this study
This study introduces OncoMark, a novel neural multi-task learning (N-MTL) framework designed to predict cancer hallmark activity from transcriptomic data obtained from biopsy samples. OncoMark addresses the lack of hallmark-specific data by generating synthetic biopsy datasets annotated with hallmark activity, meticulously modeled to reflect real-world tumor biology while maintaining clinical relevance. The framework employs a multi-task learning approach to capture interdependencies among hallmarks, advancing beyond isolated predictions to offer a holistic view of tumor biology. Validation on six independent datasets comprising 159 patient samples demonstrated its generalizability and reproducibility. Further external validation using eight datasets, encompassing over 11,679 cancer and 8348 normal patient samples, reinforced its robustness. To promote clinical integration, a user-friendly web-based tool was developed, enabling seamless access for oncologists and researchers.
Implications of all the available evidence
The OncoMark framework represents a transformative advancement in cancer diagnostics and treatment planning. By enabling accurate and reproducible prediction of hallmark activity from biopsy samples, this model paves the way for precision oncology at scale. Its ability to systematically capture hallmark interdependencies provides deeper insights into tumor behavior, guiding the development of individualized, targeted therapies. The incorporation of a web-based interface ensures the accessibility of this innovation to clinicians worldwide, bridging the gap between computational oncology and clinical practice. Following further validation and integration into healthcare workflows, OncoMark has the potential to improve cancer outcomes by delivering timely, cost-effective, and precise tumor analyses, facilitating informed therapeutic decision-making with unparalleled precision.
Methods
Dataset Collection and Processing
We utilized a large-scale dataset comprising 2.7 million single-cell transcriptomes derived from 14 tumor types, collected from 922 patients across 51 independent studies conducted globally. This dataset was sourced from the Weizmann Institute's 3CA repository.
Quality Control
Before generating synthetic datasets for model training, the raw single-cell transcriptomic data underwent a rigorous quality control (QC) process. Cells with over 15% mitochondrial transcript content, fewer than 200, or more than 6,000 expressed mRNA transcripts were excluded to ensure data reliability.
Gene Set Curation
Gene sets representing cancer hallmarks were compiled from multiple databases, retaining only genes identified in at least two independent sources. This selection was refined through manual literature reviews to exclude genes without direct or indirect roles in hallmark-related pathways.
Digital Scoring
Using the curated gene sets, Digital Scores were calculated for each of the 10 cancer hallmarks across all cells using the Mann-Whitney U test. To ensure robust binary classification, hallmark presence or absence was determined through Otsu’s thresholding method. Tissue-specific digital score thresholds were calculated to account for variations in hallmark expression across different tumor tissues.
Synthetic Data Generation
To simulate clinical biopsy conditions while preserving biological fidelity, synthetic biopsy datasets were created by aggregating 200 hallmark-specific cells from each patient sample. Cells were grouped by hallmark status (positive or negative) to generate distinct hallmark-specific synthetic samples, ensuring no overlap across samples and minimizing cross-sample contamination. Synthetic datasets with positive and negative ground truths were created separately for each hallmark, facilitating robust model training and mimicking the heterogeneous composition of real-world clinical samples.
Validation
For validation purposes, six external studies were processed using the same synthetic data creation methods applied to the training data. In these datasets, all hallmark-positive cells for each patient were pooled to generate synthetic datasets resembling bulk RNA sequencing data. This approach ensured consistency in data processing while allowing the model to generalize effectively to clinically relevant bulk transcriptomic datasets.
创建时间:
2025-10-22



