five

Synthetic bulk RNA-Seq transcriptomic profiles representing 10 Cancer hallmarks

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.zw3r228jc
下载链接
链接失效反馈
官方服务:
资源简介:
Evidence before this study   We conducted an extensive literature search using Google Scholar without language restrictions, employing search terms such as “(Predicting OR Classifying OR Annotating) and (cancer hallmarks) AND (Deep OR Machine Learning) OR (Artificial Intelligence OR AI).” Despite notable advances in molecular oncology and computational methodologies, a critical gap remains: no existing machine learning or deep learning framework comprehensively predicts cancer hallmarks from tumor biopsy samples. Current research primarily targets specific molecular pathways associated with individual hallmarks, leaving clinicians without an integrated model to interpret hallmark activity at the level of an individual tumor. Moreover, the absence of wet-lab techniques capable of annotating all cancer hallmarks in biopsy samples has further impeded progress, limiting the clinical utility of hallmark-related insights for precision oncology.   Added value of this study   This study introduces OncoMark, a novel neural multi-task learning (N-MTL) framework designed to predict cancer hallmark activity from transcriptomic data obtained from biopsy samples. OncoMark addresses the lack of hallmark-specific data by generating synthetic biopsy datasets annotated with hallmark activity, meticulously modeled to reflect real-world tumor biology while maintaining clinical relevance. The framework employs a multi-task learning approach to capture interdependencies among hallmarks, advancing beyond isolated predictions to offer a holistic view of tumor biology. Validation on six independent datasets comprising 159 patient samples demonstrated its generalizability and reproducibility. Further external validation using eight datasets, encompassing over 11,679 cancer and 8348 normal patient samples, reinforced its robustness. To promote clinical integration, a user-friendly web-based tool was developed, enabling seamless access for oncologists and researchers.   Implications of all the available evidence   The OncoMark framework represents a transformative advancement in cancer diagnostics and treatment planning. By enabling accurate and reproducible prediction of hallmark activity from biopsy samples, this model paves the way for precision oncology at scale. Its ability to systematically capture hallmark interdependencies provides deeper insights into tumor behavior, guiding the development of individualized, targeted therapies. The incorporation of a web-based interface ensures the accessibility of this innovation to clinicians worldwide, bridging the gap between computational oncology and clinical practice. Following further validation and integration into healthcare workflows, OncoMark has the potential to improve cancer outcomes by delivering timely, cost-effective, and precise tumor analyses, facilitating informed therapeutic decision-making with unparalleled precision. Methods Dataset Collection and Processing   We utilized a large-scale dataset comprising 2.7 million single-cell transcriptomes derived from 14 tumor types, collected from 922 patients across 51 independent studies conducted globally. This dataset was sourced from the Weizmann Institute's 3CA repository. Quality Control   Before generating synthetic datasets for model training, the raw single-cell transcriptomic data underwent a rigorous quality control (QC) process. Cells with over 15% mitochondrial transcript content, fewer than 200, or more than 6,000 expressed mRNA transcripts were excluded to ensure data reliability.   Gene Set Curation   Gene sets representing cancer hallmarks were compiled from multiple databases, retaining only genes identified in at least two independent sources. This selection was refined through manual literature reviews to exclude genes without direct or indirect roles in hallmark-related pathways.   Digital Scoring   Using the curated gene sets, Digital Scores were calculated for each of the 10 cancer hallmarks across all cells using the Mann-Whitney U test. To ensure robust binary classification, hallmark presence or absence was determined through Otsu’s thresholding method. Tissue-specific digital score thresholds were calculated to account for variations in hallmark expression across different tumor tissues.   Synthetic Data Generation   To simulate clinical biopsy conditions while preserving biological fidelity, synthetic biopsy datasets were created by aggregating 200 hallmark-specific cells from each patient sample. Cells were grouped by hallmark status (positive or negative) to generate distinct hallmark-specific synthetic samples, ensuring no overlap across samples and minimizing cross-sample contamination. Synthetic datasets with positive and negative ground truths were created separately for each hallmark, facilitating robust model training and mimicking the heterogeneous composition of real-world clinical samples.   Validation   For validation purposes, six external studies were processed using the same synthetic data creation methods applied to the training data. In these datasets, all hallmark-positive cells for each patient were pooled to generate synthetic datasets resembling bulk RNA sequencing data. This approach ensured consistency in data processing while allowing the model to generalize effectively to clinically relevant bulk transcriptomic datasets.
创建时间:
2025-10-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作