Additional Files for Transcript-Centric Gene Fusion Validation Pipeline
收藏DataCite Commons2025-11-13 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/Additional_Files_for_Transcript-Centric_Gene_Fusion_Validation_Pipeline/30609155
下载链接
链接失效反馈官方服务:
资源简介:
<b>Supplementary Data for: A WGS-Independent Framework for Gene Fusion Validation in Long-Read RNA-Seq Data</b><br><b>Description</b><br>This repository contains the supplementary materials, datasets, and performance metrics associated with the research article: <b>"A WGS-Independent Framework for Gene Fusion Validation in Long-Read RNA-Seq Data."</b><b>Study Overview:</b> Accurate validation of gene fusions from long-read RNA sequencing (RNA-Seq) data is a critical challenge in cancer genomics, often hampered by high false-positive rates and a reliance on matched whole-genome sequencing (WGS) data. This study introduces a novel, WGS-independent validation pipeline that utilizes transcript-centric evidence (full-length spanning reads, supplementary alignments, and realigned soft-clips) and a Random Forest machine learning classifier to automate fusion validation.<b>Contents of this Repository:</b> The following files are provided to support the reproducibility and benchmarking of the reported results:<b>Additional file 1: Evaluation of Random Forest Hyperparameters using 10-Fold Cross-Validation.</b>A visualization of the hyperparameter tuning process, showing performance metrics (Accuracy, AUC-PR, etc.) across various model configurations (number of trees, tree depth, etc.) used to select the optimal Random Forest model.<b>Additional file 2: Validated gene fusions in cancer cell lines.</b>A comprehensive dataset listing all gene fusions validated by the pipeline across the five cancer cell lines analyzed (MCF7, A549, K562, HCT116, HepG2). This includes both known fusions (e.g., <i>BCAS4-BCAS3</i>) and novel discoveries (e.g., <i>MOV10-RHOC</i>).<b>Additional file 3: Merged Reference Database of Known Gene Fusions.</b>A consolidated reference dataset combining gene fusion entries from the COSMIC and FusionGDB databases. This file served as the external knowledge base for cross-referencing fusion candidates and establishing the ground truth labels for the machine learning model.<b>Additional file 4: Example summary of random forest output.</b>A representative output table generated by the pipeline for the MCF7 cell line, showing the probability scores, read support metrics, and final classifications for fusion candidates.<b>Methodology:</b> The data was generated using a custom R-based pipeline that integrates <i>Samtools</i> and <i>Minimap2</i> for alignment processing and <i>tidymodels</i> for machine learning. The initial candidate list was generated using LongGF on long-read RNA-Seq data from the Singapore Nanopore Expression Project (SG-NEx).<b>Code Availability:</b> The source code for the validation pipeline and the Singularity recipe for reproducibility are available on GitHub:https://github.com/iisomineaamos/fusion-validation-pipeline
提供机构:
figshare
创建时间:
2025-11-13



