Additional Files for Transcript-Centric Gene Fusion Validation Pipeline

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Additional_Files_for_Transcript-Centric_Gene_Fusion_Validation_Pipeline/30609155

下载链接

链接失效反馈

官方服务：

资源简介：

Supplementary Data for: A WGS-Independent Framework for Gene Fusion Validation in Long-Read RNA-Seq Data Description This repository contains the supplementary materials, datasets, and performance metrics associated with the research article: "A WGS-Independent Framework for Gene Fusion Validation in Long-Read RNA-Seq Data." Study Overview: Accurate validation of gene fusions from long-read RNA sequencing (RNA-Seq) data is a critical challenge in cancer genomics, often hampered by high false-positive rates and a reliance on matched whole-genome sequencing (WGS) data. This study introduces a novel, WGS-independent validation pipeline that utilizes transcript-centric evidence (full-length spanning reads, supplementary alignments, and realigned soft-clips) and a Random Forest machine learning classifier to automate fusion validation. Contents of this Repository: The following files are provided to support the reproducibility and benchmarking of the reported results: Additional file 1: Evaluation of Random Forest Hyperparameters using 10-Fold Cross-Validation.A visualization of the hyperparameter tuning process, showing performance metrics (Accuracy, AUC-PR, etc.) across various model configurations (number of trees, tree depth, etc.) used to select the optimal Random Forest model.Additional file 2: Validated gene fusions in cancer cell lines.A comprehensive dataset listing all gene fusions validated by the pipeline across the five cancer cell lines analyzed (MCF7, A549, K562, HCT116, HepG2). This includes both known fusions (e.g., BCAS4-BCAS3) and novel discoveries (e.g., MOV10-RHOC).Additional file 3: Merged Reference Database of Known Gene Fusions.A consolidated reference dataset combining gene fusion entries from the COSMIC and FusionGDB databases. This file served as the external knowledge base for cross-referencing fusion candidates and establishing the ground truth labels for the machine learning model.Additional file 4: Example summary of random forest output.A representative output table generated by the pipeline for the MCF7 cell line, showing the probability scores, read support metrics, and final classifications for fusion candidates.Methodology: The data was generated using a custom R-based pipeline that integrates Samtools and Minimap2 for alignment processing and tidymodels for machine learning. The initial candidate list was generated using LongGF on long-read RNA-Seq data from the Singapore Nanopore Expression Project (SG-NEx). Code Availability: The source code for the validation pipeline and the Singularity recipe for reproducibility are available on GitHub:https://github.com/iisomineaamos/fusion-validation-pipeline

创建时间：

2025-11-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集