five

Additional Files for Transcript-Centric Gene Fusion Validation Pipeline

收藏
DataCite Commons2025-11-13 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/Additional_Files_for_Transcript-Centric_Gene_Fusion_Validation_Pipeline/30609155
下载链接
链接失效反馈
官方服务:
资源简介:
<b>Supplementary Data for: A WGS-Independent Framework for Gene Fusion Validation in Long-Read RNA-Seq Data</b><br><b>Description</b><br>This repository contains the supplementary materials, datasets, and performance metrics associated with the research article: <b>"A WGS-Independent Framework for Gene Fusion Validation in Long-Read RNA-Seq Data."</b><b>Study Overview:</b> Accurate validation of gene fusions from long-read RNA sequencing (RNA-Seq) data is a critical challenge in cancer genomics, often hampered by high false-positive rates and a reliance on matched whole-genome sequencing (WGS) data. This study introduces a novel, WGS-independent validation pipeline that utilizes transcript-centric evidence (full-length spanning reads, supplementary alignments, and realigned soft-clips) and a Random Forest machine learning classifier to automate fusion validation.<b>Contents of this Repository:</b> The following files are provided to support the reproducibility and benchmarking of the reported results:<b>Additional file 1: Evaluation of Random Forest Hyperparameters using 10-Fold Cross-Validation.</b>A visualization of the hyperparameter tuning process, showing performance metrics (Accuracy, AUC-PR, etc.) across various model configurations (number of trees, tree depth, etc.) used to select the optimal Random Forest model.<b>Additional file 2: Validated gene fusions in cancer cell lines.</b>A comprehensive dataset listing all gene fusions validated by the pipeline across the five cancer cell lines analyzed (MCF7, A549, K562, HCT116, HepG2). This includes both known fusions (e.g., <i>BCAS4-BCAS3</i>) and novel discoveries (e.g., <i>MOV10-RHOC</i>).<b>Additional file 3: Merged Reference Database of Known Gene Fusions.</b>A consolidated reference dataset combining gene fusion entries from the COSMIC and FusionGDB databases. This file served as the external knowledge base for cross-referencing fusion candidates and establishing the ground truth labels for the machine learning model.<b>Additional file 4: Example summary of random forest output.</b>A representative output table generated by the pipeline for the MCF7 cell line, showing the probability scores, read support metrics, and final classifications for fusion candidates.<b>Methodology:</b> The data was generated using a custom R-based pipeline that integrates <i>Samtools</i> and <i>Minimap2</i> for alignment processing and <i>tidymodels</i> for machine learning. The initial candidate list was generated using LongGF on long-read RNA-Seq data from the Singapore Nanopore Expression Project (SG-NEx).<b>Code Availability:</b> The source code for the validation pipeline and the Singularity recipe for reproducibility are available on GitHub:https://github.com/iisomineaamos/fusion-validation-pipeline
提供机构:
figshare
创建时间:
2025-11-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作