Supplementary Data for PhD Thesis: "Unravelling the Complexity of RNA Splicing and Gene Expression Regulation in Health and Diseases"

Name: Supplementary Data for PhD Thesis: "Unravelling the Complexity of RNA Splicing and Gene Expression Regulation in Health and Diseases"
Creator: UNSW Sydney
Published: 2025-11-27 02:43:38
License: 暂无描述

DataCite Commons2025-11-27 更新2026-05-07 收录

下载链接：

http://hdl.handle.net/1959.4/106421

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset consists of Excel spreadsheets containing tabular research data supporting three studies investigating gene expression, biological attributes, and cross-species gene relationships. Chapter 2 folder contains 7 files with exon usage analysis in neuromuscular disease genes. These files include exon-level and gene-level quantitative data from RNA-seq samples across developmental stages (fetal/neonatal, pediatric, adult) and muscle types (skeletal, cardiac). The primary variable is PSI (Percent Spliced In) values ranging from 0-1 indicating exon inclusion/exclusion rates. Secondary variables include delta PSI (differences between conditions), standard deviations, and variant densities (pathogenic/likely-pathogenic variants from ClinVar, population variants from gnomAD). Data are organized by gene identifiers, exon coordinates, and variant statistics, grouped by age group and muscle subtype. Chapter 4 folder contains 5 files with GeneRAIN model predictions and performance metrics. These include performance metrics (Adjusted Rand Index, Normalized Mutual Information, Fowlkes-Mallows Index) assessing model learning across 21 biological databases, classification results from 5,050 machine learning classifiers with AUC scores (Area Under Curve, 0-1 scale), and 62.5 million biological attribute predictions for genes including disease associations, transcription factor targets, Gene Ontology terms, and pathways. The data structure consists of gene identifiers with prediction probabilities and confidence metrics. Chapter 5 folder contains 5 files with human-mouse gene comparison analysis. These include cross-species homolog analysis for gene pairs with RNA embedding similarity scores and DNA sequence alignment metrics. Variables include cosine similarity values (0-1), DNA alignment scores (normalized), GC content percentages, and DNA sequence conservation statistics. The files also contain phenotype and disease association data integrating RNA versus DNA similarity measures, and gene set enrichment analysis results with functional annotation clusters. All files are organized in tabular format with columns containing gene identifiers (Ensembl ID, gene symbol), numerical metrics, statistical values, and categorical annotations. Consistent formatting enables integrated analysis across chapters.

提供机构：

UNSW Sydney

创建时间：

2025-11-26