five

Batch-normalization of cerebellar and medulloblastoma gene expression datasets utilizing empirically defined negative control genes

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124814
下载链接
链接失效反馈
官方服务:
资源简介:
Medulloblastoma (MB) is a brain cancer predominantly arising in children. Roughly 70% of patients are cured today, but survivors often suffer from severe sequelae. MB has been extensively studied by molecular profiling, but often in small and scattered cohorts. To improve cure rates and reduce treatment side effects, accurate integration of such data to increase analytical power will be important, if not essential. We have integrated 23 transcription datasets, spanning 1350 MB and 291 normal brain samples. To remove batch effects, we combined the Removal of Unwanted Variation (RUV) method with a novel pipeline for determining empirical negative control genes and a panel of metrics to evaluate normalization performance. The documented approach enabled the removal of a majority of batch effects, producing a large-scale, integrative dataset of MB and cerebellar expression data. The proposed strategy will be broadly applicable for accurate integration of data and incorporation of normal reference samples for studies of various diseases. We hope that the integrated dataset will improve current research in the field of MB by allowing more large-scale gene expression analyses. For all selected samples, raw CEL files were downloaded from GEO or AE. Subsequently, all raw CEL files from the same platform were processed together using the R/Bioconductor package oligo in conjunction with the RMA algorithm. The Human Gene 1.0 ST and Human Gene 1.1 ST arrays were analysed at the core level, while the Human Exon 1.0 ST arrays were processed at the extended level. Subsequently, we mapped the identifiers of the HG-U133 Plus 2 and Human Exon 1.0 ST to Human Gene 1.0/1.1 ST identifiers using `Best Match' information from Affymetrix (https://www.affymetrix.com/support/technical/byproduct.affx?product=hugene-1_0-st-v1). In addition, to increase the overlap between the Human Exon 1.0 ST and Human Gene 1.0/1.1 ST data we also inspected and added probe mappings from the `Good Match' and `Complex Match' files, including probes for the genes MYCN, PTCH1, NPR3, UNC5D, DKK2, and GABRA5. After mapping of probe identifiers within each platform, multiple rows mapping to the same identifier were collapsed using the mean value. Subsequently, all platform datasets were merged on probe identifiers, and gene symbols were assigned using the hugene11sttranscriptcluster.db package. Multiple rows mapping to the same gene or multiple columns mapping to the same patient were collapsed using the mean value. Finally, the resulting gene expression matrix was quantile normalized using the respective function in the preprocessCore package. We downloaded a total of 1796 CEL files from previously published GEO or ArrayExpress records: GSE85217(n=763), GSE25219(n=154), GSE60862(n=130), GSE12992(n=40), GSE67850(n=22), GSE10327(n=62), GSE30074(n=30), E-MTAB-292(n=19), GSE74195(n=30), GSE37418(n=76), GSE4036(n=14), GSE62803(n=52), GSE21140(n=103), GSE37382(n=50), GSE22569(n=24), GSE35974(n=50), GSE73038(n=46), GSE50161(n=24), GSE3526(n=9), GSE50765(n=12), GSE49243(n=58), GSE41842(n=19), GSE44971(n=9). After preprocessing of all CEL files, we averaged the expression profiles of samples that mapped to the same patient in a single dataset, producing a final expression array comprising 1641 samples, of which 1350 samples represent primary medulloblastomas and 291 samples represent normal brain (cerebellum/upper rhombic lip).
创建时间:
2019-03-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作