five

LBQANA python code + Merged Gene Expression Dataset from GSE10810, GSE17907, GSE20711, GSE42568, GSE45827, and GSE61304 for Breast Cancer Biomarker Discovery

收藏
DataCite Commons2025-10-29 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/Merged_Gene_Expression_Dataset_from_GSE10810_GSE17907_GSE20711_GSE42568_GSE45827_and_GSE61304_for_Breast_Cancer_Biomarker_Discovery/26946364/3
下载链接
链接失效反馈
官方服务:
资源简介:
The merged dataset integrates six gene expression datasets (GSE10810, GSE17907, GSE20711, GSE42568, GSE45827, and GSE61304) from the NCBI GEO database, collectively comprising 476 breast cancer (BC) samples and 65 normal samples. Each dataset was sourced from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array), and selected for their robust sample sizes, with each dataset containing over 50 samples. The datasets include a range of gene counts: GSE10810 (11,332 genes, 58 samples), GSE17907 (11,392 genes, 55 samples), GSE20711 (11,702 genes, 90 samples), GSE42568 (11,106 genes, 121 samples), GSE45827 (11,731 genes, 155 samples), and GSE61304 (11,273 genes, 62 samples). After merging and preprocessing, the final dataset contains 10,240 genes across 541 samples.The datasets were processed using the ReadAffy function from the affy package, followed by normalization with the Robust Multi-array Average (RMA) method. Gene probes were annotated using platform-specific annotation files, and for genes represented by multiple probes, the mean expression levels were computed. To address batch effects introduced during the merging process, the Empirical Bayes algorithm from the sva package (via the ComBat function) was applied. Post-correction, Principal Component Analysis (PCA) confirmed uniform distribution across the datasets, ensuring consistency for further analysis. This merged dataset supports a comprehensive investigation into breast cancer biomarkers and improves the robustness of feature selection and machine learning analyses.
提供机构:
figshare
创建时间:
2025-10-29
二维码
社区交流群
二维码
科研交流群
商业服务