LBQANA python code + Merged Gene Expression Dataset from GSE10810, GSE17907, GSE20711, GSE42568, GSE45827, and GSE61304 for Breast Cancer Biomarker Discovery

Name: LBQANA python code + Merged Gene Expression Dataset from GSE10810, GSE17907, GSE20711, GSE42568, GSE45827, and GSE61304 for Breast Cancer Biomarker Discovery
Creator: figshare
Published: 2025-10-29 21:09:43
License: 暂无描述

DataCite Commons2025-10-29 更新2024-11-05 收录

下载链接：

https://figshare.com/articles/dataset/Merged_Gene_Expression_Dataset_from_GSE10810_GSE17907_GSE20711_GSE42568_GSE45827_and_GSE61304_for_Breast_Cancer_Biomarker_Discovery/26946364

下载链接

链接失效反馈

官方服务：

资源简介：

The merged dataset integrates six gene expression datasets (GSE10810, GSE17907, GSE20711, GSE42568, GSE45827, and GSE61304) from the NCBI GEO database, collectively comprising 476 breast cancer (BC) samples and 65 normal samples. Each dataset was sourced from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array), and selected for their robust sample sizes, with each dataset containing over 50 samples. The datasets include a range of gene counts: GSE10810 (11,332 genes, 58 samples), GSE17907 (11,392 genes, 55 samples), GSE20711 (11,702 genes, 90 samples), GSE42568 (11,106 genes, 121 samples), GSE45827 (11,731 genes, 155 samples), and GSE61304 (11,273 genes, 62 samples). After merging and preprocessing, the final dataset contains 10,240 genes across 541 samples.The datasets were processed using the ReadAffy function from the affy package, followed by normalization with the Robust Multi-array Average (RMA) method. Gene probes were annotated using platform-specific annotation files, and for genes represented by multiple probes, the mean expression levels were computed. To address batch effects introduced during the merging process, the Empirical Bayes algorithm from the sva package (via the ComBat function) was applied. Post-correction, Principal Component Analysis (PCA) confirmed uniform distribution across the datasets, ensuring consistency for further analysis. This merged dataset supports a comprehensive investigation into breast cancer biomarkers and improves the robustness of feature selection and machine learning analyses.

本合并数据集整合了来自NCBI基因表达综合数据库（Gene Expression Omnibus, GEO）的6个基因表达数据集（GSE10810、GSE17907、GSE20711、GSE42568、GSE45827及GSE61304），共计包含476例乳腺癌（Breast Cancer, BC）样本与65例正常对照样本。所有数据集均基于GPL570平台（Affymetrix人类基因组U133 Plus 2.0芯片）获取，且因样本量充足入选，每个数据集的样本数均超过50例。各数据集的基因数量与样本规模分别为：GSE10810（11332个基因，58例样本）、GSE17907（11392个基因，55例样本）、GSE20711（11702个基因，90例样本）、GSE42568（11106个基因，121例样本）、GSE45827（11731个基因，155例样本）、GSE61304（11273个基因，62例样本）。经合并与预处理后，最终数据集共包含541例样本，涵盖10240个基因。本数据集采用affy包的ReadAffy函数完成数据读取，随后通过稳健多阵列平均（Robust Multi-array Average, RMA）方法进行表达量标准化。利用平台专属注释文件对基因探针进行注释，对于单个基因对应多个探针的情况，取其表达量均值进行整合。为消除合并过程中引入的批次效应，本研究借助sva包的ComBat函数，采用经验贝叶斯（Empirical Bayes）算法完成校正。校正完成后，主成分分析（Principal Component Analysis, PCA）结果证实各数据集分布均一，确保了后续分析的一致性。本合并数据集可为乳腺癌生物标志物的系统性研究提供有力支撑，同时有效提升特征筛选与机器学习分析的稳健性。

提供机构：

figshare

创建时间：

2024-09-20

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成