U133A_combat.h5

NIAID Data Ecosystem2026-03-09 收录

下载链接：

https://figshare.com/articles/dataset/U133A_combat_h5/3119248

下载链接

链接失效反馈

官方服务：

资源简介：

We compiled a large cohort of breast cancer samples from NCBI's Gene Expression Omnibus (GEO) (see Table 1) as it was suggested in (Györffy and Schäfer, 2009). We only took samples from the U133A platform into account and removed duplicate samples, that is, samples that occur in several studies under the same GEO id. Array quality checks were executed for all samples belonging to the same study by the R packagearrayQualityMetrics. Due to high memory demands of this package, studies containing more than 400 samples had to be divided into two parts. Samples that were classified as outliers in the RLE or NUSE analysis were discarded. Finally, all samples across all studies were normalized together using R's justRMA function yielding for each sample and each probe a log(intensity) value. This normalization also included a quantile normalization step. Subsequently, probe intensities were mean centered, yielding for each sample and each probep a log(intensityμ(intensityp))log(intensityμ(intensityp)) value. We found batch effects within single studies, where samples have been collected from different locations and batch effects between studies. Specifically for breast cancer, samples also form batches according to the five subtypes of breast cancer: luminal A, luminal B, Her2 enriched, normal like and basal like. To account for these effects we employed R's combat, where the cancer subtype was modeled as an additional covariate to maintain the variance associated with the subtypes. To do so we needed to stratify the patients according to the subtype. Since this variable is not always available in the annotation of the patients, we predict the subtype employing the PAM50 marker genes as documented in R's genefu package. Principal component analysis of the batch corrected data revealed pairs of samples with a very high correlation (>0.9). Those pairs were regarded as replicate samples. For each pair of replicate samples one sample was removed randomly. Affymetrix probe IDs were mapped to Entrez Gene IDs via the mapping files provided by Affymetrix. Only probes that mapped to exactly one Gene ID were taken into account and probes starting with AFFX were discarded. If an Entrez Gene ID mapped to several Affymetrix probe IDs, probes were considered in the following order according to their suffix (Gohlmann and Talloen, 2010): “_at,” “s_at,” “x_at,” “i_at,” and “a_at.” When there were still several probes valid for one Gene ID, the Affymetrix probe with the higher variance of expression values was chosen. The patients' class labels corresponding to recurrence free or distant metastasis free survival were calculated with respect to a 5-year threshold. The final cohort is shown in Table 1. We derived two data sets: one labeled according to recurrence free survival (RFS) and one labeled according to distant metastasis free survival (DMFS). Note, that the DMFS data set is a subset of the RFS data set.

创建时间：

2016-03-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集