Adjusting for batch effects in DNA methylation microarray data, a lesson learned. Adjusting for batch effects in DNA methylation microarray data, a lesson learned

NIAID Data Ecosystem2026-03-10 收录

下载链接：

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA427719

下载链接

链接失效反馈

官方服务：

资源简介：

It is well known, but frequently overlooked, that low- and high-throughput molecular data may contain batch effects, i.e., systematic technical variation. Confounding of experimental batches with the variable(s) of interest is especially concerning, as a batch effect may then be interpreted as a biologically significant finding. An integral step towards reducing false discovery in molecular data analysis includes inspection for batch effects and application of computational tools to reduce this signal if present. In a 30-sample pilot Illumina Infinium HumanMethylation450 (450k array) experiment, we identified two sources of batch effects: array row and chip. Here, we demonstrate two approaches taken to process the 450k data in which an R function, ComBat, was applied to adjust for this non-biological signal. In the “initial analysis”, the application of ComBat to an unbalanced study design resulted in 9,683 and 19,192 significant (FDR<0.05) DNA methylation differences, despite none present prior to correction. Suspicious of this dramatic change, a “revised processing” included changes to our analysis as well as a greater number of samples, and successfully reduced batch effects without introducing false signal. Our work supports conclusions made by an article previously published in this journal: though the ultimate antidote to batch effects is thoughtful study design, every DNA methylation microarray analysis should inspect, assess and, if necessary, adjust for batch effects. The analysis experience presented here can serve as a reminder to the broader community to establish research questions a priori, ensure that they match with study design and encourage communication between technicians and analysts. Overall design: Full details of this analysis can be found in PMID:XXXXX. 450k data was generated from 30 human placentas randomly distributed within a larger batch of 84 samples run across seven chips. This design maximized cost-effectiveness by allowing several subsets of the 84 samples to be analyzed to address separate research questions. For initial processing, we extracted the data relating only to the 30-sample pilot study (these samples are labelled as initial_processing in the characterisitcs: processing group column). A standard step in our quality control assessment, principal component analysis (PCA), suggested the presence of batch effects (i.e., technical – as opposed to biological – sources of data variation (Leek et al., 2010)) relating to how the samples were dispersed across 450k chips (see details below). We employed ComBat (Leek et al.), an empirical Bayes approach implemented in the R software environment (R Core Team, 2014), to regress out batch effects in the pilot data. After this correction was applied however, we found 9,683 and 19,192 differentially methylated sites (FDR <0.05) in association with our biological variable of interest, while none had been found prior to correction. We were suspicious of this dramatic change and conducted a new analysis utilizing more samples. 29 other placental samples from within the 84-sample batch were included in the revised analysis (samples labelled as revised_processing in the characteristics: processing group column2), to increase the pre-processing sample size from 30 to 59, with a better distribution of samples across chips and rows. This larger group also allowed for the inclusion of a technical replicate to better monitor data processing (PM72r). The Matrix processed values included are those of the 59 samples in the revised analysis. The 30 pilot samples (initial_processing in the characterisitcs: processing group column) were then selected out of the larger group of 59 samples to adress biological questions. A description of the biological analysis can be found in PMID:XXXX.

创建时间：

2017-12-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集