five

Data from: Batch effects in a multi-year sequencing study: false biological trends due to changes in read lengths

收藏
DataONE2018-03-02 更新2024-06-25 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈
官方服务:
资源简介:
High-throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects, technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomisation of sample groups across batches. However, in long-term or multi-year studies where data are added incrementally, full randomisation is impossible and batch effects may be a common feature. Here we present a case study where false signals of selection were detected due to a batch effect in a multi-year study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in non-random distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multi-year high-throughput sequencing studies.

高通量测序(High-throughput sequencing)是一类功能强大的研究工具,但本身存在偏倚与误差,若未对其进行有效校正,可能导致错误的生物学结论。这类误差包括批次效应(batch effects)——即因研究内部实验流程变更,仅在部分数据子集内出现的技术误差。若忽略此类问题并合并多批次数据,极易产生虚假的生物学信号,尤其当数据批次与生物学变量存在关联时。通过在不同批次间随机分配样本组,可最大限度降低批次效应的影响。但在需逐步新增数据的长期或多年期研究中,完全随机化难以实现,批次效应或成为常见问题。本研究针对羱羊(Capra ibex)开展多年期高通量测序研究,因批次效应检出了虚假的选择信号,现将该案例进行展示:本次批次效应的产生源于项目推进过程中测序读长发生变更,且研究群体是逐步纳入的,导致不同群体在测序读长上呈现非随机分布。测序读长的差异导致部分数据子集出现比对偏差,进而产生虚假的变异等位基因,最终生成错误的单核苷酸多态性(Single Nucleotide Polymorphism, SNP)位点。由于测序读长与群体存在关联,上述错误SNP位点处出现了群体间显著的等位基因频率差异。这一现象生成了统计学上极具显著性,但生物学层面纯属虚假的选择信号,以及等位基因频率与环境间的虚假关联。本研究着重强调了批次效应带来的风险,并探讨了在多年期高通量测序研究中降低批次效应影响的相关策略。
创建时间:
2018-03-02
二维码
社区交流群
二维码
科研交流群
商业服务