Identification of factors associated with duplicate rate in ChIP-seq data

Figshare2019-04-03 更新2026-04-29 收录

下载链接：

https://figshare.com/articles/dataset/Identification_of_factors_associated_with_duplicate_rate_in_ChIP-seq_data/7947671

下载链接

链接失效反馈

官方服务：

资源简介：

Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.

染色质免疫沉淀测序（Chromatin immunoprecipitation and sequencing, ChIP-seq）已被广泛用于绘制DNA结合蛋白、组蛋白及其修饰的基因组定位图谱。ChIP-seq数据中存在一类被称为重复reads的冗余序列，特指比对至同一基因组位置与DNA链的测序片段。重复序列主要有两大来源：聚合酶链式反应（polymerase chain reaction, PCR）重复与天然重复。天然重复代表源自独立DNA模板测序的真实信号，而PCR重复则是由同一DNA模板扩增得到的相同拷贝进行测序所产生的人工伪影。在常规分析流程中，重复序列会在峰识别（peak calling）与信号定量环节被移除。然而，现有研究表明，相当比例的重复序列实则承载真实信号。显然，直接移除所有重复序列会低估峰区域内的信号水平，并干扰跨样本的信号变化鉴定。因此，亟需深入评估重复序列移除对分析结果的影响。本研究采用来自3类窄峰标记与2类宽峰标记的8个公开ChIP-seq数据集，旨在阐明重复序列在基因组中的分布特征、重复序列移除对峰识别与信号估计的影响程度，以及与峰内重复水平相关的影响因素。其中3份无PCR扩增（PCR-free）的组蛋白H3赖氨酸4三甲基化（histone H3 lysine 4 trimethylation, H3K4me3）ChIP-seq数据中，重复序列占比约为40%，且其中97%的重复序列位于峰区域内。对于其余通过ChIP DNA的PCR扩增构建的数据集，正如预期，窄峰标记数据集的重复序列占比远高于宽峰标记数据集。本研究发现，重复序列在峰区域中显著富集，且大多代表真实信号，这一特征在高置信度峰中尤为明显。此外，峰内的重复序列水平与基于非冗余reads估计的靶标富集程度呈强相关，这为区分重复序列中的噪声与信号组分提供了理论依据。本分析证实，将部分具有信号意义的重复序列纳入下游分析具备可行性，从而可缓解完全去重带来的分析局限性。

创建时间：

2019-04-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集