ParaMask: a new method to identify multicopy genomic regions, corrects major biases in whole genome data of populations with unknown inbreeding
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://www.ncbi.nlm.nih.gov/sra/ERP158564
下载链接
链接失效反馈官方服务:
资源简介:
Multicopy genomic regions are characterised by sequences that are repeated multiple times, and that after sequencing can align to each other (collapse), creating errors and biases in genomics analyses. Although this is a long-standing problem in empirical population and evolutionary genomics, we are still missing an established framework to identify these regions in whole genome data sets, and for species that depart from random mating. Here, we develop ParaMask, an easy-to-use approach to identify multicopy regions in population-level whole genome data. The use of an expectation-maximisation framework allows us to fit to the data unknown levels of inbreeding, and avoid assumptions on random mating. This is crucial for instance in species with a selfing mating system, population structure, or age structure. This method gains power by combining different signatures of collapsed multicopy regions, namely excess heterozygosity, excess sequencing coverage, deviations in allelic ratios and clustering of collapsed SNPs. We benchmark this method on simulations, showing that > 99% of SNPs are correctly classified between single-copy and multicopy regions both with random mating and with inbreeding. We apply ParaMask to a novel whole genome data set of the plant Arabis alpina. We find that multicopy regions include transposable elements, structural variants such as tandem or segmental duplications, gene families of paralogs, and other repeats, in agreement with structural variant calling from long reads. Finally, we show that multicopy regions create a mosaic pattern of biases in genomics summary statistics that can confound the inference of evolutionary histories and selection.
创建时间:
2026-01-20



