Calling structural variants with confidence from short-read data in wild bird populations

NIAID Data Ecosystem2026-05-01 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.6q573n647

下载链接

链接失效反馈

官方服务：

资源简介：

Comprehensive characterisation of structural variation in natural populations has only become feasible in the last decade. To investigate the population genomic nature of structural variation (SV), reproducible and high-confidence SV callsets are first required. We created a population-scale reference of the genome-wide landscape of structural variation across 33 Nordic house sparrows (Passer domesticus) individuals. To produce a consensus callset across all samples using short-read data, we compare heuristic-based quality filtering and visual curation (Samplot/PlotCritic and Samplot-ML) approaches. We demonstrate that curation of SVs is important for reducing putative false positives and that the time invested in this step outweighs the potential costs of analysing short-read discovered SV datasets that include many potential false positives. We find that even a lenient manual curation strategy (e.g. applied by a single curator) can reduce the proportion of putative false positives by up to 80%, thus enriching the proportion of high-confidence variants. Crucially, in applying a lenient manual curation strategy with a single curator, nearly all (>99%) variants rejected as putative false positives were also classified as such by a more stringent curation strategy using three additional curators. Furthermore, variants rejected by manual curation failed to reflect the expected population structure from SNPs, whereas variants passing curation did. Combining heuristic-based quality-filtering with rapid manual curation of structural variants in short-read data can therefore become a time- and cost-effective first step for functional and population genomic studies requiring high-confidence SV callsets. Methods The raw Illumina reads and assembled reference genome from this article are also published and available at NCBI, Bioproject number PRJNA255814 (Passer domesticus reference accession number SAMN02929199). Trimmed reads were aligned with BWA-MEM (bwa v.0.7.17) to the short-read reference genome assembly for Passer domesticus (Elgvin et al. 2017), NCBI: GCA_001700915.1_Passer_domesticus-1.0), and then sorted and indexed with Samtools (samtools v. 1.9). All unplaced scaffolds were removed and thus only mapped chromosomal regions were included in downstream analyses. Larger (>20bp) structural variants (deletions, duplications, and inversions) from the aligned .bam files using LUMPY (Layer et al. 2014) and genotyped the resulting calls with SVTyper (Chiang et al. 2015), via the smoove pipeline (Pedersen et al. 2020). The resulting VCF file of raw structural variant calls analysed in the study is included in the following file: sparrow_all.smoove.square.anno.vcf.gz Repetitive elements were identified using the Earl Grey TE annotation pipeline (version 1.2) (Baril et al. 2021, 2022), configured with RepBase (version 23.08) and Dfam (version 3.4) repeat libraries (Hubley et al. 2016; Jurka et al. 2005). Briefly, Earl Grey first annotated known repeats using the Aves repeat library. Following this, Earl Grey identified and refined novel TEs using an automated and iterative implementation of the “BLAST, Extract, Extend” process (Platt et al. 2016). Following the final TE annotation (passerDomesticusAnnotatedRepeats.gff), overlapping and fragmented annotations were resolved by Earl Grey before the final TE quantification.

创建时间：

2024-03-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集