Genotype likelihoods for low-coverage whole-genome sequencing data of yellow warblers

NIAID Data Ecosystem2026-05-01 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.h9w0vt4pj

下载链接

链接失效反馈

官方服务：

资源简介：

The following datasets include the required input files used to empirically test population assignment in WGSassign on Yellow Warbler data. The file "yewa.known.ind105.ds_2x.beagle.gz" includes the filtered variants of 105 Yellow Warbler individuals output as genotype likelihoods and stored in a Beagle-formatted file. The ID file, "yewa.known.ind105.reference.IDs.txt", is a tab-delimited file with 2 columns, the first being the sample ID, and the second being the known reference population. The sample order in the ID file should match that of the input beagle file. To measure the assignment accuracy of WGSassign, we used leave-one-out cross validation using the input beagle file and our ID file. Methods We used WGSassign on data from yellow warblers to test its accuracy when applied to individuals from a species exhibiting isolation by distance (Bay et al. 2021; Gibbs et al. 2000). Previous work on yellow warblers has found weak differentiation between populations, with pairwise FST values on the order of 0.01 or less (Gibbs et al. 2000). Blood samples from 105 individuals was collected via brachial venipuncture in the years 2020 and 2021. These served as reference samples from 3 populations—North, Central, and South—previously described in Bay et al. (2021) and Gibbs et al. (2000). We extracted DNA from blood using the manufacturer’s protocol for Qiagen DNEasy Blood and Tissue Kits. Whole genome sequencing libraries were prepared following modifications of Illumina’s Nextera Library Preparation protocol (Schweizer & DeSaix 2023) and sequenced on a HiSeq 4000 at Novogene Corporation Inc., with a target sequencing depth of 2X per individual. Sequences were trimmed with TrimGalore version 0.6.5 (https://github.com/FelixKrueger/TrimGalore) and mapped to the NCBI yellow warbler reference genome (Sayers et al. 2022) (accession number JANCRA010000000) using the Burrows-Wheeler Aligner software version 0.7.17 (Li & Durbin 2009). After mapping, the resulting SAM files were sorted, converted to BAM files, and indexed using Samtools version 1.9 (Li et al. 2009). We used MarkDuplicates from GATK version 4.1.4.0 (McKenna et al. 2010) to mark read duplicates and clipped overlapping reads with the clipOverlap function from bamUtil (https://genome.sph.umich.edu/wiki/BamUtil:_clipOverlap). To reduce sequencing depth variation, we used the DownsampleSam function from GATK to down-sample reads from BAM files with greater than 2X coverage, to 2X coverage. To identify genetic markers from low-coverage WGS data, we used stringent filtering options in ANGSD version 0.9.40 (Korneliussen et al. 2014). We retained reads with a mapping quality of at least 30 and base quality of at least 33. We retained SNPs that had read data in at least 50% of individuals and a minor allele frequency greater than 0.05. The filtered variants were output as genotype likelihoods and stored in a Beagle-formatted file.

创建时间：

2024-01-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集