Data from: Comparing methods for SNP calling from Genotyping-By-Sequencing (GBS) data for a large-genome conifer without a published genome sequence
收藏DataCite Commons2026-03-09 更新2026-04-25 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.6fv8fb4
下载链接
链接失效反馈官方服务:
资源简介:
Reduced-representation restriction-enzyme-based sequencing methods have
been demonstrated to be robust and cost-effective genotyping methods to
identify Single Nucleotide Polymorphisms (SNPs). While alignment of the
short-read fragments to a genome sequence of the same species results in
better SNP calling than de novo approaches, only a few tree species - and
few conifers in particular - have an annotated sequence. Many conifer
genomes are huge (>19 GB) and include a large proportion of repeat
sequences, making assembly difficult. While the sequence of a related
species could be used, choosing the proper pipeline for SNP calling is
still challenging. Here we compare the performance of four bioinformatics
pipelines, two of which require a reference genome (TASSEL-GBS V2 and
Stacks), two of which are de novo pipelines (UNEAK and Stacks). We used
Illumina GBS data from 94 ponderosa pines. Using loblolly pine genome as
the reference greatly increased the number of SNPs called (62 -196
thousand vs. 2.1 - 2.7 million SNPs). UNEAK was fastest and identified
more SNPs than Stacks de novo. Reference-based Stacks produced the highest
number of SNPs with lowest proportion of paralogs, TASSEL-GBS V2 exhibited
the highest proportion of paralogs. The Stacks reference-based approach
produced the best results overall, while UNEAK is the better de novo
method. However, all four pipelines had distinct benefits and limitations.
Differences in observed and expected heterozygosity between the SNP sets
generated by the pipelines could lead to different conclusions when they
are used for population genetics analyses.
提供机构:
Dryad
创建时间:
2020-02-05



