Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species

Name: Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species
Creator: Dryad
Published: 2026-03-12 16:05:47
License: 暂无描述

DataCite Commons2026-03-12 更新2026-04-25 收录

下载链接：

https://datadryad.org/dataset/doi:10.5061/dryad.k0p2ngf7w

下载链接

链接失效反馈

官方服务：

资源简介：

Despite their suitability for studying evolution, many conifer species have large and repetitive giga-genomes (16-31Gbp) that create hurdles to producing high coverage SNP datasets that capture diversity from across the entirety of the genome. Due in part to multiple ancient whole genome duplication events, gene family expansion and subsequent evolution within Pinaceae, false diversity from the misalignment of paralog copies creates further challenges in accurately and reproducibly inferring evolutionary history from sequence data. Here, we leverage the cost-saving benefits of pool-seq and exome-capture to discover SNPs in two conifer species, Douglas-fir (Pseudotsuga menziesii var. menziesii (Mirb.) Franco, Pinaceae) and jack pine (Pinus banksiana Lamb., Pinaceae). We show, using minimal baseline filtering, that allele frequencies estimated from pooled individuals show a strong positive correlation with those estimated by sequencing the same population as individuals (r > 0.948), on par with such comparisons made in model organisms. Further, we highlight the utility of haploid megagametophyte tissue for identifying sites that are likely due to misaligned paralogs. Together with additional minor filtering, we show that it is possible to remove many of the loci with large frequency estimate discrepancies between individual and pooled sequencing approaches, improving the correlation further (r > 0.973). Our work addresses bioinformatic challenges in non-model organisms with large and complex genomes, highlights the use of megagametophyte tissue for the identification of paralog sites, and suggests the combination of pool-seq and exome capture to be robust for further evolutionary hypothesis testing in these systems.

尽管针叶树（conifer）类群十分适合用于演化研究，但多数针叶树物种拥有庞大且重复的超大基因组（giga-genome，16~31 Gbp），这为构建能覆盖全基因组多样性的高覆盖度单核苷酸多态性（Single Nucleotide Polymorphism, SNP）数据集带来了阻碍。这一难题部分源于松科（Pinaceae）内多次古老的全基因组复制事件、基因家族扩张及其后续演化；而旁系同源基因（paralog）比对错误所产生的假多样性，进一步为从序列数据中准确且可重复地推断演化历史增添了挑战。本研究借助混池测序（pool-seq）与外显子组捕获（exome-capture）的成本优势，在两种针叶树物种——花旗松（*Pseudotsuga menziesii* var. *menziesii* (Mirb.) Franco，松科）与短叶松（*Pinus banksiana* Lamb.，松科）——中发掘单核苷酸多态性位点。我们通过最小化基础过滤步骤证实：从混池个体中估算的等位基因频率，与以个体为单位对同一群体进行测序所得的等位基因频率呈显著正相关（r>0.948），这一相关性水平与模式生物中的同类对比结果相当。此外，本研究阐明了单倍体雌配子体组织（haploid megagametophyte tissue）在识别可能由旁系同源基因比对错误导致的位点方面的应用价值。结合额外的轻度过滤步骤，我们可去除大量在个体测序与混池测序方法间存在较大频率估算偏差的位点，进一步提升相关性（r>0.973）。本研究解决了拥有庞大复杂基因组的非模式生物所面临的生物信息学挑战，阐明了单倍体雌配子体组织在旁系同源基因位点识别中的应用价值，并表明混池测序与外显子组捕获的组合方案，可在这类物种的后续演化假说检验中展现出良好的稳定性与可靠性。

提供机构：

Dryad

创建时间：

2021-07-18