An in silico comparison of protocols for dated phylogenomics

NIAID Data Ecosystem2026-03-10 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.8hs71

下载链接

链接失效反馈

官方服务：

资源简介：

In the age of genome-scale DNA sequencing, choice of molecular marker arguably remains an important decision in planning a phylogenetic study. Using published genomes from 23 primate species, we make a standardized comparison of four of the most frequently used protocols in phylogenomics, viz., targeted sequence-enrichment using ultraconserved element and exon-capture probes, and restriction-site-associated DNA sequencing (RADseq and ddRADseq). Here we present a procedure to perform in silico extractions from genomes and create directly comparable datasets for each class of marker. We then compare these datasets in terms of both phylogenetic resolution and ability to consistently and precisely estimate clade ages using fossil-calibrated molecular-clock models. Furthermore, we were also able to directly compare these results to previously published datasets from Sanger-sequenced nuclear exons and mitochondrial genomes under the same analytical conditions. Our results show---although with the exception of the mitochondrial genome dataset and the smallest ddRADseq dataset---that for uncontroversial nodes all data classes performed equally well, i.e. they recovered the same well supported topology. However, for one difficult-to-resolve node comprising a rapid diversification, we report well supported but conflicting topologies among the marker classes, consistent with the mismodelling of gene tree heterogeneity as demonstrated by species tree analyses of single nucleotide polymorphisms. Likewise, clade age estimates showed consistent discrepancies between datasets under strict and relaxed clock models; for recent nodes, clade ages estimated by nuclear exon datasets were younger than those of the UCE, RADseq and mitochondrial data, but vice versa for the deepest nodes in the primate phylogeny. This observation is explained by temporal differences in phylogenetic informativeness (PI), with the datasets with strong PI peaks toward the present underestimating the deepest node ages. Finally, we conclude by emphasizing that while huge numbers of loci are probably not required for uncontroversial phylogenetic questions---for which practical considerations such as cost and ease of data generation/sharing/aggregating therefore become increasingly important---accurately modelling heterogeneous data remains as relevant as ever for the more recalcitrant problems.

在基因组规模DNA测序时代，分子标记（molecular marker）的选择堪称规划系统发育研究的关键决策之一。我们利用23个灵长类物种的已公开基因组，对系统发育基因组学领域四种最常用的实验方案开展标准化比较，即采用超保守元件（ultraconserved element, UCE）与外显子捕获（exon-capture）探针的靶向序列富集技术，以及限制性酶切位点相关DNA测序（restriction-site-associated DNA sequencing, RADseq与ddRADseq）。本文提出一套可从基因组中开展计算机模拟（in silico）提取的流程，并为每一类标记生成可直接比对的数据集。随后我们从系统发育分辨率、以及借助化石校准分子钟模型（fossil-calibrated molecular-clock models）稳定且精准估计支系分化时间的能力两个维度，对上述数据集进行对比分析。此外，我们还可在相同分析条件下，将本研究结果与此前已发表的桑格测序（Sanger-sequenced）核外显子及线粒体基因组数据集进行直接比对。研究结果显示：除线粒体基因组数据集与最小的ddRADseq数据集外，针对无争议的系统发育节点，所有数据类型的表现均十分优异，即均可恢复出相同且支持度良好的系统发育拓扑结构。但针对一个伴随快速辐射演化的难以解析的系统发育节点，不同标记类型间出现了支持度良好却相互矛盾的拓扑结构，这与单核苷酸多态性（single nucleotide polymorphisms, SNPs）物种树（species tree）分析所证实的基因树异质性（gene tree heterogeneity）建模失当现象相符。同样，在严格与宽松分子钟模型下，不同数据集的支系分化时间估计值均存在一致性偏差：对于较新的系统发育节点，核外显子数据集估算的支系年龄小于UCE、RADseq及线粒体数据的估算结果；而在灵长类系统发育的最深节点处，情况则恰好相反。这一现象可通过系统发育信息度（phylogenetic informativeness, PI）的时间分布差异得到解释：那些系统发育信息度峰值集中于近现代的数据集，会对最深节点的年龄产生低估。最后，我们在总结中强调：尽管针对无争议的系统发育问题，或许并不需要海量的基因座（loci）——此时实验成本、数据生成/共享/整合的便捷性等实际考量将愈发重要——但对于更为棘手的研究难题，对异质性数据进行精准建模依然与以往同等关键。

创建时间：

2017-10-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集