Quantifying the error of secondary vs. distant primary calibrations in a simulated environment

NIAID Data Ecosystem2026-03-11 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.1zcrjdfp5

下载链接

链接失效反馈

官方服务：

资源简介：

Using calibrations to obtain absolute divergence times is standard practice in molecular clock studies. While the use of primary (e.g., fossil) calibrations is preferred, this approach can be limiting because of their rarity in fast-growing datasets. Thus, alternatives need to be explored, such as the use of secondary (molecularly-derived) calibrations that can anchor a timetree in a larger number of nodes. However, the use of secondary calibrations has been discouraged in the past because of concerns in the error rates of the node estimates they produce with an apparent high precision. Here, we quantify the amount of errors in estimates produced by the use of secondary calibrations relative to true times and primary calibrations placed on distant nodes. We find that, overall, the inaccuracies in estimates based on secondary calibrations are predictable and mirror errors associated with primary calibrations and their confidence intervals. Additionally, we find comparable error rates in estimated times from secondary calibrations and distant primary calibrations, although the precision of estimates derived from distant primary calibrations is roughly twice as good as that of estimates derived from secondary calibrations. This suggests that increasing dataset size to include primary calibration may produce divergence times that are about as accurate as those from secondary calibrations, albeit with a higher precision. Overall, our results suggest that secondary calibrations may be useful to explore the parameter space of plausible evolutionary scenarios when compared to time estimates obtained with distant primary calibrations. Methods We started from a main tree of 248 species represented in a tree of life. This main tree was split into two subtrees, tree A (173 species) and tree B (71 species), that represent two clades and maximize the size of the dataset in each tree. We then added to these clades two shared lineages which were arbitrarily chosen and an outgroup. This setup created two nested phylogenies that were used to test hypotheses on calibrations’ performance. To simulate multiple genes, we used a set of 446 empirical parameters (e.g., length, GC content, initial evolutionary rate) and altered the main timetree according to an autocorrelated model (ν = 1) that resulted in estimated rates of up to ± 25% of the mean rate. This effectively created 446 phylogenies with different branch lengths but same topology. These parameters were given to SeqGen to simulate genes under an Hasegawa-Kishino-Yano (HKY) model. Ten random sets of individual genes were then concatenated to reach a length of at least 30,000 sites (30,029 – 30,725). In addition, we also created one concatenation with all genes (approximately 604,000 sites) and two concatenations of half the number of genes (223 genes per concatenation) with lengths of 273,812 and 330,187. Each of these concatenations were used independently in downstream analyses. Patterns between the 30k, half, and full concatenations were similar. Therefore, we discuss results from the 30k concatenations because they allow us to evaluate the variance of estimates among datasets. For primary calibrations, three nodes from tree A were chosen: a relatively shallow node at 63.9 million years ago (mya), and two that were deeper in the tree but in two different clades (209.4 mya and 220.2 mya). The overlapping node between tree A and B has an intermediate depth (167 mya) within tree A and is centrally placed within the topology of tree B. These primary and secondary calibrations were chosen to minimize the effect that biased location (e.g., all within one clade) and divergence times (e.g., all young nodes) may have on the accuracy of estimations.

在分子钟（molecular clock）研究领域，利用定标获取绝对分歧时间是标准研究流程。尽管优先采用一级定标（primary calibrations，例如化石定标（fossil calibrations）），但这类定标在快速增长的数据集中较为稀缺，该方法的应用因此受到限制。因此，亟需探索替代方案，例如使用二级定标（secondary calibrations）——这类定标可在更多节点上锚定时间树（timetree）。不过，过往由于担忧二级定标会导致节点估计误差率偏高且精度虚高，其应用一度被劝阻。本研究量化了相较于真实分歧时间以及置于远缘节点的一级定标，使用二级定标所产生估计值的误差水平。研究发现，总体而言，基于二级定标的估计值偏差具备可预测性，且与一级定标及其置信区间（confidence intervals）相关的误差特征一致。此外，本研究还发现，二级定标与远缘一级定标所得到的估计时间误差率相当，不过远缘一级定标推导的估计精度约为二级定标的两倍。这表明，通过扩大数据集规模纳入一级定标，所得到的分歧时间精度虽高于二级定标结果，但其准确性与后者大致相当。总体而言，相较于使用远缘一级定标得到的时间估计值，二级定标或可用于探索合理演化场景的参数空间。方法本研究以包含248个物种的生命之树主树为起始。将该主树拆分为两个亚树（subtrees）：亚树A（173个物种）与亚树B（71个物种），二者分属两个演化支（clades），且最大化了各亚树的数据集规模。随后，向这两个演化支中加入两条任意选择的共有谱系以及一个外类群（outgroup）。该设置构建了两个嵌套的系统发育树（nested phylogenies），用于检验定标性能相关假说。为模拟多基因序列，我们采用了一套包含446个经验参数（empirical parameters）的数据集，参数涵盖序列长度、GC含量、初始演化速率等，并根据自相关模型（autocorrelated model，ν = 1）修改主时间树，得到的估计速率均值波动幅度可达±25%。该操作共生成446个具有相同拓扑结构（topology）但分支长度各异的系统发育树。将上述参数输入SeqGen软件，以Hasegawa-Kishino-Yano（HKY）模型模拟基因序列。随后，将10组随机选取的单个基因序列进行串联，得到总长度至少为30000个位点（sites）的序列矩阵（长度范围为30029~30725个位点）。此外，本研究还构建了包含全部基因的串联序列矩阵（约604000个位点），以及两组分别包含半数基因（每组223个基因）的串联序列矩阵，长度分别为273812和330187个位点。上述各组串联序列均独立用于后续分析。30k、半量以及全量串联序列的分析结果模式相似，因此我们重点讨论30k串联序列的分析结果，因其可用于评估不同数据集间估计值的方差。对于一级定标，我们选取了亚树A中的三个节点：一个位于6390万年前（million years ago, mya）的较浅节点，以及两个位于树内更深位置但分属不同演化支的节点（209.4 mya与220.2 mya）。亚树A与亚树B的共有节点深度处于亚树A的中间水平（167 mya），且在亚树B的拓扑结构中位于中心位置。选取上述一级与二级定标节点，旨在尽可能降低偏置位置（如全部集中于一个演化支内）以及分歧时间（如全部为年轻节点）对估计准确性的影响。

创建时间：

2020-02-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集