Data from: Fast dating using least-squares criteria and algorithms

DataONE2015-09-25 更新2024-06-27 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Phylogenies provide a useful way to understand the evolutionary history of genetic samples, and data sets with more than a thousand taxa are becoming increasingly common, notably with viruses (e.g. HIV). Dating ancestral events is one of the first, essential goals with such data. However, current sophisticated probabilistic approaches struggle to handle data sets of this size. Here we present very fast dating algorithms, based on a Gaussian model closely related to the Langley-Fitch molecular-clock model. We show that this model is robust to uncorrelated violations of the molecular clock. Our algorithms apply to serial data, where the tips of the tree have been sampled through times. They estimate the substitution rate and the dates of all ancestral nodes. When the input tree is unrooted, they can provide an estimate for the root position, thus representing a new, practical alternative to the standard rooting methods (e.g. mid-point). Our algorithms exploit the tree (recursive) structure of the problem at hand, and the close relationships between least-squares and linear algebra. We distinguish between an unconstrained setting and the case where the temporal precedence constraint (i.e. an ancestral node must be older that its daughter nodes) is accounted for. With rooted trees, the former is solved using linear algebra in linear computing time (i.e. proportional to the number of taxa), while the resolution of the latter, constrained setting, is based on an active-set method that runs in nearly linear time. With unrooted trees the computing time becomes (nearly) quadratic (i.e. proportional to the square of the number of taxa). In all cases very large input trees (>10,000 taxa) can easily be processed and transformed into time-scaled trees. We compare these algorithms to standard methods (root-to-tip, r8s version of Langley-Fitch method, and BEAST). Using simulated data, we show that their estimation accuracy is similar to that of the most sophisticated methods, while their computing time is much faster. We apply these algorithms on a large data set comprising 1,195 strains of Influenza virus from the pdm09 H1N1 Human pandemic. Again the results show that these algorithms provide a very fast alternative with results similar to those of other computer programs. These algorithms are implemented in the LSD software (Least-Squares Dating), which can be downloaded from http://www.atgc-montpellier.fr/LSD/, along with all our data sets and detailed results. An Online Appendix, providing additional algorithm descriptions, tables and figures can be found in the Dryad data repository.

系统发育（Phylogenies）为解析遗传样本的演化历史提供了极具价值的研究手段，而包含千余个类群的数据集正日益普及，尤以病毒（如HIV）相关研究领域为典型。对祖先演化事件进行定年，是此类数据研究的首要核心目标之一。然而，当前主流的高精度概率类方法在处理此类规模的数据集时往往算力不足或效率低下。本研究提出了基于高斯模型的超快速定年算法，该模型与Langley-Fitch分子钟模型（Langley-Fitch molecular-clock model）高度相关。我们证实，该模型对分子钟的非相关性偏离具有良好的鲁棒性。该算法适用于时序采样数据集，即树的支系末端是按不同时间点采样得到的，可同时估算替换速率与所有祖先节点的定年结果。当输入为无根树（unrooted tree）时，算法还可估算树根的位置，为现有主流定根方法（如中点定根法（mid-point rooting））提供了一种全新的实用替代方案。本算法充分利用了待求解问题的树状（递归）结构，以及最小二乘（least-squares）与线性代数之间的紧密关联。我们将问题分为无约束与带约束两种场景：约束场景需满足时序优先规则，即祖先节点的年代必须早于其子节点。针对有根树（rooted tree），无约束场景可通过线性代数求解，计算复杂度呈线性（与类群数量成正比）；而带约束场景则基于活动集法（active-set method）求解，计算复杂度接近线性。若输入为无根树，计算复杂度则变为（近似）二次方级（与类群数量的平方成正比）。所有场景下，超大规模输入树（类群数超过10000）均可被高效处理，并转换为时间标度树（time-scaled trees）。我们将本算法与三类标准方法进行了对比：根端定年法（root-to-tip）、Langley-Fitch方法的r8s实现版本，以及BEAST软件。基于模拟数据的测试结果表明，本算法的估算精度与当前最顶尖的方法相当，但计算速度要快得多。我们将本算法应用于一个大型数据集：包含1195株2009年甲型H1N1流感大流行（pdm09 H1N1）病毒毒株。测试结果同样证实，本算法可提供超快速的定年方案，所得结果与其他专业软件的输出高度一致。本算法已集成于LSD软件（最小二乘定年工具，Least-Squares Dating），用户可从http://www.atgc-montpellier.fr/LSD/ 下载该软件，同时获取全部数据集与详细结果。包含额外算法说明、图表与补充表格的在线附录，可在Dryad数据仓储中获取。

创建时间：

2015-09-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集