Data from: Misconceptions on missing data in RAD-seq phylogenetics with a deep-scale example from flowering plants

DataONE2016-10-11 更新2024-06-26 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Restriction-site associated DNA (RAD) sequencing and related methods rely on the conservation of enzyme recognition sites to isolate homologous DNA fragments for sequencing, with the consequence that mutations disrupting these sites lead to missing information. There is thus a clear expectation for how missing data should be distributed, with fewer loci recovered between more distantly related samples. This observation has led to a related expectation: that RAD-seq data are insufficiently informative for resolving deeper scale phylogenetic relationships. Here we investigate the relationship between missing information among samples at the tips of a tree and information at edges within it. We re-analyze and review the distribution of missing data across ten RAD-seq data sets and carry out simulations to determine expected patterns of missing information. We also present new empirical results for the angiosperm clade Viburnum (Adoxaceae, with a crown age >50 Ma) for which we examine phylogenetic information at different depths in the tree and with varied sequencing effort. The total number of loci, the proportion that are shared, and phylogenetic informativeness varied dramatically across the examined RAD-seq data sets. Insufficient or uneven sequencing coverage accounted for similar proportions of missing data as dropout from mutation-disruption. Simulations reveal that mutation-disruption, which results in phylogenetically distributed missing data, can be distinguished from the more stochastic patterns of missing data caused by low sequencing coverage. In Viburnum, doubling sequencing coverage nearly doubled the number of parsimony informative sites, and increased by >10X the number of loci with data shared across >40 taxa. Our analysis leads to a set of practical recommendations for maximizing phylogenetic information in RAD-seq studies.

限制性酶切位点相关DNA（Restriction-site Associated DNA, RAD）测序及其相关方法依托酶识别位点的保守性，分离同源DNA片段用于测序；其固有局限在于，破坏此类位点的突变会导致测序信息缺失。由此，缺失数据的分布模式存在明确的理论预期：亲缘关系越疏远的样本间，可获取的基因座（locus，复数形式loci）数量越少。这一观察催生了相关推论：RAD-seq数据不足以解析更深层级的系统发育关系。本研究旨在探究系统发育树末梢样本间的缺失信息与树内各分支所携带信息之间的关联，我们重新分析并梳理了10组RAD-seq数据集的缺失数据分布情况，并通过模拟实验确定缺失数据的理论分布模式；此外，我们针对被子植物（angiosperm）演化支（clade）荚蒾属（*Viburnum*，五福花科Adoxaceae，冠群年龄超过50 Ma）提供了全新的实证结果，结合不同测序投入水平，分析了该类群系统发育树不同层级深度下的系统发育信息含量。在所分析的RAD-seq数据集间，基因座总数、共享基因座比例以及系统发育信息含量均存在显著差异，测序覆盖度（sequencing coverage）不足或分布不均，与突变破坏导致的数据丢失在缺失数据占比上表现相近；模拟实验表明，因突变破坏导致的缺失数据具有系统发育分布特征，这与低测序覆盖度引发的更具随机性的缺失数据模式可以被有效区分。在荚蒾属研究中，将测序覆盖度提升一倍，可使简约信息位点（parsimony informative sites）数量几乎翻倍，同时使超过40个类群（taxon，复数taxa）共享数据的基因座数量提升10倍以上；本研究最终为RAD-seq研究中最大化系统发育信息含量提出了一系列实用建议。

创建时间：

2016-10-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集