Data from: Resolving ambiguity of concatenation in multi-locus sequence data for the construction of phylogenetic supermatrices
收藏DataONE2013-02-19 更新2024-06-27 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈官方服务:
资源简介:
The construction of supermatrices from mining of DNA metadata is problematic due to incomplete species identification and incongruence of gene trees that hamper sequence concatenation based on Linnaean binomials. We applied methods from graph theory to minimize ambiguity of concatenation globally over a large data set. An initial step establishes sequence clusters for each locus that broadly correspond to Linnaean species. These clusters frequently are not consistent with binomials and specimen identifiers, which greatly complicates the concatenation of clusters across multiple loci. A multipartite heuristic algorithm is used to match clusters across loci and to generate a global set of concatenates that minimizes conflict of taxonomic names. The procedure was applied to all available data on GenBank for the Coleoptera (beetles) including >10500 taxon labels for >23500 sequences of four loci. The BlastClust algorithm was used in the initial clustering step, resulting in 11241 clusters or divergent singletons. Clusters were first used for name assignment of unidentified sequences resulting in 510 new identifications (13.9% of total unidentified sequences) of which nearly half were by clustering of a specimen at a secondary locus. Concatenation was straightforward only for 12.8% of all binomials represented by a singleton sequence at each locus with an available entry, while the majority of binomials were associated to multi-sequence clusters in at least one locus. Concatenation of clusters is particularly problematic where limits of DNA-based clusters are inconsistent with the Linnaean binomials, either containing more than one binomial or splitting a binomial among multiple clusters. The current data set contained 1518 such clusters (13.5% of total). By applying a scoring scheme for full and partial name matches in pairs of clusters, the maximum weight set of concatenates produced a matrix of minimally 7366 terminals. Varying the match weights for partial matches had little effect on the number of terminals, although if partial matches were disallowed, the number of terminals increased greatly. Trees from the resulting supermatrices generally produced tree topologies in good agreement with the Linnaean taxonomy, with fewer terminals compared to trees generated according to standard species labels. The study illustrates a strategy for assembling the Tree-of-Life from an ever more complex primary database.
基于DNA元数据挖掘构建超级矩阵(supermatrix)的过程存在诸多难点,究其原因在于物种鉴定不完整以及基因树(gene tree)的不一致性,二者均会阻碍基于林奈双名法(Linnaean binomial)的序列拼接流程。本研究采用图论(graph theory)方法,在大型数据集层面全局最小化拼接过程中的歧义性。研究的初始步骤是为每个基因座(locus)构建序列簇,这类簇大体上对应林奈物种,但往往与双名法命名及标本标识符并不匹配,这极大地增加了多基因座间序列簇拼接的复杂度。本研究采用多分图启发式算法,实现不同基因座间序列簇的匹配,并生成全局拼接集合,以最小化分类单元名称的冲突。该流程被应用于鞘翅目(Coleoptera,甲虫类)基因银行(GenBank)的所有可用数据,涵盖4个基因座的23500余条序列,对应10500余个分类单元标签。初始聚类步骤采用BlastClust算法,最终得到11241个序列簇或高度分化的单例(singleton)。研究首先利用这些序列簇对未鉴定序列进行名称注释,共得到510条新的物种鉴定结果(占未鉴定序列总数的13.9%),其中近一半的鉴定结果是通过对第二基因座的标本进行聚类获得的。仅12.8%的双名法命名类群可直接完成拼接,这类类群在每个基因座上均仅有一条单序列且存在可获取的条目;而绝大多数双名法命名类群至少在一个基因座上对应多序列簇。当基于DNA的序列簇边界与林奈双名法命名不一致时,序列簇的拼接问题尤为突出——这类簇可能包含多个双名法命名类群,或是将一个双名法命名类群拆分为多个簇。本数据集中共包含1518个此类序列簇,占总簇数的13.5%。通过为成对序列簇的完全匹配与部分匹配设置评分规则,最大权重拼接集合最终生成了包含至少7366个终端分类单元的超级矩阵。调整部分匹配的权重对终端分类单元的数量影响极小,但如果完全禁止部分匹配,终端分类单元的数量则会大幅增加。基于所得超级矩阵构建的系统发育树,其拓扑结构总体上与林奈分类系统吻合度较高,且相较于基于标准物种标签构建的系统发育树,终端分类单元数量更少。本研究为从日益复杂的原始数据库中组装生命之树(Tree-of-Life)提供了可行策略。
创建时间:
2013-02-19



