Data from: Specimens at the center: an informatics workflow and toolkit for specimen-level analysis of public DNA database data

DataONE2016-09-06 更新2024-06-26 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Major public DNA databases — NCBI GenBank, the DNA DataBank of Japan (DDBJ), and the European Molecular Biology Laboratory (EMBL) — are invaluable biodiversity libraries. Systematists and other biodiversity scientists commonly mine these databases for sequence data to use in phylogenetic studies, but such studies generally use only the taxonomic identity of the sequenced tissue, not the specimen identity. Thus studies that use DNA supermatrices to construct phylogenetic trees with species at the tips typically do not take advantage of the fact that for many individuals in the public DNA databases, several DNA regions have been sampled; and for many species, two or more individuals have been sampled. Thus these studies typically do not make full use of the multigene datasets in public DNA databases to test species coherence and select optimal sequences to represent a species. In this study, we introduce a set of tools developed in the R programming language to construct individual-based trees from NCBI GenBank data and present a set of trees for the genus Carex (Cyperaceae) constructed using these methods. For the more than 770 species for which we found sequence data, our approach recovered an average of 1.85 gene regions per specimen, up to seven for some specimens, and more than 450 species represented by two or more specimens. Depending on the subset of genes analyzed, we found up to 42% of species monophyletic. We introduce a simple tree statistic—the Taxonomic Disparity Index (TDI)—to assist in curating specimen-level datasets and provide code for selecting maximally informative (or, conversely, minimally misleading) sequences as species exemplars. While tailored to the Carex dataset, the approach and code presented in this paper can readily be generalized to constructing individual-level trees from large amounts of data for any species group.

主流公共DNA数据库——美国国家生物技术信息中心基因银行（NCBI GenBank）、日本DNA数据库（DDBJ）以及欧洲分子生物学实验室核酸数据库（EMBL）——是极为宝贵的生物多样性资源库。分类学家与其他生物多样性研究人员常从这些数据库中调取序列数据用于系统发育研究，但此类研究通常仅利用测序组织的分类学标识，而非标本本身的标识。因此，那些利用DNA超级矩阵（DNA supermatrices）构建末端为物种的系统发育树的研究，往往未充分利用公共DNA数据库的两项核心事实：其一，公共数据库中诸多个体已被多个DNA区域测序；其二，诸多物种已有两名及以上个体被测序。故此，此类研究通常未能充分利用公共DNA数据库中的多基因数据集，以检验物种的统一性并筛选可代表某一物种的最优序列。本研究中，我们开发了一套基于R编程语言的工具，可从NCBI GenBank数据库数据中构建基于个体的系统发育树，并展示了利用该方法构建的薹草属（莎草科，Cyperaceae）系统发育树集。在我们获取到序列数据的770余种物种中，本研究方法可使每个标本平均获取1.85个基因区域，部分标本最多可获取7个基因区域，且有450余种物种拥有两名及以上标本的测序数据。根据所分析的基因子集不同，我们发现最高可达42%的物种为单系群（monophyletic）。我们还提出了一种简便的树统计量——分类差异指数（Taxonomic Disparity Index, TDI）——以辅助整理标本级数据集，并提供了代码用于筛选信息量最大（或反之，误导性最低）的序列作为物种代表序列。尽管本研究方法与代码是针对薹草属数据集定制开发的，但可轻松推广至利用大规模数据为任意物种类群构建基于个体的系统发育树的场景。

创建时间：

2016-09-06