Dataset for "To denoise or to cluster? That is not the question. Optimizing pipelines for COI metabarcoding and metaphylogeography"
收藏Mendeley Data2021-03-22 更新2026-04-09 收录
下载链接:
https://data.mendeley.com/datasets/84zypvmn2b
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains the relevant files for a study optimizing and combining denoising and clustering algorithms for COI metabarcoding. The abstract is: Background. The recent blooming of metabarcoding applications to biodiversity studies comes with some relevant methodological debates. One such issue concerns the treatment of reads by denoising or by clustering methods, which have been wrongly presented as alternatives. It has also been suggested that denoised sequence variants should replace clusters as the basic unit of metabarcoding analyses, missing the fact that sequence clusters are a proxy for species-level entities, the basic unit in biodiversity studies. We argue here that methods developed and tested for ribosomal markers have been uncritically applied to highly variable markers such as cytochrome oxidase I (COI) without conceptual or operational (e.g., parameter setting) adjustment. COI has a naturally high intraspecies variability that should be assessed and reported, as it is a source of highly valuable information. We contend that denoising and clustering are not alternatives. Rather, they are complementary and both should be used together in COI metabarcoding pipelines. Results. Using a COI dataset from benthic marine communities, we compared two denoising procedures (based on the UNOISE3 and the DADA2 algorithms), set suitable parameters for denoising and clustering, and applied these steps in different orders. Our results indicated that the UNOISE3 algorithm preserved a higher intra-cluster variability. We introduce the program DnoisE to implement the UNOISE3 algorithm taking into account the natural variability (measured as entropy) of each codon position in protein-coding genes. DnoisE retained 88% more sequences than UNOISE3. The order of the steps (denoising and clustering) had little influence on the final outcome. Conclusions. We highlight the need for combining denoising and clustering, with adequate choice of stringency parameters, in COI metabarcoding. We present a program that uses the coding properties of this marker to improve the denoising step. We recommend researchers to report their results in terms of both denoised sequences (a proxy for haplotypes) and clusters formed (a proxy for species), and to avoid collapsing the sequences of the latter into a single representative. This will allow studies at the cluster (ideally equating species-level diversity) and at the intra-cluster level, and will ease additivity and comparability between studies. Keywords: metabarcoding, metaphylogeography, COI, denoising, clustering, Operational Taxonomic Units.
本数据集收录了一项针对细胞色素氧化酶I(cytochrome oxidase I, COI)元条形码(metabarcoding)的降噪与聚类算法优化及组合研究的相关文件。本研究的摘要如下:
研究背景:近年来,元条形码(metabarcoding)技术在生物多样性研究中的应用蓬勃发展,但也伴随了若干相关的方法学争议。其中一项争议围绕测序读段(reads)的降噪(denoising)与聚类(clustering)方法展开,二者曾被错误地视为互斥选项。另有观点提出,经降噪处理的序列变异体应取代聚类簇,成为元条形码分析的基本单位,但这一观点忽略了一个核心事实:序列聚类簇是物种级实体的替代标识,而后者正是生物多样性研究的基本单位。本研究指出,针对核糖体标记开发并验证的方法,被不加批判地应用于细胞色素氧化酶I(COI)这类高变异标记基因,且未进行概念层面或操作层面(例如参数设置)的调整。COI基因天然存在较高的种内变异,这一特征可作为极具价值的信息来源,理应得到评估与报告。我们认为,降噪与聚类并非互斥选项,二者实为互补关系,因此在COI元条形码分析流程中应联合使用二者。
研究结果:本研究利用海洋底栖生物群落的COI数据集,对比了基于UNOISE3与DADA2算法的两种降噪流程,为降噪与聚类步骤设置了适配参数,并以不同顺序执行该两类步骤。结果显示,UNOISE3算法能够保留更高的簇内变异水平。本研究开发了程序DnoisE,用于实现UNOISE3算法,该实现考虑了蛋白编码基因中每个密码子位点的天然变异(以熵值衡量)。DnoisE相较于UNOISE3,多保留了88%的序列。两类步骤(降噪与聚类)的执行顺序对最终分析结果影响极小。
研究结论:本研究强调,在COI元条形码分析中,需联合使用降噪与聚类方法,并合理选择严格性参数。本研究开发了一款程序,可利用该标记基因的编码特性优化降噪步骤。我们建议研究者在报告结果时,同时呈现经降噪处理的序列(作为单倍型(haplotypes)的替代标识)与所形成的聚类簇(作为物种的替代标识),且不应将后者的序列压缩为单条代表序列。此举将支持基于聚类簇(理想情况下等同于物种水平多样性)及簇内水平的研究,并可提升不同研究间的可加性与可比性。
关键词:元条形码(metabarcoding)、元系统地理学(metaphylogeography)、细胞色素氧化酶I(COI)、降噪(denoising)、聚类(clustering)、操作分类单元(Operational Taxonomic Units, OTUs)
创建时间:
2021-03-22



