Simple approaches for evaluation of OTUs quality based on dissimilarities arrays
收藏DataCite Commons2023-11-27 更新2024-07-29 收录
下载链接:
https://figshare.com/articles/dataset/Simple_approaches_for_evaluation_of_OTUs_quality_based_on_dissimilarities_arrays/20764690/1
下载链接
链接失效反馈官方服务:
资源简介:
Community ecology and macroecology aim at a better understanding of the diversity of life, and its organisation patterns along various taxonomic levels, space and time. An issue for developing these studies is to have reliable inventories of the diversity. Therefore, the concept of species, even if continuously debated, has emerged as a cornerstone. After a long history, it is currently addressed in the framework of evolutionary biology, especially with modern synthesis, and beyond. This has lead to molecular systematics, which integrates statistical modeling of sequences evolution and inference of phylogenetic trees between lineages. When no such tree is available, two molecular based methods lead to the clustering or classification of unkown sequences of markers of taxonomic interest: building so called OTU with unsupervised clustering, and barcoding with supervised classification. OTU stands for "Operational Taxonomic unit". An OTU is a set of sequences which are ideally at a distance smaller than a given level called barcoding gap. Exponential development of Next Generation Sequencing and High Throughput Sequencing has facilitated an industrial production of barcodes in environmental samples with metabarcoding, produced in bulk, without knowing from which organism they come, especially in microbial communities. An environmental sample in metabarcoding is a set of reads which are representative of the diversity of the community which has been sampled, and a sound basis for diversity studies. Being a set of sequences close to each other, it is expected that OTUs represent a category relevant for biodiversity inventories on a molecular basis, where assemblages of OTUs mimick the organisation of communities as assemblages of species. It is expected that they represent building blocks of molecular diversity in communities, playing the same role as morphologically or phylogenetically based species. This can be validated by mapping when possible some sequences in the OTU on taxonomically annotated reference data bases. A difficulty is well documented while doing so: not all species are available for learning in reference data bases, because not all species are known, or well represented in reference molecular data bases even if morphologically well described. Note that this raises the question of qualifying and quantifying a correspondence (or not) between OTUs and the notion of species, which has been the subject of a long debate. Following \cite{Blaxter2005}, we adopt here the view that we are "\emph{agnostic as to whether the taxa we can define using these barcode sequences [...] are species or not}". In our work, an OTU is defined as a set of sequences which are mutually close, and there is no attempt to make sense of an OTU, for example by naming it. OTUs are building blocks of molecular based inventories, and there are various protocols for building them from sets of sequences in an environmental sample. Artefact in the production of OTU can occur at different stages and some tools already exist to clean OTUs. For instance there could be more OTUs than expected from the expert knowledge on the diversity of the system studied. We propose to complete these tools for post-treating the OTUs of a sample, by using only the array of pairwise distances between sequences in each OTU. To characterise the notion of quality of an OTU, we refer to an ideal OTU (where all distances within an OTU are smaller than the barcoding gap), and we identify possible deviations from the theoretical pattern of the corresponding distances array. Deviations, when they exist, are not at random. We study two deviations leading to composed OTU and OTU with noise. As defined above, composed OTUs are artificial merging of several OTUs, as oppose to single OTUs. We propose a new way to identify composed OTUs. Then, once composed OTUs have been splitted into single OTUs, we consider a second post-treatment to identify presence of noise. We say that an OTU contains noise if it contains some sequences that are loosely associated to the core sequences and that do not form a compact subgroup of sequences. To the best of our knowledge, the identification and quantification of noise in OTUs has been seldom addressed. Our approach is a classification method based on simple statistics derived from the array matrix and on learning methods like a linear Support Vector Machine and a Stochastic Block Model. We apply the approach on a data set of diatoms from Arcachon Bay.
提供机构:
figshare
创建时间:
2022-09-06



