Processed ASV data from the Insect Biome Atlas Project

Name: Processed ASV data from the Insect Biome Atlas Project
Creator: SciLifeLab
Published: 2024-10-30 00:00:00
License: 暂无描述

figshare.scilifelab.se2024-10-30 更新2025-01-22 收录

下载链接：

https://figshare.scilifelab.se/articles/dataset/Processed_ASV_data_from_the_Insect_Biome_Atlas_Project/27202368/3

下载链接

链接失效反馈

官方服务：

资源简介：

The Insect Biome Atlas project was supported by the Knut and Alice Wallenberg Foundation (dnr 2017.0088). The project analyzed the insect faunas of Sweden and Madagascar, and their associated microbiomes, mainly using DNA metabarcoding of Malaise trap samples collected in 2019 (Sweden) or 2019–2020 (Madagascar). Please cite this version of the dataset as: Miraldo A, Iwaszkiewicz-Eggebrecht E, Sundh J, Lokeshwaran M, Granqvist E, Goodsell R, Andersson AF, Lukasik P, Roslin T, Tack A, Ronquist F. 2024. Processed ASV data from the Insect Biome Atlas Project, version 3. doi:10.17044/scilifelab.27202368.v3 or https://doi.org/10.17044/scilifelab.27202368.v3 This dataset contains the results from bioinformatic processing of version 1 of the amplicon sequence variant (ASV) data from the Insect Biome Atlas project (Miraldo et al. 2024), that is, the cytochrome oxidase subunit 1 (CO1) metabarcoding data from Malaise trap samples processed using the FAVIS mild lysis protocol (Iwaszkiewicz et al. 2023). The bioinformatic processing involved: (1) taxonomic assignment of ASVs, (2) chimera removal; (3) clustering into OTUs; (4) noise filtering and (5) cleaning. The clustering step involved resolution of the taxonomic annotation of the cluster and identification of a representative ASV. The noise filtering step involved removal of ASV clusters identified as potentially originating from nuclear mitochondrial DNA (NUMTs) or representing other types of error or noise. The cleaning step involved removal of ASV clusters present in >5% of negative control samples. ASV taxonomic assignments, ASV cluster designations, consensus taxonomies and summed counts of clusters in the sequenced samples are provided in compressed tab-separated files. Sequences of cluster representatives are provided in compressed FASTA format files. The bioinformatic processing pipeline is further described in Sundh et al. (2024). NB! All result files include ASVs and clusters that represent biological and synthetic spike-ins. Methods Taxonomic assignment ASVs were taxonomically assigned using kmer-based methods implemented in a Snakemake workflow available here. Specifically ASVs were assigned a taxonomy using the SINTAX algorithm in vsearch (v2.21.2) using a CO1 database constructed from the Barcode Of Life Data System (Sundh 2022). ASVs assigned to Class 'Insecta' or 'Collembola' but unassigned at lower taxonomic ranks were then placed into a reference phylogeny of 49,325 insect species (represented by 49,338 sequences) using the phylogenetic placement tool EPA-NG with subsequent taxonomic assignments using GAPPA. Assignments at the order level in this second pass were used to update the first kmer-based assignments, but only at the order level, leaving child ranks with the ‘unclassified’ prefix. Chimera removal The workflow first identifies chimeric ASVs in the input data using the ‘uchime_denovo’ method implemented in vsearch. This was done with a so-called ‘strict samplewise’ strategy where each sample was analysed separately (hence the ‘samplewise’ notation), only comparing ASVs present in the same sample. Further, ASVs had to be identified as chimeric in all samples where they were present (corresponding to the ‘strict’ notation) in order to be removed as chimeric. ASV clustering Non-chimeric sequences were then split by family-level taxonomic assignments and ASVs within each family were clustered in parallel using swarm (v3.1.0) with differences=15. Representative ASVs were selected for each generated cluster by taking the ASV with the highest relative abundance across all samples in a cluster. Counts were generated at the cluster level by summing over all ASVs in each cluster. Consensus taxonomy A consensus taxonomy was created for each cluster by taking into account the taxonomic assignments of all ASVs in a cluster as well as the total abundance of ASVs. For each cluster, starting at the most resolved taxonomic level, each unique taxonomic assignment was weighted by the sum of read counts of ASVs with that assignment. If a single weighted assignment made up 80% or more of all weighted assignments at that rank, that taxonomy was propagated to the ASV cluster, including parent rank assignments. If no taxonomic assignment was above the 80% threshold, the algorithm continued to the parent rank in the taxonomy. Taxonomic assignments at any available child ranks were set to the consensus assignment prefixed with ‘unresolved’. Noise filtering and cleaning The clustered data was further cleaned from NUMTs and other types of noise using the NEEAT algorithm, which takes taxonomic annotation, correlations in occurrence across samples (‘echo signal’) and evolutionary signatures into account, as well as cluster abundance (Sundh et al., 2024). We used default settings for all parameters in the evolutionary and distributional filtering steps, and removed clusters unassigned at the order level and with less than 3 reads summed across each dataset. As a last clean-up step in the noise filtering, clusters containing at least one ASV present in more than 5% of blanks were removed. Further, we removed ASvs assigned to a reference sequence in the BOLD database annotated as Zoarces gillii (BOLD:AEB5125), a fish found between Japan and eastern Korea. Closer inspection revealed that this was a mis-annotated bacterial sequence and ASVs assigned to this reference most likely represent bacterial sequences in our dataset. This record has been deleted from BOLD after our custom reference database was constructed. The chimera filtering and ASV clustering methods have been implemented in a Snakemake workflow available here. This workflow takes as input: The ASV sequences in FASTA format A tab-delimited file of counts of ASVs (rows) in samples (columns) Data for 1) and 2) are available at https://doi.org/10.17044/scilifelab.25480681.v5 Cleaning of ASV clusters in controls and identification of spikeins was done with a custom R script available here. Available data Processed ASV data files ASV taxonomic assignments, non-chimeric ASV cluster designations, consensus taxonomies, sequences of cluster representatives and summed counts of clusters in the sequenced samples are provided in compressed tab-separated files. Files are organized by country (Sweden and Madagascar), marked by the suffixes SE and MG, respectively. Taxonomic assignments The files asv_taxonomy_[SE|MG].tsv.gz are tab-separated files with taxonomic assignments using SINTAX+EPA-NG for all ASVs. Columns: ASV: The id of the ASV Kingdom, Phylum, Class, Order, Family, Genus, Species, BOLD_bin: Taxonomic assignment for each rank. If an ASV was unclassified at a particular rank, the taxonomic label is prefixed with ‘unclassified.’ followed by the taxonomic assignment of the most resolved parent rank. The files asv_taxonomy_sintax_[SE|MG].tsv.gz, asv_taxonomy_epang_[SE|MG].tsv.gz and asv_taxonomy_vsearch_[SE|MG].tsv.gz have the same structure, but contain results from assignments with SINTAX, EPA-NG and VSEARCH, respectively. Cluster assignments The files cluster_taxonomy_[SE|MG].tsv are tab-separated files containing all non-chimeric ASVs (that is, the ASVs passing the chimera-filtering step) with their corresponding taxonomic and cluster assignments. Columns: ASV: ASV id cluster: name of designated cluster median: the median of normalized reads across all samples for each ASV Kingdom, Phylum, Class, Order, Family, Genus, Species, BOLD_bin: taxonomic assignment of each ASV representative: contains 1 if ASV is a representative of its cluster, otherwise 0 Cluster counts The files cluster_counts_[SE|MG].tsv are tab-separated files with read counts of ASV clusters (rows) in samples (columns). Counts have been summed for all ASVs belonging to each cluster. Note that these files contain counts for biological spike-ins and for Sweden also synthetic spike-ins. Sequences of cluster representatives The files cluster_reps_[SE|MG].fasta are text files in FASTA format with representative sequences for each cluster. The fasta headers have the format “>ASV_ID CLUSTER_NAME”. Consensus taxonomy The files cluster_consensus_taxonomy_[SE|MG].tsv are tab-separated files with consensus taxonomy of each generated ASV cluster. Columns are the same as in asv_taxonomy_[SE|MG].tsv. Noise-filtered data The files prefixed with 'noise_filtered' contain data that has been cleaned from NUMTs and other types of noise using the NEEAT algorithm. The files contain the same information as the cluster files, but only for clusters that passed the noise filtering step. Cleaned noise filtered data The files prefixed with 'cleaned_noise_filtered' contain data that has been cleaned from NUMTs and other types of noise using the NEEAT algorithm, and further cleaned from clusters present in >5% of blanks. The files contain the same information as the cluster files, but only for clusters that passed the noise filtering and cleaning steps. Additional files The files removed_control_tax_[SE|MG].tsv.gz contain the ASV clusters removed from each dataset as part of cleaning. The files spikeins_tax_[SE|MG].tsv.gz contain the taxonomic assignments of the biological spike-ins identified. References: Iwaszkiewicz-Eggebrecht, E., Łukasik, P., Buczek, M., Deng, J., Hartop, E. A., Havnås, H., ... & Miraldo, A. (2023). FAVIS: Fast and versatile protocol for non-destructive metabarcoding of bulk insect samples. PloS one, 18(7), e0286272. Miraldo, A., Iwaszkiewicz-Eggebrecht, E., Sundh, J., Manoharan, L., Granqvist, E., Andersson, A., Łukasik, P., Roslin, T., Tack, A. J. M., & Ronquist, F. (2024). Amplicon sequence variants from the Insect Biome Atlas project (Version 5). SciLifeLab. https://doi.org/10.17044/scilifelab.25480681.v5 Sundh, J. (2022). COI reference sequences from BOLD DB (Version 4). SciLifeLab. https://doi.org/10.17044/scilifelab.20514192.v4

Insect Biome Atlas项目由Knut和Alice Wallenberg基金会（dnr 2017.0088）资助。该项目分析了瑞典和马达加斯加的昆虫群落及其相关的微生物组，主要采用2019年（瑞典）或2019-2020年（马达加斯加）采集的Malaise陷阱样品的DNA条形码 metabarcoding 方法进行。该项目的分析主要涉及对2019年（瑞典）或2019-2020年（马达加斯加）采集的Malaise陷阱样品进行的FAVIS温和裂解方案处理后的细胞色素氧化酶亚基1（CO1）metabarcoding数据。生物信息学处理包括：（1）ASVs的分类学分配；（2） chimera移除；（3）聚类到OTUs；（4）噪声过滤和（5）清洁。聚类步骤涉及解决集群的分类学注释和识别代表性ASVs。噪声过滤步骤涉及移除被识别为可能源自核线粒体DNA（NUMTs）或代表其他类型错误或噪声的ASV集群。清洁步骤涉及移除在>5%的阴性对照样品中存在的ASV集群。ASV分类学分配、ASV集群指定、一致分类学和测序样本中集群的汇总计数以压缩的制表符分隔文件提供。集群代表的序列以压缩FASTA格式文件提供。生物信息学处理流程在Sundh等人（2024）中进一步描述。NB！所有结果文件均包含代表生物学和合成spike-ins的ASVs和集群。方法分类学分配使用此处可用的Snakemake工作流程中的基于kmer的方法对ASVs进行分类学分配。具体而言，使用vsearch（v2.21.2）中的SINTAX算法将ASVs分配给分类学，该算法使用从Barcode Of Life Data System（Sundh 2022）构建的CO1数据库。然后，将分配到“昆虫纲”或“有翼亚纲”但未在更低分类等级分配的ASVs放入包含49,325种昆虫物种（由49,338个序列代表）的参考系统发育树中，使用phylogenetic placement工具EPA-NG，随后使用GAPPA进行分类学分配。在第二次迭代中，在目水平上的分配用于更新第一次基于kmer的分配，但仅限于目水平，保留子等级的“未分类”前缀。 Chimera移除工作流程首先使用vsearch中实现的‘uchime_denovo’方法在输入数据中识别chimeric ASVs。这是通过所谓的“严格样本间”策略进行的，其中每个样本分别进行分析（因此有“样本间”表示），仅比较同一样本中存在的ASVs。此外，ASVs必须在所有存在的样本中被识别为chimeric（对应于“严格”表示），才能被移除为chimeric。 ASV聚类然后，非chimeric序列根据家族水平的分类学分配进行拆分，并使用swarm（v3.1.0）并行对每个家族中的ASVs进行聚类，差异=15。通过选择在集群中所有样本中相对丰度最高的ASVs作为每个生成的集群的代表ASVs。通过在每个集群中所有ASVs的总和生成集群级别的计数。一致分类学通过考虑集群中所有ASVs的分类学分配以及ASVs的总丰度，为每个集群创建了一致分类学。对于每个集群，从最解决的分类学等级开始，每个独特的分类学分配都通过具有该分配的ASVs的读取计数总和进行加权。如果单个加权分配占该等级所有加权分配的80%或更多，则将该分类学传播到ASV集群，包括父等级分配。如果没有分类学分配超过80%的阈值，则算法继续到分类学的父等级。在所有可用的子等级上设置分类学分配为一致分配前缀为“未解决”的分配。噪声过滤和清洁使用考虑分类学注释、样本间发生频率的相关性（“回声信号”）和进化特征以及集群丰度的NEEAT算法进一步从NUMTs和其他类型的噪声中清理聚类数据（Sundh等人，2024）。我们在进化和分布过滤步骤的所有参数中使用了默认设置，并移除了在目水平上未分配且每个数据集总读取数少于3个的集群。在噪声过滤的最后清洁步骤中，移除了至少包含一个在>5%的空白中存在的ASVs的集群。此外，我们移除了分配给BOLD数据库中注释为Zoarces gillii（BOLD:AEB5125）的参考序列的ASVs，该鱼在日本和韩国东部之间被发现。进一步的检查表明，这是一个误标注的细菌序列，分配给该参考序列的ASVs很可能代表我们数据集中的细菌序列。该记录在我们的自定义参考数据库构建后已从BOLD中删除。 Chimera过滤和ASV聚类方法已在以下可用的Snakemake工作流程中实现。该工作流程以以下内容作为输入： ASV序列（FASTA格式）样本（列）中ASVs计数（行）的制表符分隔文件 1）和2）的数据可从https://doi.org/10.17044/scilifelab.25480681.v5获取。使用此处可用的自定义R脚本来清理ASV集群中的controls并识别spikeins。可用数据处理后的ASV数据文件 ASV分类学分配、非chimeric ASV集群指定、一致分类学、集群代表的序列和测序样本中集群的汇总计数以压缩的制表符分隔文件提供。文件按国家（瑞典和马达加斯加）组织，分别以SE和MG后缀标记。分类学分配 asv_taxonomy_[SE|MG].tsv.gz文件是使用SINTAX+EPA-NG对所有ASVs进行分类学分配的制表符分隔文件。列包括： ASV：ASV的id 界、门、纲、目、科、属、种、BOLD_bin：每个等级的分类学分配。如果一个ASV在特定等级上未分类，则分类学标签前缀为“未分类”，后跟最解决的父等级的分类学分配。 asv_taxonomy_sintax_[SE|MG].tsv.gz、asv_taxonomy_epang_[SE|MG].tsv.gz和asv_taxonomy_vsearch_[SE|MG].tsv.gz文件具有相同的结构，但包含使用SINTAX、EPA-NG和VSEARCH进行的分配的结果。集群指定 cluster_taxonomy_[SE|MG].tsv是包含所有非chimeric ASVs（即通过chimera过滤步骤的ASVs）及其相应的分类学和集群指定的制表符分隔文件。列包括： ASV：ASV id cluster：指定的集群名称 median：每个ASV在所有样本中的标准化读取数的平均值界、门、纲、目、科、属、种、BOLD_bin：每个ASV的分类学分配 representative：如果ASV是其集群的代表，则包含1，否则为0 集群计数 cluster_counts_[SE|MG].tsv是包含ASV集群（行）在样本（列）中的读取计数的制表符分隔文件。这些计数是针对属于每个集群的所有ASVs进行汇总的。请注意，这些文件包含生物学spike-ins和瑞典的合成spike-ins的计数。集群代表的序列 cluster_reps_[SE|MG].fasta是文本文件，格式为FASTA，包含每个集群的代表序列。fasta标题的格式为“>ASV_ID CLUSTER_NAME”。一致分类学 cluster_consensus_taxonomy_[SE|MG].tsv是包含每个生成的ASV集群的一致分类学的制表符分隔文件。列与asv_taxonomy_[SE|MG].tsv中的列相同。噪声过滤数据以'noise_filtered'为前缀的文件包含使用NEEAT算法从NUMTs和其他类型的噪声中清理的数据。这些文件包含与集群文件相同的信息，但仅针对通过噪声过滤步骤的集群。清洁的噪声过滤数据以'cleaned_noise_filtered'为前缀的文件包含使用NEEAT算法从NUMTs和其他类型的噪声中清理，并进一步从>5%的空白中清理集群的数据。这些文件包含与集群文件相同的信息，但仅针对通过噪声过滤和清洁步骤的集群。附加文件 removed_control_tax_[SE|MG].tsv.gz文件包含作为清洁部分从每个数据集中移除的ASV集群。 spikeins_tax_[SE|MG].tsv.gz文件包含已识别的生物学spike-ins的分类学分配。参考文献： Iwaszkiewicz-Eggebrecht, E.，Łukasik, P.，Buczek, M.，Deng, J.，Hartop, E. A.，Havnås, H.，... & Miraldo, A. (2023). FAVIS：用于大量昆虫样本的非破坏性metabarcoding的快速和通用方案。PloS one，18(7)，e0286272。 Miraldo, A.，Iwaszkiewicz-Eggebrecht, E.，Sundh, J.，Manoharan, L.，Granqvist, E.，Andersson, A.，Łukasik, P.，Roslin, T.，Tack, A. J. M.，& Ronquist, F. (2024). Insect Biome Atlas项目（版本5）的Amplicon序列变异（SciLifeLab。https://doi.org/10.17044/scilifelab.25480681.v5 Sundh, J. (2022). COI参考序列从BOLD DB（版本4）。SciLifeLab。https://doi.org/10.17044/scilifelab.20514192.v4

提供机构：

SciLifeLab

5,000+

优质数据集

54 个

任务类型

进入经典数据集