Amplicon sequence variants from the Insect Biome Atlas project

Name: Amplicon sequence variants from the Insect Biome Atlas project
Creator: SciLifeLab
Published: 2024-11-25 00:00:00
License: 暂无描述

figshare.scilifelab.se2024-11-25 更新2025-01-21 收录

下载链接：

https://figshare.scilifelab.se/articles/dataset/Amplicon_sequence_variants_from_the_Insect_Biome_Atlas_project/25480681/1

下载链接

链接失效反馈

官方服务：

资源简介：

General informationThe Insect Biome Atlas project was supported by the Knut and Alice Wallenberg Foundation (dnr 2017.0088). The project analyzed the insect faunas of Sweden and Madagascar, and their associated microbiomes, mainly using DNA metabarcoding of Malaise trap samples collected in 2019 (Sweden) or 2019–2020 (Madagascar).Please cite this version of the dataset as: Miraldo A, Iwaszkiewicz-Eggebrecht E, Sundh J, Lokeshwaran M, Granqvist E, Andersson AF, Lukasik P, Roslin T, Tack A, Ronquist F. 2024. Dataset of amplicon sequence variants (ASVs) from the Insect Biome Atlas Project, version 1. https://doi.org/10.17044/scilifelab.25480681Dataset descriptionThis dataset (version 1) contains amplicon sequence variants (ASVs) generated from high-throughput sequencing of the cytochrome c oxidase subunit I (CO1) gene from Malaise trap samples processed with mild lysis, with the exception of 15 samples for which we also provide sequencing data from homogenates and preservative ethanol. It includes both the ASV sequences and abundance information (number of reads) and it also contains metadata files that are needed to interpret and analyse the data further. Future versions of the dataset will include additional data.MethodsSamples were sequenced using Illumina technology. Raw data are available at the European Nucleotide Archive (ENA) under project PRJEB61109. The raw sequence data was preprocessed using a Snakemake workflow available at https://github.com/biodiversitydata-se/amplicon-multi-cutadapt. Preprocessed reads were then used as input to the AmpliSeq Nextflow (v.2.1.0) pipeline to generate Amplicon Sequence Variants (ASVs).Available dataIn this dataset we provide two types of files: ASV files and metadata files. Files marked with 'SE' contain data from Sweden while those marked with 'MG' contain data from Madagascar.The file shasum.txt contains checksums for each of the files. After downloading you can run:shasum -c shasum.txtto check file integrityASV filesThis dataset contains ASV sequences in fasta format (CO1_asv_seqs_SE.fasta.gz and CO1_asv_seqs_MG.fasta.gz) and counts of ASVs in each sample (CO1_asv_counts_SE.tsv.gz and CO1_asv_counts.MG.tsv.gz). Files marked with 'SE' are from samples in Sweden while those marked with 'MG' are from Madagascar. The Swedish dataset contains 636,297 ASVs in 4,873 samples (including negative and positive control samples). The Madagascar dataset contains 559,023 ASVs in 2,081 samples (including negative and positive control samples).Metadata filesThere are three types of metadata files included in this dataset:sequencing_metadata files with information about samples that were processed in the lab and sequencedsamples_metadata files with information about samples that were collected in the field.sites_metadata files with information about sites where samples were collected.Sequencing metadata filesTwo sequencing metadata files are included in this dataset (CO1_sequencing_metadata_SE.tsv and CO1_sequencing_metadata_MG.tsv) with information about samples that were sequenced. Columns in these files are as follows:sampleID_NGI: Sample id given by the sequencing facility (matching the columns in the counts file)sampleID_HISTORICAL: Custom user idsampleID_FIELD: Sample id from field samplingsampleID_LAB: Sample id from handling in the labdataset: Dataset designation for each samplelab_sample_type: Type of sample, e.g. 'sample', 'buffer_blank', 'pcr_neg' etc.country: Country of origin for samplebiological_spikes: True if sample has biological spike ins addedartificial_spikes: True if sample has artificial spike ins added at the time of DNA purificationsample_metadata_file: Corresponding metadata file for samplelysate_rack_ID: Identification of 96-well plate where lysate aliquot is stored in the lab (internal use only)lysate_well_ID: Identification of well position where lysate aliquot is stored in the lab (internal use only)dna_plate_ID: Identification of 96-well plate where purified DNA is stored in the lab (internal use only)dna_plate_well_ID: Identification of well position where lysate is stored in the lab (internal use only)sequencing_batch: Custom user id for sequencing batch numbersequencing_batch_NGI: Sequencing batch number given by the sequencing facilitynotes_lab: Additional information about sample processing in the lab (only for SE file)sequencing_status: Additional information about sample sequencing status. If a sample has a value of “sequencing failed” in this column, then this sample will be missing from the ASV counts filestudy_accession_ENA: Study identification at the European Nucleotide Archivesample_accession_ENA: Sample identification at the European Nucleotide Archiveexperiment_accession_ENA: Experiment identification at the European Nucleotide Archiverun_accession_ENA: Run identification at the European Nucleotide ArchiveSamples metadata filesTwo samples_metadata files are included in this dataset (samples_metadata_malaise_SE.tsv and samples_metadata_malaise_MG.tsv) with information about each sample that was collected in the field. Columns in these files are as follows:sampleID_FIELD: Sample id from field samplingtrapID: Malaise trap id from field samplingbiomass_grams: Wet weight of each bulk sampleplacing_time: Time when sampling startedplacing_date: Date when sampling startedcollecting_time: Time when sampling endedcollecting_date: Date when sampling endedduration_min: Total number of minutes the sample was collectingtrap_condition_collection: Condition of the malaise trap at the time of collecting the sample from the trap (good; acceptable; poor)sample_ethanol_conc: Concentration of preservative ethanol at the time of DNA extraction (only for SE file)processing_group: Processing batch id (for internal use only)sample_accession_ENA: Sample identification at the European Nucleotide Archivesample_status: Additional information about sample processing status in the labSites metadata filesThere are two files that contain information about sampling sites, one for each country: sites_metadata_SE.tsv and sites_metadata_MG.tsv. Columns in these files are as follows:siteID: Sampling site id number. Note that for some sites there can be several Malaise traps assembled (malaise_trap_type=Multitrap)trapID: Malaise trap id from field samplinglatitude_WGS84: Latitude in WGS84 coordinate system. This info specifies the Malaise trap location at the sampling sitelongitude_WGS84: Longitude in WGS84 coordinate system. This info specifies the Malaise trap location at the sitetrap_habitat: Habitat where the Malaise trap was locatedmalaise_trap_type: Identifies if there are multiple traps assembled at the sampling site (Multitrap) or only one (Single_trap)parkID: Name of national park (for MG only)provinceID: Name of province (for MG only)NILS_mhabitat: Habitat for nearest plot of the National Inventory of Landscapes in Sweden (NILS) from the malaise trap location (only for SE file). For more information about NILS sampling design, check: https://www.slu.se/centrumbildningar-och-projekt/nils_old/Datainsamling/bakgrund-och-mal/NILS_square: Identification of nearest NILS square for sampling site (only for SE file)NILS_plot: Identification of nearest NILS plot to the Malaise trap location (only for SE file)trap_orientation_degrees_S: Orientation in degrees of the collection head of the Malaise trapnotes: notes associated with the Malaise trap (only for SE file)

一般信息昆虫生物群落图鉴项目由克努特和艾丽丝·瓦伦贝格基金会（项目编号：2017.0088）资助。该项目分析了瑞典和马达加斯加的昆虫群落及其相关微生物组，主要采用2019年（瑞典）或2019-2020年（马达加斯加）收集的马拉西式陷阱样本的DNA条形码技术。请引用此数据集版本如下：Miraldo A, Iwaszkiewicz-Eggebrecht E, Sundh J, Lokeshwaran M, Granqvist E, Andersson AF, Lukasik P, Roslin T, Tack A, Ronquist F. 2024. 昆虫生物群落图鉴项目扩增子序列变异（ASVs）数据集，版本1. https://doi.org/10.17044/scilifelab.25480681 数据集描述本数据集（版本1）包含从马拉西式陷阱样本中高通量测序得到的细胞色素c氧化酶亚基I（CO1）基因的扩增子序列变异（ASVs）。这些样本经过温和裂解处理，但对于15个样本，我们还提供了从均质化和防腐乙醇中测序的数据。它包括ASV序列和丰度信息（读数数量），并且还包含用于进一步解释和分析数据的元数据文件。未来版本的数据集将包含更多数据。方法样本使用Illumina技术进行测序。原始数据可在欧洲核苷酸档案（ENA）下项目PRJEB61109中获取。原始序列数据使用可在https://github.com/biodiversitydata-se/amplicon-multi-cutadapt获取的Snakemake工作流程进行预处理。然后，将预处理后的读数作为输入，使用AmpliSeq Nextflow（v.2.1.0）管道生成扩增子序列变异（ASVs）。可用数据在本数据集中，我们提供了两种类型的文件：ASV文件和元数据文件。标记为'SE'的文件包含来自瑞典的数据，而标记为'MG'的文件包含来自马达加斯加的数据。文件shasum.txt包含每个文件的校验和。下载后，您可以运行以下命令以检查文件完整性： shasum -c shasum.txt ASV文件本数据集包含fasta格式的ASV序列（CO1_asv_seqs_SE.fasta.gz和CO1_asv_seqs_MG.fasta.gz）以及每个样本中ASV的计数（CO1_asv_counts_SE.tsv.gz和CO1_asv_counts.MG.tsv.gz）。标记为'SE'的文件来自瑞典的样本，而标记为'MG'的文件来自马达加斯加。瑞典数据集包含636,297个ASVs，分布在4,873个样本中（包括阴性和阳性对照样本）。马达加斯加数据集包含559,023个ASVs，分布在2,081个样本中（包括阴性和阳性对照样本）。元数据文件本数据集中包含三种类型的元数据文件：测序元数据文件，包含关于在实验室处理并测序的样本的信息；样本元数据文件，包含关于在野外收集的样本的信息；采样地点元数据文件，包含关于收集样本的地点的信息。测序元数据文件本数据集包含两个测序元数据文件（CO1_sequencing_metadata_SE.tsv和CO1_sequencing_metadata_MG.tsv），包含关于测序样本的信息。这些文件中的列如下： sampleID_NGI：测序设施提供的样本ID（与计数文件中的列匹配）； sampleID_HISTORICAL：自定义用户ID； sampleID_FIELD：野外采样样本ID； sampleID_LAB：实验室处理样本ID； dataset：每个样本的样本集标识； lab_sample_type：样本类型，例如'样本'、'缓冲液空白'、'PCR阴性'等； country：样本的来源国家； biological_spikes：如果样本添加了生物性标样，则为真； artificial_spikes：如果在DNA纯化时添加了人工标样，则为真； sample_metadata_file：样本对应的元数据文件； lysate_rack_ID：实验室中存储裂解物稀释液的96孔板的标识（仅限内部使用）； lysate_well_ID：实验室中存储裂解物稀释液的孔位置的标识（仅限内部使用）； dna_plate_ID：实验室中存储纯化DNA的96孔板的标识（仅限内部使用）； dna_plate_well_ID：实验室中存储裂解物的孔位置的标识（仅限内部使用）； sequencing_batch：测序批次编号的自定义用户ID； sequencing_batch_NGI：测序设施提供的测序批次编号； notes_lab：关于样本在实验室处理的附加信息（仅适用于SE文件）； sequencing_status：关于样本测序状态的附加信息。如果此列中的样本值为“测序失败”，则该样本将缺失于ASV计数文件中； study_accession_ENA：欧洲核苷酸档案中的研究标识符； sample_accession_ENA：欧洲核苷酸档案中的样本标识符； experiment_accession_ENA：欧洲核苷酸档案中的实验标识符； run_accession_ENA：欧洲核苷酸档案中的运行标识符。样本元数据文件本数据集包含两个样本元数据文件（samples_metadata_malaise_SE.tsv和samples_metadata_malaise_MG.tsv），包含关于每个在野外收集的样本的信息。这些文件中的列如下： sampleID_FIELD：野外采样样本ID； trapID：野外采样马拉西式陷阱ID； biomass_grams：每个大量样本的湿重； placing_time：采样开始时间； placing_date：采样开始日期； collecting_time：采样结束时间； collecting_date：采样结束日期； duration_min：样本收集的总分钟数； trap_condition_collection：在从陷阱中收集样本时马拉西式陷阱的条件（良好；可接受；较差）； sample_ethanol_conc：DNA提取时防腐乙醇的浓度（仅适用于SE文件）； processing_group：处理批次ID（仅限内部使用）； sample_accession_ENA：欧洲核苷酸档案中的样本标识符； sample_status：关于样本在实验室处理状态的附加信息。采样地点元数据文件有两个文件包含关于采样地点的信息，每个国家一个：sites_metadata_SE.tsv和sites_metadata_MG.tsv。这些文件中的列如下： siteID：采样地点ID编号。注意，对于某些地点，可以组装多个马拉西式陷阱（malaise_trap_type=Multitrap）； trapID：野外采样马拉西式陷阱ID； latitude_WGS84：WGS84坐标系统中的纬度。此信息指定了采样地点的马拉西式陷阱位置； longitude_WGS84：WGS84坐标系统中的经度。此信息指定了采样地点的马拉西式陷阱位置； trap_habitat：马拉西式陷阱所在的环境； malaise_trap_type：标识采样地点是否组装了多个陷阱（Multitrap）或仅一个（Single_trap）； parkID：国家公园名称（仅适用于MG）； provinceID：省份名称（仅适用于MG）； NILS_mhabitat：从马拉西式陷阱位置最近的瑞典国家景观调查（NILS）地块的环境（仅适用于SE文件）。有关NILS采样设计的更多信息，请参阅：https://www.slu.se/centrumbildningar-och-projekt/nils_old/Datainsamling/bakgrund-och-mal/NILS_square：采样地点最近的NILS正方形的标识（仅适用于SE文件）； NILS_plot：与马拉西式陷阱位置最近的NILS地块的标识（仅适用于SE文件）； trap_orientation_degrees_S：收集头在度数中的方向； notes：与马拉西式陷阱相关的注释（仅适用于SE文件）。

提供机构：

SciLifeLab

5,000+

优质数据集

54 个

任务类型

进入经典数据集