Bioinformatic pipeline: Vast differences in strain-level diversity in the gut microbiota of two closely related honey bee species

Mendeley Data2024-03-27 更新2024-06-27 收录

下载链接：

https://zenodo.org/record/3747314

下载链接

链接失效反馈

官方服务：

资源简介：

This data-set contains the full bioinformatic pipeline used to analyze metagenomic samples in the study "Vast differences in strain-level diversity in the gut microbiota of two closely related honey bee species" (Ellegaard et al. 2020, Current Biology). New metagenomic samples were generated for the study, for which the raw data is available on the NCBI Sequence Read Achive, under accession: PRJNA59809. The data of this submission consist of 9 tar-balls, as further described here below. Download and unpack to view the contents (tar -zxvf filename.tar.gz). For each tarball, all directories contain README.txt files, describing the contents of the directory. Due to size constraints, some intermediate files have been omitted, and some workflows are demonstrated for a subset of the data. However, the full analysis can be reproduced from the raw data, using the provided scripts. All scripts are included within the directories where they were applied. Perl-scripts contain documentation, which can be viewed by typing: "perl script_name.pl -h". For R scripts, the usage is indicated as a comment in the top lines of each script. Note that many of the scripts require specific input-files to be present in the run-directory. Their usage is demonstrated within the workflow directories in bash-scripts (*.sh). Commands used for generating plots and some statistics are given within workflow directories in text-files "R.commands" when applicable. Aside from custom code, the pipeline also utilizes various open-source Software packages, which are detailed in the file "software_dependencies.txt". Note, while many of the scripts will run fast on any computer, some steps of the pipeline are computationally demanding, and will require significant computing time, as well as storage space. When scripts are known to be time-consuming, this is indicated in the script help message. Description of tarballs. raw_data_processing.tar.gz: Describes the quality-control and trimming of raw data, and includes info on the sequencing run. databases.tar.gz: Contains all databases used for analysis, in addition to relevant meta-data. mapping_stats.tar.gz: Contains a file with the number of reads mapped to the honey bee gut microbiota database and the host genomes, for each sample. Bash-scripts are provided, detailing how the mapping was done and quantified. orthologs_phylogenies.tar.gz: Contains the pipeline for inferring orthologous gene-families and core genome phylogenies, as well as scripts for filtering of single-copy core gene families. assemblies.tar.gz: Contains the final de novo metagenome assembly files (contig fasta-files), gener ated for both complete and rarefied read subsets. Bash-scripts detailing the assembly commands are also provided. SDP_validation.tar.gz: Contains the pipeline for metagenomic validation of candidate SDPs. Final output-files, containing the percentage identity of recruited metagenomic ORFs to database core genes, are provided for each candidate SDP. Additionally, a small example dataset is provided, where the intermediate result-files can be viewed. community_profiling.tar.gz: Contains the pipeline for community profiling, i.e. the quantification of individual community members (SDPs) across samples. Final output files are provided, including mapped read coverage on core gene families and corresponding plots. A small bam-file (containing data from a single subset sample), is also provided, in order to demonstrate the pipeline, together with all scripts used. snv_profiling.tar.gz: Contains the pipeline used for SNV profiling, including filtering and analysis. Final filtered vcf-files are provided for each SDP. Analytical output files are also provided, including data on shared SNV fractions, distance matrices, and cumulative curves. metagenomic_ORF_analyses.tar.gz: Contains the pipeline for analysis of metagenomic ORFs. This includes prediction of ORFs, clustering, annotation and functional characterization. ORF sequences, annotation files, and cluster-files are provided.

本数据集包含用于分析研究《两种近缘蜜蜂物种肠道菌群的菌株水平多样性差异巨大》（Ellegaard等人，2020年，《当代生物学》）中宏基因组样本的完整生物信息学流程。本研究新增生成了宏基因组样本，原始数据可在NCBI序列读取档案（NCBI Sequence Read Archive）中获取，登录号为PRJNA59809。本提交的数据包含9个压缩包（tar包），下文将进一步详述。可通过`tar -zxvf filename.tar.gz`命令下载并解压以查看包内内容。每个压缩包内的所有目录均包含README.txt文件，用于说明该目录的具体内容。受限于存储空间限制，部分中间文件已被省略，部分流程仅以部分数据子集进行演示。不过借助本数据集提供的脚本，可基于原始数据完整复现全部分析流程。所有脚本均存放于其对应应用的目录中。Perl脚本附带详细文档，可通过执行`perl script_name.pl -h`命令查看帮助信息。R脚本的使用方法已在各脚本首行的注释中注明。请注意，多数脚本需在运行目录中存在特定输入文件方可正常执行，其具体使用方法已在流程目录内的Bash脚本（*.sh）中演示。如需获取生成绘图及部分统计结果的命令，可在流程目录内的文本文件"R.commands"（如适用）中查阅。除自定义代码外，本分析流程还使用了多款开源软件包，相关详情已在"software_dependencies.txt"文件中列明。需注意，尽管多数脚本可在任意计算机上快速运行，但本流程的部分步骤计算量较大，需要耗费大量计算时间与存储空间。若脚本运行耗时较长，相关说明已包含在脚本的帮助信息中。以下为各压缩包的详细说明：1. `raw_data_processing.tar.gz`：说明原始数据的质量控制与修剪流程，并包含测序运行相关信息。2. `databases.tar.gz`：包含分析所用的全部数据库及相关元数据。3. `mapping_stats.tar.gz`：包含各样本比对至蜜蜂肠道菌群数据库与宿主基因组的reads数量统计文件。同时提供了Bash脚本，详细说明了比对与定量的具体实现方式。4. `orthologs_phylogenies.tar.gz`：包含用于推断同源基因家族与核心基因组系统发育的分析流程，以及用于筛选单拷贝核心基因家族的脚本。5. `assemblies.tar.gz`：包含为完整reads子集与稀疏化reads子集生成的最终从头组装宏基因组组装文件（contig FASTA格式文件），同时提供了说明组装命令的Bash脚本。6. `SDP_validation.tar.gz`：包含用于候选SDP的宏基因组验证流程。已为每个候选SDP提供最终输出文件，其中包含招募的宏基因组开放阅读框（Open Reading Frame，ORF）与数据库核心基因的序列一致性百分比。此外还提供了一个小型示例数据集，可用于查看中间结果文件。7. `community_profiling.tar.gz`：包含用于群落剖面分析的流程，即定量各样本中群落成员（SDP）的相对丰度。已提供最终输出文件，包括核心基因家族的比对reads覆盖度及对应绘图结果。同时提供了一个小型BAM文件（Binary Alignment Map，包含单个子集样本的数据），结合所用全部脚本可用于演示本流程。8. `snv_profiling.tar.gz`：包含用于单核苷酸变异（Single Nucleotide Variant，SNV）分析的流程，包括过滤与分析步骤。已为每个SDP提供最终过滤后的VCF文件（Variant Call Format）。同时提供了分析输出文件，包括共享SNV比例数据、距离矩阵与累积曲线。9. `metagenomic_ORF_analyses.tar.gz`：包含用于宏基因组ORF分析的完整流程，涵盖ORF预测、聚类、注释与功能表征。已提供ORF序列、注释文件与聚类文件。

创建时间：

2023-06-28