MIMIC2: Murine Intestinal Microbiota Integrated Catalog v2
收藏DataCite Commons2025-05-15 更新2025-04-16 收录
下载链接:
https://entrepot.recherche.data.gouv.fr/citation?persistentId=doi:10.15454/L11MXM
下载链接
链接失效反馈官方服务:
资源简介:
Dataset overview The MIMIC2 dataset provides: a non-redundant high-quality catalog of 5.0 million genes 6,967 Metagenome-Assembled Genomes (MAGs) 1,252 Metagenomic Species Pangenomes (MSPs) This dataset can be used to analyze shotgun sequencing data of the murine gut microbiota. How to use this dataset Create a gene abundance table by aligning reads from each sample against the catalog. For this purpose, you can use Meteor or NGLess. Then, normalize raw counts by gene length. Taxonomic profiling: the abundance of each species can be estimated as the average abundance of its 100 first core genes. To reduce the false positive rate, only consider that a species is present if at least 10/100 marker genes are detected. Methods Data sources The MIMIC2 dataset was constructed using two different data sources: Source 1: the Mouse Gastrointestinal Bacterial Catalogue (MGBC) which is a compilation of 276 genomes from cultured isolates and 45,218 metagenome-assembled genomes (MAGs) from 1,960 publicly available mouse metagenomes Source 2: 68 samples of Messaoudene et al. (PRJNA783624) and 85 deeply sequenced samples from bioproject CNP0000619 published by Xiao et al. Metagenomic assembly De novo metagenomic assembly was performed on the 153 samples from the data Source 2. First, sequencing adapters removal and read trimming was performed with fastp. Reads mapped on the host genome (GCF_000001635.27) with bowtie2 were removed with samtools. Finally, Metagenomic assembly was performed with metaSPAdes. Contigs of less than 1500 bp were removed. MAGs recovery Reads of each sample from the data Source 2 were aligned to their respective assembly with bowtie2 and results were indexed in sorted bam files with samtools. Then, contigs coverage was computed in each sample with jgi_summarize_bam_contig_depths. MAGs were generated with MetaBAT 2 and MAGs quality was assessed with checkM. MAGs with completeness < 70% or contamination > 5% or N50 < 8Kb were discarded. Non-redundant gene catalog Genes were predicted on all contigs from the data Source 2 with Prodigal (parameters : -m -p meta ). Likewise, genes were predicted on all genomes from the data Source 1 (MGBC) with Prodigal (parameters : -m -p single ). Genes from the two data sources were pooled and those shorter than 90 bp or incomplete were discarded. Finally, genes were clustered with cd-hit-est (parameters -c 0.95 -aS 0.90 -G 0 -d 0 -M 0 -T 0 ) by choosing those from the longest contigs as representatives. MSPs recovery Samples from 19 cohorts (see below) were aligned against the non-redundant gene catalog with the Meteor software suite to produce a raw gene abundance table (5M genes quantified in 1374 samples). Then, co-abundant genes were binned in 1,252 Metagenomic Species Pan-genomes (MSPs, i.e. clusters of > 500 co-abundant genes that likely belong to the same microbial species) using MSPminer. The 19 cohorts used to recover the MSPs are: PRJNA783624 CNP0000619 PRJEB15095 PRJEB22007 PRJEB22710 PRJEB31298 PRJEB32790 PRJEB32890 PRJEB3374 PRJEB36943 PRJEB44286 PRJEB7759 PRJNA293255 PRJNA390686 PRJNA397886 PRJNA515074 PRJNA540893 PRJNA549182 PRJEB40719 MSPs taxonomic annotation Representative genomes of the MMGC collection were annotated with GTDB-Tk based on GTDB r202. Then, taxonomic annotation of MMGC genomes was propagated to the corresponding MSPs. For the MSPs without any corresponding MAG, taxonomic annotation was performed by alignment of all core and accessory genes against representative genomes of the GTDB database (release r202) using blastn (version 2.7.1, task = megablast, word_size = 16). A species-level assignment was given if > 50% of the genes matched the representative genome of a given species, with a mean nucleotide identity ≥ 95% and mean gene length coverage ≥ 90%. The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom), if more than 50% of their genes had the same annotation. Construction of the phylogenetic tree 39 universal phylogenetic markers genes were extracted from the 1,252 MSPs (or the corresponding MAGs if available) with fetchMGs. Then, the markers were separately aligned with MUSCLE. The 40 alignments were merged and trimmed with trimAl (parameters: -automated1). Finally, the phylogenetic tree was computed with FastTreeMP (parameters: -gamma -pseudo -spr -mlacc 3 -slownni).
数据集概览
MIMIC2数据集提供:一套非冗余的高质量基因目录,包含500万个基因、6967个宏基因组组装基因组(Metagenome-Assembled Genomes, MAGs)以及1252个宏基因组物种泛基因组(Metagenomic Species Pangenomes, MSPs)。本数据集可用于分析小鼠肠道微生物群的鸟枪测序数据。
数据集使用方法
通过将每个样本的测序reads与该基因目录进行比对,可构建基因丰度表。为此可使用Meteor或NGLess工具。随后,基于基因长度对原始计数进行标准化。物种分类分析:可通过计算某物种前100个核心基因的平均丰度,来估算该物种的丰度。为降低假阳性率,仅当至少检测到10/100个标记基因时,才判定该物种存在。
构建方法
数据来源
MIMIC2数据集由两类不同数据源构建而成:
数据源1:小鼠胃肠道细菌目录(Mouse Gastrointestinal Bacterial Catalogue, MGBC),该目录整合了276株培养分离株的基因组,以及来自1960个公开可用小鼠宏基因组的45218个宏基因组组装基因组(Metagenome-Assembled Genomes, MAGs)。
数据源2:Messaoudene等人的68个样本(登录号PRJNA783624),以及Xiao等人发表的生物项目CNP0000619中的85个深度测序样本。
宏基因组组装
对数据源2的153个样本进行从头宏基因组组装。首先,使用fastp工具去除测序接头并对reads进行修剪;使用bowtie2将比对到宿主基因组(GCF_000001635.27)的reads过滤去除,并通过samtools完成后续处理。最终,使用metaSPAdes进行宏基因组组装,并剔除长度小于1500 bp的重叠群(contigs)。
MAGs恢复
将数据源2中每个样本的reads使用bowtie2比对至该样本对应的组装结果,再通过samtools将比对结果索引为排序后的bam文件。随后,使用jgi_summarize_bam_contig_depths计算每个样本中重叠群的覆盖度。使用MetaBAT 2生成MAGs,并通过checkM评估MAGs质量。剔除完整度<70%、污染率>5%或N50<8 Kb的MAGs。
非冗余基因目录
使用Prodigal工具(参数:-m -p meta)对数据源2的所有重叠群进行基因预测;同样,使用Prodigal工具(参数:-m -p single)对数据源1(MGBC)的所有基因组进行基因预测。将两个数据源的基因合并,剔除长度小于90 bp或不完整的基因。最终,使用cd-hit-est工具(参数:-c 0.95 -aS 0.90 -G 0 -d 0 -M 0 -T 0)对基因进行聚类,选取来自最长重叠群的基因作为代表序列。
MSPs恢复
将19个队列的样本(详见下文)使用Meteor软件套件比对至非冗余基因目录,以生成原始基因丰度表(在1374个样本中定量了500万个基因)。随后,使用MSPminer将共丰度基因聚类为1252个宏基因组物种泛基因组(Metagenomic Species Pangenomes, MSPs,即包含>500个共丰度基因、大概率属于同一微生物物种的基因簇)。用于恢复MSPs的19个队列如下:PRJNA783624、CNP0000619、PRJEB15095、PRJEB22007、PRJEB22710、PRJEB31298、PRJEB32790、PRJEB32890、PRJEB3374、PRJEB36943、PRJEB44286、PRJEB7759、PRJNA293255、PRJNA390686、PRJNA397886、PRJNA515074、PRJNA540893、PRJNA549182、PRJEB40719。
MSPs分类学注释
基于GTDB r202版本数据库,使用GTDB-Tk对MMGC集合的代表基因组进行注释。随后,将MMGC基因组的分类学注释传播至对应的MSPs。对于无对应MAG的MSPs,使用blastn工具(版本2.7.1,任务类型=megablast,单词长度=16)将其所有核心基因和附属基因比对至GTDB数据库(版本r202)的代表基因组。若>50%的基因匹配某一特定物种的代表基因组,且平均核苷酸一致性≥95%、平均基因长度覆盖度≥90%,则为该MSP分配物种水平的分类注释。若超过50%的基因具有相同的分类注释,则将剩余MSPs分配至更高分类层级(属至超界)。
系统发育树构建
从1252个MSPs(若有对应MAG则使用对应MAG)中提取39个通用系统发育标记基因,使用MUSCLE工具分别对标记基因进行序列比对。将40个比对结果合并,并使用trimAl工具(参数:-automated1)进行修剪。最终,使用FastTreeMP工具(参数:-gamma -pseudo -spr -mlacc 3 -slownni)计算系统发育树。
提供机构:
Recherche Data Gouv
创建时间:
2021-11-24
搜集汇总
数据集介绍

背景与挑战
背景概述
MIMIC2数据集是一个综合性的小鼠肠道微生物群基因和基因组目录,包含5.0百万个非冗余基因、6,967个宏基因组组装基因组和1,252个宏基因组物种泛基因组,用于分析小鼠肠道微生物群的鸟枪法测序数据。数据集基于培养分离株和公开宏基因组数据构建,涉及宏基因组组装、基因聚类和物种注释等生物信息学流程,支持基因丰度分析和物种分类分析。
以上内容由遇见数据集搜集并总结生成



