MetaNutriUnify, a curated and unified human gut metagenomics and nutritional data collection
收藏DataCite Commons2025-05-16 更新2025-04-16 收录
下载链接:
https://entrepot.recherche.data.gouv.fr/citation?persistentId=doi:10.57745/U0XZBX
下载链接
链接失效反馈官方服务:
资源简介:
MetaNutriUnify Collection This work is linked to the iTARGET project (https://qualiment.fr/des-projets-pour-anticiper-les-besoins-de-recherche-des-entreprises-agroalimentaires-2022/hot-topics/), aiming at performing in silico and in vitro targeting of healthy gut bacteria with fiber degrading metabolic potential. In this context, we developed MetaNutriUnify, the first collection of curated and harmonized metagenomic data with unified nutritional data from public study cohorts, with a particular attention on fiber. Overview of the public studies included in MetaNutriUnify MetaNutriUnify is composed of 21 harmonized and curated public projects with available shotgun metagenomes from human adults stool samples and nutritional and anthropometric data. It consists of 949 individuals from 15 countries, totalizing 1656 metagenomes, from which we generated microbial species and associated functional modules abundance tables. We also unified nutritional data, including diet type, study type (observational/interventional), time points for stool sampling, diet intervention and associated information, macro and micronutrients when available, and reported available anthropometric data (gender, country, age, weight, height, BMI). The list of the 21 included studies is: Bioproject PRJEB8249 (2015, SWE, 21 subjects, PMID 26244932) Bioproject PRJNA278393 (2015, TZA & ITA, 33 subjects, PMID 25981789) Bioproject PRJNA328899 (2016, MNG & CHN, 110 subjects, PMID 27708392) Bioproject PRJNA305507 (2017, USA, 33 subjects, PMID 28797298) Bioproject PRJEB28687 (2018, USA & THA, 50 subjects, PMID 30388453) Bioproject PRJEB32794 (2019, IRL, 37 subjects, PMID 31558359) Bioproject PRJNA472785 (2019, USA, 12 subjects, PMID 31235964) Bioproject PRJNA386503 (2019, USA, 4 subjects, PMID 30810441) Bioproject PRJNA397112 (2019, IND, 88 subjects, PMID 30698687) Bioproject PRJEB33500 (2020, ITA, 82 subjects, PMID 32075887) Bioproject PRJNA647720 (2021, USA, 20 subjects, PMID 33727392) Bioproject PRJNA755720 (2021, ESP, 20 subjects, PMID 34444797) Bioproject PRJNA892265 (2022, ESP, 20 subjects, PMID 36364873) Bioproject PRJEB42906 (2022, USA, 50 subjects, PMID 35312171) Bioproject PRJEB45944 (2022, NLD, 149 subjects, PMID 35115599) Bioproject PRJEB48663 (2022, FRA, 39 subjects, PMID 35311446) Bioproject PRJNA762543 (2022, SGP, 62 subjects, PMID 35549618) Bioproject PRJEB48605 (2023, DEU, 68 subjects, PMID 35760036) Bioproject PRJNA939268 (2023, SGP, 10 subjects, PMID 36997838) Bioproject PRJEB26842 (2023, GBR, 29 subjects, PMID 37587110) Bioproject PRJNA906167 (2023, ESP, 12 subjects, PMID 37457982) Method summary Metagenomic data and associated metadata were recovered from the European Nucleotide Archive, while nutritional and anthropometric data were collected from various online resources (main publication, supplementary files, GitHub or BioProject information). QC validation was performed using fastp (version 0.23.4) and host related reads were filtered out with bowtie using the human reference genome (Homo sapiens T2T-CHM13v2.0). Resulting high quality reads were mapped onto the 10.4 million gut IGC2 catalogue of the human microbiome and onto the 8.4 million human oral microbial catalogue using the METEOR software clustered into Metagenomic Species Pangenomes (MSP species) that were previously taxonomically and functionally annotated. MetaNutriUnify characteristics The provided data consists of: Metagenomic Species Pangenomes (MSP) species abundance table and related GTDB taxonomy (GTBD-tk version r220) (final_msp.7z and species_taxonomy_20241119.tab) KEGG, GMM and GBM Functional modules abundance table and related modules definition (KEGG version 107, GMM modules and GBM modules) (final_modules.tab and all_modules_definition_GMM_GBM_KEGG_107_20241119.tab) Manually curated and unified data collected from bioprojects (final_metadata.tab): Metadata from the metagenomes obtained from ENA, as well as nutritional and anthropometric data. Additional data on each sample, such as the evaluation of cross contamination within each bioproject using the CroCodeEL tool, together with the number of high quality reads, the MSP species richness and if the number of reads was below 1M. We proposed a “to_exclude” variable in the deposited MetaNutriUnify file, derived from these data. If one of the following conditions was met: low_read = YES, is.contaminated = YES or MSP_richness < 20, we propose to exclude the sample from downstream analysis. Nutritional data, carefully extracted from each study and reporting information on diet type, energy, macro- and micronutrients when available. No frequency data were included, because of the great variability between studies. We reported variables and their modalities as they were described in the different studies, and only modified units of nutritional data when appropriate. We encourage users to modulate the modalities for some variables, such as time point description, and to refer to the original studies for any enquiries. Anthropometric data, including country, gender, age, weight, height and BMI. General information on the cohorts, such as reference, study design, country, number of subjects and metagenomes, diet type and reported nutritional records (cohorts.xlsx) Legends for cohorts and final_metadata files (cohorts.xlsx) Codes for data recovery and harmonization are available here. Funding The research leading to these results has received funding under grant (iTARGET; doi.org/10.17180/h5gd-gk88) from Carnot Qualiment© supported by Agence Nationale de la Recherche. Additional funding was from the MetaGenoPolis grant ANR-11-DPBS-0001.
MetaNutriUnify 数据集集 本研究关联iTARGET项目(https://qualiment.fr/des-projets-pour-anticiper-les-besoins-de-recherche-des-entreprises-agroalimentaires-2022/hot-topics/),旨在通过计算机模拟(in silico)与体外实验(in vitro)靶向具备膳食纤维降解代谢潜能的健康肠道菌群。在此背景下,我们开发了MetaNutriUnify——首个经过整理与统一处理的宏基因组数据集集合,整合了公共研究队列的标准化营养数据,尤其聚焦膳食纤维方向。
MetaNutriUnify收录公共研究概览
MetaNutriUnify包含21项经过统一处理与人工整理的公共项目,涵盖来自15个国家的949名成年人类粪便样本的鸟枪法宏基因组(shotgun metagenomes)数据,以及对应的营养与人体测量学数据,总计1656份宏基因组样本。我们基于这些数据生成了微生物物种及其关联功能模块的丰度表。此外,我们统一整合了营养数据,包括饮食类型、研究类型(观察性/干预性)、粪便采样时间点、饮食干预及相关信息,若有可用数据还包含宏量营养素与微量营养素,同时收录了已报告的人体测量学数据(性别、国家、年龄、体重、身高、体重指数(BMI))。
本次收录的21项研究列表如下:
生物项目PRJEB8249(2015年,瑞典(SWE),21名受试者,PMID 26244932)
生物项目PRJNA278393(2015年,坦桑尼亚(TZA)与意大利(ITA),33名受试者,PMID 25981789)
生物项目PRJNA328899(2016年,蒙古国(MNG)与中国(CHN),110名受试者,PMID 27708392)
生物项目PRJNA305507(2017年,美国(USA),33名受试者,PMID 28797298)
生物项目PRJEB28687(2018年,美国(USA)与泰国(THA),50名受试者,PMID 30388453)
生物项目PRJEB32794(2019年,爱尔兰(IRL),37名受试者,PMID 31558359)
生物项目PRJNA472785(2019年,美国(USA),12名受试者,PMID 31235964)
生物项目PRJNA386503(2019年,美国(USA),4名受试者,PMID 30810441)
生物项目PRJNA397112(2019年,印度(IND),88名受试者,PMID 30698687)
生物项目PRJEB33500(2020年,意大利(ITA),82名受试者,PMID 32075887)
生物项目PRJNA647720(2021年,美国(USA),20名受试者,PMID 33727392)
生物项目PRJNA755720(2021年,西班牙(ESP),20名受试者,PMID 34444797)
生物项目PRJNA892265(2022年,西班牙(ESP),20名受试者,PMID 36364873)
生物项目PRJEB42906(2022年,美国(USA),50名受试者,PMID 35312171)
生物项目PRJEB45944(2022年,荷兰(NLD),149名受试者,PMID 35115599)
生物项目PRJEB48663(2022年,法国(FRA),39名受试者,PMID 35311446)
生物项目PRJNA762543(2022年,新加坡(SGP),62名受试者,PMID 35549618)
生物项目PRJEB48605(2023年,德国(DEU),68名受试者,PMID 35760036)
生物项目PRJNA939268(2023年,新加坡(SGP),10名受试者,PMID 36997838)
生物项目PRJEB26842(2023年,英国(GBR),29名受试者,PMID 37587110)
生物项目PRJNA906167(2023年,西班牙(ESP),12名受试者,PMID 37457982)
方法学概述
宏基因组数据及关联元数据从欧洲核苷酸档案馆(European Nucleotide Archive, ENA)获取,营养与人体测量学数据则从各类在线资源(主要发表文献、补充材料、GitHub或生物项目信息)收集。质量控制(QC)验证使用fastp(版本0.23.4)完成,同时使用Bowtie工具结合人类参考基因组(Homo sapiens T2T-CHM13v2.0)过滤宿主相关读段。将所得高质量读段映射至包含1040万条序列的人类肠道微生物组IGC2目录,以及包含840万条序列的人类口腔微生物目录,使用METEOR软件将读段聚类为此前已完成分类学与功能注释的宏基因组物种泛基因组(MSP)。
MetaNutriUnify数据集特征
本次提供的数据包含:
1. 宏基因组物种泛基因组(MSP)物种丰度表及关联的GTDB分类学注释(GTDB-tk版本r220)(文件:final_msp.7z与species_taxonomy_20241119.tab)
2. KEGG、GMM及GBM功能模块丰度表及关联模块定义(KEGG版本107、GMM模块与GBM模块)(文件:final_modules.tab与all_modules_definition_GMM_GBM_KEGG_107_20241119.tab)
3. 从各生物项目中人工整理并统一整合的元数据(文件:final_metadata.tab):涵盖从ENA获取的宏基因组元数据,以及营养与人体测量学数据。此外还包含每份样本的额外信息,例如使用CroCodeEL工具评估的各生物项目内的交叉污染情况、高质量读段数量、MSP物种丰富度,以及读段数量是否低于100万条的标记。我们在提交的MetaNutriUnify数据文件中设置了"to_exclude"变量,该变量基于上述数据生成:若满足以下任一条件,则建议将该样本排除于下游分析:低读段数(low_read = YES)、存在污染(is.contaminated = YES)或MSP物种丰富度<20。
4. 营养数据:从每项研究中精心提取的信息,包括饮食类型、能量、宏量营养素与微量营养素(若有可用数据)。由于不同研究间频率数据差异较大,故未纳入频率信息。我们保留了各研究中描述的变量及其分类,并仅在必要时调整营养数据的单位。我们建议用户可根据需求调整部分变量的分类方式(例如时间点描述),若有任何疑问请参考原始研究文献。
5. 人体测量学数据:包括国家、性别、年龄、体重、身高及BMI。
6. 队列的基本信息,包括参考文献、研究设计、国家、受试者与宏基因组样本数量、饮食类型及已报告的营养记录(文件:cohorts.xlsx)
7. 队列文件与final_metadata文件的说明文档(文件:cohorts.xlsx)
数据恢复与统一整合的代码可在此处获取。
资助说明
本研究获得了由法国国家研究署(Agence Nationale de la Recherche, ANR)支持的Carnot Qualiment资助的iTARGET项目(编号:doi.org/10.17180/h5gd-gk88)的资金支持。额外资助来自MetaGenoPolis项目(ANR-11-DPBS-0001)。
提供机构:
Recherche Data Gouv
创建时间:
2024-11-27



