Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data

Name: Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data
Creator: Dryad
Published: 2025-06-01 03:25:01
License: 暂无描述

DataCite Commons2025-06-01 更新2025-04-10 收录

下载链接：

https://datadryad.org/dataset/doi:10.5061/dryad.tb2rbp08p

下载链接

链接失效反馈

官方服务：

资源简介：

We use open source human gut microbiome data to learn a microbial “language” model by adapting techniques from Natural Language Processing (NLP). Our microbial “language” model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from independent studies, even in the presence of substantial distribution shifts. Through a variety of analyses, we further show that the pre-trained, context-sensitive embedding captures meaningful biological information, including taxonomic relationships, correlations with biological pathways, and relevance to IBD expression, despite the model never being explicitly exposed to such signals.

提供机构：

Dryad

创建时间：

2024-06-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集