Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data
收藏DataCite Commons2025-06-01 更新2025-04-10 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.tb2rbp08p
下载链接
链接失效反馈官方服务:
资源简介:
We use open source human gut microbiome data to learn a microbial
“language” model by adapting techniques from Natural Language Processing
(NLP). Our microbial “language” model is trained in a self-supervised
fashion (i.e., without additional external labels) to capture the
interactions among different microbial taxa and the common compositional
patterns in microbial communities. The learned model produces
contextualized taxon representations that allow a single microbial taxon
to be represented differently according to the specific microbial
environment in which it appears. The model further provides a sample
representation by collectively interpreting different microbial taxa in
the sample and their interactions as a whole. We demonstrate that, while
our sample representation performs comparably to baseline models in
in-domain prediction tasks such as predicting Irritable Bowel Disease
(IBD) and diet patterns, it significantly outperforms them when
generalizing to test data from independent studies, even in the presence
of substantial distribution shifts. Through a variety of analyses, we
further show that the pre-trained, context-sensitive embedding captures
meaningful biological information, including taxonomic relationships,
correlations with biological pathways, and relevance to IBD expression,
despite the model never being explicitly exposed to such signals.
提供机构:
Dryad
创建时间:
2024-06-10



