Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data

DataONE2025-02-20 更新2025-04-26 收录

下载链接：

https://search.dataone.org/view/sha256:8c00b06c01187eb7b0df45066ab9152b765faa96c82b52a4ce6e18de1632a2b9

下载链接

链接失效反馈

官方服务：

资源简介：

We use open source human gut microbiome data to learn a microbial âlanguageâ model by adapting techniques from Natural Language Processing (NLP). Our microbial âlanguageâ model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from indep..., No additional raw data was collected for this project. All inputs are available publicly. American Gut Project, Halfvarson, and Schirmer raw data are available from the NCBI database (accession numbers PRJEB11419, PRJEB18471, and PRJNA398089, respectively). We used the curated data produced by Tataru and David, 2020., , # Code and data for \"Learning a deep language model for microbiomes: the power of large scale unlabeled microbiome data\" ## Data: * vocab_embeddings.npy * Fixed vocabulary embeddings produced from prior work: [Decoding the language of microbiomes using word-embedding techniques, and applications in inflammatory bowel disease](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007859). Adapted from [here](http://files.cqls.oregonstate.edu/David_Lab/microbiome_embeddings/data/embed/). * microbiomedata.zip * Contains the labels and data for the three datasets used in this study. Specifically, it includes: * IBD_(test|train)*(512|otu).npy and IBD*(test|train)_labels.npy * halfvarson_(512_otu|otu).npy and halfvarson_IBD_labels.npy * schirmer_IBD_(512_otu|otu).npy and schirmer_IBD_labels.npy * (test|train)encodings_(512|1897).npy * The data are stored as n_samples x max_sample_size x 2 numpy arrays, containing both the vocab IDs of the taxa in the ...

创建时间：

2025-02-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集