five

Gene-language models are whole genome representation learners

收藏
DataONE2024-02-28 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:1f65a473ff4462d9d9e5795072b2cdb10137c3501114800abacbb995b809a5f4
下载链接
链接失效反馈
官方服务:
资源简介:
The language of genetic code embodies a complex grammar and rich syntax of interacting molecular elements. Recent advances in self-supervision and feature learning suggest that statistical learning techniques can identify high-quality quantitative representations from inherent semantic structure. We present a gene-based language model that generates whole-genome vector representations from a population of 16 disease-causing bacterial species by leveraging natural contrastive characteristics between individuals. To achieve this, we developed a set-based learning objective, AB learning, that compares the annotated gene content of two population subsets for use in optimization. Using this foundational objective, we trained a Transformer model to backpropagate information into dense genome vector representations. The resulting bacterial representations, or embeddings, captured important population structure characteristics, like delineations across serotypes and host specificity preferences..., , , # Gene-language models are whole genome representation learners This directory holds the clean phenotype & metadata tables for both 2017 and the updated 2022 NARMS population, and the genespace zarr file. \[Metadata] Two metadata files, narms_metadata_2017.csv and narms_metadata_2022.csv, containing various qualifiers for each sample belonging to the NARMS population used in the study. For example, the 'serotype' column indicates the sample-in-question's serotype designation. Index (first column) refers to the original accession ID for the corresponding Sequence Read Archive. \[Phenotypes] A phenotype file (narms_phenotype_2017.csv) containing the collection of phenotypic qualities for each sample belonging to the NARMS population used in the study. Virtually all phenotype columns reflect the minimum inhibitory concentration (MIC) for that particular drug to reflect resistance. For example, 'mic_ampicillin' refers to the MIC values for ampicillin for each sample. Index (first c...
创建时间:
2025-07-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作