Gene-language models are whole genome representation learners
收藏DataCite Commons2025-05-01 更新2025-04-09 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.vx0k6djzn
下载链接
链接失效反馈官方服务:
资源简介:
The language of genetic code embodies a complex grammar and rich syntax of
interacting molecular elements. Recent advances in self-supervision and
feature learning suggest that statistical learning techniques can identify
high-quality quantitative representations from inherent semantic
structure. We present a gene-based language model that generates
whole-genome vector representations from a population of 16
disease-causing bacterial species by leveraging natural contrastive
characteristics between individuals. To achieve this, we developed a
set-based learning objective, AB learning, that compares the annotated
gene content of two population subsets for use in optimization. Using this
foundational objective, we trained a Transformer model to backpropagate
information into dense genome vector representations. The resulting
bacterial representations, or embeddings, captured important population
structure characteristics, like delineations across serotypes and host
specificity preferences. Their vector quantities encoded the relevant
functional information necessary to achieve state-of-the-art genomic
supervised prediction accuracy in 11 out of 12 antibiotic resistance
phenotypes.
提供机构:
Dryad
创建时间:
2024-02-28



