Gene-language models are whole genome representation learners

Name: Gene-language models are whole genome representation learners
Creator: Dryad
Published: 2025-05-01 02:58:52
License: 暂无描述

DataCite Commons2025-05-01 更新2025-04-09 收录

下载链接：

https://datadryad.org/dataset/doi:10.5061/dryad.vx0k6djzn

下载链接

链接失效反馈

官方服务：

资源简介：

The language of genetic code embodies a complex grammar and rich syntax of interacting molecular elements. Recent advances in self-supervision and feature learning suggest that statistical learning techniques can identify high-quality quantitative representations from inherent semantic structure. We present a gene-based language model that generates whole-genome vector representations from a population of 16 disease-causing bacterial species by leveraging natural contrastive characteristics between individuals. To achieve this, we developed a set-based learning objective, AB learning, that compares the annotated gene content of two population subsets for use in optimization. Using this foundational objective, we trained a Transformer model to backpropagate information into dense genome vector representations. The resulting bacterial representations, or embeddings, captured important population structure characteristics, like delineations across serotypes and host specificity preferences. Their vector quantities encoded the relevant functional information necessary to achieve state-of-the-art genomic supervised prediction accuracy in 11 out of 12 antibiotic resistance phenotypes.

提供机构：

Dryad

创建时间：

2024-02-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集