five

High diversity gene libraries facilitate machine learning guided exploration of fluorescent protein sequence space

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://www.ncbi.nlm.nih.gov/sra/SRP595474
下载链接
链接失效反馈
官方服务:
资源简介:
While protein language models (PLMs) have shown great promise for protein design, their performance remains limited by the diversity and completeness of available training data. In particular, PLMs struggle to extrapolate to sequences that fall outside the distribution of their training sets. Here, we demonstrate how synthetic gene libraries can be used to overcome this limitation by experimentally expanding training data coverage. Using large-scale gene synthesis and DNA shuffling, we generate libraries spanning a broad region of fluorescent protein sequence space, including sequences that bridge between distant sequences. Functional screening for blue fluorescence yields a wide variety of active variants, many of which are chimeric and lie far from known sequences. Fine-tuning ProtGPT2 on this expanded dataset improves its ability to generate diverse and functional fluorescent proteins. This work illustrates how synthetic approaches can help address key limitations in machine learning-guided protein design, especially for small or sparsely populated protein families, by actively creating novel sequences across unexplored but functional regions of sequence space.
创建时间:
2025-06-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作