High diversity gene libraries facilitate machine learning guided exploration of fluorescent protein sequence space
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://www.ncbi.nlm.nih.gov/sra/SRP595474
下载链接
链接失效反馈官方服务:
资源简介:
While protein language models (PLMs) have shown great promise for protein design, their performance remains limited by the diversity and completeness of available training data. In particular, PLMs struggle to extrapolate to sequences that fall outside the distribution of their training sets. Here, we demonstrate how synthetic gene libraries can be used to overcome this limitation by experimentally expanding training data coverage. Using large-scale gene synthesis and DNA shuffling, we generate libraries spanning a broad region of fluorescent protein sequence space, including sequences that bridge between distant sequences. Functional screening for blue fluorescence yields a wide variety of active variants, many of which are chimeric and lie far from known sequences. Fine-tuning ProtGPT2 on this expanded dataset improves its ability to generate diverse and functional fluorescent proteins. This work illustrates how synthetic approaches can help address key limitations in machine learning-guided protein design, especially for small or sparsely populated protein families, by actively creating novel sequences across unexplored but functional regions of sequence space.
创建时间:
2025-06-28



