High diversity gene libraries facilitate machine learning guided exploration of fluorescent protein sequence space

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://www.ncbi.nlm.nih.gov/sra/SRP595474

下载链接

链接失效反馈

官方服务：

资源简介：

While protein language models (PLMs) have shown great promise for protein design, their performance remains limited by the diversity and completeness of available training data. In particular, PLMs struggle to extrapolate to sequences that fall outside the distribution of their training sets. Here, we demonstrate how synthetic gene libraries can be used to overcome this limitation by experimentally expanding training data coverage. Using large-scale gene synthesis and DNA shuffling, we generate libraries spanning a broad region of fluorescent protein sequence space, including sequences that bridge between distant sequences. Functional screening for blue fluorescence yields a wide variety of active variants, many of which are chimeric and lie far from known sequences. Fine-tuning ProtGPT2 on this expanded dataset improves its ability to generate diverse and functional fluorescent proteins. This work illustrates how synthetic approaches can help address key limitations in machine learning-guided protein design, especially for small or sparsely populated protein families, by actively creating novel sequences across unexplored but functional regions of sequence space.

创建时间：

2025-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集