Supplemental Data for "Unexplored regions of the protein sequence-structure map revealed at scale by a library of foldtuned language models"
收藏CaltechDATA2025-12-11 更新2026-04-16 收录
下载链接:
https://data.caltech.edu/doi/10.22002/24abg-6e603
下载链接
链接失效反馈官方服务:
资源简介:
Protein sequence-space is combinatorially vast yet sparsely populated, hindering attempts to chart which distant swaths of this landscape capture the familiar structures of nature's vital molecular machines. Locating such hidden protein populations would unlock new-to-nature modular parts for engineering problems from therapeutics to catalysis. Here, we introduce a novel algorithm – termed "foldtuning" – that draws on principles of adversarial learning to drive protein language models (PLMs) to erase detectable homology to natural sequences while preserving a target structure. We build foldtuned PLMs for >700 targets including membrane-bound receptors, redox enzymes, and signaling domains. Foldtuned proteins are diverse and far-from-natural in sequence, exposing fundamental biophysical constraints on structural families invisible to traditional sequence-based bioinformatics methods. Experimental characterization demonstrates that foldtuned proteins express stably in vitro and function in vivo. By revealing sequence-structure information at scale beyond evolution, foldtuning promises to accelerate the reconstitution and realization of novel-to-nature systems across synthetic biology.
提供机构:
Division of Biology and Biological Engineering, California Institute of Technology
创建时间:
2025-12-11



