Supplemental Data for "Unexplored regions of the protein sequence-structure map revealed at scale by a library of foldtuned language models"

Name: Supplemental Data for "Unexplored regions of the protein sequence-structure map revealed at scale by a library of foldtuned language models"
Creator: Division of Biology and Biological Engineering, California Institute of Technology
Published: 2025-12-11 00:00:00
License: 暂无描述

CaltechDATA2025-12-11 更新2026-04-16 收录

下载链接：

https://data.caltech.edu/doi/10.22002/24abg-6e603

下载链接

链接失效反馈

官方服务：

资源简介：

Protein sequence-space is combinatorially vast yet sparsely populated, hindering attempts to chart which distant swaths of this landscape capture the familiar structures of nature's vital molecular machines. Locating such hidden protein populations would unlock new-to-nature modular parts for engineering problems from therapeutics to catalysis. Here, we introduce a novel algorithm – termed "foldtuning" – that draws on principles of adversarial learning to drive protein language models (PLMs) to erase detectable homology to natural sequences while preserving a target structure. We build foldtuned PLMs for >700 targets including membrane-bound receptors, redox enzymes, and signaling domains. Foldtuned proteins are diverse and far-from-natural in sequence, exposing fundamental biophysical constraints on structural families invisible to traditional sequence-based bioinformatics methods. Experimental characterization demonstrates that foldtuned proteins express stably in vitro and function in vivo. By revealing sequence-structure information at scale beyond evolution, foldtuning promises to accelerate the reconstitution and realization of novel-to-nature systems across synthetic biology.

提供机构：

Division of Biology and Biological Engineering, California Institute of Technology

创建时间：

2025-12-11