Machine-Learning-Guided Library Design Cycle for Directed Evolution of Enzymes: The Effects of Training Data Composition on Sequence Space Exploration

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://figshare.com/articles/dataset/Machine-Learning-Guided_Library_Design_Cycle_for_Directed_Evolution_of_Enzymes_The_Effects_of_Training_Data_Composition_on_Sequence_Space_Exploration/17049475

下载链接

链接失效反馈

官方服务：

资源简介：

Machine learning (ML) is becoming an attractive tool in mutagenesis-based protein engineering because of its ability to design a variant library containing proteins with a desired function. However, it remains unclear how ML guides directed evolution in sequence space depending on the composition of training data. Here, we present a ML-guided directed evolution study of an enzyme to investigate the effects of a known “highly positive” variant (i.e., variant known to have high enzyme activity) in training data. We performed two separate series of ML-guided directed evolution of Sortase A with and without a known highly positive variant called 5M in training data. In each series, two rounds of ML were conducted: variants predicted by the initial round were experimentally evaluated and used as additional training data for the second-round of prediction. The improvements in enzyme activity were comparable between the two series, both achieving enzyme activity 2.2–2.5 times higher than 5M. Intriguingly, the sequences of the improved variants were largely different between the two series, indicating that ML guided the directed evolution to the distinct regions of sequence space depending on the presence/absence of the highly positive variant in the training data. This suggests that the sequence diversity of improved variants can be expanded not only by conventional ML using the whole training data but also by ML using a subset of the training data even when it lacks highly positive variants. In summary, this study demonstrates the importance of regulating the composition of training data in ML-guided directed evolution.

机器学习（Machine Learning, ML）正成为基于诱变的蛋白质工程领域极具吸引力的工具，因其能够设计出包含具备目标功能蛋白质的变异体文库。然而，目前尚不清楚机器学习如何依据训练数据的构成，在序列空间中指导定向进化。本研究针对某一酶开展机器学习指导的定向进化实验，旨在探究训练数据中已知的"highly positive"变异体（即经证实具备高酶活性的变异体）所产生的影响。我们设计了两组独立的机器学习指导的分选酶A（Sortase A）定向进化实验：一组训练数据包含已知的强阳性变异体5M，另一组则不含该变异体。每组实验均开展两轮机器学习预测：首轮预测得到的变异体经实验验证后，将作为补充训练数据用于第二轮预测。两组实验的酶活性提升幅度相当，最终酶活性均达到5M的2.2~2.5倍。值得关注的是，两组实验中获得的优化变异体序列差异显著，这表明机器学习可依据训练数据中是否包含强阳性变异体，将定向进化引导至序列空间的不同区域。该结果表明，不仅可通过使用全部训练数据的常规机器学习方法提升优化变异体的序列多样性，即便训练数据子集缺乏强阳性变异体，也可通过基于该子集的机器学习实现这一目标。综上，本研究证实了在机器学习指导的定向进化实验中，调控训练数据构成的重要性。

创建时间：

2021-11-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集