PyPEFAn Integrated Framework for Data-Driven Protein Engineering
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://figshare.com/articles/dataset/PyPEF_An_Integrated_Framework_for_Data-Driven_Protein_Engineering/14983040
下载链接
链接失效反馈官方服务:
资源简介:
Data-driven strategies
are gaining increased attention in protein
engineering due to recent advances in access to large experimental
databanks of proteins, next-generation sequencing (NGS), high-throughput
screening (HTS) methods, and the development of artificial intelligence
algorithms. However, the reliable prediction of beneficial amino acid
substitutions, their combination, and the effect on functional properties
remain the most significant challenges in protein engineering, which
is applied to develop proteins and enzymes for biocatalysis, biomedicine,
and life sciences. Here, we present a general-purpose framework (PyPEF:
pythonic protein engineering framework) for performing data-driven
protein engineering using machine learning methods combined with techniques
from signal processing and statistical physics. PyPEF guides the identification
and selection of beneficial proteins of a defined sequence space by
systematically or randomly exploring the fitness of variants and by
sampling random evolution pathways. The performance of PyPEF was evaluated
concerning its predictive accuracy and throughput on four public protein
and enzyme data sets using common regression models. It was proved
that the program could efficiently predict the fitness of protein
sequences for different target properties (predictive models with
coefficient of determination values ranging from 0.58 to 0.92). By
combining machine learning and protein evolution, PyPEF enabled the
screening of proteins with various functions, reaching a screening
capacity of more than 500,000 protein sequence variants in the timeframe
of only a few minutes on a personal computer. PyPEF displayed significant
accuracies on four public data sets (different proteins and properties)
and underlined the potential of integrating data-driven technologies
for covering different philosophies by either predicting the fitness
of the variants to the highest accuracy accounting for epistatic effects
or capturing the general trend of introduced mutations on the fitness
in directed protein evolution campaigns. In essence, PyPEF can provide
a powerful solution to current sequence exploration and combinatorial
problems faced in protein engineering through exhaustive in
silico screening of the sequence space.
创建时间:
2021-07-14



