Interpretably deep learning amyloid nucleation by massive experimental quantification of random sequences
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://www.ncbi.nlm.nih.gov/sra/SRP509571
下载链接
链接失效反馈官方服务:
资源简介:
More than 50 human diseases are characterized by the deposition of specific protein aggregates in the form of insoluble amyloid fibrils. However, only a very small number of proteins are known to form amyloids with high propensity, limiting our ability to understand, predict and engineer amyloid aggregation from sequence. Here we use a massively parallel assay to quantify the amyloid nucleation propensity of >100,000 random 20 amino acid sequences. Approximately 5% of assayed random sequences nucleate the formation of aggregates, generating a very large and diverse training dataset from which to train models to predict amyloid nucleation. We use this dataset to train CANYA, a convolution-attention hybrid neural network that predicts the propensity of any primary sequence to form amyloids. CANYA outperforms previous predictors of protein aggregation on additional random sequences and out-of-sample datasets including human disease-causing amyloids, with very stable performance across diverse prediction tasks. We adapt and extend recent advances in interpretability of genomic neural networks to elucidate CANYA's decision-making process and learned grammar and to provide mechanistic insights into amyloid formation. Our results demonstrate the power of massive experimental random sequence-space exploration and provide an interpretable and robust neural network model for understanding, predicting and designing amyloid-forming proteins. Overall design: Systematic measurement of the nucleation of random 20mers peptides
创建时间:
2025-05-08



