Interpretably deep learning amyloid nucleation by massive experimental quantification of random sequences

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://www.ncbi.nlm.nih.gov/sra/SRP509571

下载链接

链接失效反馈

官方服务：

资源简介：

More than 50 human diseases are characterized by the deposition of specific protein aggregates in the form of insoluble amyloid fibrils. However, only a very small number of proteins are known to form amyloids with high propensity, limiting our ability to understand, predict and engineer amyloid aggregation from sequence. Here we use a massively parallel assay to quantify the amyloid nucleation propensity of >100,000 random 20 amino acid sequences. Approximately 5% of assayed random sequences nucleate the formation of aggregates, generating a very large and diverse training dataset from which to train models to predict amyloid nucleation. We use this dataset to train CANYA, a convolution-attention hybrid neural network that predicts the propensity of any primary sequence to form amyloids. CANYA outperforms previous predictors of protein aggregation on additional random sequences and out-of-sample datasets including human disease-causing amyloids, with very stable performance across diverse prediction tasks. We adapt and extend recent advances in interpretability of genomic neural networks to elucidate CANYA's decision-making process and learned grammar and to provide mechanistic insights into amyloid formation. Our results demonstrate the power of massive experimental random sequence-space exploration and provide an interpretable and robust neural network model for understanding, predicting and designing amyloid-forming proteins. Overall design: Systematic measurement of the nucleation of random 20mers peptides

创建时间：

2025-05-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集