Embeddings from protein language models predict conservation and variant effects

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/5238536

下载链接

链接失效反馈

官方服务：

资源简介：

For this work, we used protein language model representations (embeddings) to predict sequence conservation without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthew Correlation Coefficient – MCC - for ProtT5 embeddings of 0.596±0.006 vs. 0.608±0.006 for ConSeq). ConSurf10k- Dataset for the development of ProtT5cons: The method (ProtT5cons) predicting residue conservation used ConSurf-DB (Ben Chorin et al. 2020). This resource provided sequences and conservation for 89,673 proteins. For all, experimental high-resolution three-dimensional (3D) structures were available in the Protein Data Bank (PDB) (Berman et al. 2000). As standard-of-truth for the conservation prediction, we used the values from ConSurf-DB generated using HMMER (Mistry et al. 2013), CD-HIT (Fu et al. 2012), and MAFFT-LINSi (Katoh and Standley 2013) to align proteins in the PDB (Burley et al. 2019). For proteins from families with over 50 proteins in the resulting MSA, an evolutionary rate at each residue position is computed and used along with the MSA to reconstruct a phylogenetic tree. The ConSurf-DB conservation scores ranged from 1 (most variable) to 9 (most conserved). The PISCES server (Wang and Dunbrack 2003) was used to redundancy reduce the data set such that no pair of proteins had more than 25% pairwise sequence identity. We removed proteins with resolutions >2.5Å, those shorter than 40 residues, and those longer than 10,000 residues. The resulting data set (ConSurf10k) with 10,507 proteins (or domains) was randomly partitioned into training (9,392 sequences), cross-training/validation (555) and test (519) sets. Uploaded data: ConSuf10k_PDBid_seq_cons.fasta: fasta file with PDBid, sequence and conservation annotation consurf10k_test_ids.txt: txt file with id's of test set consurf10k_train_ids.txt: txt file with id's of train set consurf10k_val_ids.txt: txt file with id's of cross-validation set

创建时间：

2021-08-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集