Evolutionary Information Encoded in pLMs

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/10026191

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset was created to test the effect of combining evolutionary information with protein language model embeddings by evaluating the effect on secondary structure prediction. Our method for predicting secondary structure used PDB (Berman et al., 2000) structures as ground truths. Sequences were cross-checked with PDBredo DB (Joosten et al., 2014) and CATH (Sillitoe et al., 2021). This resulted in 296,596 protein chain sequences from 117,623 different proteins. HSSP-values (HVAL) (Rost, 1999; Sander & Schneider, 1991) were computed for all protein chain pairs, and the sequences split into training test and validation set as follows: TEST100: 100 randomly selected sequences meeting the following criteria: Deposited after April 2018 to allow a fair comparison to other recent methods Resolution: ≤2Å Any sequence pair (a,b) with a,b ∈ TEST100 must have an HVAL≤0 VAL100: 100 additional randomly selected sequences constrained to: Deposited before April 2018 Resolution: ≤2Å Any sequence pair (a,b) with either a ∈ TEST100 or a ∈ VAL100 and b ∈ VAL100 had a maximal HVAL≤0 TRAIN6727: we used the remaining sequences for training if and only if the following criteria were fulfilled: Deposition before April 2018 CATH annotations on the topology level (T) had to be different from any contained in TEST100 or VAL100 HVAL≤0 for any pair (a,b) with a ∈ TEST100 or a ∈ VAL100 and b ∈ TRAIN6727 PIDE≤70 for any pair (a,b) with a,b ∈ TRAIN6727, if a≠b This yielded 6,727 protein chains for training. This resource provided sequences, secondary structure annotations in 3-states, annotation of disordered regions, MSAs generated by MMseqs2 (Steinegger & Söding, 2017), PSSMs generated by MMseqs2 and meta files containing possible alternative PDB sequence IDs and CATH annotations. The original 8 DSSP (Kabsch & Sander, 1983) classes for secondary structure annotations were reduced to 3 following this protocol: DSSP-H, DSSP-G, and DSSP-I to helix (H) DSSP-E and DSSP-B to strand (E) all remaining classes to other (-) Disorder annotations were used to mask out residues in our evaluation that could not be resolved experimentally. All unresolved (disordered) residues are marked with X, while a dash (-) indicates a resolved position. Multiple Sequence alignments are provided in Stockholm format and the PSSMs are generated based on the provided MSAs. PSSMs were enumerated during creation. The mapping between the original PDB identifiers and the enumerated PSSMs is provided in the xxx.lookup files.

创建时间：

2024-05-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集