five

ESM-1v predictions for all AA substitutions in all MANE proteins

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14828608
下载链接
链接失效反馈
官方服务:
资源简介:
ESM-1v predictions for all AA substitutions in all MANE proteins This dataset contains ESM-1v one-shot protein function predictions for all amino acid (AA) substitutions in MANE proteins. Specifically, MANE version 1.2 (link to AA sequence FASTA file) File format The predictions are provided as a collection of .tsv (tab-separated value) files, one file per unique Ensembl Peptide (ENSP) ID. Each file starts with a header row. Columns are expected but not guaranteed to appear in the order listed here. Refer to the header row for an accurate ordering of columns within a particular file. The columns are as follows: HGVS: A description of the AA change following the HGVS standard as closely as reasonable, specifically: Values in this column are a string of the form {sequence}:p.{ref}{pos}{alt}, e.g. ENSP00000005226.7:p.Val772Ala where {sequence} is the ENSP ID of the protein (e.g. ENSP00000005226.7). {ref} is the three-letter abbreviation of the AA that appears in the reference sequence at the substitution position (e.g. Val). {pos} is the one-indexed substitution position as an integer (e.g. 772). {alt} is the three-letter abbreviation of the substituting AA (e.g. Ala). Contrary to HGVS recommendation, these variants are not described at the DNA level. These are in-silico predictions of protein stability that depend solely on AA sequences and are agnostic of DNA and other context. The HGVS recommendation for predicted consequences is to use parentheses notation (e.g. ENSP00000005226.7:p.(Val772Ala)), we do not do this because the amino acid substitutions are neither predicted, nor are they consequences of anything, nor are they observed. esm1v_t33_650M_UR90S_1: the masked-marginals score for the substitution yielded by the esm1v_t33_650M_UR90S_1 model. If this position is an overlap between two segments of a long sequence, then this is the score for the prior segment (see "Long sequences" subsection below for details). esm1v_t33_650M_UR90S_2: (prior segment) masked-marginals score by esm1v_t33_650M_UR90S_2. esm1v_t33_650M_UR90S_3: (prior segment) masked-marginals score by esm1v_t33_650M_UR90S_3. esm1v_t33_650M_UR90S_4: (prior segment) masked-marginals score by esm1v_t33_650M_UR90S_4. esm1v_t33_650M_UR90S_5: (prior segment) masked-marginals score by esm1v_t33_650M_UR90S_5. esm1v_t33_650M_UR90S_1_next: the masked-marginals score for the substitution yielded by the esm1v_t33_650M_UR90S_1 model for the later segment in an overlap between two segments of a long sequence (see "Long sequences" subsection below for details). This column is only present in files of long sequences (proteins of more than 1022 amino acids). Even when this column is present, values in this column are present only in overlaps. esm1v_t33_650M_UR90S_2_next: (later segment) masked-marginals score by esm1v_t33_650M_UR90S_2. esm1v_t33_650M_UR90S_3_next: (later segment) masked-marginals score by esm1v_t33_650M_UR90S_3. esm1v_t33_650M_UR90S_4_next: (later segment) masked-marginals score by esm1v_t33_650M_UR90S_4. esm1v_t33_650M_UR90S_5_next: (later segment) masked-marginals score by esm1v_t33_650M_UR90S_5. combined_score: A combined score that represents the ultimate protein stability prediction for this set of 5 models. Outside of regions of overlap and in the first 20 positions of an overlap, this is the average of the 5 prior segment model scores. In the last 20 position of an overlap, this is the average of the 5 later segment model scores. Between the 20th and 80th position of an overlap, this is a weighted average of the two averages, governed by a cosine sigmoid. Model information The code to run the prediction task, generate containers for running the computation, and process the results to produce the included tables can be found in this GitHub repository. The trained models themselves that we used to generate the one-shot predictions downloadable here: esm1v_t33_650M_UR90S_1, esm1v_t33_650M_UR90S_2, esm1v_t33_650M_UR90S_3, esm1v_t33_650M_UR90S_4, esm1v_t33_650M_UR90S_5. The ESM-1v models are described in Language models enable zero-shot prediction of the effects of mutations on protein function. (Meier et al. 2021).
创建时间:
2025-02-07
二维码
社区交流群
二维码
科研交流群
商业服务