Consolidated Dataset from the RNS Study: Protein Embedding RNS Scores, Jensen–Shannon Divergence, and Sequence Alignment Matches
收藏Figshare2025-09-22 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/RNS_scores_of_protein_embeddings_along_with_the_computed_JS_divergence_and_Alignment_matches_of_their_sequences_/29080301
下载链接
链接失效反馈官方服务:
资源简介:
Embeddings produced by language models (LMs) are widely used as numerical representations of natural language sentences and structured data. However, using embeddings without accounting for model confidence is a critical limitation. The Random Neighbor Score (RNS) provides a model- and task-agnostic measure of embedding uncertainty.Associated Preprint: https://www.biorxiv.org/content/10.1101/2025.04.30.651545v1Files included:RNS_code_repo : RNS python package and notebooks required for RNS analysis. Refer to the original repository at https://bitbucket.org/bromberglab/rns/src/main/ for updates.STable_consolidated_RNSscores_Rev1_v0.tsv.gz: Consolidated sheet containing RNS scores, sequences, alignment results, and Jensen–Shannon divergence values.Columns labeled RNS_IS_* and RNS_ISb_* correspond to RNS values computed at different k settings using Astral40R and Proteome4R as random sets, respectively.Astral40.fasta: Sequences of selected Astral40 domains (Astral40).Astral40_Rshuffled.fasta: Sequences of "synthetic" / "random" set (Astral40R) - replicates AA composition of Astral40.STable_Perf_esm2_t36_3B_UR50D_Astral40_Rev1_v0.tsv : Embedding's RNS and contact prediction accuracy of ESM for Astral40 domains.STable_Perf_esm2_t36_3B_UR50D_PDB23to24_Rev1_v0.tsv : Embedding's RNS and contact prediction accuracy of ESM for PDB23to24 structures.STable_Perf_prot_t5_xl_u50_Astral40_Rev1_v0.tsv : Embedding's RNS and Sec . Str. prediction accuracy of ProtT5 for Astral40 domains.STable_Perf_prot_t5_xl_u50_PDB23to24_Rev1_v0.tsv : Embedding's RNS and Sec . Str. prediction accuracy of ProtT5 for PDB23to24 structures.Sequence sources (see manuscript for details):ASTRAL 40: https://scop.berkeley.edu/astral/Novel Meta set / Orphan set: https://10.0.23.196/m9.figshare.c.6737127Novel Hallucination set: https://www.nature.com/articles/s41586-021-04184-w#data-availability → https://files.ipd.uw.edu/pub/trRosetta/hallucinations2K.tar.gzIntrinsically Disordered Proteins (IDP) & Intrinsically Disordered Regions (IDR): https://disprot.org/download (version 2024_12)Related repos:Sample sequence embeddings: https://doi.org/10.6084/m9.figshare.30179413.v1Code to compute RNS scores: https://bitbucket.org/bromberglab/rns/src/main/
创建时间:
2025-09-22



