Interpreting Protein Language Models through Sparse Autoencoders - Datasets
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14837816
下载链接
链接失效反馈官方服务:
资源简介:
This zenodo repository contains the following files:
Datasets
astral-40-20.08.csv: List of sequences from SCOPe version 2.08 filtered to 40% sequence identity.
sprot_protein.csv: Protein sequences from Uniprot SwissProt version 2024_01, plus additional high level annotations.
sprot_aminoacid.csv: Selected annotations at the amino acid level for the Uniprot SwissProt proteins version 2024_01.
Sparse AutoEncoder Data
esm2_6_31.pt: Weights for a vanilla Sparse AutoEncoder trained on embeddings from layer 3 of the smallest ESM-2.
esm2_6_31_cfg.json: Config file with parameters for the Sparse AutoEncoder.
Latent - Feature Labels Dataset
label_latent_pairs.csv: Table with a list of SAE latent components and uniprot feature label associations (putative interpretations).
References
UniProt: the Universal Protein Knowledgebase in 2025. The UniProt Consortium. Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D609–D617,https://doi.org/10.1093/nar/gkae1010
SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Naomi K. Fox, Steven E. Brenner, John-Marc Chandonia. Nucleic Acids Research, Volume 42, Issue D1, 1 January 2014, Pages D304–D309, https://doi.org/10.1093/nar/gkt1240
创建时间:
2025-02-08



