Interpreting Protein Language Models through Sparse Autoencoders - Datasets

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14837816

下载链接

链接失效反馈

官方服务：

资源简介：

This zenodo repository contains the following files: Datasets astral-40-20.08.csv: List of sequences from SCOPe version 2.08 filtered to 40% sequence identity. sprot_protein.csv: Protein sequences from Uniprot SwissProt version 2024_01, plus additional high level annotations. sprot_aminoacid.csv: Selected annotations at the amino acid level for the Uniprot SwissProt proteins version 2024_01. Sparse AutoEncoder Data esm2_6_31.pt: Weights for a vanilla Sparse AutoEncoder trained on embeddings from layer 3 of the smallest ESM-2. esm2_6_31_cfg.json: Config file with parameters for the Sparse AutoEncoder. Latent - Feature Labels Dataset label_latent_pairs.csv: Table with a list of SAE latent components and uniprot feature label associations (putative interpretations). References UniProt: the Universal Protein Knowledgebase in 2025. The UniProt Consortium. Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D609–D617,https://doi.org/10.1093/nar/gkae1010 SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Naomi K. Fox, Steven E. Brenner, John-Marc Chandonia. Nucleic Acids Research, Volume 42, Issue D1, 1 January 2014, Pages D304–D309, https://doi.org/10.1093/nar/gkt1240

创建时间：

2025-02-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集