five

Interpreting Protein Language Models through Sparse Autoencoders - Datasets

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14837816
下载链接
链接失效反馈
官方服务:
资源简介:
This zenodo repository contains the following files: Datasets astral-40-20.08.csv: List of sequences from SCOPe version 2.08 filtered to 40% sequence identity.  sprot_protein.csv: Protein sequences from Uniprot SwissProt version 2024_01, plus additional high level annotations. sprot_aminoacid.csv: Selected annotations at the amino acid level for the Uniprot SwissProt proteins version 2024_01. Sparse AutoEncoder Data esm2_6_31.pt: Weights for a vanilla Sparse AutoEncoder trained on embeddings from layer 3 of the smallest ESM-2.  esm2_6_31_cfg.json: Config file with parameters for the Sparse AutoEncoder. Latent - Feature Labels Dataset label_latent_pairs.csv: Table with a list of SAE latent components and uniprot feature label associations (putative interpretations).  References UniProt: the Universal Protein Knowledgebase in 2025. The UniProt Consortium. Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D609–D617,https://doi.org/10.1093/nar/gkae1010 SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Naomi K. Fox, Steven E. Brenner, John-Marc Chandonia. Nucleic Acids Research, Volume 42, Issue D1, 1 January 2014, Pages D304–D309, https://doi.org/10.1093/nar/gkt1240
创建时间:
2025-02-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作