Gemma-2B Residual Stream Features
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/jbloom/Gemma-2b-Residual-Stream-SAEs
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了从Gemma-2B模型的预训练稀疏自动编码器(SAE)中提取的特征,这些特征可用于操纵迭代囚犯困境(IPD)中转换器层的激活,以引导代理的行为。此外,这些特征被用于在类似游戏的设置中引导模型的生成,朝着期望的行为方向发展,特别是在IPD背景下关注信任等概念。该数据集规模为16,384个解码特征,其任务是对迭代囚犯困境中的LLM代理进行博弈论评估。
This dataset contains features extracted from the pre-trained sparse autoencoder (SAE) of the Gemma-2B model. These features can be employed to manipulate the activations of Transformer layers in the Iterated Prisoner's Dilemma (IPD) to guide agent behavior. Additionally, they have been applied to direct model generation in game-like settings towards desired behavioral directions, with a particular focus on concepts such as trust within the IPD context. Comprising 16,384 decoded features, this dataset is intended for game-theoretic evaluation of Large Language Model (LLM) agents in the Iterated Prisoner's Dilemma scenario.
提供机构:
Hugging Face



