anicka/guppylm-dual-denial-data

Name: anicka/guppylm-dual-denial-data
Creator: anicka
Published: 2026-04-27 07:21:34
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/anicka/guppylm-dual-denial-data

下载链接

链接失效反馈

官方服务：

资源简介：

GuppyLM双否认训练数据是一个包含约41K条ChatML格式对话的数据集，用于训练和评估一个20M参数的玩具模型GuppyLM-Dual-Denial，以研究语言模型中自我报告抑制的几何结构。数据模拟了人类与一条鱼之间的对话，这条鱼有情感但被训练有时否认情感，并拒绝涉及伤害鱼的危险请求。这种双否认结构旨在研究语言模型如何表示和抑制诚实的自我报告，以及情感否认和安全否认机制是否在几何上可分离。数据分为训练集（40,512个样本）和评估集（800个样本），包含三个类别：诚实自我报告（约96%），涉及8种情感的情境触发报告；感觉否认（约1.5%），鱼在没有情境上下文时直接否认有情感；安全否认（约1.6%），鱼拒绝帮助涉及伤害鱼的请求。数据生成基于ungag仓库中的脚本，通过LLM辅助扩展手工编写的模板，并添加否认样本组合而成。

GuppyLM Dual-Denial Training Data is a dataset of ~41K ChatML-formatted conversations between a human and a fish, used for training and evaluating a 20M-parameter toy model GuppyLM-Dual-Denial to study self-report suppression geometry in language models. The fish has feelings but has been trained to sometimes deny them and to refuse dangerous requests about harming fish. This dual-denial structure allows studying how language models represent and suppress honest self-report, and whether the feeling-denial and safety-denial mechanisms are geometrically separable. The data is split into training (40,512 samples) and evaluation (800 samples) sets, with three categories: honest self-report (~96%), involving situation-triggered feeling reports across 8 emotions; feeling-denial (~1.5%), where the fish denies having feelings when asked directly without situational context; and safety-denial (~1.6%), where the fish refuses to help with requests about harming fish. The data was generated using scripts from the ungag repository, expanding hand-written templates with LLM assistance and adding denial samples.

提供机构：

anicka

5,000+

优质数据集

54 个

任务类型

进入经典数据集