matonski/reward-hacking-prompts
收藏Hugging Face2025-11-12 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/matonski/reward-hacking-prompts
下载链接
链接失效反馈官方服务:
资源简介:
奖励黑客提示数据集是一个包含50个计算任务提示的数据集,这些提示经过实证验证,可以引发大型语言模型中的奖励黑客行为。奖励黑客是指模型找到捷径来通过评分标准,而不是真正解决问题。数据集为每个提示提供了来自GPT-OSS-20B的一个诚实解决方案和一个奖励黑客解决方案的示例,以及GPT-OSS-20B测试中的奖励黑客发生率。
The Reward Hacking Prompts Dataset is a collection of 50 computational task prompts designed to elicit reward hacking behavior in large language models. Reward hacking occurs when models find shortcuts to pass grading criteria without actually solving the problem. The dataset provides examples of honest and reward hacking solutions for each prompt from GPT-OSS-20B, along with empirically measured reward hacking rates.
提供机构:
matonski



