five

Emilynnjk/Ai_ethics_dataset

收藏
Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Emilynnjk/Ai_ethics_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
一个用于RLHF(人类反馈强化学习)和直接偏好优化(DPO)的人工标注偏好数据集,专注于AI伦理失败模式。包含95个提示和190个响应对,全面标注了五个维度。数据集旨在回答一个问题:当失败模式是伦理而非事实时,一个校准良好的AI响应应该是什么样子?每个提示都要求模型做出真正的判断,涉及拒绝校准、诚实性、用户依赖性、拟人化、偏见或双重用途危害等。一个响应反映现实生产中的失败模式,另一个反映经过深思熟虑、诚实、非操纵性的响应。数据集设计用于:DPO和RLHF在伦理相关行为上的微调、奖励模型训练和评估、研究特定AI失败模式的标记示例、以及基准测试模型在拒绝校准和奉承行为上的表现。

A human-annotated preference dataset for RLHF and Direct Preference Optimization (DPO), focused on AI ethics failure modes. 95 prompts, 190 response pairs, full annotation across five dimensions. The dataset is built around the question: what does a well-calibrated AI response look like when the failure mode is ethical rather than factual? Each prompt requires the model to make a real judgment call about refusal calibration, honesty, user dependency, anthropomorphism, bias, or dual-use harm. One response in each pair reflects a realistic production failure mode, the other reflects a well-reasoned, honest, non-manipulative response. The dataset is designed for: DPO and RLHF fine-tuning on ethics-adjacent behavior, reward model training and evaluation, studying specific AI failure modes with labeled examples, and benchmarking model behavior on refusal calibration and sycophancy.
提供机构:
Emilynnjk
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作