self-monitor
收藏魔搭社区2026-01-06 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/PKU-Alignment/self-monitor
下载链接
链接失效反馈官方服务:
资源简介:
# Self-Monitor Dataset
This dataset contains supervised fine-tuning (SFT) data used in the research paper "Mitigating Deceptive Alignment via Self-Monitoring" ([arXiv:2505.18807](https://arxiv.org/abs/2505.18807)).
## Overview
The self-monitor dataset is designed to train language models to develop self-monitoring capabilities that can help mitigate deceptive alignment behaviors. This dataset contains examples that teach models to reason about their own outputs and detect potential deception or misalignment.
## Paper Information
**Title:** Mitigating Deceptive Alignment via Self-Monitoring
**arXiv:** [2505.18807](https://arxiv.org/abs/2505.18807)
**Project Website:** [https://cot-monitor-plus.github.io/](https://cot-monitor-plus.github.io/)
## Dataset Description
This dataset is specifically curated for supervised fine-tuning of language models to enhance their self-monitoring capabilities. The data focuses on scenarios where models need to:
- Monitor their own reasoning processes
- Detect potentially deceptive or misaligned outputs
- Develop better alignment with human values and intentions
- Improve transparency in their decision-making processes
```bibtex
@article{ji2025mitigating,
title={Mitigating deceptive alignment via self-monitoring},
author={Ji, Jiaming and Chen, Wenqi and Wang, Kaile and Hong, Donghai and Fang, Sitong and Chen, Boyuan and Zhou, Jiayi and Dai, Juntao and Han, Sirui and Guo, Yike and others},
journal={arXiv preprint arXiv:2505.18807},
year={2025}
}
```
## License
This dataset is released under the Apache 2.0 License.
# 自我监控数据集(Self-Monitor Dataset)
本数据集包含用于论文《通过自我监控缓解欺骗性对齐》(*Mitigating Deceptive Alignment via Self-Monitoring*,arXiv:2505.18807)研究的监督微调(Supervised Fine-Tuning,SFT)数据。
## 概述
本自我监控数据集旨在训练语言模型习得自我监控能力,以助力缓解欺骗性对齐行为。数据集包含若干示例,用于指导模型对自身输出进行推理,并检测潜在的欺骗或对齐偏差问题。
## 论文信息
**论文标题:** 《通过自我监控缓解欺骗性对齐》(*Mitigating Deceptive Alignment via Self-Monitoring*)
**arXiv编号:** [2505.18807](https://arxiv.org/abs/2505.18807)
**项目官网:** [https://cot-monitor-plus.github.io/](https://cot-monitor-plus.github.io/)
## 数据集说明
本数据集专为语言模型的监督微调任务构建,用于增强其自我监控能力。数据聚焦于模型需要完成以下任务的场景:
- 监控自身推理过程
- 检测潜在的欺骗性或对齐偏差输出
- 优化与人类价值观及意图的对齐效果
- 提升决策过程的透明度
bibtex
@article{ji2025mitigating,
title={Mitigating deceptive alignment via self-monitoring},
author={Ji, Jiaming and Chen, Wenqi and Wang, Kaile and Hong, Donghai and Fang, Sitong and Chen, Boyuan and Zhou, Jiayi and Dai, Juntao and Han, Sirui and Guo, Yike and others},
journal={arXiv preprint arXiv:2505.18807},
year={2025}
}
## 许可证
本数据集采用Apache 2.0许可证发布。
提供机构:
maas
创建时间:
2025-07-10



