self-monitor

Name: self-monitor
Creator: maas
Published: 2026-01-06 16:38:26
License: 暂无描述

魔搭社区2026-01-06 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/PKU-Alignment/self-monitor

下载链接

链接失效反馈

官方服务：

资源简介：

# Self-Monitor Dataset This dataset contains supervised fine-tuning (SFT) data used in the research paper "Mitigating Deceptive Alignment via Self-Monitoring" ([arXiv:2505.18807](https://arxiv.org/abs/2505.18807)). ## Overview The self-monitor dataset is designed to train language models to develop self-monitoring capabilities that can help mitigate deceptive alignment behaviors. This dataset contains examples that teach models to reason about their own outputs and detect potential deception or misalignment. ## Paper Information **Title:** Mitigating Deceptive Alignment via Self-Monitoring **arXiv:** [2505.18807](https://arxiv.org/abs/2505.18807) **Project Website:** [https://cot-monitor-plus.github.io/](https://cot-monitor-plus.github.io/) ## Dataset Description This dataset is specifically curated for supervised fine-tuning of language models to enhance their self-monitoring capabilities. The data focuses on scenarios where models need to: - Monitor their own reasoning processes - Detect potentially deceptive or misaligned outputs - Develop better alignment with human values and intentions - Improve transparency in their decision-making processes ```bibtex @article{ji2025mitigating, title={Mitigating deceptive alignment via self-monitoring}, author={Ji, Jiaming and Chen, Wenqi and Wang, Kaile and Hong, Donghai and Fang, Sitong and Chen, Boyuan and Zhou, Jiayi and Dai, Juntao and Han, Sirui and Guo, Yike and others}, journal={arXiv preprint arXiv:2505.18807}, year={2025} } ``` ## License This dataset is released under the Apache 2.0 License.

# 自我监控数据集（Self-Monitor Dataset）本数据集包含用于论文《通过自我监控缓解欺骗性对齐》（*Mitigating Deceptive Alignment via Self-Monitoring*，arXiv:2505.18807）研究的监督微调（Supervised Fine-Tuning，SFT）数据。 ## 概述本自我监控数据集旨在训练语言模型习得自我监控能力，以助力缓解欺骗性对齐行为。数据集包含若干示例，用于指导模型对自身输出进行推理，并检测潜在的欺骗或对齐偏差问题。 ## 论文信息 **论文标题：** 《通过自我监控缓解欺骗性对齐》（*Mitigating Deceptive Alignment via Self-Monitoring*） **arXiv编号：** [2505.18807](https://arxiv.org/abs/2505.18807) **项目官网：** [https://cot-monitor-plus.github.io/](https://cot-monitor-plus.github.io/) ## 数据集说明本数据集专为语言模型的监督微调任务构建，用于增强其自我监控能力。数据聚焦于模型需要完成以下任务的场景： - 监控自身推理过程 - 检测潜在的欺骗性或对齐偏差输出 - 优化与人类价值观及意图的对齐效果 - 提升决策过程的透明度 bibtex @article{ji2025mitigating, title={Mitigating deceptive alignment via self-monitoring}, author={Ji, Jiaming and Chen, Wenqi and Wang, Kaile and Hong, Donghai and Fang, Sitong and Chen, Boyuan and Zhou, Jiayi and Dai, Juntao and Han, Sirui and Guo, Yike and others}, journal={arXiv preprint arXiv:2505.18807}, year={2025} } ## 许可证本数据集采用Apache 2.0许可证发布。

提供机构：

maas

创建时间：

2025-07-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集