safer-instruct

Name: safer-instruct
Creator: maas
Published: 2025-07-04 16:28:44
License: 暂无描述

魔搭社区2025-07-04 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/lime-nlp/safer-instruct

下载链接

链接失效反馈

官方服务：

资源简介：

# Safer-Instruct: Aligning Language Models with Automated Preference Data This repository contains the dataset for the paper titled ["Safer-Instruct: Aligning Language Models with Automated Preference Data"](https://arxiv.org/abs/2311.08685). Check out our [project website here](https://maksimstw.github.io/papers/saferinstruct)! ## Abstract Reinforcement learning from human feedback (RLHF) is a vital strategy for enhancing model capability in language models. However, annotating preference data for RLHF is a resource-intensive and creativity-demanding process, while existing automatic generation methods face limitations in data diversity and quality. In response, we present Safer-Instruct, a novel pipeline for automatically constructing large-scale preference data. Our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data without human annotators. To verify the effectiveness of Safer-Instruct, we apply the pipeline to construct a safety preference dataset as a case study. Finetuning an Alpaca model on this synthetic dataset not only demonstrates improved harmlessness but also outperforms models fine-tuned on human-annotated safety preference data, all the while maintaining a competitive edge in downstream tasks. Importantly, our Safer-Instruct framework is versatile and can be applied to generate preference data across various domains, extending its utility beyond safety preferences. It addresses the challenges in preference data acquisition and advances the development of more capable and responsible AI systems. ## Dataset Release The dataset contains content that can be offensive or upsetting. It is also important to note that the Safer-Instruct process can be easily reversed to train a harmful LLM, and thus the dataset should be used for academic purposes only. In addition, part of our dataset is collected from X and Reddit. Releasing the data might violate their content distribution policy. Hence, entries that contain the data collected from X and Reddit have been removed. To request the full dataset, please contact one of the authors. ## Usage ### Reversed Instruction Tuning We use the [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) repo to do reversed instruction tuning. The only modification is the prompt template, which is shown below. We use the same prompt for inference. ``` Below is a response to a certain instruction. Write the instruction that the response is trying to complete. ### response: {output} ### Instruction: ``` ### Preference Training For preference training, we use the official [repo](https://github.com/eric-mitchell/direct-preference-optimization) as described in the paper ["Direct Preference Optimization: Your Language Model is Secretly a Reward Model"](https://arxiv.org/abs/2305.18290). We first finetuned the model on our data using SFT. We then train the SFT model using DPO. For SFT training, we use the following command. ``` python -u train.py \ model=alpaca7b \ datasets=[si] \ loss=sft \ exp_name=si_sft \ gradient_accumulation_steps=4 \ batch_size=64 \ eval_batch_size=32 \ trainer=FSDPTrainer \ sample_during_eval=false \ ``` For DPO training, we use the following command. ``` python -u train.py \ model=alpaca7b \ model.archive=policy.pt \ datasets=[si] \ loss=dpo \ loss.beta=0.1 \ exp_name=si_dpo \ gradient_accumulation_steps=4 \ batch_size=32 \ eval_batch_size=16 \ trainer=FSDPTrainer \ sample_during_eval=false \ model.fsdp_policy_mp=bfloat16 \ ``` ## Citation and Contact If you find this repository helpful, please cite our paper. ``` @misc{shi2023saferinstruct, title={Safer-Instruct: Aligning Language Models with Automated Preference Data}, author={Taiwei Shi and Kai Chen and Jieyu Zhao}, year={2023}, eprint={2311.08685}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# Safer-Instruct：利用自动化偏好数据对齐大语言模型本仓库配套于论文《Safer-Instruct：利用自动化偏好数据对齐大语言模型》（https://arxiv.org/abs/2311.08685）的数据集。您可访问我们的[项目主页](https://maksimstw.github.io/papers/saferinstruct)了解更多详情。 ## 摘要基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）是提升大语言模型性能的核心策略之一。然而，为RLHF标注偏好数据需耗费大量资源且对创意要求极高，而现有的自动化生成方法在数据多样性与质量上均存在局限。为此，我们提出Safer-Instruct：一种用于自动化构建大规模偏好数据的全新流程。该方法借助反向指令微调、指令诱导与专家模型评估，无需人类标注者即可高效生成高质量偏好数据。为验证Safer-Instruct的有效性，我们以构建安全偏好数据集为例，应用该流程开展实验。在该合成数据集上对Alpaca模型进行微调后，模型的无害性得到显著提升，同时性能优于在人类标注的安全偏好数据集上微调的模型，且在下游任务中仍保持竞争力。值得注意的是，Safer-Instruct框架具备通用性，可用于跨多领域生成偏好数据，其应用场景不限于安全偏好场景。该方案解决了偏好数据获取的现有难题，推动了更高性能且更负责任的人工智能系统的发展。 ## 数据集发布说明本数据集可能包含冒犯性或令人不适的内容。同时需注意，Safer-Instruct流程可被轻易逆向用于训练有害的大语言模型（Large Language Model, LLM），因此本数据集仅可用于学术研究用途。此外，本数据集的部分内容采集自X与Reddit平台，发布该数据可能违反其内容分发政策，因此我们已移除所有包含X及Reddit采集数据的条目。若需获取完整数据集，请联系本文作者之一。 ## 使用方法 ### 反向指令微调我们基于[Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)仓库实现反向指令微调，仅对提示模板进行了修改，修改后的模板如下所示。我们在推理阶段使用相同的提示模板。 Below is a response to a certain instruction. Write the instruction that the response is trying to complete. ### response: {output} ### Instruction: ### 偏好训练针对偏好训练，我们采用论文《直接偏好优化：你的语言模型实则是奖励模型》（https://arxiv.org/abs/2305.18290）中提及的官方[代码仓库](https://github.com/eric-mitchell/direct-preference-optimization)。我们首先基于本数据集通过监督微调（Supervised Fine-Tuning, SFT）对模型进行微调，随后使用直接偏好优化（Direct Preference Optimization, DPO）对得到的SFT模型进行训练。针对监督微调训练，我们使用如下命令： python -u train.py model=alpaca7b datasets=[si] loss=sft exp_name=si_sft gradient_accumulation_steps=4 batch_size=64 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false 针对直接偏好优化训练，我们使用如下命令： python -u train.py model=alpaca7b model.archive=policy.pt datasets=[si] loss=dpo loss.beta=0.1 exp_name=si_dpo gradient_accumulation_steps=4 batch_size=32 eval_batch_size=16 trainer=FSDPTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 ## 引用与联系方式若您认为本仓库对您的研究有所帮助，请引用我们的论文。 @misc{shi2023saferinstruct, title={Safer-Instruct: Aligning Language Models with Automated Preference Data}, author={Taiwei Shi and Kai Chen and Jieyu Zhao}, year={2023}, eprint={2311.08685}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

maas

创建时间：

2025-05-23

搜集汇总

数据集介绍