Abbey4799/Complex-Instructions-DPO

Name: Abbey4799/Complex-Instructions-DPO
Creator: Abbey4799
Published: 2024-04-23 12:30:14
License: 暂无描述

Hugging Face2024-04-23 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/Abbey4799/Complex-Instructions-DPO

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models [![Github](https://img.shields.io/static/v1?logo=github&style=flat&color=pink&label=github&message=meowpass/FollowComplexInstruction)]([https://github.com/YJiangcm/FollowBench](https://github.com/meowpass/FollowComplexInstruction)) [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97-huggingface-yellow)](https://huggingface.co/datasets/Abbey4799/Complex-Instructions-DPO) Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models". We systematically study **how to enhance the ability of LLMs to follow complex instructions**, addressing the following research questions: - ***What training data* is effective in enhancing complex constraint-following abilities?** - Training with *compositional data* can generally enhance models’ ability to follow complex instructions. - Training with *atomic data* (mostly with 1 constraint) can *generally decrease performance* compared to the backbone model for instructions with more than 1 constraint. - Training with compositional data (instructions with multi-constraints) can *better generalize to lower-level complex instructions* (instructions with fewer constraints). - Training with compositional data can even *generalize to the compositions of out-of-domain constraints*. - ***How to obtain* high-quality compositional data?** - The outputs *from weaker LLMs then refined by advanced LLMs* (Discrimination) significantly outperform the outputs generated by advanced LLMs directly (Generation). - ***How to effectively utilize* the data obtained through the discrimination-based method?** - We introduce *a reinforcement learning fine-tuning (RLFT) based method* that leverages both positive and negative samples to improve complex instruction following. - We conduct extensive experiments to prove the effectiveness of our methods in terms of *overall performance, training efficiency, and generalization abilities under four settings*. <p align="center"> <br> <img src="https://github-production-user-asset-6210df.s3.amazonaws.com/56729976/324175306-debacf40-1858-402b-b94a-700e7b7ad20b.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240423%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240423T122931Z&X-Amz-Expires=300&X-Amz-Signature=25944f590d3b8f8ec48b481ce08a14b32aa8d841deb4b8f6733073c0124406ff&X-Amz-SignedHeaders=host&actor_id=50980570&key_id=0&repo_id=788879431" width="800"/> <br> </p> We offer training data for our contrastive method, which includes: - `prompt`: Synthesized complex instructions with 3 to 5 constraints from [IFEval](https://github.com/google-research/google-research/tree/master/instruction_following_eval). - `constraint`: The constraint not followed by the `reject` output. - `reject`: Model output that doesn't meet some constraints in the complex instructions. - `chosen`: Model output that meets all constraints in the complex instructions after `Teacher Correction` as proposed in our paper. We introduce a *reinforcement learning fine-tuning (RLFT) based method* that leverages both positive and negative samples to improve *complex instruction following*. We conduct extensive experiments to prove the effectiveness of our methods in terms of *overall performance*, *training efficiency*, and *generalization abilities under four settings*.

提供机构：

Abbey4799

原始信息汇总

数据集概述

数据集名称

Complex-Instructions-DPO

数据集来源

来自论文 "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models" 的官方实现。

数据集内容

prompt: 包含3至5个约束的合成复杂指令，源自IFEval。
constraint: 未被reject输出满足的约束。
reject: 未能满足复杂指令中某些约束的模型输出。
chosen: 经过Teacher Correction后满足所有约束的模型输出，该方法在论文中提出。

数据集用途

用于研究如何增强大型语言模型遵循复杂指令的能力，特别是通过强化学习微调（RLFT）方法，利用正负样本来提升复杂指令的遵循能力。

数据集研究成果

通过广泛实验验证了方法的有效性，包括整体性能、训练效率和在四种设置下的泛化能力。

数据集许可证

Apache-2.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集