distilabel-capybara-dpo-7k-binarized
收藏魔搭社区2025-12-10 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/distilabel-capybara-dpo-7k-binarized
下载链接
链接失效反馈官方服务:
资源简介:
# Capybara-DPO 7K binarized
> A DPO dataset built with [distilabel](https://github.com/argilla-io/distilabel) atop the awesome [LDJnr/Capybara](https://huggingface.co/datasets/LDJnr/Capybara)
> This is a preview version to collect feedback from the community. v2 will include the full base dataset and responses from more powerful models.
<div>
<img src="https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/Vmr0FtTvnny6Snm-UDM_n.png">
</div>
<p align="center">
<a href="https://github.com/argilla-io/distilabel">
<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>
</a>
</p>
## Why?
Multi-turn dialogue data is key to fine-tune capable chat models. Multi-turn preference data has been used by the most relevant RLHF works (Anthropic, Meta Llama2, etc.). Unfortunately, there are very few multi-turn open datasets for DPO/RLHF.
This dataset is the first of a series of datasets to fill this gap for the Open Source AI community.
Why Capybara? Because it's 🔥
## Dataset structure
Here's a video showing the dataset structure using Argilla UI. For preference tuning, chosen and rejected mean the best/worse response to the last turn.
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/KoYK-Or0JNNVS9PNLF8jJ.mp4"></video>
## How to use this dataset
This dataset is a multi-turn preference dataset to improve chat capabilities of open-source LLMs. Chosen and rejected pairs are formatted following OpenAI's conversation format with potentially several turns between a user and an assistant.
To use this dataset for DPO use only the last assistant message as `chosen`/`rejected` and the rest as `prompt`.
Let's see an example, step by step.
First let's keep only highly-scored chosen responses (scale is 1-5) and let's filter out very long conversations:
```python
capy = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train")
capy = capy.filter(
lambda r: r["rating_chosen"]>=4
)
capy = capy.map(lambda r: {"messages": len(r["chosen"])}).filter(lambda r: r["messages"]<18)
```
Then let's prepare this in the chatml prompt and `trl` format:
```python
def chatml_format(example):
# get everything except the last message as input
prompt = tokenizer.apply_chat_template(example["chosen"][:-1], tokenize=False, add_generation_prompt=True)
# get the last assistant responses
chosen = example["chosen"][-1]["content"] + "</s>"
rejected = example["rejected"][-1]["content"] + "</s>"
return {
"prompt": system + prompt,
"chosen": chosen,
"rejected": rejected,
}
# Save columns
original_columns = capy.column_names
# Format dataset
capy = capy.map(
chatml_format,
remove_columns=original_columns
)
```
The dataset is now ready to be used for DPO fine-tuning!
In our benchmarks with 7B models, we've seen this is a challenging dataset to learn from, the best results can be achieved by mixing it with other datasets like this [dpo mix 7k](https://huggingface.co/datasets/argilla/dpo-mix-7k). We'd love to hear from the community how this works with larger models and other hyperparams.
## How we've built this dataset
### Generate responses from 3 different OSS models
In the spirit of UltraFeedback, in this step we generate three responses to the last user message using OSS 7B models and distilabel's `LLMPool` and the vLLM engine. We use Notus7B, NeuralBeagle and OpenHermes-2.5.
Additionally, the original capybara dataset already has a generated assistant response (the last assistant response) we keep it for the next step.
```python
from distilabel.llm import LLM, LLMPool, ProcessLLM
from distilabel.tasks import TextGenerationTask, Task
from distilabel.tasks.prompt import Prompt
from distilabel.dataset import DatasetCheckpoint
from distilabel.pipeline import Pipeline
from datasets import load_dataset
from dataclasses import dataclass
from pathlib import Path
dataset = load_dataset("LDJnr/Capybara", split="train")
here = Path(__file__).parent.resolve()
def extract_conversation(r):
all_but_last = r["conversation"][:-1]
all_but_last.append({"input": r["conversation"][-1]["input"]})
last = r["conversation"][-1]["output"]
return {"input": all_but_last, "original_response": last}
dataset = dataset.map(extract_conversation)
@dataclass
class NotusChatTextGeneration(TextGenerationTask):
# custom class to generate prompts in the chatml format
# skipped for brevity
@dataclass
class ChatMLTextGeneration(TextGenerationTask):
# custom class to generate prompts in the chatml format
# skipped for brevity
save_frequency = len(dataset) // 1000
checkpointing = DatasetCheckpoint(path=here / "checkpoint_generation", save_frequency=save_frequency)
def load_notus(task: Task) -> LLM:
import os
from distilabel.llm import vLLM
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
return vLLM(
vllm=LLM(
model="argilla/notus-7b-v1",
trust_remote_code=True
),
task=task,
max_new_tokens=1024,
temperature=1,
)
def load_beagle(task: Task) -> LLM:
import os
from distilabel.llm import vLLM
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
return vLLM(
vllm=LLM(
model="mlabonne/NeuralBeagle14-7B",
trust_remote_code=True
),
task=task,
max_new_tokens=1024,
temperature=1,
)
def load_hermes(task: Task) -> LLM:
import os
from distilabel.llm import vLLM
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
return vLLM(
vllm=LLM(
model="teknium/OpenHermes-2.5-Mistral-7B",
trust_remote_code=True
),
task=task,
max_new_tokens=1024,
temperature=1,
)
llm_pool = LLMPool(
[
ProcessLLM(task=NotusChatTextGeneration(), load_llm_fn=load_notus),
ProcessLLM(task=ChatMLTextGeneration(), load_llm_fn=load_beagle),
ProcessLLM(task=ChatMLTextGeneration(), load_llm_fn=load_hermes),
]
)
pipe_generation_pool = Pipeline(generator=llm_pool)
dataset = pipe_generation_pool.generate(
dataset=dataset,
num_generations=len(llm_pool.llms),
batch_size=32,
display_progress_bar=True,
checkpoint_strategy=checkpointing,
)
```
### Generate a preference dataset from 4 responses
At this point, we have 4 responses to each multi-turn dialogue. We will now use distilabel's `UltraFeedback.for_overall_quality()` preference model to judge the quality of responses. We use gpt-4-turbo but could have use other models.
```python
from distilabel.tasks import UltraFeedbackTask
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline
from datasets import load_dataset
def format_conversation(r):
mapping_role = {"input": "<|user|>\n", "output":"<|assistant|>\n"}
all_but_last = r["conversation"][:-1]
all_but_last.append({"input": r["conversation"][-1]["input"]})
input = ""
for e in all_but_last:
for k,v in e.items():
input += f"{mapping_role[k]}{v}</s>\n"
return {"input": input}
# this formats the conversation input
# one could choose other format
prepared_dataset = dataset.map(format_conversation)
# the LLM Judge will evaluate each response to the
# last user message taking into account the conversation history
labeler = OpenAILLM(
task=UltraFeedbackTask.for_overall_quality(),
model="gpt-4-1106-preview",
num_threads=8,
max_new_tokens=512,
)
distilabeler = Pipeline(
labeller=labeler
)
# this computes ratings and natural language critiques for each pair
distiset = distilabeler.generate(dataset=prepared_dataset, num_generations=4, display_progress_bar=True)
```
This preference step is also useful to evaluate the performance of the four models (3+ the original response in Capybara):

## Benchmark results
We've tested this new dataset by preference tuning [OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B). The resulting model is [CapybaraHermes](https://huggingface.co/argilla/CapybaraHermes-2.5-Mistral-7B).
CapybaraHermes has been preference tuned with LoRA and TRL for 3 epochs using argilla's [dpo mix 7k](https://huggingface.co/datasets/argilla/dpo-mix-7k).
To test the impact on multi-turn performance we have used MTBench. We also include the Nous Benchmark results and Mistral-7B-Instruct-v0.2 for reference as it's a strong 7B model on MTBench:
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | MTBench First Turn | MTBench Second Turn | Nous avg. | MTBench avg. |
|-----------------------------------|---------|---------|------------|----------|------------|-------------|-----------|--------------|
| CapybaraHermes-2.5-Mistral-7B | **43.8** | **73.35** | 57.07 | **42.44** | 8.24375 | **7.5625** | 54.16 | **7.903125** |
| teknium/OpenHermes-2.5-Mistral-7B | 42.75 | 72.99 | 52.99 | 40.94 | **8.25** | 7.2875 | 52.42 | 7.76875 |
| Mistral-7B-Instruct-v0.2 | 38.5 | 71.64 | **66.82** | 42.29 | 7.8375 | 7.1 | **54.81** | 7.46875 |
The most interesting aspect in the context of the capybara-dpo dataset is the increased performance in MTBench Second Turn scores.
For the merge lovers, we also preference tuned Beagle14-7B with a mix of capybara-dpo and distilabel orca pairs using the same recipe as NeuralBeagle (see [ YALL - Yet Another LLM Leaderboard](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard) for reference):
| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|------------------------------------------------------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|[DistilabelBeagle14-7B](https://huggingface.co/dvilasuero/DistilabelBeagle14-7B)| 45.29| 76.92| 71.66| 48.78| 60.66|
# Capybara-DPO 7K 二值化数据集
> 本DPO(Direct Preference Optimization,直接偏好优化)数据集基于优秀的LDJnr/Capybara数据集(https://huggingface.co/datasets/LDJnr/Capybara),通过distilabel(https://github.com/argilla-io/distilabel)构建而成。
> 此为预览版本,用于收集社区反馈。v2版本将包含完整的基础数据集以及来自更强模型的回复。
<div>
<img src="https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/Vmr0FtTvnny6Snm-UDM_n.png">
</div>
<p align="center">
<a href="https://github.com/argilla-io/distilabel">
<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>
</a>
</p>
## 研发初衷
多轮对话数据是微调高性能聊天模型的核心要素。主流基于人类反馈的强化学习(RLHF)相关工作(如Anthropic、Meta Llama 2等)均已采用多轮偏好数据。遗憾的是,适用于DPO/RLHF的开源多轮数据集仍十分稀缺。本数据集为填补开源AI社区这一空白的系列数据集的首作。
为何以水豚(Capybara)命名?只因它热度爆表。
## 数据集结构
此处提供一段视频,展示使用Argilla UI呈现的数据集结构。在偏好微调任务中,`chosen`与`rejected`分别代表针对对话最后一轮的最优与次优回复。
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/KoYK-Or0JNNVS9PNLF8jJ.mp4"></video>
## 数据集使用方法
本数据集为多轮偏好数据集,旨在提升开源大语言模型(LLM)的聊天能力。`chosen`与`rejected`对遵循OpenAI对话格式,可包含用户与助手间的多轮交互。若用于DPO训练,仅需将最后一条助手消息作为`chosen`/`rejected`,其余对话内容作为`prompt`。
以下为分步示例:
首先仅保留评分≥4分的`chosen`回复(评分采用1-5分制),并过滤过长的对话:
python
capy = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train")
capy = capy.filter(
lambda r: r["rating_chosen"]>=4
)
capy = capy.map(lambda r: {"messages": len(r["chosen"])}).filter(lambda r: r["messages"]<18)
随后将数据集格式转换为ChatML格式并适配`trl`库:
python
def chatml_format(example):
# 将除最后一条消息外的内容作为输入
prompt = tokenizer.apply_chat_template(example["chosen"][:-1], tokenize=False, add_generation_prompt=True)
# 获取最后一条助手回复
chosen = example["chosen"][-1]["content"] + "</s>"
rejected = example["rejected"][-1]["content"] + "</s>"
return {
"prompt": system + prompt,
"chosen": chosen,
"rejected": rejected,
}
# 保留原始列名
original_columns = capy.column_names
# 格式化数据集
capy = capy.map(
chatml_format,
remove_columns=original_columns
)
此时数据集已就绪,可用于DPO微调!
在针对7B规模模型的基准测试中,我们发现该数据集具有一定学习难度,最佳训练效果可通过与其他数据集(如该dpo mix 7k数据集:https://huggingface.co/datasets/argilla/dpo-mix-7k)混合获得。我们期待社区分享其在更大规模模型与其他超参数配置下的使用效果。
## 数据集构建流程
### 从3个开源模型生成回复
秉承UltraFeedback的设计理念,本步骤使用开源7B模型与distilabel的`LLMPool`及vLLM引擎,为对话最后一条用户消息生成3条回复。我们选用了Notus 7B、NeuralBeagle 14-7B与OpenHermes-2.5-Mistral-7B。此外,原始Capybara数据集已包含一条生成的助手回复(即最后一条助手回复),我们将其保留用于后续步骤。
python
from distilabel.llm import LLM, LLMPool, ProcessLLM
from distilabel.tasks import TextGenerationTask, Task
from distilabel.tasks.prompt import Prompt
from distilabel.dataset import DatasetCheckpoint
from distilabel.pipeline import Pipeline
from datasets import load_dataset
from dataclasses import dataclass
from pathlib import Path
dataset = load_dataset("LDJnr/Capybara", split="train")
here = Path(__file__).parent.resolve()
def extract_conversation(r):
all_but_last = r["conversation"][:-1]
all_but_last.append({"input": r["conversation"][-1]["input"]})
last = r["conversation"][-1]["output"]
return {"input": all_but_last, "original_response": last}
dataset = dataset.map(extract_conversation)
@dataclass
class NotusChatTextGeneration(TextGenerationTask):
# custom class to generate prompts in the chatml format
# skipped for brevity
@dataclass
class ChatMLTextGeneration(TextGenerationTask):
# custom class to generate prompts in the chatml format
# skipped for brevity
save_frequency = len(dataset) // 1000
checkpointing = DatasetCheckpoint(path=here / "checkpoint_generation", save_frequency=save_frequency)
def load_notus(task: Task) -> LLM:
import os
from distilabel.llm import vLLM
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
return vLLM(
vllm=LLM(
model="argilla/notus-7b-v1",
trust_remote_code=True
),
task=task,
max_new_tokens=1024,
temperature=1,
)
def load_beagle(task: Task) -> LLM:
import os
from distilabel.llm import vLLM
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
return vLLM(
vllm=LLM(
model="mlabonne/NeuralBeagle14-7B",
trust_remote_code=True
),
task=task,
max_new_tokens=1024,
temperature=1,
)
def load_hermes(task: Task) -> LLM:
import os
from distilabel.llm import vLLM
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
return vLLM(
vllm=LLM(
model="teknium/OpenHermes-2.5-Mistral-7B",
trust_remote_code=True
),
task=task,
max_new_tokens=1024,
temperature=1,
)
llm_pool = LLMPool(
[
ProcessLLM(task=NotusChatTextGeneration(), load_llm_fn=load_notus),
ProcessLLM(task=ChatMLTextGeneration(), load_llm_fn=load_beagle),
ProcessLLM(task=ChatMLTextGeneration(), load_llm_fn=load_hermes),
]
)
pipe_generation_pool = Pipeline(generator=llm_pool)
dataset = pipe_generation_pool.generate(
dataset=dataset,
num_generations=len(llm_pool.llms),
batch_size=32,
display_progress_bar=True,
checkpoint_strategy=checkpointing,
)
### 从4条回复生成偏好数据集
此时,我们已为每个多轮对话获得4条回复。接下来将使用distilabel的`UltraFeedback.for_overall_quality()`偏好模型对回复的整体质量进行评分。我们选用了GPT-4 Turbo,也可替换为其他模型。
python
from distilabel.tasks import UltraFeedbackTask
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline
from datasets import load_dataset
def format_conversation(r):
mapping_role = {"input": "<|user|>\n", "output":"<|assistant|>\n"}
all_but_last = r["conversation"][:-1]
all_but_last.append({"input": r["conversation"][-1]["input"]})
input = ""
for e in all_but_last:
for k,v in e.items():
input += f"{mapping_role[k]}{v}</s>\n"
return {"input": input}
# this formats the conversation input
# one could choose other format
prepared_dataset = dataset.map(format_conversation)
# the LLM Judge will evaluate each response to the
# last user message taking into account the conversation history
labeler = OpenAILLM(
task=UltraFeedbackTask.for_overall_quality(),
model="gpt-4-1106-preview",
num_threads=8,
max_new_tokens=512,
)
distilabeler = Pipeline(
labeller=labeler
)
# this computes ratings and natural language critiques for each pair
distiset = distilabeler.generate(dataset=prepared_dataset, num_generations=4, display_progress_bar=True)
该偏好评分步骤还可用于评估4个模型(3个新增模型+原始Capybara数据集中的回复)的性能:

## 基准测试结果
我们通过偏好微调OpenHermes-2.5-Mistral-7B(https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B)测试了本数据集的效果,微调后的模型为CapybaraHermes-2.5-Mistral-7B(https://huggingface.co/argilla/CapybaraHermes-2.5-Mistral-7B)。
CapybaraHermes通过LoRA与TRL库,使用argilla的dpo mix 7k数据集(https://huggingface.co/datasets/argilla/dpo-mix-7k)进行了3个epoch的偏好微调。
为测试其多轮对话性能,我们采用MTBench进行评测。同时我们纳入了Nous基准测试结果与Mistral-7B-Instruct-v0.2作为参考(该模型在MTBench上表现优异):
| 模型名称 | AGIEval | GPT4All | TruthfulQA | Bigbench | MTBench 首轮得分 | MTBench 次轮得分 | Nous 平均分 | MTBench 平均分 |
|----------------------------------------|---------|---------|------------|----------|-----------------|-----------------|------------|---------------|
| CapybaraHermes-2.5-Mistral-7B | **43.8** | **73.35** | 57.07 | **42.44** | 8.24375 | **7.5625** | 54.16 | **7.903125** |
| teknium/OpenHermes-2.5-Mistral-7B | 42.75 | 72.99 | 52.99 | 40.94 | **8.25** | 7.2875 | 52.42 | 7.76875 |
| Mistral-7B-Instruct-v0.2 | 38.5 | 71.64 | **66.82** | 42.29 | 7.8375 | 7.1 | **54.81** | 7.46875 |
最值得关注的是,本Capybara-DPO数据集显著提升了MTBench的次轮对话得分。
对于喜欢模型合并的用户,我们还采用与NeuralBeagle相同的微调配方(详见YALL - Yet Another LLM Leaderboard:https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard),将Beagle14-7B与capybara-dpo数据集及distilabel Orca数据集对进行混合微调,得到的模型为DistilabelBeagle14-7B(https://huggingface.co/dvilasuero/DistilabelBeagle14-7B),其基准测试结果如下:
| 模型名称 | AGIEval | GPT4All | TruthfulQA | Bigbench | 平均分 |
|-----------------------------------------------------------------------|--------:|--------:|----------:|--------:|-------:|
| [DistilabelBeagle14-7B](https://huggingface.co/dvilasuero/DistilabelBeagle14-7B) | 45.29 | 76.92 | 71.66 | 48.78 | 60.66 |
提供机构:
maas
创建时间:
2024-05-09



