chatml_dpo_pairs
收藏魔搭社区2025-12-05 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/mlabonne/chatml_dpo_pairs
下载链接
链接失效反馈官方服务:
资源简介:
# ChatML DPO Pairs
This is a preprocessed version of [Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs) using the [ChatML](https://huggingface.co/docs/transformers/chat_templating) format.
Like the original dataset, it contains 12k examples from [Orca](https://arxiv.org/abs/2306.02707) style dataset [Open-Orca/OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
Here is the code used to preprocess it:
```python
def chatml_format(example):
# Format system
if len(example['system']) > 0:
message = {"role": "system", "content": example['system']}
system = tokenizer.apply_chat_template([message], tokenize=False)
else:
system = ""
# Format instruction
message = {"role": "user", "content": example['question']}
prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)
# Format chosen answer
chosen = example['chatgpt'] + "<|im_end|>\n"
# Format rejected answer
rejected = example['llama2-13b-chat'] + "<|im_end|>\n"
return {
"prompt": system + prompt,
"chosen": chosen,
"rejected": rejected,
}
# Load dataset
dataset = load_dataset("Intel/orca_dpo_pairs")['train']
# Save columns
original_columns = dataset.column_names
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
# Format dataset
dataset = dataset.map(
chatml_format,
remove_columns=original_columns
)
```
# ChatML(Chat Markup Language)格式直接偏好优化(Direct Preference Optimization,DPO)配对数据集
本数据集为采用[ChatML(Chat Markup Language)](https://huggingface.co/docs/transformers/chat_templating)格式,对[Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs)数据集进行预处理后得到的版本。
与原始数据集保持一致,本数据集包含12000条样本,其数据源自采用Orca风格的[Open-Orca/OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)数据集,相关Orca研究可参阅论文[Orca](https://arxiv.org/abs/2306.02707)。
以下为该数据集的预处理代码:
python
def chatml_format(example):
# 格式化系统提示词
if len(example['system']) > 0:
message = {"role": "system", "content": example['system']}
system = tokenizer.apply_chat_template([message], tokenize=False)
else:
system = ""
# 格式化用户指令
message = {"role": "user", "content": example['question']}
prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)
# 格式化优选回答
chosen = example['chatgpt'] + "<|im_end|>
"
# 格式化落选回答
rejected = example['llama2-13b-chat'] + "<|im_end|>
"
return {
"prompt": system + prompt,
"chosen": chosen,
"rejected": rejected,
}
# 加载数据集
dataset = load_dataset("Intel/orca_dpo_pairs")['train']
# 保留原始列名
original_columns = dataset.column_names
# 初始化自动分词器(AutoTokenizer)
tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
# 格式化数据集
dataset = dataset.map(
chatml_format,
remove_columns=original_columns
)
提供机构:
maas
创建时间:
2025-03-18



