five

chatml_dpo_pairs

收藏
魔搭社区2025-12-05 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/mlabonne/chatml_dpo_pairs
下载链接
链接失效反馈
官方服务:
资源简介:
# ChatML DPO Pairs This is a preprocessed version of [Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs) using the [ChatML](https://huggingface.co/docs/transformers/chat_templating) format. Like the original dataset, it contains 12k examples from [Orca](https://arxiv.org/abs/2306.02707) style dataset [Open-Orca/OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca). Here is the code used to preprocess it: ```python def chatml_format(example): # Format system if len(example['system']) > 0: message = {"role": "system", "content": example['system']} system = tokenizer.apply_chat_template([message], tokenize=False) else: system = "" # Format instruction message = {"role": "user", "content": example['question']} prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True) # Format chosen answer chosen = example['chatgpt'] + "<|im_end|>\n" # Format rejected answer rejected = example['llama2-13b-chat'] + "<|im_end|>\n" return { "prompt": system + prompt, "chosen": chosen, "rejected": rejected, } # Load dataset dataset = load_dataset("Intel/orca_dpo_pairs")['train'] # Save columns original_columns = dataset.column_names # Tokenizer tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B") tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "left" # Format dataset dataset = dataset.map( chatml_format, remove_columns=original_columns ) ```

# ChatML(Chat Markup Language)格式直接偏好优化(Direct Preference Optimization,DPO)配对数据集 本数据集为采用[ChatML(Chat Markup Language)](https://huggingface.co/docs/transformers/chat_templating)格式,对[Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs)数据集进行预处理后得到的版本。 与原始数据集保持一致,本数据集包含12000条样本,其数据源自采用Orca风格的[Open-Orca/OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)数据集,相关Orca研究可参阅论文[Orca](https://arxiv.org/abs/2306.02707)。 以下为该数据集的预处理代码: python def chatml_format(example): # 格式化系统提示词 if len(example['system']) > 0: message = {"role": "system", "content": example['system']} system = tokenizer.apply_chat_template([message], tokenize=False) else: system = "" # 格式化用户指令 message = {"role": "user", "content": example['question']} prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True) # 格式化优选回答 chosen = example['chatgpt'] + "<|im_end|> " # 格式化落选回答 rejected = example['llama2-13b-chat'] + "<|im_end|> " return { "prompt": system + prompt, "chosen": chosen, "rejected": rejected, } # 加载数据集 dataset = load_dataset("Intel/orca_dpo_pairs")['train'] # 保留原始列名 original_columns = dataset.column_names # 初始化自动分词器(AutoTokenizer) tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B") tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "left" # 格式化数据集 dataset = dataset.map( chatml_format, remove_columns=original_columns )
提供机构:
maas
创建时间:
2025-03-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作