Dolci-Instruct-DPO-7B

Name: Dolci-Instruct-DPO-7B
Creator: maas
Published: 2025-12-05 16:57:42
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/allenai/Dolci-Instruct-DPO-7B

下载链接

链接失效反馈

官方服务：

资源简介：

# Dolci Instruct DPO Mixture This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use). The Dolci Instruct DPO mixture was used to preference tune Olmo 3 Instruct 7B. It contains 260,000 preference pairs in total, including: - 125,000 pairs created with the preference heuristic described in [Delta Learning](https://arxiv.org/abs/2507.06187) (Geng et al. 2025) - 125,000 pairs created with a delta-aware Ultrafeedback-esque GPT-judge pipeline, designed to maximize the contrast between chosen and rejected responses - 10,000 multiturn preference pairs (5,000 synthetic context, 5,000 self talk)

# Dolci Instruct DPO 混合数据集本数据集采用ODC-BY许可协议进行授权，仅供研究与教育用途，需遵循Ai2的[负责任使用指南](https://allenai.org/responsible-use)。本Dolci Instruct DPO混合数据集被用于对Olmo 3 Instruct 7B进行偏好微调。该数据集总计包含26万条偏好样本对，具体包括： - 12.5万条样本对，基于[Delta Learning](https://arxiv.org/abs/2507.06187)（Geng等人，2025）中提出的偏好启发式方法生成； - 12.5万条样本对，采用Delta感知型（delta-aware）的类Ultrafeedback GPT评判器流水线生成，该流水线旨在最大化选中回复与被拒回复之间的对比度； - 1万条多轮偏好样本对（其中5000条为合成上下文场景，5000条为自我对话场景）

提供机构：

maas

创建时间：

2025-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集