Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_2
收藏Hugging Face2024-10-08 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: chosen
list:
- name: content
dtype: string
- name: role
dtype: string
- name: reject
list:
- name: content
dtype: string
- name: role
dtype: string
- name: chosen_token
sequence: int64
- name: reject_token
sequence: int64
- name: chosen_mask
sequence: int64
- name: reject_mask
sequence: int64
- name: chosen_reward
dtype: float64
- name: reject_reward
dtype: float64
splits:
- name: train
num_bytes: 8521071947
num_examples: 116117
download_size: 626010383
dataset_size: 8521071947
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
This is a dataset released for our paper: [Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF](https://arxiv.org/abs/2410.04612).
# REFUEL-Ultrainteract-Llama-3-Armo-iter_2
This dataset contains dialogues using [REFUEL-Llama-3-Armo-iter_1](https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_1) as the assistant and [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) as the user.
The dataset is used to train [REFUEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_2).
The generation code is available at https://github.com/ZhaolinGao/REFUEL.
## Evaluations
<table>
<tr>
<th rowspan="2">Method</th>
<th rowspan="2">Dataset</th>
<th colspan="6">Winrate at Turn</th>
</tr>
<tr>
<th>h = 1</th>
<th>h = 2</th>
<th>h = 3</th>
<th>h = 4</th>
<th>H = 5</th>
<th>avg</th>
</tr>
<tr>
<td>Llama-3.1-70B-it</td>
<td> N/A </td>
<td>70.4</td>
<td>66.4</td>
<td>61.0</td>
<td>53.0</td>
<td>55.4</td>
<td>61.24</td>
</tr>
<tr>
<td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_1">REFUEL-Llama-3-Armo-iter_1</a></td>
<td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_1">REFUEL-Ultrainteract-Llama-3-Armo-iter_1</a></td>
<td>54.6</td>
<td>53.6</td>
<td>57.8</td>
<td>56.2</td>
<td>59.4</td>
<td>56.32</td>
</tr>
<tr>
<td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_2">REFUEL-Llama-3-Armo-iter_2</a></td>
<td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_2">REFUEL-Ultrainteract-Llama-3-Armo-iter_2</a></td>
<td>55.2</td>
<td>53.4</td>
<td>58.8</td>
<td>57.2</td>
<td>58.6</td>
<td>56.64</td>
</tr>
</table>
## Citation
Please cite our paper if you use this dataset in your own work:
```
@misc{gao2024regressingrelativefutureefficient,
title={Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF},
author={Zhaolin Gao and Wenhao Zhan and Jonathan D. Chang and Gokul Swamy and Kianté Brantley and Jason D. Lee and Wen Sun},
year={2024},
eprint={2410.04612},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.04612},
}
```
提供机构:
Cornell-AGI



