Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_2
收藏Hugging Face2024-09-02 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: response_0
dtype: string
- name: response_1
dtype: string
- name: response_2
dtype: string
- name: response_3
dtype: string
- name: response_4
dtype: string
- name: prompt_id
dtype: string
- name: prompt
dtype: string
- name: llama_prompt
dtype: string
- name: llama_prompt_tokens
sequence: int64
- name: response_0_reward
dtype: float64
- name: response_1_reward
dtype: float64
- name: response_2_reward
dtype: float64
- name: response_3_reward
dtype: float64
- name: response_4_reward
dtype: float64
- name: chosen
dtype: string
- name: chosen_reward
dtype: float64
- name: llama_chosen
dtype: string
- name: llama_chosen_tokens
sequence: int64
- name: reject
dtype: string
- name: reject_reward
dtype: float64
- name: llama_reject
dtype: string
- name: llama_reject_tokens
sequence: int64
- name: chosen_logprob
dtype: float64
- name: reject_logprob
dtype: float64
splits:
- name: train_prefs
num_bytes: 2714568025
num_examples: 53287
- name: test_prefs
num_bytes: 91060412
num_examples: 1782
download_size: 631574440
dataset_size: 2805628437
configs:
- config_name: default
data_files:
- split: train_prefs
path: data/train_prefs-*
- split: test_prefs
path: data/test_prefs-*
---
# Dataset Card for Ultrafeedback-Llama-3-Armo-iter_2
This dataset was used to train [REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2).
We generate 5 responses using [REBEL-Llama-3-Armo-iter_1](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1) and collect the rewards with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1). The best response in terms of reward is selected as chosen while the worst is selected as reject.
The 'chosen_logprob' and 'reject_logprob' are calculated based on [REBEL-Llama-3-Armo-iter_1](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1). Note that these values may differ based on the cuda version and GPU configurations. Please consider recompute these values for your own experiments.
### Evaluations
| Model | AlpacaEval 2.0<br>LC Win Rate | AlpacaEval 2.0<br>Win Rate | MT-Bench<br>Average | MMLU<br>(5-shot) | GSM8K<br>(5-shot) |
| :--------: | :--------: | :--------: | :--------: | :--------: | :--------: |
| REBEL-OpenChat-3.5| 17.3 | 12.8 | 8.06 | 63.7 | 68.8 |
| REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 |
| REBEL-Llama-3-epoch_2| 31.3 | 34.2 | 7.83 | 65.4 | 75.4 |
| REBEL-Llama-3-Armo-iter_1| 48.3 | 41.8 | 8.13 | 66.3 | 75.8 |
| REBEL-Llama-3-Armo-iter_2| 50.0 | 48.5 | 8.07 | 65.9 | 75.4 |
| REBEL-Llama-3-Armo-iter_3| 49.7 | 48.1 | 8.01 | 66.0 | 75.7 |
## Citation
Please cite our paper if you use this dataset in your own work:
```
@misc{gao2024rebel,
title={REBEL: Reinforcement Learning via Regressing Relative Rewards},
author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun},
year={2024},
eprint={2404.16767},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
数据集信息:
特征字段:
- 名称:response_0,数据类型:字符串(string)
- 名称:response_1,数据类型:字符串(string)
- 名称:response_2,数据类型:字符串(string)
- 名称:response_3,数据类型:字符串(string)
- 名称:response_4,数据类型:字符串(string)
- 名称:提示词ID(prompt_id),数据类型:字符串(string)
- 名称:提示词(prompt),数据类型:字符串(string)
- 名称:Llama格式提示词(llama_prompt),数据类型:字符串(string)
- 名称:llama_prompt_tokens,数据类型:整数序列(sequence: int64)
- 名称:response_0_reward,数据类型:双精度浮点数(float64)
- 名称:response_1_reward,数据类型:双精度浮点数(float64)
- 名称:response_2_reward,数据类型:双精度浮点数(float64)
- 名称:response_3_reward,数据类型:双精度浮点数(float64)
- 名称:response_4_reward,数据类型:双精度浮点数(float64)
- 名称:优选回复(chosen),数据类型:字符串(string)
- 名称:chosen_reward,数据类型:双精度浮点数(float64)
- 名称:llama_chosen,数据类型:字符串(string)
- 名称:llama_chosen_tokens,数据类型:整数序列(sequence: int64)
- 名称:拒选回复(reject),数据类型:字符串(string)
- 名称:reject_reward,数据类型:双精度浮点数(float64)
- 名称:llama_reject,数据类型:字符串(string)
- 名称:llama_reject_tokens,数据类型:整数序列(sequence: int64)
- 名称:优选回复对数概率(chosen_logprob),数据类型:双精度浮点数(float64)
- 名称:拒选回复对数概率(reject_logprob),数据类型:双精度浮点数(float64)
数据拆分:
- 名称:train_prefs,字节数:2714568025,样本数量:53287
- 名称:test_prefs,字节数:91060412,样本数量:1782
下载大小:631574440,数据集总大小:2805628437
配置项:
- 配置名称:default,数据文件:
- 拆分:train_prefs,路径:data/train_prefs-*
- 拆分:test_prefs,路径:data/test_prefs-*
# Ultrafeedback-Llama-3-Armo-iter_2 数据集卡片
本数据集用于训练 **REBEL-Llama-3-Armo-iter_2**(https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2)。
我们使用 **REBEL-Llama-3-Armo-iter_1**(https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1)生成5条候选回复,并通过 **ArmoRM-Llama3-8B-v0.1**(https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1)收集奖励评分。从中选取奖励得分最高的回复作为`chosen`(优选回复),奖励得分最低的回复作为`reject`(拒选回复)。
`chosen_logprob`与`reject_logprob`基于 **REBEL-Llama-3-Armo-iter_1** 计算得出。请注意,该数值可能因CUDA版本与GPU配置不同而存在差异,您可在自有实验中重新计算此类数值。
## 模型评估
| 模型 | AlpacaEval 2.0<br>LC胜率 | AlpacaEval 2.0<br>胜率 | MT-Bench<br>平均得分 | MMLU<br>(5样本) | GSM8K<br>(5样本) |
| :--------: | :--------: | :--------: | :--------: | :--------: | :--------: |
| REBEL-OpenChat-3.5 | 17.3 | 12.8 | 8.06 | 63.7 | 68.8 |
| REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 |
| REBEL-Llama-3-epoch_2 | 31.3 | 34.2 | 7.83 | 65.4 | 75.4 |
| REBEL-Llama-3-Armo-iter_1 | 48.3 | 41.8 | 8.13 | 66.3 | 75.8 |
| REBEL-Llama-3-Armo-iter_2 | 50.0 | 48.5 | 8.07 | 65.9 | 75.4 |
| REBEL-Llama-3-Armo-iter_3 | 49.7 | 48.1 | 8.01 | 66.0 | 75.7 |
## 引用
若您在研究中使用本数据集,请引用以下论文:
@misc{gao2024rebel,
title={REBEL: Reinforcement Learning via Regressing Relative Rewards},
author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun},
year={2024},
eprint={2404.16767},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
提供机构:
Cornell-AGI



