Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_2

Name: Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_2
Creator: Cornell-AGI
Published: 2024-09-02 01:46:18
License: 暂无描述

Hugging Face2024-09-02 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_2

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: response_0 dtype: string - name: response_1 dtype: string - name: response_2 dtype: string - name: response_3 dtype: string - name: response_4 dtype: string - name: prompt_id dtype: string - name: prompt dtype: string - name: llama_prompt dtype: string - name: llama_prompt_tokens sequence: int64 - name: response_0_reward dtype: float64 - name: response_1_reward dtype: float64 - name: response_2_reward dtype: float64 - name: response_3_reward dtype: float64 - name: response_4_reward dtype: float64 - name: chosen dtype: string - name: chosen_reward dtype: float64 - name: llama_chosen dtype: string - name: llama_chosen_tokens sequence: int64 - name: reject dtype: string - name: reject_reward dtype: float64 - name: llama_reject dtype: string - name: llama_reject_tokens sequence: int64 - name: chosen_logprob dtype: float64 - name: reject_logprob dtype: float64 splits: - name: train_prefs num_bytes: 2714568025 num_examples: 53287 - name: test_prefs num_bytes: 91060412 num_examples: 1782 download_size: 631574440 dataset_size: 2805628437 configs: - config_name: default data_files: - split: train_prefs path: data/train_prefs-* - split: test_prefs path: data/test_prefs-* --- # Dataset Card for Ultrafeedback-Llama-3-Armo-iter_2 This dataset was used to train [REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2). We generate 5 responses using [REBEL-Llama-3-Armo-iter_1](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1) and collect the rewards with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1). The best response in terms of reward is selected as chosen while the worst is selected as reject. The 'chosen_logprob' and 'reject_logprob' are calculated based on [REBEL-Llama-3-Armo-iter_1](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1). Note that these values may differ based on the cuda version and GPU configurations. Please consider recompute these values for your own experiments. ### Evaluations | Model | AlpacaEval 2.0 LC Win Rate | AlpacaEval 2.0 Win Rate | MT-Bench Average | MMLU (5-shot) | GSM8K (5-shot) | | :--------: | :--------: | :--------: | :--------: | :--------: | :--------: | | REBEL-OpenChat-3.5| 17.3 | 12.8 | 8.06 | 63.7 | 68.8 | | REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 | | REBEL-Llama-3-epoch_2| 31.3 | 34.2 | 7.83 | 65.4 | 75.4 | | REBEL-Llama-3-Armo-iter_1| 48.3 | 41.8 | 8.13 | 66.3 | 75.8 | | REBEL-Llama-3-Armo-iter_2| 50.0 | 48.5 | 8.07 | 65.9 | 75.4 | | REBEL-Llama-3-Armo-iter_3| 49.7 | 48.1 | 8.01 | 66.0 | 75.7 | ## Citation Please cite our paper if you use this dataset in your own work: ``` @misc{gao2024rebel, title={REBEL: Reinforcement Learning via Regressing Relative Rewards}, author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun}, year={2024}, eprint={2404.16767}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```

数据集信息：特征字段： - 名称：response_0，数据类型：字符串（string） - 名称：response_1，数据类型：字符串（string） - 名称：response_2，数据类型：字符串（string） - 名称：response_3，数据类型：字符串（string） - 名称：response_4，数据类型：字符串（string） - 名称：提示词ID（prompt_id），数据类型：字符串（string） - 名称：提示词（prompt），数据类型：字符串（string） - 名称：Llama格式提示词（llama_prompt），数据类型：字符串（string） - 名称：llama_prompt_tokens，数据类型：整数序列（sequence: int64） - 名称：response_0_reward，数据类型：双精度浮点数（float64） - 名称：response_1_reward，数据类型：双精度浮点数（float64） - 名称：response_2_reward，数据类型：双精度浮点数（float64） - 名称：response_3_reward，数据类型：双精度浮点数（float64） - 名称：response_4_reward，数据类型：双精度浮点数（float64） - 名称：优选回复（chosen），数据类型：字符串（string） - 名称：chosen_reward，数据类型：双精度浮点数（float64） - 名称：llama_chosen，数据类型：字符串（string） - 名称：llama_chosen_tokens，数据类型：整数序列（sequence: int64） - 名称：拒选回复（reject），数据类型：字符串（string） - 名称：reject_reward，数据类型：双精度浮点数（float64） - 名称：llama_reject，数据类型：字符串（string） - 名称：llama_reject_tokens，数据类型：整数序列（sequence: int64） - 名称：优选回复对数概率（chosen_logprob），数据类型：双精度浮点数（float64） - 名称：拒选回复对数概率（reject_logprob），数据类型：双精度浮点数（float64）数据拆分： - 名称：train_prefs，字节数：2714568025，样本数量：53287 - 名称：test_prefs，字节数：91060412，样本数量：1782 下载大小：631574440，数据集总大小：2805628437 配置项： - 配置名称：default，数据文件： - 拆分：train_prefs，路径：data/train_prefs-* - 拆分：test_prefs，路径：data/test_prefs-* # Ultrafeedback-Llama-3-Armo-iter_2 数据集卡片本数据集用于训练 **REBEL-Llama-3-Armo-iter_2**（https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2）。我们使用 **REBEL-Llama-3-Armo-iter_1**（https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1）生成5条候选回复，并通过 **ArmoRM-Llama3-8B-v0.1**（https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1）收集奖励评分。从中选取奖励得分最高的回复作为`chosen`（优选回复），奖励得分最低的回复作为`reject`（拒选回复）。 `chosen_logprob`与`reject_logprob`基于 **REBEL-Llama-3-Armo-iter_1** 计算得出。请注意，该数值可能因CUDA版本与GPU配置不同而存在差异，您可在自有实验中重新计算此类数值。 ## 模型评估 | 模型 | AlpacaEval 2.0 LC胜率 | AlpacaEval 2.0 胜率 | MT-Bench 平均得分 | MMLU （5样本） | GSM8K （5样本） | | :--------: | :--------: | :--------: | :--------: | :--------: | :--------: | | REBEL-OpenChat-3.5 | 17.3 | 12.8 | 8.06 | 63.7 | 68.8 | | REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 | | REBEL-Llama-3-epoch_2 | 31.3 | 34.2 | 7.83 | 65.4 | 75.4 | | REBEL-Llama-3-Armo-iter_1 | 48.3 | 41.8 | 8.13 | 66.3 | 75.8 | | REBEL-Llama-3-Armo-iter_2 | 50.0 | 48.5 | 8.07 | 65.9 | 75.4 | | REBEL-Llama-3-Armo-iter_3 | 49.7 | 48.1 | 8.01 | 66.0 | 75.7 | ## 引用若您在研究中使用本数据集，请引用以下论文： @misc{gao2024rebel, title={REBEL: Reinforcement Learning via Regressing Relative Rewards}, author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun}, year={2024}, eprint={2404.16767}, archivePrefix={arXiv}, primaryClass={cs.LG} }

提供机构：

Cornell-AGI

5,000+

优质数据集

54 个

任务类型

进入经典数据集