Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_3

Name: Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_3
Creator: Cornell-AGI
Published: 2024-09-02 01:47:11
License: 暂无描述

Hugging Face2024-09-02 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_3

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: response_0 dtype: string - name: response_1 dtype: string - name: response_2 dtype: string - name: response_3 dtype: string - name: response_4 dtype: string - name: prompt_id dtype: string - name: prompt dtype: string - name: llama_prompt dtype: string - name: llama_prompt_tokens sequence: int64 - name: response_0_reward dtype: float64 - name: response_1_reward dtype: float64 - name: response_2_reward dtype: float64 - name: response_3_reward dtype: float64 - name: response_4_reward dtype: float64 - name: chosen dtype: string - name: chosen_reward dtype: float64 - name: llama_chosen dtype: string - name: llama_chosen_tokens sequence: int64 - name: reject dtype: string - name: reject_reward dtype: float64 - name: llama_reject dtype: string - name: llama_reject_tokens sequence: int64 - name: chosen_logprob dtype: float64 - name: reject_logprob dtype: float64 splits: - name: train_prefs num_bytes: 2288598631 num_examples: 43188 - name: test_prefs num_bytes: 75595359 num_examples: 1435 download_size: 598464955 dataset_size: 2364193990 configs: - config_name: default data_files: - split: train_prefs path: data/train_prefs-* - split: test_prefs path: data/test_prefs-* --- # Dataset Card for Ultrafeedback-Llama-3-Armo-iter_3 This dataset was used to train [REBEL-Llama-3-Armo-iter_3](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_3). We generate 5 responses using [REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2) and collect the rewards with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1). The best response in terms of reward is selected as chosen while the worst is selected as reject. The 'chosen_logprob' and 'reject_logprob' are calculated based on [REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2). Note that these values may differ based on the cuda version and GPU configurations. Please consider recompute these values for your own experiments. ### Evaluations | Model | AlpacaEval 2.0 LC Win Rate | AlpacaEval 2.0 Win Rate | MT-Bench Average | MMLU (5-shot) | GSM8K (5-shot) | | :--------: | :--------: | :--------: | :--------: | :--------: | :--------: | | REBEL-OpenChat-3.5| 17.3 | 12.8 | 8.06 | 63.7 | 68.8 | | REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 | | REBEL-Llama-3-epoch_2| 31.3 | 34.2 | 7.83 | 65.4 | 75.4 | | REBEL-Llama-3-Armo-iter_1| 48.3 | 41.8 | 8.13 | 66.3 | 75.8 | | REBEL-Llama-3-Armo-iter_2| 50.0 | 48.5 | 8.07 | 65.9 | 75.4 | | REBEL-Llama-3-Armo-iter_3| 49.7 | 48.1 | 8.01 | 66.0 | 75.7 | ## Citation Please cite our paper if you use this dataset in your own work: ``` @misc{gao2024rebel, title={REBEL: Reinforcement Learning via Regressing Relative Rewards}, author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun}, year={2024}, eprint={2404.16767}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```

提供机构：

Cornell-AGI

5,000+

优质数据集

54 个

任务类型

进入经典数据集