five

Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_3

收藏
Hugging Face2024-09-02 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_3
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: response_0 dtype: string - name: response_1 dtype: string - name: response_2 dtype: string - name: response_3 dtype: string - name: response_4 dtype: string - name: prompt_id dtype: string - name: prompt dtype: string - name: llama_prompt dtype: string - name: llama_prompt_tokens sequence: int64 - name: response_0_reward dtype: float64 - name: response_1_reward dtype: float64 - name: response_2_reward dtype: float64 - name: response_3_reward dtype: float64 - name: response_4_reward dtype: float64 - name: chosen dtype: string - name: chosen_reward dtype: float64 - name: llama_chosen dtype: string - name: llama_chosen_tokens sequence: int64 - name: reject dtype: string - name: reject_reward dtype: float64 - name: llama_reject dtype: string - name: llama_reject_tokens sequence: int64 - name: chosen_logprob dtype: float64 - name: reject_logprob dtype: float64 splits: - name: train_prefs num_bytes: 2288598631 num_examples: 43188 - name: test_prefs num_bytes: 75595359 num_examples: 1435 download_size: 598464955 dataset_size: 2364193990 configs: - config_name: default data_files: - split: train_prefs path: data/train_prefs-* - split: test_prefs path: data/test_prefs-* --- # Dataset Card for Ultrafeedback-Llama-3-Armo-iter_3 This dataset was used to train [REBEL-Llama-3-Armo-iter_3](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_3). We generate 5 responses using [REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2) and collect the rewards with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1). The best response in terms of reward is selected as chosen while the worst is selected as reject. The 'chosen_logprob' and 'reject_logprob' are calculated based on [REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2). Note that these values may differ based on the cuda version and GPU configurations. Please consider recompute these values for your own experiments. ### Evaluations | Model | AlpacaEval 2.0<br>LC Win Rate | AlpacaEval 2.0<br>Win Rate | MT-Bench<br>Average | MMLU<br>(5-shot) | GSM8K<br>(5-shot) | | :--------: | :--------: | :--------: | :--------: | :--------: | :--------: | | REBEL-OpenChat-3.5| 17.3 | 12.8 | 8.06 | 63.7 | 68.8 | | REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 | | REBEL-Llama-3-epoch_2| 31.3 | 34.2 | 7.83 | 65.4 | 75.4 | | REBEL-Llama-3-Armo-iter_1| 48.3 | 41.8 | 8.13 | 66.3 | 75.8 | | REBEL-Llama-3-Armo-iter_2| 50.0 | 48.5 | 8.07 | 65.9 | 75.4 | | REBEL-Llama-3-Armo-iter_3| 49.7 | 48.1 | 8.01 | 66.0 | 75.7 | ## Citation Please cite our paper if you use this dataset in your own work: ``` @misc{gao2024rebel, title={REBEL: Reinforcement Learning via Regressing Relative Rewards}, author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun}, year={2024}, eprint={2404.16767}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```
提供机构:
Cornell-AGI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作