five

reward-bench-results

收藏
魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/reward-bench-results
下载链接
链接失效反馈
官方服务:
资源简介:
# Results for Holisitic Evaluation of Reward Models (HERM) Benchmark Here, you'll find the raw scores for the HERM project. The repository is structured as follows. ``` ├── best-of-n/ <- Nested directory for different completions on Best of N challenge | ├── alpaca_eval/ └── results for each reward model | | ├── tulu-13b/{org}/{model}.json | | └── zephyr-7b/{org}/{model}.json | └── mt_bench/ | ├── tulu-13b/{org}/{model}.json | └── zephyr-7b/{org}/{model}.json ├── eval-set-scores/{org}/{model}.json <- Per-prompt scores on our core evaluation set. ├── eval-set/ <- Aggregated results on our core eval. set. ├── pref-sets-scores/{org}/{model}.json <- Per-prompt scores on existing test sets. └── pref-sets/ <- Aggregated results on existing test sets. ``` The data is loaded by the other projects in this repo and released for further research. See the [GitHub repo](https://github.com/allenai/herm) or the [leaderboard source code](https://huggingface.co/spaces/ai2-adapt-dev/HERM-Leaderboard/tree/main) for examples on loading and manipulating the data. Tools for analysis are found on [GitHub](https://github.com/allenai/reward-bench/blob/main/analysis/utils.py). Contact: `nathanl at allenai dot org` For example, this data can be used to aggregate the distribution of scores across models (it also powers our leaderboard)! <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/dist.png" alt="RewardBench Distribution" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

# 奖励模型全面评估(Holistic Evaluation of Reward Models, HERM)基准测试结果 本仓库收录HERM项目的原始评分数据。 仓库目录结构如下: ├── best-of-n/ # 最优N次采样(Best of N)挑战的不同补全结果嵌套目录 | ├── alpaca_eval/ # 各奖励模型的评测结果 | | ├── tulu-13b/{org}/{model}.json | | └── zephyr-7b/{org}/{model}.json | └── mt_bench/ | ├── tulu-13b/{org}/{model}.json | └── zephyr-7b/{org}/{model}.json ├── eval-set-scores/{org}/{model}.json # 核心评测集上单提示词评分 ├── eval-set/ # 核心评测集上的聚合评测结果 ├── pref-sets-scores/{org}/{model}.json # 现有测试集上单提示词评分 └── pref-sets/ # 现有测试集上的聚合评测结果 本仓库内其他项目可加载此数据集,本次发布旨在支持后续相关研究。 如需了解数据加载与处理的示例,请参阅本项目的[GitHub仓库](https://github.com/allenai/herm)或[排行榜源代码](https://huggingface.co/spaces/ai2-adapt-dev/HERM-Leaderboard/tree/main)。 数据分析工具可在[GitHub](https://github.com/allenai/reward-bench/blob/main/analysis/utils.py)获取。 联系方式:`nathanl@allenai.org` 该数据集可用于聚合不同模型的评分分布,同时也是本项目排行榜的底层数据支撑! <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/dist.png" alt="RewardBench 评分分布" width="800" style="margin-left:auto;margin-right:auto;display:block;"/>
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作