reward-bench-results
收藏魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/reward-bench-results
下载链接
链接失效反馈官方服务:
资源简介:
# Results for Holisitic Evaluation of Reward Models (HERM) Benchmark
Here, you'll find the raw scores for the HERM project.
The repository is structured as follows.
```
├── best-of-n/ <- Nested directory for different completions on Best of N challenge
| ├── alpaca_eval/ └── results for each reward model
| | ├── tulu-13b/{org}/{model}.json
| | └── zephyr-7b/{org}/{model}.json
| └── mt_bench/
| ├── tulu-13b/{org}/{model}.json
| └── zephyr-7b/{org}/{model}.json
├── eval-set-scores/{org}/{model}.json <- Per-prompt scores on our core evaluation set.
├── eval-set/ <- Aggregated results on our core eval. set.
├── pref-sets-scores/{org}/{model}.json <- Per-prompt scores on existing test sets.
└── pref-sets/ <- Aggregated results on existing test sets.
```
The data is loaded by the other projects in this repo and released for further research.
See the [GitHub repo](https://github.com/allenai/herm) or the [leaderboard source code](https://huggingface.co/spaces/ai2-adapt-dev/HERM-Leaderboard/tree/main) for examples on loading and manipulating the data.
Tools for analysis are found on [GitHub](https://github.com/allenai/reward-bench/blob/main/analysis/utils.py).
Contact: `nathanl at allenai dot org`
For example, this data can be used to aggregate the distribution of scores across models (it also powers our leaderboard)!
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/dist.png" alt="RewardBench Distribution" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
# 奖励模型全面评估(Holistic Evaluation of Reward Models, HERM)基准测试结果
本仓库收录HERM项目的原始评分数据。
仓库目录结构如下:
├── best-of-n/ # 最优N次采样(Best of N)挑战的不同补全结果嵌套目录
| ├── alpaca_eval/ # 各奖励模型的评测结果
| | ├── tulu-13b/{org}/{model}.json
| | └── zephyr-7b/{org}/{model}.json
| └── mt_bench/
| ├── tulu-13b/{org}/{model}.json
| └── zephyr-7b/{org}/{model}.json
├── eval-set-scores/{org}/{model}.json # 核心评测集上单提示词评分
├── eval-set/ # 核心评测集上的聚合评测结果
├── pref-sets-scores/{org}/{model}.json # 现有测试集上单提示词评分
└── pref-sets/ # 现有测试集上的聚合评测结果
本仓库内其他项目可加载此数据集,本次发布旨在支持后续相关研究。
如需了解数据加载与处理的示例,请参阅本项目的[GitHub仓库](https://github.com/allenai/herm)或[排行榜源代码](https://huggingface.co/spaces/ai2-adapt-dev/HERM-Leaderboard/tree/main)。
数据分析工具可在[GitHub](https://github.com/allenai/reward-bench/blob/main/analysis/utils.py)获取。
联系方式:`nathanl@allenai.org`
该数据集可用于聚合不同模型的评分分布,同时也是本项目排行榜的底层数据支撑!
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/dist.png" alt="RewardBench 评分分布" width="800" style="margin-left:auto;margin-right:auto;display:block;"/>
提供机构:
maas
创建时间:
2025-05-27



