Scoring-Verifiers
收藏魔搭社区2025-10-09 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Scoring-Verifiers
下载链接
链接失效反馈官方服务:
资源简介:
# Scoring Verifiers
Scoring Verifiers is a set of 4 benchmarks that evaluate the scoring and ranking capabilities of synthetic verifiers such as test case generation and reward modelling. You can find our paper [Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning](https://www.arxiv.org/abs/2502.13820) which explains in more detail our methodology, benchmark details and findings.

## Datasets
In this repository, we include 4 benchmarks that are code scoring and ranking versions of HumanEval and MBPP:
- HE-R
- HE-R+
- MBPP-R
- MBPP-R+
Each dataset sample contains a question from HumanEval or MBPP following by several `gpt-4o` solutions and their rankings based on pre-defined test case execution scores. Alongside the keys found in the original benchmarks each sample contains the following keys:
- `task_id`
- `prompt`
- `canonical_solution`
- `all_solutions` (each solution contains the following)
- `rank`
- `average_test_score`
- `average_time_taken`
- `solution`
For example, the following is a distribution of the test case scores for all solutions in HE-R+ and MBPP-R+ respectively.
 
## Paper
Overall our paper's contributions can be summarized as follows:
1. We provide a recipe to transform any coding benchmark with predefined test cases into a code scoring and ranking benchmark.
2. We certify our recipe by creating code scoring and ranking versions of HumanEval and MBPP datasets: HE-R, HE-R+, MBPP-R, MBPP-R+.
3. We use our benchmark to evaluate synthetic verification methods such as test case generation in standard, reward and reasoning LLM’s.

We also open-source the [code used to generate these benchmarks](https://github.com/aleksficek/Scoring-Verifiers).
## Citation
```
@misc{ficek2025scoringverifiersevaluatingsynthetic,
title={Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning},
author={Aleksander Ficek and Somshubra Majumdar and Vahid Noroozi and Boris Ginsburg},
year={2025},
eprint={2502.13820},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.13820},
}
```
## 评分验证器(Scoring Verifiers)
评分验证器(Scoring Verifiers)是由4项基准测试组成的集合,用于评估合成验证器(synthetic verifiers)的评分与排序能力,涵盖测试用例生成、奖励建模等典型合成验证场景。您可查阅我们的论文《Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning》(链接:https://www.arxiv.org/abs/2502.13820),其中详细阐释了本研究的方法体系、基准测试细节与核心实验发现。

## 数据集
本仓库包含4项基准测试,均为HumanEval与MBPP的代码评分与排序衍生版本,具体包括:
- HE-R
- HE-R+
- MBPP-R
- MBPP-R+
每个数据集样本均包含来自HumanEval或MBPP的问题,附带若干`gpt-4o`生成的解决方案,以及基于预定义测试用例执行得分得到的排序结果。除原始基准测试自带的字段外,每个样本还包含以下字段:
- `task_id`
- `prompt`
- `canonical_solution`
- `all_solutions`(每个解决方案包含以下子字段)
- `rank`
- `average_test_score`
- `average_time_taken`
- `solution`
例如,下图分别展示了HE-R+与MBPP-R+中所有解决方案的测试用例得分分布。
 
## 论文
本论文的核心贡献可总结如下:
1. 提出了一种通用转换范式,可将任意带有预定义测试用例的代码基准测试转换为代码评分与排序基准测试。
2. 通过将HumanEval与MBPP数据集转换为代码评分与排序版本,构建了HE-R、HE-R+、MBPP-R、MBPP-R+四项基准测试,验证了上述转换范式的有效性。
3. 利用本基准测试,对标准型、奖励型与推理型大语言模型(Large Language Model, LLM)中的合成验证方法(如测试用例生成)开展了评估。

我们还开源了用于生成上述基准测试的代码,代码仓库链接:https://github.com/aleksficek/Scoring-Verifiers。
## 引用
@misc{ficek2025scoringverifiersevaluatingsynthetic,
title={Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning},
author={Aleksander Ficek and Somshubra Majumdar and Vahid Noroozi and Boris Ginsburg},
year={2025},
eprint={2502.13820},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.13820},
}
提供机构:
maas
创建时间:
2025-04-21



