ReasoningJudgeBench

Name: ReasoningJudgeBench
Creator: maas
Published: 2025-12-05 16:46:41
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/Salesforce/ReasoningJudgeBench

下载链接

链接失效反馈

官方服务：

资源简介：

# J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty To run evaluation, please see our Github repo. - 💻 **Github:** [https://github.com/SalesforceAIResearch/ReasoningJudgeBench](https://github.com/SalesforceAIResearch/ReasoningJudgeBench) - 📜 **Paper:** [https://arxiv.org/abs/2505.13346](https://arxiv.org/abs/2505.13346) # ReasoningJudgeBench ReasoningJudgeBench is a 1,483 sample pairwise benchmark introduced in the paper [J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization](https://arxiv.org/abs/2505.13346). It is a benchmark of automatic evaluators, such as LLM-as-judge/GenRMs and reward models, in diverse reasoning settings, created from 8 source benchmarks. Each sample consists of an original question and two responses, both generated from GPT-4o. One response is incorrect (determined by outcome), whereas the other is correct. The automatic evaluator is tasked with selecting the response with the correct output. Overall, ReasoningJudgeBench is comprised of four splits: - Multi-hop reasoning - Math reasoning - Domain reasoning - "Everyday" reasoning (e.g., common-sense, causal, inductive) The benchmark is uploaded using the original source datasets. Code is provided on Github to aggregate results into an overall score and split-level scores. <img src="https://cdn-uploads.huggingface.co/production/uploads/6668e86dc4ef4175fb18d250/Aj77v11C51kxa57-RdJPZ.png" alt="Pie chart breaking presenting a split-level breakdown of ReasoningJudgeBench's four splits" width="500"/> Each sample has the following structure ``` { 'problem_id': reasoning-judge-bench-<split_name>:<identifier 64-character string>, 'source_id': Source dataset from which the sample is derived from 'instruction': User input question, 'positive_response': Better (correct) response, 'negative_response': Worse (incorrect) response, 'label': 1, # Mainly for eval code purposes, positive_response is the correct response. } ``` ## Citation ``` @article{xu2025j4r, title={J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization}, author={Xu, Austin and Zhou, Yilun and Nguyen, Xuan-Phi and Xiong, Caiming and Joty, Shafiq}, journal={arXiv preprint arXiv:2505.13346}, year={2025} } ```

# J4R: 基于等价初始状态组相对策略优化的评判学习作者：Austin Xu、Yilun Zhou、Xuan-Phi Nguyen、Caiming Xiong、Shafiq Joty 如需运行评估代码，请查阅我们的GitHub仓库。 - 💻 **GitHub：** [https://github.com/SalesforceAIResearch/ReasoningJudgeBench](https://github.com/SalesforceAIResearch/ReasoningJudgeBench) - 📜 **论文：** [https://arxiv.org/abs/2505.13346](https://arxiv.org/abs/2505.13346) # ReasoningJudgeBench ReasoningJudgeBench是论文《J4R: 基于等价初始状态组相对策略优化的评判学习》（https://arxiv.org/abs/2505.13346）中提出的包含1483个样本的成对基准测试集。该基准聚焦于多样化推理场景下的自动评估器，包括大语言模型作为评判器（LLM-as-judge）、生成式推理模型（GenRMs）以及奖励模型（reward models），其构建源自8个原始基准测试集。每个样本包含一条原始问题与两条由GPT-4o生成的回复：其中一条回复为错误结果（以最终输出结果判定），另一条为正确结果，自动评估器的任务为选出输出正确的回复。整体而言，ReasoningJudgeBench包含四个子集： - 多步推理 - 数学推理 - 领域推理 - "日常"推理（例如常识推理、因果推理、归纳推理）该基准测试集完全基于原始源数据集构建，GitHub仓库中提供了用于将评估结果汇总为总体得分与子集得分的代码。 ![呈现ReasoningJudgeBench四个子集分布的饼状图](https://cdn-uploads.huggingface.co/production/uploads/6668e86dc4ef4175fb18d250/Aj77v11C51kxa57-RdJPZ.png) 每个样本的结构如下： { '问题ID': reasoning-judge-bench-<子集名称>:<64位字符串标识符>, '源数据集ID': 该样本所属的源数据集, '指令': 用户输入的问题, '优质回复': 更佳（正确）的回复, '劣质回复': 更差（错误）的回复, '标签': 1, # 主要用于评估代码逻辑，优质回复即为正确回复。 } ## 引用 @article{xu2025j4r, title={J4R: 基于等价初始状态组相对策略优化的评判学习}, author={Xu, Austin and Zhou, Yilun and Nguyen, Xuan-Phi and Xiong, Caiming and Joty, Shafiq}, journal={arXiv预印本 arXiv:2505.13346}, year={2025} }

提供机构：

maas

创建时间：

2025-08-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集