Salesforce/ReasoningJudgeBench

Name: Salesforce/ReasoningJudgeBench
Creator: Salesforce
Published: 2025-06-07 00:38:19
License: 暂无描述

Hugging Face2025-06-07 更新2025-07-05 收录

下载链接：

https://hf-mirror.com/datasets/Salesforce/ReasoningJudgeBench

下载链接

链接失效反馈

官方服务：

资源简介：

ReasoningJudgeBench是一个包含1,483个样本的对子基准数据集，用于评估自动评估器，如LLM-as-judge/GenRMs和奖励模型，在各种推理设置中的性能。该数据集由8个源基准数据集创建而成，每个样本包括一个原始问题和两个由GPT-4o生成的响应，其中一个是错误的，另一个是正确的。自动评估器的任务是从两个响应中选择正确的那个。ReasoningJudgeBench包括四种split：多跳推理、数学推理、领域推理和日常推理（例如，常识、因果、归纳推理）。

ReasoningJudgeBench is a pairwise benchmark consisting of 1,483 samples for evaluating the performance of automatic evaluators, such as LLM-as-judge/GenRMs and reward models, across various reasoning settings. The dataset is created from 8 source benchmarks, with each sample including an original question and two responses generated by GPT-4o, one incorrect and the other correct. The task of the automatic evaluator is to select the correct response from the two. ReasoningJudgeBench comprises of four splits: multi-hop reasoning, math reasoning, domain reasoning, and everyday reasoning (e.g., common-sense, causal, inductive reasoning).

提供机构：

Salesforce

5,000+

优质数据集

54 个

任务类型

进入经典数据集