Salesforce/ReasoningJudgeBench
收藏Hugging Face2025-06-07 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/Salesforce/ReasoningJudgeBench
下载链接
链接失效反馈官方服务:
资源简介:
ReasoningJudgeBench是一个包含1,483个样本的对子基准数据集,用于评估自动评估器,如LLM-as-judge/GenRMs和奖励模型,在各种推理设置中的性能。该数据集由8个源基准数据集创建而成,每个样本包括一个原始问题和两个由GPT-4o生成的响应,其中一个是错误的,另一个是正确的。自动评估器的任务是从两个响应中选择正确的那个。ReasoningJudgeBench包括四种split:多跳推理、数学推理、领域推理和日常推理(例如,常识、因果、归纳推理)。
ReasoningJudgeBench is a pairwise benchmark consisting of 1,483 samples for evaluating the performance of automatic evaluators, such as LLM-as-judge/GenRMs and reward models, across various reasoning settings. The dataset is created from 8 source benchmarks, with each sample including an original question and two responses generated by GPT-4o, one incorrect and the other correct. The task of the automatic evaluator is to select the correct response from the two. ReasoningJudgeBench comprises of four splits: multi-hop reasoning, math reasoning, domain reasoning, and everyday reasoning (e.g., common-sense, causal, inductive reasoning).
提供机构:
Salesforce



