five

Salesforce/ReasoningJudgeBench

收藏
Hugging Face2025-06-07 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/Salesforce/ReasoningJudgeBench
下载链接
链接失效反馈
官方服务:
资源简介:
ReasoningJudgeBench是一个包含1,483个样本的对子基准数据集,用于评估自动评估器,如LLM-as-judge/GenRMs和奖励模型,在各种推理设置中的性能。该数据集由8个源基准数据集创建而成,每个样本包括一个原始问题和两个由GPT-4o生成的响应,其中一个是错误的,另一个是正确的。自动评估器的任务是从两个响应中选择正确的那个。ReasoningJudgeBench包括四种split:多跳推理、数学推理、领域推理和日常推理(例如,常识、因果、归纳推理)。

ReasoningJudgeBench is a pairwise benchmark consisting of 1,483 samples for evaluating the performance of automatic evaluators, such as LLM-as-judge/GenRMs and reward models, across various reasoning settings. The dataset is created from 8 source benchmarks, with each sample including an original question and two responses generated by GPT-4o, one incorrect and the other correct. The task of the automatic evaluator is to select the correct response from the two. ReasoningJudgeBench comprises of four splits: multi-hop reasoning, math reasoning, domain reasoning, and everyday reasoning (e.g., common-sense, causal, inductive reasoning).
提供机构:
Salesforce
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作