ContextualJudgeBench

Name: ContextualJudgeBench
Creator: maas
Published: 2025-12-05 16:46:41
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-23 收录

下载链接：

https://modelscope.cn/datasets/Salesforce/ContextualJudgeBench

下载链接

链接失效反馈

官方服务：

资源简介：

# Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings Austin Xu*, Srijan Bansal*, Yifei Ming, Semih Yavuz, Shafiq Joty (* = co-lead, equal contribution) TL;DR: ContextualJudgeBench is a pairwise benchmark with 2,000 samples for evaluating LLM-as-judge models in two contextual settings: Contextual QA and summarization. We propose a pairwise evaluation hierarchy and generate splits for our proposed hierarchy. To run evaluation on ContextualJudgeBench, please see our Github repo. - 💻 **Github:** [https://github.com/SalesforceAIResearch/ContextualJudgeBench](https://github.com/SalesforceAIResearch/ContextualJudgeBench) - 📜 **Paper:** [https://arxiv.org/abs/2503.15620](https://arxiv.org/abs/2503.15620) <img src="https://cdn-uploads.huggingface.co/production/uploads/6668e86dc4ef4175fb18d250/D8f0XvT5euqWe4fRwYqeZ.jpeg" alt="drawing" width="1000"/> Overall, there are 8 splits (see above Figure), with roughly 250 samples per split. Each sample has the following structure ``` { 'problem_id': contextual-judge-bench-<split_name>:<identifier 64-character string>, 'question': Original user input, 'context': Context used to answer the user question, 'positive_response': Better (chosen) response, 'negative_response': Worse (rejected) response, 'source': Source dataset from which the sample is derived from } ``` ## Citation ``` @misc{xu2025doescontextmattercontextualjudgebench, title={Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings}, author={Austin Xu and Srijan Bansal and Yifei Ming and Semih Yavuz and Shafiq Joty}, year={2025}, eprint={2503.15620}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.15620}, } ```

# 语境至关重要？用于评估语境场景下基于大语言模型（LLM）的评判器的ContextualJudgeBench基准数据集奥斯汀·徐*、斯里詹·班萨尔*、明轶飞、塞米赫·亚武兹、沙菲克·乔蒂（*为共同第一作者，贡献均等） **要点概述**：ContextualJudgeBench是一个包含2000个样本的成对基准数据集，用于在语境问答（Contextual QA）和摘要生成两种语境场景下评估基于大语言模型（LLM）的评判器。本工作提出了一种成对评估层级结构，并为该层级结构生成了数据集划分方案。若需在ContextualJudgeBench上开展评估，请参阅我们的GitHub仓库。 - 💻 **GitHub：** [https://github.com/SalesforceAIResearch/ContextualJudgeBench](https://github.com/SalesforceAIResearch/ContextualJudgeBench) - 📜 **论文：** [https://arxiv.org/abs/2503.15620](https://arxiv.org/abs/2503.15620) <img src="https://cdn-uploads.huggingface.co/production/uploads/6668e86dc4ef4175fb18d250/D8f0XvT5euqWe4fRwYqeZ.jpeg" alt="drawing" width="1000"/> 总体而言，该数据集共包含8个划分子集（详见上图），每个子集约含250个样本。每个样本的结构如下： { 'problem_id': contextual-judge-bench-<split_name>:<64位字符的标识符>, 'question': 原始用户输入内容, 'context': 用于回答用户问题的语境信息, 'positive_response': 更优（被选中）的回复, 'negative_response': 较差（被拒绝）的回复, 'source': 该样本所源自的基准数据集 } ## 引用 @misc{xu2025doescontextmattercontextualjudgebench, title={Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings}, author={Austin Xu and Srijan Bansal and Yifei Ming and Semih Yavuz and Shafiq Joty}, year={2025}, eprint={2503.15620}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.15620}, }

提供机构：

maas

创建时间：

2025-08-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集