ContextualJudgeBench
收藏魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/ContextualJudgeBench
下载链接
链接失效反馈官方服务:
资源简介:
# Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
Austin Xu*, Srijan Bansal*, Yifei Ming, Semih Yavuz, Shafiq Joty (* = co-lead, equal contribution)
TL;DR: ContextualJudgeBench is a pairwise benchmark with 2,000 samples for evaluating LLM-as-judge models in two contextual settings: Contextual QA and summarization. We propose a pairwise evaluation hierarchy and generate splits for our proposed hierarchy.
To run evaluation on ContextualJudgeBench, please see our Github repo.
- 💻 **Github:** [https://github.com/SalesforceAIResearch/ContextualJudgeBench](https://github.com/SalesforceAIResearch/ContextualJudgeBench)
- 📜 **Paper:** [https://arxiv.org/abs/2503.15620](https://arxiv.org/abs/2503.15620)
<img src="https://cdn-uploads.huggingface.co/production/uploads/6668e86dc4ef4175fb18d250/D8f0XvT5euqWe4fRwYqeZ.jpeg" alt="drawing" width="1000"/>
Overall, there are 8 splits (see above Figure), with roughly 250 samples per split. Each sample has the following structure
```
{
'problem_id': contextual-judge-bench-<split_name>:<identifier 64-character string>,
'question': Original user input,
'context': Context used to answer the user question,
'positive_response': Better (chosen) response,
'negative_response': Worse (rejected) response,
'source': Source dataset from which the sample is derived from
}
```
## Citation
```
@misc{xu2025doescontextmattercontextualjudgebench,
title={Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings},
author={Austin Xu and Srijan Bansal and Yifei Ming and Semih Yavuz and Shafiq Joty},
year={2025},
eprint={2503.15620},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.15620},
}
```
# 语境至关重要?用于评估语境场景下基于大语言模型(LLM)的评判器的ContextualJudgeBench基准数据集
奥斯汀·徐*、斯里詹·班萨尔*、明轶飞、塞米赫·亚武兹、沙菲克·乔蒂(*为共同第一作者,贡献均等)
**要点概述**:ContextualJudgeBench是一个包含2000个样本的成对基准数据集,用于在语境问答(Contextual QA)和摘要生成两种语境场景下评估基于大语言模型(LLM)的评判器。本工作提出了一种成对评估层级结构,并为该层级结构生成了数据集划分方案。
若需在ContextualJudgeBench上开展评估,请参阅我们的GitHub仓库。
- 💻 **GitHub:** [https://github.com/SalesforceAIResearch/ContextualJudgeBench](https://github.com/SalesforceAIResearch/ContextualJudgeBench)
- 📜 **论文:** [https://arxiv.org/abs/2503.15620](https://arxiv.org/abs/2503.15620)
<img src="https://cdn-uploads.huggingface.co/production/uploads/6668e86dc4ef4175fb18d250/D8f0XvT5euqWe4fRwYqeZ.jpeg" alt="drawing" width="1000"/>
总体而言,该数据集共包含8个划分子集(详见上图),每个子集约含250个样本。每个样本的结构如下:
{
'problem_id': contextual-judge-bench-<split_name>:<64位字符的标识符>,
'question': 原始用户输入内容,
'context': 用于回答用户问题的语境信息,
'positive_response': 更优(被选中)的回复,
'negative_response': 较差(被拒绝)的回复,
'source': 该样本所源自的基准数据集
}
## 引用
@misc{xu2025doescontextmattercontextualjudgebench,
title={Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings},
author={Austin Xu and Srijan Bansal and Yifei Ming and Semih Yavuz and Shafiq Joty},
year={2025},
eprint={2503.15620},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.15620},
}
提供机构:
maas
创建时间:
2025-08-16



