five

ContextualJudgeBench

收藏
魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/ContextualJudgeBench
下载链接
链接失效反馈
官方服务:
资源简介:
# Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings Austin Xu*, Srijan Bansal*, Yifei Ming, Semih Yavuz, Shafiq Joty (* = co-lead, equal contribution) TL;DR: ContextualJudgeBench is a pairwise benchmark with 2,000 samples for evaluating LLM-as-judge models in two contextual settings: Contextual QA and summarization. We propose a pairwise evaluation hierarchy and generate splits for our proposed hierarchy. To run evaluation on ContextualJudgeBench, please see our Github repo. - 💻 **Github:** [https://github.com/SalesforceAIResearch/ContextualJudgeBench](https://github.com/SalesforceAIResearch/ContextualJudgeBench) - 📜 **Paper:** [https://arxiv.org/abs/2503.15620](https://arxiv.org/abs/2503.15620) <img src="https://cdn-uploads.huggingface.co/production/uploads/6668e86dc4ef4175fb18d250/D8f0XvT5euqWe4fRwYqeZ.jpeg" alt="drawing" width="1000"/> Overall, there are 8 splits (see above Figure), with roughly 250 samples per split. Each sample has the following structure ``` { 'problem_id': contextual-judge-bench-<split_name>:<identifier 64-character string>, 'question': Original user input, 'context': Context used to answer the user question, 'positive_response': Better (chosen) response, 'negative_response': Worse (rejected) response, 'source': Source dataset from which the sample is derived from } ``` ## Citation ``` @misc{xu2025doescontextmattercontextualjudgebench, title={Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings}, author={Austin Xu and Srijan Bansal and Yifei Ming and Semih Yavuz and Shafiq Joty}, year={2025}, eprint={2503.15620}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.15620}, } ```

# 语境至关重要?用于评估语境场景下基于大语言模型(LLM)的评判器的ContextualJudgeBench基准数据集 奥斯汀·徐*、斯里詹·班萨尔*、明轶飞、塞米赫·亚武兹、沙菲克·乔蒂(*为共同第一作者,贡献均等) **要点概述**:ContextualJudgeBench是一个包含2000个样本的成对基准数据集,用于在语境问答(Contextual QA)和摘要生成两种语境场景下评估基于大语言模型(LLM)的评判器。本工作提出了一种成对评估层级结构,并为该层级结构生成了数据集划分方案。 若需在ContextualJudgeBench上开展评估,请参阅我们的GitHub仓库。 - 💻 **GitHub:** [https://github.com/SalesforceAIResearch/ContextualJudgeBench](https://github.com/SalesforceAIResearch/ContextualJudgeBench) - 📜 **论文:** [https://arxiv.org/abs/2503.15620](https://arxiv.org/abs/2503.15620) <img src="https://cdn-uploads.huggingface.co/production/uploads/6668e86dc4ef4175fb18d250/D8f0XvT5euqWe4fRwYqeZ.jpeg" alt="drawing" width="1000"/> 总体而言,该数据集共包含8个划分子集(详见上图),每个子集约含250个样本。每个样本的结构如下: { 'problem_id': contextual-judge-bench-<split_name>:<64位字符的标识符>, 'question': 原始用户输入内容, 'context': 用于回答用户问题的语境信息, 'positive_response': 更优(被选中)的回复, 'negative_response': 较差(被拒绝)的回复, 'source': 该样本所源自的基准数据集 } ## 引用 @misc{xu2025doescontextmattercontextualjudgebench, title={Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings}, author={Austin Xu and Srijan Bansal and Yifei Ming and Semih Yavuz and Shafiq Joty}, year={2025}, eprint={2503.15620}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.15620}, }
提供机构:
maas
创建时间:
2025-08-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作