ContextualBench
收藏魔搭社区2025-10-03 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/ContextualBench
下载链接
链接失效反馈官方服务:
资源简介:
# ContextualBench - A comprehensive toolkit to evaluate LM on different Contextual datasets
Evaluation Code: [SalesforceAIResearch/SFR-RAG](https://github.com/SalesforceAIResearch/SFR-RAG)
## Description
ContextualBench is a powerful evaluation framework designed to assess the performance of Large Language Models (LLMs) on contextual datasets. It provides a flexible pipeline for evaluating various LLM families across different tasks, with a focus on handling large context inputs.
> Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data.
## Features
* Dynamic Retrieval Support: Efficiently handles large context inputs, allowing for comprehensive evaluation of LLMs' contextual understanding capabilities.
* Extensive Evaluation Dataset: Supports 7 contextual tasks, including: Question Answering (QA), Multi-Hop Question Answering, Classification tasks
* Multi-LLM Family Support: Compatible with a wide range of LLM families, including: Hugging Face models, Gemma, Mistral, OpenAI, Cohere.
## Component Datasets of ContextualBench
> Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data.
### 2WikiHotpotQA
This dataset is a multihop question answering task, as proposed in "Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps" by Ho. et. al
The folder contains evaluation script and path to dataset on the validation split on around 12k samples.
```
@inproceedings{xanh2020_2wikimultihop,
title = "Constructing A Multi-hop {QA} Dataset for Comprehensive Evaluation of Reasoning Steps",
author = "Ho, Xanh and
Duong Nguyen, Anh-Khoa and
Sugawara, Saku and
Aizawa, Akiko",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.580",
pages = "6609--6625",
}
```
### HotpotQA
HotpotQA is a Wikipedia-based question-answer pairs with the questions require finding and reasoning over multiple supporting documents to answer. We evaluate on 7405 datapoints, on the distractor setting. This dataset was proposed in the below paper
```
@inproceedings{yang2018hotpotqa,
title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
year={2018}
}
```
### MuSiQue
This dataset is a multihop question answering task, that requires 2-4 hop in every questions, making it slightly harder task when compared to other multihop tasks.This dataset was proposed in the below paper
```
@article{trivedi2021musique,
title={{M}u{S}i{Q}ue: Multihop Questions via Single-hop Question Composition},
author={Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish},
journal={Transactions of the Association for Computational Linguistics},
year={2022}
publisher={MIT Press}
}
```
### NaturalQuestions
The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question
```
@article{47761,
title = {Natural Questions: a Benchmark for Question Answering Research},
author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year = {2019},
journal = {Transactions of the Association of Computational Linguistics}
}
```
### PopQA
PopQA is a large-scale open-domain question answering (QA) dataset, the long-tail subset, consisting of 1,399 rare entity queries whose monthly Wikipedia page views are less than 100
Make sure to cite the work
```
@article{ mallen2023llm_memorization ,
title={When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories },
author={ Mallen, Alex and Asai,Akari and Zhong, Victor and Das, Rajarshi and Hajishirzi, Hannaneh and Khashabi, Daniel},
journal={ arXiv preprint },
year={ 2022 }
}
```
### TriviaQA
TriviaqQA is a reading comprehension dataset containing question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions.
```
@article{2017arXivtriviaqa,
author = {{Joshi}, Mandar and {Choi}, Eunsol and {Weld},
Daniel and {Zettlemoyer}, Luke},
title = "{triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension}",
journal = {arXiv e-prints},
year = 2017,
eid = {arXiv:1705.03551},
pages = {arXiv:1705.03551},
archivePrefix = {arXiv},
eprint = {1705.03551},
}
```
### TruthfulQA
TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts.
```
@misc{lin2021truthfulqa,
title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
author={Stephanie Lin and Jacob Hilton and Owain Evans},
year={2021},
eprint={2109.07958},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Ethical Considerations
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
## Citation
```
@article{nguyen2024sfrrag,
title={SFR-RAG: Towards Contextually Faithful LLMs},
author={Nguyen, Xuan-Phi and Pandit, Shrey and Purushwalkam, Senthil and Xu, Austin and Chen, Hailin and Ming, Yifei and Ke, Zixuan and Savarese, Silvio and Xong, Caiming and Joty, Shafiq},
year={2024}
}
```
# ContextualBench——面向大语言模型(Large Language Model, LLM)上下文数据集的综合评估工具包
Evaluation Code: [SalesforceAIResearch/SFR-RAG](https://github.com/SalesforceAIResearch/SFR-RAG)
## 描述
ContextualBench是一款功能强大的评估框架,旨在针对上下文数据集对大语言模型(LLM)的性能进行评估。该框架提供了灵活的流水线,可针对不同任务场景下的各类大语言模型家族进行评估,重点聚焦于长上下文输入的处理。
> 用户需自行评估与原始数据集及数据相关的许可条款或使用条件下的任何义务与责任。
## 核心特性
* 动态检索支持:可高效处理长上下文输入,实现对大语言模型上下文理解能力的全面评估。
* 丰富的评估数据集:支持7项上下文任务,包括:问答(Question Answering, QA)、多跳问答、分类任务
* 多模型家族兼容:适配广泛的大语言模型家族,涵盖:Hugging Face模型、Gemma、Mistral、OpenAI、Cohere。
## ContextualBench 组件数据集
> 用户需自行评估与原始数据集及数据相关的许可条款或使用条件下的任何义务与责任。
### 2WikiHotpotQA
该数据集是一项多跳问答任务,由Ho等人在《Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps》中提出。本文件夹包含评估脚本以及约12000个样本的验证集数据集路径。
@inproceedings{xanh2020_2wikimultihop,
title = "Constructing A Multi-hop {QA} Dataset for Comprehensive Evaluation of Reasoning Steps",
author = "Ho, Xanh and
Duong Nguyen, Anh-Khoa and
Sugawara, Saku and
Aizawa, Akiko",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.580",
pages = "6609--6625",
}
### HotpotQA
HotpotQA是一个基于维基百科的问答对数据集,其问题需要查找并推理多篇辅助文档才能作答。我们在干扰项设置下对7405个数据点进行评估。该数据集由下述论文提出:
@inproceedings{yang2018hotpotqa,
title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
year={2018}
}
### MuSiQue
该数据集为一项多跳问答任务,每个问题均需2至4跳推理,相较于其他多跳任务难度略高。该数据集由下述论文提出:
@article{trivedi2021musique,
title={{M}u{S}i{Q}ue: Multihop Questions via Single-hop Question Composition},
author={Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish},
journal={Transactions of the Association for Computational Linguistics},
year={2022}
publisher={MIT Press}
}
### NaturalQuestions
NQ语料库包含来自真实用户的问题,要求问答系统阅读并理解整篇维基百科文章,而该文章可能包含也可能不包含问题的答案。
@article{47761,
title = {NaturalQuestions: a Benchmark for Question Answering Research},
author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year = {2019},
journal = {Transactions of the Association of Computational Linguistics}
}
### PopQA
PopQA是一个大规模开放域问答(QA)数据集的长尾子集,包含1399个罕见实体查询,这些实体的月均维基百科页面浏览量不足100。请务必引用该研究成果:
@article{ mallen2023llm_memorization ,
title={When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories },
author={ Mallen, Alex and Asai,Akari and Zhong, Victor and Das, Rajarshi and Hajishirzi, Hannaneh and Khashabi, Daniel},
journal={ arXiv preprint },
year={ 2022 }
}
### TriviaQA
TriviaQA是一个阅读理解数据集,包含由问答爱好者创作的问答对,以及独立收集的证据文档(平均每个问题对应6篇文档),可为问答任务提供高质量的远程监督信号。
@article{2017arXivtriviaqa,
author = {{Joshi}, Mandar and {Choi}, Eunsol and {Weld},
Daniel and {Zettlemoyer}, Luke},
title = "{triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension}",
journal = {arXiv e-prints},
year = 2017,
eid = {arXiv:1705.03551},
pages = {arXiv:1705.03551},
archivePrefix = {arXiv},
eprint = {1705.03551},
}
### TruthfulQA
TruthfulQA是用于衡量语言模型在生成答案时是否真实可信的基准测试。该基准包含817个问题,涵盖38个类别,包括健康、法律、金融与政治等。问题设计旨在使部分人类因错误信念或认知偏差而给出错误回答。为了在该基准上取得优异性能,模型必须避免生成通过模仿人类文本习得的虚假答案。
@misc{lin2021truthfulqa,
title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
author={Stephanie Lin and Jacob Hilton and Owain Evans},
year={2021},
eprint={2109.07958},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
## 伦理考量
本版本仅用于支持学术论文的研究用途。我们的模型、数据集与代码并未针对所有下游应用场景进行专门设计或评估。我们强烈建议用户在部署该模型前,对其潜在的准确性、安全性与公平性相关问题进行评估与处理。我们鼓励用户考虑人工智能的常见局限性,遵守适用法律,并在选择应用场景时采用最佳实践,尤其是在错误或滥用可能严重影响人们生活、权利或安全的高风险场景中。如需了解更多应用场景相关指导,请参考我们的AUP及人工智能AUP。
## 引用
@article{nguyen2024sfrrag,
title={SFR-RAG: Towards Contextually Faithful LLMs},
author={Nguyen, Xuan-Phi and Pandit, Shrey and Purushwalkam, Senthil and Xu, Austin and Chen, Hailin and Ming, Yifei and Ke, Zixuan and Savarese, Silvio and Xong, Caiming and Joty, Shafiq},
year={2024}
}
提供机构:
maas
创建时间:
2025-08-15



