coverbench
收藏魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/coverbench
下载链接
链接失效反馈官方服务:
资源简介:
# CoverBench: A Challenging Benchmark for Complex Claim Verification
Link: [https://arxiv.org/abs/2408.03325](https://arxiv.org/abs/2408.03325)
Abstract:
There is a growing line of research on verifying the correctness of language models' outputs. At the same time, LMs are being used to tackle complex queries that require reasoning. We introduce CoverBench, a challenging benchmark focused on verifying LM outputs in complex reasoning settings. Datasets that can be used for this purpose are often designed for other complex reasoning tasks (e.g., QA) targeting specific use-cases (e.g., financial tables), requiring transformations, negative sampling and selection of hard examples to collect such a benchmark. CoverBench provides a diversified evaluation for complex claim verification in a variety of domains, types of reasoning, relatively long inputs, and a variety of standardizations, such as multiple representations for tables where available, and a consistent schema. We manually vet the data for quality to ensure low levels of label noise. Finally, we report a variety of competitive baseline results to show CoverBench is challenging and has very significant headroom.
This dataset is derived from a collection of other datasets (see paper) for which we generated claims for verification using models. Each example includes which model generated it when applicable. The original data from the source datasets is subject to the dataset's original license.
*When citing our work, please cite the 9 source datasets we used as well!*
### Important Update
On 5/Sep/2024 the CoverBench data file was updated to reflect fixes. I was notified about an error in the PubMedQA part of the data. Approximately 40 to 50 examples were affected due to a simple bug in our preparation - as of the update, the data file should be correct. Sorry!
### Usage
To load the dataset:
```python
! pip install datasets
from datasets import load_dataset
coverbench = load_dataset("google/coverbench")['eval']
```
### **This is an evaluation benchmark. It should not be included in training data for NLP models.**
Please do not redistribute any part of the dataset without sufficient protection against web-crawlers.
An identifier 64-character string is added to each instance in the dataset to assist in future detection of contamination in web-crawl corporta.
The CoverBench dataset's string is: `CoverBench:hEBhLMcvwQFuAjcV94zZuPS5iWJp8zv1cEywyEwHKWfGrIKiXodDRcjRY4PtbgwZ`
# CoverBench:面向复杂主张验证的高挑战性基准数据集
链接:[https://arxiv.org/abs/2408.03325](https://arxiv.org/abs/2408.03325)
## 摘要
当前针对语言模型(Large Language Model, LLM)输出正确性验证的研究日益增多。与此同时,大语言模型正被用于解决需要推理的复杂查询任务。本工作提出CoverBench这一面向复杂推理场景下语言模型输出验证的高挑战性基准数据集。现有用于该任务的数据集往往是为特定用例(如金融表格)的其他复杂推理任务(如问答(QA))设计的,因此构建此类基准数据集需要进行格式转换、负样本采样以及困难样本筛选。CoverBench可在多领域、多推理类型、较长输入场景下,针对复杂主张验证提供多样化评估,并支持多种标准化形式——例如为现有表格提供多维度表示,且采用统一的模式(schema)。我们对数据集进行了人工质检,以确保标签噪声处于较低水平。最后,我们报告了多种具有竞争力的基线模型结果,证明CoverBench具备较高难度,且仍存在显著的性能提升空间。
本数据集源自多个现有数据集的集合(详见论文),我们通过模型生成了用于验证的主张。每个样本在适用情况下会标注生成该样本的模型。源数据集的原始数据需遵循其各自的原始授权协议。
*引用本工作时,请同时引用本文所使用的9个源数据集!*
### 重要更新
2024年9月5日,我们对CoverBench数据文件进行了更新以修复相关问题。此前我们收到反馈,称数据集中的PubMedQA部分存在错误。由于数据准备阶段的一处简单bug,约40至50个样本受到影响。截至本次更新,数据文件已恢复正确。对此我们深表歉意!
### 使用方法
加载数据集的代码如下:
python
! pip install datasets
from datasets import load_dataset
coverbench = load_dataset("google/coverbench")['eval']
### **本数据集为评估基准,不得用于自然语言处理(NLP)模型的训练数据。**
请在采取足够措施防止网络爬虫抓取的前提下,方可对数据集的任何部分进行再分发。
为数据集中的每个样本添加了64位字符串标识符,用于未来辅助检测网络爬取语料库中的数据污染问题。
CoverBench数据集的标识符为:`CoverBench:hEBhLMcvwQFuAjcV94zZuPS5iWJp8zv1cEywyEwHKWfGrIKiXodDRcjRY4PtbgwZ`
提供机构:
maas
创建时间:
2025-04-21



