fever
收藏魔搭社区2025-12-05 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/EleutherAI/fever
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "fever"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://fever.ai/](https://fever.ai/)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Dataset Summary
With billions of individual pages on the web providing information on almost every conceivable topic, we should have
the ability to collect facts that answer almost every conceivable question. However, only a small fraction of this
information is contained in structured sources (Wikidata, Freebase, etc.) – we are therefore limited by our ability to
transform free-form text to structured knowledge. There is, however, another problem that has become the focus of a lot
of recent research and media coverage: false information coming from unreliable sources.
The FEVER workshops are a venue for work in verifiable knowledge extraction and to stimulate progress in this direction.
- FEVER Dataset: FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences
extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims
are classified as Supported, Refuted or NotEnoughInfo. For the first two classes, the annotators also recorded the
sentence(s) forming the necessary evidence for their judgment.
- FEVER 2.0 Adversarial Attacks Dataset: The FEVER 2.0 Dataset consists of 1174 claims created by the submissions of
participants in the Breaker phase of the 2019 shared task. Participants (Breakers) were tasked with generating
adversarial examples that induce classification errors for the existing systems. Breakers submitted a dataset of up to
1000 instances with equal number of instances for each of the three classes (Supported, Refuted NotEnoughInfo). Only
novel claims (i.e. not contained in the original FEVER dataset) were considered as valid entries to the shared task.
The submissions were then manually evaluated for Correctness (grammatical, appropriately labeled and meet the FEVER
annotation guidelines requirements).
### Supported Tasks and Leaderboards
The task is verification of textual claims against textual sources.
When compared to textual entailment (TE)/natural language inference, the key difference is that in these tasks the
passage to verify each claim is given, and in recent years it typically consists a single sentence, while in
verification systems it is retrieved from a large set of documents in order to form the evidence.
### Languages
The dataset is in English.
## Dataset Structure
### Data Instances
#### v1.0
- **Size of downloaded dataset files:** 44.86 MB
- **Size of the generated dataset:** 40.05 MB
- **Total amount of disk used:** 84.89 MB
An example of 'train' looks as follows.
```
'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.',
'evidence_wiki_url': 'Nikolaj_Coster-Waldau',
'label': 'SUPPORTS',
'id': 75397,
'evidence_id': 104971,
'evidence_sentence_id': 7,
'evidence_annotation_id': 92206}
```
#### v2.0
- **Size of downloaded dataset files:** 0.39 MB
- **Size of the generated dataset:** 0.30 MB
- **Total amount of disk used:** 0.70 MB
#### wiki_pages
- **Size of downloaded dataset files:** 1.71 GB
- **Size of the generated dataset:** 7.25 GB
- **Total amount of disk used:** 8.97 GB
An example of 'wikipedia_pages' looks as follows.
```
{'text': 'The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . ',
'lines': '0\tThe following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world .\n1\t',
'id': '1928_in_association_football'}
```
### Data Fields
The data fields are the same among all splits.
#### v1.0
- `id`: a `int32` feature.
- `label`: a `string` feature.
- `claim`: a `string` feature.
- `evidence_annotation_id`: a `int32` feature.
- `evidence_id`: a `int32` feature.
- `evidence_wiki_url`: a `string` feature.
- `evidence_sentence_id`: a `int32` feature.
#### v2.0
- `id`: a `int32` feature.
- `label`: a `string` feature.
- `claim`: a `string` feature.
- `evidence_annotation_id`: a `int32` feature.
- `evidence_id`: a `int32` feature.
- `evidence_wiki_url`: a `string` feature.
- `evidence_sentence_id`: a `int32` feature.
#### wiki_pages
- `id`: a `string` feature.
- `text`: a `string` feature.
- `lines`: a `string` feature.
### Data Splits
#### v1.0
| | train | dev | paper_dev | paper_test |
|------|-------:|------:|----------:|-----------:|
| v1.0 | 311431 | 37566 | 18999 | 18567 |
#### v2.0
| | validation |
|------|-----------:|
| v2.0 | 2384 |
#### wiki_pages
| | wikipedia_pages |
|------------|----------------:|
| wiki_pages | 5416537 |
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
FEVER license:
```
These data annotations incorporate material from Wikipedia, which is licensed pursuant to the Wikipedia Copyright Policy. These annotations are made available under the license terms described on the applicable Wikipedia article pages, or, where Wikipedia license terms are unavailable, under the Creative Commons Attribution-ShareAlike License (version 3.0), available at http://creativecommons.org/licenses/by-sa/3.0/ (collectively, the “License Termsâ€). You may not use these files except in compliance with the applicable License Terms.
```
### Citation Information
If you use "FEVER Dataset", please cite:
```bibtex
@inproceedings{Thorne18Fever,
author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit},
title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}},
booktitle = {NAACL-HLT},
year = {2018}
}
```
If you use "FEVER 2.0 Adversarial Attacks Dataset", please cite:
```bibtex
@inproceedings{Thorne19FEVER2,
author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit},
title = {The {FEVER2.0} Shared Task},
booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}},
year = {2018}
}
```
### Contributions
Thanks to [@thomwolf](https://github.com/thomwolf), [@lhoestq](https://github.com/lhoestq),
[@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun),
[@albertvillanova](https://github.com/albertvillanova) for adding this dataset.
# “FEVER(Fact Extraction and VERification,事实提取与验证)”数据集卡片
## 目录
- [数据集概述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言分布](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据整理依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [已知其他局限](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集概述
- **主页:** [https://fever.ai/](https://fever.ai/)
- **代码仓库:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **相关论文:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **联系人:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 数据集摘要
互联网上数十亿的独立页面覆盖了几乎所有可想象的主题,我们本应能够收集到能回答几乎所有问题的事实信息。然而,仅有极小一部分信息来自结构化数据源(如Wikidata、Freebase等)——因此我们的能力受限于将自由格式文本转化为结构化知识的能力。此外,另一个近期备受研究与媒体关注的问题是:来自不可靠信源的虚假信息。
FEVER工作坊是面向可验证知识提取相关研究的交流平台,旨在推动该领域的进展。
- **FEVER数据集:** FEVER(Fact Extraction and VERification,事实提取与验证)包含185,445条声明,这些声明通过修改从维基百科提取的句子生成,且在验证时不提供其来源句子的相关信息。声明被划分为「支持(Supported)」、「反驳(Refuted)」与「信息不足(NotEnoughInfo)」三类。对于前两类标签,标注人员还记录了支撑其判断的必要证据句子。
- **FEVER 2.0对抗攻击数据集:** FEVER 2.0数据集包含1174条声明,均来自2019年共享任务「破坏者(Breaker)」阶段的参赛作品。参赛选手(破坏者)的任务是生成对抗样本,使现有分类系统产生错误预测。破坏者需提交最多1000条实例的数据集,且三个类别(支持、反驳、信息不足)的实例数量需相等。仅未出现在原始FEVER数据集中的全新声明才可作为参赛有效作品。提交的作品随后会被人工评估其正确性(语法合规、标签恰当且符合FEVER标注规范要求)。
### 支持任务与排行榜
该任务的目标是基于文本源对文本声明进行验证。
与文本蕴含(Textual Entailment, TE)/自然语言推理(Natural Language Inference)任务相比,核心差异在于:在文本蕴含与自然语言推理任务中,用于验证声明的段落会被直接给出,且近年来通常仅包含单一句子;而在验证系统中,证据需从大规模文档集合中检索得到,以构建验证依据。
### 语言分布
本数据集为英文语料。
## 数据集结构
### 数据实例
#### v1.0
- **下载数据集文件大小:** 44.86 MB
- **生成后数据集大小:** 40.05 MB
- **总磁盘占用:** 84.89 MB
「训练集(train)」的一条示例如下:
'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.',
'evidence_wiki_url': 'Nikolaj_Coster-Waldau',
'label': 'SUPPORTS',
'id': 75397,
'evidence_id': 104971,
'evidence_sentence_id': 7,
'evidence_annotation_id': 92206}
#### v2.0
- **下载数据集文件大小:** 0.39 MB
- **生成后数据集大小:** 0.30 MB
- **总磁盘占用:** 0.70 MB
#### wiki_pages
- **下载数据集文件大小:** 1.71 GB
- **生成后数据集大小:** 7.25 GB
- **总磁盘占用:** 8.97 GB
「维基百科页面(wikipedia_pages)」的一条示例如下:
{'text': 'The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . ',
'lines': '0 The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world .
1 ',
'id': '1928_in_association_football'}
### 数据字段
所有划分的数据字段保持一致。
#### v1.0
- `id`: `int32` 类型特征
- `label`: 字符串类型特征
- `claim`: 字符串类型特征
- `evidence_annotation_id`: `int32` 类型特征
- `evidence_id`: `int32` 类型特征
- `evidence_wiki_url`: 字符串类型特征
- `evidence_sentence_id`: `int32` 类型特征
#### v2.0
- `id`: `int32` 类型特征
- `label`: 字符串类型特征
- `claim`: 字符串类型特征
- `evidence_annotation_id`: `int32` 类型特征
- `evidence_id`: `int32` 类型特征
- `evidence_wiki_url`: 字符串类型特征
- `evidence_sentence_id`: `int32` 类型特征
#### wiki_pages
- `id`: 字符串类型特征
- `text`: 字符串类型特征
- `lines`: 字符串类型特征
### 数据划分
#### v1.0
| | 训练集 | 开发集 | 论文开发集 | 论文测试集 |
|------|-------:|------:|----------:|-----------:|
| v1.0 | 311431 | 37566 | 18999 | 18567 |
#### v2.0
| | 验证集 |
|------|-----------:|
| v2.0 | 2384 |
#### wiki_pages
| | 维基百科页面数 |
|------------|----------------:|
| wiki_pages | 5416537 |
## 数据集构建
### 数据整理依据
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生产者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注信息
#### 标注流程
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注人员是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据使用注意事项
### 数据集的社会影响
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 已知其他局限
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
FEVER许可协议:
本数据集标注内容包含来自维基百科的材料,此类材料依照维基百科版权政策进行授权。本标注内容依据对应维基百科条目页面所描述的许可条款进行发布;若维基百科未提供相关许可条款,则采用知识共享署名-相同方式共享3.0协议(Creative Commons Attribution-ShareAlike 3.0),协议详情可访问 http://creativecommons.org/licenses/by-sa/3.0/(以下统称“许可条款”)。除非符合适用许可条款的要求,否则您不得使用本数据集文件。
### 引用信息
若使用“FEVER Dataset”,请引用:
bibtex
@inproceedings{Thorne18Fever,
author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit},
title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}},
booktitle = {NAACL-HLT},
year = {2018}
}
若使用“FEVER 2.0 Adversarial Attacks Dataset”,请引用:
bibtex
@inproceedings{Thorne19FEVER2,
author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit},
title = {The {FEVER2.0} Shared Task},
booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}},
year = {2018}
}
### 贡献致谢
感谢[@thomwolf](https://github.com/thomwolf)、[@lhoestq](https://github.com/lhoestq)、[@mariamabarham](https://github.com/mariamabarham)、[@lewtun](https://github.com/lewtun)、[@albertvillanova](https://github.com/albertvillanova)为本数据集的添加所做的贡献。
提供机构:
maas
创建时间:
2025-08-16



