fever/fever
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/fever/fever
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
paperswithcode_id: fever
annotations_creators:
- crowdsourced
language_creators:
- found
license:
- cc-by-sa-3.0
- gpl-3.0
multilinguality:
- monolingual
pretty_name: FEVER
size_categories:
- 100K<n<1M
source_datasets:
- extended|wikipedia
task_categories:
- text-classification
task_ids: []
tags:
- knowledge-verification
dataset_info:
- config_name: v1.0
features:
- name: id
dtype: int32
- name: label
dtype: string
- name: claim
dtype: string
- name: evidence_annotation_id
dtype: int32
- name: evidence_id
dtype: int32
- name: evidence_wiki_url
dtype: string
- name: evidence_sentence_id
dtype: int32
splits:
- name: train
num_bytes: 29591412
num_examples: 311431
- name: labelled_dev
num_bytes: 3643157
num_examples: 37566
- name: unlabelled_dev
num_bytes: 1548965
num_examples: 19998
- name: unlabelled_test
num_bytes: 1617002
num_examples: 19998
- name: paper_dev
num_bytes: 1821489
num_examples: 18999
- name: paper_test
num_bytes: 1821668
num_examples: 18567
download_size: 44853972
dataset_size: 40043693
- config_name: v2.0
features:
- name: id
dtype: int32
- name: label
dtype: string
- name: claim
dtype: string
- name: evidence_annotation_id
dtype: int32
- name: evidence_id
dtype: int32
- name: evidence_wiki_url
dtype: string
- name: evidence_sentence_id
dtype: int32
splits:
- name: validation
num_bytes: 306243
num_examples: 2384
download_size: 392466
dataset_size: 306243
- config_name: wiki_pages
features:
- name: id
dtype: string
- name: text
dtype: string
- name: lines
dtype: string
splits:
- name: wikipedia_pages
num_bytes: 7254115038
num_examples: 5416537
download_size: 1713485474
dataset_size: 7254115038
---
# Dataset Card for "fever"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://fever.ai/](https://fever.ai/)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Dataset Summary
With billions of individual pages on the web providing information on almost every conceivable topic, we should have
the ability to collect facts that answer almost every conceivable question. However, only a small fraction of this
information is contained in structured sources (Wikidata, Freebase, etc.) – we are therefore limited by our ability to
transform free-form text to structured knowledge. There is, however, another problem that has become the focus of a lot
of recent research and media coverage: false information coming from unreliable sources.
The FEVER workshops are a venue for work in verifiable knowledge extraction and to stimulate progress in this direction.
- FEVER Dataset: FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences
extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims
are classified as Supported, Refuted or NotEnoughInfo. For the first two classes, the annotators also recorded the
sentence(s) forming the necessary evidence for their judgment.
- FEVER 2.0 Adversarial Attacks Dataset: The FEVER 2.0 Dataset consists of 1174 claims created by the submissions of
participants in the Breaker phase of the 2019 shared task. Participants (Breakers) were tasked with generating
adversarial examples that induce classification errors for the existing systems. Breakers submitted a dataset of up to
1000 instances with equal number of instances for each of the three classes (Supported, Refuted NotEnoughInfo). Only
novel claims (i.e. not contained in the original FEVER dataset) were considered as valid entries to the shared task.
The submissions were then manually evaluated for Correctness (grammatical, appropriately labeled and meet the FEVER
annotation guidelines requirements).
### Supported Tasks and Leaderboards
The task is verification of textual claims against textual sources.
When compared to textual entailment (TE)/natural language inference, the key difference is that in these tasks the
passage to verify each claim is given, and in recent years it typically consists a single sentence, while in
verification systems it is retrieved from a large set of documents in order to form the evidence.
### Languages
The dataset is in English.
## Dataset Structure
### Data Instances
#### v1.0
- **Size of downloaded dataset files:** 44.86 MB
- **Size of the generated dataset:** 40.05 MB
- **Total amount of disk used:** 84.89 MB
An example of 'train' looks as follows.
```
'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.',
'evidence_wiki_url': 'Nikolaj_Coster-Waldau',
'label': 'SUPPORTS',
'id': 75397,
'evidence_id': 104971,
'evidence_sentence_id': 7,
'evidence_annotation_id': 92206}
```
#### v2.0
- **Size of downloaded dataset files:** 0.39 MB
- **Size of the generated dataset:** 0.30 MB
- **Total amount of disk used:** 0.70 MB
An example of 'validation' looks as follows.
```
{'claim': "There is a convicted statutory rapist called Chinatown's writer.",
'evidence_wiki_url': '',
'label': 'NOT ENOUGH INFO',
'id': 500000,
'evidence_id': -1,
'evidence_sentence_id': -1,
'evidence_annotation_id': 269158}
```
#### wiki_pages
- **Size of downloaded dataset files:** 1.71 GB
- **Size of the generated dataset:** 7.25 GB
- **Total amount of disk used:** 8.97 GB
An example of 'wikipedia_pages' looks as follows.
```
{'text': 'The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . ',
'lines': '0\tThe following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world .\n1\t',
'id': '1928_in_association_football'}
```
### Data Fields
The data fields are the same among all splits.
#### v1.0
- `id`: a `int32` feature.
- `label`: a `string` feature.
- `claim`: a `string` feature.
- `evidence_annotation_id`: a `int32` feature.
- `evidence_id`: a `int32` feature.
- `evidence_wiki_url`: a `string` feature.
- `evidence_sentence_id`: a `int32` feature.
#### v2.0
- `id`: a `int32` feature.
- `label`: a `string` feature.
- `claim`: a `string` feature.
- `evidence_annotation_id`: a `int32` feature.
- `evidence_id`: a `int32` feature.
- `evidence_wiki_url`: a `string` feature.
- `evidence_sentence_id`: a `int32` feature.
#### wiki_pages
- `id`: a `string` feature.
- `text`: a `string` feature.
- `lines`: a `string` feature.
### Data Splits
#### v1.0
| | train | unlabelled_dev | labelled_dev | paper_dev | unlabelled_test | paper_test |
|------|-------:|---------------:|-------------:|----------:|----------------:|-----------:|
| v1.0 | 311431 | 19998 | 37566 | 18999 | 19998 | 18567 |
#### v2.0
| | validation |
|------|-----------:|
| v2.0 | 2384 |
#### wiki_pages
| | wikipedia_pages |
|------------|----------------:|
| wiki_pages | 5416537 |
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
FEVER license:
```
These data annotations incorporate material from Wikipedia, which is licensed pursuant to the Wikipedia Copyright Policy. These annotations are made available under the license terms described on the applicable Wikipedia article pages, or, where Wikipedia license terms are unavailable, under the Creative Commons Attribution-ShareAlike License (version 3.0), available at http://creativecommons.org/licenses/by-sa/3.0/ (collectively, the “License Termsâ€). You may not use these files except in compliance with the applicable License Terms.
```
### Citation Information
If you use "FEVER Dataset", please cite:
```bibtex
@inproceedings{Thorne18Fever,
author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit},
title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}},
booktitle = {NAACL-HLT},
year = {2018}
}
```
If you use "FEVER 2.0 Adversarial Attacks Dataset", please cite:
```bibtex
@inproceedings{Thorne19FEVER2,
author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit},
title = {The {FEVER2.0} Shared Task},
booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}},
year = {2018}
}
```
### Contributions
Thanks to [@thomwolf](https://github.com/thomwolf), [@lhoestq](https://github.com/lhoestq),
[@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun),
[@albertvillanova](https://github.com/albertvillanova) for adding this dataset.
---
语言:
- 英语
PapersWithCode 标识符:fever
标注创建者:
- 众包
语言来源:
- 现成语料采集
许可证:
- 知识共享署名-相同方式共享3.0(CC BY-SA 3.0)
- GNU通用公共许可证3.0(GPL-3.0)
多语言属性:
- 单语言
数据集昵称:FEVER
样本规模:10万<样本数<100万
源数据集:
- 扩展|维基百科(Wikipedia)
任务类别:
- 文本分类(Text Classification)
任务子类别:无
标签:
- 知识验证(Knowledge Verification)
数据集信息:
- 配置名称:v1.0
数据字段:
- 名称:id,数据类型:int32
- 名称:label,数据类型:字符串
- 名称:claim,数据类型:字符串
- 名称:evidence_annotation_id,数据类型:int32
- 名称:evidence_id,数据类型:int32
- 名称:evidence_wiki_url,数据类型:字符串
- 名称:evidence_sentence_id,数据类型:int32
数据集划分:
- 划分名称:train(训练集),字节数:29591412,样本数:311431
- 划分名称:labelled_dev(带标注开发集),字节数:3643157,样本数:37566
- 划分名称:unlabelled_dev(无标注开发集),字节数:1548965,样本数:19998
- 划分名称:unlabelled_test(无标注测试集),字节数:1617002,样本数:19998
- 划分名称:paper_dev(论文专用开发集),字节数:1821489,样本数:18999
- 划分名称:paper_test(论文专用测试集),字节数:1821668,样本数:18567
下载大小:44853972字节
数据集总大小:40043693字节
- 配置名称:v2.0
数据字段:
- 名称:id,数据类型:int32
- 名称:label,数据类型:字符串
- 名称:claim,数据类型:字符串
- 名称:evidence_annotation_id,数据类型:int32
- 名称:evidence_id,数据类型:int32
- 名称:evidence_wiki_url,数据类型:字符串
- 名称:evidence_sentence_id,数据类型:int32
数据集划分:
- 划分名称:validation(验证集),字节数:306243,样本数:2384
下载大小:392466字节
数据集总大小:306243字节
- 配置名称:wiki_pages(维基百科页面)
数据字段:
- 名称:id,数据类型:字符串
- 名称:text,数据类型:字符串
- 名称:lines,数据类型:字符串
数据集划分:
- 划分名称:wikipedia_pages(维基百科页面划分),字节数:7254115038,样本数:5416537
下载大小:1713485474字节
数据集总大小:7254115038字节
---
# "FEVER(Fact Extraction and VERification,事实提取与验证)"数据集卡片
## 目录
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务与排行榜](#支持任务与排行榜)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据样例](#数据样例)
- [数据字段](#数据字段)
- [数据集划分](#数据集划分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [标注](#标注)
- [个人与敏感信息](#个人与敏感信息)
- [数据使用注意事项](#数据使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可证信息](#许可证信息)
- [引用信息](#引用信息)
- [贡献致谢](#贡献致谢)
## 数据集描述
- **主页**:[https://fever.ai/](https://fever.ai/)
- **仓库**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **论文**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **联系方式**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 数据集概述
互联网上数十亿个独立页面涵盖了几乎所有可想象的主题,理论上我们可以收集到回答几乎所有问题的事实信息。然而,仅有极小一部分信息存在于结构化数据源(如Wikidata、Freebase等)中——因此我们的能力受限于将自由格式文本转化为结构化知识的能力。此外,还有一个近期受到大量研究与媒体关注的问题:来自不可靠来源的虚假信息。
FEVER研讨会是开展可验证知识提取研究、推动该领域进展的重要平台。
- **FEVER数据集**:FEVER(Fact Extraction and VERification,事实提取与验证)包含185,445条主张文本,这些文本由从维基百科中提取的句子修改生成,且验证时标注者并不知晓其来源句子。主张被划分为三类:支持(Supported)、反驳(Refuted)与信息不足(NotEnoughInfo)。对于前两类,标注者还会记录用于支撑其判断的证据句子。
- **FEVER 2.0对抗攻击数据集**:FEVER 2.0数据集包含1,174条主张文本,均来自2019年共享任务“Breaker阶段”参与者的提交作品。参与者(即“Breakers”)的任务是生成对抗样本,使现有系统出现分类错误。参与者最多可提交1000个实例,且三类标签的样本数量需保持均等。仅原FEVER数据集中未出现的全新主张方可参与共享任务。提交的作品随后会被人工评估其正确性(包括语法合规性、标注合理性以及是否符合FEVER标注指南要求)。
### 支持任务与排行榜
本任务的目标是基于文本源对文本主张进行验证。
与文本蕴涵(TE)/自然语言推理任务相比,核心差异在于:后者会直接给出用于验证主张的段落(近年来通常为单个句子),而本任务的证据需从大量文档集合中检索得到。
### 语言
本数据集语言为英语。
## 数据集结构
### 数据样例
#### v1.0配置
- 下载数据集文件大小:44.86 MB
- 生成数据集大小:40.05 MB
- 总磁盘占用:84.89 MB
训练集的一条样例如下:
'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.',
'evidence_wiki_url': 'Nikolaj_Coster-Waldau',
'label': 'SUPPORTS',
'id': 75397,
'evidence_id': 104971,
'evidence_sentence_id': 7,
'evidence_annotation_id': 92206}
#### v2.0配置
- 下载数据集文件大小:0.39 MB
- 生成数据集大小:0.30 MB
- 总磁盘占用:0.70 MB
验证集的一条样例如下:
{'claim': "There is a convicted statutory rapist called Chinatown's writer.",
'evidence_wiki_url': '',
'label': 'NOT ENOUGH INFO',
'id': 500000,
'evidence_id': -1,
'evidence_sentence_id': -1,
'evidence_annotation_id': 269158}
#### wiki_pages配置
- 下载数据集文件大小:1.71 GB
- 生成数据集大小:7.25 GB
- 总磁盘占用:8.97 GB
维基百科页面划分的一条样例如下:
{'text': 'The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . ',
'lines': '0 The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world .
1 ',
'id': '1928_in_association_football'}
### 数据字段
所有数据集划分的数据字段均保持一致。
#### v1.0配置
- `id`:`int32`类型字段,样本唯一标识
- `label`:字符串类型字段,主张分类标签
- `claim`:字符串类型字段,待验证主张文本
- `evidence_annotation_id`:`int32`类型字段,证据标注唯一标识
- `evidence_id`:`int32`类型字段,证据唯一标识
- `evidence_wiki_url`:字符串类型字段,证据所在维基百科页面URL
- `evidence_sentence_id`:`int32`类型字段,证据句子在维基百科页面中的序号
#### v2.0配置
- `id`:`int32`类型字段,样本唯一标识
- `label`:字符串类型字段,主张分类标签
- `claim`:字符串类型字段,待验证主张文本
- `evidence_annotation_id`:`int32`类型字段,证据标注唯一标识
- `evidence_id`:`int32`类型字段,证据唯一标识
- `evidence_wiki_url`:字符串类型字段,证据所在维基百科页面URL
- `evidence_sentence_id`:`int32`类型字段,证据句子在维基百科页面中的序号
#### wiki_pages配置
- `id`:字符串类型字段,维基百科页面唯一标识
- `text`:字符串类型字段,维基百科页面文本内容
- `lines`:字符串类型字段,按行分割的维基百科页面文本
### 数据集划分
#### v1.0配置
| 配置名称 | 训练集 | 无标注开发集 | 带标注开发集 | 论文专用开发集 | 无标注测试集 | 论文专用测试集 |
|---------|-------:|-------------:|-------------:|---------------:|-------------:|---------------:|
| v1.0 | 311431 | 19998 | 37566 | 18999 | 19998 | 18567 |
#### v2.0配置
| 配置名称 | 验证集 |
|---------|-------:|
| v2.0 | 2384 |
#### wiki_pages配置
| 配置名称 | 维基百科页面划分样本数 |
|--------------|---------------------:|
| wiki_pages | 5416537 |
## 数据集构建
### 构建初衷
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生产者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注
#### 标注流程
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据使用注意事项
### 数据集的社会影响
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可证信息
FEVER数据集的许可证说明如下:
本数据标注包含维基百科的内容,该内容遵循维基百科版权政策。本标注可根据适用维基百科文章页面的许可条款使用;若维基百科的许可条款不可用,则可通过知识共享署名-相同方式共享3.0许可证(可在http://creativecommons.org/licenses/by-sa/3.0/获取)使用(以下统称“许可条款”)。除非符合适用许可条款,否则您不得使用本数据集文件。
### 引用信息
若使用“FEVER数据集”,请引用以下文献:
bibtex
@inproceedings{Thorne18Fever,
author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit},
title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}},
booktitle = {NAACL-HLT},
year = {2018}
}
若使用“FEVER 2.0对抗攻击数据集”,请引用以下文献:
bibtex
@inproceedings{Thorne19FEVER2,
author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit},
title = {The {FEVER2.0} Shared Task},
booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}},
year = {2018}
}
### 贡献致谢
感谢[@thomwolf](https://github.com/thomwolf)、[@lhoestq](https://github.com/lhoestq)、[@mariamabarham](https://github.com/mariamabarham)、[@lewtun](https://github.com/lewtun)、[@albertvillanova](https://github.com/albertvillanova)为本数据集的添加所做的贡献。
提供机构:
fever
原始信息汇总
数据集概述
基本信息
- 名称: FEVER
- 语言: 英语
- 许可证: CC-BY-SA-3.0, GPL-3.0
- 多语言性: 单语种
- 大小: 100K<n<1M
- 来源: 扩展自Wikipedia
- 任务类别: 文本分类
- 标签: 知识验证
数据集配置
-
v1.0:
- 特征:
id: int32label: stringclaim: stringevidence_annotation_id: int32evidence_id: int32evidence_wiki_url: stringevidence_sentence_id: int32
- 数据分割:
train: 311431个样本,29591412字节labelled_dev: 37566个样本,3643157字节unlabelled_dev: 19998个样本,1548965字节unlabelled_test: 19998个样本,1617002字节paper_dev: 18999个样本,1821489字节paper_test: 18567个样本,1821668字节
- 下载大小: 44853972字节
- 数据集大小: 40043693字节
- 特征:
-
v2.0:
- 特征: 同v1.0
- 数据分割:
validation: 2384个样本,306243字节
- 下载大小: 392466字节
- 数据集大小: 306243字节
-
wiki_pages:
- 特征:
id: stringtext: stringlines: string
- 数据分割:
wikipedia_pages: 5416537个样本,7254115038字节
- 下载大小: 1713485474字节
- 数据集大小: 7254115038字节
- 特征:
数据集创建
- 注释创建者: 众包
- 语言创建者: 已发现
使用许可
- FEVER许可: 数据注释包含来自Wikipedia的内容,根据Wikipedia版权政策获得许可。这些注释根据Wikipedia文章页面的许可条款提供,或者在Wikipedia许可条款不可用的情况下,根据Creative Commons Attribution-ShareAlike License (版本3.0)提供。
引用信息
-
FEVER数据集: bibtex @inproceedings{Thorne18Fever, author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit}, title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}}, booktitle = {NAACL-HLT}, year = {2018} }
-
FEVER 2.0 Adversarial Attacks数据集: bibtex @inproceedings{Thorne19FEVER2, author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit}, title = {The {FEVER2.0} Shared Task}, booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}}, year = {2018} }
搜集汇总
数据集介绍

构建方式
FEVER数据集的构建基于对Wikipedia内容的深度挖掘与处理。具体而言,研究团队通过修改Wikipedia中的句子生成185,445条声明,这些声明随后被验证,且验证过程中不依赖于原始句子。声明被分类为‘支持’、‘反驳’或‘信息不足’,并为前两类提供了相应的证据句子。此外,FEVER 2.0版本引入了对抗性攻击数据集,由参与者生成旨在诱导分类错误的声明,进一步增强了数据集的复杂性和实用性。
使用方法
FEVER数据集主要用于训练和评估事实验证模型。研究者可以通过加载数据集的不同配置(如v1.0和v2.0)来获取训练和验证数据。数据集提供了详细的字段信息,包括声明、标签、证据等,便于模型进行特征提取和分类任务。使用时,建议参考官方提供的下载和加载指南,确保数据处理的准确性和效率。
背景与挑战
背景概述
在信息爆炸的时代,尽管互联网上存在大量关于几乎所有主题的网页,但其中只有一小部分信息是以结构化形式存在的(如Wikidata、Freebase等)。因此,如何从自由文本中提取结构化知识成为一个关键问题。FEVER(Fact Extraction and VERification)数据集由James Thorne、Andreas Vlachos、Christos Christodoulopoulos和Arpit Mittal等研究人员于2018年创建,旨在解决从文本中提取和验证事实的问题。该数据集包含185,445条声称,这些声称是通过修改维基百科中的句子生成的,并随后在没有参考原始句子的情况下进行验证。声称被分类为‘支持’、‘反驳’或‘信息不足’。对于前两类,标注者还记录了形成必要证据的句子。FEVER数据集的创建推动了可验证知识提取领域的研究进展,成为该领域的重要资源。
当前挑战
FEVER数据集面临的挑战主要集中在两个方面。首先,从自由文本中提取结构化知识本身就是一个复杂的过程,涉及自然语言处理和信息检索等多个领域的技术。其次,数据集的构建过程中,如何确保标注的准确性和一致性也是一个重大挑战。此外,FEVER 2.0引入了对抗性攻击数据集,这增加了系统的鲁棒性要求,因为模型需要能够识别和处理那些旨在诱导分类错误的声称。这些挑战不仅推动了现有技术的进步,也为未来的研究提供了丰富的方向。
常用场景
经典使用场景
在知识验证领域,FEVER数据集被广泛用于训练和评估模型对文本声明的验证能力。该数据集通过提供大量从维基百科中提取的声明及其对应的证据,使得研究者能够开发和测试自动化的知识验证系统。这些系统需要判断声明是否被维基百科中的信息所支持、反驳或无法确定。
解决学术问题
FEVER数据集解决了在自然语言处理领域中,如何从大量非结构化文本中提取和验证事实的学术问题。通过提供一个大规模的、标注精细的数据集,FEVER促进了知识提取和验证技术的发展,推动了文本分类和信息检索等相关研究的前沿。
实际应用
在实际应用中,FEVER数据集被用于构建和优化自动化的信息验证系统,这些系统可以应用于新闻核查、社交媒体监控和在线教育等多个领域。通过验证和纠正错误信息,这些系统有助于提高信息的真实性和可靠性,从而增强公众对信息的信任。
数据集最近研究
最新研究方向
在知识验证领域,FEVER数据集的最新研究方向主要集中在提升事实验证系统的鲁棒性和准确性。随着信息爆炸时代的到来,如何从海量文本中高效提取并验证事实成为研究热点。FEVER数据集通过提供大规模的标注数据,推动了这一领域的技术进步。近期研究不仅关注于改进传统的文本匹配和推理模型,还引入了对抗训练和多模态融合等前沿技术,以应对日益复杂的虚假信息挑战。这些研究不仅提升了系统的性能,也为构建更加可靠的信息验证平台奠定了基础。
以上内容由遇见数据集搜集并总结生成



