fever/fever

Name: fever/fever
Creator: fever
Published: 2024-01-18 11:03:38
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/fever/fever

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en paperswithcode_id: fever annotations_creators: - crowdsourced language_creators: - found license: - cc-by-sa-3.0 - gpl-3.0 multilinguality: - monolingual pretty_name: FEVER size_categories: - 100K<n<1M source_datasets: - extended|wikipedia task_categories: - text-classification task_ids: [] tags: - knowledge-verification dataset_info: - config_name: v1.0 features: - name: id dtype: int32 - name: label dtype: string - name: claim dtype: string - name: evidence_annotation_id dtype: int32 - name: evidence_id dtype: int32 - name: evidence_wiki_url dtype: string - name: evidence_sentence_id dtype: int32 splits: - name: train num_bytes: 29591412 num_examples: 311431 - name: labelled_dev num_bytes: 3643157 num_examples: 37566 - name: unlabelled_dev num_bytes: 1548965 num_examples: 19998 - name: unlabelled_test num_bytes: 1617002 num_examples: 19998 - name: paper_dev num_bytes: 1821489 num_examples: 18999 - name: paper_test num_bytes: 1821668 num_examples: 18567 download_size: 44853972 dataset_size: 40043693 - config_name: v2.0 features: - name: id dtype: int32 - name: label dtype: string - name: claim dtype: string - name: evidence_annotation_id dtype: int32 - name: evidence_id dtype: int32 - name: evidence_wiki_url dtype: string - name: evidence_sentence_id dtype: int32 splits: - name: validation num_bytes: 306243 num_examples: 2384 download_size: 392466 dataset_size: 306243 - config_name: wiki_pages features: - name: id dtype: string - name: text dtype: string - name: lines dtype: string splits: - name: wikipedia_pages num_bytes: 7254115038 num_examples: 5416537 download_size: 1713485474 dataset_size: 7254115038 --- # Dataset Card for "fever" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://fever.ai/](https://fever.ai/) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary With billions of individual pages on the web providing information on almost every conceivable topic, we should have the ability to collect facts that answer almost every conceivable question. However, only a small fraction of this information is contained in structured sources (Wikidata, Freebase, etc.) – we are therefore limited by our ability to transform free-form text to structured knowledge. There is, however, another problem that has become the focus of a lot of recent research and media coverage: false information coming from unreliable sources. The FEVER workshops are a venue for work in verifiable knowledge extraction and to stimulate progress in this direction. - FEVER Dataset: FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. - FEVER 2.0 Adversarial Attacks Dataset: The FEVER 2.0 Dataset consists of 1174 claims created by the submissions of participants in the Breaker phase of the 2019 shared task. Participants (Breakers) were tasked with generating adversarial examples that induce classification errors for the existing systems. Breakers submitted a dataset of up to 1000 instances with equal number of instances for each of the three classes (Supported, Refuted NotEnoughInfo). Only novel claims (i.e. not contained in the original FEVER dataset) were considered as valid entries to the shared task. The submissions were then manually evaluated for Correctness (grammatical, appropriately labeled and meet the FEVER annotation guidelines requirements). ### Supported Tasks and Leaderboards The task is verification of textual claims against textual sources. When compared to textual entailment (TE)/natural language inference, the key difference is that in these tasks the passage to verify each claim is given, and in recent years it typically consists a single sentence, while in verification systems it is retrieved from a large set of documents in order to form the evidence. ### Languages The dataset is in English. ## Dataset Structure ### Data Instances #### v1.0 - **Size of downloaded dataset files:** 44.86 MB - **Size of the generated dataset:** 40.05 MB - **Total amount of disk used:** 84.89 MB An example of 'train' looks as follows. ``` 'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.', 'evidence_wiki_url': 'Nikolaj_Coster-Waldau', 'label': 'SUPPORTS', 'id': 75397, 'evidence_id': 104971, 'evidence_sentence_id': 7, 'evidence_annotation_id': 92206} ``` #### v2.0 - **Size of downloaded dataset files:** 0.39 MB - **Size of the generated dataset:** 0.30 MB - **Total amount of disk used:** 0.70 MB An example of 'validation' looks as follows. ``` {'claim': "There is a convicted statutory rapist called Chinatown's writer.", 'evidence_wiki_url': '', 'label': 'NOT ENOUGH INFO', 'id': 500000, 'evidence_id': -1, 'evidence_sentence_id': -1, 'evidence_annotation_id': 269158} ``` #### wiki_pages - **Size of downloaded dataset files:** 1.71 GB - **Size of the generated dataset:** 7.25 GB - **Total amount of disk used:** 8.97 GB An example of 'wikipedia_pages' looks as follows. ``` {'text': 'The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . ', 'lines': '0\tThe following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world .\n1\t', 'id': '1928_in_association_football'} ``` ### Data Fields The data fields are the same among all splits. #### v1.0 - `id`: a `int32` feature. - `label`: a `string` feature. - `claim`: a `string` feature. - `evidence_annotation_id`: a `int32` feature. - `evidence_id`: a `int32` feature. - `evidence_wiki_url`: a `string` feature. - `evidence_sentence_id`: a `int32` feature. #### v2.0 - `id`: a `int32` feature. - `label`: a `string` feature. - `claim`: a `string` feature. - `evidence_annotation_id`: a `int32` feature. - `evidence_id`: a `int32` feature. - `evidence_wiki_url`: a `string` feature. - `evidence_sentence_id`: a `int32` feature. #### wiki_pages - `id`: a `string` feature. - `text`: a `string` feature. - `lines`: a `string` feature. ### Data Splits #### v1.0 | | train | unlabelled_dev | labelled_dev | paper_dev | unlabelled_test | paper_test | |------|-------:|---------------:|-------------:|----------:|----------------:|-----------:| | v1.0 | 311431 | 19998 | 37566 | 18999 | 19998 | 18567 | #### v2.0 | | validation | |------|-----------:| | v2.0 | 2384 | #### wiki_pages | | wikipedia_pages | |------------|----------------:| | wiki_pages | 5416537 | ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information FEVER license: ``` These data annotations incorporate material from Wikipedia, which is licensed pursuant to the Wikipedia Copyright Policy. These annotations are made available under the license terms described on the applicable Wikipedia article pages, or, where Wikipedia license terms are unavailable, under the Creative Commons Attribution-ShareAlike License (version 3.0), available at http://creativecommons.org/licenses/by-sa/3.0/ (collectively, the â€œLicense Termsâ€). You may not use these files except in compliance with the applicable License Terms. ``` ### Citation Information If you use "FEVER Dataset", please cite: ```bibtex @inproceedings{Thorne18Fever, author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit}, title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}}, booktitle = {NAACL-HLT}, year = {2018} } ``` If you use "FEVER 2.0 Adversarial Attacks Dataset", please cite: ```bibtex @inproceedings{Thorne19FEVER2, author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit}, title = {The {FEVER2.0} Shared Task}, booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}}, year = {2018} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@lhoestq](https://github.com/lhoestq), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun), [@albertvillanova](https://github.com/albertvillanova) for adding this dataset.

--- 语言： - 英语 PapersWithCode 标识符：fever 标注创建者： - 众包语言来源： - 现成语料采集许可证： - 知识共享署名-相同方式共享3.0（CC BY-SA 3.0） - GNU通用公共许可证3.0（GPL-3.0）多语言属性： - 单语言数据集昵称：FEVER 样本规模：10万<样本数<100万源数据集： - 扩展|维基百科（Wikipedia）任务类别： - 文本分类（Text Classification）任务子类别：无标签： - 知识验证（Knowledge Verification）数据集信息： - 配置名称：v1.0 数据字段： - 名称：id，数据类型：int32 - 名称：label，数据类型：字符串 - 名称：claim，数据类型：字符串 - 名称：evidence_annotation_id，数据类型：int32 - 名称：evidence_id，数据类型：int32 - 名称：evidence_wiki_url，数据类型：字符串 - 名称：evidence_sentence_id，数据类型：int32 数据集划分： - 划分名称：train（训练集），字节数：29591412，样本数：311431 - 划分名称：labelled_dev（带标注开发集），字节数：3643157，样本数：37566 - 划分名称：unlabelled_dev（无标注开发集），字节数：1548965，样本数：19998 - 划分名称：unlabelled_test（无标注测试集），字节数：1617002，样本数：19998 - 划分名称：paper_dev（论文专用开发集），字节数：1821489，样本数：18999 - 划分名称：paper_test（论文专用测试集），字节数：1821668，样本数：18567 下载大小：44853972字节数据集总大小：40043693字节 - 配置名称：v2.0 数据字段： - 名称：id，数据类型：int32 - 名称：label，数据类型：字符串 - 名称：claim，数据类型：字符串 - 名称：evidence_annotation_id，数据类型：int32 - 名称：evidence_id，数据类型：int32 - 名称：evidence_wiki_url，数据类型：字符串 - 名称：evidence_sentence_id，数据类型：int32 数据集划分： - 划分名称：validation（验证集），字节数：306243，样本数：2384 下载大小：392466字节数据集总大小：306243字节 - 配置名称：wiki_pages（维基百科页面）数据字段： - 名称：id，数据类型：字符串 - 名称：text，数据类型：字符串 - 名称：lines，数据类型：字符串数据集划分： - 划分名称：wikipedia_pages（维基百科页面划分），字节数：7254115038，样本数：5416537 下载大小：1713485474字节数据集总大小：7254115038字节 --- # "FEVER（Fact Extraction and VERification，事实提取与验证）"数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与排行榜](#支持任务与排行榜) - [语言](#语言) - [数据集结构](#数据集结构) - [数据样例](#数据样例) - [数据字段](#数据字段) - [数据集划分](#数据集划分) - [数据集构建](#数据集构建) - [构建初衷](#构建初衷) - [源数据](#源数据) - [标注](#标注) - [个人与敏感信息](#个人与敏感信息) - [数据使用注意事项](#数据使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可证信息](#许可证信息) - [引用信息](#引用信息) - [贡献致谢](#贡献致谢) ## 数据集描述 - **主页**：[https://fever.ai/](https://fever.ai/) - **仓库**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **论文**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系方式**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 数据集概述互联网上数十亿个独立页面涵盖了几乎所有可想象的主题，理论上我们可以收集到回答几乎所有问题的事实信息。然而，仅有极小一部分信息存在于结构化数据源（如Wikidata、Freebase等）中——因此我们的能力受限于将自由格式文本转化为结构化知识的能力。此外，还有一个近期受到大量研究与媒体关注的问题：来自不可靠来源的虚假信息。 FEVER研讨会是开展可验证知识提取研究、推动该领域进展的重要平台。 - **FEVER数据集**：FEVER（Fact Extraction and VERification，事实提取与验证）包含185,445条主张文本，这些文本由从维基百科中提取的句子修改生成，且验证时标注者并不知晓其来源句子。主张被划分为三类：支持（Supported）、反驳（Refuted）与信息不足（NotEnoughInfo）。对于前两类，标注者还会记录用于支撑其判断的证据句子。 - **FEVER 2.0对抗攻击数据集**：FEVER 2.0数据集包含1,174条主张文本，均来自2019年共享任务“Breaker阶段”参与者的提交作品。参与者（即“Breakers”）的任务是生成对抗样本，使现有系统出现分类错误。参与者最多可提交1000个实例，且三类标签的样本数量需保持均等。仅原FEVER数据集中未出现的全新主张方可参与共享任务。提交的作品随后会被人工评估其正确性（包括语法合规性、标注合理性以及是否符合FEVER标注指南要求）。 ### 支持任务与排行榜本任务的目标是基于文本源对文本主张进行验证。与文本蕴涵（TE）/自然语言推理任务相比，核心差异在于：后者会直接给出用于验证主张的段落（近年来通常为单个句子），而本任务的证据需从大量文档集合中检索得到。 ### 语言本数据集语言为英语。 ## 数据集结构 ### 数据样例 #### v1.0配置 - 下载数据集文件大小：44.86 MB - 生成数据集大小：40.05 MB - 总磁盘占用：84.89 MB 训练集的一条样例如下： 'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.', 'evidence_wiki_url': 'Nikolaj_Coster-Waldau', 'label': 'SUPPORTS', 'id': 75397, 'evidence_id': 104971, 'evidence_sentence_id': 7, 'evidence_annotation_id': 92206} #### v2.0配置 - 下载数据集文件大小：0.39 MB - 生成数据集大小：0.30 MB - 总磁盘占用：0.70 MB 验证集的一条样例如下： {'claim': "There is a convicted statutory rapist called Chinatown's writer.", 'evidence_wiki_url': '', 'label': 'NOT ENOUGH INFO', 'id': 500000, 'evidence_id': -1, 'evidence_sentence_id': -1, 'evidence_annotation_id': 269158} #### wiki_pages配置 - 下载数据集文件大小：1.71 GB - 生成数据集大小：7.25 GB - 总磁盘占用：8.97 GB 维基百科页面划分的一条样例如下： {'text': 'The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . ', 'lines': '0 The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . 1 ', 'id': '1928_in_association_football'} ### 数据字段所有数据集划分的数据字段均保持一致。 #### v1.0配置 - `id`：`int32`类型字段，样本唯一标识 - `label`：字符串类型字段，主张分类标签 - `claim`：字符串类型字段，待验证主张文本 - `evidence_annotation_id`：`int32`类型字段，证据标注唯一标识 - `evidence_id`：`int32`类型字段，证据唯一标识 - `evidence_wiki_url`：字符串类型字段，证据所在维基百科页面URL - `evidence_sentence_id`：`int32`类型字段，证据句子在维基百科页面中的序号 #### v2.0配置 - `id`：`int32`类型字段，样本唯一标识 - `label`：字符串类型字段，主张分类标签 - `claim`：字符串类型字段，待验证主张文本 - `evidence_annotation_id`：`int32`类型字段，证据标注唯一标识 - `evidence_id`：`int32`类型字段，证据唯一标识 - `evidence_wiki_url`：字符串类型字段，证据所在维基百科页面URL - `evidence_sentence_id`：`int32`类型字段，证据句子在维基百科页面中的序号 #### wiki_pages配置 - `id`：字符串类型字段，维基百科页面唯一标识 - `text`：字符串类型字段，维基百科页面文本内容 - `lines`：字符串类型字段，按行分割的维基百科页面文本 ### 数据集划分 #### v1.0配置 | 配置名称 | 训练集 | 无标注开发集 | 带标注开发集 | 论文专用开发集 | 无标注测试集 | 论文专用测试集 | |---------|-------:|-------------:|-------------:|---------------:|-------------:|---------------:| | v1.0 | 311431 | 19998 | 37566 | 18999 | 19998 | 18567 | #### v2.0配置 | 配置名称 | 验证集 | |---------|-------:| | v2.0 | 2384 | #### wiki_pages配置 | 配置名称 | 维基百科页面划分样本数 | |--------------|---------------------:| | wiki_pages | 5416537 | ## 数据集构建 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可证信息 FEVER数据集的许可证说明如下：本数据标注包含维基百科的内容，该内容遵循维基百科版权政策。本标注可根据适用维基百科文章页面的许可条款使用；若维基百科的许可条款不可用，则可通过知识共享署名-相同方式共享3.0许可证（可在http://creativecommons.org/licenses/by-sa/3.0/获取）使用（以下统称“许可条款”）。除非符合适用许可条款，否则您不得使用本数据集文件。 ### 引用信息若使用“FEVER数据集”，请引用以下文献： bibtex @inproceedings{Thorne18Fever, author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit}, title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}}, booktitle = {NAACL-HLT}, year = {2018} } 若使用“FEVER 2.0对抗攻击数据集”，请引用以下文献： bibtex @inproceedings{Thorne19FEVER2, author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit}, title = {The {FEVER2.0} Shared Task}, booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}}, year = {2018} } ### 贡献致谢感谢[@thomwolf](https://github.com/thomwolf)、[@lhoestq](https://github.com/lhoestq)、[@mariamabarham](https://github.com/mariamabarham)、[@lewtun](https://github.com/lewtun)、[@albertvillanova](https://github.com/albertvillanova)为本数据集的添加所做的贡献。

提供机构：

fever

原始信息汇总

数据集概述

基本信息

名称: FEVER
语言: 英语
许可证: CC-BY-SA-3.0, GPL-3.0
多语言性: 单语种
大小: 100K<n<1M
来源: 扩展自Wikipedia
任务类别: 文本分类
标签: 知识验证

数据集配置

v1.0:
- 特征:
  - id: int32
  - label: string
  - claim: string
  - evidence_annotation_id: int32
  - evidence_id: int32
  - evidence_wiki_url: string
  - evidence_sentence_id: int32
- 数据分割:
  - train: 311431个样本，29591412字节
  - labelled_dev: 37566个样本，3643157字节
  - unlabelled_dev: 19998个样本，1548965字节
  - unlabelled_test: 19998个样本，1617002字节
  - paper_dev: 18999个样本，1821489字节
  - paper_test: 18567个样本，1821668字节
- 下载大小: 44853972字节
- 数据集大小: 40043693字节
v2.0:
- 特征: 同v1.0
- 数据分割:
  - validation: 2384个样本，306243字节
- 下载大小: 392466字节
- 数据集大小: 306243字节
wiki_pages:
- 特征:
  - id: string
  - text: string
  - lines: string
- 数据分割:
  - wikipedia_pages: 5416537个样本，7254115038字节
- 下载大小: 1713485474字节
- 数据集大小: 7254115038字节

数据集创建

注释创建者: 众包
语言创建者: 已发现

使用许可

FEVER许可: 数据注释包含来自Wikipedia的内容，根据Wikipedia版权政策获得许可。这些注释根据Wikipedia文章页面的许可条款提供，或者在Wikipedia许可条款不可用的情况下，根据Creative Commons Attribution-ShareAlike License (版本3.0)提供。

引用信息

FEVER数据集: bibtex @inproceedings{Thorne18Fever, author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit}, title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}}, booktitle = {NAACL-HLT}, year = {2018} }
FEVER 2.0 Adversarial Attacks数据集: bibtex @inproceedings{Thorne19FEVER2, author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit}, title = {The {FEVER2.0} Shared Task}, booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}}, year = {2018} }

搜集汇总

数据集介绍

构建方式

FEVER数据集的构建基于对Wikipedia内容的深度挖掘与处理。具体而言，研究团队通过修改Wikipedia中的句子生成185,445条声明，这些声明随后被验证，且验证过程中不依赖于原始句子。声明被分类为‘支持’、‘反驳’或‘信息不足’，并为前两类提供了相应的证据句子。此外，FEVER 2.0版本引入了对抗性攻击数据集，由参与者生成旨在诱导分类错误的声明，进一步增强了数据集的复杂性和实用性。

使用方法

FEVER数据集主要用于训练和评估事实验证模型。研究者可以通过加载数据集的不同配置（如v1.0和v2.0）来获取训练和验证数据。数据集提供了详细的字段信息，包括声明、标签、证据等，便于模型进行特征提取和分类任务。使用时，建议参考官方提供的下载和加载指南，确保数据处理的准确性和效率。

背景与挑战

背景概述

在信息爆炸的时代，尽管互联网上存在大量关于几乎所有主题的网页，但其中只有一小部分信息是以结构化形式存在的（如Wikidata、Freebase等）。因此，如何从自由文本中提取结构化知识成为一个关键问题。FEVER（Fact Extraction and VERification）数据集由James Thorne、Andreas Vlachos、Christos Christodoulopoulos和Arpit Mittal等研究人员于2018年创建，旨在解决从文本中提取和验证事实的问题。该数据集包含185,445条声称，这些声称是通过修改维基百科中的句子生成的，并随后在没有参考原始句子的情况下进行验证。声称被分类为‘支持’、‘反驳’或‘信息不足’。对于前两类，标注者还记录了形成必要证据的句子。FEVER数据集的创建推动了可验证知识提取领域的研究进展，成为该领域的重要资源。

当前挑战

FEVER数据集面临的挑战主要集中在两个方面。首先，从自由文本中提取结构化知识本身就是一个复杂的过程，涉及自然语言处理和信息检索等多个领域的技术。其次，数据集的构建过程中，如何确保标注的准确性和一致性也是一个重大挑战。此外，FEVER 2.0引入了对抗性攻击数据集，这增加了系统的鲁棒性要求，因为模型需要能够识别和处理那些旨在诱导分类错误的声称。这些挑战不仅推动了现有技术的进步，也为未来的研究提供了丰富的方向。

常用场景

经典使用场景

在知识验证领域，FEVER数据集被广泛用于训练和评估模型对文本声明的验证能力。该数据集通过提供大量从维基百科中提取的声明及其对应的证据，使得研究者能够开发和测试自动化的知识验证系统。这些系统需要判断声明是否被维基百科中的信息所支持、反驳或无法确定。

解决学术问题

FEVER数据集解决了在自然语言处理领域中，如何从大量非结构化文本中提取和验证事实的学术问题。通过提供一个大规模的、标注精细的数据集，FEVER促进了知识提取和验证技术的发展，推动了文本分类和信息检索等相关研究的前沿。

实际应用

在实际应用中，FEVER数据集被用于构建和优化自动化的信息验证系统，这些系统可以应用于新闻核查、社交媒体监控和在线教育等多个领域。通过验证和纠正错误信息，这些系统有助于提高信息的真实性和可靠性，从而增强公众对信息的信任。

数据集最近研究